All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Mechanism for high priority scheduling in amdgpu
@ 2016-12-16 23:15 Andres Rodriguez
       [not found] ` <544E607D03B20249AA404517E498FC4699EBD3-Lp/cVzEoVyaisxZYEgh0i620KmCxYQEWVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Andres Rodriguez @ 2016-12-16 23:15 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Hi Everyone,

This RFC is also available as a gist here:
https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249

We are interested in feedback for a mechanism to effectively schedule high
priority VR reprojection tasks (also referred to as time-warping) for Polaris10
running on the amdgpu kernel driver.

Brief context:
--------------

The main objective of reprojection is to avoid motion sickness for VR users in
scenarios where the game or application would fail to finish rendering a new
frame in time for the next VBLANK. When this happens, the user's head movements
are not reflected on the Head Mounted Display (HMD) for the duration of an
extra frame. This extended mismatch between the inner ear and the eyes may
cause the user to experience motion sickness.

The VR compositor deals with this problem by fabricating a new frame using the
user's updated head position in combination with the previous frames. This
avoids a prolonged mismatch between the HMD output and the inner ear.

Because of the adverse effects on the user, we require high confidence that the
reprojection task will complete before the VBLANK interval. Even if the GFX pipe
is currently full of work from the game/application (which is most likely the case).

For more details and illustrations, please refer to the following document:
https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved

Requirements:
-------------

The mechanism must expose the following functionaility:

    * Job round trip time must be predictable, from submission to fence signal

    * The mechanism must support compute workloads.

Goals:
------

    * The mechanism should provide low submission latencies

Test: submitting a NOP packet through the mechanism on busy hardware should
be equivalent to submitting a NOP on idle hardware.

Nice to have:
-------------

    * The mechanism should also support GFX workloads.

My understanding is that with the current hardware capabilities in Polaris10 we
will not be able to provide a solution compatible with GFX worloads.

But I would love to hear otherwise. So if anyone has an idea, approach or
suggestion that will also be compatible with the GFX ring, please let us know
about it.

    * The above guarantees should also be respected by amdkfd workloads

Would be good to have for consistency, but not strictly necessary as users running
games are not traditionally running HPC workloads in the background.

Proposed approach:
------------------

Similar to the windows driver, we could expose a high priority compute queue to
userspace.

Submissions to this compute queue will be scheduled with high priority, and may
acquire hardware resources previously in use by other queues.

This can be achieved by taking advantage of the 'priority' field in the HQDs
and could be programmed by amdgpu or the amdgpu scheduler. The relevant
register fields are:
	* mmCP_HQD_PIPE_PRIORITY
	* mmCP_HQD_QUEUE_PRIORITY

Implementation approach 1 - static partitioning:
------------------------------------------------

The amdgpu driver currently controls 8 compute queues from pipe0. We can
statically partition these as follows:
	* 7x regular
	* 1x high priority

The relevant priorities can be set so that submissions to the high priority
ring will starve the other compute rings and the GFX ring.

The amdgpu scheduler will only place jobs into the high priority rings if the
context is marked as high priority. And a corresponding priority should be
added to keep track of this information:
     * AMD_SCHED_PRIORITY_KERNEL
     * -> AMD_SCHED_PRIORITY_HIGH
     * AMD_SCHED_PRIORITY_NORMAL

The user will request a high priority context by setting an appropriate flag
in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163

The setting is in a per context level so that we can:
    * Maintain a consistent FIFO ordering of all submissions to a context
    * Create high priority and non-high priority contexts in the same process

Implementation approach 2 - dynamic priority programming:
---------------------------------------------------------

Similar to the above, but instead of programming the priorities and
amdgpu_init() time, the SW scheduler will reprogram the queue priorities
dynamically when scheduling a task.

This would involve having a hardware specific callback from the scheduler to
set the appropriate queue priority: set_priority(int ring, int index, int priority)

During this callback we would have to grab the SRBM mutex to perform the appropriate
HW programming, and I'm not really sure if that is something we should be doing from
the scheduler.

On the positive side, this approach would allow us to program a range of
priorities for jobs instead of a single "high priority" value", achieving
something similar to the niceness API available for CPU scheduling.

I'm not sure if this flexibility is something that we would need for our use
case, but it might be useful in other scenarios (multiple users sharing compute
time on a server).

This approach would require a new int field in drm_amdgpu_ctx_in, or repurposing
of the flags field.

Known current obstacles:
------------------------

The SQ is currently programmed to disregard the HQD priorities, and instead it picks
jobs at random. Settings from the shader itself are also disregarded as this is
considered a privileged field.

Effectively we can get our compute wavefront launched ASAP, but we might not get the
time we need on the SQ.

The current programming would have to be changed to allow priority propagation
from the HQD into the SQ.

Generic approach for all HW IPs:
--------------------------------

For consistency purposes, the high priority context can be enabled for all HW IPs
with support of the SW scheduler. This will function similarly to the current
AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of anything not
commited to the HW queue.

The benefits of requesting a high priority context for a non-compute queue will
be lesser (e.g. up to 10s of wait time if a GFX command is stuck in front of
you), but having the API in place will allow us to easily improve the implementation
in the future as new features become available in new hardware.

Future steps:
-------------

Once we have an approach settled, I can take care of the implementation.

Also, once the interface is mostly decided, we can start thinking about
exposing the high priority queue through radv.

Request for feedback:
---------------------

We aren't married to any of the approaches outlined above. Our goal is to
obtain a mechanism that will allow us to complete the reprojection job within a
predictable amount of time. So if anyone anyone has any suggestions for
improvements or alternative strategies we are more than happy to hear them.

If any of the technical information above is also incorrect, feel free to point
out my misunderstandings.

Looking forward to hearing from you.

Regards,
Andres

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found] ` <544E607D03B20249AA404517E498FC4699EBD3-Lp/cVzEoVyaisxZYEgh0i620KmCxYQEWVpNB7YpNyf8@public.gmane.org>
@ 2016-12-17  1:15   ` Sagalovitch, Serguei
       [not found]     ` <SN1PR12MB070363D0810B783234322C46FE9F0-z7L1TMIYDg6P/i5UxMCIqAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2016-12-19 19:29   ` Andres Rodriguez
  1 sibling, 1 reply; 36+ messages in thread
From: Sagalovitch, Serguei @ 2016-12-17  1:15 UTC (permalink / raw)
  To: Andres Rodriguez, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Andres,


Quick comments:

1) To minimize "bubbles", etc. we need to "force" CU assignments/binding 
to high-priority queue  when it will be in use and "free" them later 
(we  do not want forever take CUs from e.g. graphic task to degrade graphics
performance).

Otherwise we could have scenario when long graphics task (or low-priority 
compute) will took all (extra) CUs and high--priority will wait for needed resources. 
It will not be visible on "NOP " but only when you submit "real" compute task 
so I would recommend  not to use "NOP" packets at all for testing.  

It (CU assignment) could be relatively easy done when everything is going via kernel 
(e.g. as part of frame submission) but I must admit that I am not sure 
about the best way for user level submissions (amdkfd).


2) I would recommend to dedicate the whole pipe to high-priority queue and have
nothing their except it.

BTW: Which user level API do you want to use for compute: Vulkan or OpenCL?


> we will not be able to provide a solution compatible with GFX worloads.
I assume that you are talking about graphics? Am I right?

Sincerely yours,
Serguei Sagalovitch



From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Andres Rodriguez <andresr@valvesoftware.com>
Sent: December 16, 2016 6:15 PM
To: amd-gfx@lists.freedesktop.org
Subject: [RFC] Mechanism for high priority scheduling in amdgpu
    
Hi Everyone,

This RFC is also available as a gist here:
https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249



[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu


We are interested in feedback for a mechanism to effectively schedule high
priority VR reprojection tasks (also referred to as time-warping) for Polaris10
running on the amdgpu kernel driver.

Brief context:
--------------

The main objective of reprojection is to avoid motion sickness for VR users in
scenarios where the game or application would fail to finish rendering a new
frame in time for the next VBLANK. When this happens, the user's head movements
are not reflected on the Head Mounted Display (HMD) for the duration of an
extra frame. This extended mismatch between the inner ear and the eyes may
cause the user to experience motion sickness.

The VR compositor deals with this problem by fabricating a new frame using the
user's updated head position in combination with the previous frames. This
avoids a prolonged mismatch between the HMD output and the inner ear.

Because of the adverse effects on the user, we require high confidence that the
reprojection task will complete before the VBLANK interval. Even if the GFX pipe
is currently full of work from the game/application (which is most likely the case).

For more details and illustrations, please refer to the following document:
https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved


Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology over the past year has been the adoption of asynchronous shaders, which can make more efficient use of ...


Requirements:
-------------

The mechanism must expose the following functionaility:

    * Job round trip time must be predictable, from submission to fence signal

    * The mechanism must support compute workloads.

Goals:
------

    * The mechanism should provide low submission latencies

Test: submitting a NOP packet through the mechanism on busy hardware should
be equivalent to submitting a NOP on idle hardware.

Nice to have:
-------------

    * The mechanism should also support GFX workloads.

My understanding is that with the current hardware capabilities in Polaris10 we
will not be able to provide a solution compatible with GFX worloads.

But I would love to hear otherwise. So if anyone has an idea, approach or
suggestion that will also be compatible with the GFX ring, please let us know
about it.

    * The above guarantees should also be respected by amdkfd workloads

Would be good to have for consistency, but not strictly necessary as users running
games are not traditionally running HPC workloads in the background.

Proposed approach:
------------------

Similar to the windows driver, we could expose a high priority compute queue to
userspace.

Submissions to this compute queue will be scheduled with high priority, and may
acquire hardware resources previously in use by other queues.

This can be achieved by taking advantage of the 'priority' field in the HQDs
and could be programmed by amdgpu or the amdgpu scheduler. The relevant
register fields are:
        * mmCP_HQD_PIPE_PRIORITY
        * mmCP_HQD_QUEUE_PRIORITY

Implementation approach 1 - static partitioning:
------------------------------------------------

The amdgpu driver currently controls 8 compute queues from pipe0. We can
statically partition these as follows:
        * 7x regular
        * 1x high priority

The relevant priorities can be set so that submissions to the high priority
ring will starve the other compute rings and the GFX ring.

The amdgpu scheduler will only place jobs into the high priority rings if the
context is marked as high priority. And a corresponding priority should be
added to keep track of this information:
     * AMD_SCHED_PRIORITY_KERNEL
     * -> AMD_SCHED_PRIORITY_HIGH
     * AMD_SCHED_PRIORITY_NORMAL

The user will request a high priority context by setting an appropriate flag
in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163

The setting is in a per context level so that we can:
    * Maintain a consistent FIFO ordering of all submissions to a context
    * Create high priority and non-high priority contexts in the same process

Implementation approach 2 - dynamic priority programming:
---------------------------------------------------------

Similar to the above, but instead of programming the priorities and
amdgpu_init() time, the SW scheduler will reprogram the queue priorities
dynamically when scheduling a task.

This would involve having a hardware specific callback from the scheduler to
set the appropriate queue priority: set_priority(int ring, int index, int priority)

During this callback we would have to grab the SRBM mutex to perform the appropriate
HW programming, and I'm not really sure if that is something we should be doing from
the scheduler.

On the positive side, this approach would allow us to program a range of
priorities for jobs instead of a single "high priority" value", achieving
something similar to the niceness API available for CPU scheduling.

I'm not sure if this flexibility is something that we would need for our use
case, but it might be useful in other scenarios (multiple users sharing compute
time on a server).

This approach would require a new int field in drm_amdgpu_ctx_in, or repurposing
of the flags field.

Known current obstacles:
------------------------

The SQ is currently programmed to disregard the HQD priorities, and instead it picks
jobs at random. Settings from the shader itself are also disregarded as this is
considered a privileged field.

Effectively we can get our compute wavefront launched ASAP, but we might not get the
time we need on the SQ.

The current programming would have to be changed to allow priority propagation
from the HQD into the SQ.

Generic approach for all HW IPs:
--------------------------------

For consistency purposes, the high priority context can be enabled for all HW IPs
with support of the SW scheduler. This will function similarly to the current
AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of anything not
commited to the HW queue.

The benefits of requesting a high priority context for a non-compute queue will
be lesser (e.g. up to 10s of wait time if a GFX command is stuck in front of
you), but having the API in place will allow us to easily improve the implementation
in the future as new features become available in new hardware.

Future steps:
-------------

Once we have an approach settled, I can take care of the implementation.

Also, once the interface is mostly decided, we can start thinking about
exposing the high priority queue through radv.

Request for feedback:
---------------------

We aren't married to any of the approaches outlined above. Our goal is to
obtain a mechanism that will allow us to complete the reprojection job within a
predictable amount of time. So if anyone anyone has any suggestions for
improvements or alternative strategies we are more than happy to hear them.

If any of the technical information above is also incorrect, feel free to point
out my misunderstandings.

Looking forward to hearing from you.

Regards,
Andres

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


amd-gfx Info Page - lists.freedesktop.org
lists.freedesktop.org
To see the collection of prior postings to the list, visit the amd-gfx Archives. Using amd-gfx: To post a message to all the list members, send email ...

    
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]     ` <SN1PR12MB070363D0810B783234322C46FE9F0-z7L1TMIYDg6P/i5UxMCIqAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2016-12-17  1:29       ` Andres Rodriguez
       [not found]         ` <544E607D03B20249AA404517E498FC4699EC41-Lp/cVzEoVyaisxZYEgh0i620KmCxYQEWVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Andres Rodriguez @ 2016-12-17  1:29 UTC (permalink / raw)
  To: Sagalovitch, Serguei, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Hi Serguei,

Thanks for the feedback. Answers inline as [AR].

Regards,
Andres

________________________________________
From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
Sent: Friday, December 16, 2016 8:15 PM
To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu

Andres,


Quick comments:

1) To minimize "bubbles", etc. we need to "force" CU assignments/binding
to high-priority queue  when it will be in use and "free" them later
(we  do not want forever take CUs from e.g. graphic task to degrade graphics
performance).

Otherwise we could have scenario when long graphics task (or low-priority
compute) will took all (extra) CUs and high--priority will wait for needed resources.
It will not be visible on "NOP " but only when you submit "real" compute task
so I would recommend  not to use "NOP" packets at all for testing.

It (CU assignment) could be relatively easy done when everything is going via kernel
(e.g. as part of frame submission) but I must admit that I am not sure
about the best way for user level submissions (amdkfd).

[AR] I wasn't aware of this part of the programming sequence. Thanks for the heads up!
Is this similar to the CU masking programming?

2) I would recommend to dedicate the whole pipe to high-priority queue and have
nothing their except it.

[AR] I'm guessing in this context you mean pipe = queue? (as opposed to the MEC definition
of pipe, which is a grouping of queues). I say this because amdgpu only has access to 1 pipe,
and the rest are statically partitioned for amdkfd usage.

BTW: Which user level API do you want to use for compute: Vulkan or OpenCL?

[AR] Vulkan

> we will not be able to provide a solution compatible with GFX worloads.
I assume that you are talking about graphics? Am I right?

[AR] Yeah, my understanding is that pre-empting the currently running graphics job and scheduling in
something else using mid-buffer pre-emption has some cases where it doesn't work well. But if with
polaris10 it starts working well, it might be a better solution for us (because the whole reprojection
work uses the vulkan graphics stack at the moment, and porting it to compute is not trivial).

Sincerely yours,
Serguei Sagalovitch



From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Andres Rodriguez <andresr@valvesoftware.com>
Sent: December 16, 2016 6:15 PM
To: amd-gfx@lists.freedesktop.org
Subject: [RFC] Mechanism for high priority scheduling in amdgpu

Hi Everyone,

This RFC is also available as a gist here:
https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249



[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu


We are interested in feedback for a mechanism to effectively schedule high
priority VR reprojection tasks (also referred to as time-warping) for Polaris10
running on the amdgpu kernel driver.

Brief context:
--------------

The main objective of reprojection is to avoid motion sickness for VR users in
scenarios where the game or application would fail to finish rendering a new
frame in time for the next VBLANK. When this happens, the user's head movements
are not reflected on the Head Mounted Display (HMD) for the duration of an
extra frame. This extended mismatch between the inner ear and the eyes may
cause the user to experience motion sickness.

The VR compositor deals with this problem by fabricating a new frame using the
user's updated head position in combination with the previous frames. This
avoids a prolonged mismatch between the HMD output and the inner ear.

Because of the adverse effects on the user, we require high confidence that the
reprojection task will complete before the VBLANK interval. Even if the GFX pipe
is currently full of work from the game/application (which is most likely the case).

For more details and illustrations, please refer to the following document:
https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved


Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology over the past year has been the adoption of asynchronous shaders, which can make more efficient use of ...


Requirements:
-------------

The mechanism must expose the following functionaility:

    * Job round trip time must be predictable, from submission to fence signal

    * The mechanism must support compute workloads.

Goals:
------

    * The mechanism should provide low submission latencies

Test: submitting a NOP packet through the mechanism on busy hardware should
be equivalent to submitting a NOP on idle hardware.

Nice to have:
-------------

    * The mechanism should also support GFX workloads.

My understanding is that with the current hardware capabilities in Polaris10 we
will not be able to provide a solution compatible with GFX worloads.

But I would love to hear otherwise. So if anyone has an idea, approach or
suggestion that will also be compatible with the GFX ring, please let us know
about it.

    * The above guarantees should also be respected by amdkfd workloads

Would be good to have for consistency, but not strictly necessary as users running
games are not traditionally running HPC workloads in the background.

Proposed approach:
------------------

Similar to the windows driver, we could expose a high priority compute queue to
userspace.

Submissions to this compute queue will be scheduled with high priority, and may
acquire hardware resources previously in use by other queues.

This can be achieved by taking advantage of the 'priority' field in the HQDs
and could be programmed by amdgpu or the amdgpu scheduler. The relevant
register fields are:
        * mmCP_HQD_PIPE_PRIORITY
        * mmCP_HQD_QUEUE_PRIORITY

Implementation approach 1 - static partitioning:
------------------------------------------------

The amdgpu driver currently controls 8 compute queues from pipe0. We can
statically partition these as follows:
        * 7x regular
        * 1x high priority

The relevant priorities can be set so that submissions to the high priority
ring will starve the other compute rings and the GFX ring.

The amdgpu scheduler will only place jobs into the high priority rings if the
context is marked as high priority. And a corresponding priority should be
added to keep track of this information:
     * AMD_SCHED_PRIORITY_KERNEL
     * -> AMD_SCHED_PRIORITY_HIGH
     * AMD_SCHED_PRIORITY_NORMAL

The user will request a high priority context by setting an appropriate flag
in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163

The setting is in a per context level so that we can:
    * Maintain a consistent FIFO ordering of all submissions to a context
    * Create high priority and non-high priority contexts in the same process

Implementation approach 2 - dynamic priority programming:
---------------------------------------------------------

Similar to the above, but instead of programming the priorities and
amdgpu_init() time, the SW scheduler will reprogram the queue priorities
dynamically when scheduling a task.

This would involve having a hardware specific callback from the scheduler to
set the appropriate queue priority: set_priority(int ring, int index, int priority)

During this callback we would have to grab the SRBM mutex to perform the appropriate
HW programming, and I'm not really sure if that is something we should be doing from
the scheduler.

On the positive side, this approach would allow us to program a range of
priorities for jobs instead of a single "high priority" value", achieving
something similar to the niceness API available for CPU scheduling.

I'm not sure if this flexibility is something that we would need for our use
case, but it might be useful in other scenarios (multiple users sharing compute
time on a server).

This approach would require a new int field in drm_amdgpu_ctx_in, or repurposing
of the flags field.

Known current obstacles:
------------------------

The SQ is currently programmed to disregard the HQD priorities, and instead it picks
jobs at random. Settings from the shader itself are also disregarded as this is
considered a privileged field.

Effectively we can get our compute wavefront launched ASAP, but we might not get the
time we need on the SQ.

The current programming would have to be changed to allow priority propagation
from the HQD into the SQ.

Generic approach for all HW IPs:
--------------------------------

For consistency purposes, the high priority context can be enabled for all HW IPs
with support of the SW scheduler. This will function similarly to the current
AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of anything not
commited to the HW queue.

The benefits of requesting a high priority context for a non-compute queue will
be lesser (e.g. up to 10s of wait time if a GFX command is stuck in front of
you), but having the API in place will allow us to easily improve the implementation
in the future as new features become available in new hardware.

Future steps:
-------------

Once we have an approach settled, I can take care of the implementation.

Also, once the interface is mostly decided, we can start thinking about
exposing the high priority queue through radv.

Request for feedback:
---------------------

We aren't married to any of the approaches outlined above. Our goal is to
obtain a mechanism that will allow us to complete the reprojection job within a
predictable amount of time. So if anyone anyone has any suggestions for
improvements or alternative strategies we are more than happy to hear them.

If any of the technical information above is also incorrect, feel free to point
out my misunderstandings.

Looking forward to hearing from you.

Regards,
Andres

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


amd-gfx Info Page - lists.freedesktop.org
lists.freedesktop.org
To see the collection of prior postings to the list, visit the amd-gfx Archives. Using amd-gfx: To post a message to all the list members, send email ...





_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]         ` <544E607D03B20249AA404517E498FC4699EC41-Lp/cVzEoVyaisxZYEgh0i620KmCxYQEWVpNB7YpNyf8@public.gmane.org>
@ 2016-12-17  2:13           ` Sagalovitch, Serguei
       [not found]             ` <SN1PR12MB070320C32E73A8FC1E3102F1FE9F0-z7L1TMIYDg6P/i5UxMCIqAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Sagalovitch, Serguei @ 2016-12-17  2:13 UTC (permalink / raw)
  To: Andres Rodriguez, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Hi Andres,

Please see inline (as [Serguei])

Sincerely yours,
Serguei Sagalovitch


From: Andres Rodriguez <andresr@valvesoftware.com>
Sent: December 16, 2016 8:29 PM
To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
    
Hi Serguei,

Thanks for the feedback. Answers inline as [AR].

Regards,
Andres

________________________________________
From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
Sent: Friday, December 16, 2016 8:15 PM
To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu

Andres,


Quick comments:

1) To minimize "bubbles", etc. we need to "force" CU assignments/binding
to high-priority queue  when it will be in use and "free" them later
(we  do not want forever take CUs from e.g. graphic task to degrade graphics
performance).

Otherwise we could have scenario when long graphics task (or low-priority
compute) will took all (extra) CUs and high--priority will wait for needed resources.
It will not be visible on "NOP " but only when you submit "real" compute task
so I would recommend  not to use "NOP" packets at all for testing.

It (CU assignment) could be relatively easy done when everything is going via kernel
(e.g. as part of frame submission) but I must admit that I am not sure
about the best way for user level submissions (amdkfd).

[AR] I wasn't aware of this part of the programming sequence. Thanks for the heads up!
Is this similar to the CU masking programming?
[Serguei] Yes. To simplify: the problem is that "scheduler" when deciding which 
queue to  run will check if there is enough resources and if not then it will begin 
to check other queues with lower priority.

2) I would recommend to dedicate the whole pipe to high-priority queue and have
nothing their except it.

[AR] I'm guessing in this context you mean pipe = queue? (as opposed to the MEC definition
of pipe, which is a grouping of queues). I say this because amdgpu only has access to 1 pipe,
and the rest are statically partitioned for amdkfd usage.

[Serguei] No. I mean pipe :-)  as MEC define it.  As far as I understand (by simplifying) 
some scheduling is per pipe.  I know about the current allocation scheme but I do not think 
that it is  ideal.  I would assume that we need  to switch to dynamical partition 
of resources  based on the workload otherwise we will have resource conflict 
between Vulkan compute and  OpenCL.


BTW: Which user level API do you want to use for compute: Vulkan or OpenCL?

[AR] Vulkan

[Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will be not 
involved.  I would assume that in the case of VR we will have one main 
application ("console" mode(?)) so we could temporally "ignore" 
OpenCL/ROCm needs when VR is running.

> we will not be able to provide a solution compatible with GFX worloads.
I assume that you are talking about graphics? Am I right?

[AR] Yeah, my understanding is that pre-empting the currently running graphics job and scheduling in
something else using mid-buffer pre-emption has some cases where it doesn't work well. But if with
polaris10 it starts working well, it might be a better solution for us (because the whole reprojection
work uses the vulkan graphics stack at the moment, and porting it to compute is not trivial).

[Serguei]  The problem with pre-emption of graphics task:  (a) it may take time so 
latency may suffer (b) to preempt we need to have different "context" - we want 
to guarantee that submissions from the same context will be executed in order.
BTW: (a) Do you want  "preempt" and later resume or do you want "preempt" and 
"cancel/abort"?  (b) Vulkan is generic API and could be used
for graphics as well as for plain compute tasks (VK_QUEUE_COMPUTE_BIT).


Sincerely yours,
Serguei Sagalovitch



From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Andres Rodriguez <andresr@valvesoftware.com>
Sent: December 16, 2016 6:15 PM
To: amd-gfx@lists.freedesktop.org
Subject: [RFC] Mechanism for high priority scheduling in amdgpu

Hi Everyone,

This RFC is also available as a gist here:
https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249


[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu




[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu


We are interested in feedback for a mechanism to effectively schedule high
priority VR reprojection tasks (also referred to as time-warping) for Polaris10
running on the amdgpu kernel driver.

Brief context:
--------------

The main objective of reprojection is to avoid motion sickness for VR users in
scenarios where the game or application would fail to finish rendering a new
frame in time for the next VBLANK. When this happens, the user's head movements
are not reflected on the Head Mounted Display (HMD) for the duration of an
extra frame. This extended mismatch between the inner ear and the eyes may
cause the user to experience motion sickness.

The VR compositor deals with this problem by fabricating a new frame using the
user's updated head position in combination with the previous frames. This
avoids a prolonged mismatch between the HMD output and the inner ear.

Because of the adverse effects on the user, we require high confidence that the
reprojection task will complete before the VBLANK interval. Even if the GFX pipe
is currently full of work from the game/application (which is most likely the case).

For more details and illustrations, please refer to the following document:
https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved


Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology over the past year has been the adoption of asynchronous shaders, which can make more efficient use of ...



Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology over the past year has been the adoption of asynchronous shaders, which can make more efficient use of ...


Requirements:
-------------

The mechanism must expose the following functionaility:

    * Job round trip time must be predictable, from submission to fence signal

    * The mechanism must support compute workloads.

Goals:
------

    * The mechanism should provide low submission latencies

Test: submitting a NOP packet through the mechanism on busy hardware should
be equivalent to submitting a NOP on idle hardware.

Nice to have:
-------------

    * The mechanism should also support GFX workloads.

My understanding is that with the current hardware capabilities in Polaris10 we
will not be able to provide a solution compatible with GFX worloads.

But I would love to hear otherwise. So if anyone has an idea, approach or
suggestion that will also be compatible with the GFX ring, please let us know
about it.

    * The above guarantees should also be respected by amdkfd workloads

Would be good to have for consistency, but not strictly necessary as users running
games are not traditionally running HPC workloads in the background.

Proposed approach:
------------------

Similar to the windows driver, we could expose a high priority compute queue to
userspace.

Submissions to this compute queue will be scheduled with high priority, and may
acquire hardware resources previously in use by other queues.

This can be achieved by taking advantage of the 'priority' field in the HQDs
and could be programmed by amdgpu or the amdgpu scheduler. The relevant
register fields are:
        * mmCP_HQD_PIPE_PRIORITY
        * mmCP_HQD_QUEUE_PRIORITY

Implementation approach 1 - static partitioning:
------------------------------------------------

The amdgpu driver currently controls 8 compute queues from pipe0. We can
statically partition these as follows:
        * 7x regular
        * 1x high priority

The relevant priorities can be set so that submissions to the high priority
ring will starve the other compute rings and the GFX ring.

The amdgpu scheduler will only place jobs into the high priority rings if the
context is marked as high priority. And a corresponding priority should be
added to keep track of this information:
     * AMD_SCHED_PRIORITY_KERNEL
     * -> AMD_SCHED_PRIORITY_HIGH
     * AMD_SCHED_PRIORITY_NORMAL

The user will request a high priority context by setting an appropriate flag
in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163

The setting is in a per context level so that we can:
    * Maintain a consistent FIFO ordering of all submissions to a context
    * Create high priority and non-high priority contexts in the same process

Implementation approach 2 - dynamic priority programming:
---------------------------------------------------------

Similar to the above, but instead of programming the priorities and
amdgpu_init() time, the SW scheduler will reprogram the queue priorities
dynamically when scheduling a task.

This would involve having a hardware specific callback from the scheduler to
set the appropriate queue priority: set_priority(int ring, int index, int priority)

During this callback we would have to grab the SRBM mutex to perform the appropriate
HW programming, and I'm not really sure if that is something we should be doing from
the scheduler.

On the positive side, this approach would allow us to program a range of
priorities for jobs instead of a single "high priority" value", achieving
something similar to the niceness API available for CPU scheduling.

I'm not sure if this flexibility is something that we would need for our use
case, but it might be useful in other scenarios (multiple users sharing compute
time on a server).

This approach would require a new int field in drm_amdgpu_ctx_in, or repurposing
of the flags field.

Known current obstacles:
------------------------

The SQ is currently programmed to disregard the HQD priorities, and instead it picks
jobs at random. Settings from the shader itself are also disregarded as this is
considered a privileged field.

Effectively we can get our compute wavefront launched ASAP, but we might not get the
time we need on the SQ.

The current programming would have to be changed to allow priority propagation
from the HQD into the SQ.

Generic approach for all HW IPs:
--------------------------------

For consistency purposes, the high priority context can be enabled for all HW IPs
with support of the SW scheduler. This will function similarly to the current
AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of anything not
commited to the HW queue.

The benefits of requesting a high priority context for a non-compute queue will
be lesser (e.g. up to 10s of wait time if a GFX command is stuck in front of
you), but having the API in place will allow us to easily improve the implementation
in the future as new features become available in new hardware.

Future steps:
-------------

Once we have an approach settled, I can take care of the implementation.

Also, once the interface is mostly decided, we can start thinking about
exposing the high priority queue through radv.

Request for feedback:
---------------------

We aren't married to any of the approaches outlined above. Our goal is to
obtain a mechanism that will allow us to complete the reprojection job within a
predictable amount of time. So if anyone anyone has any suggestions for
improvements or alternative strategies we are more than happy to hear them.

If any of the technical information above is also incorrect, feel free to point
out my misunderstandings.

Looking forward to hearing from you.

Regards,
Andres

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


amd-gfx Info Page - lists.freedesktop.org
lists.freedesktop.org
To see the collection of prior postings to the list, visit the amd-gfx Archives. Using amd-gfx: To post a message to all the list members, send email ...





    
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]             ` <SN1PR12MB070320C32E73A8FC1E3102F1FE9F0-z7L1TMIYDg6P/i5UxMCIqAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2016-12-17  3:00               ` Andres Rodriguez
       [not found]                 ` <544E607D03B20249AA404517E498FC4699EC70-Lp/cVzEoVyaisxZYEgh0i620KmCxYQEWVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Andres Rodriguez @ 2016-12-17  3:00 UTC (permalink / raw)
  To: Sagalovitch, Serguei, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Hey Serguei,

> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I understand (by simplifying)
> some scheduling is per pipe.  I know about the current allocation scheme but I do not think
> that it is  ideal.  I would assume that we need  to switch to dynamical partition
> of resources  based on the workload otherwise we will have resource conflict
> between Vulkan compute and  OpenCL.

I agree the partitioning isn't ideal. I'm hoping we can start with a solution that assumes that
only pipe0 has any work and the other pipes are idle (no HSA/ROCm running on the system).

This should be more or less the use case we expect from VR users.

I agree the split is currently not ideal, but I'd like to consider that a separate task, because
making it dynamic is not straight forward :P

> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will be not
> involved.  I would assume that in the case of VR we will have one main
> application ("console" mode(?)) so we could temporally "ignore"
> OpenCL/ROCm needs when VR is running.

Correct, this is why we want to enable the high priority compute queue through
libdrm-amdgpu, so that we can expose it through Vulkan later.

For current VR workloads we have 3 separate processes running actually:
    1) Game process
    2) VR Compositor (this is the process that will require high priority queue)
    3) System compositor (we are looking at approaches to remove this overhead)

For now I think it is okay to assume no OpenCL/ROCm running simultaneously, but
I would also like to be able to address this case in the future (cross-pipe priorities).

> [Serguei]  The problem with pre-emption of graphics task:  (a) it may take time so
> latency may suffer

The latency is our main concern, we want something that is predictable. A good
illustration of what the reprojection scheduling looks like can be found here:
https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png

> (b) to preempt we need to have different "context" - we want
> to guarantee that submissions from the same context will be executed in order.

This is okay, as the reprojection work doesn't have dependencies on the game context, and it
even happens in a separate process.

> BTW: (a) Do you want  "preempt" and later resume or do you want "preempt" and
> "cancel/abort"

Preempt the game with the compositor task and then resume it.

> (b) Vulkan is generic API and could be used for graphics as well as
> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).

Yeah, the plan is to use vulkan compute. But if you figure out a way for us to get
a guaranteed execution time using vulkan graphics, then I'll take you out for a beer :)

Regards,
Andres
________________________________________
From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
Sent: Friday, December 16, 2016 9:13 PM
To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu

Hi Andres,

Please see inline (as [Serguei])

Sincerely yours,
Serguei Sagalovitch


From: Andres Rodriguez <andresr@valvesoftware.com>
Sent: December 16, 2016 8:29 PM
To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu

Hi Serguei,

Thanks for the feedback. Answers inline as [AR].

Regards,
Andres

________________________________________
From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
Sent: Friday, December 16, 2016 8:15 PM
To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu

Andres,


Quick comments:

1) To minimize "bubbles", etc. we need to "force" CU assignments/binding
to high-priority queue  when it will be in use and "free" them later
(we  do not want forever take CUs from e.g. graphic task to degrade graphics
performance).

Otherwise we could have scenario when long graphics task (or low-priority
compute) will took all (extra) CUs and high--priority will wait for needed resources.
It will not be visible on "NOP " but only when you submit "real" compute task
so I would recommend  not to use "NOP" packets at all for testing.

It (CU assignment) could be relatively easy done when everything is going via kernel
(e.g. as part of frame submission) but I must admit that I am not sure
about the best way for user level submissions (amdkfd).

[AR] I wasn't aware of this part of the programming sequence. Thanks for the heads up!
Is this similar to the CU masking programming?
[Serguei] Yes. To simplify: the problem is that "scheduler" when deciding which
queue to  run will check if there is enough resources and if not then it will begin
to check other queues with lower priority.

2) I would recommend to dedicate the whole pipe to high-priority queue and have
nothing their except it.

[AR] I'm guessing in this context you mean pipe = queue? (as opposed to the MEC definition
of pipe, which is a grouping of queues). I say this because amdgpu only has access to 1 pipe,
and the rest are statically partitioned for amdkfd usage.

[Serguei] No. I mean pipe :-)  as MEC define it.  As far as I understand (by simplifying)
some scheduling is per pipe.  I know about the current allocation scheme but I do not think
that it is  ideal.  I would assume that we need  to switch to dynamical partition
of resources  based on the workload otherwise we will have resource conflict
between Vulkan compute and  OpenCL.


BTW: Which user level API do you want to use for compute: Vulkan or OpenCL?

[AR] Vulkan

[Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will be not
involved.  I would assume that in the case of VR we will have one main
application ("console" mode(?)) so we could temporally "ignore"
OpenCL/ROCm needs when VR is running.

> we will not be able to provide a solution compatible with GFX worloads.
I assume that you are talking about graphics? Am I right?

[AR] Yeah, my understanding is that pre-empting the currently running graphics job and scheduling in
something else using mid-buffer pre-emption has some cases where it doesn't work well. But if with
polaris10 it starts working well, it might be a better solution for us (because the whole reprojection
work uses the vulkan graphics stack at the moment, and porting it to compute is not trivial).

[Serguei]  The problem with pre-emption of graphics task:  (a) it may take time so
latency may suffer (b) to preempt we need to have different "context" - we want
to guarantee that submissions from the same context will be executed in order.
BTW: (a) Do you want  "preempt" and later resume or do you want "preempt" and
"cancel/abort"?  (b) Vulkan is generic API and could be used
for graphics as well as for plain compute tasks (VK_QUEUE_COMPUTE_BIT).


Sincerely yours,
Serguei Sagalovitch



From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Andres Rodriguez <andresr@valvesoftware.com>
Sent: December 16, 2016 6:15 PM
To: amd-gfx@lists.freedesktop.org
Subject: [RFC] Mechanism for high priority scheduling in amdgpu

Hi Everyone,

This RFC is also available as a gist here:
https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249


[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu




[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu


We are interested in feedback for a mechanism to effectively schedule high
priority VR reprojection tasks (also referred to as time-warping) for Polaris10
running on the amdgpu kernel driver.

Brief context:
--------------

The main objective of reprojection is to avoid motion sickness for VR users in
scenarios where the game or application would fail to finish rendering a new
frame in time for the next VBLANK. When this happens, the user's head movements
are not reflected on the Head Mounted Display (HMD) for the duration of an
extra frame. This extended mismatch between the inner ear and the eyes may
cause the user to experience motion sickness.

The VR compositor deals with this problem by fabricating a new frame using the
user's updated head position in combination with the previous frames. This
avoids a prolonged mismatch between the HMD output and the inner ear.

Because of the adverse effects on the user, we require high confidence that the
reprojection task will complete before the VBLANK interval. Even if the GFX pipe
is currently full of work from the game/application (which is most likely the case).

For more details and illustrations, please refer to the following document:
https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved


Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology over the past year has been the adoption of asynchronous shaders, which can make more efficient use of ...



Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology over the past year has been the adoption of asynchronous shaders, which can make more efficient use of ...


Requirements:
-------------

The mechanism must expose the following functionaility:

    * Job round trip time must be predictable, from submission to fence signal

    * The mechanism must support compute workloads.

Goals:
------

    * The mechanism should provide low submission latencies

Test: submitting a NOP packet through the mechanism on busy hardware should
be equivalent to submitting a NOP on idle hardware.

Nice to have:
-------------

    * The mechanism should also support GFX workloads.

My understanding is that with the current hardware capabilities in Polaris10 we
will not be able to provide a solution compatible with GFX worloads.

But I would love to hear otherwise. So if anyone has an idea, approach or
suggestion that will also be compatible with the GFX ring, please let us know
about it.

    * The above guarantees should also be respected by amdkfd workloads

Would be good to have for consistency, but not strictly necessary as users running
games are not traditionally running HPC workloads in the background.

Proposed approach:
------------------

Similar to the windows driver, we could expose a high priority compute queue to
userspace.

Submissions to this compute queue will be scheduled with high priority, and may
acquire hardware resources previously in use by other queues.

This can be achieved by taking advantage of the 'priority' field in the HQDs
and could be programmed by amdgpu or the amdgpu scheduler. The relevant
register fields are:
        * mmCP_HQD_PIPE_PRIORITY
        * mmCP_HQD_QUEUE_PRIORITY

Implementation approach 1 - static partitioning:
------------------------------------------------

The amdgpu driver currently controls 8 compute queues from pipe0. We can
statically partition these as follows:
        * 7x regular
        * 1x high priority

The relevant priorities can be set so that submissions to the high priority
ring will starve the other compute rings and the GFX ring.

The amdgpu scheduler will only place jobs into the high priority rings if the
context is marked as high priority. And a corresponding priority should be
added to keep track of this information:
     * AMD_SCHED_PRIORITY_KERNEL
     * -> AMD_SCHED_PRIORITY_HIGH
     * AMD_SCHED_PRIORITY_NORMAL

The user will request a high priority context by setting an appropriate flag
in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163

The setting is in a per context level so that we can:
    * Maintain a consistent FIFO ordering of all submissions to a context
    * Create high priority and non-high priority contexts in the same process

Implementation approach 2 - dynamic priority programming:
---------------------------------------------------------

Similar to the above, but instead of programming the priorities and
amdgpu_init() time, the SW scheduler will reprogram the queue priorities
dynamically when scheduling a task.

This would involve having a hardware specific callback from the scheduler to
set the appropriate queue priority: set_priority(int ring, int index, int priority)

During this callback we would have to grab the SRBM mutex to perform the appropriate
HW programming, and I'm not really sure if that is something we should be doing from
the scheduler.

On the positive side, this approach would allow us to program a range of
priorities for jobs instead of a single "high priority" value", achieving
something similar to the niceness API available for CPU scheduling.

I'm not sure if this flexibility is something that we would need for our use
case, but it might be useful in other scenarios (multiple users sharing compute
time on a server).

This approach would require a new int field in drm_amdgpu_ctx_in, or repurposing
of the flags field.

Known current obstacles:
------------------------

The SQ is currently programmed to disregard the HQD priorities, and instead it picks
jobs at random. Settings from the shader itself are also disregarded as this is
considered a privileged field.

Effectively we can get our compute wavefront launched ASAP, but we might not get the
time we need on the SQ.

The current programming would have to be changed to allow priority propagation
from the HQD into the SQ.

Generic approach for all HW IPs:
--------------------------------

For consistency purposes, the high priority context can be enabled for all HW IPs
with support of the SW scheduler. This will function similarly to the current
AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of anything not
commited to the HW queue.

The benefits of requesting a high priority context for a non-compute queue will
be lesser (e.g. up to 10s of wait time if a GFX command is stuck in front of
you), but having the API in place will allow us to easily improve the implementation
in the future as new features become available in new hardware.

Future steps:
-------------

Once we have an approach settled, I can take care of the implementation.

Also, once the interface is mostly decided, we can start thinking about
exposing the high priority queue through radv.

Request for feedback:
---------------------

We aren't married to any of the approaches outlined above. Our goal is to
obtain a mechanism that will allow us to complete the reprojection job within a
predictable amount of time. So if anyone anyone has any suggestions for
improvements or alternative strategies we are more than happy to hear them.

If any of the technical information above is also incorrect, feel free to point
out my misunderstandings.

Looking forward to hearing from you.

Regards,
Andres

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


amd-gfx Info Page - lists.freedesktop.org
lists.freedesktop.org
To see the collection of prior postings to the list, visit the amd-gfx Archives. Using amd-gfx: To post a message to all the list members, send email ...









_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                 ` <544E607D03B20249AA404517E498FC4699EC70-Lp/cVzEoVyaisxZYEgh0i620KmCxYQEWVpNB7YpNyf8@public.gmane.org>
@ 2016-12-17  5:05                   ` Sagalovitch, Serguei
       [not found]                     ` <SN1PR12MB0703173C7AD623F6C5AECE7DFE9F0-z7L1TMIYDg6P/i5UxMCIqAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Sagalovitch, Serguei @ 2016-12-17  5:05 UTC (permalink / raw)
  To: Andres Rodriguez, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Andres,

> For current VR workloads we have 3 separate processes running actually:
So we could have potential memory overcommit case or do you do partitioning 
on your own?  I would think that there is need to avoid overcomit in VR case to 
prevent any BO migration. BTW: Do you mean __real__ processes or threads? 
Based on my understanding sharing BOs between different processes
could introduce additional synchronization constrains.  btw: I am not sure 
if we are able to share Vulkan sync. object cross-process boundary.

>    3) System compositor (we are looking at approaches to remove this overhead)
Yes,  IMHO the best is to run in  "full screen mode".

> The latency is our main concern, 
I would assume that this is the known problem (at least for compute usage). 
It looks like that amdgpu / kernel submission is rather CPU intensive (at least
in the default configuration). 

Sincerely yours,
Serguei Sagalovitch


From: Andres Rodriguez <andresr@valvesoftware.com>
Sent: December 16, 2016 10:00 PM
To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
    
Hey Serguei,

> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I understand (by simplifying)
> some scheduling is per pipe.  I know about the current allocation scheme but I do not think
> that it is  ideal.  I would assume that we need  to switch to dynamical partition
> of resources  based on the workload otherwise we will have resource conflict
> between Vulkan compute and  OpenCL.

I agree the partitioning isn't ideal. I'm hoping we can start with a solution that assumes that
only pipe0 has any work and the other pipes are idle (no HSA/ROCm running on the system).

This should be more or less the use case we expect from VR users.

I agree the split is currently not ideal, but I'd like to consider that a separate task, because
making it dynamic is not straight forward :P

> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will be not
> involved.  I would assume that in the case of VR we will have one main
> application ("console" mode(?)) so we could temporally "ignore"
> OpenCL/ROCm needs when VR is running.

Correct, this is why we want to enable the high priority compute queue through
libdrm-amdgpu, so that we can expose it through Vulkan later.

For current VR workloads we have 3 separate processes running actually:
    1) Game process
    2) VR Compositor (this is the process that will require high priority queue)
    3) System compositor (we are looking at approaches to remove this overhead)

For now I think it is okay to assume no OpenCL/ROCm running simultaneously, but
I would also like to be able to address this case in the future (cross-pipe priorities).

> [Serguei]  The problem with pre-emption of graphics task:  (a) it may take time so
> latency may suffer

The latency is our main concern, we want something that is predictable. A good
illustration of what the reprojection scheduling looks like can be found here:
https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png

> (b) to preempt we need to have different "context" - we want
> to guarantee that submissions from the same context will be executed in order.

This is okay, as the reprojection work doesn't have dependencies on the game context, and it
even happens in a separate process.

> BTW: (a) Do you want  "preempt" and later resume or do you want "preempt" and
> "cancel/abort"

Preempt the game with the compositor task and then resume it.

> (b) Vulkan is generic API and could be used for graphics as well as
> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).

Yeah, the plan is to use vulkan compute. But if you figure out a way for us to get
a guaranteed execution time using vulkan graphics, then I'll take you out for a beer :)

Regards,
Andres
________________________________________
From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
Sent: Friday, December 16, 2016 9:13 PM
To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu

Hi Andres,

Please see inline (as [Serguei])

Sincerely yours,
Serguei Sagalovitch


From: Andres Rodriguez <andresr@valvesoftware.com>
Sent: December 16, 2016 8:29 PM
To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu

Hi Serguei,

Thanks for the feedback. Answers inline as [AR].

Regards,
Andres

________________________________________
From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
Sent: Friday, December 16, 2016 8:15 PM
To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu

Andres,


Quick comments:

1) To minimize "bubbles", etc. we need to "force" CU assignments/binding
to high-priority queue  when it will be in use and "free" them later
(we  do not want forever take CUs from e.g. graphic task to degrade graphics
performance).

Otherwise we could have scenario when long graphics task (or low-priority
compute) will took all (extra) CUs and high--priority will wait for needed resources.
It will not be visible on "NOP " but only when you submit "real" compute task
so I would recommend  not to use "NOP" packets at all for testing.

It (CU assignment) could be relatively easy done when everything is going via kernel
(e.g. as part of frame submission) but I must admit that I am not sure
about the best way for user level submissions (amdkfd).

[AR] I wasn't aware of this part of the programming sequence. Thanks for the heads up!
Is this similar to the CU masking programming?
[Serguei] Yes. To simplify: the problem is that "scheduler" when deciding which
queue to  run will check if there is enough resources and if not then it will begin
to check other queues with lower priority.

2) I would recommend to dedicate the whole pipe to high-priority queue and have
nothing their except it.

[AR] I'm guessing in this context you mean pipe = queue? (as opposed to the MEC definition
of pipe, which is a grouping of queues). I say this because amdgpu only has access to 1 pipe,
and the rest are statically partitioned for amdkfd usage.

[Serguei] No. I mean pipe :-)  as MEC define it.  As far as I understand (by simplifying)
some scheduling is per pipe.  I know about the current allocation scheme but I do not think
that it is  ideal.  I would assume that we need  to switch to dynamical partition
of resources  based on the workload otherwise we will have resource conflict
between Vulkan compute and  OpenCL.


BTW: Which user level API do you want to use for compute: Vulkan or OpenCL?

[AR] Vulkan

[Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will be not
involved.  I would assume that in the case of VR we will have one main
application ("console" mode(?)) so we could temporally "ignore"
OpenCL/ROCm needs when VR is running.

> we will not be able to provide a solution compatible with GFX worloads.
I assume that you are talking about graphics? Am I right?

[AR] Yeah, my understanding is that pre-empting the currently running graphics job and scheduling in
something else using mid-buffer pre-emption has some cases where it doesn't work well. But if with
polaris10 it starts working well, it might be a better solution for us (because the whole reprojection
work uses the vulkan graphics stack at the moment, and porting it to compute is not trivial).

[Serguei]  The problem with pre-emption of graphics task:  (a) it may take time so
latency may suffer (b) to preempt we need to have different "context" - we want
to guarantee that submissions from the same context will be executed in order.
BTW: (a) Do you want  "preempt" and later resume or do you want "preempt" and
"cancel/abort"?  (b) Vulkan is generic API and could be used
for graphics as well as for plain compute tasks (VK_QUEUE_COMPUTE_BIT).


Sincerely yours,
Serguei Sagalovitch



From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Andres Rodriguez <andresr@valvesoftware.com>
Sent: December 16, 2016 6:15 PM
To: amd-gfx@lists.freedesktop.org
Subject: [RFC] Mechanism for high priority scheduling in amdgpu

Hi Everyone,

This RFC is also available as a gist here:
https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249



[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu



[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu




[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu


We are interested in feedback for a mechanism to effectively schedule high
priority VR reprojection tasks (also referred to as time-warping) for Polaris10
running on the amdgpu kernel driver.

Brief context:
--------------

The main objective of reprojection is to avoid motion sickness for VR users in
scenarios where the game or application would fail to finish rendering a new
frame in time for the next VBLANK. When this happens, the user's head movements
are not reflected on the Head Mounted Display (HMD) for the duration of an
extra frame. This extended mismatch between the inner ear and the eyes may
cause the user to experience motion sickness.

The VR compositor deals with this problem by fabricating a new frame using the
user's updated head position in combination with the previous frames. This
avoids a prolonged mismatch between the HMD output and the inner ear.

Because of the adverse effects on the user, we require high confidence that the
reprojection task will complete before the VBLANK interval. Even if the GFX pipe
is currently full of work from the game/application (which is most likely the case).

For more details and illustrations, please refer to the following document:
https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved


Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology over the past year has been the adoption of asynchronous shaders, which can make more efficient use of ...



Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology over the past year has been the adoption of asynchronous shaders, which can make more efficient use of ...



Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology over the past year has been the adoption of asynchronous shaders, which can make more efficient use of ...


Requirements:
-------------

The mechanism must expose the following functionaility:

    * Job round trip time must be predictable, from submission to fence signal

    * The mechanism must support compute workloads.

Goals:
------

    * The mechanism should provide low submission latencies

Test: submitting a NOP packet through the mechanism on busy hardware should
be equivalent to submitting a NOP on idle hardware.

Nice to have:
-------------

    * The mechanism should also support GFX workloads.

My understanding is that with the current hardware capabilities in Polaris10 we
will not be able to provide a solution compatible with GFX worloads.

But I would love to hear otherwise. So if anyone has an idea, approach or
suggestion that will also be compatible with the GFX ring, please let us know
about it.

    * The above guarantees should also be respected by amdkfd workloads

Would be good to have for consistency, but not strictly necessary as users running
games are not traditionally running HPC workloads in the background.

Proposed approach:
------------------

Similar to the windows driver, we could expose a high priority compute queue to
userspace.

Submissions to this compute queue will be scheduled with high priority, and may
acquire hardware resources previously in use by other queues.

This can be achieved by taking advantage of the 'priority' field in the HQDs
and could be programmed by amdgpu or the amdgpu scheduler. The relevant
register fields are:
        * mmCP_HQD_PIPE_PRIORITY
        * mmCP_HQD_QUEUE_PRIORITY

Implementation approach 1 - static partitioning:
------------------------------------------------

The amdgpu driver currently controls 8 compute queues from pipe0. We can
statically partition these as follows:
        * 7x regular
        * 1x high priority

The relevant priorities can be set so that submissions to the high priority
ring will starve the other compute rings and the GFX ring.

The amdgpu scheduler will only place jobs into the high priority rings if the
context is marked as high priority. And a corresponding priority should be
added to keep track of this information:
     * AMD_SCHED_PRIORITY_KERNEL
     * -> AMD_SCHED_PRIORITY_HIGH
     * AMD_SCHED_PRIORITY_NORMAL

The user will request a high priority context by setting an appropriate flag
in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163

The setting is in a per context level so that we can:
    * Maintain a consistent FIFO ordering of all submissions to a context
    * Create high priority and non-high priority contexts in the same process

Implementation approach 2 - dynamic priority programming:
---------------------------------------------------------

Similar to the above, but instead of programming the priorities and
amdgpu_init() time, the SW scheduler will reprogram the queue priorities
dynamically when scheduling a task.

This would involve having a hardware specific callback from the scheduler to
set the appropriate queue priority: set_priority(int ring, int index, int priority)

During this callback we would have to grab the SRBM mutex to perform the appropriate
HW programming, and I'm not really sure if that is something we should be doing from
the scheduler.

On the positive side, this approach would allow us to program a range of
priorities for jobs instead of a single "high priority" value", achieving
something similar to the niceness API available for CPU scheduling.

I'm not sure if this flexibility is something that we would need for our use
case, but it might be useful in other scenarios (multiple users sharing compute
time on a server).

This approach would require a new int field in drm_amdgpu_ctx_in, or repurposing
of the flags field.

Known current obstacles:
------------------------

The SQ is currently programmed to disregard the HQD priorities, and instead it picks
jobs at random. Settings from the shader itself are also disregarded as this is
considered a privileged field.

Effectively we can get our compute wavefront launched ASAP, but we might not get the
time we need on the SQ.

The current programming would have to be changed to allow priority propagation
from the HQD into the SQ.

Generic approach for all HW IPs:
--------------------------------

For consistency purposes, the high priority context can be enabled for all HW IPs
with support of the SW scheduler. This will function similarly to the current
AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of anything not
commited to the HW queue.

The benefits of requesting a high priority context for a non-compute queue will
be lesser (e.g. up to 10s of wait time if a GFX command is stuck in front of
you), but having the API in place will allow us to easily improve the implementation
in the future as new features become available in new hardware.

Future steps:
-------------

Once we have an approach settled, I can take care of the implementation.

Also, once the interface is mostly decided, we can start thinking about
exposing the high priority queue through radv.

Request for feedback:
---------------------

We aren't married to any of the approaches outlined above. Our goal is to
obtain a mechanism that will allow us to complete the reprojection job within a
predictable amount of time. So if anyone anyone has any suggestions for
improvements or alternative strategies we are more than happy to hear them.

If any of the technical information above is also incorrect, feel free to point
out my misunderstandings.

Looking forward to hearing from you.

Regards,
Andres

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


amd-gfx Info Page - lists.freedesktop.org
lists.freedesktop.org
To see the collection of prior postings to the list, visit the amd-gfx Archives. Using amd-gfx: To post a message to all the list members, send email ...



amd-gfx Info Page - lists.freedesktop.org
lists.freedesktop.org
To see the collection of prior postings to the list, visit the amd-gfx Archives. Using amd-gfx: To post a message to all the list members, send email ...










    
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                     ` <SN1PR12MB0703173C7AD623F6C5AECE7DFE9F0-z7L1TMIYDg6P/i5UxMCIqAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2016-12-17 22:05                       ` Pierre-Loup A. Griffais
       [not found]                         ` <bd0ba668-3d13-6343-a1c6-de5d0b7b3be3-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Pierre-Loup A. Griffais @ 2016-12-17 22:05 UTC (permalink / raw)
  To: Sagalovitch, Serguei, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Hi Serguei,

I'm also working on the bringing up our VR runtime on top of amgpu; see 
replies inline.

On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
> Andres,
>
>>  For current VR workloads we have 3 separate processes running actually:
> So we could have potential memory overcommit case or do you do partitioning
> on your own?  I would think that there is need to avoid overcomit in VR case to
> prevent any BO migration.

You're entirely correct; currently the VR runtime is setting up 
prioritized CPU scheduling for its VR compositor, we're working on 
prioritized GPU scheduling and pre-emption (eg. this thread), and in the 
future it will make sense to do work in order to make sure that its 
memory allocations do not get evicted, to prevent any unwelcome 
additional latency in the event of needing to perform just-in-time 
reprojection.

> BTW: Do you mean __real__ processes or threads?
> Based on my understanding sharing BOs between different processes
> could introduce additional synchronization constrains.  btw: I am not sure
> if we are able to share Vulkan sync. object cross-process boundary.

They are different processes; it is important for the compositor that is 
responsible for quality-of-service features such as consistently 
presenting distorted frames with the right latency, reprojection, etc, 
to be separate from the main application.

Currently we are using unreleased cross-process memory and semaphore 
extensions to fetch updated eye images from the client application, but 
the just-in-time reprojection discussed here does not actually have any 
direct interactions with cross-process resource sharing, since it's 
achieved by using whatever is the latest, most up-to-date eye images 
that have already been sent by the client application, which are already 
available to use without additional synchronization.

>
>>    3) System compositor (we are looking at approaches to remove this overhead)
> Yes,  IMHO the best is to run in  "full screen mode".

Yes, we are working on mechanisms to present directly to the headset 
display without any intermediaries as a separate effort.

>
>>  The latency is our main concern,
> I would assume that this is the known problem (at least for compute usage).
> It looks like that amdgpu / kernel submission is rather CPU intensive (at least
> in the default configuration).

As long as it's a consistent cost, it shouldn't an issue. However, if 
there's high degrees of variance then that would be troublesome and we 
would need to account for the worst case.

Hopefully the requirements and approach we described make sense, we're 
looking forward to your feedback and suggestions.

Thanks!
  - Pierre-Loup

>
> Sincerely yours,
> Serguei Sagalovitch
>
>
> From: Andres Rodriguez <andresr@valvesoftware.com>
> Sent: December 16, 2016 10:00 PM
> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>
> Hey Serguei,
>
>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I understand (by simplifying)
>> some scheduling is per pipe.  I know about the current allocation scheme but I do not think
>> that it is  ideal.  I would assume that we need  to switch to dynamical partition
>> of resources  based on the workload otherwise we will have resource conflict
>> between Vulkan compute and  OpenCL.
>
> I agree the partitioning isn't ideal. I'm hoping we can start with a solution that assumes that
> only pipe0 has any work and the other pipes are idle (no HSA/ROCm running on the system).
>
> This should be more or less the use case we expect from VR users.
>
> I agree the split is currently not ideal, but I'd like to consider that a separate task, because
> making it dynamic is not straight forward :P
>
>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will be not
>> involved.  I would assume that in the case of VR we will have one main
>> application ("console" mode(?)) so we could temporally "ignore"
>> OpenCL/ROCm needs when VR is running.
>
> Correct, this is why we want to enable the high priority compute queue through
> libdrm-amdgpu, so that we can expose it through Vulkan later.
>
> For current VR workloads we have 3 separate processes running actually:
>     1) Game process
>     2) VR Compositor (this is the process that will require high priority queue)
>     3) System compositor (we are looking at approaches to remove this overhead)
>
> For now I think it is okay to assume no OpenCL/ROCm running simultaneously, but
> I would also like to be able to address this case in the future (cross-pipe priorities).
>
>> [Serguei]  The problem with pre-emption of graphics task:  (a) it may take time so
>> latency may suffer
>
> The latency is our main concern, we want something that is predictable. A good
> illustration of what the reprojection scheduling looks like can be found here:
> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png
>
>> (b) to preempt we need to have different "context" - we want
>> to guarantee that submissions from the same context will be executed in order.
>
> This is okay, as the reprojection work doesn't have dependencies on the game context, and it
> even happens in a separate process.
>
>> BTW: (a) Do you want  "preempt" and later resume or do you want "preempt" and
>> "cancel/abort"
>
> Preempt the game with the compositor task and then resume it.
>
>> (b) Vulkan is generic API and could be used for graphics as well as
>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>
> Yeah, the plan is to use vulkan compute. But if you figure out a way for us to get
> a guaranteed execution time using vulkan graphics, then I'll take you out for a beer :)
>
> Regards,
> Andres
> ________________________________________
> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
> Sent: Friday, December 16, 2016 9:13 PM
> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>
> Hi Andres,
>
> Please see inline (as [Serguei])
>
> Sincerely yours,
> Serguei Sagalovitch
>
>
> From: Andres Rodriguez <andresr@valvesoftware.com>
> Sent: December 16, 2016 8:29 PM
> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>
> Hi Serguei,
>
> Thanks for the feedback. Answers inline as [AR].
>
> Regards,
> Andres
>
> ________________________________________
> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
> Sent: Friday, December 16, 2016 8:15 PM
> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>
> Andres,
>
>
> Quick comments:
>
> 1) To minimize "bubbles", etc. we need to "force" CU assignments/binding
> to high-priority queue  when it will be in use and "free" them later
> (we  do not want forever take CUs from e.g. graphic task to degrade graphics
> performance).
>
> Otherwise we could have scenario when long graphics task (or low-priority
> compute) will took all (extra) CUs and high--priority will wait for needed resources.
> It will not be visible on "NOP " but only when you submit "real" compute task
> so I would recommend  not to use "NOP" packets at all for testing.
>
> It (CU assignment) could be relatively easy done when everything is going via kernel
> (e.g. as part of frame submission) but I must admit that I am not sure
> about the best way for user level submissions (amdkfd).
>
> [AR] I wasn't aware of this part of the programming sequence. Thanks for the heads up!
> Is this similar to the CU masking programming?
> [Serguei] Yes. To simplify: the problem is that "scheduler" when deciding which
> queue to  run will check if there is enough resources and if not then it will begin
> to check other queues with lower priority.
>
> 2) I would recommend to dedicate the whole pipe to high-priority queue and have
> nothing their except it.
>
> [AR] I'm guessing in this context you mean pipe = queue? (as opposed to the MEC definition
> of pipe, which is a grouping of queues). I say this because amdgpu only has access to 1 pipe,
> and the rest are statically partitioned for amdkfd usage.
>
> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I understand (by simplifying)
> some scheduling is per pipe.  I know about the current allocation scheme but I do not think
> that it is  ideal.  I would assume that we need  to switch to dynamical partition
> of resources  based on the workload otherwise we will have resource conflict
> between Vulkan compute and  OpenCL.
>
>
> BTW: Which user level API do you want to use for compute: Vulkan or OpenCL?
>
> [AR] Vulkan
>
> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will be not
> involved.  I would assume that in the case of VR we will have one main
> application ("console" mode(?)) so we could temporally "ignore"
> OpenCL/ROCm needs when VR is running.
>
>>  we will not be able to provide a solution compatible with GFX worloads.
> I assume that you are talking about graphics? Am I right?
>
> [AR] Yeah, my understanding is that pre-empting the currently running graphics job and scheduling in
> something else using mid-buffer pre-emption has some cases where it doesn't work well. But if with
> polaris10 it starts working well, it might be a better solution for us (because the whole reprojection
> work uses the vulkan graphics stack at the moment, and porting it to compute is not trivial).
>
> [Serguei]  The problem with pre-emption of graphics task:  (a) it may take time so
> latency may suffer (b) to preempt we need to have different "context" - we want
> to guarantee that submissions from the same context will be executed in order.
> BTW: (a) Do you want  "preempt" and later resume or do you want "preempt" and
> "cancel/abort"?  (b) Vulkan is generic API and could be used
> for graphics as well as for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>
>
> Sincerely yours,
> Serguei Sagalovitch
>
>
>
> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Andres Rodriguez <andresr@valvesoftware.com>
> Sent: December 16, 2016 6:15 PM
> To: amd-gfx@lists.freedesktop.org
> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>
> Hi Everyone,
>
> This RFC is also available as a gist here:
> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>
>
>
> [RFC] Mechanism for high priority scheduling in amdgpu
> gist.github.com
> [RFC] Mechanism for high priority scheduling in amdgpu
>
>
>
> [RFC] Mechanism for high priority scheduling in amdgpu
> gist.github.com
> [RFC] Mechanism for high priority scheduling in amdgpu
>
>
>
>
> [RFC] Mechanism for high priority scheduling in amdgpu
> gist.github.com
> [RFC] Mechanism for high priority scheduling in amdgpu
>
>
> We are interested in feedback for a mechanism to effectively schedule high
> priority VR reprojection tasks (also referred to as time-warping) for Polaris10
> running on the amdgpu kernel driver.
>
> Brief context:
> --------------
>
> The main objective of reprojection is to avoid motion sickness for VR users in
> scenarios where the game or application would fail to finish rendering a new
> frame in time for the next VBLANK. When this happens, the user's head movements
> are not reflected on the Head Mounted Display (HMD) for the duration of an
> extra frame. This extended mismatch between the inner ear and the eyes may
> cause the user to experience motion sickness.
>
> The VR compositor deals with this problem by fabricating a new frame using the
> user's updated head position in combination with the previous frames. This
> avoids a prolonged mismatch between the HMD output and the inner ear.
>
> Because of the adverse effects on the user, we require high confidence that the
> reprojection task will complete before the VBLANK interval. Even if the GFX pipe
> is currently full of work from the game/application (which is most likely the case).
>
> For more details and illustrations, please refer to the following document:
> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved
>
>
> Gaming: Asynchronous Shaders Evolved | Community
> community.amd.com
> One of the most exciting new developments in GPU technology over the past year has been the adoption of asynchronous shaders, which can make more efficient use of ...
>
>
>
> Gaming: Asynchronous Shaders Evolved | Community
> community.amd.com
> One of the most exciting new developments in GPU technology over the past year has been the adoption of asynchronous shaders, which can make more efficient use of ...
>
>
>
> Gaming: Asynchronous Shaders Evolved | Community
> community.amd.com
> One of the most exciting new developments in GPU technology over the past year has been the adoption of asynchronous shaders, which can make more efficient use of ...
>
>
> Requirements:
> -------------
>
> The mechanism must expose the following functionaility:
>
>     * Job round trip time must be predictable, from submission to fence signal
>
>     * The mechanism must support compute workloads.
>
> Goals:
> ------
>
>     * The mechanism should provide low submission latencies
>
> Test: submitting a NOP packet through the mechanism on busy hardware should
> be equivalent to submitting a NOP on idle hardware.
>
> Nice to have:
> -------------
>
>     * The mechanism should also support GFX workloads.
>
> My understanding is that with the current hardware capabilities in Polaris10 we
> will not be able to provide a solution compatible with GFX worloads.
>
> But I would love to hear otherwise. So if anyone has an idea, approach or
> suggestion that will also be compatible with the GFX ring, please let us know
> about it.
>
>     * The above guarantees should also be respected by amdkfd workloads
>
> Would be good to have for consistency, but not strictly necessary as users running
> games are not traditionally running HPC workloads in the background.
>
> Proposed approach:
> ------------------
>
> Similar to the windows driver, we could expose a high priority compute queue to
> userspace.
>
> Submissions to this compute queue will be scheduled with high priority, and may
> acquire hardware resources previously in use by other queues.
>
> This can be achieved by taking advantage of the 'priority' field in the HQDs
> and could be programmed by amdgpu or the amdgpu scheduler. The relevant
> register fields are:
>         * mmCP_HQD_PIPE_PRIORITY
>         * mmCP_HQD_QUEUE_PRIORITY
>
> Implementation approach 1 - static partitioning:
> ------------------------------------------------
>
> The amdgpu driver currently controls 8 compute queues from pipe0. We can
> statically partition these as follows:
>         * 7x regular
>         * 1x high priority
>
> The relevant priorities can be set so that submissions to the high priority
> ring will starve the other compute rings and the GFX ring.
>
> The amdgpu scheduler will only place jobs into the high priority rings if the
> context is marked as high priority. And a corresponding priority should be
> added to keep track of this information:
>      * AMD_SCHED_PRIORITY_KERNEL
>      * -> AMD_SCHED_PRIORITY_HIGH
>      * AMD_SCHED_PRIORITY_NORMAL
>
> The user will request a high priority context by setting an appropriate flag
> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163
>
> The setting is in a per context level so that we can:
>     * Maintain a consistent FIFO ordering of all submissions to a context
>     * Create high priority and non-high priority contexts in the same process
>
> Implementation approach 2 - dynamic priority programming:
> ---------------------------------------------------------
>
> Similar to the above, but instead of programming the priorities and
> amdgpu_init() time, the SW scheduler will reprogram the queue priorities
> dynamically when scheduling a task.
>
> This would involve having a hardware specific callback from the scheduler to
> set the appropriate queue priority: set_priority(int ring, int index, int priority)
>
> During this callback we would have to grab the SRBM mutex to perform the appropriate
> HW programming, and I'm not really sure if that is something we should be doing from
> the scheduler.
>
> On the positive side, this approach would allow us to program a range of
> priorities for jobs instead of a single "high priority" value", achieving
> something similar to the niceness API available for CPU scheduling.
>
> I'm not sure if this flexibility is something that we would need for our use
> case, but it might be useful in other scenarios (multiple users sharing compute
> time on a server).
>
> This approach would require a new int field in drm_amdgpu_ctx_in, or repurposing
> of the flags field.
>
> Known current obstacles:
> ------------------------
>
> The SQ is currently programmed to disregard the HQD priorities, and instead it picks
> jobs at random. Settings from the shader itself are also disregarded as this is
> considered a privileged field.
>
> Effectively we can get our compute wavefront launched ASAP, but we might not get the
> time we need on the SQ.
>
> The current programming would have to be changed to allow priority propagation
> from the HQD into the SQ.
>
> Generic approach for all HW IPs:
> --------------------------------
>
> For consistency purposes, the high priority context can be enabled for all HW IPs
> with support of the SW scheduler. This will function similarly to the current
> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of anything not
> commited to the HW queue.
>
> The benefits of requesting a high priority context for a non-compute queue will
> be lesser (e.g. up to 10s of wait time if a GFX command is stuck in front of
> you), but having the API in place will allow us to easily improve the implementation
> in the future as new features become available in new hardware.
>
> Future steps:
> -------------
>
> Once we have an approach settled, I can take care of the implementation.
>
> Also, once the interface is mostly decided, we can start thinking about
> exposing the high priority queue through radv.
>
> Request for feedback:
> ---------------------
>
> We aren't married to any of the approaches outlined above. Our goal is to
> obtain a mechanism that will allow us to complete the reprojection job within a
> predictable amount of time. So if anyone anyone has any suggestions for
> improvements or alternative strategies we are more than happy to hear them.
>
> If any of the technical information above is also incorrect, feel free to point
> out my misunderstandings.
>
> Looking forward to hearing from you.
>
> Regards,
> Andres
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
> amd-gfx Info Page - lists.freedesktop.org
> lists.freedesktop.org
> To see the collection of prior postings to the list, visit the amd-gfx Archives. Using amd-gfx: To post a message to all the list members, send email ...
>
>
>
> amd-gfx Info Page - lists.freedesktop.org
> lists.freedesktop.org
> To see the collection of prior postings to the list, visit the amd-gfx Archives. Using amd-gfx: To post a message to all the list members, send email ...
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                         ` <bd0ba668-3d13-6343-a1c6-de5d0b7b3be3-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
@ 2016-12-19  3:26                           ` zhoucm1
       [not found]                             ` <58575362.2030100-5C7GfCeVMHo@public.gmane.org>
  2016-12-19 14:37                           ` Serguei Sagalovitch
  1 sibling, 1 reply; 36+ messages in thread
From: zhoucm1 @ 2016-12-19  3:26 UTC (permalink / raw)
  To: Pierre-Loup A. Griffais, Sagalovitch, Serguei, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Mao, David, Zhang,
	Hawking, Huan, Alvin

By the way, are you using all-open driver or amdgpu-pro driver?

+David Mao, who is working on our Vulkan driver.

Regards,
David Zhou

On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
> Hi Serguei,
>
> I'm also working on the bringing up our VR runtime on top of amgpu; 
> see replies inline.
>
> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>> Andres,
>>
>>>  For current VR workloads we have 3 separate processes running 
>>> actually:
>> So we could have potential memory overcommit case or do you do 
>> partitioning
>> on your own?  I would think that there is need to avoid overcomit in 
>> VR case to
>> prevent any BO migration.
>
> You're entirely correct; currently the VR runtime is setting up 
> prioritized CPU scheduling for its VR compositor, we're working on 
> prioritized GPU scheduling and pre-emption (eg. this thread), and in 
> the future it will make sense to do work in order to make sure that 
> its memory allocations do not get evicted, to prevent any unwelcome 
> additional latency in the event of needing to perform just-in-time 
> reprojection.
>
>> BTW: Do you mean __real__ processes or threads?
>> Based on my understanding sharing BOs between different processes
>> could introduce additional synchronization constrains.  btw: I am not 
>> sure
>> if we are able to share Vulkan sync. object cross-process boundary.
>
> They are different processes; it is important for the compositor that 
> is responsible for quality-of-service features such as consistently 
> presenting distorted frames with the right latency, reprojection, etc, 
> to be separate from the main application.
>
> Currently we are using unreleased cross-process memory and semaphore 
> extensions to fetch updated eye images from the client application, 
> but the just-in-time reprojection discussed here does not actually 
> have any direct interactions with cross-process resource sharing, 
> since it's achieved by using whatever is the latest, most up-to-date 
> eye images that have already been sent by the client application, 
> which are already available to use without additional synchronization.
>
>>
>>>    3) System compositor (we are looking at approaches to remove this 
>>> overhead)
>> Yes,  IMHO the best is to run in  "full screen mode".
>
> Yes, we are working on mechanisms to present directly to the headset 
> display without any intermediaries as a separate effort.
>
>>
>>>  The latency is our main concern,
>> I would assume that this is the known problem (at least for compute 
>> usage).
>> It looks like that amdgpu / kernel submission is rather CPU intensive 
>> (at least
>> in the default configuration).
>
> As long as it's a consistent cost, it shouldn't an issue. However, if 
> there's high degrees of variance then that would be troublesome and we 
> would need to account for the worst case.
>
> Hopefully the requirements and approach we described make sense, we're 
> looking forward to your feedback and suggestions.
>
> Thanks!
>  - Pierre-Loup
>
>>
>> Sincerely yours,
>> Serguei Sagalovitch
>>
>>
>> From: Andres Rodriguez <andresr@valvesoftware.com>
>> Sent: December 16, 2016 10:00 PM
>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>
>> Hey Serguei,
>>
>>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I 
>>> understand (by simplifying)
>>> some scheduling is per pipe.  I know about the current allocation 
>>> scheme but I do not think
>>> that it is  ideal.  I would assume that we need  to switch to 
>>> dynamical partition
>>> of resources  based on the workload otherwise we will have resource 
>>> conflict
>>> between Vulkan compute and  OpenCL.
>>
>> I agree the partitioning isn't ideal. I'm hoping we can start with a 
>> solution that assumes that
>> only pipe0 has any work and the other pipes are idle (no HSA/ROCm 
>> running on the system).
>>
>> This should be more or less the use case we expect from VR users.
>>
>> I agree the split is currently not ideal, but I'd like to consider 
>> that a separate task, because
>> making it dynamic is not straight forward :P
>>
>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd 
>>> will be not
>>> involved.  I would assume that in the case of VR we will have one main
>>> application ("console" mode(?)) so we could temporally "ignore"
>>> OpenCL/ROCm needs when VR is running.
>>
>> Correct, this is why we want to enable the high priority compute 
>> queue through
>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>
>> For current VR workloads we have 3 separate processes running actually:
>>     1) Game process
>>     2) VR Compositor (this is the process that will require high 
>> priority queue)
>>     3) System compositor (we are looking at approaches to remove this 
>> overhead)
>>
>> For now I think it is okay to assume no OpenCL/ROCm running 
>> simultaneously, but
>> I would also like to be able to address this case in the future 
>> (cross-pipe priorities).
>>
>>> [Serguei]  The problem with pre-emption of graphics task:  (a) it 
>>> may take time so
>>> latency may suffer
>>
>> The latency is our main concern, we want something that is 
>> predictable. A good
>> illustration of what the reprojection scheduling looks like can be 
>> found here:
>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>
>>
>>> (b) to preempt we need to have different "context" - we want
>>> to guarantee that submissions from the same context will be executed 
>>> in order.
>>
>> This is okay, as the reprojection work doesn't have dependencies on 
>> the game context, and it
>> even happens in a separate process.
>>
>>> BTW: (a) Do you want  "preempt" and later resume or do you want 
>>> "preempt" and
>>> "cancel/abort"
>>
>> Preempt the game with the compositor task and then resume it.
>>
>>> (b) Vulkan is generic API and could be used for graphics as well as
>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>
>> Yeah, the plan is to use vulkan compute. But if you figure out a way 
>> for us to get
>> a guaranteed execution time using vulkan graphics, then I'll take you 
>> out for a beer :)
>>
>> Regards,
>> Andres
>> ________________________________________
>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>> Sent: Friday, December 16, 2016 9:13 PM
>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>
>> Hi Andres,
>>
>> Please see inline (as [Serguei])
>>
>> Sincerely yours,
>> Serguei Sagalovitch
>>
>>
>> From: Andres Rodriguez <andresr@valvesoftware.com>
>> Sent: December 16, 2016 8:29 PM
>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>
>> Hi Serguei,
>>
>> Thanks for the feedback. Answers inline as [AR].
>>
>> Regards,
>> Andres
>>
>> ________________________________________
>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>> Sent: Friday, December 16, 2016 8:15 PM
>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>
>> Andres,
>>
>>
>> Quick comments:
>>
>> 1) To minimize "bubbles", etc. we need to "force" CU assignments/binding
>> to high-priority queue  when it will be in use and "free" them later
>> (we  do not want forever take CUs from e.g. graphic task to degrade 
>> graphics
>> performance).
>>
>> Otherwise we could have scenario when long graphics task (or 
>> low-priority
>> compute) will took all (extra) CUs and high--priority will wait for 
>> needed resources.
>> It will not be visible on "NOP " but only when you submit "real" 
>> compute task
>> so I would recommend  not to use "NOP" packets at all for testing.
>>
>> It (CU assignment) could be relatively easy done when everything is 
>> going via kernel
>> (e.g. as part of frame submission) but I must admit that I am not sure
>> about the best way for user level submissions (amdkfd).
>>
>> [AR] I wasn't aware of this part of the programming sequence. Thanks 
>> for the heads up!
>> Is this similar to the CU masking programming?
>> [Serguei] Yes. To simplify: the problem is that "scheduler" when 
>> deciding which
>> queue to  run will check if there is enough resources and if not then 
>> it will begin
>> to check other queues with lower priority.
>>
>> 2) I would recommend to dedicate the whole pipe to high-priority 
>> queue and have
>> nothing their except it.
>>
>> [AR] I'm guessing in this context you mean pipe = queue? (as opposed 
>> to the MEC definition
>> of pipe, which is a grouping of queues). I say this because amdgpu 
>> only has access to 1 pipe,
>> and the rest are statically partitioned for amdkfd usage.
>>
>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I 
>> understand (by simplifying)
>> some scheduling is per pipe.  I know about the current allocation 
>> scheme but I do not think
>> that it is  ideal.  I would assume that we need  to switch to 
>> dynamical partition
>> of resources  based on the workload otherwise we will have resource 
>> conflict
>> between Vulkan compute and  OpenCL.
>>
>>
>> BTW: Which user level API do you want to use for compute: Vulkan or 
>> OpenCL?
>>
>> [AR] Vulkan
>>
>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will 
>> be not
>> involved.  I would assume that in the case of VR we will have one main
>> application ("console" mode(?)) so we could temporally "ignore"
>> OpenCL/ROCm needs when VR is running.
>>
>>>  we will not be able to provide a solution compatible with GFX 
>>> worloads.
>> I assume that you are talking about graphics? Am I right?
>>
>> [AR] Yeah, my understanding is that pre-empting the currently running 
>> graphics job and scheduling in
>> something else using mid-buffer pre-emption has some cases where it 
>> doesn't work well. But if with
>> polaris10 it starts working well, it might be a better solution for 
>> us (because the whole reprojection
>> work uses the vulkan graphics stack at the moment, and porting it to 
>> compute is not trivial).
>>
>> [Serguei]  The problem with pre-emption of graphics task:  (a) it may 
>> take time so
>> latency may suffer (b) to preempt we need to have different "context" 
>> - we want
>> to guarantee that submissions from the same context will be executed 
>> in order.
>> BTW: (a) Do you want  "preempt" and later resume or do you want 
>> "preempt" and
>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>> for graphics as well as for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>
>>
>> Sincerely yours,
>> Serguei Sagalovitch
>>
>>
>>
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of 
>> Andres Rodriguez <andresr@valvesoftware.com>
>> Sent: December 16, 2016 6:15 PM
>> To: amd-gfx@lists.freedesktop.org
>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>>
>> Hi Everyone,
>>
>> This RFC is also available as a gist here:
>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>>
>>
>>
>> [RFC] Mechanism for high priority scheduling in amdgpu
>> gist.github.com
>> [RFC] Mechanism for high priority scheduling in amdgpu
>>
>>
>>
>> [RFC] Mechanism for high priority scheduling in amdgpu
>> gist.github.com
>> [RFC] Mechanism for high priority scheduling in amdgpu
>>
>>
>>
>>
>> [RFC] Mechanism for high priority scheduling in amdgpu
>> gist.github.com
>> [RFC] Mechanism for high priority scheduling in amdgpu
>>
>>
>> We are interested in feedback for a mechanism to effectively schedule 
>> high
>> priority VR reprojection tasks (also referred to as time-warping) for 
>> Polaris10
>> running on the amdgpu kernel driver.
>>
>> Brief context:
>> --------------
>>
>> The main objective of reprojection is to avoid motion sickness for VR 
>> users in
>> scenarios where the game or application would fail to finish 
>> rendering a new
>> frame in time for the next VBLANK. When this happens, the user's head 
>> movements
>> are not reflected on the Head Mounted Display (HMD) for the duration 
>> of an
>> extra frame. This extended mismatch between the inner ear and the 
>> eyes may
>> cause the user to experience motion sickness.
>>
>> The VR compositor deals with this problem by fabricating a new frame 
>> using the
>> user's updated head position in combination with the previous frames. 
>> This
>> avoids a prolonged mismatch between the HMD output and the inner ear.
>>
>> Because of the adverse effects on the user, we require high 
>> confidence that the
>> reprojection task will complete before the VBLANK interval. Even if 
>> the GFX pipe
>> is currently full of work from the game/application (which is most 
>> likely the case).
>>
>> For more details and illustrations, please refer to the following 
>> document:
>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>
>>
>>
>> Gaming: Asynchronous Shaders Evolved | Community
>> community.amd.com
>> One of the most exciting new developments in GPU technology over the 
>> past year has been the adoption of asynchronous shaders, which can 
>> make more efficient use of ...
>>
>>
>>
>> Gaming: Asynchronous Shaders Evolved | Community
>> community.amd.com
>> One of the most exciting new developments in GPU technology over the 
>> past year has been the adoption of asynchronous shaders, which can 
>> make more efficient use of ...
>>
>>
>>
>> Gaming: Asynchronous Shaders Evolved | Community
>> community.amd.com
>> One of the most exciting new developments in GPU technology over the 
>> past year has been the adoption of asynchronous shaders, which can 
>> make more efficient use of ...
>>
>>
>> Requirements:
>> -------------
>>
>> The mechanism must expose the following functionaility:
>>
>>     * Job round trip time must be predictable, from submission to 
>> fence signal
>>
>>     * The mechanism must support compute workloads.
>>
>> Goals:
>> ------
>>
>>     * The mechanism should provide low submission latencies
>>
>> Test: submitting a NOP packet through the mechanism on busy hardware 
>> should
>> be equivalent to submitting a NOP on idle hardware.
>>
>> Nice to have:
>> -------------
>>
>>     * The mechanism should also support GFX workloads.
>>
>> My understanding is that with the current hardware capabilities in 
>> Polaris10 we
>> will not be able to provide a solution compatible with GFX worloads.
>>
>> But I would love to hear otherwise. So if anyone has an idea, 
>> approach or
>> suggestion that will also be compatible with the GFX ring, please let 
>> us know
>> about it.
>>
>>     * The above guarantees should also be respected by amdkfd workloads
>>
>> Would be good to have for consistency, but not strictly necessary as 
>> users running
>> games are not traditionally running HPC workloads in the background.
>>
>> Proposed approach:
>> ------------------
>>
>> Similar to the windows driver, we could expose a high priority 
>> compute queue to
>> userspace.
>>
>> Submissions to this compute queue will be scheduled with high 
>> priority, and may
>> acquire hardware resources previously in use by other queues.
>>
>> This can be achieved by taking advantage of the 'priority' field in 
>> the HQDs
>> and could be programmed by amdgpu or the amdgpu scheduler. The relevant
>> register fields are:
>>         * mmCP_HQD_PIPE_PRIORITY
>>         * mmCP_HQD_QUEUE_PRIORITY
>>
>> Implementation approach 1 - static partitioning:
>> ------------------------------------------------
>>
>> The amdgpu driver currently controls 8 compute queues from pipe0. We can
>> statically partition these as follows:
>>         * 7x regular
>>         * 1x high priority
>>
>> The relevant priorities can be set so that submissions to the high 
>> priority
>> ring will starve the other compute rings and the GFX ring.
>>
>> The amdgpu scheduler will only place jobs into the high priority 
>> rings if the
>> context is marked as high priority. And a corresponding priority 
>> should be
>> added to keep track of this information:
>>      * AMD_SCHED_PRIORITY_KERNEL
>>      * -> AMD_SCHED_PRIORITY_HIGH
>>      * AMD_SCHED_PRIORITY_NORMAL
>>
>> The user will request a high priority context by setting an 
>> appropriate flag
>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>
>>
>> The setting is in a per context level so that we can:
>>     * Maintain a consistent FIFO ordering of all submissions to a 
>> context
>>     * Create high priority and non-high priority contexts in the same 
>> process
>>
>> Implementation approach 2 - dynamic priority programming:
>> ---------------------------------------------------------
>>
>> Similar to the above, but instead of programming the priorities and
>> amdgpu_init() time, the SW scheduler will reprogram the queue priorities
>> dynamically when scheduling a task.
>>
>> This would involve having a hardware specific callback from the 
>> scheduler to
>> set the appropriate queue priority: set_priority(int ring, int index, 
>> int priority)
>>
>> During this callback we would have to grab the SRBM mutex to perform 
>> the appropriate
>> HW programming, and I'm not really sure if that is something we 
>> should be doing from
>> the scheduler.
>>
>> On the positive side, this approach would allow us to program a range of
>> priorities for jobs instead of a single "high priority" value", 
>> achieving
>> something similar to the niceness API available for CPU scheduling.
>>
>> I'm not sure if this flexibility is something that we would need for 
>> our use
>> case, but it might be useful in other scenarios (multiple users 
>> sharing compute
>> time on a server).
>>
>> This approach would require a new int field in drm_amdgpu_ctx_in, or 
>> repurposing
>> of the flags field.
>>
>> Known current obstacles:
>> ------------------------
>>
>> The SQ is currently programmed to disregard the HQD priorities, and 
>> instead it picks
>> jobs at random. Settings from the shader itself are also disregarded 
>> as this is
>> considered a privileged field.
>>
>> Effectively we can get our compute wavefront launched ASAP, but we 
>> might not get the
>> time we need on the SQ.
>>
>> The current programming would have to be changed to allow priority 
>> propagation
>> from the HQD into the SQ.
>>
>> Generic approach for all HW IPs:
>> --------------------------------
>>
>> For consistency purposes, the high priority context can be enabled 
>> for all HW IPs
>> with support of the SW scheduler. This will function similarly to the 
>> current
>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of 
>> anything not
>> commited to the HW queue.
>>
>> The benefits of requesting a high priority context for a non-compute 
>> queue will
>> be lesser (e.g. up to 10s of wait time if a GFX command is stuck in 
>> front of
>> you), but having the API in place will allow us to easily improve the 
>> implementation
>> in the future as new features become available in new hardware.
>>
>> Future steps:
>> -------------
>>
>> Once we have an approach settled, I can take care of the implementation.
>>
>> Also, once the interface is mostly decided, we can start thinking about
>> exposing the high priority queue through radv.
>>
>> Request for feedback:
>> ---------------------
>>
>> We aren't married to any of the approaches outlined above. Our goal 
>> is to
>> obtain a mechanism that will allow us to complete the reprojection 
>> job within a
>> predictable amount of time. So if anyone anyone has any suggestions for
>> improvements or alternative strategies we are more than happy to hear 
>> them.
>>
>> If any of the technical information above is also incorrect, feel 
>> free to point
>> out my misunderstandings.
>>
>> Looking forward to hearing from you.
>>
>> Regards,
>> Andres
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>> amd-gfx Info Page - lists.freedesktop.org
>> lists.freedesktop.org
>> To see the collection of prior postings to the list, visit the 
>> amd-gfx Archives. Using amd-gfx: To post a message to all the list 
>> members, send email ...
>>
>>
>>
>> amd-gfx Info Page - lists.freedesktop.org
>> lists.freedesktop.org
>> To see the collection of prior postings to the list, visit the 
>> amd-gfx Archives. Using amd-gfx: To post a message to all the list 
>> members, send email ...
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                             ` <58575362.2030100-5C7GfCeVMHo@public.gmane.org>
@ 2016-12-19  3:33                               ` Pierre-Loup A. Griffais
       [not found]                                 ` <361f177c-bf55-1525-4f35-86708e4f8d9f-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Pierre-Loup A. Griffais @ 2016-12-19  3:33 UTC (permalink / raw)
  To: zhoucm1, Sagalovitch, Serguei, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Mao, David, Zhang,
	Hawking, Huan, Alvin

We're currently working with the open stack; I assume that a mechanism 
could be exposed by both open and Pro Vulkan userspace drivers and that 
the amdgpu kernel interface improvements we would pursue following this 
discussion would let both drivers take advantage of the feature, correct?

On 12/18/2016 07:26 PM, zhoucm1 wrote:
> By the way, are you using all-open driver or amdgpu-pro driver?
>
> +David Mao, who is working on our Vulkan driver.
>
> Regards,
> David Zhou
>
> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>> Hi Serguei,
>>
>> I'm also working on the bringing up our VR runtime on top of amgpu;
>> see replies inline.
>>
>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>> Andres,
>>>
>>>>  For current VR workloads we have 3 separate processes running
>>>> actually:
>>> So we could have potential memory overcommit case or do you do
>>> partitioning
>>> on your own?  I would think that there is need to avoid overcomit in
>>> VR case to
>>> prevent any BO migration.
>>
>> You're entirely correct; currently the VR runtime is setting up
>> prioritized CPU scheduling for its VR compositor, we're working on
>> prioritized GPU scheduling and pre-emption (eg. this thread), and in
>> the future it will make sense to do work in order to make sure that
>> its memory allocations do not get evicted, to prevent any unwelcome
>> additional latency in the event of needing to perform just-in-time
>> reprojection.
>>
>>> BTW: Do you mean __real__ processes or threads?
>>> Based on my understanding sharing BOs between different processes
>>> could introduce additional synchronization constrains.  btw: I am not
>>> sure
>>> if we are able to share Vulkan sync. object cross-process boundary.
>>
>> They are different processes; it is important for the compositor that
>> is responsible for quality-of-service features such as consistently
>> presenting distorted frames with the right latency, reprojection, etc,
>> to be separate from the main application.
>>
>> Currently we are using unreleased cross-process memory and semaphore
>> extensions to fetch updated eye images from the client application,
>> but the just-in-time reprojection discussed here does not actually
>> have any direct interactions with cross-process resource sharing,
>> since it's achieved by using whatever is the latest, most up-to-date
>> eye images that have already been sent by the client application,
>> which are already available to use without additional synchronization.
>>
>>>
>>>>    3) System compositor (we are looking at approaches to remove this
>>>> overhead)
>>> Yes,  IMHO the best is to run in  "full screen mode".
>>
>> Yes, we are working on mechanisms to present directly to the headset
>> display without any intermediaries as a separate effort.
>>
>>>
>>>>  The latency is our main concern,
>>> I would assume that this is the known problem (at least for compute
>>> usage).
>>> It looks like that amdgpu / kernel submission is rather CPU intensive
>>> (at least
>>> in the default configuration).
>>
>> As long as it's a consistent cost, it shouldn't an issue. However, if
>> there's high degrees of variance then that would be troublesome and we
>> would need to account for the worst case.
>>
>> Hopefully the requirements and approach we described make sense, we're
>> looking forward to your feedback and suggestions.
>>
>> Thanks!
>>  - Pierre-Loup
>>
>>>
>>> Sincerely yours,
>>> Serguei Sagalovitch
>>>
>>>
>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>> Sent: December 16, 2016 10:00 PM
>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>>
>>> Hey Serguei,
>>>
>>>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I
>>>> understand (by simplifying)
>>>> some scheduling is per pipe.  I know about the current allocation
>>>> scheme but I do not think
>>>> that it is  ideal.  I would assume that we need  to switch to
>>>> dynamical partition
>>>> of resources  based on the workload otherwise we will have resource
>>>> conflict
>>>> between Vulkan compute and  OpenCL.
>>>
>>> I agree the partitioning isn't ideal. I'm hoping we can start with a
>>> solution that assumes that
>>> only pipe0 has any work and the other pipes are idle (no HSA/ROCm
>>> running on the system).
>>>
>>> This should be more or less the use case we expect from VR users.
>>>
>>> I agree the split is currently not ideal, but I'd like to consider
>>> that a separate task, because
>>> making it dynamic is not straight forward :P
>>>
>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd
>>>> will be not
>>>> involved.  I would assume that in the case of VR we will have one main
>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>> OpenCL/ROCm needs when VR is running.
>>>
>>> Correct, this is why we want to enable the high priority compute
>>> queue through
>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>
>>> For current VR workloads we have 3 separate processes running actually:
>>>     1) Game process
>>>     2) VR Compositor (this is the process that will require high
>>> priority queue)
>>>     3) System compositor (we are looking at approaches to remove this
>>> overhead)
>>>
>>> For now I think it is okay to assume no OpenCL/ROCm running
>>> simultaneously, but
>>> I would also like to be able to address this case in the future
>>> (cross-pipe priorities).
>>>
>>>> [Serguei]  The problem with pre-emption of graphics task:  (a) it
>>>> may take time so
>>>> latency may suffer
>>>
>>> The latency is our main concern, we want something that is
>>> predictable. A good
>>> illustration of what the reprojection scheduling looks like can be
>>> found here:
>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png
>>>
>>>
>>>> (b) to preempt we need to have different "context" - we want
>>>> to guarantee that submissions from the same context will be executed
>>>> in order.
>>>
>>> This is okay, as the reprojection work doesn't have dependencies on
>>> the game context, and it
>>> even happens in a separate process.
>>>
>>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>>> "preempt" and
>>>> "cancel/abort"
>>>
>>> Preempt the game with the compositor task and then resume it.
>>>
>>>> (b) Vulkan is generic API and could be used for graphics as well as
>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>
>>> Yeah, the plan is to use vulkan compute. But if you figure out a way
>>> for us to get
>>> a guaranteed execution time using vulkan graphics, then I'll take you
>>> out for a beer :)
>>>
>>> Regards,
>>> Andres
>>> ________________________________________
>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>> Sent: Friday, December 16, 2016 9:13 PM
>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>>
>>> Hi Andres,
>>>
>>> Please see inline (as [Serguei])
>>>
>>> Sincerely yours,
>>> Serguei Sagalovitch
>>>
>>>
>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>> Sent: December 16, 2016 8:29 PM
>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>>
>>> Hi Serguei,
>>>
>>> Thanks for the feedback. Answers inline as [AR].
>>>
>>> Regards,
>>> Andres
>>>
>>> ________________________________________
>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>> Sent: Friday, December 16, 2016 8:15 PM
>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>>
>>> Andres,
>>>
>>>
>>> Quick comments:
>>>
>>> 1) To minimize "bubbles", etc. we need to "force" CU assignments/binding
>>> to high-priority queue  when it will be in use and "free" them later
>>> (we  do not want forever take CUs from e.g. graphic task to degrade
>>> graphics
>>> performance).
>>>
>>> Otherwise we could have scenario when long graphics task (or
>>> low-priority
>>> compute) will took all (extra) CUs and high--priority will wait for
>>> needed resources.
>>> It will not be visible on "NOP " but only when you submit "real"
>>> compute task
>>> so I would recommend  not to use "NOP" packets at all for testing.
>>>
>>> It (CU assignment) could be relatively easy done when everything is
>>> going via kernel
>>> (e.g. as part of frame submission) but I must admit that I am not sure
>>> about the best way for user level submissions (amdkfd).
>>>
>>> [AR] I wasn't aware of this part of the programming sequence. Thanks
>>> for the heads up!
>>> Is this similar to the CU masking programming?
>>> [Serguei] Yes. To simplify: the problem is that "scheduler" when
>>> deciding which
>>> queue to  run will check if there is enough resources and if not then
>>> it will begin
>>> to check other queues with lower priority.
>>>
>>> 2) I would recommend to dedicate the whole pipe to high-priority
>>> queue and have
>>> nothing their except it.
>>>
>>> [AR] I'm guessing in this context you mean pipe = queue? (as opposed
>>> to the MEC definition
>>> of pipe, which is a grouping of queues). I say this because amdgpu
>>> only has access to 1 pipe,
>>> and the rest are statically partitioned for amdkfd usage.
>>>
>>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I
>>> understand (by simplifying)
>>> some scheduling is per pipe.  I know about the current allocation
>>> scheme but I do not think
>>> that it is  ideal.  I would assume that we need  to switch to
>>> dynamical partition
>>> of resources  based on the workload otherwise we will have resource
>>> conflict
>>> between Vulkan compute and  OpenCL.
>>>
>>>
>>> BTW: Which user level API do you want to use for compute: Vulkan or
>>> OpenCL?
>>>
>>> [AR] Vulkan
>>>
>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will
>>> be not
>>> involved.  I would assume that in the case of VR we will have one main
>>> application ("console" mode(?)) so we could temporally "ignore"
>>> OpenCL/ROCm needs when VR is running.
>>>
>>>>  we will not be able to provide a solution compatible with GFX
>>>> worloads.
>>> I assume that you are talking about graphics? Am I right?
>>>
>>> [AR] Yeah, my understanding is that pre-empting the currently running
>>> graphics job and scheduling in
>>> something else using mid-buffer pre-emption has some cases where it
>>> doesn't work well. But if with
>>> polaris10 it starts working well, it might be a better solution for
>>> us (because the whole reprojection
>>> work uses the vulkan graphics stack at the moment, and porting it to
>>> compute is not trivial).
>>>
>>> [Serguei]  The problem with pre-emption of graphics task:  (a) it may
>>> take time so
>>> latency may suffer (b) to preempt we need to have different "context"
>>> - we want
>>> to guarantee that submissions from the same context will be executed
>>> in order.
>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>> "preempt" and
>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>> for graphics as well as for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>
>>>
>>> Sincerely yours,
>>> Serguei Sagalovitch
>>>
>>>
>>>
>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of
>>> Andres Rodriguez <andresr@valvesoftware.com>
>>> Sent: December 16, 2016 6:15 PM
>>> To: amd-gfx@lists.freedesktop.org
>>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>>>
>>> Hi Everyone,
>>>
>>> This RFC is also available as a gist here:
>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>>>
>>>
>>>
>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>> gist.github.com
>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>
>>>
>>>
>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>> gist.github.com
>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>
>>>
>>>
>>>
>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>> gist.github.com
>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>
>>>
>>> We are interested in feedback for a mechanism to effectively schedule
>>> high
>>> priority VR reprojection tasks (also referred to as time-warping) for
>>> Polaris10
>>> running on the amdgpu kernel driver.
>>>
>>> Brief context:
>>> --------------
>>>
>>> The main objective of reprojection is to avoid motion sickness for VR
>>> users in
>>> scenarios where the game or application would fail to finish
>>> rendering a new
>>> frame in time for the next VBLANK. When this happens, the user's head
>>> movements
>>> are not reflected on the Head Mounted Display (HMD) for the duration
>>> of an
>>> extra frame. This extended mismatch between the inner ear and the
>>> eyes may
>>> cause the user to experience motion sickness.
>>>
>>> The VR compositor deals with this problem by fabricating a new frame
>>> using the
>>> user's updated head position in combination with the previous frames.
>>> This
>>> avoids a prolonged mismatch between the HMD output and the inner ear.
>>>
>>> Because of the adverse effects on the user, we require high
>>> confidence that the
>>> reprojection task will complete before the VBLANK interval. Even if
>>> the GFX pipe
>>> is currently full of work from the game/application (which is most
>>> likely the case).
>>>
>>> For more details and illustrations, please refer to the following
>>> document:
>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved
>>>
>>>
>>>
>>> Gaming: Asynchronous Shaders Evolved | Community
>>> community.amd.com
>>> One of the most exciting new developments in GPU technology over the
>>> past year has been the adoption of asynchronous shaders, which can
>>> make more efficient use of ...
>>>
>>>
>>>
>>> Gaming: Asynchronous Shaders Evolved | Community
>>> community.amd.com
>>> One of the most exciting new developments in GPU technology over the
>>> past year has been the adoption of asynchronous shaders, which can
>>> make more efficient use of ...
>>>
>>>
>>>
>>> Gaming: Asynchronous Shaders Evolved | Community
>>> community.amd.com
>>> One of the most exciting new developments in GPU technology over the
>>> past year has been the adoption of asynchronous shaders, which can
>>> make more efficient use of ...
>>>
>>>
>>> Requirements:
>>> -------------
>>>
>>> The mechanism must expose the following functionaility:
>>>
>>>     * Job round trip time must be predictable, from submission to
>>> fence signal
>>>
>>>     * The mechanism must support compute workloads.
>>>
>>> Goals:
>>> ------
>>>
>>>     * The mechanism should provide low submission latencies
>>>
>>> Test: submitting a NOP packet through the mechanism on busy hardware
>>> should
>>> be equivalent to submitting a NOP on idle hardware.
>>>
>>> Nice to have:
>>> -------------
>>>
>>>     * The mechanism should also support GFX workloads.
>>>
>>> My understanding is that with the current hardware capabilities in
>>> Polaris10 we
>>> will not be able to provide a solution compatible with GFX worloads.
>>>
>>> But I would love to hear otherwise. So if anyone has an idea,
>>> approach or
>>> suggestion that will also be compatible with the GFX ring, please let
>>> us know
>>> about it.
>>>
>>>     * The above guarantees should also be respected by amdkfd workloads
>>>
>>> Would be good to have for consistency, but not strictly necessary as
>>> users running
>>> games are not traditionally running HPC workloads in the background.
>>>
>>> Proposed approach:
>>> ------------------
>>>
>>> Similar to the windows driver, we could expose a high priority
>>> compute queue to
>>> userspace.
>>>
>>> Submissions to this compute queue will be scheduled with high
>>> priority, and may
>>> acquire hardware resources previously in use by other queues.
>>>
>>> This can be achieved by taking advantage of the 'priority' field in
>>> the HQDs
>>> and could be programmed by amdgpu or the amdgpu scheduler. The relevant
>>> register fields are:
>>>         * mmCP_HQD_PIPE_PRIORITY
>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>
>>> Implementation approach 1 - static partitioning:
>>> ------------------------------------------------
>>>
>>> The amdgpu driver currently controls 8 compute queues from pipe0. We can
>>> statically partition these as follows:
>>>         * 7x regular
>>>         * 1x high priority
>>>
>>> The relevant priorities can be set so that submissions to the high
>>> priority
>>> ring will starve the other compute rings and the GFX ring.
>>>
>>> The amdgpu scheduler will only place jobs into the high priority
>>> rings if the
>>> context is marked as high priority. And a corresponding priority
>>> should be
>>> added to keep track of this information:
>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>
>>> The user will request a high priority context by setting an
>>> appropriate flag
>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163
>>>
>>>
>>> The setting is in a per context level so that we can:
>>>     * Maintain a consistent FIFO ordering of all submissions to a
>>> context
>>>     * Create high priority and non-high priority contexts in the same
>>> process
>>>
>>> Implementation approach 2 - dynamic priority programming:
>>> ---------------------------------------------------------
>>>
>>> Similar to the above, but instead of programming the priorities and
>>> amdgpu_init() time, the SW scheduler will reprogram the queue priorities
>>> dynamically when scheduling a task.
>>>
>>> This would involve having a hardware specific callback from the
>>> scheduler to
>>> set the appropriate queue priority: set_priority(int ring, int index,
>>> int priority)
>>>
>>> During this callback we would have to grab the SRBM mutex to perform
>>> the appropriate
>>> HW programming, and I'm not really sure if that is something we
>>> should be doing from
>>> the scheduler.
>>>
>>> On the positive side, this approach would allow us to program a range of
>>> priorities for jobs instead of a single "high priority" value",
>>> achieving
>>> something similar to the niceness API available for CPU scheduling.
>>>
>>> I'm not sure if this flexibility is something that we would need for
>>> our use
>>> case, but it might be useful in other scenarios (multiple users
>>> sharing compute
>>> time on a server).
>>>
>>> This approach would require a new int field in drm_amdgpu_ctx_in, or
>>> repurposing
>>> of the flags field.
>>>
>>> Known current obstacles:
>>> ------------------------
>>>
>>> The SQ is currently programmed to disregard the HQD priorities, and
>>> instead it picks
>>> jobs at random. Settings from the shader itself are also disregarded
>>> as this is
>>> considered a privileged field.
>>>
>>> Effectively we can get our compute wavefront launched ASAP, but we
>>> might not get the
>>> time we need on the SQ.
>>>
>>> The current programming would have to be changed to allow priority
>>> propagation
>>> from the HQD into the SQ.
>>>
>>> Generic approach for all HW IPs:
>>> --------------------------------
>>>
>>> For consistency purposes, the high priority context can be enabled
>>> for all HW IPs
>>> with support of the SW scheduler. This will function similarly to the
>>> current
>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of
>>> anything not
>>> commited to the HW queue.
>>>
>>> The benefits of requesting a high priority context for a non-compute
>>> queue will
>>> be lesser (e.g. up to 10s of wait time if a GFX command is stuck in
>>> front of
>>> you), but having the API in place will allow us to easily improve the
>>> implementation
>>> in the future as new features become available in new hardware.
>>>
>>> Future steps:
>>> -------------
>>>
>>> Once we have an approach settled, I can take care of the implementation.
>>>
>>> Also, once the interface is mostly decided, we can start thinking about
>>> exposing the high priority queue through radv.
>>>
>>> Request for feedback:
>>> ---------------------
>>>
>>> We aren't married to any of the approaches outlined above. Our goal
>>> is to
>>> obtain a mechanism that will allow us to complete the reprojection
>>> job within a
>>> predictable amount of time. So if anyone anyone has any suggestions for
>>> improvements or alternative strategies we are more than happy to hear
>>> them.
>>>
>>> If any of the technical information above is also incorrect, feel
>>> free to point
>>> out my misunderstandings.
>>>
>>> Looking forward to hearing from you.
>>>
>>> Regards,
>>> Andres
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>>
>>> amd-gfx Info Page - lists.freedesktop.org
>>> lists.freedesktop.org
>>> To see the collection of prior postings to the list, visit the
>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list
>>> members, send email ...
>>>
>>>
>>>
>>> amd-gfx Info Page - lists.freedesktop.org
>>> lists.freedesktop.org
>>> To see the collection of prior postings to the list, visit the
>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list
>>> members, send email ...
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                 ` <361f177c-bf55-1525-4f35-86708e4f8d9f-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
@ 2016-12-19  5:11                                   ` zhoucm1
       [not found]                                     ` <58576C15.1070909-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: zhoucm1 @ 2016-12-19  5:11 UTC (permalink / raw)
  To: Pierre-Loup A. Griffais, Sagalovitch, Serguei, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Mao, David, Zhang,
	Hawking, Huan, Alvin



On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
> We're currently working with the open stack; I assume that a mechanism 
> could be exposed by both open and Pro Vulkan userspace drivers and 
> that the amdgpu kernel interface improvements we would pursue 
> following this discussion would let both drivers take advantage of the 
> feature, correct?
Of course.
Does open stack have Vulkan support?

Regards,
David Zhou
>
> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>> By the way, are you using all-open driver or amdgpu-pro driver?
>>
>> +David Mao, who is working on our Vulkan driver.
>>
>> Regards,
>> David Zhou
>>
>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>> Hi Serguei,
>>>
>>> I'm also working on the bringing up our VR runtime on top of amgpu;
>>> see replies inline.
>>>
>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>> Andres,
>>>>
>>>>>  For current VR workloads we have 3 separate processes running
>>>>> actually:
>>>> So we could have potential memory overcommit case or do you do
>>>> partitioning
>>>> on your own?  I would think that there is need to avoid overcomit in
>>>> VR case to
>>>> prevent any BO migration.
>>>
>>> You're entirely correct; currently the VR runtime is setting up
>>> prioritized CPU scheduling for its VR compositor, we're working on
>>> prioritized GPU scheduling and pre-emption (eg. this thread), and in
>>> the future it will make sense to do work in order to make sure that
>>> its memory allocations do not get evicted, to prevent any unwelcome
>>> additional latency in the event of needing to perform just-in-time
>>> reprojection.
>>>
>>>> BTW: Do you mean __real__ processes or threads?
>>>> Based on my understanding sharing BOs between different processes
>>>> could introduce additional synchronization constrains.  btw: I am not
>>>> sure
>>>> if we are able to share Vulkan sync. object cross-process boundary.
>>>
>>> They are different processes; it is important for the compositor that
>>> is responsible for quality-of-service features such as consistently
>>> presenting distorted frames with the right latency, reprojection, etc,
>>> to be separate from the main application.
>>>
>>> Currently we are using unreleased cross-process memory and semaphore
>>> extensions to fetch updated eye images from the client application,
>>> but the just-in-time reprojection discussed here does not actually
>>> have any direct interactions with cross-process resource sharing,
>>> since it's achieved by using whatever is the latest, most up-to-date
>>> eye images that have already been sent by the client application,
>>> which are already available to use without additional synchronization.
>>>
>>>>
>>>>>    3) System compositor (we are looking at approaches to remove this
>>>>> overhead)
>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>
>>> Yes, we are working on mechanisms to present directly to the headset
>>> display without any intermediaries as a separate effort.
>>>
>>>>
>>>>>  The latency is our main concern,
>>>> I would assume that this is the known problem (at least for compute
>>>> usage).
>>>> It looks like that amdgpu / kernel submission is rather CPU intensive
>>>> (at least
>>>> in the default configuration).
>>>
>>> As long as it's a consistent cost, it shouldn't an issue. However, if
>>> there's high degrees of variance then that would be troublesome and we
>>> would need to account for the worst case.
>>>
>>> Hopefully the requirements and approach we described make sense, we're
>>> looking forward to your feedback and suggestions.
>>>
>>> Thanks!
>>>  - Pierre-Loup
>>>
>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>>
>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>> Sent: December 16, 2016 10:00 PM
>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>
>>>> Hey Serguei,
>>>>
>>>>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I
>>>>> understand (by simplifying)
>>>>> some scheduling is per pipe.  I know about the current allocation
>>>>> scheme but I do not think
>>>>> that it is  ideal.  I would assume that we need  to switch to
>>>>> dynamical partition
>>>>> of resources  based on the workload otherwise we will have resource
>>>>> conflict
>>>>> between Vulkan compute and  OpenCL.
>>>>
>>>> I agree the partitioning isn't ideal. I'm hoping we can start with a
>>>> solution that assumes that
>>>> only pipe0 has any work and the other pipes are idle (no HSA/ROCm
>>>> running on the system).
>>>>
>>>> This should be more or less the use case we expect from VR users.
>>>>
>>>> I agree the split is currently not ideal, but I'd like to consider
>>>> that a separate task, because
>>>> making it dynamic is not straight forward :P
>>>>
>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd
>>>>> will be not
>>>>> involved.  I would assume that in the case of VR we will have one 
>>>>> main
>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>> OpenCL/ROCm needs when VR is running.
>>>>
>>>> Correct, this is why we want to enable the high priority compute
>>>> queue through
>>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>>
>>>> For current VR workloads we have 3 separate processes running 
>>>> actually:
>>>>     1) Game process
>>>>     2) VR Compositor (this is the process that will require high
>>>> priority queue)
>>>>     3) System compositor (we are looking at approaches to remove this
>>>> overhead)
>>>>
>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>> simultaneously, but
>>>> I would also like to be able to address this case in the future
>>>> (cross-pipe priorities).
>>>>
>>>>> [Serguei]  The problem with pre-emption of graphics task:  (a) it
>>>>> may take time so
>>>>> latency may suffer
>>>>
>>>> The latency is our main concern, we want something that is
>>>> predictable. A good
>>>> illustration of what the reprojection scheduling looks like can be
>>>> found here:
>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>>>
>>>>
>>>>
>>>>> (b) to preempt we need to have different "context" - we want
>>>>> to guarantee that submissions from the same context will be executed
>>>>> in order.
>>>>
>>>> This is okay, as the reprojection work doesn't have dependencies on
>>>> the game context, and it
>>>> even happens in a separate process.
>>>>
>>>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>>>> "preempt" and
>>>>> "cancel/abort"
>>>>
>>>> Preempt the game with the compositor task and then resume it.
>>>>
>>>>> (b) Vulkan is generic API and could be used for graphics as well as
>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>
>>>> Yeah, the plan is to use vulkan compute. But if you figure out a way
>>>> for us to get
>>>> a guaranteed execution time using vulkan graphics, then I'll take you
>>>> out for a beer :)
>>>>
>>>> Regards,
>>>> Andres
>>>> ________________________________________
>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>
>>>> Hi Andres,
>>>>
>>>> Please see inline (as [Serguei])
>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>>
>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>> Sent: December 16, 2016 8:29 PM
>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>
>>>> Hi Serguei,
>>>>
>>>> Thanks for the feedback. Answers inline as [AR].
>>>>
>>>> Regards,
>>>> Andres
>>>>
>>>> ________________________________________
>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>
>>>> Andres,
>>>>
>>>>
>>>> Quick comments:
>>>>
>>>> 1) To minimize "bubbles", etc. we need to "force" CU 
>>>> assignments/binding
>>>> to high-priority queue  when it will be in use and "free" them later
>>>> (we  do not want forever take CUs from e.g. graphic task to degrade
>>>> graphics
>>>> performance).
>>>>
>>>> Otherwise we could have scenario when long graphics task (or
>>>> low-priority
>>>> compute) will took all (extra) CUs and high--priority will wait for
>>>> needed resources.
>>>> It will not be visible on "NOP " but only when you submit "real"
>>>> compute task
>>>> so I would recommend  not to use "NOP" packets at all for testing.
>>>>
>>>> It (CU assignment) could be relatively easy done when everything is
>>>> going via kernel
>>>> (e.g. as part of frame submission) but I must admit that I am not sure
>>>> about the best way for user level submissions (amdkfd).
>>>>
>>>> [AR] I wasn't aware of this part of the programming sequence. Thanks
>>>> for the heads up!
>>>> Is this similar to the CU masking programming?
>>>> [Serguei] Yes. To simplify: the problem is that "scheduler" when
>>>> deciding which
>>>> queue to  run will check if there is enough resources and if not then
>>>> it will begin
>>>> to check other queues with lower priority.
>>>>
>>>> 2) I would recommend to dedicate the whole pipe to high-priority
>>>> queue and have
>>>> nothing their except it.
>>>>
>>>> [AR] I'm guessing in this context you mean pipe = queue? (as opposed
>>>> to the MEC definition
>>>> of pipe, which is a grouping of queues). I say this because amdgpu
>>>> only has access to 1 pipe,
>>>> and the rest are statically partitioned for amdkfd usage.
>>>>
>>>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I
>>>> understand (by simplifying)
>>>> some scheduling is per pipe.  I know about the current allocation
>>>> scheme but I do not think
>>>> that it is  ideal.  I would assume that we need  to switch to
>>>> dynamical partition
>>>> of resources  based on the workload otherwise we will have resource
>>>> conflict
>>>> between Vulkan compute and  OpenCL.
>>>>
>>>>
>>>> BTW: Which user level API do you want to use for compute: Vulkan or
>>>> OpenCL?
>>>>
>>>> [AR] Vulkan
>>>>
>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will
>>>> be not
>>>> involved.  I would assume that in the case of VR we will have one main
>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>> OpenCL/ROCm needs when VR is running.
>>>>
>>>>>  we will not be able to provide a solution compatible with GFX
>>>>> worloads.
>>>> I assume that you are talking about graphics? Am I right?
>>>>
>>>> [AR] Yeah, my understanding is that pre-empting the currently running
>>>> graphics job and scheduling in
>>>> something else using mid-buffer pre-emption has some cases where it
>>>> doesn't work well. But if with
>>>> polaris10 it starts working well, it might be a better solution for
>>>> us (because the whole reprojection
>>>> work uses the vulkan graphics stack at the moment, and porting it to
>>>> compute is not trivial).
>>>>
>>>> [Serguei]  The problem with pre-emption of graphics task: (a) it may
>>>> take time so
>>>> latency may suffer (b) to preempt we need to have different "context"
>>>> - we want
>>>> to guarantee that submissions from the same context will be executed
>>>> in order.
>>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>>> "preempt" and
>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>> for graphics as well as for plain compute tasks 
>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>
>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>>
>>>>
>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of
>>>> Andres Rodriguez <andresr@valvesoftware.com>
>>>> Sent: December 16, 2016 6:15 PM
>>>> To: amd-gfx@lists.freedesktop.org
>>>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>
>>>> Hi Everyone,
>>>>
>>>> This RFC is also available as a gist here:
>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>>>>
>>>>
>>>>
>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>> gist.github.com
>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>
>>>>
>>>>
>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>> gist.github.com
>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>
>>>>
>>>>
>>>>
>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>> gist.github.com
>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>
>>>>
>>>> We are interested in feedback for a mechanism to effectively schedule
>>>> high
>>>> priority VR reprojection tasks (also referred to as time-warping) for
>>>> Polaris10
>>>> running on the amdgpu kernel driver.
>>>>
>>>> Brief context:
>>>> --------------
>>>>
>>>> The main objective of reprojection is to avoid motion sickness for VR
>>>> users in
>>>> scenarios where the game or application would fail to finish
>>>> rendering a new
>>>> frame in time for the next VBLANK. When this happens, the user's head
>>>> movements
>>>> are not reflected on the Head Mounted Display (HMD) for the duration
>>>> of an
>>>> extra frame. This extended mismatch between the inner ear and the
>>>> eyes may
>>>> cause the user to experience motion sickness.
>>>>
>>>> The VR compositor deals with this problem by fabricating a new frame
>>>> using the
>>>> user's updated head position in combination with the previous frames.
>>>> This
>>>> avoids a prolonged mismatch between the HMD output and the inner ear.
>>>>
>>>> Because of the adverse effects on the user, we require high
>>>> confidence that the
>>>> reprojection task will complete before the VBLANK interval. Even if
>>>> the GFX pipe
>>>> is currently full of work from the game/application (which is most
>>>> likely the case).
>>>>
>>>> For more details and illustrations, please refer to the following
>>>> document:
>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>>>
>>>>
>>>>
>>>>
>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>> community.amd.com
>>>> One of the most exciting new developments in GPU technology over the
>>>> past year has been the adoption of asynchronous shaders, which can
>>>> make more efficient use of ...
>>>>
>>>>
>>>>
>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>> community.amd.com
>>>> One of the most exciting new developments in GPU technology over the
>>>> past year has been the adoption of asynchronous shaders, which can
>>>> make more efficient use of ...
>>>>
>>>>
>>>>
>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>> community.amd.com
>>>> One of the most exciting new developments in GPU technology over the
>>>> past year has been the adoption of asynchronous shaders, which can
>>>> make more efficient use of ...
>>>>
>>>>
>>>> Requirements:
>>>> -------------
>>>>
>>>> The mechanism must expose the following functionaility:
>>>>
>>>>     * Job round trip time must be predictable, from submission to
>>>> fence signal
>>>>
>>>>     * The mechanism must support compute workloads.
>>>>
>>>> Goals:
>>>> ------
>>>>
>>>>     * The mechanism should provide low submission latencies
>>>>
>>>> Test: submitting a NOP packet through the mechanism on busy hardware
>>>> should
>>>> be equivalent to submitting a NOP on idle hardware.
>>>>
>>>> Nice to have:
>>>> -------------
>>>>
>>>>     * The mechanism should also support GFX workloads.
>>>>
>>>> My understanding is that with the current hardware capabilities in
>>>> Polaris10 we
>>>> will not be able to provide a solution compatible with GFX worloads.
>>>>
>>>> But I would love to hear otherwise. So if anyone has an idea,
>>>> approach or
>>>> suggestion that will also be compatible with the GFX ring, please let
>>>> us know
>>>> about it.
>>>>
>>>>     * The above guarantees should also be respected by amdkfd 
>>>> workloads
>>>>
>>>> Would be good to have for consistency, but not strictly necessary as
>>>> users running
>>>> games are not traditionally running HPC workloads in the background.
>>>>
>>>> Proposed approach:
>>>> ------------------
>>>>
>>>> Similar to the windows driver, we could expose a high priority
>>>> compute queue to
>>>> userspace.
>>>>
>>>> Submissions to this compute queue will be scheduled with high
>>>> priority, and may
>>>> acquire hardware resources previously in use by other queues.
>>>>
>>>> This can be achieved by taking advantage of the 'priority' field in
>>>> the HQDs
>>>> and could be programmed by amdgpu or the amdgpu scheduler. The 
>>>> relevant
>>>> register fields are:
>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>
>>>> Implementation approach 1 - static partitioning:
>>>> ------------------------------------------------
>>>>
>>>> The amdgpu driver currently controls 8 compute queues from pipe0. 
>>>> We can
>>>> statically partition these as follows:
>>>>         * 7x regular
>>>>         * 1x high priority
>>>>
>>>> The relevant priorities can be set so that submissions to the high
>>>> priority
>>>> ring will starve the other compute rings and the GFX ring.
>>>>
>>>> The amdgpu scheduler will only place jobs into the high priority
>>>> rings if the
>>>> context is marked as high priority. And a corresponding priority
>>>> should be
>>>> added to keep track of this information:
>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>
>>>> The user will request a high priority context by setting an
>>>> appropriate flag
>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>>>
>>>>
>>>>
>>>> The setting is in a per context level so that we can:
>>>>     * Maintain a consistent FIFO ordering of all submissions to a
>>>> context
>>>>     * Create high priority and non-high priority contexts in the same
>>>> process
>>>>
>>>> Implementation approach 2 - dynamic priority programming:
>>>> ---------------------------------------------------------
>>>>
>>>> Similar to the above, but instead of programming the priorities and
>>>> amdgpu_init() time, the SW scheduler will reprogram the queue 
>>>> priorities
>>>> dynamically when scheduling a task.
>>>>
>>>> This would involve having a hardware specific callback from the
>>>> scheduler to
>>>> set the appropriate queue priority: set_priority(int ring, int index,
>>>> int priority)
>>>>
>>>> During this callback we would have to grab the SRBM mutex to perform
>>>> the appropriate
>>>> HW programming, and I'm not really sure if that is something we
>>>> should be doing from
>>>> the scheduler.
>>>>
>>>> On the positive side, this approach would allow us to program a 
>>>> range of
>>>> priorities for jobs instead of a single "high priority" value",
>>>> achieving
>>>> something similar to the niceness API available for CPU scheduling.
>>>>
>>>> I'm not sure if this flexibility is something that we would need for
>>>> our use
>>>> case, but it might be useful in other scenarios (multiple users
>>>> sharing compute
>>>> time on a server).
>>>>
>>>> This approach would require a new int field in drm_amdgpu_ctx_in, or
>>>> repurposing
>>>> of the flags field.
>>>>
>>>> Known current obstacles:
>>>> ------------------------
>>>>
>>>> The SQ is currently programmed to disregard the HQD priorities, and
>>>> instead it picks
>>>> jobs at random. Settings from the shader itself are also disregarded
>>>> as this is
>>>> considered a privileged field.
>>>>
>>>> Effectively we can get our compute wavefront launched ASAP, but we
>>>> might not get the
>>>> time we need on the SQ.
>>>>
>>>> The current programming would have to be changed to allow priority
>>>> propagation
>>>> from the HQD into the SQ.
>>>>
>>>> Generic approach for all HW IPs:
>>>> --------------------------------
>>>>
>>>> For consistency purposes, the high priority context can be enabled
>>>> for all HW IPs
>>>> with support of the SW scheduler. This will function similarly to the
>>>> current
>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of
>>>> anything not
>>>> commited to the HW queue.
>>>>
>>>> The benefits of requesting a high priority context for a non-compute
>>>> queue will
>>>> be lesser (e.g. up to 10s of wait time if a GFX command is stuck in
>>>> front of
>>>> you), but having the API in place will allow us to easily improve the
>>>> implementation
>>>> in the future as new features become available in new hardware.
>>>>
>>>> Future steps:
>>>> -------------
>>>>
>>>> Once we have an approach settled, I can take care of the 
>>>> implementation.
>>>>
>>>> Also, once the interface is mostly decided, we can start thinking 
>>>> about
>>>> exposing the high priority queue through radv.
>>>>
>>>> Request for feedback:
>>>> ---------------------
>>>>
>>>> We aren't married to any of the approaches outlined above. Our goal
>>>> is to
>>>> obtain a mechanism that will allow us to complete the reprojection
>>>> job within a
>>>> predictable amount of time. So if anyone anyone has any suggestions 
>>>> for
>>>> improvements or alternative strategies we are more than happy to hear
>>>> them.
>>>>
>>>> If any of the technical information above is also incorrect, feel
>>>> free to point
>>>> out my misunderstandings.
>>>>
>>>> Looking forward to hearing from you.
>>>>
>>>> Regards,
>>>> Andres
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>>
>>>> amd-gfx Info Page - lists.freedesktop.org
>>>> lists.freedesktop.org
>>>> To see the collection of prior postings to the list, visit the
>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list
>>>> members, send email ...
>>>>
>>>>
>>>>
>>>> amd-gfx Info Page - lists.freedesktop.org
>>>> lists.freedesktop.org
>>>> To see the collection of prior postings to the list, visit the
>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list
>>>> members, send email ...
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                     ` <58576C15.1070909-5C7GfCeVMHo@public.gmane.org>
@ 2016-12-19  5:29                                       ` Andres Rodriguez
       [not found]                                         ` <2bf5afce-d5b8-4eaf-0fcd-a7ebfe85f92e-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Andres Rodriguez @ 2016-12-19  5:29 UTC (permalink / raw)
  To: zhoucm1, Pierre-Loup A. Griffais, Sagalovitch, Serguei,
	Andres Rodriguez, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Mao,
	David, Zhang, Hawking, Huan, Alvin

Yes, vulkan is available on all-open through the mesa radv UMD.

I'm not sure if I'm asking for too much, but if we can coordinate a 
similar interface in radv and amdgpu-pro at the vulkan level that would 
be great.

I'm not sure what that's going to be yet.

- Andres

On 12/19/2016 12:11 AM, zhoucm1 wrote:
>
>
> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>> We're currently working with the open stack; I assume that a 
>> mechanism could be exposed by both open and Pro Vulkan userspace 
>> drivers and that the amdgpu kernel interface improvements we would 
>> pursue following this discussion would let both drivers take 
>> advantage of the feature, correct?
> Of course.
> Does open stack have Vulkan support?
>
> Regards,
> David Zhou
>>
>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>> By the way, are you using all-open driver or amdgpu-pro driver?
>>>
>>> +David Mao, who is working on our Vulkan driver.
>>>
>>> Regards,
>>> David Zhou
>>>
>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>> Hi Serguei,
>>>>
>>>> I'm also working on the bringing up our VR runtime on top of amgpu;
>>>> see replies inline.
>>>>
>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>> Andres,
>>>>>
>>>>>>  For current VR workloads we have 3 separate processes running
>>>>>> actually:
>>>>> So we could have potential memory overcommit case or do you do
>>>>> partitioning
>>>>> on your own?  I would think that there is need to avoid overcomit in
>>>>> VR case to
>>>>> prevent any BO migration.
>>>>
>>>> You're entirely correct; currently the VR runtime is setting up
>>>> prioritized CPU scheduling for its VR compositor, we're working on
>>>> prioritized GPU scheduling and pre-emption (eg. this thread), and in
>>>> the future it will make sense to do work in order to make sure that
>>>> its memory allocations do not get evicted, to prevent any unwelcome
>>>> additional latency in the event of needing to perform just-in-time
>>>> reprojection.
>>>>
>>>>> BTW: Do you mean __real__ processes or threads?
>>>>> Based on my understanding sharing BOs between different processes
>>>>> could introduce additional synchronization constrains. btw: I am not
>>>>> sure
>>>>> if we are able to share Vulkan sync. object cross-process boundary.
>>>>
>>>> They are different processes; it is important for the compositor that
>>>> is responsible for quality-of-service features such as consistently
>>>> presenting distorted frames with the right latency, reprojection, etc,
>>>> to be separate from the main application.
>>>>
>>>> Currently we are using unreleased cross-process memory and semaphore
>>>> extensions to fetch updated eye images from the client application,
>>>> but the just-in-time reprojection discussed here does not actually
>>>> have any direct interactions with cross-process resource sharing,
>>>> since it's achieved by using whatever is the latest, most up-to-date
>>>> eye images that have already been sent by the client application,
>>>> which are already available to use without additional synchronization.
>>>>
>>>>>
>>>>>>    3) System compositor (we are looking at approaches to remove this
>>>>>> overhead)
>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>
>>>> Yes, we are working on mechanisms to present directly to the headset
>>>> display without any intermediaries as a separate effort.
>>>>
>>>>>
>>>>>>  The latency is our main concern,
>>>>> I would assume that this is the known problem (at least for compute
>>>>> usage).
>>>>> It looks like that amdgpu / kernel submission is rather CPU intensive
>>>>> (at least
>>>>> in the default configuration).
>>>>
>>>> As long as it's a consistent cost, it shouldn't an issue. However, if
>>>> there's high degrees of variance then that would be troublesome and we
>>>> would need to account for the worst case.
>>>>
>>>> Hopefully the requirements and approach we described make sense, we're
>>>> looking forward to your feedback and suggestions.
>>>>
>>>> Thanks!
>>>>  - Pierre-Loup
>>>>
>>>>>
>>>>> Sincerely yours,
>>>>> Serguei Sagalovitch
>>>>>
>>>>>
>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>> Sent: December 16, 2016 10:00 PM
>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>
>>>>> Hey Serguei,
>>>>>
>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I
>>>>>> understand (by simplifying)
>>>>>> some scheduling is per pipe.  I know about the current allocation
>>>>>> scheme but I do not think
>>>>>> that it is  ideal.  I would assume that we need  to switch to
>>>>>> dynamical partition
>>>>>> of resources  based on the workload otherwise we will have resource
>>>>>> conflict
>>>>>> between Vulkan compute and  OpenCL.
>>>>>
>>>>> I agree the partitioning isn't ideal. I'm hoping we can start with a
>>>>> solution that assumes that
>>>>> only pipe0 has any work and the other pipes are idle (no HSA/ROCm
>>>>> running on the system).
>>>>>
>>>>> This should be more or less the use case we expect from VR users.
>>>>>
>>>>> I agree the split is currently not ideal, but I'd like to consider
>>>>> that a separate task, because
>>>>> making it dynamic is not straight forward :P
>>>>>
>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd
>>>>>> will be not
>>>>>> involved.  I would assume that in the case of VR we will have one 
>>>>>> main
>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>
>>>>> Correct, this is why we want to enable the high priority compute
>>>>> queue through
>>>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>>>
>>>>> For current VR workloads we have 3 separate processes running 
>>>>> actually:
>>>>>     1) Game process
>>>>>     2) VR Compositor (this is the process that will require high
>>>>> priority queue)
>>>>>     3) System compositor (we are looking at approaches to remove this
>>>>> overhead)
>>>>>
>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>> simultaneously, but
>>>>> I would also like to be able to address this case in the future
>>>>> (cross-pipe priorities).
>>>>>
>>>>>> [Serguei]  The problem with pre-emption of graphics task:  (a) it
>>>>>> may take time so
>>>>>> latency may suffer
>>>>>
>>>>> The latency is our main concern, we want something that is
>>>>> predictable. A good
>>>>> illustration of what the reprojection scheduling looks like can be
>>>>> found here:
>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>>>>
>>>>>
>>>>>
>>>>>> (b) to preempt we need to have different "context" - we want
>>>>>> to guarantee that submissions from the same context will be executed
>>>>>> in order.
>>>>>
>>>>> This is okay, as the reprojection work doesn't have dependencies on
>>>>> the game context, and it
>>>>> even happens in a separate process.
>>>>>
>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>>>>> "preempt" and
>>>>>> "cancel/abort"
>>>>>
>>>>> Preempt the game with the compositor task and then resume it.
>>>>>
>>>>>> (b) Vulkan is generic API and could be used for graphics as well as
>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>
>>>>> Yeah, the plan is to use vulkan compute. But if you figure out a way
>>>>> for us to get
>>>>> a guaranteed execution time using vulkan graphics, then I'll take you
>>>>> out for a beer :)
>>>>>
>>>>> Regards,
>>>>> Andres
>>>>> ________________________________________
>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>
>>>>> Hi Andres,
>>>>>
>>>>> Please see inline (as [Serguei])
>>>>>
>>>>> Sincerely yours,
>>>>> Serguei Sagalovitch
>>>>>
>>>>>
>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>> Sent: December 16, 2016 8:29 PM
>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>
>>>>> Hi Serguei,
>>>>>
>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>
>>>>> Regards,
>>>>> Andres
>>>>>
>>>>> ________________________________________
>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>
>>>>> Andres,
>>>>>
>>>>>
>>>>> Quick comments:
>>>>>
>>>>> 1) To minimize "bubbles", etc. we need to "force" CU 
>>>>> assignments/binding
>>>>> to high-priority queue  when it will be in use and "free" them later
>>>>> (we  do not want forever take CUs from e.g. graphic task to degrade
>>>>> graphics
>>>>> performance).
>>>>>
>>>>> Otherwise we could have scenario when long graphics task (or
>>>>> low-priority
>>>>> compute) will took all (extra) CUs and high--priority will wait for
>>>>> needed resources.
>>>>> It will not be visible on "NOP " but only when you submit "real"
>>>>> compute task
>>>>> so I would recommend  not to use "NOP" packets at all for testing.
>>>>>
>>>>> It (CU assignment) could be relatively easy done when everything is
>>>>> going via kernel
>>>>> (e.g. as part of frame submission) but I must admit that I am not 
>>>>> sure
>>>>> about the best way for user level submissions (amdkfd).
>>>>>
>>>>> [AR] I wasn't aware of this part of the programming sequence. Thanks
>>>>> for the heads up!
>>>>> Is this similar to the CU masking programming?
>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler" when
>>>>> deciding which
>>>>> queue to  run will check if there is enough resources and if not then
>>>>> it will begin
>>>>> to check other queues with lower priority.
>>>>>
>>>>> 2) I would recommend to dedicate the whole pipe to high-priority
>>>>> queue and have
>>>>> nothing their except it.
>>>>>
>>>>> [AR] I'm guessing in this context you mean pipe = queue? (as opposed
>>>>> to the MEC definition
>>>>> of pipe, which is a grouping of queues). I say this because amdgpu
>>>>> only has access to 1 pipe,
>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>
>>>>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I
>>>>> understand (by simplifying)
>>>>> some scheduling is per pipe.  I know about the current allocation
>>>>> scheme but I do not think
>>>>> that it is  ideal.  I would assume that we need  to switch to
>>>>> dynamical partition
>>>>> of resources  based on the workload otherwise we will have resource
>>>>> conflict
>>>>> between Vulkan compute and  OpenCL.
>>>>>
>>>>>
>>>>> BTW: Which user level API do you want to use for compute: Vulkan or
>>>>> OpenCL?
>>>>>
>>>>> [AR] Vulkan
>>>>>
>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will
>>>>> be not
>>>>> involved.  I would assume that in the case of VR we will have one 
>>>>> main
>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>> OpenCL/ROCm needs when VR is running.
>>>>>
>>>>>>  we will not be able to provide a solution compatible with GFX
>>>>>> worloads.
>>>>> I assume that you are talking about graphics? Am I right?
>>>>>
>>>>> [AR] Yeah, my understanding is that pre-empting the currently running
>>>>> graphics job and scheduling in
>>>>> something else using mid-buffer pre-emption has some cases where it
>>>>> doesn't work well. But if with
>>>>> polaris10 it starts working well, it might be a better solution for
>>>>> us (because the whole reprojection
>>>>> work uses the vulkan graphics stack at the moment, and porting it to
>>>>> compute is not trivial).
>>>>>
>>>>> [Serguei]  The problem with pre-emption of graphics task: (a) it may
>>>>> take time so
>>>>> latency may suffer (b) to preempt we need to have different "context"
>>>>> - we want
>>>>> to guarantee that submissions from the same context will be executed
>>>>> in order.
>>>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>>>> "preempt" and
>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>> for graphics as well as for plain compute tasks 
>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>
>>>>>
>>>>> Sincerely yours,
>>>>> Serguei Sagalovitch
>>>>>
>>>>>
>>>>>
>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of
>>>>> Andres Rodriguez <andresr@valvesoftware.com>
>>>>> Sent: December 16, 2016 6:15 PM
>>>>> To: amd-gfx@lists.freedesktop.org
>>>>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>
>>>>> Hi Everyone,
>>>>>
>>>>> This RFC is also available as a gist here:
>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>>>>>
>>>>>
>>>>>
>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>> gist.github.com
>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>
>>>>>
>>>>>
>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>> gist.github.com
>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>> gist.github.com
>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>
>>>>>
>>>>> We are interested in feedback for a mechanism to effectively schedule
>>>>> high
>>>>> priority VR reprojection tasks (also referred to as time-warping) for
>>>>> Polaris10
>>>>> running on the amdgpu kernel driver.
>>>>>
>>>>> Brief context:
>>>>> --------------
>>>>>
>>>>> The main objective of reprojection is to avoid motion sickness for VR
>>>>> users in
>>>>> scenarios where the game or application would fail to finish
>>>>> rendering a new
>>>>> frame in time for the next VBLANK. When this happens, the user's head
>>>>> movements
>>>>> are not reflected on the Head Mounted Display (HMD) for the duration
>>>>> of an
>>>>> extra frame. This extended mismatch between the inner ear and the
>>>>> eyes may
>>>>> cause the user to experience motion sickness.
>>>>>
>>>>> The VR compositor deals with this problem by fabricating a new frame
>>>>> using the
>>>>> user's updated head position in combination with the previous frames.
>>>>> This
>>>>> avoids a prolonged mismatch between the HMD output and the inner ear.
>>>>>
>>>>> Because of the adverse effects on the user, we require high
>>>>> confidence that the
>>>>> reprojection task will complete before the VBLANK interval. Even if
>>>>> the GFX pipe
>>>>> is currently full of work from the game/application (which is most
>>>>> likely the case).
>>>>>
>>>>> For more details and illustrations, please refer to the following
>>>>> document:
>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>> community.amd.com
>>>>> One of the most exciting new developments in GPU technology over the
>>>>> past year has been the adoption of asynchronous shaders, which can
>>>>> make more efficient use of ...
>>>>>
>>>>>
>>>>>
>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>> community.amd.com
>>>>> One of the most exciting new developments in GPU technology over the
>>>>> past year has been the adoption of asynchronous shaders, which can
>>>>> make more efficient use of ...
>>>>>
>>>>>
>>>>>
>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>> community.amd.com
>>>>> One of the most exciting new developments in GPU technology over the
>>>>> past year has been the adoption of asynchronous shaders, which can
>>>>> make more efficient use of ...
>>>>>
>>>>>
>>>>> Requirements:
>>>>> -------------
>>>>>
>>>>> The mechanism must expose the following functionaility:
>>>>>
>>>>>     * Job round trip time must be predictable, from submission to
>>>>> fence signal
>>>>>
>>>>>     * The mechanism must support compute workloads.
>>>>>
>>>>> Goals:
>>>>> ------
>>>>>
>>>>>     * The mechanism should provide low submission latencies
>>>>>
>>>>> Test: submitting a NOP packet through the mechanism on busy hardware
>>>>> should
>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>
>>>>> Nice to have:
>>>>> -------------
>>>>>
>>>>>     * The mechanism should also support GFX workloads.
>>>>>
>>>>> My understanding is that with the current hardware capabilities in
>>>>> Polaris10 we
>>>>> will not be able to provide a solution compatible with GFX worloads.
>>>>>
>>>>> But I would love to hear otherwise. So if anyone has an idea,
>>>>> approach or
>>>>> suggestion that will also be compatible with the GFX ring, please let
>>>>> us know
>>>>> about it.
>>>>>
>>>>>     * The above guarantees should also be respected by amdkfd 
>>>>> workloads
>>>>>
>>>>> Would be good to have for consistency, but not strictly necessary as
>>>>> users running
>>>>> games are not traditionally running HPC workloads in the background.
>>>>>
>>>>> Proposed approach:
>>>>> ------------------
>>>>>
>>>>> Similar to the windows driver, we could expose a high priority
>>>>> compute queue to
>>>>> userspace.
>>>>>
>>>>> Submissions to this compute queue will be scheduled with high
>>>>> priority, and may
>>>>> acquire hardware resources previously in use by other queues.
>>>>>
>>>>> This can be achieved by taking advantage of the 'priority' field in
>>>>> the HQDs
>>>>> and could be programmed by amdgpu or the amdgpu scheduler. The 
>>>>> relevant
>>>>> register fields are:
>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>
>>>>> Implementation approach 1 - static partitioning:
>>>>> ------------------------------------------------
>>>>>
>>>>> The amdgpu driver currently controls 8 compute queues from pipe0. 
>>>>> We can
>>>>> statically partition these as follows:
>>>>>         * 7x regular
>>>>>         * 1x high priority
>>>>>
>>>>> The relevant priorities can be set so that submissions to the high
>>>>> priority
>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>
>>>>> The amdgpu scheduler will only place jobs into the high priority
>>>>> rings if the
>>>>> context is marked as high priority. And a corresponding priority
>>>>> should be
>>>>> added to keep track of this information:
>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>
>>>>> The user will request a high priority context by setting an
>>>>> appropriate flag
>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>>>>
>>>>>
>>>>>
>>>>> The setting is in a per context level so that we can:
>>>>>     * Maintain a consistent FIFO ordering of all submissions to a
>>>>> context
>>>>>     * Create high priority and non-high priority contexts in the same
>>>>> process
>>>>>
>>>>> Implementation approach 2 - dynamic priority programming:
>>>>> ---------------------------------------------------------
>>>>>
>>>>> Similar to the above, but instead of programming the priorities and
>>>>> amdgpu_init() time, the SW scheduler will reprogram the queue 
>>>>> priorities
>>>>> dynamically when scheduling a task.
>>>>>
>>>>> This would involve having a hardware specific callback from the
>>>>> scheduler to
>>>>> set the appropriate queue priority: set_priority(int ring, int index,
>>>>> int priority)
>>>>>
>>>>> During this callback we would have to grab the SRBM mutex to perform
>>>>> the appropriate
>>>>> HW programming, and I'm not really sure if that is something we
>>>>> should be doing from
>>>>> the scheduler.
>>>>>
>>>>> On the positive side, this approach would allow us to program a 
>>>>> range of
>>>>> priorities for jobs instead of a single "high priority" value",
>>>>> achieving
>>>>> something similar to the niceness API available for CPU scheduling.
>>>>>
>>>>> I'm not sure if this flexibility is something that we would need for
>>>>> our use
>>>>> case, but it might be useful in other scenarios (multiple users
>>>>> sharing compute
>>>>> time on a server).
>>>>>
>>>>> This approach would require a new int field in drm_amdgpu_ctx_in, or
>>>>> repurposing
>>>>> of the flags field.
>>>>>
>>>>> Known current obstacles:
>>>>> ------------------------
>>>>>
>>>>> The SQ is currently programmed to disregard the HQD priorities, and
>>>>> instead it picks
>>>>> jobs at random. Settings from the shader itself are also disregarded
>>>>> as this is
>>>>> considered a privileged field.
>>>>>
>>>>> Effectively we can get our compute wavefront launched ASAP, but we
>>>>> might not get the
>>>>> time we need on the SQ.
>>>>>
>>>>> The current programming would have to be changed to allow priority
>>>>> propagation
>>>>> from the HQD into the SQ.
>>>>>
>>>>> Generic approach for all HW IPs:
>>>>> --------------------------------
>>>>>
>>>>> For consistency purposes, the high priority context can be enabled
>>>>> for all HW IPs
>>>>> with support of the SW scheduler. This will function similarly to the
>>>>> current
>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of
>>>>> anything not
>>>>> commited to the HW queue.
>>>>>
>>>>> The benefits of requesting a high priority context for a non-compute
>>>>> queue will
>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is stuck in
>>>>> front of
>>>>> you), but having the API in place will allow us to easily improve the
>>>>> implementation
>>>>> in the future as new features become available in new hardware.
>>>>>
>>>>> Future steps:
>>>>> -------------
>>>>>
>>>>> Once we have an approach settled, I can take care of the 
>>>>> implementation.
>>>>>
>>>>> Also, once the interface is mostly decided, we can start thinking 
>>>>> about
>>>>> exposing the high priority queue through radv.
>>>>>
>>>>> Request for feedback:
>>>>> ---------------------
>>>>>
>>>>> We aren't married to any of the approaches outlined above. Our goal
>>>>> is to
>>>>> obtain a mechanism that will allow us to complete the reprojection
>>>>> job within a
>>>>> predictable amount of time. So if anyone anyone has any 
>>>>> suggestions for
>>>>> improvements or alternative strategies we are more than happy to hear
>>>>> them.
>>>>>
>>>>> If any of the technical information above is also incorrect, feel
>>>>> free to point
>>>>> out my misunderstandings.
>>>>>
>>>>> Looking forward to hearing from you.
>>>>>
>>>>> Regards,
>>>>> Andres
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
>>>>>
>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>> lists.freedesktop.org
>>>>> To see the collection of prior postings to the list, visit the
>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list
>>>>> members, send email ...
>>>>>
>>>>>
>>>>>
>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>> lists.freedesktop.org
>>>>> To see the collection of prior postings to the list, visit the
>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list
>>>>> members, send email ...
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                         ` <2bf5afce-d5b8-4eaf-0fcd-a7ebfe85f92e-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2016-12-19  5:50                                           ` zhoucm1
       [not found]                                             ` <5857751C.5060409-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: zhoucm1 @ 2016-12-19  5:50 UTC (permalink / raw)
  To: Andres Rodriguez, Pierre-Loup A. Griffais, Sagalovitch, Serguei,
	Andres Rodriguez, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Mao,
	David, Zhang, Hawking, Huan, Alvin

Do you encounter the priority issue for compute queue with current driver?

If compute queue is occupied only by you, the efficiency is equal with 
setting job queue to high priority I think.

Regards,
David Zhou

On 2016年12月19日 13:29, Andres Rodriguez wrote:
> Yes, vulkan is available on all-open through the mesa radv UMD.
>
> I'm not sure if I'm asking for too much, but if we can coordinate a 
> similar interface in radv and amdgpu-pro at the vulkan level that 
> would be great.
>
> I'm not sure what that's going to be yet.
>
> - Andres
>
> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>
>>
>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>> We're currently working with the open stack; I assume that a 
>>> mechanism could be exposed by both open and Pro Vulkan userspace 
>>> drivers and that the amdgpu kernel interface improvements we would 
>>> pursue following this discussion would let both drivers take 
>>> advantage of the feature, correct?
>> Of course.
>> Does open stack have Vulkan support?
>>
>> Regards,
>> David Zhou
>>>
>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>> By the way, are you using all-open driver or amdgpu-pro driver?
>>>>
>>>> +David Mao, who is working on our Vulkan driver.
>>>>
>>>> Regards,
>>>> David Zhou
>>>>
>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>> Hi Serguei,
>>>>>
>>>>> I'm also working on the bringing up our VR runtime on top of amgpu;
>>>>> see replies inline.
>>>>>
>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>> Andres,
>>>>>>
>>>>>>>  For current VR workloads we have 3 separate processes running
>>>>>>> actually:
>>>>>> So we could have potential memory overcommit case or do you do
>>>>>> partitioning
>>>>>> on your own?  I would think that there is need to avoid overcomit in
>>>>>> VR case to
>>>>>> prevent any BO migration.
>>>>>
>>>>> You're entirely correct; currently the VR runtime is setting up
>>>>> prioritized CPU scheduling for its VR compositor, we're working on
>>>>> prioritized GPU scheduling and pre-emption (eg. this thread), and in
>>>>> the future it will make sense to do work in order to make sure that
>>>>> its memory allocations do not get evicted, to prevent any unwelcome
>>>>> additional latency in the event of needing to perform just-in-time
>>>>> reprojection.
>>>>>
>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>> Based on my understanding sharing BOs between different processes
>>>>>> could introduce additional synchronization constrains. btw: I am not
>>>>>> sure
>>>>>> if we are able to share Vulkan sync. object cross-process boundary.
>>>>>
>>>>> They are different processes; it is important for the compositor that
>>>>> is responsible for quality-of-service features such as consistently
>>>>> presenting distorted frames with the right latency, reprojection, 
>>>>> etc,
>>>>> to be separate from the main application.
>>>>>
>>>>> Currently we are using unreleased cross-process memory and semaphore
>>>>> extensions to fetch updated eye images from the client application,
>>>>> but the just-in-time reprojection discussed here does not actually
>>>>> have any direct interactions with cross-process resource sharing,
>>>>> since it's achieved by using whatever is the latest, most up-to-date
>>>>> eye images that have already been sent by the client application,
>>>>> which are already available to use without additional 
>>>>> synchronization.
>>>>>
>>>>>>
>>>>>>>    3) System compositor (we are looking at approaches to remove 
>>>>>>> this
>>>>>>> overhead)
>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>
>>>>> Yes, we are working on mechanisms to present directly to the headset
>>>>> display without any intermediaries as a separate effort.
>>>>>
>>>>>>
>>>>>>>  The latency is our main concern,
>>>>>> I would assume that this is the known problem (at least for compute
>>>>>> usage).
>>>>>> It looks like that amdgpu / kernel submission is rather CPU 
>>>>>> intensive
>>>>>> (at least
>>>>>> in the default configuration).
>>>>>
>>>>> As long as it's a consistent cost, it shouldn't an issue. However, if
>>>>> there's high degrees of variance then that would be troublesome 
>>>>> and we
>>>>> would need to account for the worst case.
>>>>>
>>>>> Hopefully the requirements and approach we described make sense, 
>>>>> we're
>>>>> looking forward to your feedback and suggestions.
>>>>>
>>>>> Thanks!
>>>>>  - Pierre-Loup
>>>>>
>>>>>>
>>>>>> Sincerely yours,
>>>>>> Serguei Sagalovitch
>>>>>>
>>>>>>
>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>
>>>>>> Hey Serguei,
>>>>>>
>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
>>>>>>> understand (by simplifying)
>>>>>>> some scheduling is per pipe.  I know about the current allocation
>>>>>>> scheme but I do not think
>>>>>>> that it is  ideal.  I would assume that we need  to switch to
>>>>>>> dynamical partition
>>>>>>> of resources  based on the workload otherwise we will have resource
>>>>>>> conflict
>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>
>>>>>> I agree the partitioning isn't ideal. I'm hoping we can start with a
>>>>>> solution that assumes that
>>>>>> only pipe0 has any work and the other pipes are idle (no HSA/ROCm
>>>>>> running on the system).
>>>>>>
>>>>>> This should be more or less the use case we expect from VR users.
>>>>>>
>>>>>> I agree the split is currently not ideal, but I'd like to consider
>>>>>> that a separate task, because
>>>>>> making it dynamic is not straight forward :P
>>>>>>
>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd
>>>>>>> will be not
>>>>>>> involved.  I would assume that in the case of VR we will have 
>>>>>>> one main
>>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>
>>>>>> Correct, this is why we want to enable the high priority compute
>>>>>> queue through
>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>>>>
>>>>>> For current VR workloads we have 3 separate processes running 
>>>>>> actually:
>>>>>>     1) Game process
>>>>>>     2) VR Compositor (this is the process that will require high
>>>>>> priority queue)
>>>>>>     3) System compositor (we are looking at approaches to remove 
>>>>>> this
>>>>>> overhead)
>>>>>>
>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>> simultaneously, but
>>>>>> I would also like to be able to address this case in the future
>>>>>> (cross-pipe priorities).
>>>>>>
>>>>>>> [Serguei]  The problem with pre-emption of graphics task:  (a) it
>>>>>>> may take time so
>>>>>>> latency may suffer
>>>>>>
>>>>>> The latency is our main concern, we want something that is
>>>>>> predictable. A good
>>>>>> illustration of what the reprojection scheduling looks like can be
>>>>>> found here:
>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>>>>>
>>>>>>
>>>>>>
>>>>>>> (b) to preempt we need to have different "context" - we want
>>>>>>> to guarantee that submissions from the same context will be 
>>>>>>> executed
>>>>>>> in order.
>>>>>>
>>>>>> This is okay, as the reprojection work doesn't have dependencies on
>>>>>> the game context, and it
>>>>>> even happens in a separate process.
>>>>>>
>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>>>>>> "preempt" and
>>>>>>> "cancel/abort"
>>>>>>
>>>>>> Preempt the game with the compositor task and then resume it.
>>>>>>
>>>>>>> (b) Vulkan is generic API and could be used for graphics as well as
>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>
>>>>>> Yeah, the plan is to use vulkan compute. But if you figure out a way
>>>>>> for us to get
>>>>>> a guaranteed execution time using vulkan graphics, then I'll take 
>>>>>> you
>>>>>> out for a beer :)
>>>>>>
>>>>>> Regards,
>>>>>> Andres
>>>>>> ________________________________________
>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>
>>>>>> Hi Andres,
>>>>>>
>>>>>> Please see inline (as [Serguei])
>>>>>>
>>>>>> Sincerely yours,
>>>>>> Serguei Sagalovitch
>>>>>>
>>>>>>
>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>
>>>>>> Hi Serguei,
>>>>>>
>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>
>>>>>> Regards,
>>>>>> Andres
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>
>>>>>> Andres,
>>>>>>
>>>>>>
>>>>>> Quick comments:
>>>>>>
>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU 
>>>>>> assignments/binding
>>>>>> to high-priority queue  when it will be in use and "free" them later
>>>>>> (we  do not want forever take CUs from e.g. graphic task to degrade
>>>>>> graphics
>>>>>> performance).
>>>>>>
>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>> low-priority
>>>>>> compute) will took all (extra) CUs and high--priority will wait for
>>>>>> needed resources.
>>>>>> It will not be visible on "NOP " but only when you submit "real"
>>>>>> compute task
>>>>>> so I would recommend  not to use "NOP" packets at all for testing.
>>>>>>
>>>>>> It (CU assignment) could be relatively easy done when everything is
>>>>>> going via kernel
>>>>>> (e.g. as part of frame submission) but I must admit that I am not 
>>>>>> sure
>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>
>>>>>> [AR] I wasn't aware of this part of the programming sequence. Thanks
>>>>>> for the heads up!
>>>>>> Is this similar to the CU masking programming?
>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler" when
>>>>>> deciding which
>>>>>> queue to  run will check if there is enough resources and if not 
>>>>>> then
>>>>>> it will begin
>>>>>> to check other queues with lower priority.
>>>>>>
>>>>>> 2) I would recommend to dedicate the whole pipe to high-priority
>>>>>> queue and have
>>>>>> nothing their except it.
>>>>>>
>>>>>> [AR] I'm guessing in this context you mean pipe = queue? (as opposed
>>>>>> to the MEC definition
>>>>>> of pipe, which is a grouping of queues). I say this because amdgpu
>>>>>> only has access to 1 pipe,
>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>
>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I
>>>>>> understand (by simplifying)
>>>>>> some scheduling is per pipe.  I know about the current allocation
>>>>>> scheme but I do not think
>>>>>> that it is  ideal.  I would assume that we need  to switch to
>>>>>> dynamical partition
>>>>>> of resources  based on the workload otherwise we will have resource
>>>>>> conflict
>>>>>> between Vulkan compute and  OpenCL.
>>>>>>
>>>>>>
>>>>>> BTW: Which user level API do you want to use for compute: Vulkan or
>>>>>> OpenCL?
>>>>>>
>>>>>> [AR] Vulkan
>>>>>>
>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd 
>>>>>> will
>>>>>> be not
>>>>>> involved.  I would assume that in the case of VR we will have one 
>>>>>> main
>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>
>>>>>>>  we will not be able to provide a solution compatible with GFX
>>>>>>> worloads.
>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>
>>>>>> [AR] Yeah, my understanding is that pre-empting the currently 
>>>>>> running
>>>>>> graphics job and scheduling in
>>>>>> something else using mid-buffer pre-emption has some cases where it
>>>>>> doesn't work well. But if with
>>>>>> polaris10 it starts working well, it might be a better solution for
>>>>>> us (because the whole reprojection
>>>>>> work uses the vulkan graphics stack at the moment, and porting it to
>>>>>> compute is not trivial).
>>>>>>
>>>>>> [Serguei]  The problem with pre-emption of graphics task: (a) it may
>>>>>> take time so
>>>>>> latency may suffer (b) to preempt we need to have different 
>>>>>> "context"
>>>>>> - we want
>>>>>> to guarantee that submissions from the same context will be executed
>>>>>> in order.
>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>>>>> "preempt" and
>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>> for graphics as well as for plain compute tasks 
>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>
>>>>>>
>>>>>> Sincerely yours,
>>>>>> Serguei Sagalovitch
>>>>>>
>>>>>>
>>>>>>
>>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of
>>>>>> Andres Rodriguez <andresr@valvesoftware.com>
>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>
>>>>>> Hi Everyone,
>>>>>>
>>>>>> This RFC is also available as a gist here:
>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>>>>>>
>>>>>>
>>>>>>
>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>> gist.github.com
>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>
>>>>>>
>>>>>>
>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>> gist.github.com
>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>> gist.github.com
>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>
>>>>>>
>>>>>> We are interested in feedback for a mechanism to effectively 
>>>>>> schedule
>>>>>> high
>>>>>> priority VR reprojection tasks (also referred to as time-warping) 
>>>>>> for
>>>>>> Polaris10
>>>>>> running on the amdgpu kernel driver.
>>>>>>
>>>>>> Brief context:
>>>>>> --------------
>>>>>>
>>>>>> The main objective of reprojection is to avoid motion sickness 
>>>>>> for VR
>>>>>> users in
>>>>>> scenarios where the game or application would fail to finish
>>>>>> rendering a new
>>>>>> frame in time for the next VBLANK. When this happens, the user's 
>>>>>> head
>>>>>> movements
>>>>>> are not reflected on the Head Mounted Display (HMD) for the duration
>>>>>> of an
>>>>>> extra frame. This extended mismatch between the inner ear and the
>>>>>> eyes may
>>>>>> cause the user to experience motion sickness.
>>>>>>
>>>>>> The VR compositor deals with this problem by fabricating a new frame
>>>>>> using the
>>>>>> user's updated head position in combination with the previous 
>>>>>> frames.
>>>>>> This
>>>>>> avoids a prolonged mismatch between the HMD output and the inner 
>>>>>> ear.
>>>>>>
>>>>>> Because of the adverse effects on the user, we require high
>>>>>> confidence that the
>>>>>> reprojection task will complete before the VBLANK interval. Even if
>>>>>> the GFX pipe
>>>>>> is currently full of work from the game/application (which is most
>>>>>> likely the case).
>>>>>>
>>>>>> For more details and illustrations, please refer to the following
>>>>>> document:
>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>> community.amd.com
>>>>>> One of the most exciting new developments in GPU technology over the
>>>>>> past year has been the adoption of asynchronous shaders, which can
>>>>>> make more efficient use of ...
>>>>>>
>>>>>>
>>>>>>
>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>> community.amd.com
>>>>>> One of the most exciting new developments in GPU technology over the
>>>>>> past year has been the adoption of asynchronous shaders, which can
>>>>>> make more efficient use of ...
>>>>>>
>>>>>>
>>>>>>
>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>> community.amd.com
>>>>>> One of the most exciting new developments in GPU technology over the
>>>>>> past year has been the adoption of asynchronous shaders, which can
>>>>>> make more efficient use of ...
>>>>>>
>>>>>>
>>>>>> Requirements:
>>>>>> -------------
>>>>>>
>>>>>> The mechanism must expose the following functionaility:
>>>>>>
>>>>>>     * Job round trip time must be predictable, from submission to
>>>>>> fence signal
>>>>>>
>>>>>>     * The mechanism must support compute workloads.
>>>>>>
>>>>>> Goals:
>>>>>> ------
>>>>>>
>>>>>>     * The mechanism should provide low submission latencies
>>>>>>
>>>>>> Test: submitting a NOP packet through the mechanism on busy hardware
>>>>>> should
>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>
>>>>>> Nice to have:
>>>>>> -------------
>>>>>>
>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>
>>>>>> My understanding is that with the current hardware capabilities in
>>>>>> Polaris10 we
>>>>>> will not be able to provide a solution compatible with GFX worloads.
>>>>>>
>>>>>> But I would love to hear otherwise. So if anyone has an idea,
>>>>>> approach or
>>>>>> suggestion that will also be compatible with the GFX ring, please 
>>>>>> let
>>>>>> us know
>>>>>> about it.
>>>>>>
>>>>>>     * The above guarantees should also be respected by amdkfd 
>>>>>> workloads
>>>>>>
>>>>>> Would be good to have for consistency, but not strictly necessary as
>>>>>> users running
>>>>>> games are not traditionally running HPC workloads in the background.
>>>>>>
>>>>>> Proposed approach:
>>>>>> ------------------
>>>>>>
>>>>>> Similar to the windows driver, we could expose a high priority
>>>>>> compute queue to
>>>>>> userspace.
>>>>>>
>>>>>> Submissions to this compute queue will be scheduled with high
>>>>>> priority, and may
>>>>>> acquire hardware resources previously in use by other queues.
>>>>>>
>>>>>> This can be achieved by taking advantage of the 'priority' field in
>>>>>> the HQDs
>>>>>> and could be programmed by amdgpu or the amdgpu scheduler. The 
>>>>>> relevant
>>>>>> register fields are:
>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>
>>>>>> Implementation approach 1 - static partitioning:
>>>>>> ------------------------------------------------
>>>>>>
>>>>>> The amdgpu driver currently controls 8 compute queues from pipe0. 
>>>>>> We can
>>>>>> statically partition these as follows:
>>>>>>         * 7x regular
>>>>>>         * 1x high priority
>>>>>>
>>>>>> The relevant priorities can be set so that submissions to the high
>>>>>> priority
>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>
>>>>>> The amdgpu scheduler will only place jobs into the high priority
>>>>>> rings if the
>>>>>> context is marked as high priority. And a corresponding priority
>>>>>> should be
>>>>>> added to keep track of this information:
>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>
>>>>>> The user will request a high priority context by setting an
>>>>>> appropriate flag
>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>>>>>
>>>>>>
>>>>>>
>>>>>> The setting is in a per context level so that we can:
>>>>>>     * Maintain a consistent FIFO ordering of all submissions to a
>>>>>> context
>>>>>>     * Create high priority and non-high priority contexts in the 
>>>>>> same
>>>>>> process
>>>>>>
>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>> ---------------------------------------------------------
>>>>>>
>>>>>> Similar to the above, but instead of programming the priorities and
>>>>>> amdgpu_init() time, the SW scheduler will reprogram the queue 
>>>>>> priorities
>>>>>> dynamically when scheduling a task.
>>>>>>
>>>>>> This would involve having a hardware specific callback from the
>>>>>> scheduler to
>>>>>> set the appropriate queue priority: set_priority(int ring, int 
>>>>>> index,
>>>>>> int priority)
>>>>>>
>>>>>> During this callback we would have to grab the SRBM mutex to perform
>>>>>> the appropriate
>>>>>> HW programming, and I'm not really sure if that is something we
>>>>>> should be doing from
>>>>>> the scheduler.
>>>>>>
>>>>>> On the positive side, this approach would allow us to program a 
>>>>>> range of
>>>>>> priorities for jobs instead of a single "high priority" value",
>>>>>> achieving
>>>>>> something similar to the niceness API available for CPU scheduling.
>>>>>>
>>>>>> I'm not sure if this flexibility is something that we would need for
>>>>>> our use
>>>>>> case, but it might be useful in other scenarios (multiple users
>>>>>> sharing compute
>>>>>> time on a server).
>>>>>>
>>>>>> This approach would require a new int field in drm_amdgpu_ctx_in, or
>>>>>> repurposing
>>>>>> of the flags field.
>>>>>>
>>>>>> Known current obstacles:
>>>>>> ------------------------
>>>>>>
>>>>>> The SQ is currently programmed to disregard the HQD priorities, and
>>>>>> instead it picks
>>>>>> jobs at random. Settings from the shader itself are also disregarded
>>>>>> as this is
>>>>>> considered a privileged field.
>>>>>>
>>>>>> Effectively we can get our compute wavefront launched ASAP, but we
>>>>>> might not get the
>>>>>> time we need on the SQ.
>>>>>>
>>>>>> The current programming would have to be changed to allow priority
>>>>>> propagation
>>>>>> from the HQD into the SQ.
>>>>>>
>>>>>> Generic approach for all HW IPs:
>>>>>> --------------------------------
>>>>>>
>>>>>> For consistency purposes, the high priority context can be enabled
>>>>>> for all HW IPs
>>>>>> with support of the SW scheduler. This will function similarly to 
>>>>>> the
>>>>>> current
>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of
>>>>>> anything not
>>>>>> commited to the HW queue.
>>>>>>
>>>>>> The benefits of requesting a high priority context for a non-compute
>>>>>> queue will
>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is stuck in
>>>>>> front of
>>>>>> you), but having the API in place will allow us to easily improve 
>>>>>> the
>>>>>> implementation
>>>>>> in the future as new features become available in new hardware.
>>>>>>
>>>>>> Future steps:
>>>>>> -------------
>>>>>>
>>>>>> Once we have an approach settled, I can take care of the 
>>>>>> implementation.
>>>>>>
>>>>>> Also, once the interface is mostly decided, we can start thinking 
>>>>>> about
>>>>>> exposing the high priority queue through radv.
>>>>>>
>>>>>> Request for feedback:
>>>>>> ---------------------
>>>>>>
>>>>>> We aren't married to any of the approaches outlined above. Our goal
>>>>>> is to
>>>>>> obtain a mechanism that will allow us to complete the reprojection
>>>>>> job within a
>>>>>> predictable amount of time. So if anyone anyone has any 
>>>>>> suggestions for
>>>>>> improvements or alternative strategies we are more than happy to 
>>>>>> hear
>>>>>> them.
>>>>>>
>>>>>> If any of the technical information above is also incorrect, feel
>>>>>> free to point
>>>>>> out my misunderstandings.
>>>>>>
>>>>>> Looking forward to hearing from you.
>>>>>>
>>>>>> Regards,
>>>>>> Andres
>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx@lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>
>>>>>>
>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>> lists.freedesktop.org
>>>>>> To see the collection of prior postings to the list, visit the
>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list
>>>>>> members, send email ...
>>>>>>
>>>>>>
>>>>>>
>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>> lists.freedesktop.org
>>>>>> To see the collection of prior postings to the list, visit the
>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list
>>>>>> members, send email ...
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx@lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                         ` <bd0ba668-3d13-6343-a1c6-de5d0b7b3be3-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
  2016-12-19  3:26                           ` zhoucm1
@ 2016-12-19 14:37                           ` Serguei Sagalovitch
  1 sibling, 0 replies; 36+ messages in thread
From: Serguei Sagalovitch @ 2016-12-19 14:37 UTC (permalink / raw)
  To: Pierre-Loup A. Griffais, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Hi Pierre-Loup,

 > As long as it's a consistent cost, it shouldn't an issue.
But I assume that you still have some threshold not
to go lower?

 > Hopefully the requirements
Do you have any numbers in mind?

BTW: My understanding is that resolution should be 2160x1200
and refresh rate 90Hz. Am I right?


 > the user, we require high confidence that the reprojection task
 > will complete before the VBLANK interval.
So could we assume (to simplify) that the requirement is
the following:  Have opportunity to submit task and execute
it in time less than VBLANK interval from the moment Vulkan
calls kernel driver and GPU h/w finish processing?

Sincerely yours,
Serguei Sagalovitch

On 2016-12-17 05:05 PM, Pierre-Loup A. Griffais wrote:
> Hi Serguei,
>
> I'm also working on the bringing up our VR runtime on top of amgpu; 
> see replies inline.
>
> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>> Andres,
>>
>>>  For current VR workloads we have 3 separate processes running 
>>> actually:
>> So we could have potential memory overcommit case or do you do 
>> partitioning
>> on your own?  I would think that there is need to avoid overcomit in 
>> VR case to
>> prevent any BO migration.
>
> You're entirely correct; currently the VR runtime is setting up 
> prioritized CPU scheduling for its VR compositor, we're working on 
> prioritized GPU scheduling and pre-emption (eg. this thread), and in 
> the future it will make sense to do work in order to make sure that 
> its memory allocations do not get evicted, to prevent any unwelcome 
> additional latency in the event of needing to perform just-in-time 
> reprojection.
>
>> BTW: Do you mean __real__ processes or threads?
>> Based on my understanding sharing BOs between different processes
>> could introduce additional synchronization constrains.  btw: I am not 
>> sure
>> if we are able to share Vulkan sync. object cross-process boundary.
>
> They are different processes; it is important for the compositor that 
> is responsible for quality-of-service features such as consistently 
> presenting distorted frames with the right latency, reprojection, etc, 
> to be separate from the main application.
>
> Currently we are using unreleased cross-process memory and semaphore 
> extensions to fetch updated eye images from the client application, 
> but the just-in-time reprojection discussed here does not actually 
> have any direct interactions with cross-process resource sharing, 
> since it's achieved by using whatever is the latest, most up-to-date 
> eye images that have already been sent by the client application, 
> which are already available to use without additional synchronization.
>
>>
>>>    3) System compositor (we are looking at approaches to remove this 
>>> overhead)
>> Yes,  IMHO the best is to run in  "full screen mode".
>
> Yes, we are working on mechanisms to present directly to the headset 
> display without any intermediaries as a separate effort.
>
>>
>>>  The latency is our main concern,
>> I would assume that this is the known problem (at least for compute 
>> usage).
>> It looks like that amdgpu / kernel submission is rather CPU intensive 
>> (at least
>> in the default configuration).
>
> As long as it's a consistent cost, it shouldn't an issue. However, if 
> there's high degrees of variance then that would be troublesome and we 
> would need to account for the worst case.
>
> Hopefully the requirements and approach we described make sense, we're 
> looking forward to your feedback and suggestions.
>
> Thanks!
>  - Pierre-Loup
>
>>
>> Sincerely yours,
>> Serguei Sagalovitch
>>
>>
>> From: Andres Rodriguez <andresr@valvesoftware.com>
>> Sent: December 16, 2016 10:00 PM
>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>
>> Hey Serguei,
>>
>>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I 
>>> understand (by simplifying)
>>> some scheduling is per pipe.  I know about the current allocation 
>>> scheme but I do not think
>>> that it is  ideal.  I would assume that we need  to switch to 
>>> dynamical partition
>>> of resources  based on the workload otherwise we will have resource 
>>> conflict
>>> between Vulkan compute and  OpenCL.
>>
>> I agree the partitioning isn't ideal. I'm hoping we can start with a 
>> solution that assumes that
>> only pipe0 has any work and the other pipes are idle (no HSA/ROCm 
>> running on the system).
>>
>> This should be more or less the use case we expect from VR users.
>>
>> I agree the split is currently not ideal, but I'd like to consider 
>> that a separate task, because
>> making it dynamic is not straight forward :P
>>
>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd 
>>> will be not
>>> involved.  I would assume that in the case of VR we will have one main
>>> application ("console" mode(?)) so we could temporally "ignore"
>>> OpenCL/ROCm needs when VR is running.
>>
>> Correct, this is why we want to enable the high priority compute 
>> queue through
>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>
>> For current VR workloads we have 3 separate processes running actually:
>>     1) Game process
>>     2) VR Compositor (this is the process that will require high 
>> priority queue)
>>     3) System compositor (we are looking at approaches to remove this 
>> overhead)
>>
>> For now I think it is okay to assume no OpenCL/ROCm running 
>> simultaneously, but
>> I would also like to be able to address this case in the future 
>> (cross-pipe priorities).
>>
>>> [Serguei]  The problem with pre-emption of graphics task:  (a) it 
>>> may take time so
>>> latency may suffer
>>
>> The latency is our main concern, we want something that is 
>> predictable. A good
>> illustration of what the reprojection scheduling looks like can be 
>> found here:
>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>
>>
>>> (b) to preempt we need to have different "context" - we want
>>> to guarantee that submissions from the same context will be executed 
>>> in order.
>>
>> This is okay, as the reprojection work doesn't have dependencies on 
>> the game context, and it
>> even happens in a separate process.
>>
>>> BTW: (a) Do you want  "preempt" and later resume or do you want 
>>> "preempt" and
>>> "cancel/abort"
>>
>> Preempt the game with the compositor task and then resume it.
>>
>>> (b) Vulkan is generic API and could be used for graphics as well as
>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>
>> Yeah, the plan is to use vulkan compute. But if you figure out a way 
>> for us to get
>> a guaranteed execution time using vulkan graphics, then I'll take you 
>> out for a beer :)
>>
>> Regards,
>> Andres
>> ________________________________________
>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>> Sent: Friday, December 16, 2016 9:13 PM
>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>
>> Hi Andres,
>>
>> Please see inline (as [Serguei])
>>
>> Sincerely yours,
>> Serguei Sagalovitch
>>
>>
>> From: Andres Rodriguez <andresr@valvesoftware.com>
>> Sent: December 16, 2016 8:29 PM
>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>
>> Hi Serguei,
>>
>> Thanks for the feedback. Answers inline as [AR].
>>
>> Regards,
>> Andres
>>
>> ________________________________________
>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>> Sent: Friday, December 16, 2016 8:15 PM
>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>
>> Andres,
>>
>>
>> Quick comments:
>>
>> 1) To minimize "bubbles", etc. we need to "force" CU assignments/binding
>> to high-priority queue  when it will be in use and "free" them later
>> (we  do not want forever take CUs from e.g. graphic task to degrade 
>> graphics
>> performance).
>>
>> Otherwise we could have scenario when long graphics task (or 
>> low-priority
>> compute) will took all (extra) CUs and high--priority will wait for 
>> needed resources.
>> It will not be visible on "NOP " but only when you submit "real" 
>> compute task
>> so I would recommend  not to use "NOP" packets at all for testing.
>>
>> It (CU assignment) could be relatively easy done when everything is 
>> going via kernel
>> (e.g. as part of frame submission) but I must admit that I am not sure
>> about the best way for user level submissions (amdkfd).
>>
>> [AR] I wasn't aware of this part of the programming sequence. Thanks 
>> for the heads up!
>> Is this similar to the CU masking programming?
>> [Serguei] Yes. To simplify: the problem is that "scheduler" when 
>> deciding which
>> queue to  run will check if there is enough resources and if not then 
>> it will begin
>> to check other queues with lower priority.
>>
>> 2) I would recommend to dedicate the whole pipe to high-priority 
>> queue and have
>> nothing their except it.
>>
>> [AR] I'm guessing in this context you mean pipe = queue? (as opposed 
>> to the MEC definition
>> of pipe, which is a grouping of queues). I say this because amdgpu 
>> only has access to 1 pipe,
>> and the rest are statically partitioned for amdkfd usage.
>>
>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I 
>> understand (by simplifying)
>> some scheduling is per pipe.  I know about the current allocation 
>> scheme but I do not think
>> that it is  ideal.  I would assume that we need  to switch to 
>> dynamical partition
>> of resources  based on the workload otherwise we will have resource 
>> conflict
>> between Vulkan compute and  OpenCL.
>>
>>
>> BTW: Which user level API do you want to use for compute: Vulkan or 
>> OpenCL?
>>
>> [AR] Vulkan
>>
>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will 
>> be not
>> involved.  I would assume that in the case of VR we will have one main
>> application ("console" mode(?)) so we could temporally "ignore"
>> OpenCL/ROCm needs when VR is running.
>>
>>>  we will not be able to provide a solution compatible with GFX 
>>> worloads.
>> I assume that you are talking about graphics? Am I right?
>>
>> [AR] Yeah, my understanding is that pre-empting the currently running 
>> graphics job and scheduling in
>> something else using mid-buffer pre-emption has some cases where it 
>> doesn't work well. But if with
>> polaris10 it starts working well, it might be a better solution for 
>> us (because the whole reprojection
>> work uses the vulkan graphics stack at the moment, and porting it to 
>> compute is not trivial).
>>
>> [Serguei]  The problem with pre-emption of graphics task:  (a) it may 
>> take time so
>> latency may suffer (b) to preempt we need to have different "context" 
>> - we want
>> to guarantee that submissions from the same context will be executed 
>> in order.
>> BTW: (a) Do you want  "preempt" and later resume or do you want 
>> "preempt" and
>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>> for graphics as well as for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>
>>
>> Sincerely yours,
>> Serguei Sagalovitch
>>
>>
>>
>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of 
>> Andres Rodriguez <andresr@valvesoftware.com>
>> Sent: December 16, 2016 6:15 PM
>> To: amd-gfx@lists.freedesktop.org
>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>>
>> Hi Everyone,
>>
>> This RFC is also available as a gist here:
>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>>
>>
>>
>> [RFC] Mechanism for high priority scheduling in amdgpu
>> gist.github.com
>> [RFC] Mechanism for high priority scheduling in amdgpu
>>
>>
>>
>> [RFC] Mechanism for high priority scheduling in amdgpu
>> gist.github.com
>> [RFC] Mechanism for high priority scheduling in amdgpu
>>
>>
>>
>>
>> [RFC] Mechanism for high priority scheduling in amdgpu
>> gist.github.com
>> [RFC] Mechanism for high priority scheduling in amdgpu
>>
>>
>> We are interested in feedback for a mechanism to effectively schedule 
>> high
>> priority VR reprojection tasks (also referred to as time-warping) for 
>> Polaris10
>> running on the amdgpu kernel driver.
>>
>> Brief context:
>> --------------
>>
>> The main objective of reprojection is to avoid motion sickness for VR 
>> users in
>> scenarios where the game or application would fail to finish 
>> rendering a new
>> frame in time for the next VBLANK. When this happens, the user's head 
>> movements
>> are not reflected on the Head Mounted Display (HMD) for the duration 
>> of an
>> extra frame. This extended mismatch between the inner ear and the 
>> eyes may
>> cause the user to experience motion sickness.
>>
>> The VR compositor deals with this problem by fabricating a new frame 
>> using the
>> user's updated head position in combination with the previous frames. 
>> This
>> avoids a prolonged mismatch between the HMD output and the inner ear.
>>
>> Because of the adverse effects on the user, we require high 
>> confidence that the
>> reprojection task will complete before the VBLANK interval. Even if 
>> the GFX pipe
>> is currently full of work from the game/application (which is most 
>> likely the case).
>>
>> For more details and illustrations, please refer to the following 
>> document:
>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>
>>
>>
>> Gaming: Asynchronous Shaders Evolved | Community
>> community.amd.com
>> One of the most exciting new developments in GPU technology over the 
>> past year has been the adoption of asynchronous shaders, which can 
>> make more efficient use of ...
>>
>>
>>
>> Gaming: Asynchronous Shaders Evolved | Community
>> community.amd.com
>> One of the most exciting new developments in GPU technology over the 
>> past year has been the adoption of asynchronous shaders, which can 
>> make more efficient use of ...
>>
>>
>>
>> Gaming: Asynchronous Shaders Evolved | Community
>> community.amd.com
>> One of the most exciting new developments in GPU technology over the 
>> past year has been the adoption of asynchronous shaders, which can 
>> make more efficient use of ...
>>
>>
>> Requirements:
>> -------------
>>
>> The mechanism must expose the following functionaility:
>>
>>     * Job round trip time must be predictable, from submission to 
>> fence signal
>>
>>     * The mechanism must support compute workloads.
>>
>> Goals:
>> ------
>>
>>     * The mechanism should provide low submission latencies
>>
>> Test: submitting a NOP packet through the mechanism on busy hardware 
>> should
>> be equivalent to submitting a NOP on idle hardware.
>>
>> Nice to have:
>> -------------
>>
>>     * The mechanism should also support GFX workloads.
>>
>> My understanding is that with the current hardware capabilities in 
>> Polaris10 we
>> will not be able to provide a solution compatible with GFX worloads.
>>
>> But I would love to hear otherwise. So if anyone has an idea, 
>> approach or
>> suggestion that will also be compatible with the GFX ring, please let 
>> us know
>> about it.
>>
>>     * The above guarantees should also be respected by amdkfd workloads
>>
>> Would be good to have for consistency, but not strictly necessary as 
>> users running
>> games are not traditionally running HPC workloads in the background.
>>
>> Proposed approach:
>> ------------------
>>
>> Similar to the windows driver, we could expose a high priority 
>> compute queue to
>> userspace.
>>
>> Submissions to this compute queue will be scheduled with high 
>> priority, and may
>> acquire hardware resources previously in use by other queues.
>>
>> This can be achieved by taking advantage of the 'priority' field in 
>> the HQDs
>> and could be programmed by amdgpu or the amdgpu scheduler. The relevant
>> register fields are:
>>         * mmCP_HQD_PIPE_PRIORITY
>>         * mmCP_HQD_QUEUE_PRIORITY
>>
>> Implementation approach 1 - static partitioning:
>> ------------------------------------------------
>>
>> The amdgpu driver currently controls 8 compute queues from pipe0. We can
>> statically partition these as follows:
>>         * 7x regular
>>         * 1x high priority
>>
>> The relevant priorities can be set so that submissions to the high 
>> priority
>> ring will starve the other compute rings and the GFX ring.
>>
>> The amdgpu scheduler will only place jobs into the high priority 
>> rings if the
>> context is marked as high priority. And a corresponding priority 
>> should be
>> added to keep track of this information:
>>      * AMD_SCHED_PRIORITY_KERNEL
>>      * -> AMD_SCHED_PRIORITY_HIGH
>>      * AMD_SCHED_PRIORITY_NORMAL
>>
>> The user will request a high priority context by setting an 
>> appropriate flag
>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>
>>
>> The setting is in a per context level so that we can:
>>     * Maintain a consistent FIFO ordering of all submissions to a 
>> context
>>     * Create high priority and non-high priority contexts in the same 
>> process
>>
>> Implementation approach 2 - dynamic priority programming:
>> ---------------------------------------------------------
>>
>> Similar to the above, but instead of programming the priorities and
>> amdgpu_init() time, the SW scheduler will reprogram the queue priorities
>> dynamically when scheduling a task.
>>
>> This would involve having a hardware specific callback from the 
>> scheduler to
>> set the appropriate queue priority: set_priority(int ring, int index, 
>> int priority)
>>
>> During this callback we would have to grab the SRBM mutex to perform 
>> the appropriate
>> HW programming, and I'm not really sure if that is something we 
>> should be doing from
>> the scheduler.
>>
>> On the positive side, this approach would allow us to program a range of
>> priorities for jobs instead of a single "high priority" value", 
>> achieving
>> something similar to the niceness API available for CPU scheduling.
>>
>> I'm not sure if this flexibility is something that we would need for 
>> our use
>> case, but it might be useful in other scenarios (multiple users 
>> sharing compute
>> time on a server).
>>
>> This approach would require a new int field in drm_amdgpu_ctx_in, or 
>> repurposing
>> of the flags field.
>>
>> Known current obstacles:
>> ------------------------
>>
>> The SQ is currently programmed to disregard the HQD priorities, and 
>> instead it picks
>> jobs at random. Settings from the shader itself are also disregarded 
>> as this is
>> considered a privileged field.
>>
>> Effectively we can get our compute wavefront launched ASAP, but we 
>> might not get the
>> time we need on the SQ.
>>
>> The current programming would have to be changed to allow priority 
>> propagation
>> from the HQD into the SQ.
>>
>> Generic approach for all HW IPs:
>> --------------------------------
>>
>> For consistency purposes, the high priority context can be enabled 
>> for all HW IPs
>> with support of the SW scheduler. This will function similarly to the 
>> current
>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of 
>> anything not
>> commited to the HW queue.
>>
>> The benefits of requesting a high priority context for a non-compute 
>> queue will
>> be lesser (e.g. up to 10s of wait time if a GFX command is stuck in 
>> front of
>> you), but having the API in place will allow us to easily improve the 
>> implementation
>> in the future as new features become available in new hardware.
>>
>> Future steps:
>> -------------
>>
>> Once we have an approach settled, I can take care of the implementation.
>>
>> Also, once the interface is mostly decided, we can start thinking about
>> exposing the high priority queue through radv.
>>
>> Request for feedback:
>> ---------------------
>>
>> We aren't married to any of the approaches outlined above. Our goal 
>> is to
>> obtain a mechanism that will allow us to complete the reprojection 
>> job within a
>> predictable amount of time. So if anyone anyone has any suggestions for
>> improvements or alternative strategies we are more than happy to hear 
>> them.
>>
>> If any of the technical information above is also incorrect, feel 
>> free to point
>> out my misunderstandings.
>>
>> Looking forward to hearing from you.
>>
>> Regards,
>> Andres
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>> amd-gfx Info Page - lists.freedesktop.org
>> lists.freedesktop.org
>> To see the collection of prior postings to the list, visit the 
>> amd-gfx Archives. Using amd-gfx: To post a message to all the list 
>> members, send email ...
>>
>>
>>
>> amd-gfx Info Page - lists.freedesktop.org
>> lists.freedesktop.org
>> To see the collection of prior postings to the list, visit the 
>> amd-gfx Archives. Using amd-gfx: To post a message to all the list 
>> members, send email ...
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>

Sincerely yours,
Serguei Sagalovitch

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                             ` <5857751C.5060409-5C7GfCeVMHo@public.gmane.org>
@ 2016-12-19 14:48                                               ` Serguei Sagalovitch
       [not found]                                                 ` <d8cf437e-af88-c76d-428f-53912bc43d2b-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Serguei Sagalovitch @ 2016-12-19 14:48 UTC (permalink / raw)
  To: zhoucm1, Andres Rodriguez, Pierre-Loup A. Griffais,
	Andres Rodriguez, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Mao,
	David, Zhang, Hawking, Huan, Alvin

 > If compute queue is occupied only by you, the efficiency
 > is equal with setting job queue to high priority I think.
The only risk is the situation when graphics will take all
needed CUs. But in any case it should be very good test.

Andres/Pierre-Loup,

Did you try to do it or it is a lot of work for you?


BTW: If there is  non-VR application which will use high-priority
h/w queue then VR application will suffer.  Any ideas how
to solve it?

Sincerely yours,
Serguei Sagalovitch

On 2016-12-19 12:50 AM, zhoucm1 wrote:
> Do you encounter the priority issue for compute queue with current 
> driver?
>
> If compute queue is occupied only by you, the efficiency is equal with 
> setting job queue to high priority I think.
>
> Regards,
> David Zhou
>
> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>
>> I'm not sure if I'm asking for too much, but if we can coordinate a 
>> similar interface in radv and amdgpu-pro at the vulkan level that 
>> would be great.
>>
>> I'm not sure what that's going to be yet.
>>
>> - Andres
>>
>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>
>>>
>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>> We're currently working with the open stack; I assume that a 
>>>> mechanism could be exposed by both open and Pro Vulkan userspace 
>>>> drivers and that the amdgpu kernel interface improvements we would 
>>>> pursue following this discussion would let both drivers take 
>>>> advantage of the feature, correct?
>>> Of course.
>>> Does open stack have Vulkan support?
>>>
>>> Regards,
>>> David Zhou
>>>>
>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>> By the way, are you using all-open driver or amdgpu-pro driver?
>>>>>
>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>
>>>>> Regards,
>>>>> David Zhou
>>>>>
>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>> Hi Serguei,
>>>>>>
>>>>>> I'm also working on the bringing up our VR runtime on top of amgpu;
>>>>>> see replies inline.
>>>>>>
>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>> Andres,
>>>>>>>
>>>>>>>>  For current VR workloads we have 3 separate processes running
>>>>>>>> actually:
>>>>>>> So we could have potential memory overcommit case or do you do
>>>>>>> partitioning
>>>>>>> on your own?  I would think that there is need to avoid 
>>>>>>> overcomit in
>>>>>>> VR case to
>>>>>>> prevent any BO migration.
>>>>>>
>>>>>> You're entirely correct; currently the VR runtime is setting up
>>>>>> prioritized CPU scheduling for its VR compositor, we're working on
>>>>>> prioritized GPU scheduling and pre-emption (eg. this thread), and in
>>>>>> the future it will make sense to do work in order to make sure that
>>>>>> its memory allocations do not get evicted, to prevent any unwelcome
>>>>>> additional latency in the event of needing to perform just-in-time
>>>>>> reprojection.
>>>>>>
>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>> Based on my understanding sharing BOs between different processes
>>>>>>> could introduce additional synchronization constrains. btw: I am 
>>>>>>> not
>>>>>>> sure
>>>>>>> if we are able to share Vulkan sync. object cross-process boundary.
>>>>>>
>>>>>> They are different processes; it is important for the compositor 
>>>>>> that
>>>>>> is responsible for quality-of-service features such as consistently
>>>>>> presenting distorted frames with the right latency, reprojection, 
>>>>>> etc,
>>>>>> to be separate from the main application.
>>>>>>
>>>>>> Currently we are using unreleased cross-process memory and semaphore
>>>>>> extensions to fetch updated eye images from the client application,
>>>>>> but the just-in-time reprojection discussed here does not actually
>>>>>> have any direct interactions with cross-process resource sharing,
>>>>>> since it's achieved by using whatever is the latest, most up-to-date
>>>>>> eye images that have already been sent by the client application,
>>>>>> which are already available to use without additional 
>>>>>> synchronization.
>>>>>>
>>>>>>>
>>>>>>>>    3) System compositor (we are looking at approaches to remove 
>>>>>>>> this
>>>>>>>> overhead)
>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>
>>>>>> Yes, we are working on mechanisms to present directly to the headset
>>>>>> display without any intermediaries as a separate effort.
>>>>>>
>>>>>>>
>>>>>>>>  The latency is our main concern,
>>>>>>> I would assume that this is the known problem (at least for compute
>>>>>>> usage).
>>>>>>> It looks like that amdgpu / kernel submission is rather CPU 
>>>>>>> intensive
>>>>>>> (at least
>>>>>>> in the default configuration).
>>>>>>
>>>>>> As long as it's a consistent cost, it shouldn't an issue. 
>>>>>> However, if
>>>>>> there's high degrees of variance then that would be troublesome 
>>>>>> and we
>>>>>> would need to account for the worst case.
>>>>>>
>>>>>> Hopefully the requirements and approach we described make sense, 
>>>>>> we're
>>>>>> looking forward to your feedback and suggestions.
>>>>>>
>>>>>> Thanks!
>>>>>>  - Pierre-Loup
>>>>>>
>>>>>>>
>>>>>>> Sincerely yours,
>>>>>>> Serguei Sagalovitch
>>>>>>>
>>>>>>>
>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>
>>>>>>> Hey Serguei,
>>>>>>>
>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
>>>>>>>> understand (by simplifying)
>>>>>>>> some scheduling is per pipe.  I know about the current allocation
>>>>>>>> scheme but I do not think
>>>>>>>> that it is  ideal.  I would assume that we need  to switch to
>>>>>>>> dynamical partition
>>>>>>>> of resources  based on the workload otherwise we will have 
>>>>>>>> resource
>>>>>>>> conflict
>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>
>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can start 
>>>>>>> with a
>>>>>>> solution that assumes that
>>>>>>> only pipe0 has any work and the other pipes are idle (no HSA/ROCm
>>>>>>> running on the system).
>>>>>>>
>>>>>>> This should be more or less the use case we expect from VR users.
>>>>>>>
>>>>>>> I agree the split is currently not ideal, but I'd like to consider
>>>>>>> that a separate task, because
>>>>>>> making it dynamic is not straight forward :P
>>>>>>>
>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd
>>>>>>>> will be not
>>>>>>>> involved.  I would assume that in the case of VR we will have 
>>>>>>>> one main
>>>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>
>>>>>>> Correct, this is why we want to enable the high priority compute
>>>>>>> queue through
>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>>>>>
>>>>>>> For current VR workloads we have 3 separate processes running 
>>>>>>> actually:
>>>>>>>     1) Game process
>>>>>>>     2) VR Compositor (this is the process that will require high
>>>>>>> priority queue)
>>>>>>>     3) System compositor (we are looking at approaches to remove 
>>>>>>> this
>>>>>>> overhead)
>>>>>>>
>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>> simultaneously, but
>>>>>>> I would also like to be able to address this case in the future
>>>>>>> (cross-pipe priorities).
>>>>>>>
>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:  (a) it
>>>>>>>> may take time so
>>>>>>>> latency may suffer
>>>>>>>
>>>>>>> The latency is our main concern, we want something that is
>>>>>>> predictable. A good
>>>>>>> illustration of what the reprojection scheduling looks like can be
>>>>>>> found here:
>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> (b) to preempt we need to have different "context" - we want
>>>>>>>> to guarantee that submissions from the same context will be 
>>>>>>>> executed
>>>>>>>> in order.
>>>>>>>
>>>>>>> This is okay, as the reprojection work doesn't have dependencies on
>>>>>>> the game context, and it
>>>>>>> even happens in a separate process.
>>>>>>>
>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you want
>>>>>>>> "preempt" and
>>>>>>>> "cancel/abort"
>>>>>>>
>>>>>>> Preempt the game with the compositor task and then resume it.
>>>>>>>
>>>>>>>> (b) Vulkan is generic API and could be used for graphics as 
>>>>>>>> well as
>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>
>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure out a 
>>>>>>> way
>>>>>>> for us to get
>>>>>>> a guaranteed execution time using vulkan graphics, then I'll 
>>>>>>> take you
>>>>>>> out for a beer :)
>>>>>>>
>>>>>>> Regards,
>>>>>>> Andres
>>>>>>> ________________________________________
>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>
>>>>>>> Hi Andres,
>>>>>>>
>>>>>>> Please see inline (as [Serguei])
>>>>>>>
>>>>>>> Sincerely yours,
>>>>>>> Serguei Sagalovitch
>>>>>>>
>>>>>>>
>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>
>>>>>>> Hi Serguei,
>>>>>>>
>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>
>>>>>>> Regards,
>>>>>>> Andres
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>
>>>>>>> Andres,
>>>>>>>
>>>>>>>
>>>>>>> Quick comments:
>>>>>>>
>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU 
>>>>>>> assignments/binding
>>>>>>> to high-priority queue  when it will be in use and "free" them 
>>>>>>> later
>>>>>>> (we  do not want forever take CUs from e.g. graphic task to degrade
>>>>>>> graphics
>>>>>>> performance).
>>>>>>>
>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>> low-priority
>>>>>>> compute) will took all (extra) CUs and high--priority will wait for
>>>>>>> needed resources.
>>>>>>> It will not be visible on "NOP " but only when you submit "real"
>>>>>>> compute task
>>>>>>> so I would recommend  not to use "NOP" packets at all for testing.
>>>>>>>
>>>>>>> It (CU assignment) could be relatively easy done when everything is
>>>>>>> going via kernel
>>>>>>> (e.g. as part of frame submission) but I must admit that I am 
>>>>>>> not sure
>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>
>>>>>>> [AR] I wasn't aware of this part of the programming sequence. 
>>>>>>> Thanks
>>>>>>> for the heads up!
>>>>>>> Is this similar to the CU masking programming?
>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler" when
>>>>>>> deciding which
>>>>>>> queue to  run will check if there is enough resources and if not 
>>>>>>> then
>>>>>>> it will begin
>>>>>>> to check other queues with lower priority.
>>>>>>>
>>>>>>> 2) I would recommend to dedicate the whole pipe to high-priority
>>>>>>> queue and have
>>>>>>> nothing their except it.
>>>>>>>
>>>>>>> [AR] I'm guessing in this context you mean pipe = queue? (as 
>>>>>>> opposed
>>>>>>> to the MEC definition
>>>>>>> of pipe, which is a grouping of queues). I say this because amdgpu
>>>>>>> only has access to 1 pipe,
>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>
>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I
>>>>>>> understand (by simplifying)
>>>>>>> some scheduling is per pipe.  I know about the current allocation
>>>>>>> scheme but I do not think
>>>>>>> that it is  ideal.  I would assume that we need  to switch to
>>>>>>> dynamical partition
>>>>>>> of resources  based on the workload otherwise we will have resource
>>>>>>> conflict
>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>
>>>>>>>
>>>>>>> BTW: Which user level API do you want to use for compute: Vulkan or
>>>>>>> OpenCL?
>>>>>>>
>>>>>>> [AR] Vulkan
>>>>>>>
>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd 
>>>>>>> will
>>>>>>> be not
>>>>>>> involved.  I would assume that in the case of VR we will have 
>>>>>>> one main
>>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>
>>>>>>>>  we will not be able to provide a solution compatible with GFX
>>>>>>>> worloads.
>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>
>>>>>>> [AR] Yeah, my understanding is that pre-empting the currently 
>>>>>>> running
>>>>>>> graphics job and scheduling in
>>>>>>> something else using mid-buffer pre-emption has some cases where it
>>>>>>> doesn't work well. But if with
>>>>>>> polaris10 it starts working well, it might be a better solution for
>>>>>>> us (because the whole reprojection
>>>>>>> work uses the vulkan graphics stack at the moment, and porting 
>>>>>>> it to
>>>>>>> compute is not trivial).
>>>>>>>
>>>>>>> [Serguei]  The problem with pre-emption of graphics task: (a) it 
>>>>>>> may
>>>>>>> take time so
>>>>>>> latency may suffer (b) to preempt we need to have different 
>>>>>>> "context"
>>>>>>> - we want
>>>>>>> to guarantee that submissions from the same context will be 
>>>>>>> executed
>>>>>>> in order.
>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>>>>>> "preempt" and
>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>> for graphics as well as for plain compute tasks 
>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>
>>>>>>>
>>>>>>> Sincerely yours,
>>>>>>> Serguei Sagalovitch
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of
>>>>>>> Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>
>>>>>>> Hi Everyone,
>>>>>>>
>>>>>>> This RFC is also available as a gist here:
>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>> gist.github.com
>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>> gist.github.com
>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>> gist.github.com
>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>
>>>>>>>
>>>>>>> We are interested in feedback for a mechanism to effectively 
>>>>>>> schedule
>>>>>>> high
>>>>>>> priority VR reprojection tasks (also referred to as 
>>>>>>> time-warping) for
>>>>>>> Polaris10
>>>>>>> running on the amdgpu kernel driver.
>>>>>>>
>>>>>>> Brief context:
>>>>>>> --------------
>>>>>>>
>>>>>>> The main objective of reprojection is to avoid motion sickness 
>>>>>>> for VR
>>>>>>> users in
>>>>>>> scenarios where the game or application would fail to finish
>>>>>>> rendering a new
>>>>>>> frame in time for the next VBLANK. When this happens, the user's 
>>>>>>> head
>>>>>>> movements
>>>>>>> are not reflected on the Head Mounted Display (HMD) for the 
>>>>>>> duration
>>>>>>> of an
>>>>>>> extra frame. This extended mismatch between the inner ear and the
>>>>>>> eyes may
>>>>>>> cause the user to experience motion sickness.
>>>>>>>
>>>>>>> The VR compositor deals with this problem by fabricating a new 
>>>>>>> frame
>>>>>>> using the
>>>>>>> user's updated head position in combination with the previous 
>>>>>>> frames.
>>>>>>> This
>>>>>>> avoids a prolonged mismatch between the HMD output and the inner 
>>>>>>> ear.
>>>>>>>
>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>> confidence that the
>>>>>>> reprojection task will complete before the VBLANK interval. Even if
>>>>>>> the GFX pipe
>>>>>>> is currently full of work from the game/application (which is most
>>>>>>> likely the case).
>>>>>>>
>>>>>>> For more details and illustrations, please refer to the following
>>>>>>> document:
>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>> community.amd.com
>>>>>>> One of the most exciting new developments in GPU technology over 
>>>>>>> the
>>>>>>> past year has been the adoption of asynchronous shaders, which can
>>>>>>> make more efficient use of ...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>> community.amd.com
>>>>>>> One of the most exciting new developments in GPU technology over 
>>>>>>> the
>>>>>>> past year has been the adoption of asynchronous shaders, which can
>>>>>>> make more efficient use of ...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>> community.amd.com
>>>>>>> One of the most exciting new developments in GPU technology over 
>>>>>>> the
>>>>>>> past year has been the adoption of asynchronous shaders, which can
>>>>>>> make more efficient use of ...
>>>>>>>
>>>>>>>
>>>>>>> Requirements:
>>>>>>> -------------
>>>>>>>
>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>
>>>>>>>     * Job round trip time must be predictable, from submission to
>>>>>>> fence signal
>>>>>>>
>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>
>>>>>>> Goals:
>>>>>>> ------
>>>>>>>
>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>
>>>>>>> Test: submitting a NOP packet through the mechanism on busy 
>>>>>>> hardware
>>>>>>> should
>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>
>>>>>>> Nice to have:
>>>>>>> -------------
>>>>>>>
>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>
>>>>>>> My understanding is that with the current hardware capabilities in
>>>>>>> Polaris10 we
>>>>>>> will not be able to provide a solution compatible with GFX 
>>>>>>> worloads.
>>>>>>>
>>>>>>> But I would love to hear otherwise. So if anyone has an idea,
>>>>>>> approach or
>>>>>>> suggestion that will also be compatible with the GFX ring, 
>>>>>>> please let
>>>>>>> us know
>>>>>>> about it.
>>>>>>>
>>>>>>>     * The above guarantees should also be respected by amdkfd 
>>>>>>> workloads
>>>>>>>
>>>>>>> Would be good to have for consistency, but not strictly 
>>>>>>> necessary as
>>>>>>> users running
>>>>>>> games are not traditionally running HPC workloads in the 
>>>>>>> background.
>>>>>>>
>>>>>>> Proposed approach:
>>>>>>> ------------------
>>>>>>>
>>>>>>> Similar to the windows driver, we could expose a high priority
>>>>>>> compute queue to
>>>>>>> userspace.
>>>>>>>
>>>>>>> Submissions to this compute queue will be scheduled with high
>>>>>>> priority, and may
>>>>>>> acquire hardware resources previously in use by other queues.
>>>>>>>
>>>>>>> This can be achieved by taking advantage of the 'priority' field in
>>>>>>> the HQDs
>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler. The 
>>>>>>> relevant
>>>>>>> register fields are:
>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>
>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>> ------------------------------------------------
>>>>>>>
>>>>>>> The amdgpu driver currently controls 8 compute queues from 
>>>>>>> pipe0. We can
>>>>>>> statically partition these as follows:
>>>>>>>         * 7x regular
>>>>>>>         * 1x high priority
>>>>>>>
>>>>>>> The relevant priorities can be set so that submissions to the high
>>>>>>> priority
>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>
>>>>>>> The amdgpu scheduler will only place jobs into the high priority
>>>>>>> rings if the
>>>>>>> context is marked as high priority. And a corresponding priority
>>>>>>> should be
>>>>>>> added to keep track of this information:
>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>
>>>>>>> The user will request a high priority context by setting an
>>>>>>> appropriate flag
>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The setting is in a per context level so that we can:
>>>>>>>     * Maintain a consistent FIFO ordering of all submissions to a
>>>>>>> context
>>>>>>>     * Create high priority and non-high priority contexts in the 
>>>>>>> same
>>>>>>> process
>>>>>>>
>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>> ---------------------------------------------------------
>>>>>>>
>>>>>>> Similar to the above, but instead of programming the priorities and
>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the queue 
>>>>>>> priorities
>>>>>>> dynamically when scheduling a task.
>>>>>>>
>>>>>>> This would involve having a hardware specific callback from the
>>>>>>> scheduler to
>>>>>>> set the appropriate queue priority: set_priority(int ring, int 
>>>>>>> index,
>>>>>>> int priority)
>>>>>>>
>>>>>>> During this callback we would have to grab the SRBM mutex to 
>>>>>>> perform
>>>>>>> the appropriate
>>>>>>> HW programming, and I'm not really sure if that is something we
>>>>>>> should be doing from
>>>>>>> the scheduler.
>>>>>>>
>>>>>>> On the positive side, this approach would allow us to program a 
>>>>>>> range of
>>>>>>> priorities for jobs instead of a single "high priority" value",
>>>>>>> achieving
>>>>>>> something similar to the niceness API available for CPU scheduling.
>>>>>>>
>>>>>>> I'm not sure if this flexibility is something that we would need 
>>>>>>> for
>>>>>>> our use
>>>>>>> case, but it might be useful in other scenarios (multiple users
>>>>>>> sharing compute
>>>>>>> time on a server).
>>>>>>>
>>>>>>> This approach would require a new int field in 
>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>> repurposing
>>>>>>> of the flags field.
>>>>>>>
>>>>>>> Known current obstacles:
>>>>>>> ------------------------
>>>>>>>
>>>>>>> The SQ is currently programmed to disregard the HQD priorities, and
>>>>>>> instead it picks
>>>>>>> jobs at random. Settings from the shader itself are also 
>>>>>>> disregarded
>>>>>>> as this is
>>>>>>> considered a privileged field.
>>>>>>>
>>>>>>> Effectively we can get our compute wavefront launched ASAP, but we
>>>>>>> might not get the
>>>>>>> time we need on the SQ.
>>>>>>>
>>>>>>> The current programming would have to be changed to allow priority
>>>>>>> propagation
>>>>>>> from the HQD into the SQ.
>>>>>>>
>>>>>>> Generic approach for all HW IPs:
>>>>>>> --------------------------------
>>>>>>>
>>>>>>> For consistency purposes, the high priority context can be enabled
>>>>>>> for all HW IPs
>>>>>>> with support of the SW scheduler. This will function similarly 
>>>>>>> to the
>>>>>>> current
>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of
>>>>>>> anything not
>>>>>>> commited to the HW queue.
>>>>>>>
>>>>>>> The benefits of requesting a high priority context for a 
>>>>>>> non-compute
>>>>>>> queue will
>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is stuck in
>>>>>>> front of
>>>>>>> you), but having the API in place will allow us to easily 
>>>>>>> improve the
>>>>>>> implementation
>>>>>>> in the future as new features become available in new hardware.
>>>>>>>
>>>>>>> Future steps:
>>>>>>> -------------
>>>>>>>
>>>>>>> Once we have an approach settled, I can take care of the 
>>>>>>> implementation.
>>>>>>>
>>>>>>> Also, once the interface is mostly decided, we can start 
>>>>>>> thinking about
>>>>>>> exposing the high priority queue through radv.
>>>>>>>
>>>>>>> Request for feedback:
>>>>>>> ---------------------
>>>>>>>
>>>>>>> We aren't married to any of the approaches outlined above. Our goal
>>>>>>> is to
>>>>>>> obtain a mechanism that will allow us to complete the reprojection
>>>>>>> job within a
>>>>>>> predictable amount of time. So if anyone anyone has any 
>>>>>>> suggestions for
>>>>>>> improvements or alternative strategies we are more than happy to 
>>>>>>> hear
>>>>>>> them.
>>>>>>>
>>>>>>> If any of the technical information above is also incorrect, feel
>>>>>>> free to point
>>>>>>> out my misunderstandings.
>>>>>>>
>>>>>>> Looking forward to hearing from you.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Andres
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> amd-gfx mailing list
>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>
>>>>>>>
>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>> lists.freedesktop.org
>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list
>>>>>>> members, send email ...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>> lists.freedesktop.org
>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list
>>>>>>> members, send email ...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> amd-gfx mailing list
>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx@lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>

Sincerely yours,
Serguei Sagalovitch

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found] ` <544E607D03B20249AA404517E498FC4699EBD3-Lp/cVzEoVyaisxZYEgh0i620KmCxYQEWVpNB7YpNyf8@public.gmane.org>
  2016-12-17  1:15   ` Sagalovitch, Serguei
@ 2016-12-19 19:29   ` Andres Rodriguez
  1 sibling, 0 replies; 36+ messages in thread
From: Andres Rodriguez @ 2016-12-19 19:29 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Pierre-Loup Griffais,
	John.Bridgman-5C7GfCeVMHo, Sagalovitch, Serguei,
	Jay.Cornwall-5C7GfCeVMHo

Hey Guys,

One particular piece I'd like to discuss is how to get around the below issue:

> Known current obstacles:
> ------------------------
> 
> The SQ is currently programmed to disregard the HQD priorities, and instead it picks
> jobs at random. Settings from the shader itself are also disregarded as this is
> considered a privileged field.
> 
> Effectively we can get our compute wavefront launched ASAP, but we might not get the
> time we need on the SQ.
>
> The current programming would have to be changed to allow priority propagation
> from the HQD into the SQ.

1) Is this still an issue if we do the CU reservation that Serguei mentioned?

2) If the SQ respected the HQD priorities, would we still need the CU reservation?

3) Would updating the golden register settings be sufficient to change this behavior? Or would we also need a FW change?

Regards,
Andres
________________________________________
From: Andres Rodriguez
Sent: Friday, December 16, 2016 6:15 PM
To: amd-gfx@lists.freedesktop.org
Subject: [RFC] Mechanism for high priority scheduling in amdgpu

Hi Everyone,

This RFC is also available as a gist here:
https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249

We are interested in feedback for a mechanism to effectively schedule high
priority VR reprojection tasks (also referred to as time-warping) for Polaris10
running on the amdgpu kernel driver.

Brief context:
--------------

The main objective of reprojection is to avoid motion sickness for VR users in
scenarios where the game or application would fail to finish rendering a new
frame in time for the next VBLANK. When this happens, the user's head movements
are not reflected on the Head Mounted Display (HMD) for the duration of an
extra frame. This extended mismatch between the inner ear and the eyes may
cause the user to experience motion sickness.

The VR compositor deals with this problem by fabricating a new frame using the
user's updated head position in combination with the previous frames. This
avoids a prolonged mismatch between the HMD output and the inner ear.

Because of the adverse effects on the user, we require high confidence that the
reprojection task will complete before the VBLANK interval. Even if the GFX pipe
is currently full of work from the game/application (which is most likely the case).

For more details and illustrations, please refer to the following document:
https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved

Requirements:
-------------

The mechanism must expose the following functionaility:

    * Job round trip time must be predictable, from submission to fence signal

    * The mechanism must support compute workloads.

Goals:
------

    * The mechanism should provide low submission latencies

Test: submitting a NOP packet through the mechanism on busy hardware should
be equivalent to submitting a NOP on idle hardware.

Nice to have:
-------------

    * The mechanism should also support GFX workloads.

My understanding is that with the current hardware capabilities in Polaris10 we
will not be able to provide a solution compatible with GFX worloads.

But I would love to hear otherwise. So if anyone has an idea, approach or
suggestion that will also be compatible with the GFX ring, please let us know
about it.

    * The above guarantees should also be respected by amdkfd workloads

Would be good to have for consistency, but not strictly necessary as users running
games are not traditionally running HPC workloads in the background.

Proposed approach:
------------------

Similar to the windows driver, we could expose a high priority compute queue to
userspace.

Submissions to this compute queue will be scheduled with high priority, and may
acquire hardware resources previously in use by other queues.

This can be achieved by taking advantage of the 'priority' field in the HQDs
and could be programmed by amdgpu or the amdgpu scheduler. The relevant
register fields are:
        * mmCP_HQD_PIPE_PRIORITY
        * mmCP_HQD_QUEUE_PRIORITY

Implementation approach 1 - static partitioning:
------------------------------------------------

The amdgpu driver currently controls 8 compute queues from pipe0. We can
statically partition these as follows:
        * 7x regular
        * 1x high priority

The relevant priorities can be set so that submissions to the high priority
ring will starve the other compute rings and the GFX ring.

The amdgpu scheduler will only place jobs into the high priority rings if the
context is marked as high priority. And a corresponding priority should be
added to keep track of this information:
     * AMD_SCHED_PRIORITY_KERNEL
     * -> AMD_SCHED_PRIORITY_HIGH
     * AMD_SCHED_PRIORITY_NORMAL

The user will request a high priority context by setting an appropriate flag
in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163

The setting is in a per context level so that we can:
    * Maintain a consistent FIFO ordering of all submissions to a context
    * Create high priority and non-high priority contexts in the same process

Implementation approach 2 - dynamic priority programming:
---------------------------------------------------------

Similar to the above, but instead of programming the priorities and
amdgpu_init() time, the SW scheduler will reprogram the queue priorities
dynamically when scheduling a task.

This would involve having a hardware specific callback from the scheduler to
set the appropriate queue priority: set_priority(int ring, int index, int priority)

During this callback we would have to grab the SRBM mutex to perform the appropriate
HW programming, and I'm not really sure if that is something we should be doing from
the scheduler.

On the positive side, this approach would allow us to program a range of
priorities for jobs instead of a single "high priority" value", achieving
something similar to the niceness API available for CPU scheduling.

I'm not sure if this flexibility is something that we would need for our use
case, but it might be useful in other scenarios (multiple users sharing compute
time on a server).

This approach would require a new int field in drm_amdgpu_ctx_in, or repurposing
of the flags field.

Known current obstacles:
------------------------

The SQ is currently programmed to disregard the HQD priorities, and instead it picks
jobs at random. Settings from the shader itself are also disregarded as this is
considered a privileged field.

Effectively we can get our compute wavefront launched ASAP, but we might not get the
time we need on the SQ.

The current programming would have to be changed to allow priority propagation
from the HQD into the SQ.

Generic approach for all HW IPs:
--------------------------------

For consistency purposes, the high priority context can be enabled for all HW IPs
with support of the SW scheduler. This will function similarly to the current
AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of anything not
commited to the HW queue.

The benefits of requesting a high priority context for a non-compute queue will
be lesser (e.g. up to 10s of wait time if a GFX command is stuck in front of
you), but having the API in place will allow us to easily improve the implementation
in the future as new features become available in new hardware.

Future steps:
-------------

Once we have an approach settled, I can take care of the implementation.

Also, once the interface is mostly decided, we can start thinking about
exposing the high priority queue through radv.

Request for feedback:
---------------------

We aren't married to any of the approaches outlined above. Our goal is to
obtain a mechanism that will allow us to complete the reprojection job within a
predictable amount of time. So if anyone anyone has any suggestions for
improvements or alternative strategies we are more than happy to hear them.

If any of the technical information above is also incorrect, feel free to point
out my misunderstandings.

Looking forward to hearing from you.

Regards,
Andres


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                 ` <d8cf437e-af88-c76d-428f-53912bc43d2b-5C7GfCeVMHo@public.gmane.org>
@ 2016-12-20 12:56                                                   ` Christian König
       [not found]                                                     ` <5068f779-50ad-5e17-6d7e-8493e8fdd78a-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Christian König @ 2016-12-20 12:56 UTC (permalink / raw)
  To: Serguei Sagalovitch, zhoucm1, Andres Rodriguez,
	Pierre-Loup A. Griffais, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Mao, David, Zhang,
	Hawking, Huan, Alvin

> BTW: If there is  non-VR application which will use high-priority
> h/w queue then VR application will suffer.  Any ideas how
> to solve it?
Yeah, that problem came to my mind as well.

Basically we need to restrict those high priority submissions to the VR 
compositor or otherwise any malfunctioning application could use it.

Just think about some WebGL suddenly taking all our rendering away and 
we won't get anything drawn any more.

Alex or Michel any ideas on that?

Regards,
Christian.

Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
> > If compute queue is occupied only by you, the efficiency
> > is equal with setting job queue to high priority I think.
> The only risk is the situation when graphics will take all
> needed CUs. But in any case it should be very good test.
>
> Andres/Pierre-Loup,
>
> Did you try to do it or it is a lot of work for you?
>
>
> BTW: If there is  non-VR application which will use high-priority
> h/w queue then VR application will suffer.  Any ideas how
> to solve it?
>
> Sincerely yours,
> Serguei Sagalovitch
>
> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>> Do you encounter the priority issue for compute queue with current 
>> driver?
>>
>> If compute queue is occupied only by you, the efficiency is equal 
>> with setting job queue to high priority I think.
>>
>> Regards,
>> David Zhou
>>
>> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>
>>> I'm not sure if I'm asking for too much, but if we can coordinate a 
>>> similar interface in radv and amdgpu-pro at the vulkan level that 
>>> would be great.
>>>
>>> I'm not sure what that's going to be yet.
>>>
>>> - Andres
>>>
>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>
>>>>
>>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>>> We're currently working with the open stack; I assume that a 
>>>>> mechanism could be exposed by both open and Pro Vulkan userspace 
>>>>> drivers and that the amdgpu kernel interface improvements we would 
>>>>> pursue following this discussion would let both drivers take 
>>>>> advantage of the feature, correct?
>>>> Of course.
>>>> Does open stack have Vulkan support?
>>>>
>>>> Regards,
>>>> David Zhou
>>>>>
>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>> By the way, are you using all-open driver or amdgpu-pro driver?
>>>>>>
>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>
>>>>>> Regards,
>>>>>> David Zhou
>>>>>>
>>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>> Hi Serguei,
>>>>>>>
>>>>>>> I'm also working on the bringing up our VR runtime on top of amgpu;
>>>>>>> see replies inline.
>>>>>>>
>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>> Andres,
>>>>>>>>
>>>>>>>>>  For current VR workloads we have 3 separate processes running
>>>>>>>>> actually:
>>>>>>>> So we could have potential memory overcommit case or do you do
>>>>>>>> partitioning
>>>>>>>> on your own?  I would think that there is need to avoid 
>>>>>>>> overcomit in
>>>>>>>> VR case to
>>>>>>>> prevent any BO migration.
>>>>>>>
>>>>>>> You're entirely correct; currently the VR runtime is setting up
>>>>>>> prioritized CPU scheduling for its VR compositor, we're working on
>>>>>>> prioritized GPU scheduling and pre-emption (eg. this thread), 
>>>>>>> and in
>>>>>>> the future it will make sense to do work in order to make sure that
>>>>>>> its memory allocations do not get evicted, to prevent any unwelcome
>>>>>>> additional latency in the event of needing to perform just-in-time
>>>>>>> reprojection.
>>>>>>>
>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>> Based on my understanding sharing BOs between different processes
>>>>>>>> could introduce additional synchronization constrains. btw: I 
>>>>>>>> am not
>>>>>>>> sure
>>>>>>>> if we are able to share Vulkan sync. object cross-process 
>>>>>>>> boundary.
>>>>>>>
>>>>>>> They are different processes; it is important for the compositor 
>>>>>>> that
>>>>>>> is responsible for quality-of-service features such as consistently
>>>>>>> presenting distorted frames with the right latency, 
>>>>>>> reprojection, etc,
>>>>>>> to be separate from the main application.
>>>>>>>
>>>>>>> Currently we are using unreleased cross-process memory and 
>>>>>>> semaphore
>>>>>>> extensions to fetch updated eye images from the client application,
>>>>>>> but the just-in-time reprojection discussed here does not actually
>>>>>>> have any direct interactions with cross-process resource sharing,
>>>>>>> since it's achieved by using whatever is the latest, most 
>>>>>>> up-to-date
>>>>>>> eye images that have already been sent by the client application,
>>>>>>> which are already available to use without additional 
>>>>>>> synchronization.
>>>>>>>
>>>>>>>>
>>>>>>>>>    3) System compositor (we are looking at approaches to 
>>>>>>>>> remove this
>>>>>>>>> overhead)
>>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>>
>>>>>>> Yes, we are working on mechanisms to present directly to the 
>>>>>>> headset
>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>
>>>>>>>>
>>>>>>>>>  The latency is our main concern,
>>>>>>>> I would assume that this is the known problem (at least for 
>>>>>>>> compute
>>>>>>>> usage).
>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU 
>>>>>>>> intensive
>>>>>>>> (at least
>>>>>>>> in the default configuration).
>>>>>>>
>>>>>>> As long as it's a consistent cost, it shouldn't an issue. 
>>>>>>> However, if
>>>>>>> there's high degrees of variance then that would be troublesome 
>>>>>>> and we
>>>>>>> would need to account for the worst case.
>>>>>>>
>>>>>>> Hopefully the requirements and approach we described make sense, 
>>>>>>> we're
>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>  - Pierre-Loup
>>>>>>>
>>>>>>>>
>>>>>>>> Sincerely yours,
>>>>>>>> Serguei Sagalovitch
>>>>>>>>
>>>>>>>>
>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in 
>>>>>>>> amdgpu
>>>>>>>>
>>>>>>>> Hey Serguei,
>>>>>>>>
>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
>>>>>>>>> understand (by simplifying)
>>>>>>>>> some scheduling is per pipe.  I know about the current allocation
>>>>>>>>> scheme but I do not think
>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>> dynamical partition
>>>>>>>>> of resources  based on the workload otherwise we will have 
>>>>>>>>> resource
>>>>>>>>> conflict
>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>
>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can start 
>>>>>>>> with a
>>>>>>>> solution that assumes that
>>>>>>>> only pipe0 has any work and the other pipes are idle (no HSA/ROCm
>>>>>>>> running on the system).
>>>>>>>>
>>>>>>>> This should be more or less the use case we expect from VR users.
>>>>>>>>
>>>>>>>> I agree the split is currently not ideal, but I'd like to consider
>>>>>>>> that a separate task, because
>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>
>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd
>>>>>>>>> will be not
>>>>>>>>> involved.  I would assume that in the case of VR we will have 
>>>>>>>>> one main
>>>>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>
>>>>>>>> Correct, this is why we want to enable the high priority compute
>>>>>>>> queue through
>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>>>>>>
>>>>>>>> For current VR workloads we have 3 separate processes running 
>>>>>>>> actually:
>>>>>>>>     1) Game process
>>>>>>>>     2) VR Compositor (this is the process that will require high
>>>>>>>> priority queue)
>>>>>>>>     3) System compositor (we are looking at approaches to 
>>>>>>>> remove this
>>>>>>>> overhead)
>>>>>>>>
>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>>> simultaneously, but
>>>>>>>> I would also like to be able to address this case in the future
>>>>>>>> (cross-pipe priorities).
>>>>>>>>
>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:  (a) it
>>>>>>>>> may take time so
>>>>>>>>> latency may suffer
>>>>>>>>
>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>> predictable. A good
>>>>>>>> illustration of what the reprojection scheduling looks like can be
>>>>>>>> found here:
>>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> (b) to preempt we need to have different "context" - we want
>>>>>>>>> to guarantee that submissions from the same context will be 
>>>>>>>>> executed
>>>>>>>>> in order.
>>>>>>>>
>>>>>>>> This is okay, as the reprojection work doesn't have 
>>>>>>>> dependencies on
>>>>>>>> the game context, and it
>>>>>>>> even happens in a separate process.
>>>>>>>>
>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you want
>>>>>>>>> "preempt" and
>>>>>>>>> "cancel/abort"
>>>>>>>>
>>>>>>>> Preempt the game with the compositor task and then resume it.
>>>>>>>>
>>>>>>>>> (b) Vulkan is generic API and could be used for graphics as 
>>>>>>>>> well as
>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>
>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure out 
>>>>>>>> a way
>>>>>>>> for us to get
>>>>>>>> a guaranteed execution time using vulkan graphics, then I'll 
>>>>>>>> take you
>>>>>>>> out for a beer :)
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Andres
>>>>>>>> ________________________________________
>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in 
>>>>>>>> amdgpu
>>>>>>>>
>>>>>>>> Hi Andres,
>>>>>>>>
>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>
>>>>>>>> Sincerely yours,
>>>>>>>> Serguei Sagalovitch
>>>>>>>>
>>>>>>>>
>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in 
>>>>>>>> amdgpu
>>>>>>>>
>>>>>>>> Hi Serguei,
>>>>>>>>
>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Andres
>>>>>>>>
>>>>>>>> ________________________________________
>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in 
>>>>>>>> amdgpu
>>>>>>>>
>>>>>>>> Andres,
>>>>>>>>
>>>>>>>>
>>>>>>>> Quick comments:
>>>>>>>>
>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU 
>>>>>>>> assignments/binding
>>>>>>>> to high-priority queue  when it will be in use and "free" them 
>>>>>>>> later
>>>>>>>> (we  do not want forever take CUs from e.g. graphic task to 
>>>>>>>> degrade
>>>>>>>> graphics
>>>>>>>> performance).
>>>>>>>>
>>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>>> low-priority
>>>>>>>> compute) will took all (extra) CUs and high--priority will wait 
>>>>>>>> for
>>>>>>>> needed resources.
>>>>>>>> It will not be visible on "NOP " but only when you submit "real"
>>>>>>>> compute task
>>>>>>>> so I would recommend  not to use "NOP" packets at all for testing.
>>>>>>>>
>>>>>>>> It (CU assignment) could be relatively easy done when 
>>>>>>>> everything is
>>>>>>>> going via kernel
>>>>>>>> (e.g. as part of frame submission) but I must admit that I am 
>>>>>>>> not sure
>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>
>>>>>>>> [AR] I wasn't aware of this part of the programming sequence. 
>>>>>>>> Thanks
>>>>>>>> for the heads up!
>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler" when
>>>>>>>> deciding which
>>>>>>>> queue to  run will check if there is enough resources and if 
>>>>>>>> not then
>>>>>>>> it will begin
>>>>>>>> to check other queues with lower priority.
>>>>>>>>
>>>>>>>> 2) I would recommend to dedicate the whole pipe to high-priority
>>>>>>>> queue and have
>>>>>>>> nothing their except it.
>>>>>>>>
>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue? (as 
>>>>>>>> opposed
>>>>>>>> to the MEC definition
>>>>>>>> of pipe, which is a grouping of queues). I say this because amdgpu
>>>>>>>> only has access to 1 pipe,
>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>
>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I
>>>>>>>> understand (by simplifying)
>>>>>>>> some scheduling is per pipe.  I know about the current allocation
>>>>>>>> scheme but I do not think
>>>>>>>> that it is  ideal.  I would assume that we need  to switch to
>>>>>>>> dynamical partition
>>>>>>>> of resources  based on the workload otherwise we will have 
>>>>>>>> resource
>>>>>>>> conflict
>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>
>>>>>>>>
>>>>>>>> BTW: Which user level API do you want to use for compute: 
>>>>>>>> Vulkan or
>>>>>>>> OpenCL?
>>>>>>>>
>>>>>>>> [AR] Vulkan
>>>>>>>>
>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so 
>>>>>>>> amdkfd will
>>>>>>>> be not
>>>>>>>> involved.  I would assume that in the case of VR we will have 
>>>>>>>> one main
>>>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>
>>>>>>>>>  we will not be able to provide a solution compatible with GFX
>>>>>>>>> worloads.
>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>
>>>>>>>> [AR] Yeah, my understanding is that pre-empting the currently 
>>>>>>>> running
>>>>>>>> graphics job and scheduling in
>>>>>>>> something else using mid-buffer pre-emption has some cases 
>>>>>>>> where it
>>>>>>>> doesn't work well. But if with
>>>>>>>> polaris10 it starts working well, it might be a better solution 
>>>>>>>> for
>>>>>>>> us (because the whole reprojection
>>>>>>>> work uses the vulkan graphics stack at the moment, and porting 
>>>>>>>> it to
>>>>>>>> compute is not trivial).
>>>>>>>>
>>>>>>>> [Serguei]  The problem with pre-emption of graphics task: (a) 
>>>>>>>> it may
>>>>>>>> take time so
>>>>>>>> latency may suffer (b) to preempt we need to have different 
>>>>>>>> "context"
>>>>>>>> - we want
>>>>>>>> to guarantee that submissions from the same context will be 
>>>>>>>> executed
>>>>>>>> in order.
>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>>>>>>> "preempt" and
>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>>> for graphics as well as for plain compute tasks 
>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>
>>>>>>>>
>>>>>>>> Sincerely yours,
>>>>>>>> Serguei Sagalovitch
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of
>>>>>>>> Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>
>>>>>>>> Hi Everyone,
>>>>>>>>
>>>>>>>> This RFC is also available as a gist here:
>>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>> gist.github.com
>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>> gist.github.com
>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>> gist.github.com
>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>
>>>>>>>>
>>>>>>>> We are interested in feedback for a mechanism to effectively 
>>>>>>>> schedule
>>>>>>>> high
>>>>>>>> priority VR reprojection tasks (also referred to as 
>>>>>>>> time-warping) for
>>>>>>>> Polaris10
>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>
>>>>>>>> Brief context:
>>>>>>>> --------------
>>>>>>>>
>>>>>>>> The main objective of reprojection is to avoid motion sickness 
>>>>>>>> for VR
>>>>>>>> users in
>>>>>>>> scenarios where the game or application would fail to finish
>>>>>>>> rendering a new
>>>>>>>> frame in time for the next VBLANK. When this happens, the 
>>>>>>>> user's head
>>>>>>>> movements
>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the 
>>>>>>>> duration
>>>>>>>> of an
>>>>>>>> extra frame. This extended mismatch between the inner ear and the
>>>>>>>> eyes may
>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>
>>>>>>>> The VR compositor deals with this problem by fabricating a new 
>>>>>>>> frame
>>>>>>>> using the
>>>>>>>> user's updated head position in combination with the previous 
>>>>>>>> frames.
>>>>>>>> This
>>>>>>>> avoids a prolonged mismatch between the HMD output and the 
>>>>>>>> inner ear.
>>>>>>>>
>>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>>> confidence that the
>>>>>>>> reprojection task will complete before the VBLANK interval. 
>>>>>>>> Even if
>>>>>>>> the GFX pipe
>>>>>>>> is currently full of work from the game/application (which is most
>>>>>>>> likely the case).
>>>>>>>>
>>>>>>>> For more details and illustrations, please refer to the following
>>>>>>>> document:
>>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>> community.amd.com
>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>> over the
>>>>>>>> past year has been the adoption of asynchronous shaders, which can
>>>>>>>> make more efficient use of ...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>> community.amd.com
>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>> over the
>>>>>>>> past year has been the adoption of asynchronous shaders, which can
>>>>>>>> make more efficient use of ...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>> community.amd.com
>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>> over the
>>>>>>>> past year has been the adoption of asynchronous shaders, which can
>>>>>>>> make more efficient use of ...
>>>>>>>>
>>>>>>>>
>>>>>>>> Requirements:
>>>>>>>> -------------
>>>>>>>>
>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>
>>>>>>>>     * Job round trip time must be predictable, from submission to
>>>>>>>> fence signal
>>>>>>>>
>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>
>>>>>>>> Goals:
>>>>>>>> ------
>>>>>>>>
>>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>>
>>>>>>>> Test: submitting a NOP packet through the mechanism on busy 
>>>>>>>> hardware
>>>>>>>> should
>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>
>>>>>>>> Nice to have:
>>>>>>>> -------------
>>>>>>>>
>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>
>>>>>>>> My understanding is that with the current hardware capabilities in
>>>>>>>> Polaris10 we
>>>>>>>> will not be able to provide a solution compatible with GFX 
>>>>>>>> worloads.
>>>>>>>>
>>>>>>>> But I would love to hear otherwise. So if anyone has an idea,
>>>>>>>> approach or
>>>>>>>> suggestion that will also be compatible with the GFX ring, 
>>>>>>>> please let
>>>>>>>> us know
>>>>>>>> about it.
>>>>>>>>
>>>>>>>>     * The above guarantees should also be respected by amdkfd 
>>>>>>>> workloads
>>>>>>>>
>>>>>>>> Would be good to have for consistency, but not strictly 
>>>>>>>> necessary as
>>>>>>>> users running
>>>>>>>> games are not traditionally running HPC workloads in the 
>>>>>>>> background.
>>>>>>>>
>>>>>>>> Proposed approach:
>>>>>>>> ------------------
>>>>>>>>
>>>>>>>> Similar to the windows driver, we could expose a high priority
>>>>>>>> compute queue to
>>>>>>>> userspace.
>>>>>>>>
>>>>>>>> Submissions to this compute queue will be scheduled with high
>>>>>>>> priority, and may
>>>>>>>> acquire hardware resources previously in use by other queues.
>>>>>>>>
>>>>>>>> This can be achieved by taking advantage of the 'priority' 
>>>>>>>> field in
>>>>>>>> the HQDs
>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler. The 
>>>>>>>> relevant
>>>>>>>> register fields are:
>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>
>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>> ------------------------------------------------
>>>>>>>>
>>>>>>>> The amdgpu driver currently controls 8 compute queues from 
>>>>>>>> pipe0. We can
>>>>>>>> statically partition these as follows:
>>>>>>>>         * 7x regular
>>>>>>>>         * 1x high priority
>>>>>>>>
>>>>>>>> The relevant priorities can be set so that submissions to the high
>>>>>>>> priority
>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>
>>>>>>>> The amdgpu scheduler will only place jobs into the high priority
>>>>>>>> rings if the
>>>>>>>> context is marked as high priority. And a corresponding priority
>>>>>>>> should be
>>>>>>>> added to keep track of this information:
>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>
>>>>>>>> The user will request a high priority context by setting an
>>>>>>>> appropriate flag
>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>     * Maintain a consistent FIFO ordering of all submissions to a
>>>>>>>> context
>>>>>>>>     * Create high priority and non-high priority contexts in 
>>>>>>>> the same
>>>>>>>> process
>>>>>>>>
>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>> ---------------------------------------------------------
>>>>>>>>
>>>>>>>> Similar to the above, but instead of programming the priorities 
>>>>>>>> and
>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the queue 
>>>>>>>> priorities
>>>>>>>> dynamically when scheduling a task.
>>>>>>>>
>>>>>>>> This would involve having a hardware specific callback from the
>>>>>>>> scheduler to
>>>>>>>> set the appropriate queue priority: set_priority(int ring, int 
>>>>>>>> index,
>>>>>>>> int priority)
>>>>>>>>
>>>>>>>> During this callback we would have to grab the SRBM mutex to 
>>>>>>>> perform
>>>>>>>> the appropriate
>>>>>>>> HW programming, and I'm not really sure if that is something we
>>>>>>>> should be doing from
>>>>>>>> the scheduler.
>>>>>>>>
>>>>>>>> On the positive side, this approach would allow us to program a 
>>>>>>>> range of
>>>>>>>> priorities for jobs instead of a single "high priority" value",
>>>>>>>> achieving
>>>>>>>> something similar to the niceness API available for CPU 
>>>>>>>> scheduling.
>>>>>>>>
>>>>>>>> I'm not sure if this flexibility is something that we would 
>>>>>>>> need for
>>>>>>>> our use
>>>>>>>> case, but it might be useful in other scenarios (multiple users
>>>>>>>> sharing compute
>>>>>>>> time on a server).
>>>>>>>>
>>>>>>>> This approach would require a new int field in 
>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>> repurposing
>>>>>>>> of the flags field.
>>>>>>>>
>>>>>>>> Known current obstacles:
>>>>>>>> ------------------------
>>>>>>>>
>>>>>>>> The SQ is currently programmed to disregard the HQD priorities, 
>>>>>>>> and
>>>>>>>> instead it picks
>>>>>>>> jobs at random. Settings from the shader itself are also 
>>>>>>>> disregarded
>>>>>>>> as this is
>>>>>>>> considered a privileged field.
>>>>>>>>
>>>>>>>> Effectively we can get our compute wavefront launched ASAP, but we
>>>>>>>> might not get the
>>>>>>>> time we need on the SQ.
>>>>>>>>
>>>>>>>> The current programming would have to be changed to allow priority
>>>>>>>> propagation
>>>>>>>> from the HQD into the SQ.
>>>>>>>>
>>>>>>>> Generic approach for all HW IPs:
>>>>>>>> --------------------------------
>>>>>>>>
>>>>>>>> For consistency purposes, the high priority context can be enabled
>>>>>>>> for all HW IPs
>>>>>>>> with support of the SW scheduler. This will function similarly 
>>>>>>>> to the
>>>>>>>> current
>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump 
>>>>>>>> ahead of
>>>>>>>> anything not
>>>>>>>> commited to the HW queue.
>>>>>>>>
>>>>>>>> The benefits of requesting a high priority context for a 
>>>>>>>> non-compute
>>>>>>>> queue will
>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is 
>>>>>>>> stuck in
>>>>>>>> front of
>>>>>>>> you), but having the API in place will allow us to easily 
>>>>>>>> improve the
>>>>>>>> implementation
>>>>>>>> in the future as new features become available in new hardware.
>>>>>>>>
>>>>>>>> Future steps:
>>>>>>>> -------------
>>>>>>>>
>>>>>>>> Once we have an approach settled, I can take care of the 
>>>>>>>> implementation.
>>>>>>>>
>>>>>>>> Also, once the interface is mostly decided, we can start 
>>>>>>>> thinking about
>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>
>>>>>>>> Request for feedback:
>>>>>>>> ---------------------
>>>>>>>>
>>>>>>>> We aren't married to any of the approaches outlined above. Our 
>>>>>>>> goal
>>>>>>>> is to
>>>>>>>> obtain a mechanism that will allow us to complete the reprojection
>>>>>>>> job within a
>>>>>>>> predictable amount of time. So if anyone anyone has any 
>>>>>>>> suggestions for
>>>>>>>> improvements or alternative strategies we are more than happy 
>>>>>>>> to hear
>>>>>>>> them.
>>>>>>>>
>>>>>>>> If any of the technical information above is also incorrect, feel
>>>>>>>> free to point
>>>>>>>> out my misunderstandings.
>>>>>>>>
>>>>>>>> Looking forward to hearing from you.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Andres
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>
>>>>>>>>
>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>> lists.freedesktop.org
>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list
>>>>>>>> members, send email ...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>> lists.freedesktop.org
>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list
>>>>>>>> members, send email ...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> amd-gfx mailing list
>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>
>
> Sincerely yours,
> Serguei Sagalovitch
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                     ` <5068f779-50ad-5e17-6d7e-8493e8fdd78a-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2016-12-20 15:51                                                       ` Andres Rodriguez
       [not found]                                                         ` <afc51505-7f86-a963-5d3a-be9df538019e-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Andres Rodriguez @ 2016-12-20 15:51 UTC (permalink / raw)
  To: Christian König, Serguei Sagalovitch, zhoucm1,
	Pierre-Loup A. Griffais, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Mao, David, Zhang,
	Hawking, Huan, Alvin

Hi Christian,

That is definitely a concern. What we are currently thinking is to make 
the high priority queues accessible to root only.

Therefore is a non-root user attempts to set the high priority flag on 
context allocation, we would fail the call and return ENOPERM.

Regards,
Andres


On 12/20/2016 7:56 AM, Christian König wrote:
>> BTW: If there is  non-VR application which will use high-priority
>> h/w queue then VR application will suffer.  Any ideas how
>> to solve it?
> Yeah, that problem came to my mind as well.
>
> Basically we need to restrict those high priority submissions to the 
> VR compositor or otherwise any malfunctioning application could use it.
>
> Just think about some WebGL suddenly taking all our rendering away and 
> we won't get anything drawn any more.
>
> Alex or Michel any ideas on that?
>
> Regards,
> Christian.
>
> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
>> > If compute queue is occupied only by you, the efficiency
>> > is equal with setting job queue to high priority I think.
>> The only risk is the situation when graphics will take all
>> needed CUs. But in any case it should be very good test.
>>
>> Andres/Pierre-Loup,
>>
>> Did you try to do it or it is a lot of work for you?
>>
>>
>> BTW: If there is  non-VR application which will use high-priority
>> h/w queue then VR application will suffer.  Any ideas how
>> to solve it?
>>
>> Sincerely yours,
>> Serguei Sagalovitch
>>
>> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>>> Do you encounter the priority issue for compute queue with current 
>>> driver?
>>>
>>> If compute queue is occupied only by you, the efficiency is equal 
>>> with setting job queue to high priority I think.
>>>
>>> Regards,
>>> David Zhou
>>>
>>> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>>
>>>> I'm not sure if I'm asking for too much, but if we can coordinate a 
>>>> similar interface in radv and amdgpu-pro at the vulkan level that 
>>>> would be great.
>>>>
>>>> I'm not sure what that's going to be yet.
>>>>
>>>> - Andres
>>>>
>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>>
>>>>>
>>>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>>>> We're currently working with the open stack; I assume that a 
>>>>>> mechanism could be exposed by both open and Pro Vulkan userspace 
>>>>>> drivers and that the amdgpu kernel interface improvements we 
>>>>>> would pursue following this discussion would let both drivers 
>>>>>> take advantage of the feature, correct?
>>>>> Of course.
>>>>> Does open stack have Vulkan support?
>>>>>
>>>>> Regards,
>>>>> David Zhou
>>>>>>
>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>>> By the way, are you using all-open driver or amdgpu-pro driver?
>>>>>>>
>>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>>
>>>>>>> Regards,
>>>>>>> David Zhou
>>>>>>>
>>>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>>> Hi Serguei,
>>>>>>>>
>>>>>>>> I'm also working on the bringing up our VR runtime on top of 
>>>>>>>> amgpu;
>>>>>>>> see replies inline.
>>>>>>>>
>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>>> Andres,
>>>>>>>>>
>>>>>>>>>>  For current VR workloads we have 3 separate processes running
>>>>>>>>>> actually:
>>>>>>>>> So we could have potential memory overcommit case or do you do
>>>>>>>>> partitioning
>>>>>>>>> on your own?  I would think that there is need to avoid 
>>>>>>>>> overcomit in
>>>>>>>>> VR case to
>>>>>>>>> prevent any BO migration.
>>>>>>>>
>>>>>>>> You're entirely correct; currently the VR runtime is setting up
>>>>>>>> prioritized CPU scheduling for its VR compositor, we're working on
>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this thread), 
>>>>>>>> and in
>>>>>>>> the future it will make sense to do work in order to make sure 
>>>>>>>> that
>>>>>>>> its memory allocations do not get evicted, to prevent any 
>>>>>>>> unwelcome
>>>>>>>> additional latency in the event of needing to perform just-in-time
>>>>>>>> reprojection.
>>>>>>>>
>>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>>> Based on my understanding sharing BOs between different processes
>>>>>>>>> could introduce additional synchronization constrains. btw: I 
>>>>>>>>> am not
>>>>>>>>> sure
>>>>>>>>> if we are able to share Vulkan sync. object cross-process 
>>>>>>>>> boundary.
>>>>>>>>
>>>>>>>> They are different processes; it is important for the 
>>>>>>>> compositor that
>>>>>>>> is responsible for quality-of-service features such as 
>>>>>>>> consistently
>>>>>>>> presenting distorted frames with the right latency, 
>>>>>>>> reprojection, etc,
>>>>>>>> to be separate from the main application.
>>>>>>>>
>>>>>>>> Currently we are using unreleased cross-process memory and 
>>>>>>>> semaphore
>>>>>>>> extensions to fetch updated eye images from the client 
>>>>>>>> application,
>>>>>>>> but the just-in-time reprojection discussed here does not actually
>>>>>>>> have any direct interactions with cross-process resource sharing,
>>>>>>>> since it's achieved by using whatever is the latest, most 
>>>>>>>> up-to-date
>>>>>>>> eye images that have already been sent by the client application,
>>>>>>>> which are already available to use without additional 
>>>>>>>> synchronization.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>    3) System compositor (we are looking at approaches to 
>>>>>>>>>> remove this
>>>>>>>>>> overhead)
>>>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>>>
>>>>>>>> Yes, we are working on mechanisms to present directly to the 
>>>>>>>> headset
>>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>  The latency is our main concern,
>>>>>>>>> I would assume that this is the known problem (at least for 
>>>>>>>>> compute
>>>>>>>>> usage).
>>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU 
>>>>>>>>> intensive
>>>>>>>>> (at least
>>>>>>>>> in the default configuration).
>>>>>>>>
>>>>>>>> As long as it's a consistent cost, it shouldn't an issue. 
>>>>>>>> However, if
>>>>>>>> there's high degrees of variance then that would be troublesome 
>>>>>>>> and we
>>>>>>>> would need to account for the worst case.
>>>>>>>>
>>>>>>>> Hopefully the requirements and approach we described make 
>>>>>>>> sense, we're
>>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>  - Pierre-Loup
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sincerely yours,
>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>> amdgpu
>>>>>>>>>
>>>>>>>>> Hey Serguei,
>>>>>>>>>
>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
>>>>>>>>>> understand (by simplifying)
>>>>>>>>>> some scheduling is per pipe.  I know about the current 
>>>>>>>>>> allocation
>>>>>>>>>> scheme but I do not think
>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>> dynamical partition
>>>>>>>>>> of resources  based on the workload otherwise we will have 
>>>>>>>>>> resource
>>>>>>>>>> conflict
>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>
>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can start 
>>>>>>>>> with a
>>>>>>>>> solution that assumes that
>>>>>>>>> only pipe0 has any work and the other pipes are idle (no HSA/ROCm
>>>>>>>>> running on the system).
>>>>>>>>>
>>>>>>>>> This should be more or less the use case we expect from VR users.
>>>>>>>>>
>>>>>>>>> I agree the split is currently not ideal, but I'd like to 
>>>>>>>>> consider
>>>>>>>>> that a separate task, because
>>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>>
>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd
>>>>>>>>>> will be not
>>>>>>>>>> involved.  I would assume that in the case of VR we will have 
>>>>>>>>>> one main
>>>>>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>
>>>>>>>>> Correct, this is why we want to enable the high priority compute
>>>>>>>>> queue through
>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>>>>>>>
>>>>>>>>> For current VR workloads we have 3 separate processes running 
>>>>>>>>> actually:
>>>>>>>>>     1) Game process
>>>>>>>>>     2) VR Compositor (this is the process that will require high
>>>>>>>>> priority queue)
>>>>>>>>>     3) System compositor (we are looking at approaches to 
>>>>>>>>> remove this
>>>>>>>>> overhead)
>>>>>>>>>
>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>>>> simultaneously, but
>>>>>>>>> I would also like to be able to address this case in the future
>>>>>>>>> (cross-pipe priorities).
>>>>>>>>>
>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:  
>>>>>>>>>> (a) it
>>>>>>>>>> may take time so
>>>>>>>>>> latency may suffer
>>>>>>>>>
>>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>>> predictable. A good
>>>>>>>>> illustration of what the reprojection scheduling looks like 
>>>>>>>>> can be
>>>>>>>>> found here:
>>>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> (b) to preempt we need to have different "context" - we want
>>>>>>>>>> to guarantee that submissions from the same context will be 
>>>>>>>>>> executed
>>>>>>>>>> in order.
>>>>>>>>>
>>>>>>>>> This is okay, as the reprojection work doesn't have 
>>>>>>>>> dependencies on
>>>>>>>>> the game context, and it
>>>>>>>>> even happens in a separate process.
>>>>>>>>>
>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you want
>>>>>>>>>> "preempt" and
>>>>>>>>>> "cancel/abort"
>>>>>>>>>
>>>>>>>>> Preempt the game with the compositor task and then resume it.
>>>>>>>>>
>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics as 
>>>>>>>>>> well as
>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>
>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure out 
>>>>>>>>> a way
>>>>>>>>> for us to get
>>>>>>>>> a guaranteed execution time using vulkan graphics, then I'll 
>>>>>>>>> take you
>>>>>>>>> out for a beer :)
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Andres
>>>>>>>>> ________________________________________
>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>> amdgpu
>>>>>>>>>
>>>>>>>>> Hi Andres,
>>>>>>>>>
>>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>>
>>>>>>>>> Sincerely yours,
>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>> amdgpu
>>>>>>>>>
>>>>>>>>> Hi Serguei,
>>>>>>>>>
>>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Andres
>>>>>>>>>
>>>>>>>>> ________________________________________
>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>> amdgpu
>>>>>>>>>
>>>>>>>>> Andres,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Quick comments:
>>>>>>>>>
>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU 
>>>>>>>>> assignments/binding
>>>>>>>>> to high-priority queue  when it will be in use and "free" them 
>>>>>>>>> later
>>>>>>>>> (we  do not want forever take CUs from e.g. graphic task to 
>>>>>>>>> degrade
>>>>>>>>> graphics
>>>>>>>>> performance).
>>>>>>>>>
>>>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>>>> low-priority
>>>>>>>>> compute) will took all (extra) CUs and high--priority will 
>>>>>>>>> wait for
>>>>>>>>> needed resources.
>>>>>>>>> It will not be visible on "NOP " but only when you submit "real"
>>>>>>>>> compute task
>>>>>>>>> so I would recommend  not to use "NOP" packets at all for 
>>>>>>>>> testing.
>>>>>>>>>
>>>>>>>>> It (CU assignment) could be relatively easy done when 
>>>>>>>>> everything is
>>>>>>>>> going via kernel
>>>>>>>>> (e.g. as part of frame submission) but I must admit that I am 
>>>>>>>>> not sure
>>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>>
>>>>>>>>> [AR] I wasn't aware of this part of the programming sequence. 
>>>>>>>>> Thanks
>>>>>>>>> for the heads up!
>>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler" when
>>>>>>>>> deciding which
>>>>>>>>> queue to  run will check if there is enough resources and if 
>>>>>>>>> not then
>>>>>>>>> it will begin
>>>>>>>>> to check other queues with lower priority.
>>>>>>>>>
>>>>>>>>> 2) I would recommend to dedicate the whole pipe to high-priority
>>>>>>>>> queue and have
>>>>>>>>> nothing their except it.
>>>>>>>>>
>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue? (as 
>>>>>>>>> opposed
>>>>>>>>> to the MEC definition
>>>>>>>>> of pipe, which is a grouping of queues). I say this because 
>>>>>>>>> amdgpu
>>>>>>>>> only has access to 1 pipe,
>>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>>
>>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it. As far as I
>>>>>>>>> understand (by simplifying)
>>>>>>>>> some scheduling is per pipe.  I know about the current allocation
>>>>>>>>> scheme but I do not think
>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>> dynamical partition
>>>>>>>>> of resources  based on the workload otherwise we will have 
>>>>>>>>> resource
>>>>>>>>> conflict
>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> BTW: Which user level API do you want to use for compute: 
>>>>>>>>> Vulkan or
>>>>>>>>> OpenCL?
>>>>>>>>>
>>>>>>>>> [AR] Vulkan
>>>>>>>>>
>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so 
>>>>>>>>> amdkfd will
>>>>>>>>> be not
>>>>>>>>> involved.  I would assume that in the case of VR we will have 
>>>>>>>>> one main
>>>>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>
>>>>>>>>>>  we will not be able to provide a solution compatible with GFX
>>>>>>>>>> worloads.
>>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>>
>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the currently 
>>>>>>>>> running
>>>>>>>>> graphics job and scheduling in
>>>>>>>>> something else using mid-buffer pre-emption has some cases 
>>>>>>>>> where it
>>>>>>>>> doesn't work well. But if with
>>>>>>>>> polaris10 it starts working well, it might be a better 
>>>>>>>>> solution for
>>>>>>>>> us (because the whole reprojection
>>>>>>>>> work uses the vulkan graphics stack at the moment, and porting 
>>>>>>>>> it to
>>>>>>>>> compute is not trivial).
>>>>>>>>>
>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task: (a) 
>>>>>>>>> it may
>>>>>>>>> take time so
>>>>>>>>> latency may suffer (b) to preempt we need to have different 
>>>>>>>>> "context"
>>>>>>>>> - we want
>>>>>>>>> to guarantee that submissions from the same context will be 
>>>>>>>>> executed
>>>>>>>>> in order.
>>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>>>>>>>> "preempt" and
>>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>>>> for graphics as well as for plain compute tasks 
>>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sincerely yours,
>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on 
>>>>>>>>> behalf of
>>>>>>>>> Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>
>>>>>>>>> Hi Everyone,
>>>>>>>>>
>>>>>>>>> This RFC is also available as a gist here:
>>>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>> gist.github.com
>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>> gist.github.com
>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>> gist.github.com
>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We are interested in feedback for a mechanism to effectively 
>>>>>>>>> schedule
>>>>>>>>> high
>>>>>>>>> priority VR reprojection tasks (also referred to as 
>>>>>>>>> time-warping) for
>>>>>>>>> Polaris10
>>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>>
>>>>>>>>> Brief context:
>>>>>>>>> --------------
>>>>>>>>>
>>>>>>>>> The main objective of reprojection is to avoid motion sickness 
>>>>>>>>> for VR
>>>>>>>>> users in
>>>>>>>>> scenarios where the game or application would fail to finish
>>>>>>>>> rendering a new
>>>>>>>>> frame in time for the next VBLANK. When this happens, the 
>>>>>>>>> user's head
>>>>>>>>> movements
>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the 
>>>>>>>>> duration
>>>>>>>>> of an
>>>>>>>>> extra frame. This extended mismatch between the inner ear and the
>>>>>>>>> eyes may
>>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>>
>>>>>>>>> The VR compositor deals with this problem by fabricating a new 
>>>>>>>>> frame
>>>>>>>>> using the
>>>>>>>>> user's updated head position in combination with the previous 
>>>>>>>>> frames.
>>>>>>>>> This
>>>>>>>>> avoids a prolonged mismatch between the HMD output and the 
>>>>>>>>> inner ear.
>>>>>>>>>
>>>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>>>> confidence that the
>>>>>>>>> reprojection task will complete before the VBLANK interval. 
>>>>>>>>> Even if
>>>>>>>>> the GFX pipe
>>>>>>>>> is currently full of work from the game/application (which is 
>>>>>>>>> most
>>>>>>>>> likely the case).
>>>>>>>>>
>>>>>>>>> For more details and illustrations, please refer to the following
>>>>>>>>> document:
>>>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>> community.amd.com
>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>> over the
>>>>>>>>> past year has been the adoption of asynchronous shaders, which 
>>>>>>>>> can
>>>>>>>>> make more efficient use of ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>> community.amd.com
>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>> over the
>>>>>>>>> past year has been the adoption of asynchronous shaders, which 
>>>>>>>>> can
>>>>>>>>> make more efficient use of ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>> community.amd.com
>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>> over the
>>>>>>>>> past year has been the adoption of asynchronous shaders, which 
>>>>>>>>> can
>>>>>>>>> make more efficient use of ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Requirements:
>>>>>>>>> -------------
>>>>>>>>>
>>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>>
>>>>>>>>>     * Job round trip time must be predictable, from submission to
>>>>>>>>> fence signal
>>>>>>>>>
>>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>>
>>>>>>>>> Goals:
>>>>>>>>> ------
>>>>>>>>>
>>>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>>>
>>>>>>>>> Test: submitting a NOP packet through the mechanism on busy 
>>>>>>>>> hardware
>>>>>>>>> should
>>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>>
>>>>>>>>> Nice to have:
>>>>>>>>> -------------
>>>>>>>>>
>>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>>
>>>>>>>>> My understanding is that with the current hardware 
>>>>>>>>> capabilities in
>>>>>>>>> Polaris10 we
>>>>>>>>> will not be able to provide a solution compatible with GFX 
>>>>>>>>> worloads.
>>>>>>>>>
>>>>>>>>> But I would love to hear otherwise. So if anyone has an idea,
>>>>>>>>> approach or
>>>>>>>>> suggestion that will also be compatible with the GFX ring, 
>>>>>>>>> please let
>>>>>>>>> us know
>>>>>>>>> about it.
>>>>>>>>>
>>>>>>>>>     * The above guarantees should also be respected by amdkfd 
>>>>>>>>> workloads
>>>>>>>>>
>>>>>>>>> Would be good to have for consistency, but not strictly 
>>>>>>>>> necessary as
>>>>>>>>> users running
>>>>>>>>> games are not traditionally running HPC workloads in the 
>>>>>>>>> background.
>>>>>>>>>
>>>>>>>>> Proposed approach:
>>>>>>>>> ------------------
>>>>>>>>>
>>>>>>>>> Similar to the windows driver, we could expose a high priority
>>>>>>>>> compute queue to
>>>>>>>>> userspace.
>>>>>>>>>
>>>>>>>>> Submissions to this compute queue will be scheduled with high
>>>>>>>>> priority, and may
>>>>>>>>> acquire hardware resources previously in use by other queues.
>>>>>>>>>
>>>>>>>>> This can be achieved by taking advantage of the 'priority' 
>>>>>>>>> field in
>>>>>>>>> the HQDs
>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler. The 
>>>>>>>>> relevant
>>>>>>>>> register fields are:
>>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>>
>>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>>> ------------------------------------------------
>>>>>>>>>
>>>>>>>>> The amdgpu driver currently controls 8 compute queues from 
>>>>>>>>> pipe0. We can
>>>>>>>>> statically partition these as follows:
>>>>>>>>>         * 7x regular
>>>>>>>>>         * 1x high priority
>>>>>>>>>
>>>>>>>>> The relevant priorities can be set so that submissions to the 
>>>>>>>>> high
>>>>>>>>> priority
>>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>>
>>>>>>>>> The amdgpu scheduler will only place jobs into the high priority
>>>>>>>>> rings if the
>>>>>>>>> context is marked as high priority. And a corresponding priority
>>>>>>>>> should be
>>>>>>>>> added to keep track of this information:
>>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>>
>>>>>>>>> The user will request a high priority context by setting an
>>>>>>>>> appropriate flag
>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>>     * Maintain a consistent FIFO ordering of all submissions to a
>>>>>>>>> context
>>>>>>>>>     * Create high priority and non-high priority contexts in 
>>>>>>>>> the same
>>>>>>>>> process
>>>>>>>>>
>>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>
>>>>>>>>> Similar to the above, but instead of programming the 
>>>>>>>>> priorities and
>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the queue 
>>>>>>>>> priorities
>>>>>>>>> dynamically when scheduling a task.
>>>>>>>>>
>>>>>>>>> This would involve having a hardware specific callback from the
>>>>>>>>> scheduler to
>>>>>>>>> set the appropriate queue priority: set_priority(int ring, int 
>>>>>>>>> index,
>>>>>>>>> int priority)
>>>>>>>>>
>>>>>>>>> During this callback we would have to grab the SRBM mutex to 
>>>>>>>>> perform
>>>>>>>>> the appropriate
>>>>>>>>> HW programming, and I'm not really sure if that is something we
>>>>>>>>> should be doing from
>>>>>>>>> the scheduler.
>>>>>>>>>
>>>>>>>>> On the positive side, this approach would allow us to program 
>>>>>>>>> a range of
>>>>>>>>> priorities for jobs instead of a single "high priority" value",
>>>>>>>>> achieving
>>>>>>>>> something similar to the niceness API available for CPU 
>>>>>>>>> scheduling.
>>>>>>>>>
>>>>>>>>> I'm not sure if this flexibility is something that we would 
>>>>>>>>> need for
>>>>>>>>> our use
>>>>>>>>> case, but it might be useful in other scenarios (multiple users
>>>>>>>>> sharing compute
>>>>>>>>> time on a server).
>>>>>>>>>
>>>>>>>>> This approach would require a new int field in 
>>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>>> repurposing
>>>>>>>>> of the flags field.
>>>>>>>>>
>>>>>>>>> Known current obstacles:
>>>>>>>>> ------------------------
>>>>>>>>>
>>>>>>>>> The SQ is currently programmed to disregard the HQD 
>>>>>>>>> priorities, and
>>>>>>>>> instead it picks
>>>>>>>>> jobs at random. Settings from the shader itself are also 
>>>>>>>>> disregarded
>>>>>>>>> as this is
>>>>>>>>> considered a privileged field.
>>>>>>>>>
>>>>>>>>> Effectively we can get our compute wavefront launched ASAP, 
>>>>>>>>> but we
>>>>>>>>> might not get the
>>>>>>>>> time we need on the SQ.
>>>>>>>>>
>>>>>>>>> The current programming would have to be changed to allow 
>>>>>>>>> priority
>>>>>>>>> propagation
>>>>>>>>> from the HQD into the SQ.
>>>>>>>>>
>>>>>>>>> Generic approach for all HW IPs:
>>>>>>>>> --------------------------------
>>>>>>>>>
>>>>>>>>> For consistency purposes, the high priority context can be 
>>>>>>>>> enabled
>>>>>>>>> for all HW IPs
>>>>>>>>> with support of the SW scheduler. This will function similarly 
>>>>>>>>> to the
>>>>>>>>> current
>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump 
>>>>>>>>> ahead of
>>>>>>>>> anything not
>>>>>>>>> commited to the HW queue.
>>>>>>>>>
>>>>>>>>> The benefits of requesting a high priority context for a 
>>>>>>>>> non-compute
>>>>>>>>> queue will
>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is 
>>>>>>>>> stuck in
>>>>>>>>> front of
>>>>>>>>> you), but having the API in place will allow us to easily 
>>>>>>>>> improve the
>>>>>>>>> implementation
>>>>>>>>> in the future as new features become available in new hardware.
>>>>>>>>>
>>>>>>>>> Future steps:
>>>>>>>>> -------------
>>>>>>>>>
>>>>>>>>> Once we have an approach settled, I can take care of the 
>>>>>>>>> implementation.
>>>>>>>>>
>>>>>>>>> Also, once the interface is mostly decided, we can start 
>>>>>>>>> thinking about
>>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>>
>>>>>>>>> Request for feedback:
>>>>>>>>> ---------------------
>>>>>>>>>
>>>>>>>>> We aren't married to any of the approaches outlined above. Our 
>>>>>>>>> goal
>>>>>>>>> is to
>>>>>>>>> obtain a mechanism that will allow us to complete the 
>>>>>>>>> reprojection
>>>>>>>>> job within a
>>>>>>>>> predictable amount of time. So if anyone anyone has any 
>>>>>>>>> suggestions for
>>>>>>>>> improvements or alternative strategies we are more than happy 
>>>>>>>>> to hear
>>>>>>>>> them.
>>>>>>>>>
>>>>>>>>> If any of the technical information above is also incorrect, feel
>>>>>>>>> free to point
>>>>>>>>> out my misunderstandings.
>>>>>>>>>
>>>>>>>>> Looking forward to hearing from you.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Andres
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>> lists.freedesktop.org
>>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the 
>>>>>>>>> list
>>>>>>>>> members, send email ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>> lists.freedesktop.org
>>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the 
>>>>>>>>> list
>>>>>>>>> members, send email ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>
>>
>> Sincerely yours,
>> Serguei Sagalovitch
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                         ` <afc51505-7f86-a963-5d3a-be9df538019e-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2016-12-20 17:20                                                           ` Pierre-Loup A. Griffais
  2016-12-22 11:42                                                           ` Christian König
  1 sibling, 0 replies; 36+ messages in thread
From: Pierre-Loup A. Griffais @ 2016-12-20 17:20 UTC (permalink / raw)
  To: Andres Rodriguez, Christian König, Serguei Sagalovitch,
	zhoucm1, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Mao, David, Zhang,
	Hawking, Huan, Alvin

Resending bits of a message I had already attempted to send, but got 
mangled in various ways by a mobile MUA:

On Serguei's question on just using compute to leverage unused CUs and 
requirements:

The system will be fully loaded by the VR client application when this 
feature will need to be used, with hopefully both a graphics and compute 
job in flight using 100% of the CU capacity.

Let me try to succintly sum up requirements since you asked in the other 
branch:

On a fully loaded system (optimal occupancy by VR client app), we would 
like the VR runtime to be able to submit a task (graphics or compute, 
but we realize only compute might be possible for best results) and get 
results in a consistent amount of time. Ideally that time would be close 
to the time it would take to complete the same task on an otherwise idle 
system, but it's assumed there would be a fixed cost added to it due to 
winding down in-flight CUs. The quality of service provided by the 
feature would depend on how predictably small such a cost would be. 11ms 
would be a current upper limit but not really a useful number for the 
purpose of discussion, as the feature would be beyond useless at that 
point. Being able to intervene 1ms before vblank of the HMD and 
consistently get our task complete in time would be good.

On access control to the interface and other high-priority clients 
potentially intefering with VR functionality:

The intent is that this interface will require some sort of privilege 
that only the VR compositor would have on a well-configured VR system, 
higher-than-average niceness would be one idea. If you have any 
suggestions there also, it would be good discussion.

Thanks,
  - Pierre-Loup

On 12/20/2016 07:51 AM, Andres Rodriguez wrote:
> Hi Christian,
>
> That is definitely a concern. What we are currently thinking is to make
> the high priority queues accessible to root only.
>
> Therefore is a non-root user attempts to set the high priority flag on
> context allocation, we would fail the call and return ENOPERM.
>
> Regards,
> Andres
>
>
> On 12/20/2016 7:56 AM, Christian König wrote:
>>> BTW: If there is  non-VR application which will use high-priority
>>> h/w queue then VR application will suffer.  Any ideas how
>>> to solve it?
>> Yeah, that problem came to my mind as well.
>>
>> Basically we need to restrict those high priority submissions to the
>> VR compositor or otherwise any malfunctioning application could use it.
>>
>> Just think about some WebGL suddenly taking all our rendering away and
>> we won't get anything drawn any more.
>>
>> Alex or Michel any ideas on that?
>>
>> Regards,
>> Christian.
>>
>> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
>>> > If compute queue is occupied only by you, the efficiency
>>> > is equal with setting job queue to high priority I think.
>>> The only risk is the situation when graphics will take all
>>> needed CUs. But in any case it should be very good test.
>>>
>>> Andres/Pierre-Loup,
>>>
>>> Did you try to do it or it is a lot of work for you?
>>>
>>>
>>> BTW: If there is  non-VR application which will use high-priority
>>> h/w queue then VR application will suffer.  Any ideas how
>>> to solve it?
>>>
>>> Sincerely yours,
>>> Serguei Sagalovitch
>>>
>>> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>>>> Do you encounter the priority issue for compute queue with current
>>>> driver?
>>>>
>>>> If compute queue is occupied only by you, the efficiency is equal
>>>> with setting job queue to high priority I think.
>>>>
>>>> Regards,
>>>> David Zhou
>>>>
>>>> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>>>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>>>
>>>>> I'm not sure if I'm asking for too much, but if we can coordinate a
>>>>> similar interface in radv and amdgpu-pro at the vulkan level that
>>>>> would be great.
>>>>>
>>>>> I'm not sure what that's going to be yet.
>>>>>
>>>>> - Andres
>>>>>
>>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>>>
>>>>>>
>>>>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>>>>> We're currently working with the open stack; I assume that a
>>>>>>> mechanism could be exposed by both open and Pro Vulkan userspace
>>>>>>> drivers and that the amdgpu kernel interface improvements we
>>>>>>> would pursue following this discussion would let both drivers
>>>>>>> take advantage of the feature, correct?
>>>>>> Of course.
>>>>>> Does open stack have Vulkan support?
>>>>>>
>>>>>> Regards,
>>>>>> David Zhou
>>>>>>>
>>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>>>> By the way, are you using all-open driver or amdgpu-pro driver?
>>>>>>>>
>>>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> David Zhou
>>>>>>>>
>>>>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>>>> Hi Serguei,
>>>>>>>>>
>>>>>>>>> I'm also working on the bringing up our VR runtime on top of
>>>>>>>>> amgpu;
>>>>>>>>> see replies inline.
>>>>>>>>>
>>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>>>> Andres,
>>>>>>>>>>
>>>>>>>>>>>  For current VR workloads we have 3 separate processes running
>>>>>>>>>>> actually:
>>>>>>>>>> So we could have potential memory overcommit case or do you do
>>>>>>>>>> partitioning
>>>>>>>>>> on your own?  I would think that there is need to avoid
>>>>>>>>>> overcomit in
>>>>>>>>>> VR case to
>>>>>>>>>> prevent any BO migration.
>>>>>>>>>
>>>>>>>>> You're entirely correct; currently the VR runtime is setting up
>>>>>>>>> prioritized CPU scheduling for its VR compositor, we're working on
>>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this thread),
>>>>>>>>> and in
>>>>>>>>> the future it will make sense to do work in order to make sure
>>>>>>>>> that
>>>>>>>>> its memory allocations do not get evicted, to prevent any
>>>>>>>>> unwelcome
>>>>>>>>> additional latency in the event of needing to perform just-in-time
>>>>>>>>> reprojection.
>>>>>>>>>
>>>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>>>> Based on my understanding sharing BOs between different processes
>>>>>>>>>> could introduce additional synchronization constrains. btw: I
>>>>>>>>>> am not
>>>>>>>>>> sure
>>>>>>>>>> if we are able to share Vulkan sync. object cross-process
>>>>>>>>>> boundary.
>>>>>>>>>
>>>>>>>>> They are different processes; it is important for the
>>>>>>>>> compositor that
>>>>>>>>> is responsible for quality-of-service features such as
>>>>>>>>> consistently
>>>>>>>>> presenting distorted frames with the right latency,
>>>>>>>>> reprojection, etc,
>>>>>>>>> to be separate from the main application.
>>>>>>>>>
>>>>>>>>> Currently we are using unreleased cross-process memory and
>>>>>>>>> semaphore
>>>>>>>>> extensions to fetch updated eye images from the client
>>>>>>>>> application,
>>>>>>>>> but the just-in-time reprojection discussed here does not actually
>>>>>>>>> have any direct interactions with cross-process resource sharing,
>>>>>>>>> since it's achieved by using whatever is the latest, most
>>>>>>>>> up-to-date
>>>>>>>>> eye images that have already been sent by the client application,
>>>>>>>>> which are already available to use without additional
>>>>>>>>> synchronization.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>    3) System compositor (we are looking at approaches to
>>>>>>>>>>> remove this
>>>>>>>>>>> overhead)
>>>>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>>>>
>>>>>>>>> Yes, we are working on mechanisms to present directly to the
>>>>>>>>> headset
>>>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>  The latency is our main concern,
>>>>>>>>>> I would assume that this is the known problem (at least for
>>>>>>>>>> compute
>>>>>>>>>> usage).
>>>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU
>>>>>>>>>> intensive
>>>>>>>>>> (at least
>>>>>>>>>> in the default configuration).
>>>>>>>>>
>>>>>>>>> As long as it's a consistent cost, it shouldn't an issue.
>>>>>>>>> However, if
>>>>>>>>> there's high degrees of variance then that would be troublesome
>>>>>>>>> and we
>>>>>>>>> would need to account for the worst case.
>>>>>>>>>
>>>>>>>>> Hopefully the requirements and approach we described make
>>>>>>>>> sense, we're
>>>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>  - Pierre-Loup
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Sincerely yours,
>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in
>>>>>>>>>> amdgpu
>>>>>>>>>>
>>>>>>>>>> Hey Serguei,
>>>>>>>>>>
>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>> allocation
>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>> dynamical partition
>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>> resource
>>>>>>>>>>> conflict
>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>
>>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can start
>>>>>>>>>> with a
>>>>>>>>>> solution that assumes that
>>>>>>>>>> only pipe0 has any work and the other pipes are idle (no HSA/ROCm
>>>>>>>>>> running on the system).
>>>>>>>>>>
>>>>>>>>>> This should be more or less the use case we expect from VR users.
>>>>>>>>>>
>>>>>>>>>> I agree the split is currently not ideal, but I'd like to
>>>>>>>>>> consider
>>>>>>>>>> that a separate task, because
>>>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>>>
>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd
>>>>>>>>>>> will be not
>>>>>>>>>>> involved.  I would assume that in the case of VR we will have
>>>>>>>>>>> one main
>>>>>>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>
>>>>>>>>>> Correct, this is why we want to enable the high priority compute
>>>>>>>>>> queue through
>>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>>>>>>>>
>>>>>>>>>> For current VR workloads we have 3 separate processes running
>>>>>>>>>> actually:
>>>>>>>>>>     1) Game process
>>>>>>>>>>     2) VR Compositor (this is the process that will require high
>>>>>>>>>> priority queue)
>>>>>>>>>>     3) System compositor (we are looking at approaches to
>>>>>>>>>> remove this
>>>>>>>>>> overhead)
>>>>>>>>>>
>>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>>>>> simultaneously, but
>>>>>>>>>> I would also like to be able to address this case in the future
>>>>>>>>>> (cross-pipe priorities).
>>>>>>>>>>
>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>> (a) it
>>>>>>>>>>> may take time so
>>>>>>>>>>> latency may suffer
>>>>>>>>>>
>>>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>>>> predictable. A good
>>>>>>>>>> illustration of what the reprojection scheduling looks like
>>>>>>>>>> can be
>>>>>>>>>> found here:
>>>>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> (b) to preempt we need to have different "context" - we want
>>>>>>>>>>> to guarantee that submissions from the same context will be
>>>>>>>>>>> executed
>>>>>>>>>>> in order.
>>>>>>>>>>
>>>>>>>>>> This is okay, as the reprojection work doesn't have
>>>>>>>>>> dependencies on
>>>>>>>>>> the game context, and it
>>>>>>>>>> even happens in a separate process.
>>>>>>>>>>
>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you want
>>>>>>>>>>> "preempt" and
>>>>>>>>>>> "cancel/abort"
>>>>>>>>>>
>>>>>>>>>> Preempt the game with the compositor task and then resume it.
>>>>>>>>>>
>>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics as
>>>>>>>>>>> well as
>>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>
>>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure out
>>>>>>>>>> a way
>>>>>>>>>> for us to get
>>>>>>>>>> a guaranteed execution time using vulkan graphics, then I'll
>>>>>>>>>> take you
>>>>>>>>>> out for a beer :)
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Andres
>>>>>>>>>> ________________________________________
>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in
>>>>>>>>>> amdgpu
>>>>>>>>>>
>>>>>>>>>> Hi Andres,
>>>>>>>>>>
>>>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>>>
>>>>>>>>>> Sincerely yours,
>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in
>>>>>>>>>> amdgpu
>>>>>>>>>>
>>>>>>>>>> Hi Serguei,
>>>>>>>>>>
>>>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Andres
>>>>>>>>>>
>>>>>>>>>> ________________________________________
>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in
>>>>>>>>>> amdgpu
>>>>>>>>>>
>>>>>>>>>> Andres,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Quick comments:
>>>>>>>>>>
>>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU
>>>>>>>>>> assignments/binding
>>>>>>>>>> to high-priority queue  when it will be in use and "free" them
>>>>>>>>>> later
>>>>>>>>>> (we  do not want forever take CUs from e.g. graphic task to
>>>>>>>>>> degrade
>>>>>>>>>> graphics
>>>>>>>>>> performance).
>>>>>>>>>>
>>>>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>>>>> low-priority
>>>>>>>>>> compute) will took all (extra) CUs and high--priority will
>>>>>>>>>> wait for
>>>>>>>>>> needed resources.
>>>>>>>>>> It will not be visible on "NOP " but only when you submit "real"
>>>>>>>>>> compute task
>>>>>>>>>> so I would recommend  not to use "NOP" packets at all for
>>>>>>>>>> testing.
>>>>>>>>>>
>>>>>>>>>> It (CU assignment) could be relatively easy done when
>>>>>>>>>> everything is
>>>>>>>>>> going via kernel
>>>>>>>>>> (e.g. as part of frame submission) but I must admit that I am
>>>>>>>>>> not sure
>>>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>>>
>>>>>>>>>> [AR] I wasn't aware of this part of the programming sequence.
>>>>>>>>>> Thanks
>>>>>>>>>> for the heads up!
>>>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler" when
>>>>>>>>>> deciding which
>>>>>>>>>> queue to  run will check if there is enough resources and if
>>>>>>>>>> not then
>>>>>>>>>> it will begin
>>>>>>>>>> to check other queues with lower priority.
>>>>>>>>>>
>>>>>>>>>> 2) I would recommend to dedicate the whole pipe to high-priority
>>>>>>>>>> queue and have
>>>>>>>>>> nothing their except it.
>>>>>>>>>>
>>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue? (as
>>>>>>>>>> opposed
>>>>>>>>>> to the MEC definition
>>>>>>>>>> of pipe, which is a grouping of queues). I say this because
>>>>>>>>>> amdgpu
>>>>>>>>>> only has access to 1 pipe,
>>>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>>>
>>>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it. As far as I
>>>>>>>>>> understand (by simplifying)
>>>>>>>>>> some scheduling is per pipe.  I know about the current allocation
>>>>>>>>>> scheme but I do not think
>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>> dynamical partition
>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>> resource
>>>>>>>>>> conflict
>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> BTW: Which user level API do you want to use for compute:
>>>>>>>>>> Vulkan or
>>>>>>>>>> OpenCL?
>>>>>>>>>>
>>>>>>>>>> [AR] Vulkan
>>>>>>>>>>
>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>> amdkfd will
>>>>>>>>>> be not
>>>>>>>>>> involved.  I would assume that in the case of VR we will have
>>>>>>>>>> one main
>>>>>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>
>>>>>>>>>>>  we will not be able to provide a solution compatible with GFX
>>>>>>>>>>> worloads.
>>>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>>>
>>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the currently
>>>>>>>>>> running
>>>>>>>>>> graphics job and scheduling in
>>>>>>>>>> something else using mid-buffer pre-emption has some cases
>>>>>>>>>> where it
>>>>>>>>>> doesn't work well. But if with
>>>>>>>>>> polaris10 it starts working well, it might be a better
>>>>>>>>>> solution for
>>>>>>>>>> us (because the whole reprojection
>>>>>>>>>> work uses the vulkan graphics stack at the moment, and porting
>>>>>>>>>> it to
>>>>>>>>>> compute is not trivial).
>>>>>>>>>>
>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task: (a)
>>>>>>>>>> it may
>>>>>>>>>> take time so
>>>>>>>>>> latency may suffer (b) to preempt we need to have different
>>>>>>>>>> "context"
>>>>>>>>>> - we want
>>>>>>>>>> to guarantee that submissions from the same context will be
>>>>>>>>>> executed
>>>>>>>>>> in order.
>>>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>>>>>>>>> "preempt" and
>>>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>>>>> for graphics as well as for plain compute tasks
>>>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Sincerely yours,
>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on
>>>>>>>>>> behalf of
>>>>>>>>>> Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>
>>>>>>>>>> Hi Everyone,
>>>>>>>>>>
>>>>>>>>>> This RFC is also available as a gist here:
>>>>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>> gist.github.com
>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>> gist.github.com
>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>> gist.github.com
>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We are interested in feedback for a mechanism to effectively
>>>>>>>>>> schedule
>>>>>>>>>> high
>>>>>>>>>> priority VR reprojection tasks (also referred to as
>>>>>>>>>> time-warping) for
>>>>>>>>>> Polaris10
>>>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>>>
>>>>>>>>>> Brief context:
>>>>>>>>>> --------------
>>>>>>>>>>
>>>>>>>>>> The main objective of reprojection is to avoid motion sickness
>>>>>>>>>> for VR
>>>>>>>>>> users in
>>>>>>>>>> scenarios where the game or application would fail to finish
>>>>>>>>>> rendering a new
>>>>>>>>>> frame in time for the next VBLANK. When this happens, the
>>>>>>>>>> user's head
>>>>>>>>>> movements
>>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the
>>>>>>>>>> duration
>>>>>>>>>> of an
>>>>>>>>>> extra frame. This extended mismatch between the inner ear and the
>>>>>>>>>> eyes may
>>>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>>>
>>>>>>>>>> The VR compositor deals with this problem by fabricating a new
>>>>>>>>>> frame
>>>>>>>>>> using the
>>>>>>>>>> user's updated head position in combination with the previous
>>>>>>>>>> frames.
>>>>>>>>>> This
>>>>>>>>>> avoids a prolonged mismatch between the HMD output and the
>>>>>>>>>> inner ear.
>>>>>>>>>>
>>>>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>>>>> confidence that the
>>>>>>>>>> reprojection task will complete before the VBLANK interval.
>>>>>>>>>> Even if
>>>>>>>>>> the GFX pipe
>>>>>>>>>> is currently full of work from the game/application (which is
>>>>>>>>>> most
>>>>>>>>>> likely the case).
>>>>>>>>>>
>>>>>>>>>> For more details and illustrations, please refer to the following
>>>>>>>>>> document:
>>>>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>> community.amd.com
>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>> over the
>>>>>>>>>> past year has been the adoption of asynchronous shaders, which
>>>>>>>>>> can
>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>> community.amd.com
>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>> over the
>>>>>>>>>> past year has been the adoption of asynchronous shaders, which
>>>>>>>>>> can
>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>> community.amd.com
>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>> over the
>>>>>>>>>> past year has been the adoption of asynchronous shaders, which
>>>>>>>>>> can
>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Requirements:
>>>>>>>>>> -------------
>>>>>>>>>>
>>>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>>>
>>>>>>>>>>     * Job round trip time must be predictable, from submission to
>>>>>>>>>> fence signal
>>>>>>>>>>
>>>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>>>
>>>>>>>>>> Goals:
>>>>>>>>>> ------
>>>>>>>>>>
>>>>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>>>>
>>>>>>>>>> Test: submitting a NOP packet through the mechanism on busy
>>>>>>>>>> hardware
>>>>>>>>>> should
>>>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>>>
>>>>>>>>>> Nice to have:
>>>>>>>>>> -------------
>>>>>>>>>>
>>>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>>>
>>>>>>>>>> My understanding is that with the current hardware
>>>>>>>>>> capabilities in
>>>>>>>>>> Polaris10 we
>>>>>>>>>> will not be able to provide a solution compatible with GFX
>>>>>>>>>> worloads.
>>>>>>>>>>
>>>>>>>>>> But I would love to hear otherwise. So if anyone has an idea,
>>>>>>>>>> approach or
>>>>>>>>>> suggestion that will also be compatible with the GFX ring,
>>>>>>>>>> please let
>>>>>>>>>> us know
>>>>>>>>>> about it.
>>>>>>>>>>
>>>>>>>>>>     * The above guarantees should also be respected by amdkfd
>>>>>>>>>> workloads
>>>>>>>>>>
>>>>>>>>>> Would be good to have for consistency, but not strictly
>>>>>>>>>> necessary as
>>>>>>>>>> users running
>>>>>>>>>> games are not traditionally running HPC workloads in the
>>>>>>>>>> background.
>>>>>>>>>>
>>>>>>>>>> Proposed approach:
>>>>>>>>>> ------------------
>>>>>>>>>>
>>>>>>>>>> Similar to the windows driver, we could expose a high priority
>>>>>>>>>> compute queue to
>>>>>>>>>> userspace.
>>>>>>>>>>
>>>>>>>>>> Submissions to this compute queue will be scheduled with high
>>>>>>>>>> priority, and may
>>>>>>>>>> acquire hardware resources previously in use by other queues.
>>>>>>>>>>
>>>>>>>>>> This can be achieved by taking advantage of the 'priority'
>>>>>>>>>> field in
>>>>>>>>>> the HQDs
>>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler. The
>>>>>>>>>> relevant
>>>>>>>>>> register fields are:
>>>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>>>
>>>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> The amdgpu driver currently controls 8 compute queues from
>>>>>>>>>> pipe0. We can
>>>>>>>>>> statically partition these as follows:
>>>>>>>>>>         * 7x regular
>>>>>>>>>>         * 1x high priority
>>>>>>>>>>
>>>>>>>>>> The relevant priorities can be set so that submissions to the
>>>>>>>>>> high
>>>>>>>>>> priority
>>>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>>>
>>>>>>>>>> The amdgpu scheduler will only place jobs into the high priority
>>>>>>>>>> rings if the
>>>>>>>>>> context is marked as high priority. And a corresponding priority
>>>>>>>>>> should be
>>>>>>>>>> added to keep track of this information:
>>>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>>>
>>>>>>>>>> The user will request a high priority context by setting an
>>>>>>>>>> appropriate flag
>>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>>>     * Maintain a consistent FIFO ordering of all submissions to a
>>>>>>>>>> context
>>>>>>>>>>     * Create high priority and non-high priority contexts in
>>>>>>>>>> the same
>>>>>>>>>> process
>>>>>>>>>>
>>>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> Similar to the above, but instead of programming the
>>>>>>>>>> priorities and
>>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the queue
>>>>>>>>>> priorities
>>>>>>>>>> dynamically when scheduling a task.
>>>>>>>>>>
>>>>>>>>>> This would involve having a hardware specific callback from the
>>>>>>>>>> scheduler to
>>>>>>>>>> set the appropriate queue priority: set_priority(int ring, int
>>>>>>>>>> index,
>>>>>>>>>> int priority)
>>>>>>>>>>
>>>>>>>>>> During this callback we would have to grab the SRBM mutex to
>>>>>>>>>> perform
>>>>>>>>>> the appropriate
>>>>>>>>>> HW programming, and I'm not really sure if that is something we
>>>>>>>>>> should be doing from
>>>>>>>>>> the scheduler.
>>>>>>>>>>
>>>>>>>>>> On the positive side, this approach would allow us to program
>>>>>>>>>> a range of
>>>>>>>>>> priorities for jobs instead of a single "high priority" value",
>>>>>>>>>> achieving
>>>>>>>>>> something similar to the niceness API available for CPU
>>>>>>>>>> scheduling.
>>>>>>>>>>
>>>>>>>>>> I'm not sure if this flexibility is something that we would
>>>>>>>>>> need for
>>>>>>>>>> our use
>>>>>>>>>> case, but it might be useful in other scenarios (multiple users
>>>>>>>>>> sharing compute
>>>>>>>>>> time on a server).
>>>>>>>>>>
>>>>>>>>>> This approach would require a new int field in
>>>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>>>> repurposing
>>>>>>>>>> of the flags field.
>>>>>>>>>>
>>>>>>>>>> Known current obstacles:
>>>>>>>>>> ------------------------
>>>>>>>>>>
>>>>>>>>>> The SQ is currently programmed to disregard the HQD
>>>>>>>>>> priorities, and
>>>>>>>>>> instead it picks
>>>>>>>>>> jobs at random. Settings from the shader itself are also
>>>>>>>>>> disregarded
>>>>>>>>>> as this is
>>>>>>>>>> considered a privileged field.
>>>>>>>>>>
>>>>>>>>>> Effectively we can get our compute wavefront launched ASAP,
>>>>>>>>>> but we
>>>>>>>>>> might not get the
>>>>>>>>>> time we need on the SQ.
>>>>>>>>>>
>>>>>>>>>> The current programming would have to be changed to allow
>>>>>>>>>> priority
>>>>>>>>>> propagation
>>>>>>>>>> from the HQD into the SQ.
>>>>>>>>>>
>>>>>>>>>> Generic approach for all HW IPs:
>>>>>>>>>> --------------------------------
>>>>>>>>>>
>>>>>>>>>> For consistency purposes, the high priority context can be
>>>>>>>>>> enabled
>>>>>>>>>> for all HW IPs
>>>>>>>>>> with support of the SW scheduler. This will function similarly
>>>>>>>>>> to the
>>>>>>>>>> current
>>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump
>>>>>>>>>> ahead of
>>>>>>>>>> anything not
>>>>>>>>>> commited to the HW queue.
>>>>>>>>>>
>>>>>>>>>> The benefits of requesting a high priority context for a
>>>>>>>>>> non-compute
>>>>>>>>>> queue will
>>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is
>>>>>>>>>> stuck in
>>>>>>>>>> front of
>>>>>>>>>> you), but having the API in place will allow us to easily
>>>>>>>>>> improve the
>>>>>>>>>> implementation
>>>>>>>>>> in the future as new features become available in new hardware.
>>>>>>>>>>
>>>>>>>>>> Future steps:
>>>>>>>>>> -------------
>>>>>>>>>>
>>>>>>>>>> Once we have an approach settled, I can take care of the
>>>>>>>>>> implementation.
>>>>>>>>>>
>>>>>>>>>> Also, once the interface is mostly decided, we can start
>>>>>>>>>> thinking about
>>>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>>>
>>>>>>>>>> Request for feedback:
>>>>>>>>>> ---------------------
>>>>>>>>>>
>>>>>>>>>> We aren't married to any of the approaches outlined above. Our
>>>>>>>>>> goal
>>>>>>>>>> is to
>>>>>>>>>> obtain a mechanism that will allow us to complete the
>>>>>>>>>> reprojection
>>>>>>>>>> job within a
>>>>>>>>>> predictable amount of time. So if anyone anyone has any
>>>>>>>>>> suggestions for
>>>>>>>>>> improvements or alternative strategies we are more than happy
>>>>>>>>>> to hear
>>>>>>>>>> them.
>>>>>>>>>>
>>>>>>>>>> If any of the technical information above is also incorrect, feel
>>>>>>>>>> free to point
>>>>>>>>>> out my misunderstandings.
>>>>>>>>>>
>>>>>>>>>> Looking forward to hearing from you.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Andres
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the
>>>>>>>>>> list
>>>>>>>>>> members, send email ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the
>>>>>>>>>> list
>>>>>>>>>> members, send email ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx@lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
>>>>
>>>
>>> Sincerely yours,
>>> Serguei Sagalovitch
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                         ` <afc51505-7f86-a963-5d3a-be9df538019e-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2016-12-20 17:20                                                           ` Pierre-Loup A. Griffais
@ 2016-12-22 11:42                                                           ` Christian König
       [not found]                                                             ` <76892a0d-677b-f0cb-d4e7-74d29b4a0aa7-5C7GfCeVMHo@public.gmane.org>
  1 sibling, 1 reply; 36+ messages in thread
From: Christian König @ 2016-12-22 11:42 UTC (permalink / raw)
  To: Andres Rodriguez, Serguei Sagalovitch, zhoucm1,
	Pierre-Loup A. Griffais, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Mao, David, Zhang,
	Hawking, Huan, Alvin

Hi Andres,

well using root might cause stability and security problems as well. We 
worked quite hard to avoid exactly this for X.

We could make this feature depend on the compositor being DRM master, 
but for example with X the X server is master (and e.g. can change 
resolutions etc..) and not the compositor.

So another question is also what windowing system (if any) are you 
planning to use? X, Wayland, Flinger or something completely different ?

Regards,
Christian.

Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
> Hi Christian,
>
> That is definitely a concern. What we are currently thinking is to 
> make the high priority queues accessible to root only.
>
> Therefore is a non-root user attempts to set the high priority flag on 
> context allocation, we would fail the call and return ENOPERM.
>
> Regards,
> Andres
>
>
> On 12/20/2016 7:56 AM, Christian König wrote:
>>> BTW: If there is  non-VR application which will use high-priority
>>> h/w queue then VR application will suffer.  Any ideas how
>>> to solve it?
>> Yeah, that problem came to my mind as well.
>>
>> Basically we need to restrict those high priority submissions to the 
>> VR compositor or otherwise any malfunctioning application could use it.
>>
>> Just think about some WebGL suddenly taking all our rendering away 
>> and we won't get anything drawn any more.
>>
>> Alex or Michel any ideas on that?
>>
>> Regards,
>> Christian.
>>
>> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
>>> > If compute queue is occupied only by you, the efficiency
>>> > is equal with setting job queue to high priority I think.
>>> The only risk is the situation when graphics will take all
>>> needed CUs. But in any case it should be very good test.
>>>
>>> Andres/Pierre-Loup,
>>>
>>> Did you try to do it or it is a lot of work for you?
>>>
>>>
>>> BTW: If there is  non-VR application which will use high-priority
>>> h/w queue then VR application will suffer.  Any ideas how
>>> to solve it?
>>>
>>> Sincerely yours,
>>> Serguei Sagalovitch
>>>
>>> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>>>> Do you encounter the priority issue for compute queue with current 
>>>> driver?
>>>>
>>>> If compute queue is occupied only by you, the efficiency is equal 
>>>> with setting job queue to high priority I think.
>>>>
>>>> Regards,
>>>> David Zhou
>>>>
>>>> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>>>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>>>
>>>>> I'm not sure if I'm asking for too much, but if we can coordinate 
>>>>> a similar interface in radv and amdgpu-pro at the vulkan level 
>>>>> that would be great.
>>>>>
>>>>> I'm not sure what that's going to be yet.
>>>>>
>>>>> - Andres
>>>>>
>>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>>>
>>>>>>
>>>>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>>>>> We're currently working with the open stack; I assume that a 
>>>>>>> mechanism could be exposed by both open and Pro Vulkan userspace 
>>>>>>> drivers and that the amdgpu kernel interface improvements we 
>>>>>>> would pursue following this discussion would let both drivers 
>>>>>>> take advantage of the feature, correct?
>>>>>> Of course.
>>>>>> Does open stack have Vulkan support?
>>>>>>
>>>>>> Regards,
>>>>>> David Zhou
>>>>>>>
>>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>>>> By the way, are you using all-open driver or amdgpu-pro driver?
>>>>>>>>
>>>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> David Zhou
>>>>>>>>
>>>>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>>>> Hi Serguei,
>>>>>>>>>
>>>>>>>>> I'm also working on the bringing up our VR runtime on top of 
>>>>>>>>> amgpu;
>>>>>>>>> see replies inline.
>>>>>>>>>
>>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>>>> Andres,
>>>>>>>>>>
>>>>>>>>>>>  For current VR workloads we have 3 separate processes running
>>>>>>>>>>> actually:
>>>>>>>>>> So we could have potential memory overcommit case or do you do
>>>>>>>>>> partitioning
>>>>>>>>>> on your own?  I would think that there is need to avoid 
>>>>>>>>>> overcomit in
>>>>>>>>>> VR case to
>>>>>>>>>> prevent any BO migration.
>>>>>>>>>
>>>>>>>>> You're entirely correct; currently the VR runtime is setting up
>>>>>>>>> prioritized CPU scheduling for its VR compositor, we're 
>>>>>>>>> working on
>>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this thread), 
>>>>>>>>> and in
>>>>>>>>> the future it will make sense to do work in order to make sure 
>>>>>>>>> that
>>>>>>>>> its memory allocations do not get evicted, to prevent any 
>>>>>>>>> unwelcome
>>>>>>>>> additional latency in the event of needing to perform 
>>>>>>>>> just-in-time
>>>>>>>>> reprojection.
>>>>>>>>>
>>>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>>>> Based on my understanding sharing BOs between different 
>>>>>>>>>> processes
>>>>>>>>>> could introduce additional synchronization constrains. btw: I 
>>>>>>>>>> am not
>>>>>>>>>> sure
>>>>>>>>>> if we are able to share Vulkan sync. object cross-process 
>>>>>>>>>> boundary.
>>>>>>>>>
>>>>>>>>> They are different processes; it is important for the 
>>>>>>>>> compositor that
>>>>>>>>> is responsible for quality-of-service features such as 
>>>>>>>>> consistently
>>>>>>>>> presenting distorted frames with the right latency, 
>>>>>>>>> reprojection, etc,
>>>>>>>>> to be separate from the main application.
>>>>>>>>>
>>>>>>>>> Currently we are using unreleased cross-process memory and 
>>>>>>>>> semaphore
>>>>>>>>> extensions to fetch updated eye images from the client 
>>>>>>>>> application,
>>>>>>>>> but the just-in-time reprojection discussed here does not 
>>>>>>>>> actually
>>>>>>>>> have any direct interactions with cross-process resource sharing,
>>>>>>>>> since it's achieved by using whatever is the latest, most 
>>>>>>>>> up-to-date
>>>>>>>>> eye images that have already been sent by the client application,
>>>>>>>>> which are already available to use without additional 
>>>>>>>>> synchronization.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>    3) System compositor (we are looking at approaches to 
>>>>>>>>>>> remove this
>>>>>>>>>>> overhead)
>>>>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>>>>
>>>>>>>>> Yes, we are working on mechanisms to present directly to the 
>>>>>>>>> headset
>>>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>  The latency is our main concern,
>>>>>>>>>> I would assume that this is the known problem (at least for 
>>>>>>>>>> compute
>>>>>>>>>> usage).
>>>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU 
>>>>>>>>>> intensive
>>>>>>>>>> (at least
>>>>>>>>>> in the default configuration).
>>>>>>>>>
>>>>>>>>> As long as it's a consistent cost, it shouldn't an issue. 
>>>>>>>>> However, if
>>>>>>>>> there's high degrees of variance then that would be 
>>>>>>>>> troublesome and we
>>>>>>>>> would need to account for the worst case.
>>>>>>>>>
>>>>>>>>> Hopefully the requirements and approach we described make 
>>>>>>>>> sense, we're
>>>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>  - Pierre-Loup
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Sincerely yours,
>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>>> amdgpu
>>>>>>>>>>
>>>>>>>>>> Hey Serguei,
>>>>>>>>>>
>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>> some scheduling is per pipe.  I know about the current 
>>>>>>>>>>> allocation
>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>> dynamical partition
>>>>>>>>>>> of resources  based on the workload otherwise we will have 
>>>>>>>>>>> resource
>>>>>>>>>>> conflict
>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>
>>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can start 
>>>>>>>>>> with a
>>>>>>>>>> solution that assumes that
>>>>>>>>>> only pipe0 has any work and the other pipes are idle (no 
>>>>>>>>>> HSA/ROCm
>>>>>>>>>> running on the system).
>>>>>>>>>>
>>>>>>>>>> This should be more or less the use case we expect from VR 
>>>>>>>>>> users.
>>>>>>>>>>
>>>>>>>>>> I agree the split is currently not ideal, but I'd like to 
>>>>>>>>>> consider
>>>>>>>>>> that a separate task, because
>>>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>>>
>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so 
>>>>>>>>>>> amdkfd
>>>>>>>>>>> will be not
>>>>>>>>>>> involved.  I would assume that in the case of VR we will 
>>>>>>>>>>> have one main
>>>>>>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>
>>>>>>>>>> Correct, this is why we want to enable the high priority compute
>>>>>>>>>> queue through
>>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>>>>>>>>
>>>>>>>>>> For current VR workloads we have 3 separate processes running 
>>>>>>>>>> actually:
>>>>>>>>>>     1) Game process
>>>>>>>>>>     2) VR Compositor (this is the process that will require high
>>>>>>>>>> priority queue)
>>>>>>>>>>     3) System compositor (we are looking at approaches to 
>>>>>>>>>> remove this
>>>>>>>>>> overhead)
>>>>>>>>>>
>>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>>>>> simultaneously, but
>>>>>>>>>> I would also like to be able to address this case in the future
>>>>>>>>>> (cross-pipe priorities).
>>>>>>>>>>
>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:  
>>>>>>>>>>> (a) it
>>>>>>>>>>> may take time so
>>>>>>>>>>> latency may suffer
>>>>>>>>>>
>>>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>>>> predictable. A good
>>>>>>>>>> illustration of what the reprojection scheduling looks like 
>>>>>>>>>> can be
>>>>>>>>>> found here:
>>>>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> (b) to preempt we need to have different "context" - we want
>>>>>>>>>>> to guarantee that submissions from the same context will be 
>>>>>>>>>>> executed
>>>>>>>>>>> in order.
>>>>>>>>>>
>>>>>>>>>> This is okay, as the reprojection work doesn't have 
>>>>>>>>>> dependencies on
>>>>>>>>>> the game context, and it
>>>>>>>>>> even happens in a separate process.
>>>>>>>>>>
>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you want
>>>>>>>>>>> "preempt" and
>>>>>>>>>>> "cancel/abort"
>>>>>>>>>>
>>>>>>>>>> Preempt the game with the compositor task and then resume it.
>>>>>>>>>>
>>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics as 
>>>>>>>>>>> well as
>>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>
>>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure 
>>>>>>>>>> out a way
>>>>>>>>>> for us to get
>>>>>>>>>> a guaranteed execution time using vulkan graphics, then I'll 
>>>>>>>>>> take you
>>>>>>>>>> out for a beer :)
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Andres
>>>>>>>>>> ________________________________________
>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>>> amdgpu
>>>>>>>>>>
>>>>>>>>>> Hi Andres,
>>>>>>>>>>
>>>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>>>
>>>>>>>>>> Sincerely yours,
>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>>> amdgpu
>>>>>>>>>>
>>>>>>>>>> Hi Serguei,
>>>>>>>>>>
>>>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Andres
>>>>>>>>>>
>>>>>>>>>> ________________________________________
>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>>> amdgpu
>>>>>>>>>>
>>>>>>>>>> Andres,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Quick comments:
>>>>>>>>>>
>>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU 
>>>>>>>>>> assignments/binding
>>>>>>>>>> to high-priority queue  when it will be in use and "free" 
>>>>>>>>>> them later
>>>>>>>>>> (we  do not want forever take CUs from e.g. graphic task to 
>>>>>>>>>> degrade
>>>>>>>>>> graphics
>>>>>>>>>> performance).
>>>>>>>>>>
>>>>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>>>>> low-priority
>>>>>>>>>> compute) will took all (extra) CUs and high--priority will 
>>>>>>>>>> wait for
>>>>>>>>>> needed resources.
>>>>>>>>>> It will not be visible on "NOP " but only when you submit "real"
>>>>>>>>>> compute task
>>>>>>>>>> so I would recommend  not to use "NOP" packets at all for 
>>>>>>>>>> testing.
>>>>>>>>>>
>>>>>>>>>> It (CU assignment) could be relatively easy done when 
>>>>>>>>>> everything is
>>>>>>>>>> going via kernel
>>>>>>>>>> (e.g. as part of frame submission) but I must admit that I am 
>>>>>>>>>> not sure
>>>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>>>
>>>>>>>>>> [AR] I wasn't aware of this part of the programming sequence. 
>>>>>>>>>> Thanks
>>>>>>>>>> for the heads up!
>>>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler" when
>>>>>>>>>> deciding which
>>>>>>>>>> queue to  run will check if there is enough resources and if 
>>>>>>>>>> not then
>>>>>>>>>> it will begin
>>>>>>>>>> to check other queues with lower priority.
>>>>>>>>>>
>>>>>>>>>> 2) I would recommend to dedicate the whole pipe to high-priority
>>>>>>>>>> queue and have
>>>>>>>>>> nothing their except it.
>>>>>>>>>>
>>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue? (as 
>>>>>>>>>> opposed
>>>>>>>>>> to the MEC definition
>>>>>>>>>> of pipe, which is a grouping of queues). I say this because 
>>>>>>>>>> amdgpu
>>>>>>>>>> only has access to 1 pipe,
>>>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>>>
>>>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it. As far as I
>>>>>>>>>> understand (by simplifying)
>>>>>>>>>> some scheduling is per pipe.  I know about the current 
>>>>>>>>>> allocation
>>>>>>>>>> scheme but I do not think
>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>> dynamical partition
>>>>>>>>>> of resources  based on the workload otherwise we will have 
>>>>>>>>>> resource
>>>>>>>>>> conflict
>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> BTW: Which user level API do you want to use for compute: 
>>>>>>>>>> Vulkan or
>>>>>>>>>> OpenCL?
>>>>>>>>>>
>>>>>>>>>> [AR] Vulkan
>>>>>>>>>>
>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so 
>>>>>>>>>> amdkfd will
>>>>>>>>>> be not
>>>>>>>>>> involved.  I would assume that in the case of VR we will have 
>>>>>>>>>> one main
>>>>>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>
>>>>>>>>>>>  we will not be able to provide a solution compatible with GFX
>>>>>>>>>>> worloads.
>>>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>>>
>>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the currently 
>>>>>>>>>> running
>>>>>>>>>> graphics job and scheduling in
>>>>>>>>>> something else using mid-buffer pre-emption has some cases 
>>>>>>>>>> where it
>>>>>>>>>> doesn't work well. But if with
>>>>>>>>>> polaris10 it starts working well, it might be a better 
>>>>>>>>>> solution for
>>>>>>>>>> us (because the whole reprojection
>>>>>>>>>> work uses the vulkan graphics stack at the moment, and 
>>>>>>>>>> porting it to
>>>>>>>>>> compute is not trivial).
>>>>>>>>>>
>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task: (a) 
>>>>>>>>>> it may
>>>>>>>>>> take time so
>>>>>>>>>> latency may suffer (b) to preempt we need to have different 
>>>>>>>>>> "context"
>>>>>>>>>> - we want
>>>>>>>>>> to guarantee that submissions from the same context will be 
>>>>>>>>>> executed
>>>>>>>>>> in order.
>>>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>>>>>>>>> "preempt" and
>>>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>>>>> for graphics as well as for plain compute tasks 
>>>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Sincerely yours,
>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on 
>>>>>>>>>> behalf of
>>>>>>>>>> Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>
>>>>>>>>>> Hi Everyone,
>>>>>>>>>>
>>>>>>>>>> This RFC is also available as a gist here:
>>>>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>> gist.github.com
>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>> gist.github.com
>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>> gist.github.com
>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We are interested in feedback for a mechanism to effectively 
>>>>>>>>>> schedule
>>>>>>>>>> high
>>>>>>>>>> priority VR reprojection tasks (also referred to as 
>>>>>>>>>> time-warping) for
>>>>>>>>>> Polaris10
>>>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>>>
>>>>>>>>>> Brief context:
>>>>>>>>>> --------------
>>>>>>>>>>
>>>>>>>>>> The main objective of reprojection is to avoid motion 
>>>>>>>>>> sickness for VR
>>>>>>>>>> users in
>>>>>>>>>> scenarios where the game or application would fail to finish
>>>>>>>>>> rendering a new
>>>>>>>>>> frame in time for the next VBLANK. When this happens, the 
>>>>>>>>>> user's head
>>>>>>>>>> movements
>>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the 
>>>>>>>>>> duration
>>>>>>>>>> of an
>>>>>>>>>> extra frame. This extended mismatch between the inner ear and 
>>>>>>>>>> the
>>>>>>>>>> eyes may
>>>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>>>
>>>>>>>>>> The VR compositor deals with this problem by fabricating a 
>>>>>>>>>> new frame
>>>>>>>>>> using the
>>>>>>>>>> user's updated head position in combination with the previous 
>>>>>>>>>> frames.
>>>>>>>>>> This
>>>>>>>>>> avoids a prolonged mismatch between the HMD output and the 
>>>>>>>>>> inner ear.
>>>>>>>>>>
>>>>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>>>>> confidence that the
>>>>>>>>>> reprojection task will complete before the VBLANK interval. 
>>>>>>>>>> Even if
>>>>>>>>>> the GFX pipe
>>>>>>>>>> is currently full of work from the game/application (which is 
>>>>>>>>>> most
>>>>>>>>>> likely the case).
>>>>>>>>>>
>>>>>>>>>> For more details and illustrations, please refer to the 
>>>>>>>>>> following
>>>>>>>>>> document:
>>>>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>> community.amd.com
>>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>>> over the
>>>>>>>>>> past year has been the adoption of asynchronous shaders, 
>>>>>>>>>> which can
>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>> community.amd.com
>>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>>> over the
>>>>>>>>>> past year has been the adoption of asynchronous shaders, 
>>>>>>>>>> which can
>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>> community.amd.com
>>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>>> over the
>>>>>>>>>> past year has been the adoption of asynchronous shaders, 
>>>>>>>>>> which can
>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Requirements:
>>>>>>>>>> -------------
>>>>>>>>>>
>>>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>>>
>>>>>>>>>>     * Job round trip time must be predictable, from 
>>>>>>>>>> submission to
>>>>>>>>>> fence signal
>>>>>>>>>>
>>>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>>>
>>>>>>>>>> Goals:
>>>>>>>>>> ------
>>>>>>>>>>
>>>>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>>>>
>>>>>>>>>> Test: submitting a NOP packet through the mechanism on busy 
>>>>>>>>>> hardware
>>>>>>>>>> should
>>>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>>>
>>>>>>>>>> Nice to have:
>>>>>>>>>> -------------
>>>>>>>>>>
>>>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>>>
>>>>>>>>>> My understanding is that with the current hardware 
>>>>>>>>>> capabilities in
>>>>>>>>>> Polaris10 we
>>>>>>>>>> will not be able to provide a solution compatible with GFX 
>>>>>>>>>> worloads.
>>>>>>>>>>
>>>>>>>>>> But I would love to hear otherwise. So if anyone has an idea,
>>>>>>>>>> approach or
>>>>>>>>>> suggestion that will also be compatible with the GFX ring, 
>>>>>>>>>> please let
>>>>>>>>>> us know
>>>>>>>>>> about it.
>>>>>>>>>>
>>>>>>>>>>     * The above guarantees should also be respected by amdkfd 
>>>>>>>>>> workloads
>>>>>>>>>>
>>>>>>>>>> Would be good to have for consistency, but not strictly 
>>>>>>>>>> necessary as
>>>>>>>>>> users running
>>>>>>>>>> games are not traditionally running HPC workloads in the 
>>>>>>>>>> background.
>>>>>>>>>>
>>>>>>>>>> Proposed approach:
>>>>>>>>>> ------------------
>>>>>>>>>>
>>>>>>>>>> Similar to the windows driver, we could expose a high priority
>>>>>>>>>> compute queue to
>>>>>>>>>> userspace.
>>>>>>>>>>
>>>>>>>>>> Submissions to this compute queue will be scheduled with high
>>>>>>>>>> priority, and may
>>>>>>>>>> acquire hardware resources previously in use by other queues.
>>>>>>>>>>
>>>>>>>>>> This can be achieved by taking advantage of the 'priority' 
>>>>>>>>>> field in
>>>>>>>>>> the HQDs
>>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler. 
>>>>>>>>>> The relevant
>>>>>>>>>> register fields are:
>>>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>>>
>>>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> The amdgpu driver currently controls 8 compute queues from 
>>>>>>>>>> pipe0. We can
>>>>>>>>>> statically partition these as follows:
>>>>>>>>>>         * 7x regular
>>>>>>>>>>         * 1x high priority
>>>>>>>>>>
>>>>>>>>>> The relevant priorities can be set so that submissions to the 
>>>>>>>>>> high
>>>>>>>>>> priority
>>>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>>>
>>>>>>>>>> The amdgpu scheduler will only place jobs into the high priority
>>>>>>>>>> rings if the
>>>>>>>>>> context is marked as high priority. And a corresponding priority
>>>>>>>>>> should be
>>>>>>>>>> added to keep track of this information:
>>>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>>>
>>>>>>>>>> The user will request a high priority context by setting an
>>>>>>>>>> appropriate flag
>>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>>>     * Maintain a consistent FIFO ordering of all submissions 
>>>>>>>>>> to a
>>>>>>>>>> context
>>>>>>>>>>     * Create high priority and non-high priority contexts in 
>>>>>>>>>> the same
>>>>>>>>>> process
>>>>>>>>>>
>>>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> Similar to the above, but instead of programming the 
>>>>>>>>>> priorities and
>>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the queue 
>>>>>>>>>> priorities
>>>>>>>>>> dynamically when scheduling a task.
>>>>>>>>>>
>>>>>>>>>> This would involve having a hardware specific callback from the
>>>>>>>>>> scheduler to
>>>>>>>>>> set the appropriate queue priority: set_priority(int ring, 
>>>>>>>>>> int index,
>>>>>>>>>> int priority)
>>>>>>>>>>
>>>>>>>>>> During this callback we would have to grab the SRBM mutex to 
>>>>>>>>>> perform
>>>>>>>>>> the appropriate
>>>>>>>>>> HW programming, and I'm not really sure if that is something we
>>>>>>>>>> should be doing from
>>>>>>>>>> the scheduler.
>>>>>>>>>>
>>>>>>>>>> On the positive side, this approach would allow us to program 
>>>>>>>>>> a range of
>>>>>>>>>> priorities for jobs instead of a single "high priority" value",
>>>>>>>>>> achieving
>>>>>>>>>> something similar to the niceness API available for CPU 
>>>>>>>>>> scheduling.
>>>>>>>>>>
>>>>>>>>>> I'm not sure if this flexibility is something that we would 
>>>>>>>>>> need for
>>>>>>>>>> our use
>>>>>>>>>> case, but it might be useful in other scenarios (multiple users
>>>>>>>>>> sharing compute
>>>>>>>>>> time on a server).
>>>>>>>>>>
>>>>>>>>>> This approach would require a new int field in 
>>>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>>>> repurposing
>>>>>>>>>> of the flags field.
>>>>>>>>>>
>>>>>>>>>> Known current obstacles:
>>>>>>>>>> ------------------------
>>>>>>>>>>
>>>>>>>>>> The SQ is currently programmed to disregard the HQD 
>>>>>>>>>> priorities, and
>>>>>>>>>> instead it picks
>>>>>>>>>> jobs at random. Settings from the shader itself are also 
>>>>>>>>>> disregarded
>>>>>>>>>> as this is
>>>>>>>>>> considered a privileged field.
>>>>>>>>>>
>>>>>>>>>> Effectively we can get our compute wavefront launched ASAP, 
>>>>>>>>>> but we
>>>>>>>>>> might not get the
>>>>>>>>>> time we need on the SQ.
>>>>>>>>>>
>>>>>>>>>> The current programming would have to be changed to allow 
>>>>>>>>>> priority
>>>>>>>>>> propagation
>>>>>>>>>> from the HQD into the SQ.
>>>>>>>>>>
>>>>>>>>>> Generic approach for all HW IPs:
>>>>>>>>>> --------------------------------
>>>>>>>>>>
>>>>>>>>>> For consistency purposes, the high priority context can be 
>>>>>>>>>> enabled
>>>>>>>>>> for all HW IPs
>>>>>>>>>> with support of the SW scheduler. This will function 
>>>>>>>>>> similarly to the
>>>>>>>>>> current
>>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump 
>>>>>>>>>> ahead of
>>>>>>>>>> anything not
>>>>>>>>>> commited to the HW queue.
>>>>>>>>>>
>>>>>>>>>> The benefits of requesting a high priority context for a 
>>>>>>>>>> non-compute
>>>>>>>>>> queue will
>>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is 
>>>>>>>>>> stuck in
>>>>>>>>>> front of
>>>>>>>>>> you), but having the API in place will allow us to easily 
>>>>>>>>>> improve the
>>>>>>>>>> implementation
>>>>>>>>>> in the future as new features become available in new hardware.
>>>>>>>>>>
>>>>>>>>>> Future steps:
>>>>>>>>>> -------------
>>>>>>>>>>
>>>>>>>>>> Once we have an approach settled, I can take care of the 
>>>>>>>>>> implementation.
>>>>>>>>>>
>>>>>>>>>> Also, once the interface is mostly decided, we can start 
>>>>>>>>>> thinking about
>>>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>>>
>>>>>>>>>> Request for feedback:
>>>>>>>>>> ---------------------
>>>>>>>>>>
>>>>>>>>>> We aren't married to any of the approaches outlined above. 
>>>>>>>>>> Our goal
>>>>>>>>>> is to
>>>>>>>>>> obtain a mechanism that will allow us to complete the 
>>>>>>>>>> reprojection
>>>>>>>>>> job within a
>>>>>>>>>> predictable amount of time. So if anyone anyone has any 
>>>>>>>>>> suggestions for
>>>>>>>>>> improvements or alternative strategies we are more than happy 
>>>>>>>>>> to hear
>>>>>>>>>> them.
>>>>>>>>>>
>>>>>>>>>> If any of the technical information above is also incorrect, 
>>>>>>>>>> feel
>>>>>>>>>> free to point
>>>>>>>>>> out my misunderstandings.
>>>>>>>>>>
>>>>>>>>>> Looking forward to hearing from you.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Andres
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the 
>>>>>>>>>> list
>>>>>>>>>> members, send email ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the 
>>>>>>>>>> list
>>>>>>>>>> members, send email ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx@lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
>>>>
>>>
>>> Sincerely yours,
>>> Serguei Sagalovitch
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                             ` <76892a0d-677b-f0cb-d4e7-74d29b4a0aa7-5C7GfCeVMHo@public.gmane.org>
@ 2016-12-22 16:35                                                               ` Andres Rodriguez
       [not found]                                                                 ` <8ab5bb4d-f331-d991-f208-ec7c0a25662a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Andres Rodriguez @ 2016-12-22 16:35 UTC (permalink / raw)
  To: Christian König, Serguei Sagalovitch, zhoucm1,
	Pierre-Loup A. Griffais, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Mao, David, Zhang,
	Hawking, Huan, Alvin

Hey Christian,

We are currently interested in X, but with some distros switching to 
other compositors by default, we also need to consider those.

We agree, running the full vrcompositor in root isn't something that we 
want to do. Too many security concerns. Having a small root helper that 
does the privilege escalation for us is the initial idea.

For a long term approach, Pierre-Loup and Dave are working on dealing 
with the "two compositors" scenario a little better in DRM+X. Fullscreen 
isn't really a sufficient approach, since we don't want the HMD to be 
used as part of the Desktop environment when a VR app is not in use 
(this is extremely annoying).

When the above is settled, we should have an auth mechanism besides 
DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the HMD 
permanently away from X. Re-using that auth method to gate this IOCTL is 
probably going to be the final solution.

I propose to start with ROOT_ONLY since it should allow us to respect 
kernel IOCTL compatibility guidelines with the most flexibility. Going 
from a restrictive to a more flexible permission model would be 
inclusive, but going from a general to a restrictive model may exclude 
some apps that used to work.

Regards,
Andres

On 12/22/2016 6:42 AM, Christian König wrote:
> Hi Andres,
>
> well using root might cause stability and security problems as well. 
> We worked quite hard to avoid exactly this for X.
>
> We could make this feature depend on the compositor being DRM master, 
> but for example with X the X server is master (and e.g. can change 
> resolutions etc..) and not the compositor.
>
> So another question is also what windowing system (if any) are you 
> planning to use? X, Wayland, Flinger or something completely different ?
>
> Regards,
> Christian.
>
> Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
>> Hi Christian,
>>
>> That is definitely a concern. What we are currently thinking is to 
>> make the high priority queues accessible to root only.
>>
>> Therefore is a non-root user attempts to set the high priority flag 
>> on context allocation, we would fail the call and return ENOPERM.
>>
>> Regards,
>> Andres
>>
>>
>> On 12/20/2016 7:56 AM, Christian König wrote:
>>>> BTW: If there is  non-VR application which will use high-priority
>>>> h/w queue then VR application will suffer.  Any ideas how
>>>> to solve it?
>>> Yeah, that problem came to my mind as well.
>>>
>>> Basically we need to restrict those high priority submissions to the 
>>> VR compositor or otherwise any malfunctioning application could use it.
>>>
>>> Just think about some WebGL suddenly taking all our rendering away 
>>> and we won't get anything drawn any more.
>>>
>>> Alex or Michel any ideas on that?
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
>>>> > If compute queue is occupied only by you, the efficiency
>>>> > is equal with setting job queue to high priority I think.
>>>> The only risk is the situation when graphics will take all
>>>> needed CUs. But in any case it should be very good test.
>>>>
>>>> Andres/Pierre-Loup,
>>>>
>>>> Did you try to do it or it is a lot of work for you?
>>>>
>>>>
>>>> BTW: If there is  non-VR application which will use high-priority
>>>> h/w queue then VR application will suffer.  Any ideas how
>>>> to solve it?
>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>>>>> Do you encounter the priority issue for compute queue with current 
>>>>> driver?
>>>>>
>>>>> If compute queue is occupied only by you, the efficiency is equal 
>>>>> with setting job queue to high priority I think.
>>>>>
>>>>> Regards,
>>>>> David Zhou
>>>>>
>>>>> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>>>>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>>>>
>>>>>> I'm not sure if I'm asking for too much, but if we can coordinate 
>>>>>> a similar interface in radv and amdgpu-pro at the vulkan level 
>>>>>> that would be great.
>>>>>>
>>>>>> I'm not sure what that's going to be yet.
>>>>>>
>>>>>> - Andres
>>>>>>
>>>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>>>>>> We're currently working with the open stack; I assume that a 
>>>>>>>> mechanism could be exposed by both open and Pro Vulkan 
>>>>>>>> userspace drivers and that the amdgpu kernel interface 
>>>>>>>> improvements we would pursue following this discussion would 
>>>>>>>> let both drivers take advantage of the feature, correct?
>>>>>>> Of course.
>>>>>>> Does open stack have Vulkan support?
>>>>>>>
>>>>>>> Regards,
>>>>>>> David Zhou
>>>>>>>>
>>>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>>>>> By the way, are you using all-open driver or amdgpu-pro driver?
>>>>>>>>>
>>>>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> David Zhou
>>>>>>>>>
>>>>>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>>>>> Hi Serguei,
>>>>>>>>>>
>>>>>>>>>> I'm also working on the bringing up our VR runtime on top of 
>>>>>>>>>> amgpu;
>>>>>>>>>> see replies inline.
>>>>>>>>>>
>>>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>>>>> Andres,
>>>>>>>>>>>
>>>>>>>>>>>>  For current VR workloads we have 3 separate processes running
>>>>>>>>>>>> actually:
>>>>>>>>>>> So we could have potential memory overcommit case or do you do
>>>>>>>>>>> partitioning
>>>>>>>>>>> on your own?  I would think that there is need to avoid 
>>>>>>>>>>> overcomit in
>>>>>>>>>>> VR case to
>>>>>>>>>>> prevent any BO migration.
>>>>>>>>>>
>>>>>>>>>> You're entirely correct; currently the VR runtime is setting up
>>>>>>>>>> prioritized CPU scheduling for its VR compositor, we're 
>>>>>>>>>> working on
>>>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this thread), 
>>>>>>>>>> and in
>>>>>>>>>> the future it will make sense to do work in order to make 
>>>>>>>>>> sure that
>>>>>>>>>> its memory allocations do not get evicted, to prevent any 
>>>>>>>>>> unwelcome
>>>>>>>>>> additional latency in the event of needing to perform 
>>>>>>>>>> just-in-time
>>>>>>>>>> reprojection.
>>>>>>>>>>
>>>>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>>>>> Based on my understanding sharing BOs between different 
>>>>>>>>>>> processes
>>>>>>>>>>> could introduce additional synchronization constrains. btw: 
>>>>>>>>>>> I am not
>>>>>>>>>>> sure
>>>>>>>>>>> if we are able to share Vulkan sync. object cross-process 
>>>>>>>>>>> boundary.
>>>>>>>>>>
>>>>>>>>>> They are different processes; it is important for the 
>>>>>>>>>> compositor that
>>>>>>>>>> is responsible for quality-of-service features such as 
>>>>>>>>>> consistently
>>>>>>>>>> presenting distorted frames with the right latency, 
>>>>>>>>>> reprojection, etc,
>>>>>>>>>> to be separate from the main application.
>>>>>>>>>>
>>>>>>>>>> Currently we are using unreleased cross-process memory and 
>>>>>>>>>> semaphore
>>>>>>>>>> extensions to fetch updated eye images from the client 
>>>>>>>>>> application,
>>>>>>>>>> but the just-in-time reprojection discussed here does not 
>>>>>>>>>> actually
>>>>>>>>>> have any direct interactions with cross-process resource 
>>>>>>>>>> sharing,
>>>>>>>>>> since it's achieved by using whatever is the latest, most 
>>>>>>>>>> up-to-date
>>>>>>>>>> eye images that have already been sent by the client 
>>>>>>>>>> application,
>>>>>>>>>> which are already available to use without additional 
>>>>>>>>>> synchronization.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>    3) System compositor (we are looking at approaches to 
>>>>>>>>>>>> remove this
>>>>>>>>>>>> overhead)
>>>>>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>>>>>
>>>>>>>>>> Yes, we are working on mechanisms to present directly to the 
>>>>>>>>>> headset
>>>>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>  The latency is our main concern,
>>>>>>>>>>> I would assume that this is the known problem (at least for 
>>>>>>>>>>> compute
>>>>>>>>>>> usage).
>>>>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU 
>>>>>>>>>>> intensive
>>>>>>>>>>> (at least
>>>>>>>>>>> in the default configuration).
>>>>>>>>>>
>>>>>>>>>> As long as it's a consistent cost, it shouldn't an issue. 
>>>>>>>>>> However, if
>>>>>>>>>> there's high degrees of variance then that would be 
>>>>>>>>>> troublesome and we
>>>>>>>>>> would need to account for the worst case.
>>>>>>>>>>
>>>>>>>>>> Hopefully the requirements and approach we described make 
>>>>>>>>>> sense, we're
>>>>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>  - Pierre-Loup
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>>>> amdgpu
>>>>>>>>>>>
>>>>>>>>>>> Hey Serguei,
>>>>>>>>>>>
>>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>> some scheduling is per pipe.  I know about the current 
>>>>>>>>>>>> allocation
>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>> of resources  based on the workload otherwise we will have 
>>>>>>>>>>>> resource
>>>>>>>>>>>> conflict
>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>
>>>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can 
>>>>>>>>>>> start with a
>>>>>>>>>>> solution that assumes that
>>>>>>>>>>> only pipe0 has any work and the other pipes are idle (no 
>>>>>>>>>>> HSA/ROCm
>>>>>>>>>>> running on the system).
>>>>>>>>>>>
>>>>>>>>>>> This should be more or less the use case we expect from VR 
>>>>>>>>>>> users.
>>>>>>>>>>>
>>>>>>>>>>> I agree the split is currently not ideal, but I'd like to 
>>>>>>>>>>> consider
>>>>>>>>>>> that a separate task, because
>>>>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>>>>
>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so 
>>>>>>>>>>>> amdkfd
>>>>>>>>>>>> will be not
>>>>>>>>>>>> involved.  I would assume that in the case of VR we will 
>>>>>>>>>>>> have one main
>>>>>>>>>>>> application ("console" mode(?)) so we could temporally 
>>>>>>>>>>>> "ignore"
>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>
>>>>>>>>>>> Correct, this is why we want to enable the high priority 
>>>>>>>>>>> compute
>>>>>>>>>>> queue through
>>>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>>>>>>>>>
>>>>>>>>>>> For current VR workloads we have 3 separate processes 
>>>>>>>>>>> running actually:
>>>>>>>>>>>     1) Game process
>>>>>>>>>>>     2) VR Compositor (this is the process that will require 
>>>>>>>>>>> high
>>>>>>>>>>> priority queue)
>>>>>>>>>>>     3) System compositor (we are looking at approaches to 
>>>>>>>>>>> remove this
>>>>>>>>>>> overhead)
>>>>>>>>>>>
>>>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>>>>>> simultaneously, but
>>>>>>>>>>> I would also like to be able to address this case in the future
>>>>>>>>>>> (cross-pipe priorities).
>>>>>>>>>>>
>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:  
>>>>>>>>>>>> (a) it
>>>>>>>>>>>> may take time so
>>>>>>>>>>>> latency may suffer
>>>>>>>>>>>
>>>>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>>>>> predictable. A good
>>>>>>>>>>> illustration of what the reprojection scheduling looks like 
>>>>>>>>>>> can be
>>>>>>>>>>> found here:
>>>>>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> (b) to preempt we need to have different "context" - we want
>>>>>>>>>>>> to guarantee that submissions from the same context will be 
>>>>>>>>>>>> executed
>>>>>>>>>>>> in order.
>>>>>>>>>>>
>>>>>>>>>>> This is okay, as the reprojection work doesn't have 
>>>>>>>>>>> dependencies on
>>>>>>>>>>> the game context, and it
>>>>>>>>>>> even happens in a separate process.
>>>>>>>>>>>
>>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you want
>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>> "cancel/abort"
>>>>>>>>>>>
>>>>>>>>>>> Preempt the game with the compositor task and then resume it.
>>>>>>>>>>>
>>>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics as 
>>>>>>>>>>>> well as
>>>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>
>>>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure 
>>>>>>>>>>> out a way
>>>>>>>>>>> for us to get
>>>>>>>>>>> a guaranteed execution time using vulkan graphics, then I'll 
>>>>>>>>>>> take you
>>>>>>>>>>> out for a beer :)
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Andres
>>>>>>>>>>> ________________________________________
>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>>>> amdgpu
>>>>>>>>>>>
>>>>>>>>>>> Hi Andres,
>>>>>>>>>>>
>>>>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>>>>
>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>>>> amdgpu
>>>>>>>>>>>
>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Andres
>>>>>>>>>>>
>>>>>>>>>>> ________________________________________
>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>>>> amdgpu
>>>>>>>>>>>
>>>>>>>>>>> Andres,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Quick comments:
>>>>>>>>>>>
>>>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU 
>>>>>>>>>>> assignments/binding
>>>>>>>>>>> to high-priority queue  when it will be in use and "free" 
>>>>>>>>>>> them later
>>>>>>>>>>> (we  do not want forever take CUs from e.g. graphic task to 
>>>>>>>>>>> degrade
>>>>>>>>>>> graphics
>>>>>>>>>>> performance).
>>>>>>>>>>>
>>>>>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>>>>>> low-priority
>>>>>>>>>>> compute) will took all (extra) CUs and high--priority will 
>>>>>>>>>>> wait for
>>>>>>>>>>> needed resources.
>>>>>>>>>>> It will not be visible on "NOP " but only when you submit 
>>>>>>>>>>> "real"
>>>>>>>>>>> compute task
>>>>>>>>>>> so I would recommend  not to use "NOP" packets at all for 
>>>>>>>>>>> testing.
>>>>>>>>>>>
>>>>>>>>>>> It (CU assignment) could be relatively easy done when 
>>>>>>>>>>> everything is
>>>>>>>>>>> going via kernel
>>>>>>>>>>> (e.g. as part of frame submission) but I must admit that I 
>>>>>>>>>>> am not sure
>>>>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>>>>
>>>>>>>>>>> [AR] I wasn't aware of this part of the programming 
>>>>>>>>>>> sequence. Thanks
>>>>>>>>>>> for the heads up!
>>>>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler" 
>>>>>>>>>>> when
>>>>>>>>>>> deciding which
>>>>>>>>>>> queue to  run will check if there is enough resources and if 
>>>>>>>>>>> not then
>>>>>>>>>>> it will begin
>>>>>>>>>>> to check other queues with lower priority.
>>>>>>>>>>>
>>>>>>>>>>> 2) I would recommend to dedicate the whole pipe to 
>>>>>>>>>>> high-priority
>>>>>>>>>>> queue and have
>>>>>>>>>>> nothing their except it.
>>>>>>>>>>>
>>>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue? (as 
>>>>>>>>>>> opposed
>>>>>>>>>>> to the MEC definition
>>>>>>>>>>> of pipe, which is a grouping of queues). I say this because 
>>>>>>>>>>> amdgpu
>>>>>>>>>>> only has access to 1 pipe,
>>>>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>>>>
>>>>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it. As far as I
>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>> some scheduling is per pipe.  I know about the current 
>>>>>>>>>>> allocation
>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>> dynamical partition
>>>>>>>>>>> of resources  based on the workload otherwise we will have 
>>>>>>>>>>> resource
>>>>>>>>>>> conflict
>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> BTW: Which user level API do you want to use for compute: 
>>>>>>>>>>> Vulkan or
>>>>>>>>>>> OpenCL?
>>>>>>>>>>>
>>>>>>>>>>> [AR] Vulkan
>>>>>>>>>>>
>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so 
>>>>>>>>>>> amdkfd will
>>>>>>>>>>> be not
>>>>>>>>>>> involved.  I would assume that in the case of VR we will 
>>>>>>>>>>> have one main
>>>>>>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>
>>>>>>>>>>>>  we will not be able to provide a solution compatible with GFX
>>>>>>>>>>>> worloads.
>>>>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>>>>
>>>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the 
>>>>>>>>>>> currently running
>>>>>>>>>>> graphics job and scheduling in
>>>>>>>>>>> something else using mid-buffer pre-emption has some cases 
>>>>>>>>>>> where it
>>>>>>>>>>> doesn't work well. But if with
>>>>>>>>>>> polaris10 it starts working well, it might be a better 
>>>>>>>>>>> solution for
>>>>>>>>>>> us (because the whole reprojection
>>>>>>>>>>> work uses the vulkan graphics stack at the moment, and 
>>>>>>>>>>> porting it to
>>>>>>>>>>> compute is not trivial).
>>>>>>>>>>>
>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task: 
>>>>>>>>>>> (a) it may
>>>>>>>>>>> take time so
>>>>>>>>>>> latency may suffer (b) to preempt we need to have different 
>>>>>>>>>>> "context"
>>>>>>>>>>> - we want
>>>>>>>>>>> to guarantee that submissions from the same context will be 
>>>>>>>>>>> executed
>>>>>>>>>>> in order.
>>>>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>>>>>>>>>> "preempt" and
>>>>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>>>>>> for graphics as well as for plain compute tasks 
>>>>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on 
>>>>>>>>>>> behalf of
>>>>>>>>>>> Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>
>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>
>>>>>>>>>>> This RFC is also available as a gist here:
>>>>>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249 
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>> gist.github.com
>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>> gist.github.com
>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>> gist.github.com
>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We are interested in feedback for a mechanism to effectively 
>>>>>>>>>>> schedule
>>>>>>>>>>> high
>>>>>>>>>>> priority VR reprojection tasks (also referred to as 
>>>>>>>>>>> time-warping) for
>>>>>>>>>>> Polaris10
>>>>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>>>>
>>>>>>>>>>> Brief context:
>>>>>>>>>>> --------------
>>>>>>>>>>>
>>>>>>>>>>> The main objective of reprojection is to avoid motion 
>>>>>>>>>>> sickness for VR
>>>>>>>>>>> users in
>>>>>>>>>>> scenarios where the game or application would fail to finish
>>>>>>>>>>> rendering a new
>>>>>>>>>>> frame in time for the next VBLANK. When this happens, the 
>>>>>>>>>>> user's head
>>>>>>>>>>> movements
>>>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the 
>>>>>>>>>>> duration
>>>>>>>>>>> of an
>>>>>>>>>>> extra frame. This extended mismatch between the inner ear 
>>>>>>>>>>> and the
>>>>>>>>>>> eyes may
>>>>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>>>>
>>>>>>>>>>> The VR compositor deals with this problem by fabricating a 
>>>>>>>>>>> new frame
>>>>>>>>>>> using the
>>>>>>>>>>> user's updated head position in combination with the 
>>>>>>>>>>> previous frames.
>>>>>>>>>>> This
>>>>>>>>>>> avoids a prolonged mismatch between the HMD output and the 
>>>>>>>>>>> inner ear.
>>>>>>>>>>>
>>>>>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>>>>>> confidence that the
>>>>>>>>>>> reprojection task will complete before the VBLANK interval. 
>>>>>>>>>>> Even if
>>>>>>>>>>> the GFX pipe
>>>>>>>>>>> is currently full of work from the game/application (which 
>>>>>>>>>>> is most
>>>>>>>>>>> likely the case).
>>>>>>>>>>>
>>>>>>>>>>> For more details and illustrations, please refer to the 
>>>>>>>>>>> following
>>>>>>>>>>> document:
>>>>>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>> community.amd.com
>>>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>>>> over the
>>>>>>>>>>> past year has been the adoption of asynchronous shaders, 
>>>>>>>>>>> which can
>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>> community.amd.com
>>>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>>>> over the
>>>>>>>>>>> past year has been the adoption of asynchronous shaders, 
>>>>>>>>>>> which can
>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>> community.amd.com
>>>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>>>> over the
>>>>>>>>>>> past year has been the adoption of asynchronous shaders, 
>>>>>>>>>>> which can
>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Requirements:
>>>>>>>>>>> -------------
>>>>>>>>>>>
>>>>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>>>>
>>>>>>>>>>>     * Job round trip time must be predictable, from 
>>>>>>>>>>> submission to
>>>>>>>>>>> fence signal
>>>>>>>>>>>
>>>>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>>>>
>>>>>>>>>>> Goals:
>>>>>>>>>>> ------
>>>>>>>>>>>
>>>>>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>>>>>
>>>>>>>>>>> Test: submitting a NOP packet through the mechanism on busy 
>>>>>>>>>>> hardware
>>>>>>>>>>> should
>>>>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>>>>
>>>>>>>>>>> Nice to have:
>>>>>>>>>>> -------------
>>>>>>>>>>>
>>>>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>>>>
>>>>>>>>>>> My understanding is that with the current hardware 
>>>>>>>>>>> capabilities in
>>>>>>>>>>> Polaris10 we
>>>>>>>>>>> will not be able to provide a solution compatible with GFX 
>>>>>>>>>>> worloads.
>>>>>>>>>>>
>>>>>>>>>>> But I would love to hear otherwise. So if anyone has an idea,
>>>>>>>>>>> approach or
>>>>>>>>>>> suggestion that will also be compatible with the GFX ring, 
>>>>>>>>>>> please let
>>>>>>>>>>> us know
>>>>>>>>>>> about it.
>>>>>>>>>>>
>>>>>>>>>>>     * The above guarantees should also be respected by 
>>>>>>>>>>> amdkfd workloads
>>>>>>>>>>>
>>>>>>>>>>> Would be good to have for consistency, but not strictly 
>>>>>>>>>>> necessary as
>>>>>>>>>>> users running
>>>>>>>>>>> games are not traditionally running HPC workloads in the 
>>>>>>>>>>> background.
>>>>>>>>>>>
>>>>>>>>>>> Proposed approach:
>>>>>>>>>>> ------------------
>>>>>>>>>>>
>>>>>>>>>>> Similar to the windows driver, we could expose a high priority
>>>>>>>>>>> compute queue to
>>>>>>>>>>> userspace.
>>>>>>>>>>>
>>>>>>>>>>> Submissions to this compute queue will be scheduled with high
>>>>>>>>>>> priority, and may
>>>>>>>>>>> acquire hardware resources previously in use by other queues.
>>>>>>>>>>>
>>>>>>>>>>> This can be achieved by taking advantage of the 'priority' 
>>>>>>>>>>> field in
>>>>>>>>>>> the HQDs
>>>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler. 
>>>>>>>>>>> The relevant
>>>>>>>>>>> register fields are:
>>>>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>>>>
>>>>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> The amdgpu driver currently controls 8 compute queues from 
>>>>>>>>>>> pipe0. We can
>>>>>>>>>>> statically partition these as follows:
>>>>>>>>>>>         * 7x regular
>>>>>>>>>>>         * 1x high priority
>>>>>>>>>>>
>>>>>>>>>>> The relevant priorities can be set so that submissions to 
>>>>>>>>>>> the high
>>>>>>>>>>> priority
>>>>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>>>>
>>>>>>>>>>> The amdgpu scheduler will only place jobs into the high 
>>>>>>>>>>> priority
>>>>>>>>>>> rings if the
>>>>>>>>>>> context is marked as high priority. And a corresponding 
>>>>>>>>>>> priority
>>>>>>>>>>> should be
>>>>>>>>>>> added to keep track of this information:
>>>>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>>>>
>>>>>>>>>>> The user will request a high priority context by setting an
>>>>>>>>>>> appropriate flag
>>>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>>>>     * Maintain a consistent FIFO ordering of all submissions 
>>>>>>>>>>> to a
>>>>>>>>>>> context
>>>>>>>>>>>     * Create high priority and non-high priority contexts in 
>>>>>>>>>>> the same
>>>>>>>>>>> process
>>>>>>>>>>>
>>>>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> Similar to the above, but instead of programming the 
>>>>>>>>>>> priorities and
>>>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the 
>>>>>>>>>>> queue priorities
>>>>>>>>>>> dynamically when scheduling a task.
>>>>>>>>>>>
>>>>>>>>>>> This would involve having a hardware specific callback from the
>>>>>>>>>>> scheduler to
>>>>>>>>>>> set the appropriate queue priority: set_priority(int ring, 
>>>>>>>>>>> int index,
>>>>>>>>>>> int priority)
>>>>>>>>>>>
>>>>>>>>>>> During this callback we would have to grab the SRBM mutex to 
>>>>>>>>>>> perform
>>>>>>>>>>> the appropriate
>>>>>>>>>>> HW programming, and I'm not really sure if that is something we
>>>>>>>>>>> should be doing from
>>>>>>>>>>> the scheduler.
>>>>>>>>>>>
>>>>>>>>>>> On the positive side, this approach would allow us to 
>>>>>>>>>>> program a range of
>>>>>>>>>>> priorities for jobs instead of a single "high priority" value",
>>>>>>>>>>> achieving
>>>>>>>>>>> something similar to the niceness API available for CPU 
>>>>>>>>>>> scheduling.
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure if this flexibility is something that we would 
>>>>>>>>>>> need for
>>>>>>>>>>> our use
>>>>>>>>>>> case, but it might be useful in other scenarios (multiple users
>>>>>>>>>>> sharing compute
>>>>>>>>>>> time on a server).
>>>>>>>>>>>
>>>>>>>>>>> This approach would require a new int field in 
>>>>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>>>>> repurposing
>>>>>>>>>>> of the flags field.
>>>>>>>>>>>
>>>>>>>>>>> Known current obstacles:
>>>>>>>>>>> ------------------------
>>>>>>>>>>>
>>>>>>>>>>> The SQ is currently programmed to disregard the HQD 
>>>>>>>>>>> priorities, and
>>>>>>>>>>> instead it picks
>>>>>>>>>>> jobs at random. Settings from the shader itself are also 
>>>>>>>>>>> disregarded
>>>>>>>>>>> as this is
>>>>>>>>>>> considered a privileged field.
>>>>>>>>>>>
>>>>>>>>>>> Effectively we can get our compute wavefront launched ASAP, 
>>>>>>>>>>> but we
>>>>>>>>>>> might not get the
>>>>>>>>>>> time we need on the SQ.
>>>>>>>>>>>
>>>>>>>>>>> The current programming would have to be changed to allow 
>>>>>>>>>>> priority
>>>>>>>>>>> propagation
>>>>>>>>>>> from the HQD into the SQ.
>>>>>>>>>>>
>>>>>>>>>>> Generic approach for all HW IPs:
>>>>>>>>>>> --------------------------------
>>>>>>>>>>>
>>>>>>>>>>> For consistency purposes, the high priority context can be 
>>>>>>>>>>> enabled
>>>>>>>>>>> for all HW IPs
>>>>>>>>>>> with support of the SW scheduler. This will function 
>>>>>>>>>>> similarly to the
>>>>>>>>>>> current
>>>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump 
>>>>>>>>>>> ahead of
>>>>>>>>>>> anything not
>>>>>>>>>>> commited to the HW queue.
>>>>>>>>>>>
>>>>>>>>>>> The benefits of requesting a high priority context for a 
>>>>>>>>>>> non-compute
>>>>>>>>>>> queue will
>>>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is 
>>>>>>>>>>> stuck in
>>>>>>>>>>> front of
>>>>>>>>>>> you), but having the API in place will allow us to easily 
>>>>>>>>>>> improve the
>>>>>>>>>>> implementation
>>>>>>>>>>> in the future as new features become available in new hardware.
>>>>>>>>>>>
>>>>>>>>>>> Future steps:
>>>>>>>>>>> -------------
>>>>>>>>>>>
>>>>>>>>>>> Once we have an approach settled, I can take care of the 
>>>>>>>>>>> implementation.
>>>>>>>>>>>
>>>>>>>>>>> Also, once the interface is mostly decided, we can start 
>>>>>>>>>>> thinking about
>>>>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>>>>
>>>>>>>>>>> Request for feedback:
>>>>>>>>>>> ---------------------
>>>>>>>>>>>
>>>>>>>>>>> We aren't married to any of the approaches outlined above. 
>>>>>>>>>>> Our goal
>>>>>>>>>>> is to
>>>>>>>>>>> obtain a mechanism that will allow us to complete the 
>>>>>>>>>>> reprojection
>>>>>>>>>>> job within a
>>>>>>>>>>> predictable amount of time. So if anyone anyone has any 
>>>>>>>>>>> suggestions for
>>>>>>>>>>> improvements or alternative strategies we are more than 
>>>>>>>>>>> happy to hear
>>>>>>>>>>> them.
>>>>>>>>>>>
>>>>>>>>>>> If any of the technical information above is also incorrect, 
>>>>>>>>>>> feel
>>>>>>>>>>> free to point
>>>>>>>>>>> out my misunderstandings.
>>>>>>>>>>>
>>>>>>>>>>> Looking forward to hearing from you.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Andres
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all 
>>>>>>>>>>> the list
>>>>>>>>>>> members, send email ...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all 
>>>>>>>>>>> the list
>>>>>>>>>>> members, send email ...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> amd-gfx mailing list
>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>
>>>>>
>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>>
>>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                                 ` <8ab5bb4d-f331-d991-f208-ec7c0a25662a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2016-12-22 16:41                                                                   ` Serguei Sagalovitch
       [not found]                                                                     ` <fd1f1a6f-f72a-3e65-bb6f-17671d8b1d6b-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Serguei Sagalovitch @ 2016-12-22 16:41 UTC (permalink / raw)
  To: Andres Rodriguez, Christian König, zhoucm1,
	Pierre-Loup A. Griffais, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Mao, David, Zhang,
	Hawking, Huan, Alvin

Andres,

Did you measure  latency, etc. impact of __any__ compositor?

My understanding is that VR has pretty strict requirements related to QoS.

Sincerely yours,
Serguei Sagalovitch


On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
> Hey Christian,
>
> We are currently interested in X, but with some distros switching to 
> other compositors by default, we also need to consider those.
>
> We agree, running the full vrcompositor in root isn't something that 
> we want to do. Too many security concerns. Having a small root helper 
> that does the privilege escalation for us is the initial idea.
>
> For a long term approach, Pierre-Loup and Dave are working on dealing 
> with the "two compositors" scenario a little better in DRM+X. 
> Fullscreen isn't really a sufficient approach, since we don't want the 
> HMD to be used as part of the Desktop environment when a VR app is not 
> in use (this is extremely annoying).
>
> When the above is settled, we should have an auth mechanism besides 
> DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the 
> HMD permanently away from X. Re-using that auth method to gate this 
> IOCTL is probably going to be the final solution.
>
> I propose to start with ROOT_ONLY since it should allow us to respect 
> kernel IOCTL compatibility guidelines with the most flexibility. Going 
> from a restrictive to a more flexible permission model would be 
> inclusive, but going from a general to a restrictive model may exclude 
> some apps that used to work.
>
> Regards,
> Andres
>
> On 12/22/2016 6:42 AM, Christian König wrote:
>> Hi Andres,
>>
>> well using root might cause stability and security problems as well. 
>> We worked quite hard to avoid exactly this for X.
>>
>> We could make this feature depend on the compositor being DRM master, 
>> but for example with X the X server is master (and e.g. can change 
>> resolutions etc..) and not the compositor.
>>
>> So another question is also what windowing system (if any) are you 
>> planning to use? X, Wayland, Flinger or something completely different ?
>>
>> Regards,
>> Christian.
>>
>> Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
>>> Hi Christian,
>>>
>>> That is definitely a concern. What we are currently thinking is to 
>>> make the high priority queues accessible to root only.
>>>
>>> Therefore is a non-root user attempts to set the high priority flag 
>>> on context allocation, we would fail the call and return ENOPERM.
>>>
>>> Regards,
>>> Andres
>>>
>>>
>>> On 12/20/2016 7:56 AM, Christian König wrote:
>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>> to solve it?
>>>> Yeah, that problem came to my mind as well.
>>>>
>>>> Basically we need to restrict those high priority submissions to 
>>>> the VR compositor or otherwise any malfunctioning application could 
>>>> use it.
>>>>
>>>> Just think about some WebGL suddenly taking all our rendering away 
>>>> and we won't get anything drawn any more.
>>>>
>>>> Alex or Michel any ideas on that?
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
>>>>> > If compute queue is occupied only by you, the efficiency
>>>>> > is equal with setting job queue to high priority I think.
>>>>> The only risk is the situation when graphics will take all
>>>>> needed CUs. But in any case it should be very good test.
>>>>>
>>>>> Andres/Pierre-Loup,
>>>>>
>>>>> Did you try to do it or it is a lot of work for you?
>>>>>
>>>>>
>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>> to solve it?
>>>>>
>>>>> Sincerely yours,
>>>>> Serguei Sagalovitch
>>>>>
>>>>> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>>>>>> Do you encounter the priority issue for compute queue with 
>>>>>> current driver?
>>>>>>
>>>>>> If compute queue is occupied only by you, the efficiency is equal 
>>>>>> with setting job queue to high priority I think.
>>>>>>
>>>>>> Regards,
>>>>>> David Zhou
>>>>>>
>>>>>> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>>>>>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>>>>>
>>>>>>> I'm not sure if I'm asking for too much, but if we can 
>>>>>>> coordinate a similar interface in radv and amdgpu-pro at the 
>>>>>>> vulkan level that would be great.
>>>>>>>
>>>>>>> I'm not sure what that's going to be yet.
>>>>>>>
>>>>>>> - Andres
>>>>>>>
>>>>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>>>>>>> We're currently working with the open stack; I assume that a 
>>>>>>>>> mechanism could be exposed by both open and Pro Vulkan 
>>>>>>>>> userspace drivers and that the amdgpu kernel interface 
>>>>>>>>> improvements we would pursue following this discussion would 
>>>>>>>>> let both drivers take advantage of the feature, correct?
>>>>>>>> Of course.
>>>>>>>> Does open stack have Vulkan support?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> David Zhou
>>>>>>>>>
>>>>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>>>>>> By the way, are you using all-open driver or amdgpu-pro driver?
>>>>>>>>>>
>>>>>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> David Zhou
>>>>>>>>>>
>>>>>>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>
>>>>>>>>>>> I'm also working on the bringing up our VR runtime on top of 
>>>>>>>>>>> amgpu;
>>>>>>>>>>> see replies inline.
>>>>>>>>>>>
>>>>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>>>>>> Andres,
>>>>>>>>>>>>
>>>>>>>>>>>>>  For current VR workloads we have 3 separate processes 
>>>>>>>>>>>>> running
>>>>>>>>>>>>> actually:
>>>>>>>>>>>> So we could have potential memory overcommit case or do you do
>>>>>>>>>>>> partitioning
>>>>>>>>>>>> on your own?  I would think that there is need to avoid 
>>>>>>>>>>>> overcomit in
>>>>>>>>>>>> VR case to
>>>>>>>>>>>> prevent any BO migration.
>>>>>>>>>>>
>>>>>>>>>>> You're entirely correct; currently the VR runtime is setting up
>>>>>>>>>>> prioritized CPU scheduling for its VR compositor, we're 
>>>>>>>>>>> working on
>>>>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this 
>>>>>>>>>>> thread), and in
>>>>>>>>>>> the future it will make sense to do work in order to make 
>>>>>>>>>>> sure that
>>>>>>>>>>> its memory allocations do not get evicted, to prevent any 
>>>>>>>>>>> unwelcome
>>>>>>>>>>> additional latency in the event of needing to perform 
>>>>>>>>>>> just-in-time
>>>>>>>>>>> reprojection.
>>>>>>>>>>>
>>>>>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>>>>>> Based on my understanding sharing BOs between different 
>>>>>>>>>>>> processes
>>>>>>>>>>>> could introduce additional synchronization constrains. btw: 
>>>>>>>>>>>> I am not
>>>>>>>>>>>> sure
>>>>>>>>>>>> if we are able to share Vulkan sync. object cross-process 
>>>>>>>>>>>> boundary.
>>>>>>>>>>>
>>>>>>>>>>> They are different processes; it is important for the 
>>>>>>>>>>> compositor that
>>>>>>>>>>> is responsible for quality-of-service features such as 
>>>>>>>>>>> consistently
>>>>>>>>>>> presenting distorted frames with the right latency, 
>>>>>>>>>>> reprojection, etc,
>>>>>>>>>>> to be separate from the main application.
>>>>>>>>>>>
>>>>>>>>>>> Currently we are using unreleased cross-process memory and 
>>>>>>>>>>> semaphore
>>>>>>>>>>> extensions to fetch updated eye images from the client 
>>>>>>>>>>> application,
>>>>>>>>>>> but the just-in-time reprojection discussed here does not 
>>>>>>>>>>> actually
>>>>>>>>>>> have any direct interactions with cross-process resource 
>>>>>>>>>>> sharing,
>>>>>>>>>>> since it's achieved by using whatever is the latest, most 
>>>>>>>>>>> up-to-date
>>>>>>>>>>> eye images that have already been sent by the client 
>>>>>>>>>>> application,
>>>>>>>>>>> which are already available to use without additional 
>>>>>>>>>>> synchronization.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>    3) System compositor (we are looking at approaches to 
>>>>>>>>>>>>> remove this
>>>>>>>>>>>>> overhead)
>>>>>>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>>>>>>
>>>>>>>>>>> Yes, we are working on mechanisms to present directly to the 
>>>>>>>>>>> headset
>>>>>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>  The latency is our main concern,
>>>>>>>>>>>> I would assume that this is the known problem (at least for 
>>>>>>>>>>>> compute
>>>>>>>>>>>> usage).
>>>>>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU 
>>>>>>>>>>>> intensive
>>>>>>>>>>>> (at least
>>>>>>>>>>>> in the default configuration).
>>>>>>>>>>>
>>>>>>>>>>> As long as it's a consistent cost, it shouldn't an issue. 
>>>>>>>>>>> However, if
>>>>>>>>>>> there's high degrees of variance then that would be 
>>>>>>>>>>> troublesome and we
>>>>>>>>>>> would need to account for the worst case.
>>>>>>>>>>>
>>>>>>>>>>> Hopefully the requirements and approach we described make 
>>>>>>>>>>> sense, we're
>>>>>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>  - Pierre-Loup
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling 
>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>
>>>>>>>>>>>> Hey Serguei,
>>>>>>>>>>>>
>>>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current 
>>>>>>>>>>>>> allocation
>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>> of resources  based on the workload otherwise we will have 
>>>>>>>>>>>>> resource
>>>>>>>>>>>>> conflict
>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>
>>>>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can 
>>>>>>>>>>>> start with a
>>>>>>>>>>>> solution that assumes that
>>>>>>>>>>>> only pipe0 has any work and the other pipes are idle (no 
>>>>>>>>>>>> HSA/ROCm
>>>>>>>>>>>> running on the system).
>>>>>>>>>>>>
>>>>>>>>>>>> This should be more or less the use case we expect from VR 
>>>>>>>>>>>> users.
>>>>>>>>>>>>
>>>>>>>>>>>> I agree the split is currently not ideal, but I'd like to 
>>>>>>>>>>>> consider
>>>>>>>>>>>> that a separate task, because
>>>>>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>>>>>
>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so 
>>>>>>>>>>>>> amdkfd
>>>>>>>>>>>>> will be not
>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will 
>>>>>>>>>>>>> have one main
>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally 
>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>
>>>>>>>>>>>> Correct, this is why we want to enable the high priority 
>>>>>>>>>>>> compute
>>>>>>>>>>>> queue through
>>>>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>>>>>>>>>>
>>>>>>>>>>>> For current VR workloads we have 3 separate processes 
>>>>>>>>>>>> running actually:
>>>>>>>>>>>>     1) Game process
>>>>>>>>>>>>     2) VR Compositor (this is the process that will require 
>>>>>>>>>>>> high
>>>>>>>>>>>> priority queue)
>>>>>>>>>>>>     3) System compositor (we are looking at approaches to 
>>>>>>>>>>>> remove this
>>>>>>>>>>>> overhead)
>>>>>>>>>>>>
>>>>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>>>>>>> simultaneously, but
>>>>>>>>>>>> I would also like to be able to address this case in the 
>>>>>>>>>>>> future
>>>>>>>>>>>> (cross-pipe priorities).
>>>>>>>>>>>>
>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:  
>>>>>>>>>>>>> (a) it
>>>>>>>>>>>>> may take time so
>>>>>>>>>>>>> latency may suffer
>>>>>>>>>>>>
>>>>>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>>>>>> predictable. A good
>>>>>>>>>>>> illustration of what the reprojection scheduling looks like 
>>>>>>>>>>>> can be
>>>>>>>>>>>> found here:
>>>>>>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> (b) to preempt we need to have different "context" - we want
>>>>>>>>>>>>> to guarantee that submissions from the same context will 
>>>>>>>>>>>>> be executed
>>>>>>>>>>>>> in order.
>>>>>>>>>>>>
>>>>>>>>>>>> This is okay, as the reprojection work doesn't have 
>>>>>>>>>>>> dependencies on
>>>>>>>>>>>> the game context, and it
>>>>>>>>>>>> even happens in a separate process.
>>>>>>>>>>>>
>>>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you 
>>>>>>>>>>>>> want
>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>> "cancel/abort"
>>>>>>>>>>>>
>>>>>>>>>>>> Preempt the game with the compositor task and then resume it.
>>>>>>>>>>>>
>>>>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics 
>>>>>>>>>>>>> as well as
>>>>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>
>>>>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure 
>>>>>>>>>>>> out a way
>>>>>>>>>>>> for us to get
>>>>>>>>>>>> a guaranteed execution time using vulkan graphics, then 
>>>>>>>>>>>> I'll take you
>>>>>>>>>>>> out for a beer :)
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Andres
>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling 
>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Andres,
>>>>>>>>>>>>
>>>>>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>>>>>
>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling 
>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Andres
>>>>>>>>>>>>
>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling 
>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>
>>>>>>>>>>>> Andres,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Quick comments:
>>>>>>>>>>>>
>>>>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU 
>>>>>>>>>>>> assignments/binding
>>>>>>>>>>>> to high-priority queue  when it will be in use and "free" 
>>>>>>>>>>>> them later
>>>>>>>>>>>> (we  do not want forever take CUs from e.g. graphic task to 
>>>>>>>>>>>> degrade
>>>>>>>>>>>> graphics
>>>>>>>>>>>> performance).
>>>>>>>>>>>>
>>>>>>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>>>>>>> low-priority
>>>>>>>>>>>> compute) will took all (extra) CUs and high--priority will 
>>>>>>>>>>>> wait for
>>>>>>>>>>>> needed resources.
>>>>>>>>>>>> It will not be visible on "NOP " but only when you submit 
>>>>>>>>>>>> "real"
>>>>>>>>>>>> compute task
>>>>>>>>>>>> so I would recommend  not to use "NOP" packets at all for 
>>>>>>>>>>>> testing.
>>>>>>>>>>>>
>>>>>>>>>>>> It (CU assignment) could be relatively easy done when 
>>>>>>>>>>>> everything is
>>>>>>>>>>>> going via kernel
>>>>>>>>>>>> (e.g. as part of frame submission) but I must admit that I 
>>>>>>>>>>>> am not sure
>>>>>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>>>>>
>>>>>>>>>>>> [AR] I wasn't aware of this part of the programming 
>>>>>>>>>>>> sequence. Thanks
>>>>>>>>>>>> for the heads up!
>>>>>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler" 
>>>>>>>>>>>> when
>>>>>>>>>>>> deciding which
>>>>>>>>>>>> queue to  run will check if there is enough resources and 
>>>>>>>>>>>> if not then
>>>>>>>>>>>> it will begin
>>>>>>>>>>>> to check other queues with lower priority.
>>>>>>>>>>>>
>>>>>>>>>>>> 2) I would recommend to dedicate the whole pipe to 
>>>>>>>>>>>> high-priority
>>>>>>>>>>>> queue and have
>>>>>>>>>>>> nothing their except it.
>>>>>>>>>>>>
>>>>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue? 
>>>>>>>>>>>> (as opposed
>>>>>>>>>>>> to the MEC definition
>>>>>>>>>>>> of pipe, which is a grouping of queues). I say this because 
>>>>>>>>>>>> amdgpu
>>>>>>>>>>>> only has access to 1 pipe,
>>>>>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>>>>>
>>>>>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it. As far as I
>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>> some scheduling is per pipe.  I know about the current 
>>>>>>>>>>>> allocation
>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>> of resources  based on the workload otherwise we will have 
>>>>>>>>>>>> resource
>>>>>>>>>>>> conflict
>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> BTW: Which user level API do you want to use for compute: 
>>>>>>>>>>>> Vulkan or
>>>>>>>>>>>> OpenCL?
>>>>>>>>>>>>
>>>>>>>>>>>> [AR] Vulkan
>>>>>>>>>>>>
>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so 
>>>>>>>>>>>> amdkfd will
>>>>>>>>>>>> be not
>>>>>>>>>>>> involved.  I would assume that in the case of VR we will 
>>>>>>>>>>>> have one main
>>>>>>>>>>>> application ("console" mode(?)) so we could temporally 
>>>>>>>>>>>> "ignore"
>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>
>>>>>>>>>>>>>  we will not be able to provide a solution compatible with 
>>>>>>>>>>>>> GFX
>>>>>>>>>>>>> worloads.
>>>>>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>>>>>
>>>>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the 
>>>>>>>>>>>> currently running
>>>>>>>>>>>> graphics job and scheduling in
>>>>>>>>>>>> something else using mid-buffer pre-emption has some cases 
>>>>>>>>>>>> where it
>>>>>>>>>>>> doesn't work well. But if with
>>>>>>>>>>>> polaris10 it starts working well, it might be a better 
>>>>>>>>>>>> solution for
>>>>>>>>>>>> us (because the whole reprojection
>>>>>>>>>>>> work uses the vulkan graphics stack at the moment, and 
>>>>>>>>>>>> porting it to
>>>>>>>>>>>> compute is not trivial).
>>>>>>>>>>>>
>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task: 
>>>>>>>>>>>> (a) it may
>>>>>>>>>>>> take time so
>>>>>>>>>>>> latency may suffer (b) to preempt we need to have different 
>>>>>>>>>>>> "context"
>>>>>>>>>>>> - we want
>>>>>>>>>>>> to guarantee that submissions from the same context will be 
>>>>>>>>>>>> executed
>>>>>>>>>>>> in order.
>>>>>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you 
>>>>>>>>>>>> want
>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>>>>>>> for graphics as well as for plain compute tasks 
>>>>>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on 
>>>>>>>>>>>> behalf of
>>>>>>>>>>>> Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> This RFC is also available as a gist here:
>>>>>>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> We are interested in feedback for a mechanism to 
>>>>>>>>>>>> effectively schedule
>>>>>>>>>>>> high
>>>>>>>>>>>> priority VR reprojection tasks (also referred to as 
>>>>>>>>>>>> time-warping) for
>>>>>>>>>>>> Polaris10
>>>>>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>>>>>
>>>>>>>>>>>> Brief context:
>>>>>>>>>>>> --------------
>>>>>>>>>>>>
>>>>>>>>>>>> The main objective of reprojection is to avoid motion 
>>>>>>>>>>>> sickness for VR
>>>>>>>>>>>> users in
>>>>>>>>>>>> scenarios where the game or application would fail to finish
>>>>>>>>>>>> rendering a new
>>>>>>>>>>>> frame in time for the next VBLANK. When this happens, the 
>>>>>>>>>>>> user's head
>>>>>>>>>>>> movements
>>>>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the 
>>>>>>>>>>>> duration
>>>>>>>>>>>> of an
>>>>>>>>>>>> extra frame. This extended mismatch between the inner ear 
>>>>>>>>>>>> and the
>>>>>>>>>>>> eyes may
>>>>>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>>>>>
>>>>>>>>>>>> The VR compositor deals with this problem by fabricating a 
>>>>>>>>>>>> new frame
>>>>>>>>>>>> using the
>>>>>>>>>>>> user's updated head position in combination with the 
>>>>>>>>>>>> previous frames.
>>>>>>>>>>>> This
>>>>>>>>>>>> avoids a prolonged mismatch between the HMD output and the 
>>>>>>>>>>>> inner ear.
>>>>>>>>>>>>
>>>>>>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>>>>>>> confidence that the
>>>>>>>>>>>> reprojection task will complete before the VBLANK interval. 
>>>>>>>>>>>> Even if
>>>>>>>>>>>> the GFX pipe
>>>>>>>>>>>> is currently full of work from the game/application (which 
>>>>>>>>>>>> is most
>>>>>>>>>>>> likely the case).
>>>>>>>>>>>>
>>>>>>>>>>>> For more details and illustrations, please refer to the 
>>>>>>>>>>>> following
>>>>>>>>>>>> document:
>>>>>>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>>>>> over the
>>>>>>>>>>>> past year has been the adoption of asynchronous shaders, 
>>>>>>>>>>>> which can
>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>>>>> over the
>>>>>>>>>>>> past year has been the adoption of asynchronous shaders, 
>>>>>>>>>>>> which can
>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>>>>> over the
>>>>>>>>>>>> past year has been the adoption of asynchronous shaders, 
>>>>>>>>>>>> which can
>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Requirements:
>>>>>>>>>>>> -------------
>>>>>>>>>>>>
>>>>>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>>>>>
>>>>>>>>>>>>     * Job round trip time must be predictable, from 
>>>>>>>>>>>> submission to
>>>>>>>>>>>> fence signal
>>>>>>>>>>>>
>>>>>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>>>>>
>>>>>>>>>>>> Goals:
>>>>>>>>>>>> ------
>>>>>>>>>>>>
>>>>>>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>>>>>>
>>>>>>>>>>>> Test: submitting a NOP packet through the mechanism on busy 
>>>>>>>>>>>> hardware
>>>>>>>>>>>> should
>>>>>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>>>>>
>>>>>>>>>>>> Nice to have:
>>>>>>>>>>>> -------------
>>>>>>>>>>>>
>>>>>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>>>>>
>>>>>>>>>>>> My understanding is that with the current hardware 
>>>>>>>>>>>> capabilities in
>>>>>>>>>>>> Polaris10 we
>>>>>>>>>>>> will not be able to provide a solution compatible with GFX 
>>>>>>>>>>>> worloads.
>>>>>>>>>>>>
>>>>>>>>>>>> But I would love to hear otherwise. So if anyone has an idea,
>>>>>>>>>>>> approach or
>>>>>>>>>>>> suggestion that will also be compatible with the GFX ring, 
>>>>>>>>>>>> please let
>>>>>>>>>>>> us know
>>>>>>>>>>>> about it.
>>>>>>>>>>>>
>>>>>>>>>>>>     * The above guarantees should also be respected by 
>>>>>>>>>>>> amdkfd workloads
>>>>>>>>>>>>
>>>>>>>>>>>> Would be good to have for consistency, but not strictly 
>>>>>>>>>>>> necessary as
>>>>>>>>>>>> users running
>>>>>>>>>>>> games are not traditionally running HPC workloads in the 
>>>>>>>>>>>> background.
>>>>>>>>>>>>
>>>>>>>>>>>> Proposed approach:
>>>>>>>>>>>> ------------------
>>>>>>>>>>>>
>>>>>>>>>>>> Similar to the windows driver, we could expose a high priority
>>>>>>>>>>>> compute queue to
>>>>>>>>>>>> userspace.
>>>>>>>>>>>>
>>>>>>>>>>>> Submissions to this compute queue will be scheduled with high
>>>>>>>>>>>> priority, and may
>>>>>>>>>>>> acquire hardware resources previously in use by other queues.
>>>>>>>>>>>>
>>>>>>>>>>>> This can be achieved by taking advantage of the 'priority' 
>>>>>>>>>>>> field in
>>>>>>>>>>>> the HQDs
>>>>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler. 
>>>>>>>>>>>> The relevant
>>>>>>>>>>>> register fields are:
>>>>>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>>>>>
>>>>>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> The amdgpu driver currently controls 8 compute queues from 
>>>>>>>>>>>> pipe0. We can
>>>>>>>>>>>> statically partition these as follows:
>>>>>>>>>>>>         * 7x regular
>>>>>>>>>>>>         * 1x high priority
>>>>>>>>>>>>
>>>>>>>>>>>> The relevant priorities can be set so that submissions to 
>>>>>>>>>>>> the high
>>>>>>>>>>>> priority
>>>>>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>>>>>
>>>>>>>>>>>> The amdgpu scheduler will only place jobs into the high 
>>>>>>>>>>>> priority
>>>>>>>>>>>> rings if the
>>>>>>>>>>>> context is marked as high priority. And a corresponding 
>>>>>>>>>>>> priority
>>>>>>>>>>>> should be
>>>>>>>>>>>> added to keep track of this information:
>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>>>>>
>>>>>>>>>>>> The user will request a high priority context by setting an
>>>>>>>>>>>> appropriate flag
>>>>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>>>>>     * Maintain a consistent FIFO ordering of all 
>>>>>>>>>>>> submissions to a
>>>>>>>>>>>> context
>>>>>>>>>>>>     * Create high priority and non-high priority contexts 
>>>>>>>>>>>> in the same
>>>>>>>>>>>> process
>>>>>>>>>>>>
>>>>>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> Similar to the above, but instead of programming the 
>>>>>>>>>>>> priorities and
>>>>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the 
>>>>>>>>>>>> queue priorities
>>>>>>>>>>>> dynamically when scheduling a task.
>>>>>>>>>>>>
>>>>>>>>>>>> This would involve having a hardware specific callback from 
>>>>>>>>>>>> the
>>>>>>>>>>>> scheduler to
>>>>>>>>>>>> set the appropriate queue priority: set_priority(int ring, 
>>>>>>>>>>>> int index,
>>>>>>>>>>>> int priority)
>>>>>>>>>>>>
>>>>>>>>>>>> During this callback we would have to grab the SRBM mutex 
>>>>>>>>>>>> to perform
>>>>>>>>>>>> the appropriate
>>>>>>>>>>>> HW programming, and I'm not really sure if that is 
>>>>>>>>>>>> something we
>>>>>>>>>>>> should be doing from
>>>>>>>>>>>> the scheduler.
>>>>>>>>>>>>
>>>>>>>>>>>> On the positive side, this approach would allow us to 
>>>>>>>>>>>> program a range of
>>>>>>>>>>>> priorities for jobs instead of a single "high priority" 
>>>>>>>>>>>> value",
>>>>>>>>>>>> achieving
>>>>>>>>>>>> something similar to the niceness API available for CPU 
>>>>>>>>>>>> scheduling.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not sure if this flexibility is something that we would 
>>>>>>>>>>>> need for
>>>>>>>>>>>> our use
>>>>>>>>>>>> case, but it might be useful in other scenarios (multiple 
>>>>>>>>>>>> users
>>>>>>>>>>>> sharing compute
>>>>>>>>>>>> time on a server).
>>>>>>>>>>>>
>>>>>>>>>>>> This approach would require a new int field in 
>>>>>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>>>>>> repurposing
>>>>>>>>>>>> of the flags field.
>>>>>>>>>>>>
>>>>>>>>>>>> Known current obstacles:
>>>>>>>>>>>> ------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> The SQ is currently programmed to disregard the HQD 
>>>>>>>>>>>> priorities, and
>>>>>>>>>>>> instead it picks
>>>>>>>>>>>> jobs at random. Settings from the shader itself are also 
>>>>>>>>>>>> disregarded
>>>>>>>>>>>> as this is
>>>>>>>>>>>> considered a privileged field.
>>>>>>>>>>>>
>>>>>>>>>>>> Effectively we can get our compute wavefront launched ASAP, 
>>>>>>>>>>>> but we
>>>>>>>>>>>> might not get the
>>>>>>>>>>>> time we need on the SQ.
>>>>>>>>>>>>
>>>>>>>>>>>> The current programming would have to be changed to allow 
>>>>>>>>>>>> priority
>>>>>>>>>>>> propagation
>>>>>>>>>>>> from the HQD into the SQ.
>>>>>>>>>>>>
>>>>>>>>>>>> Generic approach for all HW IPs:
>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> For consistency purposes, the high priority context can be 
>>>>>>>>>>>> enabled
>>>>>>>>>>>> for all HW IPs
>>>>>>>>>>>> with support of the SW scheduler. This will function 
>>>>>>>>>>>> similarly to the
>>>>>>>>>>>> current
>>>>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump 
>>>>>>>>>>>> ahead of
>>>>>>>>>>>> anything not
>>>>>>>>>>>> commited to the HW queue.
>>>>>>>>>>>>
>>>>>>>>>>>> The benefits of requesting a high priority context for a 
>>>>>>>>>>>> non-compute
>>>>>>>>>>>> queue will
>>>>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is 
>>>>>>>>>>>> stuck in
>>>>>>>>>>>> front of
>>>>>>>>>>>> you), but having the API in place will allow us to easily 
>>>>>>>>>>>> improve the
>>>>>>>>>>>> implementation
>>>>>>>>>>>> in the future as new features become available in new 
>>>>>>>>>>>> hardware.
>>>>>>>>>>>>
>>>>>>>>>>>> Future steps:
>>>>>>>>>>>> -------------
>>>>>>>>>>>>
>>>>>>>>>>>> Once we have an approach settled, I can take care of the 
>>>>>>>>>>>> implementation.
>>>>>>>>>>>>
>>>>>>>>>>>> Also, once the interface is mostly decided, we can start 
>>>>>>>>>>>> thinking about
>>>>>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>>>>>
>>>>>>>>>>>> Request for feedback:
>>>>>>>>>>>> ---------------------
>>>>>>>>>>>>
>>>>>>>>>>>> We aren't married to any of the approaches outlined above. 
>>>>>>>>>>>> Our goal
>>>>>>>>>>>> is to
>>>>>>>>>>>> obtain a mechanism that will allow us to complete the 
>>>>>>>>>>>> reprojection
>>>>>>>>>>>> job within a
>>>>>>>>>>>> predictable amount of time. So if anyone anyone has any 
>>>>>>>>>>>> suggestions for
>>>>>>>>>>>> improvements or alternative strategies we are more than 
>>>>>>>>>>>> happy to hear
>>>>>>>>>>>> them.
>>>>>>>>>>>>
>>>>>>>>>>>> If any of the technical information above is also 
>>>>>>>>>>>> incorrect, feel
>>>>>>>>>>>> free to point
>>>>>>>>>>>> out my misunderstandings.
>>>>>>>>>>>>
>>>>>>>>>>>> Looking forward to hearing from you.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Andres
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all 
>>>>>>>>>>>> the list
>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all 
>>>>>>>>>>>> the list
>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>
>>>>>>
>>>>>
>>>>> Sincerely yours,
>>>>> Serguei Sagalovitch
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>>
>>>
>>
>

Sincerely yours,
Serguei Sagalovitch

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                                     ` <fd1f1a6f-f72a-3e65-bb6f-17671d8b1d6b-5C7GfCeVMHo@public.gmane.org>
@ 2016-12-22 19:54                                                                       ` Pierre-Loup A. Griffais
       [not found]                                                                         ` <2e8051cb-09b1-c5cb-cb5a-b7ca30f65e89-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Pierre-Loup A. Griffais @ 2016-12-22 19:54 UTC (permalink / raw)
  To: Serguei Sagalovitch, Andres Rodriguez, Christian König,
	zhoucm1, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Mao, David, Zhang,
	Hawking, Huan, Alvin

Display concerns are a separate issue, and as Andres said we have other 
plans to address. But yes, in general you don't want another compositor 
in the way, so we'll be acquiring the HMD display directly, separate 
from any desktop or display server. Same with security, we can have a 
separate conversation about that when the time comes.

On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
> Andres,
>
> Did you measure  latency, etc. impact of __any__ compositor?
>
> My understanding is that VR has pretty strict requirements related to QoS.
>
> Sincerely yours,
> Serguei Sagalovitch
>
>
> On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
>> Hey Christian,
>>
>> We are currently interested in X, but with some distros switching to
>> other compositors by default, we also need to consider those.
>>
>> We agree, running the full vrcompositor in root isn't something that
>> we want to do. Too many security concerns. Having a small root helper
>> that does the privilege escalation for us is the initial idea.
>>
>> For a long term approach, Pierre-Loup and Dave are working on dealing
>> with the "two compositors" scenario a little better in DRM+X.
>> Fullscreen isn't really a sufficient approach, since we don't want the
>> HMD to be used as part of the Desktop environment when a VR app is not
>> in use (this is extremely annoying).
>>
>> When the above is settled, we should have an auth mechanism besides
>> DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the
>> HMD permanently away from X. Re-using that auth method to gate this
>> IOCTL is probably going to be the final solution.
>>
>> I propose to start with ROOT_ONLY since it should allow us to respect
>> kernel IOCTL compatibility guidelines with the most flexibility. Going
>> from a restrictive to a more flexible permission model would be
>> inclusive, but going from a general to a restrictive model may exclude
>> some apps that used to work.
>>
>> Regards,
>> Andres
>>
>> On 12/22/2016 6:42 AM, Christian König wrote:
>>> Hi Andres,
>>>
>>> well using root might cause stability and security problems as well.
>>> We worked quite hard to avoid exactly this for X.
>>>
>>> We could make this feature depend on the compositor being DRM master,
>>> but for example with X the X server is master (and e.g. can change
>>> resolutions etc..) and not the compositor.
>>>
>>> So another question is also what windowing system (if any) are you
>>> planning to use? X, Wayland, Flinger or something completely different ?
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
>>>> Hi Christian,
>>>>
>>>> That is definitely a concern. What we are currently thinking is to
>>>> make the high priority queues accessible to root only.
>>>>
>>>> Therefore is a non-root user attempts to set the high priority flag
>>>> on context allocation, we would fail the call and return ENOPERM.
>>>>
>>>> Regards,
>>>> Andres
>>>>
>>>>
>>>> On 12/20/2016 7:56 AM, Christian König wrote:
>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>> to solve it?
>>>>> Yeah, that problem came to my mind as well.
>>>>>
>>>>> Basically we need to restrict those high priority submissions to
>>>>> the VR compositor or otherwise any malfunctioning application could
>>>>> use it.
>>>>>
>>>>> Just think about some WebGL suddenly taking all our rendering away
>>>>> and we won't get anything drawn any more.
>>>>>
>>>>> Alex or Michel any ideas on that?
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
>>>>>> > If compute queue is occupied only by you, the efficiency
>>>>>> > is equal with setting job queue to high priority I think.
>>>>>> The only risk is the situation when graphics will take all
>>>>>> needed CUs. But in any case it should be very good test.
>>>>>>
>>>>>> Andres/Pierre-Loup,
>>>>>>
>>>>>> Did you try to do it or it is a lot of work for you?
>>>>>>
>>>>>>
>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>> to solve it?
>>>>>>
>>>>>> Sincerely yours,
>>>>>> Serguei Sagalovitch
>>>>>>
>>>>>> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>>>>>>> Do you encounter the priority issue for compute queue with
>>>>>>> current driver?
>>>>>>>
>>>>>>> If compute queue is occupied only by you, the efficiency is equal
>>>>>>> with setting job queue to high priority I think.
>>>>>>>
>>>>>>> Regards,
>>>>>>> David Zhou
>>>>>>>
>>>>>>> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>>>>>>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>>>>>>
>>>>>>>> I'm not sure if I'm asking for too much, but if we can
>>>>>>>> coordinate a similar interface in radv and amdgpu-pro at the
>>>>>>>> vulkan level that would be great.
>>>>>>>>
>>>>>>>> I'm not sure what that's going to be yet.
>>>>>>>>
>>>>>>>> - Andres
>>>>>>>>
>>>>>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>>>>>>>> We're currently working with the open stack; I assume that a
>>>>>>>>>> mechanism could be exposed by both open and Pro Vulkan
>>>>>>>>>> userspace drivers and that the amdgpu kernel interface
>>>>>>>>>> improvements we would pursue following this discussion would
>>>>>>>>>> let both drivers take advantage of the feature, correct?
>>>>>>>>> Of course.
>>>>>>>>> Does open stack have Vulkan support?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> David Zhou
>>>>>>>>>>
>>>>>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>>>>>>> By the way, are you using all-open driver or amdgpu-pro driver?
>>>>>>>>>>>
>>>>>>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> David Zhou
>>>>>>>>>>>
>>>>>>>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>
>>>>>>>>>>>> I'm also working on the bringing up our VR runtime on top of
>>>>>>>>>>>> amgpu;
>>>>>>>>>>>> see replies inline.
>>>>>>>>>>>>
>>>>>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>
>>>>>>>>>>>>>>  For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>> running
>>>>>>>>>>>>>> actually:
>>>>>>>>>>>>> So we could have potential memory overcommit case or do you do
>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>> on your own?  I would think that there is need to avoid
>>>>>>>>>>>>> overcomit in
>>>>>>>>>>>>> VR case to
>>>>>>>>>>>>> prevent any BO migration.
>>>>>>>>>>>>
>>>>>>>>>>>> You're entirely correct; currently the VR runtime is setting up
>>>>>>>>>>>> prioritized CPU scheduling for its VR compositor, we're
>>>>>>>>>>>> working on
>>>>>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this
>>>>>>>>>>>> thread), and in
>>>>>>>>>>>> the future it will make sense to do work in order to make
>>>>>>>>>>>> sure that
>>>>>>>>>>>> its memory allocations do not get evicted, to prevent any
>>>>>>>>>>>> unwelcome
>>>>>>>>>>>> additional latency in the event of needing to perform
>>>>>>>>>>>> just-in-time
>>>>>>>>>>>> reprojection.
>>>>>>>>>>>>
>>>>>>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>>>>>>> Based on my understanding sharing BOs between different
>>>>>>>>>>>>> processes
>>>>>>>>>>>>> could introduce additional synchronization constrains. btw:
>>>>>>>>>>>>> I am not
>>>>>>>>>>>>> sure
>>>>>>>>>>>>> if we are able to share Vulkan sync. object cross-process
>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>
>>>>>>>>>>>> They are different processes; it is important for the
>>>>>>>>>>>> compositor that
>>>>>>>>>>>> is responsible for quality-of-service features such as
>>>>>>>>>>>> consistently
>>>>>>>>>>>> presenting distorted frames with the right latency,
>>>>>>>>>>>> reprojection, etc,
>>>>>>>>>>>> to be separate from the main application.
>>>>>>>>>>>>
>>>>>>>>>>>> Currently we are using unreleased cross-process memory and
>>>>>>>>>>>> semaphore
>>>>>>>>>>>> extensions to fetch updated eye images from the client
>>>>>>>>>>>> application,
>>>>>>>>>>>> but the just-in-time reprojection discussed here does not
>>>>>>>>>>>> actually
>>>>>>>>>>>> have any direct interactions with cross-process resource
>>>>>>>>>>>> sharing,
>>>>>>>>>>>> since it's achieved by using whatever is the latest, most
>>>>>>>>>>>> up-to-date
>>>>>>>>>>>> eye images that have already been sent by the client
>>>>>>>>>>>> application,
>>>>>>>>>>>> which are already available to use without additional
>>>>>>>>>>>> synchronization.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>    3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, we are working on mechanisms to present directly to the
>>>>>>>>>>>> headset
>>>>>>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>  The latency is our main concern,
>>>>>>>>>>>>> I would assume that this is the known problem (at least for
>>>>>>>>>>>>> compute
>>>>>>>>>>>>> usage).
>>>>>>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU
>>>>>>>>>>>>> intensive
>>>>>>>>>>>>> (at least
>>>>>>>>>>>>> in the default configuration).
>>>>>>>>>>>>
>>>>>>>>>>>> As long as it's a consistent cost, it shouldn't an issue.
>>>>>>>>>>>> However, if
>>>>>>>>>>>> there's high degrees of variance then that would be
>>>>>>>>>>>> troublesome and we
>>>>>>>>>>>> would need to account for the worst case.
>>>>>>>>>>>>
>>>>>>>>>>>> Hopefully the requirements and approach we described make
>>>>>>>>>>>> sense, we're
>>>>>>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>  - Pierre-Loup
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hey Serguei,
>>>>>>>>>>>>>
>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can
>>>>>>>>>>>>> start with a
>>>>>>>>>>>>> solution that assumes that
>>>>>>>>>>>>> only pipe0 has any work and the other pipes are idle (no
>>>>>>>>>>>>> HSA/ROCm
>>>>>>>>>>>>> running on the system).
>>>>>>>>>>>>>
>>>>>>>>>>>>> This should be more or less the use case we expect from VR
>>>>>>>>>>>>> users.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree the split is currently not ideal, but I'd like to
>>>>>>>>>>>>> consider
>>>>>>>>>>>>> that a separate task, because
>>>>>>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>>>>>>
>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>> amdkfd
>>>>>>>>>>>>>> will be not
>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Correct, this is why we want to enable the high priority
>>>>>>>>>>>>> compute
>>>>>>>>>>>>> queue through
>>>>>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For current VR workloads we have 3 separate processes
>>>>>>>>>>>>> running actually:
>>>>>>>>>>>>>     1) Game process
>>>>>>>>>>>>>     2) VR Compositor (this is the process that will require
>>>>>>>>>>>>> high
>>>>>>>>>>>>> priority queue)
>>>>>>>>>>>>>     3) System compositor (we are looking at approaches to
>>>>>>>>>>>>> remove this
>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>
>>>>>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>>>>>>>> simultaneously, but
>>>>>>>>>>>>> I would also like to be able to address this case in the
>>>>>>>>>>>>> future
>>>>>>>>>>>>> (cross-pipe priorities).
>>>>>>>>>>>>>
>>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>>> (a) it
>>>>>>>>>>>>>> may take time so
>>>>>>>>>>>>>> latency may suffer
>>>>>>>>>>>>>
>>>>>>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>>>>>>> predictable. A good
>>>>>>>>>>>>> illustration of what the reprojection scheduling looks like
>>>>>>>>>>>>> can be
>>>>>>>>>>>>> found here:
>>>>>>>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> (b) to preempt we need to have different "context" - we want
>>>>>>>>>>>>>> to guarantee that submissions from the same context will
>>>>>>>>>>>>>> be executed
>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is okay, as the reprojection work doesn't have
>>>>>>>>>>>>> dependencies on
>>>>>>>>>>>>> the game context, and it
>>>>>>>>>>>>> even happens in a separate process.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you
>>>>>>>>>>>>>> want
>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>> "cancel/abort"
>>>>>>>>>>>>>
>>>>>>>>>>>>> Preempt the game with the compositor task and then resume it.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics
>>>>>>>>>>>>>> as well as
>>>>>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure
>>>>>>>>>>>>> out a way
>>>>>>>>>>>>> for us to get
>>>>>>>>>>>>> a guaranteed execution time using vulkan graphics, then
>>>>>>>>>>>>> I'll take you
>>>>>>>>>>>>> out for a beer :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Andres
>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Andres,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>
>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Quick comments:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU
>>>>>>>>>>>>> assignments/binding
>>>>>>>>>>>>> to high-priority queue  when it will be in use and "free"
>>>>>>>>>>>>> them later
>>>>>>>>>>>>> (we  do not want forever take CUs from e.g. graphic task to
>>>>>>>>>>>>> degrade
>>>>>>>>>>>>> graphics
>>>>>>>>>>>>> performance).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>>>>>>>> low-priority
>>>>>>>>>>>>> compute) will took all (extra) CUs and high--priority will
>>>>>>>>>>>>> wait for
>>>>>>>>>>>>> needed resources.
>>>>>>>>>>>>> It will not be visible on "NOP " but only when you submit
>>>>>>>>>>>>> "real"
>>>>>>>>>>>>> compute task
>>>>>>>>>>>>> so I would recommend  not to use "NOP" packets at all for
>>>>>>>>>>>>> testing.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It (CU assignment) could be relatively easy done when
>>>>>>>>>>>>> everything is
>>>>>>>>>>>>> going via kernel
>>>>>>>>>>>>> (e.g. as part of frame submission) but I must admit that I
>>>>>>>>>>>>> am not sure
>>>>>>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>>>>>>
>>>>>>>>>>>>> [AR] I wasn't aware of this part of the programming
>>>>>>>>>>>>> sequence. Thanks
>>>>>>>>>>>>> for the heads up!
>>>>>>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler"
>>>>>>>>>>>>> when
>>>>>>>>>>>>> deciding which
>>>>>>>>>>>>> queue to  run will check if there is enough resources and
>>>>>>>>>>>>> if not then
>>>>>>>>>>>>> it will begin
>>>>>>>>>>>>> to check other queues with lower priority.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2) I would recommend to dedicate the whole pipe to
>>>>>>>>>>>>> high-priority
>>>>>>>>>>>>> queue and have
>>>>>>>>>>>>> nothing their except it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue?
>>>>>>>>>>>>> (as opposed
>>>>>>>>>>>>> to the MEC definition
>>>>>>>>>>>>> of pipe, which is a grouping of queues). I say this because
>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>> only has access to 1 pipe,
>>>>>>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it. As far as I
>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>> allocation
>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>> resource
>>>>>>>>>>>>> conflict
>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> BTW: Which user level API do you want to use for compute:
>>>>>>>>>>>>> Vulkan or
>>>>>>>>>>>>> OpenCL?
>>>>>>>>>>>>>
>>>>>>>>>>>>> [AR] Vulkan
>>>>>>>>>>>>>
>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>> amdkfd will
>>>>>>>>>>>>> be not
>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>> have one main
>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>  we will not be able to provide a solution compatible with
>>>>>>>>>>>>>> GFX
>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>>>>>>
>>>>>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the
>>>>>>>>>>>>> currently running
>>>>>>>>>>>>> graphics job and scheduling in
>>>>>>>>>>>>> something else using mid-buffer pre-emption has some cases
>>>>>>>>>>>>> where it
>>>>>>>>>>>>> doesn't work well. But if with
>>>>>>>>>>>>> polaris10 it starts working well, it might be a better
>>>>>>>>>>>>> solution for
>>>>>>>>>>>>> us (because the whole reprojection
>>>>>>>>>>>>> work uses the vulkan graphics stack at the moment, and
>>>>>>>>>>>>> porting it to
>>>>>>>>>>>>> compute is not trivial).
>>>>>>>>>>>>>
>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>> (a) it may
>>>>>>>>>>>>> take time so
>>>>>>>>>>>>> latency may suffer (b) to preempt we need to have different
>>>>>>>>>>>>> "context"
>>>>>>>>>>>>> - we want
>>>>>>>>>>>>> to guarantee that submissions from the same context will be
>>>>>>>>>>>>> executed
>>>>>>>>>>>>> in order.
>>>>>>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you
>>>>>>>>>>>>> want
>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>>>>>>>> for graphics as well as for plain compute tasks
>>>>>>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on
>>>>>>>>>>>>> behalf of
>>>>>>>>>>>>> Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in
>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> This RFC is also available as a gist here:
>>>>>>>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> We are interested in feedback for a mechanism to
>>>>>>>>>>>>> effectively schedule
>>>>>>>>>>>>> high
>>>>>>>>>>>>> priority VR reprojection tasks (also referred to as
>>>>>>>>>>>>> time-warping) for
>>>>>>>>>>>>> Polaris10
>>>>>>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Brief context:
>>>>>>>>>>>>> --------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> The main objective of reprojection is to avoid motion
>>>>>>>>>>>>> sickness for VR
>>>>>>>>>>>>> users in
>>>>>>>>>>>>> scenarios where the game or application would fail to finish
>>>>>>>>>>>>> rendering a new
>>>>>>>>>>>>> frame in time for the next VBLANK. When this happens, the
>>>>>>>>>>>>> user's head
>>>>>>>>>>>>> movements
>>>>>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the
>>>>>>>>>>>>> duration
>>>>>>>>>>>>> of an
>>>>>>>>>>>>> extra frame. This extended mismatch between the inner ear
>>>>>>>>>>>>> and the
>>>>>>>>>>>>> eyes may
>>>>>>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The VR compositor deals with this problem by fabricating a
>>>>>>>>>>>>> new frame
>>>>>>>>>>>>> using the
>>>>>>>>>>>>> user's updated head position in combination with the
>>>>>>>>>>>>> previous frames.
>>>>>>>>>>>>> This
>>>>>>>>>>>>> avoids a prolonged mismatch between the HMD output and the
>>>>>>>>>>>>> inner ear.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>>>>>>>> confidence that the
>>>>>>>>>>>>> reprojection task will complete before the VBLANK interval.
>>>>>>>>>>>>> Even if
>>>>>>>>>>>>> the GFX pipe
>>>>>>>>>>>>> is currently full of work from the game/application (which
>>>>>>>>>>>>> is most
>>>>>>>>>>>>> likely the case).
>>>>>>>>>>>>>
>>>>>>>>>>>>> For more details and illustrations, please refer to the
>>>>>>>>>>>>> following
>>>>>>>>>>>>> document:
>>>>>>>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>> over the
>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>> which can
>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>> over the
>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>> which can
>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>> over the
>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>> which can
>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Requirements:
>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>>>>>>
>>>>>>>>>>>>>     * Job round trip time must be predictable, from
>>>>>>>>>>>>> submission to
>>>>>>>>>>>>> fence signal
>>>>>>>>>>>>>
>>>>>>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Goals:
>>>>>>>>>>>>> ------
>>>>>>>>>>>>>
>>>>>>>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>>>>>>>
>>>>>>>>>>>>> Test: submitting a NOP packet through the mechanism on busy
>>>>>>>>>>>>> hardware
>>>>>>>>>>>>> should
>>>>>>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Nice to have:
>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>
>>>>>>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My understanding is that with the current hardware
>>>>>>>>>>>>> capabilities in
>>>>>>>>>>>>> Polaris10 we
>>>>>>>>>>>>> will not be able to provide a solution compatible with GFX
>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But I would love to hear otherwise. So if anyone has an idea,
>>>>>>>>>>>>> approach or
>>>>>>>>>>>>> suggestion that will also be compatible with the GFX ring,
>>>>>>>>>>>>> please let
>>>>>>>>>>>>> us know
>>>>>>>>>>>>> about it.
>>>>>>>>>>>>>
>>>>>>>>>>>>>     * The above guarantees should also be respected by
>>>>>>>>>>>>> amdkfd workloads
>>>>>>>>>>>>>
>>>>>>>>>>>>> Would be good to have for consistency, but not strictly
>>>>>>>>>>>>> necessary as
>>>>>>>>>>>>> users running
>>>>>>>>>>>>> games are not traditionally running HPC workloads in the
>>>>>>>>>>>>> background.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Proposed approach:
>>>>>>>>>>>>> ------------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> Similar to the windows driver, we could expose a high priority
>>>>>>>>>>>>> compute queue to
>>>>>>>>>>>>> userspace.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Submissions to this compute queue will be scheduled with high
>>>>>>>>>>>>> priority, and may
>>>>>>>>>>>>> acquire hardware resources previously in use by other queues.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This can be achieved by taking advantage of the 'priority'
>>>>>>>>>>>>> field in
>>>>>>>>>>>>> the HQDs
>>>>>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler.
>>>>>>>>>>>>> The relevant
>>>>>>>>>>>>> register fields are:
>>>>>>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>>>>>>
>>>>>>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> The amdgpu driver currently controls 8 compute queues from
>>>>>>>>>>>>> pipe0. We can
>>>>>>>>>>>>> statically partition these as follows:
>>>>>>>>>>>>>         * 7x regular
>>>>>>>>>>>>>         * 1x high priority
>>>>>>>>>>>>>
>>>>>>>>>>>>> The relevant priorities can be set so that submissions to
>>>>>>>>>>>>> the high
>>>>>>>>>>>>> priority
>>>>>>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The amdgpu scheduler will only place jobs into the high
>>>>>>>>>>>>> priority
>>>>>>>>>>>>> rings if the
>>>>>>>>>>>>> context is marked as high priority. And a corresponding
>>>>>>>>>>>>> priority
>>>>>>>>>>>>> should be
>>>>>>>>>>>>> added to keep track of this information:
>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>>>>>>
>>>>>>>>>>>>> The user will request a high priority context by setting an
>>>>>>>>>>>>> appropriate flag
>>>>>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>>>>>>     * Maintain a consistent FIFO ordering of all
>>>>>>>>>>>>> submissions to a
>>>>>>>>>>>>> context
>>>>>>>>>>>>>     * Create high priority and non-high priority contexts
>>>>>>>>>>>>> in the same
>>>>>>>>>>>>> process
>>>>>>>>>>>>>
>>>>>>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> Similar to the above, but instead of programming the
>>>>>>>>>>>>> priorities and
>>>>>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the
>>>>>>>>>>>>> queue priorities
>>>>>>>>>>>>> dynamically when scheduling a task.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This would involve having a hardware specific callback from
>>>>>>>>>>>>> the
>>>>>>>>>>>>> scheduler to
>>>>>>>>>>>>> set the appropriate queue priority: set_priority(int ring,
>>>>>>>>>>>>> int index,
>>>>>>>>>>>>> int priority)
>>>>>>>>>>>>>
>>>>>>>>>>>>> During this callback we would have to grab the SRBM mutex
>>>>>>>>>>>>> to perform
>>>>>>>>>>>>> the appropriate
>>>>>>>>>>>>> HW programming, and I'm not really sure if that is
>>>>>>>>>>>>> something we
>>>>>>>>>>>>> should be doing from
>>>>>>>>>>>>> the scheduler.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On the positive side, this approach would allow us to
>>>>>>>>>>>>> program a range of
>>>>>>>>>>>>> priorities for jobs instead of a single "high priority"
>>>>>>>>>>>>> value",
>>>>>>>>>>>>> achieving
>>>>>>>>>>>>> something similar to the niceness API available for CPU
>>>>>>>>>>>>> scheduling.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not sure if this flexibility is something that we would
>>>>>>>>>>>>> need for
>>>>>>>>>>>>> our use
>>>>>>>>>>>>> case, but it might be useful in other scenarios (multiple
>>>>>>>>>>>>> users
>>>>>>>>>>>>> sharing compute
>>>>>>>>>>>>> time on a server).
>>>>>>>>>>>>>
>>>>>>>>>>>>> This approach would require a new int field in
>>>>>>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>>>>>>> repurposing
>>>>>>>>>>>>> of the flags field.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Known current obstacles:
>>>>>>>>>>>>> ------------------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> The SQ is currently programmed to disregard the HQD
>>>>>>>>>>>>> priorities, and
>>>>>>>>>>>>> instead it picks
>>>>>>>>>>>>> jobs at random. Settings from the shader itself are also
>>>>>>>>>>>>> disregarded
>>>>>>>>>>>>> as this is
>>>>>>>>>>>>> considered a privileged field.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Effectively we can get our compute wavefront launched ASAP,
>>>>>>>>>>>>> but we
>>>>>>>>>>>>> might not get the
>>>>>>>>>>>>> time we need on the SQ.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The current programming would have to be changed to allow
>>>>>>>>>>>>> priority
>>>>>>>>>>>>> propagation
>>>>>>>>>>>>> from the HQD into the SQ.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Generic approach for all HW IPs:
>>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> For consistency purposes, the high priority context can be
>>>>>>>>>>>>> enabled
>>>>>>>>>>>>> for all HW IPs
>>>>>>>>>>>>> with support of the SW scheduler. This will function
>>>>>>>>>>>>> similarly to the
>>>>>>>>>>>>> current
>>>>>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump
>>>>>>>>>>>>> ahead of
>>>>>>>>>>>>> anything not
>>>>>>>>>>>>> commited to the HW queue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The benefits of requesting a high priority context for a
>>>>>>>>>>>>> non-compute
>>>>>>>>>>>>> queue will
>>>>>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is
>>>>>>>>>>>>> stuck in
>>>>>>>>>>>>> front of
>>>>>>>>>>>>> you), but having the API in place will allow us to easily
>>>>>>>>>>>>> improve the
>>>>>>>>>>>>> implementation
>>>>>>>>>>>>> in the future as new features become available in new
>>>>>>>>>>>>> hardware.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Future steps:
>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> Once we have an approach settled, I can take care of the
>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also, once the interface is mostly decided, we can start
>>>>>>>>>>>>> thinking about
>>>>>>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Request for feedback:
>>>>>>>>>>>>> ---------------------
>>>>>>>>>>>>>
>>>>>>>>>>>>> We aren't married to any of the approaches outlined above.
>>>>>>>>>>>>> Our goal
>>>>>>>>>>>>> is to
>>>>>>>>>>>>> obtain a mechanism that will allow us to complete the
>>>>>>>>>>>>> reprojection
>>>>>>>>>>>>> job within a
>>>>>>>>>>>>> predictable amount of time. So if anyone anyone has any
>>>>>>>>>>>>> suggestions for
>>>>>>>>>>>>> improvements or alternative strategies we are more than
>>>>>>>>>>>>> happy to hear
>>>>>>>>>>>>> them.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If any of the technical information above is also
>>>>>>>>>>>>> incorrect, feel
>>>>>>>>>>>>> free to point
>>>>>>>>>>>>> out my misunderstandings.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Looking forward to hearing from you.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>> the list
>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>> the list
>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Sincerely yours,
>>>>>> Serguei Sagalovitch
>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx@lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
>>>>>
>>>>
>>>
>>
>
> Sincerely yours,
> Serguei Sagalovitch
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                                         ` <2e8051cb-09b1-c5cb-cb5a-b7ca30f65e89-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
@ 2016-12-23 10:54                                                                           ` Christian König
       [not found]                                                                             ` <1c3ea5aa-36ee-5031-5f32-d860e9e0bf7c-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Christian König @ 2016-12-23 10:54 UTC (permalink / raw)
  To: Pierre-Loup A. Griffais, Serguei Sagalovitch, Andres Rodriguez,
	zhoucm1, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Mao, David, Zhang,
	Hawking, Huan, Alvin

> But yes, in general you don't want another compositor in the way, so 
> we'll be acquiring the HMD display directly, separate from any desktop 
> or display server.
Assuming that the the HMD is attached to the rendering device in some 
way you have the X server and the Compositor which both try to be DRM 
master at the same time.

Please correct me if that was fixed in the meantime, but that sounds 
like it will simply not work. Or is this what Andres mention below Dave 
is working on ?.

Additional to that a compositor in combination with X is a bit counter 
productive when you want to keep the latency low.

E.g. the "normal" flow of a GL or Vulkan surface filled with rendered 
data to be displayed is from the Application -> X server -> compositor 
-> X server.

The extra step between X server and compositor just means extra latency 
and for this use case you probably don't want that.

Targeting something like Wayland and when you need X compatibility 
XWayland sounds like the much better idea.

Regards,
Christian.

Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:
> Display concerns are a separate issue, and as Andres said we have 
> other plans to address. But yes, in general you don't want another 
> compositor in the way, so we'll be acquiring the HMD display directly, 
> separate from any desktop or display server. Same with security, we 
> can have a separate conversation about that when the time comes.
>
> On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
>> Andres,
>>
>> Did you measure  latency, etc. impact of __any__ compositor?
>>
>> My understanding is that VR has pretty strict requirements related to 
>> QoS.
>>
>> Sincerely yours,
>> Serguei Sagalovitch
>>
>>
>> On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
>>> Hey Christian,
>>>
>>> We are currently interested in X, but with some distros switching to
>>> other compositors by default, we also need to consider those.
>>>
>>> We agree, running the full vrcompositor in root isn't something that
>>> we want to do. Too many security concerns. Having a small root helper
>>> that does the privilege escalation for us is the initial idea.
>>>
>>> For a long term approach, Pierre-Loup and Dave are working on dealing
>>> with the "two compositors" scenario a little better in DRM+X.
>>> Fullscreen isn't really a sufficient approach, since we don't want the
>>> HMD to be used as part of the Desktop environment when a VR app is not
>>> in use (this is extremely annoying).
>>>
>>> When the above is settled, we should have an auth mechanism besides
>>> DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the
>>> HMD permanently away from X. Re-using that auth method to gate this
>>> IOCTL is probably going to be the final solution.
>>>
>>> I propose to start with ROOT_ONLY since it should allow us to respect
>>> kernel IOCTL compatibility guidelines with the most flexibility. Going
>>> from a restrictive to a more flexible permission model would be
>>> inclusive, but going from a general to a restrictive model may exclude
>>> some apps that used to work.
>>>
>>> Regards,
>>> Andres
>>>
>>> On 12/22/2016 6:42 AM, Christian König wrote:
>>>> Hi Andres,
>>>>
>>>> well using root might cause stability and security problems as well.
>>>> We worked quite hard to avoid exactly this for X.
>>>>
>>>> We could make this feature depend on the compositor being DRM master,
>>>> but for example with X the X server is master (and e.g. can change
>>>> resolutions etc..) and not the compositor.
>>>>
>>>> So another question is also what windowing system (if any) are you
>>>> planning to use? X, Wayland, Flinger or something completely 
>>>> different ?
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
>>>>> Hi Christian,
>>>>>
>>>>> That is definitely a concern. What we are currently thinking is to
>>>>> make the high priority queues accessible to root only.
>>>>>
>>>>> Therefore is a non-root user attempts to set the high priority flag
>>>>> on context allocation, we would fail the call and return ENOPERM.
>>>>>
>>>>> Regards,
>>>>> Andres
>>>>>
>>>>>
>>>>> On 12/20/2016 7:56 AM, Christian König wrote:
>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>> to solve it?
>>>>>> Yeah, that problem came to my mind as well.
>>>>>>
>>>>>> Basically we need to restrict those high priority submissions to
>>>>>> the VR compositor or otherwise any malfunctioning application could
>>>>>> use it.
>>>>>>
>>>>>> Just think about some WebGL suddenly taking all our rendering away
>>>>>> and we won't get anything drawn any more.
>>>>>>
>>>>>> Alex or Michel any ideas on that?
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
>>>>>>> > If compute queue is occupied only by you, the efficiency
>>>>>>> > is equal with setting job queue to high priority I think.
>>>>>>> The only risk is the situation when graphics will take all
>>>>>>> needed CUs. But in any case it should be very good test.
>>>>>>>
>>>>>>> Andres/Pierre-Loup,
>>>>>>>
>>>>>>> Did you try to do it or it is a lot of work for you?
>>>>>>>
>>>>>>>
>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>> to solve it?
>>>>>>>
>>>>>>> Sincerely yours,
>>>>>>> Serguei Sagalovitch
>>>>>>>
>>>>>>> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>>>>>>>> Do you encounter the priority issue for compute queue with
>>>>>>>> current driver?
>>>>>>>>
>>>>>>>> If compute queue is occupied only by you, the efficiency is equal
>>>>>>>> with setting job queue to high priority I think.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> David Zhou
>>>>>>>>
>>>>>>>> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>>>>>>>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>>>>>>>
>>>>>>>>> I'm not sure if I'm asking for too much, but if we can
>>>>>>>>> coordinate a similar interface in radv and amdgpu-pro at the
>>>>>>>>> vulkan level that would be great.
>>>>>>>>>
>>>>>>>>> I'm not sure what that's going to be yet.
>>>>>>>>>
>>>>>>>>> - Andres
>>>>>>>>>
>>>>>>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>> We're currently working with the open stack; I assume that a
>>>>>>>>>>> mechanism could be exposed by both open and Pro Vulkan
>>>>>>>>>>> userspace drivers and that the amdgpu kernel interface
>>>>>>>>>>> improvements we would pursue following this discussion would
>>>>>>>>>>> let both drivers take advantage of the feature, correct?
>>>>>>>>>> Of course.
>>>>>>>>>> Does open stack have Vulkan support?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> David Zhou
>>>>>>>>>>>
>>>>>>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>>>>>>>> By the way, are you using all-open driver or amdgpu-pro 
>>>>>>>>>>>> driver?
>>>>>>>>>>>>
>>>>>>>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> David Zhou
>>>>>>>>>>>>
>>>>>>>>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm also working on the bringing up our VR runtime on top of
>>>>>>>>>>>>> amgpu;
>>>>>>>>>>>>> see replies inline.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>> actually:
>>>>>>>>>>>>>> So we could have potential memory overcommit case or do 
>>>>>>>>>>>>>> you do
>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>> on your own?  I would think that there is need to avoid
>>>>>>>>>>>>>> overcomit in
>>>>>>>>>>>>>> VR case to
>>>>>>>>>>>>>> prevent any BO migration.
>>>>>>>>>>>>>
>>>>>>>>>>>>> You're entirely correct; currently the VR runtime is 
>>>>>>>>>>>>> setting up
>>>>>>>>>>>>> prioritized CPU scheduling for its VR compositor, we're
>>>>>>>>>>>>> working on
>>>>>>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this
>>>>>>>>>>>>> thread), and in
>>>>>>>>>>>>> the future it will make sense to do work in order to make
>>>>>>>>>>>>> sure that
>>>>>>>>>>>>> its memory allocations do not get evicted, to prevent any
>>>>>>>>>>>>> unwelcome
>>>>>>>>>>>>> additional latency in the event of needing to perform
>>>>>>>>>>>>> just-in-time
>>>>>>>>>>>>> reprojection.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>>>>>>>> Based on my understanding sharing BOs between different
>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>> could introduce additional synchronization constrains. btw:
>>>>>>>>>>>>>> I am not
>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>> if we are able to share Vulkan sync. object cross-process
>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>
>>>>>>>>>>>>> They are different processes; it is important for the
>>>>>>>>>>>>> compositor that
>>>>>>>>>>>>> is responsible for quality-of-service features such as
>>>>>>>>>>>>> consistently
>>>>>>>>>>>>> presenting distorted frames with the right latency,
>>>>>>>>>>>>> reprojection, etc,
>>>>>>>>>>>>> to be separate from the main application.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Currently we are using unreleased cross-process memory and
>>>>>>>>>>>>> semaphore
>>>>>>>>>>>>> extensions to fetch updated eye images from the client
>>>>>>>>>>>>> application,
>>>>>>>>>>>>> but the just-in-time reprojection discussed here does not
>>>>>>>>>>>>> actually
>>>>>>>>>>>>> have any direct interactions with cross-process resource
>>>>>>>>>>>>> sharing,
>>>>>>>>>>>>> since it's achieved by using whatever is the latest, most
>>>>>>>>>>>>> up-to-date
>>>>>>>>>>>>> eye images that have already been sent by the client
>>>>>>>>>>>>> application,
>>>>>>>>>>>>> which are already available to use without additional
>>>>>>>>>>>>> synchronization.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, we are working on mechanisms to present directly to the
>>>>>>>>>>>>> headset
>>>>>>>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  The latency is our main concern,
>>>>>>>>>>>>>> I would assume that this is the known problem (at least for
>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>> usage).
>>>>>>>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU
>>>>>>>>>>>>>> intensive
>>>>>>>>>>>>>> (at least
>>>>>>>>>>>>>> in the default configuration).
>>>>>>>>>>>>>
>>>>>>>>>>>>> As long as it's a consistent cost, it shouldn't an issue.
>>>>>>>>>>>>> However, if
>>>>>>>>>>>>> there's high degrees of variance then that would be
>>>>>>>>>>>>> troublesome and we
>>>>>>>>>>>>> would need to account for the worst case.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hopefully the requirements and approach we described make
>>>>>>>>>>>>> sense, we're
>>>>>>>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>  - Pierre-Loup
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey Serguei,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far 
>>>>>>>>>>>>>>> as I
>>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to 
>>>>>>>>>>>>>>> switch to
>>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can
>>>>>>>>>>>>>> start with a
>>>>>>>>>>>>>> solution that assumes that
>>>>>>>>>>>>>> only pipe0 has any work and the other pipes are idle (no
>>>>>>>>>>>>>> HSA/ROCm
>>>>>>>>>>>>>> running on the system).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This should be more or less the use case we expect from VR
>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree the split is currently not ideal, but I'd like to
>>>>>>>>>>>>>> consider
>>>>>>>>>>>>>> that a separate task, because
>>>>>>>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>>> amdkfd
>>>>>>>>>>>>>>> will be not
>>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Correct, this is why we want to enable the high priority
>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>> queue through
>>>>>>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan 
>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>> running actually:
>>>>>>>>>>>>>>     1) Game process
>>>>>>>>>>>>>>     2) VR Compositor (this is the process that will require
>>>>>>>>>>>>>> high
>>>>>>>>>>>>>> priority queue)
>>>>>>>>>>>>>>     3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>>>>>>>>> simultaneously, but
>>>>>>>>>>>>>> I would also like to be able to address this case in the
>>>>>>>>>>>>>> future
>>>>>>>>>>>>>> (cross-pipe priorities).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>>>> (a) it
>>>>>>>>>>>>>>> may take time so
>>>>>>>>>>>>>>> latency may suffer
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>>>>>>>> predictable. A good
>>>>>>>>>>>>>> illustration of what the reprojection scheduling looks like
>>>>>>>>>>>>>> can be
>>>>>>>>>>>>>> found here:
>>>>>>>>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (b) to preempt we need to have different "context" - we 
>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>> to guarantee that submissions from the same context will
>>>>>>>>>>>>>>> be executed
>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is okay, as the reprojection work doesn't have
>>>>>>>>>>>>>> dependencies on
>>>>>>>>>>>>>> the game context, and it
>>>>>>>>>>>>>> even happens in a separate process.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you
>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>>> "cancel/abort"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Preempt the game with the compositor task and then resume 
>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics
>>>>>>>>>>>>>>> as well as
>>>>>>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure
>>>>>>>>>>>>>> out a way
>>>>>>>>>>>>>> for us to get
>>>>>>>>>>>>>> a guaranteed execution time using vulkan graphics, then
>>>>>>>>>>>>>> I'll take you
>>>>>>>>>>>>>> out for a beer :)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Andres,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Quick comments:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU
>>>>>>>>>>>>>> assignments/binding
>>>>>>>>>>>>>> to high-priority queue  when it will be in use and "free"
>>>>>>>>>>>>>> them later
>>>>>>>>>>>>>> (we  do not want forever take CUs from e.g. graphic task to
>>>>>>>>>>>>>> degrade
>>>>>>>>>>>>>> graphics
>>>>>>>>>>>>>> performance).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>>>>>>>>> low-priority
>>>>>>>>>>>>>> compute) will took all (extra) CUs and high--priority will
>>>>>>>>>>>>>> wait for
>>>>>>>>>>>>>> needed resources.
>>>>>>>>>>>>>> It will not be visible on "NOP " but only when you submit
>>>>>>>>>>>>>> "real"
>>>>>>>>>>>>>> compute task
>>>>>>>>>>>>>> so I would recommend  not to use "NOP" packets at all for
>>>>>>>>>>>>>> testing.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It (CU assignment) could be relatively easy done when
>>>>>>>>>>>>>> everything is
>>>>>>>>>>>>>> going via kernel
>>>>>>>>>>>>>> (e.g. as part of frame submission) but I must admit that I
>>>>>>>>>>>>>> am not sure
>>>>>>>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [AR] I wasn't aware of this part of the programming
>>>>>>>>>>>>>> sequence. Thanks
>>>>>>>>>>>>>> for the heads up!
>>>>>>>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler"
>>>>>>>>>>>>>> when
>>>>>>>>>>>>>> deciding which
>>>>>>>>>>>>>> queue to  run will check if there is enough resources and
>>>>>>>>>>>>>> if not then
>>>>>>>>>>>>>> it will begin
>>>>>>>>>>>>>> to check other queues with lower priority.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2) I would recommend to dedicate the whole pipe to
>>>>>>>>>>>>>> high-priority
>>>>>>>>>>>>>> queue and have
>>>>>>>>>>>>>> nothing their except it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue?
>>>>>>>>>>>>>> (as opposed
>>>>>>>>>>>>>> to the MEC definition
>>>>>>>>>>>>>> of pipe, which is a grouping of queues). I say this because
>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>> only has access to 1 pipe,
>>>>>>>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it. As far as I
>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> BTW: Which user level API do you want to use for compute:
>>>>>>>>>>>>>> Vulkan or
>>>>>>>>>>>>>> OpenCL?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [AR] Vulkan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>> amdkfd will
>>>>>>>>>>>>>> be not
>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  we will not be able to provide a solution compatible with
>>>>>>>>>>>>>>> GFX
>>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the
>>>>>>>>>>>>>> currently running
>>>>>>>>>>>>>> graphics job and scheduling in
>>>>>>>>>>>>>> something else using mid-buffer pre-emption has some cases
>>>>>>>>>>>>>> where it
>>>>>>>>>>>>>> doesn't work well. But if with
>>>>>>>>>>>>>> polaris10 it starts working well, it might be a better
>>>>>>>>>>>>>> solution for
>>>>>>>>>>>>>> us (because the whole reprojection
>>>>>>>>>>>>>> work uses the vulkan graphics stack at the moment, and
>>>>>>>>>>>>>> porting it to
>>>>>>>>>>>>>> compute is not trivial).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>>> (a) it may
>>>>>>>>>>>>>> take time so
>>>>>>>>>>>>>> latency may suffer (b) to preempt we need to have different
>>>>>>>>>>>>>> "context"
>>>>>>>>>>>>>> - we want
>>>>>>>>>>>>>> to guarantee that submissions from the same context will be
>>>>>>>>>>>>>> executed
>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you
>>>>>>>>>>>>>> want
>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>>>>>>>>> for graphics as well as for plain compute tasks
>>>>>>>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on
>>>>>>>>>>>>>> behalf of
>>>>>>>>>>>>>> Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>>>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in
>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This RFC is also available as a gist here:
>>>>>>>>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We are interested in feedback for a mechanism to
>>>>>>>>>>>>>> effectively schedule
>>>>>>>>>>>>>> high
>>>>>>>>>>>>>> priority VR reprojection tasks (also referred to as
>>>>>>>>>>>>>> time-warping) for
>>>>>>>>>>>>>> Polaris10
>>>>>>>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brief context:
>>>>>>>>>>>>>> --------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The main objective of reprojection is to avoid motion
>>>>>>>>>>>>>> sickness for VR
>>>>>>>>>>>>>> users in
>>>>>>>>>>>>>> scenarios where the game or application would fail to finish
>>>>>>>>>>>>>> rendering a new
>>>>>>>>>>>>>> frame in time for the next VBLANK. When this happens, the
>>>>>>>>>>>>>> user's head
>>>>>>>>>>>>>> movements
>>>>>>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the
>>>>>>>>>>>>>> duration
>>>>>>>>>>>>>> of an
>>>>>>>>>>>>>> extra frame. This extended mismatch between the inner ear
>>>>>>>>>>>>>> and the
>>>>>>>>>>>>>> eyes may
>>>>>>>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The VR compositor deals with this problem by fabricating a
>>>>>>>>>>>>>> new frame
>>>>>>>>>>>>>> using the
>>>>>>>>>>>>>> user's updated head position in combination with the
>>>>>>>>>>>>>> previous frames.
>>>>>>>>>>>>>> This
>>>>>>>>>>>>>> avoids a prolonged mismatch between the HMD output and the
>>>>>>>>>>>>>> inner ear.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>>>>>>>>> confidence that the
>>>>>>>>>>>>>> reprojection task will complete before the VBLANK interval.
>>>>>>>>>>>>>> Even if
>>>>>>>>>>>>>> the GFX pipe
>>>>>>>>>>>>>> is currently full of work from the game/application (which
>>>>>>>>>>>>>> is most
>>>>>>>>>>>>>> likely the case).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For more details and illustrations, please refer to the
>>>>>>>>>>>>>> following
>>>>>>>>>>>>>> document:
>>>>>>>>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Requirements:
>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     * Job round trip time must be predictable, from
>>>>>>>>>>>>>> submission to
>>>>>>>>>>>>>> fence signal
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Goals:
>>>>>>>>>>>>>> ------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Test: submitting a NOP packet through the mechanism on busy
>>>>>>>>>>>>>> hardware
>>>>>>>>>>>>>> should
>>>>>>>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Nice to have:
>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My understanding is that with the current hardware
>>>>>>>>>>>>>> capabilities in
>>>>>>>>>>>>>> Polaris10 we
>>>>>>>>>>>>>> will not be able to provide a solution compatible with GFX
>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But I would love to hear otherwise. So if anyone has an 
>>>>>>>>>>>>>> idea,
>>>>>>>>>>>>>> approach or
>>>>>>>>>>>>>> suggestion that will also be compatible with the GFX ring,
>>>>>>>>>>>>>> please let
>>>>>>>>>>>>>> us know
>>>>>>>>>>>>>> about it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     * The above guarantees should also be respected by
>>>>>>>>>>>>>> amdkfd workloads
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Would be good to have for consistency, but not strictly
>>>>>>>>>>>>>> necessary as
>>>>>>>>>>>>>> users running
>>>>>>>>>>>>>> games are not traditionally running HPC workloads in the
>>>>>>>>>>>>>> background.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Proposed approach:
>>>>>>>>>>>>>> ------------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Similar to the windows driver, we could expose a high 
>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>> compute queue to
>>>>>>>>>>>>>> userspace.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Submissions to this compute queue will be scheduled with 
>>>>>>>>>>>>>> high
>>>>>>>>>>>>>> priority, and may
>>>>>>>>>>>>>> acquire hardware resources previously in use by other 
>>>>>>>>>>>>>> queues.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This can be achieved by taking advantage of the 'priority'
>>>>>>>>>>>>>> field in
>>>>>>>>>>>>>> the HQDs
>>>>>>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler.
>>>>>>>>>>>>>> The relevant
>>>>>>>>>>>>>> register fields are:
>>>>>>>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The amdgpu driver currently controls 8 compute queues from
>>>>>>>>>>>>>> pipe0. We can
>>>>>>>>>>>>>> statically partition these as follows:
>>>>>>>>>>>>>>         * 7x regular
>>>>>>>>>>>>>>         * 1x high priority
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The relevant priorities can be set so that submissions to
>>>>>>>>>>>>>> the high
>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The amdgpu scheduler will only place jobs into the high
>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>> rings if the
>>>>>>>>>>>>>> context is marked as high priority. And a corresponding
>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>> added to keep track of this information:
>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The user will request a high priority context by setting an
>>>>>>>>>>>>>> appropriate flag
>>>>>>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>>>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>>>>>>>     * Maintain a consistent FIFO ordering of all
>>>>>>>>>>>>>> submissions to a
>>>>>>>>>>>>>> context
>>>>>>>>>>>>>>     * Create high priority and non-high priority contexts
>>>>>>>>>>>>>> in the same
>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Similar to the above, but instead of programming the
>>>>>>>>>>>>>> priorities and
>>>>>>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the
>>>>>>>>>>>>>> queue priorities
>>>>>>>>>>>>>> dynamically when scheduling a task.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This would involve having a hardware specific callback from
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> scheduler to
>>>>>>>>>>>>>> set the appropriate queue priority: set_priority(int ring,
>>>>>>>>>>>>>> int index,
>>>>>>>>>>>>>> int priority)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> During this callback we would have to grab the SRBM mutex
>>>>>>>>>>>>>> to perform
>>>>>>>>>>>>>> the appropriate
>>>>>>>>>>>>>> HW programming, and I'm not really sure if that is
>>>>>>>>>>>>>> something we
>>>>>>>>>>>>>> should be doing from
>>>>>>>>>>>>>> the scheduler.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On the positive side, this approach would allow us to
>>>>>>>>>>>>>> program a range of
>>>>>>>>>>>>>> priorities for jobs instead of a single "high priority"
>>>>>>>>>>>>>> value",
>>>>>>>>>>>>>> achieving
>>>>>>>>>>>>>> something similar to the niceness API available for CPU
>>>>>>>>>>>>>> scheduling.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not sure if this flexibility is something that we would
>>>>>>>>>>>>>> need for
>>>>>>>>>>>>>> our use
>>>>>>>>>>>>>> case, but it might be useful in other scenarios (multiple
>>>>>>>>>>>>>> users
>>>>>>>>>>>>>> sharing compute
>>>>>>>>>>>>>> time on a server).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This approach would require a new int field in
>>>>>>>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>>>>>>>> repurposing
>>>>>>>>>>>>>> of the flags field.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Known current obstacles:
>>>>>>>>>>>>>> ------------------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The SQ is currently programmed to disregard the HQD
>>>>>>>>>>>>>> priorities, and
>>>>>>>>>>>>>> instead it picks
>>>>>>>>>>>>>> jobs at random. Settings from the shader itself are also
>>>>>>>>>>>>>> disregarded
>>>>>>>>>>>>>> as this is
>>>>>>>>>>>>>> considered a privileged field.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Effectively we can get our compute wavefront launched ASAP,
>>>>>>>>>>>>>> but we
>>>>>>>>>>>>>> might not get the
>>>>>>>>>>>>>> time we need on the SQ.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The current programming would have to be changed to allow
>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>> propagation
>>>>>>>>>>>>>> from the HQD into the SQ.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Generic approach for all HW IPs:
>>>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For consistency purposes, the high priority context can be
>>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>> for all HW IPs
>>>>>>>>>>>>>> with support of the SW scheduler. This will function
>>>>>>>>>>>>>> similarly to the
>>>>>>>>>>>>>> current
>>>>>>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump
>>>>>>>>>>>>>> ahead of
>>>>>>>>>>>>>> anything not
>>>>>>>>>>>>>> commited to the HW queue.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The benefits of requesting a high priority context for a
>>>>>>>>>>>>>> non-compute
>>>>>>>>>>>>>> queue will
>>>>>>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is
>>>>>>>>>>>>>> stuck in
>>>>>>>>>>>>>> front of
>>>>>>>>>>>>>> you), but having the API in place will allow us to easily
>>>>>>>>>>>>>> improve the
>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>> in the future as new features become available in new
>>>>>>>>>>>>>> hardware.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Future steps:
>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Once we have an approach settled, I can take care of the
>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also, once the interface is mostly decided, we can start
>>>>>>>>>>>>>> thinking about
>>>>>>>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Request for feedback:
>>>>>>>>>>>>>> ---------------------
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We aren't married to any of the approaches outlined above.
>>>>>>>>>>>>>> Our goal
>>>>>>>>>>>>>> is to
>>>>>>>>>>>>>> obtain a mechanism that will allow us to complete the
>>>>>>>>>>>>>> reprojection
>>>>>>>>>>>>>> job within a
>>>>>>>>>>>>>> predictable amount of time. So if anyone anyone has any
>>>>>>>>>>>>>> suggestions for
>>>>>>>>>>>>>> improvements or alternative strategies we are more than
>>>>>>>>>>>>>> happy to hear
>>>>>>>>>>>>>> them.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If any of the technical information above is also
>>>>>>>>>>>>>> incorrect, feel
>>>>>>>>>>>>>> free to point
>>>>>>>>>>>>>> out my misunderstandings.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Looking forward to hearing from you.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>> To see the collection of prior postings to the list, 
>>>>>>>>>>>>>> visit the
>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>> To see the collection of prior postings to the list, 
>>>>>>>>>>>>>> visit the
>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Sincerely yours,
>>>>>>> Serguei Sagalovitch
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> amd-gfx mailing list
>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>> Sincerely yours,
>> Serguei Sagalovitch
>>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                                             ` <1c3ea5aa-36ee-5031-5f32-d860e9e0bf7c-5C7GfCeVMHo@public.gmane.org>
@ 2016-12-23 16:13                                                                               ` Andres Rodriguez
       [not found]                                                                                 ` <CAFQ_0eFRaCKKk9BaMyahBARzFEdXP9gQWbK+61R0snDz08qGdw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-12-23 18:18                                                                               ` Pierre-Loup A. Griffais
  1 sibling, 1 reply; 36+ messages in thread
From: Andres Rodriguez @ 2016-12-23 16:13 UTC (permalink / raw)
  To: Christian König
  Cc: zhoucm1, Huan, Alvin, Mao, David, Serguei Sagalovitch,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Andres Rodriguez,
	Pierre-Loup A. Griffais, Zhang, Hawking


[-- Attachment #1.1: Type: text/plain, Size: 43082 bytes --]

Hey Christian,

But yes, in general you don't want another compositor in the way, so we'll
>> be acquiring the HMD display directly, separate from any desktop or display
>> server.
>
>
> Assuming that the the HMD is attached to the rendering device in some way
> you have the X server and the Compositor which both try to be DRM master at
> the same time.
>
> Please correct me if that was fixed in the meantime, but that sounds like
> it will simply not work. Or is this what Andres mention below Dave is
> working on ?.
>

You are correct on both statements. We can't have two DRM_MASTERs, so the
current DRM+X does not support this use case. And this what Dave and
Pierre-Loup are currently working on.

Additional to that a compositor in combination with X is a bit counter
> productive when you want to keep the latency low.
>

One thing I'd like to correct is that our main goal is to get latency
_predictable_, secondary goal is to make it low.

The high priority queue feature addresses our main source of
unpredictability: the scheduling latency when the hardware is already full
of work from the game engine.

The DirectMode feature addresses one of the latency sources: multiple
(unnecessary) context switches to submit a surface to the DRM driver.

Targeting something like Wayland and when you need X compatibility XWayland
> sounds like the much better idea.
>

We are pretty enthusiastic about Wayland (and really glad to see Fedora 25
use Wayland by default). Once we have everything working nicely under X
(where most of the users are currently), I'm sure Pierre-Loup will be
pushing us to get everything optimized under Wayland as well (which should
be a lot simpler!).

Ever since working with SurfaceFlinger on Android with explicit fencing
I've been waiting for the day I can finally ditch X altogether :)

Regards,
Andres


On Fri, Dec 23, 2016 at 5:54 AM, Christian König <christian.koenig@amd.com>
wrote:

> But yes, in general you don't want another compositor in the way, so we'll
>> be acquiring the HMD display directly, separate from any desktop or display
>> server.
>>
> Assuming that the the HMD is attached to the rendering device in some way
> you have the X server and the Compositor which both try to be DRM master at
> the same time.
>
> Please correct me if that was fixed in the meantime, but that sounds like
> it will simply not work. Or is this what Andres mention below Dave is
> working on ?.
>
> Additional to that a compositor in combination with X is a bit counter
> productive when you want to keep the latency low.
>
> E.g. the "normal" flow of a GL or Vulkan surface filled with rendered data
> to be displayed is from the Application -> X server -> compositor -> X
> server.
>
> The extra step between X server and compositor just means extra latency
> and for this use case you probably don't want that.
>
> Targeting something like Wayland and when you need X compatibility
> XWayland sounds like the much better idea.
>
> Regards,
> Christian.
>
>
> Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:
>
>> Display concerns are a separate issue, and as Andres said we have other
>> plans to address. But yes, in general you don't want another compositor in
>> the way, so we'll be acquiring the HMD display directly, separate from any
>> desktop or display server. Same with security, we can have a separate
>> conversation about that when the time comes.
>>
>> On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
>>
>>> Andres,
>>>
>>> Did you measure  latency, etc. impact of __any__ compositor?
>>>
>>> My understanding is that VR has pretty strict requirements related to
>>> QoS.
>>>
>>> Sincerely yours,
>>> Serguei Sagalovitch
>>>
>>>
>>> On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
>>>
>>>> Hey Christian,
>>>>
>>>> We are currently interested in X, but with some distros switching to
>>>> other compositors by default, we also need to consider those.
>>>>
>>>> We agree, running the full vrcompositor in root isn't something that
>>>> we want to do. Too many security concerns. Having a small root helper
>>>> that does the privilege escalation for us is the initial idea.
>>>>
>>>> For a long term approach, Pierre-Loup and Dave are working on dealing
>>>> with the "two compositors" scenario a little better in DRM+X.
>>>> Fullscreen isn't really a sufficient approach, since we don't want the
>>>> HMD to be used as part of the Desktop environment when a VR app is not
>>>> in use (this is extremely annoying).
>>>>
>>>> When the above is settled, we should have an auth mechanism besides
>>>> DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the
>>>> HMD permanently away from X. Re-using that auth method to gate this
>>>> IOCTL is probably going to be the final solution.
>>>>
>>>> I propose to start with ROOT_ONLY since it should allow us to respect
>>>> kernel IOCTL compatibility guidelines with the most flexibility. Going
>>>> from a restrictive to a more flexible permission model would be
>>>> inclusive, but going from a general to a restrictive model may exclude
>>>> some apps that used to work.
>>>>
>>>> Regards,
>>>> Andres
>>>>
>>>> On 12/22/2016 6:42 AM, Christian König wrote:
>>>>
>>>>> Hi Andres,
>>>>>
>>>>> well using root might cause stability and security problems as well.
>>>>> We worked quite hard to avoid exactly this for X.
>>>>>
>>>>> We could make this feature depend on the compositor being DRM master,
>>>>> but for example with X the X server is master (and e.g. can change
>>>>> resolutions etc..) and not the compositor.
>>>>>
>>>>> So another question is also what windowing system (if any) are you
>>>>> planning to use? X, Wayland, Flinger or something completely different
>>>>> ?
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
>>>>>
>>>>>> Hi Christian,
>>>>>>
>>>>>> That is definitely a concern. What we are currently thinking is to
>>>>>> make the high priority queues accessible to root only.
>>>>>>
>>>>>> Therefore is a non-root user attempts to set the high priority flag
>>>>>> on context allocation, we would fail the call and return ENOPERM.
>>>>>>
>>>>>> Regards,
>>>>>> Andres
>>>>>>
>>>>>>
>>>>>> On 12/20/2016 7:56 AM, Christian König wrote:
>>>>>>
>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>>> to solve it?
>>>>>>>>
>>>>>>> Yeah, that problem came to my mind as well.
>>>>>>>
>>>>>>> Basically we need to restrict those high priority submissions to
>>>>>>> the VR compositor or otherwise any malfunctioning application could
>>>>>>> use it.
>>>>>>>
>>>>>>> Just think about some WebGL suddenly taking all our rendering away
>>>>>>> and we won't get anything drawn any more.
>>>>>>>
>>>>>>> Alex or Michel any ideas on that?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
>>>>>>>
>>>>>>>> > If compute queue is occupied only by you, the efficiency
>>>>>>>> > is equal with setting job queue to high priority I think.
>>>>>>>> The only risk is the situation when graphics will take all
>>>>>>>> needed CUs. But in any case it should be very good test.
>>>>>>>>
>>>>>>>> Andres/Pierre-Loup,
>>>>>>>>
>>>>>>>> Did you try to do it or it is a lot of work for you?
>>>>>>>>
>>>>>>>>
>>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>>> to solve it?
>>>>>>>>
>>>>>>>> Sincerely yours,
>>>>>>>> Serguei Sagalovitch
>>>>>>>>
>>>>>>>> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>>>>>>>>
>>>>>>>>> Do you encounter the priority issue for compute queue with
>>>>>>>>> current driver?
>>>>>>>>>
>>>>>>>>> If compute queue is occupied only by you, the efficiency is equal
>>>>>>>>> with setting job queue to high priority I think.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> David Zhou
>>>>>>>>>
>>>>>>>>> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>>>>>>>>>
>>>>>>>>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>>>>>>>>
>>>>>>>>>> I'm not sure if I'm asking for too much, but if we can
>>>>>>>>>> coordinate a similar interface in radv and amdgpu-pro at the
>>>>>>>>>> vulkan level that would be great.
>>>>>>>>>>
>>>>>>>>>> I'm not sure what that's going to be yet.
>>>>>>>>>>
>>>>>>>>>> - Andres
>>>>>>>>>>
>>>>>>>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>
>>>>>>>>>>>> We're currently working with the open stack; I assume that a
>>>>>>>>>>>> mechanism could be exposed by both open and Pro Vulkan
>>>>>>>>>>>> userspace drivers and that the amdgpu kernel interface
>>>>>>>>>>>> improvements we would pursue following this discussion would
>>>>>>>>>>>> let both drivers take advantage of the feature, correct?
>>>>>>>>>>>>
>>>>>>>>>>> Of course.
>>>>>>>>>>> Does open stack have Vulkan support?
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> David Zhou
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> By the way, are you using all-open driver or amdgpu-pro driver?
>>>>>>>>>>>>>
>>>>>>>>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> David Zhou
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm also working on the bringing up our VR runtime on top of
>>>>>>>>>>>>>> amgpu;
>>>>>>>>>>>>>> see replies inline.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>> actually:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So we could have potential memory overcommit case or do you
>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>> on your own?  I would think that there is need to avoid
>>>>>>>>>>>>>>> overcomit in
>>>>>>>>>>>>>>> VR case to
>>>>>>>>>>>>>>> prevent any BO migration.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You're entirely correct; currently the VR runtime is setting
>>>>>>>>>>>>>> up
>>>>>>>>>>>>>> prioritized CPU scheduling for its VR compositor, we're
>>>>>>>>>>>>>> working on
>>>>>>>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this
>>>>>>>>>>>>>> thread), and in
>>>>>>>>>>>>>> the future it will make sense to do work in order to make
>>>>>>>>>>>>>> sure that
>>>>>>>>>>>>>> its memory allocations do not get evicted, to prevent any
>>>>>>>>>>>>>> unwelcome
>>>>>>>>>>>>>> additional latency in the event of needing to perform
>>>>>>>>>>>>>> just-in-time
>>>>>>>>>>>>>> reprojection.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>>>>>>>>> Based on my understanding sharing BOs between different
>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>> could introduce additional synchronization constrains. btw:
>>>>>>>>>>>>>>> I am not
>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>> if we are able to share Vulkan sync. object cross-process
>>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> They are different processes; it is important for the
>>>>>>>>>>>>>> compositor that
>>>>>>>>>>>>>> is responsible for quality-of-service features such as
>>>>>>>>>>>>>> consistently
>>>>>>>>>>>>>> presenting distorted frames with the right latency,
>>>>>>>>>>>>>> reprojection, etc,
>>>>>>>>>>>>>> to be separate from the main application.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Currently we are using unreleased cross-process memory and
>>>>>>>>>>>>>> semaphore
>>>>>>>>>>>>>> extensions to fetch updated eye images from the client
>>>>>>>>>>>>>> application,
>>>>>>>>>>>>>> but the just-in-time reprojection discussed here does not
>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>> have any direct interactions with cross-process resource
>>>>>>>>>>>>>> sharing,
>>>>>>>>>>>>>> since it's achieved by using whatever is the latest, most
>>>>>>>>>>>>>> up-to-date
>>>>>>>>>>>>>> eye images that have already been sent by the client
>>>>>>>>>>>>>> application,
>>>>>>>>>>>>>> which are already available to use without additional
>>>>>>>>>>>>>> synchronization.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, we are working on mechanisms to present directly to the
>>>>>>>>>>>>>> headset
>>>>>>>>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  The latency is our main concern,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would assume that this is the known problem (at least for
>>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>>> usage).
>>>>>>>>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU
>>>>>>>>>>>>>>> intensive
>>>>>>>>>>>>>>> (at least
>>>>>>>>>>>>>>> in the default configuration).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As long as it's a consistent cost, it shouldn't an issue.
>>>>>>>>>>>>>> However, if
>>>>>>>>>>>>>> there's high degrees of variance then that would be
>>>>>>>>>>>>>> troublesome and we
>>>>>>>>>>>>>> would need to account for the worst case.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hopefully the requirements and approach we described make
>>>>>>>>>>>>>> sense, we're
>>>>>>>>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>  - Pierre-Loup
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
>>>>>>>>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hey Serguei,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
>>>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can
>>>>>>>>>>>>>>> start with a
>>>>>>>>>>>>>>> solution that assumes that
>>>>>>>>>>>>>>> only pipe0 has any work and the other pipes are idle (no
>>>>>>>>>>>>>>> HSA/ROCm
>>>>>>>>>>>>>>> running on the system).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This should be more or less the use case we expect from VR
>>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I agree the split is currently not ideal, but I'd like to
>>>>>>>>>>>>>>> consider
>>>>>>>>>>>>>>> that a separate task, because
>>>>>>>>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>>>> amdkfd
>>>>>>>>>>>>>>>> will be not
>>>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Correct, this is why we want to enable the high priority
>>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>>> queue through
>>>>>>>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>>> running actually:
>>>>>>>>>>>>>>>     1) Game process
>>>>>>>>>>>>>>>     2) VR Compositor (this is the process that will require
>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>> priority queue)
>>>>>>>>>>>>>>>     3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>>>>>>>>>> simultaneously, but
>>>>>>>>>>>>>>> I would also like to be able to address this case in the
>>>>>>>>>>>>>>> future
>>>>>>>>>>>>>>> (cross-pipe priorities).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>>>>> (a) it
>>>>>>>>>>>>>>>> may take time so
>>>>>>>>>>>>>>>> latency may suffer
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>>>>>>>>> predictable. A good
>>>>>>>>>>>>>>> illustration of what the reprojection scheduling looks like
>>>>>>>>>>>>>>> can be
>>>>>>>>>>>>>>> found here:
>>>>>>>>>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-
>>>>>>>>>>>>>>> 1310-104754/pastedImage_3.png
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (b) to preempt we need to have different "context" - we want
>>>>>>>>>>>>>>>> to guarantee that submissions from the same context will
>>>>>>>>>>>>>>>> be executed
>>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is okay, as the reprojection work doesn't have
>>>>>>>>>>>>>>> dependencies on
>>>>>>>>>>>>>>> the game context, and it
>>>>>>>>>>>>>>> even happens in a separate process.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you
>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>>>> "cancel/abort"
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Preempt the game with the compositor task and then resume it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics
>>>>>>>>>>>>>>>> as well as
>>>>>>>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure
>>>>>>>>>>>>>>> out a way
>>>>>>>>>>>>>>> for us to get
>>>>>>>>>>>>>>> a guaranteed execution time using vulkan graphics, then
>>>>>>>>>>>>>>> I'll take you
>>>>>>>>>>>>>>> out for a beer :)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch-5C7GfCeVMHo@public.gmane.org]
>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Andres,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
>>>>>>>>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch-5C7GfCeVMHo@public.gmane.org]
>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Quick comments:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU
>>>>>>>>>>>>>>> assignments/binding
>>>>>>>>>>>>>>> to high-priority queue  when it will be in use and "free"
>>>>>>>>>>>>>>> them later
>>>>>>>>>>>>>>> (we  do not want forever take CUs from e.g. graphic task to
>>>>>>>>>>>>>>> degrade
>>>>>>>>>>>>>>> graphics
>>>>>>>>>>>>>>> performance).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>>>>>>>>>> low-priority
>>>>>>>>>>>>>>> compute) will took all (extra) CUs and high--priority will
>>>>>>>>>>>>>>> wait for
>>>>>>>>>>>>>>> needed resources.
>>>>>>>>>>>>>>> It will not be visible on "NOP " but only when you submit
>>>>>>>>>>>>>>> "real"
>>>>>>>>>>>>>>> compute task
>>>>>>>>>>>>>>> so I would recommend  not to use "NOP" packets at all for
>>>>>>>>>>>>>>> testing.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It (CU assignment) could be relatively easy done when
>>>>>>>>>>>>>>> everything is
>>>>>>>>>>>>>>> going via kernel
>>>>>>>>>>>>>>> (e.g. as part of frame submission) but I must admit that I
>>>>>>>>>>>>>>> am not sure
>>>>>>>>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [AR] I wasn't aware of this part of the programming
>>>>>>>>>>>>>>> sequence. Thanks
>>>>>>>>>>>>>>> for the heads up!
>>>>>>>>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler"
>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>> deciding which
>>>>>>>>>>>>>>> queue to  run will check if there is enough resources and
>>>>>>>>>>>>>>> if not then
>>>>>>>>>>>>>>> it will begin
>>>>>>>>>>>>>>> to check other queues with lower priority.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2) I would recommend to dedicate the whole pipe to
>>>>>>>>>>>>>>> high-priority
>>>>>>>>>>>>>>> queue and have
>>>>>>>>>>>>>>> nothing their except it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue?
>>>>>>>>>>>>>>> (as opposed
>>>>>>>>>>>>>>> to the MEC definition
>>>>>>>>>>>>>>> of pipe, which is a grouping of queues). I say this because
>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>> only has access to 1 pipe,
>>>>>>>>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it. As far as I
>>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> BTW: Which user level API do you want to use for compute:
>>>>>>>>>>>>>>> Vulkan or
>>>>>>>>>>>>>>> OpenCL?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [AR] Vulkan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>>> amdkfd will
>>>>>>>>>>>>>>> be not
>>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  we will not be able to provide a solution compatible with
>>>>>>>>>>>>>>>> GFX
>>>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the
>>>>>>>>>>>>>>> currently running
>>>>>>>>>>>>>>> graphics job and scheduling in
>>>>>>>>>>>>>>> something else using mid-buffer pre-emption has some cases
>>>>>>>>>>>>>>> where it
>>>>>>>>>>>>>>> doesn't work well. But if with
>>>>>>>>>>>>>>> polaris10 it starts working well, it might be a better
>>>>>>>>>>>>>>> solution for
>>>>>>>>>>>>>>> us (because the whole reprojection
>>>>>>>>>>>>>>> work uses the vulkan graphics stack at the moment, and
>>>>>>>>>>>>>>> porting it to
>>>>>>>>>>>>>>> compute is not trivial).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>>>> (a) it may
>>>>>>>>>>>>>>> take time so
>>>>>>>>>>>>>>> latency may suffer (b) to preempt we need to have different
>>>>>>>>>>>>>>> "context"
>>>>>>>>>>>>>>> - we want
>>>>>>>>>>>>>>> to guarantee that submissions from the same context will be
>>>>>>>>>>>>>>> executed
>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you
>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>>>>>>>>>> for graphics as well as for plain compute tasks
>>>>>>>>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org> on
>>>>>>>>>>>>>>> behalf of
>>>>>>>>>>>>>>> Andres Rodriguez <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
>>>>>>>>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>>>>>>>>> To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in
>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This RFC is also available as a gist here:
>>>>>>>>>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab932
>>>>>>>>>>>>>>> 04249
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We are interested in feedback for a mechanism to
>>>>>>>>>>>>>>> effectively schedule
>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>> priority VR reprojection tasks (also referred to as
>>>>>>>>>>>>>>> time-warping) for
>>>>>>>>>>>>>>> Polaris10
>>>>>>>>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Brief context:
>>>>>>>>>>>>>>> --------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The main objective of reprojection is to avoid motion
>>>>>>>>>>>>>>> sickness for VR
>>>>>>>>>>>>>>> users in
>>>>>>>>>>>>>>> scenarios where the game or application would fail to finish
>>>>>>>>>>>>>>> rendering a new
>>>>>>>>>>>>>>> frame in time for the next VBLANK. When this happens, the
>>>>>>>>>>>>>>> user's head
>>>>>>>>>>>>>>> movements
>>>>>>>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the
>>>>>>>>>>>>>>> duration
>>>>>>>>>>>>>>> of an
>>>>>>>>>>>>>>> extra frame. This extended mismatch between the inner ear
>>>>>>>>>>>>>>> and the
>>>>>>>>>>>>>>> eyes may
>>>>>>>>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The VR compositor deals with this problem by fabricating a
>>>>>>>>>>>>>>> new frame
>>>>>>>>>>>>>>> using the
>>>>>>>>>>>>>>> user's updated head position in combination with the
>>>>>>>>>>>>>>> previous frames.
>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>> avoids a prolonged mismatch between the HMD output and the
>>>>>>>>>>>>>>> inner ear.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>>>>>>>>>> confidence that the
>>>>>>>>>>>>>>> reprojection task will complete before the VBLANK interval.
>>>>>>>>>>>>>>> Even if
>>>>>>>>>>>>>>> the GFX pipe
>>>>>>>>>>>>>>> is currently full of work from the game/application (which
>>>>>>>>>>>>>>> is most
>>>>>>>>>>>>>>> likely the case).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For more details and illustrations, please refer to the
>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>> document:
>>>>>>>>>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/
>>>>>>>>>>>>>>> asynchronous-shaders-evolved
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Requirements:
>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     * Job round trip time must be predictable, from
>>>>>>>>>>>>>>> submission to
>>>>>>>>>>>>>>> fence signal
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Goals:
>>>>>>>>>>>>>>> ------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Test: submitting a NOP packet through the mechanism on busy
>>>>>>>>>>>>>>> hardware
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Nice to have:
>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My understanding is that with the current hardware
>>>>>>>>>>>>>>> capabilities in
>>>>>>>>>>>>>>> Polaris10 we
>>>>>>>>>>>>>>> will not be able to provide a solution compatible with GFX
>>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But I would love to hear otherwise. So if anyone has an idea,
>>>>>>>>>>>>>>> approach or
>>>>>>>>>>>>>>> suggestion that will also be compatible with the GFX ring,
>>>>>>>>>>>>>>> please let
>>>>>>>>>>>>>>> us know
>>>>>>>>>>>>>>> about it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     * The above guarantees should also be respected by
>>>>>>>>>>>>>>> amdkfd workloads
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Would be good to have for consistency, but not strictly
>>>>>>>>>>>>>>> necessary as
>>>>>>>>>>>>>>> users running
>>>>>>>>>>>>>>> games are not traditionally running HPC workloads in the
>>>>>>>>>>>>>>> background.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Proposed approach:
>>>>>>>>>>>>>>> ------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Similar to the windows driver, we could expose a high
>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>> compute queue to
>>>>>>>>>>>>>>> userspace.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Submissions to this compute queue will be scheduled with high
>>>>>>>>>>>>>>> priority, and may
>>>>>>>>>>>>>>> acquire hardware resources previously in use by other queues.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This can be achieved by taking advantage of the 'priority'
>>>>>>>>>>>>>>> field in
>>>>>>>>>>>>>>> the HQDs
>>>>>>>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler.
>>>>>>>>>>>>>>> The relevant
>>>>>>>>>>>>>>> register fields are:
>>>>>>>>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The amdgpu driver currently controls 8 compute queues from
>>>>>>>>>>>>>>> pipe0. We can
>>>>>>>>>>>>>>> statically partition these as follows:
>>>>>>>>>>>>>>>         * 7x regular
>>>>>>>>>>>>>>>         * 1x high priority
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The relevant priorities can be set so that submissions to
>>>>>>>>>>>>>>> the high
>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The amdgpu scheduler will only place jobs into the high
>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>> rings if the
>>>>>>>>>>>>>>> context is marked as high priority. And a corresponding
>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>> added to keep track of this information:
>>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The user will request a high priority context by setting an
>>>>>>>>>>>>>>> appropriate flag
>>>>>>>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>>>>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/
>>>>>>>>>>>>>>> drm/amdgpu_drm.h#L163
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>>>>>>>>     * Maintain a consistent FIFO ordering of all
>>>>>>>>>>>>>>> submissions to a
>>>>>>>>>>>>>>> context
>>>>>>>>>>>>>>>     * Create high priority and non-high priority contexts
>>>>>>>>>>>>>>> in the same
>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Similar to the above, but instead of programming the
>>>>>>>>>>>>>>> priorities and
>>>>>>>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the
>>>>>>>>>>>>>>> queue priorities
>>>>>>>>>>>>>>> dynamically when scheduling a task.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This would involve having a hardware specific callback from
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> scheduler to
>>>>>>>>>>>>>>> set the appropriate queue priority: set_priority(int ring,
>>>>>>>>>>>>>>> int index,
>>>>>>>>>>>>>>> int priority)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> During this callback we would have to grab the SRBM mutex
>>>>>>>>>>>>>>> to perform
>>>>>>>>>>>>>>> the appropriate
>>>>>>>>>>>>>>> HW programming, and I'm not really sure if that is
>>>>>>>>>>>>>>> something we
>>>>>>>>>>>>>>> should be doing from
>>>>>>>>>>>>>>> the scheduler.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On the positive side, this approach would allow us to
>>>>>>>>>>>>>>> program a range of
>>>>>>>>>>>>>>> priorities for jobs instead of a single "high priority"
>>>>>>>>>>>>>>> value",
>>>>>>>>>>>>>>> achieving
>>>>>>>>>>>>>>> something similar to the niceness API available for CPU
>>>>>>>>>>>>>>> scheduling.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not sure if this flexibility is something that we would
>>>>>>>>>>>>>>> need for
>>>>>>>>>>>>>>> our use
>>>>>>>>>>>>>>> case, but it might be useful in other scenarios (multiple
>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>> sharing compute
>>>>>>>>>>>>>>> time on a server).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This approach would require a new int field in
>>>>>>>>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>>>>>>>>> repurposing
>>>>>>>>>>>>>>> of the flags field.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Known current obstacles:
>>>>>>>>>>>>>>> ------------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The SQ is currently programmed to disregard the HQD
>>>>>>>>>>>>>>> priorities, and
>>>>>>>>>>>>>>> instead it picks
>>>>>>>>>>>>>>> jobs at random. Settings from the shader itself are also
>>>>>>>>>>>>>>> disregarded
>>>>>>>>>>>>>>> as this is
>>>>>>>>>>>>>>> considered a privileged field.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Effectively we can get our compute wavefront launched ASAP,
>>>>>>>>>>>>>>> but we
>>>>>>>>>>>>>>> might not get the
>>>>>>>>>>>>>>> time we need on the SQ.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The current programming would have to be changed to allow
>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>> propagation
>>>>>>>>>>>>>>> from the HQD into the SQ.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Generic approach for all HW IPs:
>>>>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For consistency purposes, the high priority context can be
>>>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>>> for all HW IPs
>>>>>>>>>>>>>>> with support of the SW scheduler. This will function
>>>>>>>>>>>>>>> similarly to the
>>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump
>>>>>>>>>>>>>>> ahead of
>>>>>>>>>>>>>>> anything not
>>>>>>>>>>>>>>> commited to the HW queue.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The benefits of requesting a high priority context for a
>>>>>>>>>>>>>>> non-compute
>>>>>>>>>>>>>>> queue will
>>>>>>>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is
>>>>>>>>>>>>>>> stuck in
>>>>>>>>>>>>>>> front of
>>>>>>>>>>>>>>> you), but having the API in place will allow us to easily
>>>>>>>>>>>>>>> improve the
>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>> in the future as new features become available in new
>>>>>>>>>>>>>>> hardware.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Future steps:
>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Once we have an approach settled, I can take care of the
>>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, once the interface is mostly decided, we can start
>>>>>>>>>>>>>>> thinking about
>>>>>>>>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Request for feedback:
>>>>>>>>>>>>>>> ---------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We aren't married to any of the approaches outlined above.
>>>>>>>>>>>>>>> Our goal
>>>>>>>>>>>>>>> is to
>>>>>>>>>>>>>>> obtain a mechanism that will allow us to complete the
>>>>>>>>>>>>>>> reprojection
>>>>>>>>>>>>>>> job within a
>>>>>>>>>>>>>>> predictable amount of time. So if anyone anyone has any
>>>>>>>>>>>>>>> suggestions for
>>>>>>>>>>>>>>> improvements or alternative strategies we are more than
>>>>>>>>>>>>>>> happy to hear
>>>>>>>>>>>>>>> them.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If any of the technical information above is also
>>>>>>>>>>>>>>> incorrect, feel
>>>>>>>>>>>>>>> free to point
>>>>>>>>>>>>>>> out my misunderstandings.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looking forward to hearing from you.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>>> To see the collection of prior postings to the list, visit
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>>> To see the collection of prior postings to the list, visit
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> Sincerely yours,
>>>>>>>> Serguei Sagalovitch
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>> Sincerely yours,
>>> Serguei Sagalovitch
>>>
>>>
>>
>

[-- Attachment #1.2: Type: text/html, Size: 41508 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                                                 ` <CAFQ_0eFRaCKKk9BaMyahBARzFEdXP9gQWbK+61R0snDz08qGdw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-12-23 16:20                                                                                   ` Bridgman, John
       [not found]                                                                                     ` <BN6PR12MB13485DCB60A2308A3C28A62CE8950-/b2+HYfkarQX0pEhCR5T8QdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Bridgman, John @ 2016-12-23 16:20 UTC (permalink / raw)
  To: Andres Rodriguez, Koenig, Christian
  Cc: Zhou, David(ChunMing),
	Mao, David, Sagalovitch, Serguei, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Pierre-Loup A. Griffais, Huan, Alvin, Zhang, Hawking


[-- Attachment #1.1: Type: text/plain, Size: 30105 bytes --]

One question I just remembered - the amdgpu driver includes some scheduler logic which maintains per-process queues and therefore avoids loading up the primary ring with a ton of work.


Has there been any experimentation with injecting priorities at that level rather than jumping straight to HW-level changes ?

________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Andres Rodriguez <andresx7@gmail.com>
Sent: December 23, 2016 11:13 AM
To: Koenig, Christian
Cc: Zhou, David(ChunMing); Huan, Alvin; Mao, David; Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org; Andres Rodriguez; Pierre-Loup A. Griffais; Zhang, Hawking
Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu

Hey Christian,

But yes, in general you don't want another compositor in the way, so we'll be acquiring the HMD display directly, separate from any desktop or display server.

Assuming that the the HMD is attached to the rendering device in some way you have the X server and the Compositor which both try to be DRM master at the same time.

Please correct me if that was fixed in the meantime, but that sounds like it will simply not work. Or is this what Andres mention below Dave is working on ?.

You are correct on both statements. We can't have two DRM_MASTERs, so the current DRM+X does not support this use case. And this what Dave and Pierre-Loup are currently working on.

Additional to that a compositor in combination with X is a bit counter productive when you want to keep the latency low.

One thing I'd like to correct is that our main goal is to get latency _predictable_, secondary goal is to make it low.

The high priority queue feature addresses our main source of unpredictability: the scheduling latency when the hardware is already full of work from the game engine.

The DirectMode feature addresses one of the latency sources: multiple (unnecessary) context switches to submit a surface to the DRM driver.

Targeting something like Wayland and when you need X compatibility XWayland sounds like the much better idea.

We are pretty enthusiastic about Wayland (and really glad to see Fedora 25 use Wayland by default). Once we have everything working nicely under X (where most of the users are currently), I'm sure Pierre-Loup will be pushing us to get everything optimized under Wayland as well (which should be a lot simpler!).

Ever since working with SurfaceFlinger on Android with explicit fencing I've been waiting for the day I can finally ditch X altogether :)

Regards,
Andres


On Fri, Dec 23, 2016 at 5:54 AM, Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>> wrote:
But yes, in general you don't want another compositor in the way, so we'll be acquiring the HMD display directly, separate from any desktop or display server.
Assuming that the the HMD is attached to the rendering device in some way you have the X server and the Compositor which both try to be DRM master at the same time.

Please correct me if that was fixed in the meantime, but that sounds like it will simply not work. Or is this what Andres mention below Dave is working on ?.

Additional to that a compositor in combination with X is a bit counter productive when you want to keep the latency low.

E.g. the "normal" flow of a GL or Vulkan surface filled with rendered data to be displayed is from the Application -> X server -> compositor -> X server.

The extra step between X server and compositor just means extra latency and for this use case you probably don't want that.

Targeting something like Wayland and when you need X compatibility XWayland sounds like the much better idea.

Regards,
Christian.


Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:
Display concerns are a separate issue, and as Andres said we have other plans to address. But yes, in general you don't want another compositor in the way, so we'll be acquiring the HMD display directly, separate from any desktop or display server. Same with security, we can have a separate conversation about that when the time comes.

On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
Andres,

Did you measure  latency, etc. impact of __any__ compositor?

My understanding is that VR has pretty strict requirements related to QoS.

Sincerely yours,
Serguei Sagalovitch


On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
Hey Christian,

We are currently interested in X, but with some distros switching to
other compositors by default, we also need to consider those.

We agree, running the full vrcompositor in root isn't something that
we want to do. Too many security concerns. Having a small root helper
that does the privilege escalation for us is the initial idea.

For a long term approach, Pierre-Loup and Dave are working on dealing
with the "two compositors" scenario a little better in DRM+X.
Fullscreen isn't really a sufficient approach, since we don't want the
HMD to be used as part of the Desktop environment when a VR app is not
in use (this is extremely annoying).

When the above is settled, we should have an auth mechanism besides
DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the
HMD permanently away from X. Re-using that auth method to gate this
IOCTL is probably going to be the final solution.

I propose to start with ROOT_ONLY since it should allow us to respect
kernel IOCTL compatibility guidelines with the most flexibility. Going
from a restrictive to a more flexible permission model would be
inclusive, but going from a general to a restrictive model may exclude
some apps that used to work.

Regards,
Andres

On 12/22/2016 6:42 AM, Christian König wrote:
Hi Andres,

well using root might cause stability and security problems as well.
We worked quite hard to avoid exactly this for X.

We could make this feature depend on the compositor being DRM master,
but for example with X the X server is master (and e.g. can change
resolutions etc..) and not the compositor.

So another question is also what windowing system (if any) are you
planning to use? X, Wayland, Flinger or something completely different ?

Regards,
Christian.

Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
Hi Christian,

That is definitely a concern. What we are currently thinking is to
make the high priority queues accessible to root only.

Therefore is a non-root user attempts to set the high priority flag
on context allocation, we would fail the call and return ENOPERM.

Regards,
Andres


On 12/20/2016 7:56 AM, Christian König wrote:
BTW: If there is  non-VR application which will use high-priority
h/w queue then VR application will suffer.  Any ideas how
to solve it?
Yeah, that problem came to my mind as well.

Basically we need to restrict those high priority submissions to
the VR compositor or otherwise any malfunctioning application could
use it.

Just think about some WebGL suddenly taking all our rendering away
and we won't get anything drawn any more.

Alex or Michel any ideas on that?

Regards,
Christian.

Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
> If compute queue is occupied only by you, the efficiency
> is equal with setting job queue to high priority I think.
The only risk is the situation when graphics will take all
needed CUs. But in any case it should be very good test.

Andres/Pierre-Loup,

Did you try to do it or it is a lot of work for you?


BTW: If there is  non-VR application which will use high-priority
h/w queue then VR application will suffer.  Any ideas how
to solve it?

Sincerely yours,
Serguei Sagalovitch

On 2016-12-19 12:50 AM, zhoucm1 wrote:
Do you encounter the priority issue for compute queue with
current driver?

If compute queue is occupied only by you, the efficiency is equal
with setting job queue to high priority I think.

Regards,
David Zhou

On 2016年12月19日 13:29, Andres Rodriguez wrote:
Yes, vulkan is available on all-open through the mesa radv UMD.

I'm not sure if I'm asking for too much, but if we can
coordinate a similar interface in radv and amdgpu-pro at the
vulkan level that would be great.

I'm not sure what that's going to be yet.

- Andres

On 12/19/2016 12:11 AM, zhoucm1 wrote:


On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
We're currently working with the open stack; I assume that a
mechanism could be exposed by both open and Pro Vulkan
userspace drivers and that the amdgpu kernel interface
improvements we would pursue following this discussion would
let both drivers take advantage of the feature, correct?
Of course.
Does open stack have Vulkan support?

Regards,
David Zhou

On 12/18/2016 07:26 PM, zhoucm1 wrote:
By the way, are you using all-open driver or amdgpu-pro driver?

+David Mao, who is working on our Vulkan driver.

Regards,
David Zhou

On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
Hi Serguei,

I'm also working on the bringing up our VR runtime on top of
amgpu;
see replies inline.

On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
Andres,

 For current VR workloads we have 3 separate processes
running
actually:
So we could have potential memory overcommit case or do you do
partitioning
on your own?  I would think that there is need to avoid
overcomit in
VR case to
prevent any BO migration.

You're entirely correct; currently the VR runtime is setting up
prioritized CPU scheduling for its VR compositor, we're
working on
prioritized GPU scheduling and pre-emption (eg. this
thread), and in
the future it will make sense to do work in order to make
sure that
its memory allocations do not get evicted, to prevent any
unwelcome
additional latency in the event of needing to perform
just-in-time
reprojection.

BTW: Do you mean __real__ processes or threads?
Based on my understanding sharing BOs between different
processes
could introduce additional synchronization constrains. btw:
I am not
sure
if we are able to share Vulkan sync. object cross-process
boundary.

They are different processes; it is important for the
compositor that
is responsible for quality-of-service features such as
consistently
presenting distorted frames with the right latency,
reprojection, etc,
to be separate from the main application.

Currently we are using unreleased cross-process memory and
semaphore
extensions to fetch updated eye images from the client
application,
but the just-in-time reprojection discussed here does not
actually
have any direct interactions with cross-process resource
sharing,
since it's achieved by using whatever is the latest, most
up-to-date
eye images that have already been sent by the client
application,
which are already available to use without additional
synchronization.


   3) System compositor (we are looking at approaches to
remove this
overhead)
Yes,  IMHO the best is to run in  "full screen mode".

Yes, we are working on mechanisms to present directly to the
headset
display without any intermediaries as a separate effort.


 The latency is our main concern,
I would assume that this is the known problem (at least for
compute
usage).
It looks like that amdgpu / kernel submission is rather CPU
intensive
(at least
in the default configuration).

As long as it's a consistent cost, it shouldn't an issue.
However, if
there's high degrees of variance then that would be
troublesome and we
would need to account for the worst case.

Hopefully the requirements and approach we described make
sense, we're
looking forward to your feedback and suggestions.

Thanks!
 - Pierre-Loup


Sincerely yours,
Serguei Sagalovitch


From: Andres Rodriguez <andresr@valvesoftware.com<mailto:andresr@valvesoftware.com>>
Sent: December 16, 2016 10:00 PM
To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: RE: [RFC] Mechanism for high priority scheduling
in amdgpu

Hey Serguei,

[Serguei] No. I mean pipe :-) as MEC define it.  As far as I
understand (by simplifying)
some scheduling is per pipe.  I know about the current
allocation
scheme but I do not think
that it is  ideal.  I would assume that we need to switch to
dynamical partition
of resources  based on the workload otherwise we will have
resource
conflict
between Vulkan compute and  OpenCL.

I agree the partitioning isn't ideal. I'm hoping we can
start with a
solution that assumes that
only pipe0 has any work and the other pipes are idle (no
HSA/ROCm
running on the system).

This should be more or less the use case we expect from VR
users.

I agree the split is currently not ideal, but I'd like to
consider
that a separate task, because
making it dynamic is not straight forward :P

[Serguei] Vulkan works via amdgpu (kernel submissions) so
amdkfd
will be not
involved.  I would assume that in the case of VR we will
have one main
application ("console" mode(?)) so we could temporally
"ignore"
OpenCL/ROCm needs when VR is running.

Correct, this is why we want to enable the high priority
compute
queue through
libdrm-amdgpu, so that we can expose it through Vulkan later.

For current VR workloads we have 3 separate processes
running actually:
    1) Game process
    2) VR Compositor (this is the process that will require
high
priority queue)
    3) System compositor (we are looking at approaches to
remove this
overhead)

For now I think it is okay to assume no OpenCL/ROCm running
simultaneously, but
I would also like to be able to address this case in the
future
(cross-pipe priorities).

[Serguei]  The problem with pre-emption of graphics task:
(a) it
may take time so
latency may suffer

The latency is our main concern, we want something that is
predictable. A good
illustration of what the reprojection scheduling looks like
can be
found here:
https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png



(b) to preempt we need to have different "context" - we want
to guarantee that submissions from the same context will
be executed
in order.

This is okay, as the reprojection work doesn't have
dependencies on
the game context, and it
even happens in a separate process.

BTW: (a) Do you want "preempt" and later resume or do you
want
"preempt" and
"cancel/abort"

Preempt the game with the compositor task and then resume it.

(b) Vulkan is generic API and could be used for graphics
as well as
for plain compute tasks (VK_QUEUE_COMPUTE_BIT).

Yeah, the plan is to use vulkan compute. But if you figure
out a way
for us to get
a guaranteed execution time using vulkan graphics, then
I'll take you
out for a beer :)

Regards,
Andres
________________________________________
From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com<mailto:Serguei.Sagalovitch@amd.com>]
Sent: Friday, December 16, 2016 9:13 PM
To: Andres Rodriguez; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [RFC] Mechanism for high priority scheduling
in amdgpu

Hi Andres,

Please see inline (as [Serguei])

Sincerely yours,
Serguei Sagalovitch


From: Andres Rodriguez <andresr@valvesoftware.com<mailto:andresr@valvesoftware.com>>
Sent: December 16, 2016 8:29 PM
To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: RE: [RFC] Mechanism for high priority scheduling
in amdgpu

Hi Serguei,

Thanks for the feedback. Answers inline as [AR].

Regards,
Andres

________________________________________
From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com<mailto:Serguei.Sagalovitch@amd.com>]
Sent: Friday, December 16, 2016 8:15 PM
To: Andres Rodriguez; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [RFC] Mechanism for high priority scheduling
in amdgpu

Andres,


Quick comments:

1) To minimize "bubbles", etc. we need to "force" CU
assignments/binding
to high-priority queue  when it will be in use and "free"
them later
(we  do not want forever take CUs from e.g. graphic task to
degrade
graphics
performance).

Otherwise we could have scenario when long graphics task (or
low-priority
compute) will took all (extra) CUs and high--priority will
wait for
needed resources.
It will not be visible on "NOP " but only when you submit
"real"
compute task
so I would recommend  not to use "NOP" packets at all for
testing.

It (CU assignment) could be relatively easy done when
everything is
going via kernel
(e.g. as part of frame submission) but I must admit that I
am not sure
about the best way for user level submissions (amdkfd).

[AR] I wasn't aware of this part of the programming
sequence. Thanks
for the heads up!
Is this similar to the CU masking programming?
[Serguei] Yes. To simplify: the problem is that "scheduler"
when
deciding which
queue to  run will check if there is enough resources and
if not then
it will begin
to check other queues with lower priority.

2) I would recommend to dedicate the whole pipe to
high-priority
queue and have
nothing their except it.

[AR] I'm guessing in this context you mean pipe = queue?
(as opposed
to the MEC definition
of pipe, which is a grouping of queues). I say this because
amdgpu
only has access to 1 pipe,
and the rest are statically partitioned for amdkfd usage.

[Serguei] No. I mean pipe :-)  as MEC define it. As far as I
understand (by simplifying)
some scheduling is per pipe.  I know about the current
allocation
scheme but I do not think
that it is  ideal.  I would assume that we need to switch to
dynamical partition
of resources  based on the workload otherwise we will have
resource
conflict
between Vulkan compute and  OpenCL.


BTW: Which user level API do you want to use for compute:
Vulkan or
OpenCL?

[AR] Vulkan

[Serguei] Vulkan works via amdgpu (kernel submissions) so
amdkfd will
be not
involved.  I would assume that in the case of VR we will
have one main
application ("console" mode(?)) so we could temporally
"ignore"
OpenCL/ROCm needs when VR is running.

 we will not be able to provide a solution compatible with
GFX
worloads.
I assume that you are talking about graphics? Am I right?

[AR] Yeah, my understanding is that pre-empting the
currently running
graphics job and scheduling in
something else using mid-buffer pre-emption has some cases
where it
doesn't work well. But if with
polaris10 it starts working well, it might be a better
solution for
us (because the whole reprojection
work uses the vulkan graphics stack at the moment, and
porting it to
compute is not trivial).

[Serguei]  The problem with pre-emption of graphics task:
(a) it may
take time so
latency may suffer (b) to preempt we need to have different
"context"
- we want
to guarantee that submissions from the same context will be
executed
in order.
BTW: (a) Do you want  "preempt" and later resume or do you
want
"preempt" and
"cancel/abort"?  (b) Vulkan is generic API and could be used
for graphics as well as for plain compute tasks
(VK_QUEUE_COMPUTE_BIT).


Sincerely yours,
Serguei Sagalovitch



From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on
behalf of
Andres Rodriguez <andresr@valvesoftware.com<mailto:andresr@valvesoftware.com>>
Sent: December 16, 2016 6:15 PM
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: [RFC] Mechanism for high priority scheduling in
amdgpu

Hi Everyone,

This RFC is also available as a gist here:
https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249




[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com<http://gist.github.com>
[RFC] Mechanism for high priority scheduling in amdgpu



[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com<http://gist.github.com>
[RFC] Mechanism for high priority scheduling in amdgpu




[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com<http://gist.github.com>
[RFC] Mechanism for high priority scheduling in amdgpu


We are interested in feedback for a mechanism to
effectively schedule
high
priority VR reprojection tasks (also referred to as
time-warping) for
Polaris10
running on the amdgpu kernel driver.

Brief context:
--------------

The main objective of reprojection is to avoid motion
sickness for VR
users in
scenarios where the game or application would fail to finish
rendering a new
frame in time for the next VBLANK. When this happens, the
user's head
movements
are not reflected on the Head Mounted Display (HMD) for the
duration
of an
extra frame. This extended mismatch between the inner ear
and the
eyes may
cause the user to experience motion sickness.

The VR compositor deals with this problem by fabricating a
new frame
using the
user's updated head position in combination with the
previous frames.
This
avoids a prolonged mismatch between the HMD output and the
inner ear.

Because of the adverse effects on the user, we require high
confidence that the
reprojection task will complete before the VBLANK interval.
Even if
the GFX pipe
is currently full of work from the game/application (which
is most
likely the case).

For more details and illustrations, please refer to the
following
document:
https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved




Gaming: Asynchronous Shaders Evolved | Community
community.amd.com<http://community.amd.com>
One of the most exciting new developments in GPU technology
over the
past year has been the adoption of asynchronous shaders,
which can
make more efficient use of ...



Gaming: Asynchronous Shaders Evolved | Community
community.amd.com<http://community.amd.com>
One of the most exciting new developments in GPU technology
over the
past year has been the adoption of asynchronous shaders,
which can
make more efficient use of ...



Gaming: Asynchronous Shaders Evolved | Community
community.amd.com<http://community.amd.com>
One of the most exciting new developments in GPU technology
over the
past year has been the adoption of asynchronous shaders,
which can
make more efficient use of ...


Requirements:
-------------

The mechanism must expose the following functionaility:

    * Job round trip time must be predictable, from
submission to
fence signal

    * The mechanism must support compute workloads.

Goals:
------

    * The mechanism should provide low submission latencies

Test: submitting a NOP packet through the mechanism on busy
hardware
should
be equivalent to submitting a NOP on idle hardware.

Nice to have:
-------------

    * The mechanism should also support GFX workloads.

My understanding is that with the current hardware
capabilities in
Polaris10 we
will not be able to provide a solution compatible with GFX
worloads.

But I would love to hear otherwise. So if anyone has an idea,
approach or
suggestion that will also be compatible with the GFX ring,
please let
us know
about it.

    * The above guarantees should also be respected by
amdkfd workloads

Would be good to have for consistency, but not strictly
necessary as
users running
games are not traditionally running HPC workloads in the
background.

Proposed approach:
------------------

Similar to the windows driver, we could expose a high priority
compute queue to
userspace.

Submissions to this compute queue will be scheduled with high
priority, and may
acquire hardware resources previously in use by other queues.

This can be achieved by taking advantage of the 'priority'
field in
the HQDs
and could be programmed by amdgpu or the amdgpu scheduler.
The relevant
register fields are:
        * mmCP_HQD_PIPE_PRIORITY
        * mmCP_HQD_QUEUE_PRIORITY

Implementation approach 1 - static partitioning:
------------------------------------------------

The amdgpu driver currently controls 8 compute queues from
pipe0. We can
statically partition these as follows:
        * 7x regular
        * 1x high priority

The relevant priorities can be set so that submissions to
the high
priority
ring will starve the other compute rings and the GFX ring.

The amdgpu scheduler will only place jobs into the high
priority
rings if the
context is marked as high priority. And a corresponding
priority
should be
added to keep track of this information:
     * AMD_SCHED_PRIORITY_KERNEL
     * -> AMD_SCHED_PRIORITY_HIGH
     * AMD_SCHED_PRIORITY_NORMAL

The user will request a high priority context by setting an
appropriate flag
in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163



The setting is in a per context level so that we can:
    * Maintain a consistent FIFO ordering of all
submissions to a
context
    * Create high priority and non-high priority contexts
in the same
process

Implementation approach 2 - dynamic priority programming:
---------------------------------------------------------

Similar to the above, but instead of programming the
priorities and
amdgpu_init() time, the SW scheduler will reprogram the
queue priorities
dynamically when scheduling a task.

This would involve having a hardware specific callback from
the
scheduler to
set the appropriate queue priority: set_priority(int ring,
int index,
int priority)

During this callback we would have to grab the SRBM mutex
to perform
the appropriate
HW programming, and I'm not really sure if that is
something we
should be doing from
the scheduler.

On the positive side, this approach would allow us to
program a range of
priorities for jobs instead of a single "high priority"
value",
achieving
something similar to the niceness API available for CPU
scheduling.

I'm not sure if this flexibility is something that we would
need for
our use
case, but it might be useful in other scenarios (multiple
users
sharing compute
time on a server).

This approach would require a new int field in
drm_amdgpu_ctx_in, or
repurposing
of the flags field.

Known current obstacles:
------------------------

The SQ is currently programmed to disregard the HQD
priorities, and
instead it picks
jobs at random. Settings from the shader itself are also
disregarded
as this is
considered a privileged field.

Effectively we can get our compute wavefront launched ASAP,
but we
might not get the
time we need on the SQ.

The current programming would have to be changed to allow
priority
propagation
from the HQD into the SQ.

Generic approach for all HW IPs:
--------------------------------

For consistency purposes, the high priority context can be
enabled
for all HW IPs
with support of the SW scheduler. This will function
similarly to the
current
AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump
ahead of
anything not
commited to the HW queue.

The benefits of requesting a high priority context for a
non-compute
queue will
be lesser (e.g. up to 10s of wait time if a GFX command is
stuck in
front of
you), but having the API in place will allow us to easily
improve the
implementation
in the future as new features become available in new
hardware.

Future steps:
-------------

Once we have an approach settled, I can take care of the
implementation.

Also, once the interface is mostly decided, we can start
thinking about
exposing the high priority queue through radv.

Request for feedback:
---------------------

We aren't married to any of the approaches outlined above.
Our goal
is to
obtain a mechanism that will allow us to complete the
reprojection
job within a
predictable amount of time. So if anyone anyone has any
suggestions for
improvements or alternative strategies we are more than
happy to hear
them.

If any of the technical information above is also
incorrect, feel
free to point
out my misunderstandings.

Looking forward to hearing from you.

Regards,
Andres

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


amd-gfx Info Page - lists.freedesktop.org<http://lists.freedesktop.org>
lists.freedesktop.org<http://lists.freedesktop.org>
To see the collection of prior postings to the list, visit the
amd-gfx Archives. Using amd-gfx: To post a message to all
the list
members, send email ...



amd-gfx Info Page - lists.freedesktop.org<http://lists.freedesktop.org>
lists.freedesktop.org<http://lists.freedesktop.org>
To see the collection of prior postings to the list, visit the
amd-gfx Archives. Using amd-gfx: To post a message to all
the list
members, send email ...











_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



Sincerely yours,
Serguei Sagalovitch

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx






Sincerely yours,
Serguei Sagalovitch





[-- Attachment #1.2: Type: text/html, Size: 42537 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                                                     ` <BN6PR12MB13485DCB60A2308A3C28A62CE8950-/b2+HYfkarQX0pEhCR5T8QdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2016-12-23 16:30                                                                                       ` Andres Rodriguez
       [not found]                                                                                         ` <CAFQ_0eGgYpb-d+OBG-q2S=Ha90GrNGBrTRvfvY64B_ya7Pvyzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Andres Rodriguez @ 2016-12-23 16:30 UTC (permalink / raw)
  To: Bridgman, John
  Cc: Zhou, David(ChunMing),
	Mao, David, Sagalovitch, Serguei,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Andres Rodriguez,
	Pierre-Loup A. Griffais, Koenig, Christian, Huan, Alvin, Zhang,
	Hawking


[-- Attachment #1.1: Type: text/plain, Size: 45789 bytes --]

I'm actually testing that out that today.

Example prototype patch:
https://github.com/lostgoat/linux/commit/c9d88d409d8655d63aa8386edc66687a2ba64a12

My goal is to first implement this approach, then slowly work my way
towards the HW level optimizations.

The problem I expect to see with this approach is that there will still be
unpredictably long latencies depending on what has been committed to the HW
rings.

But it is definitely a good start.

Regards,
Andres

On Fri, Dec 23, 2016 at 11:20 AM, Bridgman, John <John.Bridgman-5C7GfCeVMHo@public.gmane.org>
wrote:

> One question I just remembered - the amdgpu driver includes some scheduler
> logic which maintains per-process queues and therefore avoids loading up
> the primary ring with a ton of work.
>
>
> Has there been any experimentation with injecting priorities at that level
> rather than jumping straight to HW-level changes ?
> ------------------------------
> *From:* amd-gfx <amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org> on behalf of
> Andres Rodriguez <andresx7-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> *Sent:* December 23, 2016 11:13 AM
> *To:* Koenig, Christian
> *Cc:* Zhou, David(ChunMing); Huan, Alvin; Mao, David; Sagalovitch,
> Serguei; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org; Andres Rodriguez; Pierre-Loup A.
> Griffais; Zhang, Hawking
>
> *Subject:* Re: [RFC] Mechanism for high priority scheduling in amdgpu
>
> Hey Christian,
>
> But yes, in general you don't want another compositor in the way, so we'll
>>> be acquiring the HMD display directly, separate from any desktop or display
>>> server.
>>
>>
>> Assuming that the the HMD is attached to the rendering device in some way
>> you have the X server and the Compositor which both try to be DRM master at
>> the same time.
>>
>> Please correct me if that was fixed in the meantime, but that sounds like
>> it will simply not work. Or is this what Andres mention below Dave is
>> working on ?.
>>
>
> You are correct on both statements. We can't have two DRM_MASTERs, so the
> current DRM+X does not support this use case. And this what Dave and
> Pierre-Loup are currently working on.
>
> Additional to that a compositor in combination with X is a bit counter
>> productive when you want to keep the latency low.
>>
>
> One thing I'd like to correct is that our main goal is to get latency
> _predictable_, secondary goal is to make it low.
>
> The high priority queue feature addresses our main source of
> unpredictability: the scheduling latency when the hardware is already full
> of work from the game engine.
>
> The DirectMode feature addresses one of the latency sources: multiple
> (unnecessary) context switches to submit a surface to the DRM driver.
>
> Targeting something like Wayland and when you need X compatibility
>> XWayland sounds like the much better idea.
>>
>
> We are pretty enthusiastic about Wayland (and really glad to see Fedora 25
> use Wayland by default). Once we have everything working nicely under X
> (where most of the users are currently), I'm sure Pierre-Loup will be
> pushing us to get everything optimized under Wayland as well (which should
> be a lot simpler!).
>
> Ever since working with SurfaceFlinger on Android with explicit fencing
> I've been waiting for the day I can finally ditch X altogether :)
>
> Regards,
> Andres
>
>
> On Fri, Dec 23, 2016 at 5:54 AM, Christian König <christian.koenig@amd.com
> > wrote:
>
>> But yes, in general you don't want another compositor in the way, so
>>> we'll be acquiring the HMD display directly, separate from any desktop or
>>> display server.
>>>
>> Assuming that the the HMD is attached to the rendering device in some way
>> you have the X server and the Compositor which both try to be DRM master at
>> the same time.
>>
>> Please correct me if that was fixed in the meantime, but that sounds like
>> it will simply not work. Or is this what Andres mention below Dave is
>> working on ?.
>>
>> Additional to that a compositor in combination with X is a bit counter
>> productive when you want to keep the latency low.
>>
>> E.g. the "normal" flow of a GL or Vulkan surface filled with rendered
>> data to be displayed is from the Application -> X server -> compositor -> X
>> server.
>>
>> The extra step between X server and compositor just means extra latency
>> and for this use case you probably don't want that.
>>
>> Targeting something like Wayland and when you need X compatibility
>> XWayland sounds like the much better idea.
>>
>> Regards,
>> Christian.
>>
>>
>> Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:
>>
>>> Display concerns are a separate issue, and as Andres said we have other
>>> plans to address. But yes, in general you don't want another compositor in
>>> the way, so we'll be acquiring the HMD display directly, separate from any
>>> desktop or display server. Same with security, we can have a separate
>>> conversation about that when the time comes.
>>>
>>> On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
>>>
>>>> Andres,
>>>>
>>>> Did you measure  latency, etc. impact of __any__ compositor?
>>>>
>>>> My understanding is that VR has pretty strict requirements related to
>>>> QoS.
>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>>
>>>> On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
>>>>
>>>>> Hey Christian,
>>>>>
>>>>> We are currently interested in X, but with some distros switching to
>>>>> other compositors by default, we also need to consider those.
>>>>>
>>>>> We agree, running the full vrcompositor in root isn't something that
>>>>> we want to do. Too many security concerns. Having a small root helper
>>>>> that does the privilege escalation for us is the initial idea.
>>>>>
>>>>> For a long term approach, Pierre-Loup and Dave are working on dealing
>>>>> with the "two compositors" scenario a little better in DRM+X.
>>>>> Fullscreen isn't really a sufficient approach, since we don't want the
>>>>> HMD to be used as part of the Desktop environment when a VR app is not
>>>>> in use (this is extremely annoying).
>>>>>
>>>>> When the above is settled, we should have an auth mechanism besides
>>>>> DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the
>>>>> HMD permanently away from X. Re-using that auth method to gate this
>>>>> IOCTL is probably going to be the final solution.
>>>>>
>>>>> I propose to start with ROOT_ONLY since it should allow us to respect
>>>>> kernel IOCTL compatibility guidelines with the most flexibility. Going
>>>>> from a restrictive to a more flexible permission model would be
>>>>> inclusive, but going from a general to a restrictive model may exclude
>>>>> some apps that used to work.
>>>>>
>>>>> Regards,
>>>>> Andres
>>>>>
>>>>> On 12/22/2016 6:42 AM, Christian König wrote:
>>>>>
>>>>>> Hi Andres,
>>>>>>
>>>>>> well using root might cause stability and security problems as well.
>>>>>> We worked quite hard to avoid exactly this for X.
>>>>>>
>>>>>> We could make this feature depend on the compositor being DRM master,
>>>>>> but for example with X the X server is master (and e.g. can change
>>>>>> resolutions etc..) and not the compositor.
>>>>>>
>>>>>> So another question is also what windowing system (if any) are you
>>>>>> planning to use? X, Wayland, Flinger or something completely
>>>>>> different ?
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
>>>>>>
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> That is definitely a concern. What we are currently thinking is to
>>>>>>> make the high priority queues accessible to root only.
>>>>>>>
>>>>>>> Therefore is a non-root user attempts to set the high priority flag
>>>>>>> on context allocation, we would fail the call and return ENOPERM.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Andres
>>>>>>>
>>>>>>>
>>>>>>> On 12/20/2016 7:56 AM, Christian König wrote:
>>>>>>>
>>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>>>> to solve it?
>>>>>>>>>
>>>>>>>> Yeah, that problem came to my mind as well.
>>>>>>>>
>>>>>>>> Basically we need to restrict those high priority submissions to
>>>>>>>> the VR compositor or otherwise any malfunctioning application could
>>>>>>>> use it.
>>>>>>>>
>>>>>>>> Just think about some WebGL suddenly taking all our rendering away
>>>>>>>> and we won't get anything drawn any more.
>>>>>>>>
>>>>>>>> Alex or Michel any ideas on that?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
>>>>>>>>
>>>>>>>>> > If compute queue is occupied only by you, the efficiency
>>>>>>>>> > is equal with setting job queue to high priority I think.
>>>>>>>>> The only risk is the situation when graphics will take all
>>>>>>>>> needed CUs. But in any case it should be very good test.
>>>>>>>>>
>>>>>>>>> Andres/Pierre-Loup,
>>>>>>>>>
>>>>>>>>> Did you try to do it or it is a lot of work for you?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>>>> to solve it?
>>>>>>>>>
>>>>>>>>> Sincerely yours,
>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>
>>>>>>>>> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>>>>>>>>>
>>>>>>>>>> Do you encounter the priority issue for compute queue with
>>>>>>>>>> current driver?
>>>>>>>>>>
>>>>>>>>>> If compute queue is occupied only by you, the efficiency is equal
>>>>>>>>>> with setting job queue to high priority I think.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> David Zhou
>>>>>>>>>>
>>>>>>>>>> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure if I'm asking for too much, but if we can
>>>>>>>>>>> coordinate a similar interface in radv and amdgpu-pro at the
>>>>>>>>>>> vulkan level that would be great.
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure what that's going to be yet.
>>>>>>>>>>>
>>>>>>>>>>> - Andres
>>>>>>>>>>>
>>>>>>>>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> We're currently working with the open stack; I assume that a
>>>>>>>>>>>>> mechanism could be exposed by both open and Pro Vulkan
>>>>>>>>>>>>> userspace drivers and that the amdgpu kernel interface
>>>>>>>>>>>>> improvements we would pursue following this discussion would
>>>>>>>>>>>>> let both drivers take advantage of the feature, correct?
>>>>>>>>>>>>>
>>>>>>>>>>>> Of course.
>>>>>>>>>>>> Does open stack have Vulkan support?
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> David Zhou
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> By the way, are you using all-open driver or amdgpu-pro
>>>>>>>>>>>>>> driver?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> David Zhou
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm also working on the bringing up our VR runtime on top of
>>>>>>>>>>>>>>> amgpu;
>>>>>>>>>>>>>>> see replies inline.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>>> actually:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So we could have potential memory overcommit case or do you
>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>>> on your own?  I would think that there is need to avoid
>>>>>>>>>>>>>>>> overcomit in
>>>>>>>>>>>>>>>> VR case to
>>>>>>>>>>>>>>>> prevent any BO migration.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You're entirely correct; currently the VR runtime is setting
>>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>> prioritized CPU scheduling for its VR compositor, we're
>>>>>>>>>>>>>>> working on
>>>>>>>>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this
>>>>>>>>>>>>>>> thread), and in
>>>>>>>>>>>>>>> the future it will make sense to do work in order to make
>>>>>>>>>>>>>>> sure that
>>>>>>>>>>>>>>> its memory allocations do not get evicted, to prevent any
>>>>>>>>>>>>>>> unwelcome
>>>>>>>>>>>>>>> additional latency in the event of needing to perform
>>>>>>>>>>>>>>> just-in-time
>>>>>>>>>>>>>>> reprojection.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>>>>>>>>>> Based on my understanding sharing BOs between different
>>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>>> could introduce additional synchronization constrains. btw:
>>>>>>>>>>>>>>>> I am not
>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>> if we are able to share Vulkan sync. object cross-process
>>>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> They are different processes; it is important for the
>>>>>>>>>>>>>>> compositor that
>>>>>>>>>>>>>>> is responsible for quality-of-service features such as
>>>>>>>>>>>>>>> consistently
>>>>>>>>>>>>>>> presenting distorted frames with the right latency,
>>>>>>>>>>>>>>> reprojection, etc,
>>>>>>>>>>>>>>> to be separate from the main application.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Currently we are using unreleased cross-process memory and
>>>>>>>>>>>>>>> semaphore
>>>>>>>>>>>>>>> extensions to fetch updated eye images from the client
>>>>>>>>>>>>>>> application,
>>>>>>>>>>>>>>> but the just-in-time reprojection discussed here does not
>>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>>> have any direct interactions with cross-process resource
>>>>>>>>>>>>>>> sharing,
>>>>>>>>>>>>>>> since it's achieved by using whatever is the latest, most
>>>>>>>>>>>>>>> up-to-date
>>>>>>>>>>>>>>> eye images that have already been sent by the client
>>>>>>>>>>>>>>> application,
>>>>>>>>>>>>>>> which are already available to use without additional
>>>>>>>>>>>>>>> synchronization.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, we are working on mechanisms to present directly to the
>>>>>>>>>>>>>>> headset
>>>>>>>>>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  The latency is our main concern,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would assume that this is the known problem (at least for
>>>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>>>> usage).
>>>>>>>>>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU
>>>>>>>>>>>>>>>> intensive
>>>>>>>>>>>>>>>> (at least
>>>>>>>>>>>>>>>> in the default configuration).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As long as it's a consistent cost, it shouldn't an issue.
>>>>>>>>>>>>>>> However, if
>>>>>>>>>>>>>>> there's high degrees of variance then that would be
>>>>>>>>>>>>>>> troublesome and we
>>>>>>>>>>>>>>> would need to account for the worst case.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hopefully the requirements and approach we described make
>>>>>>>>>>>>>>> sense, we're
>>>>>>>>>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>  - Pierre-Loup
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
>>>>>>>>>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey Serguei,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
>>>>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can
>>>>>>>>>>>>>>>> start with a
>>>>>>>>>>>>>>>> solution that assumes that
>>>>>>>>>>>>>>>> only pipe0 has any work and the other pipes are idle (no
>>>>>>>>>>>>>>>> HSA/ROCm
>>>>>>>>>>>>>>>> running on the system).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This should be more or less the use case we expect from VR
>>>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree the split is currently not ideal, but I'd like to
>>>>>>>>>>>>>>>> consider
>>>>>>>>>>>>>>>> that a separate task, because
>>>>>>>>>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>>>>> amdkfd
>>>>>>>>>>>>>>>>> will be not
>>>>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Correct, this is why we want to enable the high priority
>>>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>>>> queue through
>>>>>>>>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan
>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>>>> running actually:
>>>>>>>>>>>>>>>>     1) Game process
>>>>>>>>>>>>>>>>     2) VR Compositor (this is the process that will require
>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>> priority queue)
>>>>>>>>>>>>>>>>     3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>>>>>>>>>>> simultaneously, but
>>>>>>>>>>>>>>>> I would also like to be able to address this case in the
>>>>>>>>>>>>>>>> future
>>>>>>>>>>>>>>>> (cross-pipe priorities).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>>>>>> (a) it
>>>>>>>>>>>>>>>>> may take time so
>>>>>>>>>>>>>>>>> latency may suffer
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>>>>>>>>>> predictable. A good
>>>>>>>>>>>>>>>> illustration of what the reprojection scheduling looks like
>>>>>>>>>>>>>>>> can be
>>>>>>>>>>>>>>>> found here:
>>>>>>>>>>>>>>>> https://community.amd.com/serv
>>>>>>>>>>>>>>>> let/JiveServlet/showImage/38-1310-104754/pastedImage_3.png
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (b) to preempt we need to have different "context" - we want
>>>>>>>>>>>>>>>>> to guarantee that submissions from the same context will
>>>>>>>>>>>>>>>>> be executed
>>>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is okay, as the reprojection work doesn't have
>>>>>>>>>>>>>>>> dependencies on
>>>>>>>>>>>>>>>> the game context, and it
>>>>>>>>>>>>>>>> even happens in a separate process.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you
>>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>>>>> "cancel/abort"
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Preempt the game with the compositor task and then resume
>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics
>>>>>>>>>>>>>>>>> as well as
>>>>>>>>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure
>>>>>>>>>>>>>>>> out a way
>>>>>>>>>>>>>>>> for us to get
>>>>>>>>>>>>>>>> a guaranteed execution time using vulkan graphics, then
>>>>>>>>>>>>>>>> I'll take you
>>>>>>>>>>>>>>>> out for a beer :)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch-5C7GfCeVMHo@public.gmane.org]
>>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Andres,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
>>>>>>>>>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch-5C7GfCeVMHo@public.gmane.org]
>>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Quick comments:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU
>>>>>>>>>>>>>>>> assignments/binding
>>>>>>>>>>>>>>>> to high-priority queue  when it will be in use and "free"
>>>>>>>>>>>>>>>> them later
>>>>>>>>>>>>>>>> (we  do not want forever take CUs from e.g. graphic task to
>>>>>>>>>>>>>>>> degrade
>>>>>>>>>>>>>>>> graphics
>>>>>>>>>>>>>>>> performance).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>>>>>>>>>>> low-priority
>>>>>>>>>>>>>>>> compute) will took all (extra) CUs and high--priority will
>>>>>>>>>>>>>>>> wait for
>>>>>>>>>>>>>>>> needed resources.
>>>>>>>>>>>>>>>> It will not be visible on "NOP " but only when you submit
>>>>>>>>>>>>>>>> "real"
>>>>>>>>>>>>>>>> compute task
>>>>>>>>>>>>>>>> so I would recommend  not to use "NOP" packets at all for
>>>>>>>>>>>>>>>> testing.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It (CU assignment) could be relatively easy done when
>>>>>>>>>>>>>>>> everything is
>>>>>>>>>>>>>>>> going via kernel
>>>>>>>>>>>>>>>> (e.g. as part of frame submission) but I must admit that I
>>>>>>>>>>>>>>>> am not sure
>>>>>>>>>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] I wasn't aware of this part of the programming
>>>>>>>>>>>>>>>> sequence. Thanks
>>>>>>>>>>>>>>>> for the heads up!
>>>>>>>>>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>>>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler"
>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>> deciding which
>>>>>>>>>>>>>>>> queue to  run will check if there is enough resources and
>>>>>>>>>>>>>>>> if not then
>>>>>>>>>>>>>>>> it will begin
>>>>>>>>>>>>>>>> to check other queues with lower priority.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2) I would recommend to dedicate the whole pipe to
>>>>>>>>>>>>>>>> high-priority
>>>>>>>>>>>>>>>> queue and have
>>>>>>>>>>>>>>>> nothing their except it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue?
>>>>>>>>>>>>>>>> (as opposed
>>>>>>>>>>>>>>>> to the MEC definition
>>>>>>>>>>>>>>>> of pipe, which is a grouping of queues). I say this because
>>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>>> only has access to 1 pipe,
>>>>>>>>>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it. As far as I
>>>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> BTW: Which user level API do you want to use for compute:
>>>>>>>>>>>>>>>> Vulkan or
>>>>>>>>>>>>>>>> OpenCL?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] Vulkan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>>>> amdkfd will
>>>>>>>>>>>>>>>> be not
>>>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  we will not be able to provide a solution compatible with
>>>>>>>>>>>>>>>>> GFX
>>>>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the
>>>>>>>>>>>>>>>> currently running
>>>>>>>>>>>>>>>> graphics job and scheduling in
>>>>>>>>>>>>>>>> something else using mid-buffer pre-emption has some cases
>>>>>>>>>>>>>>>> where it
>>>>>>>>>>>>>>>> doesn't work well. But if with
>>>>>>>>>>>>>>>> polaris10 it starts working well, it might be a better
>>>>>>>>>>>>>>>> solution for
>>>>>>>>>>>>>>>> us (because the whole reprojection
>>>>>>>>>>>>>>>> work uses the vulkan graphics stack at the moment, and
>>>>>>>>>>>>>>>> porting it to
>>>>>>>>>>>>>>>> compute is not trivial).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>>>>> (a) it may
>>>>>>>>>>>>>>>> take time so
>>>>>>>>>>>>>>>> latency may suffer (b) to preempt we need to have different
>>>>>>>>>>>>>>>> "context"
>>>>>>>>>>>>>>>> - we want
>>>>>>>>>>>>>>>> to guarantee that submissions from the same context will be
>>>>>>>>>>>>>>>> executed
>>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you
>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>>>>>>>>>>> for graphics as well as for plain compute tasks
>>>>>>>>>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org> on
>>>>>>>>>>>>>>>> behalf of
>>>>>>>>>>>>>>>> Andres Rodriguez <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
>>>>>>>>>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>>>>>>>>>> To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in
>>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This RFC is also available as a gist here:
>>>>>>>>>>>>>>>> https://gist.github.com/lostgo
>>>>>>>>>>>>>>>> at/7000432cd6864265dbc2c3ab93204249
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We are interested in feedback for a mechanism to
>>>>>>>>>>>>>>>> effectively schedule
>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>> priority VR reprojection tasks (also referred to as
>>>>>>>>>>>>>>>> time-warping) for
>>>>>>>>>>>>>>>> Polaris10
>>>>>>>>>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Brief context:
>>>>>>>>>>>>>>>> --------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The main objective of reprojection is to avoid motion
>>>>>>>>>>>>>>>> sickness for VR
>>>>>>>>>>>>>>>> users in
>>>>>>>>>>>>>>>> scenarios where the game or application would fail to finish
>>>>>>>>>>>>>>>> rendering a new
>>>>>>>>>>>>>>>> frame in time for the next VBLANK. When this happens, the
>>>>>>>>>>>>>>>> user's head
>>>>>>>>>>>>>>>> movements
>>>>>>>>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the
>>>>>>>>>>>>>>>> duration
>>>>>>>>>>>>>>>> of an
>>>>>>>>>>>>>>>> extra frame. This extended mismatch between the inner ear
>>>>>>>>>>>>>>>> and the
>>>>>>>>>>>>>>>> eyes may
>>>>>>>>>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The VR compositor deals with this problem by fabricating a
>>>>>>>>>>>>>>>> new frame
>>>>>>>>>>>>>>>> using the
>>>>>>>>>>>>>>>> user's updated head position in combination with the
>>>>>>>>>>>>>>>> previous frames.
>>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>> avoids a prolonged mismatch between the HMD output and the
>>>>>>>>>>>>>>>> inner ear.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>>>>>>>>>>> confidence that the
>>>>>>>>>>>>>>>> reprojection task will complete before the VBLANK interval.
>>>>>>>>>>>>>>>> Even if
>>>>>>>>>>>>>>>> the GFX pipe
>>>>>>>>>>>>>>>> is currently full of work from the game/application (which
>>>>>>>>>>>>>>>> is most
>>>>>>>>>>>>>>>> likely the case).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For more details and illustrations, please refer to the
>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>> document:
>>>>>>>>>>>>>>>> https://community.amd.com/comm
>>>>>>>>>>>>>>>> unity/gaming/blog/2016/03/28/asynchronous-shaders-evolved
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Requirements:
>>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * Job round trip time must be predictable, from
>>>>>>>>>>>>>>>> submission to
>>>>>>>>>>>>>>>> fence signal
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Goals:
>>>>>>>>>>>>>>>> ------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Test: submitting a NOP packet through the mechanism on busy
>>>>>>>>>>>>>>>> hardware
>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Nice to have:
>>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My understanding is that with the current hardware
>>>>>>>>>>>>>>>> capabilities in
>>>>>>>>>>>>>>>> Polaris10 we
>>>>>>>>>>>>>>>> will not be able to provide a solution compatible with GFX
>>>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But I would love to hear otherwise. So if anyone has an
>>>>>>>>>>>>>>>> idea,
>>>>>>>>>>>>>>>> approach or
>>>>>>>>>>>>>>>> suggestion that will also be compatible with the GFX ring,
>>>>>>>>>>>>>>>> please let
>>>>>>>>>>>>>>>> us know
>>>>>>>>>>>>>>>> about it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The above guarantees should also be respected by
>>>>>>>>>>>>>>>> amdkfd workloads
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Would be good to have for consistency, but not strictly
>>>>>>>>>>>>>>>> necessary as
>>>>>>>>>>>>>>>> users running
>>>>>>>>>>>>>>>> games are not traditionally running HPC workloads in the
>>>>>>>>>>>>>>>> background.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Proposed approach:
>>>>>>>>>>>>>>>> ------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Similar to the windows driver, we could expose a high
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> compute queue to
>>>>>>>>>>>>>>>> userspace.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Submissions to this compute queue will be scheduled with
>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>> priority, and may
>>>>>>>>>>>>>>>> acquire hardware resources previously in use by other
>>>>>>>>>>>>>>>> queues.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This can be achieved by taking advantage of the 'priority'
>>>>>>>>>>>>>>>> field in
>>>>>>>>>>>>>>>> the HQDs
>>>>>>>>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler.
>>>>>>>>>>>>>>>> The relevant
>>>>>>>>>>>>>>>> register fields are:
>>>>>>>>>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The amdgpu driver currently controls 8 compute queues from
>>>>>>>>>>>>>>>> pipe0. We can
>>>>>>>>>>>>>>>> statically partition these as follows:
>>>>>>>>>>>>>>>>         * 7x regular
>>>>>>>>>>>>>>>>         * 1x high priority
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The relevant priorities can be set so that submissions to
>>>>>>>>>>>>>>>> the high
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The amdgpu scheduler will only place jobs into the high
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> rings if the
>>>>>>>>>>>>>>>> context is marked as high priority. And a corresponding
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>>> added to keep track of this information:
>>>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The user will request a high priority context by setting an
>>>>>>>>>>>>>>>> appropriate flag
>>>>>>>>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>>>>>>>>>>> https://github.com/torvalds/li
>>>>>>>>>>>>>>>> nux/blob/master/include/uapi/drm/amdgpu_drm.h#L163
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>>>>>>>>>     * Maintain a consistent FIFO ordering of all
>>>>>>>>>>>>>>>> submissions to a
>>>>>>>>>>>>>>>> context
>>>>>>>>>>>>>>>>     * Create high priority and non-high priority contexts
>>>>>>>>>>>>>>>> in the same
>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Similar to the above, but instead of programming the
>>>>>>>>>>>>>>>> priorities and
>>>>>>>>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the
>>>>>>>>>>>>>>>> queue priorities
>>>>>>>>>>>>>>>> dynamically when scheduling a task.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This would involve having a hardware specific callback from
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> scheduler to
>>>>>>>>>>>>>>>> set the appropriate queue priority: set_priority(int ring,
>>>>>>>>>>>>>>>> int index,
>>>>>>>>>>>>>>>> int priority)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> During this callback we would have to grab the SRBM mutex
>>>>>>>>>>>>>>>> to perform
>>>>>>>>>>>>>>>> the appropriate
>>>>>>>>>>>>>>>> HW programming, and I'm not really sure if that is
>>>>>>>>>>>>>>>> something we
>>>>>>>>>>>>>>>> should be doing from
>>>>>>>>>>>>>>>> the scheduler.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On the positive side, this approach would allow us to
>>>>>>>>>>>>>>>> program a range of
>>>>>>>>>>>>>>>> priorities for jobs instead of a single "high priority"
>>>>>>>>>>>>>>>> value",
>>>>>>>>>>>>>>>> achieving
>>>>>>>>>>>>>>>> something similar to the niceness API available for CPU
>>>>>>>>>>>>>>>> scheduling.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not sure if this flexibility is something that we would
>>>>>>>>>>>>>>>> need for
>>>>>>>>>>>>>>>> our use
>>>>>>>>>>>>>>>> case, but it might be useful in other scenarios (multiple
>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>> sharing compute
>>>>>>>>>>>>>>>> time on a server).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This approach would require a new int field in
>>>>>>>>>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>>>>>>>>>> repurposing
>>>>>>>>>>>>>>>> of the flags field.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Known current obstacles:
>>>>>>>>>>>>>>>> ------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The SQ is currently programmed to disregard the HQD
>>>>>>>>>>>>>>>> priorities, and
>>>>>>>>>>>>>>>> instead it picks
>>>>>>>>>>>>>>>> jobs at random. Settings from the shader itself are also
>>>>>>>>>>>>>>>> disregarded
>>>>>>>>>>>>>>>> as this is
>>>>>>>>>>>>>>>> considered a privileged field.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Effectively we can get our compute wavefront launched ASAP,
>>>>>>>>>>>>>>>> but we
>>>>>>>>>>>>>>>> might not get the
>>>>>>>>>>>>>>>> time we need on the SQ.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The current programming would have to be changed to allow
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> propagation
>>>>>>>>>>>>>>>> from the HQD into the SQ.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Generic approach for all HW IPs:
>>>>>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For consistency purposes, the high priority context can be
>>>>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>>>> for all HW IPs
>>>>>>>>>>>>>>>> with support of the SW scheduler. This will function
>>>>>>>>>>>>>>>> similarly to the
>>>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump
>>>>>>>>>>>>>>>> ahead of
>>>>>>>>>>>>>>>> anything not
>>>>>>>>>>>>>>>> commited to the HW queue.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The benefits of requesting a high priority context for a
>>>>>>>>>>>>>>>> non-compute
>>>>>>>>>>>>>>>> queue will
>>>>>>>>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is
>>>>>>>>>>>>>>>> stuck in
>>>>>>>>>>>>>>>> front of
>>>>>>>>>>>>>>>> you), but having the API in place will allow us to easily
>>>>>>>>>>>>>>>> improve the
>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>> in the future as new features become available in new
>>>>>>>>>>>>>>>> hardware.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Future steps:
>>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Once we have an approach settled, I can take care of the
>>>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also, once the interface is mostly decided, we can start
>>>>>>>>>>>>>>>> thinking about
>>>>>>>>>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Request for feedback:
>>>>>>>>>>>>>>>> ---------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We aren't married to any of the approaches outlined above.
>>>>>>>>>>>>>>>> Our goal
>>>>>>>>>>>>>>>> is to
>>>>>>>>>>>>>>>> obtain a mechanism that will allow us to complete the
>>>>>>>>>>>>>>>> reprojection
>>>>>>>>>>>>>>>> job within a
>>>>>>>>>>>>>>>> predictable amount of time. So if anyone anyone has any
>>>>>>>>>>>>>>>> suggestions for
>>>>>>>>>>>>>>>> improvements or alternative strategies we are more than
>>>>>>>>>>>>>>>> happy to hear
>>>>>>>>>>>>>>>> them.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If any of the technical information above is also
>>>>>>>>>>>>>>>> incorrect, feel
>>>>>>>>>>>>>>>> free to point
>>>>>>>>>>>>>>>> out my misunderstandings.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Looking forward to hearing from you.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>>>> To see the collection of prior postings to the list, visit
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>>>> To see the collection of prior postings to the list, visit
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Sincerely yours,
>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>>
>>>
>>
>

[-- Attachment #1.2: Type: text/html, Size: 44480 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                                                         ` <CAFQ_0eGgYpb-d+OBG-q2S=Ha90GrNGBrTRvfvY64B_ya7Pvyzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-12-23 16:49                                                                                           ` Bridgman, John
       [not found]                                                                                             ` <BN6PR12MB1348A8E9B5AAC0A1DC66B2E8E8950-/b2+HYfkarQX0pEhCR5T8QdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Bridgman, John @ 2016-12-23 16:49 UTC (permalink / raw)
  To: Andres Rodriguez
  Cc: Zhou, David(ChunMing),
	Mao, David, Sagalovitch, Serguei,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Andres Rodriguez,
	Pierre-Loup A. Griffais, Koenig, Christian, Huan, Alvin, Zhang,
	Hawking


[-- Attachment #1.1: Type: text/plain, Size: 31855 bytes --]

Excellent, thanks. Agree that it is not a complete solution, just a good start.


I do think we will need to get to setting priorities at HW level fairly quickly (we want it for ROCM as well as for VR) but we'll need to eliminate the current requirement for randomization at SQ as part of a HW approach and I don't think we know how long that will take at the moment.


IIRC randomization was required to avoid deadlock problems with certain OpenCL programs - what I don't know is whether the problem is inherent to the OpenCL API spec or just a function of how specific OpenCL programs were written. I'll try to dig up some history for that and ask around internally as well.

From: Andres Rodriguez <andresx7@gmail.com>
Sent: December 23, 2016 11:30 AM
To: Bridgman, John
Cc: Koenig, Christian; Zhou, David(ChunMing); Huan, Alvin; Mao, David; Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org; Andres Rodriguez; Pierre-Loup A. Griffais; Zhang, Hawking
Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu

I'm actually testing that out that today.

Example prototype patch:
https://github.com/lostgoat/linux/commit/c9d88d409d8655d63aa8386edc66687a2ba64a12

My goal is to first implement this approach, then slowly work my way towards the HW level optimizations.

The problem I expect to see with this approach is that there will still be unpredictably long latencies depending on what has been committed to the HW rings.

But it is definitely a good start.

Regards,
Andres

On Fri, Dec 23, 2016 at 11:20 AM, Bridgman, John <John.Bridgman@amd.com<mailto:John.Bridgman@amd.com>> wrote:

One question I just remembered - the amdgpu driver includes some scheduler logic which maintains per-process queues and therefore avoids loading up the primary ring with a ton of work.


Has there been any experimentation with injecting priorities at that level rather than jumping straight to HW-level changes ?

________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Andres Rodriguez <andresx7@gmail.com<mailto:andresx7@gmail.com>>
Sent: December 23, 2016 11:13 AM
To: Koenig, Christian
Cc: Zhou, David(ChunMing); Huan, Alvin; Mao, David; Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Andres Rodriguez; Pierre-Loup A. Griffais; Zhang, Hawking

Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu

Hey Christian,

But yes, in general you don't want another compositor in the way, so we'll be acquiring the HMD display directly, separate from any desktop or display server.

Assuming that the the HMD is attached to the rendering device in some way you have the X server and the Compositor which both try to be DRM master at the same time.

Please correct me if that was fixed in the meantime, but that sounds like it will simply not work. Or is this what Andres mention below Dave is working on ?.

You are correct on both statements. We can't have two DRM_MASTERs, so the current DRM+X does not support this use case. And this what Dave and Pierre-Loup are currently working on.

Additional to that a compositor in combination with X is a bit counter productive when you want to keep the latency low.

One thing I'd like to correct is that our main goal is to get latency _predictable_, secondary goal is to make it low.

The high priority queue feature addresses our main source of unpredictability: the scheduling latency when the hardware is already full of work from the game engine.

The DirectMode feature addresses one of the latency sources: multiple (unnecessary) context switches to submit a surface to the DRM driver.

Targeting something like Wayland and when you need X compatibility XWayland sounds like the much better idea.

We are pretty enthusiastic about Wayland (and really glad to see Fedora 25 use Wayland by default). Once we have everything working nicely under X (where most of the users are currently), I'm sure Pierre-Loup will be pushing us to get everything optimized under Wayland as well (which should be a lot simpler!).

Ever since working with SurfaceFlinger on Android with explicit fencing I've been waiting for the day I can finally ditch X altogether :)

Regards,
Andres


On Fri, Dec 23, 2016 at 5:54 AM, Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>> wrote:
But yes, in general you don't want another compositor in the way, so we'll be acquiring the HMD display directly, separate from any desktop or display server.
Assuming that the the HMD is attached to the rendering device in some way you have the X server and the Compositor which both try to be DRM master at the same time.

Please correct me if that was fixed in the meantime, but that sounds like it will simply not work. Or is this what Andres mention below Dave is working on ?.

Additional to that a compositor in combination with X is a bit counter productive when you want to keep the latency low.

E.g. the "normal" flow of a GL or Vulkan surface filled with rendered data to be displayed is from the Application -> X server -> compositor -> X server.

The extra step between X server and compositor just means extra latency and for this use case you probably don't want that.

Targeting something like Wayland and when you need X compatibility XWayland sounds like the much better idea.

Regards,
Christian.


Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:
Display concerns are a separate issue, and as Andres said we have other plans to address. But yes, in general you don't want another compositor in the way, so we'll be acquiring the HMD display directly, separate from any desktop or display server. Same with security, we can have a separate conversation about that when the time comes.

On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
Andres,

Did you measure  latency, etc. impact of __any__ compositor?

My understanding is that VR has pretty strict requirements related to QoS.

Sincerely yours,
Serguei Sagalovitch


On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
Hey Christian,

We are currently interested in X, but with some distros switching to
other compositors by default, we also need to consider those.

We agree, running the full vrcompositor in root isn't something that
we want to do. Too many security concerns. Having a small root helper
that does the privilege escalation for us is the initial idea.

For a long term approach, Pierre-Loup and Dave are working on dealing
with the "two compositors" scenario a little better in DRM+X.
Fullscreen isn't really a sufficient approach, since we don't want the
HMD to be used as part of the Desktop environment when a VR app is not
in use (this is extremely annoying).

When the above is settled, we should have an auth mechanism besides
DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the
HMD permanently away from X. Re-using that auth method to gate this
IOCTL is probably going to be the final solution.

I propose to start with ROOT_ONLY since it should allow us to respect
kernel IOCTL compatibility guidelines with the most flexibility. Going
from a restrictive to a more flexible permission model would be
inclusive, but going from a general to a restrictive model may exclude
some apps that used to work.

Regards,
Andres

On 12/22/2016 6:42 AM, Christian König wrote:
Hi Andres,

well using root might cause stability and security problems as well.
We worked quite hard to avoid exactly this for X.

We could make this feature depend on the compositor being DRM master,
but for example with X the X server is master (and e.g. can change
resolutions etc..) and not the compositor.

So another question is also what windowing system (if any) are you
planning to use? X, Wayland, Flinger or something completely different ?

Regards,
Christian.

Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
Hi Christian,

That is definitely a concern. What we are currently thinking is to
make the high priority queues accessible to root only.

Therefore is a non-root user attempts to set the high priority flag
on context allocation, we would fail the call and return ENOPERM.

Regards,
Andres


On 12/20/2016 7:56 AM, Christian König wrote:
BTW: If there is  non-VR application which will use high-priority
h/w queue then VR application will suffer.  Any ideas how
to solve it?
Yeah, that problem came to my mind as well.

Basically we need to restrict those high priority submissions to
the VR compositor or otherwise any malfunctioning application could
use it.

Just think about some WebGL suddenly taking all our rendering away
and we won't get anything drawn any more.

Alex or Michel any ideas on that?

Regards,
Christian.

Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
> If compute queue is occupied only by you, the efficiency
> is equal with setting job queue to high priority I think.
The only risk is the situation when graphics will take all
needed CUs. But in any case it should be very good test.

Andres/Pierre-Loup,

Did you try to do it or it is a lot of work for you?


BTW: If there is  non-VR application which will use high-priority
h/w queue then VR application will suffer.  Any ideas how
to solve it?

Sincerely yours,
Serguei Sagalovitch

On 2016-12-19 12:50 AM, zhoucm1 wrote:
Do you encounter the priority issue for compute queue with
current driver?

If compute queue is occupied only by you, the efficiency is equal
with setting job queue to high priority I think.

Regards,
David Zhou

On 2016年12月19日 13:29, Andres Rodriguez wrote:
Yes, vulkan is available on all-open through the mesa radv UMD.

I'm not sure if I'm asking for too much, but if we can
coordinate a similar interface in radv and amdgpu-pro at the
vulkan level that would be great.

I'm not sure what that's going to be yet.

- Andres

On 12/19/2016 12:11 AM, zhoucm1 wrote:


On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
We're currently working with the open stack; I assume that a
mechanism could be exposed by both open and Pro Vulkan
userspace drivers and that the amdgpu kernel interface
improvements we would pursue following this discussion would
let both drivers take advantage of the feature, correct?
Of course.
Does open stack have Vulkan support?

Regards,
David Zhou

On 12/18/2016 07:26 PM, zhoucm1 wrote:
By the way, are you using all-open driver or amdgpu-pro driver?

+David Mao, who is working on our Vulkan driver.

Regards,
David Zhou

On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
Hi Serguei,

I'm also working on the bringing up our VR runtime on top of
amgpu;
see replies inline.

On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
Andres,

 For current VR workloads we have 3 separate processes
running
actually:
So we could have potential memory overcommit case or do you do
partitioning
on your own?  I would think that there is need to avoid
overcomit in
VR case to
prevent any BO migration.

You're entirely correct; currently the VR runtime is setting up
prioritized CPU scheduling for its VR compositor, we're
working on
prioritized GPU scheduling and pre-emption (eg. this
thread), and in
the future it will make sense to do work in order to make
sure that
its memory allocations do not get evicted, to prevent any
unwelcome
additional latency in the event of needing to perform
just-in-time
reprojection.

BTW: Do you mean __real__ processes or threads?
Based on my understanding sharing BOs between different
processes
could introduce additional synchronization constrains. btw:
I am not
sure
if we are able to share Vulkan sync. object cross-process
boundary.

They are different processes; it is important for the
compositor that
is responsible for quality-of-service features such as
consistently
presenting distorted frames with the right latency,
reprojection, etc,
to be separate from the main application.

Currently we are using unreleased cross-process memory and
semaphore
extensions to fetch updated eye images from the client
application,
but the just-in-time reprojection discussed here does not
actually
have any direct interactions with cross-process resource
sharing,
since it's achieved by using whatever is the latest, most
up-to-date
eye images that have already been sent by the client
application,
which are already available to use without additional
synchronization.


   3) System compositor (we are looking at approaches to
remove this
overhead)
Yes,  IMHO the best is to run in  "full screen mode".

Yes, we are working on mechanisms to present directly to the
headset
display without any intermediaries as a separate effort.


 The latency is our main concern,
I would assume that this is the known problem (at least for
compute
usage).
It looks like that amdgpu / kernel submission is rather CPU
intensive
(at least
in the default configuration).

As long as it's a consistent cost, it shouldn't an issue.
However, if
there's high degrees of variance then that would be
troublesome and we
would need to account for the worst case.

Hopefully the requirements and approach we described make
sense, we're
looking forward to your feedback and suggestions.

Thanks!
 - Pierre-Loup


Sincerely yours,
Serguei Sagalovitch


From: Andres Rodriguez <andresr@valvesoftware.com<mailto:andresr@valvesoftware.com>>
Sent: December 16, 2016 10:00 PM
To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: RE: [RFC] Mechanism for high priority scheduling
in amdgpu

Hey Serguei,

[Serguei] No. I mean pipe :-) as MEC define it.  As far as I
understand (by simplifying)
some scheduling is per pipe.  I know about the current
allocation
scheme but I do not think
that it is  ideal.  I would assume that we need to switch to
dynamical partition
of resources  based on the workload otherwise we will have
resource
conflict
between Vulkan compute and  OpenCL.

I agree the partitioning isn't ideal. I'm hoping we can
start with a
solution that assumes that
only pipe0 has any work and the other pipes are idle (no
HSA/ROCm
running on the system).

This should be more or less the use case we expect from VR
users.

I agree the split is currently not ideal, but I'd like to
consider
that a separate task, because
making it dynamic is not straight forward :P

[Serguei] Vulkan works via amdgpu (kernel submissions) so
amdkfd
will be not
involved.  I would assume that in the case of VR we will
have one main
application ("console" mode(?)) so we could temporally
"ignore"
OpenCL/ROCm needs when VR is running.

Correct, this is why we want to enable the high priority
compute
queue through
libdrm-amdgpu, so that we can expose it through Vulkan later.

For current VR workloads we have 3 separate processes
running actually:
    1) Game process
    2) VR Compositor (this is the process that will require
high
priority queue)
    3) System compositor (we are looking at approaches to
remove this
overhead)

For now I think it is okay to assume no OpenCL/ROCm running
simultaneously, but
I would also like to be able to address this case in the
future
(cross-pipe priorities).

[Serguei]  The problem with pre-emption of graphics task:
(a) it
may take time so
latency may suffer

The latency is our main concern, we want something that is
predictable. A good
illustration of what the reprojection scheduling looks like
can be
found here:
https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png



(b) to preempt we need to have different "context" - we want
to guarantee that submissions from the same context will
be executed
in order.

This is okay, as the reprojection work doesn't have
dependencies on
the game context, and it
even happens in a separate process.

BTW: (a) Do you want "preempt" and later resume or do you
want
"preempt" and
"cancel/abort"

Preempt the game with the compositor task and then resume it.

(b) Vulkan is generic API and could be used for graphics
as well as
for plain compute tasks (VK_QUEUE_COMPUTE_BIT).

Yeah, the plan is to use vulkan compute. But if you figure
out a way
for us to get
a guaranteed execution time using vulkan graphics, then
I'll take you
out for a beer :)

Regards,
Andres
________________________________________
From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com<mailto:Serguei.Sagalovitch@amd.com>]
Sent: Friday, December 16, 2016 9:13 PM
To: Andres Rodriguez; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [RFC] Mechanism for high priority scheduling
in amdgpu

Hi Andres,

Please see inline (as [Serguei])

Sincerely yours,
Serguei Sagalovitch


From: Andres Rodriguez <andresr@valvesoftware.com<mailto:andresr@valvesoftware.com>>
Sent: December 16, 2016 8:29 PM
To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: RE: [RFC] Mechanism for high priority scheduling
in amdgpu

Hi Serguei,

Thanks for the feedback. Answers inline as [AR].

Regards,
Andres

________________________________________
From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com<mailto:Serguei.Sagalovitch@amd.com>]
Sent: Friday, December 16, 2016 8:15 PM
To: Andres Rodriguez; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [RFC] Mechanism for high priority scheduling
in amdgpu

Andres,


Quick comments:

1) To minimize "bubbles", etc. we need to "force" CU
assignments/binding
to high-priority queue  when it will be in use and "free"
them later
(we  do not want forever take CUs from e.g. graphic task to
degrade
graphics
performance).

Otherwise we could have scenario when long graphics task (or
low-priority
compute) will took all (extra) CUs and high--priority will
wait for
needed resources.
It will not be visible on "NOP " but only when you submit
"real"
compute task
so I would recommend  not to use "NOP" packets at all for
testing.

It (CU assignment) could be relatively easy done when
everything is
going via kernel
(e.g. as part of frame submission) but I must admit that I
am not sure
about the best way for user level submissions (amdkfd).

[AR] I wasn't aware of this part of the programming
sequence. Thanks
for the heads up!
Is this similar to the CU masking programming?
[Serguei] Yes. To simplify: the problem is that "scheduler"
when
deciding which
queue to  run will check if there is enough resources and
if not then
it will begin
to check other queues with lower priority.

2) I would recommend to dedicate the whole pipe to
high-priority
queue and have
nothing their except it.

[AR] I'm guessing in this context you mean pipe = queue?
(as opposed
to the MEC definition
of pipe, which is a grouping of queues). I say this because
amdgpu
only has access to 1 pipe,
and the rest are statically partitioned for amdkfd usage.

[Serguei] No. I mean pipe :-)  as MEC define it. As far as I
understand (by simplifying)
some scheduling is per pipe.  I know about the current
allocation
scheme but I do not think
that it is  ideal.  I would assume that we need to switch to
dynamical partition
of resources  based on the workload otherwise we will have
resource
conflict
between Vulkan compute and  OpenCL.


BTW: Which user level API do you want to use for compute:
Vulkan or
OpenCL?

[AR] Vulkan

[Serguei] Vulkan works via amdgpu (kernel submissions) so
amdkfd will
be not
involved.  I would assume that in the case of VR we will
have one main
application ("console" mode(?)) so we could temporally
"ignore"
OpenCL/ROCm needs when VR is running.

 we will not be able to provide a solution compatible with
GFX
worloads.
I assume that you are talking about graphics? Am I right?

[AR] Yeah, my understanding is that pre-empting the
currently running
graphics job and scheduling in
something else using mid-buffer pre-emption has some cases
where it
doesn't work well. But if with
polaris10 it starts working well, it might be a better
solution for
us (because the whole reprojection
work uses the vulkan graphics stack at the moment, and
porting it to
compute is not trivial).

[Serguei]  The problem with pre-emption of graphics task:
(a) it may
take time so
latency may suffer (b) to preempt we need to have different
"context"
- we want
to guarantee that submissions from the same context will be
executed
in order.
BTW: (a) Do you want  "preempt" and later resume or do you
want
"preempt" and
"cancel/abort"?  (b) Vulkan is generic API and could be used
for graphics as well as for plain compute tasks
(VK_QUEUE_COMPUTE_BIT).


Sincerely yours,
Serguei Sagalovitch



From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on
behalf of
Andres Rodriguez <andresr@valvesoftware.com<mailto:andresr@valvesoftware.com>>
Sent: December 16, 2016 6:15 PM
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: [RFC] Mechanism for high priority scheduling in
amdgpu

Hi Everyone,

This RFC is also available as a gist here:
https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249




[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com<http://gist.github.com>
[RFC] Mechanism for high priority scheduling in amdgpu



[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com<http://gist.github.com>
[RFC] Mechanism for high priority scheduling in amdgpu




[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com<http://gist.github.com>
[RFC] Mechanism for high priority scheduling in amdgpu


We are interested in feedback for a mechanism to
effectively schedule
high
priority VR reprojection tasks (also referred to as
time-warping) for
Polaris10
running on the amdgpu kernel driver.

Brief context:
--------------

The main objective of reprojection is to avoid motion
sickness for VR
users in
scenarios where the game or application would fail to finish
rendering a new
frame in time for the next VBLANK. When this happens, the
user's head
movements
are not reflected on the Head Mounted Display (HMD) for the
duration
of an
extra frame. This extended mismatch between the inner ear
and the
eyes may
cause the user to experience motion sickness.

The VR compositor deals with this problem by fabricating a
new frame
using the
user's updated head position in combination with the
previous frames.
This
avoids a prolonged mismatch between the HMD output and the
inner ear.

Because of the adverse effects on the user, we require high
confidence that the
reprojection task will complete before the VBLANK interval.
Even if
the GFX pipe
is currently full of work from the game/application (which
is most
likely the case).

For more details and illustrations, please refer to the
following
document:
https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved




Gaming: Asynchronous Shaders Evolved | Community
community.amd.com<http://community.amd.com>
One of the most exciting new developments in GPU technology
over the
past year has been the adoption of asynchronous shaders,
which can
make more efficient use of ...



Gaming: Asynchronous Shaders Evolved | Community
community.amd.com<http://community.amd.com>
One of the most exciting new developments in GPU technology
over the
past year has been the adoption of asynchronous shaders,
which can
make more efficient use of ...



Gaming: Asynchronous Shaders Evolved | Community
community.amd.com<http://community.amd.com>
One of the most exciting new developments in GPU technology
over the
past year has been the adoption of asynchronous shaders,
which can
make more efficient use of ...


Requirements:
-------------

The mechanism must expose the following functionaility:

    * Job round trip time must be predictable, from
submission to
fence signal

    * The mechanism must support compute workloads.

Goals:
------

    * The mechanism should provide low submission latencies

Test: submitting a NOP packet through the mechanism on busy
hardware
should
be equivalent to submitting a NOP on idle hardware.

Nice to have:
-------------

    * The mechanism should also support GFX workloads.

My understanding is that with the current hardware
capabilities in
Polaris10 we
will not be able to provide a solution compatible with GFX
worloads.

But I would love to hear otherwise. So if anyone has an idea,
approach or
suggestion that will also be compatible with the GFX ring,
please let
us know
about it.

    * The above guarantees should also be respected by
amdkfd workloads

Would be good to have for consistency, but not strictly
necessary as
users running
games are not traditionally running HPC workloads in the
background.

Proposed approach:
------------------

Similar to the windows driver, we could expose a high priority
compute queue to
userspace.

Submissions to this compute queue will be scheduled with high
priority, and may
acquire hardware resources previously in use by other queues.

This can be achieved by taking advantage of the 'priority'
field in
the HQDs
and could be programmed by amdgpu or the amdgpu scheduler.
The relevant
register fields are:
        * mmCP_HQD_PIPE_PRIORITY
        * mmCP_HQD_QUEUE_PRIORITY

Implementation approach 1 - static partitioning:
------------------------------------------------

The amdgpu driver currently controls 8 compute queues from
pipe0. We can
statically partition these as follows:
        * 7x regular
        * 1x high priority

The relevant priorities can be set so that submissions to
the high
priority
ring will starve the other compute rings and the GFX ring.

The amdgpu scheduler will only place jobs into the high
priority
rings if the
context is marked as high priority. And a corresponding
priority
should be
added to keep track of this information:
     * AMD_SCHED_PRIORITY_KERNEL
     * -> AMD_SCHED_PRIORITY_HIGH
     * AMD_SCHED_PRIORITY_NORMAL

The user will request a high priority context by setting an
appropriate flag
in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163



The setting is in a per context level so that we can:
    * Maintain a consistent FIFO ordering of all
submissions to a
context
    * Create high priority and non-high priority contexts
in the same
process

Implementation approach 2 - dynamic priority programming:
---------------------------------------------------------

Similar to the above, but instead of programming the
priorities and
amdgpu_init() time, the SW scheduler will reprogram the
queue priorities
dynamically when scheduling a task.

This would involve having a hardware specific callback from
the
scheduler to
set the appropriate queue priority: set_priority(int ring,
int index,
int priority)

During this callback we would have to grab the SRBM mutex
to perform
the appropriate
HW programming, and I'm not really sure if that is
something we
should be doing from
the scheduler.

On the positive side, this approach would allow us to
program a range of
priorities for jobs instead of a single "high priority"
value",
achieving
something similar to the niceness API available for CPU
scheduling.

I'm not sure if this flexibility is something that we would
need for
our use
case, but it might be useful in other scenarios (multiple
users
sharing compute
time on a server).

This approach would require a new int field in
drm_amdgpu_ctx_in, or
repurposing
of the flags field.

Known current obstacles:
------------------------

The SQ is currently programmed to disregard the HQD
priorities, and
instead it picks
jobs at random. Settings from the shader itself are also
disregarded
as this is
considered a privileged field.

Effectively we can get our compute wavefront launched ASAP,
but we
might not get the
time we need on the SQ.

The current programming would have to be changed to allow
priority
propagation
from the HQD into the SQ.

Generic approach for all HW IPs:
--------------------------------

For consistency purposes, the high priority context can be
enabled
for all HW IPs
with support of the SW scheduler. This will function
similarly to the
current
AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump
ahead of
anything not
commited to the HW queue.

The benefits of requesting a high priority context for a
non-compute
queue will
be lesser (e.g. up to 10s of wait time if a GFX command is
stuck in
front of
you), but having the API in place will allow us to easily
improve the
implementation
in the future as new features become available in new
hardware.

Future steps:
-------------

Once we have an approach settled, I can take care of the
implementation.

Also, once the interface is mostly decided, we can start
thinking about
exposing the high priority queue through radv.

Request for feedback:
---------------------

We aren't married to any of the approaches outlined above.
Our goal
is to
obtain a mechanism that will allow us to complete the
reprojection
job within a
predictable amount of time. So if anyone anyone has any
suggestions for
improvements or alternative strategies we are more than
happy to hear
them.

If any of the technical information above is also
incorrect, feel
free to point
out my misunderstandings.

Looking forward to hearing from you.

Regards,
Andres

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


amd-gfx Info Page - lists.freedesktop.org<http://lists.freedesktop.org>
lists.freedesktop.org<http://lists.freedesktop.org>
To see the collection of prior postings to the list, visit the
amd-gfx Archives. Using amd-gfx: To post a message to all
the list
members, send email ...



amd-gfx Info Page - lists.freedesktop.org<http://lists.freedesktop.org>
lists.freedesktop.org<http://lists.freedesktop.org>
To see the collection of prior postings to the list, visit the
amd-gfx Archives. Using amd-gfx: To post a message to all
the list
members, send email ...











_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



Sincerely yours,
Serguei Sagalovitch

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx






Sincerely yours,
Serguei Sagalovitch






[-- Attachment #1.2: Type: text/html, Size: 45558 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                                                             ` <BN6PR12MB1348A8E9B5AAC0A1DC66B2E8E8950-/b2+HYfkarQX0pEhCR5T8QdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2016-12-23 17:10                                                                                               ` Sagalovitch, Serguei
       [not found]                                                                                                 ` <SN1PR12MB070348C8435374C0C463E0FDFE950-z7L1TMIYDg6P/i5UxMCIqAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Sagalovitch, Serguei @ 2016-12-23 17:10 UTC (permalink / raw)
  To: Bridgman, John, Andres Rodriguez
  Cc: Zhou, David(ChunMing),
	Mao, David, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Pierre-Loup A. Griffais, Koenig, Christian, Huan, Alvin, Zhang,
	Hawking

John,


One comment:  When Andres is talking about compute he is talking about Vulkan compute not OpenCL one and it means that it is not HSA path.


Sincerely yours,
Serguei Sagalovitch



From: Bridgman, John
Sent: December 23, 2016 11:49 AM
To: Andres Rodriguez
Cc: Koenig, Christian; Zhou, David(ChunMing); Huan, Alvin; Mao, David; Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org; Andres Rodriguez; Pierre-Loup A. Griffais; Zhang, Hawking
Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
  

Excellent, thanks. Agree that it is not a complete solution, just a good start. 


I do think we will need to get to setting priorities at HW level fairly quickly (we want it for ROCM as well as for VR) but we'll need to eliminate the current requirement for randomization at SQ as part of a HW approach and I don't think we know how long  that will take at the moment. 


IIRC randomization was required to avoid deadlock problems with certain OpenCL programs - what I don't know is whether the problem is inherent to the OpenCL API spec or just a function of how specific OpenCL programs were written. I'll try to dig up some  history for that and ask around internally as well. 




From: Andres Rodriguez <andresx7@gmail.com>
Sent: December 23, 2016 11:30 AM
To: Bridgman, John
Cc: Koenig, Christian; Zhou, David(ChunMing); Huan, Alvin; Mao, David; Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org; Andres Rodriguez; Pierre-Loup A. Griffais; Zhang, Hawking
Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
  



I'm actually testing that out that today.

Example prototype patch:
https://github.com/lostgoat/linux/commit/c9d88d409d8655d63aa8386edc66687a2ba64a12



drm/amdgpu: add flag for high priority contexts · lostgoat/linux@c9d88d4
github.com
Add a new context creation flag, AMDGPU_CTX_FLAG_HIGHPRIORITY. This flag results in the allocated context receiving a higher scheduler priority that other contexts system-wide.



My goal is to first implement this approach, then slowly work my way towards the HW level optimizations.


The problem I expect to see with this approach is that there will still be unpredictably long latencies depending on what has been committed to the HW rings.


But it is definitely a good start.


 Regards,
 Andres



On Fri, Dec 23, 2016 at 11:20 AM, Bridgman, John  <John.Bridgman@amd.com> wrote:


One question I just remembered - the amdgpu driver includes some scheduler logic which maintains per-process queues and therefore avoids loading up the primary ring with a ton of work.



Has there been any experimentation with injecting priorities at that level rather than jumping straight to HW-level changes ?



From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org>  on behalf of Andres Rodriguez <andresx7@gmail.com>
Sent: December 23, 2016 11:13 AM
To: Koenig, Christian
Cc: Zhou, David(ChunMing); Huan, Alvin; Mao, David; Sagalovitch, Serguei;  amd-gfx@lists.freedesktop.org; Andres Rodriguez; Pierre-Loup A. Griffais; Zhang, Hawking


Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu  
  




Hey Christian,

   But yes, in general you don't want another compositor in the way, so we'll be acquiring the HMD display directly, separate from any desktop or display server. 
   Assuming that the the HMD is attached to the rendering device in some way you have the X server and the Compositor which both try to be DRM master at the same time.

Please correct me if that was fixed in the meantime, but that sounds like it will simply not work. Or is this what Andres mention below Dave is working on ?.

 You are correct on both statements. We can't have two DRM_MASTERs, so the current DRM+X does not support this use case. And this what Dave and Pierre-Loup are currently working on.

 Additional to that a compositor in combination with X is a bit counter productive when you want to keep the latency low.



One thing I'd like to correct is that our main goal is to get latency _predictable_, secondary goal is to make it low.


The high priority queue feature addresses our main source of unpredictability: the scheduling latency when the hardware is already full of work from the game engine.


The DirectMode feature addresses one of the latency sources: multiple (unnecessary) context switches to submit a surface to the DRM driver.

 Targeting something like Wayland and when you need X compatibility XWayland sounds like the much better idea.



We are pretty enthusiastic about Wayland (and really glad to see Fedora 25 use Wayland by default). Once we have everything working nicely under X (where most of the users are currently), I'm sure Pierre-Loup will be pushing us to get everything optimized  under Wayland as well (which should be a lot simpler!).


Ever since working with SurfaceFlinger on Android with explicit fencing I've been waiting for the day I can finally ditch X altogether :)


Regards,

Andres 

 


On Fri, Dec 23, 2016 at 5:54 AM, Christian König  <christian.koenig@amd.com> wrote:
   But yes, in general you don't want another compositor in the way, so we'll be acquiring the HMD display directly, separate from any desktop or display server.
 Assuming that the the HMD is attached to the rendering device in some way you have the X server and the Compositor which both try to be DRM master at the same time.

Please correct me if that was fixed in the meantime, but that sounds like it will simply not work. Or is this what Andres mention below Dave is working on ?.

Additional to that a compositor in combination with X is a bit counter productive when you want to keep the latency low.

E.g. the "normal" flow of a GL or Vulkan surface filled with rendered data to be displayed is from the Application -> X server -> compositor -> X server.

The extra step between X server and compositor just means extra latency and for this use case you probably don't want that.

Targeting something like Wayland and when you need X compatibility XWayland sounds like the much better idea.

Regards,
Christian.



Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:
 Display concerns are a separate issue, and as Andres said we have other plans to address. But yes, in general you don't want another compositor in the way, so we'll be acquiring the HMD display directly, separate from any desktop or display server. Same with  security, we can have a separate conversation about that when the time comes.

On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
 Andres,

Did you measure  latency, etc. impact of __any__ compositor?

My understanding is that VR has pretty strict requirements related to QoS.

Sincerely yours,
Serguei Sagalovitch


On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
 Hey Christian,

We are currently interested in X, but with some distros switching to
other compositors by default, we also need to consider those.

We agree, running the full vrcompositor in root isn't something that
we want to do. Too many security concerns. Having a small root helper
that does the privilege escalation for us is the initial idea.

For a long term approach, Pierre-Loup and Dave are working on dealing
with the "two compositors" scenario a little better in DRM+X.
Fullscreen isn't really a sufficient approach, since we don't want the
HMD to be used as part of the Desktop environment when a VR app is not
in use (this is extremely annoying).

When the above is settled, we should have an auth mechanism besides
DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the
HMD permanently away from X. Re-using that auth method to gate this
IOCTL is probably going to be the final solution.

I propose to start with ROOT_ONLY since it should allow us to respect
kernel IOCTL compatibility guidelines with the most flexibility. Going
from a restrictive to a more flexible permission model would be
inclusive, but going from a general to a restrictive model may exclude
some apps that used to work.

Regards,
Andres

On 12/22/2016 6:42 AM, Christian König wrote:
 Hi Andres,

well using root might cause stability and security problems as well.
We worked quite hard to avoid exactly this for X.

We could make this feature depend on the compositor being DRM master,
but for example with X the X server is master (and e.g. can change
resolutions etc..) and not the compositor.

So another question is also what windowing system (if any) are you
planning to use? X, Wayland, Flinger or something completely different ?

Regards,
Christian.

Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
 Hi Christian,

That is definitely a concern. What we are currently thinking is to
make the high priority queues accessible to root only.

Therefore is a non-root user attempts to set the high priority flag
on context allocation, we would fail the call and return ENOPERM.

Regards,
Andres


On 12/20/2016 7:56 AM, Christian König wrote:
  BTW: If there is  non-VR application which will use high-priority
h/w queue then VR application will suffer.  Any ideas how
to solve it?
 Yeah, that problem came to my mind as well.

Basically we need to restrict those high priority submissions to
the VR compositor or otherwise any malfunctioning application could
use it.

Just think about some WebGL suddenly taking all our rendering away
and we won't get anything drawn any more.

Alex or Michel any ideas on that?

Regards,
Christian.

Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
 > If compute queue is occupied only by you, the efficiency
> is equal with setting job queue to high priority I think.
The only risk is the situation when graphics will take all
needed CUs. But in any case it should be very good test.

Andres/Pierre-Loup,

Did you try to do it or it is a lot of work for you?


BTW: If there is  non-VR application which will use high-priority
h/w queue then VR application will suffer.  Any ideas how
to solve it?

Sincerely yours,
Serguei Sagalovitch

On 2016-12-19 12:50 AM, zhoucm1 wrote:
 Do you encounter the priority issue for compute queue with
current driver?

If compute queue is occupied only by you, the efficiency is equal
with setting job queue to high priority I think.

Regards,
David Zhou

On 2016年12月19日 13:29, Andres Rodriguez wrote:
 Yes, vulkan is available on all-open through the mesa radv UMD.

I'm not sure if I'm asking for too much, but if we can
coordinate a similar interface in radv and amdgpu-pro at the
vulkan level that would be great.

I'm not sure what that's going to be yet.

- Andres

On 12/19/2016 12:11 AM, zhoucm1 wrote:


On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
 We're currently working with the open stack; I assume that a
mechanism could be exposed by both open and Pro Vulkan
userspace drivers and that the amdgpu kernel interface
improvements we would pursue following this discussion would
let both drivers take advantage of the feature, correct?
 Of course.
Does open stack have Vulkan support?

Regards,
David Zhou

On 12/18/2016 07:26 PM, zhoucm1 wrote:
 By the way, are you using all-open driver or amdgpu-pro driver?

+David Mao, who is working on our Vulkan driver.

Regards,
David Zhou

On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
 Hi Serguei,

I'm also working on the bringing up our VR runtime on top of
amgpu;
see replies inline.

On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
 Andres,

  For current VR workloads we have 3 separate processes
running
actually:
 So we could have potential memory overcommit case or do you do
partitioning
on your own?  I would think that there is need to avoid
overcomit in
VR case to
prevent any BO migration.

You're entirely correct; currently the VR runtime is setting up
prioritized CPU scheduling for its VR compositor, we're
working on
prioritized GPU scheduling and pre-emption (eg. this
thread), and in
the future it will make sense to do work in order to make
sure that
its memory allocations do not get evicted, to prevent any
unwelcome
additional latency in the event of needing to perform
just-in-time
reprojection.

 BTW: Do you mean __real__ processes or threads?
Based on my understanding sharing BOs between different
processes
could introduce additional synchronization constrains. btw:
I am not
sure
if we are able to share Vulkan sync. object cross-process
boundary.

They are different processes; it is important for the
compositor that
is responsible for quality-of-service features such as
consistently
presenting distorted frames with the right latency,
reprojection, etc,
to be separate from the main application.

Currently we are using unreleased cross-process memory and
semaphore
extensions to fetch updated eye images from the client
application,
but the just-in-time reprojection discussed here does not
actually
have any direct interactions with cross-process resource
sharing,
since it's achieved by using whatever is the latest, most
up-to-date
eye images that have already been sent by the client
application,
which are already available to use without additional
synchronization.


    3) System compositor (we are looking at approaches to
remove this
overhead)
 Yes,  IMHO the best is to run in  "full screen mode".

Yes, we are working on mechanisms to present directly to the
headset
display without any intermediaries as a separate effort.


  The latency is our main concern,
 I would assume that this is the known problem (at least for
compute
usage).
It looks like that amdgpu / kernel submission is rather CPU
intensive
(at least
in the default configuration).

As long as it's a consistent cost, it shouldn't an issue.
However, if
there's high degrees of variance then that would be
troublesome and we
would need to account for the worst case.

Hopefully the requirements and approach we described make
sense, we're
looking forward to your feedback and suggestions.

Thanks!
 - Pierre-Loup


Sincerely yours,
Serguei Sagalovitch


From: Andres Rodriguez <andresr@valvesoftware.com>
Sent: December 16, 2016 10:00 PM
To: Sagalovitch, Serguei;  amd-gfx@lists.freedesktop.org
Subject: RE: [RFC] Mechanism for high priority scheduling
in amdgpu

Hey Serguei,

 [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
understand (by simplifying)
some scheduling is per pipe.  I know about the current
allocation
scheme but I do not think
that it is  ideal.  I would assume that we need to switch to
dynamical partition
of resources  based on the workload otherwise we will have
resource
conflict
between Vulkan compute and  OpenCL.

I agree the partitioning isn't ideal. I'm hoping we can
start with a
solution that assumes that
only pipe0 has any work and the other pipes are idle (no
HSA/ROCm
running on the system).

This should be more or less the use case we expect from VR
users.

I agree the split is currently not ideal, but I'd like to
consider
that a separate task, because
making it dynamic is not straight forward :P

 [Serguei] Vulkan works via amdgpu (kernel submissions) so
amdkfd
will be not
involved.  I would assume that in the case of VR we will
have one main
application ("console" mode(?)) so we could temporally
"ignore"
OpenCL/ROCm needs when VR is running.

Correct, this is why we want to enable the high priority
compute
queue through
libdrm-amdgpu, so that we can expose it through Vulkan later.

For current VR workloads we have 3 separate processes
running actually:
    1) Game process
    2) VR Compositor (this is the process that will require
high
priority queue)
    3) System compositor (we are looking at approaches to
remove this
overhead)

For now I think it is okay to assume no OpenCL/ROCm running
simultaneously, but
I would also like to be able to address this case in the
future
(cross-pipe priorities).

 [Serguei]  The problem with pre-emption of graphics task:
(a) it
may take time so
latency may suffer

The latency is our main concern, we want something that is
predictable. A good
illustration of what the reprojection scheduling looks like
can be
found here:
https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png



 (b) to preempt we need to have different "context" - we want
to guarantee that submissions from the same context will
be executed
in order.

This is okay, as the reprojection work doesn't have
dependencies on
the game context, and it
even happens in a separate process.

 BTW: (a) Do you want "preempt" and later resume or do you
want
"preempt" and
"cancel/abort"

Preempt the game with the compositor task and then resume it.

 (b) Vulkan is generic API and could be used for graphics
as well as
for plain compute tasks (VK_QUEUE_COMPUTE_BIT).

Yeah, the plan is to use vulkan compute. But if you figure
out a way
for us to get
a guaranteed execution time using vulkan graphics, then
I'll take you
out for a beer :)

Regards,
Andres
________________________________________
From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
Sent: Friday, December 16, 2016 9:13 PM
To: Andres Rodriguez;  amd-gfx@lists.freedesktop.org
Subject: Re: [RFC] Mechanism for high priority scheduling
in amdgpu

Hi Andres,

Please see inline (as [Serguei])

Sincerely yours,
Serguei Sagalovitch


From: Andres Rodriguez <andresr@valvesoftware.com>
Sent: December 16, 2016 8:29 PM
To: Sagalovitch, Serguei;  amd-gfx@lists.freedesktop.org
Subject: RE: [RFC] Mechanism for high priority scheduling
in amdgpu

Hi Serguei,

Thanks for the feedback. Answers inline as [AR].

Regards,
Andres

________________________________________
From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
Sent: Friday, December 16, 2016 8:15 PM
To: Andres Rodriguez;  amd-gfx@lists.freedesktop.org
Subject: Re: [RFC] Mechanism for high priority scheduling
in amdgpu

Andres,


Quick comments:

1) To minimize "bubbles", etc. we need to "force" CU
assignments/binding
to high-priority queue  when it will be in use and "free"
them later
(we  do not want forever take CUs from e.g. graphic task to
degrade
graphics
performance).

Otherwise we could have scenario when long graphics task (or
low-priority
compute) will took all (extra) CUs and high--priority will
wait for
needed resources.
It will not be visible on "NOP " but only when you submit
"real"
compute task
so I would recommend  not to use "NOP" packets at all for
testing.

It (CU assignment) could be relatively easy done when
everything is
going via kernel
(e.g. as part of frame submission) but I must admit that I
am not sure
about the best way for user level submissions (amdkfd).

[AR] I wasn't aware of this part of the programming
sequence. Thanks
for the heads up!
Is this similar to the CU masking programming?
[Serguei] Yes. To simplify: the problem is that "scheduler"
when
deciding which
queue to  run will check if there is enough resources and
if not then
it will begin
to check other queues with lower priority.

2) I would recommend to dedicate the whole pipe to
high-priority
queue and have
nothing their except it.

[AR] I'm guessing in this context you mean pipe = queue?
(as opposed
to the MEC definition
of pipe, which is a grouping of queues). I say this because
amdgpu
only has access to 1 pipe,
and the rest are statically partitioned for amdkfd usage.

[Serguei] No. I mean pipe :-)  as MEC define it. As far as I
understand (by simplifying)
some scheduling is per pipe.  I know about the current
allocation
scheme but I do not think
that it is  ideal.  I would assume that we need to switch to
dynamical partition
of resources  based on the workload otherwise we will have
resource
conflict
between Vulkan compute and  OpenCL.


BTW: Which user level API do you want to use for compute:
Vulkan or
OpenCL?

[AR] Vulkan

[Serguei] Vulkan works via amdgpu (kernel submissions) so
amdkfd will
be not
involved.  I would assume that in the case of VR we will
have one main
application ("console" mode(?)) so we could temporally
"ignore"
OpenCL/ROCm needs when VR is running.

  we will not be able to provide a solution compatible with
GFX
worloads.
 I assume that you are talking about graphics? Am I right?

[AR] Yeah, my understanding is that pre-empting the
currently running
graphics job and scheduling in
something else using mid-buffer pre-emption has some cases
where it
doesn't work well. But if with
polaris10 it starts working well, it might be a better
solution for
us (because the whole reprojection
work uses the vulkan graphics stack at the moment, and
porting it to
compute is not trivial).

[Serguei]  The problem with pre-emption of graphics task:
(a) it may
take time so
latency may suffer (b) to preempt we need to have different
"context"
- we want
to guarantee that submissions from the same context will be
executed
in order.
BTW: (a) Do you want  "preempt" and later resume or do you
want
"preempt" and
"cancel/abort"?  (b) Vulkan is generic API and could be used
for graphics as well as for plain compute tasks
(VK_QUEUE_COMPUTE_BIT).


Sincerely yours,
Serguei Sagalovitch



From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on
behalf of
Andres Rodriguez <andresr@valvesoftware.com>
Sent: December 16, 2016 6:15 PM
To: amd-gfx@lists.freedesktop.org
Subject: [RFC] Mechanism for high priority scheduling in
amdgpu

Hi Everyone,

This RFC is also available as a gist here:
https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249




[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu



[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu




[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu


We are interested in feedback for a mechanism to
effectively schedule
high
priority VR reprojection tasks (also referred to as
time-warping) for
Polaris10
running on the amdgpu kernel driver.

Brief context:
--------------

The main objective of reprojection is to avoid motion
sickness for VR
users in
scenarios where the game or application would fail to finish
rendering a new
frame in time for the next VBLANK. When this happens, the
user's head
movements
are not reflected on the Head Mounted Display (HMD) for the
duration
of an
extra frame. This extended mismatch between the inner ear
and the
eyes may
cause the user to experience motion sickness.

The VR compositor deals with this problem by fabricating a
new frame
using the
user's updated head position in combination with the
previous frames.
This
avoids a prolonged mismatch between the HMD output and the
inner ear.

Because of the adverse effects on the user, we require high
confidence that the
reprojection task will complete before the VBLANK interval.
Even if
the GFX pipe
is currently full of work from the game/application (which
is most
likely the case).

For more details and illustrations, please refer to the
following
document:
https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved




Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology
over the
past year has been the adoption of asynchronous shaders,
which can
make more efficient use of ...



Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology
over the
past year has been the adoption of asynchronous shaders,
which can
make more efficient use of ...



Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology
over the
past year has been the adoption of asynchronous shaders,
which can
make more efficient use of ...


Requirements:
-------------

The mechanism must expose the following functionaility:

    * Job round trip time must be predictable, from
submission to
fence signal

    * The mechanism must support compute workloads.

Goals:
------

    * The mechanism should provide low submission latencies

Test: submitting a NOP packet through the mechanism on busy
hardware
should
be equivalent to submitting a NOP on idle hardware.

Nice to have:
-------------

    * The mechanism should also support GFX workloads.

My understanding is that with the current hardware
capabilities in
Polaris10 we
will not be able to provide a solution compatible with GFX
worloads.

But I would love to hear otherwise. So if anyone has an idea,
approach or
suggestion that will also be compatible with the GFX ring,
please let
us know
about it.

    * The above guarantees should also be respected by
amdkfd workloads

Would be good to have for consistency, but not strictly
necessary as
users running
games are not traditionally running HPC workloads in the
background.

Proposed approach:
------------------

Similar to the windows driver, we could expose a high priority
compute queue to
userspace.

Submissions to this compute queue will be scheduled with high
priority, and may
acquire hardware resources previously in use by other queues.

This can be achieved by taking advantage of the 'priority'
field in
the HQDs
and could be programmed by amdgpu or the amdgpu scheduler.
The relevant
register fields are:
        * mmCP_HQD_PIPE_PRIORITY
        * mmCP_HQD_QUEUE_PRIORITY

Implementation approach 1 - static partitioning:
------------------------------------------------

The amdgpu driver currently controls 8 compute queues from
pipe0. We can
statically partition these as follows:
        * 7x regular
        * 1x high priority

The relevant priorities can be set so that submissions to
the high
priority
ring will starve the other compute rings and the GFX ring.

The amdgpu scheduler will only place jobs into the high
priority
rings if the
context is marked as high priority. And a corresponding
priority
should be
added to keep track of this information:
     * AMD_SCHED_PRIORITY_KERNEL
     * -> AMD_SCHED_PRIORITY_HIGH
     * AMD_SCHED_PRIORITY_NORMAL

The user will request a high priority context by setting an
appropriate flag
in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163



The setting is in a per context level so that we can:
    * Maintain a consistent FIFO ordering of all
submissions to a
context
    * Create high priority and non-high priority contexts
in the same
process

Implementation approach 2 - dynamic priority programming:
---------------------------------------------------------

Similar to the above, but instead of programming the
priorities and
amdgpu_init() time, the SW scheduler will reprogram the
queue priorities
dynamically when scheduling a task.

This would involve having a hardware specific callback from
the
scheduler to
set the appropriate queue priority: set_priority(int ring,
int index,
int priority)

During this callback we would have to grab the SRBM mutex
to perform
the appropriate
HW programming, and I'm not really sure if that is
something we
should be doing from
the scheduler.

On the positive side, this approach would allow us to
program a range of
priorities for jobs instead of a single "high priority"
value",
achieving
something similar to the niceness API available for CPU
scheduling.

I'm not sure if this flexibility is something that we would
need for
our use
case, but it might be useful in other scenarios (multiple
users
sharing compute
time on a server).

This approach would require a new int field in
drm_amdgpu_ctx_in, or
repurposing
of the flags field.

Known current obstacles:
------------------------

The SQ is currently programmed to disregard the HQD
priorities, and
instead it picks
jobs at random. Settings from the shader itself are also
disregarded
as this is
considered a privileged field.

Effectively we can get our compute wavefront launched ASAP,
but we
might not get the
time we need on the SQ.

The current programming would have to be changed to allow
priority
propagation
from the HQD into the SQ.

Generic approach for all HW IPs:
--------------------------------

For consistency purposes, the high priority context can be
enabled
for all HW IPs
with support of the SW scheduler. This will function
similarly to the
current
AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump
ahead of
anything not
commited to the HW queue.

The benefits of requesting a high priority context for a
non-compute
queue will
be lesser (e.g. up to 10s of wait time if a GFX command is
stuck in
front of
you), but having the API in place will allow us to easily
improve the
implementation
in the future as new features become available in new
hardware.

Future steps:
-------------

Once we have an approach settled, I can take care of the
implementation.

Also, once the interface is mostly decided, we can start
thinking about
exposing the high priority queue through radv.

Request for feedback:
---------------------

We aren't married to any of the approaches outlined above.
Our goal
is to
obtain a mechanism that will allow us to complete the
reprojection
job within a
predictable amount of time. So if anyone anyone has any
suggestions for
improvements or alternative strategies we are more than
happy to hear
them.

If any of the technical information above is also
incorrect, feel
free to point
out my misunderstandings.

Looking forward to hearing from you.

Regards,
Andres

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


amd-gfx Info Page -  lists.freedesktop.org
lists.freedesktop.org
To see the collection of prior postings to the list, visit the
amd-gfx Archives. Using amd-gfx: To post a message to all
the list
members, send email ...



amd-gfx Info Page -  lists.freedesktop.org
lists.freedesktop.org
To see the collection of prior postings to the list, visit the
amd-gfx Archives. Using amd-gfx: To post a message to all
the list
members, send email ...











_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



Sincerely yours,
Serguei Sagalovitch

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx






Sincerely yours,
Serguei Sagalovitch



   
        
       
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                                                                 ` <SN1PR12MB070348C8435374C0C463E0FDFE950-z7L1TMIYDg6P/i5UxMCIqAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2016-12-23 17:20                                                                                                   ` Bridgman, John
  0 siblings, 0 replies; 36+ messages in thread
From: Bridgman, John @ 2016-12-23 17:20 UTC (permalink / raw)
  To: Sagalovitch, Serguei, Andres Rodriguez
  Cc: Zhou, David(ChunMing),
	Mao, David, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Pierre-Loup A. Griffais, Koenig, Christian, Huan, Alvin, Zhang,
	Hawking


[-- Attachment #1.1: Type: text/plain, Size: 32670 bytes --]

Understood... but my recollection was that the priority settings were global rather than per-context or per-VMID. If I am remembering wrong and they can be set for just one process then that makes things easier at least for prototyping.


We still wouldn't have a production-able solution even then (we can't set defaults for non-HSA processes that make OpenCL apps hang even if we are doing it for the good of VR) unless we allowed apps to set their own relative priorities, although maybe giving processes the ability to muck up their own interactivity (start a transcode, display stops responding) might be acceptable.

From: Sagalovitch, Serguei
Sent: December 23, 2016 12:10 PM
To: Bridgman, John; Andres Rodriguez
Cc: Koenig, Christian; Zhou, David(ChunMing); Huan, Alvin; Mao, David; amd-gfx@lists.freedesktop.org; Andres Rodriguez; Pierre-Loup A. Griffais; Zhang, Hawking
Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu

John,


One comment:  When Andres is talking about compute he is talking about Vulkan compute not OpenCL one and it means that it is not HSA path.


Sincerely yours,
Serguei Sagalovitch



From: Bridgman, John
Sent: December 23, 2016 11:49 AM
To: Andres Rodriguez
Cc: Koenig, Christian; Zhou, David(ChunMing); Huan, Alvin; Mao, David; Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org; Andres Rodriguez; Pierre-Loup A. Griffais; Zhang, Hawking
Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu


Excellent, thanks. Agree that it is not a complete solution, just a good start.


I do think we will need to get to setting priorities at HW level fairly quickly (we want it for ROCM as well as for VR) but we'll need to eliminate the current requirement for randomization at SQ as part of a HW approach and I don't think we know how long  that will take at the moment.


IIRC randomization was required to avoid deadlock problems with certain OpenCL programs - what I don't know is whether the problem is inherent to the OpenCL API spec or just a function of how specific OpenCL programs were written. I'll try to dig up some  history for that and ask around internally as well.




From: Andres Rodriguez <andresx7@gmail.com>
Sent: December 23, 2016 11:30 AM
To: Bridgman, John
Cc: Koenig, Christian; Zhou, David(ChunMing); Huan, Alvin; Mao, David; Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org; Andres Rodriguez; Pierre-Loup A. Griffais; Zhang, Hawking
Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu




I'm actually testing that out that today.

Example prototype patch:
https://github.com/lostgoat/linux/commit/c9d88d409d8655d63aa8386edc66687a2ba64a12



drm/amdgpu: add flag for high priority contexts · lostgoat/linux@c9d88d4
github.com
Add a new context creation flag, AMDGPU_CTX_FLAG_HIGHPRIORITY. This flag results in the allocated context receiving a higher scheduler priority that other contexts system-wide.



My goal is to first implement this approach, then slowly work my way towards the HW level optimizations.


The problem I expect to see with this approach is that there will still be unpredictably long latencies depending on what has been committed to the HW rings.


But it is definitely a good start.


 Regards,
 Andres



On Fri, Dec 23, 2016 at 11:20 AM, Bridgman, John  <John.Bridgman@amd.com> wrote:


One question I just remembered - the amdgpu driver includes some scheduler logic which maintains per-process queues and therefore avoids loading up the primary ring with a ton of work.



Has there been any experimentation with injecting priorities at that level rather than jumping straight to HW-level changes ?



From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org>  on behalf of Andres Rodriguez <andresx7@gmail.com>
Sent: December 23, 2016 11:13 AM
To: Koenig, Christian
Cc: Zhou, David(ChunMing); Huan, Alvin; Mao, David; Sagalovitch, Serguei;  amd-gfx@lists.freedesktop.org; Andres Rodriguez; Pierre-Loup A. Griffais; Zhang, Hawking


Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu





Hey Christian,

   But yes, in general you don't want another compositor in the way, so we'll be acquiring the HMD display directly, separate from any desktop or display server.
   Assuming that the the HMD is attached to the rendering device in some way you have the X server and the Compositor which both try to be DRM master at the same time.

Please correct me if that was fixed in the meantime, but that sounds like it will simply not work. Or is this what Andres mention below Dave is working on ?.

 You are correct on both statements. We can't have two DRM_MASTERs, so the current DRM+X does not support this use case. And this what Dave and Pierre-Loup are currently working on.

 Additional to that a compositor in combination with X is a bit counter productive when you want to keep the latency low.



One thing I'd like to correct is that our main goal is to get latency _predictable_, secondary goal is to make it low.


The high priority queue feature addresses our main source of unpredictability: the scheduling latency when the hardware is already full of work from the game engine.


The DirectMode feature addresses one of the latency sources: multiple (unnecessary) context switches to submit a surface to the DRM driver.

 Targeting something like Wayland and when you need X compatibility XWayland sounds like the much better idea.



We are pretty enthusiastic about Wayland (and really glad to see Fedora 25 use Wayland by default). Once we have everything working nicely under X (where most of the users are currently), I'm sure Pierre-Loup will be pushing us to get everything optimized  under Wayland as well (which should be a lot simpler!).


Ever since working with SurfaceFlinger on Android with explicit fencing I've been waiting for the day I can finally ditch X altogether :)


Regards,

Andres




On Fri, Dec 23, 2016 at 5:54 AM, Christian König  <christian.koenig@amd.com> wrote:
   But yes, in general you don't want another compositor in the way, so we'll be acquiring the HMD display directly, separate from any desktop or display server.
 Assuming that the the HMD is attached to the rendering device in some way you have the X server and the Compositor which both try to be DRM master at the same time.

Please correct me if that was fixed in the meantime, but that sounds like it will simply not work. Or is this what Andres mention below Dave is working on ?.

Additional to that a compositor in combination with X is a bit counter productive when you want to keep the latency low.

E.g. the "normal" flow of a GL or Vulkan surface filled with rendered data to be displayed is from the Application -> X server -> compositor -> X server.

The extra step between X server and compositor just means extra latency and for this use case you probably don't want that.

Targeting something like Wayland and when you need X compatibility XWayland sounds like the much better idea.

Regards,
Christian.



Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:
 Display concerns are a separate issue, and as Andres said we have other plans to address. But yes, in general you don't want another compositor in the way, so we'll be acquiring the HMD display directly, separate from any desktop or display server. Same with  security, we can have a separate conversation about that when the time comes.

On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
 Andres,

Did you measure  latency, etc. impact of __any__ compositor?

My understanding is that VR has pretty strict requirements related to QoS.

Sincerely yours,
Serguei Sagalovitch


On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
 Hey Christian,

We are currently interested in X, but with some distros switching to
other compositors by default, we also need to consider those.

We agree, running the full vrcompositor in root isn't something that
we want to do. Too many security concerns. Having a small root helper
that does the privilege escalation for us is the initial idea.

For a long term approach, Pierre-Loup and Dave are working on dealing
with the "two compositors" scenario a little better in DRM+X.
Fullscreen isn't really a sufficient approach, since we don't want the
HMD to be used as part of the Desktop environment when a VR app is not
in use (this is extremely annoying).

When the above is settled, we should have an auth mechanism besides
DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the
HMD permanently away from X. Re-using that auth method to gate this
IOCTL is probably going to be the final solution.

I propose to start with ROOT_ONLY since it should allow us to respect
kernel IOCTL compatibility guidelines with the most flexibility. Going
from a restrictive to a more flexible permission model would be
inclusive, but going from a general to a restrictive model may exclude
some apps that used to work.

Regards,
Andres

On 12/22/2016 6:42 AM, Christian König wrote:
 Hi Andres,

well using root might cause stability and security problems as well.
We worked quite hard to avoid exactly this for X.

We could make this feature depend on the compositor being DRM master,
but for example with X the X server is master (and e.g. can change
resolutions etc..) and not the compositor.

So another question is also what windowing system (if any) are you
planning to use? X, Wayland, Flinger or something completely different ?

Regards,
Christian.

Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
 Hi Christian,

That is definitely a concern. What we are currently thinking is to
make the high priority queues accessible to root only.

Therefore is a non-root user attempts to set the high priority flag
on context allocation, we would fail the call and return ENOPERM.

Regards,
Andres


On 12/20/2016 7:56 AM, Christian König wrote:
  BTW: If there is  non-VR application which will use high-priority
h/w queue then VR application will suffer.  Any ideas how
to solve it?
 Yeah, that problem came to my mind as well.

Basically we need to restrict those high priority submissions to
the VR compositor or otherwise any malfunctioning application could
use it.

Just think about some WebGL suddenly taking all our rendering away
and we won't get anything drawn any more.

Alex or Michel any ideas on that?

Regards,
Christian.

Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
 > If compute queue is occupied only by you, the efficiency
> is equal with setting job queue to high priority I think.
The only risk is the situation when graphics will take all
needed CUs. But in any case it should be very good test.

Andres/Pierre-Loup,

Did you try to do it or it is a lot of work for you?


BTW: If there is  non-VR application which will use high-priority
h/w queue then VR application will suffer.  Any ideas how
to solve it?

Sincerely yours,
Serguei Sagalovitch

On 2016-12-19 12:50 AM, zhoucm1 wrote:
 Do you encounter the priority issue for compute queue with
current driver?

If compute queue is occupied only by you, the efficiency is equal
with setting job queue to high priority I think.

Regards,
David Zhou

On 2016年12月19日 13:29, Andres Rodriguez wrote:
 Yes, vulkan is available on all-open through the mesa radv UMD.

I'm not sure if I'm asking for too much, but if we can
coordinate a similar interface in radv and amdgpu-pro at the
vulkan level that would be great.

I'm not sure what that's going to be yet.

- Andres

On 12/19/2016 12:11 AM, zhoucm1 wrote:


On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
 We're currently working with the open stack; I assume that a
mechanism could be exposed by both open and Pro Vulkan
userspace drivers and that the amdgpu kernel interface
improvements we would pursue following this discussion would
let both drivers take advantage of the feature, correct?
 Of course.
Does open stack have Vulkan support?

Regards,
David Zhou

On 12/18/2016 07:26 PM, zhoucm1 wrote:
 By the way, are you using all-open driver or amdgpu-pro driver?

+David Mao, who is working on our Vulkan driver.

Regards,
David Zhou

On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
 Hi Serguei,

I'm also working on the bringing up our VR runtime on top of
amgpu;
see replies inline.

On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
 Andres,

  For current VR workloads we have 3 separate processes
running
actually:
 So we could have potential memory overcommit case or do you do
partitioning
on your own?  I would think that there is need to avoid
overcomit in
VR case to
prevent any BO migration.

You're entirely correct; currently the VR runtime is setting up
prioritized CPU scheduling for its VR compositor, we're
working on
prioritized GPU scheduling and pre-emption (eg. this
thread), and in
the future it will make sense to do work in order to make
sure that
its memory allocations do not get evicted, to prevent any
unwelcome
additional latency in the event of needing to perform
just-in-time
reprojection.

 BTW: Do you mean __real__ processes or threads?
Based on my understanding sharing BOs between different
processes
could introduce additional synchronization constrains. btw:
I am not
sure
if we are able to share Vulkan sync. object cross-process
boundary.

They are different processes; it is important for the
compositor that
is responsible for quality-of-service features such as
consistently
presenting distorted frames with the right latency,
reprojection, etc,
to be separate from the main application.

Currently we are using unreleased cross-process memory and
semaphore
extensions to fetch updated eye images from the client
application,
but the just-in-time reprojection discussed here does not
actually
have any direct interactions with cross-process resource
sharing,
since it's achieved by using whatever is the latest, most
up-to-date
eye images that have already been sent by the client
application,
which are already available to use without additional
synchronization.


    3) System compositor (we are looking at approaches to
remove this
overhead)
 Yes,  IMHO the best is to run in  "full screen mode".

Yes, we are working on mechanisms to present directly to the
headset
display without any intermediaries as a separate effort.


  The latency is our main concern,
 I would assume that this is the known problem (at least for
compute
usage).
It looks like that amdgpu / kernel submission is rather CPU
intensive
(at least
in the default configuration).

As long as it's a consistent cost, it shouldn't an issue.
However, if
there's high degrees of variance then that would be
troublesome and we
would need to account for the worst case.

Hopefully the requirements and approach we described make
sense, we're
looking forward to your feedback and suggestions.

Thanks!
 - Pierre-Loup


Sincerely yours,
Serguei Sagalovitch


From: Andres Rodriguez <andresr@valvesoftware.com>
Sent: December 16, 2016 10:00 PM
To: Sagalovitch, Serguei;  amd-gfx@lists.freedesktop.org
Subject: RE: [RFC] Mechanism for high priority scheduling
in amdgpu

Hey Serguei,

 [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
understand (by simplifying)
some scheduling is per pipe.  I know about the current
allocation
scheme but I do not think
that it is  ideal.  I would assume that we need to switch to
dynamical partition
of resources  based on the workload otherwise we will have
resource
conflict
between Vulkan compute and  OpenCL.

I agree the partitioning isn't ideal. I'm hoping we can
start with a
solution that assumes that
only pipe0 has any work and the other pipes are idle (no
HSA/ROCm
running on the system).

This should be more or less the use case we expect from VR
users.

I agree the split is currently not ideal, but I'd like to
consider
that a separate task, because
making it dynamic is not straight forward :P

 [Serguei] Vulkan works via amdgpu (kernel submissions) so
amdkfd
will be not
involved.  I would assume that in the case of VR we will
have one main
application ("console" mode(?)) so we could temporally
"ignore"
OpenCL/ROCm needs when VR is running.

Correct, this is why we want to enable the high priority
compute
queue through
libdrm-amdgpu, so that we can expose it through Vulkan later.

For current VR workloads we have 3 separate processes
running actually:
    1) Game process
    2) VR Compositor (this is the process that will require
high
priority queue)
    3) System compositor (we are looking at approaches to
remove this
overhead)

For now I think it is okay to assume no OpenCL/ROCm running
simultaneously, but
I would also like to be able to address this case in the
future
(cross-pipe priorities).

 [Serguei]  The problem with pre-emption of graphics task:
(a) it
may take time so
latency may suffer

The latency is our main concern, we want something that is
predictable. A good
illustration of what the reprojection scheduling looks like
can be
found here:
https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png



 (b) to preempt we need to have different "context" - we want
to guarantee that submissions from the same context will
be executed
in order.

This is okay, as the reprojection work doesn't have
dependencies on
the game context, and it
even happens in a separate process.

 BTW: (a) Do you want "preempt" and later resume or do you
want
"preempt" and
"cancel/abort"

Preempt the game with the compositor task and then resume it.

 (b) Vulkan is generic API and could be used for graphics
as well as
for plain compute tasks (VK_QUEUE_COMPUTE_BIT).

Yeah, the plan is to use vulkan compute. But if you figure
out a way
for us to get
a guaranteed execution time using vulkan graphics, then
I'll take you
out for a beer :)

Regards,
Andres
________________________________________
From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
Sent: Friday, December 16, 2016 9:13 PM
To: Andres Rodriguez;  amd-gfx@lists.freedesktop.org
Subject: Re: [RFC] Mechanism for high priority scheduling
in amdgpu

Hi Andres,

Please see inline (as [Serguei])

Sincerely yours,
Serguei Sagalovitch


From: Andres Rodriguez <andresr@valvesoftware.com>
Sent: December 16, 2016 8:29 PM
To: Sagalovitch, Serguei;  amd-gfx@lists.freedesktop.org
Subject: RE: [RFC] Mechanism for high priority scheduling
in amdgpu

Hi Serguei,

Thanks for the feedback. Answers inline as [AR].

Regards,
Andres

________________________________________
From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
Sent: Friday, December 16, 2016 8:15 PM
To: Andres Rodriguez;  amd-gfx@lists.freedesktop.org
Subject: Re: [RFC] Mechanism for high priority scheduling
in amdgpu

Andres,


Quick comments:

1) To minimize "bubbles", etc. we need to "force" CU
assignments/binding
to high-priority queue  when it will be in use and "free"
them later
(we  do not want forever take CUs from e.g. graphic task to
degrade
graphics
performance).

Otherwise we could have scenario when long graphics task (or
low-priority
compute) will took all (extra) CUs and high--priority will
wait for
needed resources.
It will not be visible on "NOP " but only when you submit
"real"
compute task
so I would recommend  not to use "NOP" packets at all for
testing.

It (CU assignment) could be relatively easy done when
everything is
going via kernel
(e.g. as part of frame submission) but I must admit that I
am not sure
about the best way for user level submissions (amdkfd).

[AR] I wasn't aware of this part of the programming
sequence. Thanks
for the heads up!
Is this similar to the CU masking programming?
[Serguei] Yes. To simplify: the problem is that "scheduler"
when
deciding which
queue to  run will check if there is enough resources and
if not then
it will begin
to check other queues with lower priority.

2) I would recommend to dedicate the whole pipe to
high-priority
queue and have
nothing their except it.

[AR] I'm guessing in this context you mean pipe = queue?
(as opposed
to the MEC definition
of pipe, which is a grouping of queues). I say this because
amdgpu
only has access to 1 pipe,
and the rest are statically partitioned for amdkfd usage.

[Serguei] No. I mean pipe :-)  as MEC define it. As far as I
understand (by simplifying)
some scheduling is per pipe.  I know about the current
allocation
scheme but I do not think
that it is  ideal.  I would assume that we need to switch to
dynamical partition
of resources  based on the workload otherwise we will have
resource
conflict
between Vulkan compute and  OpenCL.


BTW: Which user level API do you want to use for compute:
Vulkan or
OpenCL?

[AR] Vulkan

[Serguei] Vulkan works via amdgpu (kernel submissions) so
amdkfd will
be not
involved.  I would assume that in the case of VR we will
have one main
application ("console" mode(?)) so we could temporally
"ignore"
OpenCL/ROCm needs when VR is running.

  we will not be able to provide a solution compatible with
GFX
worloads.
 I assume that you are talking about graphics? Am I right?

[AR] Yeah, my understanding is that pre-empting the
currently running
graphics job and scheduling in
something else using mid-buffer pre-emption has some cases
where it
doesn't work well. But if with
polaris10 it starts working well, it might be a better
solution for
us (because the whole reprojection
work uses the vulkan graphics stack at the moment, and
porting it to
compute is not trivial).

[Serguei]  The problem with pre-emption of graphics task:
(a) it may
take time so
latency may suffer (b) to preempt we need to have different
"context"
- we want
to guarantee that submissions from the same context will be
executed
in order.
BTW: (a) Do you want  "preempt" and later resume or do you
want
"preempt" and
"cancel/abort"?  (b) Vulkan is generic API and could be used
for graphics as well as for plain compute tasks
(VK_QUEUE_COMPUTE_BIT).


Sincerely yours,
Serguei Sagalovitch



From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on
behalf of
Andres Rodriguez <andresr@valvesoftware.com>
Sent: December 16, 2016 6:15 PM
To: amd-gfx@lists.freedesktop.org
Subject: [RFC] Mechanism for high priority scheduling in
amdgpu

Hi Everyone,

This RFC is also available as a gist here:
https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249




[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu



[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu




[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu


We are interested in feedback for a mechanism to
effectively schedule
high
priority VR reprojection tasks (also referred to as
time-warping) for
Polaris10
running on the amdgpu kernel driver.

Brief context:
--------------

The main objective of reprojection is to avoid motion
sickness for VR
users in
scenarios where the game or application would fail to finish
rendering a new
frame in time for the next VBLANK. When this happens, the
user's head
movements
are not reflected on the Head Mounted Display (HMD) for the
duration
of an
extra frame. This extended mismatch between the inner ear
and the
eyes may
cause the user to experience motion sickness.

The VR compositor deals with this problem by fabricating a
new frame
using the
user's updated head position in combination with the
previous frames.
This
avoids a prolonged mismatch between the HMD output and the
inner ear.

Because of the adverse effects on the user, we require high
confidence that the
reprojection task will complete before the VBLANK interval.
Even if
the GFX pipe
is currently full of work from the game/application (which
is most
likely the case).

For more details and illustrations, please refer to the
following
document:
https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved




Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology
over the
past year has been the adoption of asynchronous shaders,
which can
make more efficient use of ...



Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology
over the
past year has been the adoption of asynchronous shaders,
which can
make more efficient use of ...



Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology
over the
past year has been the adoption of asynchronous shaders,
which can
make more efficient use of ...


Requirements:
-------------

The mechanism must expose the following functionaility:

    * Job round trip time must be predictable, from
submission to
fence signal

    * The mechanism must support compute workloads.

Goals:
------

    * The mechanism should provide low submission latencies

Test: submitting a NOP packet through the mechanism on busy
hardware
should
be equivalent to submitting a NOP on idle hardware.

Nice to have:
-------------

    * The mechanism should also support GFX workloads.

My understanding is that with the current hardware
capabilities in
Polaris10 we
will not be able to provide a solution compatible with GFX
worloads.

But I would love to hear otherwise. So if anyone has an idea,
approach or
suggestion that will also be compatible with the GFX ring,
please let
us know
about it.

    * The above guarantees should also be respected by
amdkfd workloads

Would be good to have for consistency, but not strictly
necessary as
users running
games are not traditionally running HPC workloads in the
background.

Proposed approach:
------------------

Similar to the windows driver, we could expose a high priority
compute queue to
userspace.

Submissions to this compute queue will be scheduled with high
priority, and may
acquire hardware resources previously in use by other queues.

This can be achieved by taking advantage of the 'priority'
field in
the HQDs
and could be programmed by amdgpu or the amdgpu scheduler.
The relevant
register fields are:
        * mmCP_HQD_PIPE_PRIORITY
        * mmCP_HQD_QUEUE_PRIORITY

Implementation approach 1 - static partitioning:
------------------------------------------------

The amdgpu driver currently controls 8 compute queues from
pipe0. We can
statically partition these as follows:
        * 7x regular
        * 1x high priority

The relevant priorities can be set so that submissions to
the high
priority
ring will starve the other compute rings and the GFX ring.

The amdgpu scheduler will only place jobs into the high
priority
rings if the
context is marked as high priority. And a corresponding
priority
should be
added to keep track of this information:
     * AMD_SCHED_PRIORITY_KERNEL
     * -> AMD_SCHED_PRIORITY_HIGH
     * AMD_SCHED_PRIORITY_NORMAL

The user will request a high priority context by setting an
appropriate flag
in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163



The setting is in a per context level so that we can:
    * Maintain a consistent FIFO ordering of all
submissions to a
context
    * Create high priority and non-high priority contexts
in the same
process

Implementation approach 2 - dynamic priority programming:
---------------------------------------------------------

Similar to the above, but instead of programming the
priorities and
amdgpu_init() time, the SW scheduler will reprogram the
queue priorities
dynamically when scheduling a task.

This would involve having a hardware specific callback from
the
scheduler to
set the appropriate queue priority: set_priority(int ring,
int index,
int priority)

During this callback we would have to grab the SRBM mutex
to perform
the appropriate
HW programming, and I'm not really sure if that is
something we
should be doing from
the scheduler.

On the positive side, this approach would allow us to
program a range of
priorities for jobs instead of a single "high priority"
value",
achieving
something similar to the niceness API available for CPU
scheduling.

I'm not sure if this flexibility is something that we would
need for
our use
case, but it might be useful in other scenarios (multiple
users
sharing compute
time on a server).

This approach would require a new int field in
drm_amdgpu_ctx_in, or
repurposing
of the flags field.

Known current obstacles:
------------------------

The SQ is currently programmed to disregard the HQD
priorities, and
instead it picks
jobs at random. Settings from the shader itself are also
disregarded
as this is
considered a privileged field.

Effectively we can get our compute wavefront launched ASAP,
but we
might not get the
time we need on the SQ.

The current programming would have to be changed to allow
priority
propagation
from the HQD into the SQ.

Generic approach for all HW IPs:
--------------------------------

For consistency purposes, the high priority context can be
enabled
for all HW IPs
with support of the SW scheduler. This will function
similarly to the
current
AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump
ahead of
anything not
commited to the HW queue.

The benefits of requesting a high priority context for a
non-compute
queue will
be lesser (e.g. up to 10s of wait time if a GFX command is
stuck in
front of
you), but having the API in place will allow us to easily
improve the
implementation
in the future as new features become available in new
hardware.

Future steps:
-------------

Once we have an approach settled, I can take care of the
implementation.

Also, once the interface is mostly decided, we can start
thinking about
exposing the high priority queue through radv.

Request for feedback:
---------------------

We aren't married to any of the approaches outlined above.
Our goal
is to
obtain a mechanism that will allow us to complete the
reprojection
job within a
predictable amount of time. So if anyone anyone has any
suggestions for
improvements or alternative strategies we are more than
happy to hear
them.

If any of the technical information above is also
incorrect, feel
free to point
out my misunderstandings.

Looking forward to hearing from you.

Regards,
Andres

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


amd-gfx Info Page -  lists.freedesktop.org
lists.freedesktop.org
To see the collection of prior postings to the list, visit the
amd-gfx Archives. Using amd-gfx: To post a message to all
the list
members, send email ...



amd-gfx Info Page -  lists.freedesktop.org
lists.freedesktop.org
To see the collection of prior postings to the list, visit the
amd-gfx Archives. Using amd-gfx: To post a message to all
the list
members, send email ...











_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



Sincerely yours,
Serguei Sagalovitch

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx






Sincerely yours,
Serguei Sagalovitch







[-- Attachment #1.2: Type: text/html, Size: 39765 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                                             ` <1c3ea5aa-36ee-5031-5f32-d860e9e0bf7c-5C7GfCeVMHo@public.gmane.org>
  2016-12-23 16:13                                                                               ` Andres Rodriguez
@ 2016-12-23 18:18                                                                               ` Pierre-Loup A. Griffais
       [not found]                                                                                 ` <b853e4e3-0ba5-2bda-e129-d9253e7b098d-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
  1 sibling, 1 reply; 36+ messages in thread
From: Pierre-Loup A. Griffais @ 2016-12-23 18:18 UTC (permalink / raw)
  To: Christian König, Serguei Sagalovitch, Andres Rodriguez,
	zhoucm1, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Mao, David, Zhang,
	Hawking, Huan, Alvin

I hate to keep bringing up display topics in an unrelated conversation, 
but I'm not sure where you got "Application -> X server -> compositor -> 
X server" from. As I was saying before, we need to be presenting 
directly to the HMD display as no display server can be in the way, both 
for latency but also quality of service reasons (a buggy application 
cannot be allowed to accidentally display undistorted rendering into the 
HMD); we intend to do the necessary work for this, and the extent of X's 
(or a Wayland implementation, or any other display server) involvment 
will be to participate enough to know that the HMD display is 
off-limits. If you have more questions on the display aspect, or VR 
rendering in general, I'm happy to try to address them out-of-band from 
this conversation.

On 12/23/2016 02:54 AM, Christian König wrote:
>> But yes, in general you don't want another compositor in the way, so
>> we'll be acquiring the HMD display directly, separate from any desktop
>> or display server.
> Assuming that the the HMD is attached to the rendering device in some
> way you have the X server and the Compositor which both try to be DRM
> master at the same time.
>
> Please correct me if that was fixed in the meantime, but that sounds
> like it will simply not work. Or is this what Andres mention below Dave
> is working on ?.
>
> Additional to that a compositor in combination with X is a bit counter
> productive when you want to keep the latency low.
>
> E.g. the "normal" flow of a GL or Vulkan surface filled with rendered
> data to be displayed is from the Application -> X server -> compositor
> -> X server.
>
> The extra step between X server and compositor just means extra latency
> and for this use case you probably don't want that.
>
> Targeting something like Wayland and when you need X compatibility
> XWayland sounds like the much better idea.
>
> Regards,
> Christian.
>
> Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:
>> Display concerns are a separate issue, and as Andres said we have
>> other plans to address. But yes, in general you don't want another
>> compositor in the way, so we'll be acquiring the HMD display directly,
>> separate from any desktop or display server. Same with security, we
>> can have a separate conversation about that when the time comes.
>>
>> On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
>>> Andres,
>>>
>>> Did you measure  latency, etc. impact of __any__ compositor?
>>>
>>> My understanding is that VR has pretty strict requirements related to
>>> QoS.
>>>
>>> Sincerely yours,
>>> Serguei Sagalovitch
>>>
>>>
>>> On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
>>>> Hey Christian,
>>>>
>>>> We are currently interested in X, but with some distros switching to
>>>> other compositors by default, we also need to consider those.
>>>>
>>>> We agree, running the full vrcompositor in root isn't something that
>>>> we want to do. Too many security concerns. Having a small root helper
>>>> that does the privilege escalation for us is the initial idea.
>>>>
>>>> For a long term approach, Pierre-Loup and Dave are working on dealing
>>>> with the "two compositors" scenario a little better in DRM+X.
>>>> Fullscreen isn't really a sufficient approach, since we don't want the
>>>> HMD to be used as part of the Desktop environment when a VR app is not
>>>> in use (this is extremely annoying).
>>>>
>>>> When the above is settled, we should have an auth mechanism besides
>>>> DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the
>>>> HMD permanently away from X. Re-using that auth method to gate this
>>>> IOCTL is probably going to be the final solution.
>>>>
>>>> I propose to start with ROOT_ONLY since it should allow us to respect
>>>> kernel IOCTL compatibility guidelines with the most flexibility. Going
>>>> from a restrictive to a more flexible permission model would be
>>>> inclusive, but going from a general to a restrictive model may exclude
>>>> some apps that used to work.
>>>>
>>>> Regards,
>>>> Andres
>>>>
>>>> On 12/22/2016 6:42 AM, Christian König wrote:
>>>>> Hi Andres,
>>>>>
>>>>> well using root might cause stability and security problems as well.
>>>>> We worked quite hard to avoid exactly this for X.
>>>>>
>>>>> We could make this feature depend on the compositor being DRM master,
>>>>> but for example with X the X server is master (and e.g. can change
>>>>> resolutions etc..) and not the compositor.
>>>>>
>>>>> So another question is also what windowing system (if any) are you
>>>>> planning to use? X, Wayland, Flinger or something completely
>>>>> different ?
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
>>>>>> Hi Christian,
>>>>>>
>>>>>> That is definitely a concern. What we are currently thinking is to
>>>>>> make the high priority queues accessible to root only.
>>>>>>
>>>>>> Therefore is a non-root user attempts to set the high priority flag
>>>>>> on context allocation, we would fail the call and return ENOPERM.
>>>>>>
>>>>>> Regards,
>>>>>> Andres
>>>>>>
>>>>>>
>>>>>> On 12/20/2016 7:56 AM, Christian König wrote:
>>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>>> to solve it?
>>>>>>> Yeah, that problem came to my mind as well.
>>>>>>>
>>>>>>> Basically we need to restrict those high priority submissions to
>>>>>>> the VR compositor or otherwise any malfunctioning application could
>>>>>>> use it.
>>>>>>>
>>>>>>> Just think about some WebGL suddenly taking all our rendering away
>>>>>>> and we won't get anything drawn any more.
>>>>>>>
>>>>>>> Alex or Michel any ideas on that?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
>>>>>>>> > If compute queue is occupied only by you, the efficiency
>>>>>>>> > is equal with setting job queue to high priority I think.
>>>>>>>> The only risk is the situation when graphics will take all
>>>>>>>> needed CUs. But in any case it should be very good test.
>>>>>>>>
>>>>>>>> Andres/Pierre-Loup,
>>>>>>>>
>>>>>>>> Did you try to do it or it is a lot of work for you?
>>>>>>>>
>>>>>>>>
>>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>>> to solve it?
>>>>>>>>
>>>>>>>> Sincerely yours,
>>>>>>>> Serguei Sagalovitch
>>>>>>>>
>>>>>>>> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>>>>>>>>> Do you encounter the priority issue for compute queue with
>>>>>>>>> current driver?
>>>>>>>>>
>>>>>>>>> If compute queue is occupied only by you, the efficiency is equal
>>>>>>>>> with setting job queue to high priority I think.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> David Zhou
>>>>>>>>>
>>>>>>>>> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>>>>>>>>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>>>>>>>>
>>>>>>>>>> I'm not sure if I'm asking for too much, but if we can
>>>>>>>>>> coordinate a similar interface in radv and amdgpu-pro at the
>>>>>>>>>> vulkan level that would be great.
>>>>>>>>>>
>>>>>>>>>> I'm not sure what that's going to be yet.
>>>>>>>>>>
>>>>>>>>>> - Andres
>>>>>>>>>>
>>>>>>>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>> We're currently working with the open stack; I assume that a
>>>>>>>>>>>> mechanism could be exposed by both open and Pro Vulkan
>>>>>>>>>>>> userspace drivers and that the amdgpu kernel interface
>>>>>>>>>>>> improvements we would pursue following this discussion would
>>>>>>>>>>>> let both drivers take advantage of the feature, correct?
>>>>>>>>>>> Of course.
>>>>>>>>>>> Does open stack have Vulkan support?
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> David Zhou
>>>>>>>>>>>>
>>>>>>>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>>>>>>>>> By the way, are you using all-open driver or amdgpu-pro
>>>>>>>>>>>>> driver?
>>>>>>>>>>>>>
>>>>>>>>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> David Zhou
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm also working on the bringing up our VR runtime on top of
>>>>>>>>>>>>>> amgpu;
>>>>>>>>>>>>>> see replies inline.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>> actually:
>>>>>>>>>>>>>>> So we could have potential memory overcommit case or do
>>>>>>>>>>>>>>> you do
>>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>> on your own?  I would think that there is need to avoid
>>>>>>>>>>>>>>> overcomit in
>>>>>>>>>>>>>>> VR case to
>>>>>>>>>>>>>>> prevent any BO migration.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You're entirely correct; currently the VR runtime is
>>>>>>>>>>>>>> setting up
>>>>>>>>>>>>>> prioritized CPU scheduling for its VR compositor, we're
>>>>>>>>>>>>>> working on
>>>>>>>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this
>>>>>>>>>>>>>> thread), and in
>>>>>>>>>>>>>> the future it will make sense to do work in order to make
>>>>>>>>>>>>>> sure that
>>>>>>>>>>>>>> its memory allocations do not get evicted, to prevent any
>>>>>>>>>>>>>> unwelcome
>>>>>>>>>>>>>> additional latency in the event of needing to perform
>>>>>>>>>>>>>> just-in-time
>>>>>>>>>>>>>> reprojection.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>>>>>>>>> Based on my understanding sharing BOs between different
>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>> could introduce additional synchronization constrains. btw:
>>>>>>>>>>>>>>> I am not
>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>> if we are able to share Vulkan sync. object cross-process
>>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> They are different processes; it is important for the
>>>>>>>>>>>>>> compositor that
>>>>>>>>>>>>>> is responsible for quality-of-service features such as
>>>>>>>>>>>>>> consistently
>>>>>>>>>>>>>> presenting distorted frames with the right latency,
>>>>>>>>>>>>>> reprojection, etc,
>>>>>>>>>>>>>> to be separate from the main application.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Currently we are using unreleased cross-process memory and
>>>>>>>>>>>>>> semaphore
>>>>>>>>>>>>>> extensions to fetch updated eye images from the client
>>>>>>>>>>>>>> application,
>>>>>>>>>>>>>> but the just-in-time reprojection discussed here does not
>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>> have any direct interactions with cross-process resource
>>>>>>>>>>>>>> sharing,
>>>>>>>>>>>>>> since it's achieved by using whatever is the latest, most
>>>>>>>>>>>>>> up-to-date
>>>>>>>>>>>>>> eye images that have already been sent by the client
>>>>>>>>>>>>>> application,
>>>>>>>>>>>>>> which are already available to use without additional
>>>>>>>>>>>>>> synchronization.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, we are working on mechanisms to present directly to the
>>>>>>>>>>>>>> headset
>>>>>>>>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  The latency is our main concern,
>>>>>>>>>>>>>>> I would assume that this is the known problem (at least for
>>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>>> usage).
>>>>>>>>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU
>>>>>>>>>>>>>>> intensive
>>>>>>>>>>>>>>> (at least
>>>>>>>>>>>>>>> in the default configuration).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As long as it's a consistent cost, it shouldn't an issue.
>>>>>>>>>>>>>> However, if
>>>>>>>>>>>>>> there's high degrees of variance then that would be
>>>>>>>>>>>>>> troublesome and we
>>>>>>>>>>>>>> would need to account for the worst case.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hopefully the requirements and approach we described make
>>>>>>>>>>>>>> sense, we're
>>>>>>>>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>  - Pierre-Loup
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hey Serguei,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far
>>>>>>>>>>>>>>>> as I
>>>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to
>>>>>>>>>>>>>>>> switch to
>>>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can
>>>>>>>>>>>>>>> start with a
>>>>>>>>>>>>>>> solution that assumes that
>>>>>>>>>>>>>>> only pipe0 has any work and the other pipes are idle (no
>>>>>>>>>>>>>>> HSA/ROCm
>>>>>>>>>>>>>>> running on the system).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This should be more or less the use case we expect from VR
>>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I agree the split is currently not ideal, but I'd like to
>>>>>>>>>>>>>>> consider
>>>>>>>>>>>>>>> that a separate task, because
>>>>>>>>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>>>> amdkfd
>>>>>>>>>>>>>>>> will be not
>>>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Correct, this is why we want to enable the high priority
>>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>>> queue through
>>>>>>>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan
>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>>> running actually:
>>>>>>>>>>>>>>>     1) Game process
>>>>>>>>>>>>>>>     2) VR Compositor (this is the process that will require
>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>> priority queue)
>>>>>>>>>>>>>>>     3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>>>>>>>>>> simultaneously, but
>>>>>>>>>>>>>>> I would also like to be able to address this case in the
>>>>>>>>>>>>>>> future
>>>>>>>>>>>>>>> (cross-pipe priorities).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>>>>> (a) it
>>>>>>>>>>>>>>>> may take time so
>>>>>>>>>>>>>>>> latency may suffer
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>>>>>>>>> predictable. A good
>>>>>>>>>>>>>>> illustration of what the reprojection scheduling looks like
>>>>>>>>>>>>>>> can be
>>>>>>>>>>>>>>> found here:
>>>>>>>>>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (b) to preempt we need to have different "context" - we
>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>> to guarantee that submissions from the same context will
>>>>>>>>>>>>>>>> be executed
>>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is okay, as the reprojection work doesn't have
>>>>>>>>>>>>>>> dependencies on
>>>>>>>>>>>>>>> the game context, and it
>>>>>>>>>>>>>>> even happens in a separate process.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you
>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>>>> "cancel/abort"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Preempt the game with the compositor task and then resume
>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics
>>>>>>>>>>>>>>>> as well as
>>>>>>>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure
>>>>>>>>>>>>>>> out a way
>>>>>>>>>>>>>>> for us to get
>>>>>>>>>>>>>>> a guaranteed execution time using vulkan graphics, then
>>>>>>>>>>>>>>> I'll take you
>>>>>>>>>>>>>>> out for a beer :)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Andres,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Quick comments:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU
>>>>>>>>>>>>>>> assignments/binding
>>>>>>>>>>>>>>> to high-priority queue  when it will be in use and "free"
>>>>>>>>>>>>>>> them later
>>>>>>>>>>>>>>> (we  do not want forever take CUs from e.g. graphic task to
>>>>>>>>>>>>>>> degrade
>>>>>>>>>>>>>>> graphics
>>>>>>>>>>>>>>> performance).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>>>>>>>>>> low-priority
>>>>>>>>>>>>>>> compute) will took all (extra) CUs and high--priority will
>>>>>>>>>>>>>>> wait for
>>>>>>>>>>>>>>> needed resources.
>>>>>>>>>>>>>>> It will not be visible on "NOP " but only when you submit
>>>>>>>>>>>>>>> "real"
>>>>>>>>>>>>>>> compute task
>>>>>>>>>>>>>>> so I would recommend  not to use "NOP" packets at all for
>>>>>>>>>>>>>>> testing.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It (CU assignment) could be relatively easy done when
>>>>>>>>>>>>>>> everything is
>>>>>>>>>>>>>>> going via kernel
>>>>>>>>>>>>>>> (e.g. as part of frame submission) but I must admit that I
>>>>>>>>>>>>>>> am not sure
>>>>>>>>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [AR] I wasn't aware of this part of the programming
>>>>>>>>>>>>>>> sequence. Thanks
>>>>>>>>>>>>>>> for the heads up!
>>>>>>>>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler"
>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>> deciding which
>>>>>>>>>>>>>>> queue to  run will check if there is enough resources and
>>>>>>>>>>>>>>> if not then
>>>>>>>>>>>>>>> it will begin
>>>>>>>>>>>>>>> to check other queues with lower priority.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2) I would recommend to dedicate the whole pipe to
>>>>>>>>>>>>>>> high-priority
>>>>>>>>>>>>>>> queue and have
>>>>>>>>>>>>>>> nothing their except it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue?
>>>>>>>>>>>>>>> (as opposed
>>>>>>>>>>>>>>> to the MEC definition
>>>>>>>>>>>>>>> of pipe, which is a grouping of queues). I say this because
>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>> only has access to 1 pipe,
>>>>>>>>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it. As far as I
>>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> BTW: Which user level API do you want to use for compute:
>>>>>>>>>>>>>>> Vulkan or
>>>>>>>>>>>>>>> OpenCL?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [AR] Vulkan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>>> amdkfd will
>>>>>>>>>>>>>>> be not
>>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  we will not be able to provide a solution compatible with
>>>>>>>>>>>>>>>> GFX
>>>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the
>>>>>>>>>>>>>>> currently running
>>>>>>>>>>>>>>> graphics job and scheduling in
>>>>>>>>>>>>>>> something else using mid-buffer pre-emption has some cases
>>>>>>>>>>>>>>> where it
>>>>>>>>>>>>>>> doesn't work well. But if with
>>>>>>>>>>>>>>> polaris10 it starts working well, it might be a better
>>>>>>>>>>>>>>> solution for
>>>>>>>>>>>>>>> us (because the whole reprojection
>>>>>>>>>>>>>>> work uses the vulkan graphics stack at the moment, and
>>>>>>>>>>>>>>> porting it to
>>>>>>>>>>>>>>> compute is not trivial).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>>>> (a) it may
>>>>>>>>>>>>>>> take time so
>>>>>>>>>>>>>>> latency may suffer (b) to preempt we need to have different
>>>>>>>>>>>>>>> "context"
>>>>>>>>>>>>>>> - we want
>>>>>>>>>>>>>>> to guarantee that submissions from the same context will be
>>>>>>>>>>>>>>> executed
>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you
>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>>>>>>>>>> for graphics as well as for plain compute tasks
>>>>>>>>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on
>>>>>>>>>>>>>>> behalf of
>>>>>>>>>>>>>>> Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>>>>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in
>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This RFC is also available as a gist here:
>>>>>>>>>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We are interested in feedback for a mechanism to
>>>>>>>>>>>>>>> effectively schedule
>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>> priority VR reprojection tasks (also referred to as
>>>>>>>>>>>>>>> time-warping) for
>>>>>>>>>>>>>>> Polaris10
>>>>>>>>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Brief context:
>>>>>>>>>>>>>>> --------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The main objective of reprojection is to avoid motion
>>>>>>>>>>>>>>> sickness for VR
>>>>>>>>>>>>>>> users in
>>>>>>>>>>>>>>> scenarios where the game or application would fail to finish
>>>>>>>>>>>>>>> rendering a new
>>>>>>>>>>>>>>> frame in time for the next VBLANK. When this happens, the
>>>>>>>>>>>>>>> user's head
>>>>>>>>>>>>>>> movements
>>>>>>>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the
>>>>>>>>>>>>>>> duration
>>>>>>>>>>>>>>> of an
>>>>>>>>>>>>>>> extra frame. This extended mismatch between the inner ear
>>>>>>>>>>>>>>> and the
>>>>>>>>>>>>>>> eyes may
>>>>>>>>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The VR compositor deals with this problem by fabricating a
>>>>>>>>>>>>>>> new frame
>>>>>>>>>>>>>>> using the
>>>>>>>>>>>>>>> user's updated head position in combination with the
>>>>>>>>>>>>>>> previous frames.
>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>> avoids a prolonged mismatch between the HMD output and the
>>>>>>>>>>>>>>> inner ear.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>>>>>>>>>> confidence that the
>>>>>>>>>>>>>>> reprojection task will complete before the VBLANK interval.
>>>>>>>>>>>>>>> Even if
>>>>>>>>>>>>>>> the GFX pipe
>>>>>>>>>>>>>>> is currently full of work from the game/application (which
>>>>>>>>>>>>>>> is most
>>>>>>>>>>>>>>> likely the case).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For more details and illustrations, please refer to the
>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>> document:
>>>>>>>>>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Requirements:
>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     * Job round trip time must be predictable, from
>>>>>>>>>>>>>>> submission to
>>>>>>>>>>>>>>> fence signal
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Goals:
>>>>>>>>>>>>>>> ------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Test: submitting a NOP packet through the mechanism on busy
>>>>>>>>>>>>>>> hardware
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Nice to have:
>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My understanding is that with the current hardware
>>>>>>>>>>>>>>> capabilities in
>>>>>>>>>>>>>>> Polaris10 we
>>>>>>>>>>>>>>> will not be able to provide a solution compatible with GFX
>>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But I would love to hear otherwise. So if anyone has an
>>>>>>>>>>>>>>> idea,
>>>>>>>>>>>>>>> approach or
>>>>>>>>>>>>>>> suggestion that will also be compatible with the GFX ring,
>>>>>>>>>>>>>>> please let
>>>>>>>>>>>>>>> us know
>>>>>>>>>>>>>>> about it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     * The above guarantees should also be respected by
>>>>>>>>>>>>>>> amdkfd workloads
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Would be good to have for consistency, but not strictly
>>>>>>>>>>>>>>> necessary as
>>>>>>>>>>>>>>> users running
>>>>>>>>>>>>>>> games are not traditionally running HPC workloads in the
>>>>>>>>>>>>>>> background.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Proposed approach:
>>>>>>>>>>>>>>> ------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Similar to the windows driver, we could expose a high
>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>> compute queue to
>>>>>>>>>>>>>>> userspace.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Submissions to this compute queue will be scheduled with
>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>> priority, and may
>>>>>>>>>>>>>>> acquire hardware resources previously in use by other
>>>>>>>>>>>>>>> queues.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This can be achieved by taking advantage of the 'priority'
>>>>>>>>>>>>>>> field in
>>>>>>>>>>>>>>> the HQDs
>>>>>>>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler.
>>>>>>>>>>>>>>> The relevant
>>>>>>>>>>>>>>> register fields are:
>>>>>>>>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The amdgpu driver currently controls 8 compute queues from
>>>>>>>>>>>>>>> pipe0. We can
>>>>>>>>>>>>>>> statically partition these as follows:
>>>>>>>>>>>>>>>         * 7x regular
>>>>>>>>>>>>>>>         * 1x high priority
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The relevant priorities can be set so that submissions to
>>>>>>>>>>>>>>> the high
>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The amdgpu scheduler will only place jobs into the high
>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>> rings if the
>>>>>>>>>>>>>>> context is marked as high priority. And a corresponding
>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>> added to keep track of this information:
>>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The user will request a high priority context by setting an
>>>>>>>>>>>>>>> appropriate flag
>>>>>>>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>>>>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>>>>>>>>     * Maintain a consistent FIFO ordering of all
>>>>>>>>>>>>>>> submissions to a
>>>>>>>>>>>>>>> context
>>>>>>>>>>>>>>>     * Create high priority and non-high priority contexts
>>>>>>>>>>>>>>> in the same
>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Similar to the above, but instead of programming the
>>>>>>>>>>>>>>> priorities and
>>>>>>>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the
>>>>>>>>>>>>>>> queue priorities
>>>>>>>>>>>>>>> dynamically when scheduling a task.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This would involve having a hardware specific callback from
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> scheduler to
>>>>>>>>>>>>>>> set the appropriate queue priority: set_priority(int ring,
>>>>>>>>>>>>>>> int index,
>>>>>>>>>>>>>>> int priority)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> During this callback we would have to grab the SRBM mutex
>>>>>>>>>>>>>>> to perform
>>>>>>>>>>>>>>> the appropriate
>>>>>>>>>>>>>>> HW programming, and I'm not really sure if that is
>>>>>>>>>>>>>>> something we
>>>>>>>>>>>>>>> should be doing from
>>>>>>>>>>>>>>> the scheduler.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On the positive side, this approach would allow us to
>>>>>>>>>>>>>>> program a range of
>>>>>>>>>>>>>>> priorities for jobs instead of a single "high priority"
>>>>>>>>>>>>>>> value",
>>>>>>>>>>>>>>> achieving
>>>>>>>>>>>>>>> something similar to the niceness API available for CPU
>>>>>>>>>>>>>>> scheduling.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not sure if this flexibility is something that we would
>>>>>>>>>>>>>>> need for
>>>>>>>>>>>>>>> our use
>>>>>>>>>>>>>>> case, but it might be useful in other scenarios (multiple
>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>> sharing compute
>>>>>>>>>>>>>>> time on a server).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This approach would require a new int field in
>>>>>>>>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>>>>>>>>> repurposing
>>>>>>>>>>>>>>> of the flags field.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Known current obstacles:
>>>>>>>>>>>>>>> ------------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The SQ is currently programmed to disregard the HQD
>>>>>>>>>>>>>>> priorities, and
>>>>>>>>>>>>>>> instead it picks
>>>>>>>>>>>>>>> jobs at random. Settings from the shader itself are also
>>>>>>>>>>>>>>> disregarded
>>>>>>>>>>>>>>> as this is
>>>>>>>>>>>>>>> considered a privileged field.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Effectively we can get our compute wavefront launched ASAP,
>>>>>>>>>>>>>>> but we
>>>>>>>>>>>>>>> might not get the
>>>>>>>>>>>>>>> time we need on the SQ.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The current programming would have to be changed to allow
>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>> propagation
>>>>>>>>>>>>>>> from the HQD into the SQ.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Generic approach for all HW IPs:
>>>>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For consistency purposes, the high priority context can be
>>>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>>> for all HW IPs
>>>>>>>>>>>>>>> with support of the SW scheduler. This will function
>>>>>>>>>>>>>>> similarly to the
>>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump
>>>>>>>>>>>>>>> ahead of
>>>>>>>>>>>>>>> anything not
>>>>>>>>>>>>>>> commited to the HW queue.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The benefits of requesting a high priority context for a
>>>>>>>>>>>>>>> non-compute
>>>>>>>>>>>>>>> queue will
>>>>>>>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is
>>>>>>>>>>>>>>> stuck in
>>>>>>>>>>>>>>> front of
>>>>>>>>>>>>>>> you), but having the API in place will allow us to easily
>>>>>>>>>>>>>>> improve the
>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>> in the future as new features become available in new
>>>>>>>>>>>>>>> hardware.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Future steps:
>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Once we have an approach settled, I can take care of the
>>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, once the interface is mostly decided, we can start
>>>>>>>>>>>>>>> thinking about
>>>>>>>>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Request for feedback:
>>>>>>>>>>>>>>> ---------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We aren't married to any of the approaches outlined above.
>>>>>>>>>>>>>>> Our goal
>>>>>>>>>>>>>>> is to
>>>>>>>>>>>>>>> obtain a mechanism that will allow us to complete the
>>>>>>>>>>>>>>> reprojection
>>>>>>>>>>>>>>> job within a
>>>>>>>>>>>>>>> predictable amount of time. So if anyone anyone has any
>>>>>>>>>>>>>>> suggestions for
>>>>>>>>>>>>>>> improvements or alternative strategies we are more than
>>>>>>>>>>>>>>> happy to hear
>>>>>>>>>>>>>>> them.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If any of the technical information above is also
>>>>>>>>>>>>>>> incorrect, feel
>>>>>>>>>>>>>>> free to point
>>>>>>>>>>>>>>> out my misunderstandings.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looking forward to hearing from you.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>>> To see the collection of prior postings to the list,
>>>>>>>>>>>>>>> visit the
>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>>> To see the collection of prior postings to the list,
>>>>>>>>>>>>>>> visit the
>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Sincerely yours,
>>>>>>>> Serguei Sagalovitch
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>> Sincerely yours,
>>> Serguei Sagalovitch
>>>
>>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                                                 ` <b853e4e3-0ba5-2bda-e129-d9253e7b098d-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
@ 2016-12-23 22:20                                                                                   ` Andres Rodriguez
       [not found]                                                                                     ` <CAFQ_0eHg=Kf5qV50cgm51m6bTcMYdkgRXkT-sykJnYNzu3Zzsg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-01-02 14:09                                                                                   ` Christian König
  1 sibling, 1 reply; 36+ messages in thread
From: Andres Rodriguez @ 2016-12-23 22:20 UTC (permalink / raw)
  To: Pierre-Loup A. Griffais
  Cc: zhoucm1, Mao, David, Serguei Sagalovitch,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Andres Rodriguez,
	Christian König, Huan, Alvin, Zhang, Hawking


[-- Attachment #1.1: Type: text/plain, Size: 46590 bytes --]

Hey John,

I've collected bit of data using high priority SW scheduler queues,
thought you might be interested.

Implementation as per the patch above.

Control test 1
==============

Sascha Willems mesh sample running on its own at regular priority

Results
-------

Mesh: ~0.14ms per-frame latency

Control test 2
==============

Two Sascha Willems mesh sample running on its own at regular priority

Results
-------

Mesh 1: ~0.26ms per-frame latency
Mesh 2: ~0.26ms per-frame latency

Test 1
======

Two Sascha Willems mesh samples running simultaneously. One at high
priority and the other running in a regular priority graphics context.

Results
-------

Mesh High:    0.14 - 0.24ms per-frame latency
Mesh Regular: 0.24 - 0.40ms per-frame latency

Test 2
======

Ten Sascha Willems mesh samples running simultaneously. One at high
priority and the others running in a regular priority graphics context.

Results
-------

Mesh High:    0.14 - 0.8ms per-frame latency
Mesh Regular: 1.10 - 2.05ms per-frame latency

Test 3
======

Two Sascha Willems mesh samples running simultaneously. One at high
priority and the other running in a regular priority graphics context.

Also running Unigine Heaven at Exteme preset @ 2560x1600

Results
-------

Mesh High:     7 - 100ms per-frame latency (Lots of fluctuation)
Mesh Regular: 40 - 130ms per-frame latency (Lots of fluctuation)
Unigine Heaven: 20-40 fps


Test 4
======

Two Sascha Willems mesh samples running simultaneously. One at high
priority and the other running in a regular priority graphics context.

Also running Talos Principle @ 4K

Results
-------

Mesh High:    0.14 - 3.97ms per-frame latency (Mostly floats ~0.4ms)
Mesh Regular: 0.43 - 8.11ms per-frame latency (Lots of fluctuation)
Talos: 24.8 fps AVG

Observations
============

The high priority queue based on the SW scheduler provides significant
gains when paired with tasks that submit short duration commands into
the queue. This can be observed in tests 1 and 2.

When the pipe is full of long running commands, the effects are dampened.
As observed in test 3, the per-frame latency suffers very large spikes,
and the latencies are very inconsistent.

Talos seems to be a better behaved game. It may be submitting shorter
draw commands and the SW scheduler is able to interleave the rest of
the work.

The results seem consistent with the hypothetical advantages the SW
scheduler should provide. I.e. your context can be scheduled into the
HW queue ahead of any other context, but everything already commited
to the HW queue is executed in strict FIFO order.

In order to deal with cases similar to Test 3, we will need to take
advantage of further features.

Notes
=====

- Tests were run multiple times, and reboots were performed during tests.
- The mesh sample isn't really designed for benchmarking, but it should
  be decent for ballpark figures
- The high priority mesh app was run with default niceness and also niceness
  at -20. This had no effect on the results, so it was not added above.
- CPU usage was not saturated while running the tests

Regards,
Andres


On Fri, Dec 23, 2016 at 1:18 PM, Pierre-Loup A. Griffais <
pgriffais-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org> wrote:

> I hate to keep bringing up display topics in an unrelated conversation,
> but I'm not sure where you got "Application -> X server -> compositor -> X
> server" from. As I was saying before, we need to be presenting directly to
> the HMD display as no display server can be in the way, both for latency
> but also quality of service reasons (a buggy application cannot be allowed
> to accidentally display undistorted rendering into the HMD); we intend to
> do the necessary work for this, and the extent of X's (or a Wayland
> implementation, or any other display server) involvment will be to
> participate enough to know that the HMD display is off-limits. If you have
> more questions on the display aspect, or VR rendering in general, I'm happy
> to try to address them out-of-band from this conversation.
>
>
> On 12/23/2016 02:54 AM, Christian König wrote:
>
>> But yes, in general you don't want another compositor in the way, so
>>> we'll be acquiring the HMD display directly, separate from any desktop
>>> or display server.
>>>
>> Assuming that the the HMD is attached to the rendering device in some
>> way you have the X server and the Compositor which both try to be DRM
>> master at the same time.
>>
>> Please correct me if that was fixed in the meantime, but that sounds
>> like it will simply not work. Or is this what Andres mention below Dave
>> is working on ?.
>>
>> Additional to that a compositor in combination with X is a bit counter
>> productive when you want to keep the latency low.
>>
>> E.g. the "normal" flow of a GL or Vulkan surface filled with rendered
>> data to be displayed is from the Application -> X server -> compositor
>> -> X server.
>>
>> The extra step between X server and compositor just means extra latency
>> and for this use case you probably don't want that.
>>
>> Targeting something like Wayland and when you need X compatibility
>> XWayland sounds like the much better idea.
>>
>> Regards,
>> Christian.
>>
>> Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:
>>
>>> Display concerns are a separate issue, and as Andres said we have
>>> other plans to address. But yes, in general you don't want another
>>> compositor in the way, so we'll be acquiring the HMD display directly,
>>> separate from any desktop or display server. Same with security, we
>>> can have a separate conversation about that when the time comes.
>>>
>>> On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
>>>
>>>> Andres,
>>>>
>>>> Did you measure  latency, etc. impact of __any__ compositor?
>>>>
>>>> My understanding is that VR has pretty strict requirements related to
>>>> QoS.
>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>>
>>>> On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
>>>>
>>>>> Hey Christian,
>>>>>
>>>>> We are currently interested in X, but with some distros switching to
>>>>> other compositors by default, we also need to consider those.
>>>>>
>>>>> We agree, running the full vrcompositor in root isn't something that
>>>>> we want to do. Too many security concerns. Having a small root helper
>>>>> that does the privilege escalation for us is the initial idea.
>>>>>
>>>>> For a long term approach, Pierre-Loup and Dave are working on dealing
>>>>> with the "two compositors" scenario a little better in DRM+X.
>>>>> Fullscreen isn't really a sufficient approach, since we don't want the
>>>>> HMD to be used as part of the Desktop environment when a VR app is not
>>>>> in use (this is extremely annoying).
>>>>>
>>>>> When the above is settled, we should have an auth mechanism besides
>>>>> DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the
>>>>> HMD permanently away from X. Re-using that auth method to gate this
>>>>> IOCTL is probably going to be the final solution.
>>>>>
>>>>> I propose to start with ROOT_ONLY since it should allow us to respect
>>>>> kernel IOCTL compatibility guidelines with the most flexibility. Going
>>>>> from a restrictive to a more flexible permission model would be
>>>>> inclusive, but going from a general to a restrictive model may exclude
>>>>> some apps that used to work.
>>>>>
>>>>> Regards,
>>>>> Andres
>>>>>
>>>>> On 12/22/2016 6:42 AM, Christian König wrote:
>>>>>
>>>>>> Hi Andres,
>>>>>>
>>>>>> well using root might cause stability and security problems as well.
>>>>>> We worked quite hard to avoid exactly this for X.
>>>>>>
>>>>>> We could make this feature depend on the compositor being DRM master,
>>>>>> but for example with X the X server is master (and e.g. can change
>>>>>> resolutions etc..) and not the compositor.
>>>>>>
>>>>>> So another question is also what windowing system (if any) are you
>>>>>> planning to use? X, Wayland, Flinger or something completely
>>>>>> different ?
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
>>>>>>
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> That is definitely a concern. What we are currently thinking is to
>>>>>>> make the high priority queues accessible to root only.
>>>>>>>
>>>>>>> Therefore is a non-root user attempts to set the high priority flag
>>>>>>> on context allocation, we would fail the call and return ENOPERM.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Andres
>>>>>>>
>>>>>>>
>>>>>>> On 12/20/2016 7:56 AM, Christian König wrote:
>>>>>>>
>>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>>>> to solve it?
>>>>>>>>>
>>>>>>>> Yeah, that problem came to my mind as well.
>>>>>>>>
>>>>>>>> Basically we need to restrict those high priority submissions to
>>>>>>>> the VR compositor or otherwise any malfunctioning application could
>>>>>>>> use it.
>>>>>>>>
>>>>>>>> Just think about some WebGL suddenly taking all our rendering away
>>>>>>>> and we won't get anything drawn any more.
>>>>>>>>
>>>>>>>> Alex or Michel any ideas on that?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
>>>>>>>>
>>>>>>>>> > If compute queue is occupied only by you, the efficiency
>>>>>>>>> > is equal with setting job queue to high priority I think.
>>>>>>>>> The only risk is the situation when graphics will take all
>>>>>>>>> needed CUs. But in any case it should be very good test.
>>>>>>>>>
>>>>>>>>> Andres/Pierre-Loup,
>>>>>>>>>
>>>>>>>>> Did you try to do it or it is a lot of work for you?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>>>> to solve it?
>>>>>>>>>
>>>>>>>>> Sincerely yours,
>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>
>>>>>>>>> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>>>>>>>>>
>>>>>>>>>> Do you encounter the priority issue for compute queue with
>>>>>>>>>> current driver?
>>>>>>>>>>
>>>>>>>>>> If compute queue is occupied only by you, the efficiency is equal
>>>>>>>>>> with setting job queue to high priority I think.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> David Zhou
>>>>>>>>>>
>>>>>>>>>> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure if I'm asking for too much, but if we can
>>>>>>>>>>> coordinate a similar interface in radv and amdgpu-pro at the
>>>>>>>>>>> vulkan level that would be great.
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure what that's going to be yet.
>>>>>>>>>>>
>>>>>>>>>>> - Andres
>>>>>>>>>>>
>>>>>>>>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> We're currently working with the open stack; I assume that a
>>>>>>>>>>>>> mechanism could be exposed by both open and Pro Vulkan
>>>>>>>>>>>>> userspace drivers and that the amdgpu kernel interface
>>>>>>>>>>>>> improvements we would pursue following this discussion would
>>>>>>>>>>>>> let both drivers take advantage of the feature, correct?
>>>>>>>>>>>>>
>>>>>>>>>>>> Of course.
>>>>>>>>>>>> Does open stack have Vulkan support?
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> David Zhou
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> By the way, are you using all-open driver or amdgpu-pro
>>>>>>>>>>>>>> driver?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> David Zhou
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm also working on the bringing up our VR runtime on top of
>>>>>>>>>>>>>>> amgpu;
>>>>>>>>>>>>>>> see replies inline.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>>> actually:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So we could have potential memory overcommit case or do
>>>>>>>>>>>>>>>> you do
>>>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>>> on your own?  I would think that there is need to avoid
>>>>>>>>>>>>>>>> overcomit in
>>>>>>>>>>>>>>>> VR case to
>>>>>>>>>>>>>>>> prevent any BO migration.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You're entirely correct; currently the VR runtime is
>>>>>>>>>>>>>>> setting up
>>>>>>>>>>>>>>> prioritized CPU scheduling for its VR compositor, we're
>>>>>>>>>>>>>>> working on
>>>>>>>>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this
>>>>>>>>>>>>>>> thread), and in
>>>>>>>>>>>>>>> the future it will make sense to do work in order to make
>>>>>>>>>>>>>>> sure that
>>>>>>>>>>>>>>> its memory allocations do not get evicted, to prevent any
>>>>>>>>>>>>>>> unwelcome
>>>>>>>>>>>>>>> additional latency in the event of needing to perform
>>>>>>>>>>>>>>> just-in-time
>>>>>>>>>>>>>>> reprojection.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>>>>>>>>>> Based on my understanding sharing BOs between different
>>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>>> could introduce additional synchronization constrains. btw:
>>>>>>>>>>>>>>>> I am not
>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>> if we are able to share Vulkan sync. object cross-process
>>>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> They are different processes; it is important for the
>>>>>>>>>>>>>>> compositor that
>>>>>>>>>>>>>>> is responsible for quality-of-service features such as
>>>>>>>>>>>>>>> consistently
>>>>>>>>>>>>>>> presenting distorted frames with the right latency,
>>>>>>>>>>>>>>> reprojection, etc,
>>>>>>>>>>>>>>> to be separate from the main application.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Currently we are using unreleased cross-process memory and
>>>>>>>>>>>>>>> semaphore
>>>>>>>>>>>>>>> extensions to fetch updated eye images from the client
>>>>>>>>>>>>>>> application,
>>>>>>>>>>>>>>> but the just-in-time reprojection discussed here does not
>>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>>> have any direct interactions with cross-process resource
>>>>>>>>>>>>>>> sharing,
>>>>>>>>>>>>>>> since it's achieved by using whatever is the latest, most
>>>>>>>>>>>>>>> up-to-date
>>>>>>>>>>>>>>> eye images that have already been sent by the client
>>>>>>>>>>>>>>> application,
>>>>>>>>>>>>>>> which are already available to use without additional
>>>>>>>>>>>>>>> synchronization.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, we are working on mechanisms to present directly to the
>>>>>>>>>>>>>>> headset
>>>>>>>>>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  The latency is our main concern,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would assume that this is the known problem (at least for
>>>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>>>> usage).
>>>>>>>>>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU
>>>>>>>>>>>>>>>> intensive
>>>>>>>>>>>>>>>> (at least
>>>>>>>>>>>>>>>> in the default configuration).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As long as it's a consistent cost, it shouldn't an issue.
>>>>>>>>>>>>>>> However, if
>>>>>>>>>>>>>>> there's high degrees of variance then that would be
>>>>>>>>>>>>>>> troublesome and we
>>>>>>>>>>>>>>> would need to account for the worst case.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hopefully the requirements and approach we described make
>>>>>>>>>>>>>>> sense, we're
>>>>>>>>>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>  - Pierre-Loup
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
>>>>>>>>>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey Serguei,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far
>>>>>>>>>>>>>>>>> as I
>>>>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to
>>>>>>>>>>>>>>>>> switch to
>>>>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can
>>>>>>>>>>>>>>>> start with a
>>>>>>>>>>>>>>>> solution that assumes that
>>>>>>>>>>>>>>>> only pipe0 has any work and the other pipes are idle (no
>>>>>>>>>>>>>>>> HSA/ROCm
>>>>>>>>>>>>>>>> running on the system).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This should be more or less the use case we expect from VR
>>>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree the split is currently not ideal, but I'd like to
>>>>>>>>>>>>>>>> consider
>>>>>>>>>>>>>>>> that a separate task, because
>>>>>>>>>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>>>>> amdkfd
>>>>>>>>>>>>>>>>> will be not
>>>>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Correct, this is why we want to enable the high priority
>>>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>>>> queue through
>>>>>>>>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan
>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>>>> running actually:
>>>>>>>>>>>>>>>>     1) Game process
>>>>>>>>>>>>>>>>     2) VR Compositor (this is the process that will require
>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>> priority queue)
>>>>>>>>>>>>>>>>     3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>>>>>>>>>>> simultaneously, but
>>>>>>>>>>>>>>>> I would also like to be able to address this case in the
>>>>>>>>>>>>>>>> future
>>>>>>>>>>>>>>>> (cross-pipe priorities).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>>>>>> (a) it
>>>>>>>>>>>>>>>>> may take time so
>>>>>>>>>>>>>>>>> latency may suffer
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>>>>>>>>>> predictable. A good
>>>>>>>>>>>>>>>> illustration of what the reprojection scheduling looks like
>>>>>>>>>>>>>>>> can be
>>>>>>>>>>>>>>>> found here:
>>>>>>>>>>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-
>>>>>>>>>>>>>>>> 1310-104754/pastedImage_3.png
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (b) to preempt we need to have different "context" - we
>>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>>> to guarantee that submissions from the same context will
>>>>>>>>>>>>>>>>> be executed
>>>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is okay, as the reprojection work doesn't have
>>>>>>>>>>>>>>>> dependencies on
>>>>>>>>>>>>>>>> the game context, and it
>>>>>>>>>>>>>>>> even happens in a separate process.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you
>>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>>>>> "cancel/abort"
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Preempt the game with the compositor task and then resume
>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics
>>>>>>>>>>>>>>>>> as well as
>>>>>>>>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure
>>>>>>>>>>>>>>>> out a way
>>>>>>>>>>>>>>>> for us to get
>>>>>>>>>>>>>>>> a guaranteed execution time using vulkan graphics, then
>>>>>>>>>>>>>>>> I'll take you
>>>>>>>>>>>>>>>> out for a beer :)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch-5C7GfCeVMHo@public.gmane.org]
>>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Andres,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
>>>>>>>>>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch-5C7GfCeVMHo@public.gmane.org]
>>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Quick comments:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU
>>>>>>>>>>>>>>>> assignments/binding
>>>>>>>>>>>>>>>> to high-priority queue  when it will be in use and "free"
>>>>>>>>>>>>>>>> them later
>>>>>>>>>>>>>>>> (we  do not want forever take CUs from e.g. graphic task to
>>>>>>>>>>>>>>>> degrade
>>>>>>>>>>>>>>>> graphics
>>>>>>>>>>>>>>>> performance).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>>>>>>>>>>> low-priority
>>>>>>>>>>>>>>>> compute) will took all (extra) CUs and high--priority will
>>>>>>>>>>>>>>>> wait for
>>>>>>>>>>>>>>>> needed resources.
>>>>>>>>>>>>>>>> It will not be visible on "NOP " but only when you submit
>>>>>>>>>>>>>>>> "real"
>>>>>>>>>>>>>>>> compute task
>>>>>>>>>>>>>>>> so I would recommend  not to use "NOP" packets at all for
>>>>>>>>>>>>>>>> testing.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It (CU assignment) could be relatively easy done when
>>>>>>>>>>>>>>>> everything is
>>>>>>>>>>>>>>>> going via kernel
>>>>>>>>>>>>>>>> (e.g. as part of frame submission) but I must admit that I
>>>>>>>>>>>>>>>> am not sure
>>>>>>>>>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] I wasn't aware of this part of the programming
>>>>>>>>>>>>>>>> sequence. Thanks
>>>>>>>>>>>>>>>> for the heads up!
>>>>>>>>>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>>>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler"
>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>> deciding which
>>>>>>>>>>>>>>>> queue to  run will check if there is enough resources and
>>>>>>>>>>>>>>>> if not then
>>>>>>>>>>>>>>>> it will begin
>>>>>>>>>>>>>>>> to check other queues with lower priority.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2) I would recommend to dedicate the whole pipe to
>>>>>>>>>>>>>>>> high-priority
>>>>>>>>>>>>>>>> queue and have
>>>>>>>>>>>>>>>> nothing their except it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue?
>>>>>>>>>>>>>>>> (as opposed
>>>>>>>>>>>>>>>> to the MEC definition
>>>>>>>>>>>>>>>> of pipe, which is a grouping of queues). I say this because
>>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>>> only has access to 1 pipe,
>>>>>>>>>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it. As far as I
>>>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> BTW: Which user level API do you want to use for compute:
>>>>>>>>>>>>>>>> Vulkan or
>>>>>>>>>>>>>>>> OpenCL?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] Vulkan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>>>> amdkfd will
>>>>>>>>>>>>>>>> be not
>>>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  we will not be able to provide a solution compatible with
>>>>>>>>>>>>>>>>> GFX
>>>>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the
>>>>>>>>>>>>>>>> currently running
>>>>>>>>>>>>>>>> graphics job and scheduling in
>>>>>>>>>>>>>>>> something else using mid-buffer pre-emption has some cases
>>>>>>>>>>>>>>>> where it
>>>>>>>>>>>>>>>> doesn't work well. But if with
>>>>>>>>>>>>>>>> polaris10 it starts working well, it might be a better
>>>>>>>>>>>>>>>> solution for
>>>>>>>>>>>>>>>> us (because the whole reprojection
>>>>>>>>>>>>>>>> work uses the vulkan graphics stack at the moment, and
>>>>>>>>>>>>>>>> porting it to
>>>>>>>>>>>>>>>> compute is not trivial).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>>>>> (a) it may
>>>>>>>>>>>>>>>> take time so
>>>>>>>>>>>>>>>> latency may suffer (b) to preempt we need to have different
>>>>>>>>>>>>>>>> "context"
>>>>>>>>>>>>>>>> - we want
>>>>>>>>>>>>>>>> to guarantee that submissions from the same context will be
>>>>>>>>>>>>>>>> executed
>>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you
>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>>>>>>>>>>> for graphics as well as for plain compute tasks
>>>>>>>>>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org> on
>>>>>>>>>>>>>>>> behalf of
>>>>>>>>>>>>>>>> Andres Rodriguez <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
>>>>>>>>>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>>>>>>>>>> To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in
>>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This RFC is also available as a gist here:
>>>>>>>>>>>>>>>> https://gist.github.com/lostgo
>>>>>>>>>>>>>>>> at/7000432cd6864265dbc2c3ab93204249
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We are interested in feedback for a mechanism to
>>>>>>>>>>>>>>>> effectively schedule
>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>> priority VR reprojection tasks (also referred to as
>>>>>>>>>>>>>>>> time-warping) for
>>>>>>>>>>>>>>>> Polaris10
>>>>>>>>>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Brief context:
>>>>>>>>>>>>>>>> --------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The main objective of reprojection is to avoid motion
>>>>>>>>>>>>>>>> sickness for VR
>>>>>>>>>>>>>>>> users in
>>>>>>>>>>>>>>>> scenarios where the game or application would fail to finish
>>>>>>>>>>>>>>>> rendering a new
>>>>>>>>>>>>>>>> frame in time for the next VBLANK. When this happens, the
>>>>>>>>>>>>>>>> user's head
>>>>>>>>>>>>>>>> movements
>>>>>>>>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the
>>>>>>>>>>>>>>>> duration
>>>>>>>>>>>>>>>> of an
>>>>>>>>>>>>>>>> extra frame. This extended mismatch between the inner ear
>>>>>>>>>>>>>>>> and the
>>>>>>>>>>>>>>>> eyes may
>>>>>>>>>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The VR compositor deals with this problem by fabricating a
>>>>>>>>>>>>>>>> new frame
>>>>>>>>>>>>>>>> using the
>>>>>>>>>>>>>>>> user's updated head position in combination with the
>>>>>>>>>>>>>>>> previous frames.
>>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>> avoids a prolonged mismatch between the HMD output and the
>>>>>>>>>>>>>>>> inner ear.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>>>>>>>>>>> confidence that the
>>>>>>>>>>>>>>>> reprojection task will complete before the VBLANK interval.
>>>>>>>>>>>>>>>> Even if
>>>>>>>>>>>>>>>> the GFX pipe
>>>>>>>>>>>>>>>> is currently full of work from the game/application (which
>>>>>>>>>>>>>>>> is most
>>>>>>>>>>>>>>>> likely the case).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For more details and illustrations, please refer to the
>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>> document:
>>>>>>>>>>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/
>>>>>>>>>>>>>>>> asynchronous-shaders-evolved
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Requirements:
>>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * Job round trip time must be predictable, from
>>>>>>>>>>>>>>>> submission to
>>>>>>>>>>>>>>>> fence signal
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Goals:
>>>>>>>>>>>>>>>> ------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Test: submitting a NOP packet through the mechanism on busy
>>>>>>>>>>>>>>>> hardware
>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Nice to have:
>>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My understanding is that with the current hardware
>>>>>>>>>>>>>>>> capabilities in
>>>>>>>>>>>>>>>> Polaris10 we
>>>>>>>>>>>>>>>> will not be able to provide a solution compatible with GFX
>>>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But I would love to hear otherwise. So if anyone has an
>>>>>>>>>>>>>>>> idea,
>>>>>>>>>>>>>>>> approach or
>>>>>>>>>>>>>>>> suggestion that will also be compatible with the GFX ring,
>>>>>>>>>>>>>>>> please let
>>>>>>>>>>>>>>>> us know
>>>>>>>>>>>>>>>> about it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The above guarantees should also be respected by
>>>>>>>>>>>>>>>> amdkfd workloads
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Would be good to have for consistency, but not strictly
>>>>>>>>>>>>>>>> necessary as
>>>>>>>>>>>>>>>> users running
>>>>>>>>>>>>>>>> games are not traditionally running HPC workloads in the
>>>>>>>>>>>>>>>> background.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Proposed approach:
>>>>>>>>>>>>>>>> ------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Similar to the windows driver, we could expose a high
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> compute queue to
>>>>>>>>>>>>>>>> userspace.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Submissions to this compute queue will be scheduled with
>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>> priority, and may
>>>>>>>>>>>>>>>> acquire hardware resources previously in use by other
>>>>>>>>>>>>>>>> queues.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This can be achieved by taking advantage of the 'priority'
>>>>>>>>>>>>>>>> field in
>>>>>>>>>>>>>>>> the HQDs
>>>>>>>>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler.
>>>>>>>>>>>>>>>> The relevant
>>>>>>>>>>>>>>>> register fields are:
>>>>>>>>>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The amdgpu driver currently controls 8 compute queues from
>>>>>>>>>>>>>>>> pipe0. We can
>>>>>>>>>>>>>>>> statically partition these as follows:
>>>>>>>>>>>>>>>>         * 7x regular
>>>>>>>>>>>>>>>>         * 1x high priority
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The relevant priorities can be set so that submissions to
>>>>>>>>>>>>>>>> the high
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The amdgpu scheduler will only place jobs into the high
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> rings if the
>>>>>>>>>>>>>>>> context is marked as high priority. And a corresponding
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>>> added to keep track of this information:
>>>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The user will request a high priority context by setting an
>>>>>>>>>>>>>>>> appropriate flag
>>>>>>>>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>>>>>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/
>>>>>>>>>>>>>>>> drm/amdgpu_drm.h#L163
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>>>>>>>>>     * Maintain a consistent FIFO ordering of all
>>>>>>>>>>>>>>>> submissions to a
>>>>>>>>>>>>>>>> context
>>>>>>>>>>>>>>>>     * Create high priority and non-high priority contexts
>>>>>>>>>>>>>>>> in the same
>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Similar to the above, but instead of programming the
>>>>>>>>>>>>>>>> priorities and
>>>>>>>>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the
>>>>>>>>>>>>>>>> queue priorities
>>>>>>>>>>>>>>>> dynamically when scheduling a task.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This would involve having a hardware specific callback from
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> scheduler to
>>>>>>>>>>>>>>>> set the appropriate queue priority: set_priority(int ring,
>>>>>>>>>>>>>>>> int index,
>>>>>>>>>>>>>>>> int priority)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> During this callback we would have to grab the SRBM mutex
>>>>>>>>>>>>>>>> to perform
>>>>>>>>>>>>>>>> the appropriate
>>>>>>>>>>>>>>>> HW programming, and I'm not really sure if that is
>>>>>>>>>>>>>>>> something we
>>>>>>>>>>>>>>>> should be doing from
>>>>>>>>>>>>>>>> the scheduler.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On the positive side, this approach would allow us to
>>>>>>>>>>>>>>>> program a range of
>>>>>>>>>>>>>>>> priorities for jobs instead of a single "high priority"
>>>>>>>>>>>>>>>> value",
>>>>>>>>>>>>>>>> achieving
>>>>>>>>>>>>>>>> something similar to the niceness API available for CPU
>>>>>>>>>>>>>>>> scheduling.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not sure if this flexibility is something that we would
>>>>>>>>>>>>>>>> need for
>>>>>>>>>>>>>>>> our use
>>>>>>>>>>>>>>>> case, but it might be useful in other scenarios (multiple
>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>> sharing compute
>>>>>>>>>>>>>>>> time on a server).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This approach would require a new int field in
>>>>>>>>>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>>>>>>>>>> repurposing
>>>>>>>>>>>>>>>> of the flags field.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Known current obstacles:
>>>>>>>>>>>>>>>> ------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The SQ is currently programmed to disregard the HQD
>>>>>>>>>>>>>>>> priorities, and
>>>>>>>>>>>>>>>> instead it picks
>>>>>>>>>>>>>>>> jobs at random. Settings from the shader itself are also
>>>>>>>>>>>>>>>> disregarded
>>>>>>>>>>>>>>>> as this is
>>>>>>>>>>>>>>>> considered a privileged field.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Effectively we can get our compute wavefront launched ASAP,
>>>>>>>>>>>>>>>> but we
>>>>>>>>>>>>>>>> might not get the
>>>>>>>>>>>>>>>> time we need on the SQ.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The current programming would have to be changed to allow
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> propagation
>>>>>>>>>>>>>>>> from the HQD into the SQ.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Generic approach for all HW IPs:
>>>>>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For consistency purposes, the high priority context can be
>>>>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>>>> for all HW IPs
>>>>>>>>>>>>>>>> with support of the SW scheduler. This will function
>>>>>>>>>>>>>>>> similarly to the
>>>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump
>>>>>>>>>>>>>>>> ahead of
>>>>>>>>>>>>>>>> anything not
>>>>>>>>>>>>>>>> commited to the HW queue.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The benefits of requesting a high priority context for a
>>>>>>>>>>>>>>>> non-compute
>>>>>>>>>>>>>>>> queue will
>>>>>>>>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is
>>>>>>>>>>>>>>>> stuck in
>>>>>>>>>>>>>>>> front of
>>>>>>>>>>>>>>>> you), but having the API in place will allow us to easily
>>>>>>>>>>>>>>>> improve the
>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>> in the future as new features become available in new
>>>>>>>>>>>>>>>> hardware.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Future steps:
>>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Once we have an approach settled, I can take care of the
>>>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also, once the interface is mostly decided, we can start
>>>>>>>>>>>>>>>> thinking about
>>>>>>>>>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Request for feedback:
>>>>>>>>>>>>>>>> ---------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We aren't married to any of the approaches outlined above.
>>>>>>>>>>>>>>>> Our goal
>>>>>>>>>>>>>>>> is to
>>>>>>>>>>>>>>>> obtain a mechanism that will allow us to complete the
>>>>>>>>>>>>>>>> reprojection
>>>>>>>>>>>>>>>> job within a
>>>>>>>>>>>>>>>> predictable amount of time. So if anyone anyone has any
>>>>>>>>>>>>>>>> suggestions for
>>>>>>>>>>>>>>>> improvements or alternative strategies we are more than
>>>>>>>>>>>>>>>> happy to hear
>>>>>>>>>>>>>>>> them.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If any of the technical information above is also
>>>>>>>>>>>>>>>> incorrect, feel
>>>>>>>>>>>>>>>> free to point
>>>>>>>>>>>>>>>> out my misunderstandings.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Looking forward to hearing from you.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>>>> To see the collection of prior postings to the list,
>>>>>>>>>>>>>>>> visit the
>>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>>>> To see the collection of prior postings to the list,
>>>>>>>>>>>>>>>> visit the
>>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Sincerely yours,
>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>>
>>>
>>
>

[-- Attachment #1.2: Type: text/html, Size: 44434 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                                                     ` <CAFQ_0eHg=Kf5qV50cgm51m6bTcMYdkgRXkT-sykJnYNzu3Zzsg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-12-26  2:26                                                                                       ` zhoucm1
       [not found]                                                                                         ` <58607FDF.2080200-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: zhoucm1 @ 2016-12-26  2:26 UTC (permalink / raw)
  To: Andres Rodriguez, Pierre-Loup A. Griffais
  Cc: Huan, Alvin, Mao, David, Serguei Sagalovitch,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Andres Rodriguez,
	Christian König, Zhang, Hawking


[-- Attachment #1.1: Type: text/plain, Size: 195949 bytes --]

Nice experiment, which is exactly SW scheduler can provide.
And as you said "I.e. your context can be scheduled into the
HW queue ahead of any other context, but everything already commited
to the HW queue is executed in strict FIFO order."

If you want to keep consistent latency, which will need to enable hw 
priority queue feature.

Regards,
David Zhou

On 2016年12月24日 06:20, Andres Rodriguez wrote:
> Hey John,
>
> I've collected bit of data using high priority SW scheduler queues,
> thought you might be interested.
>
> Implementation as per the patch above.
>
> Control test 1
> ==============
>
> Sascha Willems mesh sample running on its own at regular priority
>
> Results
> -------
>
> Mesh: ~0.14ms per-frame latency
>
> Control test 2
> ==============
>
> Two Sascha Willems mesh sample running on its own at regular priority
>
> Results
> -------
>
> Mesh 1: ~0.26ms per-frame latency
> Mesh 2: ~0.26ms per-frame latency
>
> Test 1
> ======
>
> Two Sascha Willems mesh samples running simultaneously. One at high
> priority and the other running in a regular priority graphics context.
>
> Results
> -------
>
> Mesh High:    0.14 - 0.24ms per-frame latency
> Mesh Regular: 0.24 - 0.40ms per-frame latency
>
> Test 2
> ======
>
> Ten Sascha Willems mesh samples running simultaneously. One at high
> priority and the others running in a regular priority graphics context.
>
> Results
> -------
>
> Mesh High:    0.14 - 0.8ms per-frame latency
> Mesh Regular: 1.10 - 2.05ms per-frame latency
>
> Test 3
> ======
>
> Two Sascha Willems mesh samples running simultaneously. One at high
> priority and the other running in a regular priority graphics context.
>
> Also running Unigine Heaven at Exteme preset @ 2560x1600
>
> Results
> -------
>
> Mesh High:     7 - 100ms per-frame latency (Lots of fluctuation)
> Mesh Regular: 40 - 130ms per-frame latency(Lots of fluctuation)
> Unigine Heaven: 20-40 fps
>
>
> Test 4
> ======
>
> Two Sascha Willems mesh samples running simultaneously. One at high
> priority and the other running in a regular priority graphics context.
>
> Also running Talos Principle @ 4K
>
> Results
> -------
>
> Mesh High:    0.14 - 3.97ms per-frame latency (Mostly floats ~0.4ms)
> Mesh Regular: 0.43 - 8.11ms per-frame latency (Lots of fluctuation)
> Talos: 24.8 fps AVG
>
> Observations
> ============
>
> The high priority queue based on the SW scheduler provides significant
> gains when paired with tasks that submit short duration commands into
> the queue. This can be observed in tests 1 and 2.
>
> When the pipe is full of long running commands, the effects are dampened.
> As observed in test 3, the per-frame latency suffers very large spikes,
> and the latencies are very inconsistent.
>
> Talos seems to be a better behaved game. It may be submitting shorter
> draw commands and the SW scheduler is able to interleave the rest of
> the work.
>
> The results seem consistent with the hypothetical advantages the SW
> scheduler should provide. I.e. your context can be scheduled into the
> HW queue ahead of any other context, but everything already commited
> to the HW queue is executed in strict FIFO order.
>
> In order to deal with cases similar to Test 3, we will need to take
> advantage of further features.
>
> Notes
> =====
>
> - Tests were run multiple times, and reboots were performed during tests.
> - The mesh sample isn't really designed for benchmarking, but it should
>   be decent for ballpark figures
> - The high priority mesh app was run with default niceness and also 
> niceness
>   at -20. This had no effect on the results, so it was not added above.
> - CPU usage was not saturated while running the tests
>
> Regards,
> Andres
>
>
> On Fri, Dec 23, 2016 at 1:18 PM, Pierre-Loup A. Griffais 
> <pgriffais-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org <mailto:pgriffais-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>> wrote:
>
>     I hate to keep bringing up display topics in an unrelated
>     conversation, but I'm not sure where you got "Application -> X
>     server -> compositor -> X server" from. As I was saying before, we
>     need to be presenting directly to the HMD display as no display
>     server can be in the way, both for latency but also quality of
>     service reasons (a buggy application cannot be allowed to
>     accidentally display undistorted rendering into the HMD); we
>     intend to do the necessary work for this, and the extent of X's
>     (or a Wayland implementation, or any other display server)
>     involvment will be to participate enough to know that the HMD
>     display is off-limits. If you have more questions on the display
>     aspect, or VR rendering in general, I'm happy to try to address
>     them out-of-band from this conversation.
>
>
>     On 12/23/2016 02:54 AM, Christian König wrote:
>
>             But yes, in general you don't want another compositor in
>             the way, so
>             we'll be acquiring the HMD display directly, separate from
>             any desktop
>             or display server.
>
>         Assuming that the the HMD is attached to the rendering device
>         in some
>         way you have the X server and the Compositor which both try to
>         be DRM
>         master at the same time.
>
>         Please correct me if that was fixed in the meantime, but that
>         sounds
>         like it will simply not work. Or is this what Andres mention
>         below Dave
>         is working on ?.
>
>         Additional to that a compositor in combination with X is a bit
>         counter
>         productive when you want to keep the latency low.
>
>         E.g. the "normal" flow of a GL or Vulkan surface filled with
>         rendered
>         data to be displayed is from the Application -> X server ->
>         compositor
>         -> X server.
>
>         The extra step between X server and compositor just means
>         extra latency
>         and for this use case you probably don't want that.
>
>         Targeting something like Wayland and when you need X compatibility
>         XWayland sounds like the much better idea.
>
>         Regards,
>         Christian.
>
>         Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:
>
>             Display concerns are a separate issue, and as Andres said
>             we have
>             other plans to address. But yes, in general you don't want
>             another
>             compositor in the way, so we'll be acquiring the HMD
>             display directly,
>             separate from any desktop or display server. Same with
>             security, we
>             can have a separate conversation about that when the time
>             comes.
>
>             On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
>
>                 Andres,
>
>                 Did you measure  latency, etc. impact of __any__
>                 compositor?
>
>                 My understanding is that VR has pretty strict
>                 requirements related to
>                 QoS.
>
>                 Sincerely yours,
>                 Serguei Sagalovitch
>
>
>                 On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
>
>                     Hey Christian,
>
>                     We are currently interested in X, but with some
>                     distros switching to
>                     other compositors by default, we also need to
>                     consider those.
>
>                     We agree, running the full vrcompositor in root
>                     isn't something that
>                     we want to do. Too many security concerns. Having
>                     a small root helper
>                     that does the privilege escalation for us is the
>                     initial idea.
>
>                     For a long term approach, Pierre-Loup and Dave are
>                     working on dealing
>                     with the "two compositors" scenario a little
>                     better in DRM+X.
>                     Fullscreen isn't really a sufficient approach,
>                     since we don't want the
>                     HMD to be used as part of the Desktop environment
>                     when a VR app is not
>                     in use (this is extremely annoying).
>
>                     When the above is settled, we should have an auth
>                     mechanism besides
>                     DRM_MASTER or DRM_AUTH that allows the
>                     vrcompositor to take over the
>                     HMD permanently away from X. Re-using that auth
>                     method to gate this
>                     IOCTL is probably going to be the final solution.
>
>                     I propose to start with ROOT_ONLY since it should
>                     allow us to respect
>                     kernel IOCTL compatibility guidelines with the
>                     most flexibility. Going
>                     from a restrictive to a more flexible permission
>                     model would be
>                     inclusive, but going from a general to a
>                     restrictive model may exclude
>                     some apps that used to work.
>
>                     Regards,
>                     Andres
>
>                     On 12/22/2016 6:42 AM, Christian König wrote:
>
>                         Hi Andres,
>
>                         well using root might cause stability and
>                         security problems as well.
>                         We worked quite hard to avoid exactly this for X.
>
>                         We could make this feature depend on the
>                         compositor being DRM master,
>                         but for example with X the X server is master
>                         (and e.g. can change
>                         resolutions etc..) and not the compositor.
>
>                         So another question is also what windowing
>                         system (if any) are you
>                         planning to use? X, Wayland, Flinger or
>                         something completely
>                         different ?
>
>                         Regards,
>                         Christian.
>
>                         Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
>
>                             Hi Christian,
>
>                             That is definitely a concern. What we are
>                             currently thinking is to
>                             make the high priority queues accessible
>                             to root only.
>
>                             Therefore is a non-root user attempts to
>                             set the high priority flag
>                             on context allocation, we would fail the
>                             call and return ENOPERM.
>
>                             Regards,
>                             Andres
>
>
>                             On 12/20/2016 7:56 AM, Christian König wrote:
>
>                                     BTW: If there is  non-VR
>                                     application which will use
>                                     high-priority
>                                     h/w queue then VR application will
>                                     suffer.  Any ideas how
>                                     to solve it?
>
>                                 Yeah, that problem came to my mind as
>                                 well.
>
>                                 Basically we need to restrict those
>                                 high priority submissions to
>                                 the VR compositor or otherwise any
>                                 malfunctioning application could
>                                 use it.
>
>                                 Just think about some WebGL suddenly
>                                 taking all our rendering away
>                                 and we won't get anything drawn any more.
>
>                                 Alex or Michel any ideas on that?
>
>                                 Regards,
>                                 Christian.
>
>                                 Am 19.12.2016 um 15:48 schrieb Serguei
>                                 Sagalovitch:
>
>                                     > If compute queue is occupied
>                                     only by you, the efficiency
>                                     > is equal with setting job queue
>                                     to high priority I think.
>                                     The only risk is the situation
>                                     when graphics will take all
>                                     needed CUs. But in any case it
>                                     should be very good test.
>
>                                     Andres/Pierre-Loup,
>
>                                     Did you try to do it or it is a
>                                     lot of work for you?
>
>
>                                     BTW: If there is  non-VR
>                                     application which will use
>                                     high-priority
>                                     h/w queue then VR application will
>                                     suffer.  Any ideas how
>                                     to solve it?
>
>                                     Sincerely yours,
>                                     Serguei Sagalovitch
>
>                                     On 2016-12-19 12:50 AM, zhoucm1 wrote:
>
>                                         Do you encounter the priority
>                                         issue for compute queue with
>                                         current driver?
>
>                                         If compute queue is occupied
>                                         only by you, the efficiency is
>                                         equal
>                                         with setting job queue to high
>                                         priority I think.
>
>                                         Regards,
>                                         David Zhou
>
>                                         On 2016年12月19日 13:29,
>                                         Andres Rodriguez wrote:
>
>                                             Yes, vulkan is available
>                                             on all-open through the
>                                             mesa radv UMD.
>
>                                             I'm not sure if I'm asking
>                                             for too much, but if we can
>                                             coordinate a similar
>                                             interface in radv and
>                                             amdgpu-pro at the
>                                             vulkan level that would be
>                                             great.
>
>                                             I'm not sure what that's
>                                             going to be yet.
>
>                                             - Andres
>
>                                             On 12/19/2016 12:11 AM,
>                                             zhoucm1 wrote:
>
>
>
>                                                 On 2016年12月19日
>                                                 11:33, Pierre-Loup A.
>                                                 Griffais wrote:
>
>                                                     We're currently
>                                                     working with the
>                                                     open stack; I
>                                                     assume that a
>                                                     mechanism could be
>                                                     exposed by both
>                                                     open and Pro Vulkan
>                                                     userspace drivers
>                                                     and that the
>                                                     amdgpu kernel
>                                                     interface
>                                                     improvements we
>                                                     would pursue
>                                                     following this
>                                                     discussion would
>                                                     let both drivers
>                                                     take advantage of
>                                                     the feature, correct?
>
>                                                 Of course.
>                                                 Does open stack have
>                                                 Vulkan support?
>
>                                                 Regards,
>                                                 David Zhou
>
>
>                                                     On 12/18/2016
>                                                     07:26 PM, zhoucm1
>                                                     wrote:
>
>                                                         By the way,
>                                                         are you using
>                                                         all-open
>                                                         driver or
>                                                         amdgpu-pro
>                                                         driver?
>
>                                                         +David Mao,
>                                                         who is working
>                                                         on our Vulkan
>                                                         driver.
>
>                                                         Regards,
>                                                         David Zhou
>
>                                                         On 2016年12月
>                                                         18日 06:05,
>                                                         Pierre-Loup A.
>                                                         Griffais wrote:
>
>                                                             Hi Serguei,
>
>                                                             I'm also
>                                                             working on
>                                                             the
>                                                             bringing
>                                                             up our VR
>                                                             runtime on
>                                                             top of
>                                                             amgpu;
>                                                             see
>                                                             replies
>                                                             inline.
>
>                                                             On
>                                                             12/16/2016
>                                                             09:05 PM,
>                                                             Sagalovitch,
>                                                             Serguei wrote:
>
>                                                                 Andres,
>
>                                                                      For
>                                                                     current
>                                                                     VR
>                                                                     workloads
>                                                                     we
>                                                                     have
>                                                                     3
>                                                                     separate
>                                                                     processes
>                                                                     running
>                                                                     actually:
>
>                                                                 So we
>                                                                 could
>                                                                 have
>                                                                 potential
>                                                                 memory
>                                                                 overcommit
>                                                                 case or do
>                                                                 you do
>                                                                 partitioning
>                                                                 on
>                                                                 your
>                                                                 own? 
>                                                                 I
>                                                                 would
>                                                                 think
>                                                                 that
>                                                                 there
>                                                                 is
>                                                                 need
>                                                                 to avoid
>                                                                 overcomit
>                                                                 in
>                                                                 VR case to
>                                                                 prevent any
>                                                                 BO
>                                                                 migration.
>
>
>                                                             You're
>                                                             entirely
>                                                             correct;
>                                                             currently
>                                                             the VR
>                                                             runtime is
>                                                             setting up
>                                                             prioritized CPU
>                                                             scheduling
>                                                             for its VR
>                                                             compositor, we're
>                                                             working on
>                                                             prioritized GPU
>                                                             scheduling
>                                                             and
>                                                             pre-emption (eg.
>                                                             this
>                                                             thread),
>                                                             and in
>                                                             the future
>                                                             it will
>                                                             make sense
>                                                             to do work
>                                                             in order
>                                                             to make
>                                                             sure that
>                                                             its memory
>                                                             allocations do
>                                                             not get
>                                                             evicted,
>                                                             to prevent any
>                                                             unwelcome
>                                                             additional
>                                                             latency in
>                                                             the event
>                                                             of needing
>                                                             to perform
>                                                             just-in-time
>                                                             reprojection.
>
>                                                                 BTW:
>                                                                 Do you
>                                                                 mean
>                                                                 __real__
>                                                                 processes
>                                                                 or
>                                                                 threads?
>                                                                 Based
>                                                                 on my
>                                                                 understanding
>                                                                 sharing BOs
>                                                                 between different
>                                                                 processes
>                                                                 could
>                                                                 introduce
>                                                                 additional
>                                                                 synchronization
>                                                                 constrains.
>                                                                 btw:
>                                                                 I am not
>                                                                 sure
>                                                                 if we
>                                                                 are
>                                                                 able
>                                                                 to
>                                                                 share
>                                                                 Vulkan
>                                                                 sync.
>                                                                 object
>                                                                 cross-process
>                                                                 boundary.
>
>
>                                                             They are
>                                                             different
>                                                             processes;
>                                                             it is
>                                                             important
>                                                             for the
>                                                             compositor
>                                                             that
>                                                             is
>                                                             responsible for
>                                                             quality-of-service
>                                                             features
>                                                             such as
>                                                             consistently
>                                                             presenting
>                                                             distorted
>                                                             frames
>                                                             with the
>                                                             right latency,
>                                                             reprojection,
>                                                             etc,
>                                                             to be
>                                                             separate
>                                                             from the
>                                                             main
>                                                             application.
>
>                                                             Currently
>                                                             we are
>                                                             using
>                                                             unreleased
>                                                             cross-process
>                                                             memory and
>                                                             semaphore
>                                                             extensions
>                                                             to fetch
>                                                             updated
>                                                             eye images
>                                                             from the
>                                                             client
>                                                             application,
>                                                             but the
>                                                             just-in-time
>                                                             reprojection
>                                                             discussed
>                                                             here does not
>                                                             actually
>                                                             have any
>                                                             direct
>                                                             interactions
>                                                             with
>                                                             cross-process
>                                                             resource
>                                                             sharing,
>                                                             since it's
>                                                             achieved
>                                                             by using
>                                                             whatever
>                                                             is the
>                                                             latest, most
>                                                             up-to-date
>                                                             eye images
>                                                             that have
>                                                             already
>                                                             been sent
>                                                             by the client
>                                                             application,
>                                                             which are
>                                                             already
>                                                             available
>                                                             to use
>                                                             without
>                                                             additional
>                                                             synchronization.
>
>
>                                                                      
>                                                                      3) System
>                                                                     compositor
>                                                                     (we are
>                                                                     looking
>                                                                     at
>                                                                     approaches
>                                                                     to
>                                                                     remove
>                                                                     this
>                                                                     overhead)
>
>                                                                 Yes, 
>                                                                 IMHO
>                                                                 the
>                                                                 best
>                                                                 is to
>                                                                 run
>                                                                 in 
>                                                                 "full
>                                                                 screen
>                                                                 mode".
>
>
>                                                             Yes, we
>                                                             are
>                                                             working on
>                                                             mechanisms
>                                                             to present
>                                                             directly
>                                                             to the
>                                                             headset
>                                                             display
>                                                             without
>                                                             any
>                                                             intermediaries
>                                                             as a
>                                                             separate
>                                                             effort.
>
>
>                                                                      The
>                                                                     latency
>                                                                     is
>                                                                     our main
>                                                                     concern,
>
>                                                                 I
>                                                                 would
>                                                                 assume
>                                                                 that
>                                                                 this
>                                                                 is the
>                                                                 known
>                                                                 problem (at
>                                                                 least for
>                                                                 compute
>                                                                 usage).
>                                                                 It
>                                                                 looks
>                                                                 like
>                                                                 that
>                                                                 amdgpu
>                                                                 /
>                                                                 kernel
>                                                                 submission
>                                                                 is
>                                                                 rather CPU
>                                                                 intensive
>                                                                 (at least
>                                                                 in the
>                                                                 default configuration).
>
>
>                                                             As long as
>                                                             it's a
>                                                             consistent
>                                                             cost, it
>                                                             shouldn't
>                                                             an issue.
>                                                             However, if
>                                                             there's
>                                                             high
>                                                             degrees of
>                                                             variance
>                                                             then that
>                                                             would be
>                                                             troublesome and
>                                                             we
>                                                             would need
>                                                             to account
>                                                             for the
>                                                             worst case.
>
>                                                             Hopefully
>                                                             the
>                                                             requirements
>                                                             and
>                                                             approach
>                                                             we
>                                                             described make
>                                                             sense, we're
>                                                             looking
>                                                             forward to
>                                                             your
>                                                             feedback
>                                                             and
>                                                             suggestions.
>
>                                                             Thanks!
>                                                              - Pierre-Loup
>
>
>                                                                 Sincerely
>                                                                 yours,
>                                                                 Serguei Sagalovitch
>
>
>                                                                 From:
>                                                                 Andres
>                                                                 Rodriguez
>                                                                 <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org
>                                                                 <mailto:andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>>
>                                                                 Sent:
>                                                                 December
>                                                                 16,
>                                                                 2016
>                                                                 10:00 PM
>                                                                 To:
>                                                                 Sagalovitch,
>                                                                 Serguei;
>                                                                 amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>                                                                 <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>                                                                 Subject:
>                                                                 RE:
>                                                                 [RFC]
>                                                                 Mechanism
>                                                                 for
>                                                                 high
>                                                                 priority
>                                                                 scheduling
>                                                                 in amdgpu
>
>                                                                 Hey
>                                                                 Serguei,
>
>                                                                     [Serguei]
>                                                                     No. I
>                                                                     mean
>                                                                     pipe
>                                                                     :-) as
>                                                                     MEC define
>                                                                     it. 
>                                                                     As far
>                                                                     as I
>                                                                     understand
>                                                                     (by simplifying)
>                                                                     some
>                                                                     scheduling
>                                                                     is
>                                                                     per pipe. 
>                                                                     I
>                                                                     know
>                                                                     about
>                                                                     the current
>                                                                     allocation
>                                                                     scheme
>                                                                     but I
>                                                                     do
>                                                                     not think
>                                                                     that
>                                                                     it
>                                                                     is  ideal. 
>                                                                     I
>                                                                     would
>                                                                     assume
>                                                                     that
>                                                                     we
>                                                                     need
>                                                                     to
>                                                                     switch
>                                                                     to
>                                                                     dynamical
>                                                                     partition
>                                                                     of
>                                                                     resources 
>                                                                     based
>                                                                     on
>                                                                     the workload
>                                                                     otherwise
>                                                                     we
>                                                                     will
>                                                                     have
>                                                                     resource
>                                                                     conflict
>                                                                     between
>                                                                     Vulkan
>                                                                     compute
>                                                                     and 
>                                                                     OpenCL.
>
>
>                                                                 I
>                                                                 agree
>                                                                 the
>                                                                 partitioning
>                                                                 isn't
>                                                                 ideal.
>                                                                 I'm
>                                                                 hoping
>                                                                 we can
>                                                                 start
>                                                                 with a
>                                                                 solution
>                                                                 that
>                                                                 assumes that
>                                                                 only
>                                                                 pipe0
>                                                                 has
>                                                                 any
>                                                                 work
>                                                                 and
>                                                                 the
>                                                                 other
>                                                                 pipes
>                                                                 are
>                                                                 idle (no
>                                                                 HSA/ROCm
>                                                                 running on
>                                                                 the
>                                                                 system).
>
>                                                                 This
>                                                                 should
>                                                                 be
>                                                                 more
>                                                                 or
>                                                                 less
>                                                                 the
>                                                                 use
>                                                                 case
>                                                                 we
>                                                                 expect
>                                                                 from VR
>                                                                 users.
>
>                                                                 I
>                                                                 agree
>                                                                 the
>                                                                 split
>                                                                 is
>                                                                 currently
>                                                                 not
>                                                                 ideal,
>                                                                 but
>                                                                 I'd
>                                                                 like to
>                                                                 consider
>                                                                 that a
>                                                                 separate
>                                                                 task,
>                                                                 because
>                                                                 making
>                                                                 it
>                                                                 dynamic is
>                                                                 not
>                                                                 straight
>                                                                 forward :P
>
>                                                                     [Serguei]
>                                                                     Vulkan
>                                                                     works
>                                                                     via amdgpu
>                                                                     (kernel
>                                                                     submissions)
>                                                                     so
>                                                                     amdkfd
>                                                                     will
>                                                                     be not
>                                                                     involved. 
>                                                                     I
>                                                                     would
>                                                                     assume
>                                                                     that
>                                                                     in
>                                                                     the case
>                                                                     of
>                                                                     VR
>                                                                     we
>                                                                     will
>                                                                     have
>                                                                     one main
>                                                                     application
>                                                                     ("console"
>                                                                     mode(?))
>                                                                     so
>                                                                     we
>                                                                     could
>                                                                     temporally
>                                                                     "ignore"
>                                                                     OpenCL/ROCm
>                                                                     needs
>                                                                     when
>                                                                     VR
>                                                                     is
>                                                                     running.
>
>
>                                                                 Correct,
>                                                                 this
>                                                                 is why
>                                                                 we
>                                                                 want
>                                                                 to
>                                                                 enable
>                                                                 the
>                                                                 high
>                                                                 priority
>                                                                 compute
>                                                                 queue
>                                                                 through
>                                                                 libdrm-amdgpu,
>                                                                 so
>                                                                 that
>                                                                 we can
>                                                                 expose
>                                                                 it
>                                                                 through Vulkan
>                                                                 later.
>
>                                                                 For
>                                                                 current VR
>                                                                 workloads
>                                                                 we
>                                                                 have 3
>                                                                 separate
>                                                                 processes
>                                                                 running actually:
>                                                                     1)
>                                                                 Game
>                                                                 process
>                                                                     2)
>                                                                 VR
>                                                                 Compositor
>                                                                 (this
>                                                                 is the
>                                                                 process that
>                                                                 will
>                                                                 require
>                                                                 high
>                                                                 priority
>                                                                 queue)
>                                                                     3)
>                                                                 System
>                                                                 compositor
>                                                                 (we
>                                                                 are
>                                                                 looking at
>                                                                 approaches
>                                                                 to
>                                                                 remove
>                                                                 this
>                                                                 overhead)
>
>                                                                 For
>                                                                 now I
>                                                                 think
>                                                                 it is
>                                                                 okay
>                                                                 to
>                                                                 assume
>                                                                 no
>                                                                 OpenCL/ROCm
>                                                                 running
>                                                                 simultaneously,
>                                                                 but
>                                                                 I
>                                                                 would
>                                                                 also
>                                                                 like
>                                                                 to be
>                                                                 able
>                                                                 to
>                                                                 address this
>                                                                 case
>                                                                 in the
>                                                                 future
>                                                                 (cross-pipe
>                                                                 priorities).
>
>                                                                     [Serguei] 
>                                                                     The problem
>                                                                     with
>                                                                     pre-emption
>                                                                     of
>                                                                     graphics
>                                                                     task:
>                                                                     (a) it
>                                                                     may take
>                                                                     time
>                                                                     so
>                                                                     latency
>                                                                     may suffer
>
>
>                                                                 The
>                                                                 latency is
>                                                                 our
>                                                                 main
>                                                                 concern,
>                                                                 we
>                                                                 want
>                                                                 something
>                                                                 that is
>                                                                 predictable.
>                                                                 A good
>                                                                 illustration
>                                                                 of
>                                                                 what
>                                                                 the
>                                                                 reprojection
>                                                                 scheduling
>                                                                 looks like
>                                                                 can be
>                                                                 found
>                                                                 here:
>                                                                 https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png
>                                                                 <https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png>
>
>
>
>
>                                                                     (b) to
>                                                                     preempt
>                                                                     we
>                                                                     need
>                                                                     to
>                                                                     have
>                                                                     different
>                                                                     "context"
>                                                                     - we
>                                                                     want
>                                                                     to
>                                                                     guarantee
>                                                                     that
>                                                                     submissions
>                                                                     from
>                                                                     the same
>                                                                     context
>                                                                     will
>                                                                     be
>                                                                     executed
>                                                                     in
>                                                                     order.
>
>
>                                                                 This
>                                                                 is
>                                                                 okay,
>                                                                 as the
>                                                                 reprojection
>                                                                 work
>                                                                 doesn't have
>                                                                 dependencies
>                                                                 on
>                                                                 the
>                                                                 game
>                                                                 context,
>                                                                 and it
>                                                                 even
>                                                                 happens in
>                                                                 a
>                                                                 separate
>                                                                 process.
>
>                                                                     BTW:
>                                                                     (a) Do
>                                                                     you want
>                                                                     "preempt"
>                                                                     and later
>                                                                     resume
>                                                                     or
>                                                                     do you
>                                                                     want
>                                                                     "preempt"
>                                                                     and
>                                                                     "cancel/abort"
>
>
>                                                                 Preempt the
>                                                                 game
>                                                                 with
>                                                                 the
>                                                                 compositor
>                                                                 task
>                                                                 and
>                                                                 then
>                                                                 resume
>                                                                 it.
>
>                                                                     (b) Vulkan
>                                                                     is
>                                                                     generic
>                                                                     API and
>                                                                     could
>                                                                     be
>                                                                     used
>                                                                     for graphics
>                                                                     as
>                                                                     well
>                                                                     as
>                                                                     for plain
>                                                                     compute
>                                                                     tasks
>                                                                     (VK_QUEUE_COMPUTE_BIT).
>
>
>                                                                 Yeah,
>                                                                 the
>                                                                 plan
>                                                                 is to
>                                                                 use
>                                                                 vulkan
>                                                                 compute.
>                                                                 But if
>                                                                 you figure
>                                                                 out a way
>                                                                 for us
>                                                                 to get
>                                                                 a
>                                                                 guaranteed
>                                                                 execution
>                                                                 time
>                                                                 using
>                                                                 vulkan
>                                                                 graphics,
>                                                                 then
>                                                                 I'll
>                                                                 take you
>                                                                 out
>                                                                 for a
>                                                                 beer :)
>
>                                                                 Regards,
>                                                                 Andres
>                                                                 ________________________________________
>                                                                 From:
>                                                                 Sagalovitch,
>                                                                 Serguei [Serguei.Sagalovitch-5C7GfCeVMHo@public.gmane.org
>                                                                 <mailto:Serguei.Sagalovitch-5C7GfCeVMHo@public.gmane.org>]
>                                                                 Sent:
>                                                                 Friday, December
>                                                                 16,
>                                                                 2016
>                                                                 9:13 PM
>                                                                 To:
>                                                                 Andres
>                                                                 Rodriguez;
>                                                                 amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>                                                                 <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>                                                                 Subject:
>                                                                 Re:
>                                                                 [RFC]
>                                                                 Mechanism
>                                                                 for
>                                                                 high
>                                                                 priority
>                                                                 scheduling
>                                                                 in amdgpu
>
>                                                                 Hi Andres,
>
>                                                                 Please
>                                                                 see
>                                                                 inline
>                                                                 (as
>                                                                 [Serguei])
>
>                                                                 Sincerely
>                                                                 yours,
>                                                                 Serguei Sagalovitch
>
>
>                                                                 From:
>                                                                 Andres
>                                                                 Rodriguez
>                                                                 <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org
>                                                                 <mailto:andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>>
>                                                                 Sent:
>                                                                 December
>                                                                 16,
>                                                                 2016
>                                                                 8:29 PM
>                                                                 To:
>                                                                 Sagalovitch,
>                                                                 Serguei;
>                                                                 amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>                                                                 <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>                                                                 Subject:
>                                                                 RE:
>                                                                 [RFC]
>                                                                 Mechanism
>                                                                 for
>                                                                 high
>                                                                 priority
>                                                                 scheduling
>                                                                 in amdgpu
>
>                                                                 Hi
>                                                                 Serguei,
>
>                                                                 Thanks
>                                                                 for
>                                                                 the
>                                                                 feedback.
>                                                                 Answers inline
>                                                                 as [AR].
>
>                                                                 Regards,
>                                                                 Andres
>
>                                                                 ________________________________________
>                                                                 From:
>                                                                 Sagalovitch,
>                                                                 Serguei [Serguei.Sagalovitch-5C7GfCeVMHo@public.gmane.org
>                                                                 <mailto:Serguei.Sagalovitch-5C7GfCeVMHo@public.gmane.org>]
>                                                                 Sent:
>                                                                 Friday, December
>                                                                 16,
>                                                                 2016
>                                                                 8:15 PM
>                                                                 To:
>                                                                 Andres
>                                                                 Rodriguez;
>                                                                 amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>                                                                 <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>                                                                 Subject:
>                                                                 Re:
>                                                                 [RFC]
>                                                                 Mechanism
>                                                                 for
>                                                                 high
>                                                                 priority
>                                                                 scheduling
>                                                                 in amdgpu
>
>                                                                 Andres,
>
>
>                                                                 Quick
>                                                                 comments:
>
>                                                                 1) To
>                                                                 minimize
>                                                                 "bubbles",
>                                                                 etc.
>                                                                 we
>                                                                 need
>                                                                 to
>                                                                 "force" CU
>                                                                 assignments/binding
>                                                                 to
>                                                                 high-priority
>                                                                 queue
>                                                                 when
>                                                                 it
>                                                                 will
>                                                                 be in
>                                                                 use
>                                                                 and "free"
>                                                                 them later
>                                                                 (we 
>                                                                 do not
>                                                                 want
>                                                                 forever take
>                                                                 CUs
>                                                                 from
>                                                                 e.g.
>                                                                 graphic task
>                                                                 to
>                                                                 degrade
>                                                                 graphics
>                                                                 performance).
>
>                                                                 Otherwise
>                                                                 we
>                                                                 could
>                                                                 have
>                                                                 scenario
>                                                                 when
>                                                                 long
>                                                                 graphics
>                                                                 task (or
>                                                                 low-priority
>                                                                 compute)
>                                                                 will
>                                                                 took
>                                                                 all
>                                                                 (extra) CUs
>                                                                 and
>                                                                 high--priority
>                                                                 will
>                                                                 wait for
>                                                                 needed
>                                                                 resources.
>                                                                 It
>                                                                 will
>                                                                 not be
>                                                                 visible on
>                                                                 "NOP "
>                                                                 but
>                                                                 only
>                                                                 when
>                                                                 you submit
>                                                                 "real"
>                                                                 compute task
>                                                                 so I
>                                                                 would
>                                                                 recommend 
>                                                                 not to
>                                                                 use
>                                                                 "NOP"
>                                                                 packets at
>                                                                 all for
>                                                                 testing.
>
>                                                                 It (CU
>                                                                 assignment)
>                                                                 could
>                                                                 be
>                                                                 relatively
>                                                                 easy
>                                                                 done when
>                                                                 everything
>                                                                 is
>                                                                 going
>                                                                 via kernel
>                                                                 (e.g.
>                                                                 as
>                                                                 part
>                                                                 of
>                                                                 frame
>                                                                 submission)
>                                                                 but I
>                                                                 must
>                                                                 admit
>                                                                 that I
>                                                                 am not
>                                                                 sure
>                                                                 about
>                                                                 the
>                                                                 best
>                                                                 way
>                                                                 for
>                                                                 user
>                                                                 level
>                                                                 submissions
>                                                                 (amdkfd).
>
>                                                                 [AR] I
>                                                                 wasn't
>                                                                 aware
>                                                                 of
>                                                                 this
>                                                                 part
>                                                                 of the
>                                                                 programming
>                                                                 sequence.
>                                                                 Thanks
>                                                                 for
>                                                                 the
>                                                                 heads up!
>                                                                 Is
>                                                                 this
>                                                                 similar to
>                                                                 the CU
>                                                                 masking programming?
>                                                                 [Serguei]
>                                                                 Yes.
>                                                                 To
>                                                                 simplify:
>                                                                 the
>                                                                 problem is
>                                                                 that
>                                                                 "scheduler"
>                                                                 when
>                                                                 deciding
>                                                                 which
>                                                                 queue
>                                                                 to 
>                                                                 run
>                                                                 will
>                                                                 check
>                                                                 if
>                                                                 there
>                                                                 is
>                                                                 enough
>                                                                 resources
>                                                                 and
>                                                                 if not
>                                                                 then
>                                                                 it
>                                                                 will begin
>                                                                 to
>                                                                 check
>                                                                 other
>                                                                 queues
>                                                                 with
>                                                                 lower
>                                                                 priority.
>
>                                                                 2) I
>                                                                 would
>                                                                 recommend
>                                                                 to
>                                                                 dedicate
>                                                                 the
>                                                                 whole
>                                                                 pipe to
>                                                                 high-priority
>                                                                 queue
>                                                                 and have
>                                                                 nothing their
>                                                                 except it.
>
>                                                                 [AR]
>                                                                 I'm
>                                                                 guessing
>                                                                 in
>                                                                 this
>                                                                 context you
>                                                                 mean
>                                                                 pipe =
>                                                                 queue?
>                                                                 (as
>                                                                 opposed
>                                                                 to the
>                                                                 MEC
>                                                                 definition
>                                                                 of
>                                                                 pipe,
>                                                                 which
>                                                                 is a
>                                                                 grouping
>                                                                 of
>                                                                 queues).
>                                                                 I say
>                                                                 this
>                                                                 because
>                                                                 amdgpu
>                                                                 only
>                                                                 has
>                                                                 access
>                                                                 to 1 pipe,
>                                                                 and
>                                                                 the
>                                                                 rest
>                                                                 are
>                                                                 statically
>                                                                 partitioned
>                                                                 for
>                                                                 amdkfd
>                                                                 usage.
>
>                                                                 [Serguei]
>                                                                 No. I
>                                                                 mean
>                                                                 pipe
>                                                                 :-) 
>                                                                 as MEC
>                                                                 define
>                                                                 it. As
>                                                                 far as I
>                                                                 understand
>                                                                 (by
>                                                                 simplifying)
>                                                                 some
>                                                                 scheduling
>                                                                 is per
>                                                                 pipe. 
>                                                                 I know
>                                                                 about
>                                                                 the
>                                                                 current
>                                                                 allocation
>                                                                 scheme
>                                                                 but I
>                                                                 do not
>                                                                 think
>                                                                 that
>                                                                 it is 
>                                                                 ideal.  I
>                                                                 would
>                                                                 assume
>                                                                 that
>                                                                 we
>                                                                 need
>                                                                 to
>                                                                 switch to
>                                                                 dynamical
>                                                                 partition
>                                                                 of
>                                                                 resources 
>                                                                 based
>                                                                 on the
>                                                                 workload
>                                                                 otherwise
>                                                                 we
>                                                                 will have
>                                                                 resource
>                                                                 conflict
>                                                                 between Vulkan
>                                                                 compute and 
>                                                                 OpenCL.
>
>
>                                                                 BTW:
>                                                                 Which
>                                                                 user
>                                                                 level
>                                                                 API do
>                                                                 you
>                                                                 want
>                                                                 to use
>                                                                 for
>                                                                 compute:
>                                                                 Vulkan or
>                                                                 OpenCL?
>
>                                                                 [AR]
>                                                                 Vulkan
>
>                                                                 [Serguei]
>                                                                 Vulkan
>                                                                 works
>                                                                 via
>                                                                 amdgpu
>                                                                 (kernel submissions)
>                                                                 so
>                                                                 amdkfd
>                                                                 will
>                                                                 be not
>                                                                 involved. 
>                                                                 I
>                                                                 would
>                                                                 assume
>                                                                 that
>                                                                 in the
>                                                                 case
>                                                                 of VR
>                                                                 we will
>                                                                 have
>                                                                 one main
>                                                                 application
>                                                                 ("console"
>                                                                 mode(?))
>                                                                 so we
>                                                                 could
>                                                                 temporally
>                                                                 "ignore"
>                                                                 OpenCL/ROCm
>                                                                 needs
>                                                                 when
>                                                                 VR is
>                                                                 running.
>
>                                                                      we will
>                                                                     not be
>                                                                     able
>                                                                     to
>                                                                     provide
>                                                                     a
>                                                                     solution
>                                                                     compatible
>                                                                     with
>                                                                     GFX
>                                                                     worloads.
>
>                                                                 I
>                                                                 assume
>                                                                 that
>                                                                 you
>                                                                 are
>                                                                 talking about
>                                                                 graphics?
>                                                                 Am I
>                                                                 right?
>
>                                                                 [AR]
>                                                                 Yeah,
>                                                                 my
>                                                                 understanding
>                                                                 is
>                                                                 that
>                                                                 pre-empting
>                                                                 the
>                                                                 currently
>                                                                 running
>                                                                 graphics
>                                                                 job
>                                                                 and
>                                                                 scheduling
>                                                                 in
>                                                                 something
>                                                                 else
>                                                                 using
>                                                                 mid-buffer
>                                                                 pre-emption
>                                                                 has
>                                                                 some cases
>                                                                 where it
>                                                                 doesn't work
>                                                                 well.
>                                                                 But if
>                                                                 with
>                                                                 polaris10
>                                                                 it
>                                                                 starts
>                                                                 working well,
>                                                                 it
>                                                                 might
>                                                                 be a
>                                                                 better
>                                                                 solution
>                                                                 for
>                                                                 us
>                                                                 (because
>                                                                 the
>                                                                 whole
>                                                                 reprojection
>                                                                 work
>                                                                 uses
>                                                                 the
>                                                                 vulkan
>                                                                 graphics
>                                                                 stack
>                                                                 at the
>                                                                 moment, and
>                                                                 porting it
>                                                                 to
>                                                                 compute is
>                                                                 not
>                                                                 trivial).
>
>                                                                 [Serguei] 
>                                                                 The
>                                                                 problem with
>                                                                 pre-emption
>                                                                 of
>                                                                 graphics
>                                                                 task:
>                                                                 (a) it may
>                                                                 take
>                                                                 time so
>                                                                 latency may
>                                                                 suffer
>                                                                 (b) to
>                                                                 preempt we
>                                                                 need
>                                                                 to
>                                                                 have
>                                                                 different
>                                                                 "context"
>                                                                 - we want
>                                                                 to
>                                                                 guarantee
>                                                                 that
>                                                                 submissions
>                                                                 from
>                                                                 the
>                                                                 same
>                                                                 context will
>                                                                 be
>                                                                 executed
>                                                                 in order.
>                                                                 BTW:
>                                                                 (a) Do
>                                                                 you
>                                                                 want
>                                                                 "preempt"
>                                                                 and
>                                                                 later
>                                                                 resume
>                                                                 or do you
>                                                                 want
>                                                                 "preempt"
>                                                                 and
>                                                                 "cancel/abort"? 
>                                                                 (b)
>                                                                 Vulkan
>                                                                 is
>                                                                 generic API
>                                                                 and
>                                                                 could
>                                                                 be used
>                                                                 for
>                                                                 graphics
>                                                                 as
>                                                                 well
>                                                                 as for
>                                                                 plain
>                                                                 compute tasks
>                                                                 (VK_QUEUE_COMPUTE_BIT).
>
>
>                                                                 Sincerely
>                                                                 yours,
>                                                                 Serguei Sagalovitch
>
>
>
>                                                                 From:
>                                                                 amd-gfx <amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>                                                                 <mailto:amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>>
>                                                                 on
>                                                                 behalf of
>                                                                 Andres
>                                                                 Rodriguez
>                                                                 <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org
>                                                                 <mailto:andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>>
>                                                                 Sent:
>                                                                 December
>                                                                 16,
>                                                                 2016
>                                                                 6:15 PM
>                                                                 To:
>                                                                 amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>                                                                 <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>                                                                 Subject:
>                                                                 [RFC]
>                                                                 Mechanism
>                                                                 for
>                                                                 high
>                                                                 priority
>                                                                 scheduling
>                                                                 in
>                                                                 amdgpu
>
>                                                                 Hi
>                                                                 Everyone,
>
>                                                                 This
>                                                                 RFC is
>                                                                 also
>                                                                 available
>                                                                 as a
>                                                                 gist here:
>                                                                 https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>                                                                 <https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249>
>
>
>
>
>
>                                                                 [RFC]
>                                                                 Mechanism
>                                                                 for
>                                                                 high
>                                                                 priority
>                                                                 scheduling
>                                                                 in amdgpu
>                                                                 gist.github.com
>                                                                 <http://gist.github.com>
>                                                                 [RFC]
>                                                                 Mechanism
>                                                                 for
>                                                                 high
>                                                                 priority
>                                                                 scheduling
>                                                                 in amdgpu
>
>
>
>                                                                 [RFC]
>                                                                 Mechanism
>                                                                 for
>                                                                 high
>                                                                 priority
>                                                                 scheduling
>                                                                 in amdgpu
>                                                                 gist.github.com
>                                                                 <http://gist.github.com>
>                                                                 [RFC]
>                                                                 Mechanism
>                                                                 for
>                                                                 high
>                                                                 priority
>                                                                 scheduling
>                                                                 in amdgpu
>
>
>
>
>                                                                 [RFC]
>                                                                 Mechanism
>                                                                 for
>                                                                 high
>                                                                 priority
>                                                                 scheduling
>                                                                 in amdgpu
>                                                                 gist.github.com
>                                                                 <http://gist.github.com>
>                                                                 [RFC]
>                                                                 Mechanism
>                                                                 for
>                                                                 high
>                                                                 priority
>                                                                 scheduling
>                                                                 in amdgpu
>
>
>                                                                 We are
>                                                                 interested
>                                                                 in
>                                                                 feedback
>                                                                 for a
>                                                                 mechanism
>                                                                 to
>                                                                 effectively
>                                                                 schedule
>                                                                 high
>                                                                 priority
>                                                                 VR
>                                                                 reprojection
>                                                                 tasks
>                                                                 (also
>                                                                 referred
>                                                                 to as
>                                                                 time-warping)
>                                                                 for
>                                                                 Polaris10
>                                                                 running on
>                                                                 the
>                                                                 amdgpu
>                                                                 kernel
>                                                                 driver.
>
>                                                                 Brief
>                                                                 context:
>                                                                 --------------
>
>                                                                 The
>                                                                 main
>                                                                 objective
>                                                                 of
>                                                                 reprojection
>                                                                 is to
>                                                                 avoid
>                                                                 motion
>                                                                 sickness
>                                                                 for VR
>                                                                 users in
>                                                                 scenarios
>                                                                 where
>                                                                 the
>                                                                 game
>                                                                 or
>                                                                 application
>                                                                 would
>                                                                 fail
>                                                                 to finish
>                                                                 rendering
>                                                                 a new
>                                                                 frame
>                                                                 in
>                                                                 time
>                                                                 for
>                                                                 the
>                                                                 next
>                                                                 VBLANK. When
>                                                                 this
>                                                                 happens,
>                                                                 the
>                                                                 user's
>                                                                 head
>                                                                 movements
>                                                                 are
>                                                                 not
>                                                                 reflected
>                                                                 on the
>                                                                 Head
>                                                                 Mounted Display
>                                                                 (HMD)
>                                                                 for the
>                                                                 duration
>                                                                 of an
>                                                                 extra
>                                                                 frame.
>                                                                 This
>                                                                 extended
>                                                                 mismatch
>                                                                 between the
>                                                                 inner ear
>                                                                 and the
>                                                                 eyes may
>                                                                 cause
>                                                                 the
>                                                                 user
>                                                                 to
>                                                                 experience
>                                                                 motion
>                                                                 sickness.
>
>                                                                 The VR
>                                                                 compositor
>                                                                 deals
>                                                                 with
>                                                                 this
>                                                                 problem by
>                                                                 fabricating
>                                                                 a
>                                                                 new frame
>                                                                 using the
>                                                                 user's
>                                                                 updated head
>                                                                 position
>                                                                 in
>                                                                 combination
>                                                                 with the
>                                                                 previous
>                                                                 frames.
>                                                                 This
>                                                                 avoids
>                                                                 a
>                                                                 prolonged
>                                                                 mismatch
>                                                                 between the
>                                                                 HMD
>                                                                 output
>                                                                 and the
>                                                                 inner ear.
>
>                                                                 Because of
>                                                                 the
>                                                                 adverse effects
>                                                                 on the
>                                                                 user,
>                                                                 we
>                                                                 require high
>                                                                 confidence
>                                                                 that the
>                                                                 reprojection
>                                                                 task
>                                                                 will
>                                                                 complete
>                                                                 before
>                                                                 the
>                                                                 VBLANK
>                                                                 interval.
>                                                                 Even if
>                                                                 the
>                                                                 GFX pipe
>                                                                 is
>                                                                 currently
>                                                                 full
>                                                                 of
>                                                                 work
>                                                                 from
>                                                                 the
>                                                                 game/application
>                                                                 (which
>                                                                 is most
>                                                                 likely
>                                                                 the case).
>
>                                                                 For
>                                                                 more
>                                                                 details and
>                                                                 illustrations,
>                                                                 please
>                                                                 refer
>                                                                 to the
>                                                                 following
>                                                                 document:
>                                                                 https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved
>                                                                 <https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved>
>
>
>
>
>
>                                                                 Gaming: Asynchronous
>                                                                 Shaders Evolved
>                                                                 |
>                                                                 Community
>                                                                 community.amd.com
>                                                                 <http://community.amd.com>
>                                                                 One of
>                                                                 the
>                                                                 most
>                                                                 exciting
>                                                                 new
>                                                                 developments
>                                                                 in GPU
>                                                                 technology
>                                                                 over the
>                                                                 past
>                                                                 year
>                                                                 has
>                                                                 been
>                                                                 the
>                                                                 adoption
>                                                                 of
>                                                                 asynchronous
>                                                                 shaders,
>                                                                 which can
>                                                                 make
>                                                                 more
>                                                                 efficient
>                                                                 use of ...
>
>
>
>                                                                 Gaming: Asynchronous
>                                                                 Shaders Evolved
>                                                                 |
>                                                                 Community
>                                                                 community.amd.com
>                                                                 <http://community.amd.com>
>                                                                 One of
>                                                                 the
>                                                                 most
>                                                                 exciting
>                                                                 new
>                                                                 developments
>                                                                 in GPU
>                                                                 technology
>                                                                 over the
>                                                                 past
>                                                                 year
>                                                                 has
>                                                                 been
>                                                                 the
>                                                                 adoption
>                                                                 of
>                                                                 asynchronous
>                                                                 shaders,
>                                                                 which can
>                                                                 make
>                                                                 more
>                                                                 efficient
>                                                                 use of ...
>
>
>
>                                                                 Gaming: Asynchronous
>                                                                 Shaders Evolved
>                                                                 |
>                                                                 Community
>                                                                 community.amd.com
>                                                                 <http://community.amd.com>
>                                                                 One of
>                                                                 the
>                                                                 most
>                                                                 exciting
>                                                                 new
>                                                                 developments
>                                                                 in GPU
>                                                                 technology
>                                                                 over the
>                                                                 past
>                                                                 year
>                                                                 has
>                                                                 been
>                                                                 the
>                                                                 adoption
>                                                                 of
>                                                                 asynchronous
>                                                                 shaders,
>                                                                 which can
>                                                                 make
>                                                                 more
>                                                                 efficient
>                                                                 use of ...
>
>
>                                                                 Requirements:
>                                                                 -------------
>
>                                                                 The
>                                                                 mechanism
>                                                                 must
>                                                                 expose
>                                                                 the
>                                                                 following
>                                                                 functionaility:
>
>                                                                     *
>                                                                 Job
>                                                                 round
>                                                                 trip
>                                                                 time
>                                                                 must
>                                                                 be
>                                                                 predictable,
>                                                                 from
>                                                                 submission
>                                                                 to
>                                                                 fence
>                                                                 signal
>
>                                                                     *
>                                                                 The
>                                                                 mechanism
>                                                                 must
>                                                                 support compute
>                                                                 workloads.
>
>                                                                 Goals:
>                                                                 ------
>
>                                                                     *
>                                                                 The
>                                                                 mechanism
>                                                                 should
>                                                                 provide low
>                                                                 submission
>                                                                 latencies
>
>                                                                 Test:
>                                                                 submitting
>                                                                 a NOP
>                                                                 packet
>                                                                 through the
>                                                                 mechanism
>                                                                 on busy
>                                                                 hardware
>                                                                 should
>                                                                 be
>                                                                 equivalent
>                                                                 to
>                                                                 submitting
>                                                                 a NOP
>                                                                 on
>                                                                 idle
>                                                                 hardware.
>
>                                                                 Nice
>                                                                 to have:
>                                                                 -------------
>
>                                                                     *
>                                                                 The
>                                                                 mechanism
>                                                                 should
>                                                                 also
>                                                                 support GFX
>                                                                 workloads.
>
>                                                                 My
>                                                                 understanding
>                                                                 is
>                                                                 that
>                                                                 with
>                                                                 the
>                                                                 current hardware
>                                                                 capabilities
>                                                                 in
>                                                                 Polaris10
>                                                                 we
>                                                                 will
>                                                                 not be
>                                                                 able
>                                                                 to
>                                                                 provide a
>                                                                 solution
>                                                                 compatible
>                                                                 with GFX
>                                                                 worloads.
>
>                                                                 But I
>                                                                 would
>                                                                 love
>                                                                 to
>                                                                 hear
>                                                                 otherwise.
>                                                                 So if
>                                                                 anyone
>                                                                 has an
>                                                                 idea,
>                                                                 approach
>                                                                 or
>                                                                 suggestion
>                                                                 that
>                                                                 will
>                                                                 also
>                                                                 be
>                                                                 compatible
>                                                                 with
>                                                                 the
>                                                                 GFX ring,
>                                                                 please let
>                                                                 us know
>                                                                 about it.
>
>                                                                     *
>                                                                 The
>                                                                 above
>                                                                 guarantees
>                                                                 should
>                                                                 also
>                                                                 be
>                                                                 respected
>                                                                 by
>                                                                 amdkfd
>                                                                 workloads
>
>                                                                 Would
>                                                                 be
>                                                                 good
>                                                                 to
>                                                                 have
>                                                                 for
>                                                                 consistency,
>                                                                 but
>                                                                 not
>                                                                 strictly
>                                                                 necessary
>                                                                 as
>                                                                 users
>                                                                 running
>                                                                 games
>                                                                 are
>                                                                 not
>                                                                 traditionally
>                                                                 running HPC
>                                                                 workloads
>                                                                 in the
>                                                                 background.
>
>                                                                 Proposed
>                                                                 approach:
>                                                                 ------------------
>
>                                                                 Similar to
>                                                                 the
>                                                                 windows driver,
>                                                                 we
>                                                                 could
>                                                                 expose
>                                                                 a high
>                                                                 priority
>                                                                 compute queue
>                                                                 to
>                                                                 userspace.
>
>                                                                 Submissions
>                                                                 to
>                                                                 this
>                                                                 compute queue
>                                                                 will
>                                                                 be
>                                                                 scheduled
>                                                                 with
>                                                                 high
>                                                                 priority,
>                                                                 and may
>                                                                 acquire hardware
>                                                                 resources
>                                                                 previously
>                                                                 in use
>                                                                 by other
>                                                                 queues.
>
>                                                                 This
>                                                                 can be
>                                                                 achieved
>                                                                 by
>                                                                 taking
>                                                                 advantage
>                                                                 of the
>                                                                 'priority'
>                                                                 field in
>                                                                 the HQDs
>                                                                 and
>                                                                 could
>                                                                 be
>                                                                 programmed
>                                                                 by
>                                                                 amdgpu
>                                                                 or the
>                                                                 amdgpu
>                                                                 scheduler.
>                                                                 The
>                                                                 relevant
>                                                                 register
>                                                                 fields
>                                                                 are:
>                                                                      
>                                                                   *
>                                                                 mmCP_HQD_PIPE_PRIORITY
>                                                                      
>                                                                   *
>                                                                 mmCP_HQD_QUEUE_PRIORITY
>
>                                                                 Implementation
>                                                                 approach
>                                                                 1 -
>                                                                 static
>                                                                 partitioning:
>                                                                 ------------------------------------------------
>
>                                                                 The
>                                                                 amdgpu
>                                                                 driver
>                                                                 currently
>                                                                 controls
>                                                                 8
>                                                                 compute queues
>                                                                 from
>                                                                 pipe0.
>                                                                 We can
>                                                                 statically
>                                                                 partition
>                                                                 these
>                                                                 as
>                                                                 follows:
>                                                                      
>                                                                   * 7x
>                                                                 regular
>                                                                      
>                                                                   * 1x
>                                                                 high
>                                                                 priority
>
>                                                                 The
>                                                                 relevant
>                                                                 priorities
>                                                                 can be
>                                                                 set so
>                                                                 that
>                                                                 submissions
>                                                                 to
>                                                                 the high
>                                                                 priority
>                                                                 ring
>                                                                 will
>                                                                 starve
>                                                                 the
>                                                                 other
>                                                                 compute rings
>                                                                 and
>                                                                 the
>                                                                 GFX ring.
>
>                                                                 The
>                                                                 amdgpu
>                                                                 scheduler
>                                                                 will
>                                                                 only
>                                                                 place
>                                                                 jobs
>                                                                 into
>                                                                 the high
>                                                                 priority
>                                                                 rings
>                                                                 if the
>                                                                 context is
>                                                                 marked
>                                                                 as
>                                                                 high
>                                                                 priority.
>                                                                 And a
>                                                                 corresponding
>                                                                 priority
>                                                                 should be
>                                                                 added
>                                                                 to
>                                                                 keep
>                                                                 track
>                                                                 of
>                                                                 this
>                                                                 information:
>                                                                      *
>                                                                 AMD_SCHED_PRIORITY_KERNEL
>                                                                      *
>                                                                 ->
>                                                                 AMD_SCHED_PRIORITY_HIGH
>                                                                      *
>                                                                 AMD_SCHED_PRIORITY_NORMAL
>
>                                                                 The
>                                                                 user
>                                                                 will
>                                                                 request a
>                                                                 high
>                                                                 priority
>                                                                 context by
>                                                                 setting an
>                                                                 appropriate
>                                                                 flag
>                                                                 in
>                                                                 drm_amdgpu_ctx_in
>                                                                 (AMDGPU_CTX_HIGH_PRIORITY
>                                                                 or
>                                                                 similar):
>                                                                 https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163
>                                                                 <https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163>
>
>
>
>
>                                                                 The
>                                                                 setting is
>                                                                 in a
>                                                                 per
>                                                                 context level
>                                                                 so
>                                                                 that
>                                                                 we can:
>                                                                     *
>                                                                 Maintain
>                                                                 a
>                                                                 consistent
>                                                                 FIFO
>                                                                 ordering
>                                                                 of all
>                                                                 submissions
>                                                                 to a
>                                                                 context
>                                                                     *
>                                                                 Create
>                                                                 high
>                                                                 priority
>                                                                 and
>                                                                 non-high
>                                                                 priority
>                                                                 contexts
>                                                                 in the
>                                                                 same
>                                                                 process
>
>                                                                 Implementation
>                                                                 approach
>                                                                 2 -
>                                                                 dynamic priority
>                                                                 programming:
>                                                                 ---------------------------------------------------------
>
>                                                                 Similar to
>                                                                 the
>                                                                 above,
>                                                                 but
>                                                                 instead of
>                                                                 programming
>                                                                 the
>                                                                 priorities
>                                                                 and
>                                                                 amdgpu_init()
>                                                                 time,
>                                                                 the SW
>                                                                 scheduler
>                                                                 will
>                                                                 reprogram
>                                                                 the
>                                                                 queue
>                                                                 priorities
>                                                                 dynamically
>                                                                 when
>                                                                 scheduling
>                                                                 a task.
>
>                                                                 This
>                                                                 would
>                                                                 involve having
>                                                                 a
>                                                                 hardware
>                                                                 specific
>                                                                 callback
>                                                                 from
>                                                                 the
>                                                                 scheduler
>                                                                 to
>                                                                 set
>                                                                 the
>                                                                 appropriate
>                                                                 queue
>                                                                 priority:
>                                                                 set_priority(int
>                                                                 ring,
>                                                                 int index,
>                                                                 int
>                                                                 priority)
>
>                                                                 During
>                                                                 this
>                                                                 callback
>                                                                 we
>                                                                 would
>                                                                 have
>                                                                 to
>                                                                 grab
>                                                                 the
>                                                                 SRBM mutex
>                                                                 to perform
>                                                                 the
>                                                                 appropriate
>                                                                 HW
>                                                                 programming,
>                                                                 and
>                                                                 I'm
>                                                                 not
>                                                                 really
>                                                                 sure
>                                                                 if that is
>                                                                 something
>                                                                 we
>                                                                 should
>                                                                 be
>                                                                 doing from
>                                                                 the
>                                                                 scheduler.
>
>                                                                 On the
>                                                                 positive
>                                                                 side,
>                                                                 this
>                                                                 approach
>                                                                 would
>                                                                 allow
>                                                                 us to
>                                                                 program a
>                                                                 range of
>                                                                 priorities
>                                                                 for
>                                                                 jobs
>                                                                 instead of
>                                                                 a
>                                                                 single
>                                                                 "high
>                                                                 priority"
>                                                                 value",
>                                                                 achieving
>                                                                 something
>                                                                 similar to
>                                                                 the
>                                                                 niceness
>                                                                 API
>                                                                 available
>                                                                 for CPU
>                                                                 scheduling.
>
>                                                                 I'm
>                                                                 not
>                                                                 sure
>                                                                 if
>                                                                 this
>                                                                 flexibility
>                                                                 is
>                                                                 something
>                                                                 that
>                                                                 we would
>                                                                 need for
>                                                                 our use
>                                                                 case,
>                                                                 but it
>                                                                 might
>                                                                 be
>                                                                 useful
>                                                                 in
>                                                                 other
>                                                                 scenarios
>                                                                 (multiple
>                                                                 users
>                                                                 sharing compute
>                                                                 time
>                                                                 on a
>                                                                 server).
>
>                                                                 This
>                                                                 approach
>                                                                 would
>                                                                 require a
>                                                                 new
>                                                                 int
>                                                                 field in
>                                                                 drm_amdgpu_ctx_in,
>                                                                 or
>                                                                 repurposing
>                                                                 of the
>                                                                 flags
>                                                                 field.
>
>                                                                 Known
>                                                                 current obstacles:
>                                                                 ------------------------
>
>                                                                 The SQ
>                                                                 is
>                                                                 currently
>                                                                 programmed
>                                                                 to
>                                                                 disregard
>                                                                 the HQD
>                                                                 priorities,
>                                                                 and
>                                                                 instead it
>                                                                 picks
>                                                                 jobs
>                                                                 at
>                                                                 random. Settings
>                                                                 from
>                                                                 the
>                                                                 shader
>                                                                 itself
>                                                                 are also
>                                                                 disregarded
>                                                                 as this is
>                                                                 considered
>                                                                 a
>                                                                 privileged
>                                                                 field.
>
>                                                                 Effectively
>                                                                 we can
>                                                                 get
>                                                                 our
>                                                                 compute wavefront
>                                                                 launched
>                                                                 ASAP,
>                                                                 but we
>                                                                 might
>                                                                 not
>                                                                 get the
>                                                                 time
>                                                                 we
>                                                                 need
>                                                                 on the SQ.
>
>                                                                 The
>                                                                 current programming
>                                                                 would
>                                                                 have
>                                                                 to be
>                                                                 changed to
>                                                                 allow
>                                                                 priority
>                                                                 propagation
>                                                                 from
>                                                                 the
>                                                                 HQD
>                                                                 into
>                                                                 the SQ.
>
>                                                                 Generic approach
>                                                                 for
>                                                                 all HW
>                                                                 IPs:
>                                                                 --------------------------------
>
>                                                                 For
>                                                                 consistency
>                                                                 purposes,
>                                                                 the
>                                                                 high
>                                                                 priority
>                                                                 context can
>                                                                 be
>                                                                 enabled
>                                                                 for
>                                                                 all HW IPs
>                                                                 with
>                                                                 support of
>                                                                 the SW
>                                                                 scheduler.
>                                                                 This
>                                                                 will
>                                                                 function
>                                                                 similarly
>                                                                 to the
>                                                                 current
>                                                                 AMD_SCHED_PRIORITY_KERNEL
>                                                                 priority,
>                                                                 where
>                                                                 the
>                                                                 job
>                                                                 can jump
>                                                                 ahead of
>                                                                 anything
>                                                                 not
>                                                                 commited
>                                                                 to the
>                                                                 HW queue.
>
>                                                                 The
>                                                                 benefits
>                                                                 of
>                                                                 requesting
>                                                                 a high
>                                                                 priority
>                                                                 context for
>                                                                 a
>                                                                 non-compute
>                                                                 queue will
>                                                                 be
>                                                                 lesser
>                                                                 (e.g.
>                                                                 up to
>                                                                 10s of
>                                                                 wait
>                                                                 time
>                                                                 if a
>                                                                 GFX
>                                                                 command is
>                                                                 stuck in
>                                                                 front of
>                                                                 you),
>                                                                 but
>                                                                 having
>                                                                 the
>                                                                 API in
>                                                                 place
>                                                                 will
>                                                                 allow
>                                                                 us to
>                                                                 easily
>                                                                 improve the
>                                                                 implementation
>                                                                 in the
>                                                                 future
>                                                                 as new
>                                                                 features
>                                                                 become
>                                                                 available
>                                                                 in new
>                                                                 hardware.
>
>                                                                 Future
>                                                                 steps:
>                                                                 -------------
>
>                                                                 Once
>                                                                 we
>                                                                 have
>                                                                 an
>                                                                 approach
>                                                                 settled,
>                                                                 I can
>                                                                 take
>                                                                 care
>                                                                 of the
>                                                                 implementation.
>
>                                                                 Also,
>                                                                 once
>                                                                 the
>                                                                 interface
>                                                                 is
>                                                                 mostly
>                                                                 decided,
>                                                                 we can
>                                                                 start
>                                                                 thinking
>                                                                 about
>                                                                 exposing
>                                                                 the
>                                                                 high
>                                                                 priority
>                                                                 queue
>                                                                 through radv.
>
>                                                                 Request for
>                                                                 feedback:
>                                                                 ---------------------
>
>                                                                 We
>                                                                 aren't
>                                                                 married to
>                                                                 any of
>                                                                 the
>                                                                 approaches
>                                                                 outlined
>                                                                 above.
>                                                                 Our goal
>                                                                 is to
>                                                                 obtain
>                                                                 a
>                                                                 mechanism
>                                                                 that
>                                                                 will
>                                                                 allow
>                                                                 us to
>                                                                 complete
>                                                                 the
>                                                                 reprojection
>                                                                 job
>                                                                 within a
>                                                                 predictable
>                                                                 amount
>                                                                 of
>                                                                 time.
>                                                                 So if
>                                                                 anyone
>                                                                 anyone
>                                                                 has any
>                                                                 suggestions
>                                                                 for
>                                                                 improvements
>                                                                 or
>                                                                 alternative
>                                                                 strategies
>                                                                 we are
>                                                                 more than
>                                                                 happy
>                                                                 to hear
>                                                                 them.
>
>                                                                 If any
>                                                                 of the
>                                                                 technical
>                                                                 information
>                                                                 above
>                                                                 is also
>                                                                 incorrect,
>                                                                 feel
>                                                                 free
>                                                                 to point
>                                                                 out my
>                                                                 misunderstandings.
>
>                                                                 Looking forward
>                                                                 to
>                                                                 hearing from
>                                                                 you.
>
>                                                                 Regards,
>                                                                 Andres
>
>                                                                 _______________________________________________
>                                                                 amd-gfx mailing
>                                                                 list
>                                                                 amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>                                                                 <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>                                                                 https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>                                                                 <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>
>
>                                                                 amd-gfx Info
>                                                                 Page -
>                                                                 lists.freedesktop.org
>                                                                 <http://lists.freedesktop.org>
>                                                                 lists.freedesktop.org
>                                                                 <http://lists.freedesktop.org>
>                                                                 To see
>                                                                 the
>                                                                 collection
>                                                                 of
>                                                                 prior
>                                                                 postings
>                                                                 to the
>                                                                 list,
>                                                                 visit the
>                                                                 amd-gfx Archives.
>                                                                 Using
>                                                                 amd-gfx:
>                                                                 To
>                                                                 post a
>                                                                 message to
>                                                                 all
>                                                                 the list
>                                                                 members,
>                                                                 send
>                                                                 email ...
>
>
>
>                                                                 amd-gfx Info
>                                                                 Page -
>                                                                 lists.freedesktop.org
>                                                                 <http://lists.freedesktop.org>
>                                                                 lists.freedesktop.org
>                                                                 <http://lists.freedesktop.org>
>                                                                 To see
>                                                                 the
>                                                                 collection
>                                                                 of
>                                                                 prior
>                                                                 postings
>                                                                 to the
>                                                                 list,
>                                                                 visit the
>                                                                 amd-gfx Archives.
>                                                                 Using
>                                                                 amd-gfx:
>                                                                 To
>                                                                 post a
>                                                                 message to
>                                                                 all
>                                                                 the list
>                                                                 members,
>                                                                 send
>                                                                 email ...
>
>
>
>
>
>
>
>
>
>
>
>                                                                 _______________________________________________
>                                                                 amd-gfx mailing
>                                                                 list
>                                                                 amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>                                                                 <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>                                                                 https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>                                                                 <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>
>
>                                                             _______________________________________________
>                                                             amd-gfx
>                                                             mailing list
>                                                             amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>                                                             <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>                                                             https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>                                                             <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>
>
>
>
>                                                 _______________________________________________
>                                                 amd-gfx mailing list
>                                                 amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>                                                 <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>                                                 https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>                                                 <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>
>
>
>
>                                     Sincerely yours,
>                                     Serguei Sagalovitch
>
>                                     _______________________________________________
>                                     amd-gfx mailing list
>                                     amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>                                     <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>                                     https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>                                     <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>
>
>
>
>
>
>
>                 Sincerely yours,
>                 Serguei Sagalovitch
>
>
>
>
>


[-- Attachment #1.2: Type: text/html, Size: 120472 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                                                 ` <b853e4e3-0ba5-2bda-e129-d9253e7b098d-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
  2016-12-23 22:20                                                                                   ` Andres Rodriguez
@ 2017-01-02 14:09                                                                                   ` Christian König
  1 sibling, 0 replies; 36+ messages in thread
From: Christian König @ 2017-01-02 14:09 UTC (permalink / raw)
  To: Pierre-Loup A. Griffais, Christian König,
	Serguei Sagalovitch, Andres Rodriguez, zhoucm1, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Mao, David, Zhang,
	Hawking, Huan, Alvin

> I hate to keep bringing up display topics in an unrelated 
> conversation, but I'm not sure where you got "Application -> X server 
> -> compositor -> X server" from.
Sorry for that. It just sounded like you was assuming something about 
the current stack which looked like it won't work currently.

But when you are willing to implement that stuff then I'm not really 
concerned any more. Especially when Dave is already involved.

Regards,
Christian.

Am 23.12.2016 um 19:18 schrieb Pierre-Loup A. Griffais:
> I hate to keep bringing up display topics in an unrelated 
> conversation, but I'm not sure where you got "Application -> X server 
> -> compositor -> X server" from. As I was saying before, we need to be 
> presenting directly to the HMD display as no display server can be in 
> the way, both for latency but also quality of service reasons (a buggy 
> application cannot be allowed to accidentally display undistorted 
> rendering into the HMD); we intend to do the necessary work for this, 
> and the extent of X's (or a Wayland implementation, or any other 
> display server) involvment will be to participate enough to know that 
> the HMD display is off-limits. If you have more questions on the 
> display aspect, or VR rendering in general, I'm happy to try to 
> address them out-of-band from this conversation.
>
> On 12/23/2016 02:54 AM, Christian König wrote:
>>> But yes, in general you don't want another compositor in the way, so
>>> we'll be acquiring the HMD display directly, separate from any desktop
>>> or display server.
>> Assuming that the the HMD is attached to the rendering device in some
>> way you have the X server and the Compositor which both try to be DRM
>> master at the same time.
>>
>> Please correct me if that was fixed in the meantime, but that sounds
>> like it will simply not work. Or is this what Andres mention below Dave
>> is working on ?.
>>
>> Additional to that a compositor in combination with X is a bit counter
>> productive when you want to keep the latency low.
>>
>> E.g. the "normal" flow of a GL or Vulkan surface filled with rendered
>> data to be displayed is from the Application -> X server -> compositor
>> -> X server.
>>
>> The extra step between X server and compositor just means extra latency
>> and for this use case you probably don't want that.
>>
>> Targeting something like Wayland and when you need X compatibility
>> XWayland sounds like the much better idea.
>>
>> Regards,
>> Christian.
>>
>> Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:
>>> Display concerns are a separate issue, and as Andres said we have
>>> other plans to address. But yes, in general you don't want another
>>> compositor in the way, so we'll be acquiring the HMD display directly,
>>> separate from any desktop or display server. Same with security, we
>>> can have a separate conversation about that when the time comes.
>>>
>>> On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
>>>> Andres,
>>>>
>>>> Did you measure  latency, etc. impact of __any__ compositor?
>>>>
>>>> My understanding is that VR has pretty strict requirements related to
>>>> QoS.
>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>>
>>>> On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
>>>>> Hey Christian,
>>>>>
>>>>> We are currently interested in X, but with some distros switching to
>>>>> other compositors by default, we also need to consider those.
>>>>>
>>>>> We agree, running the full vrcompositor in root isn't something that
>>>>> we want to do. Too many security concerns. Having a small root helper
>>>>> that does the privilege escalation for us is the initial idea.
>>>>>
>>>>> For a long term approach, Pierre-Loup and Dave are working on dealing
>>>>> with the "two compositors" scenario a little better in DRM+X.
>>>>> Fullscreen isn't really a sufficient approach, since we don't want 
>>>>> the
>>>>> HMD to be used as part of the Desktop environment when a VR app is 
>>>>> not
>>>>> in use (this is extremely annoying).
>>>>>
>>>>> When the above is settled, we should have an auth mechanism besides
>>>>> DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the
>>>>> HMD permanently away from X. Re-using that auth method to gate this
>>>>> IOCTL is probably going to be the final solution.
>>>>>
>>>>> I propose to start with ROOT_ONLY since it should allow us to respect
>>>>> kernel IOCTL compatibility guidelines with the most flexibility. 
>>>>> Going
>>>>> from a restrictive to a more flexible permission model would be
>>>>> inclusive, but going from a general to a restrictive model may 
>>>>> exclude
>>>>> some apps that used to work.
>>>>>
>>>>> Regards,
>>>>> Andres
>>>>>
>>>>> On 12/22/2016 6:42 AM, Christian König wrote:
>>>>>> Hi Andres,
>>>>>>
>>>>>> well using root might cause stability and security problems as well.
>>>>>> We worked quite hard to avoid exactly this for X.
>>>>>>
>>>>>> We could make this feature depend on the compositor being DRM 
>>>>>> master,
>>>>>> but for example with X the X server is master (and e.g. can change
>>>>>> resolutions etc..) and not the compositor.
>>>>>>
>>>>>> So another question is also what windowing system (if any) are you
>>>>>> planning to use? X, Wayland, Flinger or something completely
>>>>>> different ?
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> That is definitely a concern. What we are currently thinking is to
>>>>>>> make the high priority queues accessible to root only.
>>>>>>>
>>>>>>> Therefore is a non-root user attempts to set the high priority flag
>>>>>>> on context allocation, we would fail the call and return ENOPERM.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Andres
>>>>>>>
>>>>>>>
>>>>>>> On 12/20/2016 7:56 AM, Christian König wrote:
>>>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>>>> to solve it?
>>>>>>>> Yeah, that problem came to my mind as well.
>>>>>>>>
>>>>>>>> Basically we need to restrict those high priority submissions to
>>>>>>>> the VR compositor or otherwise any malfunctioning application 
>>>>>>>> could
>>>>>>>> use it.
>>>>>>>>
>>>>>>>> Just think about some WebGL suddenly taking all our rendering away
>>>>>>>> and we won't get anything drawn any more.
>>>>>>>>
>>>>>>>> Alex or Michel any ideas on that?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
>>>>>>>>> > If compute queue is occupied only by you, the efficiency
>>>>>>>>> > is equal with setting job queue to high priority I think.
>>>>>>>>> The only risk is the situation when graphics will take all
>>>>>>>>> needed CUs. But in any case it should be very good test.
>>>>>>>>>
>>>>>>>>> Andres/Pierre-Loup,
>>>>>>>>>
>>>>>>>>> Did you try to do it or it is a lot of work for you?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>>>> to solve it?
>>>>>>>>>
>>>>>>>>> Sincerely yours,
>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>
>>>>>>>>> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>>>>>>>>>> Do you encounter the priority issue for compute queue with
>>>>>>>>>> current driver?
>>>>>>>>>>
>>>>>>>>>> If compute queue is occupied only by you, the efficiency is 
>>>>>>>>>> equal
>>>>>>>>>> with setting job queue to high priority I think.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> David Zhou
>>>>>>>>>>
>>>>>>>>>> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>>>>>>>>>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure if I'm asking for too much, but if we can
>>>>>>>>>>> coordinate a similar interface in radv and amdgpu-pro at the
>>>>>>>>>>> vulkan level that would be great.
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure what that's going to be yet.
>>>>>>>>>>>
>>>>>>>>>>> - Andres
>>>>>>>>>>>
>>>>>>>>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>>> We're currently working with the open stack; I assume that a
>>>>>>>>>>>>> mechanism could be exposed by both open and Pro Vulkan
>>>>>>>>>>>>> userspace drivers and that the amdgpu kernel interface
>>>>>>>>>>>>> improvements we would pursue following this discussion would
>>>>>>>>>>>>> let both drivers take advantage of the feature, correct?
>>>>>>>>>>>> Of course.
>>>>>>>>>>>> Does open stack have Vulkan support?
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> David Zhou
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>>>>>>>>>> By the way, are you using all-open driver or amdgpu-pro
>>>>>>>>>>>>>> driver?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> David Zhou
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm also working on the bringing up our VR runtime on 
>>>>>>>>>>>>>>> top of
>>>>>>>>>>>>>>> amgpu;
>>>>>>>>>>>>>>> see replies inline.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>>> actually:
>>>>>>>>>>>>>>>> So we could have potential memory overcommit case or do
>>>>>>>>>>>>>>>> you do
>>>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>>> on your own?  I would think that there is need to avoid
>>>>>>>>>>>>>>>> overcomit in
>>>>>>>>>>>>>>>> VR case to
>>>>>>>>>>>>>>>> prevent any BO migration.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You're entirely correct; currently the VR runtime is
>>>>>>>>>>>>>>> setting up
>>>>>>>>>>>>>>> prioritized CPU scheduling for its VR compositor, we're
>>>>>>>>>>>>>>> working on
>>>>>>>>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this
>>>>>>>>>>>>>>> thread), and in
>>>>>>>>>>>>>>> the future it will make sense to do work in order to make
>>>>>>>>>>>>>>> sure that
>>>>>>>>>>>>>>> its memory allocations do not get evicted, to prevent any
>>>>>>>>>>>>>>> unwelcome
>>>>>>>>>>>>>>> additional latency in the event of needing to perform
>>>>>>>>>>>>>>> just-in-time
>>>>>>>>>>>>>>> reprojection.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>>>>>>>>>> Based on my understanding sharing BOs between different
>>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>>> could introduce additional synchronization constrains. 
>>>>>>>>>>>>>>>> btw:
>>>>>>>>>>>>>>>> I am not
>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>> if we are able to share Vulkan sync. object cross-process
>>>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> They are different processes; it is important for the
>>>>>>>>>>>>>>> compositor that
>>>>>>>>>>>>>>> is responsible for quality-of-service features such as
>>>>>>>>>>>>>>> consistently
>>>>>>>>>>>>>>> presenting distorted frames with the right latency,
>>>>>>>>>>>>>>> reprojection, etc,
>>>>>>>>>>>>>>> to be separate from the main application.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Currently we are using unreleased cross-process memory and
>>>>>>>>>>>>>>> semaphore
>>>>>>>>>>>>>>> extensions to fetch updated eye images from the client
>>>>>>>>>>>>>>> application,
>>>>>>>>>>>>>>> but the just-in-time reprojection discussed here does not
>>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>>> have any direct interactions with cross-process resource
>>>>>>>>>>>>>>> sharing,
>>>>>>>>>>>>>>> since it's achieved by using whatever is the latest, most
>>>>>>>>>>>>>>> up-to-date
>>>>>>>>>>>>>>> eye images that have already been sent by the client
>>>>>>>>>>>>>>> application,
>>>>>>>>>>>>>>> which are already available to use without additional
>>>>>>>>>>>>>>> synchronization.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>>>> Yes,  IMHO the best is to run in "full screen mode".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, we are working on mechanisms to present directly to 
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> headset
>>>>>>>>>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  The latency is our main concern,
>>>>>>>>>>>>>>>> I would assume that this is the known problem (at least 
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>>>> usage).
>>>>>>>>>>>>>>>> It looks like that amdgpu / kernel submission is rather 
>>>>>>>>>>>>>>>> CPU
>>>>>>>>>>>>>>>> intensive
>>>>>>>>>>>>>>>> (at least
>>>>>>>>>>>>>>>> in the default configuration).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As long as it's a consistent cost, it shouldn't an issue.
>>>>>>>>>>>>>>> However, if
>>>>>>>>>>>>>>> there's high degrees of variance then that would be
>>>>>>>>>>>>>>> troublesome and we
>>>>>>>>>>>>>>> would need to account for the worst case.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hopefully the requirements and approach we described make
>>>>>>>>>>>>>>> sense, we're
>>>>>>>>>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>  - Pierre-Loup
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey Serguei,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far
>>>>>>>>>>>>>>>>> as I
>>>>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to
>>>>>>>>>>>>>>>>> switch to
>>>>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>>>>> of resources  based on the workload otherwise we will 
>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>>>>> between Vulkan compute and OpenCL.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can
>>>>>>>>>>>>>>>> start with a
>>>>>>>>>>>>>>>> solution that assumes that
>>>>>>>>>>>>>>>> only pipe0 has any work and the other pipes are idle (no
>>>>>>>>>>>>>>>> HSA/ROCm
>>>>>>>>>>>>>>>> running on the system).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This should be more or less the use case we expect from VR
>>>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree the split is currently not ideal, but I'd like to
>>>>>>>>>>>>>>>> consider
>>>>>>>>>>>>>>>> that a separate task, because
>>>>>>>>>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>>>>> amdkfd
>>>>>>>>>>>>>>>>> will be not
>>>>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Correct, this is why we want to enable the high priority
>>>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>>>> queue through
>>>>>>>>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan
>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>>>> running actually:
>>>>>>>>>>>>>>>>     1) Game process
>>>>>>>>>>>>>>>>     2) VR Compositor (this is the process that will 
>>>>>>>>>>>>>>>> require
>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>> priority queue)
>>>>>>>>>>>>>>>>     3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm 
>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>> simultaneously, but
>>>>>>>>>>>>>>>> I would also like to be able to address this case in the
>>>>>>>>>>>>>>>> future
>>>>>>>>>>>>>>>> (cross-pipe priorities).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [Serguei] The problem with pre-emption of graphics task:
>>>>>>>>>>>>>>>>> (a) it
>>>>>>>>>>>>>>>>> may take time so
>>>>>>>>>>>>>>>>> latency may suffer
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>>>>>>>>>> predictable. A good
>>>>>>>>>>>>>>>> illustration of what the reprojection scheduling looks 
>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>> can be
>>>>>>>>>>>>>>>> found here:
>>>>>>>>>>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (b) to preempt we need to have different "context" - we
>>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>>> to guarantee that submissions from the same context will
>>>>>>>>>>>>>>>>> be executed
>>>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is okay, as the reprojection work doesn't have
>>>>>>>>>>>>>>>> dependencies on
>>>>>>>>>>>>>>>> the game context, and it
>>>>>>>>>>>>>>>> even happens in a separate process.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you
>>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>>>>> "cancel/abort"
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Preempt the game with the compositor task and then resume
>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics
>>>>>>>>>>>>>>>>> as well as
>>>>>>>>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure
>>>>>>>>>>>>>>>> out a way
>>>>>>>>>>>>>>>> for us to get
>>>>>>>>>>>>>>>> a guaranteed execution time using vulkan graphics, then
>>>>>>>>>>>>>>>> I'll take you
>>>>>>>>>>>>>>>> out for a beer :)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Andres,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Quick comments:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU
>>>>>>>>>>>>>>>> assignments/binding
>>>>>>>>>>>>>>>> to high-priority queue  when it will be in use and "free"
>>>>>>>>>>>>>>>> them later
>>>>>>>>>>>>>>>> (we  do not want forever take CUs from e.g. graphic 
>>>>>>>>>>>>>>>> task to
>>>>>>>>>>>>>>>> degrade
>>>>>>>>>>>>>>>> graphics
>>>>>>>>>>>>>>>> performance).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Otherwise we could have scenario when long graphics 
>>>>>>>>>>>>>>>> task (or
>>>>>>>>>>>>>>>> low-priority
>>>>>>>>>>>>>>>> compute) will took all (extra) CUs and high--priority will
>>>>>>>>>>>>>>>> wait for
>>>>>>>>>>>>>>>> needed resources.
>>>>>>>>>>>>>>>> It will not be visible on "NOP " but only when you submit
>>>>>>>>>>>>>>>> "real"
>>>>>>>>>>>>>>>> compute task
>>>>>>>>>>>>>>>> so I would recommend  not to use "NOP" packets at all for
>>>>>>>>>>>>>>>> testing.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It (CU assignment) could be relatively easy done when
>>>>>>>>>>>>>>>> everything is
>>>>>>>>>>>>>>>> going via kernel
>>>>>>>>>>>>>>>> (e.g. as part of frame submission) but I must admit that I
>>>>>>>>>>>>>>>> am not sure
>>>>>>>>>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] I wasn't aware of this part of the programming
>>>>>>>>>>>>>>>> sequence. Thanks
>>>>>>>>>>>>>>>> for the heads up!
>>>>>>>>>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>>>>>>>>>> [Serguei] Yes. To simplify: the problem is that 
>>>>>>>>>>>>>>>> "scheduler"
>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>> deciding which
>>>>>>>>>>>>>>>> queue to  run will check if there is enough resources and
>>>>>>>>>>>>>>>> if not then
>>>>>>>>>>>>>>>> it will begin
>>>>>>>>>>>>>>>> to check other queues with lower priority.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2) I would recommend to dedicate the whole pipe to
>>>>>>>>>>>>>>>> high-priority
>>>>>>>>>>>>>>>> queue and have
>>>>>>>>>>>>>>>> nothing their except it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue?
>>>>>>>>>>>>>>>> (as opposed
>>>>>>>>>>>>>>>> to the MEC definition
>>>>>>>>>>>>>>>> of pipe, which is a grouping of queues). I say this 
>>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>>> only has access to 1 pipe,
>>>>>>>>>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it. As far 
>>>>>>>>>>>>>>>> as I
>>>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to 
>>>>>>>>>>>>>>>> switch to
>>>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> BTW: Which user level API do you want to use for compute:
>>>>>>>>>>>>>>>> Vulkan or
>>>>>>>>>>>>>>>> OpenCL?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] Vulkan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>>>> amdkfd will
>>>>>>>>>>>>>>>> be not
>>>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  we will not be able to provide a solution compatible 
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> GFX
>>>>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the
>>>>>>>>>>>>>>>> currently running
>>>>>>>>>>>>>>>> graphics job and scheduling in
>>>>>>>>>>>>>>>> something else using mid-buffer pre-emption has some cases
>>>>>>>>>>>>>>>> where it
>>>>>>>>>>>>>>>> doesn't work well. But if with
>>>>>>>>>>>>>>>> polaris10 it starts working well, it might be a better
>>>>>>>>>>>>>>>> solution for
>>>>>>>>>>>>>>>> us (because the whole reprojection
>>>>>>>>>>>>>>>> work uses the vulkan graphics stack at the moment, and
>>>>>>>>>>>>>>>> porting it to
>>>>>>>>>>>>>>>> compute is not trivial).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>>>>> (a) it may
>>>>>>>>>>>>>>>> take time so
>>>>>>>>>>>>>>>> latency may suffer (b) to preempt we need to have 
>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>> "context"
>>>>>>>>>>>>>>>> - we want
>>>>>>>>>>>>>>>> to guarantee that submissions from the same context 
>>>>>>>>>>>>>>>> will be
>>>>>>>>>>>>>>>> executed
>>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you
>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be 
>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>> for graphics as well as for plain compute tasks
>>>>>>>>>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on
>>>>>>>>>>>>>>>> behalf of
>>>>>>>>>>>>>>>> Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>>>>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>>>>>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in
>>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This RFC is also available as a gist here:
>>>>>>>>>>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We are interested in feedback for a mechanism to
>>>>>>>>>>>>>>>> effectively schedule
>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>> priority VR reprojection tasks (also referred to as
>>>>>>>>>>>>>>>> time-warping) for
>>>>>>>>>>>>>>>> Polaris10
>>>>>>>>>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Brief context:
>>>>>>>>>>>>>>>> --------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The main objective of reprojection is to avoid motion
>>>>>>>>>>>>>>>> sickness for VR
>>>>>>>>>>>>>>>> users in
>>>>>>>>>>>>>>>> scenarios where the game or application would fail to 
>>>>>>>>>>>>>>>> finish
>>>>>>>>>>>>>>>> rendering a new
>>>>>>>>>>>>>>>> frame in time for the next VBLANK. When this happens, the
>>>>>>>>>>>>>>>> user's head
>>>>>>>>>>>>>>>> movements
>>>>>>>>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for 
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> duration
>>>>>>>>>>>>>>>> of an
>>>>>>>>>>>>>>>> extra frame. This extended mismatch between the inner ear
>>>>>>>>>>>>>>>> and the
>>>>>>>>>>>>>>>> eyes may
>>>>>>>>>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The VR compositor deals with this problem by fabricating a
>>>>>>>>>>>>>>>> new frame
>>>>>>>>>>>>>>>> using the
>>>>>>>>>>>>>>>> user's updated head position in combination with the
>>>>>>>>>>>>>>>> previous frames.
>>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>> avoids a prolonged mismatch between the HMD output and the
>>>>>>>>>>>>>>>> inner ear.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Because of the adverse effects on the user, we require 
>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>> confidence that the
>>>>>>>>>>>>>>>> reprojection task will complete before the VBLANK 
>>>>>>>>>>>>>>>> interval.
>>>>>>>>>>>>>>>> Even if
>>>>>>>>>>>>>>>> the GFX pipe
>>>>>>>>>>>>>>>> is currently full of work from the game/application (which
>>>>>>>>>>>>>>>> is most
>>>>>>>>>>>>>>>> likely the case).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For more details and illustrations, please refer to the
>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>> document:
>>>>>>>>>>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>>> One of the most exciting new developments in GPU 
>>>>>>>>>>>>>>>> technology
>>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>>> One of the most exciting new developments in GPU 
>>>>>>>>>>>>>>>> technology
>>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>>> One of the most exciting new developments in GPU 
>>>>>>>>>>>>>>>> technology
>>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Requirements:
>>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * Job round trip time must be predictable, from
>>>>>>>>>>>>>>>> submission to
>>>>>>>>>>>>>>>> fence signal
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Goals:
>>>>>>>>>>>>>>>> ------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The mechanism should provide low submission 
>>>>>>>>>>>>>>>> latencies
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Test: submitting a NOP packet through the mechanism on 
>>>>>>>>>>>>>>>> busy
>>>>>>>>>>>>>>>> hardware
>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Nice to have:
>>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My understanding is that with the current hardware
>>>>>>>>>>>>>>>> capabilities in
>>>>>>>>>>>>>>>> Polaris10 we
>>>>>>>>>>>>>>>> will not be able to provide a solution compatible with GFX
>>>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But I would love to hear otherwise. So if anyone has an
>>>>>>>>>>>>>>>> idea,
>>>>>>>>>>>>>>>> approach or
>>>>>>>>>>>>>>>> suggestion that will also be compatible with the GFX ring,
>>>>>>>>>>>>>>>> please let
>>>>>>>>>>>>>>>> us know
>>>>>>>>>>>>>>>> about it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The above guarantees should also be respected by
>>>>>>>>>>>>>>>> amdkfd workloads
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Would be good to have for consistency, but not strictly
>>>>>>>>>>>>>>>> necessary as
>>>>>>>>>>>>>>>> users running
>>>>>>>>>>>>>>>> games are not traditionally running HPC workloads in the
>>>>>>>>>>>>>>>> background.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Proposed approach:
>>>>>>>>>>>>>>>> ------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Similar to the windows driver, we could expose a high
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> compute queue to
>>>>>>>>>>>>>>>> userspace.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Submissions to this compute queue will be scheduled with
>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>> priority, and may
>>>>>>>>>>>>>>>> acquire hardware resources previously in use by other
>>>>>>>>>>>>>>>> queues.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This can be achieved by taking advantage of the 'priority'
>>>>>>>>>>>>>>>> field in
>>>>>>>>>>>>>>>> the HQDs
>>>>>>>>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler.
>>>>>>>>>>>>>>>> The relevant
>>>>>>>>>>>>>>>> register fields are:
>>>>>>>>>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The amdgpu driver currently controls 8 compute queues from
>>>>>>>>>>>>>>>> pipe0. We can
>>>>>>>>>>>>>>>> statically partition these as follows:
>>>>>>>>>>>>>>>>         * 7x regular
>>>>>>>>>>>>>>>>         * 1x high priority
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The relevant priorities can be set so that submissions to
>>>>>>>>>>>>>>>> the high
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The amdgpu scheduler will only place jobs into the high
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> rings if the
>>>>>>>>>>>>>>>> context is marked as high priority. And a corresponding
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>>> added to keep track of this information:
>>>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The user will request a high priority context by 
>>>>>>>>>>>>>>>> setting an
>>>>>>>>>>>>>>>> appropriate flag
>>>>>>>>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or 
>>>>>>>>>>>>>>>> similar):
>>>>>>>>>>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>>>>>>>>>     * Maintain a consistent FIFO ordering of all
>>>>>>>>>>>>>>>> submissions to a
>>>>>>>>>>>>>>>> context
>>>>>>>>>>>>>>>>     * Create high priority and non-high priority contexts
>>>>>>>>>>>>>>>> in the same
>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Similar to the above, but instead of programming the
>>>>>>>>>>>>>>>> priorities and
>>>>>>>>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the
>>>>>>>>>>>>>>>> queue priorities
>>>>>>>>>>>>>>>> dynamically when scheduling a task.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This would involve having a hardware specific callback 
>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> scheduler to
>>>>>>>>>>>>>>>> set the appropriate queue priority: set_priority(int ring,
>>>>>>>>>>>>>>>> int index,
>>>>>>>>>>>>>>>> int priority)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> During this callback we would have to grab the SRBM mutex
>>>>>>>>>>>>>>>> to perform
>>>>>>>>>>>>>>>> the appropriate
>>>>>>>>>>>>>>>> HW programming, and I'm not really sure if that is
>>>>>>>>>>>>>>>> something we
>>>>>>>>>>>>>>>> should be doing from
>>>>>>>>>>>>>>>> the scheduler.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On the positive side, this approach would allow us to
>>>>>>>>>>>>>>>> program a range of
>>>>>>>>>>>>>>>> priorities for jobs instead of a single "high priority"
>>>>>>>>>>>>>>>> value",
>>>>>>>>>>>>>>>> achieving
>>>>>>>>>>>>>>>> something similar to the niceness API available for CPU
>>>>>>>>>>>>>>>> scheduling.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not sure if this flexibility is something that we 
>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>> need for
>>>>>>>>>>>>>>>> our use
>>>>>>>>>>>>>>>> case, but it might be useful in other scenarios (multiple
>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>> sharing compute
>>>>>>>>>>>>>>>> time on a server).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This approach would require a new int field in
>>>>>>>>>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>>>>>>>>>> repurposing
>>>>>>>>>>>>>>>> of the flags field.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Known current obstacles:
>>>>>>>>>>>>>>>> ------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The SQ is currently programmed to disregard the HQD
>>>>>>>>>>>>>>>> priorities, and
>>>>>>>>>>>>>>>> instead it picks
>>>>>>>>>>>>>>>> jobs at random. Settings from the shader itself are also
>>>>>>>>>>>>>>>> disregarded
>>>>>>>>>>>>>>>> as this is
>>>>>>>>>>>>>>>> considered a privileged field.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Effectively we can get our compute wavefront launched 
>>>>>>>>>>>>>>>> ASAP,
>>>>>>>>>>>>>>>> but we
>>>>>>>>>>>>>>>> might not get the
>>>>>>>>>>>>>>>> time we need on the SQ.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The current programming would have to be changed to allow
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> propagation
>>>>>>>>>>>>>>>> from the HQD into the SQ.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Generic approach for all HW IPs:
>>>>>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For consistency purposes, the high priority context can be
>>>>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>>>> for all HW IPs
>>>>>>>>>>>>>>>> with support of the SW scheduler. This will function
>>>>>>>>>>>>>>>> similarly to the
>>>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump
>>>>>>>>>>>>>>>> ahead of
>>>>>>>>>>>>>>>> anything not
>>>>>>>>>>>>>>>> commited to the HW queue.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The benefits of requesting a high priority context for a
>>>>>>>>>>>>>>>> non-compute
>>>>>>>>>>>>>>>> queue will
>>>>>>>>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is
>>>>>>>>>>>>>>>> stuck in
>>>>>>>>>>>>>>>> front of
>>>>>>>>>>>>>>>> you), but having the API in place will allow us to easily
>>>>>>>>>>>>>>>> improve the
>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>> in the future as new features become available in new
>>>>>>>>>>>>>>>> hardware.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Future steps:
>>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Once we have an approach settled, I can take care of the
>>>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also, once the interface is mostly decided, we can start
>>>>>>>>>>>>>>>> thinking about
>>>>>>>>>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Request for feedback:
>>>>>>>>>>>>>>>> ---------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We aren't married to any of the approaches outlined above.
>>>>>>>>>>>>>>>> Our goal
>>>>>>>>>>>>>>>> is to
>>>>>>>>>>>>>>>> obtain a mechanism that will allow us to complete the
>>>>>>>>>>>>>>>> reprojection
>>>>>>>>>>>>>>>> job within a
>>>>>>>>>>>>>>>> predictable amount of time. So if anyone anyone has any
>>>>>>>>>>>>>>>> suggestions for
>>>>>>>>>>>>>>>> improvements or alternative strategies we are more than
>>>>>>>>>>>>>>>> happy to hear
>>>>>>>>>>>>>>>> them.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If any of the technical information above is also
>>>>>>>>>>>>>>>> incorrect, feel
>>>>>>>>>>>>>>>> free to point
>>>>>>>>>>>>>>>> out my misunderstandings.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Looking forward to hearing from you.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>>>> To see the collection of prior postings to the list,
>>>>>>>>>>>>>>>> visit the
>>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>>>> To see the collection of prior postings to the list,
>>>>>>>>>>>>>>>> visit the
>>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sincerely yours,
>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>
>>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
       [not found]                                                                                         ` <58607FDF.2080200-5C7GfCeVMHo@public.gmane.org>
@ 2017-01-02 15:43                                                                                           ` Christian König
  0 siblings, 0 replies; 36+ messages in thread
From: Christian König @ 2017-01-02 15:43 UTC (permalink / raw)
  To: zhoucm1, Andres Rodriguez, Pierre-Loup A. Griffais
  Cc: Huan, Alvin, Mao, David, Serguei Sagalovitch,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Andres Rodriguez,
	Zhang, Hawking


[-- Attachment #1.1: Type: text/plain, Size: 218261 bytes --]

Indeed a couple of nice numbers.

> but everything already commited
> to the HW queue is executed in strict FIFO order.
Well actually if we get a high priority submission we could 
preempt/abort everything on the ring buffer before it in theory.

Probably not as fine granularity as the hardware scheduler, but might be 
easier to get working.

Regards,
Christian.

Am 26.12.2016 um 03:26 schrieb zhoucm1:
> Nice experiment, which is exactly SW scheduler can provide.
> And as you said "I.e. your context can be scheduled into the
> HW queue ahead of any other context, but everything already commited
> to the HW queue is executed in strict FIFO order."
>
> If you want to keep consistent latency, which will need to enable hw 
> priority queue feature.
>
> Regards,
> David Zhou
>
> On 2016年12月24日 06:20, Andres Rodriguez wrote:
>> Hey John,
>>
>> I've collected bit of data using high priority SW scheduler queues,
>> thought you might be interested.
>>
>> Implementation as per the patch above.
>>
>> Control test 1
>> ==============
>>
>> Sascha Willems mesh sample running on its own at regular priority
>>
>> Results
>> -------
>>
>> Mesh: ~0.14ms per-frame latency
>>
>> Control test 2
>> ==============
>>
>> Two Sascha Willems mesh sample running on its own at regular priority
>>
>> Results
>> -------
>>
>> Mesh 1: ~0.26ms per-frame latency
>> Mesh 2: ~0.26ms per-frame latency
>>
>> Test 1
>> ======
>>
>> Two Sascha Willems mesh samples running simultaneously. One at high
>> priority and the other running in a regular priority graphics context.
>>
>> Results
>> -------
>>
>> Mesh High:    0.14 - 0.24ms per-frame latency
>> Mesh Regular: 0.24 - 0.40ms per-frame latency
>>
>> Test 2
>> ======
>>
>> Ten Sascha Willems mesh samples running simultaneously. One at high
>> priority and the others running in a regular priority graphics context.
>>
>> Results
>> -------
>>
>> Mesh High:    0.14 - 0.8ms per-frame latency
>> Mesh Regular: 1.10 - 2.05ms per-frame latency
>>
>> Test 3
>> ======
>>
>> Two Sascha Willems mesh samples running simultaneously. One at high
>> priority and the other running in a regular priority graphics context.
>>
>> Also running Unigine Heaven at Exteme preset @ 2560x1600
>>
>> Results
>> -------
>>
>> Mesh High:     7 - 100ms per-frame latency (Lots of fluctuation)
>> Mesh Regular: 40 - 130ms per-frame latency(Lots of fluctuation)
>> Unigine Heaven: 20-40 fps
>>
>>
>> Test 4
>> ======
>>
>> Two Sascha Willems mesh samples running simultaneously. One at high
>> priority and the other running in a regular priority graphics context.
>>
>> Also running Talos Principle @ 4K
>>
>> Results
>> -------
>>
>> Mesh High:    0.14 - 3.97ms per-frame latency (Mostly floats ~0.4ms)
>> Mesh Regular: 0.43 - 8.11ms per-frame latency (Lots of fluctuation)
>> Talos: 24.8 fps AVG
>>
>> Observations
>> ============
>>
>> The high priority queue based on the SW scheduler provides significant
>> gains when paired with tasks that submit short duration commands into
>> the queue. This can be observed in tests 1 and 2.
>>
>> When the pipe is full of long running commands, the effects are dampened.
>> As observed in test 3, the per-frame latency suffers very large spikes,
>> and the latencies are very inconsistent.
>>
>> Talos seems to be a better behaved game. It may be submitting shorter
>> draw commands and the SW scheduler is able to interleave the rest of
>> the work.
>>
>> The results seem consistent with the hypothetical advantages the SW
>> scheduler should provide. I.e. your context can be scheduled into the
>> HW queue ahead of any other context, but everything already commited
>> to the HW queue is executed in strict FIFO order.
>>
>> In order to deal with cases similar to Test 3, we will need to take
>> advantage of further features.
>>
>> Notes
>> =====
>>
>> - Tests were run multiple times, and reboots were performed during tests.
>> - The mesh sample isn't really designed for benchmarking, but it should
>>   be decent for ballpark figures
>> - The high priority mesh app was run with default niceness and also 
>> niceness
>>   at -20. This had no effect on the results, so it was not added above.
>> - CPU usage was not saturated while running the tests
>>
>> Regards,
>> Andres
>>
>>
>> On Fri, Dec 23, 2016 at 1:18 PM, Pierre-Loup A. Griffais 
>> <pgriffais-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org <mailto:pgriffais-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>> wrote:
>>
>>     I hate to keep bringing up display topics in an unrelated
>>     conversation, but I'm not sure where you got "Application -> X
>>     server -> compositor -> X server" from. As I was saying before,
>>     we need to be presenting directly to the HMD display as no
>>     display server can be in the way, both for latency but also
>>     quality of service reasons (a buggy application cannot be allowed
>>     to accidentally display undistorted rendering into the HMD); we
>>     intend to do the necessary work for this, and the extent of X's
>>     (or a Wayland implementation, or any other display server)
>>     involvment will be to participate enough to know that the HMD
>>     display is off-limits. If you have more questions on the display
>>     aspect, or VR rendering in general, I'm happy to try to address
>>     them out-of-band from this conversation.
>>
>>
>>     On 12/23/2016 02:54 AM, Christian König wrote:
>>
>>             But yes, in general you don't want another compositor in
>>             the way, so
>>             we'll be acquiring the HMD display directly, separate
>>             from any desktop
>>             or display server.
>>
>>         Assuming that the the HMD is attached to the rendering device
>>         in some
>>         way you have the X server and the Compositor which both try
>>         to be DRM
>>         master at the same time.
>>
>>         Please correct me if that was fixed in the meantime, but that
>>         sounds
>>         like it will simply not work. Or is this what Andres mention
>>         below Dave
>>         is working on ?.
>>
>>         Additional to that a compositor in combination with X is a
>>         bit counter
>>         productive when you want to keep the latency low.
>>
>>         E.g. the "normal" flow of a GL or Vulkan surface filled with
>>         rendered
>>         data to be displayed is from the Application -> X server ->
>>         compositor
>>         -> X server.
>>
>>         The extra step between X server and compositor just means
>>         extra latency
>>         and for this use case you probably don't want that.
>>
>>         Targeting something like Wayland and when you need X
>>         compatibility
>>         XWayland sounds like the much better idea.
>>
>>         Regards,
>>         Christian.
>>
>>         Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:
>>
>>             Display concerns are a separate issue, and as Andres said
>>             we have
>>             other plans to address. But yes, in general you don't
>>             want another
>>             compositor in the way, so we'll be acquiring the HMD
>>             display directly,
>>             separate from any desktop or display server. Same with
>>             security, we
>>             can have a separate conversation about that when the time
>>             comes.
>>
>>             On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
>>
>>                 Andres,
>>
>>                 Did you measure  latency, etc. impact of __any__
>>                 compositor?
>>
>>                 My understanding is that VR has pretty strict
>>                 requirements related to
>>                 QoS.
>>
>>                 Sincerely yours,
>>                 Serguei Sagalovitch
>>
>>
>>                 On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
>>
>>                     Hey Christian,
>>
>>                     We are currently interested in X, but with some
>>                     distros switching to
>>                     other compositors by default, we also need to
>>                     consider those.
>>
>>                     We agree, running the full vrcompositor in root
>>                     isn't something that
>>                     we want to do. Too many security concerns. Having
>>                     a small root helper
>>                     that does the privilege escalation for us is the
>>                     initial idea.
>>
>>                     For a long term approach, Pierre-Loup and Dave
>>                     are working on dealing
>>                     with the "two compositors" scenario a little
>>                     better in DRM+X.
>>                     Fullscreen isn't really a sufficient approach,
>>                     since we don't want the
>>                     HMD to be used as part of the Desktop environment
>>                     when a VR app is not
>>                     in use (this is extremely annoying).
>>
>>                     When the above is settled, we should have an auth
>>                     mechanism besides
>>                     DRM_MASTER or DRM_AUTH that allows the
>>                     vrcompositor to take over the
>>                     HMD permanently away from X. Re-using that auth
>>                     method to gate this
>>                     IOCTL is probably going to be the final solution.
>>
>>                     I propose to start with ROOT_ONLY since it should
>>                     allow us to respect
>>                     kernel IOCTL compatibility guidelines with the
>>                     most flexibility. Going
>>                     from a restrictive to a more flexible permission
>>                     model would be
>>                     inclusive, but going from a general to a
>>                     restrictive model may exclude
>>                     some apps that used to work.
>>
>>                     Regards,
>>                     Andres
>>
>>                     On 12/22/2016 6:42 AM, Christian König wrote:
>>
>>                         Hi Andres,
>>
>>                         well using root might cause stability and
>>                         security problems as well.
>>                         We worked quite hard to avoid exactly this for X.
>>
>>                         We could make this feature depend on the
>>                         compositor being DRM master,
>>                         but for example with X the X server is master
>>                         (and e.g. can change
>>                         resolutions etc..) and not the compositor.
>>
>>                         So another question is also what windowing
>>                         system (if any) are you
>>                         planning to use? X, Wayland, Flinger or
>>                         something completely
>>                         different ?
>>
>>                         Regards,
>>                         Christian.
>>
>>                         Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
>>
>>                             Hi Christian,
>>
>>                             That is definitely a concern. What we are
>>                             currently thinking is to
>>                             make the high priority queues accessible
>>                             to root only.
>>
>>                             Therefore is a non-root user attempts to
>>                             set the high priority flag
>>                             on context allocation, we would fail the
>>                             call and return ENOPERM.
>>
>>                             Regards,
>>                             Andres
>>
>>
>>                             On 12/20/2016 7:56 AM, Christian König wrote:
>>
>>                                     BTW: If there is  non-VR
>>                                     application which will use
>>                                     high-priority
>>                                     h/w queue then VR application
>>                                     will suffer.  Any ideas how
>>                                     to solve it?
>>
>>                                 Yeah, that problem came to my mind as
>>                                 well.
>>
>>                                 Basically we need to restrict those
>>                                 high priority submissions to
>>                                 the VR compositor or otherwise any
>>                                 malfunctioning application could
>>                                 use it.
>>
>>                                 Just think about some WebGL suddenly
>>                                 taking all our rendering away
>>                                 and we won't get anything drawn any more.
>>
>>                                 Alex or Michel any ideas on that?
>>
>>                                 Regards,
>>                                 Christian.
>>
>>                                 Am 19.12.2016 um 15:48 schrieb
>>                                 Serguei Sagalovitch:
>>
>>                                     > If compute queue is occupied
>>                                     only by you, the efficiency
>>                                     > is equal with setting job queue
>>                                     to high priority I think.
>>                                     The only risk is the situation
>>                                     when graphics will take all
>>                                     needed CUs. But in any case it
>>                                     should be very good test.
>>
>>                                     Andres/Pierre-Loup,
>>
>>                                     Did you try to do it or it is a
>>                                     lot of work for you?
>>
>>
>>                                     BTW: If there is  non-VR
>>                                     application which will use
>>                                     high-priority
>>                                     h/w queue then VR application
>>                                     will suffer.  Any ideas how
>>                                     to solve it?
>>
>>                                     Sincerely yours,
>>                                     Serguei Sagalovitch
>>
>>                                     On 2016-12-19 12:50 AM, zhoucm1
>>                                     wrote:
>>
>>                                         Do you encounter the priority
>>                                         issue for compute queue with
>>                                         current driver?
>>
>>                                         If compute queue is occupied
>>                                         only by you, the efficiency
>>                                         is equal
>>                                         with setting job queue to
>>                                         high priority I think.
>>
>>                                         Regards,
>>                                         David Zhou
>>
>>                                         On 2016年12月19日 13:29, Andres
>>                                         Rodriguez wrote:
>>
>>                                             Yes, vulkan is available
>>                                             on all-open through the
>>                                             mesa radv UMD.
>>
>>                                             I'm not sure if I'm
>>                                             asking for too much, but
>>                                             if we can
>>                                             coordinate a similar
>>                                             interface in radv and
>>                                             amdgpu-pro at the
>>                                             vulkan level that would
>>                                             be great.
>>
>>                                             I'm not sure what that's
>>                                             going to be yet.
>>
>>                                             - Andres
>>
>>                                             On 12/19/2016 12:11 AM,
>>                                             zhoucm1 wrote:
>>
>>
>>
>>                                                 On 2016年12月19日 11:33,
>>                                                 Pierre-Loup A.
>>                                                 Griffais wrote:
>>
>>                                                     We're currently
>>                                                     working with the
>>                                                     open stack; I
>>                                                     assume that a
>>                                                     mechanism could
>>                                                     be exposed by
>>                                                     both open and Pro
>>                                                     Vulkan
>>                                                     userspace drivers
>>                                                     and that the
>>                                                     amdgpu kernel
>>                                                     interface
>>                                                     improvements we
>>                                                     would pursue
>>                                                     following this
>>                                                     discussion would
>>                                                     let both drivers
>>                                                     take advantage of
>>                                                     the feature, correct?
>>
>>                                                 Of course.
>>                                                 Does open stack have
>>                                                 Vulkan support?
>>
>>                                                 Regards,
>>                                                 David Zhou
>>
>>
>>                                                     On 12/18/2016
>>                                                     07:26 PM, zhoucm1
>>                                                     wrote:
>>
>>                                                         By the way,
>>                                                         are you using
>>                                                         all-open
>>                                                         driver or
>>                                                         amdgpu-pro
>>                                                         driver?
>>
>>                                                         +David Mao,
>>                                                         who is
>>                                                         working on
>>                                                         our Vulkan
>>                                                         driver.
>>
>>                                                         Regards,
>>                                                         David Zhou
>>
>>                                                         On
>>                                                         2016年12月18日
>>                                                         06:05,
>>                                                         Pierre-Loup
>>                                                         A. Griffais
>>                                                         wrote:
>>
>>                                                             Hi Serguei,
>>
>>                                                             I'm also
>>                                                             working
>>                                                             on the
>>                                                             bringing
>>                                                             up our VR
>>                                                             runtime
>>                                                             on top of
>>                                                             amgpu;
>>                                                             see
>>                                                             replies
>>                                                             inline.
>>
>>                                                             On
>>                                                             12/16/2016
>>                                                             09:05 PM,
>>                                                             Sagalovitch,
>>                                                             Serguei
>>                                                             wrote:
>>
>>                                                                 Andres,
>>
>>                                                                      For
>>                                                                     current
>>                                                                     VR
>>                                                                     workloads
>>                                                                     we
>>                                                                     have
>>                                                                     3
>>                                                                     separate
>>                                                                     processes
>>                                                                     running
>>                                                                     actually:
>>
>>                                                                 So we
>>                                                                 could
>>                                                                 have
>>                                                                 potential
>>                                                                 memory
>>                                                                 overcommit
>>                                                                 case
>>                                                                 or do
>>                                                                 you do
>>                                                                 partitioning
>>                                                                 on
>>                                                                 your
>>                                                                 own? 
>>                                                                 I
>>                                                                 would
>>                                                                 think
>>                                                                 that
>>                                                                 there
>>                                                                 is
>>                                                                 need
>>                                                                 to avoid
>>                                                                 overcomit
>>                                                                 in
>>                                                                 VR
>>                                                                 case to
>>                                                                 prevent
>>                                                                 any
>>                                                                 BO
>>                                                                 migration.
>>
>>
>>                                                             You're
>>                                                             entirely
>>                                                             correct;
>>                                                             currently
>>                                                             the VR
>>                                                             runtime is
>>                                                             setting up
>>                                                             prioritized
>>                                                             CPU
>>                                                             scheduling
>>                                                             for its
>>                                                             VR
>>                                                             compositor,
>>                                                             we're
>>                                                             working on
>>                                                             prioritized
>>                                                             GPU
>>                                                             scheduling
>>                                                             and
>>                                                             pre-emption
>>                                                             (eg. this
>>                                                             thread),
>>                                                             and in
>>                                                             the
>>                                                             future it
>>                                                             will make
>>                                                             sense to
>>                                                             do work
>>                                                             in order
>>                                                             to make
>>                                                             sure that
>>                                                             its
>>                                                             memory
>>                                                             allocations
>>                                                             do not
>>                                                             get
>>                                                             evicted,
>>                                                             to
>>                                                             prevent any
>>                                                             unwelcome
>>                                                             additional
>>                                                             latency
>>                                                             in the
>>                                                             event of
>>                                                             needing
>>                                                             to perform
>>                                                             just-in-time
>>                                                             reprojection.
>>
>>                                                                 BTW:
>>                                                                 Do
>>                                                                 you
>>                                                                 mean
>>                                                                 __real__
>>                                                                 processes
>>                                                                 or
>>                                                                 threads?
>>                                                                 Based
>>                                                                 on my
>>                                                                 understanding
>>                                                                 sharing
>>                                                                 BOs
>>                                                                 between
>>                                                                 different
>>                                                                 processes
>>                                                                 could
>>                                                                 introduce
>>                                                                 additional
>>                                                                 synchronization
>>                                                                 constrains.
>>                                                                 btw:
>>                                                                 I am not
>>                                                                 sure
>>                                                                 if we
>>                                                                 are
>>                                                                 able
>>                                                                 to
>>                                                                 share
>>                                                                 Vulkan
>>                                                                 sync.
>>                                                                 object
>>                                                                 cross-process
>>                                                                 boundary.
>>
>>
>>                                                             They are
>>                                                             different
>>                                                             processes;
>>                                                             it is
>>                                                             important
>>                                                             for the
>>                                                             compositor
>>                                                             that
>>                                                             is
>>                                                             responsible
>>                                                             for
>>                                                             quality-of-service
>>                                                             features
>>                                                             such as
>>                                                             consistently
>>                                                             presenting
>>                                                             distorted
>>                                                             frames
>>                                                             with the
>>                                                             right
>>                                                             latency,
>>                                                             reprojection,
>>                                                             etc,
>>                                                             to be
>>                                                             separate
>>                                                             from the
>>                                                             main
>>                                                             application.
>>
>>                                                             Currently
>>                                                             we are
>>                                                             using
>>                                                             unreleased
>>                                                             cross-process
>>                                                             memory and
>>                                                             semaphore
>>                                                             extensions
>>                                                             to fetch
>>                                                             updated
>>                                                             eye
>>                                                             images
>>                                                             from the
>>                                                             client
>>                                                             application,
>>                                                             but the
>>                                                             just-in-time
>>                                                             reprojection
>>                                                             discussed
>>                                                             here does not
>>                                                             actually
>>                                                             have any
>>                                                             direct
>>                                                             interactions
>>                                                             with
>>                                                             cross-process
>>                                                             resource
>>                                                             sharing,
>>                                                             since
>>                                                             it's
>>                                                             achieved
>>                                                             by using
>>                                                             whatever
>>                                                             is the
>>                                                             latest, most
>>                                                             up-to-date
>>                                                             eye
>>                                                             images
>>                                                             that have
>>                                                             already
>>                                                             been sent
>>                                                             by the client
>>                                                             application,
>>                                                             which are
>>                                                             already
>>                                                             available
>>                                                             to use
>>                                                             without
>>                                                             additional
>>                                                             synchronization.
>>
>>
>>                                                                      
>>                                                                      3)
>>                                                                     System
>>                                                                     compositor
>>                                                                     (we
>>                                                                     are
>>                                                                     looking
>>                                                                     at
>>                                                                     approaches
>>                                                                     to
>>                                                                     remove
>>                                                                     this
>>                                                                     overhead)
>>
>>                                                                 Yes, 
>>                                                                 IMHO
>>                                                                 the
>>                                                                 best
>>                                                                 is to
>>                                                                 run
>>                                                                 in 
>>                                                                 "full
>>                                                                 screen
>>                                                                 mode".
>>
>>
>>                                                             Yes, we
>>                                                             are
>>                                                             working
>>                                                             on
>>                                                             mechanisms
>>                                                             to
>>                                                             present
>>                                                             directly
>>                                                             to the
>>                                                             headset
>>                                                             display
>>                                                             without
>>                                                             any
>>                                                             intermediaries
>>                                                             as a
>>                                                             separate
>>                                                             effort.
>>
>>
>>                                                                      The
>>                                                                     latency
>>                                                                     is
>>                                                                     our
>>                                                                     main
>>                                                                     concern,
>>
>>                                                                 I
>>                                                                 would
>>                                                                 assume
>>                                                                 that
>>                                                                 this
>>                                                                 is
>>                                                                 the
>>                                                                 known
>>                                                                 problem
>>                                                                 (at
>>                                                                 least for
>>                                                                 compute
>>                                                                 usage).
>>                                                                 It
>>                                                                 looks
>>                                                                 like
>>                                                                 that
>>                                                                 amdgpu
>>                                                                 /
>>                                                                 kernel
>>                                                                 submission
>>                                                                 is
>>                                                                 rather
>>                                                                 CPU
>>                                                                 intensive
>>                                                                 (at least
>>                                                                 in
>>                                                                 the
>>                                                                 default
>>                                                                 configuration).
>>
>>
>>                                                             As long
>>                                                             as it's a
>>                                                             consistent
>>                                                             cost, it
>>                                                             shouldn't
>>                                                             an issue.
>>                                                             However, if
>>                                                             there's
>>                                                             high
>>                                                             degrees
>>                                                             of
>>                                                             variance
>>                                                             then that
>>                                                             would be
>>                                                             troublesome
>>                                                             and we
>>                                                             would
>>                                                             need to
>>                                                             account
>>                                                             for the
>>                                                             worst case.
>>
>>                                                             Hopefully
>>                                                             the
>>                                                             requirements
>>                                                             and
>>                                                             approach
>>                                                             we
>>                                                             described
>>                                                             make
>>                                                             sense, we're
>>                                                             looking
>>                                                             forward
>>                                                             to your
>>                                                             feedback
>>                                                             and
>>                                                             suggestions.
>>
>>                                                             Thanks!
>>                                                              -
>>                                                             Pierre-Loup
>>
>>
>>                                                                 Sincerely
>>                                                                 yours,
>>                                                                 Serguei
>>                                                                 Sagalovitch
>>
>>
>>                                                                 From:
>>                                                                 Andres
>>                                                                 Rodriguez
>>                                                                 <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org
>>                                                                 <mailto:andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>>
>>                                                                 Sent:
>>                                                                 December
>>                                                                 16,
>>                                                                 2016
>>                                                                 10:00 PM
>>                                                                 To:
>>                                                                 Sagalovitch,
>>                                                                 Serguei;
>>                                                                 amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>                                                                 <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>>                                                                 Subject:
>>                                                                 RE:
>>                                                                 [RFC]
>>                                                                 Mechanism
>>                                                                 for
>>                                                                 high
>>                                                                 priority
>>                                                                 scheduling
>>                                                                 in amdgpu
>>
>>                                                                 Hey
>>                                                                 Serguei,
>>
>>                                                                     [Serguei]
>>                                                                     No.
>>                                                                     I
>>                                                                     mean
>>                                                                     pipe
>>                                                                     :-)
>>                                                                     as
>>                                                                     MEC
>>                                                                     define
>>                                                                     it. 
>>                                                                     As
>>                                                                     far
>>                                                                     as I
>>                                                                     understand
>>                                                                     (by
>>                                                                     simplifying)
>>                                                                     some
>>                                                                     scheduling
>>                                                                     is
>>                                                                     per
>>                                                                     pipe. 
>>                                                                     I
>>                                                                     know
>>                                                                     about
>>                                                                     the
>>                                                                     current
>>                                                                     allocation
>>                                                                     scheme
>>                                                                     but
>>                                                                     I
>>                                                                     do
>>                                                                     not
>>                                                                     think
>>                                                                     that
>>                                                                     it
>>                                                                     is 
>>                                                                     ideal. 
>>                                                                     I
>>                                                                     would
>>                                                                     assume
>>                                                                     that
>>                                                                     we
>>                                                                     need
>>                                                                     to
>>                                                                     switch
>>                                                                     to
>>                                                                     dynamical
>>                                                                     partition
>>                                                                     of
>>                                                                     resources 
>>                                                                     based
>>                                                                     on
>>                                                                     the
>>                                                                     workload
>>                                                                     otherwise
>>                                                                     we
>>                                                                     will
>>                                                                     have
>>                                                                     resource
>>                                                                     conflict
>>                                                                     between
>>                                                                     Vulkan
>>                                                                     compute
>>                                                                     and 
>>                                                                     OpenCL.
>>
>>
>>                                                                 I
>>                                                                 agree
>>                                                                 the
>>                                                                 partitioning
>>                                                                 isn't
>>                                                                 ideal.
>>                                                                 I'm
>>                                                                 hoping
>>                                                                 we can
>>                                                                 start
>>                                                                 with a
>>                                                                 solution
>>                                                                 that
>>                                                                 assumes
>>                                                                 that
>>                                                                 only
>>                                                                 pipe0
>>                                                                 has
>>                                                                 any
>>                                                                 work
>>                                                                 and
>>                                                                 the
>>                                                                 other
>>                                                                 pipes
>>                                                                 are
>>                                                                 idle (no
>>                                                                 HSA/ROCm
>>                                                                 running
>>                                                                 on
>>                                                                 the
>>                                                                 system).
>>
>>                                                                 This
>>                                                                 should
>>                                                                 be
>>                                                                 more
>>                                                                 or
>>                                                                 less
>>                                                                 the
>>                                                                 use
>>                                                                 case
>>                                                                 we
>>                                                                 expect
>>                                                                 from VR
>>                                                                 users.
>>
>>                                                                 I
>>                                                                 agree
>>                                                                 the
>>                                                                 split
>>                                                                 is
>>                                                                 currently
>>                                                                 not
>>                                                                 ideal,
>>                                                                 but
>>                                                                 I'd
>>                                                                 like to
>>                                                                 consider
>>                                                                 that
>>                                                                 a
>>                                                                 separate
>>                                                                 task,
>>                                                                 because
>>                                                                 making
>>                                                                 it
>>                                                                 dynamic
>>                                                                 is
>>                                                                 not
>>                                                                 straight
>>                                                                 forward
>>                                                                 :P
>>
>>                                                                     [Serguei]
>>                                                                     Vulkan
>>                                                                     works
>>                                                                     via
>>                                                                     amdgpu
>>                                                                     (kernel
>>                                                                     submissions)
>>                                                                     so
>>                                                                     amdkfd
>>                                                                     will
>>                                                                     be
>>                                                                     not
>>                                                                     involved. 
>>                                                                     I
>>                                                                     would
>>                                                                     assume
>>                                                                     that
>>                                                                     in
>>                                                                     the
>>                                                                     case
>>                                                                     of
>>                                                                     VR
>>                                                                     we
>>                                                                     will
>>                                                                     have
>>                                                                     one
>>                                                                     main
>>                                                                     application
>>                                                                     ("console"
>>                                                                     mode(?))
>>                                                                     so
>>                                                                     we
>>                                                                     could
>>                                                                     temporally
>>                                                                     "ignore"
>>                                                                     OpenCL/ROCm
>>                                                                     needs
>>                                                                     when
>>                                                                     VR
>>                                                                     is
>>                                                                     running.
>>
>>
>>                                                                 Correct,
>>                                                                 this
>>                                                                 is
>>                                                                 why
>>                                                                 we
>>                                                                 want
>>                                                                 to
>>                                                                 enable
>>                                                                 the
>>                                                                 high
>>                                                                 priority
>>                                                                 compute
>>                                                                 queue
>>                                                                 through
>>                                                                 libdrm-amdgpu,
>>                                                                 so
>>                                                                 that
>>                                                                 we
>>                                                                 can
>>                                                                 expose
>>                                                                 it
>>                                                                 through
>>                                                                 Vulkan
>>                                                                 later.
>>
>>                                                                 For
>>                                                                 current
>>                                                                 VR
>>                                                                 workloads
>>                                                                 we
>>                                                                 have
>>                                                                 3
>>                                                                 separate
>>                                                                 processes
>>                                                                 running
>>                                                                 actually:
>>                                                                    
>>                                                                 1)
>>                                                                 Game
>>                                                                 process
>>                                                                    
>>                                                                 2) VR
>>                                                                 Compositor
>>                                                                 (this
>>                                                                 is
>>                                                                 the
>>                                                                 process
>>                                                                 that
>>                                                                 will
>>                                                                 require
>>                                                                 high
>>                                                                 priority
>>                                                                 queue)
>>                                                                    
>>                                                                 3)
>>                                                                 System
>>                                                                 compositor
>>                                                                 (we
>>                                                                 are
>>                                                                 looking
>>                                                                 at
>>                                                                 approaches
>>                                                                 to
>>                                                                 remove
>>                                                                 this
>>                                                                 overhead)
>>
>>                                                                 For
>>                                                                 now I
>>                                                                 think
>>                                                                 it is
>>                                                                 okay
>>                                                                 to
>>                                                                 assume
>>                                                                 no
>>                                                                 OpenCL/ROCm
>>                                                                 running
>>                                                                 simultaneously,
>>                                                                 but
>>                                                                 I
>>                                                                 would
>>                                                                 also
>>                                                                 like
>>                                                                 to be
>>                                                                 able
>>                                                                 to
>>                                                                 address
>>                                                                 this
>>                                                                 case
>>                                                                 in the
>>                                                                 future
>>                                                                 (cross-pipe
>>                                                                 priorities).
>>
>>                                                                     [Serguei] 
>>                                                                     The
>>                                                                     problem
>>                                                                     with
>>                                                                     pre-emption
>>                                                                     of
>>                                                                     graphics
>>                                                                     task:
>>                                                                     (a)
>>                                                                     it
>>                                                                     may
>>                                                                     take
>>                                                                     time
>>                                                                     so
>>                                                                     latency
>>                                                                     may
>>                                                                     suffer
>>
>>
>>                                                                 The
>>                                                                 latency
>>                                                                 is
>>                                                                 our
>>                                                                 main
>>                                                                 concern,
>>                                                                 we
>>                                                                 want
>>                                                                 something
>>                                                                 that is
>>                                                                 predictable.
>>                                                                 A good
>>                                                                 illustration
>>                                                                 of
>>                                                                 what
>>                                                                 the
>>                                                                 reprojection
>>                                                                 scheduling
>>                                                                 looks
>>                                                                 like
>>                                                                 can be
>>                                                                 found
>>                                                                 here:
>>                                                                 https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png
>>                                                                 <https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png>
>>
>>
>>
>>
>>                                                                     (b)
>>                                                                     to
>>                                                                     preempt
>>                                                                     we
>>                                                                     need
>>                                                                     to
>>                                                                     have
>>                                                                     different
>>                                                                     "context"
>>                                                                     - we
>>                                                                     want
>>                                                                     to
>>                                                                     guarantee
>>                                                                     that
>>                                                                     submissions
>>                                                                     from
>>                                                                     the
>>                                                                     same
>>                                                                     context
>>                                                                     will
>>                                                                     be
>>                                                                     executed
>>                                                                     in
>>                                                                     order.
>>
>>
>>                                                                 This
>>                                                                 is
>>                                                                 okay,
>>                                                                 as
>>                                                                 the
>>                                                                 reprojection
>>                                                                 work
>>                                                                 doesn't
>>                                                                 have
>>                                                                 dependencies
>>                                                                 on
>>                                                                 the
>>                                                                 game
>>                                                                 context,
>>                                                                 and it
>>                                                                 even
>>                                                                 happens
>>                                                                 in a
>>                                                                 separate
>>                                                                 process.
>>
>>                                                                     BTW:
>>                                                                     (a)
>>                                                                     Do
>>                                                                     you
>>                                                                     want
>>                                                                     "preempt"
>>                                                                     and
>>                                                                     later
>>                                                                     resume
>>                                                                     or
>>                                                                     do
>>                                                                     you
>>                                                                     want
>>                                                                     "preempt"
>>                                                                     and
>>                                                                     "cancel/abort"
>>
>>
>>                                                                 Preempt
>>                                                                 the
>>                                                                 game
>>                                                                 with
>>                                                                 the
>>                                                                 compositor
>>                                                                 task
>>                                                                 and
>>                                                                 then
>>                                                                 resume
>>                                                                 it.
>>
>>                                                                     (b)
>>                                                                     Vulkan
>>                                                                     is
>>                                                                     generic
>>                                                                     API
>>                                                                     and
>>                                                                     could
>>                                                                     be
>>                                                                     used
>>                                                                     for
>>                                                                     graphics
>>                                                                     as
>>                                                                     well
>>                                                                     as
>>                                                                     for
>>                                                                     plain
>>                                                                     compute
>>                                                                     tasks
>>                                                                     (VK_QUEUE_COMPUTE_BIT).
>>
>>
>>                                                                 Yeah,
>>                                                                 the
>>                                                                 plan
>>                                                                 is to
>>                                                                 use
>>                                                                 vulkan
>>                                                                 compute.
>>                                                                 But
>>                                                                 if
>>                                                                 you
>>                                                                 figure
>>                                                                 out a way
>>                                                                 for
>>                                                                 us to get
>>                                                                 a
>>                                                                 guaranteed
>>                                                                 execution
>>                                                                 time
>>                                                                 using
>>                                                                 vulkan
>>                                                                 graphics,
>>                                                                 then
>>                                                                 I'll
>>                                                                 take you
>>                                                                 out
>>                                                                 for a
>>                                                                 beer :)
>>
>>                                                                 Regards,
>>                                                                 Andres
>>                                                                 ________________________________________
>>                                                                 From:
>>                                                                 Sagalovitch,
>>                                                                 Serguei
>>                                                                 [Serguei.Sagalovitch-5C7GfCeVMHo@public.gmane.org
>>                                                                 <mailto:Serguei.Sagalovitch-5C7GfCeVMHo@public.gmane.org>]
>>                                                                 Sent:
>>                                                                 Friday,
>>                                                                 December
>>                                                                 16,
>>                                                                 2016
>>                                                                 9:13 PM
>>                                                                 To:
>>                                                                 Andres
>>                                                                 Rodriguez;
>>                                                                 amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>                                                                 <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>>                                                                 Subject:
>>                                                                 Re:
>>                                                                 [RFC]
>>                                                                 Mechanism
>>                                                                 for
>>                                                                 high
>>                                                                 priority
>>                                                                 scheduling
>>                                                                 in amdgpu
>>
>>                                                                 Hi
>>                                                                 Andres,
>>
>>                                                                 Please
>>                                                                 see
>>                                                                 inline
>>                                                                 (as
>>                                                                 [Serguei])
>>
>>                                                                 Sincerely
>>                                                                 yours,
>>                                                                 Serguei
>>                                                                 Sagalovitch
>>
>>
>>                                                                 From:
>>                                                                 Andres
>>                                                                 Rodriguez
>>                                                                 <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org
>>                                                                 <mailto:andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>>
>>                                                                 Sent:
>>                                                                 December
>>                                                                 16,
>>                                                                 2016
>>                                                                 8:29 PM
>>                                                                 To:
>>                                                                 Sagalovitch,
>>                                                                 Serguei;
>>                                                                 amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>                                                                 <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>>                                                                 Subject:
>>                                                                 RE:
>>                                                                 [RFC]
>>                                                                 Mechanism
>>                                                                 for
>>                                                                 high
>>                                                                 priority
>>                                                                 scheduling
>>                                                                 in amdgpu
>>
>>                                                                 Hi
>>                                                                 Serguei,
>>
>>                                                                 Thanks
>>                                                                 for
>>                                                                 the
>>                                                                 feedback.
>>                                                                 Answers
>>                                                                 inline
>>                                                                 as [AR].
>>
>>                                                                 Regards,
>>                                                                 Andres
>>
>>                                                                 ________________________________________
>>                                                                 From:
>>                                                                 Sagalovitch,
>>                                                                 Serguei
>>                                                                 [Serguei.Sagalovitch-5C7GfCeVMHo@public.gmane.org
>>                                                                 <mailto:Serguei.Sagalovitch-5C7GfCeVMHo@public.gmane.org>]
>>                                                                 Sent:
>>                                                                 Friday,
>>                                                                 December
>>                                                                 16,
>>                                                                 2016
>>                                                                 8:15 PM
>>                                                                 To:
>>                                                                 Andres
>>                                                                 Rodriguez;
>>                                                                 amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>                                                                 <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>>                                                                 Subject:
>>                                                                 Re:
>>                                                                 [RFC]
>>                                                                 Mechanism
>>                                                                 for
>>                                                                 high
>>                                                                 priority
>>                                                                 scheduling
>>                                                                 in amdgpu
>>
>>                                                                 Andres,
>>
>>
>>                                                                 Quick
>>                                                                 comments:
>>
>>                                                                 1) To
>>                                                                 minimize
>>                                                                 "bubbles",
>>                                                                 etc.
>>                                                                 we
>>                                                                 need
>>                                                                 to
>>                                                                 "force"
>>                                                                 CU
>>                                                                 assignments/binding
>>                                                                 to
>>                                                                 high-priority
>>                                                                 queue
>>                                                                 when
>>                                                                 it
>>                                                                 will
>>                                                                 be in
>>                                                                 use
>>                                                                 and
>>                                                                 "free"
>>                                                                 them
>>                                                                 later
>>                                                                 (we 
>>                                                                 do
>>                                                                 not
>>                                                                 want
>>                                                                 forever
>>                                                                 take
>>                                                                 CUs
>>                                                                 from
>>                                                                 e.g.
>>                                                                 graphic
>>                                                                 task to
>>                                                                 degrade
>>                                                                 graphics
>>                                                                 performance).
>>
>>                                                                 Otherwise
>>                                                                 we
>>                                                                 could
>>                                                                 have
>>                                                                 scenario
>>                                                                 when
>>                                                                 long
>>                                                                 graphics
>>                                                                 task (or
>>                                                                 low-priority
>>                                                                 compute)
>>                                                                 will
>>                                                                 took
>>                                                                 all
>>                                                                 (extra)
>>                                                                 CUs
>>                                                                 and
>>                                                                 high--priority
>>                                                                 will
>>                                                                 wait for
>>                                                                 needed
>>                                                                 resources.
>>                                                                 It
>>                                                                 will
>>                                                                 not
>>                                                                 be
>>                                                                 visible
>>                                                                 on
>>                                                                 "NOP
>>                                                                 " but
>>                                                                 only
>>                                                                 when
>>                                                                 you
>>                                                                 submit
>>                                                                 "real"
>>                                                                 compute
>>                                                                 task
>>                                                                 so I
>>                                                                 would
>>                                                                 recommend
>>                                                                 not
>>                                                                 to
>>                                                                 use
>>                                                                 "NOP"
>>                                                                 packets
>>                                                                 at
>>                                                                 all for
>>                                                                 testing.
>>
>>                                                                 It
>>                                                                 (CU
>>                                                                 assignment)
>>                                                                 could
>>                                                                 be
>>                                                                 relatively
>>                                                                 easy
>>                                                                 done when
>>                                                                 everything
>>                                                                 is
>>                                                                 going
>>                                                                 via
>>                                                                 kernel
>>                                                                 (e.g.
>>                                                                 as
>>                                                                 part
>>                                                                 of
>>                                                                 frame
>>                                                                 submission)
>>                                                                 but I
>>                                                                 must
>>                                                                 admit
>>                                                                 that I
>>                                                                 am
>>                                                                 not sure
>>                                                                 about
>>                                                                 the
>>                                                                 best
>>                                                                 way
>>                                                                 for
>>                                                                 user
>>                                                                 level
>>                                                                 submissions
>>                                                                 (amdkfd).
>>
>>                                                                 [AR]
>>                                                                 I
>>                                                                 wasn't
>>                                                                 aware
>>                                                                 of
>>                                                                 this
>>                                                                 part
>>                                                                 of
>>                                                                 the
>>                                                                 programming
>>                                                                 sequence.
>>                                                                 Thanks
>>                                                                 for
>>                                                                 the
>>                                                                 heads up!
>>                                                                 Is
>>                                                                 this
>>                                                                 similar
>>                                                                 to
>>                                                                 the
>>                                                                 CU
>>                                                                 masking
>>                                                                 programming?
>>                                                                 [Serguei]
>>                                                                 Yes.
>>                                                                 To
>>                                                                 simplify:
>>                                                                 the
>>                                                                 problem
>>                                                                 is
>>                                                                 that
>>                                                                 "scheduler"
>>                                                                 when
>>                                                                 deciding
>>                                                                 which
>>                                                                 queue
>>                                                                 to 
>>                                                                 run
>>                                                                 will
>>                                                                 check
>>                                                                 if
>>                                                                 there
>>                                                                 is
>>                                                                 enough
>>                                                                 resources
>>                                                                 and
>>                                                                 if
>>                                                                 not then
>>                                                                 it
>>                                                                 will
>>                                                                 begin
>>                                                                 to
>>                                                                 check
>>                                                                 other
>>                                                                 queues
>>                                                                 with
>>                                                                 lower
>>                                                                 priority.
>>
>>                                                                 2) I
>>                                                                 would
>>                                                                 recommend
>>                                                                 to
>>                                                                 dedicate
>>                                                                 the
>>                                                                 whole
>>                                                                 pipe to
>>                                                                 high-priority
>>                                                                 queue
>>                                                                 and have
>>                                                                 nothing
>>                                                                 their
>>                                                                 except
>>                                                                 it.
>>
>>                                                                 [AR]
>>                                                                 I'm
>>                                                                 guessing
>>                                                                 in
>>                                                                 this
>>                                                                 context
>>                                                                 you
>>                                                                 mean
>>                                                                 pipe
>>                                                                 = queue?
>>                                                                 (as
>>                                                                 opposed
>>                                                                 to
>>                                                                 the
>>                                                                 MEC
>>                                                                 definition
>>                                                                 of
>>                                                                 pipe,
>>                                                                 which
>>                                                                 is a
>>                                                                 grouping
>>                                                                 of
>>                                                                 queues).
>>                                                                 I say
>>                                                                 this
>>                                                                 because
>>                                                                 amdgpu
>>                                                                 only
>>                                                                 has
>>                                                                 access
>>                                                                 to 1
>>                                                                 pipe,
>>                                                                 and
>>                                                                 the
>>                                                                 rest
>>                                                                 are
>>                                                                 statically
>>                                                                 partitioned
>>                                                                 for
>>                                                                 amdkfd
>>                                                                 usage.
>>
>>                                                                 [Serguei]
>>                                                                 No. I
>>                                                                 mean
>>                                                                 pipe
>>                                                                 :-) 
>>                                                                 as
>>                                                                 MEC
>>                                                                 define
>>                                                                 it.
>>                                                                 As
>>                                                                 far as I
>>                                                                 understand
>>                                                                 (by
>>                                                                 simplifying)
>>                                                                 some
>>                                                                 scheduling
>>                                                                 is
>>                                                                 per
>>                                                                 pipe. 
>>                                                                 I
>>                                                                 know
>>                                                                 about
>>                                                                 the
>>                                                                 current
>>                                                                 allocation
>>                                                                 scheme
>>                                                                 but I
>>                                                                 do
>>                                                                 not think
>>                                                                 that
>>                                                                 it
>>                                                                 is 
>>                                                                 ideal. 
>>                                                                 I
>>                                                                 would
>>                                                                 assume
>>                                                                 that
>>                                                                 we
>>                                                                 need
>>                                                                 to
>>                                                                 switch to
>>                                                                 dynamical
>>                                                                 partition
>>                                                                 of
>>                                                                 resources 
>>                                                                 based
>>                                                                 on
>>                                                                 the
>>                                                                 workload
>>                                                                 otherwise
>>                                                                 we
>>                                                                 will have
>>                                                                 resource
>>                                                                 conflict
>>                                                                 between
>>                                                                 Vulkan
>>                                                                 compute
>>                                                                 and 
>>                                                                 OpenCL.
>>
>>
>>                                                                 BTW:
>>                                                                 Which
>>                                                                 user
>>                                                                 level
>>                                                                 API
>>                                                                 do
>>                                                                 you
>>                                                                 want
>>                                                                 to
>>                                                                 use
>>                                                                 for
>>                                                                 compute:
>>                                                                 Vulkan or
>>                                                                 OpenCL?
>>
>>                                                                 [AR]
>>                                                                 Vulkan
>>
>>                                                                 [Serguei]
>>                                                                 Vulkan
>>                                                                 works
>>                                                                 via
>>                                                                 amdgpu
>>                                                                 (kernel
>>                                                                 submissions)
>>                                                                 so
>>                                                                 amdkfd
>>                                                                 will
>>                                                                 be not
>>                                                                 involved. 
>>                                                                 I
>>                                                                 would
>>                                                                 assume
>>                                                                 that
>>                                                                 in
>>                                                                 the
>>                                                                 case
>>                                                                 of VR
>>                                                                 we will
>>                                                                 have
>>                                                                 one main
>>                                                                 application
>>                                                                 ("console"
>>                                                                 mode(?))
>>                                                                 so we
>>                                                                 could
>>                                                                 temporally
>>                                                                 "ignore"
>>                                                                 OpenCL/ROCm
>>                                                                 needs
>>                                                                 when
>>                                                                 VR is
>>                                                                 running.
>>
>>                                                                      we
>>                                                                     will
>>                                                                     not
>>                                                                     be
>>                                                                     able
>>                                                                     to
>>                                                                     provide
>>                                                                     a
>>                                                                     solution
>>                                                                     compatible
>>                                                                     with
>>                                                                     GFX
>>                                                                     worloads.
>>
>>                                                                 I
>>                                                                 assume
>>                                                                 that
>>                                                                 you
>>                                                                 are
>>                                                                 talking
>>                                                                 about
>>                                                                 graphics?
>>                                                                 Am I
>>                                                                 right?
>>
>>                                                                 [AR]
>>                                                                 Yeah,
>>                                                                 my
>>                                                                 understanding
>>                                                                 is
>>                                                                 that
>>                                                                 pre-empting
>>                                                                 the
>>                                                                 currently
>>                                                                 running
>>                                                                 graphics
>>                                                                 job
>>                                                                 and
>>                                                                 scheduling
>>                                                                 in
>>                                                                 something
>>                                                                 else
>>                                                                 using
>>                                                                 mid-buffer
>>                                                                 pre-emption
>>                                                                 has
>>                                                                 some
>>                                                                 cases
>>                                                                 where it
>>                                                                 doesn't
>>                                                                 work
>>                                                                 well.
>>                                                                 But
>>                                                                 if with
>>                                                                 polaris10
>>                                                                 it
>>                                                                 starts
>>                                                                 working
>>                                                                 well,
>>                                                                 it
>>                                                                 might
>>                                                                 be a
>>                                                                 better
>>                                                                 solution
>>                                                                 for
>>                                                                 us
>>                                                                 (because
>>                                                                 the
>>                                                                 whole
>>                                                                 reprojection
>>                                                                 work
>>                                                                 uses
>>                                                                 the
>>                                                                 vulkan
>>                                                                 graphics
>>                                                                 stack
>>                                                                 at
>>                                                                 the
>>                                                                 moment,
>>                                                                 and
>>                                                                 porting
>>                                                                 it to
>>                                                                 compute
>>                                                                 is
>>                                                                 not
>>                                                                 trivial).
>>
>>                                                                 [Serguei] 
>>                                                                 The
>>                                                                 problem
>>                                                                 with
>>                                                                 pre-emption
>>                                                                 of
>>                                                                 graphics
>>                                                                 task:
>>                                                                 (a)
>>                                                                 it may
>>                                                                 take
>>                                                                 time so
>>                                                                 latency
>>                                                                 may
>>                                                                 suffer
>>                                                                 (b)
>>                                                                 to
>>                                                                 preempt
>>                                                                 we
>>                                                                 need
>>                                                                 to
>>                                                                 have
>>                                                                 different
>>                                                                 "context"
>>                                                                 - we want
>>                                                                 to
>>                                                                 guarantee
>>                                                                 that
>>                                                                 submissions
>>                                                                 from
>>                                                                 the
>>                                                                 same
>>                                                                 context
>>                                                                 will be
>>                                                                 executed
>>                                                                 in order.
>>                                                                 BTW:
>>                                                                 (a)
>>                                                                 Do
>>                                                                 you
>>                                                                 want
>>                                                                 "preempt"
>>                                                                 and
>>                                                                 later
>>                                                                 resume
>>                                                                 or do you
>>                                                                 want
>>                                                                 "preempt"
>>                                                                 and
>>                                                                 "cancel/abort"? 
>>                                                                 (b)
>>                                                                 Vulkan
>>                                                                 is
>>                                                                 generic
>>                                                                 API
>>                                                                 and
>>                                                                 could
>>                                                                 be used
>>                                                                 for
>>                                                                 graphics
>>                                                                 as
>>                                                                 well
>>                                                                 as
>>                                                                 for
>>                                                                 plain
>>                                                                 compute
>>                                                                 tasks
>>                                                                 (VK_QUEUE_COMPUTE_BIT).
>>
>>
>>                                                                 Sincerely
>>                                                                 yours,
>>                                                                 Serguei
>>                                                                 Sagalovitch
>>
>>
>>
>>                                                                 From:
>>                                                                 amd-gfx
>>                                                                 <amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>                                                                 <mailto:amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>>
>>                                                                 on
>>                                                                 behalf of
>>                                                                 Andres
>>                                                                 Rodriguez
>>                                                                 <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org
>>                                                                 <mailto:andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>>
>>                                                                 Sent:
>>                                                                 December
>>                                                                 16,
>>                                                                 2016
>>                                                                 6:15 PM
>>                                                                 To:
>>                                                                 amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>                                                                 <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>>                                                                 Subject:
>>                                                                 [RFC]
>>                                                                 Mechanism
>>                                                                 for
>>                                                                 high
>>                                                                 priority
>>                                                                 scheduling
>>                                                                 in
>>                                                                 amdgpu
>>
>>                                                                 Hi
>>                                                                 Everyone,
>>
>>                                                                 This
>>                                                                 RFC
>>                                                                 is
>>                                                                 also
>>                                                                 available
>>                                                                 as a
>>                                                                 gist
>>                                                                 here:
>>                                                                 https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>>                                                                 <https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249>
>>
>>
>>
>>
>>
>>                                                                 [RFC]
>>                                                                 Mechanism
>>                                                                 for
>>                                                                 high
>>                                                                 priority
>>                                                                 scheduling
>>                                                                 in amdgpu
>>                                                                 gist.github.com
>>                                                                 <http://gist.github.com>
>>                                                                 [RFC]
>>                                                                 Mechanism
>>                                                                 for
>>                                                                 high
>>                                                                 priority
>>                                                                 scheduling
>>                                                                 in amdgpu
>>
>>
>>
>>                                                                 [RFC]
>>                                                                 Mechanism
>>                                                                 for
>>                                                                 high
>>                                                                 priority
>>                                                                 scheduling
>>                                                                 in amdgpu
>>                                                                 gist.github.com
>>                                                                 <http://gist.github.com>
>>                                                                 [RFC]
>>                                                                 Mechanism
>>                                                                 for
>>                                                                 high
>>                                                                 priority
>>                                                                 scheduling
>>                                                                 in amdgpu
>>
>>
>>
>>
>>                                                                 [RFC]
>>                                                                 Mechanism
>>                                                                 for
>>                                                                 high
>>                                                                 priority
>>                                                                 scheduling
>>                                                                 in amdgpu
>>                                                                 gist.github.com
>>                                                                 <http://gist.github.com>
>>                                                                 [RFC]
>>                                                                 Mechanism
>>                                                                 for
>>                                                                 high
>>                                                                 priority
>>                                                                 scheduling
>>                                                                 in amdgpu
>>
>>
>>                                                                 We
>>                                                                 are
>>                                                                 interested
>>                                                                 in
>>                                                                 feedback
>>                                                                 for a
>>                                                                 mechanism
>>                                                                 to
>>                                                                 effectively
>>                                                                 schedule
>>                                                                 high
>>                                                                 priority
>>                                                                 VR
>>                                                                 reprojection
>>                                                                 tasks
>>                                                                 (also
>>                                                                 referred
>>                                                                 to as
>>                                                                 time-warping)
>>                                                                 for
>>                                                                 Polaris10
>>                                                                 running
>>                                                                 on
>>                                                                 the
>>                                                                 amdgpu
>>                                                                 kernel
>>                                                                 driver.
>>
>>                                                                 Brief
>>                                                                 context:
>>                                                                 --------------
>>
>>                                                                 The
>>                                                                 main
>>                                                                 objective
>>                                                                 of
>>                                                                 reprojection
>>                                                                 is to
>>                                                                 avoid
>>                                                                 motion
>>                                                                 sickness
>>                                                                 for VR
>>                                                                 users in
>>                                                                 scenarios
>>                                                                 where
>>                                                                 the
>>                                                                 game
>>                                                                 or
>>                                                                 application
>>                                                                 would
>>                                                                 fail
>>                                                                 to finish
>>                                                                 rendering
>>                                                                 a new
>>                                                                 frame
>>                                                                 in
>>                                                                 time
>>                                                                 for
>>                                                                 the
>>                                                                 next
>>                                                                 VBLANK.
>>                                                                 When
>>                                                                 this
>>                                                                 happens,
>>                                                                 the
>>                                                                 user's
>>                                                                 head
>>                                                                 movements
>>                                                                 are
>>                                                                 not
>>                                                                 reflected
>>                                                                 on
>>                                                                 the
>>                                                                 Head
>>                                                                 Mounted
>>                                                                 Display
>>                                                                 (HMD)
>>                                                                 for the
>>                                                                 duration
>>                                                                 of an
>>                                                                 extra
>>                                                                 frame.
>>                                                                 This
>>                                                                 extended
>>                                                                 mismatch
>>                                                                 between
>>                                                                 the
>>                                                                 inner ear
>>                                                                 and the
>>                                                                 eyes may
>>                                                                 cause
>>                                                                 the
>>                                                                 user
>>                                                                 to
>>                                                                 experience
>>                                                                 motion
>>                                                                 sickness.
>>
>>                                                                 The
>>                                                                 VR
>>                                                                 compositor
>>                                                                 deals
>>                                                                 with
>>                                                                 this
>>                                                                 problem
>>                                                                 by
>>                                                                 fabricating
>>                                                                 a
>>                                                                 new frame
>>                                                                 using the
>>                                                                 user's
>>                                                                 updated
>>                                                                 head
>>                                                                 position
>>                                                                 in
>>                                                                 combination
>>                                                                 with the
>>                                                                 previous
>>                                                                 frames.
>>                                                                 This
>>                                                                 avoids
>>                                                                 a
>>                                                                 prolonged
>>                                                                 mismatch
>>                                                                 between
>>                                                                 the
>>                                                                 HMD
>>                                                                 output
>>                                                                 and the
>>                                                                 inner
>>                                                                 ear.
>>
>>                                                                 Because
>>                                                                 of
>>                                                                 the
>>                                                                 adverse
>>                                                                 effects
>>                                                                 on
>>                                                                 the
>>                                                                 user,
>>                                                                 we
>>                                                                 require
>>                                                                 high
>>                                                                 confidence
>>                                                                 that the
>>                                                                 reprojection
>>                                                                 task
>>                                                                 will
>>                                                                 complete
>>                                                                 before
>>                                                                 the
>>                                                                 VBLANK
>>                                                                 interval.
>>                                                                 Even if
>>                                                                 the
>>                                                                 GFX pipe
>>                                                                 is
>>                                                                 currently
>>                                                                 full
>>                                                                 of
>>                                                                 work
>>                                                                 from
>>                                                                 the
>>                                                                 game/application
>>                                                                 (which
>>                                                                 is most
>>                                                                 likely
>>                                                                 the
>>                                                                 case).
>>
>>                                                                 For
>>                                                                 more
>>                                                                 details
>>                                                                 and
>>                                                                 illustrations,
>>                                                                 please
>>                                                                 refer
>>                                                                 to the
>>                                                                 following
>>                                                                 document:
>>                                                                 https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved
>>                                                                 <https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved>
>>
>>
>>
>>
>>
>>                                                                 Gaming:
>>                                                                 Asynchronous
>>                                                                 Shaders
>>                                                                 Evolved
>>                                                                 |
>>                                                                 Community
>>                                                                 community.amd.com
>>                                                                 <http://community.amd.com>
>>                                                                 One
>>                                                                 of
>>                                                                 the
>>                                                                 most
>>                                                                 exciting
>>                                                                 new
>>                                                                 developments
>>                                                                 in
>>                                                                 GPU
>>                                                                 technology
>>                                                                 over the
>>                                                                 past
>>                                                                 year
>>                                                                 has
>>                                                                 been
>>                                                                 the
>>                                                                 adoption
>>                                                                 of
>>                                                                 asynchronous
>>                                                                 shaders,
>>                                                                 which can
>>                                                                 make
>>                                                                 more
>>                                                                 efficient
>>                                                                 use
>>                                                                 of ...
>>
>>
>>
>>                                                                 Gaming:
>>                                                                 Asynchronous
>>                                                                 Shaders
>>                                                                 Evolved
>>                                                                 |
>>                                                                 Community
>>                                                                 community.amd.com
>>                                                                 <http://community.amd.com>
>>                                                                 One
>>                                                                 of
>>                                                                 the
>>                                                                 most
>>                                                                 exciting
>>                                                                 new
>>                                                                 developments
>>                                                                 in
>>                                                                 GPU
>>                                                                 technology
>>                                                                 over the
>>                                                                 past
>>                                                                 year
>>                                                                 has
>>                                                                 been
>>                                                                 the
>>                                                                 adoption
>>                                                                 of
>>                                                                 asynchronous
>>                                                                 shaders,
>>                                                                 which can
>>                                                                 make
>>                                                                 more
>>                                                                 efficient
>>                                                                 use
>>                                                                 of ...
>>
>>
>>
>>                                                                 Gaming:
>>                                                                 Asynchronous
>>                                                                 Shaders
>>                                                                 Evolved
>>                                                                 |
>>                                                                 Community
>>                                                                 community.amd.com
>>                                                                 <http://community.amd.com>
>>                                                                 One
>>                                                                 of
>>                                                                 the
>>                                                                 most
>>                                                                 exciting
>>                                                                 new
>>                                                                 developments
>>                                                                 in
>>                                                                 GPU
>>                                                                 technology
>>                                                                 over the
>>                                                                 past
>>                                                                 year
>>                                                                 has
>>                                                                 been
>>                                                                 the
>>                                                                 adoption
>>                                                                 of
>>                                                                 asynchronous
>>                                                                 shaders,
>>                                                                 which can
>>                                                                 make
>>                                                                 more
>>                                                                 efficient
>>                                                                 use
>>                                                                 of ...
>>
>>
>>                                                                 Requirements:
>>                                                                 -------------
>>
>>                                                                 The
>>                                                                 mechanism
>>                                                                 must
>>                                                                 expose
>>                                                                 the
>>                                                                 following
>>                                                                 functionaility:
>>
>>                                                                     *
>>                                                                 Job
>>                                                                 round
>>                                                                 trip
>>                                                                 time
>>                                                                 must
>>                                                                 be
>>                                                                 predictable,
>>                                                                 from
>>                                                                 submission
>>                                                                 to
>>                                                                 fence
>>                                                                 signal
>>
>>                                                                     *
>>                                                                 The
>>                                                                 mechanism
>>                                                                 must
>>                                                                 support
>>                                                                 compute
>>                                                                 workloads.
>>
>>                                                                 Goals:
>>                                                                 ------
>>
>>                                                                     *
>>                                                                 The
>>                                                                 mechanism
>>                                                                 should
>>                                                                 provide
>>                                                                 low
>>                                                                 submission
>>                                                                 latencies
>>
>>                                                                 Test:
>>                                                                 submitting
>>                                                                 a NOP
>>                                                                 packet
>>                                                                 through
>>                                                                 the
>>                                                                 mechanism
>>                                                                 on busy
>>                                                                 hardware
>>                                                                 should
>>                                                                 be
>>                                                                 equivalent
>>                                                                 to
>>                                                                 submitting
>>                                                                 a NOP
>>                                                                 on
>>                                                                 idle
>>                                                                 hardware.
>>
>>                                                                 Nice
>>                                                                 to have:
>>                                                                 -------------
>>
>>                                                                     *
>>                                                                 The
>>                                                                 mechanism
>>                                                                 should
>>                                                                 also
>>                                                                 support
>>                                                                 GFX
>>                                                                 workloads.
>>
>>                                                                 My
>>                                                                 understanding
>>                                                                 is
>>                                                                 that
>>                                                                 with
>>                                                                 the
>>                                                                 current
>>                                                                 hardware
>>                                                                 capabilities
>>                                                                 in
>>                                                                 Polaris10
>>                                                                 we
>>                                                                 will
>>                                                                 not
>>                                                                 be
>>                                                                 able
>>                                                                 to
>>                                                                 provide
>>                                                                 a
>>                                                                 solution
>>                                                                 compatible
>>                                                                 with GFX
>>                                                                 worloads.
>>
>>                                                                 But I
>>                                                                 would
>>                                                                 love
>>                                                                 to
>>                                                                 hear
>>                                                                 otherwise.
>>                                                                 So if
>>                                                                 anyone
>>                                                                 has an
>>                                                                 idea,
>>                                                                 approach
>>                                                                 or
>>                                                                 suggestion
>>                                                                 that
>>                                                                 will
>>                                                                 also
>>                                                                 be
>>                                                                 compatible
>>                                                                 with
>>                                                                 the
>>                                                                 GFX ring,
>>                                                                 please
>>                                                                 let
>>                                                                 us know
>>                                                                 about it.
>>
>>                                                                     *
>>                                                                 The
>>                                                                 above
>>                                                                 guarantees
>>                                                                 should
>>                                                                 also
>>                                                                 be
>>                                                                 respected
>>                                                                 by
>>                                                                 amdkfd
>>                                                                 workloads
>>
>>                                                                 Would
>>                                                                 be
>>                                                                 good
>>                                                                 to
>>                                                                 have
>>                                                                 for
>>                                                                 consistency,
>>                                                                 but
>>                                                                 not
>>                                                                 strictly
>>                                                                 necessary
>>                                                                 as
>>                                                                 users
>>                                                                 running
>>                                                                 games
>>                                                                 are
>>                                                                 not
>>                                                                 traditionally
>>                                                                 running
>>                                                                 HPC
>>                                                                 workloads
>>                                                                 in the
>>                                                                 background.
>>
>>                                                                 Proposed
>>                                                                 approach:
>>                                                                 ------------------
>>
>>                                                                 Similar
>>                                                                 to
>>                                                                 the
>>                                                                 windows
>>                                                                 driver,
>>                                                                 we
>>                                                                 could
>>                                                                 expose
>>                                                                 a high
>>                                                                 priority
>>                                                                 compute
>>                                                                 queue to
>>                                                                 userspace.
>>
>>                                                                 Submissions
>>                                                                 to
>>                                                                 this
>>                                                                 compute
>>                                                                 queue
>>                                                                 will
>>                                                                 be
>>                                                                 scheduled
>>                                                                 with
>>                                                                 high
>>                                                                 priority,
>>                                                                 and may
>>                                                                 acquire
>>                                                                 hardware
>>                                                                 resources
>>                                                                 previously
>>                                                                 in
>>                                                                 use
>>                                                                 by other
>>                                                                 queues.
>>
>>                                                                 This
>>                                                                 can
>>                                                                 be
>>                                                                 achieved
>>                                                                 by
>>                                                                 taking
>>                                                                 advantage
>>                                                                 of
>>                                                                 the
>>                                                                 'priority'
>>                                                                 field in
>>                                                                 the HQDs
>>                                                                 and
>>                                                                 could
>>                                                                 be
>>                                                                 programmed
>>                                                                 by
>>                                                                 amdgpu
>>                                                                 or
>>                                                                 the
>>                                                                 amdgpu
>>                                                                 scheduler.
>>                                                                 The
>>                                                                 relevant
>>                                                                 register
>>                                                                 fields
>>                                                                 are:
>>                                                                      
>>                                                                   *
>>                                                                 mmCP_HQD_PIPE_PRIORITY
>>                                                                      
>>                                                                   *
>>                                                                 mmCP_HQD_QUEUE_PRIORITY
>>
>>                                                                 Implementation
>>                                                                 approach
>>                                                                 1 -
>>                                                                 static
>>                                                                 partitioning:
>>                                                                 ------------------------------------------------
>>
>>                                                                 The
>>                                                                 amdgpu
>>                                                                 driver
>>                                                                 currently
>>                                                                 controls
>>                                                                 8
>>                                                                 compute
>>                                                                 queues
>>                                                                 from
>>                                                                 pipe0.
>>                                                                 We can
>>                                                                 statically
>>                                                                 partition
>>                                                                 these
>>                                                                 as
>>                                                                 follows:
>>                                                                      
>>                                                                   *
>>                                                                 7x
>>                                                                 regular
>>                                                                      
>>                                                                   *
>>                                                                 1x
>>                                                                 high
>>                                                                 priority
>>
>>                                                                 The
>>                                                                 relevant
>>                                                                 priorities
>>                                                                 can
>>                                                                 be
>>                                                                 set
>>                                                                 so
>>                                                                 that
>>                                                                 submissions
>>                                                                 to
>>                                                                 the high
>>                                                                 priority
>>                                                                 ring
>>                                                                 will
>>                                                                 starve
>>                                                                 the
>>                                                                 other
>>                                                                 compute
>>                                                                 rings
>>                                                                 and
>>                                                                 the
>>                                                                 GFX ring.
>>
>>                                                                 The
>>                                                                 amdgpu
>>                                                                 scheduler
>>                                                                 will
>>                                                                 only
>>                                                                 place
>>                                                                 jobs
>>                                                                 into
>>                                                                 the high
>>                                                                 priority
>>                                                                 rings
>>                                                                 if the
>>                                                                 context
>>                                                                 is
>>                                                                 marked
>>                                                                 as
>>                                                                 high
>>                                                                 priority.
>>                                                                 And a
>>                                                                 corresponding
>>                                                                 priority
>>                                                                 should be
>>                                                                 added
>>                                                                 to
>>                                                                 keep
>>                                                                 track
>>                                                                 of
>>                                                                 this
>>                                                                 information:
>>                                                                    
>>                                                                  *
>>                                                                 AMD_SCHED_PRIORITY_KERNEL
>>                                                                    
>>                                                                  * ->
>>                                                                 AMD_SCHED_PRIORITY_HIGH
>>                                                                    
>>                                                                  *
>>                                                                 AMD_SCHED_PRIORITY_NORMAL
>>
>>                                                                 The
>>                                                                 user
>>                                                                 will
>>                                                                 request
>>                                                                 a
>>                                                                 high
>>                                                                 priority
>>                                                                 context
>>                                                                 by
>>                                                                 setting
>>                                                                 an
>>                                                                 appropriate
>>                                                                 flag
>>                                                                 in
>>                                                                 drm_amdgpu_ctx_in
>>                                                                 (AMDGPU_CTX_HIGH_PRIORITY
>>                                                                 or
>>                                                                 similar):
>>                                                                 https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163
>>                                                                 <https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163>
>>
>>
>>
>>
>>                                                                 The
>>                                                                 setting
>>                                                                 is in
>>                                                                 a per
>>                                                                 context
>>                                                                 level
>>                                                                 so
>>                                                                 that
>>                                                                 we can:
>>                                                                     *
>>                                                                 Maintain
>>                                                                 a
>>                                                                 consistent
>>                                                                 FIFO
>>                                                                 ordering
>>                                                                 of all
>>                                                                 submissions
>>                                                                 to a
>>                                                                 context
>>                                                                     *
>>                                                                 Create
>>                                                                 high
>>                                                                 priority
>>                                                                 and
>>                                                                 non-high
>>                                                                 priority
>>                                                                 contexts
>>                                                                 in
>>                                                                 the same
>>                                                                 process
>>
>>                                                                 Implementation
>>                                                                 approach
>>                                                                 2 -
>>                                                                 dynamic
>>                                                                 priority
>>                                                                 programming:
>>                                                                 ---------------------------------------------------------
>>
>>                                                                 Similar
>>                                                                 to
>>                                                                 the
>>                                                                 above,
>>                                                                 but
>>                                                                 instead
>>                                                                 of
>>                                                                 programming
>>                                                                 the
>>                                                                 priorities
>>                                                                 and
>>                                                                 amdgpu_init()
>>                                                                 time,
>>                                                                 the
>>                                                                 SW
>>                                                                 scheduler
>>                                                                 will
>>                                                                 reprogram
>>                                                                 the
>>                                                                 queue
>>                                                                 priorities
>>                                                                 dynamically
>>                                                                 when
>>                                                                 scheduling
>>                                                                 a task.
>>
>>                                                                 This
>>                                                                 would
>>                                                                 involve
>>                                                                 having
>>                                                                 a
>>                                                                 hardware
>>                                                                 specific
>>                                                                 callback
>>                                                                 from
>>                                                                 the
>>                                                                 scheduler
>>                                                                 to
>>                                                                 set
>>                                                                 the
>>                                                                 appropriate
>>                                                                 queue
>>                                                                 priority:
>>                                                                 set_priority(int
>>                                                                 ring,
>>                                                                 int
>>                                                                 index,
>>                                                                 int
>>                                                                 priority)
>>
>>                                                                 During
>>                                                                 this
>>                                                                 callback
>>                                                                 we
>>                                                                 would
>>                                                                 have
>>                                                                 to
>>                                                                 grab
>>                                                                 the
>>                                                                 SRBM
>>                                                                 mutex
>>                                                                 to
>>                                                                 perform
>>                                                                 the
>>                                                                 appropriate
>>                                                                 HW
>>                                                                 programming,
>>                                                                 and
>>                                                                 I'm
>>                                                                 not
>>                                                                 really
>>                                                                 sure
>>                                                                 if
>>                                                                 that is
>>                                                                 something
>>                                                                 we
>>                                                                 should
>>                                                                 be
>>                                                                 doing
>>                                                                 from
>>                                                                 the
>>                                                                 scheduler.
>>
>>                                                                 On
>>                                                                 the
>>                                                                 positive
>>                                                                 side,
>>                                                                 this
>>                                                                 approach
>>                                                                 would
>>                                                                 allow
>>                                                                 us to
>>                                                                 program
>>                                                                 a
>>                                                                 range of
>>                                                                 priorities
>>                                                                 for
>>                                                                 jobs
>>                                                                 instead
>>                                                                 of a
>>                                                                 single
>>                                                                 "high
>>                                                                 priority"
>>                                                                 value",
>>                                                                 achieving
>>                                                                 something
>>                                                                 similar
>>                                                                 to
>>                                                                 the
>>                                                                 niceness
>>                                                                 API
>>                                                                 available
>>                                                                 for CPU
>>                                                                 scheduling.
>>
>>                                                                 I'm
>>                                                                 not
>>                                                                 sure
>>                                                                 if
>>                                                                 this
>>                                                                 flexibility
>>                                                                 is
>>                                                                 something
>>                                                                 that
>>                                                                 we would
>>                                                                 need for
>>                                                                 our use
>>                                                                 case,
>>                                                                 but
>>                                                                 it
>>                                                                 might
>>                                                                 be
>>                                                                 useful
>>                                                                 in
>>                                                                 other
>>                                                                 scenarios
>>                                                                 (multiple
>>                                                                 users
>>                                                                 sharing
>>                                                                 compute
>>                                                                 time
>>                                                                 on a
>>                                                                 server).
>>
>>                                                                 This
>>                                                                 approach
>>                                                                 would
>>                                                                 require
>>                                                                 a new
>>                                                                 int
>>                                                                 field in
>>                                                                 drm_amdgpu_ctx_in,
>>                                                                 or
>>                                                                 repurposing
>>                                                                 of
>>                                                                 the
>>                                                                 flags
>>                                                                 field.
>>
>>                                                                 Known
>>                                                                 current
>>                                                                 obstacles:
>>                                                                 ------------------------
>>
>>                                                                 The
>>                                                                 SQ is
>>                                                                 currently
>>                                                                 programmed
>>                                                                 to
>>                                                                 disregard
>>                                                                 the HQD
>>                                                                 priorities,
>>                                                                 and
>>                                                                 instead
>>                                                                 it picks
>>                                                                 jobs
>>                                                                 at
>>                                                                 random.
>>                                                                 Settings
>>                                                                 from
>>                                                                 the
>>                                                                 shader
>>                                                                 itself
>>                                                                 are also
>>                                                                 disregarded
>>                                                                 as
>>                                                                 this is
>>                                                                 considered
>>                                                                 a
>>                                                                 privileged
>>                                                                 field.
>>
>>                                                                 Effectively
>>                                                                 we
>>                                                                 can
>>                                                                 get
>>                                                                 our
>>                                                                 compute
>>                                                                 wavefront
>>                                                                 launched
>>                                                                 ASAP,
>>                                                                 but we
>>                                                                 might
>>                                                                 not
>>                                                                 get the
>>                                                                 time
>>                                                                 we
>>                                                                 need
>>                                                                 on
>>                                                                 the SQ.
>>
>>                                                                 The
>>                                                                 current
>>                                                                 programming
>>                                                                 would
>>                                                                 have
>>                                                                 to be
>>                                                                 changed
>>                                                                 to allow
>>                                                                 priority
>>                                                                 propagation
>>                                                                 from
>>                                                                 the
>>                                                                 HQD
>>                                                                 into
>>                                                                 the SQ.
>>
>>                                                                 Generic
>>                                                                 approach
>>                                                                 for
>>                                                                 all
>>                                                                 HW IPs:
>>                                                                 --------------------------------
>>
>>                                                                 For
>>                                                                 consistency
>>                                                                 purposes,
>>                                                                 the
>>                                                                 high
>>                                                                 priority
>>                                                                 context
>>                                                                 can be
>>                                                                 enabled
>>                                                                 for
>>                                                                 all
>>                                                                 HW IPs
>>                                                                 with
>>                                                                 support
>>                                                                 of
>>                                                                 the
>>                                                                 SW
>>                                                                 scheduler.
>>                                                                 This
>>                                                                 will
>>                                                                 function
>>                                                                 similarly
>>                                                                 to the
>>                                                                 current
>>                                                                 AMD_SCHED_PRIORITY_KERNEL
>>                                                                 priority,
>>                                                                 where
>>                                                                 the
>>                                                                 job
>>                                                                 can jump
>>                                                                 ahead of
>>                                                                 anything
>>                                                                 not
>>                                                                 commited
>>                                                                 to
>>                                                                 the
>>                                                                 HW queue.
>>
>>                                                                 The
>>                                                                 benefits
>>                                                                 of
>>                                                                 requesting
>>                                                                 a
>>                                                                 high
>>                                                                 priority
>>                                                                 context
>>                                                                 for a
>>                                                                 non-compute
>>                                                                 queue
>>                                                                 will
>>                                                                 be
>>                                                                 lesser
>>                                                                 (e.g.
>>                                                                 up to
>>                                                                 10s
>>                                                                 of
>>                                                                 wait
>>                                                                 time
>>                                                                 if a
>>                                                                 GFX
>>                                                                 command
>>                                                                 is
>>                                                                 stuck in
>>                                                                 front of
>>                                                                 you),
>>                                                                 but
>>                                                                 having
>>                                                                 the
>>                                                                 API
>>                                                                 in
>>                                                                 place
>>                                                                 will
>>                                                                 allow
>>                                                                 us to
>>                                                                 easily
>>                                                                 improve
>>                                                                 the
>>                                                                 implementation
>>                                                                 in
>>                                                                 the
>>                                                                 future
>>                                                                 as
>>                                                                 new
>>                                                                 features
>>                                                                 become
>>                                                                 available
>>                                                                 in new
>>                                                                 hardware.
>>
>>                                                                 Future
>>                                                                 steps:
>>                                                                 -------------
>>
>>                                                                 Once
>>                                                                 we
>>                                                                 have
>>                                                                 an
>>                                                                 approach
>>                                                                 settled,
>>                                                                 I can
>>                                                                 take
>>                                                                 care
>>                                                                 of the
>>                                                                 implementation.
>>
>>                                                                 Also,
>>                                                                 once
>>                                                                 the
>>                                                                 interface
>>                                                                 is
>>                                                                 mostly
>>                                                                 decided,
>>                                                                 we
>>                                                                 can start
>>                                                                 thinking
>>                                                                 about
>>                                                                 exposing
>>                                                                 the
>>                                                                 high
>>                                                                 priority
>>                                                                 queue
>>                                                                 through
>>                                                                 radv.
>>
>>                                                                 Request
>>                                                                 for
>>                                                                 feedback:
>>                                                                 ---------------------
>>
>>                                                                 We
>>                                                                 aren't
>>                                                                 married
>>                                                                 to
>>                                                                 any
>>                                                                 of
>>                                                                 the
>>                                                                 approaches
>>                                                                 outlined
>>                                                                 above.
>>                                                                 Our goal
>>                                                                 is to
>>                                                                 obtain
>>                                                                 a
>>                                                                 mechanism
>>                                                                 that
>>                                                                 will
>>                                                                 allow
>>                                                                 us to
>>                                                                 complete
>>                                                                 the
>>                                                                 reprojection
>>                                                                 job
>>                                                                 within a
>>                                                                 predictable
>>                                                                 amount
>>                                                                 of
>>                                                                 time.
>>                                                                 So if
>>                                                                 anyone
>>                                                                 anyone
>>                                                                 has any
>>                                                                 suggestions
>>                                                                 for
>>                                                                 improvements
>>                                                                 or
>>                                                                 alternative
>>                                                                 strategies
>>                                                                 we
>>                                                                 are
>>                                                                 more than
>>                                                                 happy
>>                                                                 to hear
>>                                                                 them.
>>
>>                                                                 If
>>                                                                 any
>>                                                                 of
>>                                                                 the
>>                                                                 technical
>>                                                                 information
>>                                                                 above
>>                                                                 is also
>>                                                                 incorrect,
>>                                                                 feel
>>                                                                 free
>>                                                                 to point
>>                                                                 out
>>                                                                 my
>>                                                                 misunderstandings.
>>
>>                                                                 Looking
>>                                                                 forward
>>                                                                 to
>>                                                                 hearing
>>                                                                 from you.
>>
>>                                                                 Regards,
>>                                                                 Andres
>>
>>                                                                 _______________________________________________
>>                                                                 amd-gfx
>>                                                                 mailing
>>                                                                 list
>>                                                                 amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>                                                                 <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>>                                                                 https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>                                                                 <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>>
>>
>>                                                                 amd-gfx
>>                                                                 Info
>>                                                                 Page
>>                                                                 -
>>                                                                 lists.freedesktop.org
>>                                                                 <http://lists.freedesktop.org>
>>                                                                 lists.freedesktop.org
>>                                                                 <http://lists.freedesktop.org>
>>                                                                 To
>>                                                                 see
>>                                                                 the
>>                                                                 collection
>>                                                                 of
>>                                                                 prior
>>                                                                 postings
>>                                                                 to
>>                                                                 the list,
>>                                                                 visit the
>>                                                                 amd-gfx
>>                                                                 Archives.
>>                                                                 Using
>>                                                                 amd-gfx:
>>                                                                 To
>>                                                                 post
>>                                                                 a
>>                                                                 message
>>                                                                 to all
>>                                                                 the list
>>                                                                 members,
>>                                                                 send
>>                                                                 email ...
>>
>>
>>
>>                                                                 amd-gfx
>>                                                                 Info
>>                                                                 Page
>>                                                                 -
>>                                                                 lists.freedesktop.org
>>                                                                 <http://lists.freedesktop.org>
>>                                                                 lists.freedesktop.org
>>                                                                 <http://lists.freedesktop.org>
>>                                                                 To
>>                                                                 see
>>                                                                 the
>>                                                                 collection
>>                                                                 of
>>                                                                 prior
>>                                                                 postings
>>                                                                 to
>>                                                                 the list,
>>                                                                 visit the
>>                                                                 amd-gfx
>>                                                                 Archives.
>>                                                                 Using
>>                                                                 amd-gfx:
>>                                                                 To
>>                                                                 post
>>                                                                 a
>>                                                                 message
>>                                                                 to all
>>                                                                 the list
>>                                                                 members,
>>                                                                 send
>>                                                                 email ...
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>                                                                 _______________________________________________
>>                                                                 amd-gfx
>>                                                                 mailing
>>                                                                 list
>>                                                                 amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>                                                                 <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>>                                                                 https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>                                                                 <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>>
>>
>>                                                             _______________________________________________
>>                                                             amd-gfx
>>                                                             mailing list
>>                                                             amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>                                                             <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>>                                                             https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>                                                             <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>>
>>
>>
>>
>>                                                 _______________________________________________
>>                                                 amd-gfx mailing list
>>                                                 amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>                                                 <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>>                                                 https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>                                                 <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>>
>>
>>
>>
>>                                     Sincerely yours,
>>                                     Serguei Sagalovitch
>>
>>                                     _______________________________________________
>>                                     amd-gfx mailing list
>>                                     amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>                                     <mailto:amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
>>                                     https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>                                     <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>>
>>
>>
>>
>>
>>
>>
>>                 Sincerely yours,
>>                 Serguei Sagalovitch
>>
>>
>>
>>
>>
>


[-- Attachment #1.2: Type: text/html, Size: 129159 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] Mechanism for high priority scheduling in amdgpu
@ 2016-12-19 15:49 Pierre-Loup Griffais
  0 siblings, 0 replies; 36+ messages in thread
From: Pierre-Loup Griffais @ 2016-12-19 15:49 UTC (permalink / raw)
  To: Serguei Sagalovitch
  Cc: zhoucm1, Mao, David, Andres Rodriguez, Andres Rodriguez,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Huan, Alvin, Zhang,
	Hawking


[-- Attachment #1.1: Type: text/plain, Size: 28638 bytes --]

On Dec 19, 2016 6:48 AM, Serguei Sagalovitch <serguei.sagalovitch@amd.com> wrote:
> If compute queue is occupied only by you, the efficiency
 > is equal with setting job queue to high priority I think.
The only risk is the situation when graphics will take all
needed CUs. But in any case it should be very good test.

Andres/Pierre-Loup,

Did you try to do it or it is a lot of work for you?

The system will be fully loaded by the VR client application when this feature will need to be used, with hopefully both a graphics and compute job in flight using 100% of the CU capacity.

Let me try to succintly sum up requirements since you asked in the other branch:

On a fully loaded system (optimal occupancy by VR client app), we would like the VR runtime to be able to submit a task (graphics or compute, but we realize only compute might be possible for best results) and get results in a consistent amount of time. Ideally that time would be close to the time it would take to complete the same task on an otherwise idle system, but it's assumed there would be a fixed cost added to it due to winding down in-flight CUs. The quality of service provided by the feature would depend on how predictably small such a cost would be. 11ms would be a current upper limit but not really a useful number for the purpose of discussion, as the feature would be beyond useless at that point. Being able to intervene 1ms before vblank of the HMD and consistently get our task complete in time would be good.


BTW: If there is  non-VR application which will use high-priority
h/w queue then VR application will suffer.  Any ideas how
to solve it?

The intent is that this interface will require some sort of privilege that only the VR compositor would have on a well-configured VR system, higher-than-average niceness would be one idea. If you have any suggestions there also, it would be good discussion.

Thanks,
 - Pierre-Loup


Sincerely yours,
Serguei Sagalovitch

On 2016-12-19 12:50 AM, zhoucm1 wrote:
> Do you encounter the priority issue for compute queue with current
> driver?
>
> If compute queue is occupied only by you, the efficiency is equal with
> setting job queue to high priority I think.
>
> Regards,
> David Zhou
>
> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>
>> I'm not sure if I'm asking for too much, but if we can coordinate a
>> similar interface in radv and amdgpu-pro at the vulkan level that
>> would be great.
>>
>> I'm not sure what that's going to be yet.
>>
>> - Andres
>>
>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>
>>>
>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>> We're currently working with the open stack; I assume that a
>>>> mechanism could be exposed by both open and Pro Vulkan userspace
>>>> drivers and that the amdgpu kernel interface improvements we would
>>>> pursue following this discussion would let both drivers take
>>>> advantage of the feature, correct?
>>> Of course.
>>> Does open stack have Vulkan support?
>>>
>>> Regards,
>>> David Zhou
>>>>
>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>> By the way, are you using all-open driver or amdgpu-pro driver?
>>>>>
>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>
>>>>> Regards,
>>>>> David Zhou
>>>>>
>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>> Hi Serguei,
>>>>>>
>>>>>> I'm also working on the bringing up our VR runtime on top of amgpu;
>>>>>> see replies inline.
>>>>>>
>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>> Andres,
>>>>>>>
>>>>>>>>  For current VR workloads we have 3 separate processes running
>>>>>>>> actually:
>>>>>>> So we could have potential memory overcommit case or do you do
>>>>>>> partitioning
>>>>>>> on your own?  I would think that there is need to avoid
>>>>>>> overcomit in
>>>>>>> VR case to
>>>>>>> prevent any BO migration.
>>>>>>
>>>>>> You're entirely correct; currently the VR runtime is setting up
>>>>>> prioritized CPU scheduling for its VR compositor, we're working on
>>>>>> prioritized GPU scheduling and pre-emption (eg. this thread), and in
>>>>>> the future it will make sense to do work in order to make sure that
>>>>>> its memory allocations do not get evicted, to prevent any unwelcome
>>>>>> additional latency in the event of needing to perform just-in-time
>>>>>> reprojection.
>>>>>>
>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>> Based on my understanding sharing BOs between different processes
>>>>>>> could introduce additional synchronization constrains. btw: I am
>>>>>>> not
>>>>>>> sure
>>>>>>> if we are able to share Vulkan sync. object cross-process boundary.
>>>>>>
>>>>>> They are different processes; it is important for the compositor
>>>>>> that
>>>>>> is responsible for quality-of-service features such as consistently
>>>>>> presenting distorted frames with the right latency, reprojection,
>>>>>> etc,
>>>>>> to be separate from the main application.
>>>>>>
>>>>>> Currently we are using unreleased cross-process memory and semaphore
>>>>>> extensions to fetch updated eye images from the client application,
>>>>>> but the just-in-time reprojection discussed here does not actually
>>>>>> have any direct interactions with cross-process resource sharing,
>>>>>> since it's achieved by using whatever is the latest, most up-to-date
>>>>>> eye images that have already been sent by the client application,
>>>>>> which are already available to use without additional
>>>>>> synchronization.
>>>>>>
>>>>>>>
>>>>>>>>    3) System compositor (we are looking at approaches to remove
>>>>>>>> this
>>>>>>>> overhead)
>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>
>>>>>> Yes, we are working on mechanisms to present directly to the headset
>>>>>> display without any intermediaries as a separate effort.
>>>>>>
>>>>>>>
>>>>>>>>  The latency is our main concern,
>>>>>>> I would assume that this is the known problem (at least for compute
>>>>>>> usage).
>>>>>>> It looks like that amdgpu / kernel submission is rather CPU
>>>>>>> intensive
>>>>>>> (at least
>>>>>>> in the default configuration).
>>>>>>
>>>>>> As long as it's a consistent cost, it shouldn't an issue.
>>>>>> However, if
>>>>>> there's high degrees of variance then that would be troublesome
>>>>>> and we
>>>>>> would need to account for the worst case.
>>>>>>
>>>>>> Hopefully the requirements and approach we described make sense,
>>>>>> we're
>>>>>> looking forward to your feedback and suggestions.
>>>>>>
>>>>>> Thanks!
>>>>>>  - Pierre-Loup
>>>>>>
>>>>>>>
>>>>>>> Sincerely yours,
>>>>>>> Serguei Sagalovitch
>>>>>>>
>>>>>>>
>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>
>>>>>>> Hey Serguei,
>>>>>>>
>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
>>>>>>>> understand (by simplifying)
>>>>>>>> some scheduling is per pipe.  I know about the current allocation
>>>>>>>> scheme but I do not think
>>>>>>>> that it is  ideal.  I would assume that we need  to switch to
>>>>>>>> dynamical partition
>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>> resource
>>>>>>>> conflict
>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>
>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can start
>>>>>>> with a
>>>>>>> solution that assumes that
>>>>>>> only pipe0 has any work and the other pipes are idle (no HSA/ROCm
>>>>>>> running on the system).
>>>>>>>
>>>>>>> This should be more or less the use case we expect from VR users.
>>>>>>>
>>>>>>> I agree the split is currently not ideal, but I'd like to consider
>>>>>>> that a separate task, because
>>>>>>> making it dynamic is not straight forward :P
>>>>>>>
>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd
>>>>>>>> will be not
>>>>>>>> involved.  I would assume that in the case of VR we will have
>>>>>>>> one main
>>>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>
>>>>>>> Correct, this is why we want to enable the high priority compute
>>>>>>> queue through
>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>>>>>
>>>>>>> For current VR workloads we have 3 separate processes running
>>>>>>> actually:
>>>>>>>     1) Game process
>>>>>>>     2) VR Compositor (this is the process that will require high
>>>>>>> priority queue)
>>>>>>>     3) System compositor (we are looking at approaches to remove
>>>>>>> this
>>>>>>> overhead)
>>>>>>>
>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>> simultaneously, but
>>>>>>> I would also like to be able to address this case in the future
>>>>>>> (cross-pipe priorities).
>>>>>>>
>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:  (a) it
>>>>>>>> may take time so
>>>>>>>> latency may suffer
>>>>>>>
>>>>>>> The latency is our main concern, we want something that is
>>>>>>> predictable. A good
>>>>>>> illustration of what the reprojection scheduling looks like can be
>>>>>>> found here:
>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> (b) to preempt we need to have different "context" - we want
>>>>>>>> to guarantee that submissions from the same context will be
>>>>>>>> executed
>>>>>>>> in order.
>>>>>>>
>>>>>>> This is okay, as the reprojection work doesn't have dependencies on
>>>>>>> the game context, and it
>>>>>>> even happens in a separate process.
>>>>>>>
>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you want
>>>>>>>> "preempt" and
>>>>>>>> "cancel/abort"
>>>>>>>
>>>>>>> Preempt the game with the compositor task and then resume it.
>>>>>>>
>>>>>>>> (b) Vulkan is generic API and could be used for graphics as
>>>>>>>> well as
>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>
>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure out a
>>>>>>> way
>>>>>>> for us to get
>>>>>>> a guaranteed execution time using vulkan graphics, then I'll
>>>>>>> take you
>>>>>>> out for a beer :)
>>>>>>>
>>>>>>> Regards,
>>>>>>> Andres
>>>>>>> ________________________________________
>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>
>>>>>>> Hi Andres,
>>>>>>>
>>>>>>> Please see inline (as [Serguei])
>>>>>>>
>>>>>>> Sincerely yours,
>>>>>>> Serguei Sagalovitch
>>>>>>>
>>>>>>>
>>>>>>> From: Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>> To: Sagalovitch, Serguei; amd-gfx@lists.freedesktop.org
>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>
>>>>>>> Hi Serguei,
>>>>>>>
>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>
>>>>>>> Regards,
>>>>>>> Andres
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch@amd.com]
>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>> To: Andres Rodriguez; amd-gfx@lists.freedesktop.org
>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>
>>>>>>> Andres,
>>>>>>>
>>>>>>>
>>>>>>> Quick comments:
>>>>>>>
>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU
>>>>>>> assignments/binding
>>>>>>> to high-priority queue  when it will be in use and "free" them
>>>>>>> later
>>>>>>> (we  do not want forever take CUs from e.g. graphic task to degrade
>>>>>>> graphics
>>>>>>> performance).
>>>>>>>
>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>> low-priority
>>>>>>> compute) will took all (extra) CUs and high--priority will wait for
>>>>>>> needed resources.
>>>>>>> It will not be visible on "NOP " but only when you submit "real"
>>>>>>> compute task
>>>>>>> so I would recommend  not to use "NOP" packets at all for testing.
>>>>>>>
>>>>>>> It (CU assignment) could be relatively easy done when everything is
>>>>>>> going via kernel
>>>>>>> (e.g. as part of frame submission) but I must admit that I am
>>>>>>> not sure
>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>
>>>>>>> [AR] I wasn't aware of this part of the programming sequence.
>>>>>>> Thanks
>>>>>>> for the heads up!
>>>>>>> Is this similar to the CU masking programming?
>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler" when
>>>>>>> deciding which
>>>>>>> queue to  run will check if there is enough resources and if not
>>>>>>> then
>>>>>>> it will begin
>>>>>>> to check other queues with lower priority.
>>>>>>>
>>>>>>> 2) I would recommend to dedicate the whole pipe to high-priority
>>>>>>> queue and have
>>>>>>> nothing their except it.
>>>>>>>
>>>>>>> [AR] I'm guessing in this context you mean pipe = queue? (as
>>>>>>> opposed
>>>>>>> to the MEC definition
>>>>>>> of pipe, which is a grouping of queues). I say this because amdgpu
>>>>>>> only has access to 1 pipe,
>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>
>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I
>>>>>>> understand (by simplifying)
>>>>>>> some scheduling is per pipe.  I know about the current allocation
>>>>>>> scheme but I do not think
>>>>>>> that it is  ideal.  I would assume that we need  to switch to
>>>>>>> dynamical partition
>>>>>>> of resources  based on the workload otherwise we will have resource
>>>>>>> conflict
>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>
>>>>>>>
>>>>>>> BTW: Which user level API do you want to use for compute: Vulkan or
>>>>>>> OpenCL?
>>>>>>>
>>>>>>> [AR] Vulkan
>>>>>>>
>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd
>>>>>>> will
>>>>>>> be not
>>>>>>> involved.  I would assume that in the case of VR we will have
>>>>>>> one main
>>>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>
>>>>>>>>  we will not be able to provide a solution compatible with GFX
>>>>>>>> worloads.
>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>
>>>>>>> [AR] Yeah, my understanding is that pre-empting the currently
>>>>>>> running
>>>>>>> graphics job and scheduling in
>>>>>>> something else using mid-buffer pre-emption has some cases where it
>>>>>>> doesn't work well. But if with
>>>>>>> polaris10 it starts working well, it might be a better solution for
>>>>>>> us (because the whole reprojection
>>>>>>> work uses the vulkan graphics stack at the moment, and porting
>>>>>>> it to
>>>>>>> compute is not trivial).
>>>>>>>
>>>>>>> [Serguei]  The problem with pre-emption of graphics task: (a) it
>>>>>>> may
>>>>>>> take time so
>>>>>>> latency may suffer (b) to preempt we need to have different
>>>>>>> "context"
>>>>>>> - we want
>>>>>>> to guarantee that submissions from the same context will be
>>>>>>> executed
>>>>>>> in order.
>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>>>>>> "preempt" and
>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>> for graphics as well as for plain compute tasks
>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>
>>>>>>>
>>>>>>> Sincerely yours,
>>>>>>> Serguei Sagalovitch
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of
>>>>>>> Andres Rodriguez <andresr@valvesoftware.com>
>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>
>>>>>>> Hi Everyone,
>>>>>>>
>>>>>>> This RFC is also available as a gist here:
>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>> gist.github.com
>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>> gist.github.com
>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>> gist.github.com
>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>
>>>>>>>
>>>>>>> We are interested in feedback for a mechanism to effectively
>>>>>>> schedule
>>>>>>> high
>>>>>>> priority VR reprojection tasks (also referred to as
>>>>>>> time-warping) for
>>>>>>> Polaris10
>>>>>>> running on the amdgpu kernel driver.
>>>>>>>
>>>>>>> Brief context:
>>>>>>> --------------
>>>>>>>
>>>>>>> The main objective of reprojection is to avoid motion sickness
>>>>>>> for VR
>>>>>>> users in
>>>>>>> scenarios where the game or application would fail to finish
>>>>>>> rendering a new
>>>>>>> frame in time for the next VBLANK. When this happens, the user's
>>>>>>> head
>>>>>>> movements
>>>>>>> are not reflected on the Head Mounted Display (HMD) for the
>>>>>>> duration
>>>>>>> of an
>>>>>>> extra frame. This extended mismatch between the inner ear and the
>>>>>>> eyes may
>>>>>>> cause the user to experience motion sickness.
>>>>>>>
>>>>>>> The VR compositor deals with this problem by fabricating a new
>>>>>>> frame
>>>>>>> using the
>>>>>>> user's updated head position in combination with the previous
>>>>>>> frames.
>>>>>>> This
>>>>>>> avoids a prolonged mismatch between the HMD output and the inner
>>>>>>> ear.
>>>>>>>
>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>> confidence that the
>>>>>>> reprojection task will complete before the VBLANK interval. Even if
>>>>>>> the GFX pipe
>>>>>>> is currently full of work from the game/application (which is most
>>>>>>> likely the case).
>>>>>>>
>>>>>>> For more details and illustrations, please refer to the following
>>>>>>> document:
>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>> community.amd.com
>>>>>>> One of the most exciting new developments in GPU technology over
>>>>>>> the
>>>>>>> past year has been the adoption of asynchronous shaders, which can
>>>>>>> make more efficient use of ...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>> community.amd.com
>>>>>>> One of the most exciting new developments in GPU technology over
>>>>>>> the
>>>>>>> past year has been the adoption of asynchronous shaders, which can
>>>>>>> make more efficient use of ...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>> community.amd.com
>>>>>>> One of the most exciting new developments in GPU technology over
>>>>>>> the
>>>>>>> past year has been the adoption of asynchronous shaders, which can
>>>>>>> make more efficient use of ...
>>>>>>>
>>>>>>>
>>>>>>> Requirements:
>>>>>>> -------------
>>>>>>>
>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>
>>>>>>>     * Job round trip time must be predictable, from submission to
>>>>>>> fence signal
>>>>>>>
>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>
>>>>>>> Goals:
>>>>>>> ------
>>>>>>>
>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>
>>>>>>> Test: submitting a NOP packet through the mechanism on busy
>>>>>>> hardware
>>>>>>> should
>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>
>>>>>>> Nice to have:
>>>>>>> -------------
>>>>>>>
>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>
>>>>>>> My understanding is that with the current hardware capabilities in
>>>>>>> Polaris10 we
>>>>>>> will not be able to provide a solution compatible with GFX
>>>>>>> worloads.
>>>>>>>
>>>>>>> But I would love to hear otherwise. So if anyone has an idea,
>>>>>>> approach or
>>>>>>> suggestion that will also be compatible with the GFX ring,
>>>>>>> please let
>>>>>>> us know
>>>>>>> about it.
>>>>>>>
>>>>>>>     * The above guarantees should also be respected by amdkfd
>>>>>>> workloads
>>>>>>>
>>>>>>> Would be good to have for consistency, but not strictly
>>>>>>> necessary as
>>>>>>> users running
>>>>>>> games are not traditionally running HPC workloads in the
>>>>>>> background.
>>>>>>>
>>>>>>> Proposed approach:
>>>>>>> ------------------
>>>>>>>
>>>>>>> Similar to the windows driver, we could expose a high priority
>>>>>>> compute queue to
>>>>>>> userspace.
>>>>>>>
>>>>>>> Submissions to this compute queue will be scheduled with high
>>>>>>> priority, and may
>>>>>>> acquire hardware resources previously in use by other queues.
>>>>>>>
>>>>>>> This can be achieved by taking advantage of the 'priority' field in
>>>>>>> the HQDs
>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler. The
>>>>>>> relevant
>>>>>>> register fields are:
>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>
>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>> ------------------------------------------------
>>>>>>>
>>>>>>> The amdgpu driver currently controls 8 compute queues from
>>>>>>> pipe0. We can
>>>>>>> statically partition these as follows:
>>>>>>>         * 7x regular
>>>>>>>         * 1x high priority
>>>>>>>
>>>>>>> The relevant priorities can be set so that submissions to the high
>>>>>>> priority
>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>
>>>>>>> The amdgpu scheduler will only place jobs into the high priority
>>>>>>> rings if the
>>>>>>> context is marked as high priority. And a corresponding priority
>>>>>>> should be
>>>>>>> added to keep track of this information:
>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>
>>>>>>> The user will request a high priority context by setting an
>>>>>>> appropriate flag
>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The setting is in a per context level so that we can:
>>>>>>>     * Maintain a consistent FIFO ordering of all submissions to a
>>>>>>> context
>>>>>>>     * Create high priority and non-high priority contexts in the
>>>>>>> same
>>>>>>> process
>>>>>>>
>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>> ---------------------------------------------------------
>>>>>>>
>>>>>>> Similar to the above, but instead of programming the priorities and
>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the queue
>>>>>>> priorities
>>>>>>> dynamically when scheduling a task.
>>>>>>>
>>>>>>> This would involve having a hardware specific callback from the
>>>>>>> scheduler to
>>>>>>> set the appropriate queue priority: set_priority(int ring, int
>>>>>>> index,
>>>>>>> int priority)
>>>>>>>
>>>>>>> During this callback we would have to grab the SRBM mutex to
>>>>>>> perform
>>>>>>> the appropriate
>>>>>>> HW programming, and I'm not really sure if that is something we
>>>>>>> should be doing from
>>>>>>> the scheduler.
>>>>>>>
>>>>>>> On the positive side, this approach would allow us to program a
>>>>>>> range of
>>>>>>> priorities for jobs instead of a single "high priority" value",
>>>>>>> achieving
>>>>>>> something similar to the niceness API available for CPU scheduling.
>>>>>>>
>>>>>>> I'm not sure if this flexibility is something that we would need
>>>>>>> for
>>>>>>> our use
>>>>>>> case, but it might be useful in other scenarios (multiple users
>>>>>>> sharing compute
>>>>>>> time on a server).
>>>>>>>
>>>>>>> This approach would require a new int field in
>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>> repurposing
>>>>>>> of the flags field.
>>>>>>>
>>>>>>> Known current obstacles:
>>>>>>> ------------------------
>>>>>>>
>>>>>>> The SQ is currently programmed to disregard the HQD priorities, and
>>>>>>> instead it picks
>>>>>>> jobs at random. Settings from the shader itself are also
>>>>>>> disregarded
>>>>>>> as this is
>>>>>>> considered a privileged field.
>>>>>>>
>>>>>>> Effectively we can get our compute wavefront launched ASAP, but we
>>>>>>> might not get the
>>>>>>> time we need on the SQ.
>>>>>>>
>>>>>>> The current programming would have to be changed to allow priority
>>>>>>> propagation
>>>>>>> from the HQD into the SQ.
>>>>>>>
>>>>>>> Generic approach for all HW IPs:
>>>>>>> --------------------------------
>>>>>>>
>>>>>>> For consistency purposes, the high priority context can be enabled
>>>>>>> for all HW IPs
>>>>>>> with support of the SW scheduler. This will function similarly
>>>>>>> to the
>>>>>>> current
>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of
>>>>>>> anything not
>>>>>>> commited to the HW queue.
>>>>>>>
>>>>>>> The benefits of requesting a high priority context for a
>>>>>>> non-compute
>>>>>>> queue will
>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is stuck in
>>>>>>> front of
>>>>>>> you), but having the API in place will allow us to easily
>>>>>>> improve the
>>>>>>> implementation
>>>>>>> in the future as new features become available in new hardware.
>>>>>>>
>>>>>>> Future steps:
>>>>>>> -------------
>>>>>>>
>>>>>>> Once we have an approach settled, I can take care of the
>>>>>>> implementation.
>>>>>>>
>>>>>>> Also, once the interface is mostly decided, we can start
>>>>>>> thinking about
>>>>>>> exposing the high priority queue through radv.
>>>>>>>
>>>>>>> Request for feedback:
>>>>>>> ---------------------
>>>>>>>
>>>>>>> We aren't married to any of the approaches outlined above. Our goal
>>>>>>> is to
>>>>>>> obtain a mechanism that will allow us to complete the reprojection
>>>>>>> job within a
>>>>>>> predictable amount of time. So if anyone anyone has any
>>>>>>> suggestions for
>>>>>>> improvements or alternative strategies we are more than happy to
>>>>>>> hear
>>>>>>> them.
>>>>>>>
>>>>>>> If any of the technical information above is also incorrect, feel
>>>>>>> free to point
>>>>>>> out my misunderstandings.
>>>>>>>
>>>>>>> Looking forward to hearing from you.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Andres
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> amd-gfx mailing list
>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>
>>>>>>>
>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>> lists.freedesktop.org
>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list
>>>>>>> members, send email ...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>> lists.freedesktop.org
>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list
>>>>>>> members, send email ...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> amd-gfx mailing list
>>>>>>> amd-gfx@lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> amd-gfx mailing list
>>>>>> amd-gfx@lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>

Sincerely yours,
Serguei Sagalovitch



[-- Attachment #1.2: Type: text/html, Size: 49192 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC] Mechanism for high priority scheduling in amdgpu
@ 2016-12-16 23:24 Andres Rodriguez
  0 siblings, 0 replies; 36+ messages in thread
From: Andres Rodriguez @ 2016-12-16 23:24 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Hi Everyone,

We are interested in feedback for a mechanism to effectively schedule high
priority VR reprojection tasks (also referred to as time-warping) for Polaris10
running on the amdgpu kernel driver.

This RFC is also available as a gist here:
https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249

Brief context:
--------------

The main objective of reprojection is to avoid motion sickness for VR users in
scenarios where the game or application would fail to finish rendering a new
frame in time for the next VBLANK. When this happens, the user's head movements
are not reflected on the Head Mounted Display (HMD) for the duration of an
extra frame. This extended mismatch between the inner ear and the eyes may
cause the user to experience motion sickness.

The VR compositor deals with this problem by fabricating a new frame using the
user's updated head position in combination with the previous frames. This
avoids a prolonged mismatch between the HMD output and the inner ear.

Because of the adverse effects on the user, we require high confidence that the
reprojection task will complete before the VBLANK interval. Even if the GFX pipe
is currently full of work from the game/application (which is most likely the case).

For more details and illustrations, please refer to the following document:
https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved

Requirements:
-------------

The mechanism must expose the following functionaility:

    * Job round trip time must be predictable, from submission to fence signal

    * The mechanism must support compute workloads.

Goals:
------

    * The mechanism should provide low submission latencies

Test: submitting a NOP packet through the mechanism on busy hardware should
be equivalent to submitting a NOP on idle hardware.

Nice to have:
-------------

    * The mechanism should also support GFX workloads.

My understanding is that with the current hardware capabilities in Polaris10 we
will not be able to provide a solution compatible with GFX worloads.

But I would love to hear otherwise. So if anyone has an idea, approach or
suggestion that will also be compatible with the GFX ring, please let us know
about it.

    * The above guarantees should also be respected by amdkfd workloads

Would be good to have for consistency, but not strictly necessary as users running
games are not traditionally running HPC workloads in the background.

Proposed approach:
------------------

Similar to the windows driver, we could expose a high priority compute queue to
userspace.

Submissions to this compute queue will be scheduled with high priority, and may
acquire hardware resources previously in use by other queues.

This can be achieved by taking advantage of the 'priority' field in the HQDs
and could be programmed by amdgpu or the amdgpu scheduler. The relevant
register fields are:
        * mmCP_HQD_PIPE_PRIORITY
        * mmCP_HQD_QUEUE_PRIORITY

Implementation approach 1 - static partitioning:
------------------------------------------------

The amdgpu driver currently controls 8 compute queues from pipe0. We can
statically partition these as follows:
        * 7x regular
        * 1x high priority

The relevant priorities can be set so that submissions to the high priority
ring will starve the other compute rings and the GFX ring.

The amdgpu scheduler will only place jobs into the high priority rings if the
context is marked as high priority. And a corresponding priority should be
added to keep track of this information:
     * AMD_SCHED_PRIORITY_KERNEL
     * -> AMD_SCHED_PRIORITY_HIGH
     * AMD_SCHED_PRIORITY_NORMAL

The user will request a high priority context by setting an appropriate flag
in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163

The setting is in a per context level so that we can:
    * Maintain a consistent FIFO ordering of all submissions to a context
    * Create high priority and non-high priority contexts in the same process

Implementation approach 2 - dynamic priority programming:
---------------------------------------------------------

Similar to the above, but instead of programming the priorities and
amdgpu_init() time, the SW scheduler will reprogram the queue priorities
dynamically when scheduling a task.

This would involve having a hardware specific callback from the scheduler to
set the appropriate queue priority: set_priority(int ring, int index, int priority)

During this callback we would have to grab the SRBM mutex to perform the appropriate
HW programming, and I'm not really sure if that is something we should be doing from
the scheduler.

On the positive side, this approach would allow us to program a range of
priorities for jobs instead of a single "high priority" value", achieving
something similar to the niceness API available for CPU scheduling.

I'm not sure if this flexibility is something that we would need for our use
case, but it might be useful in other scenarios (multiple users sharing compute
time on a server).

This approach would require a new int field in drm_amdgpu_ctx_in, or repurposing
of the flags field.

Known current obstacles:
------------------------

The SQ is currently programmed to disregard the HQD priorities, and instead it picks
jobs at random. Settings from the shader itself are also disregarded as this is
considered a privileged field.

Effectively we can get our compute wavefront launched ASAP, but we might not get the
time we need on the SQ.

The current programming would have to be changed to allow priority propagation
from the HQD into the SQ.

Generic approach for all HW IPs:
--------------------------------

For consistency purposes, the high priority context can be enabled for all HW IPs
with support of the SW scheduler. This will function similarly to the current
AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of anything not
commited to the HW queue.

The benefits of requesting a high priority context for a non-compute queue will
be lesser (e.g. up to 10s of wait time if a GFX command is stuck in front of
you), but having the API in place will allow us to easily improve the implementation
in the future as new features become available in new hardware.

Future steps:
-------------

Once we have an approach settled, I can take care of the implementation.

Also, once the interface is mostly decided, we can start thinking about
exposing the high priority queue through radv.

Request for feedback:
---------------------

We aren't married to any of the approaches outlined above. Our goal is to
obtain a mechanism that will allow us to complete the reprojection job within a
predictable amount of time. So if anyone anyone has any suggestions for
improvements or alternative strategies we are more than happy to hear them.

If any of the technical information above is also incorrect, feel free to point
out my misunderstandings.

Looking forward to hearing from you.

Regards,
Andres

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2017-01-02 15:43 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-16 23:15 [RFC] Mechanism for high priority scheduling in amdgpu Andres Rodriguez
     [not found] ` <544E607D03B20249AA404517E498FC4699EBD3-Lp/cVzEoVyaisxZYEgh0i620KmCxYQEWVpNB7YpNyf8@public.gmane.org>
2016-12-17  1:15   ` Sagalovitch, Serguei
     [not found]     ` <SN1PR12MB070363D0810B783234322C46FE9F0-z7L1TMIYDg6P/i5UxMCIqAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2016-12-17  1:29       ` Andres Rodriguez
     [not found]         ` <544E607D03B20249AA404517E498FC4699EC41-Lp/cVzEoVyaisxZYEgh0i620KmCxYQEWVpNB7YpNyf8@public.gmane.org>
2016-12-17  2:13           ` Sagalovitch, Serguei
     [not found]             ` <SN1PR12MB070320C32E73A8FC1E3102F1FE9F0-z7L1TMIYDg6P/i5UxMCIqAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2016-12-17  3:00               ` Andres Rodriguez
     [not found]                 ` <544E607D03B20249AA404517E498FC4699EC70-Lp/cVzEoVyaisxZYEgh0i620KmCxYQEWVpNB7YpNyf8@public.gmane.org>
2016-12-17  5:05                   ` Sagalovitch, Serguei
     [not found]                     ` <SN1PR12MB0703173C7AD623F6C5AECE7DFE9F0-z7L1TMIYDg6P/i5UxMCIqAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2016-12-17 22:05                       ` Pierre-Loup A. Griffais
     [not found]                         ` <bd0ba668-3d13-6343-a1c6-de5d0b7b3be3-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
2016-12-19  3:26                           ` zhoucm1
     [not found]                             ` <58575362.2030100-5C7GfCeVMHo@public.gmane.org>
2016-12-19  3:33                               ` Pierre-Loup A. Griffais
     [not found]                                 ` <361f177c-bf55-1525-4f35-86708e4f8d9f-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
2016-12-19  5:11                                   ` zhoucm1
     [not found]                                     ` <58576C15.1070909-5C7GfCeVMHo@public.gmane.org>
2016-12-19  5:29                                       ` Andres Rodriguez
     [not found]                                         ` <2bf5afce-d5b8-4eaf-0fcd-a7ebfe85f92e-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-12-19  5:50                                           ` zhoucm1
     [not found]                                             ` <5857751C.5060409-5C7GfCeVMHo@public.gmane.org>
2016-12-19 14:48                                               ` Serguei Sagalovitch
     [not found]                                                 ` <d8cf437e-af88-c76d-428f-53912bc43d2b-5C7GfCeVMHo@public.gmane.org>
2016-12-20 12:56                                                   ` Christian König
     [not found]                                                     ` <5068f779-50ad-5e17-6d7e-8493e8fdd78a-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2016-12-20 15:51                                                       ` Andres Rodriguez
     [not found]                                                         ` <afc51505-7f86-a963-5d3a-be9df538019e-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-12-20 17:20                                                           ` Pierre-Loup A. Griffais
2016-12-22 11:42                                                           ` Christian König
     [not found]                                                             ` <76892a0d-677b-f0cb-d4e7-74d29b4a0aa7-5C7GfCeVMHo@public.gmane.org>
2016-12-22 16:35                                                               ` Andres Rodriguez
     [not found]                                                                 ` <8ab5bb4d-f331-d991-f208-ec7c0a25662a-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-12-22 16:41                                                                   ` Serguei Sagalovitch
     [not found]                                                                     ` <fd1f1a6f-f72a-3e65-bb6f-17671d8b1d6b-5C7GfCeVMHo@public.gmane.org>
2016-12-22 19:54                                                                       ` Pierre-Loup A. Griffais
     [not found]                                                                         ` <2e8051cb-09b1-c5cb-cb5a-b7ca30f65e89-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
2016-12-23 10:54                                                                           ` Christian König
     [not found]                                                                             ` <1c3ea5aa-36ee-5031-5f32-d860e9e0bf7c-5C7GfCeVMHo@public.gmane.org>
2016-12-23 16:13                                                                               ` Andres Rodriguez
     [not found]                                                                                 ` <CAFQ_0eFRaCKKk9BaMyahBARzFEdXP9gQWbK+61R0snDz08qGdw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-23 16:20                                                                                   ` Bridgman, John
     [not found]                                                                                     ` <BN6PR12MB13485DCB60A2308A3C28A62CE8950-/b2+HYfkarQX0pEhCR5T8QdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2016-12-23 16:30                                                                                       ` Andres Rodriguez
     [not found]                                                                                         ` <CAFQ_0eGgYpb-d+OBG-q2S=Ha90GrNGBrTRvfvY64B_ya7Pvyzw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-23 16:49                                                                                           ` Bridgman, John
     [not found]                                                                                             ` <BN6PR12MB1348A8E9B5AAC0A1DC66B2E8E8950-/b2+HYfkarQX0pEhCR5T8QdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2016-12-23 17:10                                                                                               ` Sagalovitch, Serguei
     [not found]                                                                                                 ` <SN1PR12MB070348C8435374C0C463E0FDFE950-z7L1TMIYDg6P/i5UxMCIqAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2016-12-23 17:20                                                                                                   ` Bridgman, John
2016-12-23 18:18                                                                               ` Pierre-Loup A. Griffais
     [not found]                                                                                 ` <b853e4e3-0ba5-2bda-e129-d9253e7b098d-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
2016-12-23 22:20                                                                                   ` Andres Rodriguez
     [not found]                                                                                     ` <CAFQ_0eHg=Kf5qV50cgm51m6bTcMYdkgRXkT-sykJnYNzu3Zzsg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-26  2:26                                                                                       ` zhoucm1
     [not found]                                                                                         ` <58607FDF.2080200-5C7GfCeVMHo@public.gmane.org>
2017-01-02 15:43                                                                                           ` Christian König
2017-01-02 14:09                                                                                   ` Christian König
2016-12-19 14:37                           ` Serguei Sagalovitch
2016-12-19 19:29   ` Andres Rodriguez
2016-12-16 23:24 Andres Rodriguez
2016-12-19 15:49 Pierre-Loup Griffais

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.