On Fri, Dec 23, 2016 at 1:18 PM, Pierre-Loup A. Griffais <pgriffais-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org> wrote:

I hate to keep bringing up display topics in an unrelated conversation, but I'm not sure where you got "Application -> X server -> compositor -> X server" from. As I was saying before, we need to be presenting directly to the HMD display as no display server can be in the way, both for latency but also quality of service reasons (a buggy application cannot be allowed to accidentally display undistorted rendering into the HMD); we intend to do the necessary work for this, and the extent of X's (or a Wayland implementation, or any other display server) involvment will be to participate enough to know that the HMD display is off-limits. If you have more questions on the display aspect, or VR rendering in general, I'm happy to try to address them out-of-band from this conversation.

On 12/23/2016 02:54 AM, Christian König wrote:

But yes, in general you don't want another compositor in the way, so
we'll be acquiring the HMD display directly, separate from any desktop
or display server.

Assuming that the the HMD is attached to the rendering device in some
way you have the X server and the Compositor which both try to be DRM
master at the same time.

Please correct me if that was fixed in the meantime, but that sounds
like it will simply not work. Or is this what Andres mention below Dave
is working on ?.

Additional to that a compositor in combination with X is a bit counter
productive when you want to keep the latency low.

E.g. the "normal" flow of a GL or Vulkan surface filled with rendered
data to be displayed is from the Application -> X server -> compositor
-> X server.

The extra step between X server and compositor just means extra latency
and for this use case you probably don't want that.

Targeting something like Wayland and when you need X compatibility
XWayland sounds like the much better idea.

Regards,
Christian.

Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:

Display concerns are a separate issue, and as Andres said we have
other plans to address. But yes, in general you don't want another
compositor in the way, so we'll be acquiring the HMD display directly,
separate from any desktop or display server. Same with security, we
can have a separate conversation about that when the time comes.

On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:

Andres,

Did you measure latency, etc. impact of __any__ compositor?

My understanding is that VR has pretty strict requirements related to
QoS.

Sincerely yours,
Serguei Sagalovitch

On 2016-12-22 11:35 AM, Andres Rodriguez wrote:

Hey Christian,

We are currently interested in X, but with some distros switching to
other compositors by default, we also need to consider those.

We agree, running the full vrcompositor in root isn't something that
we want to do. Too many security concerns. Having a small root helper
that does the privilege escalation for us is the initial idea.

For a long term approach, Pierre-Loup and Dave are working on dealing
with the "two compositors" scenario a little better in DRM+X.
Fullscreen isn't really a sufficient approach, since we don't want the
HMD to be used as part of the Desktop environment when a VR app is not
in use (this is extremely annoying).

When the above is settled, we should have an auth mechanism besides
DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the
HMD permanently away from X. Re-using that auth method to gate this
IOCTL is probably going to be the final solution.

I propose to start with ROOT_ONLY since it should allow us to respect
kernel IOCTL compatibility guidelines with the most flexibility. Going
from a restrictive to a more flexible permission model would be
inclusive, but going from a general to a restrictive model may exclude
some apps that used to work.

Regards,
Andres

On 12/22/2016 6:42 AM, Christian König wrote:

Hi Andres,

well using root might cause stability and security problems as well.
We worked quite hard to avoid exactly this for X.

We could make this feature depend on the compositor being DRM master,
but for example with X the X server is master (and e.g. can change
resolutions etc..) and not the compositor.

So another question is also what windowing system (if any) are you
planning to use? X, Wayland, Flinger or something completely
different ?

Regards,
Christian.

Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:

Hi Christian,

That is definitely a concern. What we are currently thinking is to
make the high priority queues accessible to root only.

Therefore is a non-root user attempts to set the high priority flag
on context allocation, we would fail the call and return ENOPERM.

Regards,
Andres

On 12/20/2016 7:56 AM, Christian König wrote:

BTW: If there is non-VR application which will use high-priority
h/w queue then VR application will suffer. Any ideas how
to solve it?

Yeah, that problem came to my mind as well.

Basically we need to restrict those high priority submissions to
the VR compositor or otherwise any malfunctioning application could
use it.

Just think about some WebGL suddenly taking all our rendering away
and we won't get anything drawn any more.

Alex or Michel any ideas on that?

Regards,
Christian.

Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:

> If compute queue is occupied only by you, the efficiency
> is equal with setting job queue to high priority I think.
The only risk is the situation when graphics will take all
needed CUs. But in any case it should be very good test.

Andres/Pierre-Loup,

Did you try to do it or it is a lot of work for you?

BTW: If there is non-VR application which will use high-priority
h/w queue then VR application will suffer. Any ideas how
to solve it?

Sincerely yours,
Serguei Sagalovitch

On 2016-12-19 12:50 AM, zhoucm1 wrote:

Do you encounter the priority issue for compute queue with
current driver?

If compute queue is occupied only by you, the efficiency is equal
with setting job queue to high priority I think.

Regards,
David Zhou

On 2016年12月19日 13:29, Andres Rodriguez wrote:

Yes, vulkan is available on all-open through the mesa radv UMD.

I'm not sure if I'm asking for too much, but if we can
coordinate a similar interface in radv and amdgpu-pro at the
vulkan level that would be great.

I'm not sure what that's going to be yet.

- Andres

On 12/19/2016 12:11 AM, zhoucm1 wrote:

On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:

We're currently working with the open stack; I assume that a
mechanism could be exposed by both open and Pro Vulkan
userspace drivers and that the amdgpu kernel interface
improvements we would pursue following this discussion would
let both drivers take advantage of the feature, correct?

Of course.
Does open stack have Vulkan support?

Regards,
David Zhou

On 12/18/2016 07:26 PM, zhoucm1 wrote:

By the way, are you using all-open driver or amdgpu-pro
driver?

+David Mao, who is working on our Vulkan driver.

Regards,
David Zhou

On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:

Hi Serguei,

I'm also working on the bringing up our VR runtime on top of
amgpu;
see replies inline.

On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:

Andres,

For current VR workloads we have 3 separate processes
running
actually:

So we could have potential memory overcommit case or do
you do
partitioning
on your own? I would think that there is need to avoid
overcomit in
VR case to
prevent any BO migration.

You're entirely correct; currently the VR runtime is
setting up
prioritized CPU scheduling for its VR compositor, we're
working on
prioritized GPU scheduling and pre-emption (eg. this
thread), and in
the future it will make sense to do work in order to make
sure that
its memory allocations do not get evicted, to prevent any
unwelcome
additional latency in the event of needing to perform
just-in-time
reprojection.

BTW: Do you mean __real__ processes or threads?
Based on my understanding sharing BOs between different
processes
could introduce additional synchronization constrains. btw:
I am not
sure
if we are able to share Vulkan sync. object cross-process
boundary.

They are different processes; it is important for the
compositor that
is responsible for quality-of-service features such as
consistently
presenting distorted frames with the right latency,
reprojection, etc,
to be separate from the main application.

Currently we are using unreleased cross-process memory and
semaphore
extensions to fetch updated eye images from the client
application,
but the just-in-time reprojection discussed here does not
actually
have any direct interactions with cross-process resource
sharing,
since it's achieved by using whatever is the latest, most
up-to-date
eye images that have already been sent by the client
application,
which are already available to use without additional
synchronization.

3) System compositor (we are looking at approaches to
remove this
overhead)

Yes, IMHO the best is to run in "full screen mode".

Yes, we are working on mechanisms to present directly to the
headset
display without any intermediaries as a separate effort.

The latency is our main concern,

I would assume that this is the known problem (at least for
compute
usage).
It looks like that amdgpu / kernel submission is rather CPU
intensive
(at least
in the default configuration).

As long as it's a consistent cost, it shouldn't an issue.
However, if
there's high degrees of variance then that would be
troublesome and we
would need to account for the worst case.

Hopefully the requirements and approach we described make
sense, we're
looking forward to your feedback and suggestions.

Thanks!
- Pierre-Loup

Sincerely yours,
Serguei Sagalovitch

From: Andres Rodriguez <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
Sent: December 16, 2016 10:00 PM
To: Sagalovitch, Serguei; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Subject: RE: [RFC] Mechanism for high priority scheduling
in amdgpu

Hey Serguei,

[Serguei] No. I mean pipe :-) as MEC define it. As far
as I
understand (by simplifying)
some scheduling is per pipe. I know about the current
allocation
scheme but I do not think
that it is ideal. I would assume that we need to
switch to
dynamical partition
of resources based on the workload otherwise we will have
resource
conflict
between Vulkan compute and OpenCL.

I agree the partitioning isn't ideal. I'm hoping we can
start with a
solution that assumes that
only pipe0 has any work and the other pipes are idle (no
HSA/ROCm
running on the system).

This should be more or less the use case we expect from VR
users.

I agree the split is currently not ideal, but I'd like to
consider
that a separate task, because
making it dynamic is not straight forward :P

[Serguei] Vulkan works via amdgpu (kernel submissions) so
amdkfd
will be not
involved. I would assume that in the case of VR we will
have one main
application ("console" mode(?)) so we could temporally
"ignore"
OpenCL/ROCm needs when VR is running.

Correct, this is why we want to enable the high priority
compute
queue through
libdrm-amdgpu, so that we can expose it through Vulkan
later.

For current VR workloads we have 3 separate processes
running actually:
1) Game process
2) VR Compositor (this is the process that will require
high
priority queue)
3) System compositor (we are looking at approaches to
remove this
overhead)

For now I think it is okay to assume no OpenCL/ROCm running
simultaneously, but
I would also like to be able to address this case in the
future
(cross-pipe priorities).

[Serguei] The problem with pre-emption of graphics task:
(a) it
may take time so
latency may suffer

The latency is our main concern, we want something that is
predictable. A good
illustration of what the reprojection scheduling looks like
can be
found here:
https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png

(b) to preempt we need to have different "context" - we
want
to guarantee that submissions from the same context will
be executed
in order.

This is okay, as the reprojection work doesn't have
dependencies on
the game context, and it
even happens in a separate process.

BTW: (a) Do you want "preempt" and later resume or do you
want
"preempt" and
"cancel/abort"

Preempt the game with the compositor task and then resume
it.

(b) Vulkan is generic API and could be used for graphics
as well as
for plain compute tasks (VK_QUEUE_COMPUTE_BIT).

Yeah, the plan is to use vulkan compute. But if you figure
out a way
for us to get
a guaranteed execution time using vulkan graphics, then
I'll take you
out for a beer :)

Regards,
Andres
________________________________________
From: Sagalovitch, Serguei [Serguei.Sagalovitch-5C7GfCeVMHo@public.gmane.org]
Sent: Friday, December 16, 2016 9:13 PM
To: Andres Rodriguez; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Subject: Re: [RFC] Mechanism for high priority scheduling
in amdgpu

Hi Andres,

Please see inline (as [Serguei])

Sincerely yours,
Serguei Sagalovitch

From: Andres Rodriguez <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
Sent: December 16, 2016 8:29 PM
To: Sagalovitch, Serguei; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Subject: RE: [RFC] Mechanism for high priority scheduling
in amdgpu

Hi Serguei,

Thanks for the feedback. Answers inline as [AR].

Regards,
Andres

________________________________________
From: Sagalovitch, Serguei [Serguei.Sagalovitch-5C7GfCeVMHo@public.gmane.org]
Sent: Friday, December 16, 2016 8:15 PM
To: Andres Rodriguez; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Subject: Re: [RFC] Mechanism for high priority scheduling
in amdgpu

Andres,

Quick comments:

1) To minimize "bubbles", etc. we need to "force" CU
assignments/binding
to high-priority queue when it will be in use and "free"
them later
(we do not want forever take CUs from e.g. graphic task to
degrade
graphics
performance).

Otherwise we could have scenario when long graphics task (or
low-priority
compute) will took all (extra) CUs and high--priority will
wait for
needed resources.
It will not be visible on "NOP " but only when you submit
"real"
compute task
so I would recommend not to use "NOP" packets at all for
testing.

It (CU assignment) could be relatively easy done when
everything is
going via kernel
(e.g. as part of frame submission) but I must admit that I
am not sure
about the best way for user level submissions (amdkfd).

[AR] I wasn't aware of this part of the programming
sequence. Thanks
for the heads up!
Is this similar to the CU masking programming?
[Serguei] Yes. To simplify: the problem is that "scheduler"
when
deciding which
queue to run will check if there is enough resources and
if not then
it will begin
to check other queues with lower priority.

2) I would recommend to dedicate the whole pipe to
high-priority
queue and have
nothing their except it.

[AR] I'm guessing in this context you mean pipe = queue?
(as opposed
to the MEC definition
of pipe, which is a grouping of queues). I say this because
amdgpu
only has access to 1 pipe,
and the rest are statically partitioned for amdkfd usage.

[Serguei] No. I mean pipe :-) as MEC define it. As far as I
understand (by simplifying)
some scheduling is per pipe. I know about the current
allocation
scheme but I do not think
that it is ideal. I would assume that we need to switch to
dynamical partition
of resources based on the workload otherwise we will have
resource
conflict
between Vulkan compute and OpenCL.

BTW: Which user level API do you want to use for compute:
Vulkan or
OpenCL?

[AR] Vulkan

[Serguei] Vulkan works via amdgpu (kernel submissions) so
amdkfd will
be not
involved. I would assume that in the case of VR we will
have one main
application ("console" mode(?)) so we could temporally
"ignore"
OpenCL/ROCm needs when VR is running.

we will not be able to provide a solution compatible with
GFX
worloads.

I assume that you are talking about graphics? Am I right?

[AR] Yeah, my understanding is that pre-empting the
currently running
graphics job and scheduling in
something else using mid-buffer pre-emption has some cases
where it
doesn't work well. But if with
polaris10 it starts working well, it might be a better
solution for
us (because the whole reprojection
work uses the vulkan graphics stack at the moment, and
porting it to
compute is not trivial).

[Serguei] The problem with pre-emption of graphics task:
(a) it may
take time so
latency may suffer (b) to preempt we need to have different
"context"
- we want
to guarantee that submissions from the same context will be
executed
in order.
BTW: (a) Do you want "preempt" and later resume or do you
want
"preempt" and
"cancel/abort"? (b) Vulkan is generic API and could be used
for graphics as well as for plain compute tasks
(VK_QUEUE_COMPUTE_BIT).

Sincerely yours,
Serguei Sagalovitch

From: amd-gfx <amd-gfx-bounces-PD4FTy7X32mqWrfYKbYh0A@public.gmane.orgktop.org> on
behalf of
Andres Rodriguez <andresr-38hxoXRICFZx67MzidHQgQC/G2K4zDHf@public.gmane.org>
Sent: December 16, 2016 6:15 PM
To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Subject: [RFC] Mechanism for high priority scheduling in
amdgpu

Hi Everyone,

This RFC is also available as a gist here:
https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249

[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu

[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu

[RFC] Mechanism for high priority scheduling in amdgpu
gist.github.com
[RFC] Mechanism for high priority scheduling in amdgpu

We are interested in feedback for a mechanism to
effectively schedule
high
priority VR reprojection tasks (also referred to as
time-warping) for
Polaris10
running on the amdgpu kernel driver.

Brief context:
--------------

The main objective of reprojection is to avoid motion
sickness for VR
users in
scenarios where the game or application would fail to finish
rendering a new
frame in time for the next VBLANK. When this happens, the
user's head
movements
are not reflected on the Head Mounted Display (HMD) for the
duration
of an
extra frame. This extended mismatch between the inner ear
and the
eyes may
cause the user to experience motion sickness.

The VR compositor deals with this problem by fabricating a
new frame
using the
user's updated head position in combination with the
previous frames.
This
avoids a prolonged mismatch between the HMD output and the
inner ear.

Because of the adverse effects on the user, we require high
confidence that the
reprojection task will complete before the VBLANK interval.
Even if
the GFX pipe
is currently full of work from the game/application (which
is most
likely the case).

For more details and illustrations, please refer to the
following
document:
https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved

Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology
over the
past year has been the adoption of asynchronous shaders,
which can
make more efficient use of ...

Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology
over the
past year has been the adoption of asynchronous shaders,
which can
make more efficient use of ...

Gaming: Asynchronous Shaders Evolved | Community
community.amd.com
One of the most exciting new developments in GPU technology
over the
past year has been the adoption of asynchronous shaders,
which can
make more efficient use of ...

Requirements:
-------------

The mechanism must expose the following functionaility:

* Job round trip time must be predictable, from
submission to
fence signal

* The mechanism must support compute workloads.

Goals:
------

* The mechanism should provide low submission latencies

Test: submitting a NOP packet through the mechanism on busy
hardware
should
be equivalent to submitting a NOP on idle hardware.

Nice to have:
-------------

* The mechanism should also support GFX workloads.

My understanding is that with the current hardware
capabilities in
Polaris10 we
will not be able to provide a solution compatible with GFX
worloads.

But I would love to hear otherwise. So if anyone has an
idea,
approach or
suggestion that will also be compatible with the GFX ring,
please let
us know
about it.

* The above guarantees should also be respected by
amdkfd workloads

Would be good to have for consistency, but not strictly
necessary as
users running
games are not traditionally running HPC workloads in the
background.

Proposed approach:
------------------

Similar to the windows driver, we could expose a high
priority
compute queue to
userspace.

Submissions to this compute queue will be scheduled with
high
priority, and may
acquire hardware resources previously in use by other
queues.

This can be achieved by taking advantage of the 'priority'
field in
the HQDs
and could be programmed by amdgpu or the amdgpu scheduler.
The relevant
register fields are:
* mmCP_HQD_PIPE_PRIORITY
* mmCP_HQD_QUEUE_PRIORITY

Implementation approach 1 - static partitioning:
------------------------------------------------

The amdgpu driver currently controls 8 compute queues from
pipe0. We can
statically partition these as follows:
* 7x regular
* 1x high priority

The relevant priorities can be set so that submissions to
the high
priority
ring will starve the other compute rings and the GFX ring.

The amdgpu scheduler will only place jobs into the high
priority
rings if the
context is marked as high priority. And a corresponding
priority
should be
added to keep track of this information:
* AMD_SCHED_PRIORITY_KERNEL
* -> AMD_SCHED_PRIORITY_HIGH
* AMD_SCHED_PRIORITY_NORMAL

The user will request a high priority context by setting an
appropriate flag
in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163

The setting is in a per context level so that we can:
* Maintain a consistent FIFO ordering of all
submissions to a
context
* Create high priority and non-high priority contexts
in the same
process

Implementation approach 2 - dynamic priority programming:
---------------------------------------------------------

Similar to the above, but instead of programming the
priorities and
amdgpu_init() time, the SW scheduler will reprogram the
queue priorities
dynamically when scheduling a task.

This would involve having a hardware specific callback from
the
scheduler to
set the appropriate queue priority: set_priority(int ring,
int index,
int priority)

During this callback we would have to grab the SRBM mutex
to perform
the appropriate
HW programming, and I'm not really sure if that is
something we
should be doing from
the scheduler.

On the positive side, this approach would allow us to
program a range of
priorities for jobs instead of a single "high priority"
value",
achieving
something similar to the niceness API available for CPU
scheduling.

I'm not sure if this flexibility is something that we would
need for
our use
case, but it might be useful in other scenarios (multiple
users
sharing compute
time on a server).

This approach would require a new int field in
drm_amdgpu_ctx_in, or
repurposing
of the flags field.

Known current obstacles:
------------------------

The SQ is currently programmed to disregard the HQD
priorities, and
instead it picks
jobs at random. Settings from the shader itself are also
disregarded
as this is
considered a privileged field.

Effectively we can get our compute wavefront launched ASAP,
but we
might not get the
time we need on the SQ.

The current programming would have to be changed to allow
priority
propagation
from the HQD into the SQ.

Generic approach for all HW IPs:
--------------------------------

For consistency purposes, the high priority context can be
enabled
for all HW IPs
with support of the SW scheduler. This will function
similarly to the
current
AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump
ahead of
anything not
commited to the HW queue.

The benefits of requesting a high priority context for a
non-compute
queue will
be lesser (e.g. up to 10s of wait time if a GFX command is
stuck in
front of
you), but having the API in place will allow us to easily
improve the
implementation
in the future as new features become available in new
hardware.

Future steps:
-------------

Once we have an approach settled, I can take care of the
implementation.

Also, once the interface is mostly decided, we can start
thinking about
exposing the high priority queue through radv.

Request for feedback:
---------------------

We aren't married to any of the approaches outlined above.
Our goal
is to
obtain a mechanism that will allow us to complete the
reprojection
job within a
predictable amount of time. So if anyone anyone has any
suggestions for
improvements or alternative strategies we are more than
happy to hear
them.

If any of the technical information above is also
incorrect, feel
free to point
out my misunderstandings.

Looking forward to hearing from you.

Regards,
Andres

_______________________________________________
amd-gfx mailing list
amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

amd-gfx Info Page - lists.freedesktop.org
lists.freedesktop.org
To see the collection of prior postings to the list,
visit the
amd-gfx Archives. Using amd-gfx: To post a message to all
the list
members, send email ...

amd-gfx Info Page - lists.freedesktop.org
lists.freedesktop.org
To see the collection of prior postings to the list,
visit the
amd-gfx Archives. Using amd-gfx: To post a message to all
the list
members, send email ...

_______________________________________________
amd-gfx mailing list
amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Sincerely yours,
Serguei Sagalovitch

_______________________________________________
amd-gfx mailing list
amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Sincerely yours,
Serguei Sagalovitch