A few questions about the best way to implement RandR 1.4 / PRIME buffer sharing

All of lore.kernel.org
 help / color / mirror / Atom feed

* A few questions about the best way to implement RandR 1.4 / PRIME buffer sharing
@ 2012-08-30 17:31 Aaron Plattner
  2012-08-30 17:34 ` [Linaro-mm-sig] " Aaron Plattner
  2012-09-01  3:00 ` Dave Airlie
  0 siblings, 2 replies; 6+ messages in thread
From: Aaron Plattner @ 2012-08-30 17:31 UTC (permalink / raw)
  To: linaro-mm-sig, dri-devel

So I've been experimenting with support for Dave Airlie's new RandR 1.4 provider
object interface, so that Optimus-based laptops can use our driver to drive the
discrete GPU and display on the integrated GPU.  The good news is that I've got
a proof of concept working.

During a review of the current code, we came up with a few concerns:

1. The output source is responsible for allocating the shared memory

Right now, the X server calls CreatePixmap on the output source screen and then
expects the output sink screen to be able to display from whatever memory the
source allocates.  Right now, the source has no mechanism for asking the sink
what its requirements are for the surface.  I'm using our own internal pitch
alignment requirements and that seems to be good enough for the Intel device to
scan out, but that could be pure luck.

Does it make sense to add a mechanism for drivers to negotiate this with each
other, or is it sufficient to just define a lowest common denominator format and
if your hardware can't deal with that format, you just don't get to share
buffers?

One of my coworkers brought to my attention the fact that Tegra requires a
specific pitch alignment, and cannot accommodate larger pitches.  If other SoC
designs have similar restrictions, we might need to add a handshake mechanism.

2. There's no fallback mechanism if sharing can't be negotiated

If RandR fails to share a pixmap with the output sink screen, the whole modeset
fails.  This means you'll end up not seeing anything on the screen and you'll
probably think your computer locked up.  Should there be some sort of software
copy fallback to ensure that something at least shows up on the display?

3. How should the memory be allocated?

In the prototype I threw together, I'm allocating the shared memory using
shm_open and then exporting that as a dma-buf file descriptor using an ioctl I
added to the kernel, and then importing that memory back into our driver through
dma_buf_attach & dma_buf_map_attachment.  Does it make sense for user-space
programs to be able to export shmfs files like that?  Should that interface go
in DRM / GEM / PRIME instead?  Something else?  I'm pretty unfamiliar with this
kernel code so any suggestions would be appreciated.

-- Aaron

P.S. for those unfamiliar with PRIME:
Dave Airlie added new support to the X Resize and Rotate extension version 1.4
to support offloading display and rendering to different drivers.  PRIME is the
DRM implementation in the kernel, layered on top of DMA-BUF, that implements the
actual sharing of buffers between drivers.

http://cgit.freedesktop.org/xorg/proto/randrproto/tree/randrproto.txt?id=randrproto-1.4.0#n122
http://airlied.livejournal.com/75555.html - update on hotplug server
http://airlied.livejournal.com/76078.html - randr 1.5 demo videos

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Linaro-mm-sig] A few questions about the best way to implement RandR 1.4 / PRIME buffer sharing
  2012-08-30 17:31 A few questions about the best way to implement RandR 1.4 / PRIME buffer sharing Aaron Plattner
@ 2012-08-30 17:34 ` Aaron Plattner
  2012-09-01 11:28   ` Daniel Vetter
  2012-09-01  3:00 ` Dave Airlie
  1 sibling, 1 reply; 6+ messages in thread
From: Aaron Plattner @ 2012-08-30 17:34 UTC (permalink / raw)
  To: dri-devel

On 08/30/2012 10:31 AM, Aaron Plattner wrote:
> So I've been experimenting with support for Dave Airlie's new RandR 1.4 provider
> object interface, so that Optimus-based laptops can use our driver to drive the
> discrete GPU and display on the integrated GPU.  The good news is that I've got
> a proof of concept working.
>
> During a review of the current code, we came up with a few concerns:
>
> 1. The output source is responsible for allocating the shared memory
>
> Right now, the X server calls CreatePixmap on the output source screen and then
> expects the output sink screen to be able to display from whatever memory the
> source allocates.  Right now, the source has no mechanism for asking the sink
> what its requirements are for the surface.  I'm using our own internal pitch
> alignment requirements and that seems to be good enough for the Intel device to
> scan out, but that could be pure luck.
>
> Does it make sense to add a mechanism for drivers to negotiate this with each
> other, or is it sufficient to just define a lowest common denominator format and
> if your hardware can't deal with that format, you just don't get to share
> buffers?
>
> One of my coworkers brought to my attention the fact that Tegra requires a
> specific pitch alignment, and cannot accommodate larger pitches.  If other SoC
> designs have similar restrictions, we might need to add a handshake mechanism.
>
> 2. There's no fallback mechanism if sharing can't be negotiated
>
> If RandR fails to share a pixmap with the output sink screen, the whole modeset
> fails.  This means you'll end up not seeing anything on the screen and you'll
> probably think your computer locked up.  Should there be some sort of software
> copy fallback to ensure that something at least shows up on the display?
>
> 3. How should the memory be allocated?
>
> In the prototype I threw together, I'm allocating the shared memory using
> shm_open and then exporting that as a dma-buf file descriptor using an ioctl I
> added to the kernel, and then importing that memory back into our driver through
> dma_buf_attach & dma_buf_map_attachment.  Does it make sense for user-space
> programs to be able to export shmfs files like that?  Should that interface go
> in DRM / GEM / PRIME instead?  Something else?  I'm pretty unfamiliar with this
> kernel code so any suggestions would be appreciated.

There's also a #4 that didn't seem relevant to cross-post to linaro-mm-sig:

4. There's no mechanism for double buffering the output sink

RandR allocates one pixmap on the output source screen and sets up tracking so
the source driver can copy the screen into the shared pixmap.  However, the sink 
driver scans out from the shared pixmap directly.  There's no mechanism to
prevent tearing on the sink side of the pipeline.

It seems like it would be nice if the source could trigger the sink device to
flip between front and back buffers when the copy is finished, and get back a
fence to indicate when the flip has occurred and the source can start the next copy.

-- Aaron

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: A few questions about the best way to implement RandR 1.4 / PRIME buffer sharing
  2012-08-30 17:31 A few questions about the best way to implement RandR 1.4 / PRIME buffer sharing Aaron Plattner
  2012-08-30 17:34 ` [Linaro-mm-sig] " Aaron Plattner
@ 2012-09-01  3:00 ` Dave Airlie
  2012-09-04 20:57   ` Aaron Plattner
  1 sibling, 1 reply; 6+ messages in thread
From: Dave Airlie @ 2012-09-01  3:00 UTC (permalink / raw)
  To: Aaron Plattner; +Cc: linaro-mm-sig, dri-devel

> object interface, so that Optimus-based laptops can use our driver to drive
> the
> discrete GPU and display on the integrated GPU.  The good news is that I've
> got
> a proof of concept working.

Don't suppose you'll be interested in adding the other method at some
point as well? since saving power is probably important to a lot of
people :-)

>
> During a review of the current code, we came up with a few concerns:
>
> 1. The output source is responsible for allocating the shared memory
>
> Right now, the X server calls CreatePixmap on the output source screen and
> then
> expects the output sink screen to be able to display from whatever memory
> the
> source allocates.  Right now, the source has no mechanism for asking the
> sink
> what its requirements are for the surface.  I'm using our own internal pitch
> alignment requirements and that seems to be good enough for the Intel device
> to
> scan out, but that could be pure luck.

Well in theory it might be nice but it would have been premature since
so far the only interactions for prime are combination of intel,
nvidia and AMD, and I think everyone has fairly similar pitch
alignment requirements, I'd be interested in adding such an interface
but I don't think its some I personally would be working on.

> other, or is it sufficient to just define a lowest common denominator format
> and
> if your hardware can't deal with that format, you just don't get to share
> buffers?

At the moment I'm happy to just go with linear, minimum pitch
alignment 64 or something as a base standard, but yeah I'm happy for
it to work either way, just don't have enough evidence it's worth it
yet. I've not looked at ARM stuff, so patches welcome if people
consider they need to use this stuff for SoC devices.

> 2. There's no fallback mechanism if sharing can't be negotiated
>
> If RandR fails to share a pixmap with the output sink screen, the whole
> modeset
> fails.  This means you'll end up not seeing anything on the screen and
> you'll
> probably think your computer locked up.  Should there be some sort of
> software
> copy fallback to ensure that something at least shows up on the display?

Uggh, it would be fairly slow and unuseable, I'd rather they saw
nothing, but again open to suggestions on how to make this work, since
it might fail for other reasons and in that case there is still
nothing a sw copy can do. What happens if the slave intel device just
fails to allocate a pixmap, but yeah I'm willing to think about it a
bit more when we have some reference implementations.
>
> 3. How should the memory be allocated?
>
> In the prototype I threw together, I'm allocating the shared memory using
> shm_open and then exporting that as a dma-buf file descriptor using an ioctl
> I
> added to the kernel, and then importing that memory back into our driver
> through
> dma_buf_attach & dma_buf_map_attachment.  Does it make sense for user-space
> programs to be able to export shmfs files like that?  Should that interface
> go
> in DRM / GEM / PRIME instead?  Something else?  I'm pretty unfamiliar with
> this
> kernel code so any suggestions would be appreciated.

Your kernel driver should in theory be doing it all, if you allocate
shared pixmaps in GTT accessible memory, then you need an ioctl to
tell your kernel driver to export the dma buf to the fd handle.
(assuming we get rid of the _GPL, which people have mentioned they are
open to doing). We have handle->fd and fd->handle interfaces on DRM,
you'd need something similiar on the nvidia kernel driver interface.

Yes for 4 some sort of fencing is being worked on by Maarten for other
stuff but would be a pre-req for doing this, and also some devices
don't want fullscreen updates, like USB, so doing flipped updates
would have to be optional or negoitated. It makes sense for us as well
since things like gnome-shell can do full screen pageflips and we have
to do full screen dirty updates.

Dave.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Linaro-mm-sig] A few questions about the best way to implement RandR 1.4 / PRIME buffer sharing
  2012-08-30 17:34 ` [Linaro-mm-sig] " Aaron Plattner
@ 2012-09-01 11:28   ` Daniel Vetter
  0 siblings, 0 replies; 6+ messages in thread
From: Daniel Vetter @ 2012-09-01 11:28 UTC (permalink / raw)
  To: Aaron Plattner; +Cc: dri-devel

On Thu, Aug 30, 2012 at 10:34:23AM -0700, Aaron Plattner wrote:
> 4. There's no mechanism for double buffering the output sink
> 
> RandR allocates one pixmap on the output source screen and sets up tracking so
> the source driver can copy the screen into the shared pixmap.
> However, the sink driver scans out from the shared pixmap directly.
> There's no mechanism to
> prevent tearing on the sink side of the pipeline.
> 
> It seems like it would be nice if the source could trigger the sink device to
> flip between front and back buffers when the copy is finished, and get back a
> fence to indicate when the flip has occurred and the source can start the next copy.

Dave already answered your other questions, I'll chip in with some more
details on the in-kernel fencing that Maarten Lankhorst is working on.
Current wip code is available at

http://cgit.freedesktop.org/~mlankhorst/linux/

The v10-wip branch is the latest, with some experimental code to port ttm
over to the new core reservation code. The fencing itself would be all
kernel-internal with no explicit fence objects exposed to userspace: I.e.
the driver doing the pageflip would just wait for any write-fences (=
exclusive fences) before doing the pageflip and then attach a read (=
shared) fence on the to-be-swapped out fb that signals once the pageflip
completed (to avoid other gpus rendering into the new backbuffer while
it's still being displayed).

People involved in the discussion of this are mostly Maarten, Rob Clark
(from TI) and me, and most of the discussion happens on #dri-devel, but at
least Maarten&me should be at XDC in Nürnberg. Comments/discussion of this
code, especially whether the fence/reservation code would suit you for
shared buffers highly welcome. I'll definitely annoy any nvidia guy who
shows up at xdc about this ;-)

The longer-term idea is to port all drm/* drivers over to the new
reservation framework (hence the experimental ttm port), which would allow
us to dynamically evict even shared buffers (which should be rather useful
on highly constrained SoC iommus shared among a few graphics IP blocks).

Cheers, Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: A few questions about the best way to implement RandR 1.4 / PRIME buffer sharing
  2012-09-01  3:00 ` Dave Airlie
@ 2012-09-04 20:57   ` Aaron Plattner
  2012-09-04 21:22     ` [Linaro-mm-sig] " Daniel Vetter
  0 siblings, 1 reply; 6+ messages in thread
From: Aaron Plattner @ 2012-09-04 20:57 UTC (permalink / raw)
  To: Dave Airlie; +Cc: linaro-mm-sig, dri-devel

On 08/31/2012 08:00 PM, Dave Airlie wrote:
>> object interface, so that Optimus-based laptops can use our driver to drive
>> the discrete GPU and display on the integrated GPU.  The good news is that
>> I've got a proof of concept working.
>
> Don't suppose you'll be interested in adding the other method at some point as
> well? since saving power is probably important to a lot of people

That's milestone 2.  I'm focusing on display offload to start because it's
easier to implement and lays the groundwork for the kernel pieces.  I have to
emphasize that I'm just doing a feasibility study right now and I can't promise
that we're going to officially support this stuff.

>> During a review of the current code, we came up with a few concerns:
>>
>> 1. The output source is responsible for allocating the shared memory
>>
>> Right now, the X server calls CreatePixmap on the output source screen and
>> then expects the output sink screen to be able to display from whatever memory
>> the source allocates.  Right now, the source has no mechanism for asking the
>> sink what its requirements are for the surface.  I'm using our own internal
>> pitch alignment requirements and that seems to be good enough for the Intel
>> device to scan out, but that could be pure luck.
>
> Well in theory it might be nice but it would have been premature since so far
> the only interactions for prime are combination of intel, nvidia and AMD, and
> I think everyone has fairly similar pitch alignment requirements, I'd be
> interested in adding such an interface but I don't think its some I personally
> would be working on.

Okay.  Hopefully that won't be too painful to add if we ever need it in the
future.

>> other, or is it sufficient to just define a lowest common denominator format
>> and if your hardware can't deal with that format, you just don't get to share
>> buffers?
>
> At the moment I'm happy to just go with linear, minimum pitch alignment 64 or

256, for us.

> something as a base standard, but yeah I'm happy for it to work either way,
> just don't have enough evidence it's worth it yet. I've not looked at ARM
> stuff, so patches welcome if people consider they need to use this stuff for
> SoC devices.

We can always hack it to whatever is necessary if we see that the sink side
driver is Tegra, but I was hoping for something more general.

>> 2. There's no fallback mechanism if sharing can't be negotiated
>>
>> If RandR fails to share a pixmap with the output sink screen, the whole
>> modeset fails.  This means you'll end up not seeing anything on the screen and
>> you'll probably think your computer locked up.  Should there be some sort of
>> software copy fallback to ensure that something at least shows up on the
>> display?
>
> Uggh, it would be fairly slow and unuseable, I'd rather they saw nothing, but
> again open to suggestions on how to make this work, since it might fail for
> other reasons and in that case there is still nothing a sw copy can do. What
> happens if the slave intel device just fails to allocate a pixmap, but yeah
> I'm willing to think about it a bit more when we have some reference
> implementations.

Just rolling back the modeset operation to whatever was working before would be
a good start.

It's worse than that on my current laptop, though, since our driver sees a
phantom CRT output and we happily start driving pixels to it that end up going
nowhere.  I'll need to think about what the right behavior is there since I
don't know if we want to rely on an X client to make that configuration work.

>> 3. How should the memory be allocated?
>>
>> In the prototype I threw together, I'm allocating the shared memory using
>> shm_open and then exporting that as a dma-buf file descriptor using an ioctl I
>> added to the kernel, and then importing that memory back into our driver
>> through dma_buf_attach & dma_buf_map_attachment.  Does it make sense for
>> user-space programs to be able to export shmfs files like that?  Should that
>> interface go in DRM / GEM / PRIME instead?  Something else?  I'm pretty
>> unfamiliar with this kernel code so any suggestions would be appreciated.
>
> Your kernel driver should in theory be doing it all, if you allocate shared
> pixmaps in GTT accessible memory, then you need an ioctl to tell your kernel
> driver to export the dma buf to the fd handle.  (assuming we get rid of the
> _GPL, which people have mentioned they are open to doing). We have handle->fd
> and fd->handle interfaces on DRM, you'd need something similiar on the nvidia
> kernel driver interface.

Okay, I can do that.  We already have a mechanism for importing buffers
allocated elsewhere so reusing that for shmfs and/or dma-buf seemed like a
natural extension.  I don't think adding a separate ioctl for exporting our own
allocations will add too much extra code.

> Yes for 4 some sort of fencing is being worked on by Maarten for other stuff
> but would be a pre-req for doing this, and also some devices don't want
> fullscreen updates, like USB, so doing flipped updates would have to be
> optional or negoitated. It makes sense for us as well since things like
> gnome-shell can do full screen pageflips and we have to do full screen dirty
> updates.

Right now my implementation has two sources of tearing:

1. The dGPU reads the vidmem primary surface asynchronously from its own
    rendering to it.

2. The iGPU fetches the shared surface for display asynchronously from the dGPU
    writing into it.

#1 I can fix within our driver.  For #2, I don't want to rely on the dGPU being
able to push complete frames over the bus during vblank in response to an iGPU
fence trigger so I was thinking we would want double-buffering all the time.
Also, I was hoping to set up a proper flip chain between the dGPU, the dGPU's
DMA engine, and the Intel display engine so that for full-screen applications,
glXSwapBuffers is stalled properly without relying on the CPU to schedule
things.  Maybe that's overly ambitious for now?

-- Aaron

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Linaro-mm-sig] A few questions about the best way to implement RandR 1.4 / PRIME buffer sharing
  2012-09-04 20:57   ` Aaron Plattner
@ 2012-09-04 21:22     ` Daniel Vetter
  0 siblings, 0 replies; 6+ messages in thread
From: Daniel Vetter @ 2012-09-04 21:22 UTC (permalink / raw)
  To: Aaron Plattner; +Cc: linaro-mm-sig, dri-devel

On Tue, Sep 04, 2012 at 01:57:32PM -0700, Aaron Plattner wrote:
> On 08/31/2012 08:00 PM, Dave Airlie wrote:
> >Yes for 4 some sort of fencing is being worked on by Maarten for other stuff
> >but would be a pre-req for doing this, and also some devices don't want
> >fullscreen updates, like USB, so doing flipped updates would have to be
> >optional or negoitated. It makes sense for us as well since things like
> >gnome-shell can do full screen pageflips and we have to do full screen dirty
> >updates.
> 
> Right now my implementation has two sources of tearing:
> 
> 1. The dGPU reads the vidmem primary surface asynchronously from its own
>    rendering to it.
> 
> 2. The iGPU fetches the shared surface for display asynchronously from the dGPU
>    writing into it.
> 
> #1 I can fix within our driver.  For #2, I don't want to rely on the dGPU being
> able to push complete frames over the bus during vblank in response to an iGPU
> fence trigger so I was thinking we would want double-buffering all the time.
> Also, I was hoping to set up a proper flip chain between the dGPU, the dGPU's
> DMA engine, and the Intel display engine so that for full-screen applications,
> glXSwapBuffers is stalled properly without relying on the CPU to schedule
> things.  Maybe that's overly ambitious for now?

For the frontbuffer tearing Chris Wilson added a special mode to the SNA
intel driver that uses pageflips for all buffer updates (like windowized
Xv or dri2copybuffers), mostly because vsync'ed blits are busted on snb
(and not yet proved to be fixed on ivb). So we could use that mode for an
optimus platform.

Wrt the full flip-chain, that's what Maarten Lankhorst has running in his
proof-of-concept (but only for a second or so, since nouveau is totally
bust on his machine). The only place he wakes up the cpu is to sync from
nv to intel, but even there we can kick of the intel gpu directly from the
nv irq handler (with a simple register write). intel -> nv sync uses
memory based sequence numbers. Only proof of concept for rendering though,
iirc the fence support isn't wired up with the pageflipping on the intel
side yet.
-Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-09-04 21:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-30 17:31 A few questions about the best way to implement RandR 1.4 / PRIME buffer sharing Aaron Plattner
2012-08-30 17:34 ` [Linaro-mm-sig] " Aaron Plattner
2012-09-01 11:28   ` Daniel Vetter
2012-09-01  3:00 ` Dave Airlie
2012-09-04 20:57   ` Aaron Plattner
2012-09-04 21:22     ` [Linaro-mm-sig] " Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.