Re: [RFC] Async flips

From: Mario Kleiner <mario.kleiner@tuebingen.mpg.de>
To: "Ville Syrjälä" <ville.syrjala@linux.intel.com>
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [RFC] Async flips
Date: Mon, 12 Nov 2012 04:53:10 +0100	[thread overview]
Message-ID: <50A072A6.6060000@tuebingen.mpg.de> (raw)
In-Reply-To: <20121102092938.GT3791@intel.com>

On 02.11.12 10:29, Ville Syrjälä wrote:
> On Fri, Nov 02, 2012 at 05:45:29AM +0100, Mario Kleiner wrote:
>>
>>
>> On 31.10.12 19:51, Ville Syrjälä wrote:
>>> On Wed, Oct 31, 2012 at 10:44:47AM -0700, Eric Anholt wrote:
>>>> Ville Syrjälä <ville.syrjala@linux.intel.com> writes:
>>>>
>>>>> On Tue, Oct 30, 2012 at 01:33:47PM -0500, Jesse Barnes wrote:
>>>>>> The hw supports async flips through the render ring, so why not expose it?
>>>>>> It gives us one more "tear me harder" option we can use in the DDX and
>>>>>> for other cases where simply flipping to the latest buffer is more
>>>>>> important than visual quality.
>>>>>
>>>>> The only reason I can see why anyone would really want async flips is
>>>>> when you're restricted to double buffering. With triple buffering you
>>>>> should be able to override the previous flip w/o tearing.
>>>>>
>>>>> Well, actually if you use the ring based flips, then you can't do the
>>>>> override. My atomic page flip code can do it because it's using mmio
>>>>> flips. There were also other reasons favoring mmio over ring.
>>>>>
>>>>> Once the atomic code is deemed ready, I would suggest we just nuke the
>>>>> ring based flip code (pun intended).
>>>>
>>>> Can you outline what exactly your plan is for doing faster-than-vblank
>>>> page flipping without tearing, and how it gets synchronized with
>>>> rendering?
>>>
>>> The faster than vrefresh flipping simply involves overwriting the
>>> display plane registers before they've been latched by the hardware.
>>> This appears to work fine already.
>>>
>>> As far as the synchronization goes, I basically just want a callback
>>> from the GPU when it's done with the buffer. I'm expecting to find
>>> some kind of GPU progress interrupt that I can enable while I'm waiting
>>> for the GPU to catch up. So I also need a FIFO to store the flip
>>> requests in the meantime. Once the GPU tells me it's ready, I pull the
>>> flip request from the queue and proceed with the display plane
>>> programming.
>>>
>>> So the synchronization part it's still quite handwavy, and I need
>>> to study the hardware/driver in more detail to figure out the
>>> specifics.
>>>
>>
>> That's cool. But please make sure that the behaviour will be somehow
>> controllable by OpenGL applications, via some OpenGL extension. I can
>> see use for different modes:
>>
>> a) Normal double-buffering: For deterministic, well controlled timing -
>> That's what my type of applications need. Maximum control over what to
>> show next, based on precise and reliable flip completion timestamps.
>>
>> b) Triple buffering with FIFO queueing of frames ahead, what the intel
>> ddx currently does, unfortunately for me with totally broken
>> timestamping, so all my users have to disable it in the xorg.conf -
>> quite a challenge for many Apple converts, which have trouble with the
>> concept of editing configuration files. It's useful if an app manages to
>> render at full refresh rate on average to smooth out occassional stalls,
>> because the gpu has one frame of completed rendering queued up in
>> advance. Maybe this also allows for some power saving if an app can
>> render and queue frames ahead of time as fast as possible (race to
>> completion) and then the cpu/gpu can go to some deeper sleep state earlier?
>>
>> c) Your LIFO triple-buffering, as far as i understand, with dropping
>> late frames, to reduce latency /lag for things like video games.
>>
>
> Right. I've been occasionally thinking about pushing the swap interval
> handling to the kernel.
>
> Currently user space needs to do the wait for vblank trick before
> scheduling the swap, and then hoping that the GPU will catch up fast
> enough so that the swap will happen on the next vblank. If the kernel
> handled it, it could actually guarantee the OML_sync_control remainder
> behaviour (well assuming kernel threads get scheduled in a timely
> fashon), whereas the user space solution can't give such guarantees.

Yes. You could even do much of it from the vblank irq for robustness of 
timing. The downside would be probably the complexity of 
error/special-case handling. E.g., if an app schedules a swap 10 seconds 
into the future, but then the app dies/quits or a fullscreen window gets 
switched back to windowed mode, so something that was meant to be 
page-flipped suddenly can't be page-flipped anymore, or the window went 
away during that 10 secs.

> But even w/o that extra kernel feature, my code should be no worse in
> that regard than the current code. You can still do the wait for vblank
> trick in user space to get similar swap interval behaviour, and you can
> still use as many buffers as you want. The only real difference to the
> current situation is that if you schedule the flip too soon, you won't
> get the EBUSY from the kernel, but instead you drop the previous flip.
> But assuming the user space code is well behaved it won't try to flip
> too soon, so essentially nothing will change.
>

Yes. My remark wrt application control was just because i assumed you 
would be responsible for the whole stack when implementing this feature, 
also how this gets exposed to apps via the ddx / glx / mesa etc.

>> d) Flipping without vsync = tearing. I think this is at least useful for
>> benchmarks, although not for anything else.
>
> This one I don't support curently. It would be possible to support it
> (assuming the HW allows it). The simplest way would be to just add a
> new flag to the ioctl to control this behaviour.
>

I think that's what Jesse's patches are supposed to add.

thanks,
-mario