Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

From: "Michel Dänzer" <michel@daenzer.net>
To: Daniel Vetter <daniel@ffwll.ch>
Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>,
	Dongwon Kim <dongwon.kim@intel.com>,
	dri-devel@lists.freedesktop.org,
	Tina Zhang <tina.zhang@intel.com>,
	Gerd Hoffmann <kraxel@redhat.com>
Subject: Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
Date: Mon, 2 Aug 2021 11:19:01 +0200	[thread overview]
Message-ID: <ab6300d6-1cc8-8084-81bc-118156ad8c52@daenzer.net> (raw)
In-Reply-To: <YQe1eUxNvJZ9G200@phenom.ffwll.local>

On 2021-08-02 11:06 a.m., Daniel Vetter wrote:
> On Mon, Aug 02, 2021 at 10:49:37AM +0200, Michel Dänzer wrote:
>> On 2021-08-02 9:59 a.m., Daniel Vetter wrote:
>>> On Fri, Jul 30, 2021 at 02:50:10PM +0200, Michel Dänzer wrote:
>>>> On 2021-07-30 12:25 p.m., Daniel Vetter wrote:
>>>>> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
>>>>>> By separating the OUT_FENCE signalling from pageflip completion allows
>>>>>> a Guest compositor to start a new repaint cycle with a new buffer
>>>>>> instead of waiting for the old buffer to be free. 
>>>>>>
>>>>>> This work is based on the idea/suggestion from Simon and Pekka.
>>>>>>
>>>>>> This capability can be a solution for this issue:
>>>>>> https://gitlab.freedesktop.org/wayland/weston/-/issues/514
>>>>>>
>>>>>> Corresponding Weston MR:
>>>>>> https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
>>>>>
>>>>> Uh I kinda wanted to discuss this a bit more before we jump into typing
>>>>> code, but well I guess not that much work yet.
>>>>>
>>>>> So maybe I'm not understanding the problem, but I think the fundamental
>>>>> underlying issue is that with KMS you can have at most 2 buffers
>>>>> in-flight, due to our queue depth limit of 1 pending flip.
>>>>>
>>>>> Unfortunately that means for virtual hw where it takes a few more
>>>>> steps/vblanks until the framebuffer actually shows up on screen and is
>>>>> scanned out, we suffer deeply. The usual fix for that is to drop the
>>>>> latency and increase throughput, and have more buffers in-flight. Which
>>>>> this patch tries to do.
>>>>
>>>> Per
>>>> https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986797 ,
>>>> IMO the underlying issue is actually that the guest compositor repaint
>>>> cycle is not aligned with the host compositor one. If they were aligned,
>>>> the problem would not occur even without allowing multiple page flips in
>>>> flight, and latency would be lower.
>>>
>>> Yeah my proposal here is under the premise that we do actually need to fix
>>> this with a deeper queue depth.
>>>
>>>>> Now I think where we go wrong here is that we're trying to hack this up by
>>>>> defining different semantics for the out-fence and for the drm-event. Imo
>>>>> that's wrong, they're both meant to show eactly the same thing:
>>>>> - when is the new frame actually visible to the user (as in, eyeballs in a
>>>>>   human head, preferrably, not the time when we've handed the buffer off
>>>>>   to the virtual hw)
>>>>> - when is the previous buffer no longer being used by the scanout hw
>>>>>
>>>>> We do cheat a bit right now in so far that we assume they're both the
>>>>> same, as in, panel-side latency is currently the compositor's problem to
>>>>> figure out.
>>>>>
>>>>> So for virtual hw I think the timestamp and even completion really need to
>>>>> happen only when the buffer has been pushed through the entire
>>>>> virtualization chain, i.e. ideally we get the timestamp from the kms
>>>>> driver from the host side. Currently that's not done, so this is most
>>>>> likely quite broken already (virtio relies on the no-vblank auto event
>>>>> sending, which definitely doesn't wait for anything, or I'm completely
>>>>> missing something).
>>>>>
>>>>> I think instead of hacking up some ill-defined 1.5 queue depth support,
>>>>> what we should do is support queue depth > 1 properly. So:
>>>>>
>>>>> - Change atomic to support queue depth > 1, this needs to be a per-driver
>>>>>   thing due to a bunch of issues in driver code. Essentially drivers must
>>>>>   never look at obj->state pointers, and only ever look up state through
>>>>>   the passed-in drm_atomic_state * update container.
>>>>>
>>>>> - Aside: virtio should loose all it's empty hooks, there's no point in
>>>>>   that.
>>>>>
>>>>> - We fix virtio to send out the completion event at the end of this entire
>>>>>   pipeline, i.e. virtio code needs to take care of sending out the
>>>>>   crtc_state->event correctly.
>>>>>
>>>>> - We probably also want some kind of (maybe per-crtc) recommended queue
>>>>>   depth property so compositors know how many buffers to keep in flight.
>>>>>   Not sure about that.
>>>>
>>>> I'd say there would definitely need to be some kind of signal for the
>>>> display server that it should queue multiple flips, since this is
>>>> normally not desirable for latency. In other words, this wouldn't really
>>>> be useful on bare metal (in contrast to the ability to replace a pending
>>>> flip with a newer one).
>>>
>>> Hm I was thinking that the compositor can tune this. If the round-trip
>>> latency (as measured by events) is too long to get full refresh rate, it
>>> can add more buffers to the queue. That's kinda why I think the returned
>>> event really must be accurate wrt actual display time (and old buffer
>>> release time), so that this computation in the compositor because a pretty
>>> simple
>>>
>>> num_buffers = (flip_time - submit_time) / frame_time
>>>
>>> With maybe some rounding up and averaging. You can also hit this when your
>>> 3d engine has an extremely deep pipeline (like some of the tiling
>>> renders have), where rendering just takes forever, but as long as you keep
>>> 2 frames in the renderer in-flight you can achieve full refresh rate (at a
>>> latency cost).
>>
>> As long as a page flip submitted after vblank N can complete during
>> vblank N+1, full frame rate can be sustained[0]. User space can use as
>> many buffers as needed to keep the rendering pipeline busy.
>>
>> [0] This is broken by the mis-aligned compositor repaint cycles: The
>> flip from the guest compositor misses the host compositor's cycle, so it
>> takes more than one display refresh cycle to complete.
>>
>>
>>> So kernel can't really tell you in all cases how many buffers you should
>>> have.
>>
>> That's not exactly what I mean. Right now, KMS user space has to wait
>> for a flip to complete before submitting another one, or it gets EBUSY.
>> So if the kernel wants to allow multiple flips to be submitted, it has
>> to somehow tell user space that this is possible, or it'll never happen.
>> And the kernel should never advertise this for bare metal, since it's
>> not needed there (and undesirable).
> 
> Oh the existence of the deep queue needs a getcap ofc.
> 
> Also, deep queues do exist in hw (including scheduling to the right
> frame), the benefit is reduced battery usage for e.g. video playback if
> you do the rendering for an entire set of frames and then just let the
> display show them.
> 
> It costs latency ofc (or we need a cancellable queue, which once it's in
> hw is tricky).

And if it's not in HW, it can be handled by the user-space display server instead of the kernel (except for cancelling pending flips).

(Note that this is currently not possible either way with Wayland, since it uses mailbox semantics; there are proposals for a Wayland extension which allows queuing multiple frames though. Meanwhile, this would need to be handled in the Wayland clients.)

-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer