RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

From: "Kasireddy, Vivek" <vivek.kasireddy@intel.com>
To: "Michel Dänzer" <michel@daenzer.net>, "Daniel Vetter" <daniel@ffwll.ch>
Cc: "dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"Gerd Hoffmann" <kraxel@redhat.com>,
	Pekka Paalanen <ppaalanen@gmail.com>,
	"Simon Ser" <contact@emersion.fr>,
	"Zhang, Tina" <tina.zhang@intel.com>,
	"Kim, Dongwon" <dongwon.kim@intel.com>,
	"Singh, Satyeshwar" <satyeshwar.singh@intel.com>
Subject: RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
Date: Wed, 11 Aug 2021 07:25:19 +0000	[thread overview]
Message-ID: <6f1e8eb2e5964fdcad5fe940a1b0bc34@intel.com> (raw)
In-Reply-To: <a565666e-818d-b85b-b570-c30b57979d01@daenzer.net>

Hi Michel,

> On 2021-08-10 10:30 a.m., Daniel Vetter wrote:
> > On Tue, Aug 10, 2021 at 08:21:09AM +0000, Kasireddy, Vivek wrote:
> >>> On Fri, Aug 06, 2021 at 07:27:13AM +0000, Kasireddy, Vivek wrote:
> >>>>>>>
> >>>>>>> Hence my gut feeling reaction that first we need to get these two
> >>>>>>> compositors aligned in their timings, which propobably needs
> >>>>>>> consistent vblank periods/timestamps across them (plus/minux
> >>>>>>> guest/host clocksource fun ofc). Without this any of the next steps
> >>>>>>> will simply not work because there's too much jitter by the time the
> >>>>>>> guest compositor gets the flip completion events.
> >>>>>> [Kasireddy, Vivek] Timings are not a problem and do not significantly
> >>>>>> affect the repaint cycles from what I have seen so far.
> >>>>>>
> >>>>>>>
> >>>>>>> Once we have solid events I think we should look into statically
> >>>>>>> tuning guest/host compositor deadlines (like you've suggested in a
> >>>>>>> bunch of places) to consisently make that deadline and hit 60 fps.
> >>>>>>> With that we can then look into tuning this automatically and what to
> >>>>>>> do when e.g. switching between copying and zero-copy on the host side
> >>>>>>> (which might be needed in some cases) and how to handle all that.
> >>>>>> [Kasireddy, Vivek] As I confirm here:
> >>> https://gitlab.freedesktop.org/wayland/weston/-
> >>>>> /issues/514#note_984065
> >>>>>> tweaking the deadlines works (i.e., we get 60 FPS) as we expect. However,
> >>>>>> I feel that this zero-copy solution I am trying to create should be independent
> >>>>>> of compositors' deadlines, delays or other scheduling parameters.
> >>>>>
> >>>>> That's not how compositors work nowadays. Your problem is that you don't
> >>>>> have the guest/host compositor in sync. zero-copy only changes the timing,
> >>>>> so it changes things from "rendering way too many frames" to "rendering
> >>>>> way too few frames".
> >>>>>
> >>>>> We need to fix the timing/sync issue here first, not paper over it with
> >>>>> hacks.
> >>>> [Kasireddy, Vivek] What I really meant is that the zero-copy solution should be
> >>>> independent of the scheduling policies to ensure that it works with all compositors.
> >>>>  IIUC, Weston for example uses the vblank/pageflip completion timestamp, the
> >>>> configurable repaint-window value, refresh-rate, etc to determine when to start
> >>>> its next repaint -- if there is any damage:
> >>>> timespec_add_nsec(&output->next_repaint, stamp, refresh_nsec);
> >>>> timespec_add_msec(&output->next_repaint, &output->next_repaint, -compositor-
> >>>> repaint_msec);
> >>>>
> >>>> And, in the case of VKMS, since there is no real hardware, the timestamp is always:
> >>>> now = ktime_get();
> >>>> send_vblank_event(dev, e, seq, now);
> >>>
> >>> vkms has been fixed since a while to fake high-precision timestamps like
> >>> from a real display.
> >> [Kasireddy, Vivek] IIUC, that might be one of the reasons why the Guest does not need
> >> to have the same timestamp as that of the Host -- to work as expected.
> >>
> >>>
> >>>> When you say that the Guest/Host compositor need to stay in sync, are you
> >>>> suggesting that we need to ensure that the vblank timestamp on the Host
> >>>> needs to be shared and be the same on the Guest and a vblank/pageflip
> >>>> completion for the Guest needs to be sent at exactly the same time it is sent
> >>>> on the Host? If yes, I'd say that we do send the pageflip completion to Guest
> >>>> around the same time a vblank is generated on the Host but it does not help
> >>>> because the Guest compositor would only have 9 ms to submit a new frame
> >>>> and if the Host is running Mutter, the Guest would only have 2 ms.
> >>>> (https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_984341)
> >>>
> >>> Not at the same time, but the same timestamp. And yes there is some fun
> >>> there, which is I think the fundamental issue. Or at least some of the
> >>> compositor experts seem to think so, and it makes sense to me.
> >> [Kasireddy, Vivek] It is definitely possible that if the timestamp is messed up, then
> >> the Guest repaint cycle would be affected. However, I do not believe that is the case
> >> here given the debug and instrumentation data we collected and scrutinized. Hopefully,
> >> compositor experts could chime in to shed some light on this matter.
> >>
> >>>
> >>>>>
> >>>>> Only, and I really mean only, when that shows that it's simply impossible
> >>>>> to hit 60fps with zero-copy and the guest/host fully aligned should we
> >>>>> look into making the overall pipeline deeper.
> >>>> [Kasireddy, Vivek] From all the experiments conducted so far and given the
> >>>> discussion associated with https://gitlab.freedesktop.org/wayland/weston/-
> /issues/514
> >>>> I think we have already established that in order for a zero-copy solution to work
> >>>> reliably, the Guest compositor needs to start its repaint cycle when the Host
> >>>> compositor sends a frame callback event to its clients.
> >>>>
> >>>>>
> >>>>>>> Only when that all shows that we just can't hit 60fps consistently and
> >>>>>>> really need 3 buffers in flight should we look at deeper kms queues.
> >>>>>>> And then we really need to implement them properly and not with a
> >>>>>>> mismatch between drm_event an out-fence signalling. These quick hacks
> >>>>>>> are good for experiments, but there's a pile of other things we need
> >>>>>>> to do first. At least that's how I understand the problem here right
> >>>>>>> now.
> >>>>>> [Kasireddy, Vivek] Experiments done so far indicate that we can hit 59 FPS
> >>> consistently
> >>>>>> -- in a zero-copy way independent of compositors' delays/deadlines -- with this
> >>>>>> patch series + the Weston MR I linked in the cover letter. The main reason why
> this
> >>>>>> works is because we relax the assumption that when the Guest compositor gets a
> >>>>>> pageflip completion event that it could reuse the old FB it submitted in the
> previous
> >>>>>> atomic flip and instead force it to use a new one. And, we send the pageflip
> >>> completion
> >>>>>> event to the Guest when the Host compositor sends a frame callback event.
> Lastly,
> >>>>>> we use the (deferred) out_fence as just a mechanism to tell the Guest compositor
> >>> when
> >>>>>> it can release references on old FBs so that they can be reused again.
> >>>>>>
> >>>>>> With that being said, the only question is how can we accomplish the above in an
> >>>>> upstream
> >>>>>> acceptable way without regressing anything particularly on bare-metal. Its not
> clear
> >>> if
> >>>>> just
> >>>>>> increasing the queue depth would work or not but I think the Guest compositor
> has to
> >>> be
> >>>>> told
> >>>>>> when it can start its repaint cycle and when it can assume the old FB is no longer
> in
> >>> use.
> >>>>>> On bare-metal -- and also with VKMS as of today -- a pageflip completion
> indicates
> >>>>> both.
> >>>>>> In other words, Vblank event is the same as Flip done, which makes sense on
> bare-
> >>> metal.
> >>>>>> But if we were to have two events at-least for VKMS: vblank to indicate to Guest
> to
> >>> start
> >>>>>> repaint and flip_done to indicate to drop references on old FBs, I think this
> problem
> >>> can
> >>>>>> be solved even without increasing the queue depth. Can this be acceptable?
> >>>>>
> >>>>> That's just another flavour of your "increase queue depth without
> >>>>> increasing the atomic queue depth" approach. I still think the underlying
> >>>>> fundamental issue is a timing confusion, and the fact that adjusting the
> >>>>> timings fixes things too kinda proves that. So we need to fix that in a
> >>>>> clean way, not by shuffling things around semi-randomly until the specific
> >>>>> config we tests works.
> >>>> [Kasireddy, Vivek] This issue is not due to a timing or timestamp mismatch. We
> >>>> have carefully instrumented both the Host and Guest compositors and measured
> >>>> the latencies at each step. The relevant debug data only points to the scheduling
> >>>> policy -- of both Host and Guest compositors -- playing a role in Guest rendering
> >>>> at 30 FPS.
> >>>
> >>> Hm but that essentially means that the events your passing around have an
> >>> even more ad-hoc implementation specific meaning: Essentially it's the
> >>> kick-off for the guest's repaint loop? That sounds even worse for a kms
> >>> uapi extension.
> >> [Kasireddy, Vivek] The pageflip completion event/vblank event indeed serves as the
> >> kick-off for a compositor's (both Guest and Host) repaint loop. AFAICT, Weston
> >> works that way and even if we increase the queue depth to solve this problem, I don't
> >> think it'll help because the arrival of this event always indicates to a compositor to
> >> start its repaint cycle again and assume that the previous buffers are all free.
> >
> > I thought this is how simple compositors work, and weston has since a
> > while it's own timer, which is based on the timestamp it gets (at on
> > drivers with vblank support), so that it starts the repaint loop a few ms
> > before the next vblank. And not immediately when it receives the old page
> > flip completion event.
> 
> As long as it's a fixed timer, there's still a risk that the guest compositor repaint cycle runs
> too late for the host one (unless the guest cycle happens to be scheduled significantly
> earlier than the host one).
> 
> Note that current mutter Git main (to become the 41 release this autumn) uses dynamic
> scheduling of its repaint cycle based on how long the last 16 frames took to draw and
> present. In theory, this could automatically schedule the guest cycle early enough for the
> host one.
[Kasireddy, Vivek] I'd like to try it out soon; it'd be very interesting to see how Mutter
works in both Guest and Host with this new scheduling policy. Having said that, I think
there is still a need to come up with a comprehensive solution that is independent of
compositors' scheduling policies. To that end, I am thinking of splitting the pageflip
completion event into two events: vblank event (to indicate to compositor to start repaint)
and flip_done event (to indicate to release references on old FBs). Or, introduce two new
signals/fences along similar lines. Thoughts?

Thanks,
Vivek

> 
> 
> --
> Earthling Michel Dänzer               |               https://redhat.com
> Libre software enthusiast             |             Mesa and X developer