Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)

From: Mario Kleiner <mario.kleiner.de@gmail.com>
To: "Michel Dänzer" <michel@daenzer.net>
Cc: "Leo \(Sunpeng\) Li" <sunpeng.li@amd.com>,
	Maling list - DRI developers <dri-devel@lists.freedesktop.org>,
	"Deucher, Alexander" <alexander.deucher@amd.com>,
	amd-gfx list <amd-gfx@lists.freedesktop.org>,
	Harry Wentland <hwentlan@amd.com>,
	"Kazlauskas, Nicholas" <nicholas.kazlauskas@amd.com>
Subject: Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
Date: Fri, 13 Mar 2020 22:52:59 +0100	[thread overview]
Message-ID: <CAEsyxyhzWFgqzWBC-==WUgbWtT4MAEVt_B0HmS=75_OJevm3tw@mail.gmail.com> (raw)
In-Reply-To: <41ab0520-e29a-b6ed-bf5e-fbdf1eec0ceb@daenzer.net>

[-- Attachment #1.1: Type: text/plain, Size: 6603 bytes --]

On Fri, Mar 13, 2020 at 5:02 PM Michel Dänzer <michel@daenzer.net> wrote:

> On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote:
> > On 2020-03-12 10:32 a.m., Alex Deucher wrote:
> >> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner
> >> <mario.kleiner.de@gmail.com> wrote:
> >>>
> >>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
> >>> events at vsartup for DCN")' introduces a new way of pageflip
> >>> completion handling for DCN, and some trouble.
> >>>
> >>> The current implementation introduces a race condition, which
> >>> can cause pageflip completion events to be sent out one vblank
> >>> too early, thereby confusing userspace and causing flicker:
> >>>
> >>> prepare_flip_isr():
> >>>
> >>> 1. Pageflip programming takes the ddev->event_lock.
> >>> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
> >>> 3. Releases ddev->event_lock.
> >>>
> >>> --> Deadline for surface address regs double-buffering passes on
> >>>      target pipe.
> >>>
> >>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip
> >>>     into hw, but too late for current vblank.
> >>>
> >>> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
> >>>     in current vblank due to missing the double-buffering deadline
> >>>     by a tiny bit.
> >>>
> >>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
> >>>     dm_dcn_crtc_high_irq() gets called.
> >>>
> >>> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
> >>>     pageflip has been completed/will complete in this vblank and
> >>>     sends out pageflip completion event to userspace and resets
> >>>     pflip_status = AMDGPU_FLIP_NONE.
> >>>
> >>> => Flip completion event sent out one vblank too early.
> >>>
> >>> This behaviour has been observed during my testing with measurement
> >>> hardware a couple of time.
> >>>
> >>> The commit message says that the extra flip event code was added to
> >>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events
> >>> in case the pflip irq doesn't fire, because the "DCH HUBP" component
> >>> is clock gated and doesn't fire pflip irqs in that state. Also that
> >>> this clock gating may happen if no planes are active. According to
> >>> Nicholas, the clock gating can also happen if psr is active, and the
> >>> gating is controlled independently by the hardware, so difficult to
> >>> detect if and when the completion code in above commit is needed.
> >>>
> >>> This patch tries the following solution: It only executes the extra
> >>> pflip
> >>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports
> >>> that there aren't any surface updated pending in the double-buffered
> >>> surface scanout address registers. Otherwise it leaves pflip completion
> >>> to the pflip irq handler, for a more race-free experience.
> >>>
> >>> This would only guard against the order of events mentioned above.
> >>> If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help
> >>> at all, because 1-3 + 5 might happen even without the hw being
> >>> programmed
> >>> at all, ie. no surface update pending because none yet programmed
> >>> into hw.
> >>>
> >>> Therefore this patch also changes locking in amdgpu_dm_commit_planes(),
> >>> so that prepare_flip_isr() and dc_commit_updates_for_stream() are done
> >>> under event_lock protection within the same critical section.
> >>>
> >>> v2: Take Nicholas comments into account, try a different solution.
> >>>
> >>> Lightly tested on Polaris (locking) and Raven (the whole DCN stuff).
> >>> Seems to work without causing obvious new trouble.
> >>
> >> Nick, any comments on this?  Can we get this committed or do you think
> >> it needs additional rework?
> >>
> >> Thanks,
> >>
> >> Alex
> >
> > Hi Alex, Mario,
> >
> > This might be a little strange, but if we want to get this in as a fix
> > for regressions caused by the original vblank and user events at
> > vstartup patch then I'm actually going to give my reviewed by on the
> > *v1* of this patch (but not this v2):
> >
> > Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
> >
> > You can feel free to apply that one.
> >
> > Reason 1: After having thought about it some more I don't think we
> > enable anything today that has hubp powered down at the same time we
> > expect to be waiting for a flip - eg. DMCU powering down HUBP during PSR
> > entry. Static screen interrupt should happen after that flip finishes I
> > think.
> >
> > The CRTC can still be powered on with zero planes, and I don't think any
> > userspace explicitly asks for vblank events in this case but it doesn't
> > hurt to have the check.
> >
>

Ok, so the original commit that causes the races currently solves a
non-existing - but potential future - problem?

I guess then my v1 patch makes sense for the moment and fixes the immediate
problem for Linux 5.6-rc.

Maybe there's a way to ask the hw if hubp is clock-gated and so if that
code needs to run or not in the future?

As mentioned before, it would be helpful at least for me to get a better
overview about which hw events happen when in vblank, which irq's fire in
response etc., how this relates to things like power management actions,
psr, vrr / drr, etc. A lot has changed in the hw during the last 10 years,
and my guideline are still the public pdf files with the DCE register specs
for DCE-1'ish hw from sometime around the year 2007.

> Reason 2: This new patch will need much more thorough testing from side
> > to fully understand the consequences of locking the entire DC commit
> > sequence. For just a page flip that sounds fine, but for anything more
> > than (eg. full updates, modesets, etc) I don't think we want to be
> > disabling interrupts for potentially many milliseconds.
>
> Ah! I was wondering where the attached splat comes from, but I think
> this explains it: With this patch amdgpu_dm_commit_planes keeps the
> pcrtc->dev->event_lock spinlock locked while calling
> dc_commit_updates_for_stream, which ends up calling
> smu_set_display_count, which tries to lock a mutex.
>
>
Yep, sorry about that. My most modern machine has Raven Ridge / DCN1, and
that function only gets called on Navi /DCN2 afaics. All my testing is
limited to Polaris / DCE11 and Raven DCN1, my most modern hw atm.

thanks,
-mario

-- 
> Earthling Michel Dänzer               |               https://redhat.com
> Libre software enthusiast             |             Mesa and X developer
>

[-- Attachment #1.2: Type: text/html, Size: 8714 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel