On Fri, Mar 13, 2020 at 5:02 PM Michel Dänzer wrote: > On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote: > > On 2020-03-12 10:32 a.m., Alex Deucher wrote: > >> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner > >> wrote: > >>> > >>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user > >>> events at vsartup for DCN")' introduces a new way of pageflip > >>> completion handling for DCN, and some trouble. > >>> > >>> The current implementation introduces a race condition, which > >>> can cause pageflip completion events to be sent out one vblank > >>> too early, thereby confusing userspace and causing flicker: > >>> > >>> prepare_flip_isr(): > >>> > >>> 1. Pageflip programming takes the ddev->event_lock. > >>> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED > >>> 3. Releases ddev->event_lock. > >>> > >>> --> Deadline for surface address regs double-buffering passes on > >>> target pipe. > >>> > >>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip > >>> into hw, but too late for current vblank. > >>> > >>> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete > >>> in current vblank due to missing the double-buffering deadline > >>> by a tiny bit. > >>> > >>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires, > >>> dm_dcn_crtc_high_irq() gets called. > >>> > >>> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the > >>> pageflip has been completed/will complete in this vblank and > >>> sends out pageflip completion event to userspace and resets > >>> pflip_status = AMDGPU_FLIP_NONE. > >>> > >>> => Flip completion event sent out one vblank too early. > >>> > >>> This behaviour has been observed during my testing with measurement > >>> hardware a couple of time. > >>> > >>> The commit message says that the extra flip event code was added to > >>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events > >>> in case the pflip irq doesn't fire, because the "DCH HUBP" component > >>> is clock gated and doesn't fire pflip irqs in that state. Also that > >>> this clock gating may happen if no planes are active. According to > >>> Nicholas, the clock gating can also happen if psr is active, and the > >>> gating is controlled independently by the hardware, so difficult to > >>> detect if and when the completion code in above commit is needed. > >>> > >>> This patch tries the following solution: It only executes the extra > >>> pflip > >>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports > >>> that there aren't any surface updated pending in the double-buffered > >>> surface scanout address registers. Otherwise it leaves pflip completion > >>> to the pflip irq handler, for a more race-free experience. > >>> > >>> This would only guard against the order of events mentioned above. > >>> If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help > >>> at all, because 1-3 + 5 might happen even without the hw being > >>> programmed > >>> at all, ie. no surface update pending because none yet programmed > >>> into hw. > >>> > >>> Therefore this patch also changes locking in amdgpu_dm_commit_planes(), > >>> so that prepare_flip_isr() and dc_commit_updates_for_stream() are done > >>> under event_lock protection within the same critical section. > >>> > >>> v2: Take Nicholas comments into account, try a different solution. > >>> > >>> Lightly tested on Polaris (locking) and Raven (the whole DCN stuff). > >>> Seems to work without causing obvious new trouble. > >> > >> Nick, any comments on this? Can we get this committed or do you think > >> it needs additional rework? > >> > >> Thanks, > >> > >> Alex > > > > Hi Alex, Mario, > > > > This might be a little strange, but if we want to get this in as a fix > > for regressions caused by the original vblank and user events at > > vstartup patch then I'm actually going to give my reviewed by on the > > *v1* of this patch (but not this v2): > > > > Reviewed-by: Nicholas Kazlauskas > > > > You can feel free to apply that one. > > > > Reason 1: After having thought about it some more I don't think we > > enable anything today that has hubp powered down at the same time we > > expect to be waiting for a flip - eg. DMCU powering down HUBP during PSR > > entry. Static screen interrupt should happen after that flip finishes I > > think. > > > > The CRTC can still be powered on with zero planes, and I don't think any > > userspace explicitly asks for vblank events in this case but it doesn't > > hurt to have the check. > > > Ok, so the original commit that causes the races currently solves a non-existing - but potential future - problem? I guess then my v1 patch makes sense for the moment and fixes the immediate problem for Linux 5.6-rc. Maybe there's a way to ask the hw if hubp is clock-gated and so if that code needs to run or not in the future? As mentioned before, it would be helpful at least for me to get a better overview about which hw events happen when in vblank, which irq's fire in response etc., how this relates to things like power management actions, psr, vrr / drr, etc. A lot has changed in the hw during the last 10 years, and my guideline are still the public pdf files with the DCE register specs for DCE-1'ish hw from sometime around the year 2007. > Reason 2: This new patch will need much more thorough testing from side > > to fully understand the consequences of locking the entire DC commit > > sequence. For just a page flip that sounds fine, but for anything more > > than (eg. full updates, modesets, etc) I don't think we want to be > > disabling interrupts for potentially many milliseconds. > > Ah! I was wondering where the attached splat comes from, but I think > this explains it: With this patch amdgpu_dm_commit_planes keeps the > pcrtc->dev->event_lock spinlock locked while calling > dc_commit_updates_for_stream, which ends up calling > smu_set_display_count, which tries to lock a mutex. > > Yep, sorry about that. My most modern machine has Raven Ridge / DCN1, and that function only gets called on Navi /DCN2 afaics. All my testing is limited to Polaris / DCE11 and Raven DCN1, my most modern hw atm. thanks, -mario -- > Earthling Michel Dänzer | https://redhat.com > Libre software enthusiast | Mesa and X developer >