From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: ** X-Spam-Status: No, score=2.5 required=3.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id ABCBBC2BB1D for ; Fri, 13 Mar 2020 21:53:15 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 824FA2074E for ; Fri, 13 Mar 2020 21:53:15 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="AnFDMk78" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 824FA2074E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id D2CDC6E054; Fri, 13 Mar 2020 21:53:13 +0000 (UTC) Received: from mail-wr1-x431.google.com (mail-wr1-x431.google.com [IPv6:2a00:1450:4864:20::431]) by gabe.freedesktop.org (Postfix) with ESMTPS id 9DBC36E04B; Fri, 13 Mar 2020 21:53:12 +0000 (UTC) Received: by mail-wr1-x431.google.com with SMTP id 6so13954467wre.4; Fri, 13 Mar 2020 14:53:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=F1ggqG8zLpfklIUhtsKqqhmAK1SrCt0QqulQEoEWP5g=; b=AnFDMk782pNuCdpamDK3FbcquUHjgajeTbfyDydJi4bbkdcBXyjozJHZSoxk4gKGx0 r70UO5pj7igzYLiBxKhUsEFhOn0jhdT9Tx2GiNe8wu617dANSrwaLvU/7rnKaBUt3raY HeOzthqGEdtrLGqBezym0vVJuzZateFipd6GnWZqh1SYg/DoAoUeGZbiOBK6Sm3XzdhO CQm2RsmJXDNJX46yBoQX4bCYmjbIE5KNUC4rMna16G0hoEKpHZG7Pt5nPz65+Kc9WXD6 tyZjbW69jcbg8q3B+AzIMcdB32gvxu/XiCFE2x3eudvKZ+tyCNcJRRZFu5U0TbFqjnxm C98A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=F1ggqG8zLpfklIUhtsKqqhmAK1SrCt0QqulQEoEWP5g=; b=YhGmvwNG8qktqGQFwJmxzZmIlovAXRdELSD66qEWovAVEecsjXGEdaNwAKwuWwlFe1 lMejxADJyqpvmv2zgAnspWMIG0K0SEG9CRxtc+We9karkOzFQL97FWXNqZCs9/RnIMkZ C4t0HgGZ+HW679dYyunxMBoBsu3OolLEwE/zgyBjSPi73FECf1w3VhlE+tDIju0DkECN /ib6PNzq9i8kudPGNZEvoLrL7hDGgkb7iIIeZCmg2v4xcL1D+v/OLirjDvkJhZJWd152 4RNoNfufNopCEHgawQlvMzLfSREuOPuQ4lzk0pc4dtqMAxQxjFsT6SMqFHlRDFs2BfMN tF9A== X-Gm-Message-State: ANhLgQ0S/3spGDZWVguF5PRkan1Jl9NvnQ7ehTn23oXQpPKuwmXhc6oM WA/m43RGcHPtXm9fvbQNuBNSsraYN7fE5gKW93u9NA== X-Google-Smtp-Source: ADFU+vu/Rfqh3OZs9grJOy4ux64pc/ZSQfz5oSjxV62odZ3LRQ9OEfomtYG3tE/gU8thbmwhJVTEWhLVytwROTVzzHQ= X-Received: by 2002:a5d:6146:: with SMTP id y6mr20163557wrt.107.1584136391311; Fri, 13 Mar 2020 14:53:11 -0700 (PDT) MIME-Version: 1.0 References: <20200305212044.3857-1-mario.kleiner.de@gmail.com> <41ab0520-e29a-b6ed-bf5e-fbdf1eec0ceb@daenzer.net> In-Reply-To: <41ab0520-e29a-b6ed-bf5e-fbdf1eec0ceb@daenzer.net> From: Mario Kleiner Date: Fri, 13 Mar 2020 22:52:59 +0100 Message-ID: Subject: Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2) To: =?UTF-8?Q?Michel_D=C3=A4nzer?= X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "Leo \(Sunpeng\) Li" , Maling list - DRI developers , "Deucher, Alexander" , amd-gfx list , Harry Wentland , "Kazlauskas, Nicholas" Content-Type: multipart/mixed; boundary="===============0933230698==" Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" --===============0933230698== Content-Type: multipart/alternative; boundary="000000000000e146a805a0c37e53" --000000000000e146a805a0c37e53 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Fri, Mar 13, 2020 at 5:02 PM Michel D=C3=A4nzer wro= te: > On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote: > > On 2020-03-12 10:32 a.m., Alex Deucher wrote: > >> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner > >> wrote: > >>> > >>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user > >>> events at vsartup for DCN")' introduces a new way of pageflip > >>> completion handling for DCN, and some trouble. > >>> > >>> The current implementation introduces a race condition, which > >>> can cause pageflip completion events to be sent out one vblank > >>> too early, thereby confusing userspace and causing flicker: > >>> > >>> prepare_flip_isr(): > >>> > >>> 1. Pageflip programming takes the ddev->event_lock. > >>> 2. Sets acrtc->pflip_status =3D=3D AMDGPU_FLIP_SUBMITTED > >>> 3. Releases ddev->event_lock. > >>> > >>> --> Deadline for surface address regs double-buffering passes on > >>> target pipe. > >>> > >>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip > >>> into hw, but too late for current vblank. > >>> > >>> =3D> pflip_status =3D=3D AMDGPU_FLIP_SUBMITTED, but flip won't comple= te > >>> in current vblank due to missing the double-buffering deadline > >>> by a tiny bit. > >>> > >>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires, > >>> dm_dcn_crtc_high_irq() gets called. > >>> > >>> 6. Detects pflip_status =3D=3D AMDGPU_FLIP_SUBMITTED and assumes the > >>> pageflip has been completed/will complete in this vblank and > >>> sends out pageflip completion event to userspace and resets > >>> pflip_status =3D AMDGPU_FLIP_NONE. > >>> > >>> =3D> Flip completion event sent out one vblank too early. > >>> > >>> This behaviour has been observed during my testing with measurement > >>> hardware a couple of time. > >>> > >>> The commit message says that the extra flip event code was added to > >>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events > >>> in case the pflip irq doesn't fire, because the "DCH HUBP" component > >>> is clock gated and doesn't fire pflip irqs in that state. Also that > >>> this clock gating may happen if no planes are active. According to > >>> Nicholas, the clock gating can also happen if psr is active, and the > >>> gating is controlled independently by the hardware, so difficult to > >>> detect if and when the completion code in above commit is needed. > >>> > >>> This patch tries the following solution: It only executes the extra > >>> pflip > >>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports > >>> that there aren't any surface updated pending in the double-buffered > >>> surface scanout address registers. Otherwise it leaves pflip completi= on > >>> to the pflip irq handler, for a more race-free experience. > >>> > >>> This would only guard against the order of events mentioned above. > >>> If Step 5 (VSTARTUP trigger) happens before step 4 then this won't he= lp > >>> at all, because 1-3 + 5 might happen even without the hw being > >>> programmed > >>> at all, ie. no surface update pending because none yet programmed > >>> into hw. > >>> > >>> Therefore this patch also changes locking in amdgpu_dm_commit_planes(= ), > >>> so that prepare_flip_isr() and dc_commit_updates_for_stream() are don= e > >>> under event_lock protection within the same critical section. > >>> > >>> v2: Take Nicholas comments into account, try a different solution. > >>> > >>> Lightly tested on Polaris (locking) and Raven (the whole DCN stuff). > >>> Seems to work without causing obvious new trouble. > >> > >> Nick, any comments on this? Can we get this committed or do you think > >> it needs additional rework? > >> > >> Thanks, > >> > >> Alex > > > > Hi Alex, Mario, > > > > This might be a little strange, but if we want to get this in as a fix > > for regressions caused by the original vblank and user events at > > vstartup patch then I'm actually going to give my reviewed by on the > > *v1* of this patch (but not this v2): > > > > Reviewed-by: Nicholas Kazlauskas > > > > You can feel free to apply that one. > > > > Reason 1: After having thought about it some more I don't think we > > enable anything today that has hubp powered down at the same time we > > expect to be waiting for a flip - eg. DMCU powering down HUBP during PS= R > > entry. Static screen interrupt should happen after that flip finishes I > > think. > > > > The CRTC can still be powered on with zero planes, and I don't think an= y > > userspace explicitly asks for vblank events in this case but it doesn't > > hurt to have the check. > > > Ok, so the original commit that causes the races currently solves a non-existing - but potential future - problem? I guess then my v1 patch makes sense for the moment and fixes the immediate problem for Linux 5.6-rc. Maybe there's a way to ask the hw if hubp is clock-gated and so if that code needs to run or not in the future? As mentioned before, it would be helpful at least for me to get a better overview about which hw events happen when in vblank, which irq's fire in response etc., how this relates to things like power management actions, psr, vrr / drr, etc. A lot has changed in the hw during the last 10 years, and my guideline are still the public pdf files with the DCE register specs for DCE-1'ish hw from sometime around the year 2007. > Reason 2: This new patch will need much more thorough testing from side > > to fully understand the consequences of locking the entire DC commit > > sequence. For just a page flip that sounds fine, but for anything more > > than (eg. full updates, modesets, etc) I don't think we want to be > > disabling interrupts for potentially many milliseconds. > > Ah! I was wondering where the attached splat comes from, but I think > this explains it: With this patch amdgpu_dm_commit_planes keeps the > pcrtc->dev->event_lock spinlock locked while calling > dc_commit_updates_for_stream, which ends up calling > smu_set_display_count, which tries to lock a mutex. > > Yep, sorry about that. My most modern machine has Raven Ridge / DCN1, and that function only gets called on Navi /DCN2 afaics. All my testing is limited to Polaris / DCE11 and Raven DCN1, my most modern hw atm. thanks, -mario --=20 > Earthling Michel D=C3=A4nzer | https://redhat= .com > Libre software enthusiast | Mesa and X developer > --000000000000e146a805a0c37e53 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Fri, Mar 13, 2020 at 5:02 PM Miche= l D=C3=A4nzer <michel@daenzer.net<= /a>> wrote:
O= n 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote:
> On 2020-03-12 10:32 a.m., Alex Deucher wrote:
>> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner
>> <
mario.kleiner.de@gmail.com> wrote:
>>>
>>> Commit '16f17eda8bad ("drm/amd/display: Send vblank a= nd user
>>> events at vsartup for DCN")' introduces a new way of = pageflip
>>> completion handling for DCN, and some trouble.
>>>
>>> The current implementation introduces a race condition, which<= br> >>> can cause pageflip completion events to be sent out one vblank=
>>> too early, thereby confusing userspace and causing flicker: >>>
>>> prepare_flip_isr():
>>>
>>> 1. Pageflip programming takes the ddev->event_lock.
>>> 2. Sets acrtc->pflip_status =3D=3D AMDGPU_FLIP_SUBMITTED >>> 3. Releases ddev->event_lock.
>>>
>>> --> Deadline for surface address regs double-buffering pass= es on
>>> =C2=A0=C2=A0=C2=A0=C2=A0 target pipe.
>>>
>>> 4. dc_commit_updates_for_stream() MMIO programs the new pagefl= ip
>>> =C2=A0=C2=A0=C2=A0 into hw, but too late for current vblank. >>>
>>> =3D> pflip_status =3D=3D AMDGPU_FLIP_SUBMITTED, but flip wo= n't complete
>>> =C2=A0=C2=A0=C2=A0 in current vblank due to missing the double= -buffering deadline
>>> =C2=A0=C2=A0=C2=A0 by a tiny bit.
>>>
>>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq f= ires,
>>> =C2=A0=C2=A0=C2=A0 dm_dcn_crtc_high_irq() gets called.
>>>
>>> 6. Detects pflip_status =3D=3D AMDGPU_FLIP_SUBMITTED and assum= es the
>>> =C2=A0=C2=A0=C2=A0 pageflip has been completed/will complete i= n this vblank and
>>> =C2=A0=C2=A0=C2=A0 sends out pageflip completion event to user= space and resets
>>> =C2=A0=C2=A0=C2=A0 pflip_status =3D AMDGPU_FLIP_NONE.
>>>
>>> =3D> Flip completion event sent out one vblank too early. >>>
>>> This behaviour has been observed during my testing with measur= ement
>>> hardware a couple of time.
>>>
>>> The commit message says that the extra flip event code was add= ed to
>>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip= events
>>> in case the pflip irq doesn't fire, because the "DCH = HUBP" component
>>> is clock gated and doesn't fire pflip irqs in that state. = Also that
>>> this clock gating may happen if no planes are active. Accordin= g to
>>> Nicholas, the clock gating can also happen if psr is active, a= nd the
>>> gating is controlled independently by the hardware, so difficu= lt to
>>> detect if and when the completion code in above commit is need= ed.
>>>
>>> This patch tries the following solution: It only executes the = extra
>>> pflip
>>> completion code in dm_dcn_crtc_high_irq() iff the hardware rep= orts
>>> that there aren't any surface updated pending in the doubl= e-buffered
>>> surface scanout address registers. Otherwise it leaves pflip c= ompletion
>>> to the pflip irq handler, for a more race-free experience.
>>>
>>> This would only guard against the order of events mentioned ab= ove.
>>> If Step 5 (VSTARTUP trigger) happens before step 4 then this w= on't help
>>> at all, because 1-3 + 5 might happen even without the hw being=
>>> programmed
>>> at all, ie. no surface update pending because none yet program= med
>>> into hw.
>>>
>>> Therefore this patch also changes locking in amdgpu_dm_commit_= planes(),
>>> so that prepare_flip_isr() and dc_commit_updates_for_stream() = are done
>>> under event_lock protection within the same critical section.<= br> >>>
>>> v2: Take Nicholas comments into account, try a different solut= ion.
>>>
>>> Lightly tested on Polaris (locking) and Raven (the whole DCN s= tuff).
>>> Seems to work without causing obvious new trouble.
>>
>> Nick, any comments on this?=C2=A0 Can we get this committed or do = you think
>> it needs additional rework?
>>
>> Thanks,
>>
>> Alex
>
> Hi Alex, Mario,
>
> This might be a little strange, but if we want to get this in as a fix=
> for regressions caused by the original vblank and user events at
> vstartup patch then I'm actually going to give my reviewed by on t= he
> *v1* of this patch (but not this v2):
>
> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
>
> You can feel free to apply that one.
>
> Reason 1: After having thought about it some more I don't think we=
> enable anything today that has hubp powered down at the same time we > expect to be waiting for a flip - eg. DMCU powering down HUBP during P= SR
> entry. Static screen interrupt should happen after that flip finishes = I
> think.
>
> The CRTC can still be powered on with zero planes, and I don't thi= nk any
> userspace explicitly asks for vblank events in this case but it doesn&= #39;t
> hurt to have the check.
>

Ok, so the original commit that ca= uses the races currently solves a non-existing - but potential future - pro= blem?
=C2=A0
I guess then my v1 patch makes sense f= or the moment and fixes the immediate problem for Linux 5.6-rc.

Maybe there's a way to ask the hw if hubp is clock-ga= ted and so if that code needs to run or not in the future?

As mentioned before, it would be helpful at least for me to get a = better overview about which hw events happen when in vblank, which irq'= s fire in response etc., how this relates to things like power management a= ctions, psr, vrr / drr, etc. A lot has changed in the hw during the last 10= years, and my guideline are still the public pdf files with the DCE regist= er specs for DCE-1'ish hw from sometime around the year 2007.
=

> Reason 2: This new patch will need much more thorough testing from sid= e
> to fully understand the consequences of locking the entire DC commit > sequence. For just a page flip that sounds fine, but for anything more=
> than (eg. full updates, modesets, etc) I don't think we want to be=
> disabling interrupts for potentially many milliseconds.

Ah! I was wondering where the attached splat comes from, but I think
this explains it: With this patch amdgpu_dm_commit_planes keeps the
pcrtc->dev->event_lock spinlock locked while calling
dc_commit_updates_for_stream, which ends up calling
smu_set_display_count, which tries to lock a mutex.


Yep, sorry about that. My most modern = machine has Raven Ridge / DCN1, and that function only gets called on Navi = /DCN2 afaics. All my testing is limited to Polaris / DCE11 and Raven DCN1, = my most modern hw atm.

thanks,
-= mario

--
Earthling Michel D=C3=A4nzer=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0|=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0https://redhat= .com
Libre software enthusiast=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Mesa and X developer
--000000000000e146a805a0c37e53-- --===============0933230698== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel --===============0933230698==--