dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
@ 2020-03-05 21:20 Mario Kleiner
  2020-03-12 14:32 ` Alex Deucher
  0 siblings, 1 reply; 11+ messages in thread
From: Mario Kleiner @ 2020-03-05 21:20 UTC (permalink / raw)
  To: amd-gfx, dri-devel
  Cc: sunpeng.li, alexander.deucher, hwentlan, nicholas.kazlauskas

Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
events at vsartup for DCN")' introduces a new way of pageflip
completion handling for DCN, and some trouble.

The current implementation introduces a race condition, which
can cause pageflip completion events to be sent out one vblank
too early, thereby confusing userspace and causing flicker:

prepare_flip_isr():

1. Pageflip programming takes the ddev->event_lock.
2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
3. Releases ddev->event_lock.

--> Deadline for surface address regs double-buffering passes on
    target pipe.

4. dc_commit_updates_for_stream() MMIO programs the new pageflip
   into hw, but too late for current vblank.

=> pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
   in current vblank due to missing the double-buffering deadline
   by a tiny bit.

5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
   dm_dcn_crtc_high_irq() gets called.

6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
   pageflip has been completed/will complete in this vblank and
   sends out pageflip completion event to userspace and resets
   pflip_status = AMDGPU_FLIP_NONE.

=> Flip completion event sent out one vblank too early.

This behaviour has been observed during my testing with measurement
hardware a couple of time.

The commit message says that the extra flip event code was added to
dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events
in case the pflip irq doesn't fire, because the "DCH HUBP" component
is clock gated and doesn't fire pflip irqs in that state. Also that
this clock gating may happen if no planes are active. According to
Nicholas, the clock gating can also happen if psr is active, and the
gating is controlled independently by the hardware, so difficult to
detect if and when the completion code in above commit is needed.

This patch tries the following solution: It only executes the extra pflip
completion code in dm_dcn_crtc_high_irq() iff the hardware reports
that there aren't any surface updated pending in the double-buffered
surface scanout address registers. Otherwise it leaves pflip completion
to the pflip irq handler, for a more race-free experience.

This would only guard against the order of events mentioned above.
If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help
at all, because 1-3 + 5 might happen even without the hw being programmed
at all, ie. no surface update pending because none yet programmed into hw.

Therefore this patch also changes locking in amdgpu_dm_commit_planes(),
so that prepare_flip_isr() and dc_commit_updates_for_stream() are done
under event_lock protection within the same critical section.

v2: Take Nicholas comments into account, try a different solution.

Lightly tested on Polaris (locking) and Raven (the whole DCN stuff).
Seems to work without causing obvious new trouble.

Fixes: 16f17eda8bad ("drm/amd/display: Send vblank and user events at vsartup for DCN")
Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 80 ++++++++++++++++---
 1 file changed, 67 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index d7df1a85e72f..aa4e941b276f 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -287,6 +287,28 @@ static inline bool amdgpu_dm_vrr_active(struct dm_crtc_state *dm_state)
 	       dm_state->freesync_config.state == VRR_STATE_ACTIVE_FIXED;
 }
 
+/**
+ * dm_crtc_is_flip_pending() - Is a pageflip pending on this crtc?
+ *
+ * Returns true if any plane on the crtc has a flip pending, false otherwise.
+ */
+static bool dm_crtc_is_flip_pending(struct dm_crtc_state *acrtc_state)
+{
+	struct dc_stream_status *status = dc_stream_get_status(acrtc_state->stream);
+	const struct dc_plane_status *plane_status;
+	int i;
+	bool pending = false;
+
+	for (i = 0; i < status->plane_count; i++) {
+		plane_status = dc_plane_get_status(status->plane_states[i]);
+		pending |= plane_status->is_flip_pending;
+		DRM_DEBUG_DRIVER("plane:%d, flip_pending=%d\n",
+				 i, plane_status->is_flip_pending);
+	}
+
+	return pending;
+}
+
 /**
  * dm_pflip_high_irq() - Handle pageflip interrupt
  * @interrupt_params: ignored
@@ -435,6 +457,11 @@ static void dm_vupdate_high_irq(void *interrupt_params)
 				spin_unlock_irqrestore(&adev->ddev->event_lock, flags);
 			}
 		}
+
+		if (acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED) {
+			DRM_DEBUG_DRIVER("%s:crtc:%d, flip_pending=%d\n", __func__,
+					    acrtc->crtc_id, dm_crtc_is_flip_pending(acrtc_state));
+		}
 	}
 }
 
@@ -489,6 +516,11 @@ static void dm_crtc_high_irq(void *interrupt_params)
 				&acrtc_state->vrr_params.adjust);
 			spin_unlock_irqrestore(&adev->ddev->event_lock, flags);
 		}
+
+		if (acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED) {
+			DRM_DEBUG_DRIVER("%s:crtc:%d, flip_pending=%d\n", __func__,
+					 acrtc->crtc_id, dm_crtc_is_flip_pending(acrtc_state));
+		}
 	}
 }
 
@@ -543,13 +575,29 @@ static void dm_dcn_crtc_high_irq(void *interrupt_params)
 			&acrtc_state->vrr_params.adjust);
 	}
 
-	if (acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED) {
+	/*
+	 * If there aren't any active_planes, or PSR is active, then DCH HUBP
+	 * may be clock-gated. In that state, pageflip completion interrupts
+	 * won't fire and pageflip completion events won't get delivered.
+	 *
+	 * Prevent this by sending pending pageflip events from here if a flip
+	 * has been submitted, but is no longer pending in hw, ie. has already
+	 * completed.
+	 *
+	 * If the flip is still pending in hw, then use dm_pflip_high_irq()
+	 * instead, handling completion as usual by pflip irq.
+	 */
+	if (acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED &&
+	    !dm_crtc_is_flip_pending(acrtc_state)) {
 		if (acrtc->event) {
 			drm_crtc_send_vblank_event(&acrtc->base, acrtc->event);
 			acrtc->event = NULL;
 			drm_crtc_vblank_put(&acrtc->base);
 		}
 		acrtc->pflip_status = AMDGPU_FLIP_NONE;
+
+		DRM_DEBUG_DRIVER("crtc:%d, pflip_stat:AMDGPU_FLIP_NONE\n",
+				 acrtc->crtc_id);
 	}
 
 	spin_unlock_irqrestore(&adev->ddev->event_lock, flags);
@@ -6325,7 +6373,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 			to_dm_crtc_state(drm_atomic_get_old_crtc_state(state, pcrtc));
 	int planes_count = 0, vpos, hpos;
 	long r;
-	unsigned long flags;
+	unsigned long flags = 0;
 	struct amdgpu_bo *abo;
 	uint64_t tiling_flags;
 	uint32_t target_vblank, last_flip_vblank;
@@ -6513,17 +6561,6 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 			usleep_range(1000, 1100);
 		}
 
-		if (acrtc_attach->base.state->event) {
-			drm_crtc_vblank_get(pcrtc);
-
-			spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
-
-			WARN_ON(acrtc_attach->pflip_status != AMDGPU_FLIP_NONE);
-			prepare_flip_isr(acrtc_attach);
-
-			spin_unlock_irqrestore(&pcrtc->dev->event_lock, flags);
-		}
-
 		if (acrtc_state->stream) {
 			if (acrtc_state->freesync_vrr_info_changed)
 				bundle->stream_update.vrr_infopacket =
@@ -6575,6 +6612,15 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 				acrtc_state->stream->link->psr_allow_active)
 			amdgpu_dm_psr_disable(acrtc_state->stream);
 
+		if (pflip_present && acrtc_attach->base.state->event) {
+			drm_crtc_vblank_get(pcrtc);
+
+			spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
+
+			WARN_ON(acrtc_attach->pflip_status != AMDGPU_FLIP_NONE);
+			prepare_flip_isr(acrtc_attach);
+		}
+
 		dc_commit_updates_for_stream(dm->dc,
 						     bundle->surface_updates,
 						     planes_count,
@@ -6582,6 +6628,14 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 						     &bundle->stream_update,
 						     dc_state);
 
+		/*
+		 * Must event_lock protect prepare_flip_isr() above and
+		 * dc_commit_updates_for_stream within same critical section,
+		 * or pageflip completion will suffer bad races on DCN.
+		 */
+		if (pflip_present && acrtc_attach->pflip_status == AMDGPU_FLIP_SUBMITTED)
+			spin_unlock_irqrestore(&pcrtc->dev->event_lock, flags);
+
 		if ((acrtc_state->update_type > UPDATE_TYPE_FAST) &&
 						acrtc_state->stream->psr_version &&
 						!acrtc_state->stream->link->psr_feature_enabled)
-- 
2.20.1

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
  2020-03-05 21:20 [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2) Mario Kleiner
@ 2020-03-12 14:32 ` Alex Deucher
  2020-03-13 12:35   ` Kazlauskas, Nicholas
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Deucher @ 2020-03-12 14:32 UTC (permalink / raw)
  To: Mario Kleiner
  Cc: Leo (Sunpeng) Li, Maling list - DRI developers, amd-gfx list,
	Deucher, Alexander, Harry Wentland, Kazlauskas, Nicholas

On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner <mario.kleiner.de@gmail.com> wrote:
>
> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
> events at vsartup for DCN")' introduces a new way of pageflip
> completion handling for DCN, and some trouble.
>
> The current implementation introduces a race condition, which
> can cause pageflip completion events to be sent out one vblank
> too early, thereby confusing userspace and causing flicker:
>
> prepare_flip_isr():
>
> 1. Pageflip programming takes the ddev->event_lock.
> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
> 3. Releases ddev->event_lock.
>
> --> Deadline for surface address regs double-buffering passes on
>     target pipe.
>
> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip
>    into hw, but too late for current vblank.
>
> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
>    in current vblank due to missing the double-buffering deadline
>    by a tiny bit.
>
> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
>    dm_dcn_crtc_high_irq() gets called.
>
> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
>    pageflip has been completed/will complete in this vblank and
>    sends out pageflip completion event to userspace and resets
>    pflip_status = AMDGPU_FLIP_NONE.
>
> => Flip completion event sent out one vblank too early.
>
> This behaviour has been observed during my testing with measurement
> hardware a couple of time.
>
> The commit message says that the extra flip event code was added to
> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events
> in case the pflip irq doesn't fire, because the "DCH HUBP" component
> is clock gated and doesn't fire pflip irqs in that state. Also that
> this clock gating may happen if no planes are active. According to
> Nicholas, the clock gating can also happen if psr is active, and the
> gating is controlled independently by the hardware, so difficult to
> detect if and when the completion code in above commit is needed.
>
> This patch tries the following solution: It only executes the extra pflip
> completion code in dm_dcn_crtc_high_irq() iff the hardware reports
> that there aren't any surface updated pending in the double-buffered
> surface scanout address registers. Otherwise it leaves pflip completion
> to the pflip irq handler, for a more race-free experience.
>
> This would only guard against the order of events mentioned above.
> If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help
> at all, because 1-3 + 5 might happen even without the hw being programmed
> at all, ie. no surface update pending because none yet programmed into hw.
>
> Therefore this patch also changes locking in amdgpu_dm_commit_planes(),
> so that prepare_flip_isr() and dc_commit_updates_for_stream() are done
> under event_lock protection within the same critical section.
>
> v2: Take Nicholas comments into account, try a different solution.
>
> Lightly tested on Polaris (locking) and Raven (the whole DCN stuff).
> Seems to work without causing obvious new trouble.

Nick, any comments on this?  Can we get this committed or do you think
it needs additional rework?

Thanks,

Alex

>
> Fixes: 16f17eda8bad ("drm/amd/display: Send vblank and user events at vsartup for DCN")
> Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
> ---
>  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 80 ++++++++++++++++---
>  1 file changed, 67 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index d7df1a85e72f..aa4e941b276f 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -287,6 +287,28 @@ static inline bool amdgpu_dm_vrr_active(struct dm_crtc_state *dm_state)
>                dm_state->freesync_config.state == VRR_STATE_ACTIVE_FIXED;
>  }
>
> +/**
> + * dm_crtc_is_flip_pending() - Is a pageflip pending on this crtc?
> + *
> + * Returns true if any plane on the crtc has a flip pending, false otherwise.
> + */
> +static bool dm_crtc_is_flip_pending(struct dm_crtc_state *acrtc_state)
> +{
> +       struct dc_stream_status *status = dc_stream_get_status(acrtc_state->stream);
> +       const struct dc_plane_status *plane_status;
> +       int i;
> +       bool pending = false;
> +
> +       for (i = 0; i < status->plane_count; i++) {
> +               plane_status = dc_plane_get_status(status->plane_states[i]);
> +               pending |= plane_status->is_flip_pending;
> +               DRM_DEBUG_DRIVER("plane:%d, flip_pending=%d\n",
> +                                i, plane_status->is_flip_pending);
> +       }
> +
> +       return pending;
> +}
> +
>  /**
>   * dm_pflip_high_irq() - Handle pageflip interrupt
>   * @interrupt_params: ignored
> @@ -435,6 +457,11 @@ static void dm_vupdate_high_irq(void *interrupt_params)
>                                 spin_unlock_irqrestore(&adev->ddev->event_lock, flags);
>                         }
>                 }
> +
> +               if (acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED) {
> +                       DRM_DEBUG_DRIVER("%s:crtc:%d, flip_pending=%d\n", __func__,
> +                                           acrtc->crtc_id, dm_crtc_is_flip_pending(acrtc_state));
> +               }
>         }
>  }
>
> @@ -489,6 +516,11 @@ static void dm_crtc_high_irq(void *interrupt_params)
>                                 &acrtc_state->vrr_params.adjust);
>                         spin_unlock_irqrestore(&adev->ddev->event_lock, flags);
>                 }
> +
> +               if (acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED) {
> +                       DRM_DEBUG_DRIVER("%s:crtc:%d, flip_pending=%d\n", __func__,
> +                                        acrtc->crtc_id, dm_crtc_is_flip_pending(acrtc_state));
> +               }
>         }
>  }
>
> @@ -543,13 +575,29 @@ static void dm_dcn_crtc_high_irq(void *interrupt_params)
>                         &acrtc_state->vrr_params.adjust);
>         }
>
> -       if (acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED) {
> +       /*
> +        * If there aren't any active_planes, or PSR is active, then DCH HUBP
> +        * may be clock-gated. In that state, pageflip completion interrupts
> +        * won't fire and pageflip completion events won't get delivered.
> +        *
> +        * Prevent this by sending pending pageflip events from here if a flip
> +        * has been submitted, but is no longer pending in hw, ie. has already
> +        * completed.
> +        *
> +        * If the flip is still pending in hw, then use dm_pflip_high_irq()
> +        * instead, handling completion as usual by pflip irq.
> +        */
> +       if (acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED &&
> +           !dm_crtc_is_flip_pending(acrtc_state)) {
>                 if (acrtc->event) {
>                         drm_crtc_send_vblank_event(&acrtc->base, acrtc->event);
>                         acrtc->event = NULL;
>                         drm_crtc_vblank_put(&acrtc->base);
>                 }
>                 acrtc->pflip_status = AMDGPU_FLIP_NONE;
> +
> +               DRM_DEBUG_DRIVER("crtc:%d, pflip_stat:AMDGPU_FLIP_NONE\n",
> +                                acrtc->crtc_id);
>         }
>
>         spin_unlock_irqrestore(&adev->ddev->event_lock, flags);
> @@ -6325,7 +6373,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>                         to_dm_crtc_state(drm_atomic_get_old_crtc_state(state, pcrtc));
>         int planes_count = 0, vpos, hpos;
>         long r;
> -       unsigned long flags;
> +       unsigned long flags = 0;
>         struct amdgpu_bo *abo;
>         uint64_t tiling_flags;
>         uint32_t target_vblank, last_flip_vblank;
> @@ -6513,17 +6561,6 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>                         usleep_range(1000, 1100);
>                 }
>
> -               if (acrtc_attach->base.state->event) {
> -                       drm_crtc_vblank_get(pcrtc);
> -
> -                       spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
> -
> -                       WARN_ON(acrtc_attach->pflip_status != AMDGPU_FLIP_NONE);
> -                       prepare_flip_isr(acrtc_attach);
> -
> -                       spin_unlock_irqrestore(&pcrtc->dev->event_lock, flags);
> -               }
> -
>                 if (acrtc_state->stream) {
>                         if (acrtc_state->freesync_vrr_info_changed)
>                                 bundle->stream_update.vrr_infopacket =
> @@ -6575,6 +6612,15 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>                                 acrtc_state->stream->link->psr_allow_active)
>                         amdgpu_dm_psr_disable(acrtc_state->stream);
>
> +               if (pflip_present && acrtc_attach->base.state->event) {
> +                       drm_crtc_vblank_get(pcrtc);
> +
> +                       spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
> +
> +                       WARN_ON(acrtc_attach->pflip_status != AMDGPU_FLIP_NONE);
> +                       prepare_flip_isr(acrtc_attach);
> +               }
> +
>                 dc_commit_updates_for_stream(dm->dc,
>                                                      bundle->surface_updates,
>                                                      planes_count,
> @@ -6582,6 +6628,14 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>                                                      &bundle->stream_update,
>                                                      dc_state);
>
> +               /*
> +                * Must event_lock protect prepare_flip_isr() above and
> +                * dc_commit_updates_for_stream within same critical section,
> +                * or pageflip completion will suffer bad races on DCN.
> +                */
> +               if (pflip_present && acrtc_attach->pflip_status == AMDGPU_FLIP_SUBMITTED)
> +                       spin_unlock_irqrestore(&pcrtc->dev->event_lock, flags);
> +
>                 if ((acrtc_state->update_type > UPDATE_TYPE_FAST) &&
>                                                 acrtc_state->stream->psr_version &&
>                                                 !acrtc_state->stream->link->psr_feature_enabled)
> --
> 2.20.1
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
  2020-03-12 14:32 ` Alex Deucher
@ 2020-03-13 12:35   ` Kazlauskas, Nicholas
  2020-03-13 14:42     ` Michel Dänzer
  0 siblings, 1 reply; 11+ messages in thread
From: Kazlauskas, Nicholas @ 2020-03-13 12:35 UTC (permalink / raw)
  To: Alex Deucher, Mario Kleiner
  Cc: Leo (Sunpeng) Li, Deucher, Alexander, Harry Wentland,
	Maling list - DRI developers, amd-gfx list

On 2020-03-12 10:32 a.m., Alex Deucher wrote:
> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner <mario.kleiner.de@gmail.com> wrote:
>>
>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
>> events at vsartup for DCN")' introduces a new way of pageflip
>> completion handling for DCN, and some trouble.
>>
>> The current implementation introduces a race condition, which
>> can cause pageflip completion events to be sent out one vblank
>> too early, thereby confusing userspace and causing flicker:
>>
>> prepare_flip_isr():
>>
>> 1. Pageflip programming takes the ddev->event_lock.
>> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
>> 3. Releases ddev->event_lock.
>>
>> --> Deadline for surface address regs double-buffering passes on
>>      target pipe.
>>
>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip
>>     into hw, but too late for current vblank.
>>
>> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
>>     in current vblank due to missing the double-buffering deadline
>>     by a tiny bit.
>>
>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
>>     dm_dcn_crtc_high_irq() gets called.
>>
>> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
>>     pageflip has been completed/will complete in this vblank and
>>     sends out pageflip completion event to userspace and resets
>>     pflip_status = AMDGPU_FLIP_NONE.
>>
>> => Flip completion event sent out one vblank too early.
>>
>> This behaviour has been observed during my testing with measurement
>> hardware a couple of time.
>>
>> The commit message says that the extra flip event code was added to
>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events
>> in case the pflip irq doesn't fire, because the "DCH HUBP" component
>> is clock gated and doesn't fire pflip irqs in that state. Also that
>> this clock gating may happen if no planes are active. According to
>> Nicholas, the clock gating can also happen if psr is active, and the
>> gating is controlled independently by the hardware, so difficult to
>> detect if and when the completion code in above commit is needed.
>>
>> This patch tries the following solution: It only executes the extra pflip
>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports
>> that there aren't any surface updated pending in the double-buffered
>> surface scanout address registers. Otherwise it leaves pflip completion
>> to the pflip irq handler, for a more race-free experience.
>>
>> This would only guard against the order of events mentioned above.
>> If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help
>> at all, because 1-3 + 5 might happen even without the hw being programmed
>> at all, ie. no surface update pending because none yet programmed into hw.
>>
>> Therefore this patch also changes locking in amdgpu_dm_commit_planes(),
>> so that prepare_flip_isr() and dc_commit_updates_for_stream() are done
>> under event_lock protection within the same critical section.
>>
>> v2: Take Nicholas comments into account, try a different solution.
>>
>> Lightly tested on Polaris (locking) and Raven (the whole DCN stuff).
>> Seems to work without causing obvious new trouble.
> 
> Nick, any comments on this?  Can we get this committed or do you think
> it needs additional rework?
> 
> Thanks,
> 
> Alex

Hi Alex, Mario,

This might be a little strange, but if we want to get this in as a fix 
for regressions caused by the original vblank and user events at 
vstartup patch then I'm actually going to give my reviewed by on the 
*v1* of this patch (but not this v2):

Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>

You can feel free to apply that one.

Reason 1: After having thought about it some more I don't think we 
enable anything today that has hubp powered down at the same time we 
expect to be waiting for a flip - eg. DMCU powering down HUBP during PSR 
entry. Static screen interrupt should happen after that flip finishes I 
think.

The CRTC can still be powered on with zero planes, and I don't think any 
userspace explicitly asks for vblank events in this case but it doesn't 
hurt to have the check.

Reason 2: This new patch will need much more thorough testing from side 
to fully understand the consequences of locking the entire DC commit 
sequence. For just a page flip that sounds fine, but for anything more 
than (eg. full updates, modesets, etc) I don't think we want to be 
disabling interrupts for potentially many milliseconds.

Maybe we could just use a lock-free queue here for the events. It's 
single producer/single consumer per CRTC.

Regards,
Nicholas Kazlauskas

> 
>>
>> Fixes: 16f17eda8bad ("drm/amd/display: Send vblank and user events at vsartup for DCN")
>> Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
>> ---
>>   .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 80 ++++++++++++++++---
>>   1 file changed, 67 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> index d7df1a85e72f..aa4e941b276f 100644
>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>> @@ -287,6 +287,28 @@ static inline bool amdgpu_dm_vrr_active(struct dm_crtc_state *dm_state)
>>                 dm_state->freesync_config.state == VRR_STATE_ACTIVE_FIXED;
>>   }
>>
>> +/**
>> + * dm_crtc_is_flip_pending() - Is a pageflip pending on this crtc?
>> + *
>> + * Returns true if any plane on the crtc has a flip pending, false otherwise.
>> + */
>> +static bool dm_crtc_is_flip_pending(struct dm_crtc_state *acrtc_state)
>> +{
>> +       struct dc_stream_status *status = dc_stream_get_status(acrtc_state->stream);
>> +       const struct dc_plane_status *plane_status;
>> +       int i;
>> +       bool pending = false;
>> +
>> +       for (i = 0; i < status->plane_count; i++) {
>> +               plane_status = dc_plane_get_status(status->plane_states[i]);
>> +               pending |= plane_status->is_flip_pending;
>> +               DRM_DEBUG_DRIVER("plane:%d, flip_pending=%d\n",
>> +                                i, plane_status->is_flip_pending);
>> +       }
>> +
>> +       return pending;
>> +}
>> +
>>   /**
>>    * dm_pflip_high_irq() - Handle pageflip interrupt
>>    * @interrupt_params: ignored
>> @@ -435,6 +457,11 @@ static void dm_vupdate_high_irq(void *interrupt_params)
>>                                  spin_unlock_irqrestore(&adev->ddev->event_lock, flags);
>>                          }
>>                  }
>> +
>> +               if (acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED) {
>> +                       DRM_DEBUG_DRIVER("%s:crtc:%d, flip_pending=%d\n", __func__,
>> +                                           acrtc->crtc_id, dm_crtc_is_flip_pending(acrtc_state));
>> +               }
>>          }
>>   }
>>
>> @@ -489,6 +516,11 @@ static void dm_crtc_high_irq(void *interrupt_params)
>>                                  &acrtc_state->vrr_params.adjust);
>>                          spin_unlock_irqrestore(&adev->ddev->event_lock, flags);
>>                  }
>> +
>> +               if (acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED) {
>> +                       DRM_DEBUG_DRIVER("%s:crtc:%d, flip_pending=%d\n", __func__,
>> +                                        acrtc->crtc_id, dm_crtc_is_flip_pending(acrtc_state));
>> +               }
>>          }
>>   }
>>
>> @@ -543,13 +575,29 @@ static void dm_dcn_crtc_high_irq(void *interrupt_params)
>>                          &acrtc_state->vrr_params.adjust);
>>          }
>>
>> -       if (acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED) {
>> +       /*
>> +        * If there aren't any active_planes, or PSR is active, then DCH HUBP
>> +        * may be clock-gated. In that state, pageflip completion interrupts
>> +        * won't fire and pageflip completion events won't get delivered.
>> +        *
>> +        * Prevent this by sending pending pageflip events from here if a flip
>> +        * has been submitted, but is no longer pending in hw, ie. has already
>> +        * completed.
>> +        *
>> +        * If the flip is still pending in hw, then use dm_pflip_high_irq()
>> +        * instead, handling completion as usual by pflip irq.
>> +        */
>> +       if (acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED &&
>> +           !dm_crtc_is_flip_pending(acrtc_state)) {
>>                  if (acrtc->event) {
>>                          drm_crtc_send_vblank_event(&acrtc->base, acrtc->event);
>>                          acrtc->event = NULL;
>>                          drm_crtc_vblank_put(&acrtc->base);
>>                  }
>>                  acrtc->pflip_status = AMDGPU_FLIP_NONE;
>> +
>> +               DRM_DEBUG_DRIVER("crtc:%d, pflip_stat:AMDGPU_FLIP_NONE\n",
>> +                                acrtc->crtc_id);
>>          }
>>
>>          spin_unlock_irqrestore(&adev->ddev->event_lock, flags);
>> @@ -6325,7 +6373,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>>                          to_dm_crtc_state(drm_atomic_get_old_crtc_state(state, pcrtc));
>>          int planes_count = 0, vpos, hpos;
>>          long r;
>> -       unsigned long flags;
>> +       unsigned long flags = 0;
>>          struct amdgpu_bo *abo;
>>          uint64_t tiling_flags;
>>          uint32_t target_vblank, last_flip_vblank;
>> @@ -6513,17 +6561,6 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>>                          usleep_range(1000, 1100);
>>                  }
>>
>> -               if (acrtc_attach->base.state->event) {
>> -                       drm_crtc_vblank_get(pcrtc);
>> -
>> -                       spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
>> -
>> -                       WARN_ON(acrtc_attach->pflip_status != AMDGPU_FLIP_NONE);
>> -                       prepare_flip_isr(acrtc_attach);
>> -
>> -                       spin_unlock_irqrestore(&pcrtc->dev->event_lock, flags);
>> -               }
>> -
>>                  if (acrtc_state->stream) {
>>                          if (acrtc_state->freesync_vrr_info_changed)
>>                                  bundle->stream_update.vrr_infopacket =
>> @@ -6575,6 +6612,15 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>>                                  acrtc_state->stream->link->psr_allow_active)
>>                          amdgpu_dm_psr_disable(acrtc_state->stream);
>>
>> +               if (pflip_present && acrtc_attach->base.state->event) {
>> +                       drm_crtc_vblank_get(pcrtc);
>> +
>> +                       spin_lock_irqsave(&pcrtc->dev->event_lock, flags);
>> +
>> +                       WARN_ON(acrtc_attach->pflip_status != AMDGPU_FLIP_NONE);
>> +                       prepare_flip_isr(acrtc_attach);
>> +               }
>> +
>>                  dc_commit_updates_for_stream(dm->dc,
>>                                                       bundle->surface_updates,
>>                                                       planes_count,
>> @@ -6582,6 +6628,14 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>>                                                       &bundle->stream_update,
>>                                                       dc_state);
>>
>> +               /*
>> +                * Must event_lock protect prepare_flip_isr() above and
>> +                * dc_commit_updates_for_stream within same critical section,
>> +                * or pageflip completion will suffer bad races on DCN.
>> +                */
>> +               if (pflip_present && acrtc_attach->pflip_status == AMDGPU_FLIP_SUBMITTED)
>> +                       spin_unlock_irqrestore(&pcrtc->dev->event_lock, flags);
>> +
>>                  if ((acrtc_state->update_type > UPDATE_TYPE_FAST) &&
>>                                                  acrtc_state->stream->psr_version &&
>>                                                  !acrtc_state->stream->link->psr_feature_enabled)
>> --
>> 2.20.1
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
  2020-03-13 12:35   ` Kazlauskas, Nicholas
@ 2020-03-13 14:42     ` Michel Dänzer
  2020-03-13 21:52       ` Mario Kleiner
  2020-04-14 23:32       ` Matt Coffin
  0 siblings, 2 replies; 11+ messages in thread
From: Michel Dänzer @ 2020-03-13 14:42 UTC (permalink / raw)
  To: Kazlauskas, Nicholas, Alex Deucher, Mario Kleiner
  Cc: Leo (Sunpeng) Li, Deucher, Alexander, Harry Wentland,
	amd-gfx list, Maling list - DRI developers

[-- Attachment #1: Type: text/plain, Size: 5195 bytes --]

On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote:
> On 2020-03-12 10:32 a.m., Alex Deucher wrote:
>> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner
>> <mario.kleiner.de@gmail.com> wrote:
>>>
>>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
>>> events at vsartup for DCN")' introduces a new way of pageflip
>>> completion handling for DCN, and some trouble.
>>>
>>> The current implementation introduces a race condition, which
>>> can cause pageflip completion events to be sent out one vblank
>>> too early, thereby confusing userspace and causing flicker:
>>>
>>> prepare_flip_isr():
>>>
>>> 1. Pageflip programming takes the ddev->event_lock.
>>> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
>>> 3. Releases ddev->event_lock.
>>>
>>> --> Deadline for surface address regs double-buffering passes on
>>>      target pipe.
>>>
>>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip
>>>     into hw, but too late for current vblank.
>>>
>>> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
>>>     in current vblank due to missing the double-buffering deadline
>>>     by a tiny bit.
>>>
>>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
>>>     dm_dcn_crtc_high_irq() gets called.
>>>
>>> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
>>>     pageflip has been completed/will complete in this vblank and
>>>     sends out pageflip completion event to userspace and resets
>>>     pflip_status = AMDGPU_FLIP_NONE.
>>>
>>> => Flip completion event sent out one vblank too early.
>>>
>>> This behaviour has been observed during my testing with measurement
>>> hardware a couple of time.
>>>
>>> The commit message says that the extra flip event code was added to
>>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events
>>> in case the pflip irq doesn't fire, because the "DCH HUBP" component
>>> is clock gated and doesn't fire pflip irqs in that state. Also that
>>> this clock gating may happen if no planes are active. According to
>>> Nicholas, the clock gating can also happen if psr is active, and the
>>> gating is controlled independently by the hardware, so difficult to
>>> detect if and when the completion code in above commit is needed.
>>>
>>> This patch tries the following solution: It only executes the extra
>>> pflip
>>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports
>>> that there aren't any surface updated pending in the double-buffered
>>> surface scanout address registers. Otherwise it leaves pflip completion
>>> to the pflip irq handler, for a more race-free experience.
>>>
>>> This would only guard against the order of events mentioned above.
>>> If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help
>>> at all, because 1-3 + 5 might happen even without the hw being
>>> programmed
>>> at all, ie. no surface update pending because none yet programmed
>>> into hw.
>>>
>>> Therefore this patch also changes locking in amdgpu_dm_commit_planes(),
>>> so that prepare_flip_isr() and dc_commit_updates_for_stream() are done
>>> under event_lock protection within the same critical section.
>>>
>>> v2: Take Nicholas comments into account, try a different solution.
>>>
>>> Lightly tested on Polaris (locking) and Raven (the whole DCN stuff).
>>> Seems to work without causing obvious new trouble.
>>
>> Nick, any comments on this?  Can we get this committed or do you think
>> it needs additional rework?
>>
>> Thanks,
>>
>> Alex
> 
> Hi Alex, Mario,
> 
> This might be a little strange, but if we want to get this in as a fix
> for regressions caused by the original vblank and user events at
> vstartup patch then I'm actually going to give my reviewed by on the
> *v1* of this patch (but not this v2):
> 
> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
> 
> You can feel free to apply that one.
> 
> Reason 1: After having thought about it some more I don't think we
> enable anything today that has hubp powered down at the same time we
> expect to be waiting for a flip - eg. DMCU powering down HUBP during PSR
> entry. Static screen interrupt should happen after that flip finishes I
> think.
> 
> The CRTC can still be powered on with zero planes, and I don't think any
> userspace explicitly asks for vblank events in this case but it doesn't
> hurt to have the check.
> 
> Reason 2: This new patch will need much more thorough testing from side
> to fully understand the consequences of locking the entire DC commit
> sequence. For just a page flip that sounds fine, but for anything more
> than (eg. full updates, modesets, etc) I don't think we want to be
> disabling interrupts for potentially many milliseconds.

Ah! I was wondering where the attached splat comes from, but I think
this explains it: With this patch amdgpu_dm_commit_planes keeps the
pcrtc->dev->event_lock spinlock locked while calling
dc_commit_updates_for_stream, which ends up calling
smu_set_display_count, which tries to lock a mutex.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

[-- Attachment #2: kern.txt --]
[-- Type: text/plain, Size: 12277 bytes --]

Mar 13 12:44:49 kaveri kernel: [16728.466999][T347601] BUG: scheduling while atomic: Xorg/347601/0x00000002
Mar 13 12:44:49 kaveri kernel: [16728.467004][T347601] Modules linked in: fuse(E) xt_conntrack(E) xt_MASQUERADE(E) nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) nft_counter(E) xt_addrtype(E) nft_compat(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) nf_tables(E) nfnetlink(E) br_netfilter(E) bridge(E) stp(E) llc(E) overlay(E) rfkill(E) cpufreq_powersave(E) cpufreq_userspace(E) cpufreq_conservative(E) edac_mce_amd(E) kvm_amd(E) binfmt_misc(E) snd_hda_codec_realtek(E) kvm(E) snd_hda_codec_generic(E) irqbypass(E) ppdev(E) amdgpu(E) wmi_bmof(E) ledtrig_audio(E) nls_ascii(E) nls_cp437(E) snd_hda_codec_hdmi(E) vfat(E) gpu_sched(E) fat(E) crct10dif_pclmul(E) snd_hda_intel(E) ttm(E) snd_intel_dspcfg(E) crc32_pclmul(E) snd_hda_codec(E) drm_kms_helper(E) snd_hda_core(E) cec(E) snd_hwdep(E) ghash_clmulni_intel(E) snd_pcm(E) rc_core(E) r8169(E) snd_timer(E) aesni_intel(E) drm(E) efi_pstore(E) sp5100_tco(E) ccp(E) crypto_simd(E) realtek(E) snd(E) i2c_algo_bit(E) watchdog(E)
Mar 13 12:44:49 kaveri kernel: [16728.467043][T347601]  cryptd(E) mfd_core(E) glue_helper(E) pcspkr(E) efivars(E) sg(E) parport_pc(E) libphy(E) k10temp(E) soundcore(E) rng_core(E) i2c_piix4(E) parport(E) wmi(E) button(E) acpi_cpufreq(E) tcp_bbr(E) sch_fq(E) nct6775(E) hwmon_vid(E) sunrpc(E) efivarfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E) jbd2(E) dm_mod(E) raid10(E) raid1(E) raid0(E) multipath(E) linear(E) md_mod(E) sd_mod(E) evdev(E) hid_generic(E) usbhid(E) hid(E) xhci_pci(E) ahci(E) libahci(E) xhci_hcd(E) crc32c_intel(E) libata(E) usbcore(E) scsi_mod(E) gpio_amdpt(E) gpio_generic(E)
Mar 13 12:44:49 kaveri kernel: [16728.467077][T347601] Preemption disabled at:
Mar 13 12:44:49 kaveri kernel: [16728.467080][T347601] [<0000000000000000>] 0x0
Mar 13 12:44:49 kaveri kernel: [16728.467084][T347601] CPU: 7 PID: 347601 Comm: Xorg Tainted: G            E     5.5.8+ #28
Mar 13 12:44:49 kaveri kernel: [16728.467086][T347601] Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 TOMAHAWK (MS-7A34), BIOS 1.OR 11/29/2019
Mar 13 12:44:49 kaveri kernel: [16728.467088][T347601] Call Trace:
Mar 13 12:44:49 kaveri kernel: [16728.467096][T347601]  dump_stack+0x66/0x90
Mar 13 12:44:49 kaveri kernel: [16728.467101][T347601]  __schedule_bug.cold+0x8e/0x9b
Mar 13 12:44:49 kaveri kernel: [16728.467106][T347601]  __schedule+0x63c/0x790
Mar 13 12:44:49 kaveri kernel: [16728.467110][T347601]  schedule+0x46/0xf0
Mar 13 12:44:49 kaveri kernel: [16728.467112][T347601]  schedule_preempt_disabled+0x14/0x20
Mar 13 12:44:49 kaveri kernel: [16728.467115][T347601]  __mutex_lock.isra.0+0x178/0x520
Mar 13 12:44:49 kaveri kernel: [16728.467207][T347601]  smu_set_display_count+0x21/0x60 [amdgpu]
Mar 13 12:44:49 kaveri kernel: [16728.467291][T347601]  pp_nv_set_display_count+0x23/0x40 [amdgpu]
Mar 13 12:44:49 kaveri kernel: [16728.467375][T347601]  dcn2_update_clocks+0x597/0x650 [amdgpu]
Mar 13 12:44:49 kaveri kernel: [16728.467460][T347601]  dcn20_prepare_bandwidth+0x32/0x70 [amdgpu]
Mar 13 12:44:49 kaveri kernel: [16728.467541][T347601]  dc_commit_updates_for_stream+0xee4/0x15d0 [amdgpu]
Mar 13 12:44:49 kaveri kernel: [16728.467623][T347601]  amdgpu_dm_atomic_commit_tail+0x1210/0x2130 [amdgpu]
Mar 13 12:44:49 kaveri kernel: [16728.467632][T347601]  ? kfree+0x196/0x2b0
Mar 13 12:44:49 kaveri kernel: [16728.467649][T347601]  commit_tail+0x94/0x130 [drm_kms_helper]
Mar 13 12:44:49 kaveri kernel: [16728.467658][T347601]  drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
Mar 13 12:44:49 kaveri kernel: [16728.467682][T347601]  drm_mode_obj_set_property_ioctl+0x115/0x2d0 [drm]
Mar 13 12:44:49 kaveri kernel: [16728.467699][T347601]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
Mar 13 12:44:49 kaveri kernel: [16728.467711][T347601]  drm_ioctl_kernel+0xaa/0xf0 [drm]
Mar 13 12:44:49 kaveri kernel: [16728.467724][T347601]  drm_ioctl+0x208/0x390 [drm]
Mar 13 12:44:49 kaveri kernel: [16728.467737][T347601]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
Mar 13 12:44:49 kaveri kernel: [16728.467742][T347601]  ? preempt_count_sub+0x9b/0xd0
Mar 13 12:44:49 kaveri kernel: [16728.467793][T347601]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
Mar 13 12:44:49 kaveri kernel: [16728.467799][T347601]  do_vfs_ioctl+0x45f/0x6d0
Mar 13 12:44:49 kaveri kernel: [16728.467803][T347601]  ksys_ioctl+0x5e/0x90
Mar 13 12:44:49 kaveri kernel: [16728.467806][T347601]  __x64_sys_ioctl+0x16/0x20
Mar 13 12:44:49 kaveri kernel: [16728.467811][T347601]  do_syscall_64+0x5f/0x200
Mar 13 12:44:49 kaveri kernel: [16728.467816][T347601]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Mar 13 12:44:49 kaveri kernel: [16728.467819][T347601] RIP: 0033:0x7f023d792497
Mar 13 12:44:49 kaveri kernel: [16728.467822][T347601] Code: 00 00 90 48 8b 05 f9 79 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 79 0c 00 f7 d8 64 89 01 48
Mar 13 12:44:49 kaveri kernel: [16728.467824][T347601] RSP: 002b:00007ffd773e0348 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Mar 13 12:44:49 kaveri kernel: [16728.467827][T347601] RAX: ffffffffffffffda RBX: 00007ffd773e0380 RCX: 00007f023d792497
Mar 13 12:44:49 kaveri kernel: [16728.467829][T347601] RDX: 00007ffd773e0380 RSI: 00000000c01864ba RDI: 000000000000000d
Mar 13 12:44:49 kaveri kernel: [16728.467830][T347601] RBP: 00000000c01864ba R08: 000000000000005c R09: 00000000cccccccc
Mar 13 12:44:49 kaveri kernel: [16728.467832][T347601] R10: 000055fbcc8a6bd4 R11: 0000000000000246 R12: 000055fbcb7199b0
Mar 13 12:44:49 kaveri kernel: [16728.467834][T347601] R13: 000000000000000d R14: 0000000000000000 R15: 0000000000000fff
Mar 13 12:44:49 kaveri kernel: [16728.477723][T347601] ------------[ cut here ]------------
Mar 13 12:44:49 kaveri kernel: [16728.477728][T347601] DEBUG_LOCKS_WARN_ON(val > preempt_count())
Mar 13 12:44:49 kaveri kernel: [16728.477739][T347601] WARNING: CPU: 7 PID: 347601 at kernel/sched/core.c:3816 preempt_count_sub+0x70/0xd0
Mar 13 12:44:49 kaveri kernel: [16728.477743][T347601] Modules linked in: fuse(E) xt_conntrack(E) xt_MASQUERADE(E) nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) nft_counter(E) xt_addrtype(E) nft_compat(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) nf_tables(E) nfnetlink(E) br_netfilter(E) bridge(E) stp(E) llc(E) overlay(E) rfkill(E) cpufreq_powersave(E) cpufreq_userspace(E) cpufreq_conservative(E) edac_mce_amd(E) kvm_amd(E) binfmt_misc(E) snd_hda_codec_realtek(E) kvm(E) snd_hda_codec_generic(E) irqbypass(E) ppdev(E) amdgpu(E) wmi_bmof(E) ledtrig_audio(E) nls_ascii(E) nls_cp437(E) snd_hda_codec_hdmi(E) vfat(E) gpu_sched(E) fat(E) crct10dif_pclmul(E) snd_hda_intel(E) ttm(E) snd_intel_dspcfg(E) crc32_pclmul(E) snd_hda_codec(E) drm_kms_helper(E) snd_hda_core(E) cec(E) snd_hwdep(E) ghash_clmulni_intel(E) snd_pcm(E) rc_core(E) r8169(E) snd_timer(E) aesni_intel(E) drm(E) efi_pstore(E) sp5100_tco(E) ccp(E) crypto_simd(E) realtek(E) snd(E) i2c_algo_bit(E) watchdog(E)
Mar 13 12:44:49 kaveri kernel: [16728.477780][T347601]  cryptd(E) mfd_core(E) glue_helper(E) pcspkr(E) efivars(E) sg(E) parport_pc(E) libphy(E) k10temp(E) soundcore(E) rng_core(E) i2c_piix4(E) parport(E) wmi(E) button(E) acpi_cpufreq(E) tcp_bbr(E) sch_fq(E) nct6775(E) hwmon_vid(E) sunrpc(E) efivarfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc32c_generic(E) crc16(E) mbcache(E) jbd2(E) dm_mod(E) raid10(E) raid1(E) raid0(E) multipath(E) linear(E) md_mod(E) sd_mod(E) evdev(E) hid_generic(E) usbhid(E) hid(E) xhci_pci(E) ahci(E) libahci(E) xhci_hcd(E) crc32c_intel(E) libata(E) usbcore(E) scsi_mod(E) gpio_amdpt(E) gpio_generic(E)
Mar 13 12:44:49 kaveri kernel: [16728.477809][T347601] CPU: 7 PID: 347601 Comm: Xorg Tainted: G        W   E     5.5.8+ #28
Mar 13 12:44:49 kaveri kernel: [16728.477811][T347601] Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 TOMAHAWK (MS-7A34), BIOS 1.OR 11/29/2019
Mar 13 12:44:49 kaveri kernel: [16728.477814][T347601] RIP: 0010:preempt_count_sub+0x70/0xd0
Mar 13 12:44:49 kaveri kernel: [16728.477817][T347601] Code: 6c 5b 5d c3 e8 11 c4 37 00 85 c0 74 f4 8b 15 f7 d9 27 01 85 d2 75 ea 48 c7 c6 a5 f9 cb 94 48 c7 c7 e5 b7 ca 94 e8 22 fc fc ff <0f> 0b eb d3 48 8b 6c 24 10 48 89 ef e8 df e4 02 00 85 c0 74 09 31
Mar 13 12:44:49 kaveri kernel: [16728.477819][T347601] RSP: 0018:ffffa393a61d7928 EFLAGS: 00010282
Mar 13 12:44:49 kaveri kernel: [16728.477821][T347601] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
Mar 13 12:44:49 kaveri kernel: [16728.477823][T347601] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000246
Mar 13 12:44:49 kaveri kernel: [16728.477824][T347601] RBP: ffffa393a61d7c38 R08: 0000000000000001 R09: 000000000000002a
Mar 13 12:44:49 kaveri kernel: [16728.477826][T347601] R10: ffffffff95465690 R11: 00000000954652d3 R12: ffff8ab13baf2918
Mar 13 12:44:49 kaveri kernel: [16728.477828][T347601] R13: ffff8aae13ea6400 R14: ffff8ab13bae0000 R15: ffff8ab13e14c800
Mar 13 12:44:49 kaveri kernel: [16728.477830][T347601] FS:  00007f023d1cb940(0000) GS:ffff8ab14e9c0000(0000) knlGS:0000000000000000
Mar 13 12:44:49 kaveri kernel: [16728.477832][T347601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 13 12:44:49 kaveri kernel: [16728.477833][T347601] CR2: 00007ffa2900b54c CR3: 00000003a4340000 CR4: 00000000003406e0
Mar 13 12:44:49 kaveri kernel: [16728.477835][T347601] Call Trace:
Mar 13 12:44:49 kaveri kernel: [16728.477842][T347601]  _raw_spin_unlock_irqrestore+0x20/0x40
Mar 13 12:44:49 kaveri kernel: [16728.477935][T347601]  amdgpu_dm_atomic_commit_tail+0x1ee3/0x2130 [amdgpu]
Mar 13 12:44:49 kaveri kernel: [16728.477944][T347601]  ? kfree+0x196/0x2b0
Mar 13 12:44:49 kaveri kernel: [16728.477961][T347601]  commit_tail+0x94/0x130 [drm_kms_helper]
Mar 13 12:44:49 kaveri kernel: [16728.477970][T347601]  drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
Mar 13 12:44:49 kaveri kernel: [16728.477990][T347601]  drm_mode_obj_set_property_ioctl+0x115/0x2d0 [drm]
Mar 13 12:44:49 kaveri kernel: [16728.478008][T347601]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
Mar 13 12:44:49 kaveri kernel: [16728.478020][T347601]  drm_ioctl_kernel+0xaa/0xf0 [drm]
Mar 13 12:44:49 kaveri kernel: [16728.478033][T347601]  drm_ioctl+0x208/0x390 [drm]
Mar 13 12:44:49 kaveri kernel: [16728.478047][T347601]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
Mar 13 12:44:49 kaveri kernel: [16728.478051][T347601]  ? preempt_count_sub+0x9b/0xd0
Mar 13 12:44:49 kaveri kernel: [16728.478102][T347601]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
Mar 13 12:44:49 kaveri kernel: [16728.478108][T347601]  do_vfs_ioctl+0x45f/0x6d0
Mar 13 12:44:49 kaveri kernel: [16728.478112][T347601]  ksys_ioctl+0x5e/0x90
Mar 13 12:44:49 kaveri kernel: [16728.478115][T347601]  __x64_sys_ioctl+0x16/0x20
Mar 13 12:44:49 kaveri kernel: [16728.478119][T347601]  do_syscall_64+0x5f/0x200
Mar 13 12:44:49 kaveri kernel: [16728.478122][T347601]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Mar 13 12:44:49 kaveri kernel: [16728.478125][T347601] RIP: 0033:0x7f023d792497
Mar 13 12:44:49 kaveri kernel: [16728.478128][T347601] Code: 00 00 90 48 8b 05 f9 79 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 79 0c 00 f7 d8 64 89 01 48
Mar 13 12:44:49 kaveri kernel: [16728.478130][T347601] RSP: 002b:00007ffd773e0348 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Mar 13 12:44:49 kaveri kernel: [16728.478132][T347601] RAX: ffffffffffffffda RBX: 00007ffd773e0380 RCX: 00007f023d792497
Mar 13 12:44:49 kaveri kernel: [16728.478134][T347601] RDX: 00007ffd773e0380 RSI: 00000000c01864ba RDI: 000000000000000d
Mar 13 12:44:49 kaveri kernel: [16728.478136][T347601] RBP: 00000000c01864ba R08: 000000000000005c R09: 00000000cccccccc
Mar 13 12:44:49 kaveri kernel: [16728.478137][T347601] R10: 000055fbcc8a6bd4 R11: 0000000000000246 R12: 000055fbcb7199b0
Mar 13 12:44:49 kaveri kernel: [16728.478139][T347601] R13: 000000000000000d R14: 0000000000000000 R15: 0000000000000fff
Mar 13 12:44:49 kaveri kernel: [16728.478143][T347601] ---[ end trace e1909b5aa94da59e ]---

[-- Attachment #3: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
  2020-03-13 14:42     ` Michel Dänzer
@ 2020-03-13 21:52       ` Mario Kleiner
  2020-04-14 23:32       ` Matt Coffin
  1 sibling, 0 replies; 11+ messages in thread
From: Mario Kleiner @ 2020-03-13 21:52 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Leo (Sunpeng) Li, Maling list - DRI developers, Deucher,
	Alexander, amd-gfx list, Harry Wentland, Kazlauskas, Nicholas


[-- Attachment #1.1: Type: text/plain, Size: 6603 bytes --]

On Fri, Mar 13, 2020 at 5:02 PM Michel Dänzer <michel@daenzer.net> wrote:

> On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote:
> > On 2020-03-12 10:32 a.m., Alex Deucher wrote:
> >> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner
> >> <mario.kleiner.de@gmail.com> wrote:
> >>>
> >>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
> >>> events at vsartup for DCN")' introduces a new way of pageflip
> >>> completion handling for DCN, and some trouble.
> >>>
> >>> The current implementation introduces a race condition, which
> >>> can cause pageflip completion events to be sent out one vblank
> >>> too early, thereby confusing userspace and causing flicker:
> >>>
> >>> prepare_flip_isr():
> >>>
> >>> 1. Pageflip programming takes the ddev->event_lock.
> >>> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
> >>> 3. Releases ddev->event_lock.
> >>>
> >>> --> Deadline for surface address regs double-buffering passes on
> >>>      target pipe.
> >>>
> >>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip
> >>>     into hw, but too late for current vblank.
> >>>
> >>> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
> >>>     in current vblank due to missing the double-buffering deadline
> >>>     by a tiny bit.
> >>>
> >>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
> >>>     dm_dcn_crtc_high_irq() gets called.
> >>>
> >>> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
> >>>     pageflip has been completed/will complete in this vblank and
> >>>     sends out pageflip completion event to userspace and resets
> >>>     pflip_status = AMDGPU_FLIP_NONE.
> >>>
> >>> => Flip completion event sent out one vblank too early.
> >>>
> >>> This behaviour has been observed during my testing with measurement
> >>> hardware a couple of time.
> >>>
> >>> The commit message says that the extra flip event code was added to
> >>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events
> >>> in case the pflip irq doesn't fire, because the "DCH HUBP" component
> >>> is clock gated and doesn't fire pflip irqs in that state. Also that
> >>> this clock gating may happen if no planes are active. According to
> >>> Nicholas, the clock gating can also happen if psr is active, and the
> >>> gating is controlled independently by the hardware, so difficult to
> >>> detect if and when the completion code in above commit is needed.
> >>>
> >>> This patch tries the following solution: It only executes the extra
> >>> pflip
> >>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports
> >>> that there aren't any surface updated pending in the double-buffered
> >>> surface scanout address registers. Otherwise it leaves pflip completion
> >>> to the pflip irq handler, for a more race-free experience.
> >>>
> >>> This would only guard against the order of events mentioned above.
> >>> If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help
> >>> at all, because 1-3 + 5 might happen even without the hw being
> >>> programmed
> >>> at all, ie. no surface update pending because none yet programmed
> >>> into hw.
> >>>
> >>> Therefore this patch also changes locking in amdgpu_dm_commit_planes(),
> >>> so that prepare_flip_isr() and dc_commit_updates_for_stream() are done
> >>> under event_lock protection within the same critical section.
> >>>
> >>> v2: Take Nicholas comments into account, try a different solution.
> >>>
> >>> Lightly tested on Polaris (locking) and Raven (the whole DCN stuff).
> >>> Seems to work without causing obvious new trouble.
> >>
> >> Nick, any comments on this?  Can we get this committed or do you think
> >> it needs additional rework?
> >>
> >> Thanks,
> >>
> >> Alex
> >
> > Hi Alex, Mario,
> >
> > This might be a little strange, but if we want to get this in as a fix
> > for regressions caused by the original vblank and user events at
> > vstartup patch then I'm actually going to give my reviewed by on the
> > *v1* of this patch (but not this v2):
> >
> > Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
> >
> > You can feel free to apply that one.
> >
> > Reason 1: After having thought about it some more I don't think we
> > enable anything today that has hubp powered down at the same time we
> > expect to be waiting for a flip - eg. DMCU powering down HUBP during PSR
> > entry. Static screen interrupt should happen after that flip finishes I
> > think.
> >
> > The CRTC can still be powered on with zero planes, and I don't think any
> > userspace explicitly asks for vblank events in this case but it doesn't
> > hurt to have the check.
> >
>

Ok, so the original commit that causes the races currently solves a
non-existing - but potential future - problem?

I guess then my v1 patch makes sense for the moment and fixes the immediate
problem for Linux 5.6-rc.

Maybe there's a way to ask the hw if hubp is clock-gated and so if that
code needs to run or not in the future?

As mentioned before, it would be helpful at least for me to get a better
overview about which hw events happen when in vblank, which irq's fire in
response etc., how this relates to things like power management actions,
psr, vrr / drr, etc. A lot has changed in the hw during the last 10 years,
and my guideline are still the public pdf files with the DCE register specs
for DCE-1'ish hw from sometime around the year 2007.

> Reason 2: This new patch will need much more thorough testing from side
> > to fully understand the consequences of locking the entire DC commit
> > sequence. For just a page flip that sounds fine, but for anything more
> > than (eg. full updates, modesets, etc) I don't think we want to be
> > disabling interrupts for potentially many milliseconds.
>
> Ah! I was wondering where the attached splat comes from, but I think
> this explains it: With this patch amdgpu_dm_commit_planes keeps the
> pcrtc->dev->event_lock spinlock locked while calling
> dc_commit_updates_for_stream, which ends up calling
> smu_set_display_count, which tries to lock a mutex.
>
>
Yep, sorry about that. My most modern machine has Raven Ridge / DCN1, and
that function only gets called on Navi /DCN2 afaics. All my testing is
limited to Polaris / DCE11 and Raven DCN1, my most modern hw atm.

thanks,
-mario

-- 
> Earthling Michel Dänzer               |               https://redhat.com
> Libre software enthusiast             |             Mesa and X developer
>

[-- Attachment #1.2: Type: text/html, Size: 8714 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
  2020-03-13 14:42     ` Michel Dänzer
  2020-03-13 21:52       ` Mario Kleiner
@ 2020-04-14 23:32       ` Matt Coffin
  2020-05-04 17:35         ` Matt Coffin
  1 sibling, 1 reply; 11+ messages in thread
From: Matt Coffin @ 2020-04-14 23:32 UTC (permalink / raw)
  To: Michel Dänzer, Kazlauskas, Nicholas, Alex Deucher, Mario Kleiner
  Cc: Leo (Sunpeng) Li, Deucher, Alexander, Harry Wentland,
	Maling list - DRI developers, amd-gfx list

Hey everyone,

This patch broke variable refresh rate in games (all that I've tried so
far... Project CARS 2, DiRT Rally 2.0, Assetto Corsa Competizione) as
well as a simple freesync tester application.

FreeSync tester I've been using: https://github.com/Nixola/VRRTest

I'm not at all familiar with the page flipping code, so it would take me
a long time to find the *right* way to fix it, but does someone else see
why it would do that?

The symptom is that the refresh rate of the display constantly bounces
between the two ends of the FreeSync range (for me 40 -> 144), and the
game stutters like a madman.

Any help on where to start, ideas on how to fix it (other than just
revert this commit, which I've done in the interim), or alternative
patches would be appreciated.

Thanks in advance for the work/help,
Matt

On 3/13/20 8:42 AM, Michel Dänzer wrote:
> On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote:
>> On 2020-03-12 10:32 a.m., Alex Deucher wrote:
>>> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner
>>> <mario.kleiner.de@gmail.com> wrote:
>>>>
>>>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
>>>> events at vsartup for DCN")' introduces a new way of pageflip
>>>> completion handling for DCN, and some trouble.
>>>>
>>>> The current implementation introduces a race condition, which
>>>> can cause pageflip completion events to be sent out one vblank
>>>> too early, thereby confusing userspace and causing flicker:
>>>>
>>>> prepare_flip_isr():
>>>>
>>>> 1. Pageflip programming takes the ddev->event_lock.
>>>> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
>>>> 3. Releases ddev->event_lock.
>>>>
>>>> --> Deadline for surface address regs double-buffering passes on
>>>>      target pipe.
>>>>
>>>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip
>>>>     into hw, but too late for current vblank.
>>>>
>>>> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
>>>>     in current vblank due to missing the double-buffering deadline
>>>>     by a tiny bit.
>>>>
>>>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
>>>>     dm_dcn_crtc_high_irq() gets called.
>>>>
>>>> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
>>>>     pageflip has been completed/will complete in this vblank and
>>>>     sends out pageflip completion event to userspace and resets
>>>>     pflip_status = AMDGPU_FLIP_NONE.
>>>>
>>>> => Flip completion event sent out one vblank too early.
>>>>
>>>> This behaviour has been observed during my testing with measurement
>>>> hardware a couple of time.
>>>>
>>>> The commit message says that the extra flip event code was added to
>>>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events
>>>> in case the pflip irq doesn't fire, because the "DCH HUBP" component
>>>> is clock gated and doesn't fire pflip irqs in that state. Also that
>>>> this clock gating may happen if no planes are active. According to
>>>> Nicholas, the clock gating can also happen if psr is active, and the
>>>> gating is controlled independently by the hardware, so difficult to
>>>> detect if and when the completion code in above commit is needed.
>>>>
>>>> This patch tries the following solution: It only executes the extra
>>>> pflip
>>>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports
>>>> that there aren't any surface updated pending in the double-buffered
>>>> surface scanout address registers. Otherwise it leaves pflip completion
>>>> to the pflip irq handler, for a more race-free experience.
>>>>
>>>> This would only guard against the order of events mentioned above.
>>>> If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help
>>>> at all, because 1-3 + 5 might happen even without the hw being
>>>> programmed
>>>> at all, ie. no surface update pending because none yet programmed
>>>> into hw.
>>>>
>>>> Therefore this patch also changes locking in amdgpu_dm_commit_planes(),
>>>> so that prepare_flip_isr() and dc_commit_updates_for_stream() are done
>>>> under event_lock protection within the same critical section.
>>>>
>>>> v2: Take Nicholas comments into account, try a different solution.
>>>>
>>>> Lightly tested on Polaris (locking) and Raven (the whole DCN stuff).
>>>> Seems to work without causing obvious new trouble.
>>>
>>> Nick, any comments on this?  Can we get this committed or do you think
>>> it needs additional rework?
>>>
>>> Thanks,
>>>
>>> Alex
>>
>> Hi Alex, Mario,
>>
>> This might be a little strange, but if we want to get this in as a fix
>> for regressions caused by the original vblank and user events at
>> vstartup patch then I'm actually going to give my reviewed by on the
>> *v1* of this patch (but not this v2):
>>
>> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
>>
>> You can feel free to apply that one.
>>
>> Reason 1: After having thought about it some more I don't think we
>> enable anything today that has hubp powered down at the same time we
>> expect to be waiting for a flip - eg. DMCU powering down HUBP during PSR
>> entry. Static screen interrupt should happen after that flip finishes I
>> think.
>>
>> The CRTC can still be powered on with zero planes, and I don't think any
>> userspace explicitly asks for vblank events in this case but it doesn't
>> hurt to have the check.
>>
>> Reason 2: This new patch will need much more thorough testing from side
>> to fully understand the consequences of locking the entire DC commit
>> sequence. For just a page flip that sounds fine, but for anything more
>> than (eg. full updates, modesets, etc) I don't think we want to be
>> disabling interrupts for potentially many milliseconds.
> 
> Ah! I was wondering where the attached splat comes from, but I think
> this explains it: With this patch amdgpu_dm_commit_planes keeps the
> pcrtc->dev->event_lock spinlock locked while calling
> dc_commit_updates_for_stream, which ends up calling
> smu_set_display_count, which tries to lock a mutex.
> 
> 
> 
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
  2020-04-14 23:32       ` Matt Coffin
@ 2020-05-04 17:35         ` Matt Coffin
  2020-05-05 17:03           ` Alex Deucher
  0 siblings, 1 reply; 11+ messages in thread
From: Matt Coffin @ 2020-05-04 17:35 UTC (permalink / raw)
  To: Michel Dänzer, Kazlauskas, Nicholas, Alex Deucher, Mario Kleiner
  Cc: Leo (Sunpeng) Li, Deucher, Alexander, Harry Wentland,
	Maling list - DRI developers, amd-gfx list

Hey guys,

This is still an issue for me, and I'm still having to run a patch to
revert this as of 5.7-rc4. To avoid breaking a lot of people's Navi
setups in 5.7, is there any news on this? Has anyone else at the very
least been able to reproduce the problem?

It happens for me in every single program that mesa allows to utilize
variable refresh rates, and reverting it "fixes" the issue.

Cheers, and sorry for the extra email, just making sure this is still on
someone's radar,
Matt

On 4/14/20 5:32 PM, Matt Coffin wrote:
> Hey everyone,
> 
> This patch broke variable refresh rate in games (all that I've tried so
> far... Project CARS 2, DiRT Rally 2.0, Assetto Corsa Competizione) as
> well as a simple freesync tester application.
> 
> FreeSync tester I've been using: https://github.com/Nixola/VRRTest
> 
> I'm not at all familiar with the page flipping code, so it would take me
> a long time to find the *right* way to fix it, but does someone else see
> why it would do that?
> 
> The symptom is that the refresh rate of the display constantly bounces
> between the two ends of the FreeSync range (for me 40 -> 144), and the
> game stutters like a madman.
> 
> Any help on where to start, ideas on how to fix it (other than just
> revert this commit, which I've done in the interim), or alternative
> patches would be appreciated.
> 
> Thanks in advance for the work/help,
> Matt
> 
> On 3/13/20 8:42 AM, Michel Dänzer wrote:
>> On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote:
>>> On 2020-03-12 10:32 a.m., Alex Deucher wrote:
>>>> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner
>>>> <mario.kleiner.de@gmail.com> wrote:
>>>>>
>>>>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
>>>>> events at vsartup for DCN")' introduces a new way of pageflip
>>>>> completion handling for DCN, and some trouble.
>>>>>
>>>>> The current implementation introduces a race condition, which
>>>>> can cause pageflip completion events to be sent out one vblank
>>>>> too early, thereby confusing userspace and causing flicker:
>>>>>
>>>>> prepare_flip_isr():
>>>>>
>>>>> 1. Pageflip programming takes the ddev->event_lock.
>>>>> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
>>>>> 3. Releases ddev->event_lock.
>>>>>
>>>>> --> Deadline for surface address regs double-buffering passes on
>>>>>      target pipe.
>>>>>
>>>>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip
>>>>>     into hw, but too late for current vblank.
>>>>>
>>>>> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
>>>>>     in current vblank due to missing the double-buffering deadline
>>>>>     by a tiny bit.
>>>>>
>>>>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
>>>>>     dm_dcn_crtc_high_irq() gets called.
>>>>>
>>>>> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
>>>>>     pageflip has been completed/will complete in this vblank and
>>>>>     sends out pageflip completion event to userspace and resets
>>>>>     pflip_status = AMDGPU_FLIP_NONE.
>>>>>
>>>>> => Flip completion event sent out one vblank too early.
>>>>>
>>>>> This behaviour has been observed during my testing with measurement
>>>>> hardware a couple of time.
>>>>>
>>>>> The commit message says that the extra flip event code was added to
>>>>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events
>>>>> in case the pflip irq doesn't fire, because the "DCH HUBP" component
>>>>> is clock gated and doesn't fire pflip irqs in that state. Also that
>>>>> this clock gating may happen if no planes are active. According to
>>>>> Nicholas, the clock gating can also happen if psr is active, and the
>>>>> gating is controlled independently by the hardware, so difficult to
>>>>> detect if and when the completion code in above commit is needed.
>>>>>
>>>>> This patch tries the following solution: It only executes the extra
>>>>> pflip
>>>>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports
>>>>> that there aren't any surface updated pending in the double-buffered
>>>>> surface scanout address registers. Otherwise it leaves pflip completion
>>>>> to the pflip irq handler, for a more race-free experience.
>>>>>
>>>>> This would only guard against the order of events mentioned above.
>>>>> If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help
>>>>> at all, because 1-3 + 5 might happen even without the hw being
>>>>> programmed
>>>>> at all, ie. no surface update pending because none yet programmed
>>>>> into hw.
>>>>>
>>>>> Therefore this patch also changes locking in amdgpu_dm_commit_planes(),
>>>>> so that prepare_flip_isr() and dc_commit_updates_for_stream() are done
>>>>> under event_lock protection within the same critical section.
>>>>>
>>>>> v2: Take Nicholas comments into account, try a different solution.
>>>>>
>>>>> Lightly tested on Polaris (locking) and Raven (the whole DCN stuff).
>>>>> Seems to work without causing obvious new trouble.
>>>>
>>>> Nick, any comments on this?  Can we get this committed or do you think
>>>> it needs additional rework?
>>>>
>>>> Thanks,
>>>>
>>>> Alex
>>>
>>> Hi Alex, Mario,
>>>
>>> This might be a little strange, but if we want to get this in as a fix
>>> for regressions caused by the original vblank and user events at
>>> vstartup patch then I'm actually going to give my reviewed by on the
>>> *v1* of this patch (but not this v2):
>>>
>>> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
>>>
>>> You can feel free to apply that one.
>>>
>>> Reason 1: After having thought about it some more I don't think we
>>> enable anything today that has hubp powered down at the same time we
>>> expect to be waiting for a flip - eg. DMCU powering down HUBP during PSR
>>> entry. Static screen interrupt should happen after that flip finishes I
>>> think.
>>>
>>> The CRTC can still be powered on with zero planes, and I don't think any
>>> userspace explicitly asks for vblank events in this case but it doesn't
>>> hurt to have the check.
>>>
>>> Reason 2: This new patch will need much more thorough testing from side
>>> to fully understand the consequences of locking the entire DC commit
>>> sequence. For just a page flip that sounds fine, but for anything more
>>> than (eg. full updates, modesets, etc) I don't think we want to be
>>> disabling interrupts for potentially many milliseconds.
>>
>> Ah! I was wondering where the attached splat comes from, but I think
>> this explains it: With this patch amdgpu_dm_commit_planes keeps the
>> pcrtc->dev->event_lock spinlock locked while calling
>> dc_commit_updates_for_stream, which ends up calling
>> smu_set_display_count, which tries to lock a mutex.
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
  2020-05-04 17:35         ` Matt Coffin
@ 2020-05-05 17:03           ` Alex Deucher
  2020-05-05 17:59             ` Kazlauskas, Nicholas
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Deucher @ 2020-05-05 17:03 UTC (permalink / raw)
  To: Matt Coffin
  Cc: Leo (Sunpeng) Li, Michel Dänzer, amd-gfx list,
	Maling list - DRI developers, Deucher, Alexander, Harry Wentland,
	Kazlauskas, Nicholas

Mario or Nick any thoughts?

Alex

On Mon, May 4, 2020 at 1:35 PM Matt Coffin <mcoffin13@gmail.com> wrote:
>
> Hey guys,
>
> This is still an issue for me, and I'm still having to run a patch to
> revert this as of 5.7-rc4. To avoid breaking a lot of people's Navi
> setups in 5.7, is there any news on this? Has anyone else at the very
> least been able to reproduce the problem?
>
> It happens for me in every single program that mesa allows to utilize
> variable refresh rates, and reverting it "fixes" the issue.
>
> Cheers, and sorry for the extra email, just making sure this is still on
> someone's radar,
> Matt
>
> On 4/14/20 5:32 PM, Matt Coffin wrote:
> > Hey everyone,
> >
> > This patch broke variable refresh rate in games (all that I've tried so
> > far... Project CARS 2, DiRT Rally 2.0, Assetto Corsa Competizione) as
> > well as a simple freesync tester application.
> >
> > FreeSync tester I've been using: https://github.com/Nixola/VRRTest
> >
> > I'm not at all familiar with the page flipping code, so it would take me
> > a long time to find the *right* way to fix it, but does someone else see
> > why it would do that?
> >
> > The symptom is that the refresh rate of the display constantly bounces
> > between the two ends of the FreeSync range (for me 40 -> 144), and the
> > game stutters like a madman.
> >
> > Any help on where to start, ideas on how to fix it (other than just
> > revert this commit, which I've done in the interim), or alternative
> > patches would be appreciated.
> >
> > Thanks in advance for the work/help,
> > Matt
> >
> > On 3/13/20 8:42 AM, Michel Dänzer wrote:
> >> On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote:
> >>> On 2020-03-12 10:32 a.m., Alex Deucher wrote:
> >>>> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner
> >>>> <mario.kleiner.de@gmail.com> wrote:
> >>>>>
> >>>>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
> >>>>> events at vsartup for DCN")' introduces a new way of pageflip
> >>>>> completion handling for DCN, and some trouble.
> >>>>>
> >>>>> The current implementation introduces a race condition, which
> >>>>> can cause pageflip completion events to be sent out one vblank
> >>>>> too early, thereby confusing userspace and causing flicker:
> >>>>>
> >>>>> prepare_flip_isr():
> >>>>>
> >>>>> 1. Pageflip programming takes the ddev->event_lock.
> >>>>> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
> >>>>> 3. Releases ddev->event_lock.
> >>>>>
> >>>>> --> Deadline for surface address regs double-buffering passes on
> >>>>>      target pipe.
> >>>>>
> >>>>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip
> >>>>>     into hw, but too late for current vblank.
> >>>>>
> >>>>> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
> >>>>>     in current vblank due to missing the double-buffering deadline
> >>>>>     by a tiny bit.
> >>>>>
> >>>>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
> >>>>>     dm_dcn_crtc_high_irq() gets called.
> >>>>>
> >>>>> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
> >>>>>     pageflip has been completed/will complete in this vblank and
> >>>>>     sends out pageflip completion event to userspace and resets
> >>>>>     pflip_status = AMDGPU_FLIP_NONE.
> >>>>>
> >>>>> => Flip completion event sent out one vblank too early.
> >>>>>
> >>>>> This behaviour has been observed during my testing with measurement
> >>>>> hardware a couple of time.
> >>>>>
> >>>>> The commit message says that the extra flip event code was added to
> >>>>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events
> >>>>> in case the pflip irq doesn't fire, because the "DCH HUBP" component
> >>>>> is clock gated and doesn't fire pflip irqs in that state. Also that
> >>>>> this clock gating may happen if no planes are active. According to
> >>>>> Nicholas, the clock gating can also happen if psr is active, and the
> >>>>> gating is controlled independently by the hardware, so difficult to
> >>>>> detect if and when the completion code in above commit is needed.
> >>>>>
> >>>>> This patch tries the following solution: It only executes the extra
> >>>>> pflip
> >>>>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports
> >>>>> that there aren't any surface updated pending in the double-buffered
> >>>>> surface scanout address registers. Otherwise it leaves pflip completion
> >>>>> to the pflip irq handler, for a more race-free experience.
> >>>>>
> >>>>> This would only guard against the order of events mentioned above.
> >>>>> If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help
> >>>>> at all, because 1-3 + 5 might happen even without the hw being
> >>>>> programmed
> >>>>> at all, ie. no surface update pending because none yet programmed
> >>>>> into hw.
> >>>>>
> >>>>> Therefore this patch also changes locking in amdgpu_dm_commit_planes(),
> >>>>> so that prepare_flip_isr() and dc_commit_updates_for_stream() are done
> >>>>> under event_lock protection within the same critical section.
> >>>>>
> >>>>> v2: Take Nicholas comments into account, try a different solution.
> >>>>>
> >>>>> Lightly tested on Polaris (locking) and Raven (the whole DCN stuff).
> >>>>> Seems to work without causing obvious new trouble.
> >>>>
> >>>> Nick, any comments on this?  Can we get this committed or do you think
> >>>> it needs additional rework?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Alex
> >>>
> >>> Hi Alex, Mario,
> >>>
> >>> This might be a little strange, but if we want to get this in as a fix
> >>> for regressions caused by the original vblank and user events at
> >>> vstartup patch then I'm actually going to give my reviewed by on the
> >>> *v1* of this patch (but not this v2):
> >>>
> >>> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
> >>>
> >>> You can feel free to apply that one.
> >>>
> >>> Reason 1: After having thought about it some more I don't think we
> >>> enable anything today that has hubp powered down at the same time we
> >>> expect to be waiting for a flip - eg. DMCU powering down HUBP during PSR
> >>> entry. Static screen interrupt should happen after that flip finishes I
> >>> think.
> >>>
> >>> The CRTC can still be powered on with zero planes, and I don't think any
> >>> userspace explicitly asks for vblank events in this case but it doesn't
> >>> hurt to have the check.
> >>>
> >>> Reason 2: This new patch will need much more thorough testing from side
> >>> to fully understand the consequences of locking the entire DC commit
> >>> sequence. For just a page flip that sounds fine, but for anything more
> >>> than (eg. full updates, modesets, etc) I don't think we want to be
> >>> disabling interrupts for potentially many milliseconds.
> >>
> >> Ah! I was wondering where the attached splat comes from, but I think
> >> this explains it: With this patch amdgpu_dm_commit_planes keeps the
> >> pcrtc->dev->event_lock spinlock locked while calling
> >> dc_commit_updates_for_stream, which ends up calling
> >> smu_set_display_count, which tries to lock a mutex.
> >>
> >>
> >>
> >> _______________________________________________
> >> amd-gfx mailing list
> >> amd-gfx@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> >>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
  2020-05-05 17:03           ` Alex Deucher
@ 2020-05-05 17:59             ` Kazlauskas, Nicholas
  2020-05-05 18:03               ` Matt Coffin
  2020-05-08 15:54               ` Matt Coffin
  0 siblings, 2 replies; 11+ messages in thread
From: Kazlauskas, Nicholas @ 2020-05-05 17:59 UTC (permalink / raw)
  To: Alex Deucher, Matt Coffin
  Cc: Leo (Sunpeng) Li, Michel Dänzer, amd-gfx list,
	Maling list - DRI developers, Deucher, Alexander, Harry Wentland

Can you file a full bug report on the gitlab tracker?

FreeSync is still working on my Navi setups with this patch applied, and 
this patch is essentially just a revert of another patch already (where 
FreeSync worked before).

I can understand the v2 of this series causing issues, but the v1 
shouldn't be - so I'd like to understand more about the setup where this 
is causing issues - ASIC, OS, compositor, displays, dmesg log, X log, etc.

Regards,
Nicholas Kazlauskas

On 2020-05-05 1:03 p.m., Alex Deucher wrote:
> Mario or Nick any thoughts?
> 
> Alex
> 
> On Mon, May 4, 2020 at 1:35 PM Matt Coffin <mcoffin13@gmail.com> wrote:
>>
>> Hey guys,
>>
>> This is still an issue for me, and I'm still having to run a patch to
>> revert this as of 5.7-rc4. To avoid breaking a lot of people's Navi
>> setups in 5.7, is there any news on this? Has anyone else at the very
>> least been able to reproduce the problem?
>>
>> It happens for me in every single program that mesa allows to utilize
>> variable refresh rates, and reverting it "fixes" the issue.
>>
>> Cheers, and sorry for the extra email, just making sure this is still on
>> someone's radar,
>> Matt
>>
>> On 4/14/20 5:32 PM, Matt Coffin wrote:
>>> Hey everyone,
>>>
>>> This patch broke variable refresh rate in games (all that I've tried so
>>> far... Project CARS 2, DiRT Rally 2.0, Assetto Corsa Competizione) as
>>> well as a simple freesync tester application.
>>>
>>> FreeSync tester I've been using: https://github.com/Nixola/VRRTest
>>>
>>> I'm not at all familiar with the page flipping code, so it would take me
>>> a long time to find the *right* way to fix it, but does someone else see
>>> why it would do that?
>>>
>>> The symptom is that the refresh rate of the display constantly bounces
>>> between the two ends of the FreeSync range (for me 40 -> 144), and the
>>> game stutters like a madman.
>>>
>>> Any help on where to start, ideas on how to fix it (other than just
>>> revert this commit, which I've done in the interim), or alternative
>>> patches would be appreciated.
>>>
>>> Thanks in advance for the work/help,
>>> Matt
>>>
>>> On 3/13/20 8:42 AM, Michel Dänzer wrote:
>>>> On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote:
>>>>> On 2020-03-12 10:32 a.m., Alex Deucher wrote:
>>>>>> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner
>>>>>> <mario.kleiner.de@gmail.com> wrote:
>>>>>>>
>>>>>>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
>>>>>>> events at vsartup for DCN")' introduces a new way of pageflip
>>>>>>> completion handling for DCN, and some trouble.
>>>>>>>
>>>>>>> The current implementation introduces a race condition, which
>>>>>>> can cause pageflip completion events to be sent out one vblank
>>>>>>> too early, thereby confusing userspace and causing flicker:
>>>>>>>
>>>>>>> prepare_flip_isr():
>>>>>>>
>>>>>>> 1. Pageflip programming takes the ddev->event_lock.
>>>>>>> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
>>>>>>> 3. Releases ddev->event_lock.
>>>>>>>
>>>>>>> --> Deadline for surface address regs double-buffering passes on
>>>>>>>       target pipe.
>>>>>>>
>>>>>>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip
>>>>>>>      into hw, but too late for current vblank.
>>>>>>>
>>>>>>> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
>>>>>>>      in current vblank due to missing the double-buffering deadline
>>>>>>>      by a tiny bit.
>>>>>>>
>>>>>>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
>>>>>>>      dm_dcn_crtc_high_irq() gets called.
>>>>>>>
>>>>>>> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
>>>>>>>      pageflip has been completed/will complete in this vblank and
>>>>>>>      sends out pageflip completion event to userspace and resets
>>>>>>>      pflip_status = AMDGPU_FLIP_NONE.
>>>>>>>
>>>>>>> => Flip completion event sent out one vblank too early.
>>>>>>>
>>>>>>> This behaviour has been observed during my testing with measurement
>>>>>>> hardware a couple of time.
>>>>>>>
>>>>>>> The commit message says that the extra flip event code was added to
>>>>>>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events
>>>>>>> in case the pflip irq doesn't fire, because the "DCH HUBP" component
>>>>>>> is clock gated and doesn't fire pflip irqs in that state. Also that
>>>>>>> this clock gating may happen if no planes are active. According to
>>>>>>> Nicholas, the clock gating can also happen if psr is active, and the
>>>>>>> gating is controlled independently by the hardware, so difficult to
>>>>>>> detect if and when the completion code in above commit is needed.
>>>>>>>
>>>>>>> This patch tries the following solution: It only executes the extra
>>>>>>> pflip
>>>>>>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports
>>>>>>> that there aren't any surface updated pending in the double-buffered
>>>>>>> surface scanout address registers. Otherwise it leaves pflip completion
>>>>>>> to the pflip irq handler, for a more race-free experience.
>>>>>>>
>>>>>>> This would only guard against the order of events mentioned above.
>>>>>>> If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help
>>>>>>> at all, because 1-3 + 5 might happen even without the hw being
>>>>>>> programmed
>>>>>>> at all, ie. no surface update pending because none yet programmed
>>>>>>> into hw.
>>>>>>>
>>>>>>> Therefore this patch also changes locking in amdgpu_dm_commit_planes(),
>>>>>>> so that prepare_flip_isr() and dc_commit_updates_for_stream() are done
>>>>>>> under event_lock protection within the same critical section.
>>>>>>>
>>>>>>> v2: Take Nicholas comments into account, try a different solution.
>>>>>>>
>>>>>>> Lightly tested on Polaris (locking) and Raven (the whole DCN stuff).
>>>>>>> Seems to work without causing obvious new trouble.
>>>>>>
>>>>>> Nick, any comments on this?  Can we get this committed or do you think
>>>>>> it needs additional rework?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Alex
>>>>>
>>>>> Hi Alex, Mario,
>>>>>
>>>>> This might be a little strange, but if we want to get this in as a fix
>>>>> for regressions caused by the original vblank and user events at
>>>>> vstartup patch then I'm actually going to give my reviewed by on the
>>>>> *v1* of this patch (but not this v2):
>>>>>
>>>>> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
>>>>>
>>>>> You can feel free to apply that one.
>>>>>
>>>>> Reason 1: After having thought about it some more I don't think we
>>>>> enable anything today that has hubp powered down at the same time we
>>>>> expect to be waiting for a flip - eg. DMCU powering down HUBP during PSR
>>>>> entry. Static screen interrupt should happen after that flip finishes I
>>>>> think.
>>>>>
>>>>> The CRTC can still be powered on with zero planes, and I don't think any
>>>>> userspace explicitly asks for vblank events in this case but it doesn't
>>>>> hurt to have the check.
>>>>>
>>>>> Reason 2: This new patch will need much more thorough testing from side
>>>>> to fully understand the consequences of locking the entire DC commit
>>>>> sequence. For just a page flip that sounds fine, but for anything more
>>>>> than (eg. full updates, modesets, etc) I don't think we want to be
>>>>> disabling interrupts for potentially many milliseconds.
>>>>
>>>> Ah! I was wondering where the attached splat comes from, but I think
>>>> this explains it: With this patch amdgpu_dm_commit_planes keeps the
>>>> pcrtc->dev->event_lock spinlock locked while calling
>>>> dc_commit_updates_for_stream, which ends up calling
>>>> smu_set_display_count, which tries to lock a mutex.
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
  2020-05-05 17:59             ` Kazlauskas, Nicholas
@ 2020-05-05 18:03               ` Matt Coffin
  2020-05-08 15:54               ` Matt Coffin
  1 sibling, 0 replies; 11+ messages in thread
From: Matt Coffin @ 2020-05-05 18:03 UTC (permalink / raw)
  To: Kazlauskas, Nicholas, Alex Deucher
  Cc: Leo (Sunpeng) Li, Michel Dänzer, amd-gfx list,
	Maling list - DRI developers, Deucher, Alexander, Harry Wentland

Sure, I'll file one after work today.

To clarify though, FreeSync still "works" as in the monitor refresh rate
is updating, but it constantly bounces between the maximum and minimum
freesync refresh rates, causing it to look VERY stuttery.

Thanks for the attention, I'll file the but tonight. If you want another
reference person, it got a little discussion in this bug report...

https://gitlab.freedesktop.org/drm/amd/-/issues/1002#note_486494

Cheers,
Matt


On 5/5/20 11:59 AM, Kazlauskas, Nicholas wrote:
> Can you file a full bug report on the gitlab tracker?
> 
> FreeSync is still working on my Navi setups with this patch applied, and
> this patch is essentially just a revert of another patch already (where
> FreeSync worked before).
> 
> I can understand the v2 of this series causing issues, but the v1
> shouldn't be - so I'd like to understand more about the setup where this
> is causing issues - ASIC, OS, compositor, displays, dmesg log, X log, etc.
> 
> Regards,
> Nicholas Kazlauskas
> 
> On 2020-05-05 1:03 p.m., Alex Deucher wrote:
>> Mario or Nick any thoughts?
>>
>> Alex
>>
>> On Mon, May 4, 2020 at 1:35 PM Matt Coffin <mcoffin13@gmail.com> wrote:
>>>
>>> Hey guys,
>>>
>>> This is still an issue for me, and I'm still having to run a patch to
>>> revert this as of 5.7-rc4. To avoid breaking a lot of people's Navi
>>> setups in 5.7, is there any news on this? Has anyone else at the very
>>> least been able to reproduce the problem?
>>>
>>> It happens for me in every single program that mesa allows to utilize
>>> variable refresh rates, and reverting it "fixes" the issue.
>>>
>>> Cheers, and sorry for the extra email, just making sure this is still on
>>> someone's radar,
>>> Matt
>>>
>>> On 4/14/20 5:32 PM, Matt Coffin wrote:
>>>> Hey everyone,
>>>>
>>>> This patch broke variable refresh rate in games (all that I've tried so
>>>> far... Project CARS 2, DiRT Rally 2.0, Assetto Corsa Competizione) as
>>>> well as a simple freesync tester application.
>>>>
>>>> FreeSync tester I've been using: https://github.com/Nixola/VRRTest
>>>>
>>>> I'm not at all familiar with the page flipping code, so it would
>>>> take me
>>>> a long time to find the *right* way to fix it, but does someone else
>>>> see
>>>> why it would do that?
>>>>
>>>> The symptom is that the refresh rate of the display constantly bounces
>>>> between the two ends of the FreeSync range (for me 40 -> 144), and the
>>>> game stutters like a madman.
>>>>
>>>> Any help on where to start, ideas on how to fix it (other than just
>>>> revert this commit, which I've done in the interim), or alternative
>>>> patches would be appreciated.
>>>>
>>>> Thanks in advance for the work/help,
>>>> Matt
>>>>
>>>> On 3/13/20 8:42 AM, Michel Dänzer wrote:
>>>>> On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote:
>>>>>> On 2020-03-12 10:32 a.m., Alex Deucher wrote:
>>>>>>> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner
>>>>>>> <mario.kleiner.de@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
>>>>>>>> events at vsartup for DCN")' introduces a new way of pageflip
>>>>>>>> completion handling for DCN, and some trouble.
>>>>>>>>
>>>>>>>> The current implementation introduces a race condition, which
>>>>>>>> can cause pageflip completion events to be sent out one vblank
>>>>>>>> too early, thereby confusing userspace and causing flicker:
>>>>>>>>
>>>>>>>> prepare_flip_isr():
>>>>>>>>
>>>>>>>> 1. Pageflip programming takes the ddev->event_lock.
>>>>>>>> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
>>>>>>>> 3. Releases ddev->event_lock.
>>>>>>>>
>>>>>>>> --> Deadline for surface address regs double-buffering passes on
>>>>>>>>       target pipe.
>>>>>>>>
>>>>>>>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip
>>>>>>>>      into hw, but too late for current vblank.
>>>>>>>>
>>>>>>>> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
>>>>>>>>      in current vblank due to missing the double-buffering deadline
>>>>>>>>      by a tiny bit.
>>>>>>>>
>>>>>>>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
>>>>>>>>      dm_dcn_crtc_high_irq() gets called.
>>>>>>>>
>>>>>>>> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
>>>>>>>>      pageflip has been completed/will complete in this vblank and
>>>>>>>>      sends out pageflip completion event to userspace and resets
>>>>>>>>      pflip_status = AMDGPU_FLIP_NONE.
>>>>>>>>
>>>>>>>> => Flip completion event sent out one vblank too early.
>>>>>>>>
>>>>>>>> This behaviour has been observed during my testing with measurement
>>>>>>>> hardware a couple of time.
>>>>>>>>
>>>>>>>> The commit message says that the extra flip event code was added to
>>>>>>>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip
>>>>>>>> events
>>>>>>>> in case the pflip irq doesn't fire, because the "DCH HUBP"
>>>>>>>> component
>>>>>>>> is clock gated and doesn't fire pflip irqs in that state. Also that
>>>>>>>> this clock gating may happen if no planes are active. According to
>>>>>>>> Nicholas, the clock gating can also happen if psr is active, and
>>>>>>>> the
>>>>>>>> gating is controlled independently by the hardware, so difficult to
>>>>>>>> detect if and when the completion code in above commit is needed.
>>>>>>>>
>>>>>>>> This patch tries the following solution: It only executes the extra
>>>>>>>> pflip
>>>>>>>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports
>>>>>>>> that there aren't any surface updated pending in the
>>>>>>>> double-buffered
>>>>>>>> surface scanout address registers. Otherwise it leaves pflip
>>>>>>>> completion
>>>>>>>> to the pflip irq handler, for a more race-free experience.
>>>>>>>>
>>>>>>>> This would only guard against the order of events mentioned above.
>>>>>>>> If Step 5 (VSTARTUP trigger) happens before step 4 then this
>>>>>>>> won't help
>>>>>>>> at all, because 1-3 + 5 might happen even without the hw being
>>>>>>>> programmed
>>>>>>>> at all, ie. no surface update pending because none yet programmed
>>>>>>>> into hw.
>>>>>>>>
>>>>>>>> Therefore this patch also changes locking in
>>>>>>>> amdgpu_dm_commit_planes(),
>>>>>>>> so that prepare_flip_isr() and dc_commit_updates_for_stream()
>>>>>>>> are done
>>>>>>>> under event_lock protection within the same critical section.
>>>>>>>>
>>>>>>>> v2: Take Nicholas comments into account, try a different solution.
>>>>>>>>
>>>>>>>> Lightly tested on Polaris (locking) and Raven (the whole DCN
>>>>>>>> stuff).
>>>>>>>> Seems to work without causing obvious new trouble.
>>>>>>>
>>>>>>> Nick, any comments on this?  Can we get this committed or do you
>>>>>>> think
>>>>>>> it needs additional rework?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Alex
>>>>>>
>>>>>> Hi Alex, Mario,
>>>>>>
>>>>>> This might be a little strange, but if we want to get this in as a
>>>>>> fix
>>>>>> for regressions caused by the original vblank and user events at
>>>>>> vstartup patch then I'm actually going to give my reviewed by on the
>>>>>> *v1* of this patch (but not this v2):
>>>>>>
>>>>>> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
>>>>>>
>>>>>> You can feel free to apply that one.
>>>>>>
>>>>>> Reason 1: After having thought about it some more I don't think we
>>>>>> enable anything today that has hubp powered down at the same time we
>>>>>> expect to be waiting for a flip - eg. DMCU powering down HUBP
>>>>>> during PSR
>>>>>> entry. Static screen interrupt should happen after that flip
>>>>>> finishes I
>>>>>> think.
>>>>>>
>>>>>> The CRTC can still be powered on with zero planes, and I don't
>>>>>> think any
>>>>>> userspace explicitly asks for vblank events in this case but it
>>>>>> doesn't
>>>>>> hurt to have the check.
>>>>>>
>>>>>> Reason 2: This new patch will need much more thorough testing from
>>>>>> side
>>>>>> to fully understand the consequences of locking the entire DC commit
>>>>>> sequence. For just a page flip that sounds fine, but for anything
>>>>>> more
>>>>>> than (eg. full updates, modesets, etc) I don't think we want to be
>>>>>> disabling interrupts for potentially many milliseconds.
>>>>>
>>>>> Ah! I was wondering where the attached splat comes from, but I think
>>>>> this explains it: With this patch amdgpu_dm_commit_planes keeps the
>>>>> pcrtc->dev->event_lock spinlock locked while calling
>>>>> dc_commit_updates_for_stream, which ends up calling
>>>>> smu_set_display_count, which tries to lock a mutex.
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
> 
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
  2020-05-05 17:59             ` Kazlauskas, Nicholas
  2020-05-05 18:03               ` Matt Coffin
@ 2020-05-08 15:54               ` Matt Coffin
  1 sibling, 0 replies; 11+ messages in thread
From: Matt Coffin @ 2020-05-08 15:54 UTC (permalink / raw)
  To: Kazlauskas, Nicholas, Alex Deucher
  Cc: Leo (Sunpeng) Li, Michel Dänzer, amd-gfx list,
	Maling list - DRI developers, Deucher, Alexander, Harry Wentland

Just to follow up on this, I decided not to file the report, since
Nick's latest patch that "Fixes:" this has fixed my issue. Thanks for
the good work on that one Nick.

Might want to make sure that makes it in to the 5.7 fixes :)

Thanks again guys.

~Matt

On 5/5/20 11:59 AM, Kazlauskas, Nicholas wrote:
> Can you file a full bug report on the gitlab tracker?
> 
> FreeSync is still working on my Navi setups with this patch applied, and
> this patch is essentially just a revert of another patch already (where
> FreeSync worked before).
> 
> I can understand the v2 of this series causing issues, but the v1
> shouldn't be - so I'd like to understand more about the setup where this
> is causing issues - ASIC, OS, compositor, displays, dmesg log, X log, etc.
> 
> Regards,
> Nicholas Kazlauskas
> 
> On 2020-05-05 1:03 p.m., Alex Deucher wrote:
>> Mario or Nick any thoughts?
>>
>> Alex
>>
>> On Mon, May 4, 2020 at 1:35 PM Matt Coffin <mcoffin13@gmail.com> wrote:
>>>
>>> Hey guys,
>>>
>>> This is still an issue for me, and I'm still having to run a patch to
>>> revert this as of 5.7-rc4. To avoid breaking a lot of people's Navi
>>> setups in 5.7, is there any news on this? Has anyone else at the very
>>> least been able to reproduce the problem?
>>>
>>> It happens for me in every single program that mesa allows to utilize
>>> variable refresh rates, and reverting it "fixes" the issue.
>>>
>>> Cheers, and sorry for the extra email, just making sure this is still on
>>> someone's radar,
>>> Matt
>>>
>>> On 4/14/20 5:32 PM, Matt Coffin wrote:
>>>> Hey everyone,
>>>>
>>>> This patch broke variable refresh rate in games (all that I've tried so
>>>> far... Project CARS 2, DiRT Rally 2.0, Assetto Corsa Competizione) as
>>>> well as a simple freesync tester application.
>>>>
>>>> FreeSync tester I've been using: https://github.com/Nixola/VRRTest
>>>>
>>>> I'm not at all familiar with the page flipping code, so it would
>>>> take me
>>>> a long time to find the *right* way to fix it, but does someone else
>>>> see
>>>> why it would do that?
>>>>
>>>> The symptom is that the refresh rate of the display constantly bounces
>>>> between the two ends of the FreeSync range (for me 40 -> 144), and the
>>>> game stutters like a madman.
>>>>
>>>> Any help on where to start, ideas on how to fix it (other than just
>>>> revert this commit, which I've done in the interim), or alternative
>>>> patches would be appreciated.
>>>>
>>>> Thanks in advance for the work/help,
>>>> Matt
>>>>
>>>> On 3/13/20 8:42 AM, Michel Dänzer wrote:
>>>>> On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote:
>>>>>> On 2020-03-12 10:32 a.m., Alex Deucher wrote:
>>>>>>> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner
>>>>>>> <mario.kleiner.de@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user
>>>>>>>> events at vsartup for DCN")' introduces a new way of pageflip
>>>>>>>> completion handling for DCN, and some trouble.
>>>>>>>>
>>>>>>>> The current implementation introduces a race condition, which
>>>>>>>> can cause pageflip completion events to be sent out one vblank
>>>>>>>> too early, thereby confusing userspace and causing flicker:
>>>>>>>>
>>>>>>>> prepare_flip_isr():
>>>>>>>>
>>>>>>>> 1. Pageflip programming takes the ddev->event_lock.
>>>>>>>> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED
>>>>>>>> 3. Releases ddev->event_lock.
>>>>>>>>
>>>>>>>> --> Deadline for surface address regs double-buffering passes on
>>>>>>>>       target pipe.
>>>>>>>>
>>>>>>>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip
>>>>>>>>      into hw, but too late for current vblank.
>>>>>>>>
>>>>>>>> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete
>>>>>>>>      in current vblank due to missing the double-buffering deadline
>>>>>>>>      by a tiny bit.
>>>>>>>>
>>>>>>>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires,
>>>>>>>>      dm_dcn_crtc_high_irq() gets called.
>>>>>>>>
>>>>>>>> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the
>>>>>>>>      pageflip has been completed/will complete in this vblank and
>>>>>>>>      sends out pageflip completion event to userspace and resets
>>>>>>>>      pflip_status = AMDGPU_FLIP_NONE.
>>>>>>>>
>>>>>>>> => Flip completion event sent out one vblank too early.
>>>>>>>>
>>>>>>>> This behaviour has been observed during my testing with measurement
>>>>>>>> hardware a couple of time.
>>>>>>>>
>>>>>>>> The commit message says that the extra flip event code was added to
>>>>>>>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip
>>>>>>>> events
>>>>>>>> in case the pflip irq doesn't fire, because the "DCH HUBP"
>>>>>>>> component
>>>>>>>> is clock gated and doesn't fire pflip irqs in that state. Also that
>>>>>>>> this clock gating may happen if no planes are active. According to
>>>>>>>> Nicholas, the clock gating can also happen if psr is active, and
>>>>>>>> the
>>>>>>>> gating is controlled independently by the hardware, so difficult to
>>>>>>>> detect if and when the completion code in above commit is needed.
>>>>>>>>
>>>>>>>> This patch tries the following solution: It only executes the extra
>>>>>>>> pflip
>>>>>>>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports
>>>>>>>> that there aren't any surface updated pending in the
>>>>>>>> double-buffered
>>>>>>>> surface scanout address registers. Otherwise it leaves pflip
>>>>>>>> completion
>>>>>>>> to the pflip irq handler, for a more race-free experience.
>>>>>>>>
>>>>>>>> This would only guard against the order of events mentioned above.
>>>>>>>> If Step 5 (VSTARTUP trigger) happens before step 4 then this
>>>>>>>> won't help
>>>>>>>> at all, because 1-3 + 5 might happen even without the hw being
>>>>>>>> programmed
>>>>>>>> at all, ie. no surface update pending because none yet programmed
>>>>>>>> into hw.
>>>>>>>>
>>>>>>>> Therefore this patch also changes locking in
>>>>>>>> amdgpu_dm_commit_planes(),
>>>>>>>> so that prepare_flip_isr() and dc_commit_updates_for_stream()
>>>>>>>> are done
>>>>>>>> under event_lock protection within the same critical section.
>>>>>>>>
>>>>>>>> v2: Take Nicholas comments into account, try a different solution.
>>>>>>>>
>>>>>>>> Lightly tested on Polaris (locking) and Raven (the whole DCN
>>>>>>>> stuff).
>>>>>>>> Seems to work without causing obvious new trouble.
>>>>>>>
>>>>>>> Nick, any comments on this?  Can we get this committed or do you
>>>>>>> think
>>>>>>> it needs additional rework?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Alex
>>>>>>
>>>>>> Hi Alex, Mario,
>>>>>>
>>>>>> This might be a little strange, but if we want to get this in as a
>>>>>> fix
>>>>>> for regressions caused by the original vblank and user events at
>>>>>> vstartup patch then I'm actually going to give my reviewed by on the
>>>>>> *v1* of this patch (but not this v2):
>>>>>>
>>>>>> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
>>>>>>
>>>>>> You can feel free to apply that one.
>>>>>>
>>>>>> Reason 1: After having thought about it some more I don't think we
>>>>>> enable anything today that has hubp powered down at the same time we
>>>>>> expect to be waiting for a flip - eg. DMCU powering down HUBP
>>>>>> during PSR
>>>>>> entry. Static screen interrupt should happen after that flip
>>>>>> finishes I
>>>>>> think.
>>>>>>
>>>>>> The CRTC can still be powered on with zero planes, and I don't
>>>>>> think any
>>>>>> userspace explicitly asks for vblank events in this case but it
>>>>>> doesn't
>>>>>> hurt to have the check.
>>>>>>
>>>>>> Reason 2: This new patch will need much more thorough testing from
>>>>>> side
>>>>>> to fully understand the consequences of locking the entire DC commit
>>>>>> sequence. For just a page flip that sounds fine, but for anything
>>>>>> more
>>>>>> than (eg. full updates, modesets, etc) I don't think we want to be
>>>>>> disabling interrupts for potentially many milliseconds.
>>>>>
>>>>> Ah! I was wondering where the attached splat comes from, but I think
>>>>> this explains it: With this patch amdgpu_dm_commit_planes keeps the
>>>>> pcrtc->dev->event_lock spinlock locked while calling
>>>>> dc_commit_updates_for_stream, which ends up calling
>>>>> smu_set_display_count, which tries to lock a mutex.
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
> 
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-05-08 15:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-05 21:20 [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2) Mario Kleiner
2020-03-12 14:32 ` Alex Deucher
2020-03-13 12:35   ` Kazlauskas, Nicholas
2020-03-13 14:42     ` Michel Dänzer
2020-03-13 21:52       ` Mario Kleiner
2020-04-14 23:32       ` Matt Coffin
2020-05-04 17:35         ` Matt Coffin
2020-05-05 17:03           ` Alex Deucher
2020-05-05 17:59             ` Kazlauskas, Nicholas
2020-05-05 18:03               ` Matt Coffin
2020-05-08 15:54               ` Matt Coffin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).