All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-23 21:10 ` Mazin Rezk
  0 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-23 21:10 UTC (permalink / raw)
  To: linux-kernel, amd-gfx, dri-devel
  Cc: akpm, christian.koenig, harry.wentland, mnrzk,
	nicholas.kazlauskas, sunpeng.li, keescook, alexander.deucher,
	1i5t5.duncan, mphantomx, regressions, anthony.ruhier, pmenzel

When amdgpu_dm_atomic_commit_tail is running in the workqueue,
drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
running, causing a race condition where state (and then dm_state) is
sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
occurred since 5.7-rc1 and is well documented among polaris11 users [1].

Prior to 5.7, this was not a noticeable issue since the freelist pointer
was stored at the beginning of dm_state (base), which was unused. After
changing the freelist pointer to be stored in the middle of the struct, the
freelist pointer overwrote the context, causing dc_state to become garbage
data and made the call to dm_enable_per_frame_crtc_master_sync dereference
a freelist pointer.

This patch fixes the aforementioned issue by calling drm_atomic_state_get
in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.

According to my testing on 5.8.0-rc6, this should fix bug 207383 on
Bugzilla [1].

[1] https://bugzilla.kernel.org/show_bug.cgi?id=207383

Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
Reported-by: Duncan <1i5t5.duncan@cox.net>
Signed-off-by: Mazin Rezk <mnrzk@protonmail.com>
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 86ffa0c2880f..86d6652872f2 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -7303,6 +7303,7 @@ static int amdgpu_dm_atomic_commit(struct drm_device *dev,
 	 * unset legacy_cursor_update
 	 */

+	drm_atomic_state_get(state);
 	return drm_atomic_helper_commit(dev, state, nonblock);

 	/*TODO Handle EINTR, reenable IRQ*/
@@ -7628,6 +7629,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)

 	if (dc_state_temp)
 		dc_release_state(dc_state_temp);
+
+	drm_atomic_state_put(state);
 }


--
2.27.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-23 21:10 ` Mazin Rezk
  0 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-23 21:10 UTC (permalink / raw)
  To: linux-kernel, amd-gfx, dri-devel
  Cc: pmenzel, anthony.ruhier, 1i5t5.duncan, keescook, sunpeng.li,
	mnrzk, nicholas.kazlauskas, regressions, alexander.deucher, akpm,
	mphantomx, christian.koenig

When amdgpu_dm_atomic_commit_tail is running in the workqueue,
drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
running, causing a race condition where state (and then dm_state) is
sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
occurred since 5.7-rc1 and is well documented among polaris11 users [1].

Prior to 5.7, this was not a noticeable issue since the freelist pointer
was stored at the beginning of dm_state (base), which was unused. After
changing the freelist pointer to be stored in the middle of the struct, the
freelist pointer overwrote the context, causing dc_state to become garbage
data and made the call to dm_enable_per_frame_crtc_master_sync dereference
a freelist pointer.

This patch fixes the aforementioned issue by calling drm_atomic_state_get
in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.

According to my testing on 5.8.0-rc6, this should fix bug 207383 on
Bugzilla [1].

[1] https://bugzilla.kernel.org/show_bug.cgi?id=207383

Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
Reported-by: Duncan <1i5t5.duncan@cox.net>
Signed-off-by: Mazin Rezk <mnrzk@protonmail.com>
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 86ffa0c2880f..86d6652872f2 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -7303,6 +7303,7 @@ static int amdgpu_dm_atomic_commit(struct drm_device *dev,
 	 * unset legacy_cursor_update
 	 */

+	drm_atomic_state_get(state);
 	return drm_atomic_helper_commit(dev, state, nonblock);

 	/*TODO Handle EINTR, reenable IRQ*/
@@ -7628,6 +7629,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)

 	if (dc_state_temp)
 		dc_release_state(dc_state_temp);
+
+	drm_atomic_state_put(state);
 }


--
2.27.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-23 21:10 ` Mazin Rezk
  0 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-23 21:10 UTC (permalink / raw)
  To: linux-kernel, amd-gfx, dri-devel
  Cc: pmenzel, anthony.ruhier, 1i5t5.duncan, keescook, sunpeng.li,
	mnrzk, nicholas.kazlauskas, regressions, alexander.deucher, akpm,
	mphantomx, harry.wentland, christian.koenig

When amdgpu_dm_atomic_commit_tail is running in the workqueue,
drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
running, causing a race condition where state (and then dm_state) is
sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
occurred since 5.7-rc1 and is well documented among polaris11 users [1].

Prior to 5.7, this was not a noticeable issue since the freelist pointer
was stored at the beginning of dm_state (base), which was unused. After
changing the freelist pointer to be stored in the middle of the struct, the
freelist pointer overwrote the context, causing dc_state to become garbage
data and made the call to dm_enable_per_frame_crtc_master_sync dereference
a freelist pointer.

This patch fixes the aforementioned issue by calling drm_atomic_state_get
in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.

According to my testing on 5.8.0-rc6, this should fix bug 207383 on
Bugzilla [1].

[1] https://bugzilla.kernel.org/show_bug.cgi?id=207383

Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
Reported-by: Duncan <1i5t5.duncan@cox.net>
Signed-off-by: Mazin Rezk <mnrzk@protonmail.com>
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 86ffa0c2880f..86d6652872f2 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -7303,6 +7303,7 @@ static int amdgpu_dm_atomic_commit(struct drm_device *dev,
 	 * unset legacy_cursor_update
 	 */

+	drm_atomic_state_get(state);
 	return drm_atomic_helper_commit(dev, state, nonblock);

 	/*TODO Handle EINTR, reenable IRQ*/
@@ -7628,6 +7629,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)

 	if (dc_state_temp)
 		dc_release_state(dc_state_temp);
+
+	drm_atomic_state_put(state);
 }


--
2.27.0

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
  2020-07-23 21:10 ` Mazin Rezk
  (?)
@ 2020-07-23 22:16   ` Kazlauskas, Nicholas
  -1 siblings, 0 replies; 48+ messages in thread
From: Kazlauskas, Nicholas @ 2020-07-23 22:16 UTC (permalink / raw)
  To: Mazin Rezk, linux-kernel, amd-gfx, dri-devel
  Cc: akpm, christian.koenig, harry.wentland, sunpeng.li, keescook,
	alexander.deucher, 1i5t5.duncan, mphantomx, regressions,
	anthony.ruhier, pmenzel

On 2020-07-23 5:10 p.m., Mazin Rezk wrote:
> When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> running, causing a race condition where state (and then dm_state) is
> sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
> occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> 
> Prior to 5.7, this was not a noticeable issue since the freelist pointer
> was stored at the beginning of dm_state (base), which was unused. After
> changing the freelist pointer to be stored in the middle of the struct, the
> freelist pointer overwrote the context, causing dc_state to become garbage
> data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> a freelist pointer.
> 
> This patch fixes the aforementioned issue by calling drm_atomic_state_get
> in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> 
> According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> Bugzilla [1].
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
> 
> Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
> Reported-by: Duncan <1i5t5.duncan@cox.net>
> Signed-off-by: Mazin Rezk <mnrzk@protonmail.com>

Thanks for the investigation and your patch. I appreciate the help in 
trying to narrow down the root cause as this issue has been difficult to 
reproduce on my setups.

Though I'm not sure this really resolves the issue - we make use of the 
drm_atomic_helper_commit helper function from DRM which internally does 
what you're doing with this patch:

	drm_atomic_state_get(state);
	if (nonblock)
		queue_work(system_unbound_wq, &state->commit_work);
	else
		commit_tail(state);

So even when it gets queued off to the unbound workqueue we still have a 
reference on the state.

That reference gets dropped as part of commit tail helper in DRM as well:

if (funcs && funcs->atomic_commit_tail)
		funcs->atomic_commit_tail(old_state);
	else
		drm_atomic_helper_commit_tail(old_state);

	commit_time_ms = ktime_ms_delta(ktime_get(), start);
	if (commit_time_ms > 0)
		drm_self_refresh_helper_update_avg_times(old_state,
						 (unsigned long)commit_time_ms,
						 new_self_refresh_mask);

	drm_atomic_helper_commit_cleanup_done(old_state);

	drm_atomic_state_put(old_state);

So instead of a use after free happening when we access the state we get 
a double-free happening later at the end of commit tail in DRM.

What I think would be the right next step here is to actually determine 
what sequence of IOCTLs and atomic commits are happening under your 
setup with a very verbose dmesg log. You can set a debug level for DRM 
in your kernel parameters with something like:

drm.debug=0x54

I don't see anything in amdgpu_dm.c that looks like it would be freeing 
the state so I suspect something in the core is this doing this.

> ---
>   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 86ffa0c2880f..86d6652872f2 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -7303,6 +7303,7 @@ static int amdgpu_dm_atomic_commit(struct drm_device *dev,
>   	 * unset legacy_cursor_update
>   	 */
> 
> +	drm_atomic_state_get(state);

Also note that if the drm_atomic_helper_commit() call fails here then 
we're going to never free this structure. So we should really be 
checking the return code here below before trying to do this, if at all.

Regards,
Nicholas Kazlauskas

>   	return drm_atomic_helper_commit(dev, state, nonblock);
> 
>   	/*TODO Handle EINTR, reenable IRQ*/
> @@ -7628,6 +7629,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
> 
>   	if (dc_state_temp)
>   		dc_release_state(dc_state_temp);
> +
> +	drm_atomic_state_put(state);
>   }
> 
> 
> --
> 2.27.0
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-23 22:16   ` Kazlauskas, Nicholas
  0 siblings, 0 replies; 48+ messages in thread
From: Kazlauskas, Nicholas @ 2020-07-23 22:16 UTC (permalink / raw)
  To: Mazin Rezk, linux-kernel, amd-gfx, dri-devel
  Cc: pmenzel, anthony.ruhier, 1i5t5.duncan, keescook, sunpeng.li,
	regressions, alexander.deucher, akpm, mphantomx,
	christian.koenig

On 2020-07-23 5:10 p.m., Mazin Rezk wrote:
> When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> running, causing a race condition where state (and then dm_state) is
> sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
> occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> 
> Prior to 5.7, this was not a noticeable issue since the freelist pointer
> was stored at the beginning of dm_state (base), which was unused. After
> changing the freelist pointer to be stored in the middle of the struct, the
> freelist pointer overwrote the context, causing dc_state to become garbage
> data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> a freelist pointer.
> 
> This patch fixes the aforementioned issue by calling drm_atomic_state_get
> in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> 
> According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> Bugzilla [1].
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
> 
> Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
> Reported-by: Duncan <1i5t5.duncan@cox.net>
> Signed-off-by: Mazin Rezk <mnrzk@protonmail.com>

Thanks for the investigation and your patch. I appreciate the help in 
trying to narrow down the root cause as this issue has been difficult to 
reproduce on my setups.

Though I'm not sure this really resolves the issue - we make use of the 
drm_atomic_helper_commit helper function from DRM which internally does 
what you're doing with this patch:

	drm_atomic_state_get(state);
	if (nonblock)
		queue_work(system_unbound_wq, &state->commit_work);
	else
		commit_tail(state);

So even when it gets queued off to the unbound workqueue we still have a 
reference on the state.

That reference gets dropped as part of commit tail helper in DRM as well:

if (funcs && funcs->atomic_commit_tail)
		funcs->atomic_commit_tail(old_state);
	else
		drm_atomic_helper_commit_tail(old_state);

	commit_time_ms = ktime_ms_delta(ktime_get(), start);
	if (commit_time_ms > 0)
		drm_self_refresh_helper_update_avg_times(old_state,
						 (unsigned long)commit_time_ms,
						 new_self_refresh_mask);

	drm_atomic_helper_commit_cleanup_done(old_state);

	drm_atomic_state_put(old_state);

So instead of a use after free happening when we access the state we get 
a double-free happening later at the end of commit tail in DRM.

What I think would be the right next step here is to actually determine 
what sequence of IOCTLs and atomic commits are happening under your 
setup with a very verbose dmesg log. You can set a debug level for DRM 
in your kernel parameters with something like:

drm.debug=0x54

I don't see anything in amdgpu_dm.c that looks like it would be freeing 
the state so I suspect something in the core is this doing this.

> ---
>   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 86ffa0c2880f..86d6652872f2 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -7303,6 +7303,7 @@ static int amdgpu_dm_atomic_commit(struct drm_device *dev,
>   	 * unset legacy_cursor_update
>   	 */
> 
> +	drm_atomic_state_get(state);

Also note that if the drm_atomic_helper_commit() call fails here then 
we're going to never free this structure. So we should really be 
checking the return code here below before trying to do this, if at all.

Regards,
Nicholas Kazlauskas

>   	return drm_atomic_helper_commit(dev, state, nonblock);
> 
>   	/*TODO Handle EINTR, reenable IRQ*/
> @@ -7628,6 +7629,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
> 
>   	if (dc_state_temp)
>   		dc_release_state(dc_state_temp);
> +
> +	drm_atomic_state_put(state);
>   }
> 
> 
> --
> 2.27.0
> 

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-23 22:16   ` Kazlauskas, Nicholas
  0 siblings, 0 replies; 48+ messages in thread
From: Kazlauskas, Nicholas @ 2020-07-23 22:16 UTC (permalink / raw)
  To: Mazin Rezk, linux-kernel, amd-gfx, dri-devel
  Cc: pmenzel, anthony.ruhier, 1i5t5.duncan, keescook, sunpeng.li,
	regressions, alexander.deucher, akpm, mphantomx, harry.wentland,
	christian.koenig

On 2020-07-23 5:10 p.m., Mazin Rezk wrote:
> When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> running, causing a race condition where state (and then dm_state) is
> sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
> occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> 
> Prior to 5.7, this was not a noticeable issue since the freelist pointer
> was stored at the beginning of dm_state (base), which was unused. After
> changing the freelist pointer to be stored in the middle of the struct, the
> freelist pointer overwrote the context, causing dc_state to become garbage
> data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> a freelist pointer.
> 
> This patch fixes the aforementioned issue by calling drm_atomic_state_get
> in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> 
> According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> Bugzilla [1].
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
> 
> Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
> Reported-by: Duncan <1i5t5.duncan@cox.net>
> Signed-off-by: Mazin Rezk <mnrzk@protonmail.com>

Thanks for the investigation and your patch. I appreciate the help in 
trying to narrow down the root cause as this issue has been difficult to 
reproduce on my setups.

Though I'm not sure this really resolves the issue - we make use of the 
drm_atomic_helper_commit helper function from DRM which internally does 
what you're doing with this patch:

	drm_atomic_state_get(state);
	if (nonblock)
		queue_work(system_unbound_wq, &state->commit_work);
	else
		commit_tail(state);

So even when it gets queued off to the unbound workqueue we still have a 
reference on the state.

That reference gets dropped as part of commit tail helper in DRM as well:

if (funcs && funcs->atomic_commit_tail)
		funcs->atomic_commit_tail(old_state);
	else
		drm_atomic_helper_commit_tail(old_state);

	commit_time_ms = ktime_ms_delta(ktime_get(), start);
	if (commit_time_ms > 0)
		drm_self_refresh_helper_update_avg_times(old_state,
						 (unsigned long)commit_time_ms,
						 new_self_refresh_mask);

	drm_atomic_helper_commit_cleanup_done(old_state);

	drm_atomic_state_put(old_state);

So instead of a use after free happening when we access the state we get 
a double-free happening later at the end of commit tail in DRM.

What I think would be the right next step here is to actually determine 
what sequence of IOCTLs and atomic commits are happening under your 
setup with a very verbose dmesg log. You can set a debug level for DRM 
in your kernel parameters with something like:

drm.debug=0x54

I don't see anything in amdgpu_dm.c that looks like it would be freeing 
the state so I suspect something in the core is this doing this.

> ---
>   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 86ffa0c2880f..86d6652872f2 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -7303,6 +7303,7 @@ static int amdgpu_dm_atomic_commit(struct drm_device *dev,
>   	 * unset legacy_cursor_update
>   	 */
> 
> +	drm_atomic_state_get(state);

Also note that if the drm_atomic_helper_commit() call fails here then 
we're going to never free this structure. So we should really be 
checking the return code here below before trying to do this, if at all.

Regards,
Nicholas Kazlauskas

>   	return drm_atomic_helper_commit(dev, state, nonblock);
> 
>   	/*TODO Handle EINTR, reenable IRQ*/
> @@ -7628,6 +7629,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
> 
>   	if (dc_state_temp)
>   		dc_release_state(dc_state_temp);
> +
> +	drm_atomic_state_put(state);
>   }
> 
> 
> --
> 2.27.0
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
  2020-07-23 21:10 ` Mazin Rezk
  (?)
@ 2020-07-23 22:32   ` Kees Cook
  -1 siblings, 0 replies; 48+ messages in thread
From: Kees Cook @ 2020-07-23 22:32 UTC (permalink / raw)
  To: Mazin Rezk
  Cc: linux-kernel, amd-gfx, dri-devel, akpm, christian.koenig,
	harry.wentland, nicholas.kazlauskas, sunpeng.li,
	alexander.deucher, 1i5t5.duncan, mphantomx, regressions,
	anthony.ruhier, pmenzel

On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
> When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> running, causing a race condition where state (and then dm_state) is
> sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
> occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> 
> Prior to 5.7, this was not a noticeable issue since the freelist pointer
> was stored at the beginning of dm_state (base), which was unused. After
> changing the freelist pointer to be stored in the middle of the struct, the
> freelist pointer overwrote the context, causing dc_state to become garbage
> data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> a freelist pointer.
> 
> This patch fixes the aforementioned issue by calling drm_atomic_state_get
> in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> 
> According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> Bugzilla [1].
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383

Nice work tracking this down!

> Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")

I do, however, object to this Fixes tag. :) The flaw appears to have
been with amdgpu_dm's reference tracking of "state" in the nonblocking
case. (How this reference counting is supposed to work correctly, though,
I'm not sure.) If I look at where the drm helper was split from being
the default callback, it looks like this was what introduced the bug:

da5c47f682ab ("drm/amd/display: Remove acrtc->stream")

? 3202fa62f certainly exposed it much more quickly, but there was a race
even without 3202fa62f where something could have realloced the memory
and written over it.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-23 22:32   ` Kees Cook
  0 siblings, 0 replies; 48+ messages in thread
From: Kees Cook @ 2020-07-23 22:32 UTC (permalink / raw)
  To: Mazin Rezk
  Cc: pmenzel, anthony.ruhier, 1i5t5.duncan, sunpeng.li, linux-kernel,
	dri-devel, nicholas.kazlauskas, regressions, amd-gfx,
	alexander.deucher, akpm, mphantomx, christian.koenig

On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
> When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> running, causing a race condition where state (and then dm_state) is
> sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
> occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> 
> Prior to 5.7, this was not a noticeable issue since the freelist pointer
> was stored at the beginning of dm_state (base), which was unused. After
> changing the freelist pointer to be stored in the middle of the struct, the
> freelist pointer overwrote the context, causing dc_state to become garbage
> data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> a freelist pointer.
> 
> This patch fixes the aforementioned issue by calling drm_atomic_state_get
> in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> 
> According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> Bugzilla [1].
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383

Nice work tracking this down!

> Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")

I do, however, object to this Fixes tag. :) The flaw appears to have
been with amdgpu_dm's reference tracking of "state" in the nonblocking
case. (How this reference counting is supposed to work correctly, though,
I'm not sure.) If I look at where the drm helper was split from being
the default callback, it looks like this was what introduced the bug:

da5c47f682ab ("drm/amd/display: Remove acrtc->stream")

? 3202fa62f certainly exposed it much more quickly, but there was a race
even without 3202fa62f where something could have realloced the memory
and written over it.

-- 
Kees Cook
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-23 22:32   ` Kees Cook
  0 siblings, 0 replies; 48+ messages in thread
From: Kees Cook @ 2020-07-23 22:32 UTC (permalink / raw)
  To: Mazin Rezk
  Cc: pmenzel, anthony.ruhier, 1i5t5.duncan, sunpeng.li, linux-kernel,
	dri-devel, nicholas.kazlauskas, regressions, amd-gfx,
	alexander.deucher, akpm, mphantomx, harry.wentland,
	christian.koenig

On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
> When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> running, causing a race condition where state (and then dm_state) is
> sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
> occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> 
> Prior to 5.7, this was not a noticeable issue since the freelist pointer
> was stored at the beginning of dm_state (base), which was unused. After
> changing the freelist pointer to be stored in the middle of the struct, the
> freelist pointer overwrote the context, causing dc_state to become garbage
> data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> a freelist pointer.
> 
> This patch fixes the aforementioned issue by calling drm_atomic_state_get
> in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> 
> According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> Bugzilla [1].
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383

Nice work tracking this down!

> Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")

I do, however, object to this Fixes tag. :) The flaw appears to have
been with amdgpu_dm's reference tracking of "state" in the nonblocking
case. (How this reference counting is supposed to work correctly, though,
I'm not sure.) If I look at where the drm helper was split from being
the default callback, it looks like this was what introduced the bug:

da5c47f682ab ("drm/amd/display: Remove acrtc->stream")

? 3202fa62f certainly exposed it much more quickly, but there was a race
even without 3202fa62f where something could have realloced the memory
and written over it.

-- 
Kees Cook
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
  2020-07-23 22:16   ` Kazlauskas, Nicholas
  (?)
@ 2020-07-23 22:57     ` Mazin Rezk
  -1 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-23 22:57 UTC (permalink / raw)
  To: Kazlauskas, Nicholas
  Cc: Mazin Rezk, linux-kernel, amd-gfx, dri-devel, akpm,
	christian.koenig, harry.wentland, sunpeng.li, keescook,
	alexander.deucher, 1i5t5.duncan, mphantomx, regressions,
	anthony.ruhier, pmenzel

It seems that I spoke too soon. I ran the system for another hour after
submitting the patch and the bug just occurred. :/

Sadly, that means the bug isn't really fixed and that I have to go
investigate further.

At the very least, this patch seems to delay the occurrence of the bug
significantly which may help in further discovering the cause.

On Thursday, July 23, 2020 6:16 PM, Kazlauskas, Nicholas <nicholas.kazlauskas@amd.com> wrote:

> On 2020-07-23 5:10 p.m., Mazin Rezk wrote:
>
> > When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> > drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> > running, causing a race condition where state (and then dm_state) is
> > sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
> > occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> > Prior to 5.7, this was not a noticeable issue since the freelist pointer
> > was stored at the beginning of dm_state (base), which was unused. After
> > changing the freelist pointer to be stored in the middle of the struct, the
> > freelist pointer overwrote the context, causing dc_state to become garbage
> > data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> > a freelist pointer.
> > This patch fixes the aforementioned issue by calling drm_atomic_state_get
> > in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> > drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> > According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> > Bugzilla [1].
> > [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
> > Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
> > Reported-by: Duncan 1i5t5.duncan@cox.net
> > Signed-off-by: Mazin Rezk mnrzk@protonmail.com
>
> Thanks for the investigation and your patch. I appreciate the help in
> trying to narrow down the root cause as this issue has been difficult to
> reproduce on my setups.
>
> Though I'm not sure this really resolves the issue - we make use of the
> drm_atomic_helper_commit helper function from DRM which internally does
> what you're doing with this patch:
>
> drm_atomic_state_get(state);
> if (nonblock)
> queue_work(system_unbound_wq, &state->commit_work);
>
>     else
>     	commit_tail(state);
>
>
> So even when it gets queued off to the unbound workqueue we still have a
> reference on the state.
>
> That reference gets dropped as part of commit tail helper in DRM as well:
>
> if (funcs && funcs->atomic_commit_tail)
>
>     	funcs->atomic_commit_tail(old_state);
>
>     else
>     	drm_atomic_helper_commit_tail(old_state);
>
>
> commit_time_ms = ktime_ms_delta(ktime_get(), start);
> if (commit_time_ms > 0)
>
>     	drm_self_refresh_helper_update_avg_times(old_state,
>     					 (unsigned long)commit_time_ms,
>     					 new_self_refresh_mask);
>
>
> drm_atomic_helper_commit_cleanup_done(old_state);
>
> drm_atomic_state_put(old_state);
>

I initially noticed that right after I wrote this patch so I was expecting
the patch to fail. However, after several hours of testing, the crash just
didn't occur so I believed the bug was fixed.

> So instead of a use after free happening when we access the state we get
> a double-free happening later at the end of commit tail in DRM.
>
> What I think would be the right next step here is to actually determine
> what sequence of IOCTLs and atomic commits are happening under your
> setup with a very verbose dmesg log. You can set a debug level for DRM
> in your kernel parameters with something like:
>
> drm.debug=0x54
>
> I don't see anything in amdgpu_dm.c that looks like it would be freeing
> the state so I suspect something in the core is this doing this.

Going through the KASAN use-after-free bug report in the Bugzilla
attachments, it appears that the state is being freed at the end of
commit_tail. Perhaps amdgpu_dm_atomic_commit_tail is being called on the
the same old state twice? I can't quite think of any other possible
explanation as to why that happens.

>
> > drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
> > 1 file changed, 3 insertions(+)
> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > index 86ffa0c2880f..86d6652872f2 100644
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > @@ -7303,6 +7303,7 @@ static int amdgpu_dm_atomic_commit(struct drm_device *dev,
> > * unset legacy_cursor_update
> > */
> >
> > -   drm_atomic_state_get(state);
>
> Also note that if the drm_atomic_helper_commit() call fails here then
> we're going to never free this structure. So we should really be
> checking the return code here below before trying to do this, if at all.

Oh right, that's true. I looked at amdgpu_dm_atomic_commit_tail and didn't
see any return statements in there, so I thought it was safe.

>
> Regards,
> Nicholas Kazlauskas
>
> >       return drm_atomic_helper_commit(dev, state, nonblock);
> >
> >       /*TODO Handle EINTR, reenable IRQ*/
> >
> >
> > @@ -7628,6 +7629,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
> >
> >       if (dc_state_temp)
> >       	dc_release_state(dc_state_temp);
> >
> >
> > -
> > -   drm_atomic_state_put(state);
> >     }
> >
> >
> > --
> > 2.27.0

Thanks,
Mazin Rezk

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-23 22:57     ` Mazin Rezk
  0 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-23 22:57 UTC (permalink / raw)
  To: Kazlauskas, Nicholas
  Cc: pmenzel, anthony.ruhier, 1i5t5.duncan, keescook, sunpeng.li,
	Mazin Rezk, linux-kernel, dri-devel, regressions, amd-gfx,
	alexander.deucher, akpm, mphantomx, christian.koenig

It seems that I spoke too soon. I ran the system for another hour after
submitting the patch and the bug just occurred. :/

Sadly, that means the bug isn't really fixed and that I have to go
investigate further.

At the very least, this patch seems to delay the occurrence of the bug
significantly which may help in further discovering the cause.

On Thursday, July 23, 2020 6:16 PM, Kazlauskas, Nicholas <nicholas.kazlauskas@amd.com> wrote:

> On 2020-07-23 5:10 p.m., Mazin Rezk wrote:
>
> > When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> > drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> > running, causing a race condition where state (and then dm_state) is
> > sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
> > occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> > Prior to 5.7, this was not a noticeable issue since the freelist pointer
> > was stored at the beginning of dm_state (base), which was unused. After
> > changing the freelist pointer to be stored in the middle of the struct, the
> > freelist pointer overwrote the context, causing dc_state to become garbage
> > data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> > a freelist pointer.
> > This patch fixes the aforementioned issue by calling drm_atomic_state_get
> > in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> > drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> > According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> > Bugzilla [1].
> > [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
> > Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
> > Reported-by: Duncan 1i5t5.duncan@cox.net
> > Signed-off-by: Mazin Rezk mnrzk@protonmail.com
>
> Thanks for the investigation and your patch. I appreciate the help in
> trying to narrow down the root cause as this issue has been difficult to
> reproduce on my setups.
>
> Though I'm not sure this really resolves the issue - we make use of the
> drm_atomic_helper_commit helper function from DRM which internally does
> what you're doing with this patch:
>
> drm_atomic_state_get(state);
> if (nonblock)
> queue_work(system_unbound_wq, &state->commit_work);
>
>     else
>     	commit_tail(state);
>
>
> So even when it gets queued off to the unbound workqueue we still have a
> reference on the state.
>
> That reference gets dropped as part of commit tail helper in DRM as well:
>
> if (funcs && funcs->atomic_commit_tail)
>
>     	funcs->atomic_commit_tail(old_state);
>
>     else
>     	drm_atomic_helper_commit_tail(old_state);
>
>
> commit_time_ms = ktime_ms_delta(ktime_get(), start);
> if (commit_time_ms > 0)
>
>     	drm_self_refresh_helper_update_avg_times(old_state,
>     					 (unsigned long)commit_time_ms,
>     					 new_self_refresh_mask);
>
>
> drm_atomic_helper_commit_cleanup_done(old_state);
>
> drm_atomic_state_put(old_state);
>

I initially noticed that right after I wrote this patch so I was expecting
the patch to fail. However, after several hours of testing, the crash just
didn't occur so I believed the bug was fixed.

> So instead of a use after free happening when we access the state we get
> a double-free happening later at the end of commit tail in DRM.
>
> What I think would be the right next step here is to actually determine
> what sequence of IOCTLs and atomic commits are happening under your
> setup with a very verbose dmesg log. You can set a debug level for DRM
> in your kernel parameters with something like:
>
> drm.debug=0x54
>
> I don't see anything in amdgpu_dm.c that looks like it would be freeing
> the state so I suspect something in the core is this doing this.

Going through the KASAN use-after-free bug report in the Bugzilla
attachments, it appears that the state is being freed at the end of
commit_tail. Perhaps amdgpu_dm_atomic_commit_tail is being called on the
the same old state twice? I can't quite think of any other possible
explanation as to why that happens.

>
> > drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
> > 1 file changed, 3 insertions(+)
> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > index 86ffa0c2880f..86d6652872f2 100644
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > @@ -7303,6 +7303,7 @@ static int amdgpu_dm_atomic_commit(struct drm_device *dev,
> > * unset legacy_cursor_update
> > */
> >
> > -   drm_atomic_state_get(state);
>
> Also note that if the drm_atomic_helper_commit() call fails here then
> we're going to never free this structure. So we should really be
> checking the return code here below before trying to do this, if at all.

Oh right, that's true. I looked at amdgpu_dm_atomic_commit_tail and didn't
see any return statements in there, so I thought it was safe.

>
> Regards,
> Nicholas Kazlauskas
>
> >       return drm_atomic_helper_commit(dev, state, nonblock);
> >
> >       /*TODO Handle EINTR, reenable IRQ*/
> >
> >
> > @@ -7628,6 +7629,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
> >
> >       if (dc_state_temp)
> >       	dc_release_state(dc_state_temp);
> >
> >
> > -
> > -   drm_atomic_state_put(state);
> >     }
> >
> >
> > --
> > 2.27.0

Thanks,
Mazin Rezk
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-23 22:57     ` Mazin Rezk
  0 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-23 22:57 UTC (permalink / raw)
  To: Kazlauskas, Nicholas
  Cc: pmenzel, anthony.ruhier, 1i5t5.duncan, keescook, sunpeng.li,
	Mazin Rezk, linux-kernel, dri-devel, regressions, amd-gfx,
	alexander.deucher, akpm, mphantomx, harry.wentland,
	christian.koenig

It seems that I spoke too soon. I ran the system for another hour after
submitting the patch and the bug just occurred. :/

Sadly, that means the bug isn't really fixed and that I have to go
investigate further.

At the very least, this patch seems to delay the occurrence of the bug
significantly which may help in further discovering the cause.

On Thursday, July 23, 2020 6:16 PM, Kazlauskas, Nicholas <nicholas.kazlauskas@amd.com> wrote:

> On 2020-07-23 5:10 p.m., Mazin Rezk wrote:
>
> > When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> > drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> > running, causing a race condition where state (and then dm_state) is
> > sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
> > occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> > Prior to 5.7, this was not a noticeable issue since the freelist pointer
> > was stored at the beginning of dm_state (base), which was unused. After
> > changing the freelist pointer to be stored in the middle of the struct, the
> > freelist pointer overwrote the context, causing dc_state to become garbage
> > data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> > a freelist pointer.
> > This patch fixes the aforementioned issue by calling drm_atomic_state_get
> > in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> > drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> > According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> > Bugzilla [1].
> > [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
> > Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
> > Reported-by: Duncan 1i5t5.duncan@cox.net
> > Signed-off-by: Mazin Rezk mnrzk@protonmail.com
>
> Thanks for the investigation and your patch. I appreciate the help in
> trying to narrow down the root cause as this issue has been difficult to
> reproduce on my setups.
>
> Though I'm not sure this really resolves the issue - we make use of the
> drm_atomic_helper_commit helper function from DRM which internally does
> what you're doing with this patch:
>
> drm_atomic_state_get(state);
> if (nonblock)
> queue_work(system_unbound_wq, &state->commit_work);
>
>     else
>     	commit_tail(state);
>
>
> So even when it gets queued off to the unbound workqueue we still have a
> reference on the state.
>
> That reference gets dropped as part of commit tail helper in DRM as well:
>
> if (funcs && funcs->atomic_commit_tail)
>
>     	funcs->atomic_commit_tail(old_state);
>
>     else
>     	drm_atomic_helper_commit_tail(old_state);
>
>
> commit_time_ms = ktime_ms_delta(ktime_get(), start);
> if (commit_time_ms > 0)
>
>     	drm_self_refresh_helper_update_avg_times(old_state,
>     					 (unsigned long)commit_time_ms,
>     					 new_self_refresh_mask);
>
>
> drm_atomic_helper_commit_cleanup_done(old_state);
>
> drm_atomic_state_put(old_state);
>

I initially noticed that right after I wrote this patch so I was expecting
the patch to fail. However, after several hours of testing, the crash just
didn't occur so I believed the bug was fixed.

> So instead of a use after free happening when we access the state we get
> a double-free happening later at the end of commit tail in DRM.
>
> What I think would be the right next step here is to actually determine
> what sequence of IOCTLs and atomic commits are happening under your
> setup with a very verbose dmesg log. You can set a debug level for DRM
> in your kernel parameters with something like:
>
> drm.debug=0x54
>
> I don't see anything in amdgpu_dm.c that looks like it would be freeing
> the state so I suspect something in the core is this doing this.

Going through the KASAN use-after-free bug report in the Bugzilla
attachments, it appears that the state is being freed at the end of
commit_tail. Perhaps amdgpu_dm_atomic_commit_tail is being called on the
the same old state twice? I can't quite think of any other possible
explanation as to why that happens.

>
> > drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
> > 1 file changed, 3 insertions(+)
> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > index 86ffa0c2880f..86d6652872f2 100644
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > @@ -7303,6 +7303,7 @@ static int amdgpu_dm_atomic_commit(struct drm_device *dev,
> > * unset legacy_cursor_update
> > */
> >
> > -   drm_atomic_state_get(state);
>
> Also note that if the drm_atomic_helper_commit() call fails here then
> we're going to never free this structure. So we should really be
> checking the return code here below before trying to do this, if at all.

Oh right, that's true. I looked at amdgpu_dm_atomic_commit_tail and didn't
see any return statements in there, so I thought it was safe.

>
> Regards,
> Nicholas Kazlauskas
>
> >       return drm_atomic_helper_commit(dev, state, nonblock);
> >
> >       /*TODO Handle EINTR, reenable IRQ*/
> >
> >
> > @@ -7628,6 +7629,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
> >
> >       if (dc_state_temp)
> >       	dc_release_state(dc_state_temp);
> >
> >
> > -
> > -   drm_atomic_state_put(state);
> >     }
> >
> >
> > --
> > 2.27.0

Thanks,
Mazin Rezk
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
  2020-07-23 22:32   ` Kees Cook
  (?)
@ 2020-07-23 22:58     ` Mazin Rezk
  -1 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-23 22:58 UTC (permalink / raw)
  To: Kees Cook
  Cc: Mazin Rezk, linux-kernel, amd-gfx, dri-devel, akpm,
	christian.koenig, harry.wentland, nicholas.kazlauskas,
	sunpeng.li, alexander.deucher, 1i5t5.duncan, mphantomx,
	regressions, anthony.ruhier, pmenzel

On Thursday, July 23, 2020 6:32 PM, Kees Cook <keescook@chromium.org> wrote:

> On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
>
> > When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> > drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> > running, causing a race condition where state (and then dm_state) is
> > sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
> > occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> > Prior to 5.7, this was not a noticeable issue since the freelist pointer
> > was stored at the beginning of dm_state (base), which was unused. After
> > changing the freelist pointer to be stored in the middle of the struct, the
> > freelist pointer overwrote the context, causing dc_state to become garbage
> > data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> > a freelist pointer.
> > This patch fixes the aforementioned issue by calling drm_atomic_state_get
> > in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> > drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> > According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> > Bugzilla [1].
> > [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
>
> Nice work tracking this down!
>
> > Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
>
> I do, however, object to this Fixes tag. :) The flaw appears to have
> been with amdgpu_dm's reference tracking of "state" in the nonblocking
> case. (How this reference counting is supposed to work correctly, though,
> I'm not sure.) If I look at where the drm helper was split from being
> the default callback, it looks like this was what introduced the bug:
>
> da5c47f682ab ("drm/amd/display: Remove acrtc->stream")
>
> ? 3202fa62f certainly exposed it much more quickly, but there was a race
> even without 3202fa62f where something could have realloced the memory
> and written over it.
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Kees Cook


Thanks, I'll be sure to avoid using 3202fa62f as the cause next time.
I just thought to do that because it was what made the use-after-free cause
a noticeable bug.

Also, by the way, I just realised the patch didn't completely solve the bug.
Sorry about that, making an LKML thread on this was hasty on my part. Should
I get further confirmation from the Bugzilla thread before submitting a patch
for this bug in the future?

Thanks,
Mazin Rezk

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-23 22:58     ` Mazin Rezk
  0 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-23 22:58 UTC (permalink / raw)
  To: Kees Cook
  Cc: pmenzel, anthony.ruhier, 1i5t5.duncan, sunpeng.li, Mazin Rezk,
	linux-kernel, dri-devel, nicholas.kazlauskas, regressions,
	amd-gfx, alexander.deucher, akpm, mphantomx, christian.koenig

On Thursday, July 23, 2020 6:32 PM, Kees Cook <keescook@chromium.org> wrote:

> On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
>
> > When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> > drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> > running, causing a race condition where state (and then dm_state) is
> > sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
> > occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> > Prior to 5.7, this was not a noticeable issue since the freelist pointer
> > was stored at the beginning of dm_state (base), which was unused. After
> > changing the freelist pointer to be stored in the middle of the struct, the
> > freelist pointer overwrote the context, causing dc_state to become garbage
> > data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> > a freelist pointer.
> > This patch fixes the aforementioned issue by calling drm_atomic_state_get
> > in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> > drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> > According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> > Bugzilla [1].
> > [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
>
> Nice work tracking this down!
>
> > Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
>
> I do, however, object to this Fixes tag. :) The flaw appears to have
> been with amdgpu_dm's reference tracking of "state" in the nonblocking
> case. (How this reference counting is supposed to work correctly, though,
> I'm not sure.) If I look at where the drm helper was split from being
> the default callback, it looks like this was what introduced the bug:
>
> da5c47f682ab ("drm/amd/display: Remove acrtc->stream")
>
> ? 3202fa62f certainly exposed it much more quickly, but there was a race
> even without 3202fa62f where something could have realloced the memory
> and written over it.
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Kees Cook


Thanks, I'll be sure to avoid using 3202fa62f as the cause next time.
I just thought to do that because it was what made the use-after-free cause
a noticeable bug.

Also, by the way, I just realised the patch didn't completely solve the bug.
Sorry about that, making an LKML thread on this was hasty on my part. Should
I get further confirmation from the Bugzilla thread before submitting a patch
for this bug in the future?

Thanks,
Mazin Rezk
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-23 22:58     ` Mazin Rezk
  0 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-23 22:58 UTC (permalink / raw)
  To: Kees Cook
  Cc: pmenzel, anthony.ruhier, 1i5t5.duncan, sunpeng.li, Mazin Rezk,
	linux-kernel, dri-devel, nicholas.kazlauskas, regressions,
	amd-gfx, alexander.deucher, akpm, mphantomx, harry.wentland,
	christian.koenig

On Thursday, July 23, 2020 6:32 PM, Kees Cook <keescook@chromium.org> wrote:

> On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
>
> > When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> > drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> > running, causing a race condition where state (and then dm_state) is
> > sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
> > occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> > Prior to 5.7, this was not a noticeable issue since the freelist pointer
> > was stored at the beginning of dm_state (base), which was unused. After
> > changing the freelist pointer to be stored in the middle of the struct, the
> > freelist pointer overwrote the context, causing dc_state to become garbage
> > data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> > a freelist pointer.
> > This patch fixes the aforementioned issue by calling drm_atomic_state_get
> > in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> > drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> > According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> > Bugzilla [1].
> > [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
>
> Nice work tracking this down!
>
> > Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
>
> I do, however, object to this Fixes tag. :) The flaw appears to have
> been with amdgpu_dm's reference tracking of "state" in the nonblocking
> case. (How this reference counting is supposed to work correctly, though,
> I'm not sure.) If I look at where the drm helper was split from being
> the default callback, it looks like this was what introduced the bug:
>
> da5c47f682ab ("drm/amd/display: Remove acrtc->stream")
>
> ? 3202fa62f certainly exposed it much more quickly, but there was a race
> even without 3202fa62f where something could have realloced the memory
> and written over it.
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Kees Cook


Thanks, I'll be sure to avoid using 3202fa62f as the cause next time.
I just thought to do that because it was what made the use-after-free cause
a noticeable bug.

Also, by the way, I just realised the patch didn't completely solve the bug.
Sorry about that, making an LKML thread on this was hasty on my part. Should
I get further confirmation from the Bugzilla thread before submitting a patch
for this bug in the future?

Thanks,
Mazin Rezk
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
  2020-07-23 22:58     ` Mazin Rezk
  (?)
@ 2020-07-24  7:26       ` Christian König
  -1 siblings, 0 replies; 48+ messages in thread
From: Christian König @ 2020-07-24  7:26 UTC (permalink / raw)
  To: Mazin Rezk, Kees Cook
  Cc: linux-kernel, amd-gfx, dri-devel, akpm, harry.wentland,
	nicholas.kazlauskas, sunpeng.li, alexander.deucher, 1i5t5.duncan,
	mphantomx, regressions, anthony.ruhier, pmenzel

Am 24.07.20 um 00:58 schrieb Mazin Rezk:
> On Thursday, July 23, 2020 6:32 PM, Kees Cook <keescook@chromium.org> wrote:
>
>> On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
>>
>>> When amdgpu_dm_atomic_commit_tail is running in the workqueue,
>>> drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
>>> running, causing a race condition where state (and then dm_state) is
>>> sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
>>> occurred since 5.7-rc1 and is well documented among polaris11 users [1].
>>> Prior to 5.7, this was not a noticeable issue since the freelist pointer
>>> was stored at the beginning of dm_state (base), which was unused. After
>>> changing the freelist pointer to be stored in the middle of the struct, the
>>> freelist pointer overwrote the context, causing dc_state to become garbage
>>> data and made the call to dm_enable_per_frame_crtc_master_sync dereference
>>> a freelist pointer.
>>> This patch fixes the aforementioned issue by calling drm_atomic_state_get
>>> in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
>>> drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
>>> According to my testing on 5.8.0-rc6, this should fix bug 207383 on
>>> Bugzilla [1].
>>> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D207383&amp;data=02%7C01%7Charry.wentland%40amd.com%7C53cc9cffb1d244d7b43508d82f5bed1b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637311419153032496&amp;sdata=t45vmEJ80UXOmRfndGfe69AOedtkFUwDqvWgGDrSuOk%3D&amp;reserved=0
>> Nice work tracking this down!
>>
>>> Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
>> I do, however, object to this Fixes tag. :) The flaw appears to have
>> been with amdgpu_dm's reference tracking of "state" in the nonblocking
>> case. (How this reference counting is supposed to work correctly, though,
>> I'm not sure.) If I look at where the drm helper was split from being
>> the default callback, it looks like this was what introduced the bug:
>>
>> da5c47f682ab ("drm/amd/display: Remove acrtc->stream")
>>
>> ? 3202fa62f certainly exposed it much more quickly, but there was a race
>> even without 3202fa62f where something could have realloced the memory
>> and written over it.
>>
>> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> Kees Cook
>
> Thanks, I'll be sure to avoid using 3202fa62f as the cause next time.
> I just thought to do that because it was what made the use-after-free cause
> a noticeable bug.
>
> Also, by the way, I just realised the patch didn't completely solve the bug.
> Sorry about that, making an LKML thread on this was hasty on my part. Should
> I get further confirmation from the Bugzilla thread before submitting a patch
> for this bug in the future?

Submitting stuff as early as possible is mostly a good idea. Just if the 
code is utterly broken or completely unreadable you should probably 
expect a harsh response :)

Maybe ask for more testing in the commit message if you are not 100% 
sure if that really fixes a bug or not.

Regards,
Christian.

>
> Thanks,
> Mazin Rezk


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-24  7:26       ` Christian König
  0 siblings, 0 replies; 48+ messages in thread
From: Christian König @ 2020-07-24  7:26 UTC (permalink / raw)
  To: Mazin Rezk, Kees Cook
  Cc: pmenzel, anthony.ruhier, 1i5t5.duncan, sunpeng.li, linux-kernel,
	dri-devel, regressions, amd-gfx, alexander.deucher, akpm,
	mphantomx, nicholas.kazlauskas

Am 24.07.20 um 00:58 schrieb Mazin Rezk:
> On Thursday, July 23, 2020 6:32 PM, Kees Cook <keescook@chromium.org> wrote:
>
>> On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
>>
>>> When amdgpu_dm_atomic_commit_tail is running in the workqueue,
>>> drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
>>> running, causing a race condition where state (and then dm_state) is
>>> sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
>>> occurred since 5.7-rc1 and is well documented among polaris11 users [1].
>>> Prior to 5.7, this was not a noticeable issue since the freelist pointer
>>> was stored at the beginning of dm_state (base), which was unused. After
>>> changing the freelist pointer to be stored in the middle of the struct, the
>>> freelist pointer overwrote the context, causing dc_state to become garbage
>>> data and made the call to dm_enable_per_frame_crtc_master_sync dereference
>>> a freelist pointer.
>>> This patch fixes the aforementioned issue by calling drm_atomic_state_get
>>> in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
>>> drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
>>> According to my testing on 5.8.0-rc6, this should fix bug 207383 on
>>> Bugzilla [1].
>>> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D207383&amp;data=02%7C01%7Charry.wentland%40amd.com%7C53cc9cffb1d244d7b43508d82f5bed1b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637311419153032496&amp;sdata=t45vmEJ80UXOmRfndGfe69AOedtkFUwDqvWgGDrSuOk%3D&amp;reserved=0
>> Nice work tracking this down!
>>
>>> Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
>> I do, however, object to this Fixes tag. :) The flaw appears to have
>> been with amdgpu_dm's reference tracking of "state" in the nonblocking
>> case. (How this reference counting is supposed to work correctly, though,
>> I'm not sure.) If I look at where the drm helper was split from being
>> the default callback, it looks like this was what introduced the bug:
>>
>> da5c47f682ab ("drm/amd/display: Remove acrtc->stream")
>>
>> ? 3202fa62f certainly exposed it much more quickly, but there was a race
>> even without 3202fa62f where something could have realloced the memory
>> and written over it.
>>
>> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> Kees Cook
>
> Thanks, I'll be sure to avoid using 3202fa62f as the cause next time.
> I just thought to do that because it was what made the use-after-free cause
> a noticeable bug.
>
> Also, by the way, I just realised the patch didn't completely solve the bug.
> Sorry about that, making an LKML thread on this was hasty on my part. Should
> I get further confirmation from the Bugzilla thread before submitting a patch
> for this bug in the future?

Submitting stuff as early as possible is mostly a good idea. Just if the 
code is utterly broken or completely unreadable you should probably 
expect a harsh response :)

Maybe ask for more testing in the commit message if you are not 100% 
sure if that really fixes a bug or not.

Regards,
Christian.

>
> Thanks,
> Mazin Rezk

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-24  7:26       ` Christian König
  0 siblings, 0 replies; 48+ messages in thread
From: Christian König @ 2020-07-24  7:26 UTC (permalink / raw)
  To: Mazin Rezk, Kees Cook
  Cc: pmenzel, anthony.ruhier, 1i5t5.duncan, sunpeng.li, linux-kernel,
	dri-devel, regressions, amd-gfx, alexander.deucher, akpm,
	mphantomx, harry.wentland, nicholas.kazlauskas

Am 24.07.20 um 00:58 schrieb Mazin Rezk:
> On Thursday, July 23, 2020 6:32 PM, Kees Cook <keescook@chromium.org> wrote:
>
>> On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
>>
>>> When amdgpu_dm_atomic_commit_tail is running in the workqueue,
>>> drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
>>> running, causing a race condition where state (and then dm_state) is
>>> sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
>>> occurred since 5.7-rc1 and is well documented among polaris11 users [1].
>>> Prior to 5.7, this was not a noticeable issue since the freelist pointer
>>> was stored at the beginning of dm_state (base), which was unused. After
>>> changing the freelist pointer to be stored in the middle of the struct, the
>>> freelist pointer overwrote the context, causing dc_state to become garbage
>>> data and made the call to dm_enable_per_frame_crtc_master_sync dereference
>>> a freelist pointer.
>>> This patch fixes the aforementioned issue by calling drm_atomic_state_get
>>> in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
>>> drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
>>> According to my testing on 5.8.0-rc6, this should fix bug 207383 on
>>> Bugzilla [1].
>>> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D207383&amp;data=02%7C01%7Charry.wentland%40amd.com%7C53cc9cffb1d244d7b43508d82f5bed1b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637311419153032496&amp;sdata=t45vmEJ80UXOmRfndGfe69AOedtkFUwDqvWgGDrSuOk%3D&amp;reserved=0
>> Nice work tracking this down!
>>
>>> Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
>> I do, however, object to this Fixes tag. :) The flaw appears to have
>> been with amdgpu_dm's reference tracking of "state" in the nonblocking
>> case. (How this reference counting is supposed to work correctly, though,
>> I'm not sure.) If I look at where the drm helper was split from being
>> the default callback, it looks like this was what introduced the bug:
>>
>> da5c47f682ab ("drm/amd/display: Remove acrtc->stream")
>>
>> ? 3202fa62f certainly exposed it much more quickly, but there was a race
>> even without 3202fa62f where something could have realloced the memory
>> and written over it.
>>
>> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> Kees Cook
>
> Thanks, I'll be sure to avoid using 3202fa62f as the cause next time.
> I just thought to do that because it was what made the use-after-free cause
> a noticeable bug.
>
> Also, by the way, I just realised the patch didn't completely solve the bug.
> Sorry about that, making an LKML thread on this was hasty on my part. Should
> I get further confirmation from the Bugzilla thread before submitting a patch
> for this bug in the future?

Submitting stuff as early as possible is mostly a good idea. Just if the 
code is utterly broken or completely unreadable you should probably 
expect a harsh response :)

Maybe ask for more testing in the commit message if you are not 100% 
sure if that really fixes a bug or not.

Regards,
Christian.

>
> Thanks,
> Mazin Rezk

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
  2020-07-23 22:32   ` Kees Cook
  (?)
@ 2020-07-24  7:45     ` Paul Menzel
  -1 siblings, 0 replies; 48+ messages in thread
From: Paul Menzel @ 2020-07-24  7:45 UTC (permalink / raw)
  To: Kees Cook, Mazin Rezk
  Cc: linux-kernel, amd-gfx, dri-devel, Andrew Morton,
	Christian König, Harry Wentland, Nicholas Kazlauskas,
	sunpeng.li, Alexander Deucher, 1i5t5.duncan, mphantomx,
	regressions, anthony.ruhier

Dear Kees,


Am 24.07.20 um 00:32 schrieb Kees Cook:
> On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
>> When amdgpu_dm_atomic_commit_tail is running in the workqueue,
>> drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
>> running, causing a race condition where state (and then dm_state) is
>> sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
>> occurred since 5.7-rc1 and is well documented among polaris11 users [1].
>>
>> Prior to 5.7, this was not a noticeable issue since the freelist pointer
>> was stored at the beginning of dm_state (base), which was unused. After
>> changing the freelist pointer to be stored in the middle of the struct, the
>> freelist pointer overwrote the context, causing dc_state to become garbage
>> data and made the call to dm_enable_per_frame_crtc_master_sync dereference
>> a freelist pointer.
>>
>> This patch fixes the aforementioned issue by calling drm_atomic_state_get
>> in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
>> drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
>>
>> According to my testing on 5.8.0-rc6, this should fix bug 207383 on
>> Bugzilla [1].
>>
>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
> 
> Nice work tracking this down!
> 
>> Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
> 
> I do, however, object to this Fixes tag. :) The flaw appears to have
> been with amdgpu_dm's reference tracking of "state" in the nonblocking
> case. (How this reference counting is supposed to work correctly, though,
> I'm not sure.) If I look at where the drm helper was split from being
> the default callback, it looks like this was what introduced the bug:
> 
> da5c47f682ab ("drm/amd/display: Remove acrtc->stream")
> 
> ? 3202fa62f certainly exposed it much more quickly, but there was a race
> even without 3202fa62f where something could have realloced the memory
> and written over it.

I understand the Fixes tag mainly a help when backporting commits.

As Linux 5.8-rc7 is going to be released this Sunday, I wonder, if 
commit 3202fa62f ("slub: relocate freelist pointer to middle of object") 
should be reverted for now to fix the regression for the users according 
to Linux’ no regression policy. Once the AMDGPU/DRM driver issue is 
fixed, it can be reapplied. I know it’s not optimal, but as some testing 
is going to be involved for the fix, I’d argue it’s the best option for 
the users.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-24  7:45     ` Paul Menzel
  0 siblings, 0 replies; 48+ messages in thread
From: Paul Menzel @ 2020-07-24  7:45 UTC (permalink / raw)
  To: Kees Cook, Mazin Rezk
  Cc: anthony.ruhier, 1i5t5.duncan, sunpeng.li, linux-kernel,
	dri-devel, Nicholas Kazlauskas, regressions, amd-gfx,
	Alexander Deucher, Andrew Morton, mphantomx,
	Christian König

Dear Kees,


Am 24.07.20 um 00:32 schrieb Kees Cook:
> On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
>> When amdgpu_dm_atomic_commit_tail is running in the workqueue,
>> drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
>> running, causing a race condition where state (and then dm_state) is
>> sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
>> occurred since 5.7-rc1 and is well documented among polaris11 users [1].
>>
>> Prior to 5.7, this was not a noticeable issue since the freelist pointer
>> was stored at the beginning of dm_state (base), which was unused. After
>> changing the freelist pointer to be stored in the middle of the struct, the
>> freelist pointer overwrote the context, causing dc_state to become garbage
>> data and made the call to dm_enable_per_frame_crtc_master_sync dereference
>> a freelist pointer.
>>
>> This patch fixes the aforementioned issue by calling drm_atomic_state_get
>> in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
>> drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
>>
>> According to my testing on 5.8.0-rc6, this should fix bug 207383 on
>> Bugzilla [1].
>>
>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
> 
> Nice work tracking this down!
> 
>> Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
> 
> I do, however, object to this Fixes tag. :) The flaw appears to have
> been with amdgpu_dm's reference tracking of "state" in the nonblocking
> case. (How this reference counting is supposed to work correctly, though,
> I'm not sure.) If I look at where the drm helper was split from being
> the default callback, it looks like this was what introduced the bug:
> 
> da5c47f682ab ("drm/amd/display: Remove acrtc->stream")
> 
> ? 3202fa62f certainly exposed it much more quickly, but there was a race
> even without 3202fa62f where something could have realloced the memory
> and written over it.

I understand the Fixes tag mainly a help when backporting commits.

As Linux 5.8-rc7 is going to be released this Sunday, I wonder, if 
commit 3202fa62f ("slub: relocate freelist pointer to middle of object") 
should be reverted for now to fix the regression for the users according 
to Linux’ no regression policy. Once the AMDGPU/DRM driver issue is 
fixed, it can be reapplied. I know it’s not optimal, but as some testing 
is going to be involved for the fix, I’d argue it’s the best option for 
the users.


Kind regards,

Paul
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-24  7:45     ` Paul Menzel
  0 siblings, 0 replies; 48+ messages in thread
From: Paul Menzel @ 2020-07-24  7:45 UTC (permalink / raw)
  To: Kees Cook, Mazin Rezk
  Cc: anthony.ruhier, 1i5t5.duncan, sunpeng.li, linux-kernel,
	dri-devel, Nicholas Kazlauskas, regressions, amd-gfx,
	Alexander Deucher, Andrew Morton, mphantomx, Harry Wentland,
	Christian König

Dear Kees,


Am 24.07.20 um 00:32 schrieb Kees Cook:
> On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
>> When amdgpu_dm_atomic_commit_tail is running in the workqueue,
>> drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
>> running, causing a race condition where state (and then dm_state) is
>> sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
>> occurred since 5.7-rc1 and is well documented among polaris11 users [1].
>>
>> Prior to 5.7, this was not a noticeable issue since the freelist pointer
>> was stored at the beginning of dm_state (base), which was unused. After
>> changing the freelist pointer to be stored in the middle of the struct, the
>> freelist pointer overwrote the context, causing dc_state to become garbage
>> data and made the call to dm_enable_per_frame_crtc_master_sync dereference
>> a freelist pointer.
>>
>> This patch fixes the aforementioned issue by calling drm_atomic_state_get
>> in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
>> drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
>>
>> According to my testing on 5.8.0-rc6, this should fix bug 207383 on
>> Bugzilla [1].
>>
>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
> 
> Nice work tracking this down!
> 
>> Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
> 
> I do, however, object to this Fixes tag. :) The flaw appears to have
> been with amdgpu_dm's reference tracking of "state" in the nonblocking
> case. (How this reference counting is supposed to work correctly, though,
> I'm not sure.) If I look at where the drm helper was split from being
> the default callback, it looks like this was what introduced the bug:
> 
> da5c47f682ab ("drm/amd/display: Remove acrtc->stream")
> 
> ? 3202fa62f certainly exposed it much more quickly, but there was a race
> even without 3202fa62f where something could have realloced the memory
> and written over it.

I understand the Fixes tag mainly a help when backporting commits.

As Linux 5.8-rc7 is going to be released this Sunday, I wonder, if 
commit 3202fa62f ("slub: relocate freelist pointer to middle of object") 
should be reverted for now to fix the regression for the users according 
to Linux’ no regression policy. Once the AMDGPU/DRM driver issue is 
fixed, it can be reapplied. I know it’s not optimal, but as some testing 
is going to be involved for the fix, I’d argue it’s the best option for 
the users.


Kind regards,

Paul
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
  2020-07-24  7:45     ` Paul Menzel
  (?)
@ 2020-07-24 17:33       ` Kees Cook
  -1 siblings, 0 replies; 48+ messages in thread
From: Kees Cook @ 2020-07-24 17:33 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Mazin Rezk, linux-kernel, amd-gfx, dri-devel, Andrew Morton,
	Christian König, Harry Wentland, Nicholas Kazlauskas,
	sunpeng.li, Alexander Deucher, 1i5t5.duncan, mphantomx,
	regressions, anthony.ruhier

On Fri, Jul 24, 2020 at 09:45:18AM +0200, Paul Menzel wrote:
> Am 24.07.20 um 00:32 schrieb Kees Cook:
> > On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
> As Linux 5.8-rc7 is going to be released this Sunday, I wonder, if commit
> 3202fa62f ("slub: relocate freelist pointer to middle of object") should be
> reverted for now to fix the regression for the users according to Linux’ no
> regression policy. Once the AMDGPU/DRM driver issue is fixed, it can be
> reapplied. I know it’s not optimal, but as some testing is going to be
> involved for the fix, I’d argue it’s the best option for the users.

Well, the SLUB defense was already released in v5.7, so I'm not sure it
really helps for amdgpu_dm users seeing it there too. There was a fix to
disable the async path for this driver that worked around the bug too,
yes? That seems like a safer and more focused change that doesn't revert
the SLUB defense for all users, and would actually provide a complete,
I think, workaround whereas reverting the SLUB change means the race
still exists. For example, it would be hit with slab poisoning, etc.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-24 17:33       ` Kees Cook
  0 siblings, 0 replies; 48+ messages in thread
From: Kees Cook @ 2020-07-24 17:33 UTC (permalink / raw)
  To: Paul Menzel
  Cc: anthony.ruhier, 1i5t5.duncan, sunpeng.li, Mazin Rezk,
	linux-kernel, dri-devel, Nicholas Kazlauskas, regressions,
	amd-gfx, Alexander Deucher, Andrew Morton, mphantomx,
	Christian König

On Fri, Jul 24, 2020 at 09:45:18AM +0200, Paul Menzel wrote:
> Am 24.07.20 um 00:32 schrieb Kees Cook:
> > On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
> As Linux 5.8-rc7 is going to be released this Sunday, I wonder, if commit
> 3202fa62f ("slub: relocate freelist pointer to middle of object") should be
> reverted for now to fix the regression for the users according to Linux’ no
> regression policy. Once the AMDGPU/DRM driver issue is fixed, it can be
> reapplied. I know it’s not optimal, but as some testing is going to be
> involved for the fix, I’d argue it’s the best option for the users.

Well, the SLUB defense was already released in v5.7, so I'm not sure it
really helps for amdgpu_dm users seeing it there too. There was a fix to
disable the async path for this driver that worked around the bug too,
yes? That seems like a safer and more focused change that doesn't revert
the SLUB defense for all users, and would actually provide a complete,
I think, workaround whereas reverting the SLUB change means the race
still exists. For example, it would be hit with slab poisoning, etc.

-- 
Kees Cook
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-24 17:33       ` Kees Cook
  0 siblings, 0 replies; 48+ messages in thread
From: Kees Cook @ 2020-07-24 17:33 UTC (permalink / raw)
  To: Paul Menzel
  Cc: anthony.ruhier, 1i5t5.duncan, sunpeng.li, Mazin Rezk,
	linux-kernel, dri-devel, Nicholas Kazlauskas, regressions,
	amd-gfx, Alexander Deucher, Andrew Morton, mphantomx,
	Harry Wentland, Christian König

On Fri, Jul 24, 2020 at 09:45:18AM +0200, Paul Menzel wrote:
> Am 24.07.20 um 00:32 schrieb Kees Cook:
> > On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
> As Linux 5.8-rc7 is going to be released this Sunday, I wonder, if commit
> 3202fa62f ("slub: relocate freelist pointer to middle of object") should be
> reverted for now to fix the regression for the users according to Linux’ no
> regression policy. Once the AMDGPU/DRM driver issue is fixed, it can be
> reapplied. I know it’s not optimal, but as some testing is going to be
> involved for the fix, I’d argue it’s the best option for the users.

Well, the SLUB defense was already released in v5.7, so I'm not sure it
really helps for amdgpu_dm users seeing it there too. There was a fix to
disable the async path for this driver that worked around the bug too,
yes? That seems like a safer and more focused change that doesn't revert
the SLUB defense for all users, and would actually provide a complete,
I think, workaround whereas reverting the SLUB change means the race
still exists. For example, it would be hit with slab poisoning, etc.

-- 
Kees Cook
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
  2020-07-23 22:57     ` Mazin Rezk
  (?)
@ 2020-07-24 21:09       ` Mazin Rezk
  -1 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-24 21:09 UTC (permalink / raw)
  To: Kazlauskas, Nicholas
  Cc: Mazin Rezk, linux-kernel, amd-gfx, dri-devel, akpm,
	christian.koenig, harry.wentland, sunpeng.li, keescook,
	alexander.deucher, 1i5t5.duncan, mphantomx, regressions,
	anthony.ruhier, pmenzel

On Thursday, July 23, 2020 6:57 PM, Mazin Rezk <mnrzk@protonmail.com> wrote:

> It seems that I spoke too soon. I ran the system for another hour after
> submitting the patch and the bug just occurred. :/
>
> Sadly, that means the bug isn't really fixed and that I have to go
> investigate further.
>
> At the very least, this patch seems to delay the occurrence of the bug
> significantly which may help in further discovering the cause.
>
> On Thursday, July 23, 2020 6:16 PM, Kazlauskas, Nicholas nicholas.kazlauskas@amd.com wrote:
>
> > On 2020-07-23 5:10 p.m., Mazin Rezk wrote:
> >
> > > When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> > > drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> > > running, causing a race condition where state (and then dm_state) is
> > > sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
> > > occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> > > Prior to 5.7, this was not a noticeable issue since the freelist pointer
> > > was stored at the beginning of dm_state (base), which was unused. After
> > > changing the freelist pointer to be stored in the middle of the struct, the
> > > freelist pointer overwrote the context, causing dc_state to become garbage
> > > data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> > > a freelist pointer.
> > > This patch fixes the aforementioned issue by calling drm_atomic_state_get
> > > in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> > > drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> > > According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> > > Bugzilla [1].
> > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
> > > Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
> > > Reported-by: Duncan 1i5t5.duncan@cox.net
> > > Signed-off-by: Mazin Rezk mnrzk@protonmail.com
> >
> > Thanks for the investigation and your patch. I appreciate the help in
> > trying to narrow down the root cause as this issue has been difficult to
> > reproduce on my setups.
> > Though I'm not sure this really resolves the issue - we make use of the
> > drm_atomic_helper_commit helper function from DRM which internally does
> > what you're doing with this patch:
> > drm_atomic_state_get(state);
> > if (nonblock)
> > queue_work(system_unbound_wq, &state->commit_work);
> >
> >     else
> >     	commit_tail(state);
> >
> >
> > So even when it gets queued off to the unbound workqueue we still have a
> > reference on the state.
> > That reference gets dropped as part of commit tail helper in DRM as well:
> > if (funcs && funcs->atomic_commit_tail)
> >
> >     	funcs->atomic_commit_tail(old_state);
> >
> >     else
> >     	drm_atomic_helper_commit_tail(old_state);
> >
> >
> > commit_time_ms = ktime_ms_delta(ktime_get(), start);
> > if (commit_time_ms > 0)
> >
> >     	drm_self_refresh_helper_update_avg_times(old_state,
> >     					 (unsigned long)commit_time_ms,
> >     					 new_self_refresh_mask);
> >
> >
> > drm_atomic_helper_commit_cleanup_done(old_state);
> > drm_atomic_state_put(old_state);
>
> I initially noticed that right after I wrote this patch so I was expecting
> the patch to fail. However, after several hours of testing, the crash just
> didn't occur so I believed the bug was fixed.
>
> > So instead of a use after free happening when we access the state we get
> > a double-free happening later at the end of commit tail in DRM.
> > What I think would be the right next step here is to actually determine
> > what sequence of IOCTLs and atomic commits are happening under your
> > setup with a very verbose dmesg log. You can set a debug level for DRM
> > in your kernel parameters with something like:
> > drm.debug=0x54
> > I don't see anything in amdgpu_dm.c that looks like it would be freeing
> > the state so I suspect something in the core is this doing this.
>
> Going through the KASAN use-after-free bug report in the Bugzilla
> attachments, it appears that the state is being freed at the end of
> commit_tail. Perhaps amdgpu_dm_atomic_commit_tail is being called on the
> the same old state twice? I can't quite think of any other possible
> explanation as to why that happens.

I think I've more or less confirmed that this is the case.

I created two padding variables, one to store debug magic numbers and one
to store the freelist pointer. I had magic numbers for initialised,
preuse, and used states. When the dm_atomic_state is initialised, the
padding is set to the init magic number. Right before commit_tail is
called, the padding is set to the preuse magic number. During
dm_atomic_get_new_state checks the magic number to confirm that it
was in the preuse state and then set it to used. If it failed that check
and it was already in a used state, there was a breakpoint set so I could
gather further information.

At one point (presumably where the crash would have occurred), the debug
padding variable was set to the used state during the call to commit_tail
which I believe confirms my guess that amdgpu_dm_atomic_commit_tail is
being called on the same state twice.

What's weird, however, is that dmesg (w/ drm.debug=0x54) says this right
before amdgpu_dm_atomic_commit_tail is called:

[ 3277.580205] [drm:drm_atomic_state_init [drm]] Allocated atomic state 00000000a06f4024
[ 3277.580262] [drm:drm_atomic_get_crtc_state [drm]] Added [CRTC:49:crtc-1] 000000003b9da5c1 state to 00000000a06f4024
[ 3277.580316] [drm:drm_atomic_get_plane_state [drm]] Added [PLANE:44:plane-4] 000000003488c027 state to 00000000a06f4024
[ 3277.580366] [drm:drm_atomic_set_fb_for_plane [drm]] Set [FB:103] for [PLANE:44:plane-4] state 000000003488c027
[ 3277.580417] [drm:drm_atomic_check_only [drm]] checking 00000000a06f4024
[ 3277.580519] [drm:drm_atomic_get_private_obj_state [drm]] Added new private object 0000000002a633ab state 00000000695dff15 to 00000000a06f4024
[ 3277.580579] [drm:drm_atomic_nonblocking_commit [drm]] committing 00000000a06f4024 nonblocking
[ 3277.582325] [drm:drm_atomic_state_default_clear [drm]] Clearing atomic state 00000000a06f4024
[ 3277.582393] [drm:__drm_atomic_state_free [drm]] Freeing atomic state 00000000a06f4024

From the log, I'm noticing that drm_atomic_nonblocking_commit is only
called once and that whatever is calling the second non-blocking
commit_tail on the same state doesn't seem to be using
drm_atomic_nonblocking_commit.

Perhaps someone with more knowledge of the code can give a possible
explanation as to why that's happening.

Thanks,
Mazin Rezk

>
> > > drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
> > > 1 file changed, 3 insertions(+)
> > > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > > index 86ffa0c2880f..86d6652872f2 100644
> > > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > > @@ -7303,6 +7303,7 @@ static int amdgpu_dm_atomic_commit(struct drm_device *dev,
> > >
> > > -   unset legacy_cursor_update
> > >     */
> > >
> > >
> > > -   drm_atomic_state_get(state);
> >
> > Also note that if the drm_atomic_helper_commit() call fails here then
> > we're going to never free this structure. So we should really be
> > checking the return code here below before trying to do this, if at all.
>
> Oh right, that's true. I looked at amdgpu_dm_atomic_commit_tail and didn't
> see any return statements in there, so I thought it was safe.
>
> > Regards,
> > Nicholas Kazlauskas
> >
> > >       return drm_atomic_helper_commit(dev, state, nonblock);
> > >
> > >       /*TODO Handle EINTR, reenable IRQ*/
> > >
> > >
> > > @@ -7628,6 +7629,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
> > >
> > >       if (dc_state_temp)
> > >       	dc_release_state(dc_state_temp);
> > >
> > >
> > > -
> > > -   drm_atomic_state_put(state);
> > >     }
> > >
> > >
> > > --
> > > 2.27.0
>
> Thanks,
> Mazin Rezk



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-24 21:09       ` Mazin Rezk
  0 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-24 21:09 UTC (permalink / raw)
  To: Kazlauskas, Nicholas
  Cc: pmenzel, anthony.ruhier, 1i5t5.duncan, keescook, sunpeng.li,
	Mazin Rezk, linux-kernel, dri-devel, regressions, amd-gfx,
	alexander.deucher, akpm, mphantomx, christian.koenig

On Thursday, July 23, 2020 6:57 PM, Mazin Rezk <mnrzk@protonmail.com> wrote:

> It seems that I spoke too soon. I ran the system for another hour after
> submitting the patch and the bug just occurred. :/
>
> Sadly, that means the bug isn't really fixed and that I have to go
> investigate further.
>
> At the very least, this patch seems to delay the occurrence of the bug
> significantly which may help in further discovering the cause.
>
> On Thursday, July 23, 2020 6:16 PM, Kazlauskas, Nicholas nicholas.kazlauskas@amd.com wrote:
>
> > On 2020-07-23 5:10 p.m., Mazin Rezk wrote:
> >
> > > When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> > > drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> > > running, causing a race condition where state (and then dm_state) is
> > > sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
> > > occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> > > Prior to 5.7, this was not a noticeable issue since the freelist pointer
> > > was stored at the beginning of dm_state (base), which was unused. After
> > > changing the freelist pointer to be stored in the middle of the struct, the
> > > freelist pointer overwrote the context, causing dc_state to become garbage
> > > data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> > > a freelist pointer.
> > > This patch fixes the aforementioned issue by calling drm_atomic_state_get
> > > in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> > > drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> > > According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> > > Bugzilla [1].
> > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
> > > Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
> > > Reported-by: Duncan 1i5t5.duncan@cox.net
> > > Signed-off-by: Mazin Rezk mnrzk@protonmail.com
> >
> > Thanks for the investigation and your patch. I appreciate the help in
> > trying to narrow down the root cause as this issue has been difficult to
> > reproduce on my setups.
> > Though I'm not sure this really resolves the issue - we make use of the
> > drm_atomic_helper_commit helper function from DRM which internally does
> > what you're doing with this patch:
> > drm_atomic_state_get(state);
> > if (nonblock)
> > queue_work(system_unbound_wq, &state->commit_work);
> >
> >     else
> >     	commit_tail(state);
> >
> >
> > So even when it gets queued off to the unbound workqueue we still have a
> > reference on the state.
> > That reference gets dropped as part of commit tail helper in DRM as well:
> > if (funcs && funcs->atomic_commit_tail)
> >
> >     	funcs->atomic_commit_tail(old_state);
> >
> >     else
> >     	drm_atomic_helper_commit_tail(old_state);
> >
> >
> > commit_time_ms = ktime_ms_delta(ktime_get(), start);
> > if (commit_time_ms > 0)
> >
> >     	drm_self_refresh_helper_update_avg_times(old_state,
> >     					 (unsigned long)commit_time_ms,
> >     					 new_self_refresh_mask);
> >
> >
> > drm_atomic_helper_commit_cleanup_done(old_state);
> > drm_atomic_state_put(old_state);
>
> I initially noticed that right after I wrote this patch so I was expecting
> the patch to fail. However, after several hours of testing, the crash just
> didn't occur so I believed the bug was fixed.
>
> > So instead of a use after free happening when we access the state we get
> > a double-free happening later at the end of commit tail in DRM.
> > What I think would be the right next step here is to actually determine
> > what sequence of IOCTLs and atomic commits are happening under your
> > setup with a very verbose dmesg log. You can set a debug level for DRM
> > in your kernel parameters with something like:
> > drm.debug=0x54
> > I don't see anything in amdgpu_dm.c that looks like it would be freeing
> > the state so I suspect something in the core is this doing this.
>
> Going through the KASAN use-after-free bug report in the Bugzilla
> attachments, it appears that the state is being freed at the end of
> commit_tail. Perhaps amdgpu_dm_atomic_commit_tail is being called on the
> the same old state twice? I can't quite think of any other possible
> explanation as to why that happens.

I think I've more or less confirmed that this is the case.

I created two padding variables, one to store debug magic numbers and one
to store the freelist pointer. I had magic numbers for initialised,
preuse, and used states. When the dm_atomic_state is initialised, the
padding is set to the init magic number. Right before commit_tail is
called, the padding is set to the preuse magic number. During
dm_atomic_get_new_state checks the magic number to confirm that it
was in the preuse state and then set it to used. If it failed that check
and it was already in a used state, there was a breakpoint set so I could
gather further information.

At one point (presumably where the crash would have occurred), the debug
padding variable was set to the used state during the call to commit_tail
which I believe confirms my guess that amdgpu_dm_atomic_commit_tail is
being called on the same state twice.

What's weird, however, is that dmesg (w/ drm.debug=0x54) says this right
before amdgpu_dm_atomic_commit_tail is called:

[ 3277.580205] [drm:drm_atomic_state_init [drm]] Allocated atomic state 00000000a06f4024
[ 3277.580262] [drm:drm_atomic_get_crtc_state [drm]] Added [CRTC:49:crtc-1] 000000003b9da5c1 state to 00000000a06f4024
[ 3277.580316] [drm:drm_atomic_get_plane_state [drm]] Added [PLANE:44:plane-4] 000000003488c027 state to 00000000a06f4024
[ 3277.580366] [drm:drm_atomic_set_fb_for_plane [drm]] Set [FB:103] for [PLANE:44:plane-4] state 000000003488c027
[ 3277.580417] [drm:drm_atomic_check_only [drm]] checking 00000000a06f4024
[ 3277.580519] [drm:drm_atomic_get_private_obj_state [drm]] Added new private object 0000000002a633ab state 00000000695dff15 to 00000000a06f4024
[ 3277.580579] [drm:drm_atomic_nonblocking_commit [drm]] committing 00000000a06f4024 nonblocking
[ 3277.582325] [drm:drm_atomic_state_default_clear [drm]] Clearing atomic state 00000000a06f4024
[ 3277.582393] [drm:__drm_atomic_state_free [drm]] Freeing atomic state 00000000a06f4024

From the log, I'm noticing that drm_atomic_nonblocking_commit is only
called once and that whatever is calling the second non-blocking
commit_tail on the same state doesn't seem to be using
drm_atomic_nonblocking_commit.

Perhaps someone with more knowledge of the code can give a possible
explanation as to why that's happening.

Thanks,
Mazin Rezk

>
> > > drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
> > > 1 file changed, 3 insertions(+)
> > > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > > index 86ffa0c2880f..86d6652872f2 100644
> > > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > > @@ -7303,6 +7303,7 @@ static int amdgpu_dm_atomic_commit(struct drm_device *dev,
> > >
> > > -   unset legacy_cursor_update
> > >     */
> > >
> > >
> > > -   drm_atomic_state_get(state);
> >
> > Also note that if the drm_atomic_helper_commit() call fails here then
> > we're going to never free this structure. So we should really be
> > checking the return code here below before trying to do this, if at all.
>
> Oh right, that's true. I looked at amdgpu_dm_atomic_commit_tail and didn't
> see any return statements in there, so I thought it was safe.
>
> > Regards,
> > Nicholas Kazlauskas
> >
> > >       return drm_atomic_helper_commit(dev, state, nonblock);
> > >
> > >       /*TODO Handle EINTR, reenable IRQ*/
> > >
> > >
> > > @@ -7628,6 +7629,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
> > >
> > >       if (dc_state_temp)
> > >       	dc_release_state(dc_state_temp);
> > >
> > >
> > > -
> > > -   drm_atomic_state_put(state);
> > >     }
> > >
> > >
> > > --
> > > 2.27.0
>
> Thanks,
> Mazin Rezk


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-24 21:09       ` Mazin Rezk
  0 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-24 21:09 UTC (permalink / raw)
  To: Kazlauskas, Nicholas
  Cc: pmenzel, anthony.ruhier, 1i5t5.duncan, keescook, sunpeng.li,
	Mazin Rezk, linux-kernel, dri-devel, regressions, amd-gfx,
	alexander.deucher, akpm, mphantomx, harry.wentland,
	christian.koenig

On Thursday, July 23, 2020 6:57 PM, Mazin Rezk <mnrzk@protonmail.com> wrote:

> It seems that I spoke too soon. I ran the system for another hour after
> submitting the patch and the bug just occurred. :/
>
> Sadly, that means the bug isn't really fixed and that I have to go
> investigate further.
>
> At the very least, this patch seems to delay the occurrence of the bug
> significantly which may help in further discovering the cause.
>
> On Thursday, July 23, 2020 6:16 PM, Kazlauskas, Nicholas nicholas.kazlauskas@amd.com wrote:
>
> > On 2020-07-23 5:10 p.m., Mazin Rezk wrote:
> >
> > > When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> > > drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> > > running, causing a race condition where state (and then dm_state) is
> > > sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
> > > occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> > > Prior to 5.7, this was not a noticeable issue since the freelist pointer
> > > was stored at the beginning of dm_state (base), which was unused. After
> > > changing the freelist pointer to be stored in the middle of the struct, the
> > > freelist pointer overwrote the context, causing dc_state to become garbage
> > > data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> > > a freelist pointer.
> > > This patch fixes the aforementioned issue by calling drm_atomic_state_get
> > > in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> > > drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> > > According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> > > Bugzilla [1].
> > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
> > > Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
> > > Reported-by: Duncan 1i5t5.duncan@cox.net
> > > Signed-off-by: Mazin Rezk mnrzk@protonmail.com
> >
> > Thanks for the investigation and your patch. I appreciate the help in
> > trying to narrow down the root cause as this issue has been difficult to
> > reproduce on my setups.
> > Though I'm not sure this really resolves the issue - we make use of the
> > drm_atomic_helper_commit helper function from DRM which internally does
> > what you're doing with this patch:
> > drm_atomic_state_get(state);
> > if (nonblock)
> > queue_work(system_unbound_wq, &state->commit_work);
> >
> >     else
> >     	commit_tail(state);
> >
> >
> > So even when it gets queued off to the unbound workqueue we still have a
> > reference on the state.
> > That reference gets dropped as part of commit tail helper in DRM as well:
> > if (funcs && funcs->atomic_commit_tail)
> >
> >     	funcs->atomic_commit_tail(old_state);
> >
> >     else
> >     	drm_atomic_helper_commit_tail(old_state);
> >
> >
> > commit_time_ms = ktime_ms_delta(ktime_get(), start);
> > if (commit_time_ms > 0)
> >
> >     	drm_self_refresh_helper_update_avg_times(old_state,
> >     					 (unsigned long)commit_time_ms,
> >     					 new_self_refresh_mask);
> >
> >
> > drm_atomic_helper_commit_cleanup_done(old_state);
> > drm_atomic_state_put(old_state);
>
> I initially noticed that right after I wrote this patch so I was expecting
> the patch to fail. However, after several hours of testing, the crash just
> didn't occur so I believed the bug was fixed.
>
> > So instead of a use after free happening when we access the state we get
> > a double-free happening later at the end of commit tail in DRM.
> > What I think would be the right next step here is to actually determine
> > what sequence of IOCTLs and atomic commits are happening under your
> > setup with a very verbose dmesg log. You can set a debug level for DRM
> > in your kernel parameters with something like:
> > drm.debug=0x54
> > I don't see anything in amdgpu_dm.c that looks like it would be freeing
> > the state so I suspect something in the core is this doing this.
>
> Going through the KASAN use-after-free bug report in the Bugzilla
> attachments, it appears that the state is being freed at the end of
> commit_tail. Perhaps amdgpu_dm_atomic_commit_tail is being called on the
> the same old state twice? I can't quite think of any other possible
> explanation as to why that happens.

I think I've more or less confirmed that this is the case.

I created two padding variables, one to store debug magic numbers and one
to store the freelist pointer. I had magic numbers for initialised,
preuse, and used states. When the dm_atomic_state is initialised, the
padding is set to the init magic number. Right before commit_tail is
called, the padding is set to the preuse magic number. During
dm_atomic_get_new_state checks the magic number to confirm that it
was in the preuse state and then set it to used. If it failed that check
and it was already in a used state, there was a breakpoint set so I could
gather further information.

At one point (presumably where the crash would have occurred), the debug
padding variable was set to the used state during the call to commit_tail
which I believe confirms my guess that amdgpu_dm_atomic_commit_tail is
being called on the same state twice.

What's weird, however, is that dmesg (w/ drm.debug=0x54) says this right
before amdgpu_dm_atomic_commit_tail is called:

[ 3277.580205] [drm:drm_atomic_state_init [drm]] Allocated atomic state 00000000a06f4024
[ 3277.580262] [drm:drm_atomic_get_crtc_state [drm]] Added [CRTC:49:crtc-1] 000000003b9da5c1 state to 00000000a06f4024
[ 3277.580316] [drm:drm_atomic_get_plane_state [drm]] Added [PLANE:44:plane-4] 000000003488c027 state to 00000000a06f4024
[ 3277.580366] [drm:drm_atomic_set_fb_for_plane [drm]] Set [FB:103] for [PLANE:44:plane-4] state 000000003488c027
[ 3277.580417] [drm:drm_atomic_check_only [drm]] checking 00000000a06f4024
[ 3277.580519] [drm:drm_atomic_get_private_obj_state [drm]] Added new private object 0000000002a633ab state 00000000695dff15 to 00000000a06f4024
[ 3277.580579] [drm:drm_atomic_nonblocking_commit [drm]] committing 00000000a06f4024 nonblocking
[ 3277.582325] [drm:drm_atomic_state_default_clear [drm]] Clearing atomic state 00000000a06f4024
[ 3277.582393] [drm:__drm_atomic_state_free [drm]] Freeing atomic state 00000000a06f4024

From the log, I'm noticing that drm_atomic_nonblocking_commit is only
called once and that whatever is calling the second non-blocking
commit_tail on the same state doesn't seem to be using
drm_atomic_nonblocking_commit.

Perhaps someone with more knowledge of the code can give a possible
explanation as to why that's happening.

Thanks,
Mazin Rezk

>
> > > drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
> > > 1 file changed, 3 insertions(+)
> > > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > > index 86ffa0c2880f..86d6652872f2 100644
> > > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > > @@ -7303,6 +7303,7 @@ static int amdgpu_dm_atomic_commit(struct drm_device *dev,
> > >
> > > -   unset legacy_cursor_update
> > >     */
> > >
> > >
> > > -   drm_atomic_state_get(state);
> >
> > Also note that if the drm_atomic_helper_commit() call fails here then
> > we're going to never free this structure. So we should really be
> > checking the return code here below before trying to do this, if at all.
>
> Oh right, that's true. I looked at amdgpu_dm_atomic_commit_tail and didn't
> see any return statements in there, so I thought it was safe.
>
> > Regards,
> > Nicholas Kazlauskas
> >
> > >       return drm_atomic_helper_commit(dev, state, nonblock);
> > >
> > >       /*TODO Handle EINTR, reenable IRQ*/
> > >
> > >
> > > @@ -7628,6 +7629,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
> > >
> > >       if (dc_state_temp)
> > >       	dc_release_state(dc_state_temp);
> > >
> > >
> > > -
> > > -   drm_atomic_state_put(state);
> > >     }
> > >
> > >
> > > --
> > > 2.27.0
>
> Thanks,
> Mazin Rezk


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
  2020-07-24 17:33       ` Kees Cook
  (?)
@ 2020-07-24 21:19         ` Paul Menzel
  -1 siblings, 0 replies; 48+ messages in thread
From: Paul Menzel @ 2020-07-24 21:19 UTC (permalink / raw)
  To: Kees Cook
  Cc: Mazin Rezk, linux-kernel, amd-gfx, dri-devel, Andrew Morton,
	Christian König, Harry Wentland, Nicholas Kazlauskas,
	sunpeng.li, Alexander Deucher, 1i5t5.duncan, mphantomx,
	regressions, anthony.ruhier


Dear Kees,


Am 24.07.20 um 19:33 schrieb Kees Cook:
> On Fri, Jul 24, 2020 at 09:45:18AM +0200, Paul Menzel wrote:
>> Am 24.07.20 um 00:32 schrieb Kees Cook:
>>> On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
>> As Linux 5.8-rc7 is going to be released this Sunday, I wonder, if commit
>> 3202fa62f ("slub: relocate freelist pointer to middle of object") should be
>> reverted for now to fix the regression for the users according to Linux’ no
>> regression policy. Once the AMDGPU/DRM driver issue is fixed, it can be
>> reapplied. I know it’s not optimal, but as some testing is going to be
>> involved for the fix, I’d argue it’s the best option for the users.
> 
> Well, the SLUB defense was already released in v5.7, so I'm not sure it
> really helps for amdgpu_dm users seeing it there too.

In my opinion, it would help, as the stable release could pick up the 
revert, ones it’s in Linus’ master branch.

> There was a fix to disable the async path for this driver that worked
> around the bug too, yes? That seems like a safer and more focused
> change that doesn't revert the SLUB defense for all users, and would
> actually provide a complete, I think, workaround whereas reverting
> the SLUB change means the race still exists. For example, it would be
> hit with slab poisoning, etc.

I do not know. If there is such a fix, that would be great. But if you 
do not know, how should a normal user? ;-)


Kind regards,

Paul


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-24 21:19         ` Paul Menzel
  0 siblings, 0 replies; 48+ messages in thread
From: Paul Menzel @ 2020-07-24 21:19 UTC (permalink / raw)
  To: Kees Cook
  Cc: anthony.ruhier, 1i5t5.duncan, sunpeng.li, Mazin Rezk,
	linux-kernel, dri-devel, Nicholas Kazlauskas, regressions,
	amd-gfx, Alexander Deucher, Andrew Morton, mphantomx,
	Christian König


Dear Kees,


Am 24.07.20 um 19:33 schrieb Kees Cook:
> On Fri, Jul 24, 2020 at 09:45:18AM +0200, Paul Menzel wrote:
>> Am 24.07.20 um 00:32 schrieb Kees Cook:
>>> On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
>> As Linux 5.8-rc7 is going to be released this Sunday, I wonder, if commit
>> 3202fa62f ("slub: relocate freelist pointer to middle of object") should be
>> reverted for now to fix the regression for the users according to Linux’ no
>> regression policy. Once the AMDGPU/DRM driver issue is fixed, it can be
>> reapplied. I know it’s not optimal, but as some testing is going to be
>> involved for the fix, I’d argue it’s the best option for the users.
> 
> Well, the SLUB defense was already released in v5.7, so I'm not sure it
> really helps for amdgpu_dm users seeing it there too.

In my opinion, it would help, as the stable release could pick up the 
revert, ones it’s in Linus’ master branch.

> There was a fix to disable the async path for this driver that worked
> around the bug too, yes? That seems like a safer and more focused
> change that doesn't revert the SLUB defense for all users, and would
> actually provide a complete, I think, workaround whereas reverting
> the SLUB change means the race still exists. For example, it would be
> hit with slab poisoning, etc.

I do not know. If there is such a fix, that would be great. But if you 
do not know, how should a normal user? ;-)


Kind regards,

Paul


Kind regards,

Paul
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-24 21:19         ` Paul Menzel
  0 siblings, 0 replies; 48+ messages in thread
From: Paul Menzel @ 2020-07-24 21:19 UTC (permalink / raw)
  To: Kees Cook
  Cc: anthony.ruhier, 1i5t5.duncan, sunpeng.li, Mazin Rezk,
	linux-kernel, dri-devel, Nicholas Kazlauskas, regressions,
	amd-gfx, Alexander Deucher, Andrew Morton, mphantomx,
	Harry Wentland, Christian König


Dear Kees,


Am 24.07.20 um 19:33 schrieb Kees Cook:
> On Fri, Jul 24, 2020 at 09:45:18AM +0200, Paul Menzel wrote:
>> Am 24.07.20 um 00:32 schrieb Kees Cook:
>>> On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
>> As Linux 5.8-rc7 is going to be released this Sunday, I wonder, if commit
>> 3202fa62f ("slub: relocate freelist pointer to middle of object") should be
>> reverted for now to fix the regression for the users according to Linux’ no
>> regression policy. Once the AMDGPU/DRM driver issue is fixed, it can be
>> reapplied. I know it’s not optimal, but as some testing is going to be
>> involved for the fix, I’d argue it’s the best option for the users.
> 
> Well, the SLUB defense was already released in v5.7, so I'm not sure it
> really helps for amdgpu_dm users seeing it there too.

In my opinion, it would help, as the stable release could pick up the 
revert, ones it’s in Linus’ master branch.

> There was a fix to disable the async path for this driver that worked
> around the bug too, yes? That seems like a safer and more focused
> change that doesn't revert the SLUB defense for all users, and would
> actually provide a complete, I think, workaround whereas reverting
> the SLUB change means the race still exists. For example, it would be
> hit with slab poisoning, etc.

I do not know. If there is such a fix, that would be great. But if you 
do not know, how should a normal user? ;-)


Kind regards,

Paul


Kind regards,

Paul
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
  2020-07-24 21:19         ` Paul Menzel
  (?)
@ 2020-07-25  3:03           ` Mazin Rezk
  -1 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-25  3:03 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Kees Cook, linux-kernel, amd-gfx, dri-devel, Andrew Morton,
	Christian König, Harry Wentland, Nicholas Kazlauskas,
	sunpeng.li, Alexander Deucher, 1i5t5.duncan, mphantomx,
	regressions, anthony.ruhier

On Friday, July 24, 2020 5:19 PM, Paul Menzel <pmenzel@molgen.mpg.de> wrote:

> Dear Kees,
>
> Am 24.07.20 um 19:33 schrieb Kees Cook:
>
> > On Fri, Jul 24, 2020 at 09:45:18AM +0200, Paul Menzel wrote:
> >
> > > Am 24.07.20 um 00:32 schrieb Kees Cook:
> > >
> > > > On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
> > > > As Linux 5.8-rc7 is going to be released this Sunday, I wonder, if commit
> > > > 3202fa62f ("slub: relocate freelist pointer to middle of object") should be
> > > > reverted for now to fix the regression for the users according to Linux’ no
> > > > regression policy. Once the AMDGPU/DRM driver issue is fixed, it can be
> > > > reapplied. I know it’s not optimal, but as some testing is going to be
> > > > involved for the fix, I’d argue it’s the best option for the users.
> >
> > Well, the SLUB defense was already released in v5.7, so I'm not sure it
> > really helps for amdgpu_dm users seeing it there too.
>
> In my opinion, it would help, as the stable release could pick up the
> revert, ones it’s in Linus’ master branch.
>
> > There was a fix to disable the async path for this driver that worked
> > around the bug too, yes? That seems like a safer and more focused
> > change that doesn't revert the SLUB defense for all users, and would
> > actually provide a complete, I think, workaround whereas reverting
> > the SLUB change means the race still exists. For example, it would be
> > hit with slab poisoning, etc.
>
> I do not know. If there is such a fix, that would be great. But if you
> do not know, how should a normal user? ;-)
>
> Kind regards,
>
> Paul
>
> Kind regards,
>
> Paul

If we're talking about workarounds now, I suggest simply swapping the base
and context variables in struct dm_atomic_state. By that way, we won't need
to change non-amdgpu parts of the code (e.g. by reverting the SLUB patch).

Prior to 3202fa62f, the freelist pointer was stored in dm_state->base which
was never dereferenced and therefore caused no noticeable issue. After
3202fa62f, the freelist pointer is stored in the middle of the struct (i.e.
dm_state->context).

Swapping the position of the base and context variables in dm_atomic_state
should, in theory, revert this code back to it's pre-5.7 state since the
code would be back to overwriting base instead.

If we decide to use this workaround, I can write the patch and do more
extended tests to confirm it works around the issues.

That said, I haven't seen the async disabling patch. If you could link to
it, I'd be glad to test it out and perhaps we can use that instead.

Thanks,
Mazin Rezk


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-25  3:03           ` Mazin Rezk
  0 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-25  3:03 UTC (permalink / raw)
  To: Paul Menzel
  Cc: anthony.ruhier, 1i5t5.duncan, Kees Cook, sunpeng.li,
	linux-kernel, dri-devel, Nicholas Kazlauskas, regressions,
	amd-gfx, Alexander Deucher, Andrew Morton, mphantomx,
	Christian König

On Friday, July 24, 2020 5:19 PM, Paul Menzel <pmenzel@molgen.mpg.de> wrote:

> Dear Kees,
>
> Am 24.07.20 um 19:33 schrieb Kees Cook:
>
> > On Fri, Jul 24, 2020 at 09:45:18AM +0200, Paul Menzel wrote:
> >
> > > Am 24.07.20 um 00:32 schrieb Kees Cook:
> > >
> > > > On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
> > > > As Linux 5.8-rc7 is going to be released this Sunday, I wonder, if commit
> > > > 3202fa62f ("slub: relocate freelist pointer to middle of object") should be
> > > > reverted for now to fix the regression for the users according to Linux’ no
> > > > regression policy. Once the AMDGPU/DRM driver issue is fixed, it can be
> > > > reapplied. I know it’s not optimal, but as some testing is going to be
> > > > involved for the fix, I’d argue it’s the best option for the users.
> >
> > Well, the SLUB defense was already released in v5.7, so I'm not sure it
> > really helps for amdgpu_dm users seeing it there too.
>
> In my opinion, it would help, as the stable release could pick up the
> revert, ones it’s in Linus’ master branch.
>
> > There was a fix to disable the async path for this driver that worked
> > around the bug too, yes? That seems like a safer and more focused
> > change that doesn't revert the SLUB defense for all users, and would
> > actually provide a complete, I think, workaround whereas reverting
> > the SLUB change means the race still exists. For example, it would be
> > hit with slab poisoning, etc.
>
> I do not know. If there is such a fix, that would be great. But if you
> do not know, how should a normal user? ;-)
>
> Kind regards,
>
> Paul
>
> Kind regards,
>
> Paul

If we're talking about workarounds now, I suggest simply swapping the base
and context variables in struct dm_atomic_state. By that way, we won't need
to change non-amdgpu parts of the code (e.g. by reverting the SLUB patch).

Prior to 3202fa62f, the freelist pointer was stored in dm_state->base which
was never dereferenced and therefore caused no noticeable issue. After
3202fa62f, the freelist pointer is stored in the middle of the struct (i.e.
dm_state->context).

Swapping the position of the base and context variables in dm_atomic_state
should, in theory, revert this code back to it's pre-5.7 state since the
code would be back to overwriting base instead.

If we decide to use this workaround, I can write the patch and do more
extended tests to confirm it works around the issues.

That said, I haven't seen the async disabling patch. If you could link to
it, I'd be glad to test it out and perhaps we can use that instead.

Thanks,
Mazin Rezk

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-25  3:03           ` Mazin Rezk
  0 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-25  3:03 UTC (permalink / raw)
  To: Paul Menzel
  Cc: anthony.ruhier, 1i5t5.duncan, Kees Cook, sunpeng.li,
	linux-kernel, dri-devel, Nicholas Kazlauskas, regressions,
	amd-gfx, Alexander Deucher, Andrew Morton, mphantomx,
	Harry Wentland, Christian König

On Friday, July 24, 2020 5:19 PM, Paul Menzel <pmenzel@molgen.mpg.de> wrote:

> Dear Kees,
>
> Am 24.07.20 um 19:33 schrieb Kees Cook:
>
> > On Fri, Jul 24, 2020 at 09:45:18AM +0200, Paul Menzel wrote:
> >
> > > Am 24.07.20 um 00:32 schrieb Kees Cook:
> > >
> > > > On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
> > > > As Linux 5.8-rc7 is going to be released this Sunday, I wonder, if commit
> > > > 3202fa62f ("slub: relocate freelist pointer to middle of object") should be
> > > > reverted for now to fix the regression for the users according to Linux’ no
> > > > regression policy. Once the AMDGPU/DRM driver issue is fixed, it can be
> > > > reapplied. I know it’s not optimal, but as some testing is going to be
> > > > involved for the fix, I’d argue it’s the best option for the users.
> >
> > Well, the SLUB defense was already released in v5.7, so I'm not sure it
> > really helps for amdgpu_dm users seeing it there too.
>
> In my opinion, it would help, as the stable release could pick up the
> revert, ones it’s in Linus’ master branch.
>
> > There was a fix to disable the async path for this driver that worked
> > around the bug too, yes? That seems like a safer and more focused
> > change that doesn't revert the SLUB defense for all users, and would
> > actually provide a complete, I think, workaround whereas reverting
> > the SLUB change means the race still exists. For example, it would be
> > hit with slab poisoning, etc.
>
> I do not know. If there is such a fix, that would be great. But if you
> do not know, how should a normal user? ;-)
>
> Kind regards,
>
> Paul
>
> Kind regards,
>
> Paul

If we're talking about workarounds now, I suggest simply swapping the base
and context variables in struct dm_atomic_state. By that way, we won't need
to change non-amdgpu parts of the code (e.g. by reverting the SLUB patch).

Prior to 3202fa62f, the freelist pointer was stored in dm_state->base which
was never dereferenced and therefore caused no noticeable issue. After
3202fa62f, the freelist pointer is stored in the middle of the struct (i.e.
dm_state->context).

Swapping the position of the base and context variables in dm_atomic_state
should, in theory, revert this code back to it's pre-5.7 state since the
code would be back to overwriting base instead.

If we decide to use this workaround, I can write the patch and do more
extended tests to confirm it works around the issues.

That said, I haven't seen the async disabling patch. If you could link to
it, I'd be glad to test it out and perhaps we can use that instead.

Thanks,
Mazin Rezk

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
  2020-07-25  3:03           ` Mazin Rezk
  (?)
@ 2020-07-25  4:59             ` Duncan
  -1 siblings, 0 replies; 48+ messages in thread
From: Duncan @ 2020-07-25  4:59 UTC (permalink / raw)
  To: Mazin Rezk
  Cc: Paul Menzel, Kees Cook, linux-kernel, amd-gfx, dri-devel,
	Andrew Morton, Christian König, Harry Wentland,
	Nicholas Kazlauskas, sunpeng.li, Alexander Deucher, mphantomx,
	regressions, anthony.ruhier

On Sat, 25 Jul 2020 03:03:52 +0000
Mazin Rezk <mnrzk@protonmail.com> wrote:

> > Am 24.07.20 um 19:33 schrieb Kees Cook:
> >  
> > > There was a fix to disable the async path for this driver that
> > > worked around the bug too, yes? That seems like a safer and more
> > > focused change that doesn't revert the SLUB defense for all
> > > users, and would actually provide a complete, I think, workaround
> 
> That said, I haven't seen the async disabling patch. If you could
> link to it, I'd be glad to test it out and perhaps we can use that
> instead.

I'm confused.  Not to put words in Kees' mouth; /I/ am confused (which
admittedly could well be just because I make no claims to be a
coder and am simply reading the bug and thread, but I'd appreciate some
"unconfusing" anyway).

My interpretation of the "async disabling" reference was that it was to
comment #30 on the bug:

https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30

... which (if I'm not confused on this point too) appears to be yours.
There it was stated...

>>>>
I've also found that this bug exclusively occurs when commit_work is on
the workqueue. After forcing drm_atomic_helper_commit to run all of the
commits without adding to the workqueue and running the OS, the issue
seems to have disappeared.
<<<<

Would not forcing all commits to run directly, without placing them on
the workqueue, be "async disabling"?  That's what I /thought/ he was
referencing.

OTOH your base/context swap idea sounds like a possibly "less
disturbance" workaround, if it works, and given the point in the
commit cycle... (But if it's out Sunday it's likely too late to test
and get it in now anyway; if it's another week, tho...)

-- 
Duncan - No HTML messages please; they are filtered as spam.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-25  4:59             ` Duncan
  0 siblings, 0 replies; 48+ messages in thread
From: Duncan @ 2020-07-25  4:59 UTC (permalink / raw)
  To: Mazin Rezk
  Cc: Paul Menzel, anthony.ruhier, Kees Cook, sunpeng.li, linux-kernel,
	dri-devel, Nicholas Kazlauskas, regressions, amd-gfx,
	Alexander Deucher, Andrew Morton, mphantomx,
	Christian König

On Sat, 25 Jul 2020 03:03:52 +0000
Mazin Rezk <mnrzk@protonmail.com> wrote:

> > Am 24.07.20 um 19:33 schrieb Kees Cook:
> >  
> > > There was a fix to disable the async path for this driver that
> > > worked around the bug too, yes? That seems like a safer and more
> > > focused change that doesn't revert the SLUB defense for all
> > > users, and would actually provide a complete, I think, workaround
> 
> That said, I haven't seen the async disabling patch. If you could
> link to it, I'd be glad to test it out and perhaps we can use that
> instead.

I'm confused.  Not to put words in Kees' mouth; /I/ am confused (which
admittedly could well be just because I make no claims to be a
coder and am simply reading the bug and thread, but I'd appreciate some
"unconfusing" anyway).

My interpretation of the "async disabling" reference was that it was to
comment #30 on the bug:

https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30

... which (if I'm not confused on this point too) appears to be yours.
There it was stated...

>>>>
I've also found that this bug exclusively occurs when commit_work is on
the workqueue. After forcing drm_atomic_helper_commit to run all of the
commits without adding to the workqueue and running the OS, the issue
seems to have disappeared.
<<<<

Would not forcing all commits to run directly, without placing them on
the workqueue, be "async disabling"?  That's what I /thought/ he was
referencing.

OTOH your base/context swap idea sounds like a possibly "less
disturbance" workaround, if it works, and given the point in the
commit cycle... (But if it's out Sunday it's likely too late to test
and get it in now anyway; if it's another week, tho...)

-- 
Duncan - No HTML messages please; they are filtered as spam.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-25  4:59             ` Duncan
  0 siblings, 0 replies; 48+ messages in thread
From: Duncan @ 2020-07-25  4:59 UTC (permalink / raw)
  To: Mazin Rezk
  Cc: Paul Menzel, anthony.ruhier, Kees Cook, sunpeng.li, linux-kernel,
	dri-devel, Nicholas Kazlauskas, regressions, amd-gfx,
	Alexander Deucher, Andrew Morton, mphantomx, Harry Wentland,
	Christian König

On Sat, 25 Jul 2020 03:03:52 +0000
Mazin Rezk <mnrzk@protonmail.com> wrote:

> > Am 24.07.20 um 19:33 schrieb Kees Cook:
> >  
> > > There was a fix to disable the async path for this driver that
> > > worked around the bug too, yes? That seems like a safer and more
> > > focused change that doesn't revert the SLUB defense for all
> > > users, and would actually provide a complete, I think, workaround
> 
> That said, I haven't seen the async disabling patch. If you could
> link to it, I'd be glad to test it out and perhaps we can use that
> instead.

I'm confused.  Not to put words in Kees' mouth; /I/ am confused (which
admittedly could well be just because I make no claims to be a
coder and am simply reading the bug and thread, but I'd appreciate some
"unconfusing" anyway).

My interpretation of the "async disabling" reference was that it was to
comment #30 on the bug:

https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30

... which (if I'm not confused on this point too) appears to be yours.
There it was stated...

>>>>
I've also found that this bug exclusively occurs when commit_work is on
the workqueue. After forcing drm_atomic_helper_commit to run all of the
commits without adding to the workqueue and running the OS, the issue
seems to have disappeared.
<<<<

Would not forcing all commits to run directly, without placing them on
the workqueue, be "async disabling"?  That's what I /thought/ he was
referencing.

OTOH your base/context swap idea sounds like a possibly "less
disturbance" workaround, if it works, and given the point in the
commit cycle... (But if it's out Sunday it's likely too late to test
and get it in now anyway; if it's another week, tho...)

-- 
Duncan - No HTML messages please; they are filtered as spam.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
  2020-07-25  4:59             ` Duncan
  (?)
@ 2020-07-25  5:20               ` Mazin Rezk
  -1 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-25  5:20 UTC (permalink / raw)
  To: Duncan
  Cc: Paul Menzel, Kees Cook, linux-kernel, amd-gfx, dri-devel,
	Andrew Morton, Christian König, Harry Wentland,
	Nicholas Kazlauskas, sunpeng.li, Alexander Deucher, mphantomx,
	regressions, anthony.ruhier

On Saturday, July 25, 2020 12:59 AM, Duncan <1i5t5.duncan@cox.net> wrote:

> On Sat, 25 Jul 2020 03:03:52 +0000
> Mazin Rezk mnrzk@protonmail.com wrote:
>
> > > Am 24.07.20 um 19:33 schrieb Kees Cook:
> > >
> > > > There was a fix to disable the async path for this driver that
> > > > worked around the bug too, yes? That seems like a safer and more
> > > > focused change that doesn't revert the SLUB defense for all
> > > > users, and would actually provide a complete, I think, workaround
> >
> > That said, I haven't seen the async disabling patch. If you could
> > link to it, I'd be glad to test it out and perhaps we can use that
> > instead.
>
> I'm confused. Not to put words in Kees' mouth; /I/ am confused (which
> admittedly could well be just because I make no claims to be a
> coder and am simply reading the bug and thread, but I'd appreciate some
> "unconfusing" anyway).
>
> My interpretation of the "async disabling" reference was that it was to
> comment #30 on the bug:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30
>
> ... which (if I'm not confused on this point too) appears to be yours.
> There it was stated...
>
> > > > >
>
> I've also found that this bug exclusively occurs when commit_work is on
> the workqueue. After forcing drm_atomic_helper_commit to run all of the
> commits without adding to the workqueue and running the OS, the issue
> seems to have disappeared.
> <<<<
>
> Would not forcing all commits to run directly, without placing them on
> the workqueue, be "async disabling"? That's what I /thought/ he was
> referencing.

Oh, I thought he was referring to a different patch. Kees, could I get
your confirmation on this?

The change I made actually affected all of the DRM code, although this could
easily be changed to be specific to amdgpu. (By forcing blocking on
amdgpu_dm's non-blocking commit code)

That said, I'd still need to test further because I only did test it for a
couple of hours then. Although it should work in theory.

>
> OTOH your base/context swap idea sounds like a possibly "less
> disturbance" workaround, if it works, and given the point in the
> commit cycle... (But if it's out Sunday it's likely too late to test
> and get it in now anyway; if it's another week, tho...)

The base/context swap idea should make the use-after-free behave how it
did in 5.6. Since the bug doesn't cause an issue in 5.6, it's less of a
"less disturbance" workaround and more of a "no disturbance" workaround.

Thanks,
Mazin Rezk

>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Duncan - No HTML messages please; they are filtered as spam.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master." Richard Stallman



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-25  5:20               ` Mazin Rezk
  0 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-25  5:20 UTC (permalink / raw)
  To: Duncan
  Cc: Paul Menzel, anthony.ruhier, Kees Cook, sunpeng.li, linux-kernel,
	dri-devel, Nicholas Kazlauskas, regressions, amd-gfx,
	Alexander Deucher, Andrew Morton, mphantomx,
	Christian König

On Saturday, July 25, 2020 12:59 AM, Duncan <1i5t5.duncan@cox.net> wrote:

> On Sat, 25 Jul 2020 03:03:52 +0000
> Mazin Rezk mnrzk@protonmail.com wrote:
>
> > > Am 24.07.20 um 19:33 schrieb Kees Cook:
> > >
> > > > There was a fix to disable the async path for this driver that
> > > > worked around the bug too, yes? That seems like a safer and more
> > > > focused change that doesn't revert the SLUB defense for all
> > > > users, and would actually provide a complete, I think, workaround
> >
> > That said, I haven't seen the async disabling patch. If you could
> > link to it, I'd be glad to test it out and perhaps we can use that
> > instead.
>
> I'm confused. Not to put words in Kees' mouth; /I/ am confused (which
> admittedly could well be just because I make no claims to be a
> coder and am simply reading the bug and thread, but I'd appreciate some
> "unconfusing" anyway).
>
> My interpretation of the "async disabling" reference was that it was to
> comment #30 on the bug:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30
>
> ... which (if I'm not confused on this point too) appears to be yours.
> There it was stated...
>
> > > > >
>
> I've also found that this bug exclusively occurs when commit_work is on
> the workqueue. After forcing drm_atomic_helper_commit to run all of the
> commits without adding to the workqueue and running the OS, the issue
> seems to have disappeared.
> <<<<
>
> Would not forcing all commits to run directly, without placing them on
> the workqueue, be "async disabling"? That's what I /thought/ he was
> referencing.

Oh, I thought he was referring to a different patch. Kees, could I get
your confirmation on this?

The change I made actually affected all of the DRM code, although this could
easily be changed to be specific to amdgpu. (By forcing blocking on
amdgpu_dm's non-blocking commit code)

That said, I'd still need to test further because I only did test it for a
couple of hours then. Although it should work in theory.

>
> OTOH your base/context swap idea sounds like a possibly "less
> disturbance" workaround, if it works, and given the point in the
> commit cycle... (But if it's out Sunday it's likely too late to test
> and get it in now anyway; if it's another week, tho...)

The base/context swap idea should make the use-after-free behave how it
did in 5.6. Since the bug doesn't cause an issue in 5.6, it's less of a
"less disturbance" workaround and more of a "no disturbance" workaround.

Thanks,
Mazin Rezk

>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Duncan - No HTML messages please; they are filtered as spam.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master." Richard Stallman


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-25  5:20               ` Mazin Rezk
  0 siblings, 0 replies; 48+ messages in thread
From: Mazin Rezk @ 2020-07-25  5:20 UTC (permalink / raw)
  To: Duncan
  Cc: Paul Menzel, anthony.ruhier, Kees Cook, sunpeng.li, linux-kernel,
	dri-devel, Nicholas Kazlauskas, regressions, amd-gfx,
	Alexander Deucher, Andrew Morton, mphantomx, Harry Wentland,
	Christian König

On Saturday, July 25, 2020 12:59 AM, Duncan <1i5t5.duncan@cox.net> wrote:

> On Sat, 25 Jul 2020 03:03:52 +0000
> Mazin Rezk mnrzk@protonmail.com wrote:
>
> > > Am 24.07.20 um 19:33 schrieb Kees Cook:
> > >
> > > > There was a fix to disable the async path for this driver that
> > > > worked around the bug too, yes? That seems like a safer and more
> > > > focused change that doesn't revert the SLUB defense for all
> > > > users, and would actually provide a complete, I think, workaround
> >
> > That said, I haven't seen the async disabling patch. If you could
> > link to it, I'd be glad to test it out and perhaps we can use that
> > instead.
>
> I'm confused. Not to put words in Kees' mouth; /I/ am confused (which
> admittedly could well be just because I make no claims to be a
> coder and am simply reading the bug and thread, but I'd appreciate some
> "unconfusing" anyway).
>
> My interpretation of the "async disabling" reference was that it was to
> comment #30 on the bug:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30
>
> ... which (if I'm not confused on this point too) appears to be yours.
> There it was stated...
>
> > > > >
>
> I've also found that this bug exclusively occurs when commit_work is on
> the workqueue. After forcing drm_atomic_helper_commit to run all of the
> commits without adding to the workqueue and running the OS, the issue
> seems to have disappeared.
> <<<<
>
> Would not forcing all commits to run directly, without placing them on
> the workqueue, be "async disabling"? That's what I /thought/ he was
> referencing.

Oh, I thought he was referring to a different patch. Kees, could I get
your confirmation on this?

The change I made actually affected all of the DRM code, although this could
easily be changed to be specific to amdgpu. (By forcing blocking on
amdgpu_dm's non-blocking commit code)

That said, I'd still need to test further because I only did test it for a
couple of hours then. Although it should work in theory.

>
> OTOH your base/context swap idea sounds like a possibly "less
> disturbance" workaround, if it works, and given the point in the
> commit cycle... (But if it's out Sunday it's likely too late to test
> and get it in now anyway; if it's another week, tho...)

The base/context swap idea should make the use-after-free behave how it
did in 5.6. Since the bug doesn't cause an issue in 5.6, it's less of a
"less disturbance" workaround and more of a "no disturbance" workaround.

Thanks,
Mazin Rezk

>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Duncan - No HTML messages please; they are filtered as spam.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master." Richard Stallman


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
  2020-07-25  5:20               ` Mazin Rezk
  (?)
@ 2020-07-28  9:22                 ` Paul Menzel
  -1 siblings, 0 replies; 48+ messages in thread
From: Paul Menzel @ 2020-07-28  9:22 UTC (permalink / raw)
  To: Mazin Rezk, Duncan
  Cc: Kees Cook, linux-kernel, amd-gfx, dri-devel, Andrew Morton,
	Christian König, Harry Wentland, Nicholas Kazlauskas,
	sunpeng.li, Alexander Deucher, mphantomx, regressions,
	anthony.ruhier

Dear Linux folks,


Am 25.07.20 um 07:20 schrieb Mazin Rezk:
> On Saturday, July 25, 2020 12:59 AM, Duncan wrote:
> 
>> On Sat, 25 Jul 2020 03:03:52 +0000 Mazin Rezk wrote:
>>
>>>> Am 24.07.20 um 19:33 schrieb Kees Cook:
>>>>
>>>>> There was a fix to disable the async path for this driver that
>>>>> worked around the bug too, yes? That seems like a safer and more
>>>>> focused change that doesn't revert the SLUB defense for all
>>>>> users, and would actually provide a complete, I think, workaround
>>>
>>> That said, I haven't seen the async disabling patch. If you could
>>> link to it, I'd be glad to test it out and perhaps we can use that
>>> instead.
>>
>> I'm confused. Not to put words in Kees' mouth; /I/ am confused (which
>> admittedly could well be just because I make no claims to be a
>> coder and am simply reading the bug and thread, but I'd appreciate some
>> "unconfusing" anyway).
>>
>> My interpretation of the "async disabling" reference was that it was to
>> comment #30 on the bug:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30
>>
>> ... which (if I'm not confused on this point too) appears to be yours.
>> There it was stated...
>>
>> I've also found that this bug exclusively occurs when commit_work is on
>> the workqueue. After forcing drm_atomic_helper_commit to run all of the
>> commits without adding to the workqueue and running the OS, the issue
>> seems to have disappeared.
>> <<<<
>>
>> Would not forcing all commits to run directly, without placing them on
>> the workqueue, be "async disabling"? That's what I /thought/ he was
>> referencing.
> 
> Oh, I thought he was referring to a different patch. Kees, could I get
> your confirmation on this?
> 
> The change I made actually affected all of the DRM code, although this could
> easily be changed to be specific to amdgpu. (By forcing blocking on
> amdgpu_dm's non-blocking commit code)
> 
> That said, I'd still need to test further because I only did test it for a
> couple of hours then. Although it should work in theory.
> 
>> OTOH your base/context swap idea sounds like a possibly "less
>> disturbance" workaround, if it works, and given the point in the
>> commit cycle... (But if it's out Sunday it's likely too late to test
>> and get it in now anyway; if it's another week, tho...)
> 
> The base/context swap idea should make the use-after-free behave how it
> did in 5.6. Since the bug doesn't cause an issue in 5.6, it's less of a
> "less disturbance" workaround and more of a "no disturbance" workaround.

Sorry for bothering, but is there now a solution, besides reverting the 
commits, to avoid freezes/crashes *without* performance regressions?


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-28  9:22                 ` Paul Menzel
  0 siblings, 0 replies; 48+ messages in thread
From: Paul Menzel @ 2020-07-28  9:22 UTC (permalink / raw)
  To: Mazin Rezk, Duncan
  Cc: anthony.ruhier, Kees Cook, sunpeng.li, linux-kernel, dri-devel,
	Nicholas Kazlauskas, regressions, amd-gfx, Alexander Deucher,
	Andrew Morton, mphantomx, Christian König

Dear Linux folks,


Am 25.07.20 um 07:20 schrieb Mazin Rezk:
> On Saturday, July 25, 2020 12:59 AM, Duncan wrote:
> 
>> On Sat, 25 Jul 2020 03:03:52 +0000 Mazin Rezk wrote:
>>
>>>> Am 24.07.20 um 19:33 schrieb Kees Cook:
>>>>
>>>>> There was a fix to disable the async path for this driver that
>>>>> worked around the bug too, yes? That seems like a safer and more
>>>>> focused change that doesn't revert the SLUB defense for all
>>>>> users, and would actually provide a complete, I think, workaround
>>>
>>> That said, I haven't seen the async disabling patch. If you could
>>> link to it, I'd be glad to test it out and perhaps we can use that
>>> instead.
>>
>> I'm confused. Not to put words in Kees' mouth; /I/ am confused (which
>> admittedly could well be just because I make no claims to be a
>> coder and am simply reading the bug and thread, but I'd appreciate some
>> "unconfusing" anyway).
>>
>> My interpretation of the "async disabling" reference was that it was to
>> comment #30 on the bug:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30
>>
>> ... which (if I'm not confused on this point too) appears to be yours.
>> There it was stated...
>>
>> I've also found that this bug exclusively occurs when commit_work is on
>> the workqueue. After forcing drm_atomic_helper_commit to run all of the
>> commits without adding to the workqueue and running the OS, the issue
>> seems to have disappeared.
>> <<<<
>>
>> Would not forcing all commits to run directly, without placing them on
>> the workqueue, be "async disabling"? That's what I /thought/ he was
>> referencing.
> 
> Oh, I thought he was referring to a different patch. Kees, could I get
> your confirmation on this?
> 
> The change I made actually affected all of the DRM code, although this could
> easily be changed to be specific to amdgpu. (By forcing blocking on
> amdgpu_dm's non-blocking commit code)
> 
> That said, I'd still need to test further because I only did test it for a
> couple of hours then. Although it should work in theory.
> 
>> OTOH your base/context swap idea sounds like a possibly "less
>> disturbance" workaround, if it works, and given the point in the
>> commit cycle... (But if it's out Sunday it's likely too late to test
>> and get it in now anyway; if it's another week, tho...)
> 
> The base/context swap idea should make the use-after-free behave how it
> did in 5.6. Since the bug doesn't cause an issue in 5.6, it's less of a
> "less disturbance" workaround and more of a "no disturbance" workaround.

Sorry for bothering, but is there now a solution, besides reverting the 
commits, to avoid freezes/crashes *without* performance regressions?


Kind regards,

Paul
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-28  9:22                 ` Paul Menzel
  0 siblings, 0 replies; 48+ messages in thread
From: Paul Menzel @ 2020-07-28  9:22 UTC (permalink / raw)
  To: Mazin Rezk, Duncan
  Cc: anthony.ruhier, Kees Cook, sunpeng.li, linux-kernel, dri-devel,
	Nicholas Kazlauskas, regressions, amd-gfx, Alexander Deucher,
	Andrew Morton, mphantomx, Harry Wentland, Christian König

Dear Linux folks,


Am 25.07.20 um 07:20 schrieb Mazin Rezk:
> On Saturday, July 25, 2020 12:59 AM, Duncan wrote:
> 
>> On Sat, 25 Jul 2020 03:03:52 +0000 Mazin Rezk wrote:
>>
>>>> Am 24.07.20 um 19:33 schrieb Kees Cook:
>>>>
>>>>> There was a fix to disable the async path for this driver that
>>>>> worked around the bug too, yes? That seems like a safer and more
>>>>> focused change that doesn't revert the SLUB defense for all
>>>>> users, and would actually provide a complete, I think, workaround
>>>
>>> That said, I haven't seen the async disabling patch. If you could
>>> link to it, I'd be glad to test it out and perhaps we can use that
>>> instead.
>>
>> I'm confused. Not to put words in Kees' mouth; /I/ am confused (which
>> admittedly could well be just because I make no claims to be a
>> coder and am simply reading the bug and thread, but I'd appreciate some
>> "unconfusing" anyway).
>>
>> My interpretation of the "async disabling" reference was that it was to
>> comment #30 on the bug:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30
>>
>> ... which (if I'm not confused on this point too) appears to be yours.
>> There it was stated...
>>
>> I've also found that this bug exclusively occurs when commit_work is on
>> the workqueue. After forcing drm_atomic_helper_commit to run all of the
>> commits without adding to the workqueue and running the OS, the issue
>> seems to have disappeared.
>> <<<<
>>
>> Would not forcing all commits to run directly, without placing them on
>> the workqueue, be "async disabling"? That's what I /thought/ he was
>> referencing.
> 
> Oh, I thought he was referring to a different patch. Kees, could I get
> your confirmation on this?
> 
> The change I made actually affected all of the DRM code, although this could
> easily be changed to be specific to amdgpu. (By forcing blocking on
> amdgpu_dm's non-blocking commit code)
> 
> That said, I'd still need to test further because I only did test it for a
> couple of hours then. Although it should work in theory.
> 
>> OTOH your base/context swap idea sounds like a possibly "less
>> disturbance" workaround, if it works, and given the point in the
>> commit cycle... (But if it's out Sunday it's likely too late to test
>> and get it in now anyway; if it's another week, tho...)
> 
> The base/context swap idea should make the use-after-free behave how it
> did in 5.6. Since the bug doesn't cause an issue in 5.6, it's less of a
> "less disturbance" workaround and more of a "no disturbance" workaround.

Sorry for bothering, but is there now a solution, besides reverting the 
commits, to avoid freezes/crashes *without* performance regressions?


Kind regards,

Paul
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
  2020-07-28  9:22                 ` Paul Menzel
  (?)
@ 2020-07-28 17:07                   ` Kazlauskas, Nicholas
  -1 siblings, 0 replies; 48+ messages in thread
From: Kazlauskas, Nicholas @ 2020-07-28 17:07 UTC (permalink / raw)
  To: Paul Menzel, Mazin Rezk, Duncan
  Cc: Kees Cook, linux-kernel, amd-gfx, dri-devel, Andrew Morton,
	Christian König, Harry Wentland, sunpeng.li,
	Alexander Deucher, mphantomx, regressions, anthony.ruhier

On 2020-07-28 5:22 a.m., Paul Menzel wrote:
> Dear Linux folks,
> 
> 
> Am 25.07.20 um 07:20 schrieb Mazin Rezk:
>> On Saturday, July 25, 2020 12:59 AM, Duncan wrote:
>>
>>> On Sat, 25 Jul 2020 03:03:52 +0000 Mazin Rezk wrote:
>>>
>>>>> Am 24.07.20 um 19:33 schrieb Kees Cook:
>>>>>
>>>>>> There was a fix to disable the async path for this driver that
>>>>>> worked around the bug too, yes? That seems like a safer and more
>>>>>> focused change that doesn't revert the SLUB defense for all
>>>>>> users, and would actually provide a complete, I think, workaround
>>>>
>>>> That said, I haven't seen the async disabling patch. If you could
>>>> link to it, I'd be glad to test it out and perhaps we can use that
>>>> instead.
>>>
>>> I'm confused. Not to put words in Kees' mouth; /I/ am confused (which
>>> admittedly could well be just because I make no claims to be a
>>> coder and am simply reading the bug and thread, but I'd appreciate some
>>> "unconfusing" anyway).
>>>
>>> My interpretation of the "async disabling" reference was that it was to
>>> comment #30 on the bug:
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30 
>>>
>>>
>>> ... which (if I'm not confused on this point too) appears to be yours.
>>> There it was stated...
>>>
>>> I've also found that this bug exclusively occurs when commit_work is on
>>> the workqueue. After forcing drm_atomic_helper_commit to run all of the
>>> commits without adding to the workqueue and running the OS, the issue
>>> seems to have disappeared.
>>> <<<<
>>>
>>> Would not forcing all commits to run directly, without placing them on
>>> the workqueue, be "async disabling"? That's what I /thought/ he was
>>> referencing.
>>
>> Oh, I thought he was referring to a different patch. Kees, could I get
>> your confirmation on this?
>>
>> The change I made actually affected all of the DRM code, although this 
>> could
>> easily be changed to be specific to amdgpu. (By forcing blocking on
>> amdgpu_dm's non-blocking commit code)
>>
>> That said, I'd still need to test further because I only did test it 
>> for a
>> couple of hours then. Although it should work in theory.
>>
>>> OTOH your base/context swap idea sounds like a possibly "less
>>> disturbance" workaround, if it works, and given the point in the
>>> commit cycle... (But if it's out Sunday it's likely too late to test
>>> and get it in now anyway; if it's another week, tho...)
>>
>> The base/context swap idea should make the use-after-free behave how it
>> did in 5.6. Since the bug doesn't cause an issue in 5.6, it's less of a
>> "less disturbance" workaround and more of a "no disturbance" workaround.
> 
> Sorry for bothering, but is there now a solution, besides reverting the 
> commits, to avoid freezes/crashes *without* performance regressions?
> 
> 
> Kind regards,
> 
> Paul

Mazin's "drm/amd/display: Clear dm_state for fast updates" change 
accomplishes this, at least as a temporary hack.

I've started work on a more large scale fix that we could get in in after.

Regards,
Nicholas Kazlauskas

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-28 17:07                   ` Kazlauskas, Nicholas
  0 siblings, 0 replies; 48+ messages in thread
From: Kazlauskas, Nicholas @ 2020-07-28 17:07 UTC (permalink / raw)
  To: Paul Menzel, Mazin Rezk, Duncan
  Cc: anthony.ruhier, Kees Cook, sunpeng.li, linux-kernel, dri-devel,
	regressions, amd-gfx, Alexander Deucher, Andrew Morton,
	mphantomx, Christian König

On 2020-07-28 5:22 a.m., Paul Menzel wrote:
> Dear Linux folks,
> 
> 
> Am 25.07.20 um 07:20 schrieb Mazin Rezk:
>> On Saturday, July 25, 2020 12:59 AM, Duncan wrote:
>>
>>> On Sat, 25 Jul 2020 03:03:52 +0000 Mazin Rezk wrote:
>>>
>>>>> Am 24.07.20 um 19:33 schrieb Kees Cook:
>>>>>
>>>>>> There was a fix to disable the async path for this driver that
>>>>>> worked around the bug too, yes? That seems like a safer and more
>>>>>> focused change that doesn't revert the SLUB defense for all
>>>>>> users, and would actually provide a complete, I think, workaround
>>>>
>>>> That said, I haven't seen the async disabling patch. If you could
>>>> link to it, I'd be glad to test it out and perhaps we can use that
>>>> instead.
>>>
>>> I'm confused. Not to put words in Kees' mouth; /I/ am confused (which
>>> admittedly could well be just because I make no claims to be a
>>> coder and am simply reading the bug and thread, but I'd appreciate some
>>> "unconfusing" anyway).
>>>
>>> My interpretation of the "async disabling" reference was that it was to
>>> comment #30 on the bug:
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30 
>>>
>>>
>>> ... which (if I'm not confused on this point too) appears to be yours.
>>> There it was stated...
>>>
>>> I've also found that this bug exclusively occurs when commit_work is on
>>> the workqueue. After forcing drm_atomic_helper_commit to run all of the
>>> commits without adding to the workqueue and running the OS, the issue
>>> seems to have disappeared.
>>> <<<<
>>>
>>> Would not forcing all commits to run directly, without placing them on
>>> the workqueue, be "async disabling"? That's what I /thought/ he was
>>> referencing.
>>
>> Oh, I thought he was referring to a different patch. Kees, could I get
>> your confirmation on this?
>>
>> The change I made actually affected all of the DRM code, although this 
>> could
>> easily be changed to be specific to amdgpu. (By forcing blocking on
>> amdgpu_dm's non-blocking commit code)
>>
>> That said, I'd still need to test further because I only did test it 
>> for a
>> couple of hours then. Although it should work in theory.
>>
>>> OTOH your base/context swap idea sounds like a possibly "less
>>> disturbance" workaround, if it works, and given the point in the
>>> commit cycle... (But if it's out Sunday it's likely too late to test
>>> and get it in now anyway; if it's another week, tho...)
>>
>> The base/context swap idea should make the use-after-free behave how it
>> did in 5.6. Since the bug doesn't cause an issue in 5.6, it's less of a
>> "less disturbance" workaround and more of a "no disturbance" workaround.
> 
> Sorry for bothering, but is there now a solution, besides reverting the 
> commits, to avoid freezes/crashes *without* performance regressions?
> 
> 
> Kind regards,
> 
> Paul

Mazin's "drm/amd/display: Clear dm_state for fast updates" change 
accomplishes this, at least as a temporary hack.

I've started work on a more large scale fix that we could get in in after.

Regards,
Nicholas Kazlauskas
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-28 17:07                   ` Kazlauskas, Nicholas
  0 siblings, 0 replies; 48+ messages in thread
From: Kazlauskas, Nicholas @ 2020-07-28 17:07 UTC (permalink / raw)
  To: Paul Menzel, Mazin Rezk, Duncan
  Cc: anthony.ruhier, Kees Cook, sunpeng.li, linux-kernel, dri-devel,
	regressions, amd-gfx, Alexander Deucher, Andrew Morton,
	mphantomx, Harry Wentland, Christian König

On 2020-07-28 5:22 a.m., Paul Menzel wrote:
> Dear Linux folks,
> 
> 
> Am 25.07.20 um 07:20 schrieb Mazin Rezk:
>> On Saturday, July 25, 2020 12:59 AM, Duncan wrote:
>>
>>> On Sat, 25 Jul 2020 03:03:52 +0000 Mazin Rezk wrote:
>>>
>>>>> Am 24.07.20 um 19:33 schrieb Kees Cook:
>>>>>
>>>>>> There was a fix to disable the async path for this driver that
>>>>>> worked around the bug too, yes? That seems like a safer and more
>>>>>> focused change that doesn't revert the SLUB defense for all
>>>>>> users, and would actually provide a complete, I think, workaround
>>>>
>>>> That said, I haven't seen the async disabling patch. If you could
>>>> link to it, I'd be glad to test it out and perhaps we can use that
>>>> instead.
>>>
>>> I'm confused. Not to put words in Kees' mouth; /I/ am confused (which
>>> admittedly could well be just because I make no claims to be a
>>> coder and am simply reading the bug and thread, but I'd appreciate some
>>> "unconfusing" anyway).
>>>
>>> My interpretation of the "async disabling" reference was that it was to
>>> comment #30 on the bug:
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30 
>>>
>>>
>>> ... which (if I'm not confused on this point too) appears to be yours.
>>> There it was stated...
>>>
>>> I've also found that this bug exclusively occurs when commit_work is on
>>> the workqueue. After forcing drm_atomic_helper_commit to run all of the
>>> commits without adding to the workqueue and running the OS, the issue
>>> seems to have disappeared.
>>> <<<<
>>>
>>> Would not forcing all commits to run directly, without placing them on
>>> the workqueue, be "async disabling"? That's what I /thought/ he was
>>> referencing.
>>
>> Oh, I thought he was referring to a different patch. Kees, could I get
>> your confirmation on this?
>>
>> The change I made actually affected all of the DRM code, although this 
>> could
>> easily be changed to be specific to amdgpu. (By forcing blocking on
>> amdgpu_dm's non-blocking commit code)
>>
>> That said, I'd still need to test further because I only did test it 
>> for a
>> couple of hours then. Although it should work in theory.
>>
>>> OTOH your base/context swap idea sounds like a possibly "less
>>> disturbance" workaround, if it works, and given the point in the
>>> commit cycle... (But if it's out Sunday it's likely too late to test
>>> and get it in now anyway; if it's another week, tho...)
>>
>> The base/context swap idea should make the use-after-free behave how it
>> did in 5.6. Since the bug doesn't cause an issue in 5.6, it's less of a
>> "less disturbance" workaround and more of a "no disturbance" workaround.
> 
> Sorry for bothering, but is there now a solution, besides reverting the 
> commits, to avoid freezes/crashes *without* performance regressions?
> 
> 
> Kind regards,
> 
> Paul

Mazin's "drm/amd/display: Clear dm_state for fast updates" change 
accomplishes this, at least as a temporary hack.

I've started work on a more large scale fix that we could get in in after.

Regards,
Nicholas Kazlauskas
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
  2020-07-28 17:07                   ` Kazlauskas, Nicholas
  (?)
@ 2020-07-28 21:58                     ` daniel
  -1 siblings, 0 replies; 48+ messages in thread
From: daniel @ 2020-07-28 21:58 UTC (permalink / raw)
  Cc: Paul Menzel, Mazin Rezk, Duncan, anthony.ruhier, Kees Cook,
	sunpeng.li, linux-kernel, dri-devel, regressions, amd-gfx,
	Alexander Deucher, Andrew Morton, mphantomx,
	Christian König

On Tue, Jul 28, 2020 at 01:07:13PM -0400, Kazlauskas, Nicholas wrote:
> On 2020-07-28 5:22 a.m., Paul Menzel wrote:
> > Dear Linux folks,
> > 
> > 
> > Am 25.07.20 um 07:20 schrieb Mazin Rezk:
> > > On Saturday, July 25, 2020 12:59 AM, Duncan wrote:
> > > 
> > > > On Sat, 25 Jul 2020 03:03:52 +0000 Mazin Rezk wrote:
> > > > 
> > > > > > Am 24.07.20 um 19:33 schrieb Kees Cook:
> > > > > > 
> > > > > > > There was a fix to disable the async path for this driver that
> > > > > > > worked around the bug too, yes? That seems like a safer and more
> > > > > > > focused change that doesn't revert the SLUB defense for all
> > > > > > > users, and would actually provide a complete, I think, workaround
> > > > > 
> > > > > That said, I haven't seen the async disabling patch. If you could
> > > > > link to it, I'd be glad to test it out and perhaps we can use that
> > > > > instead.
> > > > 
> > > > I'm confused. Not to put words in Kees' mouth; /I/ am confused (which
> > > > admittedly could well be just because I make no claims to be a
> > > > coder and am simply reading the bug and thread, but I'd appreciate some
> > > > "unconfusing" anyway).
> > > > 
> > > > My interpretation of the "async disabling" reference was that it was to
> > > > comment #30 on the bug:
> > > > 
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30
> > > > 
> > > > 
> > > > ... which (if I'm not confused on this point too) appears to be yours.
> > > > There it was stated...
> > > > 
> > > > I've also found that this bug exclusively occurs when commit_work is on
> > > > the workqueue. After forcing drm_atomic_helper_commit to run all of the
> > > > commits without adding to the workqueue and running the OS, the issue
> > > > seems to have disappeared.
> > > > <<<<
> > > > 
> > > > Would not forcing all commits to run directly, without placing them on
> > > > the workqueue, be "async disabling"? That's what I /thought/ he was
> > > > referencing.
> > > 
> > > Oh, I thought he was referring to a different patch. Kees, could I get
> > > your confirmation on this?
> > > 
> > > The change I made actually affected all of the DRM code, although
> > > this could
> > > easily be changed to be specific to amdgpu. (By forcing blocking on
> > > amdgpu_dm's non-blocking commit code)
> > > 
> > > That said, I'd still need to test further because I only did test it
> > > for a
> > > couple of hours then. Although it should work in theory.
> > > 
> > > > OTOH your base/context swap idea sounds like a possibly "less
> > > > disturbance" workaround, if it works, and given the point in the
> > > > commit cycle... (But if it's out Sunday it's likely too late to test
> > > > and get it in now anyway; if it's another week, tho...)
> > > 
> > > The base/context swap idea should make the use-after-free behave how it
> > > did in 5.6. Since the bug doesn't cause an issue in 5.6, it's less of a
> > > "less disturbance" workaround and more of a "no disturbance" workaround.
> > 
> > Sorry for bothering, but is there now a solution, besides reverting the
> > commits, to avoid freezes/crashes *without* performance regressions?
> > 
> > 
> > Kind regards,
> > 
> > Paul
> 
> Mazin's "drm/amd/display: Clear dm_state for fast updates" change
> accomplishes this, at least as a temporary hack.

Yeah I gets it's horrible, but better than nothing. Reverting the old
amdgpu change to a private state object is probably a lot more invasive.

> I've started work on a more large scale fix that we could get in in after.

Does that include a fix for the "stuff needed by irq handler"? Either way
pls cc dri-devel, I think this is something worth of a bit wider
discussion. Feels like unsolved homework from the entire "make DC
integrate into linux" saga ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-28 21:58                     ` daniel
  0 siblings, 0 replies; 48+ messages in thread
From: daniel @ 2020-07-28 21:58 UTC (permalink / raw)
  Cc: Paul Menzel, mphantomx, Duncan, Kees Cook, sunpeng.li,
	Mazin Rezk, linux-kernel, dri-devel, regressions, amd-gfx,
	Alexander Deucher, Andrew Morton, anthony.ruhier,
	Christian König

On Tue, Jul 28, 2020 at 01:07:13PM -0400, Kazlauskas, Nicholas wrote:
> On 2020-07-28 5:22 a.m., Paul Menzel wrote:
> > Dear Linux folks,
> > 
> > 
> > Am 25.07.20 um 07:20 schrieb Mazin Rezk:
> > > On Saturday, July 25, 2020 12:59 AM, Duncan wrote:
> > > 
> > > > On Sat, 25 Jul 2020 03:03:52 +0000 Mazin Rezk wrote:
> > > > 
> > > > > > Am 24.07.20 um 19:33 schrieb Kees Cook:
> > > > > > 
> > > > > > > There was a fix to disable the async path for this driver that
> > > > > > > worked around the bug too, yes? That seems like a safer and more
> > > > > > > focused change that doesn't revert the SLUB defense for all
> > > > > > > users, and would actually provide a complete, I think, workaround
> > > > > 
> > > > > That said, I haven't seen the async disabling patch. If you could
> > > > > link to it, I'd be glad to test it out and perhaps we can use that
> > > > > instead.
> > > > 
> > > > I'm confused. Not to put words in Kees' mouth; /I/ am confused (which
> > > > admittedly could well be just because I make no claims to be a
> > > > coder and am simply reading the bug and thread, but I'd appreciate some
> > > > "unconfusing" anyway).
> > > > 
> > > > My interpretation of the "async disabling" reference was that it was to
> > > > comment #30 on the bug:
> > > > 
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30
> > > > 
> > > > 
> > > > ... which (if I'm not confused on this point too) appears to be yours.
> > > > There it was stated...
> > > > 
> > > > I've also found that this bug exclusively occurs when commit_work is on
> > > > the workqueue. After forcing drm_atomic_helper_commit to run all of the
> > > > commits without adding to the workqueue and running the OS, the issue
> > > > seems to have disappeared.
> > > > <<<<
> > > > 
> > > > Would not forcing all commits to run directly, without placing them on
> > > > the workqueue, be "async disabling"? That's what I /thought/ he was
> > > > referencing.
> > > 
> > > Oh, I thought he was referring to a different patch. Kees, could I get
> > > your confirmation on this?
> > > 
> > > The change I made actually affected all of the DRM code, although
> > > this could
> > > easily be changed to be specific to amdgpu. (By forcing blocking on
> > > amdgpu_dm's non-blocking commit code)
> > > 
> > > That said, I'd still need to test further because I only did test it
> > > for a
> > > couple of hours then. Although it should work in theory.
> > > 
> > > > OTOH your base/context swap idea sounds like a possibly "less
> > > > disturbance" workaround, if it works, and given the point in the
> > > > commit cycle... (But if it's out Sunday it's likely too late to test
> > > > and get it in now anyway; if it's another week, tho...)
> > > 
> > > The base/context swap idea should make the use-after-free behave how it
> > > did in 5.6. Since the bug doesn't cause an issue in 5.6, it's less of a
> > > "less disturbance" workaround and more of a "no disturbance" workaround.
> > 
> > Sorry for bothering, but is there now a solution, besides reverting the
> > commits, to avoid freezes/crashes *without* performance regressions?
> > 
> > 
> > Kind regards,
> > 
> > Paul
> 
> Mazin's "drm/amd/display: Clear dm_state for fast updates" change
> accomplishes this, at least as a temporary hack.

Yeah I gets it's horrible, but better than nothing. Reverting the old
amdgpu change to a private state object is probably a lot more invasive.

> I've started work on a more large scale fix that we could get in in after.

Does that include a fix for the "stuff needed by irq handler"? Either way
pls cc dri-devel, I think this is something worth of a bit wider
discussion. Feels like unsolved homework from the entire "make DC
integrate into linux" saga ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
@ 2020-07-28 21:58                     ` daniel
  0 siblings, 0 replies; 48+ messages in thread
From: daniel @ 2020-07-28 21:58 UTC (permalink / raw)
  Cc: Paul Menzel, mphantomx, Duncan, Kees Cook, sunpeng.li,
	Mazin Rezk, linux-kernel, dri-devel, regressions, amd-gfx,
	Alexander Deucher, Andrew Morton, anthony.ruhier,
	Christian König

On Tue, Jul 28, 2020 at 01:07:13PM -0400, Kazlauskas, Nicholas wrote:
> On 2020-07-28 5:22 a.m., Paul Menzel wrote:
> > Dear Linux folks,
> > 
> > 
> > Am 25.07.20 um 07:20 schrieb Mazin Rezk:
> > > On Saturday, July 25, 2020 12:59 AM, Duncan wrote:
> > > 
> > > > On Sat, 25 Jul 2020 03:03:52 +0000 Mazin Rezk wrote:
> > > > 
> > > > > > Am 24.07.20 um 19:33 schrieb Kees Cook:
> > > > > > 
> > > > > > > There was a fix to disable the async path for this driver that
> > > > > > > worked around the bug too, yes? That seems like a safer and more
> > > > > > > focused change that doesn't revert the SLUB defense for all
> > > > > > > users, and would actually provide a complete, I think, workaround
> > > > > 
> > > > > That said, I haven't seen the async disabling patch. If you could
> > > > > link to it, I'd be glad to test it out and perhaps we can use that
> > > > > instead.
> > > > 
> > > > I'm confused. Not to put words in Kees' mouth; /I/ am confused (which
> > > > admittedly could well be just because I make no claims to be a
> > > > coder and am simply reading the bug and thread, but I'd appreciate some
> > > > "unconfusing" anyway).
> > > > 
> > > > My interpretation of the "async disabling" reference was that it was to
> > > > comment #30 on the bug:
> > > > 
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30
> > > > 
> > > > 
> > > > ... which (if I'm not confused on this point too) appears to be yours.
> > > > There it was stated...
> > > > 
> > > > I've also found that this bug exclusively occurs when commit_work is on
> > > > the workqueue. After forcing drm_atomic_helper_commit to run all of the
> > > > commits without adding to the workqueue and running the OS, the issue
> > > > seems to have disappeared.
> > > > <<<<
> > > > 
> > > > Would not forcing all commits to run directly, without placing them on
> > > > the workqueue, be "async disabling"? That's what I /thought/ he was
> > > > referencing.
> > > 
> > > Oh, I thought he was referring to a different patch. Kees, could I get
> > > your confirmation on this?
> > > 
> > > The change I made actually affected all of the DRM code, although
> > > this could
> > > easily be changed to be specific to amdgpu. (By forcing blocking on
> > > amdgpu_dm's non-blocking commit code)
> > > 
> > > That said, I'd still need to test further because I only did test it
> > > for a
> > > couple of hours then. Although it should work in theory.
> > > 
> > > > OTOH your base/context swap idea sounds like a possibly "less
> > > > disturbance" workaround, if it works, and given the point in the
> > > > commit cycle... (But if it's out Sunday it's likely too late to test
> > > > and get it in now anyway; if it's another week, tho...)
> > > 
> > > The base/context swap idea should make the use-after-free behave how it
> > > did in 5.6. Since the bug doesn't cause an issue in 5.6, it's less of a
> > > "less disturbance" workaround and more of a "no disturbance" workaround.
> > 
> > Sorry for bothering, but is there now a solution, besides reverting the
> > commits, to avoid freezes/crashes *without* performance regressions?
> > 
> > 
> > Kind regards,
> > 
> > Paul
> 
> Mazin's "drm/amd/display: Clear dm_state for fast updates" change
> accomplishes this, at least as a temporary hack.

Yeah I gets it's horrible, but better than nothing. Reverting the old
amdgpu change to a private state object is probably a lot more invasive.

> I've started work on a more large scale fix that we could get in in after.

Does that include a fix for the "stuff needed by irq handler"? Either way
pls cc dri-devel, I think this is something worth of a bit wider
discussion. Feels like unsolved homework from the entire "make DC
integrate into linux" saga ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2020-07-28 21:58 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-23 21:10 [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free Mazin Rezk
2020-07-23 21:10 ` Mazin Rezk
2020-07-23 21:10 ` Mazin Rezk
2020-07-23 22:16 ` Kazlauskas, Nicholas
2020-07-23 22:16   ` Kazlauskas, Nicholas
2020-07-23 22:16   ` Kazlauskas, Nicholas
2020-07-23 22:57   ` Mazin Rezk
2020-07-23 22:57     ` Mazin Rezk
2020-07-23 22:57     ` Mazin Rezk
2020-07-24 21:09     ` Mazin Rezk
2020-07-24 21:09       ` Mazin Rezk
2020-07-24 21:09       ` Mazin Rezk
2020-07-23 22:32 ` Kees Cook
2020-07-23 22:32   ` Kees Cook
2020-07-23 22:32   ` Kees Cook
2020-07-23 22:58   ` Mazin Rezk
2020-07-23 22:58     ` Mazin Rezk
2020-07-23 22:58     ` Mazin Rezk
2020-07-24  7:26     ` Christian König
2020-07-24  7:26       ` Christian König
2020-07-24  7:26       ` Christian König
2020-07-24  7:45   ` Paul Menzel
2020-07-24  7:45     ` Paul Menzel
2020-07-24  7:45     ` Paul Menzel
2020-07-24 17:33     ` Kees Cook
2020-07-24 17:33       ` Kees Cook
2020-07-24 17:33       ` Kees Cook
2020-07-24 21:19       ` Paul Menzel
2020-07-24 21:19         ` Paul Menzel
2020-07-24 21:19         ` Paul Menzel
2020-07-25  3:03         ` Mazin Rezk
2020-07-25  3:03           ` Mazin Rezk
2020-07-25  3:03           ` Mazin Rezk
2020-07-25  4:59           ` Duncan
2020-07-25  4:59             ` Duncan
2020-07-25  4:59             ` Duncan
2020-07-25  5:20             ` Mazin Rezk
2020-07-25  5:20               ` Mazin Rezk
2020-07-25  5:20               ` Mazin Rezk
2020-07-28  9:22               ` Paul Menzel
2020-07-28  9:22                 ` Paul Menzel
2020-07-28  9:22                 ` Paul Menzel
2020-07-28 17:07                 ` Kazlauskas, Nicholas
2020-07-28 17:07                   ` Kazlauskas, Nicholas
2020-07-28 17:07                   ` Kazlauskas, Nicholas
2020-07-28 21:58                   ` daniel
2020-07-28 21:58                     ` daniel
2020-07-28 21:58                     ` daniel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.