All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/ttm: Don't delete the system manager before the delayed delete
@ 2021-09-17 17:53 Zack Rusin
  2021-09-17 18:34 ` Andrey Grodzovsky
  2021-09-20  6:30 ` Christian König
  0 siblings, 2 replies; 6+ messages in thread
From: Zack Rusin @ 2021-09-17 17:53 UTC (permalink / raw)
  To: dri-devel; +Cc: Christian Koenig, Huang Rui, David Airlie, Daniel Vetter

On some hardware, in particular in virtualized environments, the
system memory can be shared with the "hardware". In those cases
the BO's allocated through the ttm system manager might be
busy during ttm_bo_put which results in them being scheduled
for a delayed deletion.

The problem is that that the ttm system manager is disabled
before the final delayed deletion is ran in ttm_device_fini.
This results in crashes during freeing of the BO resources
because they're trying to remove themselves from a no longer
existent ttm_resource_manager (e.g. in IGT's core_hotunplug
on vmwgfx)

In general reloading any driver that could share system mem
resources with "hardware" could hit it because nothing
prevents the system mem resources from being scheduled
for delayed deletion (apart from them not being busy probably
anywhere apart from virtualized environments).

Signed-off-by: Zack Rusin <zackr@vmware.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: dri-devel@lists.freedesktop.org
---
 drivers/gpu/drm/ttm/ttm_device.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index 9eb8f54b66fc..4ef19cafc755 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -225,10 +225,6 @@ void ttm_device_fini(struct ttm_device *bdev)
 	struct ttm_resource_manager *man;
 	unsigned i;
 
-	man = ttm_manager_type(bdev, TTM_PL_SYSTEM);
-	ttm_resource_manager_set_used(man, false);
-	ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL);
-
 	mutex_lock(&ttm_global_mutex);
 	list_del(&bdev->device_list);
 	mutex_unlock(&ttm_global_mutex);
@@ -238,6 +234,10 @@ void ttm_device_fini(struct ttm_device *bdev)
 	if (ttm_bo_delayed_delete(bdev, true))
 		pr_debug("Delayed destroy list was clean\n");
 
+	man = ttm_manager_type(bdev, TTM_PL_SYSTEM);
+	ttm_resource_manager_set_used(man, false);
+	ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL);
+
 	spin_lock(&bdev->lru_lock);
 	for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i)
 		if (list_empty(&man->lru[0]))
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/ttm: Don't delete the system manager before the delayed delete
  2021-09-17 17:53 [PATCH] drm/ttm: Don't delete the system manager before the delayed delete Zack Rusin
@ 2021-09-17 18:34 ` Andrey Grodzovsky
  2021-09-20  6:30 ` Christian König
  1 sibling, 0 replies; 6+ messages in thread
From: Andrey Grodzovsky @ 2021-09-17 18:34 UTC (permalink / raw)
  To: Zack Rusin, dri-devel
  Cc: Christian Koenig, Huang Rui, David Airlie, Daniel Vetter

On 2021-09-17 1:53 p.m., Zack Rusin wrote:

> On some hardware, in particular in virtualized environments, the
> system memory can be shared with the "hardware". In those cases
> the BO's allocated through the ttm system manager might be
> busy during ttm_bo_put which results in them being scheduled
> for a delayed deletion.
>
> The problem is that that the ttm system manager is disabled
> before the final delayed deletion is ran in ttm_device_fini.
> This results in crashes during freeing of the BO resources
> because they're trying to remove themselves from a no longer
> existent ttm_resource_manager (e.g. in IGT's core_hotunplug
> on vmwgfx)
>
> In general reloading any driver that could share system mem
> resources with "hardware" could hit it because nothing
> prevents the system mem resources from being scheduled
> for delayed deletion (apart from them not being busy probably
> anywhere apart from virtualized environments).
>
> Signed-off-by: Zack Rusin <zackr@vmware.com>
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Huang Rui <ray.huang@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: dri-devel@lists.freedesktop.org
> ---
>   drivers/gpu/drm/ttm/ttm_device.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index 9eb8f54b66fc..4ef19cafc755 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -225,10 +225,6 @@ void ttm_device_fini(struct ttm_device *bdev)
>   	struct ttm_resource_manager *man;
>   	unsigned i;
>   
> -	man = ttm_manager_type(bdev, TTM_PL_SYSTEM);
> -	ttm_resource_manager_set_used(man, false);
> -	ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL);
> -
>   	mutex_lock(&ttm_global_mutex);
>   	list_del(&bdev->device_list);
>   	mutex_unlock(&ttm_global_mutex);
> @@ -238,6 +234,10 @@ void ttm_device_fini(struct ttm_device *bdev)
>   	if (ttm_bo_delayed_delete(bdev, true))
>   		pr_debug("Delayed destroy list was clean\n");
>   
> +	man = ttm_manager_type(bdev, TTM_PL_SYSTEM);
> +	ttm_resource_manager_set_used(man, false);
> +	ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL);
> +


Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Andrey


>   	spin_lock(&bdev->lru_lock);
>   	for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i)
>   		if (list_empty(&man->lru[0]))

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/ttm: Don't delete the system manager before the delayed delete
  2021-09-17 17:53 [PATCH] drm/ttm: Don't delete the system manager before the delayed delete Zack Rusin
  2021-09-17 18:34 ` Andrey Grodzovsky
@ 2021-09-20  6:30 ` Christian König
  2021-09-20 14:59   ` Zack Rusin
  1 sibling, 1 reply; 6+ messages in thread
From: Christian König @ 2021-09-20  6:30 UTC (permalink / raw)
  To: Zack Rusin, dri-devel; +Cc: Huang Rui, David Airlie, Daniel Vetter

Am 17.09.21 um 19:53 schrieb Zack Rusin:
> On some hardware, in particular in virtualized environments, the
> system memory can be shared with the "hardware". In those cases
> the BO's allocated through the ttm system manager might be
> busy during ttm_bo_put which results in them being scheduled
> for a delayed deletion.

While the patch itself is probably fine the reasoning here is a clear NAK.

Buffers in the system domain are not GPU accessible by definition, even 
in a shared environment and so *must* be idle.

Otherwise you break quite a number of assumptions in the code.

Regards,
Christian.

>
> The problem is that that the ttm system manager is disabled
> before the final delayed deletion is ran in ttm_device_fini.
> This results in crashes during freeing of the BO resources
> because they're trying to remove themselves from a no longer
> existent ttm_resource_manager (e.g. in IGT's core_hotunplug
> on vmwgfx)
>
> In general reloading any driver that could share system mem
> resources with "hardware" could hit it because nothing
> prevents the system mem resources from being scheduled
> for delayed deletion (apart from them not being busy probably
> anywhere apart from virtualized environments).
>
> Signed-off-by: Zack Rusin <zackr@vmware.com>
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Huang Rui <ray.huang@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: dri-devel@lists.freedesktop.org
> ---
>   drivers/gpu/drm/ttm/ttm_device.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index 9eb8f54b66fc..4ef19cafc755 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -225,10 +225,6 @@ void ttm_device_fini(struct ttm_device *bdev)
>   	struct ttm_resource_manager *man;
>   	unsigned i;
>   
> -	man = ttm_manager_type(bdev, TTM_PL_SYSTEM);
> -	ttm_resource_manager_set_used(man, false);
> -	ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL);
> -
>   	mutex_lock(&ttm_global_mutex);
>   	list_del(&bdev->device_list);
>   	mutex_unlock(&ttm_global_mutex);
> @@ -238,6 +234,10 @@ void ttm_device_fini(struct ttm_device *bdev)
>   	if (ttm_bo_delayed_delete(bdev, true))
>   		pr_debug("Delayed destroy list was clean\n");
>   
> +	man = ttm_manager_type(bdev, TTM_PL_SYSTEM);
> +	ttm_resource_manager_set_used(man, false);
> +	ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL);
> +
>   	spin_lock(&bdev->lru_lock);
>   	for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i)
>   		if (list_empty(&man->lru[0]))


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/ttm: Don't delete the system manager before the delayed delete
  2021-09-20  6:30 ` Christian König
@ 2021-09-20 14:59   ` Zack Rusin
  2021-09-23 13:53     ` Zack Rusin
  0 siblings, 1 reply; 6+ messages in thread
From: Zack Rusin @ 2021-09-20 14:59 UTC (permalink / raw)
  To: Christian König
  Cc: DRI Development, Huang Rui, David Airlie, Daniel Vetter



> On Sep 20, 2021, at 02:30, Christian König <christian.koenig@amd.com> wrote:
> 
> Am 17.09.21 um 19:53 schrieb Zack Rusin:
>> On some hardware, in particular in virtualized environments, the
>> system memory can be shared with the "hardware". In those cases
>> the BO's allocated through the ttm system manager might be
>> busy during ttm_bo_put which results in them being scheduled
>> for a delayed deletion.
> 
> While the patch itself is probably fine the reasoning here is a clear NAK.
> 
> Buffers in the system domain are not GPU accessible by definition, even in a shared environment and so *must* be idle.

I’m assuming that means they are not allowed to be ever fenced then, yes?

> Otherwise you break quite a number of assumptions in the code.

Are there more assumptions like that or do you mean there’s more places that depend on the assumption that system domain bo’s are always idle? If there’s more assumptions like that in TTM that would be incredibly valuable to know. I haven’t been paying much attention to the kernel code in years and coming back now and looking at a few years old vmwgfx code it’s almost impossible to tell the difference between: “this assumption breaks the driver” and “this driver breaks this assumption”.

z


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/ttm: Don't delete the system manager before the delayed delete
  2021-09-20 14:59   ` Zack Rusin
@ 2021-09-23 13:53     ` Zack Rusin
  2021-09-23 14:49       ` Christian König
  0 siblings, 1 reply; 6+ messages in thread
From: Zack Rusin @ 2021-09-23 13:53 UTC (permalink / raw)
  To: Christian König
  Cc: DRI Development, Huang Rui, David Airlie, Daniel Vetter

On 9/20/21 10:59 AM, Zack Rusin wrote:
>> On Sep 20, 2021, at 02:30, Christian König <christian.koenig@amd.com> wrote:
>>
>> Am 17.09.21 um 19:53 schrieb Zack Rusin:
>>> On some hardware, in particular in virtualized environments, the
>>> system memory can be shared with the "hardware". In those cases
>>> the BO's allocated through the ttm system manager might be
>>> busy during ttm_bo_put which results in them being scheduled
>>> for a delayed deletion.
>>
>> While the patch itself is probably fine the reasoning here is a clear NAK.
>>
>> Buffers in the system domain are not GPU accessible by definition, even in a shared environment and so *must* be idle.
> 
> I’m assuming that means they are not allowed to be ever fenced then, yes?

Any thoughts on this? I'd love a confirmation because it would mean I need to go and rewrite the vmwgfx_mob.c bits where we use TTM_PL_SYSTEM memory (through vmw_bo_create_and_populate) for a page table which is read by the host, and those bo's need to be fenced to prevent destruction of the page tables while the memory they point to is still used. So if those were never allowed to be fenced in the first place we probably need to add a new memory type to hold those page tables.

z

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/ttm: Don't delete the system manager before the delayed delete
  2021-09-23 13:53     ` Zack Rusin
@ 2021-09-23 14:49       ` Christian König
  0 siblings, 0 replies; 6+ messages in thread
From: Christian König @ 2021-09-23 14:49 UTC (permalink / raw)
  To: Zack Rusin; +Cc: DRI Development, Huang Rui, David Airlie, Daniel Vetter

Am 23.09.21 um 15:53 schrieb Zack Rusin:
> On 9/20/21 10:59 AM, Zack Rusin wrote:
>>> On Sep 20, 2021, at 02:30, Christian König 
>>> <christian.koenig@amd.com> wrote:
>>>
>>> Am 17.09.21 um 19:53 schrieb Zack Rusin:
>>>> On some hardware, in particular in virtualized environments, the
>>>> system memory can be shared with the "hardware". In those cases
>>>> the BO's allocated through the ttm system manager might be
>>>> busy during ttm_bo_put which results in them being scheduled
>>>> for a delayed deletion.
>>>
>>> While the patch itself is probably fine the reasoning here is a 
>>> clear NAK.
>>>
>>> Buffers in the system domain are not GPU accessible by definition, 
>>> even in a shared environment and so *must* be idle.
>>
>> I’m assuming that means they are not allowed to be ever fenced then, 
>> yes?
>
> Any thoughts on this? I'd love a confirmation because it would mean I 
> need to go and rewrite the vmwgfx_mob.c bits where we use 
> TTM_PL_SYSTEM memory (through vmw_bo_create_and_populate) for a page 
> table which is read by the host, and those bo's need to be fenced to 
> prevent destruction of the page tables while the memory they point to 
> is still used. So if those were never allowed to be fenced in the 
> first place we probably need to add a new memory type to hold those 
> page tables.

Yeah, as far as I can see that is pretty much illegal from a design 
point of view.

We could probably change that rule on the TTM side, but I think that 
keeping the design as it is and adding a placement in vmwgfx sounds like 
the cleaner approach.

Christian.

>
> z


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-09-23 14:49 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-17 17:53 [PATCH] drm/ttm: Don't delete the system manager before the delayed delete Zack Rusin
2021-09-17 18:34 ` Andrey Grodzovsky
2021-09-20  6:30 ` Christian König
2021-09-20 14:59   ` Zack Rusin
2021-09-23 13:53     ` Zack Rusin
2021-09-23 14:49       ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.