All of lore.kernel.org
 help / color / mirror / Atom feed
* [Patch v2] drm/ttm: Schedule delayed_delete worker closer
@ 2023-11-08 12:58 Rajneesh Bhardwaj
  2023-11-08 14:49 ` Christian König
  0 siblings, 1 reply; 4+ messages in thread
From: Rajneesh Bhardwaj @ 2023-11-08 12:58 UTC (permalink / raw)
  To: amd-gfx
  Cc: alexander.deucher, Felix Kuehling, christian.koenig, dri-devel,
	Rajneesh Bhardwaj

Try to allocate system memory on the NUMA node the device is closest to
and try to run delayed_delete workers on a CPU of this node as well.

To optimize the memory clearing operation when a TTM BO gets freed by
the delayed_delete worker, scheduling it closer to a NUMA node where the
memory was initially allocated helps avoid the cases where the worker
gets randomly scheduled on the CPU cores that are across interconnect
boundaries such as xGMI, PCIe etc.

This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD
APU platforms such as GFXIP9.4.3.

Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
---

Changes in v2:
 - Absorbed the feedback provided by Christian in the commit message and
   the comment.

 drivers/gpu/drm/ttm/ttm_bo.c     | 8 +++++++-
 drivers/gpu/drm/ttm/ttm_device.c | 3 ++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 5757b9415e37..6f28a77a565b 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -370,7 +370,13 @@ static void ttm_bo_release(struct kref *kref)
 			spin_unlock(&bo->bdev->lru_lock);
 
 			INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete);
-			queue_work(bdev->wq, &bo->delayed_delete);
+
+			/* Schedule the worker on the closest NUMA node. This
+			 * improves performance since system memory might be
+			 * cleared on free and that is best done on a CPU core
+			 * close to it.
+			 */
+			queue_work_node(bdev->pool.nid, bdev->wq, &bo->delayed_delete);
 			return;
 		}
 
diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index 43e27ab77f95..72b81a2ee6c7 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs,
 	bdev->funcs = funcs;
 
 	ttm_sys_man_init(bdev);
-	ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32);
+
+	ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32);
 
 	bdev->vma_manager = vma_manager;
 	spin_lock_init(&bdev->lru_lock);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [Patch v2] drm/ttm: Schedule delayed_delete worker closer
  2023-11-08 12:58 [Patch v2] drm/ttm: Schedule delayed_delete worker closer Rajneesh Bhardwaj
@ 2023-11-08 14:49 ` Christian König
  2023-11-08 17:28   ` Bhardwaj, Rajneesh
  2023-11-08 18:19   ` Felix Kuehling
  0 siblings, 2 replies; 4+ messages in thread
From: Christian König @ 2023-11-08 14:49 UTC (permalink / raw)
  To: Rajneesh Bhardwaj, amd-gfx; +Cc: alexander.deucher, felix.kuehling, dri-devel

Am 08.11.23 um 13:58 schrieb Rajneesh Bhardwaj:
> Try to allocate system memory on the NUMA node the device is closest to
> and try to run delayed_delete workers on a CPU of this node as well.
>
> To optimize the memory clearing operation when a TTM BO gets freed by
> the delayed_delete worker, scheduling it closer to a NUMA node where the
> memory was initially allocated helps avoid the cases where the worker
> gets randomly scheduled on the CPU cores that are across interconnect
> boundaries such as xGMI, PCIe etc.
>
> This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD
> APU platforms such as GFXIP9.4.3.
>
> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

Going to push this to drm-misc-next.

Thanks,
Christian.

> ---
>
> Changes in v2:
>   - Absorbed the feedback provided by Christian in the commit message and
>     the comment.
>
>   drivers/gpu/drm/ttm/ttm_bo.c     | 8 +++++++-
>   drivers/gpu/drm/ttm/ttm_device.c | 3 ++-
>   2 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index 5757b9415e37..6f28a77a565b 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -370,7 +370,13 @@ static void ttm_bo_release(struct kref *kref)
>   			spin_unlock(&bo->bdev->lru_lock);
>   
>   			INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete);
> -			queue_work(bdev->wq, &bo->delayed_delete);
> +
> +			/* Schedule the worker on the closest NUMA node. This
> +			 * improves performance since system memory might be
> +			 * cleared on free and that is best done on a CPU core
> +			 * close to it.
> +			 */
> +			queue_work_node(bdev->pool.nid, bdev->wq, &bo->delayed_delete);
>   			return;
>   		}
>   
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index 43e27ab77f95..72b81a2ee6c7 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs,
>   	bdev->funcs = funcs;
>   
>   	ttm_sys_man_init(bdev);
> -	ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32);
> +
> +	ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32);
>   
>   	bdev->vma_manager = vma_manager;
>   	spin_lock_init(&bdev->lru_lock);


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Patch v2] drm/ttm: Schedule delayed_delete worker closer
  2023-11-08 14:49 ` Christian König
@ 2023-11-08 17:28   ` Bhardwaj, Rajneesh
  2023-11-08 18:19   ` Felix Kuehling
  1 sibling, 0 replies; 4+ messages in thread
From: Bhardwaj, Rajneesh @ 2023-11-08 17:28 UTC (permalink / raw)
  To: Christian König, amd-gfx
  Cc: alexander.deucher, felix.kuehling, dri-devel


On 11/8/2023 9:49 AM, Christian König wrote:
> Am 08.11.23 um 13:58 schrieb Rajneesh Bhardwaj:
>> Try to allocate system memory on the NUMA node the device is closest to
>> and try to run delayed_delete workers on a CPU of this node as well.
>>
>> To optimize the memory clearing operation when a TTM BO gets freed by
>> the delayed_delete worker, scheduling it closer to a NUMA node where the
>> memory was initially allocated helps avoid the cases where the worker
>> gets randomly scheduled on the CPU cores that are across interconnect
>> boundaries such as xGMI, PCIe etc.
>>
>> This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD
>> APU platforms such as GFXIP9.4.3.
>>
>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>
> Reviewed-by: Christian König <christian.koenig@amd.com>
>
> Going to push this to drm-misc-next.

Thanks Christian, there is a new regression reported and I am checking 
on that. Please don't submit it yet.


>
> Thanks,
> Christian.
>
>> ---
>>
>> Changes in v2:
>>   - Absorbed the feedback provided by Christian in the commit message 
>> and
>>     the comment.
>>
>>   drivers/gpu/drm/ttm/ttm_bo.c     | 8 +++++++-
>>   drivers/gpu/drm/ttm/ttm_device.c | 3 ++-
>>   2 files changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
>> index 5757b9415e37..6f28a77a565b 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>> @@ -370,7 +370,13 @@ static void ttm_bo_release(struct kref *kref)
>>               spin_unlock(&bo->bdev->lru_lock);
>>                 INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete);
>> -            queue_work(bdev->wq, &bo->delayed_delete);
>> +
>> +            /* Schedule the worker on the closest NUMA node. This
>> +             * improves performance since system memory might be
>> +             * cleared on free and that is best done on a CPU core
>> +             * close to it.
>> +             */
>> +            queue_work_node(bdev->pool.nid, bdev->wq, 
>> &bo->delayed_delete);
>>               return;
>>           }
>>   diff --git a/drivers/gpu/drm/ttm/ttm_device.c 
>> b/drivers/gpu/drm/ttm/ttm_device.c
>> index 43e27ab77f95..72b81a2ee6c7 100644
>> --- a/drivers/gpu/drm/ttm/ttm_device.c
>> +++ b/drivers/gpu/drm/ttm/ttm_device.c
>> @@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, 
>> struct ttm_device_funcs *funcs,
>>       bdev->funcs = funcs;
>>         ttm_sys_man_init(bdev);
>> -    ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, 
>> use_dma32);
>> +
>> +    ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, 
>> use_dma32);
>>         bdev->vma_manager = vma_manager;
>>       spin_lock_init(&bdev->lru_lock);
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Patch v2] drm/ttm: Schedule delayed_delete worker closer
  2023-11-08 14:49 ` Christian König
  2023-11-08 17:28   ` Bhardwaj, Rajneesh
@ 2023-11-08 18:19   ` Felix Kuehling
  1 sibling, 0 replies; 4+ messages in thread
From: Felix Kuehling @ 2023-11-08 18:19 UTC (permalink / raw)
  To: Christian König, Rajneesh Bhardwaj, amd-gfx
  Cc: alexander.deucher, dri-devel

On 2023-11-08 09:49, Christian König wrote:
> Am 08.11.23 um 13:58 schrieb Rajneesh Bhardwaj:
>> Try to allocate system memory on the NUMA node the device is closest to
>> and try to run delayed_delete workers on a CPU of this node as well.
>>
>> To optimize the memory clearing operation when a TTM BO gets freed by
>> the delayed_delete worker, scheduling it closer to a NUMA node where the
>> memory was initially allocated helps avoid the cases where the worker
>> gets randomly scheduled on the CPU cores that are across interconnect
>> boundaries such as xGMI, PCIe etc.
>>
>> This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD
>> APU platforms such as GFXIP9.4.3.
>>
>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>
> Reviewed-by: Christian König <christian.koenig@amd.com>
>
> Going to push this to drm-misc-next.

Hold on. Rajneesh just pointed out a WARN regression from testing. I 
think the problem is that the bdev->wq is not unbound.

Regards,
   Felix


>
> Thanks,
> Christian.
>
>> ---
>>
>> Changes in v2:
>>   - Absorbed the feedback provided by Christian in the commit message 
>> and
>>     the comment.
>>
>>   drivers/gpu/drm/ttm/ttm_bo.c     | 8 +++++++-
>>   drivers/gpu/drm/ttm/ttm_device.c | 3 ++-
>>   2 files changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
>> index 5757b9415e37..6f28a77a565b 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>> @@ -370,7 +370,13 @@ static void ttm_bo_release(struct kref *kref)
>>               spin_unlock(&bo->bdev->lru_lock);
>>                 INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete);
>> -            queue_work(bdev->wq, &bo->delayed_delete);
>> +
>> +            /* Schedule the worker on the closest NUMA node. This
>> +             * improves performance since system memory might be
>> +             * cleared on free and that is best done on a CPU core
>> +             * close to it.
>> +             */
>> +            queue_work_node(bdev->pool.nid, bdev->wq, 
>> &bo->delayed_delete);
>>               return;
>>           }
>>   diff --git a/drivers/gpu/drm/ttm/ttm_device.c 
>> b/drivers/gpu/drm/ttm/ttm_device.c
>> index 43e27ab77f95..72b81a2ee6c7 100644
>> --- a/drivers/gpu/drm/ttm/ttm_device.c
>> +++ b/drivers/gpu/drm/ttm/ttm_device.c
>> @@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, 
>> struct ttm_device_funcs *funcs,
>>       bdev->funcs = funcs;
>>         ttm_sys_man_init(bdev);
>> -    ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, 
>> use_dma32);
>> +
>> +    ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, 
>> use_dma32);
>>         bdev->vma_manager = vma_manager;
>>       spin_lock_init(&bdev->lru_lock);
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-11-08 18:19 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-08 12:58 [Patch v2] drm/ttm: Schedule delayed_delete worker closer Rajneesh Bhardwaj
2023-11-08 14:49 ` Christian König
2023-11-08 17:28   ` Bhardwaj, Rajneesh
2023-11-08 18:19   ` Felix Kuehling

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.