All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] drm, drm/amd, drm/radeon: Introduce a generic suballocator
@ 2023-02-16 14:48 ` Thomas Hellström
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-16 14:48 UTC (permalink / raw)
  To: dri-devel
  Cc: Thomas Hellström, Daniel Vetter, Christian Koenig,
	Dave Airlie, intel-xe


This series (or at least the suballocator helper) is a prerequisite
for the new Xe driver.

A variant of the series has been on review before, making the
suballocator used by the amdgpu driver the generic one. However we
ran into a number of issues on Xe when using it for context-less
allocations, hence the comlplete rewrite with simplification in
mind. More specifics in the patch commit message and in the code.

There was an unresolved issue when the series was last up for review,
and that was the per allocation aligment. Last message was from
Maarten Lankhorst arguing that the larger per-driver alignment used
would only incur a small memory cost. It would be good to have that
resolved.

The generic suballocator has been extensively tested with the Xe driver.
The amd- and radeon adaptations are only compile-tested.


Maarten Lankhorst (2):
  drm/amd: Convert amdgpu to use suballocation helper.
  drm/radeon: Use the drm suballocation manager implementation.

Thomas Hellström (1):
  drm/suballoc: Introduce a generic suballocation manager

 drivers/gpu/drm/Kconfig                    |   5 +
 drivers/gpu/drm/Makefile                   |   3 +
 drivers/gpu/drm/amd/amdgpu/Kconfig         |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  26 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c     |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  23 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c     | 320 ++-------------------
 drivers/gpu/drm/drm_suballoc.c             | 301 +++++++++++++++++++
 drivers/gpu/drm/radeon/radeon.h            |  55 +---
 drivers/gpu/drm/radeon/radeon_ib.c         |  12 +-
 drivers/gpu/drm/radeon/radeon_object.h     |  25 +-
 drivers/gpu/drm/radeon/radeon_sa.c         | 314 ++------------------
 drivers/gpu/drm/radeon/radeon_semaphore.c  |   6 +-
 include/drm/drm_suballoc.h                 | 112 ++++++++
 15 files changed, 518 insertions(+), 693 deletions(-)
 create mode 100644 drivers/gpu/drm/drm_suballoc.c
 create mode 100644 include/drm/drm_suballoc.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [Intel-xe] [PATCH 0/3] drm, drm/amd, drm/radeon: Introduce a generic suballocator
@ 2023-02-16 14:48 ` Thomas Hellström
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-16 14:48 UTC (permalink / raw)
  To: dri-devel
  Cc: Daniel Vetter, Maarten Lankhorst, Christian Koenig, Dave Airlie,
	intel-xe


This series (or at least the suballocator helper) is a prerequisite
for the new Xe driver.

A variant of the series has been on review before, making the
suballocator used by the amdgpu driver the generic one. However we
ran into a number of issues on Xe when using it for context-less
allocations, hence the comlplete rewrite with simplification in
mind. More specifics in the patch commit message and in the code.

There was an unresolved issue when the series was last up for review,
and that was the per allocation aligment. Last message was from
Maarten Lankhorst arguing that the larger per-driver alignment used
would only incur a small memory cost. It would be good to have that
resolved.

The generic suballocator has been extensively tested with the Xe driver.
The amd- and radeon adaptations are only compile-tested.


Maarten Lankhorst (2):
  drm/amd: Convert amdgpu to use suballocation helper.
  drm/radeon: Use the drm suballocation manager implementation.

Thomas Hellström (1):
  drm/suballoc: Introduce a generic suballocation manager

 drivers/gpu/drm/Kconfig                    |   5 +
 drivers/gpu/drm/Makefile                   |   3 +
 drivers/gpu/drm/amd/amdgpu/Kconfig         |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  26 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c     |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  23 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c     | 320 ++-------------------
 drivers/gpu/drm/drm_suballoc.c             | 301 +++++++++++++++++++
 drivers/gpu/drm/radeon/radeon.h            |  55 +---
 drivers/gpu/drm/radeon/radeon_ib.c         |  12 +-
 drivers/gpu/drm/radeon/radeon_object.h     |  25 +-
 drivers/gpu/drm/radeon/radeon_sa.c         | 314 ++------------------
 drivers/gpu/drm/radeon/radeon_semaphore.c  |   6 +-
 include/drm/drm_suballoc.h                 | 112 ++++++++
 15 files changed, 518 insertions(+), 693 deletions(-)
 create mode 100644 drivers/gpu/drm/drm_suballoc.c
 create mode 100644 include/drm/drm_suballoc.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
  2023-02-16 14:48 ` [Intel-xe] " Thomas Hellström
@ 2023-02-16 14:48   ` Thomas Hellström
  -1 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-16 14:48 UTC (permalink / raw)
  To: dri-devel
  Cc: Thomas Hellström, Daniel Vetter, Christian Koenig,
	Dave Airlie, intel-xe

Initially we tried to leverage the amdgpu suballocation manager.
It turnes out, however, that it tries extremely hard not to enable
signalling on the fences that hold the memory up for freeing, which makes
it hard to understand and to fix potential issues with it.

So in a simplification effort, introduce a drm suballocation manager as a
wrapper around an existing allocator (drm_mm) and to avoid using queues
for freeing, thus avoiding throttling on free which is an undesired
feature as typically the throttling needs to be done uninterruptible.

This variant is probably more cpu-hungry but can be improved at the cost
of additional complexity. Ideas for that are documented in the
drm_suballoc.c file.

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Co-developed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/Kconfig        |   4 +
 drivers/gpu/drm/Makefile       |   3 +
 drivers/gpu/drm/drm_suballoc.c | 301 +++++++++++++++++++++++++++++++++
 include/drm/drm_suballoc.h     | 112 ++++++++++++
 4 files changed, 420 insertions(+)
 create mode 100644 drivers/gpu/drm/drm_suballoc.c
 create mode 100644 include/drm/drm_suballoc.h

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index dc0f94f02a82..8fbe57407c60 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -232,6 +232,10 @@ config DRM_GEM_SHMEM_HELPER
 	help
 	  Choose this if you need the GEM shmem helper functions
 
+config DRM_SUBALLOC_HELPER
+	tristate
+	depends on DRM
+
 config DRM_SCHED
 	tristate
 	depends on DRM
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index ab4460fcd63f..1e04d135e866 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -88,6 +88,9 @@ obj-$(CONFIG_DRM_GEM_DMA_HELPER) += drm_dma_helper.o
 drm_shmem_helper-y := drm_gem_shmem_helper.o
 obj-$(CONFIG_DRM_GEM_SHMEM_HELPER) += drm_shmem_helper.o
 
+drm_suballoc_helper-y := drm_suballoc.o
+obj-$(CONFIG_DRM_SUBALLOC_HELPER) += drm_suballoc_helper.o
+
 drm_vram_helper-y := drm_gem_vram_helper.o
 obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
 
diff --git a/drivers/gpu/drm/drm_suballoc.c b/drivers/gpu/drm/drm_suballoc.c
new file mode 100644
index 000000000000..6e0292dea548
--- /dev/null
+++ b/drivers/gpu/drm/drm_suballoc.c
@@ -0,0 +1,301 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+#include <drm/drm_suballoc.h>
+
+/**
+ * DOC:
+ * This suballocator intends to be a wrapper around a range allocator
+ * that is aware also of deferred range freeing with fences. Currently
+ * we hard-code the drm_mm as the range allocator.
+ * The approach, while rather simple, suffers from three performance
+ * issues that can all be fixed if needed at the tradeoff of more and / or
+ * more complex code:
+ *
+ * 1) It's cpu-hungry, the drm_mm allocator is overkill. Either code a
+ * much simpler range allocator, or let the caller decide by providing
+ * ops that wrap any range allocator. Also could avoid waking up unless
+ * there is a reasonable chance of enough space in the range manager.
+ *
+ * 2) We unnecessarily install the fence callbacks too early, forcing
+ * enable_signaling() too early causing extra driver effort. This is likely
+ * not an issue if used with the drm_scheduler since it calls
+ * enable_signaling() early anyway.
+ *
+ * 3) Long processing in irq (disabled) context. We've mostly worked around
+ * that already by using the idle_list. If that workaround is deemed to
+ * complex for little gain, we can remove it and use spin_lock_irq()
+ * throughout the manager. If we want to shorten processing in irq context
+ * even further, we can skip the spin_trylock in __drm_suballoc_free() and
+ * avoid freeing allocations from irq context altogeher. However drm_mm
+ * should be quite fast at freeing ranges.
+ *
+ * 4) Shrinker that starts processing the list items in 2) and 3) to play
+ * better with the system.
+ */
+
+static void drm_suballoc_process_idle(struct drm_suballoc_manager *sa_manager);
+
+/**
+ * drm_suballoc_manager_init() - Initialise the drm_suballoc_manager
+ * @sa_manager: pointer to the sa_manager
+ * @size: number of bytes we want to suballocate
+ * @align: alignment for each suballocated chunk
+ *
+ * Prepares the suballocation manager for suballocations.
+ */
+void drm_suballoc_manager_init(struct drm_suballoc_manager *sa_manager,
+			       u64 size, u64 align)
+{
+	spin_lock_init(&sa_manager->lock);
+	spin_lock_init(&sa_manager->idle_list_lock);
+	mutex_init(&sa_manager->alloc_mutex);
+	drm_mm_init(&sa_manager->mm, 0, size);
+	init_waitqueue_head(&sa_manager->wq);
+	sa_manager->range_size = size;
+	sa_manager->alignment = align;
+	INIT_LIST_HEAD(&sa_manager->idle_list);
+}
+EXPORT_SYMBOL(drm_suballoc_manager_init);
+
+/**
+ * drm_suballoc_manager_fini() - Destroy the drm_suballoc_manager
+ * @sa_manager: pointer to the sa_manager
+ *
+ * Cleans up the suballocation manager after use. All fences added
+ * with drm_suballoc_free() must be signaled, or we cannot clean up
+ * the entire manager.
+ */
+void drm_suballoc_manager_fini(struct drm_suballoc_manager *sa_manager)
+{
+	drm_suballoc_process_idle(sa_manager);
+	drm_mm_takedown(&sa_manager->mm);
+	mutex_destroy(&sa_manager->alloc_mutex);
+}
+EXPORT_SYMBOL(drm_suballoc_manager_fini);
+
+static void __drm_suballoc_free(struct drm_suballoc *sa)
+{
+	struct drm_suballoc_manager *sa_manager = sa->manager;
+	struct dma_fence *fence;
+
+	/*
+	 * In order to avoid protecting the potentially lengthy drm_mm manager
+	 * *allocation* processing with an irq-disabling lock,
+	 * defer touching the drm_mm for freeing until we're in task context,
+	 * with no irqs disabled, or happen to succeed in taking the manager
+	 * lock.
+	 */
+	if (!in_task() || irqs_disabled()) {
+		unsigned long irqflags;
+
+		if (spin_trylock(&sa_manager->lock))
+			goto locked;
+
+		spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
+		list_add_tail(&sa->idle_link, &sa_manager->idle_list);
+		spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
+		wake_up(&sa_manager->wq);
+		return;
+	}
+
+	spin_lock(&sa_manager->lock);
+locked:
+	drm_mm_remove_node(&sa->node);
+
+	fence = sa->fence;
+	sa->fence = NULL;
+	spin_unlock(&sa_manager->lock);
+	/* Maybe only wake if first mm hole is sufficiently large? */
+	wake_up(&sa_manager->wq);
+	dma_fence_put(fence);
+	kfree(sa);
+}
+
+/* Free all deferred idle allocations */
+static void drm_suballoc_process_idle(struct drm_suballoc_manager *sa_manager)
+{
+	/*
+	 * prepare_to_wait() / wake_up() semantics ensure that any list
+	 * addition that was done before wake_up() is visible when
+	 * this code is called from the wait loop.
+	 */
+	if (!list_empty_careful(&sa_manager->idle_list)) {
+		struct drm_suballoc *sa, *next;
+		unsigned long irqflags;
+		LIST_HEAD(list);
+
+		spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
+		list_splice_init(&sa_manager->idle_list, &list);
+		spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
+
+		list_for_each_entry_safe(sa, next, &list, idle_link)
+			__drm_suballoc_free(sa);
+	}
+}
+
+static void
+drm_suballoc_fence_signaled(struct dma_fence *fence, struct dma_fence_cb *cb)
+{
+	struct drm_suballoc *sa = container_of(cb, typeof(*sa), cb);
+
+	__drm_suballoc_free(sa);
+}
+
+static int drm_suballoc_tryalloc(struct drm_suballoc *sa, u64 size)
+{
+	struct drm_suballoc_manager *sa_manager = sa->manager;
+	int err;
+
+	drm_suballoc_process_idle(sa_manager);
+	spin_lock(&sa_manager->lock);
+	err = drm_mm_insert_node_generic(&sa_manager->mm, &sa->node, size,
+					 sa_manager->alignment, 0,
+					 DRM_MM_INSERT_EVICT);
+	spin_unlock(&sa_manager->lock);
+	return err;
+}
+
+/**
+ * drm_suballoc_new() - Make a suballocation.
+ * @sa_manager: pointer to the sa_manager
+ * @size: number of bytes we want to suballocate.
+ * @gfp: Allocation context.
+ * @intr: Whether to sleep interruptibly if sleeping.
+ *
+ * Try to make a suballocation of size @size, which will be rounded
+ * up to the alignment specified in specified in drm_suballoc_manager_init().
+ *
+ * Returns a new suballocated bo, or an ERR_PTR.
+ */
+struct drm_suballoc*
+drm_suballoc_new(struct drm_suballoc_manager *sa_manager, u64 size,
+		 gfp_t gfp, bool intr)
+{
+	struct drm_suballoc *sa;
+	DEFINE_WAIT(wait);
+	int err = 0;
+
+	if (size > sa_manager->range_size)
+		return ERR_PTR(-ENOSPC);
+
+	sa = kzalloc(sizeof(*sa), gfp);
+	if (!sa)
+		return ERR_PTR(-ENOMEM);
+
+	/* Avoid starvation using the alloc_mutex */
+	if (intr)
+		err = mutex_lock_interruptible(&sa_manager->alloc_mutex);
+	else
+		mutex_lock(&sa_manager->alloc_mutex);
+	if (err) {
+		kfree(sa);
+		return ERR_PTR(err);
+	}
+
+	sa->manager = sa_manager;
+	err = drm_suballoc_tryalloc(sa, size);
+	if (err != -ENOSPC)
+		goto out;
+
+	for (;;) {
+		prepare_to_wait(&sa_manager->wq, &wait,
+				intr ? TASK_INTERRUPTIBLE :
+				TASK_UNINTERRUPTIBLE);
+
+		err = drm_suballoc_tryalloc(sa, size);
+		if (err != -ENOSPC)
+			break;
+
+		if (intr && signal_pending(current)) {
+			err = -ERESTARTSYS;
+			break;
+		}
+
+		io_schedule();
+	}
+	finish_wait(&sa_manager->wq, &wait);
+
+out:
+	mutex_unlock(&sa_manager->alloc_mutex);
+	if (!sa->node.size) {
+		kfree(sa);
+		WARN_ON(!err);
+		sa = ERR_PTR(err);
+	}
+
+	return sa;
+}
+EXPORT_SYMBOL(drm_suballoc_new);
+
+/**
+ * drm_suballoc_free() - Free a suballocation
+ * @suballoc: pointer to the suballocation
+ * @fence: fence that signals when suballocation is idle
+ * @queue: the index to which queue the suballocation will be placed on the free list.
+ *
+ * Free the suballocation. The suballocation can be re-used after @fence
+ * signals.
+ */
+void
+drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence)
+{
+	if (!sa)
+		return;
+
+	if (!fence || dma_fence_is_signaled(fence)) {
+		__drm_suballoc_free(sa);
+		return;
+	}
+
+	sa->fence = dma_fence_get(fence);
+	if (dma_fence_add_callback(fence, &sa->cb, drm_suballoc_fence_signaled))
+		__drm_suballoc_free(sa);
+}
+EXPORT_SYMBOL(drm_suballoc_free);
+
+#ifdef CONFIG_DEBUG_FS
+
+/**
+ * drm_suballoc_dump_debug_info() - Dump the suballocator state
+ * @sa_manager: The suballoc manager.
+ * @p: Pointer to a drm printer for output.
+ * @suballoc_base: Constant to add to the suballocated offsets on printout.
+ *
+ * This function dumps the suballocator state. Note that the caller has
+ * to explicitly order frees and calls to this function in order for the
+ * freed node to show up as protected by a fence.
+ */
+void drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
+				  struct drm_printer *p, u64 suballoc_base)
+{
+	const struct drm_mm_node *entry;
+
+	spin_lock(&sa_manager->lock);
+	drm_mm_for_each_node(entry, &sa_manager->mm) {
+		struct drm_suballoc *sa =
+			container_of(entry, typeof(*sa), node);
+
+		drm_printf(p, " ");
+		drm_printf(p, "[0x%010llx 0x%010llx] size %8lld",
+			   (unsigned long long)suballoc_base + entry->start,
+			   (unsigned long long)suballoc_base + entry->start +
+			   entry->size, (unsigned long long)entry->size);
+
+		if (sa->fence)
+			drm_printf(p, " protected by 0x%016llx on context %llu",
+				   (unsigned long long)sa->fence->seqno,
+				   (unsigned long long)sa->fence->context);
+
+		drm_printf(p, "\n");
+	}
+	spin_unlock(&sa_manager->lock);
+}
+EXPORT_SYMBOL(drm_suballoc_dump_debug_info);
+#endif
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_DESCRIPTION("Simple range suballocator helper");
+MODULE_LICENSE("GPL and additional rights");
diff --git a/include/drm/drm_suballoc.h b/include/drm/drm_suballoc.h
new file mode 100644
index 000000000000..910952b3383b
--- /dev/null
+++ b/include/drm/drm_suballoc.h
@@ -0,0 +1,112 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+#ifndef _DRM_SUBALLOC_H_
+#define _DRM_SUBALLOC_H_
+
+#include <drm/drm_mm.h>
+
+#include <linux/dma-fence.h>
+#include <linux/types.h>
+
+/**
+ * struct drm_suballoc_manager - Wrapper for fenced range allocations
+ * @mm: The range manager. Protected by @lock.
+ * @range_size: The total size of the range.
+ * @alignment: Range alignment.
+ * @wq: Wait queue for sleeping allocations on contention.
+ * @idle_list: List of idle but not yet freed allocations. Protected by
+ * @idle_list_lock.
+ * @task: Task waiting for allocation. Protected by @lock.
+ */
+struct drm_suballoc_manager {
+	/** @lock: Manager lock. Protects @mm. */
+	spinlock_t lock;
+	/**
+	 * @idle_list_lock: Lock to protect the idle_list.
+	 * Disable irqs when locking.
+	 */
+	spinlock_t idle_list_lock;
+	/** @alloc_mutex: Mutex to protect against stavation. */
+	struct mutex alloc_mutex;
+	struct drm_mm mm;
+	u64 range_size;
+	u64 alignment;
+	wait_queue_head_t wq;
+	struct list_head idle_list;
+};
+
+/**
+ * struct drm_suballoc: Suballocated range.
+ * @node: The drm_mm representation of the range.
+ * @fence: dma-fence indicating whether allocation is active or idle.
+ * Assigned on call to free the allocation so doesn't need protection.
+ * @cb: dma-fence callback structure. Used for callbacks when the fence signals.
+ * @manager: The struct drm_suballoc_manager the range belongs to. Immutable.
+ * @idle_link: Link for the manager idle_list. Progected by the
+ * drm_suballoc_manager::idle_lock.
+ */
+struct drm_suballoc {
+	struct drm_mm_node node;
+	struct dma_fence *fence;
+	struct dma_fence_cb cb;
+	struct drm_suballoc_manager *manager;
+	struct list_head idle_link;
+};
+
+void drm_suballoc_manager_init(struct drm_suballoc_manager *sa_manager,
+			       u64 size, u64 align);
+
+void drm_suballoc_manager_fini(struct drm_suballoc_manager *sa_manager);
+
+struct drm_suballoc *drm_suballoc_new(struct drm_suballoc_manager *sa_manager,
+				      u64 size, gfp_t gfp, bool intr);
+
+void drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence);
+
+/**
+ * drm_suballoc_soffset - Range start.
+ * @sa: The struct drm_suballoc.
+ *
+ * Return: The start of the allocated range.
+ */
+static inline u64 drm_suballoc_soffset(struct drm_suballoc *sa)
+{
+	return sa->node.start;
+}
+
+/**
+ * drm_suballoc_eoffset - Range end.
+ * @sa: The struct drm_suballoc.
+ *
+ * Return: The end of the allocated range + 1.
+ */
+static inline u64 drm_suballoc_eoffset(struct drm_suballoc *sa)
+{
+	return sa->node.start + sa->node.size;
+}
+
+/**
+ * drm_suballoc_size - Range size.
+ * @sa: The struct drm_suballoc.
+ *
+ * Return: The size of the allocated range.
+ */
+static inline u64 drm_suballoc_size(struct drm_suballoc *sa)
+{
+	return sa->node.size;
+}
+
+#ifdef CONFIG_DEBUG_FS
+void drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
+				  struct drm_printer *p, u64 suballoc_base);
+#else
+static inline void
+drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
+			     struct drm_printer *p, u64 suballoc_base)
+{ }
+
+#endif
+
+#endif /* _DRM_SUBALLOC_H_ */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
@ 2023-02-16 14:48   ` Thomas Hellström
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-16 14:48 UTC (permalink / raw)
  To: dri-devel
  Cc: Daniel Vetter, Maarten Lankhorst, Christian Koenig, Dave Airlie,
	intel-xe

Initially we tried to leverage the amdgpu suballocation manager.
It turnes out, however, that it tries extremely hard not to enable
signalling on the fences that hold the memory up for freeing, which makes
it hard to understand and to fix potential issues with it.

So in a simplification effort, introduce a drm suballocation manager as a
wrapper around an existing allocator (drm_mm) and to avoid using queues
for freeing, thus avoiding throttling on free which is an undesired
feature as typically the throttling needs to be done uninterruptible.

This variant is probably more cpu-hungry but can be improved at the cost
of additional complexity. Ideas for that are documented in the
drm_suballoc.c file.

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Co-developed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/Kconfig        |   4 +
 drivers/gpu/drm/Makefile       |   3 +
 drivers/gpu/drm/drm_suballoc.c | 301 +++++++++++++++++++++++++++++++++
 include/drm/drm_suballoc.h     | 112 ++++++++++++
 4 files changed, 420 insertions(+)
 create mode 100644 drivers/gpu/drm/drm_suballoc.c
 create mode 100644 include/drm/drm_suballoc.h

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index dc0f94f02a82..8fbe57407c60 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -232,6 +232,10 @@ config DRM_GEM_SHMEM_HELPER
 	help
 	  Choose this if you need the GEM shmem helper functions
 
+config DRM_SUBALLOC_HELPER
+	tristate
+	depends on DRM
+
 config DRM_SCHED
 	tristate
 	depends on DRM
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index ab4460fcd63f..1e04d135e866 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -88,6 +88,9 @@ obj-$(CONFIG_DRM_GEM_DMA_HELPER) += drm_dma_helper.o
 drm_shmem_helper-y := drm_gem_shmem_helper.o
 obj-$(CONFIG_DRM_GEM_SHMEM_HELPER) += drm_shmem_helper.o
 
+drm_suballoc_helper-y := drm_suballoc.o
+obj-$(CONFIG_DRM_SUBALLOC_HELPER) += drm_suballoc_helper.o
+
 drm_vram_helper-y := drm_gem_vram_helper.o
 obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
 
diff --git a/drivers/gpu/drm/drm_suballoc.c b/drivers/gpu/drm/drm_suballoc.c
new file mode 100644
index 000000000000..6e0292dea548
--- /dev/null
+++ b/drivers/gpu/drm/drm_suballoc.c
@@ -0,0 +1,301 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+
+#include <drm/drm_suballoc.h>
+
+/**
+ * DOC:
+ * This suballocator intends to be a wrapper around a range allocator
+ * that is aware also of deferred range freeing with fences. Currently
+ * we hard-code the drm_mm as the range allocator.
+ * The approach, while rather simple, suffers from three performance
+ * issues that can all be fixed if needed at the tradeoff of more and / or
+ * more complex code:
+ *
+ * 1) It's cpu-hungry, the drm_mm allocator is overkill. Either code a
+ * much simpler range allocator, or let the caller decide by providing
+ * ops that wrap any range allocator. Also could avoid waking up unless
+ * there is a reasonable chance of enough space in the range manager.
+ *
+ * 2) We unnecessarily install the fence callbacks too early, forcing
+ * enable_signaling() too early causing extra driver effort. This is likely
+ * not an issue if used with the drm_scheduler since it calls
+ * enable_signaling() early anyway.
+ *
+ * 3) Long processing in irq (disabled) context. We've mostly worked around
+ * that already by using the idle_list. If that workaround is deemed to
+ * complex for little gain, we can remove it and use spin_lock_irq()
+ * throughout the manager. If we want to shorten processing in irq context
+ * even further, we can skip the spin_trylock in __drm_suballoc_free() and
+ * avoid freeing allocations from irq context altogeher. However drm_mm
+ * should be quite fast at freeing ranges.
+ *
+ * 4) Shrinker that starts processing the list items in 2) and 3) to play
+ * better with the system.
+ */
+
+static void drm_suballoc_process_idle(struct drm_suballoc_manager *sa_manager);
+
+/**
+ * drm_suballoc_manager_init() - Initialise the drm_suballoc_manager
+ * @sa_manager: pointer to the sa_manager
+ * @size: number of bytes we want to suballocate
+ * @align: alignment for each suballocated chunk
+ *
+ * Prepares the suballocation manager for suballocations.
+ */
+void drm_suballoc_manager_init(struct drm_suballoc_manager *sa_manager,
+			       u64 size, u64 align)
+{
+	spin_lock_init(&sa_manager->lock);
+	spin_lock_init(&sa_manager->idle_list_lock);
+	mutex_init(&sa_manager->alloc_mutex);
+	drm_mm_init(&sa_manager->mm, 0, size);
+	init_waitqueue_head(&sa_manager->wq);
+	sa_manager->range_size = size;
+	sa_manager->alignment = align;
+	INIT_LIST_HEAD(&sa_manager->idle_list);
+}
+EXPORT_SYMBOL(drm_suballoc_manager_init);
+
+/**
+ * drm_suballoc_manager_fini() - Destroy the drm_suballoc_manager
+ * @sa_manager: pointer to the sa_manager
+ *
+ * Cleans up the suballocation manager after use. All fences added
+ * with drm_suballoc_free() must be signaled, or we cannot clean up
+ * the entire manager.
+ */
+void drm_suballoc_manager_fini(struct drm_suballoc_manager *sa_manager)
+{
+	drm_suballoc_process_idle(sa_manager);
+	drm_mm_takedown(&sa_manager->mm);
+	mutex_destroy(&sa_manager->alloc_mutex);
+}
+EXPORT_SYMBOL(drm_suballoc_manager_fini);
+
+static void __drm_suballoc_free(struct drm_suballoc *sa)
+{
+	struct drm_suballoc_manager *sa_manager = sa->manager;
+	struct dma_fence *fence;
+
+	/*
+	 * In order to avoid protecting the potentially lengthy drm_mm manager
+	 * *allocation* processing with an irq-disabling lock,
+	 * defer touching the drm_mm for freeing until we're in task context,
+	 * with no irqs disabled, or happen to succeed in taking the manager
+	 * lock.
+	 */
+	if (!in_task() || irqs_disabled()) {
+		unsigned long irqflags;
+
+		if (spin_trylock(&sa_manager->lock))
+			goto locked;
+
+		spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
+		list_add_tail(&sa->idle_link, &sa_manager->idle_list);
+		spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
+		wake_up(&sa_manager->wq);
+		return;
+	}
+
+	spin_lock(&sa_manager->lock);
+locked:
+	drm_mm_remove_node(&sa->node);
+
+	fence = sa->fence;
+	sa->fence = NULL;
+	spin_unlock(&sa_manager->lock);
+	/* Maybe only wake if first mm hole is sufficiently large? */
+	wake_up(&sa_manager->wq);
+	dma_fence_put(fence);
+	kfree(sa);
+}
+
+/* Free all deferred idle allocations */
+static void drm_suballoc_process_idle(struct drm_suballoc_manager *sa_manager)
+{
+	/*
+	 * prepare_to_wait() / wake_up() semantics ensure that any list
+	 * addition that was done before wake_up() is visible when
+	 * this code is called from the wait loop.
+	 */
+	if (!list_empty_careful(&sa_manager->idle_list)) {
+		struct drm_suballoc *sa, *next;
+		unsigned long irqflags;
+		LIST_HEAD(list);
+
+		spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
+		list_splice_init(&sa_manager->idle_list, &list);
+		spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
+
+		list_for_each_entry_safe(sa, next, &list, idle_link)
+			__drm_suballoc_free(sa);
+	}
+}
+
+static void
+drm_suballoc_fence_signaled(struct dma_fence *fence, struct dma_fence_cb *cb)
+{
+	struct drm_suballoc *sa = container_of(cb, typeof(*sa), cb);
+
+	__drm_suballoc_free(sa);
+}
+
+static int drm_suballoc_tryalloc(struct drm_suballoc *sa, u64 size)
+{
+	struct drm_suballoc_manager *sa_manager = sa->manager;
+	int err;
+
+	drm_suballoc_process_idle(sa_manager);
+	spin_lock(&sa_manager->lock);
+	err = drm_mm_insert_node_generic(&sa_manager->mm, &sa->node, size,
+					 sa_manager->alignment, 0,
+					 DRM_MM_INSERT_EVICT);
+	spin_unlock(&sa_manager->lock);
+	return err;
+}
+
+/**
+ * drm_suballoc_new() - Make a suballocation.
+ * @sa_manager: pointer to the sa_manager
+ * @size: number of bytes we want to suballocate.
+ * @gfp: Allocation context.
+ * @intr: Whether to sleep interruptibly if sleeping.
+ *
+ * Try to make a suballocation of size @size, which will be rounded
+ * up to the alignment specified in specified in drm_suballoc_manager_init().
+ *
+ * Returns a new suballocated bo, or an ERR_PTR.
+ */
+struct drm_suballoc*
+drm_suballoc_new(struct drm_suballoc_manager *sa_manager, u64 size,
+		 gfp_t gfp, bool intr)
+{
+	struct drm_suballoc *sa;
+	DEFINE_WAIT(wait);
+	int err = 0;
+
+	if (size > sa_manager->range_size)
+		return ERR_PTR(-ENOSPC);
+
+	sa = kzalloc(sizeof(*sa), gfp);
+	if (!sa)
+		return ERR_PTR(-ENOMEM);
+
+	/* Avoid starvation using the alloc_mutex */
+	if (intr)
+		err = mutex_lock_interruptible(&sa_manager->alloc_mutex);
+	else
+		mutex_lock(&sa_manager->alloc_mutex);
+	if (err) {
+		kfree(sa);
+		return ERR_PTR(err);
+	}
+
+	sa->manager = sa_manager;
+	err = drm_suballoc_tryalloc(sa, size);
+	if (err != -ENOSPC)
+		goto out;
+
+	for (;;) {
+		prepare_to_wait(&sa_manager->wq, &wait,
+				intr ? TASK_INTERRUPTIBLE :
+				TASK_UNINTERRUPTIBLE);
+
+		err = drm_suballoc_tryalloc(sa, size);
+		if (err != -ENOSPC)
+			break;
+
+		if (intr && signal_pending(current)) {
+			err = -ERESTARTSYS;
+			break;
+		}
+
+		io_schedule();
+	}
+	finish_wait(&sa_manager->wq, &wait);
+
+out:
+	mutex_unlock(&sa_manager->alloc_mutex);
+	if (!sa->node.size) {
+		kfree(sa);
+		WARN_ON(!err);
+		sa = ERR_PTR(err);
+	}
+
+	return sa;
+}
+EXPORT_SYMBOL(drm_suballoc_new);
+
+/**
+ * drm_suballoc_free() - Free a suballocation
+ * @suballoc: pointer to the suballocation
+ * @fence: fence that signals when suballocation is idle
+ * @queue: the index to which queue the suballocation will be placed on the free list.
+ *
+ * Free the suballocation. The suballocation can be re-used after @fence
+ * signals.
+ */
+void
+drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence)
+{
+	if (!sa)
+		return;
+
+	if (!fence || dma_fence_is_signaled(fence)) {
+		__drm_suballoc_free(sa);
+		return;
+	}
+
+	sa->fence = dma_fence_get(fence);
+	if (dma_fence_add_callback(fence, &sa->cb, drm_suballoc_fence_signaled))
+		__drm_suballoc_free(sa);
+}
+EXPORT_SYMBOL(drm_suballoc_free);
+
+#ifdef CONFIG_DEBUG_FS
+
+/**
+ * drm_suballoc_dump_debug_info() - Dump the suballocator state
+ * @sa_manager: The suballoc manager.
+ * @p: Pointer to a drm printer for output.
+ * @suballoc_base: Constant to add to the suballocated offsets on printout.
+ *
+ * This function dumps the suballocator state. Note that the caller has
+ * to explicitly order frees and calls to this function in order for the
+ * freed node to show up as protected by a fence.
+ */
+void drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
+				  struct drm_printer *p, u64 suballoc_base)
+{
+	const struct drm_mm_node *entry;
+
+	spin_lock(&sa_manager->lock);
+	drm_mm_for_each_node(entry, &sa_manager->mm) {
+		struct drm_suballoc *sa =
+			container_of(entry, typeof(*sa), node);
+
+		drm_printf(p, " ");
+		drm_printf(p, "[0x%010llx 0x%010llx] size %8lld",
+			   (unsigned long long)suballoc_base + entry->start,
+			   (unsigned long long)suballoc_base + entry->start +
+			   entry->size, (unsigned long long)entry->size);
+
+		if (sa->fence)
+			drm_printf(p, " protected by 0x%016llx on context %llu",
+				   (unsigned long long)sa->fence->seqno,
+				   (unsigned long long)sa->fence->context);
+
+		drm_printf(p, "\n");
+	}
+	spin_unlock(&sa_manager->lock);
+}
+EXPORT_SYMBOL(drm_suballoc_dump_debug_info);
+#endif
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_DESCRIPTION("Simple range suballocator helper");
+MODULE_LICENSE("GPL and additional rights");
diff --git a/include/drm/drm_suballoc.h b/include/drm/drm_suballoc.h
new file mode 100644
index 000000000000..910952b3383b
--- /dev/null
+++ b/include/drm/drm_suballoc.h
@@ -0,0 +1,112 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2022 Intel Corporation
+ */
+#ifndef _DRM_SUBALLOC_H_
+#define _DRM_SUBALLOC_H_
+
+#include <drm/drm_mm.h>
+
+#include <linux/dma-fence.h>
+#include <linux/types.h>
+
+/**
+ * struct drm_suballoc_manager - Wrapper for fenced range allocations
+ * @mm: The range manager. Protected by @lock.
+ * @range_size: The total size of the range.
+ * @alignment: Range alignment.
+ * @wq: Wait queue for sleeping allocations on contention.
+ * @idle_list: List of idle but not yet freed allocations. Protected by
+ * @idle_list_lock.
+ * @task: Task waiting for allocation. Protected by @lock.
+ */
+struct drm_suballoc_manager {
+	/** @lock: Manager lock. Protects @mm. */
+	spinlock_t lock;
+	/**
+	 * @idle_list_lock: Lock to protect the idle_list.
+	 * Disable irqs when locking.
+	 */
+	spinlock_t idle_list_lock;
+	/** @alloc_mutex: Mutex to protect against stavation. */
+	struct mutex alloc_mutex;
+	struct drm_mm mm;
+	u64 range_size;
+	u64 alignment;
+	wait_queue_head_t wq;
+	struct list_head idle_list;
+};
+
+/**
+ * struct drm_suballoc: Suballocated range.
+ * @node: The drm_mm representation of the range.
+ * @fence: dma-fence indicating whether allocation is active or idle.
+ * Assigned on call to free the allocation so doesn't need protection.
+ * @cb: dma-fence callback structure. Used for callbacks when the fence signals.
+ * @manager: The struct drm_suballoc_manager the range belongs to. Immutable.
+ * @idle_link: Link for the manager idle_list. Progected by the
+ * drm_suballoc_manager::idle_lock.
+ */
+struct drm_suballoc {
+	struct drm_mm_node node;
+	struct dma_fence *fence;
+	struct dma_fence_cb cb;
+	struct drm_suballoc_manager *manager;
+	struct list_head idle_link;
+};
+
+void drm_suballoc_manager_init(struct drm_suballoc_manager *sa_manager,
+			       u64 size, u64 align);
+
+void drm_suballoc_manager_fini(struct drm_suballoc_manager *sa_manager);
+
+struct drm_suballoc *drm_suballoc_new(struct drm_suballoc_manager *sa_manager,
+				      u64 size, gfp_t gfp, bool intr);
+
+void drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence);
+
+/**
+ * drm_suballoc_soffset - Range start.
+ * @sa: The struct drm_suballoc.
+ *
+ * Return: The start of the allocated range.
+ */
+static inline u64 drm_suballoc_soffset(struct drm_suballoc *sa)
+{
+	return sa->node.start;
+}
+
+/**
+ * drm_suballoc_eoffset - Range end.
+ * @sa: The struct drm_suballoc.
+ *
+ * Return: The end of the allocated range + 1.
+ */
+static inline u64 drm_suballoc_eoffset(struct drm_suballoc *sa)
+{
+	return sa->node.start + sa->node.size;
+}
+
+/**
+ * drm_suballoc_size - Range size.
+ * @sa: The struct drm_suballoc.
+ *
+ * Return: The size of the allocated range.
+ */
+static inline u64 drm_suballoc_size(struct drm_suballoc *sa)
+{
+	return sa->node.size;
+}
+
+#ifdef CONFIG_DEBUG_FS
+void drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
+				  struct drm_printer *p, u64 suballoc_base);
+#else
+static inline void
+drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
+			     struct drm_printer *p, u64 suballoc_base)
+{ }
+
+#endif
+
+#endif /* _DRM_SUBALLOC_H_ */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/3] drm/amd: Convert amdgpu to use suballocation helper.
  2023-02-16 14:48 ` [Intel-xe] " Thomas Hellström
@ 2023-02-16 14:48   ` Thomas Hellström
  -1 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-16 14:48 UTC (permalink / raw)
  To: dri-devel
  Cc: Thomas Hellström, Daniel Vetter, Christian Koenig,
	Dave Airlie, intel-xe

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Now that we have a generic suballocation helper, Use it in amdgpu.
The debug output is slightly different and suballocation may be
slightly more cpu-hungry.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Co-developed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/Kconfig                    |   1 +
 drivers/gpu/drm/amd/amdgpu/Kconfig         |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  26 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c     |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  23 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c     | 320 ++-------------------
 7 files changed, 43 insertions(+), 336 deletions(-)

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 8fbe57407c60..73ddfdf3a894 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -77,6 +77,7 @@ config DRM_KUNIT_TEST
 	select DRM_DISPLAY_HELPER
 	select DRM_LIB_RANDOM
 	select DRM_KMS_HELPER
+	select DRM_SUBALLOC_HELPER
 	select DRM_BUDDY
 	select DRM_EXPORT_FOR_TESTS if m
 	select DRM_KUNIT_TEST_HELPERS
diff --git a/drivers/gpu/drm/amd/amdgpu/Kconfig b/drivers/gpu/drm/amd/amdgpu/Kconfig
index 5341b6b242c3..0ed12171450b 100644
--- a/drivers/gpu/drm/amd/amdgpu/Kconfig
+++ b/drivers/gpu/drm/amd/amdgpu/Kconfig
@@ -18,6 +18,7 @@ config DRM_AMDGPU
 	select BACKLIGHT_CLASS_DEVICE
 	select INTERVAL_TREE
 	select DRM_BUDDY
+	select DRM_SUBALLOC_HELPER
 	# amdgpu depends on ACPI_VIDEO when ACPI is enabled, for select to work
 	# ACPI_VIDEO's dependencies must also be selected.
 	select INPUT if ACPI
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 164141bc8b4a..dda88090f044 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -424,29 +424,11 @@ struct amdgpu_clock {
  * alignment).
  */
 
-#define AMDGPU_SA_NUM_FENCE_LISTS	32
-
 struct amdgpu_sa_manager {
-	wait_queue_head_t	wq;
-	struct amdgpu_bo	*bo;
-	struct list_head	*hole;
-	struct list_head	flist[AMDGPU_SA_NUM_FENCE_LISTS];
-	struct list_head	olist;
-	unsigned		size;
-	uint64_t		gpu_addr;
-	void			*cpu_ptr;
-	uint32_t		domain;
-	uint32_t		align;
-};
-
-/* sub-allocation buffer */
-struct amdgpu_sa_bo {
-	struct list_head		olist;
-	struct list_head		flist;
-	struct amdgpu_sa_manager	*manager;
-	unsigned			soffset;
-	unsigned			eoffset;
-	struct dma_fence	        *fence;
+	struct drm_suballoc_manager	base;
+	struct amdgpu_bo		*bo;
+	uint64_t			gpu_addr;
+	void				*cpu_ptr;
 };
 
 int amdgpu_fence_slab_init(void);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
index bcccc348dbe2..5621b63c7f42 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
@@ -69,7 +69,7 @@ int amdgpu_ib_get(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 
 	if (size) {
 		r = amdgpu_sa_bo_new(&adev->ib_pools[pool_type],
-				      &ib->sa_bo, size, 256);
+				      &ib->sa_bo, size);
 		if (r) {
 			dev_err(adev->dev, "failed to get a new IB (%d)\n", r);
 			return r;
@@ -309,8 +309,7 @@ int amdgpu_ib_pool_init(struct amdgpu_device *adev)
 
 	for (i = 0; i < AMDGPU_IB_POOL_MAX; i++) {
 		r = amdgpu_sa_bo_manager_init(adev, &adev->ib_pools[i],
-					      AMDGPU_IB_POOL_SIZE,
-					      AMDGPU_GPU_PAGE_SIZE,
+					      AMDGPU_IB_POOL_SIZE, 256,
 					      AMDGPU_GEM_DOMAIN_GTT);
 		if (r)
 			goto error;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 93207badf83f..568baf15d5b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -336,15 +336,22 @@ uint32_t amdgpu_bo_get_preferred_domain(struct amdgpu_device *adev,
 /*
  * sub allocation
  */
+static inline struct amdgpu_sa_manager *
+to_amdgpu_sa_manager(struct drm_suballoc_manager *manager)
+{
+	return container_of(manager, struct amdgpu_sa_manager, base);
+}
 
-static inline uint64_t amdgpu_sa_bo_gpu_addr(struct amdgpu_sa_bo *sa_bo)
+static inline uint64_t amdgpu_sa_bo_gpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->gpu_addr + sa_bo->soffset;
+	return to_amdgpu_sa_manager(sa_bo->manager)->gpu_addr +
+		drm_suballoc_soffset(sa_bo);
 }
 
-static inline void * amdgpu_sa_bo_cpu_addr(struct amdgpu_sa_bo *sa_bo)
+static inline void * amdgpu_sa_bo_cpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->cpu_ptr + sa_bo->soffset;
+	return to_amdgpu_sa_manager(sa_bo->manager)->cpu_ptr +
+		drm_suballoc_soffset(sa_bo);
 }
 
 int amdgpu_sa_bo_manager_init(struct amdgpu_device *adev,
@@ -355,11 +362,11 @@ void amdgpu_sa_bo_manager_fini(struct amdgpu_device *adev,
 int amdgpu_sa_bo_manager_start(struct amdgpu_device *adev,
 				      struct amdgpu_sa_manager *sa_manager);
 int amdgpu_sa_bo_new(struct amdgpu_sa_manager *sa_manager,
-		     struct amdgpu_sa_bo **sa_bo,
-		     unsigned size, unsigned align);
+		     struct drm_suballoc **sa_bo,
+		     unsigned size);
 void amdgpu_sa_bo_free(struct amdgpu_device *adev,
-			      struct amdgpu_sa_bo **sa_bo,
-			      struct dma_fence *fence);
+		       struct drm_suballoc **sa_bo,
+		       struct dma_fence *fence);
 #if defined(CONFIG_DEBUG_FS)
 void amdgpu_sa_bo_dump_debug_info(struct amdgpu_sa_manager *sa_manager,
 					 struct seq_file *m);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 3989e755a5b4..018f36b10de8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -27,6 +27,7 @@
 #include <drm/amdgpu_drm.h>
 #include <drm/gpu_scheduler.h>
 #include <drm/drm_print.h>
+#include <drm/drm_suballoc.h>
 
 struct amdgpu_device;
 struct amdgpu_ring;
@@ -92,7 +93,7 @@ enum amdgpu_ib_pool_type {
 };
 
 struct amdgpu_ib {
-	struct amdgpu_sa_bo		*sa_bo;
+	struct drm_suballoc		*sa_bo;
 	uint32_t			length_dw;
 	uint64_t			gpu_addr;
 	uint32_t			*ptr;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
index 524d10b21041..e7b3539e0294 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
@@ -44,327 +44,61 @@
 
 #include "amdgpu.h"
 
-static void amdgpu_sa_bo_remove_locked(struct amdgpu_sa_bo *sa_bo);
-static void amdgpu_sa_bo_try_free(struct amdgpu_sa_manager *sa_manager);
-
 int amdgpu_sa_bo_manager_init(struct amdgpu_device *adev,
 			      struct amdgpu_sa_manager *sa_manager,
-			      unsigned size, u32 align, u32 domain)
+			      unsigned size, u32 suballoc_align, u32 domain)
 {
-	int i, r;
-
-	init_waitqueue_head(&sa_manager->wq);
-	sa_manager->bo = NULL;
-	sa_manager->size = size;
-	sa_manager->domain = domain;
-	sa_manager->align = align;
-	sa_manager->hole = &sa_manager->olist;
-	INIT_LIST_HEAD(&sa_manager->olist);
-	for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
-		INIT_LIST_HEAD(&sa_manager->flist[i]);
+	int r;
 
-	r = amdgpu_bo_create_kernel(adev, size, align, domain, &sa_manager->bo,
+	r = amdgpu_bo_create_kernel(adev, size, AMDGPU_GPU_PAGE_SIZE, domain, &sa_manager->bo,
 				&sa_manager->gpu_addr, &sa_manager->cpu_ptr);
 	if (r) {
 		dev_err(adev->dev, "(%d) failed to allocate bo for manager\n", r);
 		return r;
 	}
 
-	memset(sa_manager->cpu_ptr, 0, sa_manager->size);
+	memset(sa_manager->cpu_ptr, 0, size);
+	drm_suballoc_manager_init(&sa_manager->base, size, suballoc_align);
 	return r;
 }
 
 void amdgpu_sa_bo_manager_fini(struct amdgpu_device *adev,
 			       struct amdgpu_sa_manager *sa_manager)
 {
-	struct amdgpu_sa_bo *sa_bo, *tmp;
-
 	if (sa_manager->bo == NULL) {
 		dev_err(adev->dev, "no bo for sa manager\n");
 		return;
 	}
 
-	if (!list_empty(&sa_manager->olist)) {
-		sa_manager->hole = &sa_manager->olist,
-		amdgpu_sa_bo_try_free(sa_manager);
-		if (!list_empty(&sa_manager->olist)) {
-			dev_err(adev->dev, "sa_manager is not empty, clearing anyway\n");
-		}
-	}
-	list_for_each_entry_safe(sa_bo, tmp, &sa_manager->olist, olist) {
-		amdgpu_sa_bo_remove_locked(sa_bo);
-	}
+	drm_suballoc_manager_fini(&sa_manager->base);
 
 	amdgpu_bo_free_kernel(&sa_manager->bo, &sa_manager->gpu_addr, &sa_manager->cpu_ptr);
-	sa_manager->size = 0;
 }
 
-static void amdgpu_sa_bo_remove_locked(struct amdgpu_sa_bo *sa_bo)
-{
-	struct amdgpu_sa_manager *sa_manager = sa_bo->manager;
-	if (sa_manager->hole == &sa_bo->olist) {
-		sa_manager->hole = sa_bo->olist.prev;
-	}
-	list_del_init(&sa_bo->olist);
-	list_del_init(&sa_bo->flist);
-	dma_fence_put(sa_bo->fence);
-	kfree(sa_bo);
-}
-
-static void amdgpu_sa_bo_try_free(struct amdgpu_sa_manager *sa_manager)
+int amdgpu_sa_bo_new(struct amdgpu_sa_manager *sa_manager,
+		     struct drm_suballoc **sa_bo,
+		     unsigned size)
 {
-	struct amdgpu_sa_bo *sa_bo, *tmp;
+	struct drm_suballoc *sa = drm_suballoc_new(&sa_manager->base, size, GFP_KERNEL, true);
 
-	if (sa_manager->hole->next == &sa_manager->olist)
-		return;
+	if (IS_ERR(sa)) {
+		*sa_bo = NULL;
 
-	sa_bo = list_entry(sa_manager->hole->next, struct amdgpu_sa_bo, olist);
-	list_for_each_entry_safe_from(sa_bo, tmp, &sa_manager->olist, olist) {
-		if (sa_bo->fence == NULL ||
-		    !dma_fence_is_signaled(sa_bo->fence)) {
-			return;
-		}
-		amdgpu_sa_bo_remove_locked(sa_bo);
+		return PTR_ERR(sa);
 	}
-}
 
-static inline unsigned amdgpu_sa_bo_hole_soffset(struct amdgpu_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole != &sa_manager->olist) {
-		return list_entry(hole, struct amdgpu_sa_bo, olist)->eoffset;
-	}
+	*sa_bo = sa;
 	return 0;
 }
 
-static inline unsigned amdgpu_sa_bo_hole_eoffset(struct amdgpu_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole->next != &sa_manager->olist) {
-		return list_entry(hole->next, struct amdgpu_sa_bo, olist)->soffset;
-	}
-	return sa_manager->size;
-}
-
-static bool amdgpu_sa_bo_try_alloc(struct amdgpu_sa_manager *sa_manager,
-				   struct amdgpu_sa_bo *sa_bo,
-				   unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-
-	soffset = amdgpu_sa_bo_hole_soffset(sa_manager);
-	eoffset = amdgpu_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		soffset += wasted;
-
-		sa_bo->manager = sa_manager;
-		sa_bo->soffset = soffset;
-		sa_bo->eoffset = soffset + size;
-		list_add(&sa_bo->olist, sa_manager->hole);
-		INIT_LIST_HEAD(&sa_bo->flist);
-		sa_manager->hole = &sa_bo->olist;
-		return true;
-	}
-	return false;
-}
-
-/**
- * amdgpu_sa_event - Check if we can stop waiting
- *
- * @sa_manager: pointer to the sa_manager
- * @size: number of bytes we want to allocate
- * @align: alignment we need to match
- *
- * Check if either there is a fence we can wait for or
- * enough free memory to satisfy the allocation directly
- */
-static bool amdgpu_sa_event(struct amdgpu_sa_manager *sa_manager,
-			    unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-	int i;
-
-	for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
-		if (!list_empty(&sa_manager->flist[i]))
-			return true;
-
-	soffset = amdgpu_sa_bo_hole_soffset(sa_manager);
-	eoffset = amdgpu_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		return true;
-	}
-
-	return false;
-}
-
-static bool amdgpu_sa_bo_next_hole(struct amdgpu_sa_manager *sa_manager,
-				   struct dma_fence **fences,
-				   unsigned *tries)
-{
-	struct amdgpu_sa_bo *best_bo = NULL;
-	unsigned i, soffset, best, tmp;
-
-	/* if hole points to the end of the buffer */
-	if (sa_manager->hole->next == &sa_manager->olist) {
-		/* try again with its beginning */
-		sa_manager->hole = &sa_manager->olist;
-		return true;
-	}
-
-	soffset = amdgpu_sa_bo_hole_soffset(sa_manager);
-	/* to handle wrap around we add sa_manager->size */
-	best = sa_manager->size * 2;
-	/* go over all fence list and try to find the closest sa_bo
-	 * of the current last
-	 */
-	for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i) {
-		struct amdgpu_sa_bo *sa_bo;
-
-		fences[i] = NULL;
-
-		if (list_empty(&sa_manager->flist[i]))
-			continue;
-
-		sa_bo = list_first_entry(&sa_manager->flist[i],
-					 struct amdgpu_sa_bo, flist);
-
-		if (!dma_fence_is_signaled(sa_bo->fence)) {
-			fences[i] = sa_bo->fence;
-			continue;
-		}
-
-		/* limit the number of tries each ring gets */
-		if (tries[i] > 2) {
-			continue;
-		}
-
-		tmp = sa_bo->soffset;
-		if (tmp < soffset) {
-			/* wrap around, pretend it's after */
-			tmp += sa_manager->size;
-		}
-		tmp -= soffset;
-		if (tmp < best) {
-			/* this sa bo is the closest one */
-			best = tmp;
-			best_bo = sa_bo;
-		}
-	}
-
-	if (best_bo) {
-		uint32_t idx = best_bo->fence->context;
-
-		idx %= AMDGPU_SA_NUM_FENCE_LISTS;
-		++tries[idx];
-		sa_manager->hole = best_bo->olist.prev;
-
-		/* we knew that this one is signaled,
-		   so it's save to remote it */
-		amdgpu_sa_bo_remove_locked(best_bo);
-		return true;
-	}
-	return false;
-}
-
-int amdgpu_sa_bo_new(struct amdgpu_sa_manager *sa_manager,
-		     struct amdgpu_sa_bo **sa_bo,
-		     unsigned size, unsigned align)
-{
-	struct dma_fence *fences[AMDGPU_SA_NUM_FENCE_LISTS];
-	unsigned tries[AMDGPU_SA_NUM_FENCE_LISTS];
-	unsigned count;
-	int i, r;
-	signed long t;
-
-	if (WARN_ON_ONCE(align > sa_manager->align))
-		return -EINVAL;
-
-	if (WARN_ON_ONCE(size > sa_manager->size))
-		return -EINVAL;
-
-	*sa_bo = kmalloc(sizeof(struct amdgpu_sa_bo), GFP_KERNEL);
-	if (!(*sa_bo))
-		return -ENOMEM;
-	(*sa_bo)->manager = sa_manager;
-	(*sa_bo)->fence = NULL;
-	INIT_LIST_HEAD(&(*sa_bo)->olist);
-	INIT_LIST_HEAD(&(*sa_bo)->flist);
-
-	spin_lock(&sa_manager->wq.lock);
-	do {
-		for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
-			tries[i] = 0;
-
-		do {
-			amdgpu_sa_bo_try_free(sa_manager);
-
-			if (amdgpu_sa_bo_try_alloc(sa_manager, *sa_bo,
-						   size, align)) {
-				spin_unlock(&sa_manager->wq.lock);
-				return 0;
-			}
-
-			/* see if we can skip over some allocations */
-		} while (amdgpu_sa_bo_next_hole(sa_manager, fences, tries));
-
-		for (i = 0, count = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
-			if (fences[i])
-				fences[count++] = dma_fence_get(fences[i]);
-
-		if (count) {
-			spin_unlock(&sa_manager->wq.lock);
-			t = dma_fence_wait_any_timeout(fences, count, false,
-						       MAX_SCHEDULE_TIMEOUT,
-						       NULL);
-			for (i = 0; i < count; ++i)
-				dma_fence_put(fences[i]);
-
-			r = (t > 0) ? 0 : t;
-			spin_lock(&sa_manager->wq.lock);
-		} else {
-			/* if we have nothing to wait for block */
-			r = wait_event_interruptible_locked(
-				sa_manager->wq,
-				amdgpu_sa_event(sa_manager, size, align)
-			);
-		}
-
-	} while (!r);
-
-	spin_unlock(&sa_manager->wq.lock);
-	kfree(*sa_bo);
-	*sa_bo = NULL;
-	return r;
-}
-
-void amdgpu_sa_bo_free(struct amdgpu_device *adev, struct amdgpu_sa_bo **sa_bo,
+void amdgpu_sa_bo_free(struct amdgpu_device *adev, struct drm_suballoc **sa_bo,
 		       struct dma_fence *fence)
 {
-	struct amdgpu_sa_manager *sa_manager;
-
 	if (sa_bo == NULL || *sa_bo == NULL) {
 		return;
 	}
 
-	sa_manager = (*sa_bo)->manager;
-	spin_lock(&sa_manager->wq.lock);
-	if (fence && !dma_fence_is_signaled(fence)) {
-		uint32_t idx;
-
-		(*sa_bo)->fence = dma_fence_get(fence);
-		idx = fence->context % AMDGPU_SA_NUM_FENCE_LISTS;
-		list_add_tail(&(*sa_bo)->flist, &sa_manager->flist[idx]);
-	} else {
-		amdgpu_sa_bo_remove_locked(*sa_bo);
-	}
-	wake_up_all_locked(&sa_manager->wq);
-	spin_unlock(&sa_manager->wq.lock);
+	drm_suballoc_free(*sa_bo, fence);
 	*sa_bo = NULL;
 }
 
@@ -373,26 +107,8 @@ void amdgpu_sa_bo_free(struct amdgpu_device *adev, struct amdgpu_sa_bo **sa_bo,
 void amdgpu_sa_bo_dump_debug_info(struct amdgpu_sa_manager *sa_manager,
 				  struct seq_file *m)
 {
-	struct amdgpu_sa_bo *i;
-
-	spin_lock(&sa_manager->wq.lock);
-	list_for_each_entry(i, &sa_manager->olist, olist) {
-		uint64_t soffset = i->soffset + sa_manager->gpu_addr;
-		uint64_t eoffset = i->eoffset + sa_manager->gpu_addr;
-		if (&i->olist == sa_manager->hole) {
-			seq_printf(m, ">");
-		} else {
-			seq_printf(m, " ");
-		}
-		seq_printf(m, "[0x%010llx 0x%010llx] size %8lld",
-			   soffset, eoffset, eoffset - soffset);
+	struct drm_printer p = drm_seq_file_printer(m);
 
-		if (i->fence)
-			seq_printf(m, " protected by 0x%016llx on context %llu",
-				   i->fence->seqno, i->fence->context);
-
-		seq_printf(m, "\n");
-	}
-	spin_unlock(&sa_manager->wq.lock);
+	drm_suballoc_dump_debug_info(&sa_manager->base, &p, sa_manager->gpu_addr);
 }
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Intel-xe] [PATCH 2/3] drm/amd: Convert amdgpu to use suballocation helper.
@ 2023-02-16 14:48   ` Thomas Hellström
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-16 14:48 UTC (permalink / raw)
  To: dri-devel
  Cc: Daniel Vetter, Maarten Lankhorst, Christian Koenig, Dave Airlie,
	intel-xe

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Now that we have a generic suballocation helper, Use it in amdgpu.
The debug output is slightly different and suballocation may be
slightly more cpu-hungry.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Co-developed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/Kconfig                    |   1 +
 drivers/gpu/drm/amd/amdgpu/Kconfig         |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  26 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c     |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  23 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c     | 320 ++-------------------
 7 files changed, 43 insertions(+), 336 deletions(-)

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 8fbe57407c60..73ddfdf3a894 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -77,6 +77,7 @@ config DRM_KUNIT_TEST
 	select DRM_DISPLAY_HELPER
 	select DRM_LIB_RANDOM
 	select DRM_KMS_HELPER
+	select DRM_SUBALLOC_HELPER
 	select DRM_BUDDY
 	select DRM_EXPORT_FOR_TESTS if m
 	select DRM_KUNIT_TEST_HELPERS
diff --git a/drivers/gpu/drm/amd/amdgpu/Kconfig b/drivers/gpu/drm/amd/amdgpu/Kconfig
index 5341b6b242c3..0ed12171450b 100644
--- a/drivers/gpu/drm/amd/amdgpu/Kconfig
+++ b/drivers/gpu/drm/amd/amdgpu/Kconfig
@@ -18,6 +18,7 @@ config DRM_AMDGPU
 	select BACKLIGHT_CLASS_DEVICE
 	select INTERVAL_TREE
 	select DRM_BUDDY
+	select DRM_SUBALLOC_HELPER
 	# amdgpu depends on ACPI_VIDEO when ACPI is enabled, for select to work
 	# ACPI_VIDEO's dependencies must also be selected.
 	select INPUT if ACPI
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 164141bc8b4a..dda88090f044 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -424,29 +424,11 @@ struct amdgpu_clock {
  * alignment).
  */
 
-#define AMDGPU_SA_NUM_FENCE_LISTS	32
-
 struct amdgpu_sa_manager {
-	wait_queue_head_t	wq;
-	struct amdgpu_bo	*bo;
-	struct list_head	*hole;
-	struct list_head	flist[AMDGPU_SA_NUM_FENCE_LISTS];
-	struct list_head	olist;
-	unsigned		size;
-	uint64_t		gpu_addr;
-	void			*cpu_ptr;
-	uint32_t		domain;
-	uint32_t		align;
-};
-
-/* sub-allocation buffer */
-struct amdgpu_sa_bo {
-	struct list_head		olist;
-	struct list_head		flist;
-	struct amdgpu_sa_manager	*manager;
-	unsigned			soffset;
-	unsigned			eoffset;
-	struct dma_fence	        *fence;
+	struct drm_suballoc_manager	base;
+	struct amdgpu_bo		*bo;
+	uint64_t			gpu_addr;
+	void				*cpu_ptr;
 };
 
 int amdgpu_fence_slab_init(void);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
index bcccc348dbe2..5621b63c7f42 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
@@ -69,7 +69,7 @@ int amdgpu_ib_get(struct amdgpu_device *adev, struct amdgpu_vm *vm,
 
 	if (size) {
 		r = amdgpu_sa_bo_new(&adev->ib_pools[pool_type],
-				      &ib->sa_bo, size, 256);
+				      &ib->sa_bo, size);
 		if (r) {
 			dev_err(adev->dev, "failed to get a new IB (%d)\n", r);
 			return r;
@@ -309,8 +309,7 @@ int amdgpu_ib_pool_init(struct amdgpu_device *adev)
 
 	for (i = 0; i < AMDGPU_IB_POOL_MAX; i++) {
 		r = amdgpu_sa_bo_manager_init(adev, &adev->ib_pools[i],
-					      AMDGPU_IB_POOL_SIZE,
-					      AMDGPU_GPU_PAGE_SIZE,
+					      AMDGPU_IB_POOL_SIZE, 256,
 					      AMDGPU_GEM_DOMAIN_GTT);
 		if (r)
 			goto error;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 93207badf83f..568baf15d5b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -336,15 +336,22 @@ uint32_t amdgpu_bo_get_preferred_domain(struct amdgpu_device *adev,
 /*
  * sub allocation
  */
+static inline struct amdgpu_sa_manager *
+to_amdgpu_sa_manager(struct drm_suballoc_manager *manager)
+{
+	return container_of(manager, struct amdgpu_sa_manager, base);
+}
 
-static inline uint64_t amdgpu_sa_bo_gpu_addr(struct amdgpu_sa_bo *sa_bo)
+static inline uint64_t amdgpu_sa_bo_gpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->gpu_addr + sa_bo->soffset;
+	return to_amdgpu_sa_manager(sa_bo->manager)->gpu_addr +
+		drm_suballoc_soffset(sa_bo);
 }
 
-static inline void * amdgpu_sa_bo_cpu_addr(struct amdgpu_sa_bo *sa_bo)
+static inline void * amdgpu_sa_bo_cpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->cpu_ptr + sa_bo->soffset;
+	return to_amdgpu_sa_manager(sa_bo->manager)->cpu_ptr +
+		drm_suballoc_soffset(sa_bo);
 }
 
 int amdgpu_sa_bo_manager_init(struct amdgpu_device *adev,
@@ -355,11 +362,11 @@ void amdgpu_sa_bo_manager_fini(struct amdgpu_device *adev,
 int amdgpu_sa_bo_manager_start(struct amdgpu_device *adev,
 				      struct amdgpu_sa_manager *sa_manager);
 int amdgpu_sa_bo_new(struct amdgpu_sa_manager *sa_manager,
-		     struct amdgpu_sa_bo **sa_bo,
-		     unsigned size, unsigned align);
+		     struct drm_suballoc **sa_bo,
+		     unsigned size);
 void amdgpu_sa_bo_free(struct amdgpu_device *adev,
-			      struct amdgpu_sa_bo **sa_bo,
-			      struct dma_fence *fence);
+		       struct drm_suballoc **sa_bo,
+		       struct dma_fence *fence);
 #if defined(CONFIG_DEBUG_FS)
 void amdgpu_sa_bo_dump_debug_info(struct amdgpu_sa_manager *sa_manager,
 					 struct seq_file *m);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 3989e755a5b4..018f36b10de8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -27,6 +27,7 @@
 #include <drm/amdgpu_drm.h>
 #include <drm/gpu_scheduler.h>
 #include <drm/drm_print.h>
+#include <drm/drm_suballoc.h>
 
 struct amdgpu_device;
 struct amdgpu_ring;
@@ -92,7 +93,7 @@ enum amdgpu_ib_pool_type {
 };
 
 struct amdgpu_ib {
-	struct amdgpu_sa_bo		*sa_bo;
+	struct drm_suballoc		*sa_bo;
 	uint32_t			length_dw;
 	uint64_t			gpu_addr;
 	uint32_t			*ptr;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
index 524d10b21041..e7b3539e0294 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sa.c
@@ -44,327 +44,61 @@
 
 #include "amdgpu.h"
 
-static void amdgpu_sa_bo_remove_locked(struct amdgpu_sa_bo *sa_bo);
-static void amdgpu_sa_bo_try_free(struct amdgpu_sa_manager *sa_manager);
-
 int amdgpu_sa_bo_manager_init(struct amdgpu_device *adev,
 			      struct amdgpu_sa_manager *sa_manager,
-			      unsigned size, u32 align, u32 domain)
+			      unsigned size, u32 suballoc_align, u32 domain)
 {
-	int i, r;
-
-	init_waitqueue_head(&sa_manager->wq);
-	sa_manager->bo = NULL;
-	sa_manager->size = size;
-	sa_manager->domain = domain;
-	sa_manager->align = align;
-	sa_manager->hole = &sa_manager->olist;
-	INIT_LIST_HEAD(&sa_manager->olist);
-	for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
-		INIT_LIST_HEAD(&sa_manager->flist[i]);
+	int r;
 
-	r = amdgpu_bo_create_kernel(adev, size, align, domain, &sa_manager->bo,
+	r = amdgpu_bo_create_kernel(adev, size, AMDGPU_GPU_PAGE_SIZE, domain, &sa_manager->bo,
 				&sa_manager->gpu_addr, &sa_manager->cpu_ptr);
 	if (r) {
 		dev_err(adev->dev, "(%d) failed to allocate bo for manager\n", r);
 		return r;
 	}
 
-	memset(sa_manager->cpu_ptr, 0, sa_manager->size);
+	memset(sa_manager->cpu_ptr, 0, size);
+	drm_suballoc_manager_init(&sa_manager->base, size, suballoc_align);
 	return r;
 }
 
 void amdgpu_sa_bo_manager_fini(struct amdgpu_device *adev,
 			       struct amdgpu_sa_manager *sa_manager)
 {
-	struct amdgpu_sa_bo *sa_bo, *tmp;
-
 	if (sa_manager->bo == NULL) {
 		dev_err(adev->dev, "no bo for sa manager\n");
 		return;
 	}
 
-	if (!list_empty(&sa_manager->olist)) {
-		sa_manager->hole = &sa_manager->olist,
-		amdgpu_sa_bo_try_free(sa_manager);
-		if (!list_empty(&sa_manager->olist)) {
-			dev_err(adev->dev, "sa_manager is not empty, clearing anyway\n");
-		}
-	}
-	list_for_each_entry_safe(sa_bo, tmp, &sa_manager->olist, olist) {
-		amdgpu_sa_bo_remove_locked(sa_bo);
-	}
+	drm_suballoc_manager_fini(&sa_manager->base);
 
 	amdgpu_bo_free_kernel(&sa_manager->bo, &sa_manager->gpu_addr, &sa_manager->cpu_ptr);
-	sa_manager->size = 0;
 }
 
-static void amdgpu_sa_bo_remove_locked(struct amdgpu_sa_bo *sa_bo)
-{
-	struct amdgpu_sa_manager *sa_manager = sa_bo->manager;
-	if (sa_manager->hole == &sa_bo->olist) {
-		sa_manager->hole = sa_bo->olist.prev;
-	}
-	list_del_init(&sa_bo->olist);
-	list_del_init(&sa_bo->flist);
-	dma_fence_put(sa_bo->fence);
-	kfree(sa_bo);
-}
-
-static void amdgpu_sa_bo_try_free(struct amdgpu_sa_manager *sa_manager)
+int amdgpu_sa_bo_new(struct amdgpu_sa_manager *sa_manager,
+		     struct drm_suballoc **sa_bo,
+		     unsigned size)
 {
-	struct amdgpu_sa_bo *sa_bo, *tmp;
+	struct drm_suballoc *sa = drm_suballoc_new(&sa_manager->base, size, GFP_KERNEL, true);
 
-	if (sa_manager->hole->next == &sa_manager->olist)
-		return;
+	if (IS_ERR(sa)) {
+		*sa_bo = NULL;
 
-	sa_bo = list_entry(sa_manager->hole->next, struct amdgpu_sa_bo, olist);
-	list_for_each_entry_safe_from(sa_bo, tmp, &sa_manager->olist, olist) {
-		if (sa_bo->fence == NULL ||
-		    !dma_fence_is_signaled(sa_bo->fence)) {
-			return;
-		}
-		amdgpu_sa_bo_remove_locked(sa_bo);
+		return PTR_ERR(sa);
 	}
-}
 
-static inline unsigned amdgpu_sa_bo_hole_soffset(struct amdgpu_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole != &sa_manager->olist) {
-		return list_entry(hole, struct amdgpu_sa_bo, olist)->eoffset;
-	}
+	*sa_bo = sa;
 	return 0;
 }
 
-static inline unsigned amdgpu_sa_bo_hole_eoffset(struct amdgpu_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole->next != &sa_manager->olist) {
-		return list_entry(hole->next, struct amdgpu_sa_bo, olist)->soffset;
-	}
-	return sa_manager->size;
-}
-
-static bool amdgpu_sa_bo_try_alloc(struct amdgpu_sa_manager *sa_manager,
-				   struct amdgpu_sa_bo *sa_bo,
-				   unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-
-	soffset = amdgpu_sa_bo_hole_soffset(sa_manager);
-	eoffset = amdgpu_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		soffset += wasted;
-
-		sa_bo->manager = sa_manager;
-		sa_bo->soffset = soffset;
-		sa_bo->eoffset = soffset + size;
-		list_add(&sa_bo->olist, sa_manager->hole);
-		INIT_LIST_HEAD(&sa_bo->flist);
-		sa_manager->hole = &sa_bo->olist;
-		return true;
-	}
-	return false;
-}
-
-/**
- * amdgpu_sa_event - Check if we can stop waiting
- *
- * @sa_manager: pointer to the sa_manager
- * @size: number of bytes we want to allocate
- * @align: alignment we need to match
- *
- * Check if either there is a fence we can wait for or
- * enough free memory to satisfy the allocation directly
- */
-static bool amdgpu_sa_event(struct amdgpu_sa_manager *sa_manager,
-			    unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-	int i;
-
-	for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
-		if (!list_empty(&sa_manager->flist[i]))
-			return true;
-
-	soffset = amdgpu_sa_bo_hole_soffset(sa_manager);
-	eoffset = amdgpu_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		return true;
-	}
-
-	return false;
-}
-
-static bool amdgpu_sa_bo_next_hole(struct amdgpu_sa_manager *sa_manager,
-				   struct dma_fence **fences,
-				   unsigned *tries)
-{
-	struct amdgpu_sa_bo *best_bo = NULL;
-	unsigned i, soffset, best, tmp;
-
-	/* if hole points to the end of the buffer */
-	if (sa_manager->hole->next == &sa_manager->olist) {
-		/* try again with its beginning */
-		sa_manager->hole = &sa_manager->olist;
-		return true;
-	}
-
-	soffset = amdgpu_sa_bo_hole_soffset(sa_manager);
-	/* to handle wrap around we add sa_manager->size */
-	best = sa_manager->size * 2;
-	/* go over all fence list and try to find the closest sa_bo
-	 * of the current last
-	 */
-	for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i) {
-		struct amdgpu_sa_bo *sa_bo;
-
-		fences[i] = NULL;
-
-		if (list_empty(&sa_manager->flist[i]))
-			continue;
-
-		sa_bo = list_first_entry(&sa_manager->flist[i],
-					 struct amdgpu_sa_bo, flist);
-
-		if (!dma_fence_is_signaled(sa_bo->fence)) {
-			fences[i] = sa_bo->fence;
-			continue;
-		}
-
-		/* limit the number of tries each ring gets */
-		if (tries[i] > 2) {
-			continue;
-		}
-
-		tmp = sa_bo->soffset;
-		if (tmp < soffset) {
-			/* wrap around, pretend it's after */
-			tmp += sa_manager->size;
-		}
-		tmp -= soffset;
-		if (tmp < best) {
-			/* this sa bo is the closest one */
-			best = tmp;
-			best_bo = sa_bo;
-		}
-	}
-
-	if (best_bo) {
-		uint32_t idx = best_bo->fence->context;
-
-		idx %= AMDGPU_SA_NUM_FENCE_LISTS;
-		++tries[idx];
-		sa_manager->hole = best_bo->olist.prev;
-
-		/* we knew that this one is signaled,
-		   so it's save to remote it */
-		amdgpu_sa_bo_remove_locked(best_bo);
-		return true;
-	}
-	return false;
-}
-
-int amdgpu_sa_bo_new(struct amdgpu_sa_manager *sa_manager,
-		     struct amdgpu_sa_bo **sa_bo,
-		     unsigned size, unsigned align)
-{
-	struct dma_fence *fences[AMDGPU_SA_NUM_FENCE_LISTS];
-	unsigned tries[AMDGPU_SA_NUM_FENCE_LISTS];
-	unsigned count;
-	int i, r;
-	signed long t;
-
-	if (WARN_ON_ONCE(align > sa_manager->align))
-		return -EINVAL;
-
-	if (WARN_ON_ONCE(size > sa_manager->size))
-		return -EINVAL;
-
-	*sa_bo = kmalloc(sizeof(struct amdgpu_sa_bo), GFP_KERNEL);
-	if (!(*sa_bo))
-		return -ENOMEM;
-	(*sa_bo)->manager = sa_manager;
-	(*sa_bo)->fence = NULL;
-	INIT_LIST_HEAD(&(*sa_bo)->olist);
-	INIT_LIST_HEAD(&(*sa_bo)->flist);
-
-	spin_lock(&sa_manager->wq.lock);
-	do {
-		for (i = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
-			tries[i] = 0;
-
-		do {
-			amdgpu_sa_bo_try_free(sa_manager);
-
-			if (amdgpu_sa_bo_try_alloc(sa_manager, *sa_bo,
-						   size, align)) {
-				spin_unlock(&sa_manager->wq.lock);
-				return 0;
-			}
-
-			/* see if we can skip over some allocations */
-		} while (amdgpu_sa_bo_next_hole(sa_manager, fences, tries));
-
-		for (i = 0, count = 0; i < AMDGPU_SA_NUM_FENCE_LISTS; ++i)
-			if (fences[i])
-				fences[count++] = dma_fence_get(fences[i]);
-
-		if (count) {
-			spin_unlock(&sa_manager->wq.lock);
-			t = dma_fence_wait_any_timeout(fences, count, false,
-						       MAX_SCHEDULE_TIMEOUT,
-						       NULL);
-			for (i = 0; i < count; ++i)
-				dma_fence_put(fences[i]);
-
-			r = (t > 0) ? 0 : t;
-			spin_lock(&sa_manager->wq.lock);
-		} else {
-			/* if we have nothing to wait for block */
-			r = wait_event_interruptible_locked(
-				sa_manager->wq,
-				amdgpu_sa_event(sa_manager, size, align)
-			);
-		}
-
-	} while (!r);
-
-	spin_unlock(&sa_manager->wq.lock);
-	kfree(*sa_bo);
-	*sa_bo = NULL;
-	return r;
-}
-
-void amdgpu_sa_bo_free(struct amdgpu_device *adev, struct amdgpu_sa_bo **sa_bo,
+void amdgpu_sa_bo_free(struct amdgpu_device *adev, struct drm_suballoc **sa_bo,
 		       struct dma_fence *fence)
 {
-	struct amdgpu_sa_manager *sa_manager;
-
 	if (sa_bo == NULL || *sa_bo == NULL) {
 		return;
 	}
 
-	sa_manager = (*sa_bo)->manager;
-	spin_lock(&sa_manager->wq.lock);
-	if (fence && !dma_fence_is_signaled(fence)) {
-		uint32_t idx;
-
-		(*sa_bo)->fence = dma_fence_get(fence);
-		idx = fence->context % AMDGPU_SA_NUM_FENCE_LISTS;
-		list_add_tail(&(*sa_bo)->flist, &sa_manager->flist[idx]);
-	} else {
-		amdgpu_sa_bo_remove_locked(*sa_bo);
-	}
-	wake_up_all_locked(&sa_manager->wq);
-	spin_unlock(&sa_manager->wq.lock);
+	drm_suballoc_free(*sa_bo, fence);
 	*sa_bo = NULL;
 }
 
@@ -373,26 +107,8 @@ void amdgpu_sa_bo_free(struct amdgpu_device *adev, struct amdgpu_sa_bo **sa_bo,
 void amdgpu_sa_bo_dump_debug_info(struct amdgpu_sa_manager *sa_manager,
 				  struct seq_file *m)
 {
-	struct amdgpu_sa_bo *i;
-
-	spin_lock(&sa_manager->wq.lock);
-	list_for_each_entry(i, &sa_manager->olist, olist) {
-		uint64_t soffset = i->soffset + sa_manager->gpu_addr;
-		uint64_t eoffset = i->eoffset + sa_manager->gpu_addr;
-		if (&i->olist == sa_manager->hole) {
-			seq_printf(m, ">");
-		} else {
-			seq_printf(m, " ");
-		}
-		seq_printf(m, "[0x%010llx 0x%010llx] size %8lld",
-			   soffset, eoffset, eoffset - soffset);
+	struct drm_printer p = drm_seq_file_printer(m);
 
-		if (i->fence)
-			seq_printf(m, " protected by 0x%016llx on context %llu",
-				   i->fence->seqno, i->fence->context);
-
-		seq_printf(m, "\n");
-	}
-	spin_unlock(&sa_manager->wq.lock);
+	drm_suballoc_dump_debug_info(&sa_manager->base, &p, sa_manager->gpu_addr);
 }
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/3] drm/radeon: Use the drm suballocation manager implementation.
  2023-02-16 14:48 ` [Intel-xe] " Thomas Hellström
@ 2023-02-16 14:48   ` Thomas Hellström
  -1 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-16 14:48 UTC (permalink / raw)
  To: dri-devel
  Cc: Thomas Hellström, Daniel Vetter, Christian Koenig,
	Dave Airlie, intel-xe

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Use the generic suballocation helper.
Note that the generic suballocator only allows a single alignment,
so we may waste a few more bytes for radeon_semaphore, shouldn't
be a big deal, could be re-added if needed. Also, similar to amdgpu,
debug output changes slightly and suballocator cpu usage may be
slightly higher.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Co-developed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/radeon/radeon.h           |  55 +---
 drivers/gpu/drm/radeon/radeon_ib.c        |  12 +-
 drivers/gpu/drm/radeon/radeon_object.h    |  25 +-
 drivers/gpu/drm/radeon/radeon_sa.c        | 314 ++--------------------
 drivers/gpu/drm/radeon/radeon_semaphore.c |   6 +-
 5 files changed, 55 insertions(+), 357 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 57e20780a458..d19a4b1c1a8f 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -79,6 +79,7 @@
 
 #include <drm/drm_gem.h>
 #include <drm/drm_audio_component.h>
+#include <drm/drm_suballoc.h>
 
 #include "radeon_family.h"
 #include "radeon_mode.h"
@@ -511,52 +512,12 @@ struct radeon_bo {
 };
 #define gem_to_radeon_bo(gobj) container_of((gobj), struct radeon_bo, tbo.base)
 
-/* sub-allocation manager, it has to be protected by another lock.
- * By conception this is an helper for other part of the driver
- * like the indirect buffer or semaphore, which both have their
- * locking.
- *
- * Principe is simple, we keep a list of sub allocation in offset
- * order (first entry has offset == 0, last entry has the highest
- * offset).
- *
- * When allocating new object we first check if there is room at
- * the end total_size - (last_object_offset + last_object_size) >=
- * alloc_size. If so we allocate new object there.
- *
- * When there is not enough room at the end, we start waiting for
- * each sub object until we reach object_offset+object_size >=
- * alloc_size, this object then become the sub object we return.
- *
- * Alignment can't be bigger than page size.
- *
- * Hole are not considered for allocation to keep things simple.
- * Assumption is that there won't be hole (all object on same
- * alignment).
- */
 struct radeon_sa_manager {
-	wait_queue_head_t	wq;
-	struct radeon_bo	*bo;
-	struct list_head	*hole;
-	struct list_head	flist[RADEON_NUM_RINGS];
-	struct list_head	olist;
-	unsigned		size;
-	uint64_t		gpu_addr;
-	void			*cpu_ptr;
-	uint32_t		domain;
-	uint32_t		align;
-};
-
-struct radeon_sa_bo;
-
-/* sub-allocation buffer */
-struct radeon_sa_bo {
-	struct list_head		olist;
-	struct list_head		flist;
-	struct radeon_sa_manager	*manager;
-	unsigned			soffset;
-	unsigned			eoffset;
-	struct radeon_fence		*fence;
+	struct drm_suballoc_manager	base;
+	struct radeon_bo		*bo;
+	uint64_t			gpu_addr;
+	void				*cpu_ptr;
+	u32 domain;
 };
 
 /*
@@ -587,7 +548,7 @@ int radeon_mode_dumb_mmap(struct drm_file *filp,
  * Semaphores.
  */
 struct radeon_semaphore {
-	struct radeon_sa_bo	*sa_bo;
+	struct drm_suballoc	*sa_bo;
 	signed			waiters;
 	uint64_t		gpu_addr;
 };
@@ -816,7 +777,7 @@ void radeon_irq_kms_disable_hpd(struct radeon_device *rdev, unsigned hpd_mask);
  */
 
 struct radeon_ib {
-	struct radeon_sa_bo		*sa_bo;
+	struct drm_suballoc		*sa_bo;
 	uint32_t			length_dw;
 	uint64_t			gpu_addr;
 	uint32_t			*ptr;
diff --git a/drivers/gpu/drm/radeon/radeon_ib.c b/drivers/gpu/drm/radeon/radeon_ib.c
index 62b116727b4f..63fcfe65d814 100644
--- a/drivers/gpu/drm/radeon/radeon_ib.c
+++ b/drivers/gpu/drm/radeon/radeon_ib.c
@@ -61,7 +61,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 {
 	int r;
 
-	r = radeon_sa_bo_new(rdev, &rdev->ring_tmp_bo, &ib->sa_bo, size, 256);
+	r = radeon_sa_bo_new(&rdev->ring_tmp_bo, &ib->sa_bo, size);
 	if (r) {
 		dev_err(rdev->dev, "failed to get a new IB (%d)\n", r);
 		return r;
@@ -77,7 +77,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 		/* ib pool is bound at RADEON_VA_IB_OFFSET in virtual address
 		 * space and soffset is the offset inside the pool bo
 		 */
-		ib->gpu_addr = ib->sa_bo->soffset + RADEON_VA_IB_OFFSET;
+		ib->gpu_addr = drm_suballoc_soffset(ib->sa_bo) + RADEON_VA_IB_OFFSET;
 	} else {
 		ib->gpu_addr = radeon_sa_bo_gpu_addr(ib->sa_bo);
 	}
@@ -97,7 +97,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 void radeon_ib_free(struct radeon_device *rdev, struct radeon_ib *ib)
 {
 	radeon_sync_free(rdev, &ib->sync, ib->fence);
-	radeon_sa_bo_free(rdev, &ib->sa_bo, ib->fence);
+	radeon_sa_bo_free(&ib->sa_bo, ib->fence);
 	radeon_fence_unref(&ib->fence);
 }
 
@@ -201,8 +201,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
 
 	if (rdev->family >= CHIP_BONAIRE) {
 		r = radeon_sa_bo_manager_init(rdev, &rdev->ring_tmp_bo,
-					      RADEON_IB_POOL_SIZE*64*1024,
-					      RADEON_GPU_PAGE_SIZE,
+					      RADEON_IB_POOL_SIZE*64*1024, 256,
 					      RADEON_GEM_DOMAIN_GTT,
 					      RADEON_GEM_GTT_WC);
 	} else {
@@ -210,8 +209,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
 		 * to the command stream checking
 		 */
 		r = radeon_sa_bo_manager_init(rdev, &rdev->ring_tmp_bo,
-					      RADEON_IB_POOL_SIZE*64*1024,
-					      RADEON_GPU_PAGE_SIZE,
+					      RADEON_IB_POOL_SIZE*64*1024, 256,
 					      RADEON_GEM_DOMAIN_GTT, 0);
 	}
 	if (r) {
diff --git a/drivers/gpu/drm/radeon/radeon_object.h b/drivers/gpu/drm/radeon/radeon_object.h
index 0a6ef49e990a..b7c5087a7dbc 100644
--- a/drivers/gpu/drm/radeon/radeon_object.h
+++ b/drivers/gpu/drm/radeon/radeon_object.h
@@ -169,15 +169,22 @@ extern void radeon_bo_fence(struct radeon_bo *bo, struct radeon_fence *fence,
 /*
  * sub allocation
  */
+static inline struct radeon_sa_manager *
+to_radeon_sa_manager(struct drm_suballoc_manager *manager)
+{
+	return container_of(manager, struct radeon_sa_manager, base);
+}
 
-static inline uint64_t radeon_sa_bo_gpu_addr(struct radeon_sa_bo *sa_bo)
+static inline uint64_t radeon_sa_bo_gpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->gpu_addr + sa_bo->soffset;
+	return to_radeon_sa_manager(sa_bo->manager)->gpu_addr +
+		drm_suballoc_soffset(sa_bo);
 }
 
-static inline void * radeon_sa_bo_cpu_addr(struct radeon_sa_bo *sa_bo)
+static inline void * radeon_sa_bo_cpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->cpu_ptr + sa_bo->soffset;
+	return to_radeon_sa_manager(sa_bo->manager)->cpu_ptr +
+		drm_suballoc_soffset(sa_bo);
 }
 
 extern int radeon_sa_bo_manager_init(struct radeon_device *rdev,
@@ -190,12 +197,10 @@ extern int radeon_sa_bo_manager_start(struct radeon_device *rdev,
 				      struct radeon_sa_manager *sa_manager);
 extern int radeon_sa_bo_manager_suspend(struct radeon_device *rdev,
 					struct radeon_sa_manager *sa_manager);
-extern int radeon_sa_bo_new(struct radeon_device *rdev,
-			    struct radeon_sa_manager *sa_manager,
-			    struct radeon_sa_bo **sa_bo,
-			    unsigned size, unsigned align);
-extern void radeon_sa_bo_free(struct radeon_device *rdev,
-			      struct radeon_sa_bo **sa_bo,
+extern int radeon_sa_bo_new(struct radeon_sa_manager *sa_manager,
+			    struct drm_suballoc **sa_bo,
+			    unsigned size);
+extern void radeon_sa_bo_free(struct drm_suballoc **sa_bo,
 			      struct radeon_fence *fence);
 #if defined(CONFIG_DEBUG_FS)
 extern void radeon_sa_bo_dump_debug_info(struct radeon_sa_manager *sa_manager,
diff --git a/drivers/gpu/drm/radeon/radeon_sa.c b/drivers/gpu/drm/radeon/radeon_sa.c
index 0981948bd9ed..b5555750aa0d 100644
--- a/drivers/gpu/drm/radeon/radeon_sa.c
+++ b/drivers/gpu/drm/radeon/radeon_sa.c
@@ -44,53 +44,31 @@
 
 #include "radeon.h"
 
-static void radeon_sa_bo_remove_locked(struct radeon_sa_bo *sa_bo);
-static void radeon_sa_bo_try_free(struct radeon_sa_manager *sa_manager);
-
 int radeon_sa_bo_manager_init(struct radeon_device *rdev,
 			      struct radeon_sa_manager *sa_manager,
-			      unsigned size, u32 align, u32 domain, u32 flags)
+			      unsigned size, u32 sa_align, u32 domain, u32 flags)
 {
-	int i, r;
-
-	init_waitqueue_head(&sa_manager->wq);
-	sa_manager->bo = NULL;
-	sa_manager->size = size;
-	sa_manager->domain = domain;
-	sa_manager->align = align;
-	sa_manager->hole = &sa_manager->olist;
-	INIT_LIST_HEAD(&sa_manager->olist);
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		INIT_LIST_HEAD(&sa_manager->flist[i]);
-	}
+	int r;
 
-	r = radeon_bo_create(rdev, size, align, true,
+	r = radeon_bo_create(rdev, size, RADEON_GPU_PAGE_SIZE, true,
 			     domain, flags, NULL, NULL, &sa_manager->bo);
 	if (r) {
 		dev_err(rdev->dev, "(%d) failed to allocate bo for manager\n", r);
 		return r;
 	}
 
+	sa_manager->domain = domain;
+
+	drm_suballoc_manager_init(&sa_manager->base, size, sa_align);
+
 	return r;
 }
 
 void radeon_sa_bo_manager_fini(struct radeon_device *rdev,
 			       struct radeon_sa_manager *sa_manager)
 {
-	struct radeon_sa_bo *sa_bo, *tmp;
-
-	if (!list_empty(&sa_manager->olist)) {
-		sa_manager->hole = &sa_manager->olist,
-		radeon_sa_bo_try_free(sa_manager);
-		if (!list_empty(&sa_manager->olist)) {
-			dev_err(rdev->dev, "sa_manager is not empty, clearing anyway\n");
-		}
-	}
-	list_for_each_entry_safe(sa_bo, tmp, &sa_manager->olist, olist) {
-		radeon_sa_bo_remove_locked(sa_bo);
-	}
+	drm_suballoc_manager_fini(&sa_manager->base);
 	radeon_bo_unref(&sa_manager->bo);
-	sa_manager->size = 0;
 }
 
 int radeon_sa_bo_manager_start(struct radeon_device *rdev,
@@ -139,260 +117,33 @@ int radeon_sa_bo_manager_suspend(struct radeon_device *rdev,
 	return r;
 }
 
-static void radeon_sa_bo_remove_locked(struct radeon_sa_bo *sa_bo)
+int radeon_sa_bo_new(struct radeon_sa_manager *sa_manager,
+		     struct drm_suballoc **sa_bo,
+		     unsigned size)
 {
-	struct radeon_sa_manager *sa_manager = sa_bo->manager;
-	if (sa_manager->hole == &sa_bo->olist) {
-		sa_manager->hole = sa_bo->olist.prev;
-	}
-	list_del_init(&sa_bo->olist);
-	list_del_init(&sa_bo->flist);
-	radeon_fence_unref(&sa_bo->fence);
-	kfree(sa_bo);
-}
-
-static void radeon_sa_bo_try_free(struct radeon_sa_manager *sa_manager)
-{
-	struct radeon_sa_bo *sa_bo, *tmp;
-
-	if (sa_manager->hole->next == &sa_manager->olist)
-		return;
+	struct drm_suballoc *sa = drm_suballoc_new(&sa_manager->base, size, GFP_KERNEL, true);
 
-	sa_bo = list_entry(sa_manager->hole->next, struct radeon_sa_bo, olist);
-	list_for_each_entry_safe_from(sa_bo, tmp, &sa_manager->olist, olist) {
-		if (sa_bo->fence == NULL || !radeon_fence_signaled(sa_bo->fence)) {
-			return;
-		}
-		radeon_sa_bo_remove_locked(sa_bo);
+	if (IS_ERR(sa)) {
+		*sa_bo = NULL;
+		return PTR_ERR(sa);
 	}
-}
 
-static inline unsigned radeon_sa_bo_hole_soffset(struct radeon_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole != &sa_manager->olist) {
-		return list_entry(hole, struct radeon_sa_bo, olist)->eoffset;
-	}
+	*sa_bo = sa;
 	return 0;
 }
 
-static inline unsigned radeon_sa_bo_hole_eoffset(struct radeon_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole->next != &sa_manager->olist) {
-		return list_entry(hole->next, struct radeon_sa_bo, olist)->soffset;
-	}
-	return sa_manager->size;
-}
-
-static bool radeon_sa_bo_try_alloc(struct radeon_sa_manager *sa_manager,
-				   struct radeon_sa_bo *sa_bo,
-				   unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	eoffset = radeon_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		soffset += wasted;
-
-		sa_bo->manager = sa_manager;
-		sa_bo->soffset = soffset;
-		sa_bo->eoffset = soffset + size;
-		list_add(&sa_bo->olist, sa_manager->hole);
-		INIT_LIST_HEAD(&sa_bo->flist);
-		sa_manager->hole = &sa_bo->olist;
-		return true;
-	}
-	return false;
-}
-
-/**
- * radeon_sa_event - Check if we can stop waiting
- *
- * @sa_manager: pointer to the sa_manager
- * @size: number of bytes we want to allocate
- * @align: alignment we need to match
- *
- * Check if either there is a fence we can wait for or
- * enough free memory to satisfy the allocation directly
- */
-static bool radeon_sa_event(struct radeon_sa_manager *sa_manager,
-			    unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-	int i;
-
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		if (!list_empty(&sa_manager->flist[i])) {
-			return true;
-		}
-	}
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	eoffset = radeon_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		return true;
-	}
-
-	return false;
-}
-
-static bool radeon_sa_bo_next_hole(struct radeon_sa_manager *sa_manager,
-				   struct radeon_fence **fences,
-				   unsigned *tries)
-{
-	struct radeon_sa_bo *best_bo = NULL;
-	unsigned i, soffset, best, tmp;
-
-	/* if hole points to the end of the buffer */
-	if (sa_manager->hole->next == &sa_manager->olist) {
-		/* try again with its beginning */
-		sa_manager->hole = &sa_manager->olist;
-		return true;
-	}
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	/* to handle wrap around we add sa_manager->size */
-	best = sa_manager->size * 2;
-	/* go over all fence list and try to find the closest sa_bo
-	 * of the current last
-	 */
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		struct radeon_sa_bo *sa_bo;
-
-		fences[i] = NULL;
-
-		if (list_empty(&sa_manager->flist[i])) {
-			continue;
-		}
-
-		sa_bo = list_first_entry(&sa_manager->flist[i],
-					 struct radeon_sa_bo, flist);
-
-		if (!radeon_fence_signaled(sa_bo->fence)) {
-			fences[i] = sa_bo->fence;
-			continue;
-		}
-
-		/* limit the number of tries each ring gets */
-		if (tries[i] > 2) {
-			continue;
-		}
-
-		tmp = sa_bo->soffset;
-		if (tmp < soffset) {
-			/* wrap around, pretend it's after */
-			tmp += sa_manager->size;
-		}
-		tmp -= soffset;
-		if (tmp < best) {
-			/* this sa bo is the closest one */
-			best = tmp;
-			best_bo = sa_bo;
-		}
-	}
-
-	if (best_bo) {
-		++tries[best_bo->fence->ring];
-		sa_manager->hole = best_bo->olist.prev;
-
-		/* we knew that this one is signaled,
-		   so it's save to remote it */
-		radeon_sa_bo_remove_locked(best_bo);
-		return true;
-	}
-	return false;
-}
-
-int radeon_sa_bo_new(struct radeon_device *rdev,
-		     struct radeon_sa_manager *sa_manager,
-		     struct radeon_sa_bo **sa_bo,
-		     unsigned size, unsigned align)
-{
-	struct radeon_fence *fences[RADEON_NUM_RINGS];
-	unsigned tries[RADEON_NUM_RINGS];
-	int i, r;
-
-	BUG_ON(align > sa_manager->align);
-	BUG_ON(size > sa_manager->size);
-
-	*sa_bo = kmalloc(sizeof(struct radeon_sa_bo), GFP_KERNEL);
-	if ((*sa_bo) == NULL) {
-		return -ENOMEM;
-	}
-	(*sa_bo)->manager = sa_manager;
-	(*sa_bo)->fence = NULL;
-	INIT_LIST_HEAD(&(*sa_bo)->olist);
-	INIT_LIST_HEAD(&(*sa_bo)->flist);
-
-	spin_lock(&sa_manager->wq.lock);
-	do {
-		for (i = 0; i < RADEON_NUM_RINGS; ++i)
-			tries[i] = 0;
-
-		do {
-			radeon_sa_bo_try_free(sa_manager);
-
-			if (radeon_sa_bo_try_alloc(sa_manager, *sa_bo,
-						   size, align)) {
-				spin_unlock(&sa_manager->wq.lock);
-				return 0;
-			}
-
-			/* see if we can skip over some allocations */
-		} while (radeon_sa_bo_next_hole(sa_manager, fences, tries));
-
-		for (i = 0; i < RADEON_NUM_RINGS; ++i)
-			radeon_fence_ref(fences[i]);
-
-		spin_unlock(&sa_manager->wq.lock);
-		r = radeon_fence_wait_any(rdev, fences, false);
-		for (i = 0; i < RADEON_NUM_RINGS; ++i)
-			radeon_fence_unref(&fences[i]);
-		spin_lock(&sa_manager->wq.lock);
-		/* if we have nothing to wait for block */
-		if (r == -ENOENT) {
-			r = wait_event_interruptible_locked(
-				sa_manager->wq, 
-				radeon_sa_event(sa_manager, size, align)
-			);
-		}
-
-	} while (!r);
-
-	spin_unlock(&sa_manager->wq.lock);
-	kfree(*sa_bo);
-	*sa_bo = NULL;
-	return r;
-}
-
-void radeon_sa_bo_free(struct radeon_device *rdev, struct radeon_sa_bo **sa_bo,
+void radeon_sa_bo_free(struct drm_suballoc **sa_bo,
 		       struct radeon_fence *fence)
 {
-	struct radeon_sa_manager *sa_manager;
-
 	if (sa_bo == NULL || *sa_bo == NULL) {
 		return;
 	}
 
-	sa_manager = (*sa_bo)->manager;
-	spin_lock(&sa_manager->wq.lock);
-	if (fence && !radeon_fence_signaled(fence)) {
-		(*sa_bo)->fence = radeon_fence_ref(fence);
-		list_add_tail(&(*sa_bo)->flist,
-			      &sa_manager->flist[fence->ring]);
-	} else {
-		radeon_sa_bo_remove_locked(*sa_bo);
-	}
-	wake_up_all_locked(&sa_manager->wq);
-	spin_unlock(&sa_manager->wq.lock);
+	if (fence)
+		drm_suballoc_free(*sa_bo, &fence->base);
+	else
+		drm_suballoc_free(*sa_bo, NULL);
+
 	*sa_bo = NULL;
 }
 
@@ -400,25 +151,8 @@ void radeon_sa_bo_free(struct radeon_device *rdev, struct radeon_sa_bo **sa_bo,
 void radeon_sa_bo_dump_debug_info(struct radeon_sa_manager *sa_manager,
 				  struct seq_file *m)
 {
-	struct radeon_sa_bo *i;
+	struct drm_printer p = drm_seq_file_printer(m);
 
-	spin_lock(&sa_manager->wq.lock);
-	list_for_each_entry(i, &sa_manager->olist, olist) {
-		uint64_t soffset = i->soffset + sa_manager->gpu_addr;
-		uint64_t eoffset = i->eoffset + sa_manager->gpu_addr;
-		if (&i->olist == sa_manager->hole) {
-			seq_printf(m, ">");
-		} else {
-			seq_printf(m, " ");
-		}
-		seq_printf(m, "[0x%010llx 0x%010llx] size %8lld",
-			   soffset, eoffset, eoffset - soffset);
-		if (i->fence) {
-			seq_printf(m, " protected by 0x%016llx on ring %d",
-				   i->fence->seq, i->fence->ring);
-		}
-		seq_printf(m, "\n");
-	}
-	spin_unlock(&sa_manager->wq.lock);
+	drm_suballoc_dump_debug_info(&sa_manager->base, &p, sa_manager->gpu_addr);
 }
 #endif
diff --git a/drivers/gpu/drm/radeon/radeon_semaphore.c b/drivers/gpu/drm/radeon/radeon_semaphore.c
index 221e59476f64..3e2b0bf0d55d 100644
--- a/drivers/gpu/drm/radeon/radeon_semaphore.c
+++ b/drivers/gpu/drm/radeon/radeon_semaphore.c
@@ -40,8 +40,8 @@ int radeon_semaphore_create(struct radeon_device *rdev,
 	if (*semaphore == NULL) {
 		return -ENOMEM;
 	}
-	r = radeon_sa_bo_new(rdev, &rdev->ring_tmp_bo,
-			     &(*semaphore)->sa_bo, 8, 8);
+	r = radeon_sa_bo_new(&rdev->ring_tmp_bo,
+			     &(*semaphore)->sa_bo, 8);
 	if (r) {
 		kfree(*semaphore);
 		*semaphore = NULL;
@@ -100,7 +100,7 @@ void radeon_semaphore_free(struct radeon_device *rdev,
 		dev_err(rdev->dev, "semaphore %p has more waiters than signalers,"
 			" hardware lockup imminent!\n", *semaphore);
 	}
-	radeon_sa_bo_free(rdev, &(*semaphore)->sa_bo, fence);
+	radeon_sa_bo_free(&(*semaphore)->sa_bo, fence);
 	kfree(*semaphore);
 	*semaphore = NULL;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [Intel-xe] [PATCH 3/3] drm/radeon: Use the drm suballocation manager implementation.
@ 2023-02-16 14:48   ` Thomas Hellström
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-16 14:48 UTC (permalink / raw)
  To: dri-devel
  Cc: Daniel Vetter, Maarten Lankhorst, Christian Koenig, Dave Airlie,
	intel-xe

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Use the generic suballocation helper.
Note that the generic suballocator only allows a single alignment,
so we may waste a few more bytes for radeon_semaphore, shouldn't
be a big deal, could be re-added if needed. Also, similar to amdgpu,
debug output changes slightly and suballocator cpu usage may be
slightly higher.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Co-developed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/radeon/radeon.h           |  55 +---
 drivers/gpu/drm/radeon/radeon_ib.c        |  12 +-
 drivers/gpu/drm/radeon/radeon_object.h    |  25 +-
 drivers/gpu/drm/radeon/radeon_sa.c        | 314 ++--------------------
 drivers/gpu/drm/radeon/radeon_semaphore.c |   6 +-
 5 files changed, 55 insertions(+), 357 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 57e20780a458..d19a4b1c1a8f 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -79,6 +79,7 @@
 
 #include <drm/drm_gem.h>
 #include <drm/drm_audio_component.h>
+#include <drm/drm_suballoc.h>
 
 #include "radeon_family.h"
 #include "radeon_mode.h"
@@ -511,52 +512,12 @@ struct radeon_bo {
 };
 #define gem_to_radeon_bo(gobj) container_of((gobj), struct radeon_bo, tbo.base)
 
-/* sub-allocation manager, it has to be protected by another lock.
- * By conception this is an helper for other part of the driver
- * like the indirect buffer or semaphore, which both have their
- * locking.
- *
- * Principe is simple, we keep a list of sub allocation in offset
- * order (first entry has offset == 0, last entry has the highest
- * offset).
- *
- * When allocating new object we first check if there is room at
- * the end total_size - (last_object_offset + last_object_size) >=
- * alloc_size. If so we allocate new object there.
- *
- * When there is not enough room at the end, we start waiting for
- * each sub object until we reach object_offset+object_size >=
- * alloc_size, this object then become the sub object we return.
- *
- * Alignment can't be bigger than page size.
- *
- * Hole are not considered for allocation to keep things simple.
- * Assumption is that there won't be hole (all object on same
- * alignment).
- */
 struct radeon_sa_manager {
-	wait_queue_head_t	wq;
-	struct radeon_bo	*bo;
-	struct list_head	*hole;
-	struct list_head	flist[RADEON_NUM_RINGS];
-	struct list_head	olist;
-	unsigned		size;
-	uint64_t		gpu_addr;
-	void			*cpu_ptr;
-	uint32_t		domain;
-	uint32_t		align;
-};
-
-struct radeon_sa_bo;
-
-/* sub-allocation buffer */
-struct radeon_sa_bo {
-	struct list_head		olist;
-	struct list_head		flist;
-	struct radeon_sa_manager	*manager;
-	unsigned			soffset;
-	unsigned			eoffset;
-	struct radeon_fence		*fence;
+	struct drm_suballoc_manager	base;
+	struct radeon_bo		*bo;
+	uint64_t			gpu_addr;
+	void				*cpu_ptr;
+	u32 domain;
 };
 
 /*
@@ -587,7 +548,7 @@ int radeon_mode_dumb_mmap(struct drm_file *filp,
  * Semaphores.
  */
 struct radeon_semaphore {
-	struct radeon_sa_bo	*sa_bo;
+	struct drm_suballoc	*sa_bo;
 	signed			waiters;
 	uint64_t		gpu_addr;
 };
@@ -816,7 +777,7 @@ void radeon_irq_kms_disable_hpd(struct radeon_device *rdev, unsigned hpd_mask);
  */
 
 struct radeon_ib {
-	struct radeon_sa_bo		*sa_bo;
+	struct drm_suballoc		*sa_bo;
 	uint32_t			length_dw;
 	uint64_t			gpu_addr;
 	uint32_t			*ptr;
diff --git a/drivers/gpu/drm/radeon/radeon_ib.c b/drivers/gpu/drm/radeon/radeon_ib.c
index 62b116727b4f..63fcfe65d814 100644
--- a/drivers/gpu/drm/radeon/radeon_ib.c
+++ b/drivers/gpu/drm/radeon/radeon_ib.c
@@ -61,7 +61,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 {
 	int r;
 
-	r = radeon_sa_bo_new(rdev, &rdev->ring_tmp_bo, &ib->sa_bo, size, 256);
+	r = radeon_sa_bo_new(&rdev->ring_tmp_bo, &ib->sa_bo, size);
 	if (r) {
 		dev_err(rdev->dev, "failed to get a new IB (%d)\n", r);
 		return r;
@@ -77,7 +77,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 		/* ib pool is bound at RADEON_VA_IB_OFFSET in virtual address
 		 * space and soffset is the offset inside the pool bo
 		 */
-		ib->gpu_addr = ib->sa_bo->soffset + RADEON_VA_IB_OFFSET;
+		ib->gpu_addr = drm_suballoc_soffset(ib->sa_bo) + RADEON_VA_IB_OFFSET;
 	} else {
 		ib->gpu_addr = radeon_sa_bo_gpu_addr(ib->sa_bo);
 	}
@@ -97,7 +97,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 void radeon_ib_free(struct radeon_device *rdev, struct radeon_ib *ib)
 {
 	radeon_sync_free(rdev, &ib->sync, ib->fence);
-	radeon_sa_bo_free(rdev, &ib->sa_bo, ib->fence);
+	radeon_sa_bo_free(&ib->sa_bo, ib->fence);
 	radeon_fence_unref(&ib->fence);
 }
 
@@ -201,8 +201,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
 
 	if (rdev->family >= CHIP_BONAIRE) {
 		r = radeon_sa_bo_manager_init(rdev, &rdev->ring_tmp_bo,
-					      RADEON_IB_POOL_SIZE*64*1024,
-					      RADEON_GPU_PAGE_SIZE,
+					      RADEON_IB_POOL_SIZE*64*1024, 256,
 					      RADEON_GEM_DOMAIN_GTT,
 					      RADEON_GEM_GTT_WC);
 	} else {
@@ -210,8 +209,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
 		 * to the command stream checking
 		 */
 		r = radeon_sa_bo_manager_init(rdev, &rdev->ring_tmp_bo,
-					      RADEON_IB_POOL_SIZE*64*1024,
-					      RADEON_GPU_PAGE_SIZE,
+					      RADEON_IB_POOL_SIZE*64*1024, 256,
 					      RADEON_GEM_DOMAIN_GTT, 0);
 	}
 	if (r) {
diff --git a/drivers/gpu/drm/radeon/radeon_object.h b/drivers/gpu/drm/radeon/radeon_object.h
index 0a6ef49e990a..b7c5087a7dbc 100644
--- a/drivers/gpu/drm/radeon/radeon_object.h
+++ b/drivers/gpu/drm/radeon/radeon_object.h
@@ -169,15 +169,22 @@ extern void radeon_bo_fence(struct radeon_bo *bo, struct radeon_fence *fence,
 /*
  * sub allocation
  */
+static inline struct radeon_sa_manager *
+to_radeon_sa_manager(struct drm_suballoc_manager *manager)
+{
+	return container_of(manager, struct radeon_sa_manager, base);
+}
 
-static inline uint64_t radeon_sa_bo_gpu_addr(struct radeon_sa_bo *sa_bo)
+static inline uint64_t radeon_sa_bo_gpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->gpu_addr + sa_bo->soffset;
+	return to_radeon_sa_manager(sa_bo->manager)->gpu_addr +
+		drm_suballoc_soffset(sa_bo);
 }
 
-static inline void * radeon_sa_bo_cpu_addr(struct radeon_sa_bo *sa_bo)
+static inline void * radeon_sa_bo_cpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->cpu_ptr + sa_bo->soffset;
+	return to_radeon_sa_manager(sa_bo->manager)->cpu_ptr +
+		drm_suballoc_soffset(sa_bo);
 }
 
 extern int radeon_sa_bo_manager_init(struct radeon_device *rdev,
@@ -190,12 +197,10 @@ extern int radeon_sa_bo_manager_start(struct radeon_device *rdev,
 				      struct radeon_sa_manager *sa_manager);
 extern int radeon_sa_bo_manager_suspend(struct radeon_device *rdev,
 					struct radeon_sa_manager *sa_manager);
-extern int radeon_sa_bo_new(struct radeon_device *rdev,
-			    struct radeon_sa_manager *sa_manager,
-			    struct radeon_sa_bo **sa_bo,
-			    unsigned size, unsigned align);
-extern void radeon_sa_bo_free(struct radeon_device *rdev,
-			      struct radeon_sa_bo **sa_bo,
+extern int radeon_sa_bo_new(struct radeon_sa_manager *sa_manager,
+			    struct drm_suballoc **sa_bo,
+			    unsigned size);
+extern void radeon_sa_bo_free(struct drm_suballoc **sa_bo,
 			      struct radeon_fence *fence);
 #if defined(CONFIG_DEBUG_FS)
 extern void radeon_sa_bo_dump_debug_info(struct radeon_sa_manager *sa_manager,
diff --git a/drivers/gpu/drm/radeon/radeon_sa.c b/drivers/gpu/drm/radeon/radeon_sa.c
index 0981948bd9ed..b5555750aa0d 100644
--- a/drivers/gpu/drm/radeon/radeon_sa.c
+++ b/drivers/gpu/drm/radeon/radeon_sa.c
@@ -44,53 +44,31 @@
 
 #include "radeon.h"
 
-static void radeon_sa_bo_remove_locked(struct radeon_sa_bo *sa_bo);
-static void radeon_sa_bo_try_free(struct radeon_sa_manager *sa_manager);
-
 int radeon_sa_bo_manager_init(struct radeon_device *rdev,
 			      struct radeon_sa_manager *sa_manager,
-			      unsigned size, u32 align, u32 domain, u32 flags)
+			      unsigned size, u32 sa_align, u32 domain, u32 flags)
 {
-	int i, r;
-
-	init_waitqueue_head(&sa_manager->wq);
-	sa_manager->bo = NULL;
-	sa_manager->size = size;
-	sa_manager->domain = domain;
-	sa_manager->align = align;
-	sa_manager->hole = &sa_manager->olist;
-	INIT_LIST_HEAD(&sa_manager->olist);
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		INIT_LIST_HEAD(&sa_manager->flist[i]);
-	}
+	int r;
 
-	r = radeon_bo_create(rdev, size, align, true,
+	r = radeon_bo_create(rdev, size, RADEON_GPU_PAGE_SIZE, true,
 			     domain, flags, NULL, NULL, &sa_manager->bo);
 	if (r) {
 		dev_err(rdev->dev, "(%d) failed to allocate bo for manager\n", r);
 		return r;
 	}
 
+	sa_manager->domain = domain;
+
+	drm_suballoc_manager_init(&sa_manager->base, size, sa_align);
+
 	return r;
 }
 
 void radeon_sa_bo_manager_fini(struct radeon_device *rdev,
 			       struct radeon_sa_manager *sa_manager)
 {
-	struct radeon_sa_bo *sa_bo, *tmp;
-
-	if (!list_empty(&sa_manager->olist)) {
-		sa_manager->hole = &sa_manager->olist,
-		radeon_sa_bo_try_free(sa_manager);
-		if (!list_empty(&sa_manager->olist)) {
-			dev_err(rdev->dev, "sa_manager is not empty, clearing anyway\n");
-		}
-	}
-	list_for_each_entry_safe(sa_bo, tmp, &sa_manager->olist, olist) {
-		radeon_sa_bo_remove_locked(sa_bo);
-	}
+	drm_suballoc_manager_fini(&sa_manager->base);
 	radeon_bo_unref(&sa_manager->bo);
-	sa_manager->size = 0;
 }
 
 int radeon_sa_bo_manager_start(struct radeon_device *rdev,
@@ -139,260 +117,33 @@ int radeon_sa_bo_manager_suspend(struct radeon_device *rdev,
 	return r;
 }
 
-static void radeon_sa_bo_remove_locked(struct radeon_sa_bo *sa_bo)
+int radeon_sa_bo_new(struct radeon_sa_manager *sa_manager,
+		     struct drm_suballoc **sa_bo,
+		     unsigned size)
 {
-	struct radeon_sa_manager *sa_manager = sa_bo->manager;
-	if (sa_manager->hole == &sa_bo->olist) {
-		sa_manager->hole = sa_bo->olist.prev;
-	}
-	list_del_init(&sa_bo->olist);
-	list_del_init(&sa_bo->flist);
-	radeon_fence_unref(&sa_bo->fence);
-	kfree(sa_bo);
-}
-
-static void radeon_sa_bo_try_free(struct radeon_sa_manager *sa_manager)
-{
-	struct radeon_sa_bo *sa_bo, *tmp;
-
-	if (sa_manager->hole->next == &sa_manager->olist)
-		return;
+	struct drm_suballoc *sa = drm_suballoc_new(&sa_manager->base, size, GFP_KERNEL, true);
 
-	sa_bo = list_entry(sa_manager->hole->next, struct radeon_sa_bo, olist);
-	list_for_each_entry_safe_from(sa_bo, tmp, &sa_manager->olist, olist) {
-		if (sa_bo->fence == NULL || !radeon_fence_signaled(sa_bo->fence)) {
-			return;
-		}
-		radeon_sa_bo_remove_locked(sa_bo);
+	if (IS_ERR(sa)) {
+		*sa_bo = NULL;
+		return PTR_ERR(sa);
 	}
-}
 
-static inline unsigned radeon_sa_bo_hole_soffset(struct radeon_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole != &sa_manager->olist) {
-		return list_entry(hole, struct radeon_sa_bo, olist)->eoffset;
-	}
+	*sa_bo = sa;
 	return 0;
 }
 
-static inline unsigned radeon_sa_bo_hole_eoffset(struct radeon_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole->next != &sa_manager->olist) {
-		return list_entry(hole->next, struct radeon_sa_bo, olist)->soffset;
-	}
-	return sa_manager->size;
-}
-
-static bool radeon_sa_bo_try_alloc(struct radeon_sa_manager *sa_manager,
-				   struct radeon_sa_bo *sa_bo,
-				   unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	eoffset = radeon_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		soffset += wasted;
-
-		sa_bo->manager = sa_manager;
-		sa_bo->soffset = soffset;
-		sa_bo->eoffset = soffset + size;
-		list_add(&sa_bo->olist, sa_manager->hole);
-		INIT_LIST_HEAD(&sa_bo->flist);
-		sa_manager->hole = &sa_bo->olist;
-		return true;
-	}
-	return false;
-}
-
-/**
- * radeon_sa_event - Check if we can stop waiting
- *
- * @sa_manager: pointer to the sa_manager
- * @size: number of bytes we want to allocate
- * @align: alignment we need to match
- *
- * Check if either there is a fence we can wait for or
- * enough free memory to satisfy the allocation directly
- */
-static bool radeon_sa_event(struct radeon_sa_manager *sa_manager,
-			    unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-	int i;
-
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		if (!list_empty(&sa_manager->flist[i])) {
-			return true;
-		}
-	}
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	eoffset = radeon_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		return true;
-	}
-
-	return false;
-}
-
-static bool radeon_sa_bo_next_hole(struct radeon_sa_manager *sa_manager,
-				   struct radeon_fence **fences,
-				   unsigned *tries)
-{
-	struct radeon_sa_bo *best_bo = NULL;
-	unsigned i, soffset, best, tmp;
-
-	/* if hole points to the end of the buffer */
-	if (sa_manager->hole->next == &sa_manager->olist) {
-		/* try again with its beginning */
-		sa_manager->hole = &sa_manager->olist;
-		return true;
-	}
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	/* to handle wrap around we add sa_manager->size */
-	best = sa_manager->size * 2;
-	/* go over all fence list and try to find the closest sa_bo
-	 * of the current last
-	 */
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		struct radeon_sa_bo *sa_bo;
-
-		fences[i] = NULL;
-
-		if (list_empty(&sa_manager->flist[i])) {
-			continue;
-		}
-
-		sa_bo = list_first_entry(&sa_manager->flist[i],
-					 struct radeon_sa_bo, flist);
-
-		if (!radeon_fence_signaled(sa_bo->fence)) {
-			fences[i] = sa_bo->fence;
-			continue;
-		}
-
-		/* limit the number of tries each ring gets */
-		if (tries[i] > 2) {
-			continue;
-		}
-
-		tmp = sa_bo->soffset;
-		if (tmp < soffset) {
-			/* wrap around, pretend it's after */
-			tmp += sa_manager->size;
-		}
-		tmp -= soffset;
-		if (tmp < best) {
-			/* this sa bo is the closest one */
-			best = tmp;
-			best_bo = sa_bo;
-		}
-	}
-
-	if (best_bo) {
-		++tries[best_bo->fence->ring];
-		sa_manager->hole = best_bo->olist.prev;
-
-		/* we knew that this one is signaled,
-		   so it's save to remote it */
-		radeon_sa_bo_remove_locked(best_bo);
-		return true;
-	}
-	return false;
-}
-
-int radeon_sa_bo_new(struct radeon_device *rdev,
-		     struct radeon_sa_manager *sa_manager,
-		     struct radeon_sa_bo **sa_bo,
-		     unsigned size, unsigned align)
-{
-	struct radeon_fence *fences[RADEON_NUM_RINGS];
-	unsigned tries[RADEON_NUM_RINGS];
-	int i, r;
-
-	BUG_ON(align > sa_manager->align);
-	BUG_ON(size > sa_manager->size);
-
-	*sa_bo = kmalloc(sizeof(struct radeon_sa_bo), GFP_KERNEL);
-	if ((*sa_bo) == NULL) {
-		return -ENOMEM;
-	}
-	(*sa_bo)->manager = sa_manager;
-	(*sa_bo)->fence = NULL;
-	INIT_LIST_HEAD(&(*sa_bo)->olist);
-	INIT_LIST_HEAD(&(*sa_bo)->flist);
-
-	spin_lock(&sa_manager->wq.lock);
-	do {
-		for (i = 0; i < RADEON_NUM_RINGS; ++i)
-			tries[i] = 0;
-
-		do {
-			radeon_sa_bo_try_free(sa_manager);
-
-			if (radeon_sa_bo_try_alloc(sa_manager, *sa_bo,
-						   size, align)) {
-				spin_unlock(&sa_manager->wq.lock);
-				return 0;
-			}
-
-			/* see if we can skip over some allocations */
-		} while (radeon_sa_bo_next_hole(sa_manager, fences, tries));
-
-		for (i = 0; i < RADEON_NUM_RINGS; ++i)
-			radeon_fence_ref(fences[i]);
-
-		spin_unlock(&sa_manager->wq.lock);
-		r = radeon_fence_wait_any(rdev, fences, false);
-		for (i = 0; i < RADEON_NUM_RINGS; ++i)
-			radeon_fence_unref(&fences[i]);
-		spin_lock(&sa_manager->wq.lock);
-		/* if we have nothing to wait for block */
-		if (r == -ENOENT) {
-			r = wait_event_interruptible_locked(
-				sa_manager->wq, 
-				radeon_sa_event(sa_manager, size, align)
-			);
-		}
-
-	} while (!r);
-
-	spin_unlock(&sa_manager->wq.lock);
-	kfree(*sa_bo);
-	*sa_bo = NULL;
-	return r;
-}
-
-void radeon_sa_bo_free(struct radeon_device *rdev, struct radeon_sa_bo **sa_bo,
+void radeon_sa_bo_free(struct drm_suballoc **sa_bo,
 		       struct radeon_fence *fence)
 {
-	struct radeon_sa_manager *sa_manager;
-
 	if (sa_bo == NULL || *sa_bo == NULL) {
 		return;
 	}
 
-	sa_manager = (*sa_bo)->manager;
-	spin_lock(&sa_manager->wq.lock);
-	if (fence && !radeon_fence_signaled(fence)) {
-		(*sa_bo)->fence = radeon_fence_ref(fence);
-		list_add_tail(&(*sa_bo)->flist,
-			      &sa_manager->flist[fence->ring]);
-	} else {
-		radeon_sa_bo_remove_locked(*sa_bo);
-	}
-	wake_up_all_locked(&sa_manager->wq);
-	spin_unlock(&sa_manager->wq.lock);
+	if (fence)
+		drm_suballoc_free(*sa_bo, &fence->base);
+	else
+		drm_suballoc_free(*sa_bo, NULL);
+
 	*sa_bo = NULL;
 }
 
@@ -400,25 +151,8 @@ void radeon_sa_bo_free(struct radeon_device *rdev, struct radeon_sa_bo **sa_bo,
 void radeon_sa_bo_dump_debug_info(struct radeon_sa_manager *sa_manager,
 				  struct seq_file *m)
 {
-	struct radeon_sa_bo *i;
+	struct drm_printer p = drm_seq_file_printer(m);
 
-	spin_lock(&sa_manager->wq.lock);
-	list_for_each_entry(i, &sa_manager->olist, olist) {
-		uint64_t soffset = i->soffset + sa_manager->gpu_addr;
-		uint64_t eoffset = i->eoffset + sa_manager->gpu_addr;
-		if (&i->olist == sa_manager->hole) {
-			seq_printf(m, ">");
-		} else {
-			seq_printf(m, " ");
-		}
-		seq_printf(m, "[0x%010llx 0x%010llx] size %8lld",
-			   soffset, eoffset, eoffset - soffset);
-		if (i->fence) {
-			seq_printf(m, " protected by 0x%016llx on ring %d",
-				   i->fence->seq, i->fence->ring);
-		}
-		seq_printf(m, "\n");
-	}
-	spin_unlock(&sa_manager->wq.lock);
+	drm_suballoc_dump_debug_info(&sa_manager->base, &p, sa_manager->gpu_addr);
 }
 #endif
diff --git a/drivers/gpu/drm/radeon/radeon_semaphore.c b/drivers/gpu/drm/radeon/radeon_semaphore.c
index 221e59476f64..3e2b0bf0d55d 100644
--- a/drivers/gpu/drm/radeon/radeon_semaphore.c
+++ b/drivers/gpu/drm/radeon/radeon_semaphore.c
@@ -40,8 +40,8 @@ int radeon_semaphore_create(struct radeon_device *rdev,
 	if (*semaphore == NULL) {
 		return -ENOMEM;
 	}
-	r = radeon_sa_bo_new(rdev, &rdev->ring_tmp_bo,
-			     &(*semaphore)->sa_bo, 8, 8);
+	r = radeon_sa_bo_new(&rdev->ring_tmp_bo,
+			     &(*semaphore)->sa_bo, 8);
 	if (r) {
 		kfree(*semaphore);
 		*semaphore = NULL;
@@ -100,7 +100,7 @@ void radeon_semaphore_free(struct radeon_device *rdev,
 		dev_err(rdev->dev, "semaphore %p has more waiters than signalers,"
 			" hardware lockup imminent!\n", *semaphore);
 	}
-	radeon_sa_bo_free(rdev, &(*semaphore)->sa_bo, fence);
+	radeon_sa_bo_free(&(*semaphore)->sa_bo, fence);
 	kfree(*semaphore);
 	*semaphore = NULL;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/3] drm/radeon: Use the drm suballocation manager implementation.
  2023-02-16 14:48   ` [Intel-xe] " Thomas Hellström
  (?)
@ 2023-02-17  1:52   ` kernel test robot
  -1 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2023-02-17  1:52 UTC (permalink / raw)
  To: Thomas Hellström; +Cc: oe-kbuild-all

Hi Thomas,

I love your patch! Yet something to improve:

[auto build test ERROR on drm-misc/drm-misc-next]
[also build test ERROR on drm-intel/for-linux-next drm-intel/for-linux-next-fixes drm-tip/drm-tip linus/master v6.2-rc8 next-20230216]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Thomas-Hellstr-m/drm-suballoc-Introduce-a-generic-suballocation-manager/20230216-225152
base:   git://anongit.freedesktop.org/drm/drm-misc drm-misc-next
patch link:    https://lore.kernel.org/r/20230216144847.216259-4-thomas.hellstrom%40linux.intel.com
patch subject: [PATCH 3/3] drm/radeon: Use the drm suballocation manager implementation.
config: ia64-defconfig (https://download.01.org/0day-ci/archive/20230217/202302170904.Qtjoklnc-lkp@intel.com/config)
compiler: ia64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/ccbca3b1e02d931c5540d8f3dbb2985fd4663075
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Thomas-Hellstr-m/drm-suballoc-Introduce-a-generic-suballocation-manager/20230216-225152
        git checkout ccbca3b1e02d931c5540d8f3dbb2985fd4663075
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=ia64 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=ia64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202302170904.Qtjoklnc-lkp@intel.com/

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "drm_suballoc_new" [drivers/gpu/drm/radeon/radeon.ko] undefined!
>> ERROR: modpost: "drm_suballoc_free" [drivers/gpu/drm/radeon/radeon.ko] undefined!
>> ERROR: modpost: "drm_suballoc_manager_init" [drivers/gpu/drm/radeon/radeon.ko] undefined!
>> ERROR: modpost: "drm_suballoc_manager_fini" [drivers/gpu/drm/radeon/radeon.ko] undefined!

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
  2023-02-16 14:48   ` [Intel-xe] " Thomas Hellström
@ 2023-02-17 11:00     ` Christian König
  -1 siblings, 0 replies; 39+ messages in thread
From: Christian König @ 2023-02-17 11:00 UTC (permalink / raw)
  To: Thomas Hellström, dri-devel; +Cc: Daniel Vetter, intel-xe, Dave Airlie

Am 16.02.23 um 15:48 schrieb Thomas Hellström:
> Initially we tried to leverage the amdgpu suballocation manager.
> It turnes out, however, that it tries extremely hard not to enable
> signalling on the fences that hold the memory up for freeing, which makes
> it hard to understand and to fix potential issues with it.
>
> So in a simplification effort, introduce a drm suballocation manager as a
> wrapper around an existing allocator (drm_mm) and to avoid using queues
> for freeing, thus avoiding throttling on free which is an undesired
> feature as typically the throttling needs to be done uninterruptible.
>
> This variant is probably more cpu-hungry but can be improved at the cost
> of additional complexity. Ideas for that are documented in the
> drm_suballoc.c file.
>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Co-developed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> ---
>   drivers/gpu/drm/Kconfig        |   4 +
>   drivers/gpu/drm/Makefile       |   3 +
>   drivers/gpu/drm/drm_suballoc.c | 301 +++++++++++++++++++++++++++++++++
>   include/drm/drm_suballoc.h     | 112 ++++++++++++
>   4 files changed, 420 insertions(+)
>   create mode 100644 drivers/gpu/drm/drm_suballoc.c
>   create mode 100644 include/drm/drm_suballoc.h
>
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index dc0f94f02a82..8fbe57407c60 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -232,6 +232,10 @@ config DRM_GEM_SHMEM_HELPER
>   	help
>   	  Choose this if you need the GEM shmem helper functions
>   
> +config DRM_SUBALLOC_HELPER
> +	tristate
> +	depends on DRM
> +
>   config DRM_SCHED
>   	tristate
>   	depends on DRM
> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
> index ab4460fcd63f..1e04d135e866 100644
> --- a/drivers/gpu/drm/Makefile
> +++ b/drivers/gpu/drm/Makefile
> @@ -88,6 +88,9 @@ obj-$(CONFIG_DRM_GEM_DMA_HELPER) += drm_dma_helper.o
>   drm_shmem_helper-y := drm_gem_shmem_helper.o
>   obj-$(CONFIG_DRM_GEM_SHMEM_HELPER) += drm_shmem_helper.o
>   
> +drm_suballoc_helper-y := drm_suballoc.o
> +obj-$(CONFIG_DRM_SUBALLOC_HELPER) += drm_suballoc_helper.o
> +
>   drm_vram_helper-y := drm_gem_vram_helper.o
>   obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
>   
> diff --git a/drivers/gpu/drm/drm_suballoc.c b/drivers/gpu/drm/drm_suballoc.c
> new file mode 100644
> index 000000000000..6e0292dea548
> --- /dev/null
> +++ b/drivers/gpu/drm/drm_suballoc.c
> @@ -0,0 +1,301 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2022 Intel Corporation
> + */
> +
> +#include <drm/drm_suballoc.h>
> +
> +/**
> + * DOC:
> + * This suballocator intends to be a wrapper around a range allocator
> + * that is aware also of deferred range freeing with fences. Currently
> + * we hard-code the drm_mm as the range allocator.
> + * The approach, while rather simple, suffers from three performance
> + * issues that can all be fixed if needed at the tradeoff of more and / or
> + * more complex code:
> + *
> + * 1) It's cpu-hungry, the drm_mm allocator is overkill. Either code a
> + * much simpler range allocator, or let the caller decide by providing
> + * ops that wrap any range allocator. Also could avoid waking up unless
> + * there is a reasonable chance of enough space in the range manager.

That's most likely highly problematic.

The suballocator in radeon/amdgpu was designed so that it resembles a 
ring buffer and is therefor rather CPU efficient.

We could make the allocator much more trivial, but using drm_mm for this 
is a sledgehammer and therefore a pretty clear no-go.

Regards,
Christian.

> + *
> + * 2) We unnecessarily install the fence callbacks too early, forcing
> + * enable_signaling() too early causing extra driver effort. This is likely
> + * not an issue if used with the drm_scheduler since it calls
> + * enable_signaling() early anyway.
> + *
> + * 3) Long processing in irq (disabled) context. We've mostly worked around
> + * that already by using the idle_list. If that workaround is deemed to
> + * complex for little gain, we can remove it and use spin_lock_irq()
> + * throughout the manager. If we want to shorten processing in irq context
> + * even further, we can skip the spin_trylock in __drm_suballoc_free() and
> + * avoid freeing allocations from irq context altogeher. However drm_mm
> + * should be quite fast at freeing ranges.
> + *
> + * 4) Shrinker that starts processing the list items in 2) and 3) to play
> + * better with the system.
> + */
> +
> +static void drm_suballoc_process_idle(struct drm_suballoc_manager *sa_manager);
> +
> +/**
> + * drm_suballoc_manager_init() - Initialise the drm_suballoc_manager
> + * @sa_manager: pointer to the sa_manager
> + * @size: number of bytes we want to suballocate
> + * @align: alignment for each suballocated chunk
> + *
> + * Prepares the suballocation manager for suballocations.
> + */
> +void drm_suballoc_manager_init(struct drm_suballoc_manager *sa_manager,
> +			       u64 size, u64 align)
> +{
> +	spin_lock_init(&sa_manager->lock);
> +	spin_lock_init(&sa_manager->idle_list_lock);
> +	mutex_init(&sa_manager->alloc_mutex);
> +	drm_mm_init(&sa_manager->mm, 0, size);
> +	init_waitqueue_head(&sa_manager->wq);
> +	sa_manager->range_size = size;
> +	sa_manager->alignment = align;
> +	INIT_LIST_HEAD(&sa_manager->idle_list);
> +}
> +EXPORT_SYMBOL(drm_suballoc_manager_init);
> +
> +/**
> + * drm_suballoc_manager_fini() - Destroy the drm_suballoc_manager
> + * @sa_manager: pointer to the sa_manager
> + *
> + * Cleans up the suballocation manager after use. All fences added
> + * with drm_suballoc_free() must be signaled, or we cannot clean up
> + * the entire manager.
> + */
> +void drm_suballoc_manager_fini(struct drm_suballoc_manager *sa_manager)
> +{
> +	drm_suballoc_process_idle(sa_manager);
> +	drm_mm_takedown(&sa_manager->mm);
> +	mutex_destroy(&sa_manager->alloc_mutex);
> +}
> +EXPORT_SYMBOL(drm_suballoc_manager_fini);
> +
> +static void __drm_suballoc_free(struct drm_suballoc *sa)
> +{
> +	struct drm_suballoc_manager *sa_manager = sa->manager;
> +	struct dma_fence *fence;
> +
> +	/*
> +	 * In order to avoid protecting the potentially lengthy drm_mm manager
> +	 * *allocation* processing with an irq-disabling lock,
> +	 * defer touching the drm_mm for freeing until we're in task context,
> +	 * with no irqs disabled, or happen to succeed in taking the manager
> +	 * lock.
> +	 */
> +	if (!in_task() || irqs_disabled()) {
> +		unsigned long irqflags;
> +
> +		if (spin_trylock(&sa_manager->lock))
> +			goto locked;
> +
> +		spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
> +		list_add_tail(&sa->idle_link, &sa_manager->idle_list);
> +		spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
> +		wake_up(&sa_manager->wq);
> +		return;
> +	}
> +
> +	spin_lock(&sa_manager->lock);
> +locked:
> +	drm_mm_remove_node(&sa->node);
> +
> +	fence = sa->fence;
> +	sa->fence = NULL;
> +	spin_unlock(&sa_manager->lock);
> +	/* Maybe only wake if first mm hole is sufficiently large? */
> +	wake_up(&sa_manager->wq);
> +	dma_fence_put(fence);
> +	kfree(sa);
> +}
> +
> +/* Free all deferred idle allocations */
> +static void drm_suballoc_process_idle(struct drm_suballoc_manager *sa_manager)
> +{
> +	/*
> +	 * prepare_to_wait() / wake_up() semantics ensure that any list
> +	 * addition that was done before wake_up() is visible when
> +	 * this code is called from the wait loop.
> +	 */
> +	if (!list_empty_careful(&sa_manager->idle_list)) {
> +		struct drm_suballoc *sa, *next;
> +		unsigned long irqflags;
> +		LIST_HEAD(list);
> +
> +		spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
> +		list_splice_init(&sa_manager->idle_list, &list);
> +		spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
> +
> +		list_for_each_entry_safe(sa, next, &list, idle_link)
> +			__drm_suballoc_free(sa);
> +	}
> +}
> +
> +static void
> +drm_suballoc_fence_signaled(struct dma_fence *fence, struct dma_fence_cb *cb)
> +{
> +	struct drm_suballoc *sa = container_of(cb, typeof(*sa), cb);
> +
> +	__drm_suballoc_free(sa);
> +}
> +
> +static int drm_suballoc_tryalloc(struct drm_suballoc *sa, u64 size)
> +{
> +	struct drm_suballoc_manager *sa_manager = sa->manager;
> +	int err;
> +
> +	drm_suballoc_process_idle(sa_manager);
> +	spin_lock(&sa_manager->lock);
> +	err = drm_mm_insert_node_generic(&sa_manager->mm, &sa->node, size,
> +					 sa_manager->alignment, 0,
> +					 DRM_MM_INSERT_EVICT);
> +	spin_unlock(&sa_manager->lock);
> +	return err;
> +}
> +
> +/**
> + * drm_suballoc_new() - Make a suballocation.
> + * @sa_manager: pointer to the sa_manager
> + * @size: number of bytes we want to suballocate.
> + * @gfp: Allocation context.
> + * @intr: Whether to sleep interruptibly if sleeping.
> + *
> + * Try to make a suballocation of size @size, which will be rounded
> + * up to the alignment specified in specified in drm_suballoc_manager_init().
> + *
> + * Returns a new suballocated bo, or an ERR_PTR.
> + */
> +struct drm_suballoc*
> +drm_suballoc_new(struct drm_suballoc_manager *sa_manager, u64 size,
> +		 gfp_t gfp, bool intr)
> +{
> +	struct drm_suballoc *sa;
> +	DEFINE_WAIT(wait);
> +	int err = 0;
> +
> +	if (size > sa_manager->range_size)
> +		return ERR_PTR(-ENOSPC);
> +
> +	sa = kzalloc(sizeof(*sa), gfp);
> +	if (!sa)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/* Avoid starvation using the alloc_mutex */
> +	if (intr)
> +		err = mutex_lock_interruptible(&sa_manager->alloc_mutex);
> +	else
> +		mutex_lock(&sa_manager->alloc_mutex);
> +	if (err) {
> +		kfree(sa);
> +		return ERR_PTR(err);
> +	}
> +
> +	sa->manager = sa_manager;
> +	err = drm_suballoc_tryalloc(sa, size);
> +	if (err != -ENOSPC)
> +		goto out;
> +
> +	for (;;) {
> +		prepare_to_wait(&sa_manager->wq, &wait,
> +				intr ? TASK_INTERRUPTIBLE :
> +				TASK_UNINTERRUPTIBLE);
> +
> +		err = drm_suballoc_tryalloc(sa, size);
> +		if (err != -ENOSPC)
> +			break;
> +
> +		if (intr && signal_pending(current)) {
> +			err = -ERESTARTSYS;
> +			break;
> +		}
> +
> +		io_schedule();
> +	}
> +	finish_wait(&sa_manager->wq, &wait);
> +
> +out:
> +	mutex_unlock(&sa_manager->alloc_mutex);
> +	if (!sa->node.size) {
> +		kfree(sa);
> +		WARN_ON(!err);
> +		sa = ERR_PTR(err);
> +	}
> +
> +	return sa;
> +}
> +EXPORT_SYMBOL(drm_suballoc_new);
> +
> +/**
> + * drm_suballoc_free() - Free a suballocation
> + * @suballoc: pointer to the suballocation
> + * @fence: fence that signals when suballocation is idle
> + * @queue: the index to which queue the suballocation will be placed on the free list.
> + *
> + * Free the suballocation. The suballocation can be re-used after @fence
> + * signals.
> + */
> +void
> +drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence)
> +{
> +	if (!sa)
> +		return;
> +
> +	if (!fence || dma_fence_is_signaled(fence)) {
> +		__drm_suballoc_free(sa);
> +		return;
> +	}
> +
> +	sa->fence = dma_fence_get(fence);
> +	if (dma_fence_add_callback(fence, &sa->cb, drm_suballoc_fence_signaled))
> +		__drm_suballoc_free(sa);
> +}
> +EXPORT_SYMBOL(drm_suballoc_free);
> +
> +#ifdef CONFIG_DEBUG_FS
> +
> +/**
> + * drm_suballoc_dump_debug_info() - Dump the suballocator state
> + * @sa_manager: The suballoc manager.
> + * @p: Pointer to a drm printer for output.
> + * @suballoc_base: Constant to add to the suballocated offsets on printout.
> + *
> + * This function dumps the suballocator state. Note that the caller has
> + * to explicitly order frees and calls to this function in order for the
> + * freed node to show up as protected by a fence.
> + */
> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
> +				  struct drm_printer *p, u64 suballoc_base)
> +{
> +	const struct drm_mm_node *entry;
> +
> +	spin_lock(&sa_manager->lock);
> +	drm_mm_for_each_node(entry, &sa_manager->mm) {
> +		struct drm_suballoc *sa =
> +			container_of(entry, typeof(*sa), node);
> +
> +		drm_printf(p, " ");
> +		drm_printf(p, "[0x%010llx 0x%010llx] size %8lld",
> +			   (unsigned long long)suballoc_base + entry->start,
> +			   (unsigned long long)suballoc_base + entry->start +
> +			   entry->size, (unsigned long long)entry->size);
> +
> +		if (sa->fence)
> +			drm_printf(p, " protected by 0x%016llx on context %llu",
> +				   (unsigned long long)sa->fence->seqno,
> +				   (unsigned long long)sa->fence->context);
> +
> +		drm_printf(p, "\n");
> +	}
> +	spin_unlock(&sa_manager->lock);
> +}
> +EXPORT_SYMBOL(drm_suballoc_dump_debug_info);
> +#endif
> +
> +MODULE_AUTHOR("Intel Corporation");
> +MODULE_DESCRIPTION("Simple range suballocator helper");
> +MODULE_LICENSE("GPL and additional rights");
> diff --git a/include/drm/drm_suballoc.h b/include/drm/drm_suballoc.h
> new file mode 100644
> index 000000000000..910952b3383b
> --- /dev/null
> +++ b/include/drm/drm_suballoc.h
> @@ -0,0 +1,112 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2022 Intel Corporation
> + */
> +#ifndef _DRM_SUBALLOC_H_
> +#define _DRM_SUBALLOC_H_
> +
> +#include <drm/drm_mm.h>
> +
> +#include <linux/dma-fence.h>
> +#include <linux/types.h>
> +
> +/**
> + * struct drm_suballoc_manager - Wrapper for fenced range allocations
> + * @mm: The range manager. Protected by @lock.
> + * @range_size: The total size of the range.
> + * @alignment: Range alignment.
> + * @wq: Wait queue for sleeping allocations on contention.
> + * @idle_list: List of idle but not yet freed allocations. Protected by
> + * @idle_list_lock.
> + * @task: Task waiting for allocation. Protected by @lock.
> + */
> +struct drm_suballoc_manager {
> +	/** @lock: Manager lock. Protects @mm. */
> +	spinlock_t lock;
> +	/**
> +	 * @idle_list_lock: Lock to protect the idle_list.
> +	 * Disable irqs when locking.
> +	 */
> +	spinlock_t idle_list_lock;
> +	/** @alloc_mutex: Mutex to protect against stavation. */
> +	struct mutex alloc_mutex;
> +	struct drm_mm mm;
> +	u64 range_size;
> +	u64 alignment;
> +	wait_queue_head_t wq;
> +	struct list_head idle_list;
> +};
> +
> +/**
> + * struct drm_suballoc: Suballocated range.
> + * @node: The drm_mm representation of the range.
> + * @fence: dma-fence indicating whether allocation is active or idle.
> + * Assigned on call to free the allocation so doesn't need protection.
> + * @cb: dma-fence callback structure. Used for callbacks when the fence signals.
> + * @manager: The struct drm_suballoc_manager the range belongs to. Immutable.
> + * @idle_link: Link for the manager idle_list. Progected by the
> + * drm_suballoc_manager::idle_lock.
> + */
> +struct drm_suballoc {
> +	struct drm_mm_node node;
> +	struct dma_fence *fence;
> +	struct dma_fence_cb cb;
> +	struct drm_suballoc_manager *manager;
> +	struct list_head idle_link;
> +};
> +
> +void drm_suballoc_manager_init(struct drm_suballoc_manager *sa_manager,
> +			       u64 size, u64 align);
> +
> +void drm_suballoc_manager_fini(struct drm_suballoc_manager *sa_manager);
> +
> +struct drm_suballoc *drm_suballoc_new(struct drm_suballoc_manager *sa_manager,
> +				      u64 size, gfp_t gfp, bool intr);
> +
> +void drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence);
> +
> +/**
> + * drm_suballoc_soffset - Range start.
> + * @sa: The struct drm_suballoc.
> + *
> + * Return: The start of the allocated range.
> + */
> +static inline u64 drm_suballoc_soffset(struct drm_suballoc *sa)
> +{
> +	return sa->node.start;
> +}
> +
> +/**
> + * drm_suballoc_eoffset - Range end.
> + * @sa: The struct drm_suballoc.
> + *
> + * Return: The end of the allocated range + 1.
> + */
> +static inline u64 drm_suballoc_eoffset(struct drm_suballoc *sa)
> +{
> +	return sa->node.start + sa->node.size;
> +}
> +
> +/**
> + * drm_suballoc_size - Range size.
> + * @sa: The struct drm_suballoc.
> + *
> + * Return: The size of the allocated range.
> + */
> +static inline u64 drm_suballoc_size(struct drm_suballoc *sa)
> +{
> +	return sa->node.size;
> +}
> +
> +#ifdef CONFIG_DEBUG_FS
> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
> +				  struct drm_printer *p, u64 suballoc_base);
> +#else
> +static inline void
> +drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
> +			     struct drm_printer *p, u64 suballoc_base)
> +{ }
> +
> +#endif
> +
> +#endif /* _DRM_SUBALLOC_H_ */


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
@ 2023-02-17 11:00     ` Christian König
  0 siblings, 0 replies; 39+ messages in thread
From: Christian König @ 2023-02-17 11:00 UTC (permalink / raw)
  To: Thomas Hellström, dri-devel
  Cc: Daniel Vetter, Maarten Lankhorst, intel-xe, Dave Airlie

Am 16.02.23 um 15:48 schrieb Thomas Hellström:
> Initially we tried to leverage the amdgpu suballocation manager.
> It turnes out, however, that it tries extremely hard not to enable
> signalling on the fences that hold the memory up for freeing, which makes
> it hard to understand and to fix potential issues with it.
>
> So in a simplification effort, introduce a drm suballocation manager as a
> wrapper around an existing allocator (drm_mm) and to avoid using queues
> for freeing, thus avoiding throttling on free which is an undesired
> feature as typically the throttling needs to be done uninterruptible.
>
> This variant is probably more cpu-hungry but can be improved at the cost
> of additional complexity. Ideas for that are documented in the
> drm_suballoc.c file.
>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Co-developed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> ---
>   drivers/gpu/drm/Kconfig        |   4 +
>   drivers/gpu/drm/Makefile       |   3 +
>   drivers/gpu/drm/drm_suballoc.c | 301 +++++++++++++++++++++++++++++++++
>   include/drm/drm_suballoc.h     | 112 ++++++++++++
>   4 files changed, 420 insertions(+)
>   create mode 100644 drivers/gpu/drm/drm_suballoc.c
>   create mode 100644 include/drm/drm_suballoc.h
>
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index dc0f94f02a82..8fbe57407c60 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -232,6 +232,10 @@ config DRM_GEM_SHMEM_HELPER
>   	help
>   	  Choose this if you need the GEM shmem helper functions
>   
> +config DRM_SUBALLOC_HELPER
> +	tristate
> +	depends on DRM
> +
>   config DRM_SCHED
>   	tristate
>   	depends on DRM
> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
> index ab4460fcd63f..1e04d135e866 100644
> --- a/drivers/gpu/drm/Makefile
> +++ b/drivers/gpu/drm/Makefile
> @@ -88,6 +88,9 @@ obj-$(CONFIG_DRM_GEM_DMA_HELPER) += drm_dma_helper.o
>   drm_shmem_helper-y := drm_gem_shmem_helper.o
>   obj-$(CONFIG_DRM_GEM_SHMEM_HELPER) += drm_shmem_helper.o
>   
> +drm_suballoc_helper-y := drm_suballoc.o
> +obj-$(CONFIG_DRM_SUBALLOC_HELPER) += drm_suballoc_helper.o
> +
>   drm_vram_helper-y := drm_gem_vram_helper.o
>   obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
>   
> diff --git a/drivers/gpu/drm/drm_suballoc.c b/drivers/gpu/drm/drm_suballoc.c
> new file mode 100644
> index 000000000000..6e0292dea548
> --- /dev/null
> +++ b/drivers/gpu/drm/drm_suballoc.c
> @@ -0,0 +1,301 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2022 Intel Corporation
> + */
> +
> +#include <drm/drm_suballoc.h>
> +
> +/**
> + * DOC:
> + * This suballocator intends to be a wrapper around a range allocator
> + * that is aware also of deferred range freeing with fences. Currently
> + * we hard-code the drm_mm as the range allocator.
> + * The approach, while rather simple, suffers from three performance
> + * issues that can all be fixed if needed at the tradeoff of more and / or
> + * more complex code:
> + *
> + * 1) It's cpu-hungry, the drm_mm allocator is overkill. Either code a
> + * much simpler range allocator, or let the caller decide by providing
> + * ops that wrap any range allocator. Also could avoid waking up unless
> + * there is a reasonable chance of enough space in the range manager.

That's most likely highly problematic.

The suballocator in radeon/amdgpu was designed so that it resembles a 
ring buffer and is therefor rather CPU efficient.

We could make the allocator much more trivial, but using drm_mm for this 
is a sledgehammer and therefore a pretty clear no-go.

Regards,
Christian.

> + *
> + * 2) We unnecessarily install the fence callbacks too early, forcing
> + * enable_signaling() too early causing extra driver effort. This is likely
> + * not an issue if used with the drm_scheduler since it calls
> + * enable_signaling() early anyway.
> + *
> + * 3) Long processing in irq (disabled) context. We've mostly worked around
> + * that already by using the idle_list. If that workaround is deemed to
> + * complex for little gain, we can remove it and use spin_lock_irq()
> + * throughout the manager. If we want to shorten processing in irq context
> + * even further, we can skip the spin_trylock in __drm_suballoc_free() and
> + * avoid freeing allocations from irq context altogeher. However drm_mm
> + * should be quite fast at freeing ranges.
> + *
> + * 4) Shrinker that starts processing the list items in 2) and 3) to play
> + * better with the system.
> + */
> +
> +static void drm_suballoc_process_idle(struct drm_suballoc_manager *sa_manager);
> +
> +/**
> + * drm_suballoc_manager_init() - Initialise the drm_suballoc_manager
> + * @sa_manager: pointer to the sa_manager
> + * @size: number of bytes we want to suballocate
> + * @align: alignment for each suballocated chunk
> + *
> + * Prepares the suballocation manager for suballocations.
> + */
> +void drm_suballoc_manager_init(struct drm_suballoc_manager *sa_manager,
> +			       u64 size, u64 align)
> +{
> +	spin_lock_init(&sa_manager->lock);
> +	spin_lock_init(&sa_manager->idle_list_lock);
> +	mutex_init(&sa_manager->alloc_mutex);
> +	drm_mm_init(&sa_manager->mm, 0, size);
> +	init_waitqueue_head(&sa_manager->wq);
> +	sa_manager->range_size = size;
> +	sa_manager->alignment = align;
> +	INIT_LIST_HEAD(&sa_manager->idle_list);
> +}
> +EXPORT_SYMBOL(drm_suballoc_manager_init);
> +
> +/**
> + * drm_suballoc_manager_fini() - Destroy the drm_suballoc_manager
> + * @sa_manager: pointer to the sa_manager
> + *
> + * Cleans up the suballocation manager after use. All fences added
> + * with drm_suballoc_free() must be signaled, or we cannot clean up
> + * the entire manager.
> + */
> +void drm_suballoc_manager_fini(struct drm_suballoc_manager *sa_manager)
> +{
> +	drm_suballoc_process_idle(sa_manager);
> +	drm_mm_takedown(&sa_manager->mm);
> +	mutex_destroy(&sa_manager->alloc_mutex);
> +}
> +EXPORT_SYMBOL(drm_suballoc_manager_fini);
> +
> +static void __drm_suballoc_free(struct drm_suballoc *sa)
> +{
> +	struct drm_suballoc_manager *sa_manager = sa->manager;
> +	struct dma_fence *fence;
> +
> +	/*
> +	 * In order to avoid protecting the potentially lengthy drm_mm manager
> +	 * *allocation* processing with an irq-disabling lock,
> +	 * defer touching the drm_mm for freeing until we're in task context,
> +	 * with no irqs disabled, or happen to succeed in taking the manager
> +	 * lock.
> +	 */
> +	if (!in_task() || irqs_disabled()) {
> +		unsigned long irqflags;
> +
> +		if (spin_trylock(&sa_manager->lock))
> +			goto locked;
> +
> +		spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
> +		list_add_tail(&sa->idle_link, &sa_manager->idle_list);
> +		spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
> +		wake_up(&sa_manager->wq);
> +		return;
> +	}
> +
> +	spin_lock(&sa_manager->lock);
> +locked:
> +	drm_mm_remove_node(&sa->node);
> +
> +	fence = sa->fence;
> +	sa->fence = NULL;
> +	spin_unlock(&sa_manager->lock);
> +	/* Maybe only wake if first mm hole is sufficiently large? */
> +	wake_up(&sa_manager->wq);
> +	dma_fence_put(fence);
> +	kfree(sa);
> +}
> +
> +/* Free all deferred idle allocations */
> +static void drm_suballoc_process_idle(struct drm_suballoc_manager *sa_manager)
> +{
> +	/*
> +	 * prepare_to_wait() / wake_up() semantics ensure that any list
> +	 * addition that was done before wake_up() is visible when
> +	 * this code is called from the wait loop.
> +	 */
> +	if (!list_empty_careful(&sa_manager->idle_list)) {
> +		struct drm_suballoc *sa, *next;
> +		unsigned long irqflags;
> +		LIST_HEAD(list);
> +
> +		spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
> +		list_splice_init(&sa_manager->idle_list, &list);
> +		spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
> +
> +		list_for_each_entry_safe(sa, next, &list, idle_link)
> +			__drm_suballoc_free(sa);
> +	}
> +}
> +
> +static void
> +drm_suballoc_fence_signaled(struct dma_fence *fence, struct dma_fence_cb *cb)
> +{
> +	struct drm_suballoc *sa = container_of(cb, typeof(*sa), cb);
> +
> +	__drm_suballoc_free(sa);
> +}
> +
> +static int drm_suballoc_tryalloc(struct drm_suballoc *sa, u64 size)
> +{
> +	struct drm_suballoc_manager *sa_manager = sa->manager;
> +	int err;
> +
> +	drm_suballoc_process_idle(sa_manager);
> +	spin_lock(&sa_manager->lock);
> +	err = drm_mm_insert_node_generic(&sa_manager->mm, &sa->node, size,
> +					 sa_manager->alignment, 0,
> +					 DRM_MM_INSERT_EVICT);
> +	spin_unlock(&sa_manager->lock);
> +	return err;
> +}
> +
> +/**
> + * drm_suballoc_new() - Make a suballocation.
> + * @sa_manager: pointer to the sa_manager
> + * @size: number of bytes we want to suballocate.
> + * @gfp: Allocation context.
> + * @intr: Whether to sleep interruptibly if sleeping.
> + *
> + * Try to make a suballocation of size @size, which will be rounded
> + * up to the alignment specified in specified in drm_suballoc_manager_init().
> + *
> + * Returns a new suballocated bo, or an ERR_PTR.
> + */
> +struct drm_suballoc*
> +drm_suballoc_new(struct drm_suballoc_manager *sa_manager, u64 size,
> +		 gfp_t gfp, bool intr)
> +{
> +	struct drm_suballoc *sa;
> +	DEFINE_WAIT(wait);
> +	int err = 0;
> +
> +	if (size > sa_manager->range_size)
> +		return ERR_PTR(-ENOSPC);
> +
> +	sa = kzalloc(sizeof(*sa), gfp);
> +	if (!sa)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/* Avoid starvation using the alloc_mutex */
> +	if (intr)
> +		err = mutex_lock_interruptible(&sa_manager->alloc_mutex);
> +	else
> +		mutex_lock(&sa_manager->alloc_mutex);
> +	if (err) {
> +		kfree(sa);
> +		return ERR_PTR(err);
> +	}
> +
> +	sa->manager = sa_manager;
> +	err = drm_suballoc_tryalloc(sa, size);
> +	if (err != -ENOSPC)
> +		goto out;
> +
> +	for (;;) {
> +		prepare_to_wait(&sa_manager->wq, &wait,
> +				intr ? TASK_INTERRUPTIBLE :
> +				TASK_UNINTERRUPTIBLE);
> +
> +		err = drm_suballoc_tryalloc(sa, size);
> +		if (err != -ENOSPC)
> +			break;
> +
> +		if (intr && signal_pending(current)) {
> +			err = -ERESTARTSYS;
> +			break;
> +		}
> +
> +		io_schedule();
> +	}
> +	finish_wait(&sa_manager->wq, &wait);
> +
> +out:
> +	mutex_unlock(&sa_manager->alloc_mutex);
> +	if (!sa->node.size) {
> +		kfree(sa);
> +		WARN_ON(!err);
> +		sa = ERR_PTR(err);
> +	}
> +
> +	return sa;
> +}
> +EXPORT_SYMBOL(drm_suballoc_new);
> +
> +/**
> + * drm_suballoc_free() - Free a suballocation
> + * @suballoc: pointer to the suballocation
> + * @fence: fence that signals when suballocation is idle
> + * @queue: the index to which queue the suballocation will be placed on the free list.
> + *
> + * Free the suballocation. The suballocation can be re-used after @fence
> + * signals.
> + */
> +void
> +drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence)
> +{
> +	if (!sa)
> +		return;
> +
> +	if (!fence || dma_fence_is_signaled(fence)) {
> +		__drm_suballoc_free(sa);
> +		return;
> +	}
> +
> +	sa->fence = dma_fence_get(fence);
> +	if (dma_fence_add_callback(fence, &sa->cb, drm_suballoc_fence_signaled))
> +		__drm_suballoc_free(sa);
> +}
> +EXPORT_SYMBOL(drm_suballoc_free);
> +
> +#ifdef CONFIG_DEBUG_FS
> +
> +/**
> + * drm_suballoc_dump_debug_info() - Dump the suballocator state
> + * @sa_manager: The suballoc manager.
> + * @p: Pointer to a drm printer for output.
> + * @suballoc_base: Constant to add to the suballocated offsets on printout.
> + *
> + * This function dumps the suballocator state. Note that the caller has
> + * to explicitly order frees and calls to this function in order for the
> + * freed node to show up as protected by a fence.
> + */
> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
> +				  struct drm_printer *p, u64 suballoc_base)
> +{
> +	const struct drm_mm_node *entry;
> +
> +	spin_lock(&sa_manager->lock);
> +	drm_mm_for_each_node(entry, &sa_manager->mm) {
> +		struct drm_suballoc *sa =
> +			container_of(entry, typeof(*sa), node);
> +
> +		drm_printf(p, " ");
> +		drm_printf(p, "[0x%010llx 0x%010llx] size %8lld",
> +			   (unsigned long long)suballoc_base + entry->start,
> +			   (unsigned long long)suballoc_base + entry->start +
> +			   entry->size, (unsigned long long)entry->size);
> +
> +		if (sa->fence)
> +			drm_printf(p, " protected by 0x%016llx on context %llu",
> +				   (unsigned long long)sa->fence->seqno,
> +				   (unsigned long long)sa->fence->context);
> +
> +		drm_printf(p, "\n");
> +	}
> +	spin_unlock(&sa_manager->lock);
> +}
> +EXPORT_SYMBOL(drm_suballoc_dump_debug_info);
> +#endif
> +
> +MODULE_AUTHOR("Intel Corporation");
> +MODULE_DESCRIPTION("Simple range suballocator helper");
> +MODULE_LICENSE("GPL and additional rights");
> diff --git a/include/drm/drm_suballoc.h b/include/drm/drm_suballoc.h
> new file mode 100644
> index 000000000000..910952b3383b
> --- /dev/null
> +++ b/include/drm/drm_suballoc.h
> @@ -0,0 +1,112 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2022 Intel Corporation
> + */
> +#ifndef _DRM_SUBALLOC_H_
> +#define _DRM_SUBALLOC_H_
> +
> +#include <drm/drm_mm.h>
> +
> +#include <linux/dma-fence.h>
> +#include <linux/types.h>
> +
> +/**
> + * struct drm_suballoc_manager - Wrapper for fenced range allocations
> + * @mm: The range manager. Protected by @lock.
> + * @range_size: The total size of the range.
> + * @alignment: Range alignment.
> + * @wq: Wait queue for sleeping allocations on contention.
> + * @idle_list: List of idle but not yet freed allocations. Protected by
> + * @idle_list_lock.
> + * @task: Task waiting for allocation. Protected by @lock.
> + */
> +struct drm_suballoc_manager {
> +	/** @lock: Manager lock. Protects @mm. */
> +	spinlock_t lock;
> +	/**
> +	 * @idle_list_lock: Lock to protect the idle_list.
> +	 * Disable irqs when locking.
> +	 */
> +	spinlock_t idle_list_lock;
> +	/** @alloc_mutex: Mutex to protect against stavation. */
> +	struct mutex alloc_mutex;
> +	struct drm_mm mm;
> +	u64 range_size;
> +	u64 alignment;
> +	wait_queue_head_t wq;
> +	struct list_head idle_list;
> +};
> +
> +/**
> + * struct drm_suballoc: Suballocated range.
> + * @node: The drm_mm representation of the range.
> + * @fence: dma-fence indicating whether allocation is active or idle.
> + * Assigned on call to free the allocation so doesn't need protection.
> + * @cb: dma-fence callback structure. Used for callbacks when the fence signals.
> + * @manager: The struct drm_suballoc_manager the range belongs to. Immutable.
> + * @idle_link: Link for the manager idle_list. Progected by the
> + * drm_suballoc_manager::idle_lock.
> + */
> +struct drm_suballoc {
> +	struct drm_mm_node node;
> +	struct dma_fence *fence;
> +	struct dma_fence_cb cb;
> +	struct drm_suballoc_manager *manager;
> +	struct list_head idle_link;
> +};
> +
> +void drm_suballoc_manager_init(struct drm_suballoc_manager *sa_manager,
> +			       u64 size, u64 align);
> +
> +void drm_suballoc_manager_fini(struct drm_suballoc_manager *sa_manager);
> +
> +struct drm_suballoc *drm_suballoc_new(struct drm_suballoc_manager *sa_manager,
> +				      u64 size, gfp_t gfp, bool intr);
> +
> +void drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence);
> +
> +/**
> + * drm_suballoc_soffset - Range start.
> + * @sa: The struct drm_suballoc.
> + *
> + * Return: The start of the allocated range.
> + */
> +static inline u64 drm_suballoc_soffset(struct drm_suballoc *sa)
> +{
> +	return sa->node.start;
> +}
> +
> +/**
> + * drm_suballoc_eoffset - Range end.
> + * @sa: The struct drm_suballoc.
> + *
> + * Return: The end of the allocated range + 1.
> + */
> +static inline u64 drm_suballoc_eoffset(struct drm_suballoc *sa)
> +{
> +	return sa->node.start + sa->node.size;
> +}
> +
> +/**
> + * drm_suballoc_size - Range size.
> + * @sa: The struct drm_suballoc.
> + *
> + * Return: The size of the allocated range.
> + */
> +static inline u64 drm_suballoc_size(struct drm_suballoc *sa)
> +{
> +	return sa->node.size;
> +}
> +
> +#ifdef CONFIG_DEBUG_FS
> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
> +				  struct drm_printer *p, u64 suballoc_base);
> +#else
> +static inline void
> +drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
> +			     struct drm_printer *p, u64 suballoc_base)
> +{ }
> +
> +#endif
> +
> +#endif /* _DRM_SUBALLOC_H_ */


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
  2023-02-17 11:00     ` [Intel-xe] " Christian König
@ 2023-02-17 11:21       ` Thomas Hellström
  -1 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-17 11:21 UTC (permalink / raw)
  To: Christian König, dri-devel; +Cc: Daniel Vetter, intel-xe, Dave Airlie


On 2/17/23 12:00, Christian König wrote:
> Am 16.02.23 um 15:48 schrieb Thomas Hellström:
>> Initially we tried to leverage the amdgpu suballocation manager.
>> It turnes out, however, that it tries extremely hard not to enable
>> signalling on the fences that hold the memory up for freeing, which 
>> makes
>> it hard to understand and to fix potential issues with it.
>>
>> So in a simplification effort, introduce a drm suballocation manager 
>> as a
>> wrapper around an existing allocator (drm_mm) and to avoid using queues
>> for freeing, thus avoiding throttling on free which is an undesired
>> feature as typically the throttling needs to be done uninterruptible.
>>
>> This variant is probably more cpu-hungry but can be improved at the cost
>> of additional complexity. Ideas for that are documented in the
>> drm_suballoc.c file.
>>
>> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Co-developed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>> ---
>>   drivers/gpu/drm/Kconfig        |   4 +
>>   drivers/gpu/drm/Makefile       |   3 +
>>   drivers/gpu/drm/drm_suballoc.c | 301 +++++++++++++++++++++++++++++++++
>>   include/drm/drm_suballoc.h     | 112 ++++++++++++
>>   4 files changed, 420 insertions(+)
>>   create mode 100644 drivers/gpu/drm/drm_suballoc.c
>>   create mode 100644 include/drm/drm_suballoc.h
>>
>> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
>> index dc0f94f02a82..8fbe57407c60 100644
>> --- a/drivers/gpu/drm/Kconfig
>> +++ b/drivers/gpu/drm/Kconfig
>> @@ -232,6 +232,10 @@ config DRM_GEM_SHMEM_HELPER
>>       help
>>         Choose this if you need the GEM shmem helper functions
>>   +config DRM_SUBALLOC_HELPER
>> +    tristate
>> +    depends on DRM
>> +
>>   config DRM_SCHED
>>       tristate
>>       depends on DRM
>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>> index ab4460fcd63f..1e04d135e866 100644
>> --- a/drivers/gpu/drm/Makefile
>> +++ b/drivers/gpu/drm/Makefile
>> @@ -88,6 +88,9 @@ obj-$(CONFIG_DRM_GEM_DMA_HELPER) += drm_dma_helper.o
>>   drm_shmem_helper-y := drm_gem_shmem_helper.o
>>   obj-$(CONFIG_DRM_GEM_SHMEM_HELPER) += drm_shmem_helper.o
>>   +drm_suballoc_helper-y := drm_suballoc.o
>> +obj-$(CONFIG_DRM_SUBALLOC_HELPER) += drm_suballoc_helper.o
>> +
>>   drm_vram_helper-y := drm_gem_vram_helper.o
>>   obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
>>   diff --git a/drivers/gpu/drm/drm_suballoc.c 
>> b/drivers/gpu/drm/drm_suballoc.c
>> new file mode 100644
>> index 000000000000..6e0292dea548
>> --- /dev/null
>> +++ b/drivers/gpu/drm/drm_suballoc.c
>> @@ -0,0 +1,301 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2022 Intel Corporation
>> + */
>> +
>> +#include <drm/drm_suballoc.h>
>> +
>> +/**
>> + * DOC:
>> + * This suballocator intends to be a wrapper around a range allocator
>> + * that is aware also of deferred range freeing with fences. Currently
>> + * we hard-code the drm_mm as the range allocator.
>> + * The approach, while rather simple, suffers from three performance
>> + * issues that can all be fixed if needed at the tradeoff of more 
>> and / or
>> + * more complex code:
>> + *
>> + * 1) It's cpu-hungry, the drm_mm allocator is overkill. Either code a
>> + * much simpler range allocator, or let the caller decide by providing
>> + * ops that wrap any range allocator. Also could avoid waking up unless
>> + * there is a reasonable chance of enough space in the range manager.
>
> That's most likely highly problematic.
>
> The suballocator in radeon/amdgpu was designed so that it resembles a 
> ring buffer and is therefor rather CPU efficient.
>
> We could make the allocator much more trivial, but using drm_mm for 
> this is a sledgehammer and therefore a pretty clear no-go.
>
I don't think the ring vs non-ring is the big problem here, because (at 
least with the original implementation), if allocations are actually 
made and released in a ring-like fashion, the drm_mm free-list would 
consist of one or two blocks and therefore pretty efficient even for 
that case, and if slightly longer that would still not be an issue 
compared to the fence lists maintained in the older allocator.

The problem is more all the other stuff that was added and built on top 
like the interval / rb tree.

I still like the idea (originating from Gallium's helpers) to separate 
whatever is allocating from the fence delayed free.
Any chance you could do a quick performance comparison? If not, anything 
against merging this without the amd / radeon changes until we can land 
a simpler allocator?

Thanks,
Thomas


Thomas


> Regards,
> Christian.
>
>> + *
>> + * 2) We unnecessarily install the fence callbacks too early, forcing
>> + * enable_signaling() too early causing extra driver effort. This is 
>> likely
>> + * not an issue if used with the drm_scheduler since it calls
>> + * enable_signaling() early anyway.
>> + *
>> + * 3) Long processing in irq (disabled) context. We've mostly worked 
>> around
>> + * that already by using the idle_list. If that workaround is deemed to
>> + * complex for little gain, we can remove it and use spin_lock_irq()
>> + * throughout the manager. If we want to shorten processing in irq 
>> context
>> + * even further, we can skip the spin_trylock in 
>> __drm_suballoc_free() and
>> + * avoid freeing allocations from irq context altogeher. However drm_mm
>> + * should be quite fast at freeing ranges.
>> + *
>> + * 4) Shrinker that starts processing the list items in 2) and 3) to 
>> play
>> + * better with the system.
>> + */
>> +
>> +static void drm_suballoc_process_idle(struct drm_suballoc_manager 
>> *sa_manager);
>> +
>> +/**
>> + * drm_suballoc_manager_init() - Initialise the drm_suballoc_manager
>> + * @sa_manager: pointer to the sa_manager
>> + * @size: number of bytes we want to suballocate
>> + * @align: alignment for each suballocated chunk
>> + *
>> + * Prepares the suballocation manager for suballocations.
>> + */
>> +void drm_suballoc_manager_init(struct drm_suballoc_manager *sa_manager,
>> +                   u64 size, u64 align)
>> +{
>> +    spin_lock_init(&sa_manager->lock);
>> +    spin_lock_init(&sa_manager->idle_list_lock);
>> +    mutex_init(&sa_manager->alloc_mutex);
>> +    drm_mm_init(&sa_manager->mm, 0, size);
>> +    init_waitqueue_head(&sa_manager->wq);
>> +    sa_manager->range_size = size;
>> +    sa_manager->alignment = align;
>> +    INIT_LIST_HEAD(&sa_manager->idle_list);
>> +}
>> +EXPORT_SYMBOL(drm_suballoc_manager_init);
>> +
>> +/**
>> + * drm_suballoc_manager_fini() - Destroy the drm_suballoc_manager
>> + * @sa_manager: pointer to the sa_manager
>> + *
>> + * Cleans up the suballocation manager after use. All fences added
>> + * with drm_suballoc_free() must be signaled, or we cannot clean up
>> + * the entire manager.
>> + */
>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager *sa_manager)
>> +{
>> +    drm_suballoc_process_idle(sa_manager);
>> +    drm_mm_takedown(&sa_manager->mm);
>> +    mutex_destroy(&sa_manager->alloc_mutex);
>> +}
>> +EXPORT_SYMBOL(drm_suballoc_manager_fini);
>> +
>> +static void __drm_suballoc_free(struct drm_suballoc *sa)
>> +{
>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>> +    struct dma_fence *fence;
>> +
>> +    /*
>> +     * In order to avoid protecting the potentially lengthy drm_mm 
>> manager
>> +     * *allocation* processing with an irq-disabling lock,
>> +     * defer touching the drm_mm for freeing until we're in task 
>> context,
>> +     * with no irqs disabled, or happen to succeed in taking the 
>> manager
>> +     * lock.
>> +     */
>> +    if (!in_task() || irqs_disabled()) {
>> +        unsigned long irqflags;
>> +
>> +        if (spin_trylock(&sa_manager->lock))
>> +            goto locked;
>> +
>> +        spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>> +        list_add_tail(&sa->idle_link, &sa_manager->idle_list);
>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>> +        wake_up(&sa_manager->wq);
>> +        return;
>> +    }
>> +
>> +    spin_lock(&sa_manager->lock);
>> +locked:
>> +    drm_mm_remove_node(&sa->node);
>> +
>> +    fence = sa->fence;
>> +    sa->fence = NULL;
>> +    spin_unlock(&sa_manager->lock);
>> +    /* Maybe only wake if first mm hole is sufficiently large? */
>> +    wake_up(&sa_manager->wq);
>> +    dma_fence_put(fence);
>> +    kfree(sa);
>> +}
>> +
>> +/* Free all deferred idle allocations */
>> +static void drm_suballoc_process_idle(struct drm_suballoc_manager 
>> *sa_manager)
>> +{
>> +    /*
>> +     * prepare_to_wait() / wake_up() semantics ensure that any list
>> +     * addition that was done before wake_up() is visible when
>> +     * this code is called from the wait loop.
>> +     */
>> +    if (!list_empty_careful(&sa_manager->idle_list)) {
>> +        struct drm_suballoc *sa, *next;
>> +        unsigned long irqflags;
>> +        LIST_HEAD(list);
>> +
>> +        spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>> +        list_splice_init(&sa_manager->idle_list, &list);
>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>> +
>> +        list_for_each_entry_safe(sa, next, &list, idle_link)
>> +            __drm_suballoc_free(sa);
>> +    }
>> +}
>> +
>> +static void
>> +drm_suballoc_fence_signaled(struct dma_fence *fence, struct 
>> dma_fence_cb *cb)
>> +{
>> +    struct drm_suballoc *sa = container_of(cb, typeof(*sa), cb);
>> +
>> +    __drm_suballoc_free(sa);
>> +}
>> +
>> +static int drm_suballoc_tryalloc(struct drm_suballoc *sa, u64 size)
>> +{
>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>> +    int err;
>> +
>> +    drm_suballoc_process_idle(sa_manager);
>> +    spin_lock(&sa_manager->lock);
>> +    err = drm_mm_insert_node_generic(&sa_manager->mm, &sa->node, size,
>> +                     sa_manager->alignment, 0,
>> +                     DRM_MM_INSERT_EVICT);
>> +    spin_unlock(&sa_manager->lock);
>> +    return err;
>> +}
>> +
>> +/**
>> + * drm_suballoc_new() - Make a suballocation.
>> + * @sa_manager: pointer to the sa_manager
>> + * @size: number of bytes we want to suballocate.
>> + * @gfp: Allocation context.
>> + * @intr: Whether to sleep interruptibly if sleeping.
>> + *
>> + * Try to make a suballocation of size @size, which will be rounded
>> + * up to the alignment specified in specified in 
>> drm_suballoc_manager_init().
>> + *
>> + * Returns a new suballocated bo, or an ERR_PTR.
>> + */
>> +struct drm_suballoc*
>> +drm_suballoc_new(struct drm_suballoc_manager *sa_manager, u64 size,
>> +         gfp_t gfp, bool intr)
>> +{
>> +    struct drm_suballoc *sa;
>> +    DEFINE_WAIT(wait);
>> +    int err = 0;
>> +
>> +    if (size > sa_manager->range_size)
>> +        return ERR_PTR(-ENOSPC);
>> +
>> +    sa = kzalloc(sizeof(*sa), gfp);
>> +    if (!sa)
>> +        return ERR_PTR(-ENOMEM);
>> +
>> +    /* Avoid starvation using the alloc_mutex */
>> +    if (intr)
>> +        err = mutex_lock_interruptible(&sa_manager->alloc_mutex);
>> +    else
>> +        mutex_lock(&sa_manager->alloc_mutex);
>> +    if (err) {
>> +        kfree(sa);
>> +        return ERR_PTR(err);
>> +    }
>> +
>> +    sa->manager = sa_manager;
>> +    err = drm_suballoc_tryalloc(sa, size);
>> +    if (err != -ENOSPC)
>> +        goto out;
>> +
>> +    for (;;) {
>> +        prepare_to_wait(&sa_manager->wq, &wait,
>> +                intr ? TASK_INTERRUPTIBLE :
>> +                TASK_UNINTERRUPTIBLE);
>> +
>> +        err = drm_suballoc_tryalloc(sa, size);
>> +        if (err != -ENOSPC)
>> +            break;
>> +
>> +        if (intr && signal_pending(current)) {
>> +            err = -ERESTARTSYS;
>> +            break;
>> +        }
>> +
>> +        io_schedule();
>> +    }
>> +    finish_wait(&sa_manager->wq, &wait);
>> +
>> +out:
>> +    mutex_unlock(&sa_manager->alloc_mutex);
>> +    if (!sa->node.size) {
>> +        kfree(sa);
>> +        WARN_ON(!err);
>> +        sa = ERR_PTR(err);
>> +    }
>> +
>> +    return sa;
>> +}
>> +EXPORT_SYMBOL(drm_suballoc_new);
>> +
>> +/**
>> + * drm_suballoc_free() - Free a suballocation
>> + * @suballoc: pointer to the suballocation
>> + * @fence: fence that signals when suballocation is idle
>> + * @queue: the index to which queue the suballocation will be placed 
>> on the free list.
>> + *
>> + * Free the suballocation. The suballocation can be re-used after 
>> @fence
>> + * signals.
>> + */
>> +void
>> +drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence)
>> +{
>> +    if (!sa)
>> +        return;
>> +
>> +    if (!fence || dma_fence_is_signaled(fence)) {
>> +        __drm_suballoc_free(sa);
>> +        return;
>> +    }
>> +
>> +    sa->fence = dma_fence_get(fence);
>> +    if (dma_fence_add_callback(fence, &sa->cb, 
>> drm_suballoc_fence_signaled))
>> +        __drm_suballoc_free(sa);
>> +}
>> +EXPORT_SYMBOL(drm_suballoc_free);
>> +
>> +#ifdef CONFIG_DEBUG_FS
>> +
>> +/**
>> + * drm_suballoc_dump_debug_info() - Dump the suballocator state
>> + * @sa_manager: The suballoc manager.
>> + * @p: Pointer to a drm printer for output.
>> + * @suballoc_base: Constant to add to the suballocated offsets on 
>> printout.
>> + *
>> + * This function dumps the suballocator state. Note that the caller has
>> + * to explicitly order frees and calls to this function in order for 
>> the
>> + * freed node to show up as protected by a fence.
>> + */
>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>> *sa_manager,
>> +                  struct drm_printer *p, u64 suballoc_base)
>> +{
>> +    const struct drm_mm_node *entry;
>> +
>> +    spin_lock(&sa_manager->lock);
>> +    drm_mm_for_each_node(entry, &sa_manager->mm) {
>> +        struct drm_suballoc *sa =
>> +            container_of(entry, typeof(*sa), node);
>> +
>> +        drm_printf(p, " ");
>> +        drm_printf(p, "[0x%010llx 0x%010llx] size %8lld",
>> +               (unsigned long long)suballoc_base + entry->start,
>> +               (unsigned long long)suballoc_base + entry->start +
>> +               entry->size, (unsigned long long)entry->size);
>> +
>> +        if (sa->fence)
>> +            drm_printf(p, " protected by 0x%016llx on context %llu",
>> +                   (unsigned long long)sa->fence->seqno,
>> +                   (unsigned long long)sa->fence->context);
>> +
>> +        drm_printf(p, "\n");
>> +    }
>> +    spin_unlock(&sa_manager->lock);
>> +}
>> +EXPORT_SYMBOL(drm_suballoc_dump_debug_info);
>> +#endif
>> +
>> +MODULE_AUTHOR("Intel Corporation");
>> +MODULE_DESCRIPTION("Simple range suballocator helper");
>> +MODULE_LICENSE("GPL and additional rights");
>> diff --git a/include/drm/drm_suballoc.h b/include/drm/drm_suballoc.h
>> new file mode 100644
>> index 000000000000..910952b3383b
>> --- /dev/null
>> +++ b/include/drm/drm_suballoc.h
>> @@ -0,0 +1,112 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2022 Intel Corporation
>> + */
>> +#ifndef _DRM_SUBALLOC_H_
>> +#define _DRM_SUBALLOC_H_
>> +
>> +#include <drm/drm_mm.h>
>> +
>> +#include <linux/dma-fence.h>
>> +#include <linux/types.h>
>> +
>> +/**
>> + * struct drm_suballoc_manager - Wrapper for fenced range allocations
>> + * @mm: The range manager. Protected by @lock.
>> + * @range_size: The total size of the range.
>> + * @alignment: Range alignment.
>> + * @wq: Wait queue for sleeping allocations on contention.
>> + * @idle_list: List of idle but not yet freed allocations. Protected by
>> + * @idle_list_lock.
>> + * @task: Task waiting for allocation. Protected by @lock.
>> + */
>> +struct drm_suballoc_manager {
>> +    /** @lock: Manager lock. Protects @mm. */
>> +    spinlock_t lock;
>> +    /**
>> +     * @idle_list_lock: Lock to protect the idle_list.
>> +     * Disable irqs when locking.
>> +     */
>> +    spinlock_t idle_list_lock;
>> +    /** @alloc_mutex: Mutex to protect against stavation. */
>> +    struct mutex alloc_mutex;
>> +    struct drm_mm mm;
>> +    u64 range_size;
>> +    u64 alignment;
>> +    wait_queue_head_t wq;
>> +    struct list_head idle_list;
>> +};
>> +
>> +/**
>> + * struct drm_suballoc: Suballocated range.
>> + * @node: The drm_mm representation of the range.
>> + * @fence: dma-fence indicating whether allocation is active or idle.
>> + * Assigned on call to free the allocation so doesn't need protection.
>> + * @cb: dma-fence callback structure. Used for callbacks when the 
>> fence signals.
>> + * @manager: The struct drm_suballoc_manager the range belongs to. 
>> Immutable.
>> + * @idle_link: Link for the manager idle_list. Progected by the
>> + * drm_suballoc_manager::idle_lock.
>> + */
>> +struct drm_suballoc {
>> +    struct drm_mm_node node;
>> +    struct dma_fence *fence;
>> +    struct dma_fence_cb cb;
>> +    struct drm_suballoc_manager *manager;
>> +    struct list_head idle_link;
>> +};
>> +
>> +void drm_suballoc_manager_init(struct drm_suballoc_manager *sa_manager,
>> +                   u64 size, u64 align);
>> +
>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>> *sa_manager);
>> +
>> +struct drm_suballoc *drm_suballoc_new(struct drm_suballoc_manager 
>> *sa_manager,
>> +                      u64 size, gfp_t gfp, bool intr);
>> +
>> +void drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence 
>> *fence);
>> +
>> +/**
>> + * drm_suballoc_soffset - Range start.
>> + * @sa: The struct drm_suballoc.
>> + *
>> + * Return: The start of the allocated range.
>> + */
>> +static inline u64 drm_suballoc_soffset(struct drm_suballoc *sa)
>> +{
>> +    return sa->node.start;
>> +}
>> +
>> +/**
>> + * drm_suballoc_eoffset - Range end.
>> + * @sa: The struct drm_suballoc.
>> + *
>> + * Return: The end of the allocated range + 1.
>> + */
>> +static inline u64 drm_suballoc_eoffset(struct drm_suballoc *sa)
>> +{
>> +    return sa->node.start + sa->node.size;
>> +}
>> +
>> +/**
>> + * drm_suballoc_size - Range size.
>> + * @sa: The struct drm_suballoc.
>> + *
>> + * Return: The size of the allocated range.
>> + */
>> +static inline u64 drm_suballoc_size(struct drm_suballoc *sa)
>> +{
>> +    return sa->node.size;
>> +}
>> +
>> +#ifdef CONFIG_DEBUG_FS
>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>> *sa_manager,
>> +                  struct drm_printer *p, u64 suballoc_base);
>> +#else
>> +static inline void
>> +drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
>> +                 struct drm_printer *p, u64 suballoc_base)
>> +{ }
>> +
>> +#endif
>> +
>> +#endif /* _DRM_SUBALLOC_H_ */
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
@ 2023-02-17 11:21       ` Thomas Hellström
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-17 11:21 UTC (permalink / raw)
  To: Christian König, dri-devel
  Cc: Daniel Vetter, Maarten Lankhorst, intel-xe, Dave Airlie


On 2/17/23 12:00, Christian König wrote:
> Am 16.02.23 um 15:48 schrieb Thomas Hellström:
>> Initially we tried to leverage the amdgpu suballocation manager.
>> It turnes out, however, that it tries extremely hard not to enable
>> signalling on the fences that hold the memory up for freeing, which 
>> makes
>> it hard to understand and to fix potential issues with it.
>>
>> So in a simplification effort, introduce a drm suballocation manager 
>> as a
>> wrapper around an existing allocator (drm_mm) and to avoid using queues
>> for freeing, thus avoiding throttling on free which is an undesired
>> feature as typically the throttling needs to be done uninterruptible.
>>
>> This variant is probably more cpu-hungry but can be improved at the cost
>> of additional complexity. Ideas for that are documented in the
>> drm_suballoc.c file.
>>
>> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> Co-developed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>> ---
>>   drivers/gpu/drm/Kconfig        |   4 +
>>   drivers/gpu/drm/Makefile       |   3 +
>>   drivers/gpu/drm/drm_suballoc.c | 301 +++++++++++++++++++++++++++++++++
>>   include/drm/drm_suballoc.h     | 112 ++++++++++++
>>   4 files changed, 420 insertions(+)
>>   create mode 100644 drivers/gpu/drm/drm_suballoc.c
>>   create mode 100644 include/drm/drm_suballoc.h
>>
>> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
>> index dc0f94f02a82..8fbe57407c60 100644
>> --- a/drivers/gpu/drm/Kconfig
>> +++ b/drivers/gpu/drm/Kconfig
>> @@ -232,6 +232,10 @@ config DRM_GEM_SHMEM_HELPER
>>       help
>>         Choose this if you need the GEM shmem helper functions
>>   +config DRM_SUBALLOC_HELPER
>> +    tristate
>> +    depends on DRM
>> +
>>   config DRM_SCHED
>>       tristate
>>       depends on DRM
>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>> index ab4460fcd63f..1e04d135e866 100644
>> --- a/drivers/gpu/drm/Makefile
>> +++ b/drivers/gpu/drm/Makefile
>> @@ -88,6 +88,9 @@ obj-$(CONFIG_DRM_GEM_DMA_HELPER) += drm_dma_helper.o
>>   drm_shmem_helper-y := drm_gem_shmem_helper.o
>>   obj-$(CONFIG_DRM_GEM_SHMEM_HELPER) += drm_shmem_helper.o
>>   +drm_suballoc_helper-y := drm_suballoc.o
>> +obj-$(CONFIG_DRM_SUBALLOC_HELPER) += drm_suballoc_helper.o
>> +
>>   drm_vram_helper-y := drm_gem_vram_helper.o
>>   obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
>>   diff --git a/drivers/gpu/drm/drm_suballoc.c 
>> b/drivers/gpu/drm/drm_suballoc.c
>> new file mode 100644
>> index 000000000000..6e0292dea548
>> --- /dev/null
>> +++ b/drivers/gpu/drm/drm_suballoc.c
>> @@ -0,0 +1,301 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2022 Intel Corporation
>> + */
>> +
>> +#include <drm/drm_suballoc.h>
>> +
>> +/**
>> + * DOC:
>> + * This suballocator intends to be a wrapper around a range allocator
>> + * that is aware also of deferred range freeing with fences. Currently
>> + * we hard-code the drm_mm as the range allocator.
>> + * The approach, while rather simple, suffers from three performance
>> + * issues that can all be fixed if needed at the tradeoff of more 
>> and / or
>> + * more complex code:
>> + *
>> + * 1) It's cpu-hungry, the drm_mm allocator is overkill. Either code a
>> + * much simpler range allocator, or let the caller decide by providing
>> + * ops that wrap any range allocator. Also could avoid waking up unless
>> + * there is a reasonable chance of enough space in the range manager.
>
> That's most likely highly problematic.
>
> The suballocator in radeon/amdgpu was designed so that it resembles a 
> ring buffer and is therefor rather CPU efficient.
>
> We could make the allocator much more trivial, but using drm_mm for 
> this is a sledgehammer and therefore a pretty clear no-go.
>
I don't think the ring vs non-ring is the big problem here, because (at 
least with the original implementation), if allocations are actually 
made and released in a ring-like fashion, the drm_mm free-list would 
consist of one or two blocks and therefore pretty efficient even for 
that case, and if slightly longer that would still not be an issue 
compared to the fence lists maintained in the older allocator.

The problem is more all the other stuff that was added and built on top 
like the interval / rb tree.

I still like the idea (originating from Gallium's helpers) to separate 
whatever is allocating from the fence delayed free.
Any chance you could do a quick performance comparison? If not, anything 
against merging this without the amd / radeon changes until we can land 
a simpler allocator?

Thanks,
Thomas


Thomas


> Regards,
> Christian.
>
>> + *
>> + * 2) We unnecessarily install the fence callbacks too early, forcing
>> + * enable_signaling() too early causing extra driver effort. This is 
>> likely
>> + * not an issue if used with the drm_scheduler since it calls
>> + * enable_signaling() early anyway.
>> + *
>> + * 3) Long processing in irq (disabled) context. We've mostly worked 
>> around
>> + * that already by using the idle_list. If that workaround is deemed to
>> + * complex for little gain, we can remove it and use spin_lock_irq()
>> + * throughout the manager. If we want to shorten processing in irq 
>> context
>> + * even further, we can skip the spin_trylock in 
>> __drm_suballoc_free() and
>> + * avoid freeing allocations from irq context altogeher. However drm_mm
>> + * should be quite fast at freeing ranges.
>> + *
>> + * 4) Shrinker that starts processing the list items in 2) and 3) to 
>> play
>> + * better with the system.
>> + */
>> +
>> +static void drm_suballoc_process_idle(struct drm_suballoc_manager 
>> *sa_manager);
>> +
>> +/**
>> + * drm_suballoc_manager_init() - Initialise the drm_suballoc_manager
>> + * @sa_manager: pointer to the sa_manager
>> + * @size: number of bytes we want to suballocate
>> + * @align: alignment for each suballocated chunk
>> + *
>> + * Prepares the suballocation manager for suballocations.
>> + */
>> +void drm_suballoc_manager_init(struct drm_suballoc_manager *sa_manager,
>> +                   u64 size, u64 align)
>> +{
>> +    spin_lock_init(&sa_manager->lock);
>> +    spin_lock_init(&sa_manager->idle_list_lock);
>> +    mutex_init(&sa_manager->alloc_mutex);
>> +    drm_mm_init(&sa_manager->mm, 0, size);
>> +    init_waitqueue_head(&sa_manager->wq);
>> +    sa_manager->range_size = size;
>> +    sa_manager->alignment = align;
>> +    INIT_LIST_HEAD(&sa_manager->idle_list);
>> +}
>> +EXPORT_SYMBOL(drm_suballoc_manager_init);
>> +
>> +/**
>> + * drm_suballoc_manager_fini() - Destroy the drm_suballoc_manager
>> + * @sa_manager: pointer to the sa_manager
>> + *
>> + * Cleans up the suballocation manager after use. All fences added
>> + * with drm_suballoc_free() must be signaled, or we cannot clean up
>> + * the entire manager.
>> + */
>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager *sa_manager)
>> +{
>> +    drm_suballoc_process_idle(sa_manager);
>> +    drm_mm_takedown(&sa_manager->mm);
>> +    mutex_destroy(&sa_manager->alloc_mutex);
>> +}
>> +EXPORT_SYMBOL(drm_suballoc_manager_fini);
>> +
>> +static void __drm_suballoc_free(struct drm_suballoc *sa)
>> +{
>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>> +    struct dma_fence *fence;
>> +
>> +    /*
>> +     * In order to avoid protecting the potentially lengthy drm_mm 
>> manager
>> +     * *allocation* processing with an irq-disabling lock,
>> +     * defer touching the drm_mm for freeing until we're in task 
>> context,
>> +     * with no irqs disabled, or happen to succeed in taking the 
>> manager
>> +     * lock.
>> +     */
>> +    if (!in_task() || irqs_disabled()) {
>> +        unsigned long irqflags;
>> +
>> +        if (spin_trylock(&sa_manager->lock))
>> +            goto locked;
>> +
>> +        spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>> +        list_add_tail(&sa->idle_link, &sa_manager->idle_list);
>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>> +        wake_up(&sa_manager->wq);
>> +        return;
>> +    }
>> +
>> +    spin_lock(&sa_manager->lock);
>> +locked:
>> +    drm_mm_remove_node(&sa->node);
>> +
>> +    fence = sa->fence;
>> +    sa->fence = NULL;
>> +    spin_unlock(&sa_manager->lock);
>> +    /* Maybe only wake if first mm hole is sufficiently large? */
>> +    wake_up(&sa_manager->wq);
>> +    dma_fence_put(fence);
>> +    kfree(sa);
>> +}
>> +
>> +/* Free all deferred idle allocations */
>> +static void drm_suballoc_process_idle(struct drm_suballoc_manager 
>> *sa_manager)
>> +{
>> +    /*
>> +     * prepare_to_wait() / wake_up() semantics ensure that any list
>> +     * addition that was done before wake_up() is visible when
>> +     * this code is called from the wait loop.
>> +     */
>> +    if (!list_empty_careful(&sa_manager->idle_list)) {
>> +        struct drm_suballoc *sa, *next;
>> +        unsigned long irqflags;
>> +        LIST_HEAD(list);
>> +
>> +        spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>> +        list_splice_init(&sa_manager->idle_list, &list);
>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>> +
>> +        list_for_each_entry_safe(sa, next, &list, idle_link)
>> +            __drm_suballoc_free(sa);
>> +    }
>> +}
>> +
>> +static void
>> +drm_suballoc_fence_signaled(struct dma_fence *fence, struct 
>> dma_fence_cb *cb)
>> +{
>> +    struct drm_suballoc *sa = container_of(cb, typeof(*sa), cb);
>> +
>> +    __drm_suballoc_free(sa);
>> +}
>> +
>> +static int drm_suballoc_tryalloc(struct drm_suballoc *sa, u64 size)
>> +{
>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>> +    int err;
>> +
>> +    drm_suballoc_process_idle(sa_manager);
>> +    spin_lock(&sa_manager->lock);
>> +    err = drm_mm_insert_node_generic(&sa_manager->mm, &sa->node, size,
>> +                     sa_manager->alignment, 0,
>> +                     DRM_MM_INSERT_EVICT);
>> +    spin_unlock(&sa_manager->lock);
>> +    return err;
>> +}
>> +
>> +/**
>> + * drm_suballoc_new() - Make a suballocation.
>> + * @sa_manager: pointer to the sa_manager
>> + * @size: number of bytes we want to suballocate.
>> + * @gfp: Allocation context.
>> + * @intr: Whether to sleep interruptibly if sleeping.
>> + *
>> + * Try to make a suballocation of size @size, which will be rounded
>> + * up to the alignment specified in specified in 
>> drm_suballoc_manager_init().
>> + *
>> + * Returns a new suballocated bo, or an ERR_PTR.
>> + */
>> +struct drm_suballoc*
>> +drm_suballoc_new(struct drm_suballoc_manager *sa_manager, u64 size,
>> +         gfp_t gfp, bool intr)
>> +{
>> +    struct drm_suballoc *sa;
>> +    DEFINE_WAIT(wait);
>> +    int err = 0;
>> +
>> +    if (size > sa_manager->range_size)
>> +        return ERR_PTR(-ENOSPC);
>> +
>> +    sa = kzalloc(sizeof(*sa), gfp);
>> +    if (!sa)
>> +        return ERR_PTR(-ENOMEM);
>> +
>> +    /* Avoid starvation using the alloc_mutex */
>> +    if (intr)
>> +        err = mutex_lock_interruptible(&sa_manager->alloc_mutex);
>> +    else
>> +        mutex_lock(&sa_manager->alloc_mutex);
>> +    if (err) {
>> +        kfree(sa);
>> +        return ERR_PTR(err);
>> +    }
>> +
>> +    sa->manager = sa_manager;
>> +    err = drm_suballoc_tryalloc(sa, size);
>> +    if (err != -ENOSPC)
>> +        goto out;
>> +
>> +    for (;;) {
>> +        prepare_to_wait(&sa_manager->wq, &wait,
>> +                intr ? TASK_INTERRUPTIBLE :
>> +                TASK_UNINTERRUPTIBLE);
>> +
>> +        err = drm_suballoc_tryalloc(sa, size);
>> +        if (err != -ENOSPC)
>> +            break;
>> +
>> +        if (intr && signal_pending(current)) {
>> +            err = -ERESTARTSYS;
>> +            break;
>> +        }
>> +
>> +        io_schedule();
>> +    }
>> +    finish_wait(&sa_manager->wq, &wait);
>> +
>> +out:
>> +    mutex_unlock(&sa_manager->alloc_mutex);
>> +    if (!sa->node.size) {
>> +        kfree(sa);
>> +        WARN_ON(!err);
>> +        sa = ERR_PTR(err);
>> +    }
>> +
>> +    return sa;
>> +}
>> +EXPORT_SYMBOL(drm_suballoc_new);
>> +
>> +/**
>> + * drm_suballoc_free() - Free a suballocation
>> + * @suballoc: pointer to the suballocation
>> + * @fence: fence that signals when suballocation is idle
>> + * @queue: the index to which queue the suballocation will be placed 
>> on the free list.
>> + *
>> + * Free the suballocation. The suballocation can be re-used after 
>> @fence
>> + * signals.
>> + */
>> +void
>> +drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence)
>> +{
>> +    if (!sa)
>> +        return;
>> +
>> +    if (!fence || dma_fence_is_signaled(fence)) {
>> +        __drm_suballoc_free(sa);
>> +        return;
>> +    }
>> +
>> +    sa->fence = dma_fence_get(fence);
>> +    if (dma_fence_add_callback(fence, &sa->cb, 
>> drm_suballoc_fence_signaled))
>> +        __drm_suballoc_free(sa);
>> +}
>> +EXPORT_SYMBOL(drm_suballoc_free);
>> +
>> +#ifdef CONFIG_DEBUG_FS
>> +
>> +/**
>> + * drm_suballoc_dump_debug_info() - Dump the suballocator state
>> + * @sa_manager: The suballoc manager.
>> + * @p: Pointer to a drm printer for output.
>> + * @suballoc_base: Constant to add to the suballocated offsets on 
>> printout.
>> + *
>> + * This function dumps the suballocator state. Note that the caller has
>> + * to explicitly order frees and calls to this function in order for 
>> the
>> + * freed node to show up as protected by a fence.
>> + */
>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>> *sa_manager,
>> +                  struct drm_printer *p, u64 suballoc_base)
>> +{
>> +    const struct drm_mm_node *entry;
>> +
>> +    spin_lock(&sa_manager->lock);
>> +    drm_mm_for_each_node(entry, &sa_manager->mm) {
>> +        struct drm_suballoc *sa =
>> +            container_of(entry, typeof(*sa), node);
>> +
>> +        drm_printf(p, " ");
>> +        drm_printf(p, "[0x%010llx 0x%010llx] size %8lld",
>> +               (unsigned long long)suballoc_base + entry->start,
>> +               (unsigned long long)suballoc_base + entry->start +
>> +               entry->size, (unsigned long long)entry->size);
>> +
>> +        if (sa->fence)
>> +            drm_printf(p, " protected by 0x%016llx on context %llu",
>> +                   (unsigned long long)sa->fence->seqno,
>> +                   (unsigned long long)sa->fence->context);
>> +
>> +        drm_printf(p, "\n");
>> +    }
>> +    spin_unlock(&sa_manager->lock);
>> +}
>> +EXPORT_SYMBOL(drm_suballoc_dump_debug_info);
>> +#endif
>> +
>> +MODULE_AUTHOR("Intel Corporation");
>> +MODULE_DESCRIPTION("Simple range suballocator helper");
>> +MODULE_LICENSE("GPL and additional rights");
>> diff --git a/include/drm/drm_suballoc.h b/include/drm/drm_suballoc.h
>> new file mode 100644
>> index 000000000000..910952b3383b
>> --- /dev/null
>> +++ b/include/drm/drm_suballoc.h
>> @@ -0,0 +1,112 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2022 Intel Corporation
>> + */
>> +#ifndef _DRM_SUBALLOC_H_
>> +#define _DRM_SUBALLOC_H_
>> +
>> +#include <drm/drm_mm.h>
>> +
>> +#include <linux/dma-fence.h>
>> +#include <linux/types.h>
>> +
>> +/**
>> + * struct drm_suballoc_manager - Wrapper for fenced range allocations
>> + * @mm: The range manager. Protected by @lock.
>> + * @range_size: The total size of the range.
>> + * @alignment: Range alignment.
>> + * @wq: Wait queue for sleeping allocations on contention.
>> + * @idle_list: List of idle but not yet freed allocations. Protected by
>> + * @idle_list_lock.
>> + * @task: Task waiting for allocation. Protected by @lock.
>> + */
>> +struct drm_suballoc_manager {
>> +    /** @lock: Manager lock. Protects @mm. */
>> +    spinlock_t lock;
>> +    /**
>> +     * @idle_list_lock: Lock to protect the idle_list.
>> +     * Disable irqs when locking.
>> +     */
>> +    spinlock_t idle_list_lock;
>> +    /** @alloc_mutex: Mutex to protect against stavation. */
>> +    struct mutex alloc_mutex;
>> +    struct drm_mm mm;
>> +    u64 range_size;
>> +    u64 alignment;
>> +    wait_queue_head_t wq;
>> +    struct list_head idle_list;
>> +};
>> +
>> +/**
>> + * struct drm_suballoc: Suballocated range.
>> + * @node: The drm_mm representation of the range.
>> + * @fence: dma-fence indicating whether allocation is active or idle.
>> + * Assigned on call to free the allocation so doesn't need protection.
>> + * @cb: dma-fence callback structure. Used for callbacks when the 
>> fence signals.
>> + * @manager: The struct drm_suballoc_manager the range belongs to. 
>> Immutable.
>> + * @idle_link: Link for the manager idle_list. Progected by the
>> + * drm_suballoc_manager::idle_lock.
>> + */
>> +struct drm_suballoc {
>> +    struct drm_mm_node node;
>> +    struct dma_fence *fence;
>> +    struct dma_fence_cb cb;
>> +    struct drm_suballoc_manager *manager;
>> +    struct list_head idle_link;
>> +};
>> +
>> +void drm_suballoc_manager_init(struct drm_suballoc_manager *sa_manager,
>> +                   u64 size, u64 align);
>> +
>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>> *sa_manager);
>> +
>> +struct drm_suballoc *drm_suballoc_new(struct drm_suballoc_manager 
>> *sa_manager,
>> +                      u64 size, gfp_t gfp, bool intr);
>> +
>> +void drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence 
>> *fence);
>> +
>> +/**
>> + * drm_suballoc_soffset - Range start.
>> + * @sa: The struct drm_suballoc.
>> + *
>> + * Return: The start of the allocated range.
>> + */
>> +static inline u64 drm_suballoc_soffset(struct drm_suballoc *sa)
>> +{
>> +    return sa->node.start;
>> +}
>> +
>> +/**
>> + * drm_suballoc_eoffset - Range end.
>> + * @sa: The struct drm_suballoc.
>> + *
>> + * Return: The end of the allocated range + 1.
>> + */
>> +static inline u64 drm_suballoc_eoffset(struct drm_suballoc *sa)
>> +{
>> +    return sa->node.start + sa->node.size;
>> +}
>> +
>> +/**
>> + * drm_suballoc_size - Range size.
>> + * @sa: The struct drm_suballoc.
>> + *
>> + * Return: The size of the allocated range.
>> + */
>> +static inline u64 drm_suballoc_size(struct drm_suballoc *sa)
>> +{
>> +    return sa->node.size;
>> +}
>> +
>> +#ifdef CONFIG_DEBUG_FS
>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>> *sa_manager,
>> +                  struct drm_printer *p, u64 suballoc_base);
>> +#else
>> +static inline void
>> +drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
>> +                 struct drm_printer *p, u64 suballoc_base)
>> +{ }
>> +
>> +#endif
>> +
>> +#endif /* _DRM_SUBALLOC_H_ */
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
  2023-02-17 11:21       ` [Intel-xe] " Thomas Hellström
@ 2023-02-17 11:28         ` Christian König
  -1 siblings, 0 replies; 39+ messages in thread
From: Christian König @ 2023-02-17 11:28 UTC (permalink / raw)
  To: Thomas Hellström, dri-devel; +Cc: Daniel Vetter, intel-xe, Dave Airlie

Am 17.02.23 um 12:21 schrieb Thomas Hellström:
>
> On 2/17/23 12:00, Christian König wrote:
>> Am 16.02.23 um 15:48 schrieb Thomas Hellström:
>>> Initially we tried to leverage the amdgpu suballocation manager.
>>> It turnes out, however, that it tries extremely hard not to enable
>>> signalling on the fences that hold the memory up for freeing, which 
>>> makes
>>> it hard to understand and to fix potential issues with it.
>>>
>>> So in a simplification effort, introduce a drm suballocation manager 
>>> as a
>>> wrapper around an existing allocator (drm_mm) and to avoid using queues
>>> for freeing, thus avoiding throttling on free which is an undesired
>>> feature as typically the throttling needs to be done uninterruptible.
>>>
>>> This variant is probably more cpu-hungry but can be improved at the 
>>> cost
>>> of additional complexity. Ideas for that are documented in the
>>> drm_suballoc.c file.
>>>
>>> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>> Co-developed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>> ---
>>>   drivers/gpu/drm/Kconfig        |   4 +
>>>   drivers/gpu/drm/Makefile       |   3 +
>>>   drivers/gpu/drm/drm_suballoc.c | 301 
>>> +++++++++++++++++++++++++++++++++
>>>   include/drm/drm_suballoc.h     | 112 ++++++++++++
>>>   4 files changed, 420 insertions(+)
>>>   create mode 100644 drivers/gpu/drm/drm_suballoc.c
>>>   create mode 100644 include/drm/drm_suballoc.h
>>>
>>> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
>>> index dc0f94f02a82..8fbe57407c60 100644
>>> --- a/drivers/gpu/drm/Kconfig
>>> +++ b/drivers/gpu/drm/Kconfig
>>> @@ -232,6 +232,10 @@ config DRM_GEM_SHMEM_HELPER
>>>       help
>>>         Choose this if you need the GEM shmem helper functions
>>>   +config DRM_SUBALLOC_HELPER
>>> +    tristate
>>> +    depends on DRM
>>> +
>>>   config DRM_SCHED
>>>       tristate
>>>       depends on DRM
>>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>>> index ab4460fcd63f..1e04d135e866 100644
>>> --- a/drivers/gpu/drm/Makefile
>>> +++ b/drivers/gpu/drm/Makefile
>>> @@ -88,6 +88,9 @@ obj-$(CONFIG_DRM_GEM_DMA_HELPER) += drm_dma_helper.o
>>>   drm_shmem_helper-y := drm_gem_shmem_helper.o
>>>   obj-$(CONFIG_DRM_GEM_SHMEM_HELPER) += drm_shmem_helper.o
>>>   +drm_suballoc_helper-y := drm_suballoc.o
>>> +obj-$(CONFIG_DRM_SUBALLOC_HELPER) += drm_suballoc_helper.o
>>> +
>>>   drm_vram_helper-y := drm_gem_vram_helper.o
>>>   obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
>>>   diff --git a/drivers/gpu/drm/drm_suballoc.c 
>>> b/drivers/gpu/drm/drm_suballoc.c
>>> new file mode 100644
>>> index 000000000000..6e0292dea548
>>> --- /dev/null
>>> +++ b/drivers/gpu/drm/drm_suballoc.c
>>> @@ -0,0 +1,301 @@
>>> +// SPDX-License-Identifier: MIT
>>> +/*
>>> + * Copyright © 2022 Intel Corporation
>>> + */
>>> +
>>> +#include <drm/drm_suballoc.h>
>>> +
>>> +/**
>>> + * DOC:
>>> + * This suballocator intends to be a wrapper around a range allocator
>>> + * that is aware also of deferred range freeing with fences. Currently
>>> + * we hard-code the drm_mm as the range allocator.
>>> + * The approach, while rather simple, suffers from three performance
>>> + * issues that can all be fixed if needed at the tradeoff of more 
>>> and / or
>>> + * more complex code:
>>> + *
>>> + * 1) It's cpu-hungry, the drm_mm allocator is overkill. Either code a
>>> + * much simpler range allocator, or let the caller decide by providing
>>> + * ops that wrap any range allocator. Also could avoid waking up 
>>> unless
>>> + * there is a reasonable chance of enough space in the range manager.
>>
>> That's most likely highly problematic.
>>
>> The suballocator in radeon/amdgpu was designed so that it resembles a 
>> ring buffer and is therefor rather CPU efficient.
>>
>> We could make the allocator much more trivial, but using drm_mm for 
>> this is a sledgehammer and therefore a pretty clear no-go.
>>
> I don't think the ring vs non-ring is the big problem here, because 
> (at least with the original implementation), if allocations are 
> actually made and released in a ring-like fashion, the drm_mm 
> free-list would consist of one or two blocks and therefore pretty 
> efficient even for that case, and if slightly longer that would still 
> not be an issue compared to the fence lists maintained in the older 
> allocator.
>
> The problem is more all the other stuff that was added and built on 
> top like the interval / rb tree.
>
> I still like the idea (originating from Gallium's helpers) to separate 
> whatever is allocating from the fence delayed free.

That's actually a bad idea. See the ring like approach works because the 
fences used in amdgpu/radeon are used in a ring like fashion. E.g. the 
sub allocator mainly provides the temporary space for page table 
updates. Those in turn are then used by commands written into a ring buffer.

>
> Any chance you could do a quick performance comparison? If not, 
> anything against merging this without the amd / radeon changes until 
> we can land a simpler allocator?

Only if you can stick the allocator inside Xe and not drm, cause this 
seems to be for a different use case than the allocators inside 
radeon/amdgpu.

Regards,
Christian.

>
> Thanks,
> Thomas
>
>
> Thomas
>
>
>> Regards,
>> Christian.
>>
>>> + *
>>> + * 2) We unnecessarily install the fence callbacks too early, forcing
>>> + * enable_signaling() too early causing extra driver effort. This 
>>> is likely
>>> + * not an issue if used with the drm_scheduler since it calls
>>> + * enable_signaling() early anyway.
>>> + *
>>> + * 3) Long processing in irq (disabled) context. We've mostly 
>>> worked around
>>> + * that already by using the idle_list. If that workaround is 
>>> deemed to
>>> + * complex for little gain, we can remove it and use spin_lock_irq()
>>> + * throughout the manager. If we want to shorten processing in irq 
>>> context
>>> + * even further, we can skip the spin_trylock in 
>>> __drm_suballoc_free() and
>>> + * avoid freeing allocations from irq context altogeher. However 
>>> drm_mm
>>> + * should be quite fast at freeing ranges.
>>> + *
>>> + * 4) Shrinker that starts processing the list items in 2) and 3) 
>>> to play
>>> + * better with the system.
>>> + */
>>> +
>>> +static void drm_suballoc_process_idle(struct drm_suballoc_manager 
>>> *sa_manager);
>>> +
>>> +/**
>>> + * drm_suballoc_manager_init() - Initialise the drm_suballoc_manager
>>> + * @sa_manager: pointer to the sa_manager
>>> + * @size: number of bytes we want to suballocate
>>> + * @align: alignment for each suballocated chunk
>>> + *
>>> + * Prepares the suballocation manager for suballocations.
>>> + */
>>> +void drm_suballoc_manager_init(struct drm_suballoc_manager 
>>> *sa_manager,
>>> +                   u64 size, u64 align)
>>> +{
>>> +    spin_lock_init(&sa_manager->lock);
>>> +    spin_lock_init(&sa_manager->idle_list_lock);
>>> +    mutex_init(&sa_manager->alloc_mutex);
>>> +    drm_mm_init(&sa_manager->mm, 0, size);
>>> +    init_waitqueue_head(&sa_manager->wq);
>>> +    sa_manager->range_size = size;
>>> +    sa_manager->alignment = align;
>>> +    INIT_LIST_HEAD(&sa_manager->idle_list);
>>> +}
>>> +EXPORT_SYMBOL(drm_suballoc_manager_init);
>>> +
>>> +/**
>>> + * drm_suballoc_manager_fini() - Destroy the drm_suballoc_manager
>>> + * @sa_manager: pointer to the sa_manager
>>> + *
>>> + * Cleans up the suballocation manager after use. All fences added
>>> + * with drm_suballoc_free() must be signaled, or we cannot clean up
>>> + * the entire manager.
>>> + */
>>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>>> *sa_manager)
>>> +{
>>> +    drm_suballoc_process_idle(sa_manager);
>>> +    drm_mm_takedown(&sa_manager->mm);
>>> +    mutex_destroy(&sa_manager->alloc_mutex);
>>> +}
>>> +EXPORT_SYMBOL(drm_suballoc_manager_fini);
>>> +
>>> +static void __drm_suballoc_free(struct drm_suballoc *sa)
>>> +{
>>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>>> +    struct dma_fence *fence;
>>> +
>>> +    /*
>>> +     * In order to avoid protecting the potentially lengthy drm_mm 
>>> manager
>>> +     * *allocation* processing with an irq-disabling lock,
>>> +     * defer touching the drm_mm for freeing until we're in task 
>>> context,
>>> +     * with no irqs disabled, or happen to succeed in taking the 
>>> manager
>>> +     * lock.
>>> +     */
>>> +    if (!in_task() || irqs_disabled()) {
>>> +        unsigned long irqflags;
>>> +
>>> +        if (spin_trylock(&sa_manager->lock))
>>> +            goto locked;
>>> +
>>> +        spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>>> +        list_add_tail(&sa->idle_link, &sa_manager->idle_list);
>>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>>> +        wake_up(&sa_manager->wq);
>>> +        return;
>>> +    }
>>> +
>>> +    spin_lock(&sa_manager->lock);
>>> +locked:
>>> +    drm_mm_remove_node(&sa->node);
>>> +
>>> +    fence = sa->fence;
>>> +    sa->fence = NULL;
>>> +    spin_unlock(&sa_manager->lock);
>>> +    /* Maybe only wake if first mm hole is sufficiently large? */
>>> +    wake_up(&sa_manager->wq);
>>> +    dma_fence_put(fence);
>>> +    kfree(sa);
>>> +}
>>> +
>>> +/* Free all deferred idle allocations */
>>> +static void drm_suballoc_process_idle(struct drm_suballoc_manager 
>>> *sa_manager)
>>> +{
>>> +    /*
>>> +     * prepare_to_wait() / wake_up() semantics ensure that any list
>>> +     * addition that was done before wake_up() is visible when
>>> +     * this code is called from the wait loop.
>>> +     */
>>> +    if (!list_empty_careful(&sa_manager->idle_list)) {
>>> +        struct drm_suballoc *sa, *next;
>>> +        unsigned long irqflags;
>>> +        LIST_HEAD(list);
>>> +
>>> +        spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>>> +        list_splice_init(&sa_manager->idle_list, &list);
>>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>>> +
>>> +        list_for_each_entry_safe(sa, next, &list, idle_link)
>>> +            __drm_suballoc_free(sa);
>>> +    }
>>> +}
>>> +
>>> +static void
>>> +drm_suballoc_fence_signaled(struct dma_fence *fence, struct 
>>> dma_fence_cb *cb)
>>> +{
>>> +    struct drm_suballoc *sa = container_of(cb, typeof(*sa), cb);
>>> +
>>> +    __drm_suballoc_free(sa);
>>> +}
>>> +
>>> +static int drm_suballoc_tryalloc(struct drm_suballoc *sa, u64 size)
>>> +{
>>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>>> +    int err;
>>> +
>>> +    drm_suballoc_process_idle(sa_manager);
>>> +    spin_lock(&sa_manager->lock);
>>> +    err = drm_mm_insert_node_generic(&sa_manager->mm, &sa->node, size,
>>> +                     sa_manager->alignment, 0,
>>> +                     DRM_MM_INSERT_EVICT);
>>> +    spin_unlock(&sa_manager->lock);
>>> +    return err;
>>> +}
>>> +
>>> +/**
>>> + * drm_suballoc_new() - Make a suballocation.
>>> + * @sa_manager: pointer to the sa_manager
>>> + * @size: number of bytes we want to suballocate.
>>> + * @gfp: Allocation context.
>>> + * @intr: Whether to sleep interruptibly if sleeping.
>>> + *
>>> + * Try to make a suballocation of size @size, which will be rounded
>>> + * up to the alignment specified in specified in 
>>> drm_suballoc_manager_init().
>>> + *
>>> + * Returns a new suballocated bo, or an ERR_PTR.
>>> + */
>>> +struct drm_suballoc*
>>> +drm_suballoc_new(struct drm_suballoc_manager *sa_manager, u64 size,
>>> +         gfp_t gfp, bool intr)
>>> +{
>>> +    struct drm_suballoc *sa;
>>> +    DEFINE_WAIT(wait);
>>> +    int err = 0;
>>> +
>>> +    if (size > sa_manager->range_size)
>>> +        return ERR_PTR(-ENOSPC);
>>> +
>>> +    sa = kzalloc(sizeof(*sa), gfp);
>>> +    if (!sa)
>>> +        return ERR_PTR(-ENOMEM);
>>> +
>>> +    /* Avoid starvation using the alloc_mutex */
>>> +    if (intr)
>>> +        err = mutex_lock_interruptible(&sa_manager->alloc_mutex);
>>> +    else
>>> +        mutex_lock(&sa_manager->alloc_mutex);
>>> +    if (err) {
>>> +        kfree(sa);
>>> +        return ERR_PTR(err);
>>> +    }
>>> +
>>> +    sa->manager = sa_manager;
>>> +    err = drm_suballoc_tryalloc(sa, size);
>>> +    if (err != -ENOSPC)
>>> +        goto out;
>>> +
>>> +    for (;;) {
>>> +        prepare_to_wait(&sa_manager->wq, &wait,
>>> +                intr ? TASK_INTERRUPTIBLE :
>>> +                TASK_UNINTERRUPTIBLE);
>>> +
>>> +        err = drm_suballoc_tryalloc(sa, size);
>>> +        if (err != -ENOSPC)
>>> +            break;
>>> +
>>> +        if (intr && signal_pending(current)) {
>>> +            err = -ERESTARTSYS;
>>> +            break;
>>> +        }
>>> +
>>> +        io_schedule();
>>> +    }
>>> +    finish_wait(&sa_manager->wq, &wait);
>>> +
>>> +out:
>>> +    mutex_unlock(&sa_manager->alloc_mutex);
>>> +    if (!sa->node.size) {
>>> +        kfree(sa);
>>> +        WARN_ON(!err);
>>> +        sa = ERR_PTR(err);
>>> +    }
>>> +
>>> +    return sa;
>>> +}
>>> +EXPORT_SYMBOL(drm_suballoc_new);
>>> +
>>> +/**
>>> + * drm_suballoc_free() - Free a suballocation
>>> + * @suballoc: pointer to the suballocation
>>> + * @fence: fence that signals when suballocation is idle
>>> + * @queue: the index to which queue the suballocation will be 
>>> placed on the free list.
>>> + *
>>> + * Free the suballocation. The suballocation can be re-used after 
>>> @fence
>>> + * signals.
>>> + */
>>> +void
>>> +drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence)
>>> +{
>>> +    if (!sa)
>>> +        return;
>>> +
>>> +    if (!fence || dma_fence_is_signaled(fence)) {
>>> +        __drm_suballoc_free(sa);
>>> +        return;
>>> +    }
>>> +
>>> +    sa->fence = dma_fence_get(fence);
>>> +    if (dma_fence_add_callback(fence, &sa->cb, 
>>> drm_suballoc_fence_signaled))
>>> +        __drm_suballoc_free(sa);
>>> +}
>>> +EXPORT_SYMBOL(drm_suballoc_free);
>>> +
>>> +#ifdef CONFIG_DEBUG_FS
>>> +
>>> +/**
>>> + * drm_suballoc_dump_debug_info() - Dump the suballocator state
>>> + * @sa_manager: The suballoc manager.
>>> + * @p: Pointer to a drm printer for output.
>>> + * @suballoc_base: Constant to add to the suballocated offsets on 
>>> printout.
>>> + *
>>> + * This function dumps the suballocator state. Note that the caller 
>>> has
>>> + * to explicitly order frees and calls to this function in order 
>>> for the
>>> + * freed node to show up as protected by a fence.
>>> + */
>>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>> *sa_manager,
>>> +                  struct drm_printer *p, u64 suballoc_base)
>>> +{
>>> +    const struct drm_mm_node *entry;
>>> +
>>> +    spin_lock(&sa_manager->lock);
>>> +    drm_mm_for_each_node(entry, &sa_manager->mm) {
>>> +        struct drm_suballoc *sa =
>>> +            container_of(entry, typeof(*sa), node);
>>> +
>>> +        drm_printf(p, " ");
>>> +        drm_printf(p, "[0x%010llx 0x%010llx] size %8lld",
>>> +               (unsigned long long)suballoc_base + entry->start,
>>> +               (unsigned long long)suballoc_base + entry->start +
>>> +               entry->size, (unsigned long long)entry->size);
>>> +
>>> +        if (sa->fence)
>>> +            drm_printf(p, " protected by 0x%016llx on context %llu",
>>> +                   (unsigned long long)sa->fence->seqno,
>>> +                   (unsigned long long)sa->fence->context);
>>> +
>>> +        drm_printf(p, "\n");
>>> +    }
>>> +    spin_unlock(&sa_manager->lock);
>>> +}
>>> +EXPORT_SYMBOL(drm_suballoc_dump_debug_info);
>>> +#endif
>>> +
>>> +MODULE_AUTHOR("Intel Corporation");
>>> +MODULE_DESCRIPTION("Simple range suballocator helper");
>>> +MODULE_LICENSE("GPL and additional rights");
>>> diff --git a/include/drm/drm_suballoc.h b/include/drm/drm_suballoc.h
>>> new file mode 100644
>>> index 000000000000..910952b3383b
>>> --- /dev/null
>>> +++ b/include/drm/drm_suballoc.h
>>> @@ -0,0 +1,112 @@
>>> +/* SPDX-License-Identifier: MIT */
>>> +/*
>>> + * Copyright © 2022 Intel Corporation
>>> + */
>>> +#ifndef _DRM_SUBALLOC_H_
>>> +#define _DRM_SUBALLOC_H_
>>> +
>>> +#include <drm/drm_mm.h>
>>> +
>>> +#include <linux/dma-fence.h>
>>> +#include <linux/types.h>
>>> +
>>> +/**
>>> + * struct drm_suballoc_manager - Wrapper for fenced range allocations
>>> + * @mm: The range manager. Protected by @lock.
>>> + * @range_size: The total size of the range.
>>> + * @alignment: Range alignment.
>>> + * @wq: Wait queue for sleeping allocations on contention.
>>> + * @idle_list: List of idle but not yet freed allocations. 
>>> Protected by
>>> + * @idle_list_lock.
>>> + * @task: Task waiting for allocation. Protected by @lock.
>>> + */
>>> +struct drm_suballoc_manager {
>>> +    /** @lock: Manager lock. Protects @mm. */
>>> +    spinlock_t lock;
>>> +    /**
>>> +     * @idle_list_lock: Lock to protect the idle_list.
>>> +     * Disable irqs when locking.
>>> +     */
>>> +    spinlock_t idle_list_lock;
>>> +    /** @alloc_mutex: Mutex to protect against stavation. */
>>> +    struct mutex alloc_mutex;
>>> +    struct drm_mm mm;
>>> +    u64 range_size;
>>> +    u64 alignment;
>>> +    wait_queue_head_t wq;
>>> +    struct list_head idle_list;
>>> +};
>>> +
>>> +/**
>>> + * struct drm_suballoc: Suballocated range.
>>> + * @node: The drm_mm representation of the range.
>>> + * @fence: dma-fence indicating whether allocation is active or idle.
>>> + * Assigned on call to free the allocation so doesn't need protection.
>>> + * @cb: dma-fence callback structure. Used for callbacks when the 
>>> fence signals.
>>> + * @manager: The struct drm_suballoc_manager the range belongs to. 
>>> Immutable.
>>> + * @idle_link: Link for the manager idle_list. Progected by the
>>> + * drm_suballoc_manager::idle_lock.
>>> + */
>>> +struct drm_suballoc {
>>> +    struct drm_mm_node node;
>>> +    struct dma_fence *fence;
>>> +    struct dma_fence_cb cb;
>>> +    struct drm_suballoc_manager *manager;
>>> +    struct list_head idle_link;
>>> +};
>>> +
>>> +void drm_suballoc_manager_init(struct drm_suballoc_manager 
>>> *sa_manager,
>>> +                   u64 size, u64 align);
>>> +
>>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>>> *sa_manager);
>>> +
>>> +struct drm_suballoc *drm_suballoc_new(struct drm_suballoc_manager 
>>> *sa_manager,
>>> +                      u64 size, gfp_t gfp, bool intr);
>>> +
>>> +void drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence 
>>> *fence);
>>> +
>>> +/**
>>> + * drm_suballoc_soffset - Range start.
>>> + * @sa: The struct drm_suballoc.
>>> + *
>>> + * Return: The start of the allocated range.
>>> + */
>>> +static inline u64 drm_suballoc_soffset(struct drm_suballoc *sa)
>>> +{
>>> +    return sa->node.start;
>>> +}
>>> +
>>> +/**
>>> + * drm_suballoc_eoffset - Range end.
>>> + * @sa: The struct drm_suballoc.
>>> + *
>>> + * Return: The end of the allocated range + 1.
>>> + */
>>> +static inline u64 drm_suballoc_eoffset(struct drm_suballoc *sa)
>>> +{
>>> +    return sa->node.start + sa->node.size;
>>> +}
>>> +
>>> +/**
>>> + * drm_suballoc_size - Range size.
>>> + * @sa: The struct drm_suballoc.
>>> + *
>>> + * Return: The size of the allocated range.
>>> + */
>>> +static inline u64 drm_suballoc_size(struct drm_suballoc *sa)
>>> +{
>>> +    return sa->node.size;
>>> +}
>>> +
>>> +#ifdef CONFIG_DEBUG_FS
>>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>> *sa_manager,
>>> +                  struct drm_printer *p, u64 suballoc_base);
>>> +#else
>>> +static inline void
>>> +drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
>>> +                 struct drm_printer *p, u64 suballoc_base)
>>> +{ }
>>> +
>>> +#endif
>>> +
>>> +#endif /* _DRM_SUBALLOC_H_ */
>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
@ 2023-02-17 11:28         ` Christian König
  0 siblings, 0 replies; 39+ messages in thread
From: Christian König @ 2023-02-17 11:28 UTC (permalink / raw)
  To: Thomas Hellström, dri-devel
  Cc: Daniel Vetter, Maarten Lankhorst, intel-xe, Dave Airlie

Am 17.02.23 um 12:21 schrieb Thomas Hellström:
>
> On 2/17/23 12:00, Christian König wrote:
>> Am 16.02.23 um 15:48 schrieb Thomas Hellström:
>>> Initially we tried to leverage the amdgpu suballocation manager.
>>> It turnes out, however, that it tries extremely hard not to enable
>>> signalling on the fences that hold the memory up for freeing, which 
>>> makes
>>> it hard to understand and to fix potential issues with it.
>>>
>>> So in a simplification effort, introduce a drm suballocation manager 
>>> as a
>>> wrapper around an existing allocator (drm_mm) and to avoid using queues
>>> for freeing, thus avoiding throttling on free which is an undesired
>>> feature as typically the throttling needs to be done uninterruptible.
>>>
>>> This variant is probably more cpu-hungry but can be improved at the 
>>> cost
>>> of additional complexity. Ideas for that are documented in the
>>> drm_suballoc.c file.
>>>
>>> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>> Co-developed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>> ---
>>>   drivers/gpu/drm/Kconfig        |   4 +
>>>   drivers/gpu/drm/Makefile       |   3 +
>>>   drivers/gpu/drm/drm_suballoc.c | 301 
>>> +++++++++++++++++++++++++++++++++
>>>   include/drm/drm_suballoc.h     | 112 ++++++++++++
>>>   4 files changed, 420 insertions(+)
>>>   create mode 100644 drivers/gpu/drm/drm_suballoc.c
>>>   create mode 100644 include/drm/drm_suballoc.h
>>>
>>> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
>>> index dc0f94f02a82..8fbe57407c60 100644
>>> --- a/drivers/gpu/drm/Kconfig
>>> +++ b/drivers/gpu/drm/Kconfig
>>> @@ -232,6 +232,10 @@ config DRM_GEM_SHMEM_HELPER
>>>       help
>>>         Choose this if you need the GEM shmem helper functions
>>>   +config DRM_SUBALLOC_HELPER
>>> +    tristate
>>> +    depends on DRM
>>> +
>>>   config DRM_SCHED
>>>       tristate
>>>       depends on DRM
>>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>>> index ab4460fcd63f..1e04d135e866 100644
>>> --- a/drivers/gpu/drm/Makefile
>>> +++ b/drivers/gpu/drm/Makefile
>>> @@ -88,6 +88,9 @@ obj-$(CONFIG_DRM_GEM_DMA_HELPER) += drm_dma_helper.o
>>>   drm_shmem_helper-y := drm_gem_shmem_helper.o
>>>   obj-$(CONFIG_DRM_GEM_SHMEM_HELPER) += drm_shmem_helper.o
>>>   +drm_suballoc_helper-y := drm_suballoc.o
>>> +obj-$(CONFIG_DRM_SUBALLOC_HELPER) += drm_suballoc_helper.o
>>> +
>>>   drm_vram_helper-y := drm_gem_vram_helper.o
>>>   obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
>>>   diff --git a/drivers/gpu/drm/drm_suballoc.c 
>>> b/drivers/gpu/drm/drm_suballoc.c
>>> new file mode 100644
>>> index 000000000000..6e0292dea548
>>> --- /dev/null
>>> +++ b/drivers/gpu/drm/drm_suballoc.c
>>> @@ -0,0 +1,301 @@
>>> +// SPDX-License-Identifier: MIT
>>> +/*
>>> + * Copyright © 2022 Intel Corporation
>>> + */
>>> +
>>> +#include <drm/drm_suballoc.h>
>>> +
>>> +/**
>>> + * DOC:
>>> + * This suballocator intends to be a wrapper around a range allocator
>>> + * that is aware also of deferred range freeing with fences. Currently
>>> + * we hard-code the drm_mm as the range allocator.
>>> + * The approach, while rather simple, suffers from three performance
>>> + * issues that can all be fixed if needed at the tradeoff of more 
>>> and / or
>>> + * more complex code:
>>> + *
>>> + * 1) It's cpu-hungry, the drm_mm allocator is overkill. Either code a
>>> + * much simpler range allocator, or let the caller decide by providing
>>> + * ops that wrap any range allocator. Also could avoid waking up 
>>> unless
>>> + * there is a reasonable chance of enough space in the range manager.
>>
>> That's most likely highly problematic.
>>
>> The suballocator in radeon/amdgpu was designed so that it resembles a 
>> ring buffer and is therefor rather CPU efficient.
>>
>> We could make the allocator much more trivial, but using drm_mm for 
>> this is a sledgehammer and therefore a pretty clear no-go.
>>
> I don't think the ring vs non-ring is the big problem here, because 
> (at least with the original implementation), if allocations are 
> actually made and released in a ring-like fashion, the drm_mm 
> free-list would consist of one or two blocks and therefore pretty 
> efficient even for that case, and if slightly longer that would still 
> not be an issue compared to the fence lists maintained in the older 
> allocator.
>
> The problem is more all the other stuff that was added and built on 
> top like the interval / rb tree.
>
> I still like the idea (originating from Gallium's helpers) to separate 
> whatever is allocating from the fence delayed free.

That's actually a bad idea. See the ring like approach works because the 
fences used in amdgpu/radeon are used in a ring like fashion. E.g. the 
sub allocator mainly provides the temporary space for page table 
updates. Those in turn are then used by commands written into a ring buffer.

>
> Any chance you could do a quick performance comparison? If not, 
> anything against merging this without the amd / radeon changes until 
> we can land a simpler allocator?

Only if you can stick the allocator inside Xe and not drm, cause this 
seems to be for a different use case than the allocators inside 
radeon/amdgpu.

Regards,
Christian.

>
> Thanks,
> Thomas
>
>
> Thomas
>
>
>> Regards,
>> Christian.
>>
>>> + *
>>> + * 2) We unnecessarily install the fence callbacks too early, forcing
>>> + * enable_signaling() too early causing extra driver effort. This 
>>> is likely
>>> + * not an issue if used with the drm_scheduler since it calls
>>> + * enable_signaling() early anyway.
>>> + *
>>> + * 3) Long processing in irq (disabled) context. We've mostly 
>>> worked around
>>> + * that already by using the idle_list. If that workaround is 
>>> deemed to
>>> + * complex for little gain, we can remove it and use spin_lock_irq()
>>> + * throughout the manager. If we want to shorten processing in irq 
>>> context
>>> + * even further, we can skip the spin_trylock in 
>>> __drm_suballoc_free() and
>>> + * avoid freeing allocations from irq context altogeher. However 
>>> drm_mm
>>> + * should be quite fast at freeing ranges.
>>> + *
>>> + * 4) Shrinker that starts processing the list items in 2) and 3) 
>>> to play
>>> + * better with the system.
>>> + */
>>> +
>>> +static void drm_suballoc_process_idle(struct drm_suballoc_manager 
>>> *sa_manager);
>>> +
>>> +/**
>>> + * drm_suballoc_manager_init() - Initialise the drm_suballoc_manager
>>> + * @sa_manager: pointer to the sa_manager
>>> + * @size: number of bytes we want to suballocate
>>> + * @align: alignment for each suballocated chunk
>>> + *
>>> + * Prepares the suballocation manager for suballocations.
>>> + */
>>> +void drm_suballoc_manager_init(struct drm_suballoc_manager 
>>> *sa_manager,
>>> +                   u64 size, u64 align)
>>> +{
>>> +    spin_lock_init(&sa_manager->lock);
>>> +    spin_lock_init(&sa_manager->idle_list_lock);
>>> +    mutex_init(&sa_manager->alloc_mutex);
>>> +    drm_mm_init(&sa_manager->mm, 0, size);
>>> +    init_waitqueue_head(&sa_manager->wq);
>>> +    sa_manager->range_size = size;
>>> +    sa_manager->alignment = align;
>>> +    INIT_LIST_HEAD(&sa_manager->idle_list);
>>> +}
>>> +EXPORT_SYMBOL(drm_suballoc_manager_init);
>>> +
>>> +/**
>>> + * drm_suballoc_manager_fini() - Destroy the drm_suballoc_manager
>>> + * @sa_manager: pointer to the sa_manager
>>> + *
>>> + * Cleans up the suballocation manager after use. All fences added
>>> + * with drm_suballoc_free() must be signaled, or we cannot clean up
>>> + * the entire manager.
>>> + */
>>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>>> *sa_manager)
>>> +{
>>> +    drm_suballoc_process_idle(sa_manager);
>>> +    drm_mm_takedown(&sa_manager->mm);
>>> +    mutex_destroy(&sa_manager->alloc_mutex);
>>> +}
>>> +EXPORT_SYMBOL(drm_suballoc_manager_fini);
>>> +
>>> +static void __drm_suballoc_free(struct drm_suballoc *sa)
>>> +{
>>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>>> +    struct dma_fence *fence;
>>> +
>>> +    /*
>>> +     * In order to avoid protecting the potentially lengthy drm_mm 
>>> manager
>>> +     * *allocation* processing with an irq-disabling lock,
>>> +     * defer touching the drm_mm for freeing until we're in task 
>>> context,
>>> +     * with no irqs disabled, or happen to succeed in taking the 
>>> manager
>>> +     * lock.
>>> +     */
>>> +    if (!in_task() || irqs_disabled()) {
>>> +        unsigned long irqflags;
>>> +
>>> +        if (spin_trylock(&sa_manager->lock))
>>> +            goto locked;
>>> +
>>> +        spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>>> +        list_add_tail(&sa->idle_link, &sa_manager->idle_list);
>>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>>> +        wake_up(&sa_manager->wq);
>>> +        return;
>>> +    }
>>> +
>>> +    spin_lock(&sa_manager->lock);
>>> +locked:
>>> +    drm_mm_remove_node(&sa->node);
>>> +
>>> +    fence = sa->fence;
>>> +    sa->fence = NULL;
>>> +    spin_unlock(&sa_manager->lock);
>>> +    /* Maybe only wake if first mm hole is sufficiently large? */
>>> +    wake_up(&sa_manager->wq);
>>> +    dma_fence_put(fence);
>>> +    kfree(sa);
>>> +}
>>> +
>>> +/* Free all deferred idle allocations */
>>> +static void drm_suballoc_process_idle(struct drm_suballoc_manager 
>>> *sa_manager)
>>> +{
>>> +    /*
>>> +     * prepare_to_wait() / wake_up() semantics ensure that any list
>>> +     * addition that was done before wake_up() is visible when
>>> +     * this code is called from the wait loop.
>>> +     */
>>> +    if (!list_empty_careful(&sa_manager->idle_list)) {
>>> +        struct drm_suballoc *sa, *next;
>>> +        unsigned long irqflags;
>>> +        LIST_HEAD(list);
>>> +
>>> +        spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>>> +        list_splice_init(&sa_manager->idle_list, &list);
>>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>>> +
>>> +        list_for_each_entry_safe(sa, next, &list, idle_link)
>>> +            __drm_suballoc_free(sa);
>>> +    }
>>> +}
>>> +
>>> +static void
>>> +drm_suballoc_fence_signaled(struct dma_fence *fence, struct 
>>> dma_fence_cb *cb)
>>> +{
>>> +    struct drm_suballoc *sa = container_of(cb, typeof(*sa), cb);
>>> +
>>> +    __drm_suballoc_free(sa);
>>> +}
>>> +
>>> +static int drm_suballoc_tryalloc(struct drm_suballoc *sa, u64 size)
>>> +{
>>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>>> +    int err;
>>> +
>>> +    drm_suballoc_process_idle(sa_manager);
>>> +    spin_lock(&sa_manager->lock);
>>> +    err = drm_mm_insert_node_generic(&sa_manager->mm, &sa->node, size,
>>> +                     sa_manager->alignment, 0,
>>> +                     DRM_MM_INSERT_EVICT);
>>> +    spin_unlock(&sa_manager->lock);
>>> +    return err;
>>> +}
>>> +
>>> +/**
>>> + * drm_suballoc_new() - Make a suballocation.
>>> + * @sa_manager: pointer to the sa_manager
>>> + * @size: number of bytes we want to suballocate.
>>> + * @gfp: Allocation context.
>>> + * @intr: Whether to sleep interruptibly if sleeping.
>>> + *
>>> + * Try to make a suballocation of size @size, which will be rounded
>>> + * up to the alignment specified in specified in 
>>> drm_suballoc_manager_init().
>>> + *
>>> + * Returns a new suballocated bo, or an ERR_PTR.
>>> + */
>>> +struct drm_suballoc*
>>> +drm_suballoc_new(struct drm_suballoc_manager *sa_manager, u64 size,
>>> +         gfp_t gfp, bool intr)
>>> +{
>>> +    struct drm_suballoc *sa;
>>> +    DEFINE_WAIT(wait);
>>> +    int err = 0;
>>> +
>>> +    if (size > sa_manager->range_size)
>>> +        return ERR_PTR(-ENOSPC);
>>> +
>>> +    sa = kzalloc(sizeof(*sa), gfp);
>>> +    if (!sa)
>>> +        return ERR_PTR(-ENOMEM);
>>> +
>>> +    /* Avoid starvation using the alloc_mutex */
>>> +    if (intr)
>>> +        err = mutex_lock_interruptible(&sa_manager->alloc_mutex);
>>> +    else
>>> +        mutex_lock(&sa_manager->alloc_mutex);
>>> +    if (err) {
>>> +        kfree(sa);
>>> +        return ERR_PTR(err);
>>> +    }
>>> +
>>> +    sa->manager = sa_manager;
>>> +    err = drm_suballoc_tryalloc(sa, size);
>>> +    if (err != -ENOSPC)
>>> +        goto out;
>>> +
>>> +    for (;;) {
>>> +        prepare_to_wait(&sa_manager->wq, &wait,
>>> +                intr ? TASK_INTERRUPTIBLE :
>>> +                TASK_UNINTERRUPTIBLE);
>>> +
>>> +        err = drm_suballoc_tryalloc(sa, size);
>>> +        if (err != -ENOSPC)
>>> +            break;
>>> +
>>> +        if (intr && signal_pending(current)) {
>>> +            err = -ERESTARTSYS;
>>> +            break;
>>> +        }
>>> +
>>> +        io_schedule();
>>> +    }
>>> +    finish_wait(&sa_manager->wq, &wait);
>>> +
>>> +out:
>>> +    mutex_unlock(&sa_manager->alloc_mutex);
>>> +    if (!sa->node.size) {
>>> +        kfree(sa);
>>> +        WARN_ON(!err);
>>> +        sa = ERR_PTR(err);
>>> +    }
>>> +
>>> +    return sa;
>>> +}
>>> +EXPORT_SYMBOL(drm_suballoc_new);
>>> +
>>> +/**
>>> + * drm_suballoc_free() - Free a suballocation
>>> + * @suballoc: pointer to the suballocation
>>> + * @fence: fence that signals when suballocation is idle
>>> + * @queue: the index to which queue the suballocation will be 
>>> placed on the free list.
>>> + *
>>> + * Free the suballocation. The suballocation can be re-used after 
>>> @fence
>>> + * signals.
>>> + */
>>> +void
>>> +drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence)
>>> +{
>>> +    if (!sa)
>>> +        return;
>>> +
>>> +    if (!fence || dma_fence_is_signaled(fence)) {
>>> +        __drm_suballoc_free(sa);
>>> +        return;
>>> +    }
>>> +
>>> +    sa->fence = dma_fence_get(fence);
>>> +    if (dma_fence_add_callback(fence, &sa->cb, 
>>> drm_suballoc_fence_signaled))
>>> +        __drm_suballoc_free(sa);
>>> +}
>>> +EXPORT_SYMBOL(drm_suballoc_free);
>>> +
>>> +#ifdef CONFIG_DEBUG_FS
>>> +
>>> +/**
>>> + * drm_suballoc_dump_debug_info() - Dump the suballocator state
>>> + * @sa_manager: The suballoc manager.
>>> + * @p: Pointer to a drm printer for output.
>>> + * @suballoc_base: Constant to add to the suballocated offsets on 
>>> printout.
>>> + *
>>> + * This function dumps the suballocator state. Note that the caller 
>>> has
>>> + * to explicitly order frees and calls to this function in order 
>>> for the
>>> + * freed node to show up as protected by a fence.
>>> + */
>>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>> *sa_manager,
>>> +                  struct drm_printer *p, u64 suballoc_base)
>>> +{
>>> +    const struct drm_mm_node *entry;
>>> +
>>> +    spin_lock(&sa_manager->lock);
>>> +    drm_mm_for_each_node(entry, &sa_manager->mm) {
>>> +        struct drm_suballoc *sa =
>>> +            container_of(entry, typeof(*sa), node);
>>> +
>>> +        drm_printf(p, " ");
>>> +        drm_printf(p, "[0x%010llx 0x%010llx] size %8lld",
>>> +               (unsigned long long)suballoc_base + entry->start,
>>> +               (unsigned long long)suballoc_base + entry->start +
>>> +               entry->size, (unsigned long long)entry->size);
>>> +
>>> +        if (sa->fence)
>>> +            drm_printf(p, " protected by 0x%016llx on context %llu",
>>> +                   (unsigned long long)sa->fence->seqno,
>>> +                   (unsigned long long)sa->fence->context);
>>> +
>>> +        drm_printf(p, "\n");
>>> +    }
>>> +    spin_unlock(&sa_manager->lock);
>>> +}
>>> +EXPORT_SYMBOL(drm_suballoc_dump_debug_info);
>>> +#endif
>>> +
>>> +MODULE_AUTHOR("Intel Corporation");
>>> +MODULE_DESCRIPTION("Simple range suballocator helper");
>>> +MODULE_LICENSE("GPL and additional rights");
>>> diff --git a/include/drm/drm_suballoc.h b/include/drm/drm_suballoc.h
>>> new file mode 100644
>>> index 000000000000..910952b3383b
>>> --- /dev/null
>>> +++ b/include/drm/drm_suballoc.h
>>> @@ -0,0 +1,112 @@
>>> +/* SPDX-License-Identifier: MIT */
>>> +/*
>>> + * Copyright © 2022 Intel Corporation
>>> + */
>>> +#ifndef _DRM_SUBALLOC_H_
>>> +#define _DRM_SUBALLOC_H_
>>> +
>>> +#include <drm/drm_mm.h>
>>> +
>>> +#include <linux/dma-fence.h>
>>> +#include <linux/types.h>
>>> +
>>> +/**
>>> + * struct drm_suballoc_manager - Wrapper for fenced range allocations
>>> + * @mm: The range manager. Protected by @lock.
>>> + * @range_size: The total size of the range.
>>> + * @alignment: Range alignment.
>>> + * @wq: Wait queue for sleeping allocations on contention.
>>> + * @idle_list: List of idle but not yet freed allocations. 
>>> Protected by
>>> + * @idle_list_lock.
>>> + * @task: Task waiting for allocation. Protected by @lock.
>>> + */
>>> +struct drm_suballoc_manager {
>>> +    /** @lock: Manager lock. Protects @mm. */
>>> +    spinlock_t lock;
>>> +    /**
>>> +     * @idle_list_lock: Lock to protect the idle_list.
>>> +     * Disable irqs when locking.
>>> +     */
>>> +    spinlock_t idle_list_lock;
>>> +    /** @alloc_mutex: Mutex to protect against stavation. */
>>> +    struct mutex alloc_mutex;
>>> +    struct drm_mm mm;
>>> +    u64 range_size;
>>> +    u64 alignment;
>>> +    wait_queue_head_t wq;
>>> +    struct list_head idle_list;
>>> +};
>>> +
>>> +/**
>>> + * struct drm_suballoc: Suballocated range.
>>> + * @node: The drm_mm representation of the range.
>>> + * @fence: dma-fence indicating whether allocation is active or idle.
>>> + * Assigned on call to free the allocation so doesn't need protection.
>>> + * @cb: dma-fence callback structure. Used for callbacks when the 
>>> fence signals.
>>> + * @manager: The struct drm_suballoc_manager the range belongs to. 
>>> Immutable.
>>> + * @idle_link: Link for the manager idle_list. Progected by the
>>> + * drm_suballoc_manager::idle_lock.
>>> + */
>>> +struct drm_suballoc {
>>> +    struct drm_mm_node node;
>>> +    struct dma_fence *fence;
>>> +    struct dma_fence_cb cb;
>>> +    struct drm_suballoc_manager *manager;
>>> +    struct list_head idle_link;
>>> +};
>>> +
>>> +void drm_suballoc_manager_init(struct drm_suballoc_manager 
>>> *sa_manager,
>>> +                   u64 size, u64 align);
>>> +
>>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>>> *sa_manager);
>>> +
>>> +struct drm_suballoc *drm_suballoc_new(struct drm_suballoc_manager 
>>> *sa_manager,
>>> +                      u64 size, gfp_t gfp, bool intr);
>>> +
>>> +void drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence 
>>> *fence);
>>> +
>>> +/**
>>> + * drm_suballoc_soffset - Range start.
>>> + * @sa: The struct drm_suballoc.
>>> + *
>>> + * Return: The start of the allocated range.
>>> + */
>>> +static inline u64 drm_suballoc_soffset(struct drm_suballoc *sa)
>>> +{
>>> +    return sa->node.start;
>>> +}
>>> +
>>> +/**
>>> + * drm_suballoc_eoffset - Range end.
>>> + * @sa: The struct drm_suballoc.
>>> + *
>>> + * Return: The end of the allocated range + 1.
>>> + */
>>> +static inline u64 drm_suballoc_eoffset(struct drm_suballoc *sa)
>>> +{
>>> +    return sa->node.start + sa->node.size;
>>> +}
>>> +
>>> +/**
>>> + * drm_suballoc_size - Range size.
>>> + * @sa: The struct drm_suballoc.
>>> + *
>>> + * Return: The size of the allocated range.
>>> + */
>>> +static inline u64 drm_suballoc_size(struct drm_suballoc *sa)
>>> +{
>>> +    return sa->node.size;
>>> +}
>>> +
>>> +#ifdef CONFIG_DEBUG_FS
>>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>> *sa_manager,
>>> +                  struct drm_printer *p, u64 suballoc_base);
>>> +#else
>>> +static inline void
>>> +drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
>>> +                 struct drm_printer *p, u64 suballoc_base)
>>> +{ }
>>> +
>>> +#endif
>>> +
>>> +#endif /* _DRM_SUBALLOC_H_ */
>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
  2023-02-17 11:28         ` [Intel-xe] " Christian König
@ 2023-02-17 12:24           ` Thomas Hellström
  -1 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-17 12:24 UTC (permalink / raw)
  To: Christian König, dri-devel; +Cc: Daniel Vetter, intel-xe, Dave Airlie


On 2/17/23 12:28, Christian König wrote:
> Am 17.02.23 um 12:21 schrieb Thomas Hellström:
>>
>> On 2/17/23 12:00, Christian König wrote:
>>> Am 16.02.23 um 15:48 schrieb Thomas Hellström:
>>>> Initially we tried to leverage the amdgpu suballocation manager.
>>>> It turnes out, however, that it tries extremely hard not to enable
>>>> signalling on the fences that hold the memory up for freeing, which 
>>>> makes
>>>> it hard to understand and to fix potential issues with it.
>>>>
>>>> So in a simplification effort, introduce a drm suballocation 
>>>> manager as a
>>>> wrapper around an existing allocator (drm_mm) and to avoid using 
>>>> queues
>>>> for freeing, thus avoiding throttling on free which is an undesired
>>>> feature as typically the throttling needs to be done uninterruptible.
>>>>
>>>> This variant is probably more cpu-hungry but can be improved at the 
>>>> cost
>>>> of additional complexity. Ideas for that are documented in the
>>>> drm_suballoc.c file.
>>>>
>>>> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>> Co-developed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>> ---
>>>>   drivers/gpu/drm/Kconfig        |   4 +
>>>>   drivers/gpu/drm/Makefile       |   3 +
>>>>   drivers/gpu/drm/drm_suballoc.c | 301 
>>>> +++++++++++++++++++++++++++++++++
>>>>   include/drm/drm_suballoc.h     | 112 ++++++++++++
>>>>   4 files changed, 420 insertions(+)
>>>>   create mode 100644 drivers/gpu/drm/drm_suballoc.c
>>>>   create mode 100644 include/drm/drm_suballoc.h
>>>>
>>>> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
>>>> index dc0f94f02a82..8fbe57407c60 100644
>>>> --- a/drivers/gpu/drm/Kconfig
>>>> +++ b/drivers/gpu/drm/Kconfig
>>>> @@ -232,6 +232,10 @@ config DRM_GEM_SHMEM_HELPER
>>>>       help
>>>>         Choose this if you need the GEM shmem helper functions
>>>>   +config DRM_SUBALLOC_HELPER
>>>> +    tristate
>>>> +    depends on DRM
>>>> +
>>>>   config DRM_SCHED
>>>>       tristate
>>>>       depends on DRM
>>>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>>>> index ab4460fcd63f..1e04d135e866 100644
>>>> --- a/drivers/gpu/drm/Makefile
>>>> +++ b/drivers/gpu/drm/Makefile
>>>> @@ -88,6 +88,9 @@ obj-$(CONFIG_DRM_GEM_DMA_HELPER) += drm_dma_helper.o
>>>>   drm_shmem_helper-y := drm_gem_shmem_helper.o
>>>>   obj-$(CONFIG_DRM_GEM_SHMEM_HELPER) += drm_shmem_helper.o
>>>>   +drm_suballoc_helper-y := drm_suballoc.o
>>>> +obj-$(CONFIG_DRM_SUBALLOC_HELPER) += drm_suballoc_helper.o
>>>> +
>>>>   drm_vram_helper-y := drm_gem_vram_helper.o
>>>>   obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
>>>>   diff --git a/drivers/gpu/drm/drm_suballoc.c 
>>>> b/drivers/gpu/drm/drm_suballoc.c
>>>> new file mode 100644
>>>> index 000000000000..6e0292dea548
>>>> --- /dev/null
>>>> +++ b/drivers/gpu/drm/drm_suballoc.c
>>>> @@ -0,0 +1,301 @@
>>>> +// SPDX-License-Identifier: MIT
>>>> +/*
>>>> + * Copyright © 2022 Intel Corporation
>>>> + */
>>>> +
>>>> +#include <drm/drm_suballoc.h>
>>>> +
>>>> +/**
>>>> + * DOC:
>>>> + * This suballocator intends to be a wrapper around a range allocator
>>>> + * that is aware also of deferred range freeing with fences. 
>>>> Currently
>>>> + * we hard-code the drm_mm as the range allocator.
>>>> + * The approach, while rather simple, suffers from three performance
>>>> + * issues that can all be fixed if needed at the tradeoff of more 
>>>> and / or
>>>> + * more complex code:
>>>> + *
>>>> + * 1) It's cpu-hungry, the drm_mm allocator is overkill. Either 
>>>> code a
>>>> + * much simpler range allocator, or let the caller decide by 
>>>> providing
>>>> + * ops that wrap any range allocator. Also could avoid waking up 
>>>> unless
>>>> + * there is a reasonable chance of enough space in the range manager.
>>>
>>> That's most likely highly problematic.
>>>
>>> The suballocator in radeon/amdgpu was designed so that it resembles 
>>> a ring buffer and is therefor rather CPU efficient.
>>>
>>> We could make the allocator much more trivial, but using drm_mm for 
>>> this is a sledgehammer and therefore a pretty clear no-go.
>>>
>> I don't think the ring vs non-ring is the big problem here, because 
>> (at least with the original implementation), if allocations are 
>> actually made and released in a ring-like fashion, the drm_mm 
>> free-list would consist of one or two blocks and therefore pretty 
>> efficient even for that case, and if slightly longer that would still 
>> not be an issue compared to the fence lists maintained in the older 
>> allocator.
>>
>> The problem is more all the other stuff that was added and built on 
>> top like the interval / rb tree.
>>
>> I still like the idea (originating from Gallium's helpers) to 
>> separate whatever is allocating from the fence delayed free.
>
> That's actually a bad idea. See the ring like approach works because 
> the fences used in amdgpu/radeon are used in a ring like fashion. E.g. 
> the sub allocator mainly provides the temporary space for page table 
> updates. Those in turn are then used by commands written into a ring 
> buffer.

Well, what I'm saying is that *even* if you have a ring-like allocation 
algorithm, given a simpler drm_mm, I think the suggested code would be 
performing just as well as the one in amdgpu / radeon, on top of 
avoiding throttling on free, or do you have a particular scenario in 
mind that you think would be particularly pathological on this allocator?

>
>>
>> Any chance you could do a quick performance comparison? If not, 
>> anything against merging this without the amd / radeon changes until 
>> we can land a simpler allocator?
>
> Only if you can stick the allocator inside Xe and not drm, cause this 
> seems to be for a different use case than the allocators inside 
> radeon/amdgpu.

Hmm. No It's allocating in a ring-like fashion as well.  Let me put 
together a unit test for benchmaking. I think it would be a failure for 
the community to end up with three separate suballocators doing the 
exact same thing for the same problem, really.

/Thomas

>
> Regards,
> Christian.
>
>>
>> Thanks,
>> Thomas
>>
>>
>> Thomas
>>
>>
>>> Regards,
>>> Christian.
>>>
>>>> + *
>>>> + * 2) We unnecessarily install the fence callbacks too early, forcing
>>>> + * enable_signaling() too early causing extra driver effort. This 
>>>> is likely
>>>> + * not an issue if used with the drm_scheduler since it calls
>>>> + * enable_signaling() early anyway.
>>>> + *
>>>> + * 3) Long processing in irq (disabled) context. We've mostly 
>>>> worked around
>>>> + * that already by using the idle_list. If that workaround is 
>>>> deemed to
>>>> + * complex for little gain, we can remove it and use spin_lock_irq()
>>>> + * throughout the manager. If we want to shorten processing in irq 
>>>> context
>>>> + * even further, we can skip the spin_trylock in 
>>>> __drm_suballoc_free() and
>>>> + * avoid freeing allocations from irq context altogeher. However 
>>>> drm_mm
>>>> + * should be quite fast at freeing ranges.
>>>> + *
>>>> + * 4) Shrinker that starts processing the list items in 2) and 3) 
>>>> to play
>>>> + * better with the system.
>>>> + */
>>>> +
>>>> +static void drm_suballoc_process_idle(struct drm_suballoc_manager 
>>>> *sa_manager);
>>>> +
>>>> +/**
>>>> + * drm_suballoc_manager_init() - Initialise the drm_suballoc_manager
>>>> + * @sa_manager: pointer to the sa_manager
>>>> + * @size: number of bytes we want to suballocate
>>>> + * @align: alignment for each suballocated chunk
>>>> + *
>>>> + * Prepares the suballocation manager for suballocations.
>>>> + */
>>>> +void drm_suballoc_manager_init(struct drm_suballoc_manager 
>>>> *sa_manager,
>>>> +                   u64 size, u64 align)
>>>> +{
>>>> +    spin_lock_init(&sa_manager->lock);
>>>> +    spin_lock_init(&sa_manager->idle_list_lock);
>>>> +    mutex_init(&sa_manager->alloc_mutex);
>>>> +    drm_mm_init(&sa_manager->mm, 0, size);
>>>> +    init_waitqueue_head(&sa_manager->wq);
>>>> +    sa_manager->range_size = size;
>>>> +    sa_manager->alignment = align;
>>>> +    INIT_LIST_HEAD(&sa_manager->idle_list);
>>>> +}
>>>> +EXPORT_SYMBOL(drm_suballoc_manager_init);
>>>> +
>>>> +/**
>>>> + * drm_suballoc_manager_fini() - Destroy the drm_suballoc_manager
>>>> + * @sa_manager: pointer to the sa_manager
>>>> + *
>>>> + * Cleans up the suballocation manager after use. All fences added
>>>> + * with drm_suballoc_free() must be signaled, or we cannot clean up
>>>> + * the entire manager.
>>>> + */
>>>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>>>> *sa_manager)
>>>> +{
>>>> +    drm_suballoc_process_idle(sa_manager);
>>>> +    drm_mm_takedown(&sa_manager->mm);
>>>> +    mutex_destroy(&sa_manager->alloc_mutex);
>>>> +}
>>>> +EXPORT_SYMBOL(drm_suballoc_manager_fini);
>>>> +
>>>> +static void __drm_suballoc_free(struct drm_suballoc *sa)
>>>> +{
>>>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>>>> +    struct dma_fence *fence;
>>>> +
>>>> +    /*
>>>> +     * In order to avoid protecting the potentially lengthy drm_mm 
>>>> manager
>>>> +     * *allocation* processing with an irq-disabling lock,
>>>> +     * defer touching the drm_mm for freeing until we're in task 
>>>> context,
>>>> +     * with no irqs disabled, or happen to succeed in taking the 
>>>> manager
>>>> +     * lock.
>>>> +     */
>>>> +    if (!in_task() || irqs_disabled()) {
>>>> +        unsigned long irqflags;
>>>> +
>>>> +        if (spin_trylock(&sa_manager->lock))
>>>> +            goto locked;
>>>> +
>>>> + spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>>>> +        list_add_tail(&sa->idle_link, &sa_manager->idle_list);
>>>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>>>> +        wake_up(&sa_manager->wq);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    spin_lock(&sa_manager->lock);
>>>> +locked:
>>>> +    drm_mm_remove_node(&sa->node);
>>>> +
>>>> +    fence = sa->fence;
>>>> +    sa->fence = NULL;
>>>> +    spin_unlock(&sa_manager->lock);
>>>> +    /* Maybe only wake if first mm hole is sufficiently large? */
>>>> +    wake_up(&sa_manager->wq);
>>>> +    dma_fence_put(fence);
>>>> +    kfree(sa);
>>>> +}
>>>> +
>>>> +/* Free all deferred idle allocations */
>>>> +static void drm_suballoc_process_idle(struct drm_suballoc_manager 
>>>> *sa_manager)
>>>> +{
>>>> +    /*
>>>> +     * prepare_to_wait() / wake_up() semantics ensure that any list
>>>> +     * addition that was done before wake_up() is visible when
>>>> +     * this code is called from the wait loop.
>>>> +     */
>>>> +    if (!list_empty_careful(&sa_manager->idle_list)) {
>>>> +        struct drm_suballoc *sa, *next;
>>>> +        unsigned long irqflags;
>>>> +        LIST_HEAD(list);
>>>> +
>>>> + spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>>>> +        list_splice_init(&sa_manager->idle_list, &list);
>>>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>>>> +
>>>> +        list_for_each_entry_safe(sa, next, &list, idle_link)
>>>> +            __drm_suballoc_free(sa);
>>>> +    }
>>>> +}
>>>> +
>>>> +static void
>>>> +drm_suballoc_fence_signaled(struct dma_fence *fence, struct 
>>>> dma_fence_cb *cb)
>>>> +{
>>>> +    struct drm_suballoc *sa = container_of(cb, typeof(*sa), cb);
>>>> +
>>>> +    __drm_suballoc_free(sa);
>>>> +}
>>>> +
>>>> +static int drm_suballoc_tryalloc(struct drm_suballoc *sa, u64 size)
>>>> +{
>>>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>>>> +    int err;
>>>> +
>>>> +    drm_suballoc_process_idle(sa_manager);
>>>> +    spin_lock(&sa_manager->lock);
>>>> +    err = drm_mm_insert_node_generic(&sa_manager->mm, &sa->node, 
>>>> size,
>>>> +                     sa_manager->alignment, 0,
>>>> +                     DRM_MM_INSERT_EVICT);
>>>> +    spin_unlock(&sa_manager->lock);
>>>> +    return err;
>>>> +}
>>>> +
>>>> +/**
>>>> + * drm_suballoc_new() - Make a suballocation.
>>>> + * @sa_manager: pointer to the sa_manager
>>>> + * @size: number of bytes we want to suballocate.
>>>> + * @gfp: Allocation context.
>>>> + * @intr: Whether to sleep interruptibly if sleeping.
>>>> + *
>>>> + * Try to make a suballocation of size @size, which will be rounded
>>>> + * up to the alignment specified in specified in 
>>>> drm_suballoc_manager_init().
>>>> + *
>>>> + * Returns a new suballocated bo, or an ERR_PTR.
>>>> + */
>>>> +struct drm_suballoc*
>>>> +drm_suballoc_new(struct drm_suballoc_manager *sa_manager, u64 size,
>>>> +         gfp_t gfp, bool intr)
>>>> +{
>>>> +    struct drm_suballoc *sa;
>>>> +    DEFINE_WAIT(wait);
>>>> +    int err = 0;
>>>> +
>>>> +    if (size > sa_manager->range_size)
>>>> +        return ERR_PTR(-ENOSPC);
>>>> +
>>>> +    sa = kzalloc(sizeof(*sa), gfp);
>>>> +    if (!sa)
>>>> +        return ERR_PTR(-ENOMEM);
>>>> +
>>>> +    /* Avoid starvation using the alloc_mutex */
>>>> +    if (intr)
>>>> +        err = mutex_lock_interruptible(&sa_manager->alloc_mutex);
>>>> +    else
>>>> +        mutex_lock(&sa_manager->alloc_mutex);
>>>> +    if (err) {
>>>> +        kfree(sa);
>>>> +        return ERR_PTR(err);
>>>> +    }
>>>> +
>>>> +    sa->manager = sa_manager;
>>>> +    err = drm_suballoc_tryalloc(sa, size);
>>>> +    if (err != -ENOSPC)
>>>> +        goto out;
>>>> +
>>>> +    for (;;) {
>>>> +        prepare_to_wait(&sa_manager->wq, &wait,
>>>> +                intr ? TASK_INTERRUPTIBLE :
>>>> +                TASK_UNINTERRUPTIBLE);
>>>> +
>>>> +        err = drm_suballoc_tryalloc(sa, size);
>>>> +        if (err != -ENOSPC)
>>>> +            break;
>>>> +
>>>> +        if (intr && signal_pending(current)) {
>>>> +            err = -ERESTARTSYS;
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        io_schedule();
>>>> +    }
>>>> +    finish_wait(&sa_manager->wq, &wait);
>>>> +
>>>> +out:
>>>> +    mutex_unlock(&sa_manager->alloc_mutex);
>>>> +    if (!sa->node.size) {
>>>> +        kfree(sa);
>>>> +        WARN_ON(!err);
>>>> +        sa = ERR_PTR(err);
>>>> +    }
>>>> +
>>>> +    return sa;
>>>> +}
>>>> +EXPORT_SYMBOL(drm_suballoc_new);
>>>> +
>>>> +/**
>>>> + * drm_suballoc_free() - Free a suballocation
>>>> + * @suballoc: pointer to the suballocation
>>>> + * @fence: fence that signals when suballocation is idle
>>>> + * @queue: the index to which queue the suballocation will be 
>>>> placed on the free list.
>>>> + *
>>>> + * Free the suballocation. The suballocation can be re-used after 
>>>> @fence
>>>> + * signals.
>>>> + */
>>>> +void
>>>> +drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence)
>>>> +{
>>>> +    if (!sa)
>>>> +        return;
>>>> +
>>>> +    if (!fence || dma_fence_is_signaled(fence)) {
>>>> +        __drm_suballoc_free(sa);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    sa->fence = dma_fence_get(fence);
>>>> +    if (dma_fence_add_callback(fence, &sa->cb, 
>>>> drm_suballoc_fence_signaled))
>>>> +        __drm_suballoc_free(sa);
>>>> +}
>>>> +EXPORT_SYMBOL(drm_suballoc_free);
>>>> +
>>>> +#ifdef CONFIG_DEBUG_FS
>>>> +
>>>> +/**
>>>> + * drm_suballoc_dump_debug_info() - Dump the suballocator state
>>>> + * @sa_manager: The suballoc manager.
>>>> + * @p: Pointer to a drm printer for output.
>>>> + * @suballoc_base: Constant to add to the suballocated offsets on 
>>>> printout.
>>>> + *
>>>> + * This function dumps the suballocator state. Note that the 
>>>> caller has
>>>> + * to explicitly order frees and calls to this function in order 
>>>> for the
>>>> + * freed node to show up as protected by a fence.
>>>> + */
>>>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>>> *sa_manager,
>>>> +                  struct drm_printer *p, u64 suballoc_base)
>>>> +{
>>>> +    const struct drm_mm_node *entry;
>>>> +
>>>> +    spin_lock(&sa_manager->lock);
>>>> +    drm_mm_for_each_node(entry, &sa_manager->mm) {
>>>> +        struct drm_suballoc *sa =
>>>> +            container_of(entry, typeof(*sa), node);
>>>> +
>>>> +        drm_printf(p, " ");
>>>> +        drm_printf(p, "[0x%010llx 0x%010llx] size %8lld",
>>>> +               (unsigned long long)suballoc_base + entry->start,
>>>> +               (unsigned long long)suballoc_base + entry->start +
>>>> +               entry->size, (unsigned long long)entry->size);
>>>> +
>>>> +        if (sa->fence)
>>>> +            drm_printf(p, " protected by 0x%016llx on context %llu",
>>>> +                   (unsigned long long)sa->fence->seqno,
>>>> +                   (unsigned long long)sa->fence->context);
>>>> +
>>>> +        drm_printf(p, "\n");
>>>> +    }
>>>> +    spin_unlock(&sa_manager->lock);
>>>> +}
>>>> +EXPORT_SYMBOL(drm_suballoc_dump_debug_info);
>>>> +#endif
>>>> +
>>>> +MODULE_AUTHOR("Intel Corporation");
>>>> +MODULE_DESCRIPTION("Simple range suballocator helper");
>>>> +MODULE_LICENSE("GPL and additional rights");
>>>> diff --git a/include/drm/drm_suballoc.h b/include/drm/drm_suballoc.h
>>>> new file mode 100644
>>>> index 000000000000..910952b3383b
>>>> --- /dev/null
>>>> +++ b/include/drm/drm_suballoc.h
>>>> @@ -0,0 +1,112 @@
>>>> +/* SPDX-License-Identifier: MIT */
>>>> +/*
>>>> + * Copyright © 2022 Intel Corporation
>>>> + */
>>>> +#ifndef _DRM_SUBALLOC_H_
>>>> +#define _DRM_SUBALLOC_H_
>>>> +
>>>> +#include <drm/drm_mm.h>
>>>> +
>>>> +#include <linux/dma-fence.h>
>>>> +#include <linux/types.h>
>>>> +
>>>> +/**
>>>> + * struct drm_suballoc_manager - Wrapper for fenced range allocations
>>>> + * @mm: The range manager. Protected by @lock.
>>>> + * @range_size: The total size of the range.
>>>> + * @alignment: Range alignment.
>>>> + * @wq: Wait queue for sleeping allocations on contention.
>>>> + * @idle_list: List of idle but not yet freed allocations. 
>>>> Protected by
>>>> + * @idle_list_lock.
>>>> + * @task: Task waiting for allocation. Protected by @lock.
>>>> + */
>>>> +struct drm_suballoc_manager {
>>>> +    /** @lock: Manager lock. Protects @mm. */
>>>> +    spinlock_t lock;
>>>> +    /**
>>>> +     * @idle_list_lock: Lock to protect the idle_list.
>>>> +     * Disable irqs when locking.
>>>> +     */
>>>> +    spinlock_t idle_list_lock;
>>>> +    /** @alloc_mutex: Mutex to protect against stavation. */
>>>> +    struct mutex alloc_mutex;
>>>> +    struct drm_mm mm;
>>>> +    u64 range_size;
>>>> +    u64 alignment;
>>>> +    wait_queue_head_t wq;
>>>> +    struct list_head idle_list;
>>>> +};
>>>> +
>>>> +/**
>>>> + * struct drm_suballoc: Suballocated range.
>>>> + * @node: The drm_mm representation of the range.
>>>> + * @fence: dma-fence indicating whether allocation is active or idle.
>>>> + * Assigned on call to free the allocation so doesn't need 
>>>> protection.
>>>> + * @cb: dma-fence callback structure. Used for callbacks when the 
>>>> fence signals.
>>>> + * @manager: The struct drm_suballoc_manager the range belongs to. 
>>>> Immutable.
>>>> + * @idle_link: Link for the manager idle_list. Progected by the
>>>> + * drm_suballoc_manager::idle_lock.
>>>> + */
>>>> +struct drm_suballoc {
>>>> +    struct drm_mm_node node;
>>>> +    struct dma_fence *fence;
>>>> +    struct dma_fence_cb cb;
>>>> +    struct drm_suballoc_manager *manager;
>>>> +    struct list_head idle_link;
>>>> +};
>>>> +
>>>> +void drm_suballoc_manager_init(struct drm_suballoc_manager 
>>>> *sa_manager,
>>>> +                   u64 size, u64 align);
>>>> +
>>>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>>>> *sa_manager);
>>>> +
>>>> +struct drm_suballoc *drm_suballoc_new(struct drm_suballoc_manager 
>>>> *sa_manager,
>>>> +                      u64 size, gfp_t gfp, bool intr);
>>>> +
>>>> +void drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence 
>>>> *fence);
>>>> +
>>>> +/**
>>>> + * drm_suballoc_soffset - Range start.
>>>> + * @sa: The struct drm_suballoc.
>>>> + *
>>>> + * Return: The start of the allocated range.
>>>> + */
>>>> +static inline u64 drm_suballoc_soffset(struct drm_suballoc *sa)
>>>> +{
>>>> +    return sa->node.start;
>>>> +}
>>>> +
>>>> +/**
>>>> + * drm_suballoc_eoffset - Range end.
>>>> + * @sa: The struct drm_suballoc.
>>>> + *
>>>> + * Return: The end of the allocated range + 1.
>>>> + */
>>>> +static inline u64 drm_suballoc_eoffset(struct drm_suballoc *sa)
>>>> +{
>>>> +    return sa->node.start + sa->node.size;
>>>> +}
>>>> +
>>>> +/**
>>>> + * drm_suballoc_size - Range size.
>>>> + * @sa: The struct drm_suballoc.
>>>> + *
>>>> + * Return: The size of the allocated range.
>>>> + */
>>>> +static inline u64 drm_suballoc_size(struct drm_suballoc *sa)
>>>> +{
>>>> +    return sa->node.size;
>>>> +}
>>>> +
>>>> +#ifdef CONFIG_DEBUG_FS
>>>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>>> *sa_manager,
>>>> +                  struct drm_printer *p, u64 suballoc_base);
>>>> +#else
>>>> +static inline void
>>>> +drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
>>>> +                 struct drm_printer *p, u64 suballoc_base)
>>>> +{ }
>>>> +
>>>> +#endif
>>>> +
>>>> +#endif /* _DRM_SUBALLOC_H_ */
>>>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
@ 2023-02-17 12:24           ` Thomas Hellström
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-17 12:24 UTC (permalink / raw)
  To: Christian König, dri-devel
  Cc: Daniel Vetter, Maarten Lankhorst, intel-xe, Dave Airlie


On 2/17/23 12:28, Christian König wrote:
> Am 17.02.23 um 12:21 schrieb Thomas Hellström:
>>
>> On 2/17/23 12:00, Christian König wrote:
>>> Am 16.02.23 um 15:48 schrieb Thomas Hellström:
>>>> Initially we tried to leverage the amdgpu suballocation manager.
>>>> It turnes out, however, that it tries extremely hard not to enable
>>>> signalling on the fences that hold the memory up for freeing, which 
>>>> makes
>>>> it hard to understand and to fix potential issues with it.
>>>>
>>>> So in a simplification effort, introduce a drm suballocation 
>>>> manager as a
>>>> wrapper around an existing allocator (drm_mm) and to avoid using 
>>>> queues
>>>> for freeing, thus avoiding throttling on free which is an undesired
>>>> feature as typically the throttling needs to be done uninterruptible.
>>>>
>>>> This variant is probably more cpu-hungry but can be improved at the 
>>>> cost
>>>> of additional complexity. Ideas for that are documented in the
>>>> drm_suballoc.c file.
>>>>
>>>> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>> Co-developed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>> ---
>>>>   drivers/gpu/drm/Kconfig        |   4 +
>>>>   drivers/gpu/drm/Makefile       |   3 +
>>>>   drivers/gpu/drm/drm_suballoc.c | 301 
>>>> +++++++++++++++++++++++++++++++++
>>>>   include/drm/drm_suballoc.h     | 112 ++++++++++++
>>>>   4 files changed, 420 insertions(+)
>>>>   create mode 100644 drivers/gpu/drm/drm_suballoc.c
>>>>   create mode 100644 include/drm/drm_suballoc.h
>>>>
>>>> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
>>>> index dc0f94f02a82..8fbe57407c60 100644
>>>> --- a/drivers/gpu/drm/Kconfig
>>>> +++ b/drivers/gpu/drm/Kconfig
>>>> @@ -232,6 +232,10 @@ config DRM_GEM_SHMEM_HELPER
>>>>       help
>>>>         Choose this if you need the GEM shmem helper functions
>>>>   +config DRM_SUBALLOC_HELPER
>>>> +    tristate
>>>> +    depends on DRM
>>>> +
>>>>   config DRM_SCHED
>>>>       tristate
>>>>       depends on DRM
>>>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>>>> index ab4460fcd63f..1e04d135e866 100644
>>>> --- a/drivers/gpu/drm/Makefile
>>>> +++ b/drivers/gpu/drm/Makefile
>>>> @@ -88,6 +88,9 @@ obj-$(CONFIG_DRM_GEM_DMA_HELPER) += drm_dma_helper.o
>>>>   drm_shmem_helper-y := drm_gem_shmem_helper.o
>>>>   obj-$(CONFIG_DRM_GEM_SHMEM_HELPER) += drm_shmem_helper.o
>>>>   +drm_suballoc_helper-y := drm_suballoc.o
>>>> +obj-$(CONFIG_DRM_SUBALLOC_HELPER) += drm_suballoc_helper.o
>>>> +
>>>>   drm_vram_helper-y := drm_gem_vram_helper.o
>>>>   obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
>>>>   diff --git a/drivers/gpu/drm/drm_suballoc.c 
>>>> b/drivers/gpu/drm/drm_suballoc.c
>>>> new file mode 100644
>>>> index 000000000000..6e0292dea548
>>>> --- /dev/null
>>>> +++ b/drivers/gpu/drm/drm_suballoc.c
>>>> @@ -0,0 +1,301 @@
>>>> +// SPDX-License-Identifier: MIT
>>>> +/*
>>>> + * Copyright © 2022 Intel Corporation
>>>> + */
>>>> +
>>>> +#include <drm/drm_suballoc.h>
>>>> +
>>>> +/**
>>>> + * DOC:
>>>> + * This suballocator intends to be a wrapper around a range allocator
>>>> + * that is aware also of deferred range freeing with fences. 
>>>> Currently
>>>> + * we hard-code the drm_mm as the range allocator.
>>>> + * The approach, while rather simple, suffers from three performance
>>>> + * issues that can all be fixed if needed at the tradeoff of more 
>>>> and / or
>>>> + * more complex code:
>>>> + *
>>>> + * 1) It's cpu-hungry, the drm_mm allocator is overkill. Either 
>>>> code a
>>>> + * much simpler range allocator, or let the caller decide by 
>>>> providing
>>>> + * ops that wrap any range allocator. Also could avoid waking up 
>>>> unless
>>>> + * there is a reasonable chance of enough space in the range manager.
>>>
>>> That's most likely highly problematic.
>>>
>>> The suballocator in radeon/amdgpu was designed so that it resembles 
>>> a ring buffer and is therefor rather CPU efficient.
>>>
>>> We could make the allocator much more trivial, but using drm_mm for 
>>> this is a sledgehammer and therefore a pretty clear no-go.
>>>
>> I don't think the ring vs non-ring is the big problem here, because 
>> (at least with the original implementation), if allocations are 
>> actually made and released in a ring-like fashion, the drm_mm 
>> free-list would consist of one or two blocks and therefore pretty 
>> efficient even for that case, and if slightly longer that would still 
>> not be an issue compared to the fence lists maintained in the older 
>> allocator.
>>
>> The problem is more all the other stuff that was added and built on 
>> top like the interval / rb tree.
>>
>> I still like the idea (originating from Gallium's helpers) to 
>> separate whatever is allocating from the fence delayed free.
>
> That's actually a bad idea. See the ring like approach works because 
> the fences used in amdgpu/radeon are used in a ring like fashion. E.g. 
> the sub allocator mainly provides the temporary space for page table 
> updates. Those in turn are then used by commands written into a ring 
> buffer.

Well, what I'm saying is that *even* if you have a ring-like allocation 
algorithm, given a simpler drm_mm, I think the suggested code would be 
performing just as well as the one in amdgpu / radeon, on top of 
avoiding throttling on free, or do you have a particular scenario in 
mind that you think would be particularly pathological on this allocator?

>
>>
>> Any chance you could do a quick performance comparison? If not, 
>> anything against merging this without the amd / radeon changes until 
>> we can land a simpler allocator?
>
> Only if you can stick the allocator inside Xe and not drm, cause this 
> seems to be for a different use case than the allocators inside 
> radeon/amdgpu.

Hmm. No It's allocating in a ring-like fashion as well.  Let me put 
together a unit test for benchmaking. I think it would be a failure for 
the community to end up with three separate suballocators doing the 
exact same thing for the same problem, really.

/Thomas

>
> Regards,
> Christian.
>
>>
>> Thanks,
>> Thomas
>>
>>
>> Thomas
>>
>>
>>> Regards,
>>> Christian.
>>>
>>>> + *
>>>> + * 2) We unnecessarily install the fence callbacks too early, forcing
>>>> + * enable_signaling() too early causing extra driver effort. This 
>>>> is likely
>>>> + * not an issue if used with the drm_scheduler since it calls
>>>> + * enable_signaling() early anyway.
>>>> + *
>>>> + * 3) Long processing in irq (disabled) context. We've mostly 
>>>> worked around
>>>> + * that already by using the idle_list. If that workaround is 
>>>> deemed to
>>>> + * complex for little gain, we can remove it and use spin_lock_irq()
>>>> + * throughout the manager. If we want to shorten processing in irq 
>>>> context
>>>> + * even further, we can skip the spin_trylock in 
>>>> __drm_suballoc_free() and
>>>> + * avoid freeing allocations from irq context altogeher. However 
>>>> drm_mm
>>>> + * should be quite fast at freeing ranges.
>>>> + *
>>>> + * 4) Shrinker that starts processing the list items in 2) and 3) 
>>>> to play
>>>> + * better with the system.
>>>> + */
>>>> +
>>>> +static void drm_suballoc_process_idle(struct drm_suballoc_manager 
>>>> *sa_manager);
>>>> +
>>>> +/**
>>>> + * drm_suballoc_manager_init() - Initialise the drm_suballoc_manager
>>>> + * @sa_manager: pointer to the sa_manager
>>>> + * @size: number of bytes we want to suballocate
>>>> + * @align: alignment for each suballocated chunk
>>>> + *
>>>> + * Prepares the suballocation manager for suballocations.
>>>> + */
>>>> +void drm_suballoc_manager_init(struct drm_suballoc_manager 
>>>> *sa_manager,
>>>> +                   u64 size, u64 align)
>>>> +{
>>>> +    spin_lock_init(&sa_manager->lock);
>>>> +    spin_lock_init(&sa_manager->idle_list_lock);
>>>> +    mutex_init(&sa_manager->alloc_mutex);
>>>> +    drm_mm_init(&sa_manager->mm, 0, size);
>>>> +    init_waitqueue_head(&sa_manager->wq);
>>>> +    sa_manager->range_size = size;
>>>> +    sa_manager->alignment = align;
>>>> +    INIT_LIST_HEAD(&sa_manager->idle_list);
>>>> +}
>>>> +EXPORT_SYMBOL(drm_suballoc_manager_init);
>>>> +
>>>> +/**
>>>> + * drm_suballoc_manager_fini() - Destroy the drm_suballoc_manager
>>>> + * @sa_manager: pointer to the sa_manager
>>>> + *
>>>> + * Cleans up the suballocation manager after use. All fences added
>>>> + * with drm_suballoc_free() must be signaled, or we cannot clean up
>>>> + * the entire manager.
>>>> + */
>>>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>>>> *sa_manager)
>>>> +{
>>>> +    drm_suballoc_process_idle(sa_manager);
>>>> +    drm_mm_takedown(&sa_manager->mm);
>>>> +    mutex_destroy(&sa_manager->alloc_mutex);
>>>> +}
>>>> +EXPORT_SYMBOL(drm_suballoc_manager_fini);
>>>> +
>>>> +static void __drm_suballoc_free(struct drm_suballoc *sa)
>>>> +{
>>>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>>>> +    struct dma_fence *fence;
>>>> +
>>>> +    /*
>>>> +     * In order to avoid protecting the potentially lengthy drm_mm 
>>>> manager
>>>> +     * *allocation* processing with an irq-disabling lock,
>>>> +     * defer touching the drm_mm for freeing until we're in task 
>>>> context,
>>>> +     * with no irqs disabled, or happen to succeed in taking the 
>>>> manager
>>>> +     * lock.
>>>> +     */
>>>> +    if (!in_task() || irqs_disabled()) {
>>>> +        unsigned long irqflags;
>>>> +
>>>> +        if (spin_trylock(&sa_manager->lock))
>>>> +            goto locked;
>>>> +
>>>> + spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>>>> +        list_add_tail(&sa->idle_link, &sa_manager->idle_list);
>>>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>>>> +        wake_up(&sa_manager->wq);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    spin_lock(&sa_manager->lock);
>>>> +locked:
>>>> +    drm_mm_remove_node(&sa->node);
>>>> +
>>>> +    fence = sa->fence;
>>>> +    sa->fence = NULL;
>>>> +    spin_unlock(&sa_manager->lock);
>>>> +    /* Maybe only wake if first mm hole is sufficiently large? */
>>>> +    wake_up(&sa_manager->wq);
>>>> +    dma_fence_put(fence);
>>>> +    kfree(sa);
>>>> +}
>>>> +
>>>> +/* Free all deferred idle allocations */
>>>> +static void drm_suballoc_process_idle(struct drm_suballoc_manager 
>>>> *sa_manager)
>>>> +{
>>>> +    /*
>>>> +     * prepare_to_wait() / wake_up() semantics ensure that any list
>>>> +     * addition that was done before wake_up() is visible when
>>>> +     * this code is called from the wait loop.
>>>> +     */
>>>> +    if (!list_empty_careful(&sa_manager->idle_list)) {
>>>> +        struct drm_suballoc *sa, *next;
>>>> +        unsigned long irqflags;
>>>> +        LIST_HEAD(list);
>>>> +
>>>> + spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>>>> +        list_splice_init(&sa_manager->idle_list, &list);
>>>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>>>> +
>>>> +        list_for_each_entry_safe(sa, next, &list, idle_link)
>>>> +            __drm_suballoc_free(sa);
>>>> +    }
>>>> +}
>>>> +
>>>> +static void
>>>> +drm_suballoc_fence_signaled(struct dma_fence *fence, struct 
>>>> dma_fence_cb *cb)
>>>> +{
>>>> +    struct drm_suballoc *sa = container_of(cb, typeof(*sa), cb);
>>>> +
>>>> +    __drm_suballoc_free(sa);
>>>> +}
>>>> +
>>>> +static int drm_suballoc_tryalloc(struct drm_suballoc *sa, u64 size)
>>>> +{
>>>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>>>> +    int err;
>>>> +
>>>> +    drm_suballoc_process_idle(sa_manager);
>>>> +    spin_lock(&sa_manager->lock);
>>>> +    err = drm_mm_insert_node_generic(&sa_manager->mm, &sa->node, 
>>>> size,
>>>> +                     sa_manager->alignment, 0,
>>>> +                     DRM_MM_INSERT_EVICT);
>>>> +    spin_unlock(&sa_manager->lock);
>>>> +    return err;
>>>> +}
>>>> +
>>>> +/**
>>>> + * drm_suballoc_new() - Make a suballocation.
>>>> + * @sa_manager: pointer to the sa_manager
>>>> + * @size: number of bytes we want to suballocate.
>>>> + * @gfp: Allocation context.
>>>> + * @intr: Whether to sleep interruptibly if sleeping.
>>>> + *
>>>> + * Try to make a suballocation of size @size, which will be rounded
>>>> + * up to the alignment specified in specified in 
>>>> drm_suballoc_manager_init().
>>>> + *
>>>> + * Returns a new suballocated bo, or an ERR_PTR.
>>>> + */
>>>> +struct drm_suballoc*
>>>> +drm_suballoc_new(struct drm_suballoc_manager *sa_manager, u64 size,
>>>> +         gfp_t gfp, bool intr)
>>>> +{
>>>> +    struct drm_suballoc *sa;
>>>> +    DEFINE_WAIT(wait);
>>>> +    int err = 0;
>>>> +
>>>> +    if (size > sa_manager->range_size)
>>>> +        return ERR_PTR(-ENOSPC);
>>>> +
>>>> +    sa = kzalloc(sizeof(*sa), gfp);
>>>> +    if (!sa)
>>>> +        return ERR_PTR(-ENOMEM);
>>>> +
>>>> +    /* Avoid starvation using the alloc_mutex */
>>>> +    if (intr)
>>>> +        err = mutex_lock_interruptible(&sa_manager->alloc_mutex);
>>>> +    else
>>>> +        mutex_lock(&sa_manager->alloc_mutex);
>>>> +    if (err) {
>>>> +        kfree(sa);
>>>> +        return ERR_PTR(err);
>>>> +    }
>>>> +
>>>> +    sa->manager = sa_manager;
>>>> +    err = drm_suballoc_tryalloc(sa, size);
>>>> +    if (err != -ENOSPC)
>>>> +        goto out;
>>>> +
>>>> +    for (;;) {
>>>> +        prepare_to_wait(&sa_manager->wq, &wait,
>>>> +                intr ? TASK_INTERRUPTIBLE :
>>>> +                TASK_UNINTERRUPTIBLE);
>>>> +
>>>> +        err = drm_suballoc_tryalloc(sa, size);
>>>> +        if (err != -ENOSPC)
>>>> +            break;
>>>> +
>>>> +        if (intr && signal_pending(current)) {
>>>> +            err = -ERESTARTSYS;
>>>> +            break;
>>>> +        }
>>>> +
>>>> +        io_schedule();
>>>> +    }
>>>> +    finish_wait(&sa_manager->wq, &wait);
>>>> +
>>>> +out:
>>>> +    mutex_unlock(&sa_manager->alloc_mutex);
>>>> +    if (!sa->node.size) {
>>>> +        kfree(sa);
>>>> +        WARN_ON(!err);
>>>> +        sa = ERR_PTR(err);
>>>> +    }
>>>> +
>>>> +    return sa;
>>>> +}
>>>> +EXPORT_SYMBOL(drm_suballoc_new);
>>>> +
>>>> +/**
>>>> + * drm_suballoc_free() - Free a suballocation
>>>> + * @suballoc: pointer to the suballocation
>>>> + * @fence: fence that signals when suballocation is idle
>>>> + * @queue: the index to which queue the suballocation will be 
>>>> placed on the free list.
>>>> + *
>>>> + * Free the suballocation. The suballocation can be re-used after 
>>>> @fence
>>>> + * signals.
>>>> + */
>>>> +void
>>>> +drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence)
>>>> +{
>>>> +    if (!sa)
>>>> +        return;
>>>> +
>>>> +    if (!fence || dma_fence_is_signaled(fence)) {
>>>> +        __drm_suballoc_free(sa);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    sa->fence = dma_fence_get(fence);
>>>> +    if (dma_fence_add_callback(fence, &sa->cb, 
>>>> drm_suballoc_fence_signaled))
>>>> +        __drm_suballoc_free(sa);
>>>> +}
>>>> +EXPORT_SYMBOL(drm_suballoc_free);
>>>> +
>>>> +#ifdef CONFIG_DEBUG_FS
>>>> +
>>>> +/**
>>>> + * drm_suballoc_dump_debug_info() - Dump the suballocator state
>>>> + * @sa_manager: The suballoc manager.
>>>> + * @p: Pointer to a drm printer for output.
>>>> + * @suballoc_base: Constant to add to the suballocated offsets on 
>>>> printout.
>>>> + *
>>>> + * This function dumps the suballocator state. Note that the 
>>>> caller has
>>>> + * to explicitly order frees and calls to this function in order 
>>>> for the
>>>> + * freed node to show up as protected by a fence.
>>>> + */
>>>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>>> *sa_manager,
>>>> +                  struct drm_printer *p, u64 suballoc_base)
>>>> +{
>>>> +    const struct drm_mm_node *entry;
>>>> +
>>>> +    spin_lock(&sa_manager->lock);
>>>> +    drm_mm_for_each_node(entry, &sa_manager->mm) {
>>>> +        struct drm_suballoc *sa =
>>>> +            container_of(entry, typeof(*sa), node);
>>>> +
>>>> +        drm_printf(p, " ");
>>>> +        drm_printf(p, "[0x%010llx 0x%010llx] size %8lld",
>>>> +               (unsigned long long)suballoc_base + entry->start,
>>>> +               (unsigned long long)suballoc_base + entry->start +
>>>> +               entry->size, (unsigned long long)entry->size);
>>>> +
>>>> +        if (sa->fence)
>>>> +            drm_printf(p, " protected by 0x%016llx on context %llu",
>>>> +                   (unsigned long long)sa->fence->seqno,
>>>> +                   (unsigned long long)sa->fence->context);
>>>> +
>>>> +        drm_printf(p, "\n");
>>>> +    }
>>>> +    spin_unlock(&sa_manager->lock);
>>>> +}
>>>> +EXPORT_SYMBOL(drm_suballoc_dump_debug_info);
>>>> +#endif
>>>> +
>>>> +MODULE_AUTHOR("Intel Corporation");
>>>> +MODULE_DESCRIPTION("Simple range suballocator helper");
>>>> +MODULE_LICENSE("GPL and additional rights");
>>>> diff --git a/include/drm/drm_suballoc.h b/include/drm/drm_suballoc.h
>>>> new file mode 100644
>>>> index 000000000000..910952b3383b
>>>> --- /dev/null
>>>> +++ b/include/drm/drm_suballoc.h
>>>> @@ -0,0 +1,112 @@
>>>> +/* SPDX-License-Identifier: MIT */
>>>> +/*
>>>> + * Copyright © 2022 Intel Corporation
>>>> + */
>>>> +#ifndef _DRM_SUBALLOC_H_
>>>> +#define _DRM_SUBALLOC_H_
>>>> +
>>>> +#include <drm/drm_mm.h>
>>>> +
>>>> +#include <linux/dma-fence.h>
>>>> +#include <linux/types.h>
>>>> +
>>>> +/**
>>>> + * struct drm_suballoc_manager - Wrapper for fenced range allocations
>>>> + * @mm: The range manager. Protected by @lock.
>>>> + * @range_size: The total size of the range.
>>>> + * @alignment: Range alignment.
>>>> + * @wq: Wait queue for sleeping allocations on contention.
>>>> + * @idle_list: List of idle but not yet freed allocations. 
>>>> Protected by
>>>> + * @idle_list_lock.
>>>> + * @task: Task waiting for allocation. Protected by @lock.
>>>> + */
>>>> +struct drm_suballoc_manager {
>>>> +    /** @lock: Manager lock. Protects @mm. */
>>>> +    spinlock_t lock;
>>>> +    /**
>>>> +     * @idle_list_lock: Lock to protect the idle_list.
>>>> +     * Disable irqs when locking.
>>>> +     */
>>>> +    spinlock_t idle_list_lock;
>>>> +    /** @alloc_mutex: Mutex to protect against stavation. */
>>>> +    struct mutex alloc_mutex;
>>>> +    struct drm_mm mm;
>>>> +    u64 range_size;
>>>> +    u64 alignment;
>>>> +    wait_queue_head_t wq;
>>>> +    struct list_head idle_list;
>>>> +};
>>>> +
>>>> +/**
>>>> + * struct drm_suballoc: Suballocated range.
>>>> + * @node: The drm_mm representation of the range.
>>>> + * @fence: dma-fence indicating whether allocation is active or idle.
>>>> + * Assigned on call to free the allocation so doesn't need 
>>>> protection.
>>>> + * @cb: dma-fence callback structure. Used for callbacks when the 
>>>> fence signals.
>>>> + * @manager: The struct drm_suballoc_manager the range belongs to. 
>>>> Immutable.
>>>> + * @idle_link: Link for the manager idle_list. Progected by the
>>>> + * drm_suballoc_manager::idle_lock.
>>>> + */
>>>> +struct drm_suballoc {
>>>> +    struct drm_mm_node node;
>>>> +    struct dma_fence *fence;
>>>> +    struct dma_fence_cb cb;
>>>> +    struct drm_suballoc_manager *manager;
>>>> +    struct list_head idle_link;
>>>> +};
>>>> +
>>>> +void drm_suballoc_manager_init(struct drm_suballoc_manager 
>>>> *sa_manager,
>>>> +                   u64 size, u64 align);
>>>> +
>>>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>>>> *sa_manager);
>>>> +
>>>> +struct drm_suballoc *drm_suballoc_new(struct drm_suballoc_manager 
>>>> *sa_manager,
>>>> +                      u64 size, gfp_t gfp, bool intr);
>>>> +
>>>> +void drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence 
>>>> *fence);
>>>> +
>>>> +/**
>>>> + * drm_suballoc_soffset - Range start.
>>>> + * @sa: The struct drm_suballoc.
>>>> + *
>>>> + * Return: The start of the allocated range.
>>>> + */
>>>> +static inline u64 drm_suballoc_soffset(struct drm_suballoc *sa)
>>>> +{
>>>> +    return sa->node.start;
>>>> +}
>>>> +
>>>> +/**
>>>> + * drm_suballoc_eoffset - Range end.
>>>> + * @sa: The struct drm_suballoc.
>>>> + *
>>>> + * Return: The end of the allocated range + 1.
>>>> + */
>>>> +static inline u64 drm_suballoc_eoffset(struct drm_suballoc *sa)
>>>> +{
>>>> +    return sa->node.start + sa->node.size;
>>>> +}
>>>> +
>>>> +/**
>>>> + * drm_suballoc_size - Range size.
>>>> + * @sa: The struct drm_suballoc.
>>>> + *
>>>> + * Return: The size of the allocated range.
>>>> + */
>>>> +static inline u64 drm_suballoc_size(struct drm_suballoc *sa)
>>>> +{
>>>> +    return sa->node.size;
>>>> +}
>>>> +
>>>> +#ifdef CONFIG_DEBUG_FS
>>>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>>> *sa_manager,
>>>> +                  struct drm_printer *p, u64 suballoc_base);
>>>> +#else
>>>> +static inline void
>>>> +drm_suballoc_dump_debug_info(struct drm_suballoc_manager *sa_manager,
>>>> +                 struct drm_printer *p, u64 suballoc_base)
>>>> +{ }
>>>> +
>>>> +#endif
>>>> +
>>>> +#endif /* _DRM_SUBALLOC_H_ */
>>>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
  2023-02-17 12:24           ` [Intel-xe] " Thomas Hellström
@ 2023-02-17 12:28             ` Christian König
  -1 siblings, 0 replies; 39+ messages in thread
From: Christian König @ 2023-02-17 12:28 UTC (permalink / raw)
  To: Thomas Hellström, dri-devel; +Cc: Daniel Vetter, intel-xe, Dave Airlie

Am 17.02.23 um 13:24 schrieb Thomas Hellström:
>
> On 2/17/23 12:28, Christian König wrote:
>> Am 17.02.23 um 12:21 schrieb Thomas Hellström:
>>>
>>> On 2/17/23 12:00, Christian König wrote:
>>>> Am 16.02.23 um 15:48 schrieb Thomas Hellström:
>>>>> Initially we tried to leverage the amdgpu suballocation manager.
>>>>> It turnes out, however, that it tries extremely hard not to enable
>>>>> signalling on the fences that hold the memory up for freeing, 
>>>>> which makes
>>>>> it hard to understand and to fix potential issues with it.
>>>>>
>>>>> So in a simplification effort, introduce a drm suballocation 
>>>>> manager as a
>>>>> wrapper around an existing allocator (drm_mm) and to avoid using 
>>>>> queues
>>>>> for freeing, thus avoiding throttling on free which is an undesired
>>>>> feature as typically the throttling needs to be done uninterruptible.
>>>>>
>>>>> This variant is probably more cpu-hungry but can be improved at 
>>>>> the cost
>>>>> of additional complexity. Ideas for that are documented in the
>>>>> drm_suballoc.c file.
>>>>>
>>>>> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>> Co-developed-by: Maarten Lankhorst 
>>>>> <maarten.lankhorst@linux.intel.com>
>>>>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>>> ---
>>>>>   drivers/gpu/drm/Kconfig        |   4 +
>>>>>   drivers/gpu/drm/Makefile       |   3 +
>>>>>   drivers/gpu/drm/drm_suballoc.c | 301 
>>>>> +++++++++++++++++++++++++++++++++
>>>>>   include/drm/drm_suballoc.h     | 112 ++++++++++++
>>>>>   4 files changed, 420 insertions(+)
>>>>>   create mode 100644 drivers/gpu/drm/drm_suballoc.c
>>>>>   create mode 100644 include/drm/drm_suballoc.h
>>>>>
>>>>> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
>>>>> index dc0f94f02a82..8fbe57407c60 100644
>>>>> --- a/drivers/gpu/drm/Kconfig
>>>>> +++ b/drivers/gpu/drm/Kconfig
>>>>> @@ -232,6 +232,10 @@ config DRM_GEM_SHMEM_HELPER
>>>>>       help
>>>>>         Choose this if you need the GEM shmem helper functions
>>>>>   +config DRM_SUBALLOC_HELPER
>>>>> +    tristate
>>>>> +    depends on DRM
>>>>> +
>>>>>   config DRM_SCHED
>>>>>       tristate
>>>>>       depends on DRM
>>>>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>>>>> index ab4460fcd63f..1e04d135e866 100644
>>>>> --- a/drivers/gpu/drm/Makefile
>>>>> +++ b/drivers/gpu/drm/Makefile
>>>>> @@ -88,6 +88,9 @@ obj-$(CONFIG_DRM_GEM_DMA_HELPER) += 
>>>>> drm_dma_helper.o
>>>>>   drm_shmem_helper-y := drm_gem_shmem_helper.o
>>>>>   obj-$(CONFIG_DRM_GEM_SHMEM_HELPER) += drm_shmem_helper.o
>>>>>   +drm_suballoc_helper-y := drm_suballoc.o
>>>>> +obj-$(CONFIG_DRM_SUBALLOC_HELPER) += drm_suballoc_helper.o
>>>>> +
>>>>>   drm_vram_helper-y := drm_gem_vram_helper.o
>>>>>   obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
>>>>>   diff --git a/drivers/gpu/drm/drm_suballoc.c 
>>>>> b/drivers/gpu/drm/drm_suballoc.c
>>>>> new file mode 100644
>>>>> index 000000000000..6e0292dea548
>>>>> --- /dev/null
>>>>> +++ b/drivers/gpu/drm/drm_suballoc.c
>>>>> @@ -0,0 +1,301 @@
>>>>> +// SPDX-License-Identifier: MIT
>>>>> +/*
>>>>> + * Copyright © 2022 Intel Corporation
>>>>> + */
>>>>> +
>>>>> +#include <drm/drm_suballoc.h>
>>>>> +
>>>>> +/**
>>>>> + * DOC:
>>>>> + * This suballocator intends to be a wrapper around a range 
>>>>> allocator
>>>>> + * that is aware also of deferred range freeing with fences. 
>>>>> Currently
>>>>> + * we hard-code the drm_mm as the range allocator.
>>>>> + * The approach, while rather simple, suffers from three performance
>>>>> + * issues that can all be fixed if needed at the tradeoff of more 
>>>>> and / or
>>>>> + * more complex code:
>>>>> + *
>>>>> + * 1) It's cpu-hungry, the drm_mm allocator is overkill. Either 
>>>>> code a
>>>>> + * much simpler range allocator, or let the caller decide by 
>>>>> providing
>>>>> + * ops that wrap any range allocator. Also could avoid waking up 
>>>>> unless
>>>>> + * there is a reasonable chance of enough space in the range 
>>>>> manager.
>>>>
>>>> That's most likely highly problematic.
>>>>
>>>> The suballocator in radeon/amdgpu was designed so that it resembles 
>>>> a ring buffer and is therefor rather CPU efficient.
>>>>
>>>> We could make the allocator much more trivial, but using drm_mm for 
>>>> this is a sledgehammer and therefore a pretty clear no-go.
>>>>
>>> I don't think the ring vs non-ring is the big problem here, because 
>>> (at least with the original implementation), if allocations are 
>>> actually made and released in a ring-like fashion, the drm_mm 
>>> free-list would consist of one or two blocks and therefore pretty 
>>> efficient even for that case, and if slightly longer that would 
>>> still not be an issue compared to the fence lists maintained in the 
>>> older allocator.
>>>
>>> The problem is more all the other stuff that was added and built on 
>>> top like the interval / rb tree.
>>>
>>> I still like the idea (originating from Gallium's helpers) to 
>>> separate whatever is allocating from the fence delayed free.
>>
>> That's actually a bad idea. See the ring like approach works because 
>> the fences used in amdgpu/radeon are used in a ring like fashion. 
>> E.g. the sub allocator mainly provides the temporary space for page 
>> table updates. Those in turn are then used by commands written into a 
>> ring buffer.
>
> Well, what I'm saying is that *even* if you have a ring-like 
> allocation algorithm, given a simpler drm_mm, I think the suggested 
> code would be performing just as well as the one in amdgpu / radeon, 
> on top of avoiding throttling on free, or do you have a particular 
> scenario in mind that you think would be particularly pathological on 
> this allocator?

What do you mean with avoiding throttling on free?

>
>>
>>>
>>> Any chance you could do a quick performance comparison? If not, 
>>> anything against merging this without the amd / radeon changes until 
>>> we can land a simpler allocator?
>>
>> Only if you can stick the allocator inside Xe and not drm, cause this 
>> seems to be for a different use case than the allocators inside 
>> radeon/amdgpu.
>
> Hmm. No It's allocating in a ring-like fashion as well.  Let me put 
> together a unit test for benchmaking. I think it would be a failure 
> for the community to end up with three separate suballocators doing 
> the exact same thing for the same problem, really.

Well exactly that's the point. Those allocators aren't the same because 
they handle different problems.

The allocator in radeon is simpler because it only had to deal with a 
limited number of fence timelines. The one in amdgpu is a bit more 
complex because of the added complexity for more fence timelines.

We could take the one from amdgpu and use it for radeon and others as 
well, but the allocator proposed here doesn't even remotely matches the 
requirements.

Regards,
Christian.

>
> /Thomas
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>> Thomas
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> + *
>>>>> + * 2) We unnecessarily install the fence callbacks too early, 
>>>>> forcing
>>>>> + * enable_signaling() too early causing extra driver effort. This 
>>>>> is likely
>>>>> + * not an issue if used with the drm_scheduler since it calls
>>>>> + * enable_signaling() early anyway.
>>>>> + *
>>>>> + * 3) Long processing in irq (disabled) context. We've mostly 
>>>>> worked around
>>>>> + * that already by using the idle_list. If that workaround is 
>>>>> deemed to
>>>>> + * complex for little gain, we can remove it and use spin_lock_irq()
>>>>> + * throughout the manager. If we want to shorten processing in 
>>>>> irq context
>>>>> + * even further, we can skip the spin_trylock in 
>>>>> __drm_suballoc_free() and
>>>>> + * avoid freeing allocations from irq context altogeher. However 
>>>>> drm_mm
>>>>> + * should be quite fast at freeing ranges.
>>>>> + *
>>>>> + * 4) Shrinker that starts processing the list items in 2) and 3) 
>>>>> to play
>>>>> + * better with the system.
>>>>> + */
>>>>> +
>>>>> +static void drm_suballoc_process_idle(struct drm_suballoc_manager 
>>>>> *sa_manager);
>>>>> +
>>>>> +/**
>>>>> + * drm_suballoc_manager_init() - Initialise the drm_suballoc_manager
>>>>> + * @sa_manager: pointer to the sa_manager
>>>>> + * @size: number of bytes we want to suballocate
>>>>> + * @align: alignment for each suballocated chunk
>>>>> + *
>>>>> + * Prepares the suballocation manager for suballocations.
>>>>> + */
>>>>> +void drm_suballoc_manager_init(struct drm_suballoc_manager 
>>>>> *sa_manager,
>>>>> +                   u64 size, u64 align)
>>>>> +{
>>>>> +    spin_lock_init(&sa_manager->lock);
>>>>> +    spin_lock_init(&sa_manager->idle_list_lock);
>>>>> +    mutex_init(&sa_manager->alloc_mutex);
>>>>> +    drm_mm_init(&sa_manager->mm, 0, size);
>>>>> +    init_waitqueue_head(&sa_manager->wq);
>>>>> +    sa_manager->range_size = size;
>>>>> +    sa_manager->alignment = align;
>>>>> +    INIT_LIST_HEAD(&sa_manager->idle_list);
>>>>> +}
>>>>> +EXPORT_SYMBOL(drm_suballoc_manager_init);
>>>>> +
>>>>> +/**
>>>>> + * drm_suballoc_manager_fini() - Destroy the drm_suballoc_manager
>>>>> + * @sa_manager: pointer to the sa_manager
>>>>> + *
>>>>> + * Cleans up the suballocation manager after use. All fences added
>>>>> + * with drm_suballoc_free() must be signaled, or we cannot clean up
>>>>> + * the entire manager.
>>>>> + */
>>>>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>>>>> *sa_manager)
>>>>> +{
>>>>> +    drm_suballoc_process_idle(sa_manager);
>>>>> +    drm_mm_takedown(&sa_manager->mm);
>>>>> +    mutex_destroy(&sa_manager->alloc_mutex);
>>>>> +}
>>>>> +EXPORT_SYMBOL(drm_suballoc_manager_fini);
>>>>> +
>>>>> +static void __drm_suballoc_free(struct drm_suballoc *sa)
>>>>> +{
>>>>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>>>>> +    struct dma_fence *fence;
>>>>> +
>>>>> +    /*
>>>>> +     * In order to avoid protecting the potentially lengthy 
>>>>> drm_mm manager
>>>>> +     * *allocation* processing with an irq-disabling lock,
>>>>> +     * defer touching the drm_mm for freeing until we're in task 
>>>>> context,
>>>>> +     * with no irqs disabled, or happen to succeed in taking the 
>>>>> manager
>>>>> +     * lock.
>>>>> +     */
>>>>> +    if (!in_task() || irqs_disabled()) {
>>>>> +        unsigned long irqflags;
>>>>> +
>>>>> +        if (spin_trylock(&sa_manager->lock))
>>>>> +            goto locked;
>>>>> +
>>>>> + spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>>>>> +        list_add_tail(&sa->idle_link, &sa_manager->idle_list);
>>>>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>>>>> +        wake_up(&sa_manager->wq);
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    spin_lock(&sa_manager->lock);
>>>>> +locked:
>>>>> +    drm_mm_remove_node(&sa->node);
>>>>> +
>>>>> +    fence = sa->fence;
>>>>> +    sa->fence = NULL;
>>>>> +    spin_unlock(&sa_manager->lock);
>>>>> +    /* Maybe only wake if first mm hole is sufficiently large? */
>>>>> +    wake_up(&sa_manager->wq);
>>>>> +    dma_fence_put(fence);
>>>>> +    kfree(sa);
>>>>> +}
>>>>> +
>>>>> +/* Free all deferred idle allocations */
>>>>> +static void drm_suballoc_process_idle(struct drm_suballoc_manager 
>>>>> *sa_manager)
>>>>> +{
>>>>> +    /*
>>>>> +     * prepare_to_wait() / wake_up() semantics ensure that any list
>>>>> +     * addition that was done before wake_up() is visible when
>>>>> +     * this code is called from the wait loop.
>>>>> +     */
>>>>> +    if (!list_empty_careful(&sa_manager->idle_list)) {
>>>>> +        struct drm_suballoc *sa, *next;
>>>>> +        unsigned long irqflags;
>>>>> +        LIST_HEAD(list);
>>>>> +
>>>>> + spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>>>>> +        list_splice_init(&sa_manager->idle_list, &list);
>>>>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>>>>> +
>>>>> +        list_for_each_entry_safe(sa, next, &list, idle_link)
>>>>> +            __drm_suballoc_free(sa);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static void
>>>>> +drm_suballoc_fence_signaled(struct dma_fence *fence, struct 
>>>>> dma_fence_cb *cb)
>>>>> +{
>>>>> +    struct drm_suballoc *sa = container_of(cb, typeof(*sa), cb);
>>>>> +
>>>>> +    __drm_suballoc_free(sa);
>>>>> +}
>>>>> +
>>>>> +static int drm_suballoc_tryalloc(struct drm_suballoc *sa, u64 size)
>>>>> +{
>>>>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>>>>> +    int err;
>>>>> +
>>>>> +    drm_suballoc_process_idle(sa_manager);
>>>>> +    spin_lock(&sa_manager->lock);
>>>>> +    err = drm_mm_insert_node_generic(&sa_manager->mm, &sa->node, 
>>>>> size,
>>>>> +                     sa_manager->alignment, 0,
>>>>> +                     DRM_MM_INSERT_EVICT);
>>>>> +    spin_unlock(&sa_manager->lock);
>>>>> +    return err;
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * drm_suballoc_new() - Make a suballocation.
>>>>> + * @sa_manager: pointer to the sa_manager
>>>>> + * @size: number of bytes we want to suballocate.
>>>>> + * @gfp: Allocation context.
>>>>> + * @intr: Whether to sleep interruptibly if sleeping.
>>>>> + *
>>>>> + * Try to make a suballocation of size @size, which will be rounded
>>>>> + * up to the alignment specified in specified in 
>>>>> drm_suballoc_manager_init().
>>>>> + *
>>>>> + * Returns a new suballocated bo, or an ERR_PTR.
>>>>> + */
>>>>> +struct drm_suballoc*
>>>>> +drm_suballoc_new(struct drm_suballoc_manager *sa_manager, u64 size,
>>>>> +         gfp_t gfp, bool intr)
>>>>> +{
>>>>> +    struct drm_suballoc *sa;
>>>>> +    DEFINE_WAIT(wait);
>>>>> +    int err = 0;
>>>>> +
>>>>> +    if (size > sa_manager->range_size)
>>>>> +        return ERR_PTR(-ENOSPC);
>>>>> +
>>>>> +    sa = kzalloc(sizeof(*sa), gfp);
>>>>> +    if (!sa)
>>>>> +        return ERR_PTR(-ENOMEM);
>>>>> +
>>>>> +    /* Avoid starvation using the alloc_mutex */
>>>>> +    if (intr)
>>>>> +        err = mutex_lock_interruptible(&sa_manager->alloc_mutex);
>>>>> +    else
>>>>> +        mutex_lock(&sa_manager->alloc_mutex);
>>>>> +    if (err) {
>>>>> +        kfree(sa);
>>>>> +        return ERR_PTR(err);
>>>>> +    }
>>>>> +
>>>>> +    sa->manager = sa_manager;
>>>>> +    err = drm_suballoc_tryalloc(sa, size);
>>>>> +    if (err != -ENOSPC)
>>>>> +        goto out;
>>>>> +
>>>>> +    for (;;) {
>>>>> +        prepare_to_wait(&sa_manager->wq, &wait,
>>>>> +                intr ? TASK_INTERRUPTIBLE :
>>>>> +                TASK_UNINTERRUPTIBLE);
>>>>> +
>>>>> +        err = drm_suballoc_tryalloc(sa, size);
>>>>> +        if (err != -ENOSPC)
>>>>> +            break;
>>>>> +
>>>>> +        if (intr && signal_pending(current)) {
>>>>> +            err = -ERESTARTSYS;
>>>>> +            break;
>>>>> +        }
>>>>> +
>>>>> +        io_schedule();
>>>>> +    }
>>>>> +    finish_wait(&sa_manager->wq, &wait);
>>>>> +
>>>>> +out:
>>>>> +    mutex_unlock(&sa_manager->alloc_mutex);
>>>>> +    if (!sa->node.size) {
>>>>> +        kfree(sa);
>>>>> +        WARN_ON(!err);
>>>>> +        sa = ERR_PTR(err);
>>>>> +    }
>>>>> +
>>>>> +    return sa;
>>>>> +}
>>>>> +EXPORT_SYMBOL(drm_suballoc_new);
>>>>> +
>>>>> +/**
>>>>> + * drm_suballoc_free() - Free a suballocation
>>>>> + * @suballoc: pointer to the suballocation
>>>>> + * @fence: fence that signals when suballocation is idle
>>>>> + * @queue: the index to which queue the suballocation will be 
>>>>> placed on the free list.
>>>>> + *
>>>>> + * Free the suballocation. The suballocation can be re-used after 
>>>>> @fence
>>>>> + * signals.
>>>>> + */
>>>>> +void
>>>>> +drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence)
>>>>> +{
>>>>> +    if (!sa)
>>>>> +        return;
>>>>> +
>>>>> +    if (!fence || dma_fence_is_signaled(fence)) {
>>>>> +        __drm_suballoc_free(sa);
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    sa->fence = dma_fence_get(fence);
>>>>> +    if (dma_fence_add_callback(fence, &sa->cb, 
>>>>> drm_suballoc_fence_signaled))
>>>>> +        __drm_suballoc_free(sa);
>>>>> +}
>>>>> +EXPORT_SYMBOL(drm_suballoc_free);
>>>>> +
>>>>> +#ifdef CONFIG_DEBUG_FS
>>>>> +
>>>>> +/**
>>>>> + * drm_suballoc_dump_debug_info() - Dump the suballocator state
>>>>> + * @sa_manager: The suballoc manager.
>>>>> + * @p: Pointer to a drm printer for output.
>>>>> + * @suballoc_base: Constant to add to the suballocated offsets on 
>>>>> printout.
>>>>> + *
>>>>> + * This function dumps the suballocator state. Note that the 
>>>>> caller has
>>>>> + * to explicitly order frees and calls to this function in order 
>>>>> for the
>>>>> + * freed node to show up as protected by a fence.
>>>>> + */
>>>>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>>>> *sa_manager,
>>>>> +                  struct drm_printer *p, u64 suballoc_base)
>>>>> +{
>>>>> +    const struct drm_mm_node *entry;
>>>>> +
>>>>> +    spin_lock(&sa_manager->lock);
>>>>> +    drm_mm_for_each_node(entry, &sa_manager->mm) {
>>>>> +        struct drm_suballoc *sa =
>>>>> +            container_of(entry, typeof(*sa), node);
>>>>> +
>>>>> +        drm_printf(p, " ");
>>>>> +        drm_printf(p, "[0x%010llx 0x%010llx] size %8lld",
>>>>> +               (unsigned long long)suballoc_base + entry->start,
>>>>> +               (unsigned long long)suballoc_base + entry->start +
>>>>> +               entry->size, (unsigned long long)entry->size);
>>>>> +
>>>>> +        if (sa->fence)
>>>>> +            drm_printf(p, " protected by 0x%016llx on context %llu",
>>>>> +                   (unsigned long long)sa->fence->seqno,
>>>>> +                   (unsigned long long)sa->fence->context);
>>>>> +
>>>>> +        drm_printf(p, "\n");
>>>>> +    }
>>>>> +    spin_unlock(&sa_manager->lock);
>>>>> +}
>>>>> +EXPORT_SYMBOL(drm_suballoc_dump_debug_info);
>>>>> +#endif
>>>>> +
>>>>> +MODULE_AUTHOR("Intel Corporation");
>>>>> +MODULE_DESCRIPTION("Simple range suballocator helper");
>>>>> +MODULE_LICENSE("GPL and additional rights");
>>>>> diff --git a/include/drm/drm_suballoc.h b/include/drm/drm_suballoc.h
>>>>> new file mode 100644
>>>>> index 000000000000..910952b3383b
>>>>> --- /dev/null
>>>>> +++ b/include/drm/drm_suballoc.h
>>>>> @@ -0,0 +1,112 @@
>>>>> +/* SPDX-License-Identifier: MIT */
>>>>> +/*
>>>>> + * Copyright © 2022 Intel Corporation
>>>>> + */
>>>>> +#ifndef _DRM_SUBALLOC_H_
>>>>> +#define _DRM_SUBALLOC_H_
>>>>> +
>>>>> +#include <drm/drm_mm.h>
>>>>> +
>>>>> +#include <linux/dma-fence.h>
>>>>> +#include <linux/types.h>
>>>>> +
>>>>> +/**
>>>>> + * struct drm_suballoc_manager - Wrapper for fenced range 
>>>>> allocations
>>>>> + * @mm: The range manager. Protected by @lock.
>>>>> + * @range_size: The total size of the range.
>>>>> + * @alignment: Range alignment.
>>>>> + * @wq: Wait queue for sleeping allocations on contention.
>>>>> + * @idle_list: List of idle but not yet freed allocations. 
>>>>> Protected by
>>>>> + * @idle_list_lock.
>>>>> + * @task: Task waiting for allocation. Protected by @lock.
>>>>> + */
>>>>> +struct drm_suballoc_manager {
>>>>> +    /** @lock: Manager lock. Protects @mm. */
>>>>> +    spinlock_t lock;
>>>>> +    /**
>>>>> +     * @idle_list_lock: Lock to protect the idle_list.
>>>>> +     * Disable irqs when locking.
>>>>> +     */
>>>>> +    spinlock_t idle_list_lock;
>>>>> +    /** @alloc_mutex: Mutex to protect against stavation. */
>>>>> +    struct mutex alloc_mutex;
>>>>> +    struct drm_mm mm;
>>>>> +    u64 range_size;
>>>>> +    u64 alignment;
>>>>> +    wait_queue_head_t wq;
>>>>> +    struct list_head idle_list;
>>>>> +};
>>>>> +
>>>>> +/**
>>>>> + * struct drm_suballoc: Suballocated range.
>>>>> + * @node: The drm_mm representation of the range.
>>>>> + * @fence: dma-fence indicating whether allocation is active or 
>>>>> idle.
>>>>> + * Assigned on call to free the allocation so doesn't need 
>>>>> protection.
>>>>> + * @cb: dma-fence callback structure. Used for callbacks when the 
>>>>> fence signals.
>>>>> + * @manager: The struct drm_suballoc_manager the range belongs 
>>>>> to. Immutable.
>>>>> + * @idle_link: Link for the manager idle_list. Progected by the
>>>>> + * drm_suballoc_manager::idle_lock.
>>>>> + */
>>>>> +struct drm_suballoc {
>>>>> +    struct drm_mm_node node;
>>>>> +    struct dma_fence *fence;
>>>>> +    struct dma_fence_cb cb;
>>>>> +    struct drm_suballoc_manager *manager;
>>>>> +    struct list_head idle_link;
>>>>> +};
>>>>> +
>>>>> +void drm_suballoc_manager_init(struct drm_suballoc_manager 
>>>>> *sa_manager,
>>>>> +                   u64 size, u64 align);
>>>>> +
>>>>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>>>>> *sa_manager);
>>>>> +
>>>>> +struct drm_suballoc *drm_suballoc_new(struct drm_suballoc_manager 
>>>>> *sa_manager,
>>>>> +                      u64 size, gfp_t gfp, bool intr);
>>>>> +
>>>>> +void drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence 
>>>>> *fence);
>>>>> +
>>>>> +/**
>>>>> + * drm_suballoc_soffset - Range start.
>>>>> + * @sa: The struct drm_suballoc.
>>>>> + *
>>>>> + * Return: The start of the allocated range.
>>>>> + */
>>>>> +static inline u64 drm_suballoc_soffset(struct drm_suballoc *sa)
>>>>> +{
>>>>> +    return sa->node.start;
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * drm_suballoc_eoffset - Range end.
>>>>> + * @sa: The struct drm_suballoc.
>>>>> + *
>>>>> + * Return: The end of the allocated range + 1.
>>>>> + */
>>>>> +static inline u64 drm_suballoc_eoffset(struct drm_suballoc *sa)
>>>>> +{
>>>>> +    return sa->node.start + sa->node.size;
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * drm_suballoc_size - Range size.
>>>>> + * @sa: The struct drm_suballoc.
>>>>> + *
>>>>> + * Return: The size of the allocated range.
>>>>> + */
>>>>> +static inline u64 drm_suballoc_size(struct drm_suballoc *sa)
>>>>> +{
>>>>> +    return sa->node.size;
>>>>> +}
>>>>> +
>>>>> +#ifdef CONFIG_DEBUG_FS
>>>>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>>>> *sa_manager,
>>>>> +                  struct drm_printer *p, u64 suballoc_base);
>>>>> +#else
>>>>> +static inline void
>>>>> +drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>>>> *sa_manager,
>>>>> +                 struct drm_printer *p, u64 suballoc_base)
>>>>> +{ }
>>>>> +
>>>>> +#endif
>>>>> +
>>>>> +#endif /* _DRM_SUBALLOC_H_ */
>>>>
>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
@ 2023-02-17 12:28             ` Christian König
  0 siblings, 0 replies; 39+ messages in thread
From: Christian König @ 2023-02-17 12:28 UTC (permalink / raw)
  To: Thomas Hellström, dri-devel
  Cc: Daniel Vetter, Maarten Lankhorst, intel-xe, Dave Airlie

Am 17.02.23 um 13:24 schrieb Thomas Hellström:
>
> On 2/17/23 12:28, Christian König wrote:
>> Am 17.02.23 um 12:21 schrieb Thomas Hellström:
>>>
>>> On 2/17/23 12:00, Christian König wrote:
>>>> Am 16.02.23 um 15:48 schrieb Thomas Hellström:
>>>>> Initially we tried to leverage the amdgpu suballocation manager.
>>>>> It turnes out, however, that it tries extremely hard not to enable
>>>>> signalling on the fences that hold the memory up for freeing, 
>>>>> which makes
>>>>> it hard to understand and to fix potential issues with it.
>>>>>
>>>>> So in a simplification effort, introduce a drm suballocation 
>>>>> manager as a
>>>>> wrapper around an existing allocator (drm_mm) and to avoid using 
>>>>> queues
>>>>> for freeing, thus avoiding throttling on free which is an undesired
>>>>> feature as typically the throttling needs to be done uninterruptible.
>>>>>
>>>>> This variant is probably more cpu-hungry but can be improved at 
>>>>> the cost
>>>>> of additional complexity. Ideas for that are documented in the
>>>>> drm_suballoc.c file.
>>>>>
>>>>> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>> Co-developed-by: Maarten Lankhorst 
>>>>> <maarten.lankhorst@linux.intel.com>
>>>>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>>> ---
>>>>>   drivers/gpu/drm/Kconfig        |   4 +
>>>>>   drivers/gpu/drm/Makefile       |   3 +
>>>>>   drivers/gpu/drm/drm_suballoc.c | 301 
>>>>> +++++++++++++++++++++++++++++++++
>>>>>   include/drm/drm_suballoc.h     | 112 ++++++++++++
>>>>>   4 files changed, 420 insertions(+)
>>>>>   create mode 100644 drivers/gpu/drm/drm_suballoc.c
>>>>>   create mode 100644 include/drm/drm_suballoc.h
>>>>>
>>>>> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
>>>>> index dc0f94f02a82..8fbe57407c60 100644
>>>>> --- a/drivers/gpu/drm/Kconfig
>>>>> +++ b/drivers/gpu/drm/Kconfig
>>>>> @@ -232,6 +232,10 @@ config DRM_GEM_SHMEM_HELPER
>>>>>       help
>>>>>         Choose this if you need the GEM shmem helper functions
>>>>>   +config DRM_SUBALLOC_HELPER
>>>>> +    tristate
>>>>> +    depends on DRM
>>>>> +
>>>>>   config DRM_SCHED
>>>>>       tristate
>>>>>       depends on DRM
>>>>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>>>>> index ab4460fcd63f..1e04d135e866 100644
>>>>> --- a/drivers/gpu/drm/Makefile
>>>>> +++ b/drivers/gpu/drm/Makefile
>>>>> @@ -88,6 +88,9 @@ obj-$(CONFIG_DRM_GEM_DMA_HELPER) += 
>>>>> drm_dma_helper.o
>>>>>   drm_shmem_helper-y := drm_gem_shmem_helper.o
>>>>>   obj-$(CONFIG_DRM_GEM_SHMEM_HELPER) += drm_shmem_helper.o
>>>>>   +drm_suballoc_helper-y := drm_suballoc.o
>>>>> +obj-$(CONFIG_DRM_SUBALLOC_HELPER) += drm_suballoc_helper.o
>>>>> +
>>>>>   drm_vram_helper-y := drm_gem_vram_helper.o
>>>>>   obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
>>>>>   diff --git a/drivers/gpu/drm/drm_suballoc.c 
>>>>> b/drivers/gpu/drm/drm_suballoc.c
>>>>> new file mode 100644
>>>>> index 000000000000..6e0292dea548
>>>>> --- /dev/null
>>>>> +++ b/drivers/gpu/drm/drm_suballoc.c
>>>>> @@ -0,0 +1,301 @@
>>>>> +// SPDX-License-Identifier: MIT
>>>>> +/*
>>>>> + * Copyright © 2022 Intel Corporation
>>>>> + */
>>>>> +
>>>>> +#include <drm/drm_suballoc.h>
>>>>> +
>>>>> +/**
>>>>> + * DOC:
>>>>> + * This suballocator intends to be a wrapper around a range 
>>>>> allocator
>>>>> + * that is aware also of deferred range freeing with fences. 
>>>>> Currently
>>>>> + * we hard-code the drm_mm as the range allocator.
>>>>> + * The approach, while rather simple, suffers from three performance
>>>>> + * issues that can all be fixed if needed at the tradeoff of more 
>>>>> and / or
>>>>> + * more complex code:
>>>>> + *
>>>>> + * 1) It's cpu-hungry, the drm_mm allocator is overkill. Either 
>>>>> code a
>>>>> + * much simpler range allocator, or let the caller decide by 
>>>>> providing
>>>>> + * ops that wrap any range allocator. Also could avoid waking up 
>>>>> unless
>>>>> + * there is a reasonable chance of enough space in the range 
>>>>> manager.
>>>>
>>>> That's most likely highly problematic.
>>>>
>>>> The suballocator in radeon/amdgpu was designed so that it resembles 
>>>> a ring buffer and is therefor rather CPU efficient.
>>>>
>>>> We could make the allocator much more trivial, but using drm_mm for 
>>>> this is a sledgehammer and therefore a pretty clear no-go.
>>>>
>>> I don't think the ring vs non-ring is the big problem here, because 
>>> (at least with the original implementation), if allocations are 
>>> actually made and released in a ring-like fashion, the drm_mm 
>>> free-list would consist of one or two blocks and therefore pretty 
>>> efficient even for that case, and if slightly longer that would 
>>> still not be an issue compared to the fence lists maintained in the 
>>> older allocator.
>>>
>>> The problem is more all the other stuff that was added and built on 
>>> top like the interval / rb tree.
>>>
>>> I still like the idea (originating from Gallium's helpers) to 
>>> separate whatever is allocating from the fence delayed free.
>>
>> That's actually a bad idea. See the ring like approach works because 
>> the fences used in amdgpu/radeon are used in a ring like fashion. 
>> E.g. the sub allocator mainly provides the temporary space for page 
>> table updates. Those in turn are then used by commands written into a 
>> ring buffer.
>
> Well, what I'm saying is that *even* if you have a ring-like 
> allocation algorithm, given a simpler drm_mm, I think the suggested 
> code would be performing just as well as the one in amdgpu / radeon, 
> on top of avoiding throttling on free, or do you have a particular 
> scenario in mind that you think would be particularly pathological on 
> this allocator?

What do you mean with avoiding throttling on free?

>
>>
>>>
>>> Any chance you could do a quick performance comparison? If not, 
>>> anything against merging this without the amd / radeon changes until 
>>> we can land a simpler allocator?
>>
>> Only if you can stick the allocator inside Xe and not drm, cause this 
>> seems to be for a different use case than the allocators inside 
>> radeon/amdgpu.
>
> Hmm. No It's allocating in a ring-like fashion as well.  Let me put 
> together a unit test for benchmaking. I think it would be a failure 
> for the community to end up with three separate suballocators doing 
> the exact same thing for the same problem, really.

Well exactly that's the point. Those allocators aren't the same because 
they handle different problems.

The allocator in radeon is simpler because it only had to deal with a 
limited number of fence timelines. The one in amdgpu is a bit more 
complex because of the added complexity for more fence timelines.

We could take the one from amdgpu and use it for radeon and others as 
well, but the allocator proposed here doesn't even remotely matches the 
requirements.

Regards,
Christian.

>
> /Thomas
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>> Thomas
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> + *
>>>>> + * 2) We unnecessarily install the fence callbacks too early, 
>>>>> forcing
>>>>> + * enable_signaling() too early causing extra driver effort. This 
>>>>> is likely
>>>>> + * not an issue if used with the drm_scheduler since it calls
>>>>> + * enable_signaling() early anyway.
>>>>> + *
>>>>> + * 3) Long processing in irq (disabled) context. We've mostly 
>>>>> worked around
>>>>> + * that already by using the idle_list. If that workaround is 
>>>>> deemed to
>>>>> + * complex for little gain, we can remove it and use spin_lock_irq()
>>>>> + * throughout the manager. If we want to shorten processing in 
>>>>> irq context
>>>>> + * even further, we can skip the spin_trylock in 
>>>>> __drm_suballoc_free() and
>>>>> + * avoid freeing allocations from irq context altogeher. However 
>>>>> drm_mm
>>>>> + * should be quite fast at freeing ranges.
>>>>> + *
>>>>> + * 4) Shrinker that starts processing the list items in 2) and 3) 
>>>>> to play
>>>>> + * better with the system.
>>>>> + */
>>>>> +
>>>>> +static void drm_suballoc_process_idle(struct drm_suballoc_manager 
>>>>> *sa_manager);
>>>>> +
>>>>> +/**
>>>>> + * drm_suballoc_manager_init() - Initialise the drm_suballoc_manager
>>>>> + * @sa_manager: pointer to the sa_manager
>>>>> + * @size: number of bytes we want to suballocate
>>>>> + * @align: alignment for each suballocated chunk
>>>>> + *
>>>>> + * Prepares the suballocation manager for suballocations.
>>>>> + */
>>>>> +void drm_suballoc_manager_init(struct drm_suballoc_manager 
>>>>> *sa_manager,
>>>>> +                   u64 size, u64 align)
>>>>> +{
>>>>> +    spin_lock_init(&sa_manager->lock);
>>>>> +    spin_lock_init(&sa_manager->idle_list_lock);
>>>>> +    mutex_init(&sa_manager->alloc_mutex);
>>>>> +    drm_mm_init(&sa_manager->mm, 0, size);
>>>>> +    init_waitqueue_head(&sa_manager->wq);
>>>>> +    sa_manager->range_size = size;
>>>>> +    sa_manager->alignment = align;
>>>>> +    INIT_LIST_HEAD(&sa_manager->idle_list);
>>>>> +}
>>>>> +EXPORT_SYMBOL(drm_suballoc_manager_init);
>>>>> +
>>>>> +/**
>>>>> + * drm_suballoc_manager_fini() - Destroy the drm_suballoc_manager
>>>>> + * @sa_manager: pointer to the sa_manager
>>>>> + *
>>>>> + * Cleans up the suballocation manager after use. All fences added
>>>>> + * with drm_suballoc_free() must be signaled, or we cannot clean up
>>>>> + * the entire manager.
>>>>> + */
>>>>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>>>>> *sa_manager)
>>>>> +{
>>>>> +    drm_suballoc_process_idle(sa_manager);
>>>>> +    drm_mm_takedown(&sa_manager->mm);
>>>>> +    mutex_destroy(&sa_manager->alloc_mutex);
>>>>> +}
>>>>> +EXPORT_SYMBOL(drm_suballoc_manager_fini);
>>>>> +
>>>>> +static void __drm_suballoc_free(struct drm_suballoc *sa)
>>>>> +{
>>>>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>>>>> +    struct dma_fence *fence;
>>>>> +
>>>>> +    /*
>>>>> +     * In order to avoid protecting the potentially lengthy 
>>>>> drm_mm manager
>>>>> +     * *allocation* processing with an irq-disabling lock,
>>>>> +     * defer touching the drm_mm for freeing until we're in task 
>>>>> context,
>>>>> +     * with no irqs disabled, or happen to succeed in taking the 
>>>>> manager
>>>>> +     * lock.
>>>>> +     */
>>>>> +    if (!in_task() || irqs_disabled()) {
>>>>> +        unsigned long irqflags;
>>>>> +
>>>>> +        if (spin_trylock(&sa_manager->lock))
>>>>> +            goto locked;
>>>>> +
>>>>> + spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>>>>> +        list_add_tail(&sa->idle_link, &sa_manager->idle_list);
>>>>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>>>>> +        wake_up(&sa_manager->wq);
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    spin_lock(&sa_manager->lock);
>>>>> +locked:
>>>>> +    drm_mm_remove_node(&sa->node);
>>>>> +
>>>>> +    fence = sa->fence;
>>>>> +    sa->fence = NULL;
>>>>> +    spin_unlock(&sa_manager->lock);
>>>>> +    /* Maybe only wake if first mm hole is sufficiently large? */
>>>>> +    wake_up(&sa_manager->wq);
>>>>> +    dma_fence_put(fence);
>>>>> +    kfree(sa);
>>>>> +}
>>>>> +
>>>>> +/* Free all deferred idle allocations */
>>>>> +static void drm_suballoc_process_idle(struct drm_suballoc_manager 
>>>>> *sa_manager)
>>>>> +{
>>>>> +    /*
>>>>> +     * prepare_to_wait() / wake_up() semantics ensure that any list
>>>>> +     * addition that was done before wake_up() is visible when
>>>>> +     * this code is called from the wait loop.
>>>>> +     */
>>>>> +    if (!list_empty_careful(&sa_manager->idle_list)) {
>>>>> +        struct drm_suballoc *sa, *next;
>>>>> +        unsigned long irqflags;
>>>>> +        LIST_HEAD(list);
>>>>> +
>>>>> + spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>>>>> +        list_splice_init(&sa_manager->idle_list, &list);
>>>>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>>>>> +
>>>>> +        list_for_each_entry_safe(sa, next, &list, idle_link)
>>>>> +            __drm_suballoc_free(sa);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static void
>>>>> +drm_suballoc_fence_signaled(struct dma_fence *fence, struct 
>>>>> dma_fence_cb *cb)
>>>>> +{
>>>>> +    struct drm_suballoc *sa = container_of(cb, typeof(*sa), cb);
>>>>> +
>>>>> +    __drm_suballoc_free(sa);
>>>>> +}
>>>>> +
>>>>> +static int drm_suballoc_tryalloc(struct drm_suballoc *sa, u64 size)
>>>>> +{
>>>>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>>>>> +    int err;
>>>>> +
>>>>> +    drm_suballoc_process_idle(sa_manager);
>>>>> +    spin_lock(&sa_manager->lock);
>>>>> +    err = drm_mm_insert_node_generic(&sa_manager->mm, &sa->node, 
>>>>> size,
>>>>> +                     sa_manager->alignment, 0,
>>>>> +                     DRM_MM_INSERT_EVICT);
>>>>> +    spin_unlock(&sa_manager->lock);
>>>>> +    return err;
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * drm_suballoc_new() - Make a suballocation.
>>>>> + * @sa_manager: pointer to the sa_manager
>>>>> + * @size: number of bytes we want to suballocate.
>>>>> + * @gfp: Allocation context.
>>>>> + * @intr: Whether to sleep interruptibly if sleeping.
>>>>> + *
>>>>> + * Try to make a suballocation of size @size, which will be rounded
>>>>> + * up to the alignment specified in specified in 
>>>>> drm_suballoc_manager_init().
>>>>> + *
>>>>> + * Returns a new suballocated bo, or an ERR_PTR.
>>>>> + */
>>>>> +struct drm_suballoc*
>>>>> +drm_suballoc_new(struct drm_suballoc_manager *sa_manager, u64 size,
>>>>> +         gfp_t gfp, bool intr)
>>>>> +{
>>>>> +    struct drm_suballoc *sa;
>>>>> +    DEFINE_WAIT(wait);
>>>>> +    int err = 0;
>>>>> +
>>>>> +    if (size > sa_manager->range_size)
>>>>> +        return ERR_PTR(-ENOSPC);
>>>>> +
>>>>> +    sa = kzalloc(sizeof(*sa), gfp);
>>>>> +    if (!sa)
>>>>> +        return ERR_PTR(-ENOMEM);
>>>>> +
>>>>> +    /* Avoid starvation using the alloc_mutex */
>>>>> +    if (intr)
>>>>> +        err = mutex_lock_interruptible(&sa_manager->alloc_mutex);
>>>>> +    else
>>>>> +        mutex_lock(&sa_manager->alloc_mutex);
>>>>> +    if (err) {
>>>>> +        kfree(sa);
>>>>> +        return ERR_PTR(err);
>>>>> +    }
>>>>> +
>>>>> +    sa->manager = sa_manager;
>>>>> +    err = drm_suballoc_tryalloc(sa, size);
>>>>> +    if (err != -ENOSPC)
>>>>> +        goto out;
>>>>> +
>>>>> +    for (;;) {
>>>>> +        prepare_to_wait(&sa_manager->wq, &wait,
>>>>> +                intr ? TASK_INTERRUPTIBLE :
>>>>> +                TASK_UNINTERRUPTIBLE);
>>>>> +
>>>>> +        err = drm_suballoc_tryalloc(sa, size);
>>>>> +        if (err != -ENOSPC)
>>>>> +            break;
>>>>> +
>>>>> +        if (intr && signal_pending(current)) {
>>>>> +            err = -ERESTARTSYS;
>>>>> +            break;
>>>>> +        }
>>>>> +
>>>>> +        io_schedule();
>>>>> +    }
>>>>> +    finish_wait(&sa_manager->wq, &wait);
>>>>> +
>>>>> +out:
>>>>> +    mutex_unlock(&sa_manager->alloc_mutex);
>>>>> +    if (!sa->node.size) {
>>>>> +        kfree(sa);
>>>>> +        WARN_ON(!err);
>>>>> +        sa = ERR_PTR(err);
>>>>> +    }
>>>>> +
>>>>> +    return sa;
>>>>> +}
>>>>> +EXPORT_SYMBOL(drm_suballoc_new);
>>>>> +
>>>>> +/**
>>>>> + * drm_suballoc_free() - Free a suballocation
>>>>> + * @suballoc: pointer to the suballocation
>>>>> + * @fence: fence that signals when suballocation is idle
>>>>> + * @queue: the index to which queue the suballocation will be 
>>>>> placed on the free list.
>>>>> + *
>>>>> + * Free the suballocation. The suballocation can be re-used after 
>>>>> @fence
>>>>> + * signals.
>>>>> + */
>>>>> +void
>>>>> +drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence)
>>>>> +{
>>>>> +    if (!sa)
>>>>> +        return;
>>>>> +
>>>>> +    if (!fence || dma_fence_is_signaled(fence)) {
>>>>> +        __drm_suballoc_free(sa);
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    sa->fence = dma_fence_get(fence);
>>>>> +    if (dma_fence_add_callback(fence, &sa->cb, 
>>>>> drm_suballoc_fence_signaled))
>>>>> +        __drm_suballoc_free(sa);
>>>>> +}
>>>>> +EXPORT_SYMBOL(drm_suballoc_free);
>>>>> +
>>>>> +#ifdef CONFIG_DEBUG_FS
>>>>> +
>>>>> +/**
>>>>> + * drm_suballoc_dump_debug_info() - Dump the suballocator state
>>>>> + * @sa_manager: The suballoc manager.
>>>>> + * @p: Pointer to a drm printer for output.
>>>>> + * @suballoc_base: Constant to add to the suballocated offsets on 
>>>>> printout.
>>>>> + *
>>>>> + * This function dumps the suballocator state. Note that the 
>>>>> caller has
>>>>> + * to explicitly order frees and calls to this function in order 
>>>>> for the
>>>>> + * freed node to show up as protected by a fence.
>>>>> + */
>>>>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>>>> *sa_manager,
>>>>> +                  struct drm_printer *p, u64 suballoc_base)
>>>>> +{
>>>>> +    const struct drm_mm_node *entry;
>>>>> +
>>>>> +    spin_lock(&sa_manager->lock);
>>>>> +    drm_mm_for_each_node(entry, &sa_manager->mm) {
>>>>> +        struct drm_suballoc *sa =
>>>>> +            container_of(entry, typeof(*sa), node);
>>>>> +
>>>>> +        drm_printf(p, " ");
>>>>> +        drm_printf(p, "[0x%010llx 0x%010llx] size %8lld",
>>>>> +               (unsigned long long)suballoc_base + entry->start,
>>>>> +               (unsigned long long)suballoc_base + entry->start +
>>>>> +               entry->size, (unsigned long long)entry->size);
>>>>> +
>>>>> +        if (sa->fence)
>>>>> +            drm_printf(p, " protected by 0x%016llx on context %llu",
>>>>> +                   (unsigned long long)sa->fence->seqno,
>>>>> +                   (unsigned long long)sa->fence->context);
>>>>> +
>>>>> +        drm_printf(p, "\n");
>>>>> +    }
>>>>> +    spin_unlock(&sa_manager->lock);
>>>>> +}
>>>>> +EXPORT_SYMBOL(drm_suballoc_dump_debug_info);
>>>>> +#endif
>>>>> +
>>>>> +MODULE_AUTHOR("Intel Corporation");
>>>>> +MODULE_DESCRIPTION("Simple range suballocator helper");
>>>>> +MODULE_LICENSE("GPL and additional rights");
>>>>> diff --git a/include/drm/drm_suballoc.h b/include/drm/drm_suballoc.h
>>>>> new file mode 100644
>>>>> index 000000000000..910952b3383b
>>>>> --- /dev/null
>>>>> +++ b/include/drm/drm_suballoc.h
>>>>> @@ -0,0 +1,112 @@
>>>>> +/* SPDX-License-Identifier: MIT */
>>>>> +/*
>>>>> + * Copyright © 2022 Intel Corporation
>>>>> + */
>>>>> +#ifndef _DRM_SUBALLOC_H_
>>>>> +#define _DRM_SUBALLOC_H_
>>>>> +
>>>>> +#include <drm/drm_mm.h>
>>>>> +
>>>>> +#include <linux/dma-fence.h>
>>>>> +#include <linux/types.h>
>>>>> +
>>>>> +/**
>>>>> + * struct drm_suballoc_manager - Wrapper for fenced range 
>>>>> allocations
>>>>> + * @mm: The range manager. Protected by @lock.
>>>>> + * @range_size: The total size of the range.
>>>>> + * @alignment: Range alignment.
>>>>> + * @wq: Wait queue for sleeping allocations on contention.
>>>>> + * @idle_list: List of idle but not yet freed allocations. 
>>>>> Protected by
>>>>> + * @idle_list_lock.
>>>>> + * @task: Task waiting for allocation. Protected by @lock.
>>>>> + */
>>>>> +struct drm_suballoc_manager {
>>>>> +    /** @lock: Manager lock. Protects @mm. */
>>>>> +    spinlock_t lock;
>>>>> +    /**
>>>>> +     * @idle_list_lock: Lock to protect the idle_list.
>>>>> +     * Disable irqs when locking.
>>>>> +     */
>>>>> +    spinlock_t idle_list_lock;
>>>>> +    /** @alloc_mutex: Mutex to protect against stavation. */
>>>>> +    struct mutex alloc_mutex;
>>>>> +    struct drm_mm mm;
>>>>> +    u64 range_size;
>>>>> +    u64 alignment;
>>>>> +    wait_queue_head_t wq;
>>>>> +    struct list_head idle_list;
>>>>> +};
>>>>> +
>>>>> +/**
>>>>> + * struct drm_suballoc: Suballocated range.
>>>>> + * @node: The drm_mm representation of the range.
>>>>> + * @fence: dma-fence indicating whether allocation is active or 
>>>>> idle.
>>>>> + * Assigned on call to free the allocation so doesn't need 
>>>>> protection.
>>>>> + * @cb: dma-fence callback structure. Used for callbacks when the 
>>>>> fence signals.
>>>>> + * @manager: The struct drm_suballoc_manager the range belongs 
>>>>> to. Immutable.
>>>>> + * @idle_link: Link for the manager idle_list. Progected by the
>>>>> + * drm_suballoc_manager::idle_lock.
>>>>> + */
>>>>> +struct drm_suballoc {
>>>>> +    struct drm_mm_node node;
>>>>> +    struct dma_fence *fence;
>>>>> +    struct dma_fence_cb cb;
>>>>> +    struct drm_suballoc_manager *manager;
>>>>> +    struct list_head idle_link;
>>>>> +};
>>>>> +
>>>>> +void drm_suballoc_manager_init(struct drm_suballoc_manager 
>>>>> *sa_manager,
>>>>> +                   u64 size, u64 align);
>>>>> +
>>>>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>>>>> *sa_manager);
>>>>> +
>>>>> +struct drm_suballoc *drm_suballoc_new(struct drm_suballoc_manager 
>>>>> *sa_manager,
>>>>> +                      u64 size, gfp_t gfp, bool intr);
>>>>> +
>>>>> +void drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence 
>>>>> *fence);
>>>>> +
>>>>> +/**
>>>>> + * drm_suballoc_soffset - Range start.
>>>>> + * @sa: The struct drm_suballoc.
>>>>> + *
>>>>> + * Return: The start of the allocated range.
>>>>> + */
>>>>> +static inline u64 drm_suballoc_soffset(struct drm_suballoc *sa)
>>>>> +{
>>>>> +    return sa->node.start;
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * drm_suballoc_eoffset - Range end.
>>>>> + * @sa: The struct drm_suballoc.
>>>>> + *
>>>>> + * Return: The end of the allocated range + 1.
>>>>> + */
>>>>> +static inline u64 drm_suballoc_eoffset(struct drm_suballoc *sa)
>>>>> +{
>>>>> +    return sa->node.start + sa->node.size;
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * drm_suballoc_size - Range size.
>>>>> + * @sa: The struct drm_suballoc.
>>>>> + *
>>>>> + * Return: The size of the allocated range.
>>>>> + */
>>>>> +static inline u64 drm_suballoc_size(struct drm_suballoc *sa)
>>>>> +{
>>>>> +    return sa->node.size;
>>>>> +}
>>>>> +
>>>>> +#ifdef CONFIG_DEBUG_FS
>>>>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>>>> *sa_manager,
>>>>> +                  struct drm_printer *p, u64 suballoc_base);
>>>>> +#else
>>>>> +static inline void
>>>>> +drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>>>> *sa_manager,
>>>>> +                 struct drm_printer *p, u64 suballoc_base)
>>>>> +{ }
>>>>> +
>>>>> +#endif
>>>>> +
>>>>> +#endif /* _DRM_SUBALLOC_H_ */
>>>>
>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/3] drm/radeon: Use the drm suballocation manager implementation.
  2023-02-16 14:48   ` [Intel-xe] " Thomas Hellström
  (?)
  (?)
@ 2023-02-17 12:32   ` kernel test robot
  -1 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2023-02-17 12:32 UTC (permalink / raw)
  To: Thomas Hellström; +Cc: oe-kbuild-all

Hi Thomas,

I love your patch! Yet something to improve:

[auto build test ERROR on drm-misc/drm-misc-next]
[also build test ERROR on drm-intel/for-linux-next drm-intel/for-linux-next-fixes drm-tip/drm-tip linus/master v6.2-rc8 next-20230217]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Thomas-Hellstr-m/drm-suballoc-Introduce-a-generic-suballocation-manager/20230216-225152
base:   git://anongit.freedesktop.org/drm/drm-misc drm-misc-next
patch link:    https://lore.kernel.org/r/20230216144847.216259-4-thomas.hellstrom%40linux.intel.com
patch subject: [PATCH 3/3] drm/radeon: Use the drm suballocation manager implementation.
config: riscv-defconfig (https://download.01.org/0day-ci/archive/20230217/202302172029.mDYHJIqD-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/ccbca3b1e02d931c5540d8f3dbb2985fd4663075
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Thomas-Hellstr-m/drm-suballoc-Introduce-a-generic-suballocation-manager/20230216-225152
        git checkout ccbca3b1e02d931c5540d8f3dbb2985fd4663075
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=riscv olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202302172029.mDYHJIqD-lkp@intel.com/

All errors (new ones prefixed by >>, old ones prefixed by <<):

ERROR: modpost: "drm_suballoc_new" [drivers/gpu/drm/radeon/radeon.ko] undefined!
>> ERROR: modpost: "drm_suballoc_dump_debug_info" [drivers/gpu/drm/radeon/radeon.ko] undefined!
ERROR: modpost: "drm_suballoc_free" [drivers/gpu/drm/radeon/radeon.ko] undefined!
ERROR: modpost: "drm_suballoc_manager_init" [drivers/gpu/drm/radeon/radeon.ko] undefined!
ERROR: modpost: "drm_suballoc_manager_fini" [drivers/gpu/drm/radeon/radeon.ko] undefined!

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
  2023-02-17 12:28             ` [Intel-xe] " Christian König
@ 2023-02-17 13:10               ` Thomas Hellström
  -1 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-17 13:10 UTC (permalink / raw)
  To: Christian König, dri-devel; +Cc: Daniel Vetter, intel-xe, Dave Airlie


On 2/17/23 13:28, Christian König wrote:
> Am 17.02.23 um 13:24 schrieb Thomas Hellström:
>>
>> On 2/17/23 12:28, Christian König wrote:
>>> Am 17.02.23 um 12:21 schrieb Thomas Hellström:
>>>>
>>>> On 2/17/23 12:00, Christian König wrote:
>>>>> Am 16.02.23 um 15:48 schrieb Thomas Hellström:
>>>>>> Initially we tried to leverage the amdgpu suballocation manager.
>>>>>> It turnes out, however, that it tries extremely hard not to enable
>>>>>> signalling on the fences that hold the memory up for freeing, 
>>>>>> which makes
>>>>>> it hard to understand and to fix potential issues with it.
>>>>>>
>>>>>> So in a simplification effort, introduce a drm suballocation 
>>>>>> manager as a
>>>>>> wrapper around an existing allocator (drm_mm) and to avoid using 
>>>>>> queues
>>>>>> for freeing, thus avoiding throttling on free which is an undesired
>>>>>> feature as typically the throttling needs to be done 
>>>>>> uninterruptible.
>>>>>>
>>>>>> This variant is probably more cpu-hungry but can be improved at 
>>>>>> the cost
>>>>>> of additional complexity. Ideas for that are documented in the
>>>>>> drm_suballoc.c file.
>>>>>>
>>>>>> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>>> Co-developed-by: Maarten Lankhorst 
>>>>>> <maarten.lankhorst@linux.intel.com>
>>>>>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/Kconfig        |   4 +
>>>>>>   drivers/gpu/drm/Makefile       |   3 +
>>>>>>   drivers/gpu/drm/drm_suballoc.c | 301 
>>>>>> +++++++++++++++++++++++++++++++++
>>>>>>   include/drm/drm_suballoc.h     | 112 ++++++++++++
>>>>>>   4 files changed, 420 insertions(+)
>>>>>>   create mode 100644 drivers/gpu/drm/drm_suballoc.c
>>>>>>   create mode 100644 include/drm/drm_suballoc.h
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
>>>>>> index dc0f94f02a82..8fbe57407c60 100644
>>>>>> --- a/drivers/gpu/drm/Kconfig
>>>>>> +++ b/drivers/gpu/drm/Kconfig
>>>>>> @@ -232,6 +232,10 @@ config DRM_GEM_SHMEM_HELPER
>>>>>>       help
>>>>>>         Choose this if you need the GEM shmem helper functions
>>>>>>   +config DRM_SUBALLOC_HELPER
>>>>>> +    tristate
>>>>>> +    depends on DRM
>>>>>> +
>>>>>>   config DRM_SCHED
>>>>>>       tristate
>>>>>>       depends on DRM
>>>>>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>>>>>> index ab4460fcd63f..1e04d135e866 100644
>>>>>> --- a/drivers/gpu/drm/Makefile
>>>>>> +++ b/drivers/gpu/drm/Makefile
>>>>>> @@ -88,6 +88,9 @@ obj-$(CONFIG_DRM_GEM_DMA_HELPER) += 
>>>>>> drm_dma_helper.o
>>>>>>   drm_shmem_helper-y := drm_gem_shmem_helper.o
>>>>>>   obj-$(CONFIG_DRM_GEM_SHMEM_HELPER) += drm_shmem_helper.o
>>>>>>   +drm_suballoc_helper-y := drm_suballoc.o
>>>>>> +obj-$(CONFIG_DRM_SUBALLOC_HELPER) += drm_suballoc_helper.o
>>>>>> +
>>>>>>   drm_vram_helper-y := drm_gem_vram_helper.o
>>>>>>   obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
>>>>>>   diff --git a/drivers/gpu/drm/drm_suballoc.c 
>>>>>> b/drivers/gpu/drm/drm_suballoc.c
>>>>>> new file mode 100644
>>>>>> index 000000000000..6e0292dea548
>>>>>> --- /dev/null
>>>>>> +++ b/drivers/gpu/drm/drm_suballoc.c
>>>>>> @@ -0,0 +1,301 @@
>>>>>> +// SPDX-License-Identifier: MIT
>>>>>> +/*
>>>>>> + * Copyright © 2022 Intel Corporation
>>>>>> + */
>>>>>> +
>>>>>> +#include <drm/drm_suballoc.h>
>>>>>> +
>>>>>> +/**
>>>>>> + * DOC:
>>>>>> + * This suballocator intends to be a wrapper around a range 
>>>>>> allocator
>>>>>> + * that is aware also of deferred range freeing with fences. 
>>>>>> Currently
>>>>>> + * we hard-code the drm_mm as the range allocator.
>>>>>> + * The approach, while rather simple, suffers from three 
>>>>>> performance
>>>>>> + * issues that can all be fixed if needed at the tradeoff of 
>>>>>> more and / or
>>>>>> + * more complex code:
>>>>>> + *
>>>>>> + * 1) It's cpu-hungry, the drm_mm allocator is overkill. Either 
>>>>>> code a
>>>>>> + * much simpler range allocator, or let the caller decide by 
>>>>>> providing
>>>>>> + * ops that wrap any range allocator. Also could avoid waking up 
>>>>>> unless
>>>>>> + * there is a reasonable chance of enough space in the range 
>>>>>> manager.
>>>>>
>>>>> That's most likely highly problematic.
>>>>>
>>>>> The suballocator in radeon/amdgpu was designed so that it 
>>>>> resembles a ring buffer and is therefor rather CPU efficient.
>>>>>
>>>>> We could make the allocator much more trivial, but using drm_mm 
>>>>> for this is a sledgehammer and therefore a pretty clear no-go.
>>>>>
>>>> I don't think the ring vs non-ring is the big problem here, because 
>>>> (at least with the original implementation), if allocations are 
>>>> actually made and released in a ring-like fashion, the drm_mm 
>>>> free-list would consist of one or two blocks and therefore pretty 
>>>> efficient even for that case, and if slightly longer that would 
>>>> still not be an issue compared to the fence lists maintained in the 
>>>> older allocator.
>>>>
>>>> The problem is more all the other stuff that was added and built on 
>>>> top like the interval / rb tree.
>>>>
>>>> I still like the idea (originating from Gallium's helpers) to 
>>>> separate whatever is allocating from the fence delayed free.
>>>
>>> That's actually a bad idea. See the ring like approach works because 
>>> the fences used in amdgpu/radeon are used in a ring like fashion. 
>>> E.g. the sub allocator mainly provides the temporary space for page 
>>> table updates. Those in turn are then used by commands written into 
>>> a ring buffer.
>>
>> Well, what I'm saying is that *even* if you have a ring-like 
>> allocation algorithm, given a simpler drm_mm, I think the suggested 
>> code would be performing just as well as the one in amdgpu / radeon, 
>> on top of avoiding throttling on free, or do you have a particular 
>> scenario in mind that you think would be particularly pathological on 
>> this allocator?
>
> What do you mean with avoiding throttling on free?

Hmm, my bad. That was with a temporary version that was tried for Xe.

>
>>
>>>
>>>>
>>>> Any chance you could do a quick performance comparison? If not, 
>>>> anything against merging this without the amd / radeon changes 
>>>> until we can land a simpler allocator?
>>>
>>> Only if you can stick the allocator inside Xe and not drm, cause 
>>> this seems to be for a different use case than the allocators inside 
>>> radeon/amdgpu.
>>
>> Hmm. No It's allocating in a ring-like fashion as well.  Let me put 
>> together a unit test for benchmaking. I think it would be a failure 
>> for the community to end up with three separate suballocators doing 
>> the exact same thing for the same problem, really.
>
> Well exactly that's the point. Those allocators aren't the same 
> because they handle different problems.
>
> The allocator in radeon is simpler because it only had to deal with a 
> limited number of fence timelines. The one in amdgpu is a bit more 
> complex because of the added complexity for more fence timelines.
>
> We could take the one from amdgpu and use it for radeon and others as 
> well, but the allocator proposed here doesn't even remotely matches 
> the requirements.

But again, what *are* those missing requirements exactly? What is the 
pathological case you see for the current code?

 From what I can tell the amdgpu suballocator introduces excessive 
complexity to coalesce waits for fences from the same contexts, whereas 
the present code just frees from the fence callback if the fence wasn't 
already signaled. The fence signalling code that fires that callback is 
typcally always run anyway on scheduler fences.

The reason we had for not using the amdgpu suballocator as originally 
planned was that this complexity made it very hard for us to undertand 
it and to fix issues we had with it.

Regards,

Thomas

>
> Regards,
> Christian.
>
>>
>> /Thomas
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>>
>>>> Thomas
>>>>
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> + *
>>>>>> + * 2) We unnecessarily install the fence callbacks too early, 
>>>>>> forcing
>>>>>> + * enable_signaling() too early causing extra driver effort. 
>>>>>> This is likely
>>>>>> + * not an issue if used with the drm_scheduler since it calls
>>>>>> + * enable_signaling() early anyway.
>>>>>> + *
>>>>>> + * 3) Long processing in irq (disabled) context. We've mostly 
>>>>>> worked around
>>>>>> + * that already by using the idle_list. If that workaround is 
>>>>>> deemed to
>>>>>> + * complex for little gain, we can remove it and use 
>>>>>> spin_lock_irq()
>>>>>> + * throughout the manager. If we want to shorten processing in 
>>>>>> irq context
>>>>>> + * even further, we can skip the spin_trylock in 
>>>>>> __drm_suballoc_free() and
>>>>>> + * avoid freeing allocations from irq context altogeher. However 
>>>>>> drm_mm
>>>>>> + * should be quite fast at freeing ranges.
>>>>>> + *
>>>>>> + * 4) Shrinker that starts processing the list items in 2) and 
>>>>>> 3) to play
>>>>>> + * better with the system.
>>>>>> + */
>>>>>> +
>>>>>> +static void drm_suballoc_process_idle(struct 
>>>>>> drm_suballoc_manager *sa_manager);
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_suballoc_manager_init() - Initialise the 
>>>>>> drm_suballoc_manager
>>>>>> + * @sa_manager: pointer to the sa_manager
>>>>>> + * @size: number of bytes we want to suballocate
>>>>>> + * @align: alignment for each suballocated chunk
>>>>>> + *
>>>>>> + * Prepares the suballocation manager for suballocations.
>>>>>> + */
>>>>>> +void drm_suballoc_manager_init(struct drm_suballoc_manager 
>>>>>> *sa_manager,
>>>>>> +                   u64 size, u64 align)
>>>>>> +{
>>>>>> +    spin_lock_init(&sa_manager->lock);
>>>>>> +    spin_lock_init(&sa_manager->idle_list_lock);
>>>>>> +    mutex_init(&sa_manager->alloc_mutex);
>>>>>> +    drm_mm_init(&sa_manager->mm, 0, size);
>>>>>> +    init_waitqueue_head(&sa_manager->wq);
>>>>>> +    sa_manager->range_size = size;
>>>>>> +    sa_manager->alignment = align;
>>>>>> +    INIT_LIST_HEAD(&sa_manager->idle_list);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL(drm_suballoc_manager_init);
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_suballoc_manager_fini() - Destroy the drm_suballoc_manager
>>>>>> + * @sa_manager: pointer to the sa_manager
>>>>>> + *
>>>>>> + * Cleans up the suballocation manager after use. All fences added
>>>>>> + * with drm_suballoc_free() must be signaled, or we cannot clean up
>>>>>> + * the entire manager.
>>>>>> + */
>>>>>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>>>>>> *sa_manager)
>>>>>> +{
>>>>>> +    drm_suballoc_process_idle(sa_manager);
>>>>>> +    drm_mm_takedown(&sa_manager->mm);
>>>>>> +    mutex_destroy(&sa_manager->alloc_mutex);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL(drm_suballoc_manager_fini);
>>>>>> +
>>>>>> +static void __drm_suballoc_free(struct drm_suballoc *sa)
>>>>>> +{
>>>>>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>>>>>> +    struct dma_fence *fence;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * In order to avoid protecting the potentially lengthy 
>>>>>> drm_mm manager
>>>>>> +     * *allocation* processing with an irq-disabling lock,
>>>>>> +     * defer touching the drm_mm for freeing until we're in task 
>>>>>> context,
>>>>>> +     * with no irqs disabled, or happen to succeed in taking the 
>>>>>> manager
>>>>>> +     * lock.
>>>>>> +     */
>>>>>> +    if (!in_task() || irqs_disabled()) {
>>>>>> +        unsigned long irqflags;
>>>>>> +
>>>>>> +        if (spin_trylock(&sa_manager->lock))
>>>>>> +            goto locked;
>>>>>> +
>>>>>> + spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>>>>>> +        list_add_tail(&sa->idle_link, &sa_manager->idle_list);
>>>>>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>>>>>> +        wake_up(&sa_manager->wq);
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    spin_lock(&sa_manager->lock);
>>>>>> +locked:
>>>>>> +    drm_mm_remove_node(&sa->node);
>>>>>> +
>>>>>> +    fence = sa->fence;
>>>>>> +    sa->fence = NULL;
>>>>>> +    spin_unlock(&sa_manager->lock);
>>>>>> +    /* Maybe only wake if first mm hole is sufficiently large? */
>>>>>> +    wake_up(&sa_manager->wq);
>>>>>> +    dma_fence_put(fence);
>>>>>> +    kfree(sa);
>>>>>> +}
>>>>>> +
>>>>>> +/* Free all deferred idle allocations */
>>>>>> +static void drm_suballoc_process_idle(struct 
>>>>>> drm_suballoc_manager *sa_manager)
>>>>>> +{
>>>>>> +    /*
>>>>>> +     * prepare_to_wait() / wake_up() semantics ensure that any list
>>>>>> +     * addition that was done before wake_up() is visible when
>>>>>> +     * this code is called from the wait loop.
>>>>>> +     */
>>>>>> +    if (!list_empty_careful(&sa_manager->idle_list)) {
>>>>>> +        struct drm_suballoc *sa, *next;
>>>>>> +        unsigned long irqflags;
>>>>>> +        LIST_HEAD(list);
>>>>>> +
>>>>>> + spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>>>>>> +        list_splice_init(&sa_manager->idle_list, &list);
>>>>>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>>>>>> +
>>>>>> +        list_for_each_entry_safe(sa, next, &list, idle_link)
>>>>>> +            __drm_suballoc_free(sa);
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>> +static void
>>>>>> +drm_suballoc_fence_signaled(struct dma_fence *fence, struct 
>>>>>> dma_fence_cb *cb)
>>>>>> +{
>>>>>> +    struct drm_suballoc *sa = container_of(cb, typeof(*sa), cb);
>>>>>> +
>>>>>> +    __drm_suballoc_free(sa);
>>>>>> +}
>>>>>> +
>>>>>> +static int drm_suballoc_tryalloc(struct drm_suballoc *sa, u64 size)
>>>>>> +{
>>>>>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>>>>>> +    int err;
>>>>>> +
>>>>>> +    drm_suballoc_process_idle(sa_manager);
>>>>>> +    spin_lock(&sa_manager->lock);
>>>>>> +    err = drm_mm_insert_node_generic(&sa_manager->mm, &sa->node, 
>>>>>> size,
>>>>>> +                     sa_manager->alignment, 0,
>>>>>> +                     DRM_MM_INSERT_EVICT);
>>>>>> +    spin_unlock(&sa_manager->lock);
>>>>>> +    return err;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_suballoc_new() - Make a suballocation.
>>>>>> + * @sa_manager: pointer to the sa_manager
>>>>>> + * @size: number of bytes we want to suballocate.
>>>>>> + * @gfp: Allocation context.
>>>>>> + * @intr: Whether to sleep interruptibly if sleeping.
>>>>>> + *
>>>>>> + * Try to make a suballocation of size @size, which will be rounded
>>>>>> + * up to the alignment specified in specified in 
>>>>>> drm_suballoc_manager_init().
>>>>>> + *
>>>>>> + * Returns a new suballocated bo, or an ERR_PTR.
>>>>>> + */
>>>>>> +struct drm_suballoc*
>>>>>> +drm_suballoc_new(struct drm_suballoc_manager *sa_manager, u64 size,
>>>>>> +         gfp_t gfp, bool intr)
>>>>>> +{
>>>>>> +    struct drm_suballoc *sa;
>>>>>> +    DEFINE_WAIT(wait);
>>>>>> +    int err = 0;
>>>>>> +
>>>>>> +    if (size > sa_manager->range_size)
>>>>>> +        return ERR_PTR(-ENOSPC);
>>>>>> +
>>>>>> +    sa = kzalloc(sizeof(*sa), gfp);
>>>>>> +    if (!sa)
>>>>>> +        return ERR_PTR(-ENOMEM);
>>>>>> +
>>>>>> +    /* Avoid starvation using the alloc_mutex */
>>>>>> +    if (intr)
>>>>>> +        err = mutex_lock_interruptible(&sa_manager->alloc_mutex);
>>>>>> +    else
>>>>>> +        mutex_lock(&sa_manager->alloc_mutex);
>>>>>> +    if (err) {
>>>>>> +        kfree(sa);
>>>>>> +        return ERR_PTR(err);
>>>>>> +    }
>>>>>> +
>>>>>> +    sa->manager = sa_manager;
>>>>>> +    err = drm_suballoc_tryalloc(sa, size);
>>>>>> +    if (err != -ENOSPC)
>>>>>> +        goto out;
>>>>>> +
>>>>>> +    for (;;) {
>>>>>> +        prepare_to_wait(&sa_manager->wq, &wait,
>>>>>> +                intr ? TASK_INTERRUPTIBLE :
>>>>>> +                TASK_UNINTERRUPTIBLE);
>>>>>> +
>>>>>> +        err = drm_suballoc_tryalloc(sa, size);
>>>>>> +        if (err != -ENOSPC)
>>>>>> +            break;
>>>>>> +
>>>>>> +        if (intr && signal_pending(current)) {
>>>>>> +            err = -ERESTARTSYS;
>>>>>> +            break;
>>>>>> +        }
>>>>>> +
>>>>>> +        io_schedule();
>>>>>> +    }
>>>>>> +    finish_wait(&sa_manager->wq, &wait);
>>>>>> +
>>>>>> +out:
>>>>>> +    mutex_unlock(&sa_manager->alloc_mutex);
>>>>>> +    if (!sa->node.size) {
>>>>>> +        kfree(sa);
>>>>>> +        WARN_ON(!err);
>>>>>> +        sa = ERR_PTR(err);
>>>>>> +    }
>>>>>> +
>>>>>> +    return sa;
>>>>>> +}
>>>>>> +EXPORT_SYMBOL(drm_suballoc_new);
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_suballoc_free() - Free a suballocation
>>>>>> + * @suballoc: pointer to the suballocation
>>>>>> + * @fence: fence that signals when suballocation is idle
>>>>>> + * @queue: the index to which queue the suballocation will be 
>>>>>> placed on the free list.
>>>>>> + *
>>>>>> + * Free the suballocation. The suballocation can be re-used 
>>>>>> after @fence
>>>>>> + * signals.
>>>>>> + */
>>>>>> +void
>>>>>> +drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence)
>>>>>> +{
>>>>>> +    if (!sa)
>>>>>> +        return;
>>>>>> +
>>>>>> +    if (!fence || dma_fence_is_signaled(fence)) {
>>>>>> +        __drm_suballoc_free(sa);
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    sa->fence = dma_fence_get(fence);
>>>>>> +    if (dma_fence_add_callback(fence, &sa->cb, 
>>>>>> drm_suballoc_fence_signaled))
>>>>>> +        __drm_suballoc_free(sa);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL(drm_suballoc_free);
>>>>>> +
>>>>>> +#ifdef CONFIG_DEBUG_FS
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_suballoc_dump_debug_info() - Dump the suballocator state
>>>>>> + * @sa_manager: The suballoc manager.
>>>>>> + * @p: Pointer to a drm printer for output.
>>>>>> + * @suballoc_base: Constant to add to the suballocated offsets 
>>>>>> on printout.
>>>>>> + *
>>>>>> + * This function dumps the suballocator state. Note that the 
>>>>>> caller has
>>>>>> + * to explicitly order frees and calls to this function in order 
>>>>>> for the
>>>>>> + * freed node to show up as protected by a fence.
>>>>>> + */
>>>>>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>>>>> *sa_manager,
>>>>>> +                  struct drm_printer *p, u64 suballoc_base)
>>>>>> +{
>>>>>> +    const struct drm_mm_node *entry;
>>>>>> +
>>>>>> +    spin_lock(&sa_manager->lock);
>>>>>> +    drm_mm_for_each_node(entry, &sa_manager->mm) {
>>>>>> +        struct drm_suballoc *sa =
>>>>>> +            container_of(entry, typeof(*sa), node);
>>>>>> +
>>>>>> +        drm_printf(p, " ");
>>>>>> +        drm_printf(p, "[0x%010llx 0x%010llx] size %8lld",
>>>>>> +               (unsigned long long)suballoc_base + entry->start,
>>>>>> +               (unsigned long long)suballoc_base + entry->start +
>>>>>> +               entry->size, (unsigned long long)entry->size);
>>>>>> +
>>>>>> +        if (sa->fence)
>>>>>> +            drm_printf(p, " protected by 0x%016llx on context 
>>>>>> %llu",
>>>>>> +                   (unsigned long long)sa->fence->seqno,
>>>>>> +                   (unsigned long long)sa->fence->context);
>>>>>> +
>>>>>> +        drm_printf(p, "\n");
>>>>>> +    }
>>>>>> +    spin_unlock(&sa_manager->lock);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL(drm_suballoc_dump_debug_info);
>>>>>> +#endif
>>>>>> +
>>>>>> +MODULE_AUTHOR("Intel Corporation");
>>>>>> +MODULE_DESCRIPTION("Simple range suballocator helper");
>>>>>> +MODULE_LICENSE("GPL and additional rights");
>>>>>> diff --git a/include/drm/drm_suballoc.h b/include/drm/drm_suballoc.h
>>>>>> new file mode 100644
>>>>>> index 000000000000..910952b3383b
>>>>>> --- /dev/null
>>>>>> +++ b/include/drm/drm_suballoc.h
>>>>>> @@ -0,0 +1,112 @@
>>>>>> +/* SPDX-License-Identifier: MIT */
>>>>>> +/*
>>>>>> + * Copyright © 2022 Intel Corporation
>>>>>> + */
>>>>>> +#ifndef _DRM_SUBALLOC_H_
>>>>>> +#define _DRM_SUBALLOC_H_
>>>>>> +
>>>>>> +#include <drm/drm_mm.h>
>>>>>> +
>>>>>> +#include <linux/dma-fence.h>
>>>>>> +#include <linux/types.h>
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_suballoc_manager - Wrapper for fenced range 
>>>>>> allocations
>>>>>> + * @mm: The range manager. Protected by @lock.
>>>>>> + * @range_size: The total size of the range.
>>>>>> + * @alignment: Range alignment.
>>>>>> + * @wq: Wait queue for sleeping allocations on contention.
>>>>>> + * @idle_list: List of idle but not yet freed allocations. 
>>>>>> Protected by
>>>>>> + * @idle_list_lock.
>>>>>> + * @task: Task waiting for allocation. Protected by @lock.
>>>>>> + */
>>>>>> +struct drm_suballoc_manager {
>>>>>> +    /** @lock: Manager lock. Protects @mm. */
>>>>>> +    spinlock_t lock;
>>>>>> +    /**
>>>>>> +     * @idle_list_lock: Lock to protect the idle_list.
>>>>>> +     * Disable irqs when locking.
>>>>>> +     */
>>>>>> +    spinlock_t idle_list_lock;
>>>>>> +    /** @alloc_mutex: Mutex to protect against stavation. */
>>>>>> +    struct mutex alloc_mutex;
>>>>>> +    struct drm_mm mm;
>>>>>> +    u64 range_size;
>>>>>> +    u64 alignment;
>>>>>> +    wait_queue_head_t wq;
>>>>>> +    struct list_head idle_list;
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_suballoc: Suballocated range.
>>>>>> + * @node: The drm_mm representation of the range.
>>>>>> + * @fence: dma-fence indicating whether allocation is active or 
>>>>>> idle.
>>>>>> + * Assigned on call to free the allocation so doesn't need 
>>>>>> protection.
>>>>>> + * @cb: dma-fence callback structure. Used for callbacks when 
>>>>>> the fence signals.
>>>>>> + * @manager: The struct drm_suballoc_manager the range belongs 
>>>>>> to. Immutable.
>>>>>> + * @idle_link: Link for the manager idle_list. Progected by the
>>>>>> + * drm_suballoc_manager::idle_lock.
>>>>>> + */
>>>>>> +struct drm_suballoc {
>>>>>> +    struct drm_mm_node node;
>>>>>> +    struct dma_fence *fence;
>>>>>> +    struct dma_fence_cb cb;
>>>>>> +    struct drm_suballoc_manager *manager;
>>>>>> +    struct list_head idle_link;
>>>>>> +};
>>>>>> +
>>>>>> +void drm_suballoc_manager_init(struct drm_suballoc_manager 
>>>>>> *sa_manager,
>>>>>> +                   u64 size, u64 align);
>>>>>> +
>>>>>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>>>>>> *sa_manager);
>>>>>> +
>>>>>> +struct drm_suballoc *drm_suballoc_new(struct 
>>>>>> drm_suballoc_manager *sa_manager,
>>>>>> +                      u64 size, gfp_t gfp, bool intr);
>>>>>> +
>>>>>> +void drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence 
>>>>>> *fence);
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_suballoc_soffset - Range start.
>>>>>> + * @sa: The struct drm_suballoc.
>>>>>> + *
>>>>>> + * Return: The start of the allocated range.
>>>>>> + */
>>>>>> +static inline u64 drm_suballoc_soffset(struct drm_suballoc *sa)
>>>>>> +{
>>>>>> +    return sa->node.start;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_suballoc_eoffset - Range end.
>>>>>> + * @sa: The struct drm_suballoc.
>>>>>> + *
>>>>>> + * Return: The end of the allocated range + 1.
>>>>>> + */
>>>>>> +static inline u64 drm_suballoc_eoffset(struct drm_suballoc *sa)
>>>>>> +{
>>>>>> +    return sa->node.start + sa->node.size;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_suballoc_size - Range size.
>>>>>> + * @sa: The struct drm_suballoc.
>>>>>> + *
>>>>>> + * Return: The size of the allocated range.
>>>>>> + */
>>>>>> +static inline u64 drm_suballoc_size(struct drm_suballoc *sa)
>>>>>> +{
>>>>>> +    return sa->node.size;
>>>>>> +}
>>>>>> +
>>>>>> +#ifdef CONFIG_DEBUG_FS
>>>>>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>>>>> *sa_manager,
>>>>>> +                  struct drm_printer *p, u64 suballoc_base);
>>>>>> +#else
>>>>>> +static inline void
>>>>>> +drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>>>>> *sa_manager,
>>>>>> +                 struct drm_printer *p, u64 suballoc_base)
>>>>>> +{ }
>>>>>> +
>>>>>> +#endif
>>>>>> +
>>>>>> +#endif /* _DRM_SUBALLOC_H_ */
>>>>>
>>>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
@ 2023-02-17 13:10               ` Thomas Hellström
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-17 13:10 UTC (permalink / raw)
  To: Christian König, dri-devel
  Cc: Daniel Vetter, Maarten Lankhorst, intel-xe, Dave Airlie


On 2/17/23 13:28, Christian König wrote:
> Am 17.02.23 um 13:24 schrieb Thomas Hellström:
>>
>> On 2/17/23 12:28, Christian König wrote:
>>> Am 17.02.23 um 12:21 schrieb Thomas Hellström:
>>>>
>>>> On 2/17/23 12:00, Christian König wrote:
>>>>> Am 16.02.23 um 15:48 schrieb Thomas Hellström:
>>>>>> Initially we tried to leverage the amdgpu suballocation manager.
>>>>>> It turnes out, however, that it tries extremely hard not to enable
>>>>>> signalling on the fences that hold the memory up for freeing, 
>>>>>> which makes
>>>>>> it hard to understand and to fix potential issues with it.
>>>>>>
>>>>>> So in a simplification effort, introduce a drm suballocation 
>>>>>> manager as a
>>>>>> wrapper around an existing allocator (drm_mm) and to avoid using 
>>>>>> queues
>>>>>> for freeing, thus avoiding throttling on free which is an undesired
>>>>>> feature as typically the throttling needs to be done 
>>>>>> uninterruptible.
>>>>>>
>>>>>> This variant is probably more cpu-hungry but can be improved at 
>>>>>> the cost
>>>>>> of additional complexity. Ideas for that are documented in the
>>>>>> drm_suballoc.c file.
>>>>>>
>>>>>> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>>>>> Co-developed-by: Maarten Lankhorst 
>>>>>> <maarten.lankhorst@linux.intel.com>
>>>>>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/Kconfig        |   4 +
>>>>>>   drivers/gpu/drm/Makefile       |   3 +
>>>>>>   drivers/gpu/drm/drm_suballoc.c | 301 
>>>>>> +++++++++++++++++++++++++++++++++
>>>>>>   include/drm/drm_suballoc.h     | 112 ++++++++++++
>>>>>>   4 files changed, 420 insertions(+)
>>>>>>   create mode 100644 drivers/gpu/drm/drm_suballoc.c
>>>>>>   create mode 100644 include/drm/drm_suballoc.h
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
>>>>>> index dc0f94f02a82..8fbe57407c60 100644
>>>>>> --- a/drivers/gpu/drm/Kconfig
>>>>>> +++ b/drivers/gpu/drm/Kconfig
>>>>>> @@ -232,6 +232,10 @@ config DRM_GEM_SHMEM_HELPER
>>>>>>       help
>>>>>>         Choose this if you need the GEM shmem helper functions
>>>>>>   +config DRM_SUBALLOC_HELPER
>>>>>> +    tristate
>>>>>> +    depends on DRM
>>>>>> +
>>>>>>   config DRM_SCHED
>>>>>>       tristate
>>>>>>       depends on DRM
>>>>>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>>>>>> index ab4460fcd63f..1e04d135e866 100644
>>>>>> --- a/drivers/gpu/drm/Makefile
>>>>>> +++ b/drivers/gpu/drm/Makefile
>>>>>> @@ -88,6 +88,9 @@ obj-$(CONFIG_DRM_GEM_DMA_HELPER) += 
>>>>>> drm_dma_helper.o
>>>>>>   drm_shmem_helper-y := drm_gem_shmem_helper.o
>>>>>>   obj-$(CONFIG_DRM_GEM_SHMEM_HELPER) += drm_shmem_helper.o
>>>>>>   +drm_suballoc_helper-y := drm_suballoc.o
>>>>>> +obj-$(CONFIG_DRM_SUBALLOC_HELPER) += drm_suballoc_helper.o
>>>>>> +
>>>>>>   drm_vram_helper-y := drm_gem_vram_helper.o
>>>>>>   obj-$(CONFIG_DRM_VRAM_HELPER) += drm_vram_helper.o
>>>>>>   diff --git a/drivers/gpu/drm/drm_suballoc.c 
>>>>>> b/drivers/gpu/drm/drm_suballoc.c
>>>>>> new file mode 100644
>>>>>> index 000000000000..6e0292dea548
>>>>>> --- /dev/null
>>>>>> +++ b/drivers/gpu/drm/drm_suballoc.c
>>>>>> @@ -0,0 +1,301 @@
>>>>>> +// SPDX-License-Identifier: MIT
>>>>>> +/*
>>>>>> + * Copyright © 2022 Intel Corporation
>>>>>> + */
>>>>>> +
>>>>>> +#include <drm/drm_suballoc.h>
>>>>>> +
>>>>>> +/**
>>>>>> + * DOC:
>>>>>> + * This suballocator intends to be a wrapper around a range 
>>>>>> allocator
>>>>>> + * that is aware also of deferred range freeing with fences. 
>>>>>> Currently
>>>>>> + * we hard-code the drm_mm as the range allocator.
>>>>>> + * The approach, while rather simple, suffers from three 
>>>>>> performance
>>>>>> + * issues that can all be fixed if needed at the tradeoff of 
>>>>>> more and / or
>>>>>> + * more complex code:
>>>>>> + *
>>>>>> + * 1) It's cpu-hungry, the drm_mm allocator is overkill. Either 
>>>>>> code a
>>>>>> + * much simpler range allocator, or let the caller decide by 
>>>>>> providing
>>>>>> + * ops that wrap any range allocator. Also could avoid waking up 
>>>>>> unless
>>>>>> + * there is a reasonable chance of enough space in the range 
>>>>>> manager.
>>>>>
>>>>> That's most likely highly problematic.
>>>>>
>>>>> The suballocator in radeon/amdgpu was designed so that it 
>>>>> resembles a ring buffer and is therefor rather CPU efficient.
>>>>>
>>>>> We could make the allocator much more trivial, but using drm_mm 
>>>>> for this is a sledgehammer and therefore a pretty clear no-go.
>>>>>
>>>> I don't think the ring vs non-ring is the big problem here, because 
>>>> (at least with the original implementation), if allocations are 
>>>> actually made and released in a ring-like fashion, the drm_mm 
>>>> free-list would consist of one or two blocks and therefore pretty 
>>>> efficient even for that case, and if slightly longer that would 
>>>> still not be an issue compared to the fence lists maintained in the 
>>>> older allocator.
>>>>
>>>> The problem is more all the other stuff that was added and built on 
>>>> top like the interval / rb tree.
>>>>
>>>> I still like the idea (originating from Gallium's helpers) to 
>>>> separate whatever is allocating from the fence delayed free.
>>>
>>> That's actually a bad idea. See the ring like approach works because 
>>> the fences used in amdgpu/radeon are used in a ring like fashion. 
>>> E.g. the sub allocator mainly provides the temporary space for page 
>>> table updates. Those in turn are then used by commands written into 
>>> a ring buffer.
>>
>> Well, what I'm saying is that *even* if you have a ring-like 
>> allocation algorithm, given a simpler drm_mm, I think the suggested 
>> code would be performing just as well as the one in amdgpu / radeon, 
>> on top of avoiding throttling on free, or do you have a particular 
>> scenario in mind that you think would be particularly pathological on 
>> this allocator?
>
> What do you mean with avoiding throttling on free?

Hmm, my bad. That was with a temporary version that was tried for Xe.

>
>>
>>>
>>>>
>>>> Any chance you could do a quick performance comparison? If not, 
>>>> anything against merging this without the amd / radeon changes 
>>>> until we can land a simpler allocator?
>>>
>>> Only if you can stick the allocator inside Xe and not drm, cause 
>>> this seems to be for a different use case than the allocators inside 
>>> radeon/amdgpu.
>>
>> Hmm. No It's allocating in a ring-like fashion as well.  Let me put 
>> together a unit test for benchmaking. I think it would be a failure 
>> for the community to end up with three separate suballocators doing 
>> the exact same thing for the same problem, really.
>
> Well exactly that's the point. Those allocators aren't the same 
> because they handle different problems.
>
> The allocator in radeon is simpler because it only had to deal with a 
> limited number of fence timelines. The one in amdgpu is a bit more 
> complex because of the added complexity for more fence timelines.
>
> We could take the one from amdgpu and use it for radeon and others as 
> well, but the allocator proposed here doesn't even remotely matches 
> the requirements.

But again, what *are* those missing requirements exactly? What is the 
pathological case you see for the current code?

 From what I can tell the amdgpu suballocator introduces excessive 
complexity to coalesce waits for fences from the same contexts, whereas 
the present code just frees from the fence callback if the fence wasn't 
already signaled. The fence signalling code that fires that callback is 
typcally always run anyway on scheduler fences.

The reason we had for not using the amdgpu suballocator as originally 
planned was that this complexity made it very hard for us to undertand 
it and to fix issues we had with it.

Regards,

Thomas

>
> Regards,
> Christian.
>
>>
>> /Thomas
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>>
>>>> Thomas
>>>>
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> + *
>>>>>> + * 2) We unnecessarily install the fence callbacks too early, 
>>>>>> forcing
>>>>>> + * enable_signaling() too early causing extra driver effort. 
>>>>>> This is likely
>>>>>> + * not an issue if used with the drm_scheduler since it calls
>>>>>> + * enable_signaling() early anyway.
>>>>>> + *
>>>>>> + * 3) Long processing in irq (disabled) context. We've mostly 
>>>>>> worked around
>>>>>> + * that already by using the idle_list. If that workaround is 
>>>>>> deemed to
>>>>>> + * complex for little gain, we can remove it and use 
>>>>>> spin_lock_irq()
>>>>>> + * throughout the manager. If we want to shorten processing in 
>>>>>> irq context
>>>>>> + * even further, we can skip the spin_trylock in 
>>>>>> __drm_suballoc_free() and
>>>>>> + * avoid freeing allocations from irq context altogeher. However 
>>>>>> drm_mm
>>>>>> + * should be quite fast at freeing ranges.
>>>>>> + *
>>>>>> + * 4) Shrinker that starts processing the list items in 2) and 
>>>>>> 3) to play
>>>>>> + * better with the system.
>>>>>> + */
>>>>>> +
>>>>>> +static void drm_suballoc_process_idle(struct 
>>>>>> drm_suballoc_manager *sa_manager);
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_suballoc_manager_init() - Initialise the 
>>>>>> drm_suballoc_manager
>>>>>> + * @sa_manager: pointer to the sa_manager
>>>>>> + * @size: number of bytes we want to suballocate
>>>>>> + * @align: alignment for each suballocated chunk
>>>>>> + *
>>>>>> + * Prepares the suballocation manager for suballocations.
>>>>>> + */
>>>>>> +void drm_suballoc_manager_init(struct drm_suballoc_manager 
>>>>>> *sa_manager,
>>>>>> +                   u64 size, u64 align)
>>>>>> +{
>>>>>> +    spin_lock_init(&sa_manager->lock);
>>>>>> +    spin_lock_init(&sa_manager->idle_list_lock);
>>>>>> +    mutex_init(&sa_manager->alloc_mutex);
>>>>>> +    drm_mm_init(&sa_manager->mm, 0, size);
>>>>>> +    init_waitqueue_head(&sa_manager->wq);
>>>>>> +    sa_manager->range_size = size;
>>>>>> +    sa_manager->alignment = align;
>>>>>> +    INIT_LIST_HEAD(&sa_manager->idle_list);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL(drm_suballoc_manager_init);
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_suballoc_manager_fini() - Destroy the drm_suballoc_manager
>>>>>> + * @sa_manager: pointer to the sa_manager
>>>>>> + *
>>>>>> + * Cleans up the suballocation manager after use. All fences added
>>>>>> + * with drm_suballoc_free() must be signaled, or we cannot clean up
>>>>>> + * the entire manager.
>>>>>> + */
>>>>>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>>>>>> *sa_manager)
>>>>>> +{
>>>>>> +    drm_suballoc_process_idle(sa_manager);
>>>>>> +    drm_mm_takedown(&sa_manager->mm);
>>>>>> +    mutex_destroy(&sa_manager->alloc_mutex);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL(drm_suballoc_manager_fini);
>>>>>> +
>>>>>> +static void __drm_suballoc_free(struct drm_suballoc *sa)
>>>>>> +{
>>>>>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>>>>>> +    struct dma_fence *fence;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * In order to avoid protecting the potentially lengthy 
>>>>>> drm_mm manager
>>>>>> +     * *allocation* processing with an irq-disabling lock,
>>>>>> +     * defer touching the drm_mm for freeing until we're in task 
>>>>>> context,
>>>>>> +     * with no irqs disabled, or happen to succeed in taking the 
>>>>>> manager
>>>>>> +     * lock.
>>>>>> +     */
>>>>>> +    if (!in_task() || irqs_disabled()) {
>>>>>> +        unsigned long irqflags;
>>>>>> +
>>>>>> +        if (spin_trylock(&sa_manager->lock))
>>>>>> +            goto locked;
>>>>>> +
>>>>>> + spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>>>>>> +        list_add_tail(&sa->idle_link, &sa_manager->idle_list);
>>>>>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>>>>>> +        wake_up(&sa_manager->wq);
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    spin_lock(&sa_manager->lock);
>>>>>> +locked:
>>>>>> +    drm_mm_remove_node(&sa->node);
>>>>>> +
>>>>>> +    fence = sa->fence;
>>>>>> +    sa->fence = NULL;
>>>>>> +    spin_unlock(&sa_manager->lock);
>>>>>> +    /* Maybe only wake if first mm hole is sufficiently large? */
>>>>>> +    wake_up(&sa_manager->wq);
>>>>>> +    dma_fence_put(fence);
>>>>>> +    kfree(sa);
>>>>>> +}
>>>>>> +
>>>>>> +/* Free all deferred idle allocations */
>>>>>> +static void drm_suballoc_process_idle(struct 
>>>>>> drm_suballoc_manager *sa_manager)
>>>>>> +{
>>>>>> +    /*
>>>>>> +     * prepare_to_wait() / wake_up() semantics ensure that any list
>>>>>> +     * addition that was done before wake_up() is visible when
>>>>>> +     * this code is called from the wait loop.
>>>>>> +     */
>>>>>> +    if (!list_empty_careful(&sa_manager->idle_list)) {
>>>>>> +        struct drm_suballoc *sa, *next;
>>>>>> +        unsigned long irqflags;
>>>>>> +        LIST_HEAD(list);
>>>>>> +
>>>>>> + spin_lock_irqsave(&sa_manager->idle_list_lock, irqflags);
>>>>>> +        list_splice_init(&sa_manager->idle_list, &list);
>>>>>> + spin_unlock_irqrestore(&sa_manager->idle_list_lock, irqflags);
>>>>>> +
>>>>>> +        list_for_each_entry_safe(sa, next, &list, idle_link)
>>>>>> +            __drm_suballoc_free(sa);
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>> +static void
>>>>>> +drm_suballoc_fence_signaled(struct dma_fence *fence, struct 
>>>>>> dma_fence_cb *cb)
>>>>>> +{
>>>>>> +    struct drm_suballoc *sa = container_of(cb, typeof(*sa), cb);
>>>>>> +
>>>>>> +    __drm_suballoc_free(sa);
>>>>>> +}
>>>>>> +
>>>>>> +static int drm_suballoc_tryalloc(struct drm_suballoc *sa, u64 size)
>>>>>> +{
>>>>>> +    struct drm_suballoc_manager *sa_manager = sa->manager;
>>>>>> +    int err;
>>>>>> +
>>>>>> +    drm_suballoc_process_idle(sa_manager);
>>>>>> +    spin_lock(&sa_manager->lock);
>>>>>> +    err = drm_mm_insert_node_generic(&sa_manager->mm, &sa->node, 
>>>>>> size,
>>>>>> +                     sa_manager->alignment, 0,
>>>>>> +                     DRM_MM_INSERT_EVICT);
>>>>>> +    spin_unlock(&sa_manager->lock);
>>>>>> +    return err;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_suballoc_new() - Make a suballocation.
>>>>>> + * @sa_manager: pointer to the sa_manager
>>>>>> + * @size: number of bytes we want to suballocate.
>>>>>> + * @gfp: Allocation context.
>>>>>> + * @intr: Whether to sleep interruptibly if sleeping.
>>>>>> + *
>>>>>> + * Try to make a suballocation of size @size, which will be rounded
>>>>>> + * up to the alignment specified in specified in 
>>>>>> drm_suballoc_manager_init().
>>>>>> + *
>>>>>> + * Returns a new suballocated bo, or an ERR_PTR.
>>>>>> + */
>>>>>> +struct drm_suballoc*
>>>>>> +drm_suballoc_new(struct drm_suballoc_manager *sa_manager, u64 size,
>>>>>> +         gfp_t gfp, bool intr)
>>>>>> +{
>>>>>> +    struct drm_suballoc *sa;
>>>>>> +    DEFINE_WAIT(wait);
>>>>>> +    int err = 0;
>>>>>> +
>>>>>> +    if (size > sa_manager->range_size)
>>>>>> +        return ERR_PTR(-ENOSPC);
>>>>>> +
>>>>>> +    sa = kzalloc(sizeof(*sa), gfp);
>>>>>> +    if (!sa)
>>>>>> +        return ERR_PTR(-ENOMEM);
>>>>>> +
>>>>>> +    /* Avoid starvation using the alloc_mutex */
>>>>>> +    if (intr)
>>>>>> +        err = mutex_lock_interruptible(&sa_manager->alloc_mutex);
>>>>>> +    else
>>>>>> +        mutex_lock(&sa_manager->alloc_mutex);
>>>>>> +    if (err) {
>>>>>> +        kfree(sa);
>>>>>> +        return ERR_PTR(err);
>>>>>> +    }
>>>>>> +
>>>>>> +    sa->manager = sa_manager;
>>>>>> +    err = drm_suballoc_tryalloc(sa, size);
>>>>>> +    if (err != -ENOSPC)
>>>>>> +        goto out;
>>>>>> +
>>>>>> +    for (;;) {
>>>>>> +        prepare_to_wait(&sa_manager->wq, &wait,
>>>>>> +                intr ? TASK_INTERRUPTIBLE :
>>>>>> +                TASK_UNINTERRUPTIBLE);
>>>>>> +
>>>>>> +        err = drm_suballoc_tryalloc(sa, size);
>>>>>> +        if (err != -ENOSPC)
>>>>>> +            break;
>>>>>> +
>>>>>> +        if (intr && signal_pending(current)) {
>>>>>> +            err = -ERESTARTSYS;
>>>>>> +            break;
>>>>>> +        }
>>>>>> +
>>>>>> +        io_schedule();
>>>>>> +    }
>>>>>> +    finish_wait(&sa_manager->wq, &wait);
>>>>>> +
>>>>>> +out:
>>>>>> +    mutex_unlock(&sa_manager->alloc_mutex);
>>>>>> +    if (!sa->node.size) {
>>>>>> +        kfree(sa);
>>>>>> +        WARN_ON(!err);
>>>>>> +        sa = ERR_PTR(err);
>>>>>> +    }
>>>>>> +
>>>>>> +    return sa;
>>>>>> +}
>>>>>> +EXPORT_SYMBOL(drm_suballoc_new);
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_suballoc_free() - Free a suballocation
>>>>>> + * @suballoc: pointer to the suballocation
>>>>>> + * @fence: fence that signals when suballocation is idle
>>>>>> + * @queue: the index to which queue the suballocation will be 
>>>>>> placed on the free list.
>>>>>> + *
>>>>>> + * Free the suballocation. The suballocation can be re-used 
>>>>>> after @fence
>>>>>> + * signals.
>>>>>> + */
>>>>>> +void
>>>>>> +drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence *fence)
>>>>>> +{
>>>>>> +    if (!sa)
>>>>>> +        return;
>>>>>> +
>>>>>> +    if (!fence || dma_fence_is_signaled(fence)) {
>>>>>> +        __drm_suballoc_free(sa);
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    sa->fence = dma_fence_get(fence);
>>>>>> +    if (dma_fence_add_callback(fence, &sa->cb, 
>>>>>> drm_suballoc_fence_signaled))
>>>>>> +        __drm_suballoc_free(sa);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL(drm_suballoc_free);
>>>>>> +
>>>>>> +#ifdef CONFIG_DEBUG_FS
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_suballoc_dump_debug_info() - Dump the suballocator state
>>>>>> + * @sa_manager: The suballoc manager.
>>>>>> + * @p: Pointer to a drm printer for output.
>>>>>> + * @suballoc_base: Constant to add to the suballocated offsets 
>>>>>> on printout.
>>>>>> + *
>>>>>> + * This function dumps the suballocator state. Note that the 
>>>>>> caller has
>>>>>> + * to explicitly order frees and calls to this function in order 
>>>>>> for the
>>>>>> + * freed node to show up as protected by a fence.
>>>>>> + */
>>>>>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>>>>> *sa_manager,
>>>>>> +                  struct drm_printer *p, u64 suballoc_base)
>>>>>> +{
>>>>>> +    const struct drm_mm_node *entry;
>>>>>> +
>>>>>> +    spin_lock(&sa_manager->lock);
>>>>>> +    drm_mm_for_each_node(entry, &sa_manager->mm) {
>>>>>> +        struct drm_suballoc *sa =
>>>>>> +            container_of(entry, typeof(*sa), node);
>>>>>> +
>>>>>> +        drm_printf(p, " ");
>>>>>> +        drm_printf(p, "[0x%010llx 0x%010llx] size %8lld",
>>>>>> +               (unsigned long long)suballoc_base + entry->start,
>>>>>> +               (unsigned long long)suballoc_base + entry->start +
>>>>>> +               entry->size, (unsigned long long)entry->size);
>>>>>> +
>>>>>> +        if (sa->fence)
>>>>>> +            drm_printf(p, " protected by 0x%016llx on context 
>>>>>> %llu",
>>>>>> +                   (unsigned long long)sa->fence->seqno,
>>>>>> +                   (unsigned long long)sa->fence->context);
>>>>>> +
>>>>>> +        drm_printf(p, "\n");
>>>>>> +    }
>>>>>> +    spin_unlock(&sa_manager->lock);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL(drm_suballoc_dump_debug_info);
>>>>>> +#endif
>>>>>> +
>>>>>> +MODULE_AUTHOR("Intel Corporation");
>>>>>> +MODULE_DESCRIPTION("Simple range suballocator helper");
>>>>>> +MODULE_LICENSE("GPL and additional rights");
>>>>>> diff --git a/include/drm/drm_suballoc.h b/include/drm/drm_suballoc.h
>>>>>> new file mode 100644
>>>>>> index 000000000000..910952b3383b
>>>>>> --- /dev/null
>>>>>> +++ b/include/drm/drm_suballoc.h
>>>>>> @@ -0,0 +1,112 @@
>>>>>> +/* SPDX-License-Identifier: MIT */
>>>>>> +/*
>>>>>> + * Copyright © 2022 Intel Corporation
>>>>>> + */
>>>>>> +#ifndef _DRM_SUBALLOC_H_
>>>>>> +#define _DRM_SUBALLOC_H_
>>>>>> +
>>>>>> +#include <drm/drm_mm.h>
>>>>>> +
>>>>>> +#include <linux/dma-fence.h>
>>>>>> +#include <linux/types.h>
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_suballoc_manager - Wrapper for fenced range 
>>>>>> allocations
>>>>>> + * @mm: The range manager. Protected by @lock.
>>>>>> + * @range_size: The total size of the range.
>>>>>> + * @alignment: Range alignment.
>>>>>> + * @wq: Wait queue for sleeping allocations on contention.
>>>>>> + * @idle_list: List of idle but not yet freed allocations. 
>>>>>> Protected by
>>>>>> + * @idle_list_lock.
>>>>>> + * @task: Task waiting for allocation. Protected by @lock.
>>>>>> + */
>>>>>> +struct drm_suballoc_manager {
>>>>>> +    /** @lock: Manager lock. Protects @mm. */
>>>>>> +    spinlock_t lock;
>>>>>> +    /**
>>>>>> +     * @idle_list_lock: Lock to protect the idle_list.
>>>>>> +     * Disable irqs when locking.
>>>>>> +     */
>>>>>> +    spinlock_t idle_list_lock;
>>>>>> +    /** @alloc_mutex: Mutex to protect against stavation. */
>>>>>> +    struct mutex alloc_mutex;
>>>>>> +    struct drm_mm mm;
>>>>>> +    u64 range_size;
>>>>>> +    u64 alignment;
>>>>>> +    wait_queue_head_t wq;
>>>>>> +    struct list_head idle_list;
>>>>>> +};
>>>>>> +
>>>>>> +/**
>>>>>> + * struct drm_suballoc: Suballocated range.
>>>>>> + * @node: The drm_mm representation of the range.
>>>>>> + * @fence: dma-fence indicating whether allocation is active or 
>>>>>> idle.
>>>>>> + * Assigned on call to free the allocation so doesn't need 
>>>>>> protection.
>>>>>> + * @cb: dma-fence callback structure. Used for callbacks when 
>>>>>> the fence signals.
>>>>>> + * @manager: The struct drm_suballoc_manager the range belongs 
>>>>>> to. Immutable.
>>>>>> + * @idle_link: Link for the manager idle_list. Progected by the
>>>>>> + * drm_suballoc_manager::idle_lock.
>>>>>> + */
>>>>>> +struct drm_suballoc {
>>>>>> +    struct drm_mm_node node;
>>>>>> +    struct dma_fence *fence;
>>>>>> +    struct dma_fence_cb cb;
>>>>>> +    struct drm_suballoc_manager *manager;
>>>>>> +    struct list_head idle_link;
>>>>>> +};
>>>>>> +
>>>>>> +void drm_suballoc_manager_init(struct drm_suballoc_manager 
>>>>>> *sa_manager,
>>>>>> +                   u64 size, u64 align);
>>>>>> +
>>>>>> +void drm_suballoc_manager_fini(struct drm_suballoc_manager 
>>>>>> *sa_manager);
>>>>>> +
>>>>>> +struct drm_suballoc *drm_suballoc_new(struct 
>>>>>> drm_suballoc_manager *sa_manager,
>>>>>> +                      u64 size, gfp_t gfp, bool intr);
>>>>>> +
>>>>>> +void drm_suballoc_free(struct drm_suballoc *sa, struct dma_fence 
>>>>>> *fence);
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_suballoc_soffset - Range start.
>>>>>> + * @sa: The struct drm_suballoc.
>>>>>> + *
>>>>>> + * Return: The start of the allocated range.
>>>>>> + */
>>>>>> +static inline u64 drm_suballoc_soffset(struct drm_suballoc *sa)
>>>>>> +{
>>>>>> +    return sa->node.start;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_suballoc_eoffset - Range end.
>>>>>> + * @sa: The struct drm_suballoc.
>>>>>> + *
>>>>>> + * Return: The end of the allocated range + 1.
>>>>>> + */
>>>>>> +static inline u64 drm_suballoc_eoffset(struct drm_suballoc *sa)
>>>>>> +{
>>>>>> +    return sa->node.start + sa->node.size;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_suballoc_size - Range size.
>>>>>> + * @sa: The struct drm_suballoc.
>>>>>> + *
>>>>>> + * Return: The size of the allocated range.
>>>>>> + */
>>>>>> +static inline u64 drm_suballoc_size(struct drm_suballoc *sa)
>>>>>> +{
>>>>>> +    return sa->node.size;
>>>>>> +}
>>>>>> +
>>>>>> +#ifdef CONFIG_DEBUG_FS
>>>>>> +void drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>>>>> *sa_manager,
>>>>>> +                  struct drm_printer *p, u64 suballoc_base);
>>>>>> +#else
>>>>>> +static inline void
>>>>>> +drm_suballoc_dump_debug_info(struct drm_suballoc_manager 
>>>>>> *sa_manager,
>>>>>> +                 struct drm_printer *p, u64 suballoc_base)
>>>>>> +{ }
>>>>>> +
>>>>>> +#endif
>>>>>> +
>>>>>> +#endif /* _DRM_SUBALLOC_H_ */
>>>>>
>>>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
  2023-02-17 13:10               ` [Intel-xe] " Thomas Hellström
@ 2023-02-17 13:18                 ` Christian König
  -1 siblings, 0 replies; 39+ messages in thread
From: Christian König @ 2023-02-17 13:18 UTC (permalink / raw)
  To: Thomas Hellström, dri-devel; +Cc: Daniel Vetter, intel-xe, Dave Airlie

Am 17.02.23 um 14:10 schrieb Thomas Hellström:
> [SNIP]
>>>>>
>>>>> Any chance you could do a quick performance comparison? If not, 
>>>>> anything against merging this without the amd / radeon changes 
>>>>> until we can land a simpler allocator?
>>>>
>>>> Only if you can stick the allocator inside Xe and not drm, cause 
>>>> this seems to be for a different use case than the allocators 
>>>> inside radeon/amdgpu.
>>>
>>> Hmm. No It's allocating in a ring-like fashion as well.  Let me put 
>>> together a unit test for benchmaking. I think it would be a failure 
>>> for the community to end up with three separate suballocators doing 
>>> the exact same thing for the same problem, really.
>>
>> Well exactly that's the point. Those allocators aren't the same 
>> because they handle different problems.
>>
>> The allocator in radeon is simpler because it only had to deal with a 
>> limited number of fence timelines. The one in amdgpu is a bit more 
>> complex because of the added complexity for more fence timelines.
>>
>> We could take the one from amdgpu and use it for radeon and others as 
>> well, but the allocator proposed here doesn't even remotely matches 
>> the requirements.
>
> But again, what *are* those missing requirements exactly? What is the 
> pathological case you see for the current code?

Well very low CPU overhead and don't do anything in a callback.

>
> From what I can tell the amdgpu suballocator introduces excessive 
> complexity to coalesce waits for fences from the same contexts, 
> whereas the present code just frees from the fence callback if the 
> fence wasn't already signaled.

And this is exactly the design we had previously which we removed after 
Dave stumbled over tons of problems with it.

> The fence signalling code that fires that callback is typcally always 
> run anyway on scheduler fences.
>
> The reason we had for not using the amdgpu suballocator as originally 
> planned was that this complexity made it very hard for us to undertand 
> it and to fix issues we had with it.

Well what are those problems? The idea is actually not that hardware to 
understand.

We could simplify it massively for the cost of only waiting for the 
oldest fence if that helps.

Regards,
Christian.

>
> Regards,
>
> Thomas


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
@ 2023-02-17 13:18                 ` Christian König
  0 siblings, 0 replies; 39+ messages in thread
From: Christian König @ 2023-02-17 13:18 UTC (permalink / raw)
  To: Thomas Hellström, dri-devel
  Cc: Daniel Vetter, Maarten Lankhorst, intel-xe, Dave Airlie

Am 17.02.23 um 14:10 schrieb Thomas Hellström:
> [SNIP]
>>>>>
>>>>> Any chance you could do a quick performance comparison? If not, 
>>>>> anything against merging this without the amd / radeon changes 
>>>>> until we can land a simpler allocator?
>>>>
>>>> Only if you can stick the allocator inside Xe and not drm, cause 
>>>> this seems to be for a different use case than the allocators 
>>>> inside radeon/amdgpu.
>>>
>>> Hmm. No It's allocating in a ring-like fashion as well.  Let me put 
>>> together a unit test for benchmaking. I think it would be a failure 
>>> for the community to end up with three separate suballocators doing 
>>> the exact same thing for the same problem, really.
>>
>> Well exactly that's the point. Those allocators aren't the same 
>> because they handle different problems.
>>
>> The allocator in radeon is simpler because it only had to deal with a 
>> limited number of fence timelines. The one in amdgpu is a bit more 
>> complex because of the added complexity for more fence timelines.
>>
>> We could take the one from amdgpu and use it for radeon and others as 
>> well, but the allocator proposed here doesn't even remotely matches 
>> the requirements.
>
> But again, what *are* those missing requirements exactly? What is the 
> pathological case you see for the current code?

Well very low CPU overhead and don't do anything in a callback.

>
> From what I can tell the amdgpu suballocator introduces excessive 
> complexity to coalesce waits for fences from the same contexts, 
> whereas the present code just frees from the fence callback if the 
> fence wasn't already signaled.

And this is exactly the design we had previously which we removed after 
Dave stumbled over tons of problems with it.

> The fence signalling code that fires that callback is typcally always 
> run anyway on scheduler fences.
>
> The reason we had for not using the amdgpu suballocator as originally 
> planned was that this complexity made it very hard for us to undertand 
> it and to fix issues we had with it.

Well what are those problems? The idea is actually not that hardware to 
understand.

We could simplify it massively for the cost of only waiting for the 
oldest fence if that helps.

Regards,
Christian.

>
> Regards,
>
> Thomas


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
  2023-02-17 13:18                 ` [Intel-xe] " Christian König
@ 2023-02-17 13:51                   ` Thomas Hellström
  -1 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-17 13:51 UTC (permalink / raw)
  To: Christian König, dri-devel; +Cc: Daniel Vetter, intel-xe, Dave Airlie


On 2/17/23 14:18, Christian König wrote:
> Am 17.02.23 um 14:10 schrieb Thomas Hellström:
>> [SNIP]
>>>>>>
>>>>>> Any chance you could do a quick performance comparison? If not, 
>>>>>> anything against merging this without the amd / radeon changes 
>>>>>> until we can land a simpler allocator?
>>>>>
>>>>> Only if you can stick the allocator inside Xe and not drm, cause 
>>>>> this seems to be for a different use case than the allocators 
>>>>> inside radeon/amdgpu.
>>>>
>>>> Hmm. No It's allocating in a ring-like fashion as well.  Let me put 
>>>> together a unit test for benchmaking. I think it would be a failure 
>>>> for the community to end up with three separate suballocators doing 
>>>> the exact same thing for the same problem, really.
>>>
>>> Well exactly that's the point. Those allocators aren't the same 
>>> because they handle different problems.
>>>
>>> The allocator in radeon is simpler because it only had to deal with 
>>> a limited number of fence timelines. The one in amdgpu is a bit more 
>>> complex because of the added complexity for more fence timelines.
>>>
>>> We could take the one from amdgpu and use it for radeon and others 
>>> as well, but the allocator proposed here doesn't even remotely 
>>> matches the requirements.
>>
>> But again, what *are* those missing requirements exactly? What is the 
>> pathological case you see for the current code?
>
> Well very low CPU overhead and don't do anything in a callback.

Well, dma_fence_wait_any() will IIRC register callbacks on all affected 
fences, although admittedly there is no actual allocator processing in them.

>
>>
>> From what I can tell the amdgpu suballocator introduces excessive 
>> complexity to coalesce waits for fences from the same contexts, 
>> whereas the present code just frees from the fence callback if the 
>> fence wasn't already signaled.
>
> And this is exactly the design we had previously which we removed 
> after Dave stumbled over tons of problems with it.

So is the worry that those problems have spilled over in this code then? 
It's been pretty extensively tested, or is it you should never really 
use dma-fence callbacks?

>
>> The fence signalling code that fires that callback is typcally always 
>> run anyway on scheduler fences.
>>
>> The reason we had for not using the amdgpu suballocator as originally 
>> planned was that this complexity made it very hard for us to 
>> undertand it and to fix issues we had with it.
>
> Well what are those problems? The idea is actually not that hardware 
> to understand.

We hit memory corruption, and we spent substantially more time trying to 
debug it than to put together this patch, while never really 
understanding what  happened, nor why you don't see that with amdgpu.

>
> We could simplify it massively for the cost of only waiting for the 
> oldest fence if that helps.

Let me grab the latest version from amdgpu and give it a try again, but 
yes I think that to make it common code we'll need it simpler (and my 
personal wish would be to separate the allocator functionality a bit 
more from the fence waiting, which I guess should be OK if the fence 
waiting is vastly simplified).

/Thomas


>
>
> Regards,
> Christian.
>
>>
>> Regards,
>>
>> Thomas
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
@ 2023-02-17 13:51                   ` Thomas Hellström
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-17 13:51 UTC (permalink / raw)
  To: Christian König, dri-devel
  Cc: Daniel Vetter, Maarten Lankhorst, intel-xe, Dave Airlie


On 2/17/23 14:18, Christian König wrote:
> Am 17.02.23 um 14:10 schrieb Thomas Hellström:
>> [SNIP]
>>>>>>
>>>>>> Any chance you could do a quick performance comparison? If not, 
>>>>>> anything against merging this without the amd / radeon changes 
>>>>>> until we can land a simpler allocator?
>>>>>
>>>>> Only if you can stick the allocator inside Xe and not drm, cause 
>>>>> this seems to be for a different use case than the allocators 
>>>>> inside radeon/amdgpu.
>>>>
>>>> Hmm. No It's allocating in a ring-like fashion as well.  Let me put 
>>>> together a unit test for benchmaking. I think it would be a failure 
>>>> for the community to end up with three separate suballocators doing 
>>>> the exact same thing for the same problem, really.
>>>
>>> Well exactly that's the point. Those allocators aren't the same 
>>> because they handle different problems.
>>>
>>> The allocator in radeon is simpler because it only had to deal with 
>>> a limited number of fence timelines. The one in amdgpu is a bit more 
>>> complex because of the added complexity for more fence timelines.
>>>
>>> We could take the one from amdgpu and use it for radeon and others 
>>> as well, but the allocator proposed here doesn't even remotely 
>>> matches the requirements.
>>
>> But again, what *are* those missing requirements exactly? What is the 
>> pathological case you see for the current code?
>
> Well very low CPU overhead and don't do anything in a callback.

Well, dma_fence_wait_any() will IIRC register callbacks on all affected 
fences, although admittedly there is no actual allocator processing in them.

>
>>
>> From what I can tell the amdgpu suballocator introduces excessive 
>> complexity to coalesce waits for fences from the same contexts, 
>> whereas the present code just frees from the fence callback if the 
>> fence wasn't already signaled.
>
> And this is exactly the design we had previously which we removed 
> after Dave stumbled over tons of problems with it.

So is the worry that those problems have spilled over in this code then? 
It's been pretty extensively tested, or is it you should never really 
use dma-fence callbacks?

>
>> The fence signalling code that fires that callback is typcally always 
>> run anyway on scheduler fences.
>>
>> The reason we had for not using the amdgpu suballocator as originally 
>> planned was that this complexity made it very hard for us to 
>> undertand it and to fix issues we had with it.
>
> Well what are those problems? The idea is actually not that hardware 
> to understand.

We hit memory corruption, and we spent substantially more time trying to 
debug it than to put together this patch, while never really 
understanding what  happened, nor why you don't see that with amdgpu.

>
> We could simplify it massively for the cost of only waiting for the 
> oldest fence if that helps.

Let me grab the latest version from amdgpu and give it a try again, but 
yes I think that to make it common code we'll need it simpler (and my 
personal wish would be to separate the allocator functionality a bit 
more from the fence waiting, which I guess should be OK if the fence 
waiting is vastly simplified).

/Thomas


>
>
> Regards,
> Christian.
>
>>
>> Regards,
>>
>> Thomas
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
  2023-02-17 13:51                   ` [Intel-xe] " Thomas Hellström
@ 2023-02-22 11:00                     ` Thomas Hellström
  -1 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-22 11:00 UTC (permalink / raw)
  To: Christian König, dri-devel; +Cc: Daniel Vetter, intel-xe, Dave Airlie

Hi, Christian,

So I resurrected Maarten's previous patch series around this (the amdgpu 
suballocator) slightly modified the code to match the API of this patch 
series, re-introduced the per-allocation alignment as per a previous 
review comment from you on that series, and made checkpatch.pl pass 
mostly, except for pre-existing style problems, and added / fixed some 
comments. No memory corruption seen so far on limited Xe testing.

To move this forward I suggest starting with that as a common drm 
suballocator. I'll post the series later today. We can follow up with 
potential simplifactions lif needed.

I also made a kunit test also reporting some timing information. Will 
post that as a follow up. Some interesting preliminary conclusions:

* drm_mm is per se not a cpu hog, If the rb tree processing is disabled 
and the EVICT algorithm is changed from MRU to ring-like LRU traversal, 
it's more or less just as fast as the ring suballocator.

* With a single ring, and the suballocation buffer never completely 
filled (no sleeps) the amd suballocator is a bit faster per allocation / 
free. (Around 250 ns instead of 350). Allocation is slightly slower on 
the amdgpu one, freeing is faster, mostly due to the locking overhead 
incurred when setting up the fence callbacks, and for avoiding 
irq-disabled processing on the one I proposed.

* With multiple rings and varying allocation sizes and signalling times 
creating fragmentation, the picture becomes different as the amdgpu 
allocator starts to sleep/throttle already round 50% - 75% fill. The one 
I proposed between 75% to 90% fill, and once that happens, the CPU cost 
of putting to sleep and waking up should really shadow the above numbers.

So it's really a tradeoff. Where IMO also code size and maintainability 
should play a role.

Also I looked at the history of the amdgpu allocator originating back to 
Radeon 2012-ish, but couldn't find any commits mentioning fence 
callbacks nor problem with those. Could you point me to that discussion?

Thanks,

Thomas



On 2/17/23 14:51, Thomas Hellström wrote:
>
> On 2/17/23 14:18, Christian König wrote:
>> Am 17.02.23 um 14:10 schrieb Thomas Hellström:
>>> [SNIP]
>>>>>>>
>>>>>>> Any chance you could do a quick performance comparison? If not, 
>>>>>>> anything against merging this without the amd / radeon changes 
>>>>>>> until we can land a simpler allocator?
>>>>>>
>>>>>> Only if you can stick the allocator inside Xe and not drm, cause 
>>>>>> this seems to be for a different use case than the allocators 
>>>>>> inside radeon/amdgpu.
>>>>>
>>>>> Hmm. No It's allocating in a ring-like fashion as well. Let me put 
>>>>> together a unit test for benchmaking. I think it would be a 
>>>>> failure for the community to end up with three separate 
>>>>> suballocators doing the exact same thing for the same problem, 
>>>>> really.
>>>>
>>>> Well exactly that's the point. Those allocators aren't the same 
>>>> because they handle different problems.
>>>>
>>>> The allocator in radeon is simpler because it only had to deal with 
>>>> a limited number of fence timelines. The one in amdgpu is a bit 
>>>> more complex because of the added complexity for more fence timelines.
>>>>
>>>> We could take the one from amdgpu and use it for radeon and others 
>>>> as well, but the allocator proposed here doesn't even remotely 
>>>> matches the requirements.
>>>
>>> But again, what *are* those missing requirements exactly? What is 
>>> the pathological case you see for the current code?
>>
>> Well very low CPU overhead and don't do anything in a callback.
>
> Well, dma_fence_wait_any() will IIRC register callbacks on all 
> affected fences, although admittedly there is no actual allocator 
> processing in them.
>
>>
>>>
>>> From what I can tell the amdgpu suballocator introduces excessive 
>>> complexity to coalesce waits for fences from the same contexts, 
>>> whereas the present code just frees from the fence callback if the 
>>> fence wasn't already signaled.
>>
>> And this is exactly the design we had previously which we removed 
>> after Dave stumbled over tons of problems with it.
>
> So is the worry that those problems have spilled over in this code 
> then? It's been pretty extensively tested, or is it you should never 
> really use dma-fence callbacks?
>
>>
>>> The fence signalling code that fires that callback is typcally 
>>> always run anyway on scheduler fences.
>>>
>>> The reason we had for not using the amdgpu suballocator as 
>>> originally planned was that this complexity made it very hard for us 
>>> to undertand it and to fix issues we had with it.
>>
>> Well what are those problems? The idea is actually not that hardware 
>> to understand.
>
> We hit memory corruption, and we spent substantially more time trying 
> to debug it than to put together this patch, while never really 
> understanding what  happened, nor why you don't see that with amdgpu.
>
>>
>> We could simplify it massively for the cost of only waiting for the 
>> oldest fence if that helps.
>
> Let me grab the latest version from amdgpu and give it a try again, 
> but yes I think that to make it common code we'll need it simpler (and 
> my personal wish would be to separate the allocator functionality a 
> bit more from the fence waiting, which I guess should be OK if the 
> fence waiting is vastly simplified).
>
> /Thomas
>
>
>>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Regards,
>>>
>>> Thomas
>>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
@ 2023-02-22 11:00                     ` Thomas Hellström
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-22 11:00 UTC (permalink / raw)
  To: Christian König, dri-devel
  Cc: Daniel Vetter, Maarten Lankhorst, intel-xe, Dave Airlie

Hi, Christian,

So I resurrected Maarten's previous patch series around this (the amdgpu 
suballocator) slightly modified the code to match the API of this patch 
series, re-introduced the per-allocation alignment as per a previous 
review comment from you on that series, and made checkpatch.pl pass 
mostly, except for pre-existing style problems, and added / fixed some 
comments. No memory corruption seen so far on limited Xe testing.

To move this forward I suggest starting with that as a common drm 
suballocator. I'll post the series later today. We can follow up with 
potential simplifactions lif needed.

I also made a kunit test also reporting some timing information. Will 
post that as a follow up. Some interesting preliminary conclusions:

* drm_mm is per se not a cpu hog, If the rb tree processing is disabled 
and the EVICT algorithm is changed from MRU to ring-like LRU traversal, 
it's more or less just as fast as the ring suballocator.

* With a single ring, and the suballocation buffer never completely 
filled (no sleeps) the amd suballocator is a bit faster per allocation / 
free. (Around 250 ns instead of 350). Allocation is slightly slower on 
the amdgpu one, freeing is faster, mostly due to the locking overhead 
incurred when setting up the fence callbacks, and for avoiding 
irq-disabled processing on the one I proposed.

* With multiple rings and varying allocation sizes and signalling times 
creating fragmentation, the picture becomes different as the amdgpu 
allocator starts to sleep/throttle already round 50% - 75% fill. The one 
I proposed between 75% to 90% fill, and once that happens, the CPU cost 
of putting to sleep and waking up should really shadow the above numbers.

So it's really a tradeoff. Where IMO also code size and maintainability 
should play a role.

Also I looked at the history of the amdgpu allocator originating back to 
Radeon 2012-ish, but couldn't find any commits mentioning fence 
callbacks nor problem with those. Could you point me to that discussion?

Thanks,

Thomas



On 2/17/23 14:51, Thomas Hellström wrote:
>
> On 2/17/23 14:18, Christian König wrote:
>> Am 17.02.23 um 14:10 schrieb Thomas Hellström:
>>> [SNIP]
>>>>>>>
>>>>>>> Any chance you could do a quick performance comparison? If not, 
>>>>>>> anything against merging this without the amd / radeon changes 
>>>>>>> until we can land a simpler allocator?
>>>>>>
>>>>>> Only if you can stick the allocator inside Xe and not drm, cause 
>>>>>> this seems to be for a different use case than the allocators 
>>>>>> inside radeon/amdgpu.
>>>>>
>>>>> Hmm. No It's allocating in a ring-like fashion as well. Let me put 
>>>>> together a unit test for benchmaking. I think it would be a 
>>>>> failure for the community to end up with three separate 
>>>>> suballocators doing the exact same thing for the same problem, 
>>>>> really.
>>>>
>>>> Well exactly that's the point. Those allocators aren't the same 
>>>> because they handle different problems.
>>>>
>>>> The allocator in radeon is simpler because it only had to deal with 
>>>> a limited number of fence timelines. The one in amdgpu is a bit 
>>>> more complex because of the added complexity for more fence timelines.
>>>>
>>>> We could take the one from amdgpu and use it for radeon and others 
>>>> as well, but the allocator proposed here doesn't even remotely 
>>>> matches the requirements.
>>>
>>> But again, what *are* those missing requirements exactly? What is 
>>> the pathological case you see for the current code?
>>
>> Well very low CPU overhead and don't do anything in a callback.
>
> Well, dma_fence_wait_any() will IIRC register callbacks on all 
> affected fences, although admittedly there is no actual allocator 
> processing in them.
>
>>
>>>
>>> From what I can tell the amdgpu suballocator introduces excessive 
>>> complexity to coalesce waits for fences from the same contexts, 
>>> whereas the present code just frees from the fence callback if the 
>>> fence wasn't already signaled.
>>
>> And this is exactly the design we had previously which we removed 
>> after Dave stumbled over tons of problems with it.
>
> So is the worry that those problems have spilled over in this code 
> then? It's been pretty extensively tested, or is it you should never 
> really use dma-fence callbacks?
>
>>
>>> The fence signalling code that fires that callback is typcally 
>>> always run anyway on scheduler fences.
>>>
>>> The reason we had for not using the amdgpu suballocator as 
>>> originally planned was that this complexity made it very hard for us 
>>> to undertand it and to fix issues we had with it.
>>
>> Well what are those problems? The idea is actually not that hardware 
>> to understand.
>
> We hit memory corruption, and we spent substantially more time trying 
> to debug it than to put together this patch, while never really 
> understanding what  happened, nor why you don't see that with amdgpu.
>
>>
>> We could simplify it massively for the cost of only waiting for the 
>> oldest fence if that helps.
>
> Let me grab the latest version from amdgpu and give it a try again, 
> but yes I think that to make it common code we'll need it simpler (and 
> my personal wish would be to separate the allocator functionality a 
> bit more from the fence waiting, which I guess should be OK if the 
> fence waiting is vastly simplified).
>
> /Thomas
>
>
>>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Regards,
>>>
>>> Thomas
>>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
  2023-02-22 11:00                     ` Thomas Hellström
@ 2023-02-22 11:39                       ` Christian König
  -1 siblings, 0 replies; 39+ messages in thread
From: Christian König @ 2023-02-22 11:39 UTC (permalink / raw)
  To: Thomas Hellström, dri-devel; +Cc: Daniel Vetter, intel-xe, Dave Airlie

Hi Thomas,

Am 22.02.23 um 12:00 schrieb Thomas Hellström:
> Hi, Christian,
>
> So I resurrected Maarten's previous patch series around this (the 
> amdgpu suballocator) slightly modified the code to match the API of 
> this patch series, re-introduced the per-allocation alignment as per a 
> previous review comment from you on that series, and made 
> checkpatch.pl pass mostly, except for pre-existing style problems, and 
> added / fixed some comments. No memory corruption seen so far on 
> limited Xe testing.
>
> To move this forward I suggest starting with that as a common drm 
> suballocator. I'll post the series later today. We can follow up with 
> potential simplifactions lif needed.
>
> I also made a kunit test also reporting some timing information. Will 
> post that as a follow up. Some interesting preliminary conclusions:
>
> * drm_mm is per se not a cpu hog, If the rb tree processing is 
> disabled and the EVICT algorithm is changed from MRU to ring-like LRU 
> traversal, it's more or less just as fast as the ring suballocator.
>
> * With a single ring, and the suballocation buffer never completely 
> filled (no sleeps) the amd suballocator is a bit faster per allocation 
> / free. (Around 250 ns instead of 350). Allocation is slightly slower 
> on the amdgpu one, freeing is faster, mostly due to the locking 
> overhead incurred when setting up the fence callbacks, and for 
> avoiding irq-disabled processing on the one I proposed.

For some more realistic numbers try to signal the fence from another 
CPU. Alternative you can invalidate all the CPU read cache lines touched 
by the fence callback so that they need to be read in again from the 
allocating CPU.

>
> * With multiple rings and varying allocation sizes and signalling 
> times creating fragmentation, the picture becomes different as the 
> amdgpu allocator starts to sleep/throttle already round 50% - 75% 
> fill. The one I proposed between 75% to 90% fill, and once that 
> happens, the CPU cost of putting to sleep and waking up should really 
> shadow the above numbers.
>
> So it's really a tradeoff. Where IMO also code size and 
> maintainability should play a role.
>
> Also I looked at the history of the amdgpu allocator originating back 
> to Radeon 2012-ish, but couldn't find any commits mentioning fence 
> callbacks nor problem with those. Could you point me to that discussion?

Uff that was ~10 years ago. I don't think I can find that again.

Regards,
Christian.

>
> Thanks,
>
> Thomas
>
>
>
> On 2/17/23 14:51, Thomas Hellström wrote:
>>
>> On 2/17/23 14:18, Christian König wrote:
>>> Am 17.02.23 um 14:10 schrieb Thomas Hellström:
>>>> [SNIP]
>>>>>>>>
>>>>>>>> Any chance you could do a quick performance comparison? If not, 
>>>>>>>> anything against merging this without the amd / radeon changes 
>>>>>>>> until we can land a simpler allocator?
>>>>>>>
>>>>>>> Only if you can stick the allocator inside Xe and not drm, cause 
>>>>>>> this seems to be for a different use case than the allocators 
>>>>>>> inside radeon/amdgpu.
>>>>>>
>>>>>> Hmm. No It's allocating in a ring-like fashion as well. Let me 
>>>>>> put together a unit test for benchmaking. I think it would be a 
>>>>>> failure for the community to end up with three separate 
>>>>>> suballocators doing the exact same thing for the same problem, 
>>>>>> really.
>>>>>
>>>>> Well exactly that's the point. Those allocators aren't the same 
>>>>> because they handle different problems.
>>>>>
>>>>> The allocator in radeon is simpler because it only had to deal 
>>>>> with a limited number of fence timelines. The one in amdgpu is a 
>>>>> bit more complex because of the added complexity for more fence 
>>>>> timelines.
>>>>>
>>>>> We could take the one from amdgpu and use it for radeon and others 
>>>>> as well, but the allocator proposed here doesn't even remotely 
>>>>> matches the requirements.
>>>>
>>>> But again, what *are* those missing requirements exactly? What is 
>>>> the pathological case you see for the current code?
>>>
>>> Well very low CPU overhead and don't do anything in a callback.
>>
>> Well, dma_fence_wait_any() will IIRC register callbacks on all 
>> affected fences, although admittedly there is no actual allocator 
>> processing in them.
>>
>>>
>>>>
>>>> From what I can tell the amdgpu suballocator introduces excessive 
>>>> complexity to coalesce waits for fences from the same contexts, 
>>>> whereas the present code just frees from the fence callback if the 
>>>> fence wasn't already signaled.
>>>
>>> And this is exactly the design we had previously which we removed 
>>> after Dave stumbled over tons of problems with it.
>>
>> So is the worry that those problems have spilled over in this code 
>> then? It's been pretty extensively tested, or is it you should never 
>> really use dma-fence callbacks?
>>
>>>
>>>> The fence signalling code that fires that callback is typcally 
>>>> always run anyway on scheduler fences.
>>>>
>>>> The reason we had for not using the amdgpu suballocator as 
>>>> originally planned was that this complexity made it very hard for 
>>>> us to undertand it and to fix issues we had with it.
>>>
>>> Well what are those problems? The idea is actually not that hardware 
>>> to understand.
>>
>> We hit memory corruption, and we spent substantially more time trying 
>> to debug it than to put together this patch, while never really 
>> understanding what  happened, nor why you don't see that with amdgpu.
>>
>>>
>>> We could simplify it massively for the cost of only waiting for the 
>>> oldest fence if that helps.
>>
>> Let me grab the latest version from amdgpu and give it a try again, 
>> but yes I think that to make it common code we'll need it simpler 
>> (and my personal wish would be to separate the allocator 
>> functionality a bit more from the fence waiting, which I guess should 
>> be OK if the fence waiting is vastly simplified).
>>
>> /Thomas
>>
>>
>>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Regards,
>>>>
>>>> Thomas
>>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
@ 2023-02-22 11:39                       ` Christian König
  0 siblings, 0 replies; 39+ messages in thread
From: Christian König @ 2023-02-22 11:39 UTC (permalink / raw)
  To: Thomas Hellström, dri-devel
  Cc: Daniel Vetter, Maarten Lankhorst, intel-xe, Dave Airlie

Hi Thomas,

Am 22.02.23 um 12:00 schrieb Thomas Hellström:
> Hi, Christian,
>
> So I resurrected Maarten's previous patch series around this (the 
> amdgpu suballocator) slightly modified the code to match the API of 
> this patch series, re-introduced the per-allocation alignment as per a 
> previous review comment from you on that series, and made 
> checkpatch.pl pass mostly, except for pre-existing style problems, and 
> added / fixed some comments. No memory corruption seen so far on 
> limited Xe testing.
>
> To move this forward I suggest starting with that as a common drm 
> suballocator. I'll post the series later today. We can follow up with 
> potential simplifactions lif needed.
>
> I also made a kunit test also reporting some timing information. Will 
> post that as a follow up. Some interesting preliminary conclusions:
>
> * drm_mm is per se not a cpu hog, If the rb tree processing is 
> disabled and the EVICT algorithm is changed from MRU to ring-like LRU 
> traversal, it's more or less just as fast as the ring suballocator.
>
> * With a single ring, and the suballocation buffer never completely 
> filled (no sleeps) the amd suballocator is a bit faster per allocation 
> / free. (Around 250 ns instead of 350). Allocation is slightly slower 
> on the amdgpu one, freeing is faster, mostly due to the locking 
> overhead incurred when setting up the fence callbacks, and for 
> avoiding irq-disabled processing on the one I proposed.

For some more realistic numbers try to signal the fence from another 
CPU. Alternative you can invalidate all the CPU read cache lines touched 
by the fence callback so that they need to be read in again from the 
allocating CPU.

>
> * With multiple rings and varying allocation sizes and signalling 
> times creating fragmentation, the picture becomes different as the 
> amdgpu allocator starts to sleep/throttle already round 50% - 75% 
> fill. The one I proposed between 75% to 90% fill, and once that 
> happens, the CPU cost of putting to sleep and waking up should really 
> shadow the above numbers.
>
> So it's really a tradeoff. Where IMO also code size and 
> maintainability should play a role.
>
> Also I looked at the history of the amdgpu allocator originating back 
> to Radeon 2012-ish, but couldn't find any commits mentioning fence 
> callbacks nor problem with those. Could you point me to that discussion?

Uff that was ~10 years ago. I don't think I can find that again.

Regards,
Christian.

>
> Thanks,
>
> Thomas
>
>
>
> On 2/17/23 14:51, Thomas Hellström wrote:
>>
>> On 2/17/23 14:18, Christian König wrote:
>>> Am 17.02.23 um 14:10 schrieb Thomas Hellström:
>>>> [SNIP]
>>>>>>>>
>>>>>>>> Any chance you could do a quick performance comparison? If not, 
>>>>>>>> anything against merging this without the amd / radeon changes 
>>>>>>>> until we can land a simpler allocator?
>>>>>>>
>>>>>>> Only if you can stick the allocator inside Xe and not drm, cause 
>>>>>>> this seems to be for a different use case than the allocators 
>>>>>>> inside radeon/amdgpu.
>>>>>>
>>>>>> Hmm. No It's allocating in a ring-like fashion as well. Let me 
>>>>>> put together a unit test for benchmaking. I think it would be a 
>>>>>> failure for the community to end up with three separate 
>>>>>> suballocators doing the exact same thing for the same problem, 
>>>>>> really.
>>>>>
>>>>> Well exactly that's the point. Those allocators aren't the same 
>>>>> because they handle different problems.
>>>>>
>>>>> The allocator in radeon is simpler because it only had to deal 
>>>>> with a limited number of fence timelines. The one in amdgpu is a 
>>>>> bit more complex because of the added complexity for more fence 
>>>>> timelines.
>>>>>
>>>>> We could take the one from amdgpu and use it for radeon and others 
>>>>> as well, but the allocator proposed here doesn't even remotely 
>>>>> matches the requirements.
>>>>
>>>> But again, what *are* those missing requirements exactly? What is 
>>>> the pathological case you see for the current code?
>>>
>>> Well very low CPU overhead and don't do anything in a callback.
>>
>> Well, dma_fence_wait_any() will IIRC register callbacks on all 
>> affected fences, although admittedly there is no actual allocator 
>> processing in them.
>>
>>>
>>>>
>>>> From what I can tell the amdgpu suballocator introduces excessive 
>>>> complexity to coalesce waits for fences from the same contexts, 
>>>> whereas the present code just frees from the fence callback if the 
>>>> fence wasn't already signaled.
>>>
>>> And this is exactly the design we had previously which we removed 
>>> after Dave stumbled over tons of problems with it.
>>
>> So is the worry that those problems have spilled over in this code 
>> then? It's been pretty extensively tested, or is it you should never 
>> really use dma-fence callbacks?
>>
>>>
>>>> The fence signalling code that fires that callback is typcally 
>>>> always run anyway on scheduler fences.
>>>>
>>>> The reason we had for not using the amdgpu suballocator as 
>>>> originally planned was that this complexity made it very hard for 
>>>> us to undertand it and to fix issues we had with it.
>>>
>>> Well what are those problems? The idea is actually not that hardware 
>>> to understand.
>>
>> We hit memory corruption, and we spent substantially more time trying 
>> to debug it than to put together this patch, while never really 
>> understanding what  happened, nor why you don't see that with amdgpu.
>>
>>>
>>> We could simplify it massively for the cost of only waiting for the 
>>> oldest fence if that helps.
>>
>> Let me grab the latest version from amdgpu and give it a try again, 
>> but yes I think that to make it common code we'll need it simpler 
>> (and my personal wish would be to separate the allocator 
>> functionality a bit more from the fence waiting, which I guess should 
>> be OK if the fence waiting is vastly simplified).
>>
>> /Thomas
>>
>>
>>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Regards,
>>>>
>>>> Thomas
>>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
  2023-02-22 11:39                       ` Christian König
@ 2023-02-22 13:54                         ` Thomas Hellström
  -1 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-22 13:54 UTC (permalink / raw)
  To: Christian König, dri-devel; +Cc: Daniel Vetter, intel-xe, Dave Airlie

Hi,

On 2/22/23 12:39, Christian König wrote:
> Hi Thomas,
>
> Am 22.02.23 um 12:00 schrieb Thomas Hellström:
>> Hi, Christian,
>>
>> So I resurrected Maarten's previous patch series around this (the 
>> amdgpu suballocator) slightly modified the code to match the API of 
>> this patch series, re-introduced the per-allocation alignment as per 
>> a previous review comment from you on that series, and made 
>> checkpatch.pl pass mostly, except for pre-existing style problems, 
>> and added / fixed some comments. No memory corruption seen so far on 
>> limited Xe testing.
>>
>> To move this forward I suggest starting with that as a common drm 
>> suballocator. I'll post the series later today. We can follow up with 
>> potential simplifactions lif needed.
>>
>> I also made a kunit test also reporting some timing information. Will 
>> post that as a follow up. Some interesting preliminary conclusions:
>>
>> * drm_mm is per se not a cpu hog, If the rb tree processing is 
>> disabled and the EVICT algorithm is changed from MRU to ring-like LRU 
>> traversal, it's more or less just as fast as the ring suballocator.
>>
>> * With a single ring, and the suballocation buffer never completely 
>> filled (no sleeps) the amd suballocator is a bit faster per 
>> allocation / free. (Around 250 ns instead of 350). Allocation is 
>> slightly slower on the amdgpu one, freeing is faster, mostly due to 
>> the locking overhead incurred when setting up the fence callbacks, 
>> and for avoiding irq-disabled processing on the one I proposed.
>
> For some more realistic numbers try to signal the fence from another 
> CPU. Alternative you can invalidate all the CPU read cache lines 
> touched by the fence callback so that they need to be read in again 
> from the allocating CPU.

Fences are signalled using hr-timer driven fake "ring"s, so should 
probably be distributed among cpus in a pretty realistic way. But anyway 
I agree results obtained from that kunit test can and should be 
challenged before we actually use them for improvements.

>
>>
>> * With multiple rings and varying allocation sizes and signalling 
>> times creating fragmentation, the picture becomes different as the 
>> amdgpu allocator starts to sleep/throttle already round 50% - 75% 
>> fill. The one I proposed between 75% to 90% fill, and once that 
>> happens, the CPU cost of putting to sleep and waking up should really 
>> shadow the above numbers.
>>
>> So it's really a tradeoff. Where IMO also code size and 
>> maintainability should play a role.
>>
>> Also I looked at the history of the amdgpu allocator originating back 
>> to Radeon 2012-ish, but couldn't find any commits mentioning fence 
>> callbacks nor problem with those. Could you point me to that discussion?
>
> Uff that was ~10 years ago. I don't think I can find that again.

OK, fair enough. But what was the objective reasoning against using 
fence callbacks for this sort of stuff, was it unforeseen locking 
problems, caching issues or something else?

Thanks,

Thomas



>
>
> Regards,
> Christian.
>
>>
>> Thanks,
>>
>> Thomas
>>
>>
>>
>> On 2/17/23 14:51, Thomas Hellström wrote:
>>>
>>> On 2/17/23 14:18, Christian König wrote:
>>>> Am 17.02.23 um 14:10 schrieb Thomas Hellström:
>>>>> [SNIP]
>>>>>>>>>
>>>>>>>>> Any chance you could do a quick performance comparison? If 
>>>>>>>>> not, anything against merging this without the amd / radeon 
>>>>>>>>> changes until we can land a simpler allocator?
>>>>>>>>
>>>>>>>> Only if you can stick the allocator inside Xe and not drm, 
>>>>>>>> cause this seems to be for a different use case than the 
>>>>>>>> allocators inside radeon/amdgpu.
>>>>>>>
>>>>>>> Hmm. No It's allocating in a ring-like fashion as well. Let me 
>>>>>>> put together a unit test for benchmaking. I think it would be a 
>>>>>>> failure for the community to end up with three separate 
>>>>>>> suballocators doing the exact same thing for the same problem, 
>>>>>>> really.
>>>>>>
>>>>>> Well exactly that's the point. Those allocators aren't the same 
>>>>>> because they handle different problems.
>>>>>>
>>>>>> The allocator in radeon is simpler because it only had to deal 
>>>>>> with a limited number of fence timelines. The one in amdgpu is a 
>>>>>> bit more complex because of the added complexity for more fence 
>>>>>> timelines.
>>>>>>
>>>>>> We could take the one from amdgpu and use it for radeon and 
>>>>>> others as well, but the allocator proposed here doesn't even 
>>>>>> remotely matches the requirements.
>>>>>
>>>>> But again, what *are* those missing requirements exactly? What is 
>>>>> the pathological case you see for the current code?
>>>>
>>>> Well very low CPU overhead and don't do anything in a callback.
>>>
>>> Well, dma_fence_wait_any() will IIRC register callbacks on all 
>>> affected fences, although admittedly there is no actual allocator 
>>> processing in them.
>>>
>>>>
>>>>>
>>>>> From what I can tell the amdgpu suballocator introduces excessive 
>>>>> complexity to coalesce waits for fences from the same contexts, 
>>>>> whereas the present code just frees from the fence callback if the 
>>>>> fence wasn't already signaled.
>>>>
>>>> And this is exactly the design we had previously which we removed 
>>>> after Dave stumbled over tons of problems with it.
>>>
>>> So is the worry that those problems have spilled over in this code 
>>> then? It's been pretty extensively tested, or is it you should never 
>>> really use dma-fence callbacks?
>>>
>>>>
>>>>> The fence signalling code that fires that callback is typcally 
>>>>> always run anyway on scheduler fences.
>>>>>
>>>>> The reason we had for not using the amdgpu suballocator as 
>>>>> originally planned was that this complexity made it very hard for 
>>>>> us to undertand it and to fix issues we had with it.
>>>>
>>>> Well what are those problems? The idea is actually not that 
>>>> hardware to understand.
>>>
>>> We hit memory corruption, and we spent substantially more time 
>>> trying to debug it than to put together this patch, while never 
>>> really understanding what  happened, nor why you don't see that with 
>>> amdgpu.
>>>
>>>>
>>>> We could simplify it massively for the cost of only waiting for the 
>>>> oldest fence if that helps.
>>>
>>> Let me grab the latest version from amdgpu and give it a try again, 
>>> but yes I think that to make it common code we'll need it simpler 
>>> (and my personal wish would be to separate the allocator 
>>> functionality a bit more from the fence waiting, which I guess 
>>> should be OK if the fence waiting is vastly simplified).
>>>
>>> /Thomas
>>>
>>>
>>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Thomas
>>>>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
@ 2023-02-22 13:54                         ` Thomas Hellström
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-22 13:54 UTC (permalink / raw)
  To: Christian König, dri-devel
  Cc: Daniel Vetter, Maarten Lankhorst, intel-xe, Dave Airlie

Hi,

On 2/22/23 12:39, Christian König wrote:
> Hi Thomas,
>
> Am 22.02.23 um 12:00 schrieb Thomas Hellström:
>> Hi, Christian,
>>
>> So I resurrected Maarten's previous patch series around this (the 
>> amdgpu suballocator) slightly modified the code to match the API of 
>> this patch series, re-introduced the per-allocation alignment as per 
>> a previous review comment from you on that series, and made 
>> checkpatch.pl pass mostly, except for pre-existing style problems, 
>> and added / fixed some comments. No memory corruption seen so far on 
>> limited Xe testing.
>>
>> To move this forward I suggest starting with that as a common drm 
>> suballocator. I'll post the series later today. We can follow up with 
>> potential simplifactions lif needed.
>>
>> I also made a kunit test also reporting some timing information. Will 
>> post that as a follow up. Some interesting preliminary conclusions:
>>
>> * drm_mm is per se not a cpu hog, If the rb tree processing is 
>> disabled and the EVICT algorithm is changed from MRU to ring-like LRU 
>> traversal, it's more or less just as fast as the ring suballocator.
>>
>> * With a single ring, and the suballocation buffer never completely 
>> filled (no sleeps) the amd suballocator is a bit faster per 
>> allocation / free. (Around 250 ns instead of 350). Allocation is 
>> slightly slower on the amdgpu one, freeing is faster, mostly due to 
>> the locking overhead incurred when setting up the fence callbacks, 
>> and for avoiding irq-disabled processing on the one I proposed.
>
> For some more realistic numbers try to signal the fence from another 
> CPU. Alternative you can invalidate all the CPU read cache lines 
> touched by the fence callback so that they need to be read in again 
> from the allocating CPU.

Fences are signalled using hr-timer driven fake "ring"s, so should 
probably be distributed among cpus in a pretty realistic way. But anyway 
I agree results obtained from that kunit test can and should be 
challenged before we actually use them for improvements.

>
>>
>> * With multiple rings and varying allocation sizes and signalling 
>> times creating fragmentation, the picture becomes different as the 
>> amdgpu allocator starts to sleep/throttle already round 50% - 75% 
>> fill. The one I proposed between 75% to 90% fill, and once that 
>> happens, the CPU cost of putting to sleep and waking up should really 
>> shadow the above numbers.
>>
>> So it's really a tradeoff. Where IMO also code size and 
>> maintainability should play a role.
>>
>> Also I looked at the history of the amdgpu allocator originating back 
>> to Radeon 2012-ish, but couldn't find any commits mentioning fence 
>> callbacks nor problem with those. Could you point me to that discussion?
>
> Uff that was ~10 years ago. I don't think I can find that again.

OK, fair enough. But what was the objective reasoning against using 
fence callbacks for this sort of stuff, was it unforeseen locking 
problems, caching issues or something else?

Thanks,

Thomas



>
>
> Regards,
> Christian.
>
>>
>> Thanks,
>>
>> Thomas
>>
>>
>>
>> On 2/17/23 14:51, Thomas Hellström wrote:
>>>
>>> On 2/17/23 14:18, Christian König wrote:
>>>> Am 17.02.23 um 14:10 schrieb Thomas Hellström:
>>>>> [SNIP]
>>>>>>>>>
>>>>>>>>> Any chance you could do a quick performance comparison? If 
>>>>>>>>> not, anything against merging this without the amd / radeon 
>>>>>>>>> changes until we can land a simpler allocator?
>>>>>>>>
>>>>>>>> Only if you can stick the allocator inside Xe and not drm, 
>>>>>>>> cause this seems to be for a different use case than the 
>>>>>>>> allocators inside radeon/amdgpu.
>>>>>>>
>>>>>>> Hmm. No It's allocating in a ring-like fashion as well. Let me 
>>>>>>> put together a unit test for benchmaking. I think it would be a 
>>>>>>> failure for the community to end up with three separate 
>>>>>>> suballocators doing the exact same thing for the same problem, 
>>>>>>> really.
>>>>>>
>>>>>> Well exactly that's the point. Those allocators aren't the same 
>>>>>> because they handle different problems.
>>>>>>
>>>>>> The allocator in radeon is simpler because it only had to deal 
>>>>>> with a limited number of fence timelines. The one in amdgpu is a 
>>>>>> bit more complex because of the added complexity for more fence 
>>>>>> timelines.
>>>>>>
>>>>>> We could take the one from amdgpu and use it for radeon and 
>>>>>> others as well, but the allocator proposed here doesn't even 
>>>>>> remotely matches the requirements.
>>>>>
>>>>> But again, what *are* those missing requirements exactly? What is 
>>>>> the pathological case you see for the current code?
>>>>
>>>> Well very low CPU overhead and don't do anything in a callback.
>>>
>>> Well, dma_fence_wait_any() will IIRC register callbacks on all 
>>> affected fences, although admittedly there is no actual allocator 
>>> processing in them.
>>>
>>>>
>>>>>
>>>>> From what I can tell the amdgpu suballocator introduces excessive 
>>>>> complexity to coalesce waits for fences from the same contexts, 
>>>>> whereas the present code just frees from the fence callback if the 
>>>>> fence wasn't already signaled.
>>>>
>>>> And this is exactly the design we had previously which we removed 
>>>> after Dave stumbled over tons of problems with it.
>>>
>>> So is the worry that those problems have spilled over in this code 
>>> then? It's been pretty extensively tested, or is it you should never 
>>> really use dma-fence callbacks?
>>>
>>>>
>>>>> The fence signalling code that fires that callback is typcally 
>>>>> always run anyway on scheduler fences.
>>>>>
>>>>> The reason we had for not using the amdgpu suballocator as 
>>>>> originally planned was that this complexity made it very hard for 
>>>>> us to undertand it and to fix issues we had with it.
>>>>
>>>> Well what are those problems? The idea is actually not that 
>>>> hardware to understand.
>>>
>>> We hit memory corruption, and we spent substantially more time 
>>> trying to debug it than to put together this patch, while never 
>>> really understanding what  happened, nor why you don't see that with 
>>> amdgpu.
>>>
>>>>
>>>> We could simplify it massively for the cost of only waiting for the 
>>>> oldest fence if that helps.
>>>
>>> Let me grab the latest version from amdgpu and give it a try again, 
>>> but yes I think that to make it common code we'll need it simpler 
>>> (and my personal wish would be to separate the allocator 
>>> functionality a bit more from the fence waiting, which I guess 
>>> should be OK if the fence waiting is vastly simplified).
>>>
>>> /Thomas
>>>
>>>
>>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Thomas
>>>>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
  2023-02-22 13:54                         ` Thomas Hellström
@ 2023-02-22 14:20                           ` Christian König
  -1 siblings, 0 replies; 39+ messages in thread
From: Christian König @ 2023-02-22 14:20 UTC (permalink / raw)
  To: Thomas Hellström, dri-devel; +Cc: Daniel Vetter, intel-xe, Dave Airlie

Am 22.02.23 um 14:54 schrieb Thomas Hellström:
> Hi,
>
> On 2/22/23 12:39, Christian König wrote:
>> Hi Thomas,
>>
>> Am 22.02.23 um 12:00 schrieb Thomas Hellström:
>>> Hi, Christian,
>>>
>>> So I resurrected Maarten's previous patch series around this (the 
>>> amdgpu suballocator) slightly modified the code to match the API of 
>>> this patch series, re-introduced the per-allocation alignment as per 
>>> a previous review comment from you on that series, and made 
>>> checkpatch.pl pass mostly, except for pre-existing style problems, 
>>> and added / fixed some comments. No memory corruption seen so far on 
>>> limited Xe testing.
>>>
>>> To move this forward I suggest starting with that as a common drm 
>>> suballocator. I'll post the series later today. We can follow up 
>>> with potential simplifactions lif needed.
>>>
>>> I also made a kunit test also reporting some timing information. 
>>> Will post that as a follow up. Some interesting preliminary 
>>> conclusions:
>>>
>>> * drm_mm is per se not a cpu hog, If the rb tree processing is 
>>> disabled and the EVICT algorithm is changed from MRU to ring-like 
>>> LRU traversal, it's more or less just as fast as the ring suballocator.
>>>
>>> * With a single ring, and the suballocation buffer never completely 
>>> filled (no sleeps) the amd suballocator is a bit faster per 
>>> allocation / free. (Around 250 ns instead of 350). Allocation is 
>>> slightly slower on the amdgpu one, freeing is faster, mostly due to 
>>> the locking overhead incurred when setting up the fence callbacks, 
>>> and for avoiding irq-disabled processing on the one I proposed.
>>
>> For some more realistic numbers try to signal the fence from another 
>> CPU. Alternative you can invalidate all the CPU read cache lines 
>> touched by the fence callback so that they need to be read in again 
>> from the allocating CPU.
>
> Fences are signalled using hr-timer driven fake "ring"s, so should 
> probably be distributed among cpus in a pretty realistic way. But 
> anyway I agree results obtained from that kunit test can and should be 
> challenged before we actually use them for improvements.

I would double check that. My expectation is that hr-timers execute by 
default on the CPU from which they are started.

>
>>
>>>
>>> * With multiple rings and varying allocation sizes and signalling 
>>> times creating fragmentation, the picture becomes different as the 
>>> amdgpu allocator starts to sleep/throttle already round 50% - 75% 
>>> fill. The one I proposed between 75% to 90% fill, and once that 
>>> happens, the CPU cost of putting to sleep and waking up should 
>>> really shadow the above numbers.
>>>
>>> So it's really a tradeoff. Where IMO also code size and 
>>> maintainability should play a role.
>>>
>>> Also I looked at the history of the amdgpu allocator originating 
>>> back to Radeon 2012-ish, but couldn't find any commits mentioning 
>>> fence callbacks nor problem with those. Could you point me to that 
>>> discussion?
>>
>> Uff that was ~10 years ago. I don't think I can find that again.
>
> OK, fair enough. But what was the objective reasoning against using 
> fence callbacks for this sort of stuff, was it unforeseen locking 
> problems, caching issues or something else?

Well caching line bouncing is one major problem. Also take a look at the 
discussion about using list_head in interrupt handlers, that should be 
easy to find on LWN.

The allocator usually manages enough memory so that it never runs into 
waiting for anything, only in extreme cases like GPU resets we actually 
wait for allocations to be freed.

So the only cache lines which is accessed from more than one CPU should 
be the signaled flag of the fence.

With moving list work into the interrupt handler you have at least 3 
cache lines which start to bounce between different CPUs.

Regards,
Christian.

>
> Thanks,
>
> Thomas
>
>
>
>>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Thanks,
>>>
>>> Thomas
>>>
>>>
>>>
>>> On 2/17/23 14:51, Thomas Hellström wrote:
>>>>
>>>> On 2/17/23 14:18, Christian König wrote:
>>>>> Am 17.02.23 um 14:10 schrieb Thomas Hellström:
>>>>>> [SNIP]
>>>>>>>>>>
>>>>>>>>>> Any chance you could do a quick performance comparison? If 
>>>>>>>>>> not, anything against merging this without the amd / radeon 
>>>>>>>>>> changes until we can land a simpler allocator?
>>>>>>>>>
>>>>>>>>> Only if you can stick the allocator inside Xe and not drm, 
>>>>>>>>> cause this seems to be for a different use case than the 
>>>>>>>>> allocators inside radeon/amdgpu.
>>>>>>>>
>>>>>>>> Hmm. No It's allocating in a ring-like fashion as well. Let me 
>>>>>>>> put together a unit test for benchmaking. I think it would be a 
>>>>>>>> failure for the community to end up with three separate 
>>>>>>>> suballocators doing the exact same thing for the same problem, 
>>>>>>>> really.
>>>>>>>
>>>>>>> Well exactly that's the point. Those allocators aren't the same 
>>>>>>> because they handle different problems.
>>>>>>>
>>>>>>> The allocator in radeon is simpler because it only had to deal 
>>>>>>> with a limited number of fence timelines. The one in amdgpu is a 
>>>>>>> bit more complex because of the added complexity for more fence 
>>>>>>> timelines.
>>>>>>>
>>>>>>> We could take the one from amdgpu and use it for radeon and 
>>>>>>> others as well, but the allocator proposed here doesn't even 
>>>>>>> remotely matches the requirements.
>>>>>>
>>>>>> But again, what *are* those missing requirements exactly? What is 
>>>>>> the pathological case you see for the current code?
>>>>>
>>>>> Well very low CPU overhead and don't do anything in a callback.
>>>>
>>>> Well, dma_fence_wait_any() will IIRC register callbacks on all 
>>>> affected fences, although admittedly there is no actual allocator 
>>>> processing in them.
>>>>
>>>>>
>>>>>>
>>>>>> From what I can tell the amdgpu suballocator introduces excessive 
>>>>>> complexity to coalesce waits for fences from the same contexts, 
>>>>>> whereas the present code just frees from the fence callback if 
>>>>>> the fence wasn't already signaled.
>>>>>
>>>>> And this is exactly the design we had previously which we removed 
>>>>> after Dave stumbled over tons of problems with it.
>>>>
>>>> So is the worry that those problems have spilled over in this code 
>>>> then? It's been pretty extensively tested, or is it you should 
>>>> never really use dma-fence callbacks?
>>>>
>>>>>
>>>>>> The fence signalling code that fires that callback is typcally 
>>>>>> always run anyway on scheduler fences.
>>>>>>
>>>>>> The reason we had for not using the amdgpu suballocator as 
>>>>>> originally planned was that this complexity made it very hard for 
>>>>>> us to undertand it and to fix issues we had with it.
>>>>>
>>>>> Well what are those problems? The idea is actually not that 
>>>>> hardware to understand.
>>>>
>>>> We hit memory corruption, and we spent substantially more time 
>>>> trying to debug it than to put together this patch, while never 
>>>> really understanding what  happened, nor why you don't see that 
>>>> with amdgpu.
>>>>
>>>>>
>>>>> We could simplify it massively for the cost of only waiting for 
>>>>> the oldest fence if that helps.
>>>>
>>>> Let me grab the latest version from amdgpu and give it a try again, 
>>>> but yes I think that to make it common code we'll need it simpler 
>>>> (and my personal wish would be to separate the allocator 
>>>> functionality a bit more from the fence waiting, which I guess 
>>>> should be OK if the fence waiting is vastly simplified).
>>>>
>>>> /Thomas
>>>>
>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Thomas
>>>>>
>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
@ 2023-02-22 14:20                           ` Christian König
  0 siblings, 0 replies; 39+ messages in thread
From: Christian König @ 2023-02-22 14:20 UTC (permalink / raw)
  To: Thomas Hellström, dri-devel
  Cc: Daniel Vetter, Maarten Lankhorst, intel-xe, Dave Airlie

Am 22.02.23 um 14:54 schrieb Thomas Hellström:
> Hi,
>
> On 2/22/23 12:39, Christian König wrote:
>> Hi Thomas,
>>
>> Am 22.02.23 um 12:00 schrieb Thomas Hellström:
>>> Hi, Christian,
>>>
>>> So I resurrected Maarten's previous patch series around this (the 
>>> amdgpu suballocator) slightly modified the code to match the API of 
>>> this patch series, re-introduced the per-allocation alignment as per 
>>> a previous review comment from you on that series, and made 
>>> checkpatch.pl pass mostly, except for pre-existing style problems, 
>>> and added / fixed some comments. No memory corruption seen so far on 
>>> limited Xe testing.
>>>
>>> To move this forward I suggest starting with that as a common drm 
>>> suballocator. I'll post the series later today. We can follow up 
>>> with potential simplifactions lif needed.
>>>
>>> I also made a kunit test also reporting some timing information. 
>>> Will post that as a follow up. Some interesting preliminary 
>>> conclusions:
>>>
>>> * drm_mm is per se not a cpu hog, If the rb tree processing is 
>>> disabled and the EVICT algorithm is changed from MRU to ring-like 
>>> LRU traversal, it's more or less just as fast as the ring suballocator.
>>>
>>> * With a single ring, and the suballocation buffer never completely 
>>> filled (no sleeps) the amd suballocator is a bit faster per 
>>> allocation / free. (Around 250 ns instead of 350). Allocation is 
>>> slightly slower on the amdgpu one, freeing is faster, mostly due to 
>>> the locking overhead incurred when setting up the fence callbacks, 
>>> and for avoiding irq-disabled processing on the one I proposed.
>>
>> For some more realistic numbers try to signal the fence from another 
>> CPU. Alternative you can invalidate all the CPU read cache lines 
>> touched by the fence callback so that they need to be read in again 
>> from the allocating CPU.
>
> Fences are signalled using hr-timer driven fake "ring"s, so should 
> probably be distributed among cpus in a pretty realistic way. But 
> anyway I agree results obtained from that kunit test can and should be 
> challenged before we actually use them for improvements.

I would double check that. My expectation is that hr-timers execute by 
default on the CPU from which they are started.

>
>>
>>>
>>> * With multiple rings and varying allocation sizes and signalling 
>>> times creating fragmentation, the picture becomes different as the 
>>> amdgpu allocator starts to sleep/throttle already round 50% - 75% 
>>> fill. The one I proposed between 75% to 90% fill, and once that 
>>> happens, the CPU cost of putting to sleep and waking up should 
>>> really shadow the above numbers.
>>>
>>> So it's really a tradeoff. Where IMO also code size and 
>>> maintainability should play a role.
>>>
>>> Also I looked at the history of the amdgpu allocator originating 
>>> back to Radeon 2012-ish, but couldn't find any commits mentioning 
>>> fence callbacks nor problem with those. Could you point me to that 
>>> discussion?
>>
>> Uff that was ~10 years ago. I don't think I can find that again.
>
> OK, fair enough. But what was the objective reasoning against using 
> fence callbacks for this sort of stuff, was it unforeseen locking 
> problems, caching issues or something else?

Well caching line bouncing is one major problem. Also take a look at the 
discussion about using list_head in interrupt handlers, that should be 
easy to find on LWN.

The allocator usually manages enough memory so that it never runs into 
waiting for anything, only in extreme cases like GPU resets we actually 
wait for allocations to be freed.

So the only cache lines which is accessed from more than one CPU should 
be the signaled flag of the fence.

With moving list work into the interrupt handler you have at least 3 
cache lines which start to bounce between different CPUs.

Regards,
Christian.

>
> Thanks,
>
> Thomas
>
>
>
>>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Thanks,
>>>
>>> Thomas
>>>
>>>
>>>
>>> On 2/17/23 14:51, Thomas Hellström wrote:
>>>>
>>>> On 2/17/23 14:18, Christian König wrote:
>>>>> Am 17.02.23 um 14:10 schrieb Thomas Hellström:
>>>>>> [SNIP]
>>>>>>>>>>
>>>>>>>>>> Any chance you could do a quick performance comparison? If 
>>>>>>>>>> not, anything against merging this without the amd / radeon 
>>>>>>>>>> changes until we can land a simpler allocator?
>>>>>>>>>
>>>>>>>>> Only if you can stick the allocator inside Xe and not drm, 
>>>>>>>>> cause this seems to be for a different use case than the 
>>>>>>>>> allocators inside radeon/amdgpu.
>>>>>>>>
>>>>>>>> Hmm. No It's allocating in a ring-like fashion as well. Let me 
>>>>>>>> put together a unit test for benchmaking. I think it would be a 
>>>>>>>> failure for the community to end up with three separate 
>>>>>>>> suballocators doing the exact same thing for the same problem, 
>>>>>>>> really.
>>>>>>>
>>>>>>> Well exactly that's the point. Those allocators aren't the same 
>>>>>>> because they handle different problems.
>>>>>>>
>>>>>>> The allocator in radeon is simpler because it only had to deal 
>>>>>>> with a limited number of fence timelines. The one in amdgpu is a 
>>>>>>> bit more complex because of the added complexity for more fence 
>>>>>>> timelines.
>>>>>>>
>>>>>>> We could take the one from amdgpu and use it for radeon and 
>>>>>>> others as well, but the allocator proposed here doesn't even 
>>>>>>> remotely matches the requirements.
>>>>>>
>>>>>> But again, what *are* those missing requirements exactly? What is 
>>>>>> the pathological case you see for the current code?
>>>>>
>>>>> Well very low CPU overhead and don't do anything in a callback.
>>>>
>>>> Well, dma_fence_wait_any() will IIRC register callbacks on all 
>>>> affected fences, although admittedly there is no actual allocator 
>>>> processing in them.
>>>>
>>>>>
>>>>>>
>>>>>> From what I can tell the amdgpu suballocator introduces excessive 
>>>>>> complexity to coalesce waits for fences from the same contexts, 
>>>>>> whereas the present code just frees from the fence callback if 
>>>>>> the fence wasn't already signaled.
>>>>>
>>>>> And this is exactly the design we had previously which we removed 
>>>>> after Dave stumbled over tons of problems with it.
>>>>
>>>> So is the worry that those problems have spilled over in this code 
>>>> then? It's been pretty extensively tested, or is it you should 
>>>> never really use dma-fence callbacks?
>>>>
>>>>>
>>>>>> The fence signalling code that fires that callback is typcally 
>>>>>> always run anyway on scheduler fences.
>>>>>>
>>>>>> The reason we had for not using the amdgpu suballocator as 
>>>>>> originally planned was that this complexity made it very hard for 
>>>>>> us to undertand it and to fix issues we had with it.
>>>>>
>>>>> Well what are those problems? The idea is actually not that 
>>>>> hardware to understand.
>>>>
>>>> We hit memory corruption, and we spent substantially more time 
>>>> trying to debug it than to put together this patch, while never 
>>>> really understanding what  happened, nor why you don't see that 
>>>> with amdgpu.
>>>>
>>>>>
>>>>> We could simplify it massively for the cost of only waiting for 
>>>>> the oldest fence if that helps.
>>>>
>>>> Let me grab the latest version from amdgpu and give it a try again, 
>>>> but yes I think that to make it common code we'll need it simpler 
>>>> (and my personal wish would be to separate the allocator 
>>>> functionality a bit more from the fence waiting, which I guess 
>>>> should be OK if the fence waiting is vastly simplified).
>>>>
>>>> /Thomas
>>>>
>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Thomas
>>>>>
>>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
  2023-02-22 14:20                           ` Christian König
@ 2023-02-22 15:58                             ` Thomas Hellström
  -1 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-22 15:58 UTC (permalink / raw)
  To: Christian König, dri-devel; +Cc: Daniel Vetter, intel-xe, Dave Airlie


On 2/22/23 15:20, Christian König wrote:
> Am 22.02.23 um 14:54 schrieb Thomas Hellström:
>> Hi,
>>
>> On 2/22/23 12:39, Christian König wrote:
>>> Hi Thomas,
>>>
>>> Am 22.02.23 um 12:00 schrieb Thomas Hellström:
>>>> Hi, Christian,
>>>>
>>>> So I resurrected Maarten's previous patch series around this (the 
>>>> amdgpu suballocator) slightly modified the code to match the API of 
>>>> this patch series, re-introduced the per-allocation alignment as 
>>>> per a previous review comment from you on that series, and made 
>>>> checkpatch.pl pass mostly, except for pre-existing style problems, 
>>>> and added / fixed some comments. No memory corruption seen so far 
>>>> on limited Xe testing.
>>>>
>>>> To move this forward I suggest starting with that as a common drm 
>>>> suballocator. I'll post the series later today. We can follow up 
>>>> with potential simplifactions lif needed.
>>>>
>>>> I also made a kunit test also reporting some timing information. 
>>>> Will post that as a follow up. Some interesting preliminary 
>>>> conclusions:
>>>>
>>>> * drm_mm is per se not a cpu hog, If the rb tree processing is 
>>>> disabled and the EVICT algorithm is changed from MRU to ring-like 
>>>> LRU traversal, it's more or less just as fast as the ring 
>>>> suballocator.
>>>>
>>>> * With a single ring, and the suballocation buffer never completely 
>>>> filled (no sleeps) the amd suballocator is a bit faster per 
>>>> allocation / free. (Around 250 ns instead of 350). Allocation is 
>>>> slightly slower on the amdgpu one, freeing is faster, mostly due to 
>>>> the locking overhead incurred when setting up the fence callbacks, 
>>>> and for avoiding irq-disabled processing on the one I proposed.
>>>
>>> For some more realistic numbers try to signal the fence from another 
>>> CPU. Alternative you can invalidate all the CPU read cache lines 
>>> touched by the fence callback so that they need to be read in again 
>>> from the allocating CPU.
>>
>> Fences are signalled using hr-timer driven fake "ring"s, so should 
>> probably be distributed among cpus in a pretty realistic way. But 
>> anyway I agree results obtained from that kunit test can and should 
>> be challenged before we actually use them for improvements.
>
> I would double check that. My expectation is that hr-timers execute by 
> default on the CPU from which they are started.

Hmm, since not using the _PINNED hrtimer flag, I'd expect them to be 
more distributed but you're right, they weren't. A rather few 
timer_expires from other cpus only. So figures for signalling on other 
cpus are, around 500ns for the amdgpu variant, around 900 ns for the 
fence-callback one. Still, sleeping starts around 50-75% fill with the 
amdgpu variant.

>
>>
>>>
>>>>
>>>> * With multiple rings and varying allocation sizes and signalling 
>>>> times creating fragmentation, the picture becomes different as the 
>>>> amdgpu allocator starts to sleep/throttle already round 50% - 75% 
>>>> fill. The one I proposed between 75% to 90% fill, and once that 
>>>> happens, the CPU cost of putting to sleep and waking up should 
>>>> really shadow the above numbers.
>>>>
>>>> So it's really a tradeoff. Where IMO also code size and 
>>>> maintainability should play a role.
>>>>
>>>> Also I looked at the history of the amdgpu allocator originating 
>>>> back to Radeon 2012-ish, but couldn't find any commits mentioning 
>>>> fence callbacks nor problem with those. Could you point me to that 
>>>> discussion?
>>>
>>> Uff that was ~10 years ago. I don't think I can find that again.
>>
>> OK, fair enough. But what was the objective reasoning against using 
>> fence callbacks for this sort of stuff, was it unforeseen locking 
>> problems, caching issues or something else?
>
> Well caching line bouncing is one major problem. Also take a look at 
> the discussion about using list_head in interrupt handlers, that 
> should be easy to find on LWN.
>
> The allocator usually manages enough memory so that it never runs into 
> waiting for anything, only in extreme cases like GPU resets we 
> actually wait for allocations to be freed.

I guess this varies with the application, but can be remedied with just 
adding more managed memory if needed.

/Thomas


>
> So the only cache lines which is accessed from more than one CPU 
> should be the signaled flag of the fence.
>
> With moving list work into the interrupt handler you have at least 3 
> cache lines which start to bounce between different CPUs.
>
> Regards,
> Christian.
>
>>
>> Thanks,
>>
>> Thomas
>>
>>
>>
>>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Thomas
>>>>
>>>>
>>>>
>>>> On 2/17/23 14:51, Thomas Hellström wrote:
>>>>>
>>>>> On 2/17/23 14:18, Christian König wrote:
>>>>>> Am 17.02.23 um 14:10 schrieb Thomas Hellström:
>>>>>>> [SNIP]
>>>>>>>>>>>
>>>>>>>>>>> Any chance you could do a quick performance comparison? If 
>>>>>>>>>>> not, anything against merging this without the amd / radeon 
>>>>>>>>>>> changes until we can land a simpler allocator?
>>>>>>>>>>
>>>>>>>>>> Only if you can stick the allocator inside Xe and not drm, 
>>>>>>>>>> cause this seems to be for a different use case than the 
>>>>>>>>>> allocators inside radeon/amdgpu.
>>>>>>>>>
>>>>>>>>> Hmm. No It's allocating in a ring-like fashion as well. Let me 
>>>>>>>>> put together a unit test for benchmaking. I think it would be 
>>>>>>>>> a failure for the community to end up with three separate 
>>>>>>>>> suballocators doing the exact same thing for the same problem, 
>>>>>>>>> really.
>>>>>>>>
>>>>>>>> Well exactly that's the point. Those allocators aren't the same 
>>>>>>>> because they handle different problems.
>>>>>>>>
>>>>>>>> The allocator in radeon is simpler because it only had to deal 
>>>>>>>> with a limited number of fence timelines. The one in amdgpu is 
>>>>>>>> a bit more complex because of the added complexity for more 
>>>>>>>> fence timelines.
>>>>>>>>
>>>>>>>> We could take the one from amdgpu and use it for radeon and 
>>>>>>>> others as well, but the allocator proposed here doesn't even 
>>>>>>>> remotely matches the requirements.
>>>>>>>
>>>>>>> But again, what *are* those missing requirements exactly? What 
>>>>>>> is the pathological case you see for the current code?
>>>>>>
>>>>>> Well very low CPU overhead and don't do anything in a callback.
>>>>>
>>>>> Well, dma_fence_wait_any() will IIRC register callbacks on all 
>>>>> affected fences, although admittedly there is no actual allocator 
>>>>> processing in them.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> From what I can tell the amdgpu suballocator introduces 
>>>>>>> excessive complexity to coalesce waits for fences from the same 
>>>>>>> contexts, whereas the present code just frees from the fence 
>>>>>>> callback if the fence wasn't already signaled.
>>>>>>
>>>>>> And this is exactly the design we had previously which we removed 
>>>>>> after Dave stumbled over tons of problems with it.
>>>>>
>>>>> So is the worry that those problems have spilled over in this code 
>>>>> then? It's been pretty extensively tested, or is it you should 
>>>>> never really use dma-fence callbacks?
>>>>>
>>>>>>
>>>>>>> The fence signalling code that fires that callback is typcally 
>>>>>>> always run anyway on scheduler fences.
>>>>>>>
>>>>>>> The reason we had for not using the amdgpu suballocator as 
>>>>>>> originally planned was that this complexity made it very hard 
>>>>>>> for us to undertand it and to fix issues we had with it.
>>>>>>
>>>>>> Well what are those problems? The idea is actually not that 
>>>>>> hardware to understand.
>>>>>
>>>>> We hit memory corruption, and we spent substantially more time 
>>>>> trying to debug it than to put together this patch, while never 
>>>>> really understanding what  happened, nor why you don't see that 
>>>>> with amdgpu.
>>>>>
>>>>>>
>>>>>> We could simplify it massively for the cost of only waiting for 
>>>>>> the oldest fence if that helps.
>>>>>
>>>>> Let me grab the latest version from amdgpu and give it a try 
>>>>> again, but yes I think that to make it common code we'll need it 
>>>>> simpler (and my personal wish would be to separate the allocator 
>>>>> functionality a bit more from the fence waiting, which I guess 
>>>>> should be OK if the fence waiting is vastly simplified).
>>>>>
>>>>> /Thomas
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Thomas
>>>>>>
>>>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
@ 2023-02-22 15:58                             ` Thomas Hellström
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Hellström @ 2023-02-22 15:58 UTC (permalink / raw)
  To: Christian König, dri-devel
  Cc: Daniel Vetter, Maarten Lankhorst, intel-xe, Dave Airlie


On 2/22/23 15:20, Christian König wrote:
> Am 22.02.23 um 14:54 schrieb Thomas Hellström:
>> Hi,
>>
>> On 2/22/23 12:39, Christian König wrote:
>>> Hi Thomas,
>>>
>>> Am 22.02.23 um 12:00 schrieb Thomas Hellström:
>>>> Hi, Christian,
>>>>
>>>> So I resurrected Maarten's previous patch series around this (the 
>>>> amdgpu suballocator) slightly modified the code to match the API of 
>>>> this patch series, re-introduced the per-allocation alignment as 
>>>> per a previous review comment from you on that series, and made 
>>>> checkpatch.pl pass mostly, except for pre-existing style problems, 
>>>> and added / fixed some comments. No memory corruption seen so far 
>>>> on limited Xe testing.
>>>>
>>>> To move this forward I suggest starting with that as a common drm 
>>>> suballocator. I'll post the series later today. We can follow up 
>>>> with potential simplifactions lif needed.
>>>>
>>>> I also made a kunit test also reporting some timing information. 
>>>> Will post that as a follow up. Some interesting preliminary 
>>>> conclusions:
>>>>
>>>> * drm_mm is per se not a cpu hog, If the rb tree processing is 
>>>> disabled and the EVICT algorithm is changed from MRU to ring-like 
>>>> LRU traversal, it's more or less just as fast as the ring 
>>>> suballocator.
>>>>
>>>> * With a single ring, and the suballocation buffer never completely 
>>>> filled (no sleeps) the amd suballocator is a bit faster per 
>>>> allocation / free. (Around 250 ns instead of 350). Allocation is 
>>>> slightly slower on the amdgpu one, freeing is faster, mostly due to 
>>>> the locking overhead incurred when setting up the fence callbacks, 
>>>> and for avoiding irq-disabled processing on the one I proposed.
>>>
>>> For some more realistic numbers try to signal the fence from another 
>>> CPU. Alternative you can invalidate all the CPU read cache lines 
>>> touched by the fence callback so that they need to be read in again 
>>> from the allocating CPU.
>>
>> Fences are signalled using hr-timer driven fake "ring"s, so should 
>> probably be distributed among cpus in a pretty realistic way. But 
>> anyway I agree results obtained from that kunit test can and should 
>> be challenged before we actually use them for improvements.
>
> I would double check that. My expectation is that hr-timers execute by 
> default on the CPU from which they are started.

Hmm, since not using the _PINNED hrtimer flag, I'd expect them to be 
more distributed but you're right, they weren't. A rather few 
timer_expires from other cpus only. So figures for signalling on other 
cpus are, around 500ns for the amdgpu variant, around 900 ns for the 
fence-callback one. Still, sleeping starts around 50-75% fill with the 
amdgpu variant.

>
>>
>>>
>>>>
>>>> * With multiple rings and varying allocation sizes and signalling 
>>>> times creating fragmentation, the picture becomes different as the 
>>>> amdgpu allocator starts to sleep/throttle already round 50% - 75% 
>>>> fill. The one I proposed between 75% to 90% fill, and once that 
>>>> happens, the CPU cost of putting to sleep and waking up should 
>>>> really shadow the above numbers.
>>>>
>>>> So it's really a tradeoff. Where IMO also code size and 
>>>> maintainability should play a role.
>>>>
>>>> Also I looked at the history of the amdgpu allocator originating 
>>>> back to Radeon 2012-ish, but couldn't find any commits mentioning 
>>>> fence callbacks nor problem with those. Could you point me to that 
>>>> discussion?
>>>
>>> Uff that was ~10 years ago. I don't think I can find that again.
>>
>> OK, fair enough. But what was the objective reasoning against using 
>> fence callbacks for this sort of stuff, was it unforeseen locking 
>> problems, caching issues or something else?
>
> Well caching line bouncing is one major problem. Also take a look at 
> the discussion about using list_head in interrupt handlers, that 
> should be easy to find on LWN.
>
> The allocator usually manages enough memory so that it never runs into 
> waiting for anything, only in extreme cases like GPU resets we 
> actually wait for allocations to be freed.

I guess this varies with the application, but can be remedied with just 
adding more managed memory if needed.

/Thomas


>
> So the only cache lines which is accessed from more than one CPU 
> should be the signaled flag of the fence.
>
> With moving list work into the interrupt handler you have at least 3 
> cache lines which start to bounce between different CPUs.
>
> Regards,
> Christian.
>
>>
>> Thanks,
>>
>> Thomas
>>
>>
>>
>>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Thomas
>>>>
>>>>
>>>>
>>>> On 2/17/23 14:51, Thomas Hellström wrote:
>>>>>
>>>>> On 2/17/23 14:18, Christian König wrote:
>>>>>> Am 17.02.23 um 14:10 schrieb Thomas Hellström:
>>>>>>> [SNIP]
>>>>>>>>>>>
>>>>>>>>>>> Any chance you could do a quick performance comparison? If 
>>>>>>>>>>> not, anything against merging this without the amd / radeon 
>>>>>>>>>>> changes until we can land a simpler allocator?
>>>>>>>>>>
>>>>>>>>>> Only if you can stick the allocator inside Xe and not drm, 
>>>>>>>>>> cause this seems to be for a different use case than the 
>>>>>>>>>> allocators inside radeon/amdgpu.
>>>>>>>>>
>>>>>>>>> Hmm. No It's allocating in a ring-like fashion as well. Let me 
>>>>>>>>> put together a unit test for benchmaking. I think it would be 
>>>>>>>>> a failure for the community to end up with three separate 
>>>>>>>>> suballocators doing the exact same thing for the same problem, 
>>>>>>>>> really.
>>>>>>>>
>>>>>>>> Well exactly that's the point. Those allocators aren't the same 
>>>>>>>> because they handle different problems.
>>>>>>>>
>>>>>>>> The allocator in radeon is simpler because it only had to deal 
>>>>>>>> with a limited number of fence timelines. The one in amdgpu is 
>>>>>>>> a bit more complex because of the added complexity for more 
>>>>>>>> fence timelines.
>>>>>>>>
>>>>>>>> We could take the one from amdgpu and use it for radeon and 
>>>>>>>> others as well, but the allocator proposed here doesn't even 
>>>>>>>> remotely matches the requirements.
>>>>>>>
>>>>>>> But again, what *are* those missing requirements exactly? What 
>>>>>>> is the pathological case you see for the current code?
>>>>>>
>>>>>> Well very low CPU overhead and don't do anything in a callback.
>>>>>
>>>>> Well, dma_fence_wait_any() will IIRC register callbacks on all 
>>>>> affected fences, although admittedly there is no actual allocator 
>>>>> processing in them.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> From what I can tell the amdgpu suballocator introduces 
>>>>>>> excessive complexity to coalesce waits for fences from the same 
>>>>>>> contexts, whereas the present code just frees from the fence 
>>>>>>> callback if the fence wasn't already signaled.
>>>>>>
>>>>>> And this is exactly the design we had previously which we removed 
>>>>>> after Dave stumbled over tons of problems with it.
>>>>>
>>>>> So is the worry that those problems have spilled over in this code 
>>>>> then? It's been pretty extensively tested, or is it you should 
>>>>> never really use dma-fence callbacks?
>>>>>
>>>>>>
>>>>>>> The fence signalling code that fires that callback is typcally 
>>>>>>> always run anyway on scheduler fences.
>>>>>>>
>>>>>>> The reason we had for not using the amdgpu suballocator as 
>>>>>>> originally planned was that this complexity made it very hard 
>>>>>>> for us to undertand it and to fix issues we had with it.
>>>>>>
>>>>>> Well what are those problems? The idea is actually not that 
>>>>>> hardware to understand.
>>>>>
>>>>> We hit memory corruption, and we spent substantially more time 
>>>>> trying to debug it than to put together this patch, while never 
>>>>> really understanding what  happened, nor why you don't see that 
>>>>> with amdgpu.
>>>>>
>>>>>>
>>>>>> We could simplify it massively for the cost of only waiting for 
>>>>>> the oldest fence if that helps.
>>>>>
>>>>> Let me grab the latest version from amdgpu and give it a try 
>>>>> again, but yes I think that to make it common code we'll need it 
>>>>> simpler (and my personal wish would be to separate the allocator 
>>>>> functionality a bit more from the fence waiting, which I guess 
>>>>> should be OK if the fence waiting is vastly simplified).
>>>>>
>>>>> /Thomas
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Thomas
>>>>>>
>>>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/3] drm/radeon: Use the drm suballocation manager implementation.
  2023-02-23 10:57 ` [PATCH 3/3] drm/radeon: Use the drm suballocation manager implementation Thomas Hellström
@ 2023-02-23 11:18   ` Christian König
  0 siblings, 0 replies; 39+ messages in thread
From: Christian König @ 2023-02-23 11:18 UTC (permalink / raw)
  To: Thomas Hellström, dri-devel; +Cc: Daniel Vetter, intel-xe, Dave Airlie

Am 23.02.23 um 11:57 schrieb Thomas Hellström:
> From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>
> Use the generic suballocation helper for radeon.
>
> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Co-developed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/radeon/radeon.h           |  55 +---
>   drivers/gpu/drm/radeon/radeon_ib.c        |  12 +-
>   drivers/gpu/drm/radeon/radeon_object.h    |  25 +-
>   drivers/gpu/drm/radeon/radeon_sa.c        | 316 ++--------------------
>   drivers/gpu/drm/radeon/radeon_semaphore.c |   4 +-
>   5 files changed, 56 insertions(+), 356 deletions(-)
>
> diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
> index 57e20780a458..d19a4b1c1a8f 100644
> --- a/drivers/gpu/drm/radeon/radeon.h
> +++ b/drivers/gpu/drm/radeon/radeon.h
> @@ -79,6 +79,7 @@
>   
>   #include <drm/drm_gem.h>
>   #include <drm/drm_audio_component.h>
> +#include <drm/drm_suballoc.h>
>   
>   #include "radeon_family.h"
>   #include "radeon_mode.h"
> @@ -511,52 +512,12 @@ struct radeon_bo {
>   };
>   #define gem_to_radeon_bo(gobj) container_of((gobj), struct radeon_bo, tbo.base)
>   
> -/* sub-allocation manager, it has to be protected by another lock.
> - * By conception this is an helper for other part of the driver
> - * like the indirect buffer or semaphore, which both have their
> - * locking.
> - *
> - * Principe is simple, we keep a list of sub allocation in offset
> - * order (first entry has offset == 0, last entry has the highest
> - * offset).
> - *
> - * When allocating new object we first check if there is room at
> - * the end total_size - (last_object_offset + last_object_size) >=
> - * alloc_size. If so we allocate new object there.
> - *
> - * When there is not enough room at the end, we start waiting for
> - * each sub object until we reach object_offset+object_size >=
> - * alloc_size, this object then become the sub object we return.
> - *
> - * Alignment can't be bigger than page size.
> - *
> - * Hole are not considered for allocation to keep things simple.
> - * Assumption is that there won't be hole (all object on same
> - * alignment).
> - */
>   struct radeon_sa_manager {
> -	wait_queue_head_t	wq;
> -	struct radeon_bo	*bo;
> -	struct list_head	*hole;
> -	struct list_head	flist[RADEON_NUM_RINGS];
> -	struct list_head	olist;
> -	unsigned		size;
> -	uint64_t		gpu_addr;
> -	void			*cpu_ptr;
> -	uint32_t		domain;
> -	uint32_t		align;
> -};
> -
> -struct radeon_sa_bo;
> -
> -/* sub-allocation buffer */
> -struct radeon_sa_bo {
> -	struct list_head		olist;
> -	struct list_head		flist;
> -	struct radeon_sa_manager	*manager;
> -	unsigned			soffset;
> -	unsigned			eoffset;
> -	struct radeon_fence		*fence;
> +	struct drm_suballoc_manager	base;
> +	struct radeon_bo		*bo;
> +	uint64_t			gpu_addr;
> +	void				*cpu_ptr;
> +	u32 domain;
>   };
>   
>   /*
> @@ -587,7 +548,7 @@ int radeon_mode_dumb_mmap(struct drm_file *filp,
>    * Semaphores.
>    */
>   struct radeon_semaphore {
> -	struct radeon_sa_bo	*sa_bo;
> +	struct drm_suballoc	*sa_bo;
>   	signed			waiters;
>   	uint64_t		gpu_addr;
>   };
> @@ -816,7 +777,7 @@ void radeon_irq_kms_disable_hpd(struct radeon_device *rdev, unsigned hpd_mask);
>    */
>   
>   struct radeon_ib {
> -	struct radeon_sa_bo		*sa_bo;
> +	struct drm_suballoc		*sa_bo;
>   	uint32_t			length_dw;
>   	uint64_t			gpu_addr;
>   	uint32_t			*ptr;
> diff --git a/drivers/gpu/drm/radeon/radeon_ib.c b/drivers/gpu/drm/radeon/radeon_ib.c
> index 62b116727b4f..6a45a72488f9 100644
> --- a/drivers/gpu/drm/radeon/radeon_ib.c
> +++ b/drivers/gpu/drm/radeon/radeon_ib.c
> @@ -61,7 +61,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
>   {
>   	int r;
>   
> -	r = radeon_sa_bo_new(rdev, &rdev->ring_tmp_bo, &ib->sa_bo, size, 256);
> +	r = radeon_sa_bo_new(&rdev->ring_tmp_bo, &ib->sa_bo, size, 256);
>   	if (r) {
>   		dev_err(rdev->dev, "failed to get a new IB (%d)\n", r);
>   		return r;
> @@ -77,7 +77,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
>   		/* ib pool is bound at RADEON_VA_IB_OFFSET in virtual address
>   		 * space and soffset is the offset inside the pool bo
>   		 */
> -		ib->gpu_addr = ib->sa_bo->soffset + RADEON_VA_IB_OFFSET;
> +		ib->gpu_addr = drm_suballoc_soffset(ib->sa_bo) + RADEON_VA_IB_OFFSET;
>   	} else {
>   		ib->gpu_addr = radeon_sa_bo_gpu_addr(ib->sa_bo);
>   	}
> @@ -97,7 +97,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
>   void radeon_ib_free(struct radeon_device *rdev, struct radeon_ib *ib)
>   {
>   	radeon_sync_free(rdev, &ib->sync, ib->fence);
> -	radeon_sa_bo_free(rdev, &ib->sa_bo, ib->fence);
> +	radeon_sa_bo_free(&ib->sa_bo, ib->fence);
>   	radeon_fence_unref(&ib->fence);
>   }
>   
> @@ -201,8 +201,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
>   
>   	if (rdev->family >= CHIP_BONAIRE) {
>   		r = radeon_sa_bo_manager_init(rdev, &rdev->ring_tmp_bo,
> -					      RADEON_IB_POOL_SIZE*64*1024,
> -					      RADEON_GPU_PAGE_SIZE,
> +					      RADEON_IB_POOL_SIZE*64*1024, 256,
>   					      RADEON_GEM_DOMAIN_GTT,
>   					      RADEON_GEM_GTT_WC);
>   	} else {
> @@ -210,8 +209,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
>   		 * to the command stream checking
>   		 */
>   		r = radeon_sa_bo_manager_init(rdev, &rdev->ring_tmp_bo,
> -					      RADEON_IB_POOL_SIZE*64*1024,
> -					      RADEON_GPU_PAGE_SIZE,
> +					      RADEON_IB_POOL_SIZE*64*1024, 256,
>   					      RADEON_GEM_DOMAIN_GTT, 0);
>   	}
>   	if (r) {
> diff --git a/drivers/gpu/drm/radeon/radeon_object.h b/drivers/gpu/drm/radeon/radeon_object.h
> index 0a6ef49e990a..39cc87a59a9a 100644
> --- a/drivers/gpu/drm/radeon/radeon_object.h
> +++ b/drivers/gpu/drm/radeon/radeon_object.h
> @@ -169,15 +169,22 @@ extern void radeon_bo_fence(struct radeon_bo *bo, struct radeon_fence *fence,
>   /*
>    * sub allocation
>    */
> +static inline struct radeon_sa_manager *
> +to_radeon_sa_manager(struct drm_suballoc_manager *manager)
> +{
> +	return container_of(manager, struct radeon_sa_manager, base);
> +}
>   
> -static inline uint64_t radeon_sa_bo_gpu_addr(struct radeon_sa_bo *sa_bo)
> +static inline uint64_t radeon_sa_bo_gpu_addr(struct drm_suballoc *sa_bo)
>   {
> -	return sa_bo->manager->gpu_addr + sa_bo->soffset;
> +	return to_radeon_sa_manager(sa_bo->manager)->gpu_addr +
> +		drm_suballoc_soffset(sa_bo);
>   }
>   
> -static inline void * radeon_sa_bo_cpu_addr(struct radeon_sa_bo *sa_bo)
> +static inline void *radeon_sa_bo_cpu_addr(struct drm_suballoc *sa_bo)
>   {
> -	return sa_bo->manager->cpu_ptr + sa_bo->soffset;
> +	return to_radeon_sa_manager(sa_bo->manager)->cpu_ptr +
> +		drm_suballoc_soffset(sa_bo);
>   }
>   
>   extern int radeon_sa_bo_manager_init(struct radeon_device *rdev,
> @@ -190,12 +197,10 @@ extern int radeon_sa_bo_manager_start(struct radeon_device *rdev,
>   				      struct radeon_sa_manager *sa_manager);
>   extern int radeon_sa_bo_manager_suspend(struct radeon_device *rdev,
>   					struct radeon_sa_manager *sa_manager);
> -extern int radeon_sa_bo_new(struct radeon_device *rdev,
> -			    struct radeon_sa_manager *sa_manager,
> -			    struct radeon_sa_bo **sa_bo,
> -			    unsigned size, unsigned align);
> -extern void radeon_sa_bo_free(struct radeon_device *rdev,
> -			      struct radeon_sa_bo **sa_bo,
> +extern int radeon_sa_bo_new(struct radeon_sa_manager *sa_manager,
> +			    struct drm_suballoc **sa_bo,
> +			    unsigned int size, unsigned int align);
> +extern void radeon_sa_bo_free(struct drm_suballoc **sa_bo,
>   			      struct radeon_fence *fence);
>   #if defined(CONFIG_DEBUG_FS)
>   extern void radeon_sa_bo_dump_debug_info(struct radeon_sa_manager *sa_manager,
> diff --git a/drivers/gpu/drm/radeon/radeon_sa.c b/drivers/gpu/drm/radeon/radeon_sa.c
> index 0981948bd9ed..c87a57c9c592 100644
> --- a/drivers/gpu/drm/radeon/radeon_sa.c
> +++ b/drivers/gpu/drm/radeon/radeon_sa.c
> @@ -44,53 +44,32 @@
>   
>   #include "radeon.h"
>   
> -static void radeon_sa_bo_remove_locked(struct radeon_sa_bo *sa_bo);
> -static void radeon_sa_bo_try_free(struct radeon_sa_manager *sa_manager);
> -
>   int radeon_sa_bo_manager_init(struct radeon_device *rdev,
>   			      struct radeon_sa_manager *sa_manager,
> -			      unsigned size, u32 align, u32 domain, u32 flags)
> +			      unsigned int size, u32 sa_align, u32 domain,
> +			      u32 flags)
>   {
> -	int i, r;
> -
> -	init_waitqueue_head(&sa_manager->wq);
> -	sa_manager->bo = NULL;
> -	sa_manager->size = size;
> -	sa_manager->domain = domain;
> -	sa_manager->align = align;
> -	sa_manager->hole = &sa_manager->olist;
> -	INIT_LIST_HEAD(&sa_manager->olist);
> -	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
> -		INIT_LIST_HEAD(&sa_manager->flist[i]);
> -	}
> +	int r;
>   
> -	r = radeon_bo_create(rdev, size, align, true,
> +	r = radeon_bo_create(rdev, size, RADEON_GPU_PAGE_SIZE, true,
>   			     domain, flags, NULL, NULL, &sa_manager->bo);
>   	if (r) {
>   		dev_err(rdev->dev, "(%d) failed to allocate bo for manager\n", r);
>   		return r;
>   	}
>   
> +	sa_manager->domain = domain;
> +
> +	drm_suballoc_manager_init(&sa_manager->base, size, sa_align);
> +
>   	return r;
>   }
>   
>   void radeon_sa_bo_manager_fini(struct radeon_device *rdev,
>   			       struct radeon_sa_manager *sa_manager)
>   {
> -	struct radeon_sa_bo *sa_bo, *tmp;
> -
> -	if (!list_empty(&sa_manager->olist)) {
> -		sa_manager->hole = &sa_manager->olist,
> -		radeon_sa_bo_try_free(sa_manager);
> -		if (!list_empty(&sa_manager->olist)) {
> -			dev_err(rdev->dev, "sa_manager is not empty, clearing anyway\n");
> -		}
> -	}
> -	list_for_each_entry_safe(sa_bo, tmp, &sa_manager->olist, olist) {
> -		radeon_sa_bo_remove_locked(sa_bo);
> -	}
> +	drm_suballoc_manager_fini(&sa_manager->base);
>   	radeon_bo_unref(&sa_manager->bo);
> -	sa_manager->size = 0;
>   }
>   
>   int radeon_sa_bo_manager_start(struct radeon_device *rdev,
> @@ -139,260 +118,34 @@ int radeon_sa_bo_manager_suspend(struct radeon_device *rdev,
>   	return r;
>   }
>   
> -static void radeon_sa_bo_remove_locked(struct radeon_sa_bo *sa_bo)
> +int radeon_sa_bo_new(struct radeon_sa_manager *sa_manager,
> +		     struct drm_suballoc **sa_bo,
> +		     unsigned int size, unsigned int align)
>   {
> -	struct radeon_sa_manager *sa_manager = sa_bo->manager;
> -	if (sa_manager->hole == &sa_bo->olist) {
> -		sa_manager->hole = sa_bo->olist.prev;
> -	}
> -	list_del_init(&sa_bo->olist);
> -	list_del_init(&sa_bo->flist);
> -	radeon_fence_unref(&sa_bo->fence);
> -	kfree(sa_bo);
> -}
> -
> -static void radeon_sa_bo_try_free(struct radeon_sa_manager *sa_manager)
> -{
> -	struct radeon_sa_bo *sa_bo, *tmp;
> -
> -	if (sa_manager->hole->next == &sa_manager->olist)
> -		return;
> +	struct drm_suballoc *sa = drm_suballoc_new(&sa_manager->base, size,
> +						   GFP_KERNEL, true, align);
>   
> -	sa_bo = list_entry(sa_manager->hole->next, struct radeon_sa_bo, olist);
> -	list_for_each_entry_safe_from(sa_bo, tmp, &sa_manager->olist, olist) {
> -		if (sa_bo->fence == NULL || !radeon_fence_signaled(sa_bo->fence)) {
> -			return;
> -		}
> -		radeon_sa_bo_remove_locked(sa_bo);
> +	if (IS_ERR(sa)) {
> +		*sa_bo = NULL;
> +		return PTR_ERR(sa);
>   	}
> -}
>   
> -static inline unsigned radeon_sa_bo_hole_soffset(struct radeon_sa_manager *sa_manager)
> -{
> -	struct list_head *hole = sa_manager->hole;
> -
> -	if (hole != &sa_manager->olist) {
> -		return list_entry(hole, struct radeon_sa_bo, olist)->eoffset;
> -	}
> +	*sa_bo = sa;
>   	return 0;
>   }
>   
> -static inline unsigned radeon_sa_bo_hole_eoffset(struct radeon_sa_manager *sa_manager)
> -{
> -	struct list_head *hole = sa_manager->hole;
> -
> -	if (hole->next != &sa_manager->olist) {
> -		return list_entry(hole->next, struct radeon_sa_bo, olist)->soffset;
> -	}
> -	return sa_manager->size;
> -}
> -
> -static bool radeon_sa_bo_try_alloc(struct radeon_sa_manager *sa_manager,
> -				   struct radeon_sa_bo *sa_bo,
> -				   unsigned size, unsigned align)
> -{
> -	unsigned soffset, eoffset, wasted;
> -
> -	soffset = radeon_sa_bo_hole_soffset(sa_manager);
> -	eoffset = radeon_sa_bo_hole_eoffset(sa_manager);
> -	wasted = (align - (soffset % align)) % align;
> -
> -	if ((eoffset - soffset) >= (size + wasted)) {
> -		soffset += wasted;
> -
> -		sa_bo->manager = sa_manager;
> -		sa_bo->soffset = soffset;
> -		sa_bo->eoffset = soffset + size;
> -		list_add(&sa_bo->olist, sa_manager->hole);
> -		INIT_LIST_HEAD(&sa_bo->flist);
> -		sa_manager->hole = &sa_bo->olist;
> -		return true;
> -	}
> -	return false;
> -}
> -
> -/**
> - * radeon_sa_event - Check if we can stop waiting
> - *
> - * @sa_manager: pointer to the sa_manager
> - * @size: number of bytes we want to allocate
> - * @align: alignment we need to match
> - *
> - * Check if either there is a fence we can wait for or
> - * enough free memory to satisfy the allocation directly
> - */
> -static bool radeon_sa_event(struct radeon_sa_manager *sa_manager,
> -			    unsigned size, unsigned align)
> -{
> -	unsigned soffset, eoffset, wasted;
> -	int i;
> -
> -	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
> -		if (!list_empty(&sa_manager->flist[i])) {
> -			return true;
> -		}
> -	}
> -
> -	soffset = radeon_sa_bo_hole_soffset(sa_manager);
> -	eoffset = radeon_sa_bo_hole_eoffset(sa_manager);
> -	wasted = (align - (soffset % align)) % align;
> -
> -	if ((eoffset - soffset) >= (size + wasted)) {
> -		return true;
> -	}
> -
> -	return false;
> -}
> -
> -static bool radeon_sa_bo_next_hole(struct radeon_sa_manager *sa_manager,
> -				   struct radeon_fence **fences,
> -				   unsigned *tries)
> -{
> -	struct radeon_sa_bo *best_bo = NULL;
> -	unsigned i, soffset, best, tmp;
> -
> -	/* if hole points to the end of the buffer */
> -	if (sa_manager->hole->next == &sa_manager->olist) {
> -		/* try again with its beginning */
> -		sa_manager->hole = &sa_manager->olist;
> -		return true;
> -	}
> -
> -	soffset = radeon_sa_bo_hole_soffset(sa_manager);
> -	/* to handle wrap around we add sa_manager->size */
> -	best = sa_manager->size * 2;
> -	/* go over all fence list and try to find the closest sa_bo
> -	 * of the current last
> -	 */
> -	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
> -		struct radeon_sa_bo *sa_bo;
> -
> -		fences[i] = NULL;
> -
> -		if (list_empty(&sa_manager->flist[i])) {
> -			continue;
> -		}
> -
> -		sa_bo = list_first_entry(&sa_manager->flist[i],
> -					 struct radeon_sa_bo, flist);
> -
> -		if (!radeon_fence_signaled(sa_bo->fence)) {
> -			fences[i] = sa_bo->fence;
> -			continue;
> -		}
> -
> -		/* limit the number of tries each ring gets */
> -		if (tries[i] > 2) {
> -			continue;
> -		}
> -
> -		tmp = sa_bo->soffset;
> -		if (tmp < soffset) {
> -			/* wrap around, pretend it's after */
> -			tmp += sa_manager->size;
> -		}
> -		tmp -= soffset;
> -		if (tmp < best) {
> -			/* this sa bo is the closest one */
> -			best = tmp;
> -			best_bo = sa_bo;
> -		}
> -	}
> -
> -	if (best_bo) {
> -		++tries[best_bo->fence->ring];
> -		sa_manager->hole = best_bo->olist.prev;
> -
> -		/* we knew that this one is signaled,
> -		   so it's save to remote it */
> -		radeon_sa_bo_remove_locked(best_bo);
> -		return true;
> -	}
> -	return false;
> -}
> -
> -int radeon_sa_bo_new(struct radeon_device *rdev,
> -		     struct radeon_sa_manager *sa_manager,
> -		     struct radeon_sa_bo **sa_bo,
> -		     unsigned size, unsigned align)
> -{
> -	struct radeon_fence *fences[RADEON_NUM_RINGS];
> -	unsigned tries[RADEON_NUM_RINGS];
> -	int i, r;
> -
> -	BUG_ON(align > sa_manager->align);
> -	BUG_ON(size > sa_manager->size);
> -
> -	*sa_bo = kmalloc(sizeof(struct radeon_sa_bo), GFP_KERNEL);
> -	if ((*sa_bo) == NULL) {
> -		return -ENOMEM;
> -	}
> -	(*sa_bo)->manager = sa_manager;
> -	(*sa_bo)->fence = NULL;
> -	INIT_LIST_HEAD(&(*sa_bo)->olist);
> -	INIT_LIST_HEAD(&(*sa_bo)->flist);
> -
> -	spin_lock(&sa_manager->wq.lock);
> -	do {
> -		for (i = 0; i < RADEON_NUM_RINGS; ++i)
> -			tries[i] = 0;
> -
> -		do {
> -			radeon_sa_bo_try_free(sa_manager);
> -
> -			if (radeon_sa_bo_try_alloc(sa_manager, *sa_bo,
> -						   size, align)) {
> -				spin_unlock(&sa_manager->wq.lock);
> -				return 0;
> -			}
> -
> -			/* see if we can skip over some allocations */
> -		} while (radeon_sa_bo_next_hole(sa_manager, fences, tries));
> -
> -		for (i = 0; i < RADEON_NUM_RINGS; ++i)
> -			radeon_fence_ref(fences[i]);
> -
> -		spin_unlock(&sa_manager->wq.lock);
> -		r = radeon_fence_wait_any(rdev, fences, false);
> -		for (i = 0; i < RADEON_NUM_RINGS; ++i)
> -			radeon_fence_unref(&fences[i]);
> -		spin_lock(&sa_manager->wq.lock);
> -		/* if we have nothing to wait for block */
> -		if (r == -ENOENT) {
> -			r = wait_event_interruptible_locked(
> -				sa_manager->wq,
> -				radeon_sa_event(sa_manager, size, align)
> -			);
> -		}
> -
> -	} while (!r);
> -
> -	spin_unlock(&sa_manager->wq.lock);
> -	kfree(*sa_bo);
> -	*sa_bo = NULL;
> -	return r;
> -}
> -
> -void radeon_sa_bo_free(struct radeon_device *rdev, struct radeon_sa_bo **sa_bo,
> +void radeon_sa_bo_free(struct drm_suballoc **sa_bo,
>   		       struct radeon_fence *fence)
>   {
> -	struct radeon_sa_manager *sa_manager;
> -
>   	if (sa_bo == NULL || *sa_bo == NULL) {
>   		return;
>   	}
>   
> -	sa_manager = (*sa_bo)->manager;
> -	spin_lock(&sa_manager->wq.lock);
> -	if (fence && !radeon_fence_signaled(fence)) {
> -		(*sa_bo)->fence = radeon_fence_ref(fence);
> -		list_add_tail(&(*sa_bo)->flist,
> -			      &sa_manager->flist[fence->ring]);
> -	} else {
> -		radeon_sa_bo_remove_locked(*sa_bo);
> -	}
> -	wake_up_all_locked(&sa_manager->wq);
> -	spin_unlock(&sa_manager->wq.lock);
> +	if (fence)
> +		drm_suballoc_free(*sa_bo, &fence->base);
> +	else
> +		drm_suballoc_free(*sa_bo, NULL);
> +
>   	*sa_bo = NULL;
>   }
>   
> @@ -400,25 +153,8 @@ void radeon_sa_bo_free(struct radeon_device *rdev, struct radeon_sa_bo **sa_bo,
>   void radeon_sa_bo_dump_debug_info(struct radeon_sa_manager *sa_manager,
>   				  struct seq_file *m)
>   {
> -	struct radeon_sa_bo *i;
> +	struct drm_printer p = drm_seq_file_printer(m);
>   
> -	spin_lock(&sa_manager->wq.lock);
> -	list_for_each_entry(i, &sa_manager->olist, olist) {
> -		uint64_t soffset = i->soffset + sa_manager->gpu_addr;
> -		uint64_t eoffset = i->eoffset + sa_manager->gpu_addr;
> -		if (&i->olist == sa_manager->hole) {
> -			seq_printf(m, ">");
> -		} else {
> -			seq_printf(m, " ");
> -		}
> -		seq_printf(m, "[0x%010llx 0x%010llx] size %8lld",
> -			   soffset, eoffset, eoffset - soffset);
> -		if (i->fence) {
> -			seq_printf(m, " protected by 0x%016llx on ring %d",
> -				   i->fence->seq, i->fence->ring);
> -		}
> -		seq_printf(m, "\n");
> -	}
> -	spin_unlock(&sa_manager->wq.lock);
> +	drm_suballoc_dump_debug_info(&sa_manager->base, &p, sa_manager->gpu_addr);
>   }
>   #endif
> diff --git a/drivers/gpu/drm/radeon/radeon_semaphore.c b/drivers/gpu/drm/radeon/radeon_semaphore.c
> index 221e59476f64..1f0a9a4ff5ae 100644
> --- a/drivers/gpu/drm/radeon/radeon_semaphore.c
> +++ b/drivers/gpu/drm/radeon/radeon_semaphore.c
> @@ -40,7 +40,7 @@ int radeon_semaphore_create(struct radeon_device *rdev,
>   	if (*semaphore == NULL) {
>   		return -ENOMEM;
>   	}
> -	r = radeon_sa_bo_new(rdev, &rdev->ring_tmp_bo,
> +	r = radeon_sa_bo_new(&rdev->ring_tmp_bo,
>   			     &(*semaphore)->sa_bo, 8, 8);
>   	if (r) {
>   		kfree(*semaphore);
> @@ -100,7 +100,7 @@ void radeon_semaphore_free(struct radeon_device *rdev,
>   		dev_err(rdev->dev, "semaphore %p has more waiters than signalers,"
>   			" hardware lockup imminent!\n", *semaphore);
>   	}
> -	radeon_sa_bo_free(rdev, &(*semaphore)->sa_bo, fence);
> +	radeon_sa_bo_free(&(*semaphore)->sa_bo, fence);
>   	kfree(*semaphore);
>   	*semaphore = NULL;
>   }


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 3/3] drm/radeon: Use the drm suballocation manager implementation.
  2023-02-23 10:57 [PATCH 0/3] drm/helpers: Make the suballocation manager drm generic Thomas Hellström
@ 2023-02-23 10:57 ` Thomas Hellström
  2023-02-23 11:18   ` Christian König
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Hellström @ 2023-02-23 10:57 UTC (permalink / raw)
  To: dri-devel
  Cc: Thomas Hellström, Daniel Vetter, Christian Koenig,
	Dave Airlie, intel-xe

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Use the generic suballocation helper for radeon.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Co-developed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/radeon/radeon.h           |  55 +---
 drivers/gpu/drm/radeon/radeon_ib.c        |  12 +-
 drivers/gpu/drm/radeon/radeon_object.h    |  25 +-
 drivers/gpu/drm/radeon/radeon_sa.c        | 316 ++--------------------
 drivers/gpu/drm/radeon/radeon_semaphore.c |   4 +-
 5 files changed, 56 insertions(+), 356 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 57e20780a458..d19a4b1c1a8f 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -79,6 +79,7 @@
 
 #include <drm/drm_gem.h>
 #include <drm/drm_audio_component.h>
+#include <drm/drm_suballoc.h>
 
 #include "radeon_family.h"
 #include "radeon_mode.h"
@@ -511,52 +512,12 @@ struct radeon_bo {
 };
 #define gem_to_radeon_bo(gobj) container_of((gobj), struct radeon_bo, tbo.base)
 
-/* sub-allocation manager, it has to be protected by another lock.
- * By conception this is an helper for other part of the driver
- * like the indirect buffer or semaphore, which both have their
- * locking.
- *
- * Principe is simple, we keep a list of sub allocation in offset
- * order (first entry has offset == 0, last entry has the highest
- * offset).
- *
- * When allocating new object we first check if there is room at
- * the end total_size - (last_object_offset + last_object_size) >=
- * alloc_size. If so we allocate new object there.
- *
- * When there is not enough room at the end, we start waiting for
- * each sub object until we reach object_offset+object_size >=
- * alloc_size, this object then become the sub object we return.
- *
- * Alignment can't be bigger than page size.
- *
- * Hole are not considered for allocation to keep things simple.
- * Assumption is that there won't be hole (all object on same
- * alignment).
- */
 struct radeon_sa_manager {
-	wait_queue_head_t	wq;
-	struct radeon_bo	*bo;
-	struct list_head	*hole;
-	struct list_head	flist[RADEON_NUM_RINGS];
-	struct list_head	olist;
-	unsigned		size;
-	uint64_t		gpu_addr;
-	void			*cpu_ptr;
-	uint32_t		domain;
-	uint32_t		align;
-};
-
-struct radeon_sa_bo;
-
-/* sub-allocation buffer */
-struct radeon_sa_bo {
-	struct list_head		olist;
-	struct list_head		flist;
-	struct radeon_sa_manager	*manager;
-	unsigned			soffset;
-	unsigned			eoffset;
-	struct radeon_fence		*fence;
+	struct drm_suballoc_manager	base;
+	struct radeon_bo		*bo;
+	uint64_t			gpu_addr;
+	void				*cpu_ptr;
+	u32 domain;
 };
 
 /*
@@ -587,7 +548,7 @@ int radeon_mode_dumb_mmap(struct drm_file *filp,
  * Semaphores.
  */
 struct radeon_semaphore {
-	struct radeon_sa_bo	*sa_bo;
+	struct drm_suballoc	*sa_bo;
 	signed			waiters;
 	uint64_t		gpu_addr;
 };
@@ -816,7 +777,7 @@ void radeon_irq_kms_disable_hpd(struct radeon_device *rdev, unsigned hpd_mask);
  */
 
 struct radeon_ib {
-	struct radeon_sa_bo		*sa_bo;
+	struct drm_suballoc		*sa_bo;
 	uint32_t			length_dw;
 	uint64_t			gpu_addr;
 	uint32_t			*ptr;
diff --git a/drivers/gpu/drm/radeon/radeon_ib.c b/drivers/gpu/drm/radeon/radeon_ib.c
index 62b116727b4f..6a45a72488f9 100644
--- a/drivers/gpu/drm/radeon/radeon_ib.c
+++ b/drivers/gpu/drm/radeon/radeon_ib.c
@@ -61,7 +61,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 {
 	int r;
 
-	r = radeon_sa_bo_new(rdev, &rdev->ring_tmp_bo, &ib->sa_bo, size, 256);
+	r = radeon_sa_bo_new(&rdev->ring_tmp_bo, &ib->sa_bo, size, 256);
 	if (r) {
 		dev_err(rdev->dev, "failed to get a new IB (%d)\n", r);
 		return r;
@@ -77,7 +77,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 		/* ib pool is bound at RADEON_VA_IB_OFFSET in virtual address
 		 * space and soffset is the offset inside the pool bo
 		 */
-		ib->gpu_addr = ib->sa_bo->soffset + RADEON_VA_IB_OFFSET;
+		ib->gpu_addr = drm_suballoc_soffset(ib->sa_bo) + RADEON_VA_IB_OFFSET;
 	} else {
 		ib->gpu_addr = radeon_sa_bo_gpu_addr(ib->sa_bo);
 	}
@@ -97,7 +97,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 void radeon_ib_free(struct radeon_device *rdev, struct radeon_ib *ib)
 {
 	radeon_sync_free(rdev, &ib->sync, ib->fence);
-	radeon_sa_bo_free(rdev, &ib->sa_bo, ib->fence);
+	radeon_sa_bo_free(&ib->sa_bo, ib->fence);
 	radeon_fence_unref(&ib->fence);
 }
 
@@ -201,8 +201,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
 
 	if (rdev->family >= CHIP_BONAIRE) {
 		r = radeon_sa_bo_manager_init(rdev, &rdev->ring_tmp_bo,
-					      RADEON_IB_POOL_SIZE*64*1024,
-					      RADEON_GPU_PAGE_SIZE,
+					      RADEON_IB_POOL_SIZE*64*1024, 256,
 					      RADEON_GEM_DOMAIN_GTT,
 					      RADEON_GEM_GTT_WC);
 	} else {
@@ -210,8 +209,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
 		 * to the command stream checking
 		 */
 		r = radeon_sa_bo_manager_init(rdev, &rdev->ring_tmp_bo,
-					      RADEON_IB_POOL_SIZE*64*1024,
-					      RADEON_GPU_PAGE_SIZE,
+					      RADEON_IB_POOL_SIZE*64*1024, 256,
 					      RADEON_GEM_DOMAIN_GTT, 0);
 	}
 	if (r) {
diff --git a/drivers/gpu/drm/radeon/radeon_object.h b/drivers/gpu/drm/radeon/radeon_object.h
index 0a6ef49e990a..39cc87a59a9a 100644
--- a/drivers/gpu/drm/radeon/radeon_object.h
+++ b/drivers/gpu/drm/radeon/radeon_object.h
@@ -169,15 +169,22 @@ extern void radeon_bo_fence(struct radeon_bo *bo, struct radeon_fence *fence,
 /*
  * sub allocation
  */
+static inline struct radeon_sa_manager *
+to_radeon_sa_manager(struct drm_suballoc_manager *manager)
+{
+	return container_of(manager, struct radeon_sa_manager, base);
+}
 
-static inline uint64_t radeon_sa_bo_gpu_addr(struct radeon_sa_bo *sa_bo)
+static inline uint64_t radeon_sa_bo_gpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->gpu_addr + sa_bo->soffset;
+	return to_radeon_sa_manager(sa_bo->manager)->gpu_addr +
+		drm_suballoc_soffset(sa_bo);
 }
 
-static inline void * radeon_sa_bo_cpu_addr(struct radeon_sa_bo *sa_bo)
+static inline void *radeon_sa_bo_cpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->cpu_ptr + sa_bo->soffset;
+	return to_radeon_sa_manager(sa_bo->manager)->cpu_ptr +
+		drm_suballoc_soffset(sa_bo);
 }
 
 extern int radeon_sa_bo_manager_init(struct radeon_device *rdev,
@@ -190,12 +197,10 @@ extern int radeon_sa_bo_manager_start(struct radeon_device *rdev,
 				      struct radeon_sa_manager *sa_manager);
 extern int radeon_sa_bo_manager_suspend(struct radeon_device *rdev,
 					struct radeon_sa_manager *sa_manager);
-extern int radeon_sa_bo_new(struct radeon_device *rdev,
-			    struct radeon_sa_manager *sa_manager,
-			    struct radeon_sa_bo **sa_bo,
-			    unsigned size, unsigned align);
-extern void radeon_sa_bo_free(struct radeon_device *rdev,
-			      struct radeon_sa_bo **sa_bo,
+extern int radeon_sa_bo_new(struct radeon_sa_manager *sa_manager,
+			    struct drm_suballoc **sa_bo,
+			    unsigned int size, unsigned int align);
+extern void radeon_sa_bo_free(struct drm_suballoc **sa_bo,
 			      struct radeon_fence *fence);
 #if defined(CONFIG_DEBUG_FS)
 extern void radeon_sa_bo_dump_debug_info(struct radeon_sa_manager *sa_manager,
diff --git a/drivers/gpu/drm/radeon/radeon_sa.c b/drivers/gpu/drm/radeon/radeon_sa.c
index 0981948bd9ed..c87a57c9c592 100644
--- a/drivers/gpu/drm/radeon/radeon_sa.c
+++ b/drivers/gpu/drm/radeon/radeon_sa.c
@@ -44,53 +44,32 @@
 
 #include "radeon.h"
 
-static void radeon_sa_bo_remove_locked(struct radeon_sa_bo *sa_bo);
-static void radeon_sa_bo_try_free(struct radeon_sa_manager *sa_manager);
-
 int radeon_sa_bo_manager_init(struct radeon_device *rdev,
 			      struct radeon_sa_manager *sa_manager,
-			      unsigned size, u32 align, u32 domain, u32 flags)
+			      unsigned int size, u32 sa_align, u32 domain,
+			      u32 flags)
 {
-	int i, r;
-
-	init_waitqueue_head(&sa_manager->wq);
-	sa_manager->bo = NULL;
-	sa_manager->size = size;
-	sa_manager->domain = domain;
-	sa_manager->align = align;
-	sa_manager->hole = &sa_manager->olist;
-	INIT_LIST_HEAD(&sa_manager->olist);
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		INIT_LIST_HEAD(&sa_manager->flist[i]);
-	}
+	int r;
 
-	r = radeon_bo_create(rdev, size, align, true,
+	r = radeon_bo_create(rdev, size, RADEON_GPU_PAGE_SIZE, true,
 			     domain, flags, NULL, NULL, &sa_manager->bo);
 	if (r) {
 		dev_err(rdev->dev, "(%d) failed to allocate bo for manager\n", r);
 		return r;
 	}
 
+	sa_manager->domain = domain;
+
+	drm_suballoc_manager_init(&sa_manager->base, size, sa_align);
+
 	return r;
 }
 
 void radeon_sa_bo_manager_fini(struct radeon_device *rdev,
 			       struct radeon_sa_manager *sa_manager)
 {
-	struct radeon_sa_bo *sa_bo, *tmp;
-
-	if (!list_empty(&sa_manager->olist)) {
-		sa_manager->hole = &sa_manager->olist,
-		radeon_sa_bo_try_free(sa_manager);
-		if (!list_empty(&sa_manager->olist)) {
-			dev_err(rdev->dev, "sa_manager is not empty, clearing anyway\n");
-		}
-	}
-	list_for_each_entry_safe(sa_bo, tmp, &sa_manager->olist, olist) {
-		radeon_sa_bo_remove_locked(sa_bo);
-	}
+	drm_suballoc_manager_fini(&sa_manager->base);
 	radeon_bo_unref(&sa_manager->bo);
-	sa_manager->size = 0;
 }
 
 int radeon_sa_bo_manager_start(struct radeon_device *rdev,
@@ -139,260 +118,34 @@ int radeon_sa_bo_manager_suspend(struct radeon_device *rdev,
 	return r;
 }
 
-static void radeon_sa_bo_remove_locked(struct radeon_sa_bo *sa_bo)
+int radeon_sa_bo_new(struct radeon_sa_manager *sa_manager,
+		     struct drm_suballoc **sa_bo,
+		     unsigned int size, unsigned int align)
 {
-	struct radeon_sa_manager *sa_manager = sa_bo->manager;
-	if (sa_manager->hole == &sa_bo->olist) {
-		sa_manager->hole = sa_bo->olist.prev;
-	}
-	list_del_init(&sa_bo->olist);
-	list_del_init(&sa_bo->flist);
-	radeon_fence_unref(&sa_bo->fence);
-	kfree(sa_bo);
-}
-
-static void radeon_sa_bo_try_free(struct radeon_sa_manager *sa_manager)
-{
-	struct radeon_sa_bo *sa_bo, *tmp;
-
-	if (sa_manager->hole->next == &sa_manager->olist)
-		return;
+	struct drm_suballoc *sa = drm_suballoc_new(&sa_manager->base, size,
+						   GFP_KERNEL, true, align);
 
-	sa_bo = list_entry(sa_manager->hole->next, struct radeon_sa_bo, olist);
-	list_for_each_entry_safe_from(sa_bo, tmp, &sa_manager->olist, olist) {
-		if (sa_bo->fence == NULL || !radeon_fence_signaled(sa_bo->fence)) {
-			return;
-		}
-		radeon_sa_bo_remove_locked(sa_bo);
+	if (IS_ERR(sa)) {
+		*sa_bo = NULL;
+		return PTR_ERR(sa);
 	}
-}
 
-static inline unsigned radeon_sa_bo_hole_soffset(struct radeon_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole != &sa_manager->olist) {
-		return list_entry(hole, struct radeon_sa_bo, olist)->eoffset;
-	}
+	*sa_bo = sa;
 	return 0;
 }
 
-static inline unsigned radeon_sa_bo_hole_eoffset(struct radeon_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole->next != &sa_manager->olist) {
-		return list_entry(hole->next, struct radeon_sa_bo, olist)->soffset;
-	}
-	return sa_manager->size;
-}
-
-static bool radeon_sa_bo_try_alloc(struct radeon_sa_manager *sa_manager,
-				   struct radeon_sa_bo *sa_bo,
-				   unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	eoffset = radeon_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		soffset += wasted;
-
-		sa_bo->manager = sa_manager;
-		sa_bo->soffset = soffset;
-		sa_bo->eoffset = soffset + size;
-		list_add(&sa_bo->olist, sa_manager->hole);
-		INIT_LIST_HEAD(&sa_bo->flist);
-		sa_manager->hole = &sa_bo->olist;
-		return true;
-	}
-	return false;
-}
-
-/**
- * radeon_sa_event - Check if we can stop waiting
- *
- * @sa_manager: pointer to the sa_manager
- * @size: number of bytes we want to allocate
- * @align: alignment we need to match
- *
- * Check if either there is a fence we can wait for or
- * enough free memory to satisfy the allocation directly
- */
-static bool radeon_sa_event(struct radeon_sa_manager *sa_manager,
-			    unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-	int i;
-
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		if (!list_empty(&sa_manager->flist[i])) {
-			return true;
-		}
-	}
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	eoffset = radeon_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		return true;
-	}
-
-	return false;
-}
-
-static bool radeon_sa_bo_next_hole(struct radeon_sa_manager *sa_manager,
-				   struct radeon_fence **fences,
-				   unsigned *tries)
-{
-	struct radeon_sa_bo *best_bo = NULL;
-	unsigned i, soffset, best, tmp;
-
-	/* if hole points to the end of the buffer */
-	if (sa_manager->hole->next == &sa_manager->olist) {
-		/* try again with its beginning */
-		sa_manager->hole = &sa_manager->olist;
-		return true;
-	}
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	/* to handle wrap around we add sa_manager->size */
-	best = sa_manager->size * 2;
-	/* go over all fence list and try to find the closest sa_bo
-	 * of the current last
-	 */
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		struct radeon_sa_bo *sa_bo;
-
-		fences[i] = NULL;
-
-		if (list_empty(&sa_manager->flist[i])) {
-			continue;
-		}
-
-		sa_bo = list_first_entry(&sa_manager->flist[i],
-					 struct radeon_sa_bo, flist);
-
-		if (!radeon_fence_signaled(sa_bo->fence)) {
-			fences[i] = sa_bo->fence;
-			continue;
-		}
-
-		/* limit the number of tries each ring gets */
-		if (tries[i] > 2) {
-			continue;
-		}
-
-		tmp = sa_bo->soffset;
-		if (tmp < soffset) {
-			/* wrap around, pretend it's after */
-			tmp += sa_manager->size;
-		}
-		tmp -= soffset;
-		if (tmp < best) {
-			/* this sa bo is the closest one */
-			best = tmp;
-			best_bo = sa_bo;
-		}
-	}
-
-	if (best_bo) {
-		++tries[best_bo->fence->ring];
-		sa_manager->hole = best_bo->olist.prev;
-
-		/* we knew that this one is signaled,
-		   so it's save to remote it */
-		radeon_sa_bo_remove_locked(best_bo);
-		return true;
-	}
-	return false;
-}
-
-int radeon_sa_bo_new(struct radeon_device *rdev,
-		     struct radeon_sa_manager *sa_manager,
-		     struct radeon_sa_bo **sa_bo,
-		     unsigned size, unsigned align)
-{
-	struct radeon_fence *fences[RADEON_NUM_RINGS];
-	unsigned tries[RADEON_NUM_RINGS];
-	int i, r;
-
-	BUG_ON(align > sa_manager->align);
-	BUG_ON(size > sa_manager->size);
-
-	*sa_bo = kmalloc(sizeof(struct radeon_sa_bo), GFP_KERNEL);
-	if ((*sa_bo) == NULL) {
-		return -ENOMEM;
-	}
-	(*sa_bo)->manager = sa_manager;
-	(*sa_bo)->fence = NULL;
-	INIT_LIST_HEAD(&(*sa_bo)->olist);
-	INIT_LIST_HEAD(&(*sa_bo)->flist);
-
-	spin_lock(&sa_manager->wq.lock);
-	do {
-		for (i = 0; i < RADEON_NUM_RINGS; ++i)
-			tries[i] = 0;
-
-		do {
-			radeon_sa_bo_try_free(sa_manager);
-
-			if (radeon_sa_bo_try_alloc(sa_manager, *sa_bo,
-						   size, align)) {
-				spin_unlock(&sa_manager->wq.lock);
-				return 0;
-			}
-
-			/* see if we can skip over some allocations */
-		} while (radeon_sa_bo_next_hole(sa_manager, fences, tries));
-
-		for (i = 0; i < RADEON_NUM_RINGS; ++i)
-			radeon_fence_ref(fences[i]);
-
-		spin_unlock(&sa_manager->wq.lock);
-		r = radeon_fence_wait_any(rdev, fences, false);
-		for (i = 0; i < RADEON_NUM_RINGS; ++i)
-			radeon_fence_unref(&fences[i]);
-		spin_lock(&sa_manager->wq.lock);
-		/* if we have nothing to wait for block */
-		if (r == -ENOENT) {
-			r = wait_event_interruptible_locked(
-				sa_manager->wq, 
-				radeon_sa_event(sa_manager, size, align)
-			);
-		}
-
-	} while (!r);
-
-	spin_unlock(&sa_manager->wq.lock);
-	kfree(*sa_bo);
-	*sa_bo = NULL;
-	return r;
-}
-
-void radeon_sa_bo_free(struct radeon_device *rdev, struct radeon_sa_bo **sa_bo,
+void radeon_sa_bo_free(struct drm_suballoc **sa_bo,
 		       struct radeon_fence *fence)
 {
-	struct radeon_sa_manager *sa_manager;
-
 	if (sa_bo == NULL || *sa_bo == NULL) {
 		return;
 	}
 
-	sa_manager = (*sa_bo)->manager;
-	spin_lock(&sa_manager->wq.lock);
-	if (fence && !radeon_fence_signaled(fence)) {
-		(*sa_bo)->fence = radeon_fence_ref(fence);
-		list_add_tail(&(*sa_bo)->flist,
-			      &sa_manager->flist[fence->ring]);
-	} else {
-		radeon_sa_bo_remove_locked(*sa_bo);
-	}
-	wake_up_all_locked(&sa_manager->wq);
-	spin_unlock(&sa_manager->wq.lock);
+	if (fence)
+		drm_suballoc_free(*sa_bo, &fence->base);
+	else
+		drm_suballoc_free(*sa_bo, NULL);
+
 	*sa_bo = NULL;
 }
 
@@ -400,25 +153,8 @@ void radeon_sa_bo_free(struct radeon_device *rdev, struct radeon_sa_bo **sa_bo,
 void radeon_sa_bo_dump_debug_info(struct radeon_sa_manager *sa_manager,
 				  struct seq_file *m)
 {
-	struct radeon_sa_bo *i;
+	struct drm_printer p = drm_seq_file_printer(m);
 
-	spin_lock(&sa_manager->wq.lock);
-	list_for_each_entry(i, &sa_manager->olist, olist) {
-		uint64_t soffset = i->soffset + sa_manager->gpu_addr;
-		uint64_t eoffset = i->eoffset + sa_manager->gpu_addr;
-		if (&i->olist == sa_manager->hole) {
-			seq_printf(m, ">");
-		} else {
-			seq_printf(m, " ");
-		}
-		seq_printf(m, "[0x%010llx 0x%010llx] size %8lld",
-			   soffset, eoffset, eoffset - soffset);
-		if (i->fence) {
-			seq_printf(m, " protected by 0x%016llx on ring %d",
-				   i->fence->seq, i->fence->ring);
-		}
-		seq_printf(m, "\n");
-	}
-	spin_unlock(&sa_manager->wq.lock);
+	drm_suballoc_dump_debug_info(&sa_manager->base, &p, sa_manager->gpu_addr);
 }
 #endif
diff --git a/drivers/gpu/drm/radeon/radeon_semaphore.c b/drivers/gpu/drm/radeon/radeon_semaphore.c
index 221e59476f64..1f0a9a4ff5ae 100644
--- a/drivers/gpu/drm/radeon/radeon_semaphore.c
+++ b/drivers/gpu/drm/radeon/radeon_semaphore.c
@@ -40,7 +40,7 @@ int radeon_semaphore_create(struct radeon_device *rdev,
 	if (*semaphore == NULL) {
 		return -ENOMEM;
 	}
-	r = radeon_sa_bo_new(rdev, &rdev->ring_tmp_bo,
+	r = radeon_sa_bo_new(&rdev->ring_tmp_bo,
 			     &(*semaphore)->sa_bo, 8, 8);
 	if (r) {
 		kfree(*semaphore);
@@ -100,7 +100,7 @@ void radeon_semaphore_free(struct radeon_device *rdev,
 		dev_err(rdev->dev, "semaphore %p has more waiters than signalers,"
 			" hardware lockup imminent!\n", *semaphore);
 	}
-	radeon_sa_bo_free(rdev, &(*semaphore)->sa_bo, fence);
+	radeon_sa_bo_free(&(*semaphore)->sa_bo, fence);
 	kfree(*semaphore);
 	*semaphore = NULL;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/3] drm/radeon: Use the drm suballocation manager implementation.
  2022-02-23 13:51 [PATCH 0/3] drm/helpers: Make the suballocation manager drm generic Maarten Lankhorst
@ 2022-02-23 13:51 ` Maarten Lankhorst
  0 siblings, 0 replies; 39+ messages in thread
From: Maarten Lankhorst @ 2022-02-23 13:51 UTC (permalink / raw)
  To: dri-devel; +Cc: Alex Deucher, intel-gfx, Xinhui Pan, Christian König

Use the generic suballocation helper lifted from amdgpu.
Note that the generic suballocator only allows a single alignment,
so we may waste a few more bytes for radeon_semaphore, shouldn't
be a big deal, could be re-added if needed.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/Kconfig                   |   1 +
 drivers/gpu/drm/radeon/radeon.h           |  55 +---
 drivers/gpu/drm/radeon/radeon_ib.c        |  10 +-
 drivers/gpu/drm/radeon/radeon_object.h    |  23 +-
 drivers/gpu/drm/radeon/radeon_sa.c        | 314 ++--------------------
 drivers/gpu/drm/radeon/radeon_semaphore.c |   6 +-
 6 files changed, 52 insertions(+), 357 deletions(-)

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 666cb4d251b9..16880a41f3d9 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -256,6 +256,7 @@ config DRM_RADEON
 	select FW_LOADER
 	select DRM_DP_HELPER
         select DRM_KMS_HELPER
+        select DRM_SUBALLOC_HELPER
         select DRM_TTM
 	select DRM_TTM_HELPER
 	select POWER_SUPPLY
diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 08f83bf2c330..7db39ff11cd1 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -79,6 +79,7 @@
 #include <drm/ttm/ttm_execbuf_util.h>
 
 #include <drm/drm_gem.h>
+#include <drm/drm_suballoc.h>
 
 #include "radeon_family.h"
 #include "radeon_mode.h"
@@ -512,52 +513,12 @@ struct radeon_bo {
 };
 #define gem_to_radeon_bo(gobj) container_of((gobj), struct radeon_bo, tbo.base)
 
-/* sub-allocation manager, it has to be protected by another lock.
- * By conception this is an helper for other part of the driver
- * like the indirect buffer or semaphore, which both have their
- * locking.
- *
- * Principe is simple, we keep a list of sub allocation in offset
- * order (first entry has offset == 0, last entry has the highest
- * offset).
- *
- * When allocating new object we first check if there is room at
- * the end total_size - (last_object_offset + last_object_size) >=
- * alloc_size. If so we allocate new object there.
- *
- * When there is not enough room at the end, we start waiting for
- * each sub object until we reach object_offset+object_size >=
- * alloc_size, this object then become the sub object we return.
- *
- * Alignment can't be bigger than page size.
- *
- * Hole are not considered for allocation to keep things simple.
- * Assumption is that there won't be hole (all object on same
- * alignment).
- */
 struct radeon_sa_manager {
-	wait_queue_head_t	wq;
-	struct radeon_bo	*bo;
-	struct list_head	*hole;
-	struct list_head	flist[RADEON_NUM_RINGS];
-	struct list_head	olist;
-	unsigned		size;
-	uint64_t		gpu_addr;
-	void			*cpu_ptr;
-	uint32_t		domain;
-	uint32_t		align;
-};
-
-struct radeon_sa_bo;
-
-/* sub-allocation buffer */
-struct radeon_sa_bo {
-	struct list_head		olist;
-	struct list_head		flist;
-	struct radeon_sa_manager	*manager;
-	unsigned			soffset;
-	unsigned			eoffset;
-	struct radeon_fence		*fence;
+	struct drm_suballoc_manager	base;
+	struct radeon_bo		*bo;
+	uint64_t			gpu_addr;
+	void				*cpu_ptr;
+	u32 domain;
 };
 
 /*
@@ -588,7 +549,7 @@ int radeon_mode_dumb_mmap(struct drm_file *filp,
  * Semaphores.
  */
 struct radeon_semaphore {
-	struct radeon_sa_bo	*sa_bo;
+	struct drm_suballoc	*sa_bo;
 	signed			waiters;
 	uint64_t		gpu_addr;
 };
@@ -817,7 +778,7 @@ void radeon_irq_kms_disable_hpd(struct radeon_device *rdev, unsigned hpd_mask);
  */
 
 struct radeon_ib {
-	struct radeon_sa_bo		*sa_bo;
+	struct drm_suballoc		*sa_bo;
 	uint32_t			length_dw;
 	uint64_t			gpu_addr;
 	uint32_t			*ptr;
diff --git a/drivers/gpu/drm/radeon/radeon_ib.c b/drivers/gpu/drm/radeon/radeon_ib.c
index 62b116727b4f..bca2cbd27abf 100644
--- a/drivers/gpu/drm/radeon/radeon_ib.c
+++ b/drivers/gpu/drm/radeon/radeon_ib.c
@@ -61,7 +61,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 {
 	int r;
 
-	r = radeon_sa_bo_new(rdev, &rdev->ring_tmp_bo, &ib->sa_bo, size, 256);
+	r = radeon_sa_bo_new(&rdev->ring_tmp_bo, &ib->sa_bo, size);
 	if (r) {
 		dev_err(rdev->dev, "failed to get a new IB (%d)\n", r);
 		return r;
@@ -97,7 +97,7 @@ int radeon_ib_get(struct radeon_device *rdev, int ring,
 void radeon_ib_free(struct radeon_device *rdev, struct radeon_ib *ib)
 {
 	radeon_sync_free(rdev, &ib->sync, ib->fence);
-	radeon_sa_bo_free(rdev, &ib->sa_bo, ib->fence);
+	radeon_sa_bo_free(&ib->sa_bo, ib->fence);
 	radeon_fence_unref(&ib->fence);
 }
 
@@ -201,8 +201,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
 
 	if (rdev->family >= CHIP_BONAIRE) {
 		r = radeon_sa_bo_manager_init(rdev, &rdev->ring_tmp_bo,
-					      RADEON_IB_POOL_SIZE*64*1024,
-					      RADEON_GPU_PAGE_SIZE,
+					      RADEON_IB_POOL_SIZE*64*1024, 256,
 					      RADEON_GEM_DOMAIN_GTT,
 					      RADEON_GEM_GTT_WC);
 	} else {
@@ -210,8 +209,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
 		 * to the command stream checking
 		 */
 		r = radeon_sa_bo_manager_init(rdev, &rdev->ring_tmp_bo,
-					      RADEON_IB_POOL_SIZE*64*1024,
-					      RADEON_GPU_PAGE_SIZE,
+					      RADEON_IB_POOL_SIZE*64*1024, 256,
 					      RADEON_GEM_DOMAIN_GTT, 0);
 	}
 	if (r) {
diff --git a/drivers/gpu/drm/radeon/radeon_object.h b/drivers/gpu/drm/radeon/radeon_object.h
index 0a6ef49e990a..995d2ee94115 100644
--- a/drivers/gpu/drm/radeon/radeon_object.h
+++ b/drivers/gpu/drm/radeon/radeon_object.h
@@ -169,15 +169,20 @@ extern void radeon_bo_fence(struct radeon_bo *bo, struct radeon_fence *fence,
 /*
  * sub allocation
  */
+static inline struct radeon_sa_manager *
+to_radeon_sa_manager(struct drm_suballoc_manager *manager)
+{
+	return container_of(manager, struct radeon_sa_manager, base);
+}
 
-static inline uint64_t radeon_sa_bo_gpu_addr(struct radeon_sa_bo *sa_bo)
+static inline uint64_t radeon_sa_bo_gpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->gpu_addr + sa_bo->soffset;
+	return to_radeon_sa_manager(sa_bo->manager)->gpu_addr + sa_bo->soffset;
 }
 
-static inline void * radeon_sa_bo_cpu_addr(struct radeon_sa_bo *sa_bo)
+static inline void * radeon_sa_bo_cpu_addr(struct drm_suballoc *sa_bo)
 {
-	return sa_bo->manager->cpu_ptr + sa_bo->soffset;
+	return to_radeon_sa_manager(sa_bo->manager)->cpu_ptr + sa_bo->soffset;
 }
 
 extern int radeon_sa_bo_manager_init(struct radeon_device *rdev,
@@ -190,12 +195,10 @@ extern int radeon_sa_bo_manager_start(struct radeon_device *rdev,
 				      struct radeon_sa_manager *sa_manager);
 extern int radeon_sa_bo_manager_suspend(struct radeon_device *rdev,
 					struct radeon_sa_manager *sa_manager);
-extern int radeon_sa_bo_new(struct radeon_device *rdev,
-			    struct radeon_sa_manager *sa_manager,
-			    struct radeon_sa_bo **sa_bo,
-			    unsigned size, unsigned align);
-extern void radeon_sa_bo_free(struct radeon_device *rdev,
-			      struct radeon_sa_bo **sa_bo,
+extern int radeon_sa_bo_new(struct radeon_sa_manager *sa_manager,
+			    struct drm_suballoc **sa_bo,
+			    unsigned size);
+extern void radeon_sa_bo_free(struct drm_suballoc **sa_bo,
 			      struct radeon_fence *fence);
 #if defined(CONFIG_DEBUG_FS)
 extern void radeon_sa_bo_dump_debug_info(struct radeon_sa_manager *sa_manager,
diff --git a/drivers/gpu/drm/radeon/radeon_sa.c b/drivers/gpu/drm/radeon/radeon_sa.c
index 310c322c7112..ec024fa61e92 100644
--- a/drivers/gpu/drm/radeon/radeon_sa.c
+++ b/drivers/gpu/drm/radeon/radeon_sa.c
@@ -44,53 +44,31 @@
 
 #include "radeon.h"
 
-static void radeon_sa_bo_remove_locked(struct radeon_sa_bo *sa_bo);
-static void radeon_sa_bo_try_free(struct radeon_sa_manager *sa_manager);
-
 int radeon_sa_bo_manager_init(struct radeon_device *rdev,
 			      struct radeon_sa_manager *sa_manager,
-			      unsigned size, u32 align, u32 domain, u32 flags)
+			      unsigned size, u32 sa_align, u32 domain, u32 flags)
 {
-	int i, r;
-
-	init_waitqueue_head(&sa_manager->wq);
-	sa_manager->bo = NULL;
-	sa_manager->size = size;
-	sa_manager->domain = domain;
-	sa_manager->align = align;
-	sa_manager->hole = &sa_manager->olist;
-	INIT_LIST_HEAD(&sa_manager->olist);
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		INIT_LIST_HEAD(&sa_manager->flist[i]);
-	}
+	int r;
 
-	r = radeon_bo_create(rdev, size, align, true,
+	r = radeon_bo_create(rdev, size, RADEON_GPU_PAGE_SIZE, true,
 			     domain, flags, NULL, NULL, &sa_manager->bo);
 	if (r) {
 		dev_err(rdev->dev, "(%d) failed to allocate bo for manager\n", r);
 		return r;
 	}
 
+	sa_manager->domain = domain;
+
+	drm_suballoc_manager_init(&sa_manager->base, size, sa_align);
+
 	return r;
 }
 
 void radeon_sa_bo_manager_fini(struct radeon_device *rdev,
 			       struct radeon_sa_manager *sa_manager)
 {
-	struct radeon_sa_bo *sa_bo, *tmp;
-
-	if (!list_empty(&sa_manager->olist)) {
-		sa_manager->hole = &sa_manager->olist,
-		radeon_sa_bo_try_free(sa_manager);
-		if (!list_empty(&sa_manager->olist)) {
-			dev_err(rdev->dev, "sa_manager is not empty, clearing anyway\n");
-		}
-	}
-	list_for_each_entry_safe(sa_bo, tmp, &sa_manager->olist, olist) {
-		radeon_sa_bo_remove_locked(sa_bo);
-	}
+	drm_suballoc_manager_fini(&sa_manager->base);
 	radeon_bo_unref(&sa_manager->bo);
-	sa_manager->size = 0;
 }
 
 int radeon_sa_bo_manager_start(struct radeon_device *rdev,
@@ -139,260 +117,33 @@ int radeon_sa_bo_manager_suspend(struct radeon_device *rdev,
 	return r;
 }
 
-static void radeon_sa_bo_remove_locked(struct radeon_sa_bo *sa_bo)
+int radeon_sa_bo_new(struct radeon_sa_manager *sa_manager,
+		     struct drm_suballoc **sa_bo,
+		     unsigned size)
 {
-	struct radeon_sa_manager *sa_manager = sa_bo->manager;
-	if (sa_manager->hole == &sa_bo->olist) {
-		sa_manager->hole = sa_bo->olist.prev;
-	}
-	list_del_init(&sa_bo->olist);
-	list_del_init(&sa_bo->flist);
-	radeon_fence_unref(&sa_bo->fence);
-	kfree(sa_bo);
-}
-
-static void radeon_sa_bo_try_free(struct radeon_sa_manager *sa_manager)
-{
-	struct radeon_sa_bo *sa_bo, *tmp;
-
-	if (sa_manager->hole->next == &sa_manager->olist)
-		return;
+	struct drm_suballoc *sa = drm_suballoc_new(&sa_manager->base, size);
 
-	sa_bo = list_entry(sa_manager->hole->next, struct radeon_sa_bo, olist);
-	list_for_each_entry_safe_from(sa_bo, tmp, &sa_manager->olist, olist) {
-		if (sa_bo->fence == NULL || !radeon_fence_signaled(sa_bo->fence)) {
-			return;
-		}
-		radeon_sa_bo_remove_locked(sa_bo);
+	if (IS_ERR(sa)) {
+		*sa_bo = NULL;
+		return PTR_ERR(sa);
 	}
-}
 
-static inline unsigned radeon_sa_bo_hole_soffset(struct radeon_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole != &sa_manager->olist) {
-		return list_entry(hole, struct radeon_sa_bo, olist)->eoffset;
-	}
+	*sa_bo = sa;
 	return 0;
 }
 
-static inline unsigned radeon_sa_bo_hole_eoffset(struct radeon_sa_manager *sa_manager)
-{
-	struct list_head *hole = sa_manager->hole;
-
-	if (hole->next != &sa_manager->olist) {
-		return list_entry(hole->next, struct radeon_sa_bo, olist)->soffset;
-	}
-	return sa_manager->size;
-}
-
-static bool radeon_sa_bo_try_alloc(struct radeon_sa_manager *sa_manager,
-				   struct radeon_sa_bo *sa_bo,
-				   unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	eoffset = radeon_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		soffset += wasted;
-
-		sa_bo->manager = sa_manager;
-		sa_bo->soffset = soffset;
-		sa_bo->eoffset = soffset + size;
-		list_add(&sa_bo->olist, sa_manager->hole);
-		INIT_LIST_HEAD(&sa_bo->flist);
-		sa_manager->hole = &sa_bo->olist;
-		return true;
-	}
-	return false;
-}
-
-/**
- * radeon_sa_event - Check if we can stop waiting
- *
- * @sa_manager: pointer to the sa_manager
- * @size: number of bytes we want to allocate
- * @align: alignment we need to match
- *
- * Check if either there is a fence we can wait for or
- * enough free memory to satisfy the allocation directly
- */
-static bool radeon_sa_event(struct radeon_sa_manager *sa_manager,
-			    unsigned size, unsigned align)
-{
-	unsigned soffset, eoffset, wasted;
-	int i;
-
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		if (!list_empty(&sa_manager->flist[i])) {
-			return true;
-		}
-	}
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	eoffset = radeon_sa_bo_hole_eoffset(sa_manager);
-	wasted = (align - (soffset % align)) % align;
-
-	if ((eoffset - soffset) >= (size + wasted)) {
-		return true;
-	}
-
-	return false;
-}
-
-static bool radeon_sa_bo_next_hole(struct radeon_sa_manager *sa_manager,
-				   struct radeon_fence **fences,
-				   unsigned *tries)
-{
-	struct radeon_sa_bo *best_bo = NULL;
-	unsigned i, soffset, best, tmp;
-
-	/* if hole points to the end of the buffer */
-	if (sa_manager->hole->next == &sa_manager->olist) {
-		/* try again with its beginning */
-		sa_manager->hole = &sa_manager->olist;
-		return true;
-	}
-
-	soffset = radeon_sa_bo_hole_soffset(sa_manager);
-	/* to handle wrap around we add sa_manager->size */
-	best = sa_manager->size * 2;
-	/* go over all fence list and try to find the closest sa_bo
-	 * of the current last
-	 */
-	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-		struct radeon_sa_bo *sa_bo;
-
-		if (list_empty(&sa_manager->flist[i])) {
-			continue;
-		}
-
-		sa_bo = list_first_entry(&sa_manager->flist[i],
-					 struct radeon_sa_bo, flist);
-
-		if (!radeon_fence_signaled(sa_bo->fence)) {
-			fences[i] = sa_bo->fence;
-			continue;
-		}
-
-		/* limit the number of tries each ring gets */
-		if (tries[i] > 2) {
-			continue;
-		}
-
-		tmp = sa_bo->soffset;
-		if (tmp < soffset) {
-			/* wrap around, pretend it's after */
-			tmp += sa_manager->size;
-		}
-		tmp -= soffset;
-		if (tmp < best) {
-			/* this sa bo is the closest one */
-			best = tmp;
-			best_bo = sa_bo;
-		}
-	}
-
-	if (best_bo) {
-		++tries[best_bo->fence->ring];
-		sa_manager->hole = best_bo->olist.prev;
-
-		/* we knew that this one is signaled,
-		   so it's save to remote it */
-		radeon_sa_bo_remove_locked(best_bo);
-		return true;
-	}
-	return false;
-}
-
-int radeon_sa_bo_new(struct radeon_device *rdev,
-		     struct radeon_sa_manager *sa_manager,
-		     struct radeon_sa_bo **sa_bo,
-		     unsigned size, unsigned align)
-{
-	struct radeon_fence *fences[RADEON_NUM_RINGS];
-	unsigned tries[RADEON_NUM_RINGS];
-	int i, r;
-
-	BUG_ON(align > sa_manager->align);
-	BUG_ON(size > sa_manager->size);
-
-	*sa_bo = kmalloc(sizeof(struct radeon_sa_bo), GFP_KERNEL);
-	if ((*sa_bo) == NULL) {
-		return -ENOMEM;
-	}
-	(*sa_bo)->manager = sa_manager;
-	(*sa_bo)->fence = NULL;
-	INIT_LIST_HEAD(&(*sa_bo)->olist);
-	INIT_LIST_HEAD(&(*sa_bo)->flist);
-
-	spin_lock(&sa_manager->wq.lock);
-	do {
-		for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-			fences[i] = NULL;
-			tries[i] = 0;
-		}
-
-		do {
-			radeon_sa_bo_try_free(sa_manager);
-
-			if (radeon_sa_bo_try_alloc(sa_manager, *sa_bo,
-						   size, align)) {
-				spin_unlock(&sa_manager->wq.lock);
-				return 0;
-			}
-
-			/* see if we can skip over some allocations */
-		} while (radeon_sa_bo_next_hole(sa_manager, fences, tries));
-
-		for (i = 0; i < RADEON_NUM_RINGS; ++i)
-			radeon_fence_ref(fences[i]);
-
-		spin_unlock(&sa_manager->wq.lock);
-		r = radeon_fence_wait_any(rdev, fences, false);
-		for (i = 0; i < RADEON_NUM_RINGS; ++i)
-			radeon_fence_unref(&fences[i]);
-		spin_lock(&sa_manager->wq.lock);
-		/* if we have nothing to wait for block */
-		if (r == -ENOENT) {
-			r = wait_event_interruptible_locked(
-				sa_manager->wq, 
-				radeon_sa_event(sa_manager, size, align)
-			);
-		}
-
-	} while (!r);
-
-	spin_unlock(&sa_manager->wq.lock);
-	kfree(*sa_bo);
-	*sa_bo = NULL;
-	return r;
-}
-
-void radeon_sa_bo_free(struct radeon_device *rdev, struct radeon_sa_bo **sa_bo,
+void radeon_sa_bo_free(struct drm_suballoc **sa_bo,
 		       struct radeon_fence *fence)
 {
-	struct radeon_sa_manager *sa_manager;
-
 	if (sa_bo == NULL || *sa_bo == NULL) {
 		return;
 	}
 
-	sa_manager = (*sa_bo)->manager;
-	spin_lock(&sa_manager->wq.lock);
-	if (fence && !radeon_fence_signaled(fence)) {
-		(*sa_bo)->fence = radeon_fence_ref(fence);
-		list_add_tail(&(*sa_bo)->flist,
-			      &sa_manager->flist[fence->ring]);
-	} else {
-		radeon_sa_bo_remove_locked(*sa_bo);
-	}
-	wake_up_all_locked(&sa_manager->wq);
-	spin_unlock(&sa_manager->wq.lock);
+	if (fence)
+		drm_suballoc_free(*sa_bo, &fence->base, fence->ring);
+	else
+		drm_suballoc_free(*sa_bo, NULL, 0);
+
 	*sa_bo = NULL;
 }
 
@@ -400,25 +151,6 @@ void radeon_sa_bo_free(struct radeon_device *rdev, struct radeon_sa_bo **sa_bo,
 void radeon_sa_bo_dump_debug_info(struct radeon_sa_manager *sa_manager,
 				  struct seq_file *m)
 {
-	struct radeon_sa_bo *i;
-
-	spin_lock(&sa_manager->wq.lock);
-	list_for_each_entry(i, &sa_manager->olist, olist) {
-		uint64_t soffset = i->soffset + sa_manager->gpu_addr;
-		uint64_t eoffset = i->eoffset + sa_manager->gpu_addr;
-		if (&i->olist == sa_manager->hole) {
-			seq_printf(m, ">");
-		} else {
-			seq_printf(m, " ");
-		}
-		seq_printf(m, "[0x%010llx 0x%010llx] size %8lld",
-			   soffset, eoffset, eoffset - soffset);
-		if (i->fence) {
-			seq_printf(m, " protected by 0x%016llx on ring %d",
-				   i->fence->seq, i->fence->ring);
-		}
-		seq_printf(m, "\n");
-	}
-	spin_unlock(&sa_manager->wq.lock);
+	drm_suballoc_dump_debug_info(&sa_manager->base, m, sa_manager->gpu_addr);
 }
 #endif
diff --git a/drivers/gpu/drm/radeon/radeon_semaphore.c b/drivers/gpu/drm/radeon/radeon_semaphore.c
index 221e59476f64..3e2b0bf0d55d 100644
--- a/drivers/gpu/drm/radeon/radeon_semaphore.c
+++ b/drivers/gpu/drm/radeon/radeon_semaphore.c
@@ -40,8 +40,8 @@ int radeon_semaphore_create(struct radeon_device *rdev,
 	if (*semaphore == NULL) {
 		return -ENOMEM;
 	}
-	r = radeon_sa_bo_new(rdev, &rdev->ring_tmp_bo,
-			     &(*semaphore)->sa_bo, 8, 8);
+	r = radeon_sa_bo_new(&rdev->ring_tmp_bo,
+			     &(*semaphore)->sa_bo, 8);
 	if (r) {
 		kfree(*semaphore);
 		*semaphore = NULL;
@@ -100,7 +100,7 @@ void radeon_semaphore_free(struct radeon_device *rdev,
 		dev_err(rdev->dev, "semaphore %p has more waiters than signalers,"
 			" hardware lockup imminent!\n", *semaphore);
 	}
-	radeon_sa_bo_free(rdev, &(*semaphore)->sa_bo, fence);
+	radeon_sa_bo_free(&(*semaphore)->sa_bo, fence);
 	kfree(*semaphore);
 	*semaphore = NULL;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2023-02-23 11:18 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-16 14:48 [PATCH 0/3] drm, drm/amd, drm/radeon: Introduce a generic suballocator Thomas Hellström
2023-02-16 14:48 ` [Intel-xe] " Thomas Hellström
2023-02-16 14:48 ` [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager Thomas Hellström
2023-02-16 14:48   ` [Intel-xe] " Thomas Hellström
2023-02-17 11:00   ` Christian König
2023-02-17 11:00     ` [Intel-xe] " Christian König
2023-02-17 11:21     ` Thomas Hellström
2023-02-17 11:21       ` [Intel-xe] " Thomas Hellström
2023-02-17 11:28       ` Christian König
2023-02-17 11:28         ` [Intel-xe] " Christian König
2023-02-17 12:24         ` Thomas Hellström
2023-02-17 12:24           ` [Intel-xe] " Thomas Hellström
2023-02-17 12:28           ` Christian König
2023-02-17 12:28             ` [Intel-xe] " Christian König
2023-02-17 13:10             ` Thomas Hellström
2023-02-17 13:10               ` [Intel-xe] " Thomas Hellström
2023-02-17 13:18               ` Christian König
2023-02-17 13:18                 ` [Intel-xe] " Christian König
2023-02-17 13:51                 ` Thomas Hellström
2023-02-17 13:51                   ` [Intel-xe] " Thomas Hellström
2023-02-22 11:00                   ` Thomas Hellström
2023-02-22 11:00                     ` Thomas Hellström
2023-02-22 11:39                     ` Christian König
2023-02-22 11:39                       ` Christian König
2023-02-22 13:54                       ` Thomas Hellström
2023-02-22 13:54                         ` Thomas Hellström
2023-02-22 14:20                         ` Christian König
2023-02-22 14:20                           ` Christian König
2023-02-22 15:58                           ` Thomas Hellström
2023-02-22 15:58                             ` Thomas Hellström
2023-02-16 14:48 ` [PATCH 2/3] drm/amd: Convert amdgpu to use suballocation helper Thomas Hellström
2023-02-16 14:48   ` [Intel-xe] " Thomas Hellström
2023-02-16 14:48 ` [PATCH 3/3] drm/radeon: Use the drm suballocation manager implementation Thomas Hellström
2023-02-16 14:48   ` [Intel-xe] " Thomas Hellström
2023-02-17  1:52   ` kernel test robot
2023-02-17 12:32   ` kernel test robot
  -- strict thread matches above, loose matches on Subject: below --
2023-02-23 10:57 [PATCH 0/3] drm/helpers: Make the suballocation manager drm generic Thomas Hellström
2023-02-23 10:57 ` [PATCH 3/3] drm/radeon: Use the drm suballocation manager implementation Thomas Hellström
2023-02-23 11:18   ` Christian König
2022-02-23 13:51 [PATCH 0/3] drm/helpers: Make the suballocation manager drm generic Maarten Lankhorst
2022-02-23 13:51 ` [PATCH 3/3] drm/radeon: Use the drm suballocation manager implementation Maarten Lankhorst

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.