linux-media.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 01/25] dma-fence: basic lockdep annotations
       [not found] <20200707201229.472834-1-daniel.vetter@ffwll.ch>
@ 2020-07-07 20:12 ` Daniel Vetter
  2020-07-08 14:57   ` Christian König
  2020-07-07 20:12 ` [PATCH 02/25] dma-fence: prime " Daniel Vetter
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-07 20:12 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	Felix Kuehling, Thomas Hellström, Maarten Lankhorst,
	Mika Kuoppala, linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Christian König, Daniel Vetter

Design is similar to the lockdep annotations for workers, but with
some twists:

- We use a read-lock for the execution/worker/completion side, so that
  this explicit annotation can be more liberally sprinkled around.
  With read locks lockdep isn't going to complain if the read-side
  isn't nested the same way under all circumstances, so ABBA deadlocks
  are ok. Which they are, since this is an annotation only.

- We're using non-recursive lockdep read lock mode, since in recursive
  read lock mode lockdep does not catch read side hazards. And we
  _very_ much want read side hazards to be caught. For full details of
  this limitation see

  commit e91498589746065e3ae95d9a00b068e525eec34f
  Author: Peter Zijlstra <peterz@infradead.org>
  Date:   Wed Aug 23 13:13:11 2017 +0200

      locking/lockdep/selftests: Add mixed read-write ABBA tests

- To allow nesting of the read-side explicit annotations we explicitly
  keep track of the nesting. lock_is_held() allows us to do that.

- The wait-side annotation is a write lock, and entirely done within
  dma_fence_wait() for everyone by default.

- To be able to freely annotate helper functions I want to make it ok
  to call dma_fence_begin/end_signalling from soft/hardirq context.
  First attempt was using the hardirq locking context for the write
  side in lockdep, but this forces all normal spinlocks nested within
  dma_fence_begin/end_signalling to be spinlocks. That bollocks.

  The approach now is to simple check in_atomic(), and for these cases
  entirely rely on the might_sleep() check in dma_fence_wait(). That
  will catch any wrong nesting against spinlocks from soft/hardirq
  contexts.

The idea here is that every code path that's critical for eventually
signalling a dma_fence should be annotated with
dma_fence_begin/end_signalling. The annotation ideally starts right
after a dma_fence is published (added to a dma_resv, exposed as a
sync_file fd, attached to a drm_syncobj fd, or anything else that
makes the dma_fence visible to other kernel threads), up to and
including the dma_fence_wait(). Examples are irq handlers, the
scheduler rt threads, the tail of execbuf (after the corresponding
fences are visible), any workers that end up signalling dma_fences and
really anything else. Not annotated should be code paths that only
complete fences opportunistically as the gpu progresses, like e.g.
shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:

	mutex_lock(A);
	mutex_unlock(A);

	dma_fence_signal();

Thread B:

	mutex_lock(A);
	dma_fence_wait();
	mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around
to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within
dma_fence_begin/end_signalling sections. To allow this to happen the
read lock needs to be upgraded to a write lock, which means that any
other lock is acquired between the dma_fence_begin_signalling() call and
the call to dma_fence_wait(), and still held, this will result in an
immediate lockdep complaint. The only other option would be to not
annotate such calls, defeating the point. Therefore these annotations
cannot be sprinkled over the code entirely mindless to avoid false
positives.

Originally I hope that the cross-release lockdep extensions would
alleviate the need for explicit annotations:

https://lwn.net/Articles/709849/

But there's a few reasons why that's not an option:

- It's not happening in upstream, since it got reverted due to too
  many false positives:

	commit e966eaeeb623f09975ef362c2866fae6f86844f9
	Author: Ingo Molnar <mingo@kernel.org>
	Date:   Tue Dec 12 12:31:16 2017 +0100

	    locking/lockdep: Remove the cross-release locking checks

	    This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y),
	    while it found a number of old bugs initially, was also causing too many
	    false positives that caused people to disable lockdep - which is arguably
	    a worse overall outcome.

- cross-release uses the complete() call to annotate the end of
  critical sections, for dma_fence that would be dma_fence_signal().
  But we do not want all dma_fence_signal() calls to be treated as
  critical, since many are opportunistic cleanup of gpu requests. If
  these get stuck there's still the main completion interrupt and
  workers who can unblock everyone. Automatically annotating all
  dma_fence_signal() calls would hence cause false positives.

- cross-release had some educated guesses for when a critical section
  starts, like fresh syscall or fresh work callback. This would again
  cause false positives without explicit annotations, since for
  dma_fence the critical sections only starts when we publish a fence.

- Furthermore there can be cases where a thread never does a
  dma_fence_signal, but is still critical for reaching completion of
  fences. One example would be a scheduler kthread which picks up jobs
  and pushes them into hardware, where the interrupt handler or
  another completion thread calls dma_fence_signal(). But if the
  scheduler thread hangs, then all the fences hang, hence we need to
  manually annotate it. cross-release aimed to solve this by chaining
  cross-release dependencies, but the dependency from scheduler thread
  to the completion interrupt handler goes through hw where
  cross-release code can't observe it.

In short, without manual annotations and careful review of the start
and end of critical sections, cross-relese dependency tracking doesn't
work. We need explicit annotations.

v2: handle soft/hardirq ctx better against write side and dont forget
EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

v5: Amend commit message to explain in detail why cross-release isn't
the solution.

v6: Pull out misplaced .rst hunk.

Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 Documentation/driver-api/dma-buf.rst |   6 +
 drivers/dma-buf/dma-fence.c          | 161 +++++++++++++++++++++++++++
 include/linux/dma-fence.h            |  12 ++
 3 files changed, 179 insertions(+)

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
index 7fb7b661febd..05d856131140 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -133,6 +133,12 @@ DMA Fences
 .. kernel-doc:: drivers/dma-buf/dma-fence.c
    :doc: DMA fences overview
 
+DMA Fence Signalling Annotations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
+   :doc: fence signalling annotation
+
 DMA Fences Functions Reference
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 656e9ac2d028..0005bc002529 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num)
 }
 EXPORT_SYMBOL(dma_fence_context_alloc);
 
+/**
+ * DOC: fence signalling annotation
+ *
+ * Proving correctness of all the kernel code around &dma_fence through code
+ * review and testing is tricky for a few reasons:
+ *
+ * * It is a cross-driver contract, and therefore all drivers must follow the
+ *   same rules for lock nesting order, calling contexts for various functions
+ *   and anything else significant for in-kernel interfaces. But it is also
+ *   impossible to test all drivers in a single machine, hence brute-force N vs.
+ *   N testing of all combinations is impossible. Even just limiting to the
+ *   possible combinations is infeasible.
+ *
+ * * There is an enormous amount of driver code involved. For render drivers
+ *   there's the tail of command submission, after fences are published,
+ *   scheduler code, interrupt and workers to process job completion,
+ *   and timeout, gpu reset and gpu hang recovery code. Plus for integration
+ *   with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,
+ *   and &shrinker. For modesetting drivers there's the commit tail functions
+ *   between when fences for an atomic modeset are published, and when the
+ *   corresponding vblank completes, including any interrupt processing and
+ *   related workers. Auditing all that code, across all drivers, is not
+ *   feasible.
+ *
+ * * Due to how many other subsystems are involved and the locking hierarchies
+ *   this pulls in there is extremely thin wiggle-room for driver-specific
+ *   differences. &dma_fence interacts with almost all of the core memory
+ *   handling through page fault handlers via &dma_resv, dma_resv_lock() and
+ *   dma_resv_unlock(). On the other side it also interacts through all
+ *   allocation sites through &mmu_notifier and &shrinker.
+ *
+ * Furthermore lockdep does not handle cross-release dependencies, which means
+ * any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught
+ * at runtime with some quick testing. The simplest example is one thread
+ * waiting on a &dma_fence while holding a lock::
+ *
+ *     lock(A);
+ *     dma_fence_wait(B);
+ *     unlock(A);
+ *
+ * while the other thread is stuck trying to acquire the same lock, which
+ * prevents it from signalling the fence the previous thread is stuck waiting
+ * on::
+ *
+ *     lock(A);
+ *     unlock(A);
+ *     dma_fence_signal(B);
+ *
+ * By manually annotating all code relevant to signalling a &dma_fence we can
+ * teach lockdep about these dependencies, which also helps with the validation
+ * headache since now lockdep can check all the rules for us::
+ *
+ *    cookie = dma_fence_begin_signalling();
+ *    lock(A);
+ *    unlock(A);
+ *    dma_fence_signal(B);
+ *    dma_fence_end_signalling(cookie);
+ *
+ * For using dma_fence_begin_signalling() and dma_fence_end_signalling() to
+ * annotate critical sections the following rules need to be observed:
+ *
+ * * All code necessary to complete a &dma_fence must be annotated, from the
+ *   point where a fence is accessible to other threads, to the point where
+ *   dma_fence_signal() is called. Un-annotated code can contain deadlock issues,
+ *   and due to the very strict rules and many corner cases it is infeasible to
+ *   catch these just with review or normal stress testing.
+ *
+ * * &struct dma_resv deserves a special note, since the readers are only
+ *   protected by rcu. This means the signalling critical section starts as soon
+ *   as the new fences are installed, even before dma_resv_unlock() is called.
+ *
+ * * The only exception are fast paths and opportunistic signalling code, which
+ *   calls dma_fence_signal() purely as an optimization, but is not required to
+ *   guarantee completion of a &dma_fence. The usual example is a wait IOCTL
+ *   which calls dma_fence_signal(), while the mandatory completion path goes
+ *   through a hardware interrupt and possible job completion worker.
+ *
+ * * To aid composability of code, the annotations can be freely nested, as long
+ *   as the overall locking hierarchy is consistent. The annotations also work
+ *   both in interrupt and process context. Due to implementation details this
+ *   requires that callers pass an opaque cookie from
+ *   dma_fence_begin_signalling() to dma_fence_end_signalling().
+ *
+ * * Validation against the cross driver contract is implemented by priming
+ *   lockdep with the relevant hierarchy at boot-up. This means even just
+ *   testing with a single device is enough to validate a driver, at least as
+ *   far as deadlocks with dma_fence_wait() against dma_fence_signal() are
+ *   concerned.
+ */
+#ifdef CONFIG_LOCKDEP
+struct lockdep_map	dma_fence_lockdep_map = {
+	.name = "dma_fence_map"
+};
+
+/**
+ * dma_fence_begin_signalling - begin a critical DMA fence signalling section
+ *
+ * Drivers should use this to annotate the beginning of any code section
+ * required to eventually complete &dma_fence by calling dma_fence_signal().
+ *
+ * The end of these critical sections are annotated with
+ * dma_fence_end_signalling().
+ *
+ * Returns:
+ *
+ * Opaque cookie needed by the implementation, which needs to be passed to
+ * dma_fence_end_signalling().
+ */
+bool dma_fence_begin_signalling(void)
+{
+	/* explicitly nesting ... */
+	if (lock_is_held_type(&dma_fence_lockdep_map, 1))
+		return true;
+
+	/* rely on might_sleep check for soft/hardirq locks */
+	if (in_atomic())
+		return true;
+
+	/* ... and non-recursive readlock */
+	lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
+
+	return false;
+}
+EXPORT_SYMBOL(dma_fence_begin_signalling);
+
+/**
+ * dma_fence_end_signalling - end a critical DMA fence signalling section
+ *
+ * Closes a critical section annotation opened by dma_fence_begin_signalling().
+ */
+void dma_fence_end_signalling(bool cookie)
+{
+	if (cookie)
+		return;
+
+	lock_release(&dma_fence_lockdep_map, _RET_IP_);
+}
+EXPORT_SYMBOL(dma_fence_end_signalling);
+
+void __dma_fence_might_wait(void)
+{
+	bool tmp;
+
+	tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
+	if (tmp)
+		lock_release(&dma_fence_lockdep_map, _THIS_IP_);
+	lock_map_acquire(&dma_fence_lockdep_map);
+	lock_map_release(&dma_fence_lockdep_map);
+	if (tmp)
+		lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
+}
+#endif
+
+
 /**
  * dma_fence_signal_locked - signal completion of a fence
  * @fence: the fence to signal
@@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence)
 {
 	unsigned long flags;
 	int ret;
+	bool tmp;
 
 	if (!fence)
 		return -EINVAL;
 
+	tmp = dma_fence_begin_signalling();
+
 	spin_lock_irqsave(fence->lock, flags);
 	ret = dma_fence_signal_locked(fence);
 	spin_unlock_irqrestore(fence->lock, flags);
 
+	dma_fence_end_signalling(tmp);
+
 	return ret;
 }
 EXPORT_SYMBOL(dma_fence_signal);
@@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
 
 	might_sleep();
 
+	__dma_fence_might_wait();
+
 	trace_dma_fence_wait_start(fence);
 	if (fence->ops->wait)
 		ret = fence->ops->wait(fence, intr, timeout);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 3347c54f3a87..3f288f7db2ef 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
 	} while (1);
 }
 
+#ifdef CONFIG_LOCKDEP
+bool dma_fence_begin_signalling(void);
+void dma_fence_end_signalling(bool cookie);
+#else
+static inline bool dma_fence_begin_signalling(void)
+{
+	return true;
+}
+static inline void dma_fence_end_signalling(bool cookie) {}
+static inline void __dma_fence_might_wait(void) {}
+#endif
+
 int dma_fence_signal(struct dma_fence *fence);
 int dma_fence_signal_locked(struct dma_fence *fence);
 signed long dma_fence_default_wait(struct dma_fence *fence,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 02/25] dma-fence: prime lockdep annotations
       [not found] <20200707201229.472834-1-daniel.vetter@ffwll.ch>
  2020-07-07 20:12 ` [PATCH 01/25] dma-fence: basic lockdep annotations Daniel Vetter
@ 2020-07-07 20:12 ` Daniel Vetter
  2020-07-09  8:09   ` Daniel Vetter
  2020-07-07 20:12 ` [PATCH 03/25] dma-buf.rst: Document why idenfinite fences are a bad idea Daniel Vetter
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-07 20:12 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	Jason Gunthorpe, Felix Kuehling, kernel test robot,
	Thomas Hellström, Mika Kuoppala, linux-media, linaro-mm-sig,
	amd-gfx, Chris Wilson, Maarten Lankhorst, Christian König,
	Daniel Vetter

Two in one go:
- it is allowed to call dma_fence_wait() while holding a
  dma_resv_lock(). This is fundamental to how eviction works with ttm,
  so required.

- it is allowed to call dma_fence_wait() from memory reclaim contexts,
  specifically from shrinker callbacks (which i915 does), and from mmu
  notifier callbacks (which amdgpu does, and which i915 sometimes also
  does, and probably always should, but that's kinda a debate). Also
  for stuff like HMM we really need to be able to do this, or things
  get real dicey.

Consequence is that any critical path necessary to get to a
dma_fence_signal for a fence must never a) call dma_resv_lock nor b)
allocate memory with GFP_KERNEL. Also by implication of
dma_resv_lock(), no userspace faulting allowed. That's some supremely
obnoxious limitations, which is why we need to sprinkle the right
annotations to all relevant paths.

The one big locking context we're leaving out here is mmu notifiers,
added in

commit 23b68395c7c78a764e8963fc15a7cfd318bf187f
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon Aug 26 22:14:21 2019 +0200

    mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end

that one covers a lot of other callsites, and it's also allowed to
wait on dma-fences from mmu notifiers. But there's no ready-made
functions exposed to prime this, so I've left it out for now.

v2: Also track against mmu notifier context.

v3: kerneldoc to spec the cross-driver contract. Note that currently
i915 throws in a hard-coded 10s timeout on foreign fences (not sure
why that was done, but it's there), which is why that rule is worded
with SHOULD instead of MUST.

Also some of the mmu_notifier/shrinker rules might surprise SoC
drivers, I haven't fully audited them all. Which is infeasible anyway,
we'll need to run them with lockdep and dma-fence annotations and see
what goes boom.

v4: A spelling fix from Mika

v5: #ifdef for CONFIG_MMU_NOTIFIER. Reported by 0day. Unfortunately
this means lockdep enforcement is slightly inconsistent, it won't spot
GFP_NOIO and GFP_NOFS allocations in the wrong spot if
CONFIG_MMU_NOTIFIER is disabled in the kernel config. Oh well.

v5: Note that only drivers/gpu has a reasonable (or at least
historical) excuse to use dma_fence_wait() from shrinker and mmu
notifier callbacks. Everyone else should either have a better memory
manager model, or better hardware. This reflects discussions with
Jason Gunthorpe.

Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: kernel test robot <lkp@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com> (v4)
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 Documentation/driver-api/dma-buf.rst |  6 ++++
 drivers/dma-buf/dma-fence.c          | 46 ++++++++++++++++++++++++++++
 drivers/dma-buf/dma-resv.c           |  8 +++++
 include/linux/dma-fence.h            |  1 +
 4 files changed, 61 insertions(+)

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
index 05d856131140..f8f6decde359 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -133,6 +133,12 @@ DMA Fences
 .. kernel-doc:: drivers/dma-buf/dma-fence.c
    :doc: DMA fences overview
 
+DMA Fence Cross-Driver Contract
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
+   :doc: fence cross-driver contract
+
 DMA Fence Signalling Annotations
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 0005bc002529..af1d8ea926b3 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -64,6 +64,52 @@ static atomic64_t dma_fence_context_counter = ATOMIC64_INIT(1);
  *   &dma_buf.resv pointer.
  */
 
+/**
+ * DOC: fence cross-driver contract
+ *
+ * Since &dma_fence provide a cross driver contract, all drivers must follow the
+ * same rules:
+ *
+ * * Fences must complete in a reasonable time. Fences which represent kernels
+ *   and shaders submitted by userspace, which could run forever, must be backed
+ *   up by timeout and gpu hang recovery code. Minimally that code must prevent
+ *   further command submission and force complete all in-flight fences, e.g.
+ *   when the driver or hardware do not support gpu reset, or if the gpu reset
+ *   failed for some reason. Ideally the driver supports gpu recovery which only
+ *   affects the offending userspace context, and no other userspace
+ *   submissions.
+ *
+ * * Drivers may have different ideas of what completion within a reasonable
+ *   time means. Some hang recovery code uses a fixed timeout, others a mix
+ *   between observing forward progress and increasingly strict timeouts.
+ *   Drivers should not try to second guess timeout handling of fences from
+ *   other drivers.
+ *
+ * * To ensure there's no deadlocks of dma_fence_wait() against other locks
+ *   drivers should annotate all code required to reach dma_fence_signal(),
+ *   which completes the fences, with dma_fence_begin_signalling() and
+ *   dma_fence_end_signalling().
+ *
+ * * Drivers are allowed to call dma_fence_wait() while holding dma_resv_lock().
+ *   This means any code required for fence completion cannot acquire a
+ *   &dma_resv lock. Note that this also pulls in the entire established
+ *   locking hierarchy around dma_resv_lock() and dma_resv_unlock().
+ *
+ * * Drivers are allowed to call dma_fence_wait() from their &shrinker
+ *   callbacks. This means any code required for fence completion cannot
+ *   allocate memory with GFP_KERNEL.
+ *
+ * * Drivers are allowed to call dma_fence_wait() from their &mmu_notifier
+ *   respectively &mmu_interval_notifier callbacks. This means any code required
+ *   for fence completeion cannot allocate memory with GFP_NOFS or GFP_NOIO.
+ *   Only GFP_ATOMIC is permissible, which might fail.
+ *
+ * Note that only GPU drivers have a reasonable excuse for both requiring
+ * &mmu_interval_notifier and &shrinker callbacks at the same time as having to
+ * track asynchronous compute work using &dma_fence. No driver outside of
+ * drivers/gpu should ever call dma_fence_wait() in such contexts.
+ */
+
 static const char *dma_fence_stub_get_name(struct dma_fence *fence)
 {
         return "stub";
diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index e7d7197d48ce..0e6675ec1d11 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -36,6 +36,7 @@
 #include <linux/export.h>
 #include <linux/mm.h>
 #include <linux/sched/mm.h>
+#include <linux/mmu_notifier.h>
 
 /**
  * DOC: Reservation Object Overview
@@ -116,6 +117,13 @@ static int __init dma_resv_lockdep(void)
 	if (ret == -EDEADLK)
 		dma_resv_lock_slow(&obj, &ctx);
 	fs_reclaim_acquire(GFP_KERNEL);
+#ifdef CONFIG_MMU_NOTIFIER
+	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
+	__dma_fence_might_wait();
+	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
+#else
+	__dma_fence_might_wait();
+#endif
 	fs_reclaim_release(GFP_KERNEL);
 	ww_mutex_unlock(&obj.lock);
 	ww_acquire_fini(&ctx);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 3f288f7db2ef..09e23adb351d 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -360,6 +360,7 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
 #ifdef CONFIG_LOCKDEP
 bool dma_fence_begin_signalling(void);
 void dma_fence_end_signalling(bool cookie);
+void __dma_fence_might_wait(void);
 #else
 static inline bool dma_fence_begin_signalling(void)
 {
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 03/25] dma-buf.rst: Document why idenfinite fences are a bad idea
       [not found] <20200707201229.472834-1-daniel.vetter@ffwll.ch>
  2020-07-07 20:12 ` [PATCH 01/25] dma-fence: basic lockdep annotations Daniel Vetter
  2020-07-07 20:12 ` [PATCH 02/25] dma-fence: prime " Daniel Vetter
@ 2020-07-07 20:12 ` Daniel Vetter
  2020-07-09  7:36   ` [Intel-gfx] " Daniel Stone
                     ` (2 more replies)
  2020-07-07 20:12 ` [PATCH 04/25] drm/vkms: Annotate vblank timer Daniel Vetter
                   ` (12 subsequent siblings)
  15 siblings, 3 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-07 20:12 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	Jesse Natalie, Steve Pronovost, Jason Ekstrand, Felix Kuehling,
	Mika Kuoppala, Thomas Hellstrom, linux-media, linaro-mm-sig,
	amd-gfx, Chris Wilson, Maarten Lankhorst, Christian König,
	Daniel Vetter

Comes up every few years, gets somewhat tedious to discuss, let's
write this down once and for all.

What I'm not sure about is whether the text should be more explicit in
flat out mandating the amdkfd eviction fences for long running compute
workloads or workloads where userspace fencing is allowed.

v2: Now with dot graph!

Cc: Jesse Natalie <jenatali@microsoft.com>
Cc: Steve Pronovost <spronovo@microsoft.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 Documentation/driver-api/dma-buf.rst     | 70 ++++++++++++++++++++++++
 drivers/gpu/drm/virtio/virtgpu_display.c | 20 -------
 2 files changed, 70 insertions(+), 20 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
index f8f6decde359..037ba0078bb4 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -178,3 +178,73 @@ DMA Fence uABI/Sync File
 .. kernel-doc:: include/linux/sync_file.h
    :internal:
 
+Idefinite DMA Fences
+~~~~~~~~~~~~~~~~~~~~
+
+At various times &dma_fence with an indefinite time until dma_fence_wait()
+finishes have been proposed. Examples include:
+
+* Future fences, used in HWC1 to signal when a buffer isn't used by the display
+  any longer, and created with the screen update that makes the buffer visible.
+  The time this fence completes is entirely under userspace's control.
+
+* Proxy fences, proposed to handle &drm_syncobj for which the fence has not yet
+  been set. Used to asynchronously delay command submission.
+
+* Userspace fences or gpu futexes, fine-grained locking within a command buffer
+  that userspace uses for synchronization across engines or with the CPU, which
+  are then imported as a DMA fence for integration into existing winsys
+  protocols.
+
+* Long-running compute command buffers, while still using traditional end of
+  batch DMA fences for memory management instead of context preemption DMA
+  fences which get reattached when the compute job is rescheduled.
+
+Common to all these schemes is that userspace controls the dependencies of these
+fences and controls when they fire. Mixing indefinite fences with normal
+in-kernel DMA fences does not work, even when a fallback timeout is included to
+protect against malicious userspace:
+
+* Only the kernel knows about all DMA fence dependencies, userspace is not aware
+  of dependencies injected due to memory management or scheduler decisions.
+
+* Only userspace knows about all dependencies in indefinite fences and when
+  exactly they will complete, the kernel has no visibility.
+
+Furthermore the kernel has to be able to hold up userspace command submission
+for memory management needs, which means we must support indefinite fences being
+dependent upon DMA fences. If the kernel also support indefinite fences in the
+kernel like a DMA fence, like any of the above proposal would, there is the
+potential for deadlocks.
+
+.. kernel-render:: DOT
+   :alt: Indefinite Fencing Dependency Cycle
+   :caption: Indefinite Fencing Dependency Cycle
+
+   digraph "Fencing Cycle" {
+      node [shape=box bgcolor=grey style=filled]
+      kernel [label="Kernel DMA Fences"]
+      userspace [label="userspace controlled fences"]
+      kernel -> userspace [label="memory management"]
+      userspace -> kernel [label="Future fence, fence proxy, ..."]
+
+      { rank=same; kernel userspace }
+   }
+
+This means that the kernel might accidentally create deadlocks
+through memory management dependencies which userspace is unaware of, which
+randomly hangs workloads until the timeout kicks in. Workloads, which from
+userspace's perspective, do not contain a deadlock.  In such a mixed fencing
+architecture there is no single entity with knowledge of all dependencies.
+Thefore preventing such deadlocks from within the kernel is not possible.
+
+The only solution to avoid dependencies loops is by not allowing indefinite
+fences in the kernel. This means:
+
+* No future fences, proxy fences or userspace fences imported as DMA fences,
+  with or without a timeout.
+
+* No DMA fences that signal end of batchbuffer for command submission where
+  userspace is allowed to use userspace fencing or long running compute
+  workloads. This also means no implicit fencing for shared buffers in these
+  cases.
diff --git a/drivers/gpu/drm/virtio/virtgpu_display.c b/drivers/gpu/drm/virtio/virtgpu_display.c
index f3ce49c5a34c..af55b334be2f 100644
--- a/drivers/gpu/drm/virtio/virtgpu_display.c
+++ b/drivers/gpu/drm/virtio/virtgpu_display.c
@@ -314,25 +314,6 @@ virtio_gpu_user_framebuffer_create(struct drm_device *dev,
 	return &virtio_gpu_fb->base;
 }
 
-static void vgdev_atomic_commit_tail(struct drm_atomic_state *state)
-{
-	struct drm_device *dev = state->dev;
-
-	drm_atomic_helper_commit_modeset_disables(dev, state);
-	drm_atomic_helper_commit_modeset_enables(dev, state);
-	drm_atomic_helper_commit_planes(dev, state, 0);
-
-	drm_atomic_helper_fake_vblank(state);
-	drm_atomic_helper_commit_hw_done(state);
-
-	drm_atomic_helper_wait_for_vblanks(dev, state);
-	drm_atomic_helper_cleanup_planes(dev, state);
-}
-
-static const struct drm_mode_config_helper_funcs virtio_mode_config_helpers = {
-	.atomic_commit_tail = vgdev_atomic_commit_tail,
-};
-
 static const struct drm_mode_config_funcs virtio_gpu_mode_funcs = {
 	.fb_create = virtio_gpu_user_framebuffer_create,
 	.atomic_check = drm_atomic_helper_check,
@@ -346,7 +327,6 @@ void virtio_gpu_modeset_init(struct virtio_gpu_device *vgdev)
 	drm_mode_config_init(vgdev->ddev);
 	vgdev->ddev->mode_config.quirk_addfb_prefer_host_byte_order = true;
 	vgdev->ddev->mode_config.funcs = &virtio_gpu_mode_funcs;
-	vgdev->ddev->mode_config.helper_private = &virtio_mode_config_helpers;
 
 	/* modes will be validated against the framebuffer size */
 	vgdev->ddev->mode_config.min_width = XRES_MIN;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 04/25] drm/vkms: Annotate vblank timer
       [not found] <20200707201229.472834-1-daniel.vetter@ffwll.ch>
                   ` (2 preceding siblings ...)
  2020-07-07 20:12 ` [PATCH 03/25] dma-buf.rst: Document why idenfinite fences are a bad idea Daniel Vetter
@ 2020-07-07 20:12 ` Daniel Vetter
  2020-07-12 22:27   ` Rodrigo Siqueira
  2020-07-07 20:12 ` [PATCH 05/25] drm/vblank: Annotate with dma-fence signalling section Daniel Vetter
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-07 20:12 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Christian König, Daniel Vetter,
	Rodrigo Siqueira, Haneen Mohammed, Daniel Vetter

This is needed to signal the fences from page flips, annotate it
accordingly. We need to annotate entire timer callback since if we get
stuck anywhere in there, then the timer stops, and hence fences stop.
Just annotating the top part that does the vblank handling isn't
enough.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
Cc: Haneen Mohammed <hamohammed.sa@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
---
 drivers/gpu/drm/vkms/vkms_crtc.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/vkms/vkms_crtc.c b/drivers/gpu/drm/vkms/vkms_crtc.c
index ac85e17428f8..a53a40848a72 100644
--- a/drivers/gpu/drm/vkms/vkms_crtc.c
+++ b/drivers/gpu/drm/vkms/vkms_crtc.c
@@ -1,5 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0+
 
+#include <linux/dma-fence.h>
+
 #include <drm/drm_atomic.h>
 #include <drm/drm_atomic_helper.h>
 #include <drm/drm_probe_helper.h>
@@ -14,7 +16,9 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
 	struct drm_crtc *crtc = &output->crtc;
 	struct vkms_crtc_state *state;
 	u64 ret_overrun;
-	bool ret;
+	bool ret, fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	ret_overrun = hrtimer_forward_now(&output->vblank_hrtimer,
 					  output->period_ns);
@@ -49,6 +53,8 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
 			DRM_DEBUG_DRIVER("Composer worker already queued\n");
 	}
 
+	dma_fence_end_signalling(fence_cookie);
+
 	return HRTIMER_RESTART;
 }
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 05/25] drm/vblank: Annotate with dma-fence signalling section
       [not found] <20200707201229.472834-1-daniel.vetter@ffwll.ch>
                   ` (3 preceding siblings ...)
  2020-07-07 20:12 ` [PATCH 04/25] drm/vkms: Annotate vblank timer Daniel Vetter
@ 2020-07-07 20:12 ` Daniel Vetter
  2020-07-07 20:12 ` [PATCH 06/25] drm/amdgpu: add dma-fence annotations to atomic commit path Daniel Vetter
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-07 20:12 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Christian König, Daniel Vetter

This is rather overkill since currently all drivers call this from
hardirq (or at least timers). But maybe in the future we're going to
have thread irq handlers and what not, doesn't hurt to be prepared.
Plus this is an easy start for sprinkling these fence annotations into
shared code.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/drm_vblank.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index 42a84eb4cc8c..d681ab09963c 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -24,6 +24,7 @@
  * OTHER DEALINGS IN THE SOFTWARE.
  */
 
+#include <linux/dma-fence.h>
 #include <linux/export.h>
 #include <linux/moduleparam.h>
 
@@ -1909,7 +1910,7 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 {
 	struct drm_vblank_crtc *vblank = &dev->vblank[pipe];
 	unsigned long irqflags;
-	bool disable_irq;
+	bool disable_irq, fence_cookie;
 
 	if (drm_WARN_ON_ONCE(dev, !drm_dev_has_vblank(dev)))
 		return false;
@@ -1917,6 +1918,8 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 	if (drm_WARN_ON(dev, pipe >= dev->num_crtcs))
 		return false;
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	spin_lock_irqsave(&dev->event_lock, irqflags);
 
 	/* Need timestamp lock to prevent concurrent execution with
@@ -1929,6 +1932,7 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 	if (!vblank->enabled) {
 		spin_unlock(&dev->vblank_time_lock);
 		spin_unlock_irqrestore(&dev->event_lock, irqflags);
+		dma_fence_end_signalling(fence_cookie);
 		return false;
 	}
 
@@ -1954,6 +1958,8 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 	if (disable_irq)
 		vblank_disable_fn(&vblank->disable_timer);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	return true;
 }
 EXPORT_SYMBOL(drm_handle_vblank);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 06/25] drm/amdgpu: add dma-fence annotations to atomic commit path
       [not found] <20200707201229.472834-1-daniel.vetter@ffwll.ch>
                   ` (4 preceding siblings ...)
  2020-07-07 20:12 ` [PATCH 05/25] drm/vblank: Annotate with dma-fence signalling section Daniel Vetter
@ 2020-07-07 20:12 ` Daniel Vetter
  2020-07-07 20:12 ` [PATCH 16/25] drm/atomic-helper: Add dma-fence annotations Daniel Vetter
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-07 20:12 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Christian König, Daniel Vetter

I need a canary in a ttm-based atomic driver to make sure the
dma_fence_begin/end_signalling annotations actually work.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 5b0f708dd8c5..6afcc33ff846 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -57,6 +57,7 @@
 
 #include "ivsrcid/ivsrcid_vislands30.h"
 
+#include <linux/module.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/version.h>
@@ -7359,6 +7360,9 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
 	struct drm_connector_state *old_con_state, *new_con_state;
 	struct dm_crtc_state *dm_old_crtc_state, *dm_new_crtc_state;
 	int crtc_disable_count = 0;
+	bool fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	drm_atomic_helper_update_legacy_modeset_state(dev, state);
 
@@ -7639,6 +7643,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
 	/* Signal HW programming completion */
 	drm_atomic_helper_commit_hw_done(state);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	if (wait_for_vblank)
 		drm_atomic_helper_wait_for_flip_done(dev, state);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 16/25] drm/atomic-helper: Add dma-fence annotations
       [not found] <20200707201229.472834-1-daniel.vetter@ffwll.ch>
                   ` (5 preceding siblings ...)
  2020-07-07 20:12 ` [PATCH 06/25] drm/amdgpu: add dma-fence annotations to atomic commit path Daniel Vetter
@ 2020-07-07 20:12 ` Daniel Vetter
  2020-07-07 20:12 ` [PATCH 17/25] drm/scheduler: use dma-fence annotations in main thread Daniel Vetter
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-07 20:12 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Christian König, Daniel Vetter

This is a bit disappointing since we need to split the annotations
over all the different parts.

I was considering just leaking the critical section into the
->atomic_commit_tail callback of each driver. But that would mean we
need to pass the fence_cookie into each driver (there's a total of 13
implementations of this hook right now), so bad flag day. And also a
bit leaky abstraction.

Hence just do it function-by-function.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/drm_atomic_helper.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
index c6bf9722b51b..f67ee513a7cc 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -1550,6 +1550,7 @@ EXPORT_SYMBOL(drm_atomic_helper_wait_for_flip_done);
 void drm_atomic_helper_commit_tail(struct drm_atomic_state *old_state)
 {
 	struct drm_device *dev = old_state->dev;
+	bool fence_cookie = dma_fence_begin_signalling();
 
 	drm_atomic_helper_commit_modeset_disables(dev, old_state);
 
@@ -1561,6 +1562,8 @@ void drm_atomic_helper_commit_tail(struct drm_atomic_state *old_state)
 
 	drm_atomic_helper_commit_hw_done(old_state);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	drm_atomic_helper_wait_for_vblanks(dev, old_state);
 
 	drm_atomic_helper_cleanup_planes(dev, old_state);
@@ -1580,6 +1583,7 @@ EXPORT_SYMBOL(drm_atomic_helper_commit_tail);
 void drm_atomic_helper_commit_tail_rpm(struct drm_atomic_state *old_state)
 {
 	struct drm_device *dev = old_state->dev;
+	bool fence_cookie = dma_fence_begin_signalling();
 
 	drm_atomic_helper_commit_modeset_disables(dev, old_state);
 
@@ -1592,6 +1596,8 @@ void drm_atomic_helper_commit_tail_rpm(struct drm_atomic_state *old_state)
 
 	drm_atomic_helper_commit_hw_done(old_state);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	drm_atomic_helper_wait_for_vblanks(dev, old_state);
 
 	drm_atomic_helper_cleanup_planes(dev, old_state);
@@ -1607,6 +1613,9 @@ static void commit_tail(struct drm_atomic_state *old_state)
 	ktime_t start;
 	s64 commit_time_ms;
 	unsigned int i, new_self_refresh_mask = 0;
+	bool fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	funcs = dev->mode_config.helper_private;
 
@@ -1635,6 +1644,8 @@ static void commit_tail(struct drm_atomic_state *old_state)
 		if (new_crtc_state->self_refresh_active)
 			new_self_refresh_mask |= BIT(i);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	if (funcs && funcs->atomic_commit_tail)
 		funcs->atomic_commit_tail(old_state);
 	else
@@ -1790,6 +1801,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 			     bool nonblock)
 {
 	int ret;
+	bool fence_cookie;
 
 	if (state->async_update) {
 		ret = drm_atomic_helper_prepare_planes(dev, state);
@@ -1812,6 +1824,8 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 	if (ret)
 		return ret;
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	if (!nonblock) {
 		ret = drm_atomic_helper_wait_for_fences(dev, state, true);
 		if (ret)
@@ -1849,6 +1863,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 	 */
 
 	drm_atomic_state_get(state);
+	dma_fence_end_signalling(fence_cookie);
 	if (nonblock)
 		queue_work(system_unbound_wq, &state->commit_work);
 	else
@@ -1857,6 +1872,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 	return 0;
 
 err:
+	dma_fence_end_signalling(fence_cookie);
 	drm_atomic_helper_cleanup_planes(dev, state);
 	return ret;
 }
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 17/25] drm/scheduler: use dma-fence annotations in main thread
       [not found] <20200707201229.472834-1-daniel.vetter@ffwll.ch>
                   ` (6 preceding siblings ...)
  2020-07-07 20:12 ` [PATCH 16/25] drm/atomic-helper: Add dma-fence annotations Daniel Vetter
@ 2020-07-07 20:12 ` Daniel Vetter
  2020-07-07 20:12 ` [PATCH 18/25] drm/amdgpu: use dma-fence annotations in cs_submit() Daniel Vetter
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-07 20:12 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Christian König, Daniel Vetter

If the scheduler rt thread gets stuck on a mutex that we're holding
while waiting for gpu workloads to complete, we have a problem.

Add dma-fence annotations so that lockdep can check this for us.

I've tried to quite carefully review this, and I think it's at the
right spot. But obviosly no expert on drm scheduler.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index d6eaa23ad746..52f1ab4bc922 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -765,9 +765,12 @@ static int drm_sched_main(void *param)
 	struct sched_param sparam = {.sched_priority = 1};
 	struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param;
 	int r;
+	bool fence_cookie;
 
 	sched_setscheduler(current, SCHED_FIFO, &sparam);
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	while (!kthread_should_stop()) {
 		struct drm_sched_entity *entity = NULL;
 		struct drm_sched_fence *s_fence;
@@ -825,6 +828,9 @@ static int drm_sched_main(void *param)
 
 		wake_up(&sched->job_scheduled);
 	}
+
+	dma_fence_end_signalling(fence_cookie);
+
 	return 0;
 }
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 18/25] drm/amdgpu: use dma-fence annotations in cs_submit()
       [not found] <20200707201229.472834-1-daniel.vetter@ffwll.ch>
                   ` (7 preceding siblings ...)
  2020-07-07 20:12 ` [PATCH 17/25] drm/scheduler: use dma-fence annotations in main thread Daniel Vetter
@ 2020-07-07 20:12 ` Daniel Vetter
  2020-07-07 20:12 ` [PATCH 19/25] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code Daniel Vetter
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-07 20:12 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Christian König, Daniel Vetter

This is a bit tricky, since ->notifier_lock is held while calling
dma_fence_wait we must ensure that also the read side (i.e.
dma_fence_begin_signalling) is on the same side. If we mix this up
lockdep complaints, and that's again why we want to have these
annotations.

A nice side effect of this is that because of the fs_reclaim priming
for dma_fence_enable lockdep now automatically checks for us that
nothing in here allocates memory, without even running any userptr
workloads.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index a512ccbc4dea..858528a06fe7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1212,6 +1212,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	struct amdgpu_job *job;
 	uint64_t seq;
 	int r;
+	bool fence_cookie;
 
 	job = p->job;
 	p->job = NULL;
@@ -1226,6 +1227,8 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	 */
 	mutex_lock(&p->adev->notifier_lock);
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	/* If userptr are invalidated after amdgpu_cs_parser_bos(), return
 	 * -EAGAIN, drmIoctl in libdrm will restart the amdgpu_cs_ioctl.
 	 */
@@ -1262,12 +1265,14 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	amdgpu_vm_move_to_lru_tail(p->adev, &fpriv->vm);
 
 	ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
+	dma_fence_end_signalling(fence_cookie);
 	mutex_unlock(&p->adev->notifier_lock);
 
 	return 0;
 
 error_abort:
 	drm_sched_job_cleanup(&job->base);
+	dma_fence_end_signalling(fence_cookie);
 	mutex_unlock(&p->adev->notifier_lock);
 
 error_unlock:
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 19/25] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
       [not found] <20200707201229.472834-1-daniel.vetter@ffwll.ch>
                   ` (8 preceding siblings ...)
  2020-07-07 20:12 ` [PATCH 18/25] drm/amdgpu: use dma-fence annotations in cs_submit() Daniel Vetter
@ 2020-07-07 20:12 ` Daniel Vetter
  2020-07-14 10:49   ` Daniel Vetter
  2020-07-07 20:12 ` [PATCH 20/25] drm/amdgpu: DC also loves to allocate stuff where it shouldn't Daniel Vetter
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-07 20:12 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Christian König, Daniel Vetter

My dma-fence lockdep annotations caught an inversion because we
allocate memory where we really shouldn't:

	kmem_cache_alloc+0x2b/0x6d0
	amdgpu_fence_emit+0x30/0x330 [amdgpu]
	amdgpu_ib_schedule+0x306/0x550 [amdgpu]
	amdgpu_job_run+0x10f/0x260 [amdgpu]
	drm_sched_main+0x1b9/0x490 [gpu_sched]
	kthread+0x12e/0x150

Trouble right now is that lockdep only validates against GFP_FS, which
would be good enough for shrinkers. But for mmu_notifiers we actually
need !GFP_ATOMIC, since they can be called from any page laundering,
even if GFP_NOFS or GFP_NOIO are set.

I guess we should improve the lockdep annotations for
fs_reclaim_acquire/release.

Ofc real fix is to properly preallocate this fence and stuff it into
the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of
the way.

v2: Two more allocations in scheduler paths.

Frist one:

	__kmalloc+0x58/0x720
	amdgpu_vmid_grab+0x100/0xca0 [amdgpu]
	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
	drm_sched_main+0xf9/0x490 [gpu_sched]

Second one:

	kmem_cache_alloc+0x2b/0x6d0
	amdgpu_sync_fence+0x7e/0x110 [amdgpu]
	amdgpu_vmid_grab+0x86b/0xca0 [amdgpu]
	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
	drm_sched_main+0xf9/0x490 [gpu_sched]

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c   | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c  | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 8d84975885cd..a089a827fdfe 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f,
 	uint32_t seq;
 	int r;
 
-	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL);
+	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
 	if (fence == NULL)
 		return -ENOMEM;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
index 267fa45ddb66..a333ca2d4ddd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
@@ -208,7 +208,7 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
 	if (ring->vmid_wait && !dma_fence_is_signaled(ring->vmid_wait))
 		return amdgpu_sync_fence(sync, ring->vmid_wait);
 
-	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_KERNEL);
+	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATOMIC);
 	if (!fences)
 		return -ENOMEM;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index 8ea6c49529e7..af22b526cec9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -160,7 +160,7 @@ int amdgpu_sync_fence(struct amdgpu_sync *sync, struct dma_fence *f)
 	if (amdgpu_sync_add_later(sync, f))
 		return 0;
 
-	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_KERNEL);
+	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_ATOMIC);
 	if (!e)
 		return -ENOMEM;
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 20/25] drm/amdgpu: DC also loves to allocate stuff where it shouldn't
       [not found] <20200707201229.472834-1-daniel.vetter@ffwll.ch>
                   ` (9 preceding siblings ...)
  2020-07-07 20:12 ` [PATCH 19/25] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code Daniel Vetter
@ 2020-07-07 20:12 ` Daniel Vetter
  2020-07-14 11:12   ` Daniel Vetter
  2020-07-07 20:12 ` [PATCH 21/25] drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail Daniel Vetter
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-07 20:12 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Christian König, Daniel Vetter

Not going to bother with a complete&pretty commit message, just
offending backtrace:

        kvmalloc_node+0x47/0x80
        dc_create_state+0x1f/0x60 [amdgpu]
        dc_commit_state+0xcb/0x9b0 [amdgpu]
        amdgpu_dm_atomic_commit_tail+0xd31/0x2010 [amdgpu]
        commit_tail+0xa4/0x140 [drm_kms_helper]
        drm_atomic_helper_commit+0x152/0x180 [drm_kms_helper]
        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
        drm_client_modeset_commit_locked+0x55/0x190 [drm]
        drm_client_modeset_commit+0x24/0x40 [drm]

v2: Found more in DC code, I'm just going to pile them all up.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/atom.c                 | 2 +-
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +-
 drivers/gpu/drm/amd/display/dc/core/dc.c          | 4 +++-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/atom.c b/drivers/gpu/drm/amd/amdgpu/atom.c
index 4cfc786699c7..1b0c674fab25 100644
--- a/drivers/gpu/drm/amd/amdgpu/atom.c
+++ b/drivers/gpu/drm/amd/amdgpu/atom.c
@@ -1226,7 +1226,7 @@ static int amdgpu_atom_execute_table_locked(struct atom_context *ctx, int index,
 	ectx.abort = false;
 	ectx.last_jump = 0;
 	if (ws)
-		ectx.ws = kcalloc(4, ws, GFP_KERNEL);
+		ectx.ws = kcalloc(4, ws, GFP_ATOMIC);
 	else
 		ectx.ws = NULL;
 
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 6afcc33ff846..3d41eddc7908 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -6872,7 +6872,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		struct dc_stream_update stream_update;
 	} *bundle;
 
-	bundle = kzalloc(sizeof(*bundle), GFP_KERNEL);
+	bundle = kzalloc(sizeof(*bundle), GFP_ATOMIC);
 
 	if (!bundle) {
 		dm_error("Failed to allocate update bundle\n");
diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c b/drivers/gpu/drm/amd/display/dc/core/dc.c
index 942ceb0f6383..f9a58509efb2 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
@@ -1475,8 +1475,10 @@ bool dc_post_update_surfaces_to_stream(struct dc *dc)
 
 struct dc_state *dc_create_state(struct dc *dc)
 {
+	/* No you really cant allocate random crap here this late in
+	 * atomic_commit_tail. */
 	struct dc_state *context = kvzalloc(sizeof(struct dc_state),
-					    GFP_KERNEL);
+					    GFP_ATOMIC);
 
 	if (!context)
 		return NULL;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 21/25] drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail
       [not found] <20200707201229.472834-1-daniel.vetter@ffwll.ch>
                   ` (10 preceding siblings ...)
  2020-07-07 20:12 ` [PATCH 20/25] drm/amdgpu: DC also loves to allocate stuff where it shouldn't Daniel Vetter
@ 2020-07-07 20:12 ` Daniel Vetter
  2020-07-07 20:12 ` [PATCH 22/25] drm/scheduler: use dma-fence annotations in tdr work Daniel Vetter
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-07 20:12 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Christian König, Daniel Vetter

Trying to grab dma_resv_lock while in commit_tail before we've done
all the code that leads to the eventual signalling of the vblank event
(which can be a dma_fence) is deadlock-y. Don't do that.

Here the solution is easy because just grabbing locks to read
something races anyway. We don't need to bother, READ_ONCE is
equivalent. And avoids the locking issue.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 3d41eddc7908..d6bb876a74e5 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -6949,7 +6949,11 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		 * explicitly on fences instead
 		 * and in general should be called for
 		 * blocking commit to as per framework helpers
+		 *
+		 * Yes, this deadlocks, since you're calling dma_resv_lock in a
+		 * path that leads to a dma_fence_signal(). Don't do that.
 		 */
+#if 0
 		r = amdgpu_bo_reserve(abo, true);
 		if (unlikely(r != 0))
 			DRM_ERROR("failed to reserve buffer before flip\n");
@@ -6959,6 +6963,12 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		tmz_surface = amdgpu_bo_encrypted(abo);
 
 		amdgpu_bo_unreserve(abo);
+#endif
+		/*
+		 * this races anyway, so READ_ONCE isn't any better or worse
+		 * than the stuff above. Except the stuff above can deadlock.
+		 */
+		tiling_flags = READ_ONCE(abo->tiling_flags);
 
 		fill_dc_plane_info_and_addr(
 			dm->adev, new_plane_state, tiling_flags,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 22/25] drm/scheduler: use dma-fence annotations in tdr work
       [not found] <20200707201229.472834-1-daniel.vetter@ffwll.ch>
                   ` (11 preceding siblings ...)
  2020-07-07 20:12 ` [PATCH 21/25] drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail Daniel Vetter
@ 2020-07-07 20:12 ` Daniel Vetter
  2020-07-07 20:12 ` [PATCH 23/25] drm/amdgpu: use dma-fence annotations for gpu reset code Daniel Vetter
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-07 20:12 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Christian König, Daniel Vetter

In the face of unpriviledged userspace being able to submit bogus gpu
workloads the kernel needs gpu timeout and reset (tdr) to guarantee
that dma_fences actually complete. Annotate this worker to make sure
we don't have any accidental locking inversions or other problems
lurking.

Originally this was part of the overall scheduler annotation patch.
But amdgpu has some glorious inversions here:

- grabs console_lock
- does a full modeset, which grabs all kinds of locks
  (drm_modeset_lock, dma_resv_lock) which can deadlock with
  dma_fence_wait held inside them.
- almost minor at that point, but the modeset code also allocates
  memory

These all look like they'll be very hard to fix properly, the hardware
seems to require a full display reset with any gpu recovery.

Hence split out as a seperate patch.

Since amdgpu isn't the only hardware driver that needs to reset the
display (at least gen2/3 on intel have the same problem) we need a
generic solution for this. There's two tricks we could still from
drm/i915 and lift to dma-fence:

- The big whack, aka force-complete all fences. i915 does this for all
  pending jobs if the reset is somehow stuck. Trouble is we'd need to
  do this for all fences in the entire system, and just the
  book-keeping for that will be fun. Plus lots of drivers use fences
  for all kinds of internal stuff like memory management, so
  unconditionally resetting all of them doesn't work.

  I'm also hoping that with these fence annotations we could enlist
  lockdep in finding the last offenders causing deadlocks, and we
  could remove this get-out-of-jail trick.

- The more feasible approach (across drivers at least as part of the
  dma_fence contract) is what drm/i915 does for gen2/3: When we need
  to reset the display we wake up all dma_fence_wait_interruptible
  calls, or well at least the equivalent of those in i915 internally.

  Relying on ioctl restart we force all other threads to release their
  locks, which means the tdr thread is guaranteed to be able to get
  them. I think we could implement this at the dma_fence level,
  including proper lockdep annotations.

  dma_fence_begin_tdr():
  - must be nested within a dma_fence_begin/end_signalling section
  - will wake up all interruptible (but not the non-interruptible)
    dma_fence_wait() calls and force them to complete with a
    -ERESTARTSYS errno code. All new interrupitble calls to
    dma_fence_wait() will immeidately fail with the same error code.

  dma_fence_end_trdr():
  - this will convert dma_fence_wait() calls back to normal.

  Of course interrupting dma_fence_wait is only ok if the caller
  specified that, which means we need to split the annotations into
  interruptible and non-interruptible version. If we then make sure
  that we only use interruptible dma_fence_wait() calls while holding
  drm_modeset_lock we can grab them in tdr code, and allow display
  resets. Doing the same for dma_resv_lock might be a lot harder, so
  buffer updates must be avoided.

  What's worse, we're not going to be able to make the dma_fence_wait
  calls in mmu-notifiers interruptible, that doesn't work. So
  allocating memory still wont' be allowed, even in tdr sections. Plus
  obviously we can use this trick only in tdr, it is rather intrusive.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 52f1ab4bc922..a1c091e11ffd 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -281,9 +281,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
 {
 	struct drm_gpu_scheduler *sched;
 	struct drm_sched_job *job;
+	bool fence_cookie;
 
 	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
 	spin_lock(&sched->job_list_lock);
 	job = list_first_entry_or_null(&sched->ring_mirror_list,
@@ -315,6 +318,8 @@ static void drm_sched_job_timedout(struct work_struct *work)
 	spin_lock(&sched->job_list_lock);
 	drm_sched_start_timeout(sched);
 	spin_unlock(&sched->job_list_lock);
+
+	dma_fence_end_signalling(fence_cookie);
 }
 
  /**
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 23/25] drm/amdgpu: use dma-fence annotations for gpu reset code
       [not found] <20200707201229.472834-1-daniel.vetter@ffwll.ch>
                   ` (12 preceding siblings ...)
  2020-07-07 20:12 ` [PATCH 22/25] drm/scheduler: use dma-fence annotations in tdr work Daniel Vetter
@ 2020-07-07 20:12 ` Daniel Vetter
  2020-07-07 20:12 ` [PATCH 24/25] Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset" Daniel Vetter
  2020-07-07 20:12 ` [PATCH 25/25] drm/amdgpu: gpu recovery does full modesets Daniel Vetter
  15 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-07 20:12 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Christian König

To improve coverage also annotate the gpu reset code itself, since
that's called from other places than drm/scheduler (which is already
annotated). Annotations nests, so this doesn't break anything, and
allows easier testing.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a649e40fd96f..3a3bccd7f1c7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4261,6 +4261,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		(amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_BACO) ?
 		true : false;
 	bool audio_suspended = false;
+	bool fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	/*
 	 * Flush RAM to disk so that after reboot
@@ -4289,6 +4292,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
 			  job ? job->base.id : -1, hive->hive_id);
 		mutex_unlock(&hive->hive_lock);
+		dma_fence_end_signalling(fence_cookie);
 		return 0;
 	}
 
@@ -4299,8 +4303,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 */
 	INIT_LIST_HEAD(&device_list);
 	if (adev->gmc.xgmi.num_physical_nodes > 1) {
-		if (!hive)
+		if (!hive) {
+			dma_fence_end_signalling(fence_cookie);
 			return -ENODEV;
+		}
 		if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
 			list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
 		device_list_handle = &hive->device_list;
@@ -4315,6 +4321,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress",
 				  job ? job->base.id : -1);
 			mutex_unlock(&hive->hive_lock);
+			dma_fence_end_signalling(fence_cookie);
 			return 0;
 		}
 
@@ -4455,6 +4462,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 	if (r)
 		dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
+	dma_fence_end_signalling(fence_cookie);
 	return r;
 }
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 24/25] Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset"
       [not found] <20200707201229.472834-1-daniel.vetter@ffwll.ch>
                   ` (13 preceding siblings ...)
  2020-07-07 20:12 ` [PATCH 23/25] drm/amdgpu: use dma-fence annotations for gpu reset code Daniel Vetter
@ 2020-07-07 20:12 ` Daniel Vetter
  2020-07-07 20:12 ` [PATCH 25/25] drm/amdgpu: gpu recovery does full modesets Daniel Vetter
  15 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-07 20:12 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Christian König, Daniel Vetter

This is one from the department of "maybe play lottery if you hit
this, karma compensation might work". Or at least lockdep ftw!

This reverts commit 565d1941557756a584ac357d945bc374d5fcd1d0.

It's not quite as low-risk as the commit message claims, because this
grabs console_lock, which might be held when we allocate memory, which
might never happen because the dma_fence_wait() is stuck waiting on
our gpu reset:

[  136.763714] ======================================================
[  136.763714] WARNING: possible circular locking dependency detected
[  136.763715] 5.7.0-rc3+ #346 Tainted: G        W
[  136.763716] ------------------------------------------------------
[  136.763716] kworker/2:3/682 is trying to acquire lock:
[  136.763716] ffffffff8226f140 (console_lock){+.+.}-{0:0}, at: drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763723]
               but task is already holding lock:
[  136.763724] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  136.763726]
               which lock already depends on the new lock.

[  136.763726]
               the existing dependency chain (in reverse order) is:
[  136.763727]
               -> #2 (dma_fence_map){++++}-{0:0}:
[  136.763730]        __dma_fence_might_wait+0x41/0xb0
[  136.763732]        dma_resv_lockdep+0x171/0x202
[  136.763734]        do_one_initcall+0x5d/0x2f0
[  136.763736]        kernel_init_freeable+0x20d/0x26d
[  136.763738]        kernel_init+0xa/0xfb
[  136.763740]        ret_from_fork+0x27/0x50
[  136.763740]
               -> #1 (fs_reclaim){+.+.}-{0:0}:
[  136.763743]        fs_reclaim_acquire.part.0+0x25/0x30
[  136.763745]        kmem_cache_alloc_trace+0x2e/0x6e0
[  136.763747]        device_create_groups_vargs+0x52/0xf0
[  136.763747]        device_create+0x49/0x60
[  136.763749]        fb_console_init+0x25/0x145
[  136.763750]        fbmem_init+0xcc/0xe2
[  136.763750]        do_one_initcall+0x5d/0x2f0
[  136.763751]        kernel_init_freeable+0x20d/0x26d
[  136.763752]        kernel_init+0xa/0xfb
[  136.763753]        ret_from_fork+0x27/0x50
[  136.763753]
               -> #0 (console_lock){+.+.}-{0:0}:
[  136.763755]        __lock_acquire+0x1241/0x23f0
[  136.763756]        lock_acquire+0xad/0x370
[  136.763757]        console_lock+0x47/0x70
[  136.763761]        drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763809]        amdgpu_device_gpu_recover.cold+0x21e/0xe7b [amdgpu]
[  136.763850]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  136.763851]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  136.763852]        process_one_work+0x23c/0x580
[  136.763853]        worker_thread+0x50/0x3b0
[  136.763854]        kthread+0x12e/0x150
[  136.763855]        ret_from_fork+0x27/0x50
[  136.763855]
               other info that might help us debug this:

[  136.763856] Chain exists of:
                 console_lock --> fs_reclaim --> dma_fence_map

[  136.763857]  Possible unsafe locking scenario:

[  136.763857]        CPU0                    CPU1
[  136.763857]        ----                    ----
[  136.763857]   lock(dma_fence_map);
[  136.763858]                                lock(fs_reclaim);
[  136.763858]                                lock(dma_fence_map);
[  136.763858]   lock(console_lock);
[  136.763859]
                *** DEADLOCK ***

[  136.763860] 4 locks held by kworker/2:3/682:
[  136.763860]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  136.763862]  #1: ffffc90000cafe58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  136.763863]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  136.763865]  #3: ffff8887ab621748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x5ab/0xe7b [amdgpu]
[  136.763914]
               stack backtrace:
[  136.763915] CPU: 2 PID: 682 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
[  136.763916] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
[  136.763918] Workqueue: events drm_sched_job_timedout [gpu_sched]
[  136.763919] Call Trace:
[  136.763922]  dump_stack+0x8f/0xd0
[  136.763924]  check_noncircular+0x162/0x180
[  136.763926]  __lock_acquire+0x1241/0x23f0
[  136.763927]  lock_acquire+0xad/0x370
[  136.763932]  ? drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763933]  ? mark_held_locks+0x2d/0x80
[  136.763934]  ? _raw_spin_unlock_irqrestore+0x46/0x60
[  136.763936]  console_lock+0x47/0x70
[  136.763940]  ? drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763944]  drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763993]  amdgpu_device_gpu_recover.cold+0x21e/0xe7b [amdgpu]
[  136.764036]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  136.764038]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  136.764040]  process_one_work+0x23c/0x580
[  136.764041]  worker_thread+0x50/0x3b0
[  136.764042]  ? process_one_work+0x580/0x580
[  136.764044]  kthread+0x12e/0x150
[  136.764045]  ? kthread_create_worker_on_cpu+0x70/0x70
[  136.764046]  ret_from_fork+0x27/0x50

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3a3bccd7f1c7..44b321eecc3d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4109,8 +4109,6 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 				if (r)
 					goto out;
 
-				amdgpu_fbdev_set_suspend(tmp_adev, 0);
-
 				/* must succeed. */
 				amdgpu_ras_resume(tmp_adev);
 
@@ -4351,8 +4349,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		 */
 		amdgpu_unregister_gpu_instance(tmp_adev);
 
-		amdgpu_fbdev_set_suspend(tmp_adev, 1);
-
 		/* disable ras on ALL IPs */
 		if (!(in_ras_intr && !use_baco) &&
 		      amdgpu_device_ip_need_full_reset(tmp_adev))
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 25/25] drm/amdgpu: gpu recovery does full modesets
       [not found] <20200707201229.472834-1-daniel.vetter@ffwll.ch>
                   ` (14 preceding siblings ...)
  2020-07-07 20:12 ` [PATCH 24/25] Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset" Daniel Vetter
@ 2020-07-07 20:12 ` Daniel Vetter
  15 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-07 20:12 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Christian König, Daniel Vetter

...

I think it's time to stop this little exercise.

The lockdep splat, for the record:

[  132.583381] ======================================================
[  132.584091] WARNING: possible circular locking dependency detected
[  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
[  132.585461] ------------------------------------------------------
[  132.586184] kworker/2:3/865 is trying to acquire lock:
[  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.587569]
               but task is already holding lock:
[  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  132.589803]
               which lock already depends on the new lock.

[  132.592009]
               the existing dependency chain (in reverse order) is:
[  132.593507]
               -> #2 (dma_fence_map){++++}-{0:0}:
[  132.595019]        dma_fence_begin_signalling+0x50/0x60
[  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
[  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
[  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
[  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
[  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
[  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
[  132.600539]        fbcon_init+0x2e8/0x660
[  132.601344]        visual_init+0xce/0x130
[  132.602156]        do_bind_con_driver+0x1bc/0x2b0
[  132.602970]        do_take_over_console+0x115/0x180
[  132.603763]        do_fbcon_takeover+0x58/0xb0
[  132.604564]        register_framebuffer+0x1ee/0x300
[  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
[  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
[  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
[  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
[  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
[  132.609511]        local_pci_probe+0x42/0x80
[  132.610324]        pci_device_probe+0x104/0x1a0
[  132.611130]        really_probe+0x147/0x3c0
[  132.611939]        driver_probe_device+0xb6/0x100
[  132.612766]        device_driver_attach+0x53/0x60
[  132.613593]        __driver_attach+0x8c/0x150
[  132.614419]        bus_for_each_dev+0x7b/0xc0
[  132.615249]        bus_add_driver+0x14c/0x1f0
[  132.616071]        driver_register+0x6c/0xc0
[  132.616902]        do_one_initcall+0x5d/0x2f0
[  132.617731]        do_init_module+0x5c/0x230
[  132.618560]        load_module+0x2981/0x2bc0
[  132.619391]        __do_sys_finit_module+0xaa/0x110
[  132.620228]        do_syscall_64+0x5a/0x250
[  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
[  132.621903]
               -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
[  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
[  132.624448]        ww_mutex_lock+0x43/0xb0
[  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
[  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
[  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
[  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
[  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
[  132.629804]        local_pci_probe+0x42/0x80
[  132.630690]        pci_device_probe+0x104/0x1a0
[  132.631583]        really_probe+0x147/0x3c0
[  132.632479]        driver_probe_device+0xb6/0x100
[  132.633379]        device_driver_attach+0x53/0x60
[  132.634275]        __driver_attach+0x8c/0x150
[  132.635170]        bus_for_each_dev+0x7b/0xc0
[  132.636069]        bus_add_driver+0x14c/0x1f0
[  132.636974]        driver_register+0x6c/0xc0
[  132.637870]        do_one_initcall+0x5d/0x2f0
[  132.638765]        do_init_module+0x5c/0x230
[  132.639654]        load_module+0x2981/0x2bc0
[  132.640522]        __do_sys_finit_module+0xaa/0x110
[  132.641372]        do_syscall_64+0x5a/0x250
[  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
[  132.643022]
               -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
[  132.644643]        __lock_acquire+0x1241/0x23f0
[  132.645469]        lock_acquire+0xad/0x370
[  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
[  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
[  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
[  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
[  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  132.652594]        process_one_work+0x23c/0x580
[  132.653402]        worker_thread+0x50/0x3b0
[  132.654139]        kthread+0x12e/0x150
[  132.654868]        ret_from_fork+0x27/0x50
[  132.655598]
               other info that might help us debug this:

[  132.657739] Chain exists of:
                 crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map

[  132.659877]  Possible unsafe locking scenario:

[  132.661416]        CPU0                    CPU1
[  132.662126]        ----                    ----
[  132.662847]   lock(dma_fence_map);
[  132.663574]                                lock(crtc_ww_class_mutex);
[  132.664319]                                lock(dma_fence_map);
[  132.665063]   lock(crtc_ww_class_acquire);
[  132.665799]
                *** DEADLOCK ***

[  132.667965] 4 locks held by kworker/2:3/865:
[  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
[  132.671902]
               stack backtrace:
[  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
[  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
[  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
[  132.676046] Call Trace:
[  132.676897]  dump_stack+0x8f/0xd0
[  132.677748]  check_noncircular+0x162/0x180
[  132.678604]  ? stack_trace_save+0x4b/0x70
[  132.679459]  __lock_acquire+0x1241/0x23f0
[  132.680311]  lock_acquire+0xad/0x370
[  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.682021]  ? cpumask_next+0x16/0x20
[  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
[  132.683737]  ? __module_address+0x28/0xf0
[  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
[  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
[  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
[  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
[  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
[  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  132.693597]  process_one_work+0x23c/0x580
[  132.694487]  worker_thread+0x50/0x3b0
[  132.695373]  ? process_one_work+0x580/0x580
[  132.696264]  kthread+0x12e/0x150
[  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
[  132.698057]  ret_from_fork+0x27/0x50

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 44b321eecc3d..910c86f577b2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2477,6 +2477,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
 		/* displays are handled separately */
 		if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
 			/* XXX handle errors */
+
+			/*
+			 * This is dm_suspend, which calls modeset locks, and
+			 * that a pretty good inversion against dma_fence_signal
+			 * which gpu recovery is supposed to guarantee.
+			 *
+			 * Dont ask me how to fix this.
+			 */
 			r = adev->ip_blocks[i].version->funcs->suspend(adev);
 			/* XXX handle errors */
 			if (r) {
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/25] dma-fence: basic lockdep annotations
  2020-07-07 20:12 ` [PATCH 01/25] dma-fence: basic lockdep annotations Daniel Vetter
@ 2020-07-08 14:57   ` Christian König
  2020-07-08 15:12     ` Daniel Vetter
  2020-07-13 16:26     ` Daniel Vetter
  0 siblings, 2 replies; 83+ messages in thread
From: Christian König @ 2020-07-08 14:57 UTC (permalink / raw)
  To: Daniel Vetter, DRI Development
  Cc: Intel Graphics Development, linux-rdma, Felix Kuehling,
	Thomas Hellström, Maarten Lankhorst, Mika Kuoppala,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson, Daniel Vetter

Could we merge this controlled by a separate config option?

This way we could have the checks upstream without having to fix all the 
stuff before we do this?

Thanks,
Christian.

Am 07.07.20 um 22:12 schrieb Daniel Vetter:
> Design is similar to the lockdep annotations for workers, but with
> some twists:
>
> - We use a read-lock for the execution/worker/completion side, so that
>    this explicit annotation can be more liberally sprinkled around.
>    With read locks lockdep isn't going to complain if the read-side
>    isn't nested the same way under all circumstances, so ABBA deadlocks
>    are ok. Which they are, since this is an annotation only.
>
> - We're using non-recursive lockdep read lock mode, since in recursive
>    read lock mode lockdep does not catch read side hazards. And we
>    _very_ much want read side hazards to be caught. For full details of
>    this limitation see
>
>    commit e91498589746065e3ae95d9a00b068e525eec34f
>    Author: Peter Zijlstra <peterz@infradead.org>
>    Date:   Wed Aug 23 13:13:11 2017 +0200
>
>        locking/lockdep/selftests: Add mixed read-write ABBA tests
>
> - To allow nesting of the read-side explicit annotations we explicitly
>    keep track of the nesting. lock_is_held() allows us to do that.
>
> - The wait-side annotation is a write lock, and entirely done within
>    dma_fence_wait() for everyone by default.
>
> - To be able to freely annotate helper functions I want to make it ok
>    to call dma_fence_begin/end_signalling from soft/hardirq context.
>    First attempt was using the hardirq locking context for the write
>    side in lockdep, but this forces all normal spinlocks nested within
>    dma_fence_begin/end_signalling to be spinlocks. That bollocks.
>
>    The approach now is to simple check in_atomic(), and for these cases
>    entirely rely on the might_sleep() check in dma_fence_wait(). That
>    will catch any wrong nesting against spinlocks from soft/hardirq
>    contexts.
>
> The idea here is that every code path that's critical for eventually
> signalling a dma_fence should be annotated with
> dma_fence_begin/end_signalling. The annotation ideally starts right
> after a dma_fence is published (added to a dma_resv, exposed as a
> sync_file fd, attached to a drm_syncobj fd, or anything else that
> makes the dma_fence visible to other kernel threads), up to and
> including the dma_fence_wait(). Examples are irq handlers, the
> scheduler rt threads, the tail of execbuf (after the corresponding
> fences are visible), any workers that end up signalling dma_fences and
> really anything else. Not annotated should be code paths that only
> complete fences opportunistically as the gpu progresses, like e.g.
> shrinker/eviction code.
>
> The main class of deadlocks this is supposed to catch are:
>
> Thread A:
>
> 	mutex_lock(A);
> 	mutex_unlock(A);
>
> 	dma_fence_signal();
>
> Thread B:
>
> 	mutex_lock(A);
> 	dma_fence_wait();
> 	mutex_unlock(A);
>
> Thread B is blocked on A signalling the fence, but A never gets around
> to that because it cannot acquire the lock A.
>
> Note that dma_fence_wait() is allowed to be nested within
> dma_fence_begin/end_signalling sections. To allow this to happen the
> read lock needs to be upgraded to a write lock, which means that any
> other lock is acquired between the dma_fence_begin_signalling() call and
> the call to dma_fence_wait(), and still held, this will result in an
> immediate lockdep complaint. The only other option would be to not
> annotate such calls, defeating the point. Therefore these annotations
> cannot be sprinkled over the code entirely mindless to avoid false
> positives.
>
> Originally I hope that the cross-release lockdep extensions would
> alleviate the need for explicit annotations:
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F709849%2F&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7Cff1a9dd17c544534eeb808d822b21ba2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297495649621566&amp;sdata=pbDwf%2BAG1UZ5bLZeep7VeGVQMnlQhX0TKG1d6Ok8GfQ%3D&amp;reserved=0
>
> But there's a few reasons why that's not an option:
>
> - It's not happening in upstream, since it got reverted due to too
>    many false positives:
>
> 	commit e966eaeeb623f09975ef362c2866fae6f86844f9
> 	Author: Ingo Molnar <mingo@kernel.org>
> 	Date:   Tue Dec 12 12:31:16 2017 +0100
>
> 	    locking/lockdep: Remove the cross-release locking checks
>
> 	    This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y),
> 	    while it found a number of old bugs initially, was also causing too many
> 	    false positives that caused people to disable lockdep - which is arguably
> 	    a worse overall outcome.
>
> - cross-release uses the complete() call to annotate the end of
>    critical sections, for dma_fence that would be dma_fence_signal().
>    But we do not want all dma_fence_signal() calls to be treated as
>    critical, since many are opportunistic cleanup of gpu requests. If
>    these get stuck there's still the main completion interrupt and
>    workers who can unblock everyone. Automatically annotating all
>    dma_fence_signal() calls would hence cause false positives.
>
> - cross-release had some educated guesses for when a critical section
>    starts, like fresh syscall or fresh work callback. This would again
>    cause false positives without explicit annotations, since for
>    dma_fence the critical sections only starts when we publish a fence.
>
> - Furthermore there can be cases where a thread never does a
>    dma_fence_signal, but is still critical for reaching completion of
>    fences. One example would be a scheduler kthread which picks up jobs
>    and pushes them into hardware, where the interrupt handler or
>    another completion thread calls dma_fence_signal(). But if the
>    scheduler thread hangs, then all the fences hang, hence we need to
>    manually annotate it. cross-release aimed to solve this by chaining
>    cross-release dependencies, but the dependency from scheduler thread
>    to the completion interrupt handler goes through hw where
>    cross-release code can't observe it.
>
> In short, without manual annotations and careful review of the start
> and end of critical sections, cross-relese dependency tracking doesn't
> work. We need explicit annotations.
>
> v2: handle soft/hardirq ctx better against write side and dont forget
> EXPORT_SYMBOL, drivers can't use this otherwise.
>
> v3: Kerneldoc.
>
> v4: Some spelling fixes from Mika
>
> v5: Amend commit message to explain in detail why cross-release isn't
> the solution.
>
> v6: Pull out misplaced .rst hunk.
>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>
> Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>   Documentation/driver-api/dma-buf.rst |   6 +
>   drivers/dma-buf/dma-fence.c          | 161 +++++++++++++++++++++++++++
>   include/linux/dma-fence.h            |  12 ++
>   3 files changed, 179 insertions(+)
>
> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
> index 7fb7b661febd..05d856131140 100644
> --- a/Documentation/driver-api/dma-buf.rst
> +++ b/Documentation/driver-api/dma-buf.rst
> @@ -133,6 +133,12 @@ DMA Fences
>   .. kernel-doc:: drivers/dma-buf/dma-fence.c
>      :doc: DMA fences overview
>   
> +DMA Fence Signalling Annotations
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +.. kernel-doc:: drivers/dma-buf/dma-fence.c
> +   :doc: fence signalling annotation
> +
>   DMA Fences Functions Reference
>   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>   
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 656e9ac2d028..0005bc002529 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num)
>   }
>   EXPORT_SYMBOL(dma_fence_context_alloc);
>   
> +/**
> + * DOC: fence signalling annotation
> + *
> + * Proving correctness of all the kernel code around &dma_fence through code
> + * review and testing is tricky for a few reasons:
> + *
> + * * It is a cross-driver contract, and therefore all drivers must follow the
> + *   same rules for lock nesting order, calling contexts for various functions
> + *   and anything else significant for in-kernel interfaces. But it is also
> + *   impossible to test all drivers in a single machine, hence brute-force N vs.
> + *   N testing of all combinations is impossible. Even just limiting to the
> + *   possible combinations is infeasible.
> + *
> + * * There is an enormous amount of driver code involved. For render drivers
> + *   there's the tail of command submission, after fences are published,
> + *   scheduler code, interrupt and workers to process job completion,
> + *   and timeout, gpu reset and gpu hang recovery code. Plus for integration
> + *   with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,
> + *   and &shrinker. For modesetting drivers there's the commit tail functions
> + *   between when fences for an atomic modeset are published, and when the
> + *   corresponding vblank completes, including any interrupt processing and
> + *   related workers. Auditing all that code, across all drivers, is not
> + *   feasible.
> + *
> + * * Due to how many other subsystems are involved and the locking hierarchies
> + *   this pulls in there is extremely thin wiggle-room for driver-specific
> + *   differences. &dma_fence interacts with almost all of the core memory
> + *   handling through page fault handlers via &dma_resv, dma_resv_lock() and
> + *   dma_resv_unlock(). On the other side it also interacts through all
> + *   allocation sites through &mmu_notifier and &shrinker.
> + *
> + * Furthermore lockdep does not handle cross-release dependencies, which means
> + * any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught
> + * at runtime with some quick testing. The simplest example is one thread
> + * waiting on a &dma_fence while holding a lock::
> + *
> + *     lock(A);
> + *     dma_fence_wait(B);
> + *     unlock(A);
> + *
> + * while the other thread is stuck trying to acquire the same lock, which
> + * prevents it from signalling the fence the previous thread is stuck waiting
> + * on::
> + *
> + *     lock(A);
> + *     unlock(A);
> + *     dma_fence_signal(B);
> + *
> + * By manually annotating all code relevant to signalling a &dma_fence we can
> + * teach lockdep about these dependencies, which also helps with the validation
> + * headache since now lockdep can check all the rules for us::
> + *
> + *    cookie = dma_fence_begin_signalling();
> + *    lock(A);
> + *    unlock(A);
> + *    dma_fence_signal(B);
> + *    dma_fence_end_signalling(cookie);
> + *
> + * For using dma_fence_begin_signalling() and dma_fence_end_signalling() to
> + * annotate critical sections the following rules need to be observed:
> + *
> + * * All code necessary to complete a &dma_fence must be annotated, from the
> + *   point where a fence is accessible to other threads, to the point where
> + *   dma_fence_signal() is called. Un-annotated code can contain deadlock issues,
> + *   and due to the very strict rules and many corner cases it is infeasible to
> + *   catch these just with review or normal stress testing.
> + *
> + * * &struct dma_resv deserves a special note, since the readers are only
> + *   protected by rcu. This means the signalling critical section starts as soon
> + *   as the new fences are installed, even before dma_resv_unlock() is called.
> + *
> + * * The only exception are fast paths and opportunistic signalling code, which
> + *   calls dma_fence_signal() purely as an optimization, but is not required to
> + *   guarantee completion of a &dma_fence. The usual example is a wait IOCTL
> + *   which calls dma_fence_signal(), while the mandatory completion path goes
> + *   through a hardware interrupt and possible job completion worker.
> + *
> + * * To aid composability of code, the annotations can be freely nested, as long
> + *   as the overall locking hierarchy is consistent. The annotations also work
> + *   both in interrupt and process context. Due to implementation details this
> + *   requires that callers pass an opaque cookie from
> + *   dma_fence_begin_signalling() to dma_fence_end_signalling().
> + *
> + * * Validation against the cross driver contract is implemented by priming
> + *   lockdep with the relevant hierarchy at boot-up. This means even just
> + *   testing with a single device is enough to validate a driver, at least as
> + *   far as deadlocks with dma_fence_wait() against dma_fence_signal() are
> + *   concerned.
> + */
> +#ifdef CONFIG_LOCKDEP
> +struct lockdep_map	dma_fence_lockdep_map = {
> +	.name = "dma_fence_map"
> +};
> +
> +/**
> + * dma_fence_begin_signalling - begin a critical DMA fence signalling section
> + *
> + * Drivers should use this to annotate the beginning of any code section
> + * required to eventually complete &dma_fence by calling dma_fence_signal().
> + *
> + * The end of these critical sections are annotated with
> + * dma_fence_end_signalling().
> + *
> + * Returns:
> + *
> + * Opaque cookie needed by the implementation, which needs to be passed to
> + * dma_fence_end_signalling().
> + */
> +bool dma_fence_begin_signalling(void)
> +{
> +	/* explicitly nesting ... */
> +	if (lock_is_held_type(&dma_fence_lockdep_map, 1))
> +		return true;
> +
> +	/* rely on might_sleep check for soft/hardirq locks */
> +	if (in_atomic())
> +		return true;
> +
> +	/* ... and non-recursive readlock */
> +	lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
> +
> +	return false;
> +}
> +EXPORT_SYMBOL(dma_fence_begin_signalling);
> +
> +/**
> + * dma_fence_end_signalling - end a critical DMA fence signalling section
> + *
> + * Closes a critical section annotation opened by dma_fence_begin_signalling().
> + */
> +void dma_fence_end_signalling(bool cookie)
> +{
> +	if (cookie)
> +		return;
> +
> +	lock_release(&dma_fence_lockdep_map, _RET_IP_);
> +}
> +EXPORT_SYMBOL(dma_fence_end_signalling);
> +
> +void __dma_fence_might_wait(void)
> +{
> +	bool tmp;
> +
> +	tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
> +	if (tmp)
> +		lock_release(&dma_fence_lockdep_map, _THIS_IP_);
> +	lock_map_acquire(&dma_fence_lockdep_map);
> +	lock_map_release(&dma_fence_lockdep_map);
> +	if (tmp)
> +		lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
> +}
> +#endif
> +
> +
>   /**
>    * dma_fence_signal_locked - signal completion of a fence
>    * @fence: the fence to signal
> @@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence)
>   {
>   	unsigned long flags;
>   	int ret;
> +	bool tmp;
>   
>   	if (!fence)
>   		return -EINVAL;
>   
> +	tmp = dma_fence_begin_signalling();
> +
>   	spin_lock_irqsave(fence->lock, flags);
>   	ret = dma_fence_signal_locked(fence);
>   	spin_unlock_irqrestore(fence->lock, flags);
>   
> +	dma_fence_end_signalling(tmp);
> +
>   	return ret;
>   }
>   EXPORT_SYMBOL(dma_fence_signal);
> @@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
>   
>   	might_sleep();
>   
> +	__dma_fence_might_wait();
> +
>   	trace_dma_fence_wait_start(fence);
>   	if (fence->ops->wait)
>   		ret = fence->ops->wait(fence, intr, timeout);
> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
> index 3347c54f3a87..3f288f7db2ef 100644
> --- a/include/linux/dma-fence.h
> +++ b/include/linux/dma-fence.h
> @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
>   	} while (1);
>   }
>   
> +#ifdef CONFIG_LOCKDEP
> +bool dma_fence_begin_signalling(void);
> +void dma_fence_end_signalling(bool cookie);
> +#else
> +static inline bool dma_fence_begin_signalling(void)
> +{
> +	return true;
> +}
> +static inline void dma_fence_end_signalling(bool cookie) {}
> +static inline void __dma_fence_might_wait(void) {}
> +#endif
> +
>   int dma_fence_signal(struct dma_fence *fence);
>   int dma_fence_signal_locked(struct dma_fence *fence);
>   signed long dma_fence_default_wait(struct dma_fence *fence,


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/25] dma-fence: basic lockdep annotations
  2020-07-08 14:57   ` Christian König
@ 2020-07-08 15:12     ` Daniel Vetter
  2020-07-08 15:19       ` Alex Deucher
  2020-07-09  7:32       ` [Intel-gfx] " Daniel Stone
  2020-07-13 16:26     ` Daniel Vetter
  1 sibling, 2 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-08 15:12 UTC (permalink / raw)
  To: Christian König
  Cc: DRI Development, Intel Graphics Development, linux-rdma,
	Felix Kuehling, Thomas Hellström, Maarten Lankhorst,
	Mika Kuoppala, open list:DMA BUFFER SHARING FRAMEWORK,
	moderated list:DMA BUFFER SHARING FRAMEWORK, amd-gfx list,
	Chris Wilson, Daniel Vetter

On Wed, Jul 8, 2020 at 4:57 PM Christian König <christian.koenig@amd.com> wrote:
>
> Could we merge this controlled by a separate config option?
>
> This way we could have the checks upstream without having to fix all the
> stuff before we do this?

Since it's fully opt-in annotations nothing blows up if we don't merge
any annotations. So we could start merging the first 3 patches. After
that the fun starts ...

My rough idea was that first I'd try to tackle display, thus far
there's 2 actual issues in drivers:
- amdgpu has some dma_resv_lock in commit_tail, plus a kmalloc. I
think those should be fairly easy to fix (I'd try a stab at them even)
- vmwgfx has a full on locking inversion with dma_resv_lock in
commit_tail, and that one is functional. Not just reading something
which we can safely assume to be invariant anyway (like the tmz flag
for amdgpu, or whatever it was).

I've done a pile more annotations patches for other atomic drivers
now, so hopefully that flushes out any remaining offenders here. Since
some of the annotations are in helper code worst case we might need a
dev->mode_config.broken_atomic_commit flag to disable them. At least
for now I have 0 plans to merge any of these while there's known
unsolved issues. Maybe if some drivers take forever to get fixed we
can then apply some duct-tape for the atomic helper annotation patch.
Instead of a flag we can also copypasta the atomic_commit_tail hook,
leaving the annotations out and adding a huge warning about that.

Next big chunk is the drm/scheduler annotations:
- amdgpu needs a full rework of display reset (but apparently in the works)
- I read all the drivers, they all have the fairly cosmetic issue of
doing small allocations in their callbacks.

I might end up typing the mempool we need for the latter issue, but
first still hoping for some actual test feedback from other drivers
using drm/scheduler. Again no intentions of merging these annotations
without the drivers being fixed first, or at least some duct-atpe
applied.

Another option I've been thinking about, if there's cases where fixing
things properly is a lot of effort: We could do annotations for broken
sections (just the broken part, so we still catch bugs everywhere
else). They'd simply drop&reacquire the lock. We could then e.g. use
that in the amdgpu display reset code, and so still make sure that
everything else in reset doesn't get worse. But I think adding that
shouldn't be our first option.

I'm not personally a big fan of the Kconfig or runtime option, only
upsets people since it breaks lockdep for them. Or they ignore it, and
we don't catch bugs, making it fairly pointless to merge.

Cheers, Daniel


>
> Thanks,
> Christian.
>
> Am 07.07.20 um 22:12 schrieb Daniel Vetter:
> > Design is similar to the lockdep annotations for workers, but with
> > some twists:
> >
> > - We use a read-lock for the execution/worker/completion side, so that
> >    this explicit annotation can be more liberally sprinkled around.
> >    With read locks lockdep isn't going to complain if the read-side
> >    isn't nested the same way under all circumstances, so ABBA deadlocks
> >    are ok. Which they are, since this is an annotation only.
> >
> > - We're using non-recursive lockdep read lock mode, since in recursive
> >    read lock mode lockdep does not catch read side hazards. And we
> >    _very_ much want read side hazards to be caught. For full details of
> >    this limitation see
> >
> >    commit e91498589746065e3ae95d9a00b068e525eec34f
> >    Author: Peter Zijlstra <peterz@infradead.org>
> >    Date:   Wed Aug 23 13:13:11 2017 +0200
> >
> >        locking/lockdep/selftests: Add mixed read-write ABBA tests
> >
> > - To allow nesting of the read-side explicit annotations we explicitly
> >    keep track of the nesting. lock_is_held() allows us to do that.
> >
> > - The wait-side annotation is a write lock, and entirely done within
> >    dma_fence_wait() for everyone by default.
> >
> > - To be able to freely annotate helper functions I want to make it ok
> >    to call dma_fence_begin/end_signalling from soft/hardirq context.
> >    First attempt was using the hardirq locking context for the write
> >    side in lockdep, but this forces all normal spinlocks nested within
> >    dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> >
> >    The approach now is to simple check in_atomic(), and for these cases
> >    entirely rely on the might_sleep() check in dma_fence_wait(). That
> >    will catch any wrong nesting against spinlocks from soft/hardirq
> >    contexts.
> >
> > The idea here is that every code path that's critical for eventually
> > signalling a dma_fence should be annotated with
> > dma_fence_begin/end_signalling. The annotation ideally starts right
> > after a dma_fence is published (added to a dma_resv, exposed as a
> > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > makes the dma_fence visible to other kernel threads), up to and
> > including the dma_fence_wait(). Examples are irq handlers, the
> > scheduler rt threads, the tail of execbuf (after the corresponding
> > fences are visible), any workers that end up signalling dma_fences and
> > really anything else. Not annotated should be code paths that only
> > complete fences opportunistically as the gpu progresses, like e.g.
> > shrinker/eviction code.
> >
> > The main class of deadlocks this is supposed to catch are:
> >
> > Thread A:
> >
> >       mutex_lock(A);
> >       mutex_unlock(A);
> >
> >       dma_fence_signal();
> >
> > Thread B:
> >
> >       mutex_lock(A);
> >       dma_fence_wait();
> >       mutex_unlock(A);
> >
> > Thread B is blocked on A signalling the fence, but A never gets around
> > to that because it cannot acquire the lock A.
> >
> > Note that dma_fence_wait() is allowed to be nested within
> > dma_fence_begin/end_signalling sections. To allow this to happen the
> > read lock needs to be upgraded to a write lock, which means that any
> > other lock is acquired between the dma_fence_begin_signalling() call and
> > the call to dma_fence_wait(), and still held, this will result in an
> > immediate lockdep complaint. The only other option would be to not
> > annotate such calls, defeating the point. Therefore these annotations
> > cannot be sprinkled over the code entirely mindless to avoid false
> > positives.
> >
> > Originally I hope that the cross-release lockdep extensions would
> > alleviate the need for explicit annotations:
> >
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F709849%2F&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7Cff1a9dd17c544534eeb808d822b21ba2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297495649621566&amp;sdata=pbDwf%2BAG1UZ5bLZeep7VeGVQMnlQhX0TKG1d6Ok8GfQ%3D&amp;reserved=0
> >
> > But there's a few reasons why that's not an option:
> >
> > - It's not happening in upstream, since it got reverted due to too
> >    many false positives:
> >
> >       commit e966eaeeb623f09975ef362c2866fae6f86844f9
> >       Author: Ingo Molnar <mingo@kernel.org>
> >       Date:   Tue Dec 12 12:31:16 2017 +0100
> >
> >           locking/lockdep: Remove the cross-release locking checks
> >
> >           This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y),
> >           while it found a number of old bugs initially, was also causing too many
> >           false positives that caused people to disable lockdep - which is arguably
> >           a worse overall outcome.
> >
> > - cross-release uses the complete() call to annotate the end of
> >    critical sections, for dma_fence that would be dma_fence_signal().
> >    But we do not want all dma_fence_signal() calls to be treated as
> >    critical, since many are opportunistic cleanup of gpu requests. If
> >    these get stuck there's still the main completion interrupt and
> >    workers who can unblock everyone. Automatically annotating all
> >    dma_fence_signal() calls would hence cause false positives.
> >
> > - cross-release had some educated guesses for when a critical section
> >    starts, like fresh syscall or fresh work callback. This would again
> >    cause false positives without explicit annotations, since for
> >    dma_fence the critical sections only starts when we publish a fence.
> >
> > - Furthermore there can be cases where a thread never does a
> >    dma_fence_signal, but is still critical for reaching completion of
> >    fences. One example would be a scheduler kthread which picks up jobs
> >    and pushes them into hardware, where the interrupt handler or
> >    another completion thread calls dma_fence_signal(). But if the
> >    scheduler thread hangs, then all the fences hang, hence we need to
> >    manually annotate it. cross-release aimed to solve this by chaining
> >    cross-release dependencies, but the dependency from scheduler thread
> >    to the completion interrupt handler goes through hw where
> >    cross-release code can't observe it.
> >
> > In short, without manual annotations and careful review of the start
> > and end of critical sections, cross-relese dependency tracking doesn't
> > work. We need explicit annotations.
> >
> > v2: handle soft/hardirq ctx better against write side and dont forget
> > EXPORT_SYMBOL, drivers can't use this otherwise.
> >
> > v3: Kerneldoc.
> >
> > v4: Some spelling fixes from Mika
> >
> > v5: Amend commit message to explain in detail why cross-release isn't
> > the solution.
> >
> > v6: Pull out misplaced .rst hunk.
> >
> > Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> > Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>
> > Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
> > Cc: linux-media@vger.kernel.org
> > Cc: linaro-mm-sig@lists.linaro.org
> > Cc: linux-rdma@vger.kernel.org
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: intel-gfx@lists.freedesktop.org
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > ---
> >   Documentation/driver-api/dma-buf.rst |   6 +
> >   drivers/dma-buf/dma-fence.c          | 161 +++++++++++++++++++++++++++
> >   include/linux/dma-fence.h            |  12 ++
> >   3 files changed, 179 insertions(+)
> >
> > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
> > index 7fb7b661febd..05d856131140 100644
> > --- a/Documentation/driver-api/dma-buf.rst
> > +++ b/Documentation/driver-api/dma-buf.rst
> > @@ -133,6 +133,12 @@ DMA Fences
> >   .. kernel-doc:: drivers/dma-buf/dma-fence.c
> >      :doc: DMA fences overview
> >
> > +DMA Fence Signalling Annotations
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +.. kernel-doc:: drivers/dma-buf/dma-fence.c
> > +   :doc: fence signalling annotation
> > +
> >   DMA Fences Functions Reference
> >   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > index 656e9ac2d028..0005bc002529 100644
> > --- a/drivers/dma-buf/dma-fence.c
> > +++ b/drivers/dma-buf/dma-fence.c
> > @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num)
> >   }
> >   EXPORT_SYMBOL(dma_fence_context_alloc);
> >
> > +/**
> > + * DOC: fence signalling annotation
> > + *
> > + * Proving correctness of all the kernel code around &dma_fence through code
> > + * review and testing is tricky for a few reasons:
> > + *
> > + * * It is a cross-driver contract, and therefore all drivers must follow the
> > + *   same rules for lock nesting order, calling contexts for various functions
> > + *   and anything else significant for in-kernel interfaces. But it is also
> > + *   impossible to test all drivers in a single machine, hence brute-force N vs.
> > + *   N testing of all combinations is impossible. Even just limiting to the
> > + *   possible combinations is infeasible.
> > + *
> > + * * There is an enormous amount of driver code involved. For render drivers
> > + *   there's the tail of command submission, after fences are published,
> > + *   scheduler code, interrupt and workers to process job completion,
> > + *   and timeout, gpu reset and gpu hang recovery code. Plus for integration
> > + *   with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,
> > + *   and &shrinker. For modesetting drivers there's the commit tail functions
> > + *   between when fences for an atomic modeset are published, and when the
> > + *   corresponding vblank completes, including any interrupt processing and
> > + *   related workers. Auditing all that code, across all drivers, is not
> > + *   feasible.
> > + *
> > + * * Due to how many other subsystems are involved and the locking hierarchies
> > + *   this pulls in there is extremely thin wiggle-room for driver-specific
> > + *   differences. &dma_fence interacts with almost all of the core memory
> > + *   handling through page fault handlers via &dma_resv, dma_resv_lock() and
> > + *   dma_resv_unlock(). On the other side it also interacts through all
> > + *   allocation sites through &mmu_notifier and &shrinker.
> > + *
> > + * Furthermore lockdep does not handle cross-release dependencies, which means
> > + * any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught
> > + * at runtime with some quick testing. The simplest example is one thread
> > + * waiting on a &dma_fence while holding a lock::
> > + *
> > + *     lock(A);
> > + *     dma_fence_wait(B);
> > + *     unlock(A);
> > + *
> > + * while the other thread is stuck trying to acquire the same lock, which
> > + * prevents it from signalling the fence the previous thread is stuck waiting
> > + * on::
> > + *
> > + *     lock(A);
> > + *     unlock(A);
> > + *     dma_fence_signal(B);
> > + *
> > + * By manually annotating all code relevant to signalling a &dma_fence we can
> > + * teach lockdep about these dependencies, which also helps with the validation
> > + * headache since now lockdep can check all the rules for us::
> > + *
> > + *    cookie = dma_fence_begin_signalling();
> > + *    lock(A);
> > + *    unlock(A);
> > + *    dma_fence_signal(B);
> > + *    dma_fence_end_signalling(cookie);
> > + *
> > + * For using dma_fence_begin_signalling() and dma_fence_end_signalling() to
> > + * annotate critical sections the following rules need to be observed:
> > + *
> > + * * All code necessary to complete a &dma_fence must be annotated, from the
> > + *   point where a fence is accessible to other threads, to the point where
> > + *   dma_fence_signal() is called. Un-annotated code can contain deadlock issues,
> > + *   and due to the very strict rules and many corner cases it is infeasible to
> > + *   catch these just with review or normal stress testing.
> > + *
> > + * * &struct dma_resv deserves a special note, since the readers are only
> > + *   protected by rcu. This means the signalling critical section starts as soon
> > + *   as the new fences are installed, even before dma_resv_unlock() is called.
> > + *
> > + * * The only exception are fast paths and opportunistic signalling code, which
> > + *   calls dma_fence_signal() purely as an optimization, but is not required to
> > + *   guarantee completion of a &dma_fence. The usual example is a wait IOCTL
> > + *   which calls dma_fence_signal(), while the mandatory completion path goes
> > + *   through a hardware interrupt and possible job completion worker.
> > + *
> > + * * To aid composability of code, the annotations can be freely nested, as long
> > + *   as the overall locking hierarchy is consistent. The annotations also work
> > + *   both in interrupt and process context. Due to implementation details this
> > + *   requires that callers pass an opaque cookie from
> > + *   dma_fence_begin_signalling() to dma_fence_end_signalling().
> > + *
> > + * * Validation against the cross driver contract is implemented by priming
> > + *   lockdep with the relevant hierarchy at boot-up. This means even just
> > + *   testing with a single device is enough to validate a driver, at least as
> > + *   far as deadlocks with dma_fence_wait() against dma_fence_signal() are
> > + *   concerned.
> > + */
> > +#ifdef CONFIG_LOCKDEP
> > +struct lockdep_map   dma_fence_lockdep_map = {
> > +     .name = "dma_fence_map"
> > +};
> > +
> > +/**
> > + * dma_fence_begin_signalling - begin a critical DMA fence signalling section
> > + *
> > + * Drivers should use this to annotate the beginning of any code section
> > + * required to eventually complete &dma_fence by calling dma_fence_signal().
> > + *
> > + * The end of these critical sections are annotated with
> > + * dma_fence_end_signalling().
> > + *
> > + * Returns:
> > + *
> > + * Opaque cookie needed by the implementation, which needs to be passed to
> > + * dma_fence_end_signalling().
> > + */
> > +bool dma_fence_begin_signalling(void)
> > +{
> > +     /* explicitly nesting ... */
> > +     if (lock_is_held_type(&dma_fence_lockdep_map, 1))
> > +             return true;
> > +
> > +     /* rely on might_sleep check for soft/hardirq locks */
> > +     if (in_atomic())
> > +             return true;
> > +
> > +     /* ... and non-recursive readlock */
> > +     lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
> > +
> > +     return false;
> > +}
> > +EXPORT_SYMBOL(dma_fence_begin_signalling);
> > +
> > +/**
> > + * dma_fence_end_signalling - end a critical DMA fence signalling section
> > + *
> > + * Closes a critical section annotation opened by dma_fence_begin_signalling().
> > + */
> > +void dma_fence_end_signalling(bool cookie)
> > +{
> > +     if (cookie)
> > +             return;
> > +
> > +     lock_release(&dma_fence_lockdep_map, _RET_IP_);
> > +}
> > +EXPORT_SYMBOL(dma_fence_end_signalling);
> > +
> > +void __dma_fence_might_wait(void)
> > +{
> > +     bool tmp;
> > +
> > +     tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
> > +     if (tmp)
> > +             lock_release(&dma_fence_lockdep_map, _THIS_IP_);
> > +     lock_map_acquire(&dma_fence_lockdep_map);
> > +     lock_map_release(&dma_fence_lockdep_map);
> > +     if (tmp)
> > +             lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
> > +}
> > +#endif
> > +
> > +
> >   /**
> >    * dma_fence_signal_locked - signal completion of a fence
> >    * @fence: the fence to signal
> > @@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence)
> >   {
> >       unsigned long flags;
> >       int ret;
> > +     bool tmp;
> >
> >       if (!fence)
> >               return -EINVAL;
> >
> > +     tmp = dma_fence_begin_signalling();
> > +
> >       spin_lock_irqsave(fence->lock, flags);
> >       ret = dma_fence_signal_locked(fence);
> >       spin_unlock_irqrestore(fence->lock, flags);
> >
> > +     dma_fence_end_signalling(tmp);
> > +
> >       return ret;
> >   }
> >   EXPORT_SYMBOL(dma_fence_signal);
> > @@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
> >
> >       might_sleep();
> >
> > +     __dma_fence_might_wait();
> > +
> >       trace_dma_fence_wait_start(fence);
> >       if (fence->ops->wait)
> >               ret = fence->ops->wait(fence, intr, timeout);
> > diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
> > index 3347c54f3a87..3f288f7db2ef 100644
> > --- a/include/linux/dma-fence.h
> > +++ b/include/linux/dma-fence.h
> > @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
> >       } while (1);
> >   }
> >
> > +#ifdef CONFIG_LOCKDEP
> > +bool dma_fence_begin_signalling(void);
> > +void dma_fence_end_signalling(bool cookie);
> > +#else
> > +static inline bool dma_fence_begin_signalling(void)
> > +{
> > +     return true;
> > +}
> > +static inline void dma_fence_end_signalling(bool cookie) {}
> > +static inline void __dma_fence_might_wait(void) {}
> > +#endif
> > +
> >   int dma_fence_signal(struct dma_fence *fence);
> >   int dma_fence_signal_locked(struct dma_fence *fence);
> >   signed long dma_fence_default_wait(struct dma_fence *fence,
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/25] dma-fence: basic lockdep annotations
  2020-07-08 15:12     ` Daniel Vetter
@ 2020-07-08 15:19       ` Alex Deucher
  2020-07-08 15:37         ` Daniel Vetter
  2020-07-09  7:32       ` [Intel-gfx] " Daniel Stone
  1 sibling, 1 reply; 83+ messages in thread
From: Alex Deucher @ 2020-07-08 15:19 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, linux-rdma, Intel Graphics Development,
	Maarten Lankhorst, DRI Development, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Thomas Hellström, amd-gfx list, Daniel Vetter,
	open list:DMA BUFFER SHARING FRAMEWORK, Felix Kuehling,
	Mika Kuoppala

On Wed, Jul 8, 2020 at 11:13 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Wed, Jul 8, 2020 at 4:57 PM Christian König <christian.koenig@amd.com> wrote:
> >
> > Could we merge this controlled by a separate config option?
> >
> > This way we could have the checks upstream without having to fix all the
> > stuff before we do this?
>
> Since it's fully opt-in annotations nothing blows up if we don't merge
> any annotations. So we could start merging the first 3 patches. After
> that the fun starts ...
>
> My rough idea was that first I'd try to tackle display, thus far
> there's 2 actual issues in drivers:
> - amdgpu has some dma_resv_lock in commit_tail, plus a kmalloc. I
> think those should be fairly easy to fix (I'd try a stab at them even)
> - vmwgfx has a full on locking inversion with dma_resv_lock in
> commit_tail, and that one is functional. Not just reading something
> which we can safely assume to be invariant anyway (like the tmz flag
> for amdgpu, or whatever it was).
>
> I've done a pile more annotations patches for other atomic drivers
> now, so hopefully that flushes out any remaining offenders here. Since
> some of the annotations are in helper code worst case we might need a
> dev->mode_config.broken_atomic_commit flag to disable them. At least
> for now I have 0 plans to merge any of these while there's known
> unsolved issues. Maybe if some drivers take forever to get fixed we
> can then apply some duct-tape for the atomic helper annotation patch.
> Instead of a flag we can also copypasta the atomic_commit_tail hook,
> leaving the annotations out and adding a huge warning about that.
>
> Next big chunk is the drm/scheduler annotations:
> - amdgpu needs a full rework of display reset (but apparently in the works)

I think the display deadlock issues should be fixed in:
https://cgit.freedesktop.org/drm/drm/commit/?id=cdaae8371aa9d4ea1648a299b1a75946b9556944

Alex

> - I read all the drivers, they all have the fairly cosmetic issue of
> doing small allocations in their callbacks.
>
> I might end up typing the mempool we need for the latter issue, but
> first still hoping for some actual test feedback from other drivers
> using drm/scheduler. Again no intentions of merging these annotations
> without the drivers being fixed first, or at least some duct-atpe
> applied.
>
> Another option I've been thinking about, if there's cases where fixing
> things properly is a lot of effort: We could do annotations for broken
> sections (just the broken part, so we still catch bugs everywhere
> else). They'd simply drop&reacquire the lock. We could then e.g. use
> that in the amdgpu display reset code, and so still make sure that
> everything else in reset doesn't get worse. But I think adding that
> shouldn't be our first option.
>
> I'm not personally a big fan of the Kconfig or runtime option, only
> upsets people since it breaks lockdep for them. Or they ignore it, and
> we don't catch bugs, making it fairly pointless to merge.
>
> Cheers, Daniel
>
>
> >
> > Thanks,
> > Christian.
> >
> > Am 07.07.20 um 22:12 schrieb Daniel Vetter:
> > > Design is similar to the lockdep annotations for workers, but with
> > > some twists:
> > >
> > > - We use a read-lock for the execution/worker/completion side, so that
> > >    this explicit annotation can be more liberally sprinkled around.
> > >    With read locks lockdep isn't going to complain if the read-side
> > >    isn't nested the same way under all circumstances, so ABBA deadlocks
> > >    are ok. Which they are, since this is an annotation only.
> > >
> > > - We're using non-recursive lockdep read lock mode, since in recursive
> > >    read lock mode lockdep does not catch read side hazards. And we
> > >    _very_ much want read side hazards to be caught. For full details of
> > >    this limitation see
> > >
> > >    commit e91498589746065e3ae95d9a00b068e525eec34f
> > >    Author: Peter Zijlstra <peterz@infradead.org>
> > >    Date:   Wed Aug 23 13:13:11 2017 +0200
> > >
> > >        locking/lockdep/selftests: Add mixed read-write ABBA tests
> > >
> > > - To allow nesting of the read-side explicit annotations we explicitly
> > >    keep track of the nesting. lock_is_held() allows us to do that.
> > >
> > > - The wait-side annotation is a write lock, and entirely done within
> > >    dma_fence_wait() for everyone by default.
> > >
> > > - To be able to freely annotate helper functions I want to make it ok
> > >    to call dma_fence_begin/end_signalling from soft/hardirq context.
> > >    First attempt was using the hardirq locking context for the write
> > >    side in lockdep, but this forces all normal spinlocks nested within
> > >    dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> > >
> > >    The approach now is to simple check in_atomic(), and for these cases
> > >    entirely rely on the might_sleep() check in dma_fence_wait(). That
> > >    will catch any wrong nesting against spinlocks from soft/hardirq
> > >    contexts.
> > >
> > > The idea here is that every code path that's critical for eventually
> > > signalling a dma_fence should be annotated with
> > > dma_fence_begin/end_signalling. The annotation ideally starts right
> > > after a dma_fence is published (added to a dma_resv, exposed as a
> > > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > > makes the dma_fence visible to other kernel threads), up to and
> > > including the dma_fence_wait(). Examples are irq handlers, the
> > > scheduler rt threads, the tail of execbuf (after the corresponding
> > > fences are visible), any workers that end up signalling dma_fences and
> > > really anything else. Not annotated should be code paths that only
> > > complete fences opportunistically as the gpu progresses, like e.g.
> > > shrinker/eviction code.
> > >
> > > The main class of deadlocks this is supposed to catch are:
> > >
> > > Thread A:
> > >
> > >       mutex_lock(A);
> > >       mutex_unlock(A);
> > >
> > >       dma_fence_signal();
> > >
> > > Thread B:
> > >
> > >       mutex_lock(A);
> > >       dma_fence_wait();
> > >       mutex_unlock(A);
> > >
> > > Thread B is blocked on A signalling the fence, but A never gets around
> > > to that because it cannot acquire the lock A.
> > >
> > > Note that dma_fence_wait() is allowed to be nested within
> > > dma_fence_begin/end_signalling sections. To allow this to happen the
> > > read lock needs to be upgraded to a write lock, which means that any
> > > other lock is acquired between the dma_fence_begin_signalling() call and
> > > the call to dma_fence_wait(), and still held, this will result in an
> > > immediate lockdep complaint. The only other option would be to not
> > > annotate such calls, defeating the point. Therefore these annotations
> > > cannot be sprinkled over the code entirely mindless to avoid false
> > > positives.
> > >
> > > Originally I hope that the cross-release lockdep extensions would
> > > alleviate the need for explicit annotations:
> > >
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F709849%2F&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7Cff1a9dd17c544534eeb808d822b21ba2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297495649621566&amp;sdata=pbDwf%2BAG1UZ5bLZeep7VeGVQMnlQhX0TKG1d6Ok8GfQ%3D&amp;reserved=0
> > >
> > > But there's a few reasons why that's not an option:
> > >
> > > - It's not happening in upstream, since it got reverted due to too
> > >    many false positives:
> > >
> > >       commit e966eaeeb623f09975ef362c2866fae6f86844f9
> > >       Author: Ingo Molnar <mingo@kernel.org>
> > >       Date:   Tue Dec 12 12:31:16 2017 +0100
> > >
> > >           locking/lockdep: Remove the cross-release locking checks
> > >
> > >           This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y),
> > >           while it found a number of old bugs initially, was also causing too many
> > >           false positives that caused people to disable lockdep - which is arguably
> > >           a worse overall outcome.
> > >
> > > - cross-release uses the complete() call to annotate the end of
> > >    critical sections, for dma_fence that would be dma_fence_signal().
> > >    But we do not want all dma_fence_signal() calls to be treated as
> > >    critical, since many are opportunistic cleanup of gpu requests. If
> > >    these get stuck there's still the main completion interrupt and
> > >    workers who can unblock everyone. Automatically annotating all
> > >    dma_fence_signal() calls would hence cause false positives.
> > >
> > > - cross-release had some educated guesses for when a critical section
> > >    starts, like fresh syscall or fresh work callback. This would again
> > >    cause false positives without explicit annotations, since for
> > >    dma_fence the critical sections only starts when we publish a fence.
> > >
> > > - Furthermore there can be cases where a thread never does a
> > >    dma_fence_signal, but is still critical for reaching completion of
> > >    fences. One example would be a scheduler kthread which picks up jobs
> > >    and pushes them into hardware, where the interrupt handler or
> > >    another completion thread calls dma_fence_signal(). But if the
> > >    scheduler thread hangs, then all the fences hang, hence we need to
> > >    manually annotate it. cross-release aimed to solve this by chaining
> > >    cross-release dependencies, but the dependency from scheduler thread
> > >    to the completion interrupt handler goes through hw where
> > >    cross-release code can't observe it.
> > >
> > > In short, without manual annotations and careful review of the start
> > > and end of critical sections, cross-relese dependency tracking doesn't
> > > work. We need explicit annotations.
> > >
> > > v2: handle soft/hardirq ctx better against write side and dont forget
> > > EXPORT_SYMBOL, drivers can't use this otherwise.
> > >
> > > v3: Kerneldoc.
> > >
> > > v4: Some spelling fixes from Mika
> > >
> > > v5: Amend commit message to explain in detail why cross-release isn't
> > > the solution.
> > >
> > > v6: Pull out misplaced .rst hunk.
> > >
> > > Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> > > Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>
> > > Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > > Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
> > > Cc: linux-media@vger.kernel.org
> > > Cc: linaro-mm-sig@lists.linaro.org
> > > Cc: linux-rdma@vger.kernel.org
> > > Cc: amd-gfx@lists.freedesktop.org
> > > Cc: intel-gfx@lists.freedesktop.org
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > ---
> > >   Documentation/driver-api/dma-buf.rst |   6 +
> > >   drivers/dma-buf/dma-fence.c          | 161 +++++++++++++++++++++++++++
> > >   include/linux/dma-fence.h            |  12 ++
> > >   3 files changed, 179 insertions(+)
> > >
> > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
> > > index 7fb7b661febd..05d856131140 100644
> > > --- a/Documentation/driver-api/dma-buf.rst
> > > +++ b/Documentation/driver-api/dma-buf.rst
> > > @@ -133,6 +133,12 @@ DMA Fences
> > >   .. kernel-doc:: drivers/dma-buf/dma-fence.c
> > >      :doc: DMA fences overview
> > >
> > > +DMA Fence Signalling Annotations
> > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > +
> > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c
> > > +   :doc: fence signalling annotation
> > > +
> > >   DMA Fences Functions Reference
> > >   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > >
> > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > > index 656e9ac2d028..0005bc002529 100644
> > > --- a/drivers/dma-buf/dma-fence.c
> > > +++ b/drivers/dma-buf/dma-fence.c
> > > @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num)
> > >   }
> > >   EXPORT_SYMBOL(dma_fence_context_alloc);
> > >
> > > +/**
> > > + * DOC: fence signalling annotation
> > > + *
> > > + * Proving correctness of all the kernel code around &dma_fence through code
> > > + * review and testing is tricky for a few reasons:
> > > + *
> > > + * * It is a cross-driver contract, and therefore all drivers must follow the
> > > + *   same rules for lock nesting order, calling contexts for various functions
> > > + *   and anything else significant for in-kernel interfaces. But it is also
> > > + *   impossible to test all drivers in a single machine, hence brute-force N vs.
> > > + *   N testing of all combinations is impossible. Even just limiting to the
> > > + *   possible combinations is infeasible.
> > > + *
> > > + * * There is an enormous amount of driver code involved. For render drivers
> > > + *   there's the tail of command submission, after fences are published,
> > > + *   scheduler code, interrupt and workers to process job completion,
> > > + *   and timeout, gpu reset and gpu hang recovery code. Plus for integration
> > > + *   with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,
> > > + *   and &shrinker. For modesetting drivers there's the commit tail functions
> > > + *   between when fences for an atomic modeset are published, and when the
> > > + *   corresponding vblank completes, including any interrupt processing and
> > > + *   related workers. Auditing all that code, across all drivers, is not
> > > + *   feasible.
> > > + *
> > > + * * Due to how many other subsystems are involved and the locking hierarchies
> > > + *   this pulls in there is extremely thin wiggle-room for driver-specific
> > > + *   differences. &dma_fence interacts with almost all of the core memory
> > > + *   handling through page fault handlers via &dma_resv, dma_resv_lock() and
> > > + *   dma_resv_unlock(). On the other side it also interacts through all
> > > + *   allocation sites through &mmu_notifier and &shrinker.
> > > + *
> > > + * Furthermore lockdep does not handle cross-release dependencies, which means
> > > + * any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught
> > > + * at runtime with some quick testing. The simplest example is one thread
> > > + * waiting on a &dma_fence while holding a lock::
> > > + *
> > > + *     lock(A);
> > > + *     dma_fence_wait(B);
> > > + *     unlock(A);
> > > + *
> > > + * while the other thread is stuck trying to acquire the same lock, which
> > > + * prevents it from signalling the fence the previous thread is stuck waiting
> > > + * on::
> > > + *
> > > + *     lock(A);
> > > + *     unlock(A);
> > > + *     dma_fence_signal(B);
> > > + *
> > > + * By manually annotating all code relevant to signalling a &dma_fence we can
> > > + * teach lockdep about these dependencies, which also helps with the validation
> > > + * headache since now lockdep can check all the rules for us::
> > > + *
> > > + *    cookie = dma_fence_begin_signalling();
> > > + *    lock(A);
> > > + *    unlock(A);
> > > + *    dma_fence_signal(B);
> > > + *    dma_fence_end_signalling(cookie);
> > > + *
> > > + * For using dma_fence_begin_signalling() and dma_fence_end_signalling() to
> > > + * annotate critical sections the following rules need to be observed:
> > > + *
> > > + * * All code necessary to complete a &dma_fence must be annotated, from the
> > > + *   point where a fence is accessible to other threads, to the point where
> > > + *   dma_fence_signal() is called. Un-annotated code can contain deadlock issues,
> > > + *   and due to the very strict rules and many corner cases it is infeasible to
> > > + *   catch these just with review or normal stress testing.
> > > + *
> > > + * * &struct dma_resv deserves a special note, since the readers are only
> > > + *   protected by rcu. This means the signalling critical section starts as soon
> > > + *   as the new fences are installed, even before dma_resv_unlock() is called.
> > > + *
> > > + * * The only exception are fast paths and opportunistic signalling code, which
> > > + *   calls dma_fence_signal() purely as an optimization, but is not required to
> > > + *   guarantee completion of a &dma_fence. The usual example is a wait IOCTL
> > > + *   which calls dma_fence_signal(), while the mandatory completion path goes
> > > + *   through a hardware interrupt and possible job completion worker.
> > > + *
> > > + * * To aid composability of code, the annotations can be freely nested, as long
> > > + *   as the overall locking hierarchy is consistent. The annotations also work
> > > + *   both in interrupt and process context. Due to implementation details this
> > > + *   requires that callers pass an opaque cookie from
> > > + *   dma_fence_begin_signalling() to dma_fence_end_signalling().
> > > + *
> > > + * * Validation against the cross driver contract is implemented by priming
> > > + *   lockdep with the relevant hierarchy at boot-up. This means even just
> > > + *   testing with a single device is enough to validate a driver, at least as
> > > + *   far as deadlocks with dma_fence_wait() against dma_fence_signal() are
> > > + *   concerned.
> > > + */
> > > +#ifdef CONFIG_LOCKDEP
> > > +struct lockdep_map   dma_fence_lockdep_map = {
> > > +     .name = "dma_fence_map"
> > > +};
> > > +
> > > +/**
> > > + * dma_fence_begin_signalling - begin a critical DMA fence signalling section
> > > + *
> > > + * Drivers should use this to annotate the beginning of any code section
> > > + * required to eventually complete &dma_fence by calling dma_fence_signal().
> > > + *
> > > + * The end of these critical sections are annotated with
> > > + * dma_fence_end_signalling().
> > > + *
> > > + * Returns:
> > > + *
> > > + * Opaque cookie needed by the implementation, which needs to be passed to
> > > + * dma_fence_end_signalling().
> > > + */
> > > +bool dma_fence_begin_signalling(void)
> > > +{
> > > +     /* explicitly nesting ... */
> > > +     if (lock_is_held_type(&dma_fence_lockdep_map, 1))
> > > +             return true;
> > > +
> > > +     /* rely on might_sleep check for soft/hardirq locks */
> > > +     if (in_atomic())
> > > +             return true;
> > > +
> > > +     /* ... and non-recursive readlock */
> > > +     lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
> > > +
> > > +     return false;
> > > +}
> > > +EXPORT_SYMBOL(dma_fence_begin_signalling);
> > > +
> > > +/**
> > > + * dma_fence_end_signalling - end a critical DMA fence signalling section
> > > + *
> > > + * Closes a critical section annotation opened by dma_fence_begin_signalling().
> > > + */
> > > +void dma_fence_end_signalling(bool cookie)
> > > +{
> > > +     if (cookie)
> > > +             return;
> > > +
> > > +     lock_release(&dma_fence_lockdep_map, _RET_IP_);
> > > +}
> > > +EXPORT_SYMBOL(dma_fence_end_signalling);
> > > +
> > > +void __dma_fence_might_wait(void)
> > > +{
> > > +     bool tmp;
> > > +
> > > +     tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
> > > +     if (tmp)
> > > +             lock_release(&dma_fence_lockdep_map, _THIS_IP_);
> > > +     lock_map_acquire(&dma_fence_lockdep_map);
> > > +     lock_map_release(&dma_fence_lockdep_map);
> > > +     if (tmp)
> > > +             lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
> > > +}
> > > +#endif
> > > +
> > > +
> > >   /**
> > >    * dma_fence_signal_locked - signal completion of a fence
> > >    * @fence: the fence to signal
> > > @@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence)
> > >   {
> > >       unsigned long flags;
> > >       int ret;
> > > +     bool tmp;
> > >
> > >       if (!fence)
> > >               return -EINVAL;
> > >
> > > +     tmp = dma_fence_begin_signalling();
> > > +
> > >       spin_lock_irqsave(fence->lock, flags);
> > >       ret = dma_fence_signal_locked(fence);
> > >       spin_unlock_irqrestore(fence->lock, flags);
> > >
> > > +     dma_fence_end_signalling(tmp);
> > > +
> > >       return ret;
> > >   }
> > >   EXPORT_SYMBOL(dma_fence_signal);
> > > @@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
> > >
> > >       might_sleep();
> > >
> > > +     __dma_fence_might_wait();
> > > +
> > >       trace_dma_fence_wait_start(fence);
> > >       if (fence->ops->wait)
> > >               ret = fence->ops->wait(fence, intr, timeout);
> > > diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
> > > index 3347c54f3a87..3f288f7db2ef 100644
> > > --- a/include/linux/dma-fence.h
> > > +++ b/include/linux/dma-fence.h
> > > @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
> > >       } while (1);
> > >   }
> > >
> > > +#ifdef CONFIG_LOCKDEP
> > > +bool dma_fence_begin_signalling(void);
> > > +void dma_fence_end_signalling(bool cookie);
> > > +#else
> > > +static inline bool dma_fence_begin_signalling(void)
> > > +{
> > > +     return true;
> > > +}
> > > +static inline void dma_fence_end_signalling(bool cookie) {}
> > > +static inline void __dma_fence_might_wait(void) {}
> > > +#endif
> > > +
> > >   int dma_fence_signal(struct dma_fence *fence);
> > >   int dma_fence_signal_locked(struct dma_fence *fence);
> > >   signed long dma_fence_default_wait(struct dma_fence *fence,
> >
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/25] dma-fence: basic lockdep annotations
  2020-07-08 15:19       ` Alex Deucher
@ 2020-07-08 15:37         ` Daniel Vetter
  2020-07-14 11:09           ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-08 15:37 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Christian König, linux-rdma, Intel Graphics Development,
	Maarten Lankhorst, DRI Development, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Thomas Hellström, amd-gfx list, Daniel Vetter,
	open list:DMA BUFFER SHARING FRAMEWORK, Felix Kuehling,
	Mika Kuoppala

On Wed, Jul 8, 2020 at 5:19 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Wed, Jul 8, 2020 at 11:13 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > On Wed, Jul 8, 2020 at 4:57 PM Christian König <christian.koenig@amd.com> wrote:
> > >
> > > Could we merge this controlled by a separate config option?
> > >
> > > This way we could have the checks upstream without having to fix all the
> > > stuff before we do this?
> >
> > Since it's fully opt-in annotations nothing blows up if we don't merge
> > any annotations. So we could start merging the first 3 patches. After
> > that the fun starts ...
> >
> > My rough idea was that first I'd try to tackle display, thus far
> > there's 2 actual issues in drivers:
> > - amdgpu has some dma_resv_lock in commit_tail, plus a kmalloc. I
> > think those should be fairly easy to fix (I'd try a stab at them even)
> > - vmwgfx has a full on locking inversion with dma_resv_lock in
> > commit_tail, and that one is functional. Not just reading something
> > which we can safely assume to be invariant anyway (like the tmz flag
> > for amdgpu, or whatever it was).
> >
> > I've done a pile more annotations patches for other atomic drivers
> > now, so hopefully that flushes out any remaining offenders here. Since
> > some of the annotations are in helper code worst case we might need a
> > dev->mode_config.broken_atomic_commit flag to disable them. At least
> > for now I have 0 plans to merge any of these while there's known
> > unsolved issues. Maybe if some drivers take forever to get fixed we
> > can then apply some duct-tape for the atomic helper annotation patch.
> > Instead of a flag we can also copypasta the atomic_commit_tail hook,
> > leaving the annotations out and adding a huge warning about that.
> >
> > Next big chunk is the drm/scheduler annotations:
> > - amdgpu needs a full rework of display reset (but apparently in the works)
>
> I think the display deadlock issues should be fixed in:
> https://cgit.freedesktop.org/drm/drm/commit/?id=cdaae8371aa9d4ea1648a299b1a75946b9556944

That's the reset/tdr inversion, there's two more:
- kmalloc, see https://cgit.freedesktop.org/~danvet/drm/commit/?id=d9353cc3bf6111430a24188b92412dc49e7ead79
- ttm_bo_reserve in the wrong place
https://cgit.freedesktop.org/~danvet/drm/commit/?id=a6c03176152625a2f9cf1e499aceb8b2217dc2a2
- console_lock in the wrong spot
https://cgit.freedesktop.org/~danvet/drm/commit/?id=a6c03176152625a2f9cf1e499aceb8b2217dc2a2

Especially the last one I have no idea how to address really.
-Daniel


>
> Alex
>
> > - I read all the drivers, they all have the fairly cosmetic issue of
> > doing small allocations in their callbacks.
> >
> > I might end up typing the mempool we need for the latter issue, but
> > first still hoping for some actual test feedback from other drivers
> > using drm/scheduler. Again no intentions of merging these annotations
> > without the drivers being fixed first, or at least some duct-atpe
> > applied.
> >
> > Another option I've been thinking about, if there's cases where fixing
> > things properly is a lot of effort: We could do annotations for broken
> > sections (just the broken part, so we still catch bugs everywhere
> > else). They'd simply drop&reacquire the lock. We could then e.g. use
> > that in the amdgpu display reset code, and so still make sure that
> > everything else in reset doesn't get worse. But I think adding that
> > shouldn't be our first option.
> >
> > I'm not personally a big fan of the Kconfig or runtime option, only
> > upsets people since it breaks lockdep for them. Or they ignore it, and
> > we don't catch bugs, making it fairly pointless to merge.
> >
> > Cheers, Daniel
> >
> >
> > >
> > > Thanks,
> > > Christian.
> > >
> > > Am 07.07.20 um 22:12 schrieb Daniel Vetter:
> > > > Design is similar to the lockdep annotations for workers, but with
> > > > some twists:
> > > >
> > > > - We use a read-lock for the execution/worker/completion side, so that
> > > >    this explicit annotation can be more liberally sprinkled around.
> > > >    With read locks lockdep isn't going to complain if the read-side
> > > >    isn't nested the same way under all circumstances, so ABBA deadlocks
> > > >    are ok. Which they are, since this is an annotation only.
> > > >
> > > > - We're using non-recursive lockdep read lock mode, since in recursive
> > > >    read lock mode lockdep does not catch read side hazards. And we
> > > >    _very_ much want read side hazards to be caught. For full details of
> > > >    this limitation see
> > > >
> > > >    commit e91498589746065e3ae95d9a00b068e525eec34f
> > > >    Author: Peter Zijlstra <peterz@infradead.org>
> > > >    Date:   Wed Aug 23 13:13:11 2017 +0200
> > > >
> > > >        locking/lockdep/selftests: Add mixed read-write ABBA tests
> > > >
> > > > - To allow nesting of the read-side explicit annotations we explicitly
> > > >    keep track of the nesting. lock_is_held() allows us to do that.
> > > >
> > > > - The wait-side annotation is a write lock, and entirely done within
> > > >    dma_fence_wait() for everyone by default.
> > > >
> > > > - To be able to freely annotate helper functions I want to make it ok
> > > >    to call dma_fence_begin/end_signalling from soft/hardirq context.
> > > >    First attempt was using the hardirq locking context for the write
> > > >    side in lockdep, but this forces all normal spinlocks nested within
> > > >    dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> > > >
> > > >    The approach now is to simple check in_atomic(), and for these cases
> > > >    entirely rely on the might_sleep() check in dma_fence_wait(). That
> > > >    will catch any wrong nesting against spinlocks from soft/hardirq
> > > >    contexts.
> > > >
> > > > The idea here is that every code path that's critical for eventually
> > > > signalling a dma_fence should be annotated with
> > > > dma_fence_begin/end_signalling. The annotation ideally starts right
> > > > after a dma_fence is published (added to a dma_resv, exposed as a
> > > > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > > > makes the dma_fence visible to other kernel threads), up to and
> > > > including the dma_fence_wait(). Examples are irq handlers, the
> > > > scheduler rt threads, the tail of execbuf (after the corresponding
> > > > fences are visible), any workers that end up signalling dma_fences and
> > > > really anything else. Not annotated should be code paths that only
> > > > complete fences opportunistically as the gpu progresses, like e.g.
> > > > shrinker/eviction code.
> > > >
> > > > The main class of deadlocks this is supposed to catch are:
> > > >
> > > > Thread A:
> > > >
> > > >       mutex_lock(A);
> > > >       mutex_unlock(A);
> > > >
> > > >       dma_fence_signal();
> > > >
> > > > Thread B:
> > > >
> > > >       mutex_lock(A);
> > > >       dma_fence_wait();
> > > >       mutex_unlock(A);
> > > >
> > > > Thread B is blocked on A signalling the fence, but A never gets around
> > > > to that because it cannot acquire the lock A.
> > > >
> > > > Note that dma_fence_wait() is allowed to be nested within
> > > > dma_fence_begin/end_signalling sections. To allow this to happen the
> > > > read lock needs to be upgraded to a write lock, which means that any
> > > > other lock is acquired between the dma_fence_begin_signalling() call and
> > > > the call to dma_fence_wait(), and still held, this will result in an
> > > > immediate lockdep complaint. The only other option would be to not
> > > > annotate such calls, defeating the point. Therefore these annotations
> > > > cannot be sprinkled over the code entirely mindless to avoid false
> > > > positives.
> > > >
> > > > Originally I hope that the cross-release lockdep extensions would
> > > > alleviate the need for explicit annotations:
> > > >
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F709849%2F&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7Cff1a9dd17c544534eeb808d822b21ba2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297495649621566&amp;sdata=pbDwf%2BAG1UZ5bLZeep7VeGVQMnlQhX0TKG1d6Ok8GfQ%3D&amp;reserved=0
> > > >
> > > > But there's a few reasons why that's not an option:
> > > >
> > > > - It's not happening in upstream, since it got reverted due to too
> > > >    many false positives:
> > > >
> > > >       commit e966eaeeb623f09975ef362c2866fae6f86844f9
> > > >       Author: Ingo Molnar <mingo@kernel.org>
> > > >       Date:   Tue Dec 12 12:31:16 2017 +0100
> > > >
> > > >           locking/lockdep: Remove the cross-release locking checks
> > > >
> > > >           This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y),
> > > >           while it found a number of old bugs initially, was also causing too many
> > > >           false positives that caused people to disable lockdep - which is arguably
> > > >           a worse overall outcome.
> > > >
> > > > - cross-release uses the complete() call to annotate the end of
> > > >    critical sections, for dma_fence that would be dma_fence_signal().
> > > >    But we do not want all dma_fence_signal() calls to be treated as
> > > >    critical, since many are opportunistic cleanup of gpu requests. If
> > > >    these get stuck there's still the main completion interrupt and
> > > >    workers who can unblock everyone. Automatically annotating all
> > > >    dma_fence_signal() calls would hence cause false positives.
> > > >
> > > > - cross-release had some educated guesses for when a critical section
> > > >    starts, like fresh syscall or fresh work callback. This would again
> > > >    cause false positives without explicit annotations, since for
> > > >    dma_fence the critical sections only starts when we publish a fence.
> > > >
> > > > - Furthermore there can be cases where a thread never does a
> > > >    dma_fence_signal, but is still critical for reaching completion of
> > > >    fences. One example would be a scheduler kthread which picks up jobs
> > > >    and pushes them into hardware, where the interrupt handler or
> > > >    another completion thread calls dma_fence_signal(). But if the
> > > >    scheduler thread hangs, then all the fences hang, hence we need to
> > > >    manually annotate it. cross-release aimed to solve this by chaining
> > > >    cross-release dependencies, but the dependency from scheduler thread
> > > >    to the completion interrupt handler goes through hw where
> > > >    cross-release code can't observe it.
> > > >
> > > > In short, without manual annotations and careful review of the start
> > > > and end of critical sections, cross-relese dependency tracking doesn't
> > > > work. We need explicit annotations.
> > > >
> > > > v2: handle soft/hardirq ctx better against write side and dont forget
> > > > EXPORT_SYMBOL, drivers can't use this otherwise.
> > > >
> > > > v3: Kerneldoc.
> > > >
> > > > v4: Some spelling fixes from Mika
> > > >
> > > > v5: Amend commit message to explain in detail why cross-release isn't
> > > > the solution.
> > > >
> > > > v6: Pull out misplaced .rst hunk.
> > > >
> > > > Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> > > > Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > > > Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
> > > > Cc: linux-media@vger.kernel.org
> > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > Cc: linux-rdma@vger.kernel.org
> > > > Cc: amd-gfx@lists.freedesktop.org
> > > > Cc: intel-gfx@lists.freedesktop.org
> > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > Cc: Christian König <christian.koenig@amd.com>
> > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > ---
> > > >   Documentation/driver-api/dma-buf.rst |   6 +
> > > >   drivers/dma-buf/dma-fence.c          | 161 +++++++++++++++++++++++++++
> > > >   include/linux/dma-fence.h            |  12 ++
> > > >   3 files changed, 179 insertions(+)
> > > >
> > > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
> > > > index 7fb7b661febd..05d856131140 100644
> > > > --- a/Documentation/driver-api/dma-buf.rst
> > > > +++ b/Documentation/driver-api/dma-buf.rst
> > > > @@ -133,6 +133,12 @@ DMA Fences
> > > >   .. kernel-doc:: drivers/dma-buf/dma-fence.c
> > > >      :doc: DMA fences overview
> > > >
> > > > +DMA Fence Signalling Annotations
> > > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > +
> > > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c
> > > > +   :doc: fence signalling annotation
> > > > +
> > > >   DMA Fences Functions Reference
> > > >   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > >
> > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > > > index 656e9ac2d028..0005bc002529 100644
> > > > --- a/drivers/dma-buf/dma-fence.c
> > > > +++ b/drivers/dma-buf/dma-fence.c
> > > > @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num)
> > > >   }
> > > >   EXPORT_SYMBOL(dma_fence_context_alloc);
> > > >
> > > > +/**
> > > > + * DOC: fence signalling annotation
> > > > + *
> > > > + * Proving correctness of all the kernel code around &dma_fence through code
> > > > + * review and testing is tricky for a few reasons:
> > > > + *
> > > > + * * It is a cross-driver contract, and therefore all drivers must follow the
> > > > + *   same rules for lock nesting order, calling contexts for various functions
> > > > + *   and anything else significant for in-kernel interfaces. But it is also
> > > > + *   impossible to test all drivers in a single machine, hence brute-force N vs.
> > > > + *   N testing of all combinations is impossible. Even just limiting to the
> > > > + *   possible combinations is infeasible.
> > > > + *
> > > > + * * There is an enormous amount of driver code involved. For render drivers
> > > > + *   there's the tail of command submission, after fences are published,
> > > > + *   scheduler code, interrupt and workers to process job completion,
> > > > + *   and timeout, gpu reset and gpu hang recovery code. Plus for integration
> > > > + *   with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,
> > > > + *   and &shrinker. For modesetting drivers there's the commit tail functions
> > > > + *   between when fences for an atomic modeset are published, and when the
> > > > + *   corresponding vblank completes, including any interrupt processing and
> > > > + *   related workers. Auditing all that code, across all drivers, is not
> > > > + *   feasible.
> > > > + *
> > > > + * * Due to how many other subsystems are involved and the locking hierarchies
> > > > + *   this pulls in there is extremely thin wiggle-room for driver-specific
> > > > + *   differences. &dma_fence interacts with almost all of the core memory
> > > > + *   handling through page fault handlers via &dma_resv, dma_resv_lock() and
> > > > + *   dma_resv_unlock(). On the other side it also interacts through all
> > > > + *   allocation sites through &mmu_notifier and &shrinker.
> > > > + *
> > > > + * Furthermore lockdep does not handle cross-release dependencies, which means
> > > > + * any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught
> > > > + * at runtime with some quick testing. The simplest example is one thread
> > > > + * waiting on a &dma_fence while holding a lock::
> > > > + *
> > > > + *     lock(A);
> > > > + *     dma_fence_wait(B);
> > > > + *     unlock(A);
> > > > + *
> > > > + * while the other thread is stuck trying to acquire the same lock, which
> > > > + * prevents it from signalling the fence the previous thread is stuck waiting
> > > > + * on::
> > > > + *
> > > > + *     lock(A);
> > > > + *     unlock(A);
> > > > + *     dma_fence_signal(B);
> > > > + *
> > > > + * By manually annotating all code relevant to signalling a &dma_fence we can
> > > > + * teach lockdep about these dependencies, which also helps with the validation
> > > > + * headache since now lockdep can check all the rules for us::
> > > > + *
> > > > + *    cookie = dma_fence_begin_signalling();
> > > > + *    lock(A);
> > > > + *    unlock(A);
> > > > + *    dma_fence_signal(B);
> > > > + *    dma_fence_end_signalling(cookie);
> > > > + *
> > > > + * For using dma_fence_begin_signalling() and dma_fence_end_signalling() to
> > > > + * annotate critical sections the following rules need to be observed:
> > > > + *
> > > > + * * All code necessary to complete a &dma_fence must be annotated, from the
> > > > + *   point where a fence is accessible to other threads, to the point where
> > > > + *   dma_fence_signal() is called. Un-annotated code can contain deadlock issues,
> > > > + *   and due to the very strict rules and many corner cases it is infeasible to
> > > > + *   catch these just with review or normal stress testing.
> > > > + *
> > > > + * * &struct dma_resv deserves a special note, since the readers are only
> > > > + *   protected by rcu. This means the signalling critical section starts as soon
> > > > + *   as the new fences are installed, even before dma_resv_unlock() is called.
> > > > + *
> > > > + * * The only exception are fast paths and opportunistic signalling code, which
> > > > + *   calls dma_fence_signal() purely as an optimization, but is not required to
> > > > + *   guarantee completion of a &dma_fence. The usual example is a wait IOCTL
> > > > + *   which calls dma_fence_signal(), while the mandatory completion path goes
> > > > + *   through a hardware interrupt and possible job completion worker.
> > > > + *
> > > > + * * To aid composability of code, the annotations can be freely nested, as long
> > > > + *   as the overall locking hierarchy is consistent. The annotations also work
> > > > + *   both in interrupt and process context. Due to implementation details this
> > > > + *   requires that callers pass an opaque cookie from
> > > > + *   dma_fence_begin_signalling() to dma_fence_end_signalling().
> > > > + *
> > > > + * * Validation against the cross driver contract is implemented by priming
> > > > + *   lockdep with the relevant hierarchy at boot-up. This means even just
> > > > + *   testing with a single device is enough to validate a driver, at least as
> > > > + *   far as deadlocks with dma_fence_wait() against dma_fence_signal() are
> > > > + *   concerned.
> > > > + */
> > > > +#ifdef CONFIG_LOCKDEP
> > > > +struct lockdep_map   dma_fence_lockdep_map = {
> > > > +     .name = "dma_fence_map"
> > > > +};
> > > > +
> > > > +/**
> > > > + * dma_fence_begin_signalling - begin a critical DMA fence signalling section
> > > > + *
> > > > + * Drivers should use this to annotate the beginning of any code section
> > > > + * required to eventually complete &dma_fence by calling dma_fence_signal().
> > > > + *
> > > > + * The end of these critical sections are annotated with
> > > > + * dma_fence_end_signalling().
> > > > + *
> > > > + * Returns:
> > > > + *
> > > > + * Opaque cookie needed by the implementation, which needs to be passed to
> > > > + * dma_fence_end_signalling().
> > > > + */
> > > > +bool dma_fence_begin_signalling(void)
> > > > +{
> > > > +     /* explicitly nesting ... */
> > > > +     if (lock_is_held_type(&dma_fence_lockdep_map, 1))
> > > > +             return true;
> > > > +
> > > > +     /* rely on might_sleep check for soft/hardirq locks */
> > > > +     if (in_atomic())
> > > > +             return true;
> > > > +
> > > > +     /* ... and non-recursive readlock */
> > > > +     lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
> > > > +
> > > > +     return false;
> > > > +}
> > > > +EXPORT_SYMBOL(dma_fence_begin_signalling);
> > > > +
> > > > +/**
> > > > + * dma_fence_end_signalling - end a critical DMA fence signalling section
> > > > + *
> > > > + * Closes a critical section annotation opened by dma_fence_begin_signalling().
> > > > + */
> > > > +void dma_fence_end_signalling(bool cookie)
> > > > +{
> > > > +     if (cookie)
> > > > +             return;
> > > > +
> > > > +     lock_release(&dma_fence_lockdep_map, _RET_IP_);
> > > > +}
> > > > +EXPORT_SYMBOL(dma_fence_end_signalling);
> > > > +
> > > > +void __dma_fence_might_wait(void)
> > > > +{
> > > > +     bool tmp;
> > > > +
> > > > +     tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
> > > > +     if (tmp)
> > > > +             lock_release(&dma_fence_lockdep_map, _THIS_IP_);
> > > > +     lock_map_acquire(&dma_fence_lockdep_map);
> > > > +     lock_map_release(&dma_fence_lockdep_map);
> > > > +     if (tmp)
> > > > +             lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
> > > > +}
> > > > +#endif
> > > > +
> > > > +
> > > >   /**
> > > >    * dma_fence_signal_locked - signal completion of a fence
> > > >    * @fence: the fence to signal
> > > > @@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence)
> > > >   {
> > > >       unsigned long flags;
> > > >       int ret;
> > > > +     bool tmp;
> > > >
> > > >       if (!fence)
> > > >               return -EINVAL;
> > > >
> > > > +     tmp = dma_fence_begin_signalling();
> > > > +
> > > >       spin_lock_irqsave(fence->lock, flags);
> > > >       ret = dma_fence_signal_locked(fence);
> > > >       spin_unlock_irqrestore(fence->lock, flags);
> > > >
> > > > +     dma_fence_end_signalling(tmp);
> > > > +
> > > >       return ret;
> > > >   }
> > > >   EXPORT_SYMBOL(dma_fence_signal);
> > > > @@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
> > > >
> > > >       might_sleep();
> > > >
> > > > +     __dma_fence_might_wait();
> > > > +
> > > >       trace_dma_fence_wait_start(fence);
> > > >       if (fence->ops->wait)
> > > >               ret = fence->ops->wait(fence, intr, timeout);
> > > > diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
> > > > index 3347c54f3a87..3f288f7db2ef 100644
> > > > --- a/include/linux/dma-fence.h
> > > > +++ b/include/linux/dma-fence.h
> > > > @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
> > > >       } while (1);
> > > >   }
> > > >
> > > > +#ifdef CONFIG_LOCKDEP
> > > > +bool dma_fence_begin_signalling(void);
> > > > +void dma_fence_end_signalling(bool cookie);
> > > > +#else
> > > > +static inline bool dma_fence_begin_signalling(void)
> > > > +{
> > > > +     return true;
> > > > +}
> > > > +static inline void dma_fence_end_signalling(bool cookie) {}
> > > > +static inline void __dma_fence_might_wait(void) {}
> > > > +#endif
> > > > +
> > > >   int dma_fence_signal(struct dma_fence *fence);
> > > >   int dma_fence_signal_locked(struct dma_fence *fence);
> > > >   signed long dma_fence_default_wait(struct dma_fence *fence,
> > >
> >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Intel-gfx] [PATCH 01/25] dma-fence: basic lockdep annotations
  2020-07-08 15:12     ` Daniel Vetter
  2020-07-08 15:19       ` Alex Deucher
@ 2020-07-09  7:32       ` Daniel Stone
  2020-07-09  7:52         ` Daniel Vetter
  1 sibling, 1 reply; 83+ messages in thread
From: Daniel Stone @ 2020-07-09  7:32 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, linux-rdma, Intel Graphics Development,
	DRI Development, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Thomas Hellström, amd-gfx list, Daniel Vetter,
	open list:DMA BUFFER SHARING FRAMEWORK, Felix Kuehling,
	Mika Kuoppala

Hi,

On Wed, 8 Jul 2020 at 16:13, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> On Wed, Jul 8, 2020 at 4:57 PM Christian König <christian.koenig@amd.com> wrote:
> > Could we merge this controlled by a separate config option?
> >
> > This way we could have the checks upstream without having to fix all the
> > stuff before we do this?
>
> Since it's fully opt-in annotations nothing blows up if we don't merge
> any annotations. So we could start merging the first 3 patches. After
> that the fun starts ...
>
> My rough idea was that first I'd try to tackle display, thus far
> there's 2 actual issues in drivers:
> - amdgpu has some dma_resv_lock in commit_tail, plus a kmalloc. I
> think those should be fairly easy to fix (I'd try a stab at them even)
> - vmwgfx has a full on locking inversion with dma_resv_lock in
> commit_tail, and that one is functional. Not just reading something
> which we can safely assume to be invariant anyway (like the tmz flag
> for amdgpu, or whatever it was).
>
> I've done a pile more annotations patches for other atomic drivers
> now, so hopefully that flushes out any remaining offenders here. Since
> some of the annotations are in helper code worst case we might need a
> dev->mode_config.broken_atomic_commit flag to disable them. At least
> for now I have 0 plans to merge any of these while there's known
> unsolved issues. Maybe if some drivers take forever to get fixed we
> can then apply some duct-tape for the atomic helper annotation patch.
> Instead of a flag we can also copypasta the atomic_commit_tail hook,
> leaving the annotations out and adding a huge warning about that.

How about an opt-in drm_driver DRIVER_DEADLOCK_HAPPY flag? At first
this could just disable the annotations and nothing else, but as we
see the annotations gaining real-world testing and maturity, we could
eventually make it taint the kernel.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Intel-gfx] [PATCH 03/25] dma-buf.rst: Document why idenfinite fences are a bad idea
  2020-07-07 20:12 ` [PATCH 03/25] dma-buf.rst: Document why idenfinite fences are a bad idea Daniel Vetter
@ 2020-07-09  7:36   ` Daniel Stone
  2020-07-09  8:04     ` Daniel Vetter
  2020-07-09 11:53   ` Christian König
  2020-07-09 12:33   ` [PATCH 1/2] dma-buf.rst: Document why indefinite " Daniel Vetter
  2 siblings, 1 reply; 83+ messages in thread
From: Daniel Stone @ 2020-07-09  7:36 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: DRI Development, Christian König, linux-rdma,
	Intel Graphics Development, amd-gfx mailing list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	Jesse Natalie, Daniel Vetter, Thomas Hellstrom,
	open list:DMA BUFFER SHARING FRAMEWORK, Felix Kuehling,
	Mika Kuoppala

Hi,

On Tue, 7 Jul 2020 at 21:13, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> Comes up every few years, gets somewhat tedious to discuss, let's
> write this down once and for all.

Thanks for writing this up! I wonder if any of the notes from my reply
to the previous-version thread would be helpful to more explicitly
encode the carrot of dma-fence's positive guarantees, rather than just
the stick of 'don't do this'. ;) Either way, this is:
Acked-by: Daniel Stone <daniels@collabora.com>

> What I'm not sure about is whether the text should be more explicit in
> flat out mandating the amdkfd eviction fences for long running compute
> workloads or workloads where userspace fencing is allowed.

... or whether we just say that you can never use dma-fence in
conjunction with userptr.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Intel-gfx] [PATCH 01/25] dma-fence: basic lockdep annotations
  2020-07-09  7:32       ` [Intel-gfx] " Daniel Stone
@ 2020-07-09  7:52         ` Daniel Vetter
  0 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-09  7:52 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Daniel Vetter, Christian König, linux-rdma,
	Intel Graphics Development, DRI Development, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Thomas Hellström, amd-gfx list, Daniel Vetter,
	open list:DMA BUFFER SHARING FRAMEWORK, Felix Kuehling,
	Mika Kuoppala

On Thu, Jul 09, 2020 at 08:32:41AM +0100, Daniel Stone wrote:
> Hi,
> 
> On Wed, 8 Jul 2020 at 16:13, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > On Wed, Jul 8, 2020 at 4:57 PM Christian König <christian.koenig@amd.com> wrote:
> > > Could we merge this controlled by a separate config option?
> > >
> > > This way we could have the checks upstream without having to fix all the
> > > stuff before we do this?
> >
> > Since it's fully opt-in annotations nothing blows up if we don't merge
> > any annotations. So we could start merging the first 3 patches. After
> > that the fun starts ...
> >
> > My rough idea was that first I'd try to tackle display, thus far
> > there's 2 actual issues in drivers:
> > - amdgpu has some dma_resv_lock in commit_tail, plus a kmalloc. I
> > think those should be fairly easy to fix (I'd try a stab at them even)
> > - vmwgfx has a full on locking inversion with dma_resv_lock in
> > commit_tail, and that one is functional. Not just reading something
> > which we can safely assume to be invariant anyway (like the tmz flag
> > for amdgpu, or whatever it was).
> >
> > I've done a pile more annotations patches for other atomic drivers
> > now, so hopefully that flushes out any remaining offenders here. Since
> > some of the annotations are in helper code worst case we might need a
> > dev->mode_config.broken_atomic_commit flag to disable them. At least
> > for now I have 0 plans to merge any of these while there's known
> > unsolved issues. Maybe if some drivers take forever to get fixed we
> > can then apply some duct-tape for the atomic helper annotation patch.
> > Instead of a flag we can also copypasta the atomic_commit_tail hook,
> > leaving the annotations out and adding a huge warning about that.
> 
> How about an opt-in drm_driver DRIVER_DEADLOCK_HAPPY flag? At first
> this could just disable the annotations and nothing else, but as we
> see the annotations gaining real-world testing and maturity, we could
> eventually make it taint the kernel.

You can do that pretty much per-driver, since the annotations are pretty
much per-driver. No annotations in your code, no lockdep splat. Only if
there's some dma_fence_begin/end_signalling() calls is there even the
chance of a problem.

E.g. this round has the i915 patch dropped and *traraaaa* intel-gfx-ci is
happy (or well at least a lot happier, there's some noise in there that's
probably not from my stuff).

So I guess if amd wants this, we could do an DRM_AMDGPU_MOAR_LOCKDEP
Kconfig or similar. I haven't tested, but I think as long as we don't
merge any of the amdgpu specific patches, there's no splat in amdgpu. So
with that I think that's plenty enough opt-in for each driver. The only
problem is a bit shared helper code like atomic helpers and drm scheduler.
There we might need some opt-out (I don't think merging makes sense when
most of the users are still broken).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Intel-gfx] [PATCH 03/25] dma-buf.rst: Document why idenfinite fences are a bad idea
  2020-07-09  7:36   ` [Intel-gfx] " Daniel Stone
@ 2020-07-09  8:04     ` Daniel Vetter
  2020-07-09 12:11       ` Daniel Stone
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-09  8:04 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Daniel Vetter, DRI Development, Christian König, linux-rdma,
	Intel Graphics Development, amd-gfx mailing list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	Jesse Natalie, Daniel Vetter, Thomas Hellstrom,
	open list:DMA BUFFER SHARING FRAMEWORK, Felix Kuehling,
	Mika Kuoppala

On Thu, Jul 09, 2020 at 08:36:43AM +0100, Daniel Stone wrote:
> Hi,
> 
> On Tue, 7 Jul 2020 at 21:13, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > Comes up every few years, gets somewhat tedious to discuss, let's
> > write this down once and for all.
> 
> Thanks for writing this up! I wonder if any of the notes from my reply
> to the previous-version thread would be helpful to more explicitly
> encode the carrot of dma-fence's positive guarantees, rather than just
> the stick of 'don't do this'. ;) Either way, this is:

I think the carrot should go into the intro section for dma-fence, this
section here is very much just the "don't do this" part. The previous
patches have an attempt at encoding this a bit, maybe see whether there's
a place for your reply (or parts of it) to fit?

> Acked-by: Daniel Stone <daniels@collabora.com>
> 
> > What I'm not sure about is whether the text should be more explicit in
> > flat out mandating the amdkfd eviction fences for long running compute
> > workloads or workloads where userspace fencing is allowed.
> 
> ... or whether we just say that you can never use dma-fence in
> conjunction with userptr.

Uh userptr is entirely different thing. That one is ok. It's userpsace
fences or gpu futexes or future fences or whatever we want to call them.
Or is there some other confusion here?.
-Daniel


> 
> Cheers,
> Daniel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/25] dma-fence: prime lockdep annotations
  2020-07-07 20:12 ` [PATCH 02/25] dma-fence: prime " Daniel Vetter
@ 2020-07-09  8:09   ` Daniel Vetter
  2020-07-10 12:43     ` Jason Gunthorpe
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-09  8:09 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	Jason Gunthorpe, Felix Kuehling, kernel test robot,
	Thomas Hellström, Mika Kuoppala, linux-media, linaro-mm-sig,
	amd-gfx, Chris Wilson, Maarten Lankhorst, Christian König,
	Daniel Vetter

Hi Jason,

Below the paragraph I've added after our discussions around dma-fences
outside of drivers/gpu. Good enough for an ack on this, or want something
changed?

Thanks, Daniel

> + * Note that only GPU drivers have a reasonable excuse for both requiring
> + * &mmu_interval_notifier and &shrinker callbacks at the same time as having to
> + * track asynchronous compute work using &dma_fence. No driver outside of
> + * drivers/gpu should ever call dma_fence_wait() in such contexts.


On Tue, Jul 07, 2020 at 10:12:06PM +0200, Daniel Vetter wrote:
> Two in one go:
> - it is allowed to call dma_fence_wait() while holding a
>   dma_resv_lock(). This is fundamental to how eviction works with ttm,
>   so required.
> 
> - it is allowed to call dma_fence_wait() from memory reclaim contexts,
>   specifically from shrinker callbacks (which i915 does), and from mmu
>   notifier callbacks (which amdgpu does, and which i915 sometimes also
>   does, and probably always should, but that's kinda a debate). Also
>   for stuff like HMM we really need to be able to do this, or things
>   get real dicey.
> 
> Consequence is that any critical path necessary to get to a
> dma_fence_signal for a fence must never a) call dma_resv_lock nor b)
> allocate memory with GFP_KERNEL. Also by implication of
> dma_resv_lock(), no userspace faulting allowed. That's some supremely
> obnoxious limitations, which is why we need to sprinkle the right
> annotations to all relevant paths.
> 
> The one big locking context we're leaving out here is mmu notifiers,
> added in
> 
> commit 23b68395c7c78a764e8963fc15a7cfd318bf187f
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Mon Aug 26 22:14:21 2019 +0200
> 
>     mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
> 
> that one covers a lot of other callsites, and it's also allowed to
> wait on dma-fences from mmu notifiers. But there's no ready-made
> functions exposed to prime this, so I've left it out for now.
> 
> v2: Also track against mmu notifier context.
> 
> v3: kerneldoc to spec the cross-driver contract. Note that currently
> i915 throws in a hard-coded 10s timeout on foreign fences (not sure
> why that was done, but it's there), which is why that rule is worded
> with SHOULD instead of MUST.
> 
> Also some of the mmu_notifier/shrinker rules might surprise SoC
> drivers, I haven't fully audited them all. Which is infeasible anyway,
> we'll need to run them with lockdep and dma-fence annotations and see
> what goes boom.
> 
> v4: A spelling fix from Mika
> 
> v5: #ifdef for CONFIG_MMU_NOTIFIER. Reported by 0day. Unfortunately
> this means lockdep enforcement is slightly inconsistent, it won't spot
> GFP_NOIO and GFP_NOFS allocations in the wrong spot if
> CONFIG_MMU_NOTIFIER is disabled in the kernel config. Oh well.
> 
> v5: Note that only drivers/gpu has a reasonable (or at least
> historical) excuse to use dma_fence_wait() from shrinker and mmu
> notifier callbacks. Everyone else should either have a better memory
> manager model, or better hardware. This reflects discussions with
> Jason Gunthorpe.
> 
> Cc: Jason Gunthorpe <jgg@mellanox.com>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: kernel test robot <lkp@intel.com>
> Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com> (v4)
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  Documentation/driver-api/dma-buf.rst |  6 ++++
>  drivers/dma-buf/dma-fence.c          | 46 ++++++++++++++++++++++++++++
>  drivers/dma-buf/dma-resv.c           |  8 +++++
>  include/linux/dma-fence.h            |  1 +
>  4 files changed, 61 insertions(+)
> 
> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
> index 05d856131140..f8f6decde359 100644
> --- a/Documentation/driver-api/dma-buf.rst
> +++ b/Documentation/driver-api/dma-buf.rst
> @@ -133,6 +133,12 @@ DMA Fences
>  .. kernel-doc:: drivers/dma-buf/dma-fence.c
>     :doc: DMA fences overview
>  
> +DMA Fence Cross-Driver Contract
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +.. kernel-doc:: drivers/dma-buf/dma-fence.c
> +   :doc: fence cross-driver contract
> +
>  DMA Fence Signalling Annotations
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>  
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 0005bc002529..af1d8ea926b3 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -64,6 +64,52 @@ static atomic64_t dma_fence_context_counter = ATOMIC64_INIT(1);
>   *   &dma_buf.resv pointer.
>   */
>  
> +/**
> + * DOC: fence cross-driver contract
> + *
> + * Since &dma_fence provide a cross driver contract, all drivers must follow the
> + * same rules:
> + *
> + * * Fences must complete in a reasonable time. Fences which represent kernels
> + *   and shaders submitted by userspace, which could run forever, must be backed
> + *   up by timeout and gpu hang recovery code. Minimally that code must prevent
> + *   further command submission and force complete all in-flight fences, e.g.
> + *   when the driver or hardware do not support gpu reset, or if the gpu reset
> + *   failed for some reason. Ideally the driver supports gpu recovery which only
> + *   affects the offending userspace context, and no other userspace
> + *   submissions.
> + *
> + * * Drivers may have different ideas of what completion within a reasonable
> + *   time means. Some hang recovery code uses a fixed timeout, others a mix
> + *   between observing forward progress and increasingly strict timeouts.
> + *   Drivers should not try to second guess timeout handling of fences from
> + *   other drivers.
> + *
> + * * To ensure there's no deadlocks of dma_fence_wait() against other locks
> + *   drivers should annotate all code required to reach dma_fence_signal(),
> + *   which completes the fences, with dma_fence_begin_signalling() and
> + *   dma_fence_end_signalling().
> + *
> + * * Drivers are allowed to call dma_fence_wait() while holding dma_resv_lock().
> + *   This means any code required for fence completion cannot acquire a
> + *   &dma_resv lock. Note that this also pulls in the entire established
> + *   locking hierarchy around dma_resv_lock() and dma_resv_unlock().
> + *
> + * * Drivers are allowed to call dma_fence_wait() from their &shrinker
> + *   callbacks. This means any code required for fence completion cannot
> + *   allocate memory with GFP_KERNEL.
> + *
> + * * Drivers are allowed to call dma_fence_wait() from their &mmu_notifier
> + *   respectively &mmu_interval_notifier callbacks. This means any code required
> + *   for fence completeion cannot allocate memory with GFP_NOFS or GFP_NOIO.
> + *   Only GFP_ATOMIC is permissible, which might fail.
> + *
> + * Note that only GPU drivers have a reasonable excuse for both requiring
> + * &mmu_interval_notifier and &shrinker callbacks at the same time as having to
> + * track asynchronous compute work using &dma_fence. No driver outside of
> + * drivers/gpu should ever call dma_fence_wait() in such contexts.
> + */
> +
>  static const char *dma_fence_stub_get_name(struct dma_fence *fence)
>  {
>          return "stub";
> diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
> index e7d7197d48ce..0e6675ec1d11 100644
> --- a/drivers/dma-buf/dma-resv.c
> +++ b/drivers/dma-buf/dma-resv.c
> @@ -36,6 +36,7 @@
>  #include <linux/export.h>
>  #include <linux/mm.h>
>  #include <linux/sched/mm.h>
> +#include <linux/mmu_notifier.h>
>  
>  /**
>   * DOC: Reservation Object Overview
> @@ -116,6 +117,13 @@ static int __init dma_resv_lockdep(void)
>  	if (ret == -EDEADLK)
>  		dma_resv_lock_slow(&obj, &ctx);
>  	fs_reclaim_acquire(GFP_KERNEL);
> +#ifdef CONFIG_MMU_NOTIFIER
> +	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
> +	__dma_fence_might_wait();
> +	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
> +#else
> +	__dma_fence_might_wait();
> +#endif
>  	fs_reclaim_release(GFP_KERNEL);
>  	ww_mutex_unlock(&obj.lock);
>  	ww_acquire_fini(&ctx);
> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
> index 3f288f7db2ef..09e23adb351d 100644
> --- a/include/linux/dma-fence.h
> +++ b/include/linux/dma-fence.h
> @@ -360,6 +360,7 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
>  #ifdef CONFIG_LOCKDEP
>  bool dma_fence_begin_signalling(void);
>  void dma_fence_end_signalling(bool cookie);
> +void __dma_fence_might_wait(void);
>  #else
>  static inline bool dma_fence_begin_signalling(void)
>  {
> -- 
> 2.27.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/25] dma-buf.rst: Document why idenfinite fences are a bad idea
  2020-07-07 20:12 ` [PATCH 03/25] dma-buf.rst: Document why idenfinite fences are a bad idea Daniel Vetter
  2020-07-09  7:36   ` [Intel-gfx] " Daniel Stone
@ 2020-07-09 11:53   ` Christian König
  2020-07-09 12:33   ` [PATCH 1/2] dma-buf.rst: Document why indefinite " Daniel Vetter
  2 siblings, 0 replies; 83+ messages in thread
From: Christian König @ 2020-07-09 11:53 UTC (permalink / raw)
  To: Daniel Vetter, DRI Development
  Cc: Intel Graphics Development, linux-rdma, Jesse Natalie,
	Steve Pronovost, Jason Ekstrand, Felix Kuehling, Mika Kuoppala,
	Thomas Hellstrom, linux-media, linaro-mm-sig, amd-gfx,
	Chris Wilson, Maarten Lankhorst, Daniel Vetter

Am 07.07.20 um 22:12 schrieb Daniel Vetter:
> Comes up every few years, gets somewhat tedious to discuss, let's
> write this down once and for all.
>
> What I'm not sure about is whether the text should be more explicit in
> flat out mandating the amdkfd eviction fences for long running compute
> workloads or workloads where userspace fencing is allowed.
>
> v2: Now with dot graph!
>
> Cc: Jesse Natalie <jenatali@microsoft.com>
> Cc: Steve Pronovost <spronovo@microsoft.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

Acked-by: Christian König <christian.koenig@amd.com>

> ---
>   Documentation/driver-api/dma-buf.rst     | 70 ++++++++++++++++++++++++
>   drivers/gpu/drm/virtio/virtgpu_display.c | 20 -------
>   2 files changed, 70 insertions(+), 20 deletions(-)
>
> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
> index f8f6decde359..037ba0078bb4 100644
> --- a/Documentation/driver-api/dma-buf.rst
> +++ b/Documentation/driver-api/dma-buf.rst
> @@ -178,3 +178,73 @@ DMA Fence uABI/Sync File
>   .. kernel-doc:: include/linux/sync_file.h
>      :internal:
>   
> +Idefinite DMA Fences
> +~~~~~~~~~~~~~~~~~~~~
> +
> +At various times &dma_fence with an indefinite time until dma_fence_wait()
> +finishes have been proposed. Examples include:
> +
> +* Future fences, used in HWC1 to signal when a buffer isn't used by the display
> +  any longer, and created with the screen update that makes the buffer visible.
> +  The time this fence completes is entirely under userspace's control.
> +
> +* Proxy fences, proposed to handle &drm_syncobj for which the fence has not yet
> +  been set. Used to asynchronously delay command submission.
> +
> +* Userspace fences or gpu futexes, fine-grained locking within a command buffer
> +  that userspace uses for synchronization across engines or with the CPU, which
> +  are then imported as a DMA fence for integration into existing winsys
> +  protocols.
> +
> +* Long-running compute command buffers, while still using traditional end of
> +  batch DMA fences for memory management instead of context preemption DMA
> +  fences which get reattached when the compute job is rescheduled.
> +
> +Common to all these schemes is that userspace controls the dependencies of these
> +fences and controls when they fire. Mixing indefinite fences with normal
> +in-kernel DMA fences does not work, even when a fallback timeout is included to
> +protect against malicious userspace:
> +
> +* Only the kernel knows about all DMA fence dependencies, userspace is not aware
> +  of dependencies injected due to memory management or scheduler decisions.
> +
> +* Only userspace knows about all dependencies in indefinite fences and when
> +  exactly they will complete, the kernel has no visibility.
> +
> +Furthermore the kernel has to be able to hold up userspace command submission
> +for memory management needs, which means we must support indefinite fences being
> +dependent upon DMA fences. If the kernel also support indefinite fences in the
> +kernel like a DMA fence, like any of the above proposal would, there is the
> +potential for deadlocks.
> +
> +.. kernel-render:: DOT
> +   :alt: Indefinite Fencing Dependency Cycle
> +   :caption: Indefinite Fencing Dependency Cycle
> +
> +   digraph "Fencing Cycle" {
> +      node [shape=box bgcolor=grey style=filled]
> +      kernel [label="Kernel DMA Fences"]
> +      userspace [label="userspace controlled fences"]
> +      kernel -> userspace [label="memory management"]
> +      userspace -> kernel [label="Future fence, fence proxy, ..."]
> +
> +      { rank=same; kernel userspace }
> +   }
> +
> +This means that the kernel might accidentally create deadlocks
> +through memory management dependencies which userspace is unaware of, which
> +randomly hangs workloads until the timeout kicks in. Workloads, which from
> +userspace's perspective, do not contain a deadlock.  In such a mixed fencing
> +architecture there is no single entity with knowledge of all dependencies.
> +Thefore preventing such deadlocks from within the kernel is not possible.
> +
> +The only solution to avoid dependencies loops is by not allowing indefinite
> +fences in the kernel. This means:
> +
> +* No future fences, proxy fences or userspace fences imported as DMA fences,
> +  with or without a timeout.
> +
> +* No DMA fences that signal end of batchbuffer for command submission where
> +  userspace is allowed to use userspace fencing or long running compute
> +  workloads. This also means no implicit fencing for shared buffers in these
> +  cases.
> diff --git a/drivers/gpu/drm/virtio/virtgpu_display.c b/drivers/gpu/drm/virtio/virtgpu_display.c
> index f3ce49c5a34c..af55b334be2f 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_display.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_display.c
> @@ -314,25 +314,6 @@ virtio_gpu_user_framebuffer_create(struct drm_device *dev,
>   	return &virtio_gpu_fb->base;
>   }
>   
> -static void vgdev_atomic_commit_tail(struct drm_atomic_state *state)
> -{
> -	struct drm_device *dev = state->dev;
> -
> -	drm_atomic_helper_commit_modeset_disables(dev, state);
> -	drm_atomic_helper_commit_modeset_enables(dev, state);
> -	drm_atomic_helper_commit_planes(dev, state, 0);
> -
> -	drm_atomic_helper_fake_vblank(state);
> -	drm_atomic_helper_commit_hw_done(state);
> -
> -	drm_atomic_helper_wait_for_vblanks(dev, state);
> -	drm_atomic_helper_cleanup_planes(dev, state);
> -}
> -
> -static const struct drm_mode_config_helper_funcs virtio_mode_config_helpers = {
> -	.atomic_commit_tail = vgdev_atomic_commit_tail,
> -};
> -
>   static const struct drm_mode_config_funcs virtio_gpu_mode_funcs = {
>   	.fb_create = virtio_gpu_user_framebuffer_create,
>   	.atomic_check = drm_atomic_helper_check,
> @@ -346,7 +327,6 @@ void virtio_gpu_modeset_init(struct virtio_gpu_device *vgdev)
>   	drm_mode_config_init(vgdev->ddev);
>   	vgdev->ddev->mode_config.quirk_addfb_prefer_host_byte_order = true;
>   	vgdev->ddev->mode_config.funcs = &virtio_gpu_mode_funcs;
> -	vgdev->ddev->mode_config.helper_private = &virtio_mode_config_helpers;
>   
>   	/* modes will be validated against the framebuffer size */
>   	vgdev->ddev->mode_config.min_width = XRES_MIN;


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Intel-gfx] [PATCH 03/25] dma-buf.rst: Document why idenfinite fences are a bad idea
  2020-07-09  8:04     ` Daniel Vetter
@ 2020-07-09 12:11       ` Daniel Stone
  2020-07-09 12:31         ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Stone @ 2020-07-09 12:11 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Daniel Vetter, DRI Development, Christian König, linux-rdma,
	Intel Graphics Development, amd-gfx mailing list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	Jesse Natalie, Daniel Vetter, Thomas Hellstrom,
	open list:DMA BUFFER SHARING FRAMEWORK, Felix Kuehling,
	Mika Kuoppala

On Thu, 9 Jul 2020 at 09:05, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Thu, Jul 09, 2020 at 08:36:43AM +0100, Daniel Stone wrote:
> > On Tue, 7 Jul 2020 at 21:13, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > Comes up every few years, gets somewhat tedious to discuss, let's
> > > write this down once and for all.
> >
> > Thanks for writing this up! I wonder if any of the notes from my reply
> > to the previous-version thread would be helpful to more explicitly
> > encode the carrot of dma-fence's positive guarantees, rather than just
> > the stick of 'don't do this'. ;) Either way, this is:
>
> I think the carrot should go into the intro section for dma-fence, this
> section here is very much just the "don't do this" part. The previous
> patches have an attempt at encoding this a bit, maybe see whether there's
> a place for your reply (or parts of it) to fit?

Sounds good to me.

> > Acked-by: Daniel Stone <daniels@collabora.com>
> >
> > > What I'm not sure about is whether the text should be more explicit in
> > > flat out mandating the amdkfd eviction fences for long running compute
> > > workloads or workloads where userspace fencing is allowed.
> >
> > ... or whether we just say that you can never use dma-fence in
> > conjunction with userptr.
>
> Uh userptr is entirely different thing. That one is ok. It's userpsace
> fences or gpu futexes or future fences or whatever we want to call them.
> Or is there some other confusion here?.

I mean generating a dma_fence from a batch which will try to page in
userptr. Given that userptr could be backed by absolutely anything at
all, it doesn't seem smart to allow fences to rely on a pointer to an
mmap'ed NFS file. So it seems like batches should be mutually
exclusive between arbitrary SVM userptr and generating a dma-fence?

Speaking of entirely different things ... the virtio-gpu bit really
doesn't belong in this patch.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Intel-gfx] [PATCH 03/25] dma-buf.rst: Document why idenfinite fences are a bad idea
  2020-07-09 12:11       ` Daniel Stone
@ 2020-07-09 12:31         ` Daniel Vetter
  2020-07-09 14:28           ` Christian König
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-09 12:31 UTC (permalink / raw)
  To: Daniel Stone
  Cc: DRI Development, Christian König, linux-rdma,
	Intel Graphics Development, amd-gfx mailing list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	Jesse Natalie, Daniel Vetter, Thomas Hellstrom,
	open list:DMA BUFFER SHARING FRAMEWORK, Felix Kuehling,
	Mika Kuoppala

On Thu, Jul 9, 2020 at 2:11 PM Daniel Stone <daniel@fooishbar.org> wrote:
>
> On Thu, 9 Jul 2020 at 09:05, Daniel Vetter <daniel@ffwll.ch> wrote:
> > On Thu, Jul 09, 2020 at 08:36:43AM +0100, Daniel Stone wrote:
> > > On Tue, 7 Jul 2020 at 21:13, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > Comes up every few years, gets somewhat tedious to discuss, let's
> > > > write this down once and for all.
> > >
> > > Thanks for writing this up! I wonder if any of the notes from my reply
> > > to the previous-version thread would be helpful to more explicitly
> > > encode the carrot of dma-fence's positive guarantees, rather than just
> > > the stick of 'don't do this'. ;) Either way, this is:
> >
> > I think the carrot should go into the intro section for dma-fence, this
> > section here is very much just the "don't do this" part. The previous
> > patches have an attempt at encoding this a bit, maybe see whether there's
> > a place for your reply (or parts of it) to fit?
>
> Sounds good to me.
>
> > > Acked-by: Daniel Stone <daniels@collabora.com>
> > >
> > > > What I'm not sure about is whether the text should be more explicit in
> > > > flat out mandating the amdkfd eviction fences for long running compute
> > > > workloads or workloads where userspace fencing is allowed.
> > >
> > > ... or whether we just say that you can never use dma-fence in
> > > conjunction with userptr.
> >
> > Uh userptr is entirely different thing. That one is ok. It's userpsace
> > fences or gpu futexes or future fences or whatever we want to call them.
> > Or is there some other confusion here?.
>
> I mean generating a dma_fence from a batch which will try to page in
> userptr. Given that userptr could be backed by absolutely anything at
> all, it doesn't seem smart to allow fences to rely on a pointer to an
> mmap'ed NFS file. So it seems like batches should be mutually
> exclusive between arbitrary SVM userptr and generating a dma-fence?

Locking is Tricky (tm) but essentially what at least amdgpu does is
pull in the backing storage before we publish any dma-fence. And then
some serious locking magic to make sure that doesn't race with a core
mm invalidation event. So for your case here the cs ioctl just blocks
until the nfs pages are pulled in.

Once we've committed for the dma-fence it's only the other way round,
i.e. core mm will stall on the dma-fence if it wants to throw out
these pages again. More or less at least. That way we never have a
dma-fence depending upon any core mm operations. The only pain here is
that this severely limits what you can do in the critical path towards
signalling a dma-fence, because the tldr is "no interacting with core
mm at all allowed".

> Speaking of entirely different things ... the virtio-gpu bit really
> doesn't belong in this patch.

Oops, dunno where I lost that as a sparate patch. Will split out again :-(
-Daniel

>
> Cheers,
> Daniel



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-07 20:12 ` [PATCH 03/25] dma-buf.rst: Document why idenfinite fences are a bad idea Daniel Vetter
  2020-07-09  7:36   ` [Intel-gfx] " Daniel Stone
  2020-07-09 11:53   ` Christian König
@ 2020-07-09 12:33   ` Daniel Vetter
  2020-07-10 12:30     ` Maarten Lankhorst
                       ` (2 more replies)
  2 siblings, 3 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-09 12:33 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	Christian König, Daniel Stone, Jesse Natalie,
	Steve Pronovost, Jason Ekstrand, Felix Kuehling, Mika Kuoppala,
	Thomas Hellstrom, linux-media, linaro-mm-sig, amd-gfx,
	Chris Wilson, Maarten Lankhorst, Daniel Vetter

Comes up every few years, gets somewhat tedious to discuss, let's
write this down once and for all.

What I'm not sure about is whether the text should be more explicit in
flat out mandating the amdkfd eviction fences for long running compute
workloads or workloads where userspace fencing is allowed.

v2: Now with dot graph!

v3: Typo (Dave Airlie)

Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Daniel Stone <daniels@collabora.com>
Cc: Jesse Natalie <jenatali@microsoft.com>
Cc: Steve Pronovost <spronovo@microsoft.com>
Cc: Jason Ekstrand <jason@jlekstrand.net>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 Documentation/driver-api/dma-buf.rst | 70 ++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
index f8f6decde359..100bfd227265 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -178,3 +178,73 @@ DMA Fence uABI/Sync File
 .. kernel-doc:: include/linux/sync_file.h
    :internal:
 
+Indefinite DMA Fences
+~~~~~~~~~~~~~~~~~~~~
+
+At various times &dma_fence with an indefinite time until dma_fence_wait()
+finishes have been proposed. Examples include:
+
+* Future fences, used in HWC1 to signal when a buffer isn't used by the display
+  any longer, and created with the screen update that makes the buffer visible.
+  The time this fence completes is entirely under userspace's control.
+
+* Proxy fences, proposed to handle &drm_syncobj for which the fence has not yet
+  been set. Used to asynchronously delay command submission.
+
+* Userspace fences or gpu futexes, fine-grained locking within a command buffer
+  that userspace uses for synchronization across engines or with the CPU, which
+  are then imported as a DMA fence for integration into existing winsys
+  protocols.
+
+* Long-running compute command buffers, while still using traditional end of
+  batch DMA fences for memory management instead of context preemption DMA
+  fences which get reattached when the compute job is rescheduled.
+
+Common to all these schemes is that userspace controls the dependencies of these
+fences and controls when they fire. Mixing indefinite fences with normal
+in-kernel DMA fences does not work, even when a fallback timeout is included to
+protect against malicious userspace:
+
+* Only the kernel knows about all DMA fence dependencies, userspace is not aware
+  of dependencies injected due to memory management or scheduler decisions.
+
+* Only userspace knows about all dependencies in indefinite fences and when
+  exactly they will complete, the kernel has no visibility.
+
+Furthermore the kernel has to be able to hold up userspace command submission
+for memory management needs, which means we must support indefinite fences being
+dependent upon DMA fences. If the kernel also support indefinite fences in the
+kernel like a DMA fence, like any of the above proposal would, there is the
+potential for deadlocks.
+
+.. kernel-render:: DOT
+   :alt: Indefinite Fencing Dependency Cycle
+   :caption: Indefinite Fencing Dependency Cycle
+
+   digraph "Fencing Cycle" {
+      node [shape=box bgcolor=grey style=filled]
+      kernel [label="Kernel DMA Fences"]
+      userspace [label="userspace controlled fences"]
+      kernel -> userspace [label="memory management"]
+      userspace -> kernel [label="Future fence, fence proxy, ..."]
+
+      { rank=same; kernel userspace }
+   }
+
+This means that the kernel might accidentally create deadlocks
+through memory management dependencies which userspace is unaware of, which
+randomly hangs workloads until the timeout kicks in. Workloads, which from
+userspace's perspective, do not contain a deadlock.  In such a mixed fencing
+architecture there is no single entity with knowledge of all dependencies.
+Thefore preventing such deadlocks from within the kernel is not possible.
+
+The only solution to avoid dependencies loops is by not allowing indefinite
+fences in the kernel. This means:
+
+* No future fences, proxy fences or userspace fences imported as DMA fences,
+  with or without a timeout.
+
+* No DMA fences that signal end of batchbuffer for command submission where
+  userspace is allowed to use userspace fencing or long running compute
+  workloads. This also means no implicit fencing for shared buffers in these
+  cases.
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [Intel-gfx] [PATCH 03/25] dma-buf.rst: Document why idenfinite fences are a bad idea
  2020-07-09 12:31         ` Daniel Vetter
@ 2020-07-09 14:28           ` Christian König
  0 siblings, 0 replies; 83+ messages in thread
From: Christian König @ 2020-07-09 14:28 UTC (permalink / raw)
  To: Daniel Vetter, Daniel Stone
  Cc: Felix Kuehling, linux-rdma, Intel Graphics Development,
	amd-gfx mailing list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	DRI Development, Jesse Natalie, Daniel Vetter, Thomas Hellstrom,
	Mika Kuoppala, Christian König,
	open list:DMA BUFFER SHARING FRAMEWORK

Am 09.07.20 um 14:31 schrieb Daniel Vetter:
> On Thu, Jul 9, 2020 at 2:11 PM Daniel Stone <daniel@fooishbar.org> wrote:
>> On Thu, 9 Jul 2020 at 09:05, Daniel Vetter <daniel@ffwll.ch> wrote:
>>> On Thu, Jul 09, 2020 at 08:36:43AM +0100, Daniel Stone wrote:
>>>> On Tue, 7 Jul 2020 at 21:13, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>>> write this down once and for all.
>>>> Thanks for writing this up! I wonder if any of the notes from my reply
>>>> to the previous-version thread would be helpful to more explicitly
>>>> encode the carrot of dma-fence's positive guarantees, rather than just
>>>> the stick of 'don't do this'. ;) Either way, this is:
>>> I think the carrot should go into the intro section for dma-fence, this
>>> section here is very much just the "don't do this" part. The previous
>>> patches have an attempt at encoding this a bit, maybe see whether there's
>>> a place for your reply (or parts of it) to fit?
>> Sounds good to me.
>>
>>>> Acked-by: Daniel Stone <daniels@collabora.com>
>>>>
>>>>> What I'm not sure about is whether the text should be more explicit in
>>>>> flat out mandating the amdkfd eviction fences for long running compute
>>>>> workloads or workloads where userspace fencing is allowed.
>>>> ... or whether we just say that you can never use dma-fence in
>>>> conjunction with userptr.
>>> Uh userptr is entirely different thing. That one is ok. It's userpsace
>>> fences or gpu futexes or future fences or whatever we want to call them.
>>> Or is there some other confusion here?.
>> I mean generating a dma_fence from a batch which will try to page in
>> userptr. Given that userptr could be backed by absolutely anything at
>> all, it doesn't seem smart to allow fences to rely on a pointer to an
>> mmap'ed NFS file. So it seems like batches should be mutually
>> exclusive between arbitrary SVM userptr and generating a dma-fence?
> Locking is Tricky (tm) but essentially what at least amdgpu does is
> pull in the backing storage before we publish any dma-fence. And then
> some serious locking magic to make sure that doesn't race with a core
> mm invalidation event. So for your case here the cs ioctl just blocks
> until the nfs pages are pulled in.

Yeah, we had some iterations until all was settled.

Basic idea is the following:
1. Have a sequence counter increased whenever a change to the page 
tables happens.
2. During CS grab the current value of this counter.
3. Get all the pages you need in an array.
4. Prepare CS, grab the low level lock the MM notifier waits for and 
double check the counter.
5. If the counter is still the same all is well and the DMA-fence pushed 
to the hardware.
6. If the counter has changed repeat.

Can result in a nice live lock when you constantly page things in/out, 
but that is expected behavior.

Christian.

>
> Once we've committed for the dma-fence it's only the other way round,
> i.e. core mm will stall on the dma-fence if it wants to throw out
> these pages again. More or less at least. That way we never have a
> dma-fence depending upon any core mm operations. The only pain here is
> that this severely limits what you can do in the critical path towards
> signalling a dma-fence, because the tldr is "no interacting with core
> mm at all allowed".
>
>> Speaking of entirely different things ... the virtio-gpu bit really
>> doesn't belong in this patch.
> Oops, dunno where I lost that as a sparate patch. Will split out again :-(
> -Daniel
>
>> Cheers,
>> Daniel
>
>


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-09 12:33   ` [PATCH 1/2] dma-buf.rst: Document why indefinite " Daniel Vetter
@ 2020-07-10 12:30     ` Maarten Lankhorst
  2020-07-14 17:46     ` Jason Ekstrand
  2020-07-20 11:15     ` [Linaro-mm-sig] " Thomas Hellström (Intel)
  2 siblings, 0 replies; 83+ messages in thread
From: Maarten Lankhorst @ 2020-07-10 12:30 UTC (permalink / raw)
  To: Daniel Vetter, DRI Development
  Cc: Intel Graphics Development, linux-rdma, Christian König,
	Daniel Stone, Jesse Natalie, Steve Pronovost, Jason Ekstrand,
	Felix Kuehling, Mika Kuoppala, Thomas Hellstrom, linux-media,
	linaro-mm-sig, amd-gfx, Chris Wilson, Daniel Vetter

Op 09-07-2020 om 14:33 schreef Daniel Vetter:
> Comes up every few years, gets somewhat tedious to discuss, let's
> write this down once and for all.
>
> What I'm not sure about is whether the text should be more explicit in
> flat out mandating the amdkfd eviction fences for long running compute
> workloads or workloads where userspace fencing is allowed.
>
> v2: Now with dot graph!
>
> v3: Typo (Dave Airlie)

For first 5 patches, and patch 16, 17:

Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

> Acked-by: Christian König <christian.koenig@amd.com>
> Acked-by: Daniel Stone <daniels@collabora.com>
> Cc: Jesse Natalie <jenatali@microsoft.com>
> Cc: Steve Pronovost <spronovo@microsoft.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  Documentation/driver-api/dma-buf.rst | 70 ++++++++++++++++++++++++++++
>  1 file changed, 70 insertions(+)
>
> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
> index f8f6decde359..100bfd227265 100644
> --- a/Documentation/driver-api/dma-buf.rst
> +++ b/Documentation/driver-api/dma-buf.rst
> @@ -178,3 +178,73 @@ DMA Fence uABI/Sync File
>  .. kernel-doc:: include/linux/sync_file.h
>     :internal:
>  
> +Indefinite DMA Fences
> +~~~~~~~~~~~~~~~~~~~~
> +
> +At various times &dma_fence with an indefinite time until dma_fence_wait()
> +finishes have been proposed. Examples include:
> +
> +* Future fences, used in HWC1 to signal when a buffer isn't used by the display
> +  any longer, and created with the screen update that makes the buffer visible.
> +  The time this fence completes is entirely under userspace's control.
> +
> +* Proxy fences, proposed to handle &drm_syncobj for which the fence has not yet
> +  been set. Used to asynchronously delay command submission.
> +
> +* Userspace fences or gpu futexes, fine-grained locking within a command buffer
> +  that userspace uses for synchronization across engines or with the CPU, which
> +  are then imported as a DMA fence for integration into existing winsys
> +  protocols.
> +
> +* Long-running compute command buffers, while still using traditional end of
> +  batch DMA fences for memory management instead of context preemption DMA
> +  fences which get reattached when the compute job is rescheduled.
> +
> +Common to all these schemes is that userspace controls the dependencies of these
> +fences and controls when they fire. Mixing indefinite fences with normal
> +in-kernel DMA fences does not work, even when a fallback timeout is included to
> +protect against malicious userspace:
> +
> +* Only the kernel knows about all DMA fence dependencies, userspace is not aware
> +  of dependencies injected due to memory management or scheduler decisions.
> +
> +* Only userspace knows about all dependencies in indefinite fences and when
> +  exactly they will complete, the kernel has no visibility.
> +
> +Furthermore the kernel has to be able to hold up userspace command submission
> +for memory management needs, which means we must support indefinite fences being
> +dependent upon DMA fences. If the kernel also support indefinite fences in the
> +kernel like a DMA fence, like any of the above proposal would, there is the
> +potential for deadlocks.
> +
> +.. kernel-render:: DOT
> +   :alt: Indefinite Fencing Dependency Cycle
> +   :caption: Indefinite Fencing Dependency Cycle
> +
> +   digraph "Fencing Cycle" {
> +      node [shape=box bgcolor=grey style=filled]
> +      kernel [label="Kernel DMA Fences"]
> +      userspace [label="userspace controlled fences"]
> +      kernel -> userspace [label="memory management"]
> +      userspace -> kernel [label="Future fence, fence proxy, ..."]
> +
> +      { rank=same; kernel userspace }
> +   }
> +
> +This means that the kernel might accidentally create deadlocks
> +through memory management dependencies which userspace is unaware of, which
> +randomly hangs workloads until the timeout kicks in. Workloads, which from
> +userspace's perspective, do not contain a deadlock.  In such a mixed fencing
> +architecture there is no single entity with knowledge of all dependencies.
> +Thefore preventing such deadlocks from within the kernel is not possible.
> +
> +The only solution to avoid dependencies loops is by not allowing indefinite
> +fences in the kernel. This means:
> +
> +* No future fences, proxy fences or userspace fences imported as DMA fences,
> +  with or without a timeout.
> +
> +* No DMA fences that signal end of batchbuffer for command submission where
> +  userspace is allowed to use userspace fencing or long running compute
> +  workloads. This also means no implicit fencing for shared buffers in these
> +  cases.



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/25] dma-fence: prime lockdep annotations
  2020-07-09  8:09   ` Daniel Vetter
@ 2020-07-10 12:43     ` Jason Gunthorpe
  2020-07-10 12:48       ` Christian König
  0 siblings, 1 reply; 83+ messages in thread
From: Jason Gunthorpe @ 2020-07-10 12:43 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: DRI Development, Intel Graphics Development, linux-rdma,
	Daniel Vetter, Felix Kuehling, kernel test robot,
	Thomas Hellström, Mika Kuoppala, linux-media, linaro-mm-sig,
	amd-gfx, Chris Wilson, Maarten Lankhorst, Christian König,
	Daniel Vetter

On Thu, Jul 09, 2020 at 10:09:11AM +0200, Daniel Vetter wrote:
> Hi Jason,
> 
> Below the paragraph I've added after our discussions around dma-fences
> outside of drivers/gpu. Good enough for an ack on this, or want something
> changed?
> 
> Thanks, Daniel
> 
> > + * Note that only GPU drivers have a reasonable excuse for both requiring
> > + * &mmu_interval_notifier and &shrinker callbacks at the same time as having to
> > + * track asynchronous compute work using &dma_fence. No driver outside of
> > + * drivers/gpu should ever call dma_fence_wait() in such contexts.

I was hoping we'd get to 'no driver outside GPU should even use
dma_fence()'

Is that not reasonable?

When your annotations once anything uses dma_fence it has to assume
the worst cases, right?

Jason

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/25] dma-fence: prime lockdep annotations
  2020-07-10 12:43     ` Jason Gunthorpe
@ 2020-07-10 12:48       ` Christian König
  2020-07-10 12:54         ` Jason Gunthorpe
  0 siblings, 1 reply; 83+ messages in thread
From: Christian König @ 2020-07-10 12:48 UTC (permalink / raw)
  To: Jason Gunthorpe, Daniel Vetter
  Cc: DRI Development, Intel Graphics Development, linux-rdma,
	Daniel Vetter, Felix Kuehling, kernel test robot,
	Thomas Hellström, Mika Kuoppala, linux-media, linaro-mm-sig,
	amd-gfx, Chris Wilson, Maarten Lankhorst, Daniel Vetter

Am 10.07.20 um 14:43 schrieb Jason Gunthorpe:
> On Thu, Jul 09, 2020 at 10:09:11AM +0200, Daniel Vetter wrote:
>> Hi Jason,
>>
>> Below the paragraph I've added after our discussions around dma-fences
>> outside of drivers/gpu. Good enough for an ack on this, or want something
>> changed?
>>
>> Thanks, Daniel
>>
>>> + * Note that only GPU drivers have a reasonable excuse for both requiring
>>> + * &mmu_interval_notifier and &shrinker callbacks at the same time as having to
>>> + * track asynchronous compute work using &dma_fence. No driver outside of
>>> + * drivers/gpu should ever call dma_fence_wait() in such contexts.
> I was hoping we'd get to 'no driver outside GPU should even use
> dma_fence()'

My last status was that V4L could come use dma_fences as well.

I'm not 100% sure, but wouldn't MMU notifier + dma_fence be a valid use 
case for things like custom FPGA interfaces as well?

> Is that not reasonable?
>
> When your annotations once anything uses dma_fence it has to assume
> the worst cases, right?

Well a defensive approach is usually the best idea, yes.

Christian.

>
> Jason


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/25] dma-fence: prime lockdep annotations
  2020-07-10 12:48       ` Christian König
@ 2020-07-10 12:54         ` Jason Gunthorpe
  2020-07-10 13:01           ` Christian König
  0 siblings, 1 reply; 83+ messages in thread
From: Jason Gunthorpe @ 2020-07-10 12:54 UTC (permalink / raw)
  To: Christian König
  Cc: Daniel Vetter, DRI Development, Intel Graphics Development,
	linux-rdma, Daniel Vetter, Felix Kuehling, kernel test robot,
	Thomas Hellström, Mika Kuoppala, linux-media, linaro-mm-sig,
	amd-gfx, Chris Wilson, Maarten Lankhorst, Daniel Vetter

On Fri, Jul 10, 2020 at 02:48:16PM +0200, Christian König wrote:
> Am 10.07.20 um 14:43 schrieb Jason Gunthorpe:
> > On Thu, Jul 09, 2020 at 10:09:11AM +0200, Daniel Vetter wrote:
> > > Hi Jason,
> > > 
> > > Below the paragraph I've added after our discussions around dma-fences
> > > outside of drivers/gpu. Good enough for an ack on this, or want something
> > > changed?
> > > 
> > > Thanks, Daniel
> > > 
> > > > + * Note that only GPU drivers have a reasonable excuse for both requiring
> > > > + * &mmu_interval_notifier and &shrinker callbacks at the same time as having to
> > > > + * track asynchronous compute work using &dma_fence. No driver outside of
> > > > + * drivers/gpu should ever call dma_fence_wait() in such contexts.
> > I was hoping we'd get to 'no driver outside GPU should even use
> > dma_fence()'
> 
> My last status was that V4L could come use dma_fences as well.

I'm sure lots of places *could* use it, but I think I understood that
it is a bad idea unless you have to fit into the DRM uAPI?

You are better to do something contained in the single driver where
locking can be analyzed.

> I'm not 100% sure, but wouldn't MMU notifier + dma_fence be a valid use case
> for things like custom FPGA interfaces as well?

I don't think we should expand the list of drivers that use this
technique. 

Drivers that can't suspend should pin memory, not use blocked
notifiers to created pinned memory.

Jason

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/25] dma-fence: prime lockdep annotations
  2020-07-10 12:54         ` Jason Gunthorpe
@ 2020-07-10 13:01           ` Christian König
  2020-07-10 13:48             ` Jason Gunthorpe
  0 siblings, 1 reply; 83+ messages in thread
From: Christian König @ 2020-07-10 13:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Daniel Vetter, DRI Development, Intel Graphics Development,
	linux-rdma, Daniel Vetter, Felix Kuehling, kernel test robot,
	Thomas Hellström, Mika Kuoppala, linux-media, linaro-mm-sig,
	amd-gfx, Chris Wilson, Maarten Lankhorst, Daniel Vetter

Am 10.07.20 um 14:54 schrieb Jason Gunthorpe:
> On Fri, Jul 10, 2020 at 02:48:16PM +0200, Christian König wrote:
>> Am 10.07.20 um 14:43 schrieb Jason Gunthorpe:
>>> On Thu, Jul 09, 2020 at 10:09:11AM +0200, Daniel Vetter wrote:
>>>> Hi Jason,
>>>>
>>>> Below the paragraph I've added after our discussions around dma-fences
>>>> outside of drivers/gpu. Good enough for an ack on this, or want something
>>>> changed?
>>>>
>>>> Thanks, Daniel
>>>>
>>>>> + * Note that only GPU drivers have a reasonable excuse for both requiring
>>>>> + * &mmu_interval_notifier and &shrinker callbacks at the same time as having to
>>>>> + * track asynchronous compute work using &dma_fence. No driver outside of
>>>>> + * drivers/gpu should ever call dma_fence_wait() in such contexts.
>>> I was hoping we'd get to 'no driver outside GPU should even use
>>> dma_fence()'
>> My last status was that V4L could come use dma_fences as well.
> I'm sure lots of places *could* use it, but I think I understood that
> it is a bad idea unless you have to fit into the DRM uAPI?

It would be a bit questionable if you use the container objects we came 
up with in the DRM subsystem outside of it.

But using the dma_fence itself makes sense for everything which could do 
async DMA in general.

> You are better to do something contained in the single driver where
> locking can be analyzed.
>
>> I'm not 100% sure, but wouldn't MMU notifier + dma_fence be a valid use case
>> for things like custom FPGA interfaces as well?
> I don't think we should expand the list of drivers that use this
> technique.
> Drivers that can't suspend should pin memory, not use blocked
> notifiers to created pinned memory.

Agreed totally, it's a complete pain to maintain even for the GPU drivers.

Unfortunately that doesn't change users from requesting it. So I'm 
pretty sure we are going to see more of this.

Christian.

>
> Jason


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/25] dma-fence: prime lockdep annotations
  2020-07-10 13:01           ` Christian König
@ 2020-07-10 13:48             ` Jason Gunthorpe
  2020-07-10 14:02               ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Jason Gunthorpe @ 2020-07-10 13:48 UTC (permalink / raw)
  To: Christian König
  Cc: Daniel Vetter, DRI Development, Intel Graphics Development,
	linux-rdma, Daniel Vetter, Felix Kuehling, kernel test robot,
	Thomas Hellström, Mika Kuoppala, linux-media, linaro-mm-sig,
	amd-gfx, Chris Wilson, Maarten Lankhorst, Daniel Vetter

On Fri, Jul 10, 2020 at 03:01:10PM +0200, Christian König wrote:
> Am 10.07.20 um 14:54 schrieb Jason Gunthorpe:
> > On Fri, Jul 10, 2020 at 02:48:16PM +0200, Christian König wrote:
> > > Am 10.07.20 um 14:43 schrieb Jason Gunthorpe:
> > > > On Thu, Jul 09, 2020 at 10:09:11AM +0200, Daniel Vetter wrote:
> > > > > Hi Jason,
> > > > > 
> > > > > Below the paragraph I've added after our discussions around dma-fences
> > > > > outside of drivers/gpu. Good enough for an ack on this, or want something
> > > > > changed?
> > > > > 
> > > > > Thanks, Daniel
> > > > > 
> > > > > > + * Note that only GPU drivers have a reasonable excuse for both requiring
> > > > > > + * &mmu_interval_notifier and &shrinker callbacks at the same time as having to
> > > > > > + * track asynchronous compute work using &dma_fence. No driver outside of
> > > > > > + * drivers/gpu should ever call dma_fence_wait() in such contexts.
> > > > I was hoping we'd get to 'no driver outside GPU should even use
> > > > dma_fence()'
> > > My last status was that V4L could come use dma_fences as well.
> > I'm sure lots of places *could* use it, but I think I understood that
> > it is a bad idea unless you have to fit into the DRM uAPI?
> 
> It would be a bit questionable if you use the container objects we came up
> with in the DRM subsystem outside of it.
> 
> But using the dma_fence itself makes sense for everything which could do
> async DMA in general.

dma_fence only possibly makes some sense if you intend to expose the
completion outside a single driver. 

The prefered kernel design pattern for this is to connect things with
a function callback.

So the actual use case of dma_fence is quite narrow and tightly linked
to DRM.

I don't think we should spread this beyond DRM, I can't see a reason.

> > You are better to do something contained in the single driver where
> > locking can be analyzed.
> > 
> > > I'm not 100% sure, but wouldn't MMU notifier + dma_fence be a valid use case
> > > for things like custom FPGA interfaces as well?
> > I don't think we should expand the list of drivers that use this
> > technique.
> > Drivers that can't suspend should pin memory, not use blocked
> > notifiers to created pinned memory.
> 
> Agreed totally, it's a complete pain to maintain even for the GPU drivers.
> 
> Unfortunately that doesn't change users from requesting it. So I'm pretty
> sure we are going to see more of this.

Kernel maintainers need to say no.

The proper way to do DMA on no-faulting hardware is page pinning.

Trying to improve performance of limited HW by using sketchy
techniques at the cost of general system stability should be a NAK.

Jason

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/25] dma-fence: prime lockdep annotations
  2020-07-10 13:48             ` Jason Gunthorpe
@ 2020-07-10 14:02               ` Daniel Vetter
  2020-07-10 14:23                 ` Jason Gunthorpe
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-10 14:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian König, DRI Development,
	Intel Graphics Development, linux-rdma, Felix Kuehling,
	kernel test robot, Thomas Hellström, Mika Kuoppala,
	open list:DMA BUFFER SHARING FRAMEWORK,
	moderated list:DMA BUFFER SHARING FRAMEWORK, amd-gfx list,
	Chris Wilson, Maarten Lankhorst, Daniel Vetter

On Fri, Jul 10, 2020 at 3:48 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
>
> On Fri, Jul 10, 2020 at 03:01:10PM +0200, Christian König wrote:
> > Am 10.07.20 um 14:54 schrieb Jason Gunthorpe:
> > > On Fri, Jul 10, 2020 at 02:48:16PM +0200, Christian König wrote:
> > > > Am 10.07.20 um 14:43 schrieb Jason Gunthorpe:
> > > > > On Thu, Jul 09, 2020 at 10:09:11AM +0200, Daniel Vetter wrote:
> > > > > > Hi Jason,
> > > > > >
> > > > > > Below the paragraph I've added after our discussions around dma-fences
> > > > > > outside of drivers/gpu. Good enough for an ack on this, or want something
> > > > > > changed?
> > > > > >
> > > > > > Thanks, Daniel
> > > > > >
> > > > > > > + * Note that only GPU drivers have a reasonable excuse for both requiring
> > > > > > > + * &mmu_interval_notifier and &shrinker callbacks at the same time as having to
> > > > > > > + * track asynchronous compute work using &dma_fence. No driver outside of
> > > > > > > + * drivers/gpu should ever call dma_fence_wait() in such contexts.
> > > > > I was hoping we'd get to 'no driver outside GPU should even use
> > > > > dma_fence()'
> > > > My last status was that V4L could come use dma_fences as well.
> > > I'm sure lots of places *could* use it, but I think I understood that
> > > it is a bad idea unless you have to fit into the DRM uAPI?
> >
> > It would be a bit questionable if you use the container objects we came up
> > with in the DRM subsystem outside of it.
> >
> > But using the dma_fence itself makes sense for everything which could do
> > async DMA in general.
>
> dma_fence only possibly makes some sense if you intend to expose the
> completion outside a single driver.
>
> The prefered kernel design pattern for this is to connect things with
> a function callback.
>
> So the actual use case of dma_fence is quite narrow and tightly linked
> to DRM.
>
> I don't think we should spread this beyond DRM, I can't see a reason.

Yeah v4l has a legit reason to use dma_fence, android wants that
there. There's even been patches proposed years ago, but never landed
because android is using some vendor hack horror show for camera
drivers right now.

But there is an effort going on to fix that (under the libcamera
heading), and I expect that once we have that, it'll want dma_fence
support. So outright excluding everyone from dma_fence is a bit too
much. They definitely shouldn't be used though for entirely
independent stuff.

> > > You are better to do something contained in the single driver where
> > > locking can be analyzed.
> > >
> > > > I'm not 100% sure, but wouldn't MMU notifier + dma_fence be a valid use case
> > > > for things like custom FPGA interfaces as well?
> > > I don't think we should expand the list of drivers that use this
> > > technique.
> > > Drivers that can't suspend should pin memory, not use blocked
> > > notifiers to created pinned memory.
> >
> > Agreed totally, it's a complete pain to maintain even for the GPU drivers.
> >
> > Unfortunately that doesn't change users from requesting it. So I'm pretty
> > sure we are going to see more of this.
>
> Kernel maintainers need to say no.
>
> The proper way to do DMA on no-faulting hardware is page pinning.
>
> Trying to improve performance of limited HW by using sketchy
> techniques at the cost of general system stability should be a NAK.

Well that's pretty much gpu drivers, all the horrors for a bit more speed :-)

On the text itself, should I upgrade to "must not" instead of "should
not"? Or more needed?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/25] dma-fence: prime lockdep annotations
  2020-07-10 14:02               ` Daniel Vetter
@ 2020-07-10 14:23                 ` Jason Gunthorpe
  2020-07-10 20:02                   ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Jason Gunthorpe @ 2020-07-10 14:23 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, DRI Development,
	Intel Graphics Development, linux-rdma, Felix Kuehling,
	kernel test robot, Thomas Hellström, Mika Kuoppala,
	open list:DMA BUFFER SHARING FRAMEWORK,
	moderated list:DMA BUFFER SHARING FRAMEWORK, amd-gfx list,
	Chris Wilson, Maarten Lankhorst, Daniel Vetter

On Fri, Jul 10, 2020 at 04:02:35PM +0200, Daniel Vetter wrote:

> > dma_fence only possibly makes some sense if you intend to expose the
> > completion outside a single driver.
> >
> > The prefered kernel design pattern for this is to connect things with
> > a function callback.
> >
> > So the actual use case of dma_fence is quite narrow and tightly linked
> > to DRM.
> >
> > I don't think we should spread this beyond DRM, I can't see a reason.
> 
> Yeah v4l has a legit reason to use dma_fence, android wants that
> there. 

'legit' in the sense the v4l is supposed to trigger stuff in DRM when
V4L DMA completes? I would still see that as part of DRM

Or is it building a parallel DRM like DMA completion graph?

> > Trying to improve performance of limited HW by using sketchy
> > techniques at the cost of general system stability should be a NAK.
>
> Well that's pretty much gpu drivers, all the horrors for a bit more speed :-)
> 
> On the text itself, should I upgrade to "must not" instead of "should
> not"? Or more needed?

Fundamentally having some unknowable graph of dependencies where parts
of the graph can be placed in critical regions like notifiers is a
complete maintenance nightmare.

I think building systems like this should be discouraged :\

Jason

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/25] dma-fence: prime lockdep annotations
  2020-07-10 14:23                 ` Jason Gunthorpe
@ 2020-07-10 20:02                   ` Daniel Vetter
  0 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-10 20:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian König, DRI Development,
	Intel Graphics Development, linux-rdma, Felix Kuehling,
	kernel test robot, Thomas Hellström, Mika Kuoppala,
	open list:DMA BUFFER SHARING FRAMEWORK,
	moderated list:DMA BUFFER SHARING FRAMEWORK, amd-gfx list,
	Chris Wilson, Maarten Lankhorst, Daniel Vetter

On Fri, Jul 10, 2020 at 4:23 PM Jason Gunthorpe <jgg@mellanox.com> wrote:
>
> On Fri, Jul 10, 2020 at 04:02:35PM +0200, Daniel Vetter wrote:
>
> > > dma_fence only possibly makes some sense if you intend to expose the
> > > completion outside a single driver.
> > >
> > > The prefered kernel design pattern for this is to connect things with
> > > a function callback.
> > >
> > > So the actual use case of dma_fence is quite narrow and tightly linked
> > > to DRM.
> > >
> > > I don't think we should spread this beyond DRM, I can't see a reason.
> >
> > Yeah v4l has a legit reason to use dma_fence, android wants that
> > there.
>
> 'legit' in the sense the v4l is supposed to trigger stuff in DRM when
> V4L DMA completes? I would still see that as part of DRM

Yes, and also the other way around. But thus far it didn't land.
-Daniel

> Or is it building a parallel DRM like DMA completion graph?
>
> > > Trying to improve performance of limited HW by using sketchy
> > > techniques at the cost of general system stability should be a NAK.
> >
> > Well that's pretty much gpu drivers, all the horrors for a bit more speed :-)
> >
> > On the text itself, should I upgrade to "must not" instead of "should
> > not"? Or more needed?
>
> Fundamentally having some unknowable graph of dependencies where parts
> of the graph can be placed in critical regions like notifiers is a
> complete maintenance nightmare.
>
> I think building systems like this should be discouraged :\

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/25] drm/vkms: Annotate vblank timer
  2020-07-07 20:12 ` [PATCH 04/25] drm/vkms: Annotate vblank timer Daniel Vetter
@ 2020-07-12 22:27   ` Rodrigo Siqueira
  2020-07-14  9:57     ` Melissa Wen
  0 siblings, 1 reply; 83+ messages in thread
From: Rodrigo Siqueira @ 2020-07-12 22:27 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: DRI Development, Intel Graphics Development, linux-rdma,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Christian König, Daniel Vetter,
	Haneen Mohammed, Daniel Vetter, Melissa Wen, Trevor Woerner

[-- Attachment #1: Type: text/plain, Size: 2486 bytes --]

Hi,

Everything looks fine to me, I just noticed that the amdgpu patches did
not apply smoothly, however it was trivial to fix the issues.

Reviewed-by: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>

Melissa,
Since you are using vkms regularly, could you test this patch and review
it? Remember to add your Tested-by when you finish.

Thanks

On 07/07, Daniel Vetter wrote:
> This is needed to signal the fences from page flips, annotate it
> accordingly. We need to annotate entire timer callback since if we get
> stuck anywhere in there, then the timer stops, and hence fences stop.
> Just annotating the top part that does the vblank handling isn't
> enough.
> 
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
> Cc: Haneen Mohammed <hamohammed.sa@gmail.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> ---
>  drivers/gpu/drm/vkms/vkms_crtc.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/vkms/vkms_crtc.c b/drivers/gpu/drm/vkms/vkms_crtc.c
> index ac85e17428f8..a53a40848a72 100644
> --- a/drivers/gpu/drm/vkms/vkms_crtc.c
> +++ b/drivers/gpu/drm/vkms/vkms_crtc.c
> @@ -1,5 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0+
>  
> +#include <linux/dma-fence.h>
> +
>  #include <drm/drm_atomic.h>
>  #include <drm/drm_atomic_helper.h>
>  #include <drm/drm_probe_helper.h>
> @@ -14,7 +16,9 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
>  	struct drm_crtc *crtc = &output->crtc;
>  	struct vkms_crtc_state *state;
>  	u64 ret_overrun;
> -	bool ret;
> +	bool ret, fence_cookie;
> +
> +	fence_cookie = dma_fence_begin_signalling();
>  
>  	ret_overrun = hrtimer_forward_now(&output->vblank_hrtimer,
>  					  output->period_ns);
> @@ -49,6 +53,8 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
>  			DRM_DEBUG_DRIVER("Composer worker already queued\n");
>  	}
>  
> +	dma_fence_end_signalling(fence_cookie);
> +
>  	return HRTIMER_RESTART;
>  }
>  
> -- 
> 2.27.0
> 

-- 
Rodrigo Siqueira
https://siqueira.tech

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/25] dma-fence: basic lockdep annotations
  2020-07-08 14:57   ` Christian König
  2020-07-08 15:12     ` Daniel Vetter
@ 2020-07-13 16:26     ` Daniel Vetter
  2020-07-13 16:39       ` Christian König
  1 sibling, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-13 16:26 UTC (permalink / raw)
  To: Christian König
  Cc: Daniel Vetter, DRI Development, Intel Graphics Development,
	linux-rdma, Felix Kuehling, Thomas Hellström,
	Maarten Lankhorst, Mika Kuoppala, linux-media, linaro-mm-sig,
	amd-gfx, Chris Wilson, Daniel Vetter

Hi Christian,

On Wed, Jul 08, 2020 at 04:57:21PM +0200, Christian König wrote:
> Could we merge this controlled by a separate config option?
> 
> This way we could have the checks upstream without having to fix all the
> stuff before we do this?

Discussions died out a bit, do you consider this a blocker for the first
two patches, or good for an ack on these?

Like I said I don't plan to merge patches where I know it causes a lockdep
splat with a driver still. At least for now.

Thanks, Daniel

> 
> Thanks,
> Christian.
> 
> Am 07.07.20 um 22:12 schrieb Daniel Vetter:
> > Design is similar to the lockdep annotations for workers, but with
> > some twists:
> > 
> > - We use a read-lock for the execution/worker/completion side, so that
> >    this explicit annotation can be more liberally sprinkled around.
> >    With read locks lockdep isn't going to complain if the read-side
> >    isn't nested the same way under all circumstances, so ABBA deadlocks
> >    are ok. Which they are, since this is an annotation only.
> > 
> > - We're using non-recursive lockdep read lock mode, since in recursive
> >    read lock mode lockdep does not catch read side hazards. And we
> >    _very_ much want read side hazards to be caught. For full details of
> >    this limitation see
> > 
> >    commit e91498589746065e3ae95d9a00b068e525eec34f
> >    Author: Peter Zijlstra <peterz@infradead.org>
> >    Date:   Wed Aug 23 13:13:11 2017 +0200
> > 
> >        locking/lockdep/selftests: Add mixed read-write ABBA tests
> > 
> > - To allow nesting of the read-side explicit annotations we explicitly
> >    keep track of the nesting. lock_is_held() allows us to do that.
> > 
> > - The wait-side annotation is a write lock, and entirely done within
> >    dma_fence_wait() for everyone by default.
> > 
> > - To be able to freely annotate helper functions I want to make it ok
> >    to call dma_fence_begin/end_signalling from soft/hardirq context.
> >    First attempt was using the hardirq locking context for the write
> >    side in lockdep, but this forces all normal spinlocks nested within
> >    dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> > 
> >    The approach now is to simple check in_atomic(), and for these cases
> >    entirely rely on the might_sleep() check in dma_fence_wait(). That
> >    will catch any wrong nesting against spinlocks from soft/hardirq
> >    contexts.
> > 
> > The idea here is that every code path that's critical for eventually
> > signalling a dma_fence should be annotated with
> > dma_fence_begin/end_signalling. The annotation ideally starts right
> > after a dma_fence is published (added to a dma_resv, exposed as a
> > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > makes the dma_fence visible to other kernel threads), up to and
> > including the dma_fence_wait(). Examples are irq handlers, the
> > scheduler rt threads, the tail of execbuf (after the corresponding
> > fences are visible), any workers that end up signalling dma_fences and
> > really anything else. Not annotated should be code paths that only
> > complete fences opportunistically as the gpu progresses, like e.g.
> > shrinker/eviction code.
> > 
> > The main class of deadlocks this is supposed to catch are:
> > 
> > Thread A:
> > 
> > 	mutex_lock(A);
> > 	mutex_unlock(A);
> > 
> > 	dma_fence_signal();
> > 
> > Thread B:
> > 
> > 	mutex_lock(A);
> > 	dma_fence_wait();
> > 	mutex_unlock(A);
> > 
> > Thread B is blocked on A signalling the fence, but A never gets around
> > to that because it cannot acquire the lock A.
> > 
> > Note that dma_fence_wait() is allowed to be nested within
> > dma_fence_begin/end_signalling sections. To allow this to happen the
> > read lock needs to be upgraded to a write lock, which means that any
> > other lock is acquired between the dma_fence_begin_signalling() call and
> > the call to dma_fence_wait(), and still held, this will result in an
> > immediate lockdep complaint. The only other option would be to not
> > annotate such calls, defeating the point. Therefore these annotations
> > cannot be sprinkled over the code entirely mindless to avoid false
> > positives.
> > 
> > Originally I hope that the cross-release lockdep extensions would
> > alleviate the need for explicit annotations:
> > 
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F709849%2F&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7Cff1a9dd17c544534eeb808d822b21ba2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297495649621566&amp;sdata=pbDwf%2BAG1UZ5bLZeep7VeGVQMnlQhX0TKG1d6Ok8GfQ%3D&amp;reserved=0
> > 
> > But there's a few reasons why that's not an option:
> > 
> > - It's not happening in upstream, since it got reverted due to too
> >    many false positives:
> > 
> > 	commit e966eaeeb623f09975ef362c2866fae6f86844f9
> > 	Author: Ingo Molnar <mingo@kernel.org>
> > 	Date:   Tue Dec 12 12:31:16 2017 +0100
> > 
> > 	    locking/lockdep: Remove the cross-release locking checks
> > 
> > 	    This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y),
> > 	    while it found a number of old bugs initially, was also causing too many
> > 	    false positives that caused people to disable lockdep - which is arguably
> > 	    a worse overall outcome.
> > 
> > - cross-release uses the complete() call to annotate the end of
> >    critical sections, for dma_fence that would be dma_fence_signal().
> >    But we do not want all dma_fence_signal() calls to be treated as
> >    critical, since many are opportunistic cleanup of gpu requests. If
> >    these get stuck there's still the main completion interrupt and
> >    workers who can unblock everyone. Automatically annotating all
> >    dma_fence_signal() calls would hence cause false positives.
> > 
> > - cross-release had some educated guesses for when a critical section
> >    starts, like fresh syscall or fresh work callback. This would again
> >    cause false positives without explicit annotations, since for
> >    dma_fence the critical sections only starts when we publish a fence.
> > 
> > - Furthermore there can be cases where a thread never does a
> >    dma_fence_signal, but is still critical for reaching completion of
> >    fences. One example would be a scheduler kthread which picks up jobs
> >    and pushes them into hardware, where the interrupt handler or
> >    another completion thread calls dma_fence_signal(). But if the
> >    scheduler thread hangs, then all the fences hang, hence we need to
> >    manually annotate it. cross-release aimed to solve this by chaining
> >    cross-release dependencies, but the dependency from scheduler thread
> >    to the completion interrupt handler goes through hw where
> >    cross-release code can't observe it.
> > 
> > In short, without manual annotations and careful review of the start
> > and end of critical sections, cross-relese dependency tracking doesn't
> > work. We need explicit annotations.
> > 
> > v2: handle soft/hardirq ctx better against write side and dont forget
> > EXPORT_SYMBOL, drivers can't use this otherwise.
> > 
> > v3: Kerneldoc.
> > 
> > v4: Some spelling fixes from Mika
> > 
> > v5: Amend commit message to explain in detail why cross-release isn't
> > the solution.
> > 
> > v6: Pull out misplaced .rst hunk.
> > 
> > Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> > Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>
> > Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
> > Cc: linux-media@vger.kernel.org
> > Cc: linaro-mm-sig@lists.linaro.org
> > Cc: linux-rdma@vger.kernel.org
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: intel-gfx@lists.freedesktop.org
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > ---
> >   Documentation/driver-api/dma-buf.rst |   6 +
> >   drivers/dma-buf/dma-fence.c          | 161 +++++++++++++++++++++++++++
> >   include/linux/dma-fence.h            |  12 ++
> >   3 files changed, 179 insertions(+)
> > 
> > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
> > index 7fb7b661febd..05d856131140 100644
> > --- a/Documentation/driver-api/dma-buf.rst
> > +++ b/Documentation/driver-api/dma-buf.rst
> > @@ -133,6 +133,12 @@ DMA Fences
> >   .. kernel-doc:: drivers/dma-buf/dma-fence.c
> >      :doc: DMA fences overview
> > +DMA Fence Signalling Annotations
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +.. kernel-doc:: drivers/dma-buf/dma-fence.c
> > +   :doc: fence signalling annotation
> > +
> >   DMA Fences Functions Reference
> >   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > index 656e9ac2d028..0005bc002529 100644
> > --- a/drivers/dma-buf/dma-fence.c
> > +++ b/drivers/dma-buf/dma-fence.c
> > @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num)
> >   }
> >   EXPORT_SYMBOL(dma_fence_context_alloc);
> > +/**
> > + * DOC: fence signalling annotation
> > + *
> > + * Proving correctness of all the kernel code around &dma_fence through code
> > + * review and testing is tricky for a few reasons:
> > + *
> > + * * It is a cross-driver contract, and therefore all drivers must follow the
> > + *   same rules for lock nesting order, calling contexts for various functions
> > + *   and anything else significant for in-kernel interfaces. But it is also
> > + *   impossible to test all drivers in a single machine, hence brute-force N vs.
> > + *   N testing of all combinations is impossible. Even just limiting to the
> > + *   possible combinations is infeasible.
> > + *
> > + * * There is an enormous amount of driver code involved. For render drivers
> > + *   there's the tail of command submission, after fences are published,
> > + *   scheduler code, interrupt and workers to process job completion,
> > + *   and timeout, gpu reset and gpu hang recovery code. Plus for integration
> > + *   with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,
> > + *   and &shrinker. For modesetting drivers there's the commit tail functions
> > + *   between when fences for an atomic modeset are published, and when the
> > + *   corresponding vblank completes, including any interrupt processing and
> > + *   related workers. Auditing all that code, across all drivers, is not
> > + *   feasible.
> > + *
> > + * * Due to how many other subsystems are involved and the locking hierarchies
> > + *   this pulls in there is extremely thin wiggle-room for driver-specific
> > + *   differences. &dma_fence interacts with almost all of the core memory
> > + *   handling through page fault handlers via &dma_resv, dma_resv_lock() and
> > + *   dma_resv_unlock(). On the other side it also interacts through all
> > + *   allocation sites through &mmu_notifier and &shrinker.
> > + *
> > + * Furthermore lockdep does not handle cross-release dependencies, which means
> > + * any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught
> > + * at runtime with some quick testing. The simplest example is one thread
> > + * waiting on a &dma_fence while holding a lock::
> > + *
> > + *     lock(A);
> > + *     dma_fence_wait(B);
> > + *     unlock(A);
> > + *
> > + * while the other thread is stuck trying to acquire the same lock, which
> > + * prevents it from signalling the fence the previous thread is stuck waiting
> > + * on::
> > + *
> > + *     lock(A);
> > + *     unlock(A);
> > + *     dma_fence_signal(B);
> > + *
> > + * By manually annotating all code relevant to signalling a &dma_fence we can
> > + * teach lockdep about these dependencies, which also helps with the validation
> > + * headache since now lockdep can check all the rules for us::
> > + *
> > + *    cookie = dma_fence_begin_signalling();
> > + *    lock(A);
> > + *    unlock(A);
> > + *    dma_fence_signal(B);
> > + *    dma_fence_end_signalling(cookie);
> > + *
> > + * For using dma_fence_begin_signalling() and dma_fence_end_signalling() to
> > + * annotate critical sections the following rules need to be observed:
> > + *
> > + * * All code necessary to complete a &dma_fence must be annotated, from the
> > + *   point where a fence is accessible to other threads, to the point where
> > + *   dma_fence_signal() is called. Un-annotated code can contain deadlock issues,
> > + *   and due to the very strict rules and many corner cases it is infeasible to
> > + *   catch these just with review or normal stress testing.
> > + *
> > + * * &struct dma_resv deserves a special note, since the readers are only
> > + *   protected by rcu. This means the signalling critical section starts as soon
> > + *   as the new fences are installed, even before dma_resv_unlock() is called.
> > + *
> > + * * The only exception are fast paths and opportunistic signalling code, which
> > + *   calls dma_fence_signal() purely as an optimization, but is not required to
> > + *   guarantee completion of a &dma_fence. The usual example is a wait IOCTL
> > + *   which calls dma_fence_signal(), while the mandatory completion path goes
> > + *   through a hardware interrupt and possible job completion worker.
> > + *
> > + * * To aid composability of code, the annotations can be freely nested, as long
> > + *   as the overall locking hierarchy is consistent. The annotations also work
> > + *   both in interrupt and process context. Due to implementation details this
> > + *   requires that callers pass an opaque cookie from
> > + *   dma_fence_begin_signalling() to dma_fence_end_signalling().
> > + *
> > + * * Validation against the cross driver contract is implemented by priming
> > + *   lockdep with the relevant hierarchy at boot-up. This means even just
> > + *   testing with a single device is enough to validate a driver, at least as
> > + *   far as deadlocks with dma_fence_wait() against dma_fence_signal() are
> > + *   concerned.
> > + */
> > +#ifdef CONFIG_LOCKDEP
> > +struct lockdep_map	dma_fence_lockdep_map = {
> > +	.name = "dma_fence_map"
> > +};
> > +
> > +/**
> > + * dma_fence_begin_signalling - begin a critical DMA fence signalling section
> > + *
> > + * Drivers should use this to annotate the beginning of any code section
> > + * required to eventually complete &dma_fence by calling dma_fence_signal().
> > + *
> > + * The end of these critical sections are annotated with
> > + * dma_fence_end_signalling().
> > + *
> > + * Returns:
> > + *
> > + * Opaque cookie needed by the implementation, which needs to be passed to
> > + * dma_fence_end_signalling().
> > + */
> > +bool dma_fence_begin_signalling(void)
> > +{
> > +	/* explicitly nesting ... */
> > +	if (lock_is_held_type(&dma_fence_lockdep_map, 1))
> > +		return true;
> > +
> > +	/* rely on might_sleep check for soft/hardirq locks */
> > +	if (in_atomic())
> > +		return true;
> > +
> > +	/* ... and non-recursive readlock */
> > +	lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
> > +
> > +	return false;
> > +}
> > +EXPORT_SYMBOL(dma_fence_begin_signalling);
> > +
> > +/**
> > + * dma_fence_end_signalling - end a critical DMA fence signalling section
> > + *
> > + * Closes a critical section annotation opened by dma_fence_begin_signalling().
> > + */
> > +void dma_fence_end_signalling(bool cookie)
> > +{
> > +	if (cookie)
> > +		return;
> > +
> > +	lock_release(&dma_fence_lockdep_map, _RET_IP_);
> > +}
> > +EXPORT_SYMBOL(dma_fence_end_signalling);
> > +
> > +void __dma_fence_might_wait(void)
> > +{
> > +	bool tmp;
> > +
> > +	tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
> > +	if (tmp)
> > +		lock_release(&dma_fence_lockdep_map, _THIS_IP_);
> > +	lock_map_acquire(&dma_fence_lockdep_map);
> > +	lock_map_release(&dma_fence_lockdep_map);
> > +	if (tmp)
> > +		lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
> > +}
> > +#endif
> > +
> > +
> >   /**
> >    * dma_fence_signal_locked - signal completion of a fence
> >    * @fence: the fence to signal
> > @@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence)
> >   {
> >   	unsigned long flags;
> >   	int ret;
> > +	bool tmp;
> >   	if (!fence)
> >   		return -EINVAL;
> > +	tmp = dma_fence_begin_signalling();
> > +
> >   	spin_lock_irqsave(fence->lock, flags);
> >   	ret = dma_fence_signal_locked(fence);
> >   	spin_unlock_irqrestore(fence->lock, flags);
> > +	dma_fence_end_signalling(tmp);
> > +
> >   	return ret;
> >   }
> >   EXPORT_SYMBOL(dma_fence_signal);
> > @@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
> >   	might_sleep();
> > +	__dma_fence_might_wait();
> > +
> >   	trace_dma_fence_wait_start(fence);
> >   	if (fence->ops->wait)
> >   		ret = fence->ops->wait(fence, intr, timeout);
> > diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
> > index 3347c54f3a87..3f288f7db2ef 100644
> > --- a/include/linux/dma-fence.h
> > +++ b/include/linux/dma-fence.h
> > @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
> >   	} while (1);
> >   }
> > +#ifdef CONFIG_LOCKDEP
> > +bool dma_fence_begin_signalling(void);
> > +void dma_fence_end_signalling(bool cookie);
> > +#else
> > +static inline bool dma_fence_begin_signalling(void)
> > +{
> > +	return true;
> > +}
> > +static inline void dma_fence_end_signalling(bool cookie) {}
> > +static inline void __dma_fence_might_wait(void) {}
> > +#endif
> > +
> >   int dma_fence_signal(struct dma_fence *fence);
> >   int dma_fence_signal_locked(struct dma_fence *fence);
> >   signed long dma_fence_default_wait(struct dma_fence *fence,
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/25] dma-fence: basic lockdep annotations
  2020-07-13 16:26     ` Daniel Vetter
@ 2020-07-13 16:39       ` Christian König
  2020-07-13 20:31         ` Dave Airlie
  0 siblings, 1 reply; 83+ messages in thread
From: Christian König @ 2020-07-13 16:39 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Daniel Vetter, DRI Development, Intel Graphics Development,
	linux-rdma, Felix Kuehling, Thomas Hellström,
	Maarten Lankhorst, Mika Kuoppala, linux-media, linaro-mm-sig,
	amd-gfx, Chris Wilson, Daniel Vetter

Am 13.07.20 um 18:26 schrieb Daniel Vetter:
> Hi Christian,
>
> On Wed, Jul 08, 2020 at 04:57:21PM +0200, Christian König wrote:
>> Could we merge this controlled by a separate config option?
>>
>> This way we could have the checks upstream without having to fix all the
>> stuff before we do this?
> Discussions died out a bit, do you consider this a blocker for the first
> two patches, or good for an ack on these?

Yes, I think the first two can be merged without causing any pain. Feel 
free to add my ab on them.

And the third one can go in immediately as well.

Thanks,
Christian.

>
> Like I said I don't plan to merge patches where I know it causes a lockdep
> splat with a driver still. At least for now.
>
> Thanks, Daniel
>
>> Thanks,
>> Christian.
>>
>> Am 07.07.20 um 22:12 schrieb Daniel Vetter:
>>> Design is similar to the lockdep annotations for workers, but with
>>> some twists:
>>>
>>> - We use a read-lock for the execution/worker/completion side, so that
>>>     this explicit annotation can be more liberally sprinkled around.
>>>     With read locks lockdep isn't going to complain if the read-side
>>>     isn't nested the same way under all circumstances, so ABBA deadlocks
>>>     are ok. Which they are, since this is an annotation only.
>>>
>>> - We're using non-recursive lockdep read lock mode, since in recursive
>>>     read lock mode lockdep does not catch read side hazards. And we
>>>     _very_ much want read side hazards to be caught. For full details of
>>>     this limitation see
>>>
>>>     commit e91498589746065e3ae95d9a00b068e525eec34f
>>>     Author: Peter Zijlstra <peterz@infradead.org>
>>>     Date:   Wed Aug 23 13:13:11 2017 +0200
>>>
>>>         locking/lockdep/selftests: Add mixed read-write ABBA tests
>>>
>>> - To allow nesting of the read-side explicit annotations we explicitly
>>>     keep track of the nesting. lock_is_held() allows us to do that.
>>>
>>> - The wait-side annotation is a write lock, and entirely done within
>>>     dma_fence_wait() for everyone by default.
>>>
>>> - To be able to freely annotate helper functions I want to make it ok
>>>     to call dma_fence_begin/end_signalling from soft/hardirq context.
>>>     First attempt was using the hardirq locking context for the write
>>>     side in lockdep, but this forces all normal spinlocks nested within
>>>     dma_fence_begin/end_signalling to be spinlocks. That bollocks.
>>>
>>>     The approach now is to simple check in_atomic(), and for these cases
>>>     entirely rely on the might_sleep() check in dma_fence_wait(). That
>>>     will catch any wrong nesting against spinlocks from soft/hardirq
>>>     contexts.
>>>
>>> The idea here is that every code path that's critical for eventually
>>> signalling a dma_fence should be annotated with
>>> dma_fence_begin/end_signalling. The annotation ideally starts right
>>> after a dma_fence is published (added to a dma_resv, exposed as a
>>> sync_file fd, attached to a drm_syncobj fd, or anything else that
>>> makes the dma_fence visible to other kernel threads), up to and
>>> including the dma_fence_wait(). Examples are irq handlers, the
>>> scheduler rt threads, the tail of execbuf (after the corresponding
>>> fences are visible), any workers that end up signalling dma_fences and
>>> really anything else. Not annotated should be code paths that only
>>> complete fences opportunistically as the gpu progresses, like e.g.
>>> shrinker/eviction code.
>>>
>>> The main class of deadlocks this is supposed to catch are:
>>>
>>> Thread A:
>>>
>>> 	mutex_lock(A);
>>> 	mutex_unlock(A);
>>>
>>> 	dma_fence_signal();
>>>
>>> Thread B:
>>>
>>> 	mutex_lock(A);
>>> 	dma_fence_wait();
>>> 	mutex_unlock(A);
>>>
>>> Thread B is blocked on A signalling the fence, but A never gets around
>>> to that because it cannot acquire the lock A.
>>>
>>> Note that dma_fence_wait() is allowed to be nested within
>>> dma_fence_begin/end_signalling sections. To allow this to happen the
>>> read lock needs to be upgraded to a write lock, which means that any
>>> other lock is acquired between the dma_fence_begin_signalling() call and
>>> the call to dma_fence_wait(), and still held, this will result in an
>>> immediate lockdep complaint. The only other option would be to not
>>> annotate such calls, defeating the point. Therefore these annotations
>>> cannot be sprinkled over the code entirely mindless to avoid false
>>> positives.
>>>
>>> Originally I hope that the cross-release lockdep extensions would
>>> alleviate the need for explicit annotations:
>>>
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F709849%2F&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7Ca3f4bf29ad9640f56a5308d82749770e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637302543770870283&amp;sdata=jSHWG%2FNEZ9NqgT4V2l62sEVjfMeH5a%2F4Bbh1SPrKf%2Fw%3D&amp;reserved=0
>>>
>>> But there's a few reasons why that's not an option:
>>>
>>> - It's not happening in upstream, since it got reverted due to too
>>>     many false positives:
>>>
>>> 	commit e966eaeeb623f09975ef362c2866fae6f86844f9
>>> 	Author: Ingo Molnar <mingo@kernel.org>
>>> 	Date:   Tue Dec 12 12:31:16 2017 +0100
>>>
>>> 	    locking/lockdep: Remove the cross-release locking checks
>>>
>>> 	    This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y),
>>> 	    while it found a number of old bugs initially, was also causing too many
>>> 	    false positives that caused people to disable lockdep - which is arguably
>>> 	    a worse overall outcome.
>>>
>>> - cross-release uses the complete() call to annotate the end of
>>>     critical sections, for dma_fence that would be dma_fence_signal().
>>>     But we do not want all dma_fence_signal() calls to be treated as
>>>     critical, since many are opportunistic cleanup of gpu requests. If
>>>     these get stuck there's still the main completion interrupt and
>>>     workers who can unblock everyone. Automatically annotating all
>>>     dma_fence_signal() calls would hence cause false positives.
>>>
>>> - cross-release had some educated guesses for when a critical section
>>>     starts, like fresh syscall or fresh work callback. This would again
>>>     cause false positives without explicit annotations, since for
>>>     dma_fence the critical sections only starts when we publish a fence.
>>>
>>> - Furthermore there can be cases where a thread never does a
>>>     dma_fence_signal, but is still critical for reaching completion of
>>>     fences. One example would be a scheduler kthread which picks up jobs
>>>     and pushes them into hardware, where the interrupt handler or
>>>     another completion thread calls dma_fence_signal(). But if the
>>>     scheduler thread hangs, then all the fences hang, hence we need to
>>>     manually annotate it. cross-release aimed to solve this by chaining
>>>     cross-release dependencies, but the dependency from scheduler thread
>>>     to the completion interrupt handler goes through hw where
>>>     cross-release code can't observe it.
>>>
>>> In short, without manual annotations and careful review of the start
>>> and end of critical sections, cross-relese dependency tracking doesn't
>>> work. We need explicit annotations.
>>>
>>> v2: handle soft/hardirq ctx better against write side and dont forget
>>> EXPORT_SYMBOL, drivers can't use this otherwise.
>>>
>>> v3: Kerneldoc.
>>>
>>> v4: Some spelling fixes from Mika
>>>
>>> v5: Amend commit message to explain in detail why cross-release isn't
>>> the solution.
>>>
>>> v6: Pull out misplaced .rst hunk.
>>>
>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>> Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>
>>> Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
>>> Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
>>> Cc: linux-media@vger.kernel.org
>>> Cc: linaro-mm-sig@lists.linaro.org
>>> Cc: linux-rdma@vger.kernel.org
>>> Cc: amd-gfx@lists.freedesktop.org
>>> Cc: intel-gfx@lists.freedesktop.org
>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>> ---
>>>    Documentation/driver-api/dma-buf.rst |   6 +
>>>    drivers/dma-buf/dma-fence.c          | 161 +++++++++++++++++++++++++++
>>>    include/linux/dma-fence.h            |  12 ++
>>>    3 files changed, 179 insertions(+)
>>>
>>> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
>>> index 7fb7b661febd..05d856131140 100644
>>> --- a/Documentation/driver-api/dma-buf.rst
>>> +++ b/Documentation/driver-api/dma-buf.rst
>>> @@ -133,6 +133,12 @@ DMA Fences
>>>    .. kernel-doc:: drivers/dma-buf/dma-fence.c
>>>       :doc: DMA fences overview
>>> +DMA Fence Signalling Annotations
>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> +
>>> +.. kernel-doc:: drivers/dma-buf/dma-fence.c
>>> +   :doc: fence signalling annotation
>>> +
>>>    DMA Fences Functions Reference
>>>    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
>>> index 656e9ac2d028..0005bc002529 100644
>>> --- a/drivers/dma-buf/dma-fence.c
>>> +++ b/drivers/dma-buf/dma-fence.c
>>> @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num)
>>>    }
>>>    EXPORT_SYMBOL(dma_fence_context_alloc);
>>> +/**
>>> + * DOC: fence signalling annotation
>>> + *
>>> + * Proving correctness of all the kernel code around &dma_fence through code
>>> + * review and testing is tricky for a few reasons:
>>> + *
>>> + * * It is a cross-driver contract, and therefore all drivers must follow the
>>> + *   same rules for lock nesting order, calling contexts for various functions
>>> + *   and anything else significant for in-kernel interfaces. But it is also
>>> + *   impossible to test all drivers in a single machine, hence brute-force N vs.
>>> + *   N testing of all combinations is impossible. Even just limiting to the
>>> + *   possible combinations is infeasible.
>>> + *
>>> + * * There is an enormous amount of driver code involved. For render drivers
>>> + *   there's the tail of command submission, after fences are published,
>>> + *   scheduler code, interrupt and workers to process job completion,
>>> + *   and timeout, gpu reset and gpu hang recovery code. Plus for integration
>>> + *   with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,
>>> + *   and &shrinker. For modesetting drivers there's the commit tail functions
>>> + *   between when fences for an atomic modeset are published, and when the
>>> + *   corresponding vblank completes, including any interrupt processing and
>>> + *   related workers. Auditing all that code, across all drivers, is not
>>> + *   feasible.
>>> + *
>>> + * * Due to how many other subsystems are involved and the locking hierarchies
>>> + *   this pulls in there is extremely thin wiggle-room for driver-specific
>>> + *   differences. &dma_fence interacts with almost all of the core memory
>>> + *   handling through page fault handlers via &dma_resv, dma_resv_lock() and
>>> + *   dma_resv_unlock(). On the other side it also interacts through all
>>> + *   allocation sites through &mmu_notifier and &shrinker.
>>> + *
>>> + * Furthermore lockdep does not handle cross-release dependencies, which means
>>> + * any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught
>>> + * at runtime with some quick testing. The simplest example is one thread
>>> + * waiting on a &dma_fence while holding a lock::
>>> + *
>>> + *     lock(A);
>>> + *     dma_fence_wait(B);
>>> + *     unlock(A);
>>> + *
>>> + * while the other thread is stuck trying to acquire the same lock, which
>>> + * prevents it from signalling the fence the previous thread is stuck waiting
>>> + * on::
>>> + *
>>> + *     lock(A);
>>> + *     unlock(A);
>>> + *     dma_fence_signal(B);
>>> + *
>>> + * By manually annotating all code relevant to signalling a &dma_fence we can
>>> + * teach lockdep about these dependencies, which also helps with the validation
>>> + * headache since now lockdep can check all the rules for us::
>>> + *
>>> + *    cookie = dma_fence_begin_signalling();
>>> + *    lock(A);
>>> + *    unlock(A);
>>> + *    dma_fence_signal(B);
>>> + *    dma_fence_end_signalling(cookie);
>>> + *
>>> + * For using dma_fence_begin_signalling() and dma_fence_end_signalling() to
>>> + * annotate critical sections the following rules need to be observed:
>>> + *
>>> + * * All code necessary to complete a &dma_fence must be annotated, from the
>>> + *   point where a fence is accessible to other threads, to the point where
>>> + *   dma_fence_signal() is called. Un-annotated code can contain deadlock issues,
>>> + *   and due to the very strict rules and many corner cases it is infeasible to
>>> + *   catch these just with review or normal stress testing.
>>> + *
>>> + * * &struct dma_resv deserves a special note, since the readers are only
>>> + *   protected by rcu. This means the signalling critical section starts as soon
>>> + *   as the new fences are installed, even before dma_resv_unlock() is called.
>>> + *
>>> + * * The only exception are fast paths and opportunistic signalling code, which
>>> + *   calls dma_fence_signal() purely as an optimization, but is not required to
>>> + *   guarantee completion of a &dma_fence. The usual example is a wait IOCTL
>>> + *   which calls dma_fence_signal(), while the mandatory completion path goes
>>> + *   through a hardware interrupt and possible job completion worker.
>>> + *
>>> + * * To aid composability of code, the annotations can be freely nested, as long
>>> + *   as the overall locking hierarchy is consistent. The annotations also work
>>> + *   both in interrupt and process context. Due to implementation details this
>>> + *   requires that callers pass an opaque cookie from
>>> + *   dma_fence_begin_signalling() to dma_fence_end_signalling().
>>> + *
>>> + * * Validation against the cross driver contract is implemented by priming
>>> + *   lockdep with the relevant hierarchy at boot-up. This means even just
>>> + *   testing with a single device is enough to validate a driver, at least as
>>> + *   far as deadlocks with dma_fence_wait() against dma_fence_signal() are
>>> + *   concerned.
>>> + */
>>> +#ifdef CONFIG_LOCKDEP
>>> +struct lockdep_map	dma_fence_lockdep_map = {
>>> +	.name = "dma_fence_map"
>>> +};
>>> +
>>> +/**
>>> + * dma_fence_begin_signalling - begin a critical DMA fence signalling section
>>> + *
>>> + * Drivers should use this to annotate the beginning of any code section
>>> + * required to eventually complete &dma_fence by calling dma_fence_signal().
>>> + *
>>> + * The end of these critical sections are annotated with
>>> + * dma_fence_end_signalling().
>>> + *
>>> + * Returns:
>>> + *
>>> + * Opaque cookie needed by the implementation, which needs to be passed to
>>> + * dma_fence_end_signalling().
>>> + */
>>> +bool dma_fence_begin_signalling(void)
>>> +{
>>> +	/* explicitly nesting ... */
>>> +	if (lock_is_held_type(&dma_fence_lockdep_map, 1))
>>> +		return true;
>>> +
>>> +	/* rely on might_sleep check for soft/hardirq locks */
>>> +	if (in_atomic())
>>> +		return true;
>>> +
>>> +	/* ... and non-recursive readlock */
>>> +	lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
>>> +
>>> +	return false;
>>> +}
>>> +EXPORT_SYMBOL(dma_fence_begin_signalling);
>>> +
>>> +/**
>>> + * dma_fence_end_signalling - end a critical DMA fence signalling section
>>> + *
>>> + * Closes a critical section annotation opened by dma_fence_begin_signalling().
>>> + */
>>> +void dma_fence_end_signalling(bool cookie)
>>> +{
>>> +	if (cookie)
>>> +		return;
>>> +
>>> +	lock_release(&dma_fence_lockdep_map, _RET_IP_);
>>> +}
>>> +EXPORT_SYMBOL(dma_fence_end_signalling);
>>> +
>>> +void __dma_fence_might_wait(void)
>>> +{
>>> +	bool tmp;
>>> +
>>> +	tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
>>> +	if (tmp)
>>> +		lock_release(&dma_fence_lockdep_map, _THIS_IP_);
>>> +	lock_map_acquire(&dma_fence_lockdep_map);
>>> +	lock_map_release(&dma_fence_lockdep_map);
>>> +	if (tmp)
>>> +		lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
>>> +}
>>> +#endif
>>> +
>>> +
>>>    /**
>>>     * dma_fence_signal_locked - signal completion of a fence
>>>     * @fence: the fence to signal
>>> @@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence)
>>>    {
>>>    	unsigned long flags;
>>>    	int ret;
>>> +	bool tmp;
>>>    	if (!fence)
>>>    		return -EINVAL;
>>> +	tmp = dma_fence_begin_signalling();
>>> +
>>>    	spin_lock_irqsave(fence->lock, flags);
>>>    	ret = dma_fence_signal_locked(fence);
>>>    	spin_unlock_irqrestore(fence->lock, flags);
>>> +	dma_fence_end_signalling(tmp);
>>> +
>>>    	return ret;
>>>    }
>>>    EXPORT_SYMBOL(dma_fence_signal);
>>> @@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
>>>    	might_sleep();
>>> +	__dma_fence_might_wait();
>>> +
>>>    	trace_dma_fence_wait_start(fence);
>>>    	if (fence->ops->wait)
>>>    		ret = fence->ops->wait(fence, intr, timeout);
>>> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
>>> index 3347c54f3a87..3f288f7db2ef 100644
>>> --- a/include/linux/dma-fence.h
>>> +++ b/include/linux/dma-fence.h
>>> @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
>>>    	} while (1);
>>>    }
>>> +#ifdef CONFIG_LOCKDEP
>>> +bool dma_fence_begin_signalling(void);
>>> +void dma_fence_end_signalling(bool cookie);
>>> +#else
>>> +static inline bool dma_fence_begin_signalling(void)
>>> +{
>>> +	return true;
>>> +}
>>> +static inline void dma_fence_end_signalling(bool cookie) {}
>>> +static inline void __dma_fence_might_wait(void) {}
>>> +#endif
>>> +
>>>    int dma_fence_signal(struct dma_fence *fence);
>>>    int dma_fence_signal_locked(struct dma_fence *fence);
>>>    signed long dma_fence_default_wait(struct dma_fence *fence,


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/25] dma-fence: basic lockdep annotations
  2020-07-13 16:39       ` Christian König
@ 2020-07-13 20:31         ` Dave Airlie
  0 siblings, 0 replies; 83+ messages in thread
From: Dave Airlie @ 2020-07-13 20:31 UTC (permalink / raw)
  To: Christian König
  Cc: Daniel Vetter, linux-rdma, Daniel Vetter,
	Intel Graphics Development, Maarten Lankhorst, DRI Development,
	Chris Wilson, moderated list:DMA BUFFER SHARING FRAMEWORK,
	Thomas Hellström, amd-gfx mailing list, Daniel Vetter,
	Linux Media Mailing List, Felix Kuehling, Mika Kuoppala

On Tue, 14 Jul 2020 at 02:39, Christian König <christian.koenig@amd.com> wrote:
>
> Am 13.07.20 um 18:26 schrieb Daniel Vetter:
> > Hi Christian,
> >
> > On Wed, Jul 08, 2020 at 04:57:21PM +0200, Christian König wrote:
> >> Could we merge this controlled by a separate config option?
> >>
> >> This way we could have the checks upstream without having to fix all the
> >> stuff before we do this?
> > Discussions died out a bit, do you consider this a blocker for the first
> > two patches, or good for an ack on these?
>
> Yes, I think the first two can be merged without causing any pain. Feel
> free to add my ab on them.
>
> And the third one can go in immediately as well.

Acked-by: Dave Airlie <airlied@redhat.com> for the first 2 +
indefinite explains.

Dave.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/25] drm/vkms: Annotate vblank timer
  2020-07-12 22:27   ` Rodrigo Siqueira
@ 2020-07-14  9:57     ` Melissa Wen
  2020-07-14  9:59       ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Melissa Wen @ 2020-07-14  9:57 UTC (permalink / raw)
  To: Rodrigo Siqueira
  Cc: Daniel Vetter, DRI Development, Intel Graphics Development,
	linux-rdma, linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Christian König, Daniel Vetter,
	Haneen Mohammed, Daniel Vetter, Trevor Woerner

On 07/12, Rodrigo Siqueira wrote:
> Hi,
> 
> Everything looks fine to me, I just noticed that the amdgpu patches did
> not apply smoothly, however it was trivial to fix the issues.
> 
> Reviewed-by: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
> 
> Melissa,
> Since you are using vkms regularly, could you test this patch and review
> it? Remember to add your Tested-by when you finish.
>
Hi,

I've applied the patch series, ran some tests on vkms, and found no
issues. I mean, things have remained stable.

Tested-by: Melissa Wen <melissa.srw@gmail.com>

> Thanks
> 
> On 07/07, Daniel Vetter wrote:
> > This is needed to signal the fences from page flips, annotate it
> > accordingly. We need to annotate entire timer callback since if we get
> > stuck anywhere in there, then the timer stops, and hence fences stop.
> > Just annotating the top part that does the vblank handling isn't
> > enough.
> > 
> > Cc: linux-media@vger.kernel.org
> > Cc: linaro-mm-sig@lists.linaro.org
> > Cc: linux-rdma@vger.kernel.org
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: intel-gfx@lists.freedesktop.org
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
> > Cc: Haneen Mohammed <hamohammed.sa@gmail.com>
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> > ---
> >  drivers/gpu/drm/vkms/vkms_crtc.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/vkms/vkms_crtc.c b/drivers/gpu/drm/vkms/vkms_crtc.c
> > index ac85e17428f8..a53a40848a72 100644
> > --- a/drivers/gpu/drm/vkms/vkms_crtc.c
> > +++ b/drivers/gpu/drm/vkms/vkms_crtc.c
> > @@ -1,5 +1,7 @@
> >  // SPDX-License-Identifier: GPL-2.0+
> >  
> > +#include <linux/dma-fence.h>
> > +
> >  #include <drm/drm_atomic.h>
> >  #include <drm/drm_atomic_helper.h>
> >  #include <drm/drm_probe_helper.h>
> > @@ -14,7 +16,9 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
> >  	struct drm_crtc *crtc = &output->crtc;
> >  	struct vkms_crtc_state *state;
> >  	u64 ret_overrun;
> > -	bool ret;
> > +	bool ret, fence_cookie;
> > +
> > +	fence_cookie = dma_fence_begin_signalling();
> >  
> >  	ret_overrun = hrtimer_forward_now(&output->vblank_hrtimer,
> >  					  output->period_ns);
> > @@ -49,6 +53,8 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
> >  			DRM_DEBUG_DRIVER("Composer worker already queued\n");
> >  	}
> >  
> > +	dma_fence_end_signalling(fence_cookie);
> > +
> >  	return HRTIMER_RESTART;
> >  }
> >  
> > -- 
> > 2.27.0
> > 
> 
> -- 
> Rodrigo Siqueira
> https://siqueira.tech



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/25] drm/vkms: Annotate vblank timer
  2020-07-14  9:57     ` Melissa Wen
@ 2020-07-14  9:59       ` Daniel Vetter
  2020-07-14 14:55         ` Melissa Wen
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-14  9:59 UTC (permalink / raw)
  To: Melissa Wen
  Cc: Rodrigo Siqueira, DRI Development, Intel Graphics Development,
	linux-rdma, open list:DMA BUFFER SHARING FRAMEWORK,
	moderated list:DMA BUFFER SHARING FRAMEWORK, amd-gfx list,
	Chris Wilson, Maarten Lankhorst, Christian König,
	Daniel Vetter, Haneen Mohammed, Trevor Woerner

On Tue, Jul 14, 2020 at 11:57 AM Melissa Wen <melissa.srw@gmail.com> wrote:
>
> On 07/12, Rodrigo Siqueira wrote:
> > Hi,
> >
> > Everything looks fine to me, I just noticed that the amdgpu patches did
> > not apply smoothly, however it was trivial to fix the issues.
> >
> > Reviewed-by: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
> >
> > Melissa,
> > Since you are using vkms regularly, could you test this patch and review
> > it? Remember to add your Tested-by when you finish.
> >
> Hi,
>
> I've applied the patch series, ran some tests on vkms, and found no
> issues. I mean, things have remained stable.
>
> Tested-by: Melissa Wen <melissa.srw@gmail.com>

Did you test with CONFIG_PROVE_LOCKING enabled in the kernel .config?
Without that enabled, there's not really any change here, but with
that enabled there might be some lockdep splats in dmesg indicating a
problem.

Thanks, Daniel
>
> > Thanks
> >
> > On 07/07, Daniel Vetter wrote:
> > > This is needed to signal the fences from page flips, annotate it
> > > accordingly. We need to annotate entire timer callback since if we get
> > > stuck anywhere in there, then the timer stops, and hence fences stop.
> > > Just annotating the top part that does the vblank handling isn't
> > > enough.
> > >
> > > Cc: linux-media@vger.kernel.org
> > > Cc: linaro-mm-sig@lists.linaro.org
> > > Cc: linux-rdma@vger.kernel.org
> > > Cc: amd-gfx@lists.freedesktop.org
> > > Cc: intel-gfx@lists.freedesktop.org
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > Cc: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
> > > Cc: Haneen Mohammed <hamohammed.sa@gmail.com>
> > > Cc: Daniel Vetter <daniel@ffwll.ch>
> > > ---
> > >  drivers/gpu/drm/vkms/vkms_crtc.c | 8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/vkms/vkms_crtc.c b/drivers/gpu/drm/vkms/vkms_crtc.c
> > > index ac85e17428f8..a53a40848a72 100644
> > > --- a/drivers/gpu/drm/vkms/vkms_crtc.c
> > > +++ b/drivers/gpu/drm/vkms/vkms_crtc.c
> > > @@ -1,5 +1,7 @@
> > >  // SPDX-License-Identifier: GPL-2.0+
> > >
> > > +#include <linux/dma-fence.h>
> > > +
> > >  #include <drm/drm_atomic.h>
> > >  #include <drm/drm_atomic_helper.h>
> > >  #include <drm/drm_probe_helper.h>
> > > @@ -14,7 +16,9 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
> > >     struct drm_crtc *crtc = &output->crtc;
> > >     struct vkms_crtc_state *state;
> > >     u64 ret_overrun;
> > > -   bool ret;
> > > +   bool ret, fence_cookie;
> > > +
> > > +   fence_cookie = dma_fence_begin_signalling();
> > >
> > >     ret_overrun = hrtimer_forward_now(&output->vblank_hrtimer,
> > >                                       output->period_ns);
> > > @@ -49,6 +53,8 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
> > >                     DRM_DEBUG_DRIVER("Composer worker already queued\n");
> > >     }
> > >
> > > +   dma_fence_end_signalling(fence_cookie);
> > > +
> > >     return HRTIMER_RESTART;
> > >  }
> > >
> > > --
> > > 2.27.0
> > >
> >
> > --
> > Rodrigo Siqueira
> > https://siqueira.tech
>
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 19/25] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
  2020-07-07 20:12 ` [PATCH 19/25] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code Daniel Vetter
@ 2020-07-14 10:49   ` Daniel Vetter
  2020-07-14 11:40     ` Christian König
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-14 10:49 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Christian König, Daniel Vetter

On Tue, Jul 07, 2020 at 10:12:23PM +0200, Daniel Vetter wrote:
> My dma-fence lockdep annotations caught an inversion because we
> allocate memory where we really shouldn't:
> 
> 	kmem_cache_alloc+0x2b/0x6d0
> 	amdgpu_fence_emit+0x30/0x330 [amdgpu]
> 	amdgpu_ib_schedule+0x306/0x550 [amdgpu]
> 	amdgpu_job_run+0x10f/0x260 [amdgpu]
> 	drm_sched_main+0x1b9/0x490 [gpu_sched]
> 	kthread+0x12e/0x150
> 
> Trouble right now is that lockdep only validates against GFP_FS, which
> would be good enough for shrinkers. But for mmu_notifiers we actually
> need !GFP_ATOMIC, since they can be called from any page laundering,
> even if GFP_NOFS or GFP_NOIO are set.
> 
> I guess we should improve the lockdep annotations for
> fs_reclaim_acquire/release.
> 
> Ofc real fix is to properly preallocate this fence and stuff it into
> the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of
> the way.
> 
> v2: Two more allocations in scheduler paths.
> 
> Frist one:
> 
> 	__kmalloc+0x58/0x720
> 	amdgpu_vmid_grab+0x100/0xca0 [amdgpu]
> 	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
> 	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
> 	drm_sched_main+0xf9/0x490 [gpu_sched]
> 
> Second one:
> 
> 	kmem_cache_alloc+0x2b/0x6d0
> 	amdgpu_sync_fence+0x7e/0x110 [amdgpu]
> 	amdgpu_vmid_grab+0x86b/0xca0 [amdgpu]
> 	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
> 	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
> 	drm_sched_main+0xf9/0x490 [gpu_sched]
> 
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

Has anyone from amd side started looking into how to fix this properly?

I looked a bit into fixing this with mempool, and the big guarantee we
need is that
- there's a hard upper limit on how many allocations we minimally need to
  guarantee forward progress. And the entire vmid allocation and
  amdgpu_sync_fence stuff kinda makes me question that's a valid
  assumption.

- mempool_free must be called without any locks in the way which are held
  while we call mempool_alloc. Otherwise we again have a nice deadlock
  with no forward progress. I tried auditing that, but got lost in amdgpu
  and scheduler code. Some lockdep annotations for mempool.c might help,
  but they're not going to catch everything. Plus it would be again manual
  annotations because this is yet another cross-release issue. So not sure
  that helps at all.

iow, not sure what to do here. Ideas?

Cheers, Daniel

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c   | 2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c  | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 8d84975885cd..a089a827fdfe 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f,
>  	uint32_t seq;
>  	int r;
>  
> -	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL);
> +	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
>  	if (fence == NULL)
>  		return -ENOMEM;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> index 267fa45ddb66..a333ca2d4ddd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> @@ -208,7 +208,7 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
>  	if (ring->vmid_wait && !dma_fence_is_signaled(ring->vmid_wait))
>  		return amdgpu_sync_fence(sync, ring->vmid_wait);
>  
> -	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_KERNEL);
> +	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATOMIC);
>  	if (!fences)
>  		return -ENOMEM;
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> index 8ea6c49529e7..af22b526cec9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> @@ -160,7 +160,7 @@ int amdgpu_sync_fence(struct amdgpu_sync *sync, struct dma_fence *f)
>  	if (amdgpu_sync_add_later(sync, f))
>  		return 0;
>  
> -	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_KERNEL);
> +	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_ATOMIC);
>  	if (!e)
>  		return -ENOMEM;
>  
> -- 
> 2.27.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/25] dma-fence: basic lockdep annotations
  2020-07-08 15:37         ` Daniel Vetter
@ 2020-07-14 11:09           ` Daniel Vetter
  0 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-14 11:09 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Christian König, linux-rdma, Intel Graphics Development,
	Maarten Lankhorst, DRI Development, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK,
	Thomas Hellström, amd-gfx list, Daniel Vetter,
	open list:DMA BUFFER SHARING FRAMEWORK, Felix Kuehling,
	Mika Kuoppala

On Wed, Jul 08, 2020 at 05:37:19PM +0200, Daniel Vetter wrote:
> On Wed, Jul 8, 2020 at 5:19 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Wed, Jul 8, 2020 at 11:13 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > On Wed, Jul 8, 2020 at 4:57 PM Christian König <christian.koenig@amd.com> wrote:
> > > >
> > > > Could we merge this controlled by a separate config option?
> > > >
> > > > This way we could have the checks upstream without having to fix all the
> > > > stuff before we do this?
> > >
> > > Since it's fully opt-in annotations nothing blows up if we don't merge
> > > any annotations. So we could start merging the first 3 patches. After
> > > that the fun starts ...
> > >
> > > My rough idea was that first I'd try to tackle display, thus far
> > > there's 2 actual issues in drivers:
> > > - amdgpu has some dma_resv_lock in commit_tail, plus a kmalloc. I
> > > think those should be fairly easy to fix (I'd try a stab at them even)
> > > - vmwgfx has a full on locking inversion with dma_resv_lock in
> > > commit_tail, and that one is functional. Not just reading something
> > > which we can safely assume to be invariant anyway (like the tmz flag
> > > for amdgpu, or whatever it was).
> > >
> > > I've done a pile more annotations patches for other atomic drivers
> > > now, so hopefully that flushes out any remaining offenders here. Since
> > > some of the annotations are in helper code worst case we might need a
> > > dev->mode_config.broken_atomic_commit flag to disable them. At least
> > > for now I have 0 plans to merge any of these while there's known
> > > unsolved issues. Maybe if some drivers take forever to get fixed we
> > > can then apply some duct-tape for the atomic helper annotation patch.
> > > Instead of a flag we can also copypasta the atomic_commit_tail hook,
> > > leaving the annotations out and adding a huge warning about that.
> > >
> > > Next big chunk is the drm/scheduler annotations:
> > > - amdgpu needs a full rework of display reset (but apparently in the works)
> >
> > I think the display deadlock issues should be fixed in:
> > https://cgit.freedesktop.org/drm/drm/commit/?id=cdaae8371aa9d4ea1648a299b1a75946b9556944

Oh btw you have some more memory allocations in that commit, so you just
traded one deadlock for another one :-)
-Daniel

> 
> That's the reset/tdr inversion, there's two more:
> - kmalloc, see https://cgit.freedesktop.org/~danvet/drm/commit/?id=d9353cc3bf6111430a24188b92412dc49e7ead79
> - ttm_bo_reserve in the wrong place
> https://cgit.freedesktop.org/~danvet/drm/commit/?id=a6c03176152625a2f9cf1e499aceb8b2217dc2a2
> - console_lock in the wrong spot
> https://cgit.freedesktop.org/~danvet/drm/commit/?id=a6c03176152625a2f9cf1e499aceb8b2217dc2a2
> 
> Especially the last one I have no idea how to address really.
> -Daniel
> 
> 
> >
> > Alex
> >
> > > - I read all the drivers, they all have the fairly cosmetic issue of
> > > doing small allocations in their callbacks.
> > >
> > > I might end up typing the mempool we need for the latter issue, but
> > > first still hoping for some actual test feedback from other drivers
> > > using drm/scheduler. Again no intentions of merging these annotations
> > > without the drivers being fixed first, or at least some duct-atpe
> > > applied.
> > >
> > > Another option I've been thinking about, if there's cases where fixing
> > > things properly is a lot of effort: We could do annotations for broken
> > > sections (just the broken part, so we still catch bugs everywhere
> > > else). They'd simply drop&reacquire the lock. We could then e.g. use
> > > that in the amdgpu display reset code, and so still make sure that
> > > everything else in reset doesn't get worse. But I think adding that
> > > shouldn't be our first option.
> > >
> > > I'm not personally a big fan of the Kconfig or runtime option, only
> > > upsets people since it breaks lockdep for them. Or they ignore it, and
> > > we don't catch bugs, making it fairly pointless to merge.
> > >
> > > Cheers, Daniel
> > >
> > >
> > > >
> > > > Thanks,
> > > > Christian.
> > > >
> > > > Am 07.07.20 um 22:12 schrieb Daniel Vetter:
> > > > > Design is similar to the lockdep annotations for workers, but with
> > > > > some twists:
> > > > >
> > > > > - We use a read-lock for the execution/worker/completion side, so that
> > > > >    this explicit annotation can be more liberally sprinkled around.
> > > > >    With read locks lockdep isn't going to complain if the read-side
> > > > >    isn't nested the same way under all circumstances, so ABBA deadlocks
> > > > >    are ok. Which they are, since this is an annotation only.
> > > > >
> > > > > - We're using non-recursive lockdep read lock mode, since in recursive
> > > > >    read lock mode lockdep does not catch read side hazards. And we
> > > > >    _very_ much want read side hazards to be caught. For full details of
> > > > >    this limitation see
> > > > >
> > > > >    commit e91498589746065e3ae95d9a00b068e525eec34f
> > > > >    Author: Peter Zijlstra <peterz@infradead.org>
> > > > >    Date:   Wed Aug 23 13:13:11 2017 +0200
> > > > >
> > > > >        locking/lockdep/selftests: Add mixed read-write ABBA tests
> > > > >
> > > > > - To allow nesting of the read-side explicit annotations we explicitly
> > > > >    keep track of the nesting. lock_is_held() allows us to do that.
> > > > >
> > > > > - The wait-side annotation is a write lock, and entirely done within
> > > > >    dma_fence_wait() for everyone by default.
> > > > >
> > > > > - To be able to freely annotate helper functions I want to make it ok
> > > > >    to call dma_fence_begin/end_signalling from soft/hardirq context.
> > > > >    First attempt was using the hardirq locking context for the write
> > > > >    side in lockdep, but this forces all normal spinlocks nested within
> > > > >    dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> > > > >
> > > > >    The approach now is to simple check in_atomic(), and for these cases
> > > > >    entirely rely on the might_sleep() check in dma_fence_wait(). That
> > > > >    will catch any wrong nesting against spinlocks from soft/hardirq
> > > > >    contexts.
> > > > >
> > > > > The idea here is that every code path that's critical for eventually
> > > > > signalling a dma_fence should be annotated with
> > > > > dma_fence_begin/end_signalling. The annotation ideally starts right
> > > > > after a dma_fence is published (added to a dma_resv, exposed as a
> > > > > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > > > > makes the dma_fence visible to other kernel threads), up to and
> > > > > including the dma_fence_wait(). Examples are irq handlers, the
> > > > > scheduler rt threads, the tail of execbuf (after the corresponding
> > > > > fences are visible), any workers that end up signalling dma_fences and
> > > > > really anything else. Not annotated should be code paths that only
> > > > > complete fences opportunistically as the gpu progresses, like e.g.
> > > > > shrinker/eviction code.
> > > > >
> > > > > The main class of deadlocks this is supposed to catch are:
> > > > >
> > > > > Thread A:
> > > > >
> > > > >       mutex_lock(A);
> > > > >       mutex_unlock(A);
> > > > >
> > > > >       dma_fence_signal();
> > > > >
> > > > > Thread B:
> > > > >
> > > > >       mutex_lock(A);
> > > > >       dma_fence_wait();
> > > > >       mutex_unlock(A);
> > > > >
> > > > > Thread B is blocked on A signalling the fence, but A never gets around
> > > > > to that because it cannot acquire the lock A.
> > > > >
> > > > > Note that dma_fence_wait() is allowed to be nested within
> > > > > dma_fence_begin/end_signalling sections. To allow this to happen the
> > > > > read lock needs to be upgraded to a write lock, which means that any
> > > > > other lock is acquired between the dma_fence_begin_signalling() call and
> > > > > the call to dma_fence_wait(), and still held, this will result in an
> > > > > immediate lockdep complaint. The only other option would be to not
> > > > > annotate such calls, defeating the point. Therefore these annotations
> > > > > cannot be sprinkled over the code entirely mindless to avoid false
> > > > > positives.
> > > > >
> > > > > Originally I hope that the cross-release lockdep extensions would
> > > > > alleviate the need for explicit annotations:
> > > > >
> > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F709849%2F&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7Cff1a9dd17c544534eeb808d822b21ba2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297495649621566&amp;sdata=pbDwf%2BAG1UZ5bLZeep7VeGVQMnlQhX0TKG1d6Ok8GfQ%3D&amp;reserved=0
> > > > >
> > > > > But there's a few reasons why that's not an option:
> > > > >
> > > > > - It's not happening in upstream, since it got reverted due to too
> > > > >    many false positives:
> > > > >
> > > > >       commit e966eaeeb623f09975ef362c2866fae6f86844f9
> > > > >       Author: Ingo Molnar <mingo@kernel.org>
> > > > >       Date:   Tue Dec 12 12:31:16 2017 +0100
> > > > >
> > > > >           locking/lockdep: Remove the cross-release locking checks
> > > > >
> > > > >           This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y),
> > > > >           while it found a number of old bugs initially, was also causing too many
> > > > >           false positives that caused people to disable lockdep - which is arguably
> > > > >           a worse overall outcome.
> > > > >
> > > > > - cross-release uses the complete() call to annotate the end of
> > > > >    critical sections, for dma_fence that would be dma_fence_signal().
> > > > >    But we do not want all dma_fence_signal() calls to be treated as
> > > > >    critical, since many are opportunistic cleanup of gpu requests. If
> > > > >    these get stuck there's still the main completion interrupt and
> > > > >    workers who can unblock everyone. Automatically annotating all
> > > > >    dma_fence_signal() calls would hence cause false positives.
> > > > >
> > > > > - cross-release had some educated guesses for when a critical section
> > > > >    starts, like fresh syscall or fresh work callback. This would again
> > > > >    cause false positives without explicit annotations, since for
> > > > >    dma_fence the critical sections only starts when we publish a fence.
> > > > >
> > > > > - Furthermore there can be cases where a thread never does a
> > > > >    dma_fence_signal, but is still critical for reaching completion of
> > > > >    fences. One example would be a scheduler kthread which picks up jobs
> > > > >    and pushes them into hardware, where the interrupt handler or
> > > > >    another completion thread calls dma_fence_signal(). But if the
> > > > >    scheduler thread hangs, then all the fences hang, hence we need to
> > > > >    manually annotate it. cross-release aimed to solve this by chaining
> > > > >    cross-release dependencies, but the dependency from scheduler thread
> > > > >    to the completion interrupt handler goes through hw where
> > > > >    cross-release code can't observe it.
> > > > >
> > > > > In short, without manual annotations and careful review of the start
> > > > > and end of critical sections, cross-relese dependency tracking doesn't
> > > > > work. We need explicit annotations.
> > > > >
> > > > > v2: handle soft/hardirq ctx better against write side and dont forget
> > > > > EXPORT_SYMBOL, drivers can't use this otherwise.
> > > > >
> > > > > v3: Kerneldoc.
> > > > >
> > > > > v4: Some spelling fixes from Mika
> > > > >
> > > > > v5: Amend commit message to explain in detail why cross-release isn't
> > > > > the solution.
> > > > >
> > > > > v6: Pull out misplaced .rst hunk.
> > > > >
> > > > > Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> > > > > Reviewed-by: Thomas Hellström <thomas.hellstrom@intel.com>
> > > > > Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > > > > Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
> > > > > Cc: linux-media@vger.kernel.org
> > > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > > Cc: linux-rdma@vger.kernel.org
> > > > > Cc: amd-gfx@lists.freedesktop.org
> > > > > Cc: intel-gfx@lists.freedesktop.org
> > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > ---
> > > > >   Documentation/driver-api/dma-buf.rst |   6 +
> > > > >   drivers/dma-buf/dma-fence.c          | 161 +++++++++++++++++++++++++++
> > > > >   include/linux/dma-fence.h            |  12 ++
> > > > >   3 files changed, 179 insertions(+)
> > > > >
> > > > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
> > > > > index 7fb7b661febd..05d856131140 100644
> > > > > --- a/Documentation/driver-api/dma-buf.rst
> > > > > +++ b/Documentation/driver-api/dma-buf.rst
> > > > > @@ -133,6 +133,12 @@ DMA Fences
> > > > >   .. kernel-doc:: drivers/dma-buf/dma-fence.c
> > > > >      :doc: DMA fences overview
> > > > >
> > > > > +DMA Fence Signalling Annotations
> > > > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > > +
> > > > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c
> > > > > +   :doc: fence signalling annotation
> > > > > +
> > > > >   DMA Fences Functions Reference
> > > > >   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > >
> > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > > > > index 656e9ac2d028..0005bc002529 100644
> > > > > --- a/drivers/dma-buf/dma-fence.c
> > > > > +++ b/drivers/dma-buf/dma-fence.c
> > > > > @@ -110,6 +110,160 @@ u64 dma_fence_context_alloc(unsigned num)
> > > > >   }
> > > > >   EXPORT_SYMBOL(dma_fence_context_alloc);
> > > > >
> > > > > +/**
> > > > > + * DOC: fence signalling annotation
> > > > > + *
> > > > > + * Proving correctness of all the kernel code around &dma_fence through code
> > > > > + * review and testing is tricky for a few reasons:
> > > > > + *
> > > > > + * * It is a cross-driver contract, and therefore all drivers must follow the
> > > > > + *   same rules for lock nesting order, calling contexts for various functions
> > > > > + *   and anything else significant for in-kernel interfaces. But it is also
> > > > > + *   impossible to test all drivers in a single machine, hence brute-force N vs.
> > > > > + *   N testing of all combinations is impossible. Even just limiting to the
> > > > > + *   possible combinations is infeasible.
> > > > > + *
> > > > > + * * There is an enormous amount of driver code involved. For render drivers
> > > > > + *   there's the tail of command submission, after fences are published,
> > > > > + *   scheduler code, interrupt and workers to process job completion,
> > > > > + *   and timeout, gpu reset and gpu hang recovery code. Plus for integration
> > > > > + *   with core mm with have &mmu_notifier, respectively &mmu_interval_notifier,
> > > > > + *   and &shrinker. For modesetting drivers there's the commit tail functions
> > > > > + *   between when fences for an atomic modeset are published, and when the
> > > > > + *   corresponding vblank completes, including any interrupt processing and
> > > > > + *   related workers. Auditing all that code, across all drivers, is not
> > > > > + *   feasible.
> > > > > + *
> > > > > + * * Due to how many other subsystems are involved and the locking hierarchies
> > > > > + *   this pulls in there is extremely thin wiggle-room for driver-specific
> > > > > + *   differences. &dma_fence interacts with almost all of the core memory
> > > > > + *   handling through page fault handlers via &dma_resv, dma_resv_lock() and
> > > > > + *   dma_resv_unlock(). On the other side it also interacts through all
> > > > > + *   allocation sites through &mmu_notifier and &shrinker.
> > > > > + *
> > > > > + * Furthermore lockdep does not handle cross-release dependencies, which means
> > > > > + * any deadlocks between dma_fence_wait() and dma_fence_signal() can't be caught
> > > > > + * at runtime with some quick testing. The simplest example is one thread
> > > > > + * waiting on a &dma_fence while holding a lock::
> > > > > + *
> > > > > + *     lock(A);
> > > > > + *     dma_fence_wait(B);
> > > > > + *     unlock(A);
> > > > > + *
> > > > > + * while the other thread is stuck trying to acquire the same lock, which
> > > > > + * prevents it from signalling the fence the previous thread is stuck waiting
> > > > > + * on::
> > > > > + *
> > > > > + *     lock(A);
> > > > > + *     unlock(A);
> > > > > + *     dma_fence_signal(B);
> > > > > + *
> > > > > + * By manually annotating all code relevant to signalling a &dma_fence we can
> > > > > + * teach lockdep about these dependencies, which also helps with the validation
> > > > > + * headache since now lockdep can check all the rules for us::
> > > > > + *
> > > > > + *    cookie = dma_fence_begin_signalling();
> > > > > + *    lock(A);
> > > > > + *    unlock(A);
> > > > > + *    dma_fence_signal(B);
> > > > > + *    dma_fence_end_signalling(cookie);
> > > > > + *
> > > > > + * For using dma_fence_begin_signalling() and dma_fence_end_signalling() to
> > > > > + * annotate critical sections the following rules need to be observed:
> > > > > + *
> > > > > + * * All code necessary to complete a &dma_fence must be annotated, from the
> > > > > + *   point where a fence is accessible to other threads, to the point where
> > > > > + *   dma_fence_signal() is called. Un-annotated code can contain deadlock issues,
> > > > > + *   and due to the very strict rules and many corner cases it is infeasible to
> > > > > + *   catch these just with review or normal stress testing.
> > > > > + *
> > > > > + * * &struct dma_resv deserves a special note, since the readers are only
> > > > > + *   protected by rcu. This means the signalling critical section starts as soon
> > > > > + *   as the new fences are installed, even before dma_resv_unlock() is called.
> > > > > + *
> > > > > + * * The only exception are fast paths and opportunistic signalling code, which
> > > > > + *   calls dma_fence_signal() purely as an optimization, but is not required to
> > > > > + *   guarantee completion of a &dma_fence. The usual example is a wait IOCTL
> > > > > + *   which calls dma_fence_signal(), while the mandatory completion path goes
> > > > > + *   through a hardware interrupt and possible job completion worker.
> > > > > + *
> > > > > + * * To aid composability of code, the annotations can be freely nested, as long
> > > > > + *   as the overall locking hierarchy is consistent. The annotations also work
> > > > > + *   both in interrupt and process context. Due to implementation details this
> > > > > + *   requires that callers pass an opaque cookie from
> > > > > + *   dma_fence_begin_signalling() to dma_fence_end_signalling().
> > > > > + *
> > > > > + * * Validation against the cross driver contract is implemented by priming
> > > > > + *   lockdep with the relevant hierarchy at boot-up. This means even just
> > > > > + *   testing with a single device is enough to validate a driver, at least as
> > > > > + *   far as deadlocks with dma_fence_wait() against dma_fence_signal() are
> > > > > + *   concerned.
> > > > > + */
> > > > > +#ifdef CONFIG_LOCKDEP
> > > > > +struct lockdep_map   dma_fence_lockdep_map = {
> > > > > +     .name = "dma_fence_map"
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * dma_fence_begin_signalling - begin a critical DMA fence signalling section
> > > > > + *
> > > > > + * Drivers should use this to annotate the beginning of any code section
> > > > > + * required to eventually complete &dma_fence by calling dma_fence_signal().
> > > > > + *
> > > > > + * The end of these critical sections are annotated with
> > > > > + * dma_fence_end_signalling().
> > > > > + *
> > > > > + * Returns:
> > > > > + *
> > > > > + * Opaque cookie needed by the implementation, which needs to be passed to
> > > > > + * dma_fence_end_signalling().
> > > > > + */
> > > > > +bool dma_fence_begin_signalling(void)
> > > > > +{
> > > > > +     /* explicitly nesting ... */
> > > > > +     if (lock_is_held_type(&dma_fence_lockdep_map, 1))
> > > > > +             return true;
> > > > > +
> > > > > +     /* rely on might_sleep check for soft/hardirq locks */
> > > > > +     if (in_atomic())
> > > > > +             return true;
> > > > > +
> > > > > +     /* ... and non-recursive readlock */
> > > > > +     lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
> > > > > +
> > > > > +     return false;
> > > > > +}
> > > > > +EXPORT_SYMBOL(dma_fence_begin_signalling);
> > > > > +
> > > > > +/**
> > > > > + * dma_fence_end_signalling - end a critical DMA fence signalling section
> > > > > + *
> > > > > + * Closes a critical section annotation opened by dma_fence_begin_signalling().
> > > > > + */
> > > > > +void dma_fence_end_signalling(bool cookie)
> > > > > +{
> > > > > +     if (cookie)
> > > > > +             return;
> > > > > +
> > > > > +     lock_release(&dma_fence_lockdep_map, _RET_IP_);
> > > > > +}
> > > > > +EXPORT_SYMBOL(dma_fence_end_signalling);
> > > > > +
> > > > > +void __dma_fence_might_wait(void)
> > > > > +{
> > > > > +     bool tmp;
> > > > > +
> > > > > +     tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
> > > > > +     if (tmp)
> > > > > +             lock_release(&dma_fence_lockdep_map, _THIS_IP_);
> > > > > +     lock_map_acquire(&dma_fence_lockdep_map);
> > > > > +     lock_map_release(&dma_fence_lockdep_map);
> > > > > +     if (tmp)
> > > > > +             lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
> > > > > +}
> > > > > +#endif
> > > > > +
> > > > > +
> > > > >   /**
> > > > >    * dma_fence_signal_locked - signal completion of a fence
> > > > >    * @fence: the fence to signal
> > > > > @@ -170,14 +324,19 @@ int dma_fence_signal(struct dma_fence *fence)
> > > > >   {
> > > > >       unsigned long flags;
> > > > >       int ret;
> > > > > +     bool tmp;
> > > > >
> > > > >       if (!fence)
> > > > >               return -EINVAL;
> > > > >
> > > > > +     tmp = dma_fence_begin_signalling();
> > > > > +
> > > > >       spin_lock_irqsave(fence->lock, flags);
> > > > >       ret = dma_fence_signal_locked(fence);
> > > > >       spin_unlock_irqrestore(fence->lock, flags);
> > > > >
> > > > > +     dma_fence_end_signalling(tmp);
> > > > > +
> > > > >       return ret;
> > > > >   }
> > > > >   EXPORT_SYMBOL(dma_fence_signal);
> > > > > @@ -210,6 +369,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
> > > > >
> > > > >       might_sleep();
> > > > >
> > > > > +     __dma_fence_might_wait();
> > > > > +
> > > > >       trace_dma_fence_wait_start(fence);
> > > > >       if (fence->ops->wait)
> > > > >               ret = fence->ops->wait(fence, intr, timeout);
> > > > > diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
> > > > > index 3347c54f3a87..3f288f7db2ef 100644
> > > > > --- a/include/linux/dma-fence.h
> > > > > +++ b/include/linux/dma-fence.h
> > > > > @@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
> > > > >       } while (1);
> > > > >   }
> > > > >
> > > > > +#ifdef CONFIG_LOCKDEP
> > > > > +bool dma_fence_begin_signalling(void);
> > > > > +void dma_fence_end_signalling(bool cookie);
> > > > > +#else
> > > > > +static inline bool dma_fence_begin_signalling(void)
> > > > > +{
> > > > > +     return true;
> > > > > +}
> > > > > +static inline void dma_fence_end_signalling(bool cookie) {}
> > > > > +static inline void __dma_fence_might_wait(void) {}
> > > > > +#endif
> > > > > +
> > > > >   int dma_fence_signal(struct dma_fence *fence);
> > > > >   int dma_fence_signal_locked(struct dma_fence *fence);
> > > > >   signed long dma_fence_default_wait(struct dma_fence *fence,
> > > >
> > >
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> > > _______________________________________________
> > > amd-gfx mailing list
> > > amd-gfx@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 
> 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 20/25] drm/amdgpu: DC also loves to allocate stuff where it shouldn't
  2020-07-07 20:12 ` [PATCH 20/25] drm/amdgpu: DC also loves to allocate stuff where it shouldn't Daniel Vetter
@ 2020-07-14 11:12   ` Daniel Vetter
  0 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-14 11:12 UTC (permalink / raw)
  To: DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Christian König, Daniel Vetter

On Tue, Jul 07, 2020 at 10:12:24PM +0200, Daniel Vetter wrote:
> Not going to bother with a complete&pretty commit message, just
> offending backtrace:
> 
>         kvmalloc_node+0x47/0x80
>         dc_create_state+0x1f/0x60 [amdgpu]
>         dc_commit_state+0xcb/0x9b0 [amdgpu]
>         amdgpu_dm_atomic_commit_tail+0xd31/0x2010 [amdgpu]
>         commit_tail+0xa4/0x140 [drm_kms_helper]
>         drm_atomic_helper_commit+0x152/0x180 [drm_kms_helper]
>         drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
>         drm_client_modeset_commit_locked+0x55/0x190 [drm]
>         drm_client_modeset_commit+0x24/0x40 [drm]
> 
> v2: Found more in DC code, I'm just going to pile them all up.
> 
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

Anyone from amdgpu DC team started to look into this and the subsequent
patches in DC? Note that the last one isn't needed anymore because it's
now fix in upstream with

commit cdaae8371aa9d4ea1648a299b1a75946b9556944
Author: Bhawanpreet Lakha <Bhawanpreet.Lakha@amd.com>
Date:   Mon May 11 14:21:17 2020 -0400

    drm/amd/display: Handle GPU reset for DC block

But that patch has a ton of memory allocations in the reset path now, so
you just replaced one deadlock with another one ...

Note that since amdgpu has it's private atomic_commit_tail implemenation
this won't hold up the generic atomic annotations, but I think it will
hold up the tdr annotations at least. Plus would be nice to fix this
somehow.
-Daniel

> ---
>  drivers/gpu/drm/amd/amdgpu/atom.c                 | 2 +-
>  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +-
>  drivers/gpu/drm/amd/display/dc/core/dc.c          | 4 +++-
>  3 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/atom.c b/drivers/gpu/drm/amd/amdgpu/atom.c
> index 4cfc786699c7..1b0c674fab25 100644
> --- a/drivers/gpu/drm/amd/amdgpu/atom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/atom.c
> @@ -1226,7 +1226,7 @@ static int amdgpu_atom_execute_table_locked(struct atom_context *ctx, int index,
>  	ectx.abort = false;
>  	ectx.last_jump = 0;
>  	if (ws)
> -		ectx.ws = kcalloc(4, ws, GFP_KERNEL);
> +		ectx.ws = kcalloc(4, ws, GFP_ATOMIC);
>  	else
>  		ectx.ws = NULL;
>  
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 6afcc33ff846..3d41eddc7908 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -6872,7 +6872,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>  		struct dc_stream_update stream_update;
>  	} *bundle;
>  
> -	bundle = kzalloc(sizeof(*bundle), GFP_KERNEL);
> +	bundle = kzalloc(sizeof(*bundle), GFP_ATOMIC);
>  
>  	if (!bundle) {
>  		dm_error("Failed to allocate update bundle\n");
> diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c b/drivers/gpu/drm/amd/display/dc/core/dc.c
> index 942ceb0f6383..f9a58509efb2 100644
> --- a/drivers/gpu/drm/amd/display/dc/core/dc.c
> +++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
> @@ -1475,8 +1475,10 @@ bool dc_post_update_surfaces_to_stream(struct dc *dc)
>  
>  struct dc_state *dc_create_state(struct dc *dc)
>  {
> +	/* No you really cant allocate random crap here this late in
> +	 * atomic_commit_tail. */
>  	struct dc_state *context = kvzalloc(sizeof(struct dc_state),
> -					    GFP_KERNEL);
> +					    GFP_ATOMIC);
>  
>  	if (!context)
>  		return NULL;
> -- 
> 2.27.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 19/25] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
  2020-07-14 10:49   ` Daniel Vetter
@ 2020-07-14 11:40     ` Christian König
  2020-07-14 14:31       ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Christian König @ 2020-07-14 11:40 UTC (permalink / raw)
  To: Daniel Vetter, DRI Development
  Cc: Intel Graphics Development, linux-rdma, Daniel Vetter,
	linux-media, linaro-mm-sig, amd-gfx, Chris Wilson,
	Maarten Lankhorst, Daniel Vetter

Am 14.07.20 um 12:49 schrieb Daniel Vetter:
> On Tue, Jul 07, 2020 at 10:12:23PM +0200, Daniel Vetter wrote:
>> My dma-fence lockdep annotations caught an inversion because we
>> allocate memory where we really shouldn't:
>>
>> 	kmem_cache_alloc+0x2b/0x6d0
>> 	amdgpu_fence_emit+0x30/0x330 [amdgpu]
>> 	amdgpu_ib_schedule+0x306/0x550 [amdgpu]
>> 	amdgpu_job_run+0x10f/0x260 [amdgpu]
>> 	drm_sched_main+0x1b9/0x490 [gpu_sched]
>> 	kthread+0x12e/0x150
>>
>> Trouble right now is that lockdep only validates against GFP_FS, which
>> would be good enough for shrinkers. But for mmu_notifiers we actually
>> need !GFP_ATOMIC, since they can be called from any page laundering,
>> even if GFP_NOFS or GFP_NOIO are set.
>>
>> I guess we should improve the lockdep annotations for
>> fs_reclaim_acquire/release.
>>
>> Ofc real fix is to properly preallocate this fence and stuff it into
>> the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of
>> the way.
>>
>> v2: Two more allocations in scheduler paths.
>>
>> Frist one:
>>
>> 	__kmalloc+0x58/0x720
>> 	amdgpu_vmid_grab+0x100/0xca0 [amdgpu]
>> 	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
>> 	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
>> 	drm_sched_main+0xf9/0x490 [gpu_sched]
>>
>> Second one:
>>
>> 	kmem_cache_alloc+0x2b/0x6d0
>> 	amdgpu_sync_fence+0x7e/0x110 [amdgpu]
>> 	amdgpu_vmid_grab+0x86b/0xca0 [amdgpu]
>> 	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
>> 	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
>> 	drm_sched_main+0xf9/0x490 [gpu_sched]
>>
>> Cc: linux-media@vger.kernel.org
>> Cc: linaro-mm-sig@lists.linaro.org
>> Cc: linux-rdma@vger.kernel.org
>> Cc: amd-gfx@lists.freedesktop.org
>> Cc: intel-gfx@lists.freedesktop.org
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Has anyone from amd side started looking into how to fix this properly?

Yeah I checked both and neither are any real problem.

> I looked a bit into fixing this with mempool, and the big guarantee we
> need is that
> - there's a hard upper limit on how many allocations we minimally need to
>    guarantee forward progress. And the entire vmid allocation and
>    amdgpu_sync_fence stuff kinda makes me question that's a valid
>    assumption.

We do have hard upper limits for those.

The VMID allocation could as well just return the fence instead of 
putting it into the sync object IIRC. So that just needs some cleanup 
and can avoid the allocation entirely.

The hardware fence is limited by the number of submissions we can have 
concurrently on the ring buffers, so also not a problem at all.

Regards,
Christian.

>
> - mempool_free must be called without any locks in the way which are held
>    while we call mempool_alloc. Otherwise we again have a nice deadlock
>    with no forward progress. I tried auditing that, but got lost in amdgpu
>    and scheduler code. Some lockdep annotations for mempool.c might help,
>    but they're not going to catch everything. Plus it would be again manual
>    annotations because this is yet another cross-release issue. So not sure
>    that helps at all.
>
> iow, not sure what to do here. Ideas?
>
> Cheers, Daniel
>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c   | 2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c  | 2 +-
>>   3 files changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index 8d84975885cd..a089a827fdfe 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f,
>>   	uint32_t seq;
>>   	int r;
>>   
>> -	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL);
>> +	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
>>   	if (fence == NULL)
>>   		return -ENOMEM;
>>   
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
>> index 267fa45ddb66..a333ca2d4ddd 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
>> @@ -208,7 +208,7 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
>>   	if (ring->vmid_wait && !dma_fence_is_signaled(ring->vmid_wait))
>>   		return amdgpu_sync_fence(sync, ring->vmid_wait);
>>   
>> -	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_KERNEL);
>> +	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATOMIC);
>>   	if (!fences)
>>   		return -ENOMEM;
>>   
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>> index 8ea6c49529e7..af22b526cec9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>> @@ -160,7 +160,7 @@ int amdgpu_sync_fence(struct amdgpu_sync *sync, struct dma_fence *f)
>>   	if (amdgpu_sync_add_later(sync, f))
>>   		return 0;
>>   
>> -	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_KERNEL);
>> +	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_ATOMIC);
>>   	if (!e)
>>   		return -ENOMEM;
>>   
>> -- 
>> 2.27.0
>>


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 19/25] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
  2020-07-14 11:40     ` Christian König
@ 2020-07-14 14:31       ` Daniel Vetter
  2020-07-15  9:17         ` Christian König
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-14 14:31 UTC (permalink / raw)
  To: Christian König
  Cc: Daniel Vetter, DRI Development, Intel Graphics Development,
	linux-rdma, Daniel Vetter, linux-media, linaro-mm-sig, amd-gfx,
	Chris Wilson, Maarten Lankhorst, Daniel Vetter

On Tue, Jul 14, 2020 at 01:40:11PM +0200, Christian König wrote:
> Am 14.07.20 um 12:49 schrieb Daniel Vetter:
> > On Tue, Jul 07, 2020 at 10:12:23PM +0200, Daniel Vetter wrote:
> > > My dma-fence lockdep annotations caught an inversion because we
> > > allocate memory where we really shouldn't:
> > > 
> > > 	kmem_cache_alloc+0x2b/0x6d0
> > > 	amdgpu_fence_emit+0x30/0x330 [amdgpu]
> > > 	amdgpu_ib_schedule+0x306/0x550 [amdgpu]
> > > 	amdgpu_job_run+0x10f/0x260 [amdgpu]
> > > 	drm_sched_main+0x1b9/0x490 [gpu_sched]
> > > 	kthread+0x12e/0x150
> > > 
> > > Trouble right now is that lockdep only validates against GFP_FS, which
> > > would be good enough for shrinkers. But for mmu_notifiers we actually
> > > need !GFP_ATOMIC, since they can be called from any page laundering,
> > > even if GFP_NOFS or GFP_NOIO are set.
> > > 
> > > I guess we should improve the lockdep annotations for
> > > fs_reclaim_acquire/release.
> > > 
> > > Ofc real fix is to properly preallocate this fence and stuff it into
> > > the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of
> > > the way.
> > > 
> > > v2: Two more allocations in scheduler paths.
> > > 
> > > Frist one:
> > > 
> > > 	__kmalloc+0x58/0x720
> > > 	amdgpu_vmid_grab+0x100/0xca0 [amdgpu]
> > > 	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
> > > 	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
> > > 	drm_sched_main+0xf9/0x490 [gpu_sched]
> > > 
> > > Second one:
> > > 
> > > 	kmem_cache_alloc+0x2b/0x6d0
> > > 	amdgpu_sync_fence+0x7e/0x110 [amdgpu]
> > > 	amdgpu_vmid_grab+0x86b/0xca0 [amdgpu]
> > > 	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
> > > 	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
> > > 	drm_sched_main+0xf9/0x490 [gpu_sched]
> > > 
> > > Cc: linux-media@vger.kernel.org
> > > Cc: linaro-mm-sig@lists.linaro.org
> > > Cc: linux-rdma@vger.kernel.org
> > > Cc: amd-gfx@lists.freedesktop.org
> > > Cc: intel-gfx@lists.freedesktop.org
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > Has anyone from amd side started looking into how to fix this properly?
> 
> Yeah I checked both and neither are any real problem.

I'm confused ... do you mean "no real problem fixing them" or "not
actually a real problem"?

> > I looked a bit into fixing this with mempool, and the big guarantee we
> > need is that
> > - there's a hard upper limit on how many allocations we minimally need to
> >    guarantee forward progress. And the entire vmid allocation and
> >    amdgpu_sync_fence stuff kinda makes me question that's a valid
> >    assumption.
> 
> We do have hard upper limits for those.
> 
> The VMID allocation could as well just return the fence instead of putting
> it into the sync object IIRC. So that just needs some cleanup and can avoid
> the allocation entirely.

Yeah embedding should be simplest solution of all.

> The hardware fence is limited by the number of submissions we can have
> concurrently on the ring buffers, so also not a problem at all.

Ok that sounds good. Wrt releasing the memory again, is that also done
without any of the allocation-side locks held? I've seen some vmid manager
somewhere ...
-Daniel

> 
> Regards,
> Christian.
> 
> > 
> > - mempool_free must be called without any locks in the way which are held
> >    while we call mempool_alloc. Otherwise we again have a nice deadlock
> >    with no forward progress. I tried auditing that, but got lost in amdgpu
> >    and scheduler code. Some lockdep annotations for mempool.c might help,
> >    but they're not going to catch everything. Plus it would be again manual
> >    annotations because this is yet another cross-release issue. So not sure
> >    that helps at all.
> > 
> > iow, not sure what to do here. Ideas?
> > 
> > Cheers, Daniel
> > 
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c   | 2 +-
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c  | 2 +-
> > >   3 files changed, 3 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > > index 8d84975885cd..a089a827fdfe 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > > @@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f,
> > >   	uint32_t seq;
> > >   	int r;
> > > -	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL);
> > > +	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
> > >   	if (fence == NULL)
> > >   		return -ENOMEM;
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> > > index 267fa45ddb66..a333ca2d4ddd 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> > > @@ -208,7 +208,7 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
> > >   	if (ring->vmid_wait && !dma_fence_is_signaled(ring->vmid_wait))
> > >   		return amdgpu_sync_fence(sync, ring->vmid_wait);
> > > -	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_KERNEL);
> > > +	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATOMIC);
> > >   	if (!fences)
> > >   		return -ENOMEM;
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> > > index 8ea6c49529e7..af22b526cec9 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> > > @@ -160,7 +160,7 @@ int amdgpu_sync_fence(struct amdgpu_sync *sync, struct dma_fence *f)
> > >   	if (amdgpu_sync_add_later(sync, f))
> > >   		return 0;
> > > -	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_KERNEL);
> > > +	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_ATOMIC);
> > >   	if (!e)
> > >   		return -ENOMEM;
> > > -- 
> > > 2.27.0
> > > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/25] drm/vkms: Annotate vblank timer
  2020-07-14  9:59       ` Daniel Vetter
@ 2020-07-14 14:55         ` Melissa Wen
  2020-07-14 15:23           ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Melissa Wen @ 2020-07-14 14:55 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Rodrigo Siqueira, DRI Development, Intel Graphics Development,
	linux-rdma, open list:DMA BUFFER SHARING FRAMEWORK,
	moderated list:DMA BUFFER SHARING FRAMEWORK, amd-gfx list,
	Chris Wilson, Maarten Lankhorst, Christian König,
	Daniel Vetter, Haneen Mohammed, Trevor Woerner

Hi,

On 07/14, Daniel Vetter wrote:
> On Tue, Jul 14, 2020 at 11:57 AM Melissa Wen <melissa.srw@gmail.com> wrote:
> >
> > On 07/12, Rodrigo Siqueira wrote:
> > > Hi,
> > >
> > > Everything looks fine to me, I just noticed that the amdgpu patches did
> > > not apply smoothly, however it was trivial to fix the issues.
> > >
> > > Reviewed-by: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
> > >
> > > Melissa,
> > > Since you are using vkms regularly, could you test this patch and review
> > > it? Remember to add your Tested-by when you finish.
> > >
> > Hi,
> >
> > I've applied the patch series, ran some tests on vkms, and found no
> > issues. I mean, things have remained stable.
> >
> > Tested-by: Melissa Wen <melissa.srw@gmail.com>
> 
> Did you test with CONFIG_PROVE_LOCKING enabled in the kernel .config?
> Without that enabled, there's not really any change here, but with
> that enabled there might be some lockdep splats in dmesg indicating a
> problem.
>

Even with the lock debugging config enabled, no new issue arose in dmesg
during my tests using vkms.

Melissa

> Thanks, Daniel
> >
> > > Thanks
> > >
> > > On 07/07, Daniel Vetter wrote:
> > > > This is needed to signal the fences from page flips, annotate it
> > > > accordingly. We need to annotate entire timer callback since if we get
> > > > stuck anywhere in there, then the timer stops, and hence fences stop.
> > > > Just annotating the top part that does the vblank handling isn't
> > > > enough.
> > > >
> > > > Cc: linux-media@vger.kernel.org
> > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > Cc: linux-rdma@vger.kernel.org
> > > > Cc: amd-gfx@lists.freedesktop.org
> > > > Cc: intel-gfx@lists.freedesktop.org
> > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > Cc: Christian König <christian.koenig@amd.com>
> > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > Cc: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
> > > > Cc: Haneen Mohammed <hamohammed.sa@gmail.com>
> > > > Cc: Daniel Vetter <daniel@ffwll.ch>
> > > > ---
> > > >  drivers/gpu/drm/vkms/vkms_crtc.c | 8 +++++++-
> > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/vkms/vkms_crtc.c b/drivers/gpu/drm/vkms/vkms_crtc.c
> > > > index ac85e17428f8..a53a40848a72 100644
> > > > --- a/drivers/gpu/drm/vkms/vkms_crtc.c
> > > > +++ b/drivers/gpu/drm/vkms/vkms_crtc.c
> > > > @@ -1,5 +1,7 @@
> > > >  // SPDX-License-Identifier: GPL-2.0+
> > > >
> > > > +#include <linux/dma-fence.h>
> > > > +
> > > >  #include <drm/drm_atomic.h>
> > > >  #include <drm/drm_atomic_helper.h>
> > > >  #include <drm/drm_probe_helper.h>
> > > > @@ -14,7 +16,9 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
> > > >     struct drm_crtc *crtc = &output->crtc;
> > > >     struct vkms_crtc_state *state;
> > > >     u64 ret_overrun;
> > > > -   bool ret;
> > > > +   bool ret, fence_cookie;
> > > > +
> > > > +   fence_cookie = dma_fence_begin_signalling();
> > > >
> > > >     ret_overrun = hrtimer_forward_now(&output->vblank_hrtimer,
> > > >                                       output->period_ns);
> > > > @@ -49,6 +53,8 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
> > > >                     DRM_DEBUG_DRIVER("Composer worker already queued\n");
> > > >     }
> > > >
> > > > +   dma_fence_end_signalling(fence_cookie);
> > > > +
> > > >     return HRTIMER_RESTART;
> > > >  }
> > > >
> > > > --
> > > > 2.27.0
> > > >
> > >
> > > --
> > > Rodrigo Siqueira
> > > https://siqueira.tech
> >
> >
> 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/25] drm/vkms: Annotate vblank timer
  2020-07-14 14:55         ` Melissa Wen
@ 2020-07-14 15:23           ` Daniel Vetter
  0 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-14 15:23 UTC (permalink / raw)
  To: Melissa Wen
  Cc: Rodrigo Siqueira, DRI Development, Intel Graphics Development,
	linux-rdma, open list:DMA BUFFER SHARING FRAMEWORK,
	moderated list:DMA BUFFER SHARING FRAMEWORK, amd-gfx list,
	Chris Wilson, Maarten Lankhorst, Christian König,
	Daniel Vetter, Haneen Mohammed, Trevor Woerner

On Tue, Jul 14, 2020 at 4:56 PM Melissa Wen <melissa.srw@gmail.com> wrote:
>
> Hi,
>
> On 07/14, Daniel Vetter wrote:
> > On Tue, Jul 14, 2020 at 11:57 AM Melissa Wen <melissa.srw@gmail.com> wrote:
> > >
> > > On 07/12, Rodrigo Siqueira wrote:
> > > > Hi,
> > > >
> > > > Everything looks fine to me, I just noticed that the amdgpu patches did
> > > > not apply smoothly, however it was trivial to fix the issues.
> > > >
> > > > Reviewed-by: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
> > > >
> > > > Melissa,
> > > > Since you are using vkms regularly, could you test this patch and review
> > > > it? Remember to add your Tested-by when you finish.
> > > >
> > > Hi,
> > >
> > > I've applied the patch series, ran some tests on vkms, and found no
> > > issues. I mean, things have remained stable.
> > >
> > > Tested-by: Melissa Wen <melissa.srw@gmail.com>
> >
> > Did you test with CONFIG_PROVE_LOCKING enabled in the kernel .config?
> > Without that enabled, there's not really any change here, but with
> > that enabled there might be some lockdep splats in dmesg indicating a
> > problem.
> >
>
> Even with the lock debugging config enabled, no new issue arose in dmesg
> during my tests using vkms.

Excellent, thanks a lot for confirming this.
-Daniel

>
> Melissa
>
> > Thanks, Daniel
> > >
> > > > Thanks
> > > >
> > > > On 07/07, Daniel Vetter wrote:
> > > > > This is needed to signal the fences from page flips, annotate it
> > > > > accordingly. We need to annotate entire timer callback since if we get
> > > > > stuck anywhere in there, then the timer stops, and hence fences stop.
> > > > > Just annotating the top part that does the vblank handling isn't
> > > > > enough.
> > > > >
> > > > > Cc: linux-media@vger.kernel.org
> > > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > > Cc: linux-rdma@vger.kernel.org
> > > > > Cc: amd-gfx@lists.freedesktop.org
> > > > > Cc: intel-gfx@lists.freedesktop.org
> > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > Cc: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
> > > > > Cc: Haneen Mohammed <hamohammed.sa@gmail.com>
> > > > > Cc: Daniel Vetter <daniel@ffwll.ch>
> > > > > ---
> > > > >  drivers/gpu/drm/vkms/vkms_crtc.c | 8 +++++++-
> > > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/vkms/vkms_crtc.c b/drivers/gpu/drm/vkms/vkms_crtc.c
> > > > > index ac85e17428f8..a53a40848a72 100644
> > > > > --- a/drivers/gpu/drm/vkms/vkms_crtc.c
> > > > > +++ b/drivers/gpu/drm/vkms/vkms_crtc.c
> > > > > @@ -1,5 +1,7 @@
> > > > >  // SPDX-License-Identifier: GPL-2.0+
> > > > >
> > > > > +#include <linux/dma-fence.h>
> > > > > +
> > > > >  #include <drm/drm_atomic.h>
> > > > >  #include <drm/drm_atomic_helper.h>
> > > > >  #include <drm/drm_probe_helper.h>
> > > > > @@ -14,7 +16,9 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
> > > > >     struct drm_crtc *crtc = &output->crtc;
> > > > >     struct vkms_crtc_state *state;
> > > > >     u64 ret_overrun;
> > > > > -   bool ret;
> > > > > +   bool ret, fence_cookie;
> > > > > +
> > > > > +   fence_cookie = dma_fence_begin_signalling();
> > > > >
> > > > >     ret_overrun = hrtimer_forward_now(&output->vblank_hrtimer,
> > > > >                                       output->period_ns);
> > > > > @@ -49,6 +53,8 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
> > > > >                     DRM_DEBUG_DRIVER("Composer worker already queued\n");
> > > > >     }
> > > > >
> > > > > +   dma_fence_end_signalling(fence_cookie);
> > > > > +
> > > > >     return HRTIMER_RESTART;
> > > > >  }
> > > > >
> > > > > --
> > > > > 2.27.0
> > > > >
> > > >
> > > > --
> > > > Rodrigo Siqueira
> > > > https://siqueira.tech
> > >
> > >
> >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-09 12:33   ` [PATCH 1/2] dma-buf.rst: Document why indefinite " Daniel Vetter
  2020-07-10 12:30     ` Maarten Lankhorst
@ 2020-07-14 17:46     ` Jason Ekstrand
  2020-07-20 11:15     ` [Linaro-mm-sig] " Thomas Hellström (Intel)
  2 siblings, 0 replies; 83+ messages in thread
From: Jason Ekstrand @ 2020-07-14 17:46 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: DRI Development, Intel Graphics Development, linux-rdma,
	Christian König, Daniel Stone, Jesse Natalie,
	Steve Pronovost, Felix Kuehling, Mika Kuoppala, Thomas Hellstrom,
	open list:DMA BUFFER SHARING FRAMEWORK, linaro-mm-sig,
	amd-gfx mailing list, Chris Wilson, Maarten Lankhorst,
	Daniel Vetter

This matches my understanding for what it's worth.  In my little bit
of synchronization work in drm, I've gone out of my way to ensure we
can maintain this constraint.

Acked-by: Jason Ekstrand <jason@jlekstrand.net>

On Thu, Jul 9, 2020 at 7:33 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> Comes up every few years, gets somewhat tedious to discuss, let's
> write this down once and for all.
>
> What I'm not sure about is whether the text should be more explicit in
> flat out mandating the amdkfd eviction fences for long running compute
> workloads or workloads where userspace fencing is allowed.
>
> v2: Now with dot graph!
>
> v3: Typo (Dave Airlie)
>
> Acked-by: Christian König <christian.koenig@amd.com>
> Acked-by: Daniel Stone <daniels@collabora.com>
> Cc: Jesse Natalie <jenatali@microsoft.com>
> Cc: Steve Pronovost <spronovo@microsoft.com>
> Cc: Jason Ekstrand <jason@jlekstrand.net>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Thomas Hellstrom <thomas.hellstrom@intel.com>
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  Documentation/driver-api/dma-buf.rst | 70 ++++++++++++++++++++++++++++
>  1 file changed, 70 insertions(+)
>
> diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
> index f8f6decde359..100bfd227265 100644
> --- a/Documentation/driver-api/dma-buf.rst
> +++ b/Documentation/driver-api/dma-buf.rst
> @@ -178,3 +178,73 @@ DMA Fence uABI/Sync File
>  .. kernel-doc:: include/linux/sync_file.h
>     :internal:
>
> +Indefinite DMA Fences
> +~~~~~~~~~~~~~~~~~~~~
> +
> +At various times &dma_fence with an indefinite time until dma_fence_wait()
> +finishes have been proposed. Examples include:
> +
> +* Future fences, used in HWC1 to signal when a buffer isn't used by the display
> +  any longer, and created with the screen update that makes the buffer visible.
> +  The time this fence completes is entirely under userspace's control.
> +
> +* Proxy fences, proposed to handle &drm_syncobj for which the fence has not yet
> +  been set. Used to asynchronously delay command submission.
> +
> +* Userspace fences or gpu futexes, fine-grained locking within a command buffer
> +  that userspace uses for synchronization across engines or with the CPU, which
> +  are then imported as a DMA fence for integration into existing winsys
> +  protocols.
> +
> +* Long-running compute command buffers, while still using traditional end of
> +  batch DMA fences for memory management instead of context preemption DMA
> +  fences which get reattached when the compute job is rescheduled.
> +
> +Common to all these schemes is that userspace controls the dependencies of these
> +fences and controls when they fire. Mixing indefinite fences with normal
> +in-kernel DMA fences does not work, even when a fallback timeout is included to
> +protect against malicious userspace:
> +
> +* Only the kernel knows about all DMA fence dependencies, userspace is not aware
> +  of dependencies injected due to memory management or scheduler decisions.
> +
> +* Only userspace knows about all dependencies in indefinite fences and when
> +  exactly they will complete, the kernel has no visibility.
> +
> +Furthermore the kernel has to be able to hold up userspace command submission
> +for memory management needs, which means we must support indefinite fences being
> +dependent upon DMA fences. If the kernel also support indefinite fences in the
> +kernel like a DMA fence, like any of the above proposal would, there is the
> +potential for deadlocks.
> +
> +.. kernel-render:: DOT
> +   :alt: Indefinite Fencing Dependency Cycle
> +   :caption: Indefinite Fencing Dependency Cycle
> +
> +   digraph "Fencing Cycle" {
> +      node [shape=box bgcolor=grey style=filled]
> +      kernel [label="Kernel DMA Fences"]
> +      userspace [label="userspace controlled fences"]
> +      kernel -> userspace [label="memory management"]
> +      userspace -> kernel [label="Future fence, fence proxy, ..."]
> +
> +      { rank=same; kernel userspace }
> +   }
> +
> +This means that the kernel might accidentally create deadlocks
> +through memory management dependencies which userspace is unaware of, which
> +randomly hangs workloads until the timeout kicks in. Workloads, which from
> +userspace's perspective, do not contain a deadlock.  In such a mixed fencing
> +architecture there is no single entity with knowledge of all dependencies.
> +Thefore preventing such deadlocks from within the kernel is not possible.
> +
> +The only solution to avoid dependencies loops is by not allowing indefinite
> +fences in the kernel. This means:
> +
> +* No future fences, proxy fences or userspace fences imported as DMA fences,
> +  with or without a timeout.
> +
> +* No DMA fences that signal end of batchbuffer for command submission where
> +  userspace is allowed to use userspace fencing or long running compute
> +  workloads. This also means no implicit fencing for shared buffers in these
> +  cases.
> --
> 2.27.0
>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 19/25] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
  2020-07-14 14:31       ` Daniel Vetter
@ 2020-07-15  9:17         ` Christian König
  2020-07-15 11:53           ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Christian König @ 2020-07-15  9:17 UTC (permalink / raw)
  To: Daniel Vetter, Christian König
  Cc: linux-rdma, Daniel Vetter, Intel Graphics Development,
	Maarten Lankhorst, DRI Development, Chris Wilson, linaro-mm-sig,
	amd-gfx, Daniel Vetter, linux-media

Am 14.07.20 um 16:31 schrieb Daniel Vetter:
> On Tue, Jul 14, 2020 at 01:40:11PM +0200, Christian König wrote:
>> Am 14.07.20 um 12:49 schrieb Daniel Vetter:
>>> On Tue, Jul 07, 2020 at 10:12:23PM +0200, Daniel Vetter wrote:
>>>> My dma-fence lockdep annotations caught an inversion because we
>>>> allocate memory where we really shouldn't:
>>>>
>>>> 	kmem_cache_alloc+0x2b/0x6d0
>>>> 	amdgpu_fence_emit+0x30/0x330 [amdgpu]
>>>> 	amdgpu_ib_schedule+0x306/0x550 [amdgpu]
>>>> 	amdgpu_job_run+0x10f/0x260 [amdgpu]
>>>> 	drm_sched_main+0x1b9/0x490 [gpu_sched]
>>>> 	kthread+0x12e/0x150
>>>>
>>>> Trouble right now is that lockdep only validates against GFP_FS, which
>>>> would be good enough for shrinkers. But for mmu_notifiers we actually
>>>> need !GFP_ATOMIC, since they can be called from any page laundering,
>>>> even if GFP_NOFS or GFP_NOIO are set.
>>>>
>>>> I guess we should improve the lockdep annotations for
>>>> fs_reclaim_acquire/release.
>>>>
>>>> Ofc real fix is to properly preallocate this fence and stuff it into
>>>> the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of
>>>> the way.
>>>>
>>>> v2: Two more allocations in scheduler paths.
>>>>
>>>> Frist one:
>>>>
>>>> 	__kmalloc+0x58/0x720
>>>> 	amdgpu_vmid_grab+0x100/0xca0 [amdgpu]
>>>> 	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
>>>> 	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
>>>> 	drm_sched_main+0xf9/0x490 [gpu_sched]
>>>>
>>>> Second one:
>>>>
>>>> 	kmem_cache_alloc+0x2b/0x6d0
>>>> 	amdgpu_sync_fence+0x7e/0x110 [amdgpu]
>>>> 	amdgpu_vmid_grab+0x86b/0xca0 [amdgpu]
>>>> 	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
>>>> 	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
>>>> 	drm_sched_main+0xf9/0x490 [gpu_sched]
>>>>
>>>> Cc: linux-media@vger.kernel.org
>>>> Cc: linaro-mm-sig@lists.linaro.org
>>>> Cc: linux-rdma@vger.kernel.org
>>>> Cc: amd-gfx@lists.freedesktop.org
>>>> Cc: intel-gfx@lists.freedesktop.org
>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>>> Cc: Christian König <christian.koenig@amd.com>
>>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>> Has anyone from amd side started looking into how to fix this properly?
>> Yeah I checked both and neither are any real problem.
> I'm confused ... do you mean "no real problem fixing them" or "not
> actually a real problem"?

Both, at least the VMID stuff is trivial to avoid.

And the fence allocation is extremely unlikely. E.g. when we allocate a 
new one we previously most likely just freed one already.

>
>>> I looked a bit into fixing this with mempool, and the big guarantee we
>>> need is that
>>> - there's a hard upper limit on how many allocations we minimally need to
>>>     guarantee forward progress. And the entire vmid allocation and
>>>     amdgpu_sync_fence stuff kinda makes me question that's a valid
>>>     assumption.
>> We do have hard upper limits for those.
>>
>> The VMID allocation could as well just return the fence instead of putting
>> it into the sync object IIRC. So that just needs some cleanup and can avoid
>> the allocation entirely.
> Yeah embedding should be simplest solution of all.
>
>> The hardware fence is limited by the number of submissions we can have
>> concurrently on the ring buffers, so also not a problem at all.
> Ok that sounds good. Wrt releasing the memory again, is that also done
> without any of the allocation-side locks held? I've seen some vmid manager
> somewhere ...

Well that's the issue. We can't guarantee that for the hardware fence 
memory since it could be that we hold another reference during debugging 
IIRC.

Still looking if and how we could fix this. But as I said this problem 
is so extremely unlikely.

Christian.

> -Daniel
>
>> Regards,
>> Christian.
>>
>>> - mempool_free must be called without any locks in the way which are held
>>>     while we call mempool_alloc. Otherwise we again have a nice deadlock
>>>     with no forward progress. I tried auditing that, but got lost in amdgpu
>>>     and scheduler code. Some lockdep annotations for mempool.c might help,
>>>     but they're not going to catch everything. Plus it would be again manual
>>>     annotations because this is yet another cross-release issue. So not sure
>>>     that helps at all.
>>>
>>> iow, not sure what to do here. Ideas?
>>>
>>> Cheers, Daniel
>>>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c   | 2 +-
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c  | 2 +-
>>>>    3 files changed, 3 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index 8d84975885cd..a089a827fdfe 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f,
>>>>    	uint32_t seq;
>>>>    	int r;
>>>> -	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL);
>>>> +	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
>>>>    	if (fence == NULL)
>>>>    		return -ENOMEM;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
>>>> index 267fa45ddb66..a333ca2d4ddd 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
>>>> @@ -208,7 +208,7 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
>>>>    	if (ring->vmid_wait && !dma_fence_is_signaled(ring->vmid_wait))
>>>>    		return amdgpu_sync_fence(sync, ring->vmid_wait);
>>>> -	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_KERNEL);
>>>> +	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATOMIC);
>>>>    	if (!fences)
>>>>    		return -ENOMEM;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>>>> index 8ea6c49529e7..af22b526cec9 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
>>>> @@ -160,7 +160,7 @@ int amdgpu_sync_fence(struct amdgpu_sync *sync, struct dma_fence *f)
>>>>    	if (amdgpu_sync_add_later(sync, f))
>>>>    		return 0;
>>>> -	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_KERNEL);
>>>> +	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_ATOMIC);
>>>>    	if (!e)
>>>>    		return -ENOMEM;
>>>> -- 
>>>> 2.27.0
>>>>


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 19/25] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
  2020-07-15  9:17         ` Christian König
@ 2020-07-15 11:53           ` Daniel Vetter
  0 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-15 11:53 UTC (permalink / raw)
  To: Christian König
  Cc: linux-rdma, Intel Graphics Development, Maarten Lankhorst,
	DRI Development, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, amd-gfx list,
	Daniel Vetter, open list:DMA BUFFER SHARING FRAMEWORK

On Wed, Jul 15, 2020 at 11:17 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 14.07.20 um 16:31 schrieb Daniel Vetter:
> > On Tue, Jul 14, 2020 at 01:40:11PM +0200, Christian König wrote:
> >> Am 14.07.20 um 12:49 schrieb Daniel Vetter:
> >>> On Tue, Jul 07, 2020 at 10:12:23PM +0200, Daniel Vetter wrote:
> >>>> My dma-fence lockdep annotations caught an inversion because we
> >>>> allocate memory where we really shouldn't:
> >>>>
> >>>>    kmem_cache_alloc+0x2b/0x6d0
> >>>>    amdgpu_fence_emit+0x30/0x330 [amdgpu]
> >>>>    amdgpu_ib_schedule+0x306/0x550 [amdgpu]
> >>>>    amdgpu_job_run+0x10f/0x260 [amdgpu]
> >>>>    drm_sched_main+0x1b9/0x490 [gpu_sched]
> >>>>    kthread+0x12e/0x150
> >>>>
> >>>> Trouble right now is that lockdep only validates against GFP_FS, which
> >>>> would be good enough for shrinkers. But for mmu_notifiers we actually
> >>>> need !GFP_ATOMIC, since they can be called from any page laundering,
> >>>> even if GFP_NOFS or GFP_NOIO are set.
> >>>>
> >>>> I guess we should improve the lockdep annotations for
> >>>> fs_reclaim_acquire/release.
> >>>>
> >>>> Ofc real fix is to properly preallocate this fence and stuff it into
> >>>> the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of
> >>>> the way.
> >>>>
> >>>> v2: Two more allocations in scheduler paths.
> >>>>
> >>>> Frist one:
> >>>>
> >>>>    __kmalloc+0x58/0x720
> >>>>    amdgpu_vmid_grab+0x100/0xca0 [amdgpu]
> >>>>    amdgpu_job_dependency+0xf9/0x120 [amdgpu]
> >>>>    drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
> >>>>    drm_sched_main+0xf9/0x490 [gpu_sched]
> >>>>
> >>>> Second one:
> >>>>
> >>>>    kmem_cache_alloc+0x2b/0x6d0
> >>>>    amdgpu_sync_fence+0x7e/0x110 [amdgpu]
> >>>>    amdgpu_vmid_grab+0x86b/0xca0 [amdgpu]
> >>>>    amdgpu_job_dependency+0xf9/0x120 [amdgpu]
> >>>>    drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
> >>>>    drm_sched_main+0xf9/0x490 [gpu_sched]
> >>>>
> >>>> Cc: linux-media@vger.kernel.org
> >>>> Cc: linaro-mm-sig@lists.linaro.org
> >>>> Cc: linux-rdma@vger.kernel.org
> >>>> Cc: amd-gfx@lists.freedesktop.org
> >>>> Cc: intel-gfx@lists.freedesktop.org
> >>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> >>>> Cc: Christian König <christian.koenig@amd.com>
> >>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> >>> Has anyone from amd side started looking into how to fix this properly?
> >> Yeah I checked both and neither are any real problem.
> > I'm confused ... do you mean "no real problem fixing them" or "not
> > actually a real problem"?
>
> Both, at least the VMID stuff is trivial to avoid.
>
> And the fence allocation is extremely unlikely. E.g. when we allocate a
> new one we previously most likely just freed one already.

Yeah I think debugging we can avoid, just stop debugging if things get
hung up like that. So mempool for the hw fences should be perfectly
fine.

The vmid stuff I don't really understand enough, but the hw fence
stuff I think I grok, plus other scheduler users need that too from a
quick look. I might be tackling that one (maybe put the mempool
outright into drm_scheduler code as a helper), except if you have
patches already in the works. vmid I'll leave to you guys :-)

-Daniel

>
> >
> >>> I looked a bit into fixing this with mempool, and the big guarantee we
> >>> need is that
> >>> - there's a hard upper limit on how many allocations we minimally need to
> >>>     guarantee forward progress. And the entire vmid allocation and
> >>>     amdgpu_sync_fence stuff kinda makes me question that's a valid
> >>>     assumption.
> >> We do have hard upper limits for those.
> >>
> >> The VMID allocation could as well just return the fence instead of putting
> >> it into the sync object IIRC. So that just needs some cleanup and can avoid
> >> the allocation entirely.
> > Yeah embedding should be simplest solution of all.
> >
> >> The hardware fence is limited by the number of submissions we can have
> >> concurrently on the ring buffers, so also not a problem at all.
> > Ok that sounds good. Wrt releasing the memory again, is that also done
> > without any of the allocation-side locks held? I've seen some vmid manager
> > somewhere ...
>
> Well that's the issue. We can't guarantee that for the hardware fence
> memory since it could be that we hold another reference during debugging
> IIRC.
>
> Still looking if and how we could fix this. But as I said this problem
> is so extremely unlikely.
>
> Christian.
>
> > -Daniel
> >
> >> Regards,
> >> Christian.
> >>
> >>> - mempool_free must be called without any locks in the way which are held
> >>>     while we call mempool_alloc. Otherwise we again have a nice deadlock
> >>>     with no forward progress. I tried auditing that, but got lost in amdgpu
> >>>     and scheduler code. Some lockdep annotations for mempool.c might help,
> >>>     but they're not going to catch everything. Plus it would be again manual
> >>>     annotations because this is yet another cross-release issue. So not sure
> >>>     that helps at all.
> >>>
> >>> iow, not sure what to do here. Ideas?
> >>>
> >>> Cheers, Daniel
> >>>
> >>>> ---
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c   | 2 +-
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c  | 2 +-
> >>>>    3 files changed, 3 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>>> index 8d84975885cd..a089a827fdfe 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>>> @@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f,
> >>>>            uint32_t seq;
> >>>>            int r;
> >>>> -  fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL);
> >>>> +  fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
> >>>>            if (fence == NULL)
> >>>>                    return -ENOMEM;
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> >>>> index 267fa45ddb66..a333ca2d4ddd 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
> >>>> @@ -208,7 +208,7 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
> >>>>            if (ring->vmid_wait && !dma_fence_is_signaled(ring->vmid_wait))
> >>>>                    return amdgpu_sync_fence(sync, ring->vmid_wait);
> >>>> -  fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_KERNEL);
> >>>> +  fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATOMIC);
> >>>>            if (!fences)
> >>>>                    return -ENOMEM;
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> >>>> index 8ea6c49529e7..af22b526cec9 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
> >>>> @@ -160,7 +160,7 @@ int amdgpu_sync_fence(struct amdgpu_sync *sync, struct dma_fence *f)
> >>>>            if (amdgpu_sync_add_later(sync, f))
> >>>>                    return 0;
> >>>> -  e = kmem_cache_alloc(amdgpu_sync_slab, GFP_KERNEL);
> >>>> +  e = kmem_cache_alloc(amdgpu_sync_slab, GFP_ATOMIC);
> >>>>            if (!e)
> >>>>                    return -ENOMEM;
> >>>> --
> >>>> 2.27.0
> >>>>
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-09 12:33   ` [PATCH 1/2] dma-buf.rst: Document why indefinite " Daniel Vetter
  2020-07-10 12:30     ` Maarten Lankhorst
  2020-07-14 17:46     ` Jason Ekstrand
@ 2020-07-20 11:15     ` Thomas Hellström (Intel)
  2020-07-21  7:41       ` Daniel Vetter
  2 siblings, 1 reply; 83+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-20 11:15 UTC (permalink / raw)
  To: Daniel Vetter, DRI Development
  Cc: Daniel Stone, linux-rdma, Intel Graphics Development,
	Maarten Lankhorst, amd-gfx, linaro-mm-sig, Steve Pronovost,
	Daniel Vetter, Jason Ekstrand, Jesse Natalie, Felix Kuehling,
	Thomas Hellstrom, linux-media, Christian König,
	Mika Kuoppala

Hi,

On 7/9/20 2:33 PM, Daniel Vetter wrote:
> Comes up every few years, gets somewhat tedious to discuss, let's
> write this down once and for all.
>
> What I'm not sure about is whether the text should be more explicit in
> flat out mandating the amdkfd eviction fences for long running compute
> workloads or workloads where userspace fencing is allowed.

Although (in my humble opinion) it might be possible to completely 
untangle kernel-introduced fences for resource management and dma-fences 
used for completion- and dependency tracking and lift a lot of 
restrictions for the dma-fences, including prohibiting infinite ones, I 
think this makes sense describing the current state.

Reviewed-by: Thomas Hellstrom <thomas.hellstrom@intel.com>



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-20 11:15     ` [Linaro-mm-sig] " Thomas Hellström (Intel)
@ 2020-07-21  7:41       ` Daniel Vetter
  2020-07-21  7:45         ` Christian König
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-21  7:41 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Daniel Vetter, DRI Development, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, amd-gfx,
	linaro-mm-sig, Steve Pronovost, Daniel Vetter, Jason Ekstrand,
	Jesse Natalie, Felix Kuehling, Thomas Hellstrom, linux-media,
	Christian König, Mika Kuoppala

On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) wrote:
> Hi,
> 
> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> > Comes up every few years, gets somewhat tedious to discuss, let's
> > write this down once and for all.
> > 
> > What I'm not sure about is whether the text should be more explicit in
> > flat out mandating the amdkfd eviction fences for long running compute
> > workloads or workloads where userspace fencing is allowed.
> 
> Although (in my humble opinion) it might be possible to completely untangle
> kernel-introduced fences for resource management and dma-fences used for
> completion- and dependency tracking and lift a lot of restrictions for the
> dma-fences, including prohibiting infinite ones, I think this makes sense
> describing the current state.

Yeah I think a future patch needs to type up how we want to make that
happen (for some cross driver consistency) and what needs to be
considered. Some of the necessary parts are already there (with like the
preemption fences amdkfd has as an example), but I think some clear docs
on what's required from both hw, drivers and userspace would be really
good.
>
> Reviewed-by: Thomas Hellstrom <thomas.hellstrom@intel.com>

Thanks for taking a look, first 3 patches here with annotations and docs
merged to drm-misc-next. I'll ask Maarten/Dave whether another pull is ok
for 5.9 so that everyone can use this asap.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-21  7:41       ` Daniel Vetter
@ 2020-07-21  7:45         ` Christian König
  2020-07-21  8:47           ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 83+ messages in thread
From: Christian König @ 2020-07-21  7:45 UTC (permalink / raw)
  To: Daniel Vetter, Thomas Hellström (Intel)
  Cc: Daniel Vetter, DRI Development, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, amd-gfx,
	linaro-mm-sig, Steve Pronovost, Daniel Vetter, Jason Ekstrand,
	Jesse Natalie, Felix Kuehling, Thomas Hellstrom, linux-media,
	Mika Kuoppala

Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) wrote:
>> Hi,
>>
>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>> write this down once and for all.
>>>
>>> What I'm not sure about is whether the text should be more explicit in
>>> flat out mandating the amdkfd eviction fences for long running compute
>>> workloads or workloads where userspace fencing is allowed.
>> Although (in my humble opinion) it might be possible to completely untangle
>> kernel-introduced fences for resource management and dma-fences used for
>> completion- and dependency tracking and lift a lot of restrictions for the
>> dma-fences, including prohibiting infinite ones, I think this makes sense
>> describing the current state.
> Yeah I think a future patch needs to type up how we want to make that
> happen (for some cross driver consistency) and what needs to be
> considered. Some of the necessary parts are already there (with like the
> preemption fences amdkfd has as an example), but I think some clear docs
> on what's required from both hw, drivers and userspace would be really
> good.

I'm currently writing that up, but probably still need a few days for this.

Christian.

>> Reviewed-by: Thomas Hellstrom <thomas.hellstrom@intel.com>
> Thanks for taking a look, first 3 patches here with annotations and docs
> merged to drm-misc-next. I'll ask Maarten/Dave whether another pull is ok
> for 5.9 so that everyone can use this asap.
> -Daniel


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-21  7:45         ` Christian König
@ 2020-07-21  8:47           ` Thomas Hellström (Intel)
  2020-07-21  8:55             ` Christian König
  2020-07-21 22:45             ` Dave Airlie
  0 siblings, 2 replies; 83+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-21  8:47 UTC (permalink / raw)
  To: Christian König, Daniel Vetter
  Cc: Daniel Vetter, DRI Development, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, amd-gfx,
	linaro-mm-sig, Steve Pronovost, Daniel Vetter, Jason Ekstrand,
	Jesse Natalie, Felix Kuehling, Thomas Hellstrom, linux-media,
	Mika Kuoppala


On 7/21/20 9:45 AM, Christian König wrote:
> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) 
>> wrote:
>>> Hi,
>>>
>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>> write this down once and for all.
>>>>
>>>> What I'm not sure about is whether the text should be more explicit in
>>>> flat out mandating the amdkfd eviction fences for long running compute
>>>> workloads or workloads where userspace fencing is allowed.
>>> Although (in my humble opinion) it might be possible to completely 
>>> untangle
>>> kernel-introduced fences for resource management and dma-fences used 
>>> for
>>> completion- and dependency tracking and lift a lot of restrictions 
>>> for the
>>> dma-fences, including prohibiting infinite ones, I think this makes 
>>> sense
>>> describing the current state.
>> Yeah I think a future patch needs to type up how we want to make that
>> happen (for some cross driver consistency) and what needs to be
>> considered. Some of the necessary parts are already there (with like the
>> preemption fences amdkfd has as an example), but I think some clear docs
>> on what's required from both hw, drivers and userspace would be really
>> good.
>
> I'm currently writing that up, but probably still need a few days for 
> this.

Great! I put down some (very) initial thoughts a couple of weeks ago 
building on eviction fences for various hardware complexity levels here:

https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt

/Thomas



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-21  8:47           ` Thomas Hellström (Intel)
@ 2020-07-21  8:55             ` Christian König
  2020-07-21  9:16               ` Daniel Vetter
  2020-07-21  9:37               ` Thomas Hellström (Intel)
  2020-07-21 22:45             ` Dave Airlie
  1 sibling, 2 replies; 83+ messages in thread
From: Christian König @ 2020-07-21  8:55 UTC (permalink / raw)
  To: Thomas Hellström (Intel), Daniel Vetter
  Cc: Daniel Vetter, DRI Development, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, amd-gfx,
	linaro-mm-sig, Steve Pronovost, Daniel Vetter, Jason Ekstrand,
	Jesse Natalie, Felix Kuehling, Thomas Hellstrom, linux-media,
	Mika Kuoppala

Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel):
>
> On 7/21/20 9:45 AM, Christian König wrote:
>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) 
>>> wrote:
>>>> Hi,
>>>>
>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>>> write this down once and for all.
>>>>>
>>>>> What I'm not sure about is whether the text should be more 
>>>>> explicit in
>>>>> flat out mandating the amdkfd eviction fences for long running 
>>>>> compute
>>>>> workloads or workloads where userspace fencing is allowed.
>>>> Although (in my humble opinion) it might be possible to completely 
>>>> untangle
>>>> kernel-introduced fences for resource management and dma-fences 
>>>> used for
>>>> completion- and dependency tracking and lift a lot of restrictions 
>>>> for the
>>>> dma-fences, including prohibiting infinite ones, I think this makes 
>>>> sense
>>>> describing the current state.
>>> Yeah I think a future patch needs to type up how we want to make that
>>> happen (for some cross driver consistency) and what needs to be
>>> considered. Some of the necessary parts are already there (with like 
>>> the
>>> preemption fences amdkfd has as an example), but I think some clear 
>>> docs
>>> on what's required from both hw, drivers and userspace would be really
>>> good.
>>
>> I'm currently writing that up, but probably still need a few days for 
>> this.
>
> Great! I put down some (very) initial thoughts a couple of weeks ago 
> building on eviction fences for various hardware complexity levels here:
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odt&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C8978bbd7823e4b41663708d82d52add3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309180424312390&amp;sdata=tTxx2vfzfwLM1IBJSqqAZRw1604R%2F0bI3MwN1%2FBf2VQ%3D&amp;reserved=0 
>

I don't think that this will ever be possible.

See that Daniel describes in his text is that indefinite fences are a 
bad idea for memory management, and I think that this is a fixed fact.

In other words the whole concept of submitting work to the kernel which 
depends on some user space interaction doesn't work and never will.

What can be done is that dma_fences work with hardware schedulers. E.g. 
what the KFD tries to do with its preemption fences.

But for this you need a better concept and description of what the 
hardware scheduler is supposed to do and how that interacts with 
dma_fence objects.

Christian.

>
> /Thomas
>
>


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-21  8:55             ` Christian König
@ 2020-07-21  9:16               ` Daniel Vetter
  2020-07-21  9:24                 ` Daniel Vetter
  2020-07-21  9:37               ` Thomas Hellström (Intel)
  1 sibling, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-21  9:16 UTC (permalink / raw)
  To: Christian König
  Cc: Thomas Hellström (Intel),
	DRI Development, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, amd-gfx list,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	Daniel Vetter, Jason Ekstrand, Jesse Natalie, Felix Kuehling,
	Thomas Hellstrom, open list:DMA BUFFER SHARING FRAMEWORK,
	Mika Kuoppala

On Tue, Jul 21, 2020 at 10:55 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel):
> >
> > On 7/21/20 9:45 AM, Christian König wrote:
> >> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> >>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> >>> wrote:
> >>>> Hi,
> >>>>
> >>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> >>>>> Comes up every few years, gets somewhat tedious to discuss, let's
> >>>>> write this down once and for all.
> >>>>>
> >>>>> What I'm not sure about is whether the text should be more
> >>>>> explicit in
> >>>>> flat out mandating the amdkfd eviction fences for long running
> >>>>> compute
> >>>>> workloads or workloads where userspace fencing is allowed.
> >>>> Although (in my humble opinion) it might be possible to completely
> >>>> untangle
> >>>> kernel-introduced fences for resource management and dma-fences
> >>>> used for
> >>>> completion- and dependency tracking and lift a lot of restrictions
> >>>> for the
> >>>> dma-fences, including prohibiting infinite ones, I think this makes
> >>>> sense
> >>>> describing the current state.
> >>> Yeah I think a future patch needs to type up how we want to make that
> >>> happen (for some cross driver consistency) and what needs to be
> >>> considered. Some of the necessary parts are already there (with like
> >>> the
> >>> preemption fences amdkfd has as an example), but I think some clear
> >>> docs
> >>> on what's required from both hw, drivers and userspace would be really
> >>> good.
> >>
> >> I'm currently writing that up, but probably still need a few days for
> >> this.
> >
> > Great! I put down some (very) initial thoughts a couple of weeks ago
> > building on eviction fences for various hardware complexity levels here:
> >
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odt&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C8978bbd7823e4b41663708d82d52add3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309180424312390&amp;sdata=tTxx2vfzfwLM1IBJSqqAZRw1604R%2F0bI3MwN1%2FBf2VQ%3D&amp;reserved=0
> >
>
> I don't think that this will ever be possible.
>
> See that Daniel describes in his text is that indefinite fences are a
> bad idea for memory management, and I think that this is a fixed fact.
>
> In other words the whole concept of submitting work to the kernel which
> depends on some user space interaction doesn't work and never will.
>
> What can be done is that dma_fences work with hardware schedulers. E.g.
> what the KFD tries to do with its preemption fences.
>
> But for this you need a better concept and description of what the
> hardware scheduler is supposed to do and how that interacts with
> dma_fence objects.

Yeah I think trying to split dma_fence wont work, simply because of
inertia. Creating an entirely new thing for augmented userspace
controlled fencing, and then jotting down all the rules the
kernel/hw/userspace need to obey to not break dma_fence is what I had
in mind. And I guess that's also what Christian is working on. E.g.
just going through all the cases of how much your hw can preempt or
handle page faults on the gpu, and what that means in terms of
dma_fence_begin/end_signalling and other constraints would be really
good.
-Daniel

>
> Christian.
>
> >
> > /Thomas
> >
> >
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-21  9:16               ` Daniel Vetter
@ 2020-07-21  9:24                 ` Daniel Vetter
  0 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-21  9:24 UTC (permalink / raw)
  To: Christian König
  Cc: Thomas Hellström (Intel),
	DRI Development, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, amd-gfx list,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	Daniel Vetter, Jason Ekstrand, Jesse Natalie, Felix Kuehling,
	Thomas Hellstrom, open list:DMA BUFFER SHARING FRAMEWORK,
	Mika Kuoppala

On Tue, Jul 21, 2020 at 11:16 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, Jul 21, 2020 at 10:55 AM Christian König
> <christian.koenig@amd.com> wrote:
> >
> > Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel):
> > >
> > > On 7/21/20 9:45 AM, Christian König wrote:
> > >> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> > >>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> > >>> wrote:
> > >>>> Hi,
> > >>>>
> > >>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> > >>>>> Comes up every few years, gets somewhat tedious to discuss, let's
> > >>>>> write this down once and for all.
> > >>>>>
> > >>>>> What I'm not sure about is whether the text should be more
> > >>>>> explicit in
> > >>>>> flat out mandating the amdkfd eviction fences for long running
> > >>>>> compute
> > >>>>> workloads or workloads where userspace fencing is allowed.
> > >>>> Although (in my humble opinion) it might be possible to completely
> > >>>> untangle
> > >>>> kernel-introduced fences for resource management and dma-fences
> > >>>> used for
> > >>>> completion- and dependency tracking and lift a lot of restrictions
> > >>>> for the
> > >>>> dma-fences, including prohibiting infinite ones, I think this makes
> > >>>> sense
> > >>>> describing the current state.
> > >>> Yeah I think a future patch needs to type up how we want to make that
> > >>> happen (for some cross driver consistency) and what needs to be
> > >>> considered. Some of the necessary parts are already there (with like
> > >>> the
> > >>> preemption fences amdkfd has as an example), but I think some clear
> > >>> docs
> > >>> on what's required from both hw, drivers and userspace would be really
> > >>> good.
> > >>
> > >> I'm currently writing that up, but probably still need a few days for
> > >> this.
> > >
> > > Great! I put down some (very) initial thoughts a couple of weeks ago
> > > building on eviction fences for various hardware complexity levels here:
> > >
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odt&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C8978bbd7823e4b41663708d82d52add3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309180424312390&amp;sdata=tTxx2vfzfwLM1IBJSqqAZRw1604R%2F0bI3MwN1%2FBf2VQ%3D&amp;reserved=0
> > >
> >
> > I don't think that this will ever be possible.
> >
> > See that Daniel describes in his text is that indefinite fences are a
> > bad idea for memory management, and I think that this is a fixed fact.
> >
> > In other words the whole concept of submitting work to the kernel which
> > depends on some user space interaction doesn't work and never will.
> >
> > What can be done is that dma_fences work with hardware schedulers. E.g.
> > what the KFD tries to do with its preemption fences.
> >
> > But for this you need a better concept and description of what the
> > hardware scheduler is supposed to do and how that interacts with
> > dma_fence objects.
>
> Yeah I think trying to split dma_fence wont work, simply because of
> inertia. Creating an entirely new thing for augmented userspace
> controlled fencing, and then jotting down all the rules the
> kernel/hw/userspace need to obey to not break dma_fence is what I had
> in mind. And I guess that's also what Christian is working on. E.g.
> just going through all the cases of how much your hw can preempt or
> handle page faults on the gpu, and what that means in terms of
> dma_fence_begin/end_signalling and other constraints would be really
> good.

Or rephrased in terms of Thomas' doc: dma-fence will stay the memory
fence, and also the sync fence for current userspace and winsys.

Then we create a new thing and complete protocol and driver reving of
the entire world. The really hard part is that running old stuff on a
new stack is possible (we'd be totally screwed otherwise, since it
would become a system wide flag day). But running new stuff on an old
stack (even if it's just something in userspace like the compositor)
doesn't work, because then you tie the new synchronization fences back
into the dma-fence memory fences, and game over.

So yeah around 5 years or so for anything that wants to use a winsys,
or at least that's what it usually takes us to do something like this
:-/ Entirely stand-alone compute workloads (irrespective whether it's
cuda, cl, vk or whatever) doesn't have that problem ofc.
-Daniel

> -Daniel
>
> >
> > Christian.
> >
> > >
> > > /Thomas
> > >
> > >
> >
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-21  8:55             ` Christian König
  2020-07-21  9:16               ` Daniel Vetter
@ 2020-07-21  9:37               ` Thomas Hellström (Intel)
  2020-07-21  9:50                 ` Daniel Vetter
  1 sibling, 1 reply; 83+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-21  9:37 UTC (permalink / raw)
  To: Christian König, Daniel Vetter
  Cc: Daniel Vetter, DRI Development, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, amd-gfx,
	linaro-mm-sig, Steve Pronovost, Daniel Vetter, Jason Ekstrand,
	Jesse Natalie, Felix Kuehling, Thomas Hellstrom, linux-media,
	Mika Kuoppala


On 7/21/20 10:55 AM, Christian König wrote:
> Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel):
>>
>> On 7/21/20 9:45 AM, Christian König wrote:
>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel) 
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>>>> write this down once and for all.
>>>>>>
>>>>>> What I'm not sure about is whether the text should be more 
>>>>>> explicit in
>>>>>> flat out mandating the amdkfd eviction fences for long running 
>>>>>> compute
>>>>>> workloads or workloads where userspace fencing is allowed.
>>>>> Although (in my humble opinion) it might be possible to completely 
>>>>> untangle
>>>>> kernel-introduced fences for resource management and dma-fences 
>>>>> used for
>>>>> completion- and dependency tracking and lift a lot of restrictions 
>>>>> for the
>>>>> dma-fences, including prohibiting infinite ones, I think this 
>>>>> makes sense
>>>>> describing the current state.
>>>> Yeah I think a future patch needs to type up how we want to make that
>>>> happen (for some cross driver consistency) and what needs to be
>>>> considered. Some of the necessary parts are already there (with 
>>>> like the
>>>> preemption fences amdkfd has as an example), but I think some clear 
>>>> docs
>>>> on what's required from both hw, drivers and userspace would be really
>>>> good.
>>>
>>> I'm currently writing that up, but probably still need a few days 
>>> for this.
>>
>> Great! I put down some (very) initial thoughts a couple of weeks ago 
>> building on eviction fences for various hardware complexity levels here:
>>
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odt&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C8978bbd7823e4b41663708d82d52add3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309180424312390&amp;sdata=tTxx2vfzfwLM1IBJSqqAZRw1604R%2F0bI3MwN1%2FBf2VQ%3D&amp;reserved=0 
>>
>
> I don't think that this will ever be possible.
>
> See that Daniel describes in his text is that indefinite fences are a 
> bad idea for memory management, and I think that this is a fixed fact.
>
> In other words the whole concept of submitting work to the kernel 
> which depends on some user space interaction doesn't work and never will.

Well the idea here is that memory management will *never* depend on 
indefinite fences: As soon as someone waits on a memory manager fence 
(be it eviction, shrinker or mmu notifier) it breaks out of any 
dma-fence dependencies and /or user-space interaction. The text tries to 
describe what's required to be able to do that (save for non-preemptible 
gpus where someone submits a forever-running shader).

So while I think this is possible (until someone comes up with a case 
where it wouldn't work of course), I guess Daniel has a point in that it 
won't happen because of inertia and there might be better options.

/Thomas



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-21  9:37               ` Thomas Hellström (Intel)
@ 2020-07-21  9:50                 ` Daniel Vetter
  2020-07-21 10:47                   ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-21  9:50 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Christian König, DRI Development, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, amd-gfx list,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	Daniel Vetter, Jason Ekstrand, Jesse Natalie, Felix Kuehling,
	Thomas Hellstrom, open list:DMA BUFFER SHARING FRAMEWORK,
	Mika Kuoppala

On Tue, Jul 21, 2020 at 11:38 AM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 7/21/20 10:55 AM, Christian König wrote:
> > Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel):
> >>
> >> On 7/21/20 9:45 AM, Christian König wrote:
> >>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> >>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> >>>> wrote:
> >>>>> Hi,
> >>>>>
> >>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> >>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
> >>>>>> write this down once and for all.
> >>>>>>
> >>>>>> What I'm not sure about is whether the text should be more
> >>>>>> explicit in
> >>>>>> flat out mandating the amdkfd eviction fences for long running
> >>>>>> compute
> >>>>>> workloads or workloads where userspace fencing is allowed.
> >>>>> Although (in my humble opinion) it might be possible to completely
> >>>>> untangle
> >>>>> kernel-introduced fences for resource management and dma-fences
> >>>>> used for
> >>>>> completion- and dependency tracking and lift a lot of restrictions
> >>>>> for the
> >>>>> dma-fences, including prohibiting infinite ones, I think this
> >>>>> makes sense
> >>>>> describing the current state.
> >>>> Yeah I think a future patch needs to type up how we want to make that
> >>>> happen (for some cross driver consistency) and what needs to be
> >>>> considered. Some of the necessary parts are already there (with
> >>>> like the
> >>>> preemption fences amdkfd has as an example), but I think some clear
> >>>> docs
> >>>> on what's required from both hw, drivers and userspace would be really
> >>>> good.
> >>>
> >>> I'm currently writing that up, but probably still need a few days
> >>> for this.
> >>
> >> Great! I put down some (very) initial thoughts a couple of weeks ago
> >> building on eviction fences for various hardware complexity levels here:
> >>
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odt&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C8978bbd7823e4b41663708d82d52add3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309180424312390&amp;sdata=tTxx2vfzfwLM1IBJSqqAZRw1604R%2F0bI3MwN1%2FBf2VQ%3D&amp;reserved=0
> >>
> >
> > I don't think that this will ever be possible.
> >
> > See that Daniel describes in his text is that indefinite fences are a
> > bad idea for memory management, and I think that this is a fixed fact.
> >
> > In other words the whole concept of submitting work to the kernel
> > which depends on some user space interaction doesn't work and never will.
>
> Well the idea here is that memory management will *never* depend on
> indefinite fences: As soon as someone waits on a memory manager fence
> (be it eviction, shrinker or mmu notifier) it breaks out of any
> dma-fence dependencies and /or user-space interaction. The text tries to
> describe what's required to be able to do that (save for non-preemptible
> gpus where someone submits a forever-running shader).

Yeah I think that part of your text is good to describe how to
untangle memory fences from synchronization fences given how much the
hw can do.

> So while I think this is possible (until someone comes up with a case
> where it wouldn't work of course), I guess Daniel has a point in that it
> won't happen because of inertia and there might be better options.

Yeah it's just I don't see much chance for splitting dma-fence itself.
That's also why I'm not positive on the "no hw preemption, only
scheduler" case: You still have a dma_fence for the batch itself,
which means still no userspace controlled synchronization or other
form of indefinite batches allowed. So not getting us any closer to
enabling the compute use cases people want. So minimally I think hw
needs to be able to preempt, and preempt fairly quickly (i.e. within
shaders if you have long running shaders as your use-case), or support
gpu page faults. And depending how it all works different parts of the
driver code end up in dma fence critical sections, with different
restrictions.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-21  9:50                 ` Daniel Vetter
@ 2020-07-21 10:47                   ` Thomas Hellström (Intel)
  2020-07-21 13:59                     ` Christian König
  0 siblings, 1 reply; 83+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-21 10:47 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, DRI Development, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, amd-gfx list,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	Daniel Vetter, Jason Ekstrand, Jesse Natalie, Felix Kuehling,
	Thomas Hellstrom, open list:DMA BUFFER SHARING FRAMEWORK,
	Mika Kuoppala


On 7/21/20 11:50 AM, Daniel Vetter wrote:
> On Tue, Jul 21, 2020 at 11:38 AM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>>
>> On 7/21/20 10:55 AM, Christian König wrote:
>>> Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel):
>>>> On 7/21/20 9:45 AM, Christian König wrote:
>>>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>>>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
>>>>>> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>>>>>> write this down once and for all.
>>>>>>>>
>>>>>>>> What I'm not sure about is whether the text should be more
>>>>>>>> explicit in
>>>>>>>> flat out mandating the amdkfd eviction fences for long running
>>>>>>>> compute
>>>>>>>> workloads or workloads where userspace fencing is allowed.
>>>>>>> Although (in my humble opinion) it might be possible to completely
>>>>>>> untangle
>>>>>>> kernel-introduced fences for resource management and dma-fences
>>>>>>> used for
>>>>>>> completion- and dependency tracking and lift a lot of restrictions
>>>>>>> for the
>>>>>>> dma-fences, including prohibiting infinite ones, I think this
>>>>>>> makes sense
>>>>>>> describing the current state.
>>>>>> Yeah I think a future patch needs to type up how we want to make that
>>>>>> happen (for some cross driver consistency) and what needs to be
>>>>>> considered. Some of the necessary parts are already there (with
>>>>>> like the
>>>>>> preemption fences amdkfd has as an example), but I think some clear
>>>>>> docs
>>>>>> on what's required from both hw, drivers and userspace would be really
>>>>>> good.
>>>>> I'm currently writing that up, but probably still need a few days
>>>>> for this.
>>>> Great! I put down some (very) initial thoughts a couple of weeks ago
>>>> building on eviction fences for various hardware complexity levels here:
>>>>
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odt&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C8978bbd7823e4b41663708d82d52add3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309180424312390&amp;sdata=tTxx2vfzfwLM1IBJSqqAZRw1604R%2F0bI3MwN1%2FBf2VQ%3D&amp;reserved=0
>>>>
>>> I don't think that this will ever be possible.
>>>
>>> See that Daniel describes in his text is that indefinite fences are a
>>> bad idea for memory management, and I think that this is a fixed fact.
>>>
>>> In other words the whole concept of submitting work to the kernel
>>> which depends on some user space interaction doesn't work and never will.
>> Well the idea here is that memory management will *never* depend on
>> indefinite fences: As soon as someone waits on a memory manager fence
>> (be it eviction, shrinker or mmu notifier) it breaks out of any
>> dma-fence dependencies and /or user-space interaction. The text tries to
>> describe what's required to be able to do that (save for non-preemptible
>> gpus where someone submits a forever-running shader).
> Yeah I think that part of your text is good to describe how to
> untangle memory fences from synchronization fences given how much the
> hw can do.
>
>> So while I think this is possible (until someone comes up with a case
>> where it wouldn't work of course), I guess Daniel has a point in that it
>> won't happen because of inertia and there might be better options.
> Yeah it's just I don't see much chance for splitting dma-fence itself.
> That's also why I'm not positive on the "no hw preemption, only
> scheduler" case: You still have a dma_fence for the batch itself,
> which means still no userspace controlled synchronization or other
> form of indefinite batches allowed. So not getting us any closer to
> enabling the compute use cases people want.

Yes, we can't do magic. As soon as an indefinite batch makes it to such 
hardware we've lost. But since we can break out while the batch is stuck 
in the scheduler waiting, what I believe we *can* do with this approach 
is to avoid deadlocks due to locally unknown dependencies, which has 
some bearing on this documentation patch, and also to allow memory 
allocation in dma-fence (not memory-fence) critical sections, like gpu 
fault- and error handlers without resorting to using memory pools.

But again. I'm not saying we should actually implement this. Better to 
consider it and reject it than not consider it at all.

/Thomas



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-21 10:47                   ` Thomas Hellström (Intel)
@ 2020-07-21 13:59                     ` Christian König
  2020-07-21 17:46                       ` Thomas Hellström (Intel)
  2020-07-21 21:42                       ` Dave Airlie
  0 siblings, 2 replies; 83+ messages in thread
From: Christian König @ 2020-07-21 13:59 UTC (permalink / raw)
  To: Thomas Hellström (Intel), Daniel Vetter
  Cc: DRI Development, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, amd-gfx list,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	Daniel Vetter, Jason Ekstrand, Jesse Natalie, Felix Kuehling,
	Thomas Hellstrom, open list:DMA BUFFER SHARING FRAMEWORK,
	Mika Kuoppala

Am 21.07.20 um 12:47 schrieb Thomas Hellström (Intel):
>
> On 7/21/20 11:50 AM, Daniel Vetter wrote:
>> On Tue, Jul 21, 2020 at 11:38 AM Thomas Hellström (Intel)
>> <thomas_os@shipmail.org> wrote:
>>>
>>> On 7/21/20 10:55 AM, Christian König wrote:
>>>> Am 21.07.20 um 10:47 schrieb Thomas Hellström (Intel):
>>>>> On 7/21/20 9:45 AM, Christian König wrote:
>>>>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>>>>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>>>>>>> write this down once and for all.
>>>>>>>>>
>>>>>>>>> What I'm not sure about is whether the text should be more
>>>>>>>>> explicit in
>>>>>>>>> flat out mandating the amdkfd eviction fences for long running
>>>>>>>>> compute
>>>>>>>>> workloads or workloads where userspace fencing is allowed.
>>>>>>>> Although (in my humble opinion) it might be possible to completely
>>>>>>>> untangle
>>>>>>>> kernel-introduced fences for resource management and dma-fences
>>>>>>>> used for
>>>>>>>> completion- and dependency tracking and lift a lot of restrictions
>>>>>>>> for the
>>>>>>>> dma-fences, including prohibiting infinite ones, I think this
>>>>>>>> makes sense
>>>>>>>> describing the current state.
>>>>>>> Yeah I think a future patch needs to type up how we want to make 
>>>>>>> that
>>>>>>> happen (for some cross driver consistency) and what needs to be
>>>>>>> considered. Some of the necessary parts are already there (with
>>>>>>> like the
>>>>>>> preemption fences amdkfd has as an example), but I think some clear
>>>>>>> docs
>>>>>>> on what's required from both hw, drivers and userspace would be 
>>>>>>> really
>>>>>>> good.
>>>>>> I'm currently writing that up, but probably still need a few days
>>>>>> for this.
>>>>> Great! I put down some (very) initial thoughts a couple of weeks ago
>>>>> building on eviction fences for various hardware complexity levels 
>>>>> here:
>>>>>
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fthomash%2Fdocs%2F-%2Fblob%2Fmaster%2FUntangling%2520dma-fence%2520and%2520memory%2520allocation.odt&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C0af39422c4e744a9303b08d82d637d62%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637309252665326201&amp;sdata=Zk3LVX7bbMpfAMsq%2Fs2jyA0puRQNcjzliJS%2BC7uDLMo%3D&amp;reserved=0 
>>>>>
>>>>>
>>>> I don't think that this will ever be possible.
>>>>
>>>> See that Daniel describes in his text is that indefinite fences are a
>>>> bad idea for memory management, and I think that this is a fixed fact.
>>>>
>>>> In other words the whole concept of submitting work to the kernel
>>>> which depends on some user space interaction doesn't work and never 
>>>> will.
>>> Well the idea here is that memory management will *never* depend on
>>> indefinite fences: As soon as someone waits on a memory manager fence
>>> (be it eviction, shrinker or mmu notifier) it breaks out of any
>>> dma-fence dependencies and /or user-space interaction. The text 
>>> tries to
>>> describe what's required to be able to do that (save for 
>>> non-preemptible
>>> gpus where someone submits a forever-running shader).
>> Yeah I think that part of your text is good to describe how to
>> untangle memory fences from synchronization fences given how much the
>> hw can do.
>>
>>> So while I think this is possible (until someone comes up with a case
>>> where it wouldn't work of course), I guess Daniel has a point in 
>>> that it
>>> won't happen because of inertia and there might be better options.
>> Yeah it's just I don't see much chance for splitting dma-fence itself.

Well that's the whole idea with the timeline semaphores and waiting for 
a signal number to appear.

E.g. instead of doing the wait with the dma_fence we are separating that 
out into the timeline semaphore object.

This not only avoids the indefinite fence problem for the wait before 
signal case in Vulkan, but also prevents userspace to submit stuff which 
can't be processed immediately.

>> That's also why I'm not positive on the "no hw preemption, only
>> scheduler" case: You still have a dma_fence for the batch itself,
>> which means still no userspace controlled synchronization or other
>> form of indefinite batches allowed. So not getting us any closer to
>> enabling the compute use cases people want.

What compute use case are you talking about? I'm only aware about the 
wait before signal case from Vulkan, the page fault case and the KFD 
preemption fence case.

>
> Yes, we can't do magic. As soon as an indefinite batch makes it to 
> such hardware we've lost. But since we can break out while the batch 
> is stuck in the scheduler waiting, what I believe we *can* do with 
> this approach is to avoid deadlocks due to locally unknown 
> dependencies, which has some bearing on this documentation patch, and 
> also to allow memory allocation in dma-fence (not memory-fence) 
> critical sections, like gpu fault- and error handlers without 
> resorting to using memory pools.

Avoiding deadlocks is only the tip of the iceberg here.

When you allow the kernel to depend on user space to proceed with some 
operation there are a lot more things which need consideration.

E.g. what happens when an userspace process which has submitted stuff to 
the kernel is killed? Are the prepared commands send to the hardware or 
aborted as well? What do we do with other processes waiting for that stuff?

How to we do resource accounting? When processes need to block when 
submitting to the hardware stuff which is not ready we have a process we 
can punish for blocking resources. But how is kernel memory used for a 
submission accounted? How do we avoid deny of service attacks here were 
somebody eats up all memory by doing submissions which can't finish?

> But again. I'm not saying we should actually implement this. Better to 
> consider it and reject it than not consider it at all.

Agreed.

Same thing as it turned out with the Wait before Signal for Vulkan, 
initially it looked simpler to do it in the kernel. But as far as I know 
the solution in userspace now works so well that we don't really want 
the pain for a kernel implementation any more.

Christian.

>
> /Thomas
>
>


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-21 13:59                     ` Christian König
@ 2020-07-21 17:46                       ` Thomas Hellström (Intel)
  2020-07-21 18:18                         ` Daniel Vetter
  2020-07-21 21:42                       ` Dave Airlie
  1 sibling, 1 reply; 83+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-21 17:46 UTC (permalink / raw)
  To: Christian König, Daniel Vetter
  Cc: DRI Development, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, amd-gfx list,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	Daniel Vetter, Jason Ekstrand, Jesse Natalie, Felix Kuehling,
	Thomas Hellstrom, open list:DMA BUFFER SHARING FRAMEWORK,
	Mika Kuoppala


On 2020-07-21 15:59, Christian König wrote:
> Am 21.07.20 um 12:47 schrieb Thomas Hellström (Intel):
...
>> Yes, we can't do magic. As soon as an indefinite batch makes it to 
>> such hardware we've lost. But since we can break out while the batch 
>> is stuck in the scheduler waiting, what I believe we *can* do with 
>> this approach is to avoid deadlocks due to locally unknown 
>> dependencies, which has some bearing on this documentation patch, and 
>> also to allow memory allocation in dma-fence (not memory-fence) 
>> critical sections, like gpu fault- and error handlers without 
>> resorting to using memory pools.
>
> Avoiding deadlocks is only the tip of the iceberg here.
>
> When you allow the kernel to depend on user space to proceed with some 
> operation there are a lot more things which need consideration.
>
> E.g. what happens when an userspace process which has submitted stuff 
> to the kernel is killed? Are the prepared commands send to the 
> hardware or aborted as well? What do we do with other processes 
> waiting for that stuff?
>
> How to we do resource accounting? When processes need to block when 
> submitting to the hardware stuff which is not ready we have a process 
> we can punish for blocking resources. But how is kernel memory used 
> for a submission accounted? How do we avoid deny of service attacks 
> here were somebody eats up all memory by doing submissions which can't 
> finish?
>
Hmm. Are these problems really unique to user-space controlled 
dependencies? Couldn't you hit the same or similar problems with 
mis-behaving shaders blocking timeline progress?

/Thomas




^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-21 17:46                       ` Thomas Hellström (Intel)
@ 2020-07-21 18:18                         ` Daniel Vetter
  0 siblings, 0 replies; 83+ messages in thread
From: Daniel Vetter @ 2020-07-21 18:18 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Christian König, DRI Development, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, amd-gfx list,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	Daniel Vetter, Jason Ekstrand, Jesse Natalie, Felix Kuehling,
	Thomas Hellstrom, open list:DMA BUFFER SHARING FRAMEWORK,
	Mika Kuoppala

On Tue, Jul 21, 2020 at 7:46 PM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 2020-07-21 15:59, Christian König wrote:
> > Am 21.07.20 um 12:47 schrieb Thomas Hellström (Intel):
> ...
> >> Yes, we can't do magic. As soon as an indefinite batch makes it to
> >> such hardware we've lost. But since we can break out while the batch
> >> is stuck in the scheduler waiting, what I believe we *can* do with
> >> this approach is to avoid deadlocks due to locally unknown
> >> dependencies, which has some bearing on this documentation patch, and
> >> also to allow memory allocation in dma-fence (not memory-fence)
> >> critical sections, like gpu fault- and error handlers without
> >> resorting to using memory pools.
> >
> > Avoiding deadlocks is only the tip of the iceberg here.
> >
> > When you allow the kernel to depend on user space to proceed with some
> > operation there are a lot more things which need consideration.
> >
> > E.g. what happens when an userspace process which has submitted stuff
> > to the kernel is killed? Are the prepared commands send to the
> > hardware or aborted as well? What do we do with other processes
> > waiting for that stuff?
> >
> > How to we do resource accounting? When processes need to block when
> > submitting to the hardware stuff which is not ready we have a process
> > we can punish for blocking resources. But how is kernel memory used
> > for a submission accounted? How do we avoid deny of service attacks
> > here were somebody eats up all memory by doing submissions which can't
> > finish?
> >
> Hmm. Are these problems really unique to user-space controlled
> dependencies? Couldn't you hit the same or similar problems with
> mis-behaving shaders blocking timeline progress?

We just kill them, which we can because stuff needs to complete in a
timely fashion, and without any further intervention - all
prerequisite dependencies must be and are known by the kernel.

But with the long/endless running compute stuff with userspace sync
point and everything free-wheeling, including stuff like "hey I'll
submit this patch but the memory isn't even all allocated yet, so I'm
just going to hang it on this semaphore until that's done" is entirely
different. There just shooting the batch kills the programming model,
and abitrarily holding up a batch for another one to first get its
memory also breaks it, because userspace might have issued them with
dependencies in the other order.

So with that execution model you don't run batches, but just an entire
context. Up to userspace what it does with that, and like with cpu
threads just running a busy loop doing nothing is perfectly legit
(from the kernel pov's at least) workload. Nothing in the kernel ever
waits on such a context to do anything, if the kernel needs something
you just preempt (or if it's memory and you have gpu page fault
handling, rip out the page). Accounting is all done on a specific gpu
context too. And probably we need a somewhat consistent approach on
how we handle these gpu context things (definitely needed for cgroups
and all that).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-21 13:59                     ` Christian König
  2020-07-21 17:46                       ` Thomas Hellström (Intel)
@ 2020-07-21 21:42                       ` Dave Airlie
  1 sibling, 0 replies; 83+ messages in thread
From: Dave Airlie @ 2020-07-21 21:42 UTC (permalink / raw)
  To: Christian König
  Cc: Thomas Hellström (Intel),
	Daniel Vetter, Daniel Stone, linux-rdma,
	Intel Graphics Development, amd-gfx list,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	DRI Development, Jason Ekstrand, Jesse Natalie, Daniel Vetter,
	Thomas Hellstrom, Mika Kuoppala, Felix Kuehling,
	open list:DMA BUFFER SHARING FRAMEWORK

>
> >> That's also why I'm not positive on the "no hw preemption, only
> >> scheduler" case: You still have a dma_fence for the batch itself,
> >> which means still no userspace controlled synchronization or other
> >> form of indefinite batches allowed. So not getting us any closer to
> >> enabling the compute use cases people want.
>
> What compute use case are you talking about? I'm only aware about the
> wait before signal case from Vulkan, the page fault case and the KFD
> preemption fence case.

So slight aside, but it does appear as if Intel's Level 0 API exposes
some of the same problems as vulkan.

They have fences:
"A fence cannot be shared across processes."

They have events (userspace fences) like Vulkan but specify:
"Signaled from the host, and waited upon from within a device’s command list."

"There are no protections against events causing deadlocks, such as
circular waits scenarios.

These problems are left to the application to avoid."

https://spec.oneapi.com/level-zero/latest/core/PROG.html#synchronization-primitives

Dave.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-21  8:47           ` Thomas Hellström (Intel)
  2020-07-21  8:55             ` Christian König
@ 2020-07-21 22:45             ` Dave Airlie
  2020-07-22  6:45               ` Thomas Hellström (Intel)
  1 sibling, 1 reply; 83+ messages in thread
From: Dave Airlie @ 2020-07-21 22:45 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Christian König, Daniel Vetter, Daniel Stone, linux-rdma,
	Daniel Vetter, Intel Graphics Development, Maarten Lankhorst,
	DRI Development, moderated list:DMA BUFFER SHARING FRAMEWORK,
	Steve Pronovost, amd-gfx mailing list, Jason Ekstrand,
	Jesse Natalie, Daniel Vetter, Thomas Hellstrom, Mika Kuoppala,
	Felix Kuehling, Linux Media Mailing List

On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 7/21/20 9:45 AM, Christian König wrote:
> > Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> >> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> >> wrote:
> >>> Hi,
> >>>
> >>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> >>>> Comes up every few years, gets somewhat tedious to discuss, let's
> >>>> write this down once and for all.
> >>>>
> >>>> What I'm not sure about is whether the text should be more explicit in
> >>>> flat out mandating the amdkfd eviction fences for long running compute
> >>>> workloads or workloads where userspace fencing is allowed.
> >>> Although (in my humble opinion) it might be possible to completely
> >>> untangle
> >>> kernel-introduced fences for resource management and dma-fences used
> >>> for
> >>> completion- and dependency tracking and lift a lot of restrictions
> >>> for the
> >>> dma-fences, including prohibiting infinite ones, I think this makes
> >>> sense
> >>> describing the current state.
> >> Yeah I think a future patch needs to type up how we want to make that
> >> happen (for some cross driver consistency) and what needs to be
> >> considered. Some of the necessary parts are already there (with like the
> >> preemption fences amdkfd has as an example), but I think some clear docs
> >> on what's required from both hw, drivers and userspace would be really
> >> good.
> >
> > I'm currently writing that up, but probably still need a few days for
> > this.
>
> Great! I put down some (very) initial thoughts a couple of weeks ago
> building on eviction fences for various hardware complexity levels here:
>
> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt

We are seeing HW that has recoverable GPU page faults but only for
compute tasks, and scheduler without semaphores hw for graphics.

So a single driver may have to expose both models to userspace and
also introduces the problem of how to interoperate between the two
models on one card.

Dave.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-21 22:45             ` Dave Airlie
@ 2020-07-22  6:45               ` Thomas Hellström (Intel)
  2020-07-22  7:11                 ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-22  6:45 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Christian König, Daniel Vetter, Daniel Stone, linux-rdma,
	Daniel Vetter, Intel Graphics Development, Maarten Lankhorst,
	DRI Development, moderated list:DMA BUFFER SHARING FRAMEWORK,
	Steve Pronovost, amd-gfx mailing list, Jason Ekstrand,
	Jesse Natalie, Daniel Vetter, Thomas Hellstrom, Mika Kuoppala,
	Felix Kuehling, Linux Media Mailing List


On 2020-07-22 00:45, Dave Airlie wrote:
> On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>>
>> On 7/21/20 9:45 AM, Christian König wrote:
>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>>>> write this down once and for all.
>>>>>>
>>>>>> What I'm not sure about is whether the text should be more explicit in
>>>>>> flat out mandating the amdkfd eviction fences for long running compute
>>>>>> workloads or workloads where userspace fencing is allowed.
>>>>> Although (in my humble opinion) it might be possible to completely
>>>>> untangle
>>>>> kernel-introduced fences for resource management and dma-fences used
>>>>> for
>>>>> completion- and dependency tracking and lift a lot of restrictions
>>>>> for the
>>>>> dma-fences, including prohibiting infinite ones, I think this makes
>>>>> sense
>>>>> describing the current state.
>>>> Yeah I think a future patch needs to type up how we want to make that
>>>> happen (for some cross driver consistency) and what needs to be
>>>> considered. Some of the necessary parts are already there (with like the
>>>> preemption fences amdkfd has as an example), but I think some clear docs
>>>> on what's required from both hw, drivers and userspace would be really
>>>> good.
>>> I'm currently writing that up, but probably still need a few days for
>>> this.
>> Great! I put down some (very) initial thoughts a couple of weeks ago
>> building on eviction fences for various hardware complexity levels here:
>>
>> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
> We are seeing HW that has recoverable GPU page faults but only for
> compute tasks, and scheduler without semaphores hw for graphics.
>
> So a single driver may have to expose both models to userspace and
> also introduces the problem of how to interoperate between the two
> models on one card.
>
> Dave.

Hmm, yes to begin with it's important to note that this is not a 
replacement for new programming models or APIs, This is something that 
takes place internally in drivers to mitigate many of the restrictions 
that are currently imposed on dma-fence and documented in this and 
previous series. It's basically the driver-private narrow completions 
Jason suggested in the lockdep patches discussions implemented the same 
way as eviction-fences.

The memory fence API would be local to helpers and middle-layers like 
TTM, and the corresponding drivers.  The only cross-driver-like 
visibility would be that the dma-buf move_notify() callback would not be 
allowed to wait on dma-fences or something that depends on a dma-fence.

So with that in mind, I don't foresee engines with different 
capabilities on the same card being a problem.

/Thomas



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-22  6:45               ` Thomas Hellström (Intel)
@ 2020-07-22  7:11                 ` Daniel Vetter
  2020-07-22  8:05                   ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-22  7:11 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Dave Airlie, Christian König, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, DRI Development,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	amd-gfx mailing list, Jason Ekstrand, Jesse Natalie,
	Daniel Vetter, Thomas Hellstrom, Mika Kuoppala, Felix Kuehling,
	Linux Media Mailing List

On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 2020-07-22 00:45, Dave Airlie wrote:
> > On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
> > <thomas_os@shipmail.org> wrote:
> >>
> >> On 7/21/20 9:45 AM, Christian König wrote:
> >>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> >>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> >>>> wrote:
> >>>>> Hi,
> >>>>>
> >>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> >>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
> >>>>>> write this down once and for all.
> >>>>>>
> >>>>>> What I'm not sure about is whether the text should be more explicit in
> >>>>>> flat out mandating the amdkfd eviction fences for long running compute
> >>>>>> workloads or workloads where userspace fencing is allowed.
> >>>>> Although (in my humble opinion) it might be possible to completely
> >>>>> untangle
> >>>>> kernel-introduced fences for resource management and dma-fences used
> >>>>> for
> >>>>> completion- and dependency tracking and lift a lot of restrictions
> >>>>> for the
> >>>>> dma-fences, including prohibiting infinite ones, I think this makes
> >>>>> sense
> >>>>> describing the current state.
> >>>> Yeah I think a future patch needs to type up how we want to make that
> >>>> happen (for some cross driver consistency) and what needs to be
> >>>> considered. Some of the necessary parts are already there (with like the
> >>>> preemption fences amdkfd has as an example), but I think some clear docs
> >>>> on what's required from both hw, drivers and userspace would be really
> >>>> good.
> >>> I'm currently writing that up, but probably still need a few days for
> >>> this.
> >> Great! I put down some (very) initial thoughts a couple of weeks ago
> >> building on eviction fences for various hardware complexity levels here:
> >>
> >> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
> > We are seeing HW that has recoverable GPU page faults but only for
> > compute tasks, and scheduler without semaphores hw for graphics.
> >
> > So a single driver may have to expose both models to userspace and
> > also introduces the problem of how to interoperate between the two
> > models on one card.
> >
> > Dave.
>
> Hmm, yes to begin with it's important to note that this is not a
> replacement for new programming models or APIs, This is something that
> takes place internally in drivers to mitigate many of the restrictions
> that are currently imposed on dma-fence and documented in this and
> previous series. It's basically the driver-private narrow completions
> Jason suggested in the lockdep patches discussions implemented the same
> way as eviction-fences.
>
> The memory fence API would be local to helpers and middle-layers like
> TTM, and the corresponding drivers.  The only cross-driver-like
> visibility would be that the dma-buf move_notify() callback would not be
> allowed to wait on dma-fences or something that depends on a dma-fence.

Because we can't preempt (on some engines at least) we already have
the requirement that cross driver buffer management can get stuck on a
dma-fence. Not even taking into account the horrors we do with
userptr, which are cross driver no matter what. Limiting move_notify
to memory fences only doesn't work, since the pte clearing might need
to wait for a dma_fence first. Hence this becomes a full end-of-batch
fence, not just a limited kernel-internal memory fence.

That's kinda why I think only reasonable option is to toss in the
towel and declare dma-fence to be the memory fence (and suck up all
the consequences of that decision as uapi, which is kinda where we
are), and construct something new&entirely free-wheeling for userspace
fencing. But only for engines that allow enough preempt/gpu page
faulting to make that possible. Free wheeling userspace fences/gpu
semaphores or whatever you want to call them (on windows I think it's
monitored fence) only work if you can preempt to decouple the memory
fences from your gpu command execution.

There's the in-between step of just decoupling the batchbuffer
submission prep for hw without any preempt (but a scheduler), but that
seems kinda pointless. Modern execbuf should be O(1) fastpath, with
all the allocation/mapping work pulled out ahead. vk exposes that
model directly to clients, GL drivers could use it internally too, so
I see zero value in spending lots of time engineering very tricky
kernel code just for old userspace. Much more reasonable to do that in
userspace, where we have real debuggers and no panics about security
bugs (or well, a lot less, webgl is still a thing, but at least
browsers realized you need to container that completely).

Cheers, Daniel

> So with that in mind, I don't foresee engines with different
> capabilities on the same card being a problem.
>
> /Thomas
>
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-22  7:11                 ` Daniel Vetter
@ 2020-07-22  8:05                   ` Thomas Hellström (Intel)
  2020-07-22  9:45                     ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-22  8:05 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Dave Airlie, Christian König, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, DRI Development,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	amd-gfx mailing list, Jason Ekstrand, Jesse Natalie,
	Daniel Vetter, Thomas Hellstrom, Mika Kuoppala, Felix Kuehling,
	Linux Media Mailing List


On 2020-07-22 09:11, Daniel Vetter wrote:
> On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>>
>> On 2020-07-22 00:45, Dave Airlie wrote:
>>> On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
>>> <thomas_os@shipmail.org> wrote:
>>>> On 7/21/20 9:45 AM, Christian König wrote:
>>>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>>>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
>>>>>> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>>>>>> write this down once and for all.
>>>>>>>>
>>>>>>>> What I'm not sure about is whether the text should be more explicit in
>>>>>>>> flat out mandating the amdkfd eviction fences for long running compute
>>>>>>>> workloads or workloads where userspace fencing is allowed.
>>>>>>> Although (in my humble opinion) it might be possible to completely
>>>>>>> untangle
>>>>>>> kernel-introduced fences for resource management and dma-fences used
>>>>>>> for
>>>>>>> completion- and dependency tracking and lift a lot of restrictions
>>>>>>> for the
>>>>>>> dma-fences, including prohibiting infinite ones, I think this makes
>>>>>>> sense
>>>>>>> describing the current state.
>>>>>> Yeah I think a future patch needs to type up how we want to make that
>>>>>> happen (for some cross driver consistency) and what needs to be
>>>>>> considered. Some of the necessary parts are already there (with like the
>>>>>> preemption fences amdkfd has as an example), but I think some clear docs
>>>>>> on what's required from both hw, drivers and userspace would be really
>>>>>> good.
>>>>> I'm currently writing that up, but probably still need a few days for
>>>>> this.
>>>> Great! I put down some (very) initial thoughts a couple of weeks ago
>>>> building on eviction fences for various hardware complexity levels here:
>>>>
>>>> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
>>> We are seeing HW that has recoverable GPU page faults but only for
>>> compute tasks, and scheduler without semaphores hw for graphics.
>>>
>>> So a single driver may have to expose both models to userspace and
>>> also introduces the problem of how to interoperate between the two
>>> models on one card.
>>>
>>> Dave.
>> Hmm, yes to begin with it's important to note that this is not a
>> replacement for new programming models or APIs, This is something that
>> takes place internally in drivers to mitigate many of the restrictions
>> that are currently imposed on dma-fence and documented in this and
>> previous series. It's basically the driver-private narrow completions
>> Jason suggested in the lockdep patches discussions implemented the same
>> way as eviction-fences.
>>
>> The memory fence API would be local to helpers and middle-layers like
>> TTM, and the corresponding drivers.  The only cross-driver-like
>> visibility would be that the dma-buf move_notify() callback would not be
>> allowed to wait on dma-fences or something that depends on a dma-fence.
> Because we can't preempt (on some engines at least) we already have
> the requirement that cross driver buffer management can get stuck on a
> dma-fence. Not even taking into account the horrors we do with
> userptr, which are cross driver no matter what. Limiting move_notify
> to memory fences only doesn't work, since the pte clearing might need
> to wait for a dma_fence first. Hence this becomes a full end-of-batch
> fence, not just a limited kernel-internal memory fence.

For non-preemptible hardware the memory fence typically *is* the 
end-of-batch fence. (Unless, as documented, there is a scheduler 
consuming sync-file dependencies in which case the memory fence wait 
needs to be able to break out of that). The key thing is not that we can 
break out of execution, but that we can break out of dependencies, since 
when we're executing all dependecies (modulo semaphores) are already 
fulfilled. That's what's eliminating the deadlocks.

>
> That's kinda why I think only reasonable option is to toss in the
> towel and declare dma-fence to be the memory fence (and suck up all
> the consequences of that decision as uapi, which is kinda where we
> are), and construct something new&entirely free-wheeling for userspace
> fencing. But only for engines that allow enough preempt/gpu page
> faulting to make that possible. Free wheeling userspace fences/gpu
> semaphores or whatever you want to call them (on windows I think it's
> monitored fence) only work if you can preempt to decouple the memory
> fences from your gpu command execution.
>
> There's the in-between step of just decoupling the batchbuffer
> submission prep for hw without any preempt (but a scheduler), but that
> seems kinda pointless. Modern execbuf should be O(1) fastpath, with
> all the allocation/mapping work pulled out ahead. vk exposes that
> model directly to clients, GL drivers could use it internally too, so
> I see zero value in spending lots of time engineering very tricky
> kernel code just for old userspace. Much more reasonable to do that in
> userspace, where we have real debuggers and no panics about security
> bugs (or well, a lot less, webgl is still a thing, but at least
> browsers realized you need to container that completely).

Sure, it's definitely a big chunk of work. I think the big win would be 
allowing memory allocation in dma-fence critical sections. But I 
completely buy the above argument. I just wanted to point out that many 
of the dma-fence restrictions are IMHO fixable, should we need to do 
that for whatever reason.

/Thomas


>
> Cheers, Daniel
>
>> So with that in mind, I don't foresee engines with different
>> capabilities on the same card being a problem.
>>
>> /Thomas
>>
>>
>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-22  8:05                   ` Thomas Hellström (Intel)
@ 2020-07-22  9:45                     ` Daniel Vetter
  2020-07-22 10:31                       ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-22  9:45 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Dave Airlie, Christian König, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, DRI Development,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	amd-gfx mailing list, Jason Ekstrand, Jesse Natalie,
	Daniel Vetter, Thomas Hellstrom, Mika Kuoppala, Felix Kuehling,
	Linux Media Mailing List

On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 2020-07-22 09:11, Daniel Vetter wrote:
> > On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
> > <thomas_os@shipmail.org> wrote:
> >>
> >> On 2020-07-22 00:45, Dave Airlie wrote:
> >>> On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
> >>> <thomas_os@shipmail.org> wrote:
> >>>> On 7/21/20 9:45 AM, Christian König wrote:
> >>>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> >>>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> >>>>>> wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> >>>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
> >>>>>>>> write this down once and for all.
> >>>>>>>>
> >>>>>>>> What I'm not sure about is whether the text should be more explicit in
> >>>>>>>> flat out mandating the amdkfd eviction fences for long running compute
> >>>>>>>> workloads or workloads where userspace fencing is allowed.
> >>>>>>> Although (in my humble opinion) it might be possible to completely
> >>>>>>> untangle
> >>>>>>> kernel-introduced fences for resource management and dma-fences used
> >>>>>>> for
> >>>>>>> completion- and dependency tracking and lift a lot of restrictions
> >>>>>>> for the
> >>>>>>> dma-fences, including prohibiting infinite ones, I think this makes
> >>>>>>> sense
> >>>>>>> describing the current state.
> >>>>>> Yeah I think a future patch needs to type up how we want to make that
> >>>>>> happen (for some cross driver consistency) and what needs to be
> >>>>>> considered. Some of the necessary parts are already there (with like the
> >>>>>> preemption fences amdkfd has as an example), but I think some clear docs
> >>>>>> on what's required from both hw, drivers and userspace would be really
> >>>>>> good.
> >>>>> I'm currently writing that up, but probably still need a few days for
> >>>>> this.
> >>>> Great! I put down some (very) initial thoughts a couple of weeks ago
> >>>> building on eviction fences for various hardware complexity levels here:
> >>>>
> >>>> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
> >>> We are seeing HW that has recoverable GPU page faults but only for
> >>> compute tasks, and scheduler without semaphores hw for graphics.
> >>>
> >>> So a single driver may have to expose both models to userspace and
> >>> also introduces the problem of how to interoperate between the two
> >>> models on one card.
> >>>
> >>> Dave.
> >> Hmm, yes to begin with it's important to note that this is not a
> >> replacement for new programming models or APIs, This is something that
> >> takes place internally in drivers to mitigate many of the restrictions
> >> that are currently imposed on dma-fence and documented in this and
> >> previous series. It's basically the driver-private narrow completions
> >> Jason suggested in the lockdep patches discussions implemented the same
> >> way as eviction-fences.
> >>
> >> The memory fence API would be local to helpers and middle-layers like
> >> TTM, and the corresponding drivers.  The only cross-driver-like
> >> visibility would be that the dma-buf move_notify() callback would not be
> >> allowed to wait on dma-fences or something that depends on a dma-fence.
> > Because we can't preempt (on some engines at least) we already have
> > the requirement that cross driver buffer management can get stuck on a
> > dma-fence. Not even taking into account the horrors we do with
> > userptr, which are cross driver no matter what. Limiting move_notify
> > to memory fences only doesn't work, since the pte clearing might need
> > to wait for a dma_fence first. Hence this becomes a full end-of-batch
> > fence, not just a limited kernel-internal memory fence.
>
> For non-preemptible hardware the memory fence typically *is* the
> end-of-batch fence. (Unless, as documented, there is a scheduler
> consuming sync-file dependencies in which case the memory fence wait
> needs to be able to break out of that). The key thing is not that we can
> break out of execution, but that we can break out of dependencies, since
> when we're executing all dependecies (modulo semaphores) are already
> fulfilled. That's what's eliminating the deadlocks.
>
> > That's kinda why I think only reasonable option is to toss in the
> > towel and declare dma-fence to be the memory fence (and suck up all
> > the consequences of that decision as uapi, which is kinda where we
> > are), and construct something new&entirely free-wheeling for userspace
> > fencing. But only for engines that allow enough preempt/gpu page
> > faulting to make that possible. Free wheeling userspace fences/gpu
> > semaphores or whatever you want to call them (on windows I think it's
> > monitored fence) only work if you can preempt to decouple the memory
> > fences from your gpu command execution.
> >
> > There's the in-between step of just decoupling the batchbuffer
> > submission prep for hw without any preempt (but a scheduler), but that
> > seems kinda pointless. Modern execbuf should be O(1) fastpath, with
> > all the allocation/mapping work pulled out ahead. vk exposes that
> > model directly to clients, GL drivers could use it internally too, so
> > I see zero value in spending lots of time engineering very tricky
> > kernel code just for old userspace. Much more reasonable to do that in
> > userspace, where we have real debuggers and no panics about security
> > bugs (or well, a lot less, webgl is still a thing, but at least
> > browsers realized you need to container that completely).
>
> Sure, it's definitely a big chunk of work. I think the big win would be
> allowing memory allocation in dma-fence critical sections. But I
> completely buy the above argument. I just wanted to point out that many
> of the dma-fence restrictions are IMHO fixable, should we need to do
> that for whatever reason.

I'm still not sure that's possible, without preemption at least. We
have 4 edges:
- Kernel has internal depencies among memory fences. We want that to
allow (mild) amounts of overcommit, since that simplifies live so
much.
- Memory fences can block gpu ctx execution (by nature of the memory
simply not being there yet due to our overcommit)
- gpu ctx have (if we allow this) userspace controlled semaphore
dependencies. Of course userspace is expected to not create deadlocks,
but that's only assuming the kernel doesn't inject additional
dependencies. Compute folks really want that.
- gpu ctx can hold up memory allocations if all we have is
end-of-batch fences. And end-of-batch fences are all we have without
preempt, plus if we want backwards compat with the entire current
winsys/compositor ecosystem we need them, which allows us to inject
stuff dependent upon them pretty much anywhere.

Fundamentally that's not fixable without throwing one of the edges
(and the corresponding feature that enables) out, since no entity has
full visibility into what's going on. E.g. forcing userspace to tell
the kernel about all semaphores just brings up back to the
drm_timeline_syncobj design we have merged right now. And that's imo
no better.

That's kinda why I'm not seeing much benefits in a half-way state:
Tons of work, and still not what userspace wants. And for the full
deal that userspace wants we might as well not change anything with
dma-fences. For that we need a) ctx preempt and b) new entirely
decoupled fences that never feed back into a memory fences and c) are
controlled entirely by userspace. And c) is the really important thing
people want us to provide.

And once we're ok with dma_fence == memory fences, then enforcing the
strict and painful memory allocation limitations is actually what we
want.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-22  9:45                     ` Daniel Vetter
@ 2020-07-22 10:31                       ` Thomas Hellström (Intel)
  2020-07-22 11:39                         ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-22 10:31 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Dave Airlie, Christian König, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, DRI Development,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	amd-gfx mailing list, Jason Ekstrand, Jesse Natalie,
	Daniel Vetter, Thomas Hellstrom, Mika Kuoppala, Felix Kuehling,
	Linux Media Mailing List


On 2020-07-22 11:45, Daniel Vetter wrote:
> On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>>
>> On 2020-07-22 09:11, Daniel Vetter wrote:
>>> On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
>>> <thomas_os@shipmail.org> wrote:
>>>> On 2020-07-22 00:45, Dave Airlie wrote:
>>>>> On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
>>>>> <thomas_os@shipmail.org> wrote:
>>>>>> On 7/21/20 9:45 AM, Christian König wrote:
>>>>>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>>>>>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
>>>>>>>> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>>>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>>>>>>>> write this down once and for all.
>>>>>>>>>>
>>>>>>>>>> What I'm not sure about is whether the text should be more explicit in
>>>>>>>>>> flat out mandating the amdkfd eviction fences for long running compute
>>>>>>>>>> workloads or workloads where userspace fencing is allowed.
>>>>>>>>> Although (in my humble opinion) it might be possible to completely
>>>>>>>>> untangle
>>>>>>>>> kernel-introduced fences for resource management and dma-fences used
>>>>>>>>> for
>>>>>>>>> completion- and dependency tracking and lift a lot of restrictions
>>>>>>>>> for the
>>>>>>>>> dma-fences, including prohibiting infinite ones, I think this makes
>>>>>>>>> sense
>>>>>>>>> describing the current state.
>>>>>>>> Yeah I think a future patch needs to type up how we want to make that
>>>>>>>> happen (for some cross driver consistency) and what needs to be
>>>>>>>> considered. Some of the necessary parts are already there (with like the
>>>>>>>> preemption fences amdkfd has as an example), but I think some clear docs
>>>>>>>> on what's required from both hw, drivers and userspace would be really
>>>>>>>> good.
>>>>>>> I'm currently writing that up, but probably still need a few days for
>>>>>>> this.
>>>>>> Great! I put down some (very) initial thoughts a couple of weeks ago
>>>>>> building on eviction fences for various hardware complexity levels here:
>>>>>>
>>>>>> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
>>>>> We are seeing HW that has recoverable GPU page faults but only for
>>>>> compute tasks, and scheduler without semaphores hw for graphics.
>>>>>
>>>>> So a single driver may have to expose both models to userspace and
>>>>> also introduces the problem of how to interoperate between the two
>>>>> models on one card.
>>>>>
>>>>> Dave.
>>>> Hmm, yes to begin with it's important to note that this is not a
>>>> replacement for new programming models or APIs, This is something that
>>>> takes place internally in drivers to mitigate many of the restrictions
>>>> that are currently imposed on dma-fence and documented in this and
>>>> previous series. It's basically the driver-private narrow completions
>>>> Jason suggested in the lockdep patches discussions implemented the same
>>>> way as eviction-fences.
>>>>
>>>> The memory fence API would be local to helpers and middle-layers like
>>>> TTM, and the corresponding drivers.  The only cross-driver-like
>>>> visibility would be that the dma-buf move_notify() callback would not be
>>>> allowed to wait on dma-fences or something that depends on a dma-fence.
>>> Because we can't preempt (on some engines at least) we already have
>>> the requirement that cross driver buffer management can get stuck on a
>>> dma-fence. Not even taking into account the horrors we do with
>>> userptr, which are cross driver no matter what. Limiting move_notify
>>> to memory fences only doesn't work, since the pte clearing might need
>>> to wait for a dma_fence first. Hence this becomes a full end-of-batch
>>> fence, not just a limited kernel-internal memory fence.
>> For non-preemptible hardware the memory fence typically *is* the
>> end-of-batch fence. (Unless, as documented, there is a scheduler
>> consuming sync-file dependencies in which case the memory fence wait
>> needs to be able to break out of that). The key thing is not that we can
>> break out of execution, but that we can break out of dependencies, since
>> when we're executing all dependecies (modulo semaphores) are already
>> fulfilled. That's what's eliminating the deadlocks.
>>
>>> That's kinda why I think only reasonable option is to toss in the
>>> towel and declare dma-fence to be the memory fence (and suck up all
>>> the consequences of that decision as uapi, which is kinda where we
>>> are), and construct something new&entirely free-wheeling for userspace
>>> fencing. But only for engines that allow enough preempt/gpu page
>>> faulting to make that possible. Free wheeling userspace fences/gpu
>>> semaphores or whatever you want to call them (on windows I think it's
>>> monitored fence) only work if you can preempt to decouple the memory
>>> fences from your gpu command execution.
>>>
>>> There's the in-between step of just decoupling the batchbuffer
>>> submission prep for hw without any preempt (but a scheduler), but that
>>> seems kinda pointless. Modern execbuf should be O(1) fastpath, with
>>> all the allocation/mapping work pulled out ahead. vk exposes that
>>> model directly to clients, GL drivers could use it internally too, so
>>> I see zero value in spending lots of time engineering very tricky
>>> kernel code just for old userspace. Much more reasonable to do that in
>>> userspace, where we have real debuggers and no panics about security
>>> bugs (or well, a lot less, webgl is still a thing, but at least
>>> browsers realized you need to container that completely).
>> Sure, it's definitely a big chunk of work. I think the big win would be
>> allowing memory allocation in dma-fence critical sections. But I
>> completely buy the above argument. I just wanted to point out that many
>> of the dma-fence restrictions are IMHO fixable, should we need to do
>> that for whatever reason.
> I'm still not sure that's possible, without preemption at least. We
> have 4 edges:
> - Kernel has internal depencies among memory fences. We want that to
> allow (mild) amounts of overcommit, since that simplifies live so
> much.
> - Memory fences can block gpu ctx execution (by nature of the memory
> simply not being there yet due to our overcommit)
> - gpu ctx have (if we allow this) userspace controlled semaphore
> dependencies. Of course userspace is expected to not create deadlocks,
> but that's only assuming the kernel doesn't inject additional
> dependencies. Compute folks really want that.
> - gpu ctx can hold up memory allocations if all we have is
> end-of-batch fences. And end-of-batch fences are all we have without
> preempt, plus if we want backwards compat with the entire current
> winsys/compositor ecosystem we need them, which allows us to inject
> stuff dependent upon them pretty much anywhere.
>
> Fundamentally that's not fixable without throwing one of the edges
> (and the corresponding feature that enables) out, since no entity has
> full visibility into what's going on. E.g. forcing userspace to tell
> the kernel about all semaphores just brings up back to the
> drm_timeline_syncobj design we have merged right now. And that's imo
> no better.

Indeed, HW waiting for semaphores without being able to preempt that 
wait is a no-go. The doc (perhaps naively) assumes nobody is doing that.

>
> That's kinda why I'm not seeing much benefits in a half-way state:
> Tons of work, and still not what userspace wants. And for the full
> deal that userspace wants we might as well not change anything with
> dma-fences. For that we need a) ctx preempt and b) new entirely
> decoupled fences that never feed back into a memory fences and c) are
> controlled entirely by userspace. And c) is the really important thing
> people want us to provide.
>
> And once we're ok with dma_fence == memory fences, then enforcing the
> strict and painful memory allocation limitations is actually what we
> want.

Let's hope you're right. My fear is that that might be pretty painful as 
well.

> Cheers, Daniel

/Thomas



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-22 10:31                       ` Thomas Hellström (Intel)
@ 2020-07-22 11:39                         ` Daniel Vetter
  2020-07-22 12:22                           ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-22 11:39 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Dave Airlie, Christian König, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, DRI Development,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	amd-gfx mailing list, Jason Ekstrand, Jesse Natalie,
	Daniel Vetter, Thomas Hellstrom, Mika Kuoppala, Felix Kuehling,
	Linux Media Mailing List

On Wed, Jul 22, 2020 at 12:31 PM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 2020-07-22 11:45, Daniel Vetter wrote:
> > On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
> > <thomas_os@shipmail.org> wrote:
> >>
> >> On 2020-07-22 09:11, Daniel Vetter wrote:
> >>> On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
> >>> <thomas_os@shipmail.org> wrote:
> >>>> On 2020-07-22 00:45, Dave Airlie wrote:
> >>>>> On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
> >>>>> <thomas_os@shipmail.org> wrote:
> >>>>>> On 7/21/20 9:45 AM, Christian König wrote:
> >>>>>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> >>>>>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> >>>>>>>> wrote:
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> >>>>>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
> >>>>>>>>>> write this down once and for all.
> >>>>>>>>>>
> >>>>>>>>>> What I'm not sure about is whether the text should be more explicit in
> >>>>>>>>>> flat out mandating the amdkfd eviction fences for long running compute
> >>>>>>>>>> workloads or workloads where userspace fencing is allowed.
> >>>>>>>>> Although (in my humble opinion) it might be possible to completely
> >>>>>>>>> untangle
> >>>>>>>>> kernel-introduced fences for resource management and dma-fences used
> >>>>>>>>> for
> >>>>>>>>> completion- and dependency tracking and lift a lot of restrictions
> >>>>>>>>> for the
> >>>>>>>>> dma-fences, including prohibiting infinite ones, I think this makes
> >>>>>>>>> sense
> >>>>>>>>> describing the current state.
> >>>>>>>> Yeah I think a future patch needs to type up how we want to make that
> >>>>>>>> happen (for some cross driver consistency) and what needs to be
> >>>>>>>> considered. Some of the necessary parts are already there (with like the
> >>>>>>>> preemption fences amdkfd has as an example), but I think some clear docs
> >>>>>>>> on what's required from both hw, drivers and userspace would be really
> >>>>>>>> good.
> >>>>>>> I'm currently writing that up, but probably still need a few days for
> >>>>>>> this.
> >>>>>> Great! I put down some (very) initial thoughts a couple of weeks ago
> >>>>>> building on eviction fences for various hardware complexity levels here:
> >>>>>>
> >>>>>> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
> >>>>> We are seeing HW that has recoverable GPU page faults but only for
> >>>>> compute tasks, and scheduler without semaphores hw for graphics.
> >>>>>
> >>>>> So a single driver may have to expose both models to userspace and
> >>>>> also introduces the problem of how to interoperate between the two
> >>>>> models on one card.
> >>>>>
> >>>>> Dave.
> >>>> Hmm, yes to begin with it's important to note that this is not a
> >>>> replacement for new programming models or APIs, This is something that
> >>>> takes place internally in drivers to mitigate many of the restrictions
> >>>> that are currently imposed on dma-fence and documented in this and
> >>>> previous series. It's basically the driver-private narrow completions
> >>>> Jason suggested in the lockdep patches discussions implemented the same
> >>>> way as eviction-fences.
> >>>>
> >>>> The memory fence API would be local to helpers and middle-layers like
> >>>> TTM, and the corresponding drivers.  The only cross-driver-like
> >>>> visibility would be that the dma-buf move_notify() callback would not be
> >>>> allowed to wait on dma-fences or something that depends on a dma-fence.
> >>> Because we can't preempt (on some engines at least) we already have
> >>> the requirement that cross driver buffer management can get stuck on a
> >>> dma-fence. Not even taking into account the horrors we do with
> >>> userptr, which are cross driver no matter what. Limiting move_notify
> >>> to memory fences only doesn't work, since the pte clearing might need
> >>> to wait for a dma_fence first. Hence this becomes a full end-of-batch
> >>> fence, not just a limited kernel-internal memory fence.
> >> For non-preemptible hardware the memory fence typically *is* the
> >> end-of-batch fence. (Unless, as documented, there is a scheduler
> >> consuming sync-file dependencies in which case the memory fence wait
> >> needs to be able to break out of that). The key thing is not that we can
> >> break out of execution, but that we can break out of dependencies, since
> >> when we're executing all dependecies (modulo semaphores) are already
> >> fulfilled. That's what's eliminating the deadlocks.
> >>
> >>> That's kinda why I think only reasonable option is to toss in the
> >>> towel and declare dma-fence to be the memory fence (and suck up all
> >>> the consequences of that decision as uapi, which is kinda where we
> >>> are), and construct something new&entirely free-wheeling for userspace
> >>> fencing. But only for engines that allow enough preempt/gpu page
> >>> faulting to make that possible. Free wheeling userspace fences/gpu
> >>> semaphores or whatever you want to call them (on windows I think it's
> >>> monitored fence) only work if you can preempt to decouple the memory
> >>> fences from your gpu command execution.
> >>>
> >>> There's the in-between step of just decoupling the batchbuffer
> >>> submission prep for hw without any preempt (but a scheduler), but that
> >>> seems kinda pointless. Modern execbuf should be O(1) fastpath, with
> >>> all the allocation/mapping work pulled out ahead. vk exposes that
> >>> model directly to clients, GL drivers could use it internally too, so
> >>> I see zero value in spending lots of time engineering very tricky
> >>> kernel code just for old userspace. Much more reasonable to do that in
> >>> userspace, where we have real debuggers and no panics about security
> >>> bugs (or well, a lot less, webgl is still a thing, but at least
> >>> browsers realized you need to container that completely).
> >> Sure, it's definitely a big chunk of work. I think the big win would be
> >> allowing memory allocation in dma-fence critical sections. But I
> >> completely buy the above argument. I just wanted to point out that many
> >> of the dma-fence restrictions are IMHO fixable, should we need to do
> >> that for whatever reason.
> > I'm still not sure that's possible, without preemption at least. We
> > have 4 edges:
> > - Kernel has internal depencies among memory fences. We want that to
> > allow (mild) amounts of overcommit, since that simplifies live so
> > much.
> > - Memory fences can block gpu ctx execution (by nature of the memory
> > simply not being there yet due to our overcommit)
> > - gpu ctx have (if we allow this) userspace controlled semaphore
> > dependencies. Of course userspace is expected to not create deadlocks,
> > but that's only assuming the kernel doesn't inject additional
> > dependencies. Compute folks really want that.
> > - gpu ctx can hold up memory allocations if all we have is
> > end-of-batch fences. And end-of-batch fences are all we have without
> > preempt, plus if we want backwards compat with the entire current
> > winsys/compositor ecosystem we need them, which allows us to inject
> > stuff dependent upon them pretty much anywhere.
> >
> > Fundamentally that's not fixable without throwing one of the edges
> > (and the corresponding feature that enables) out, since no entity has
> > full visibility into what's going on. E.g. forcing userspace to tell
> > the kernel about all semaphores just brings up back to the
> > drm_timeline_syncobj design we have merged right now. And that's imo
> > no better.
>
> Indeed, HW waiting for semaphores without being able to preempt that
> wait is a no-go. The doc (perhaps naively) assumes nobody is doing that.

preempt is a necessary but not sufficient condition, you also must not
have end-of-batch memory fences. And i915 has semaphore support and
end-of-batch memory fences, e.g. one piece is:

commit c4e8ba7390346a77ffe33ec3f210bc62e0b6c8c6
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 7 14:08:11 2020 +0100

    drm/i915/gt: Yield the timeslice if caught waiting on a user semaphore

Sure it preempts, but that's not enough.

> > That's kinda why I'm not seeing much benefits in a half-way state:
> > Tons of work, and still not what userspace wants. And for the full
> > deal that userspace wants we might as well not change anything with
> > dma-fences. For that we need a) ctx preempt and b) new entirely
> > decoupled fences that never feed back into a memory fences and c) are
> > controlled entirely by userspace. And c) is the really important thing
> > people want us to provide.
> >
> > And once we're ok with dma_fence == memory fences, then enforcing the
> > strict and painful memory allocation limitations is actually what we
> > want.
>
> Let's hope you're right. My fear is that that might be pretty painful as
> well.

Oh it's very painful too:
- We need a separate uapi flavour for gpu ctx with preempt instead of
end-of-batch dma-fence.
- Which needs to be implemented without breaking stuff badly - e.g. we
need to make sure we don't probe-wait on fences unnecessarily since
that forces random unwanted preempts.
- If we want this with winsys integration we need full userspace
revisions since all the dma_fence based sync sharing is out (implicit
sync on dma-buf, sync_file, drm_syncobj are all defunct since we can
only go the other way round).

Utter pain, but I think it's better since it can be done
driver-by-driver, and even userspace usecase by usecase. Which means
we can experiment in areas where the 10+ years of uapi guarantee isn't
so painful, learn, until we do the big jump of new
zero-interaction-with-memory-management fences become baked in forever
into compositor/winsys/modeset protocols. With the other approach of
splitting dma-fence we need to do all the splitting first, make sure
we get it right, and only then can we enable the use-case for real.

That's just not going to happen, at least not in upstream across all
drivers. Within a single driver in some vendor tree hacking stuff up
is totally fine ofc.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-22 11:39                         ` Daniel Vetter
@ 2020-07-22 12:22                           ` Thomas Hellström (Intel)
  2020-07-22 12:41                             ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-22 12:22 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Dave Airlie, Christian König, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, DRI Development,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	amd-gfx mailing list, Jason Ekstrand, Jesse Natalie,
	Daniel Vetter, Thomas Hellstrom, Mika Kuoppala, Felix Kuehling,
	Linux Media Mailing List


On 2020-07-22 13:39, Daniel Vetter wrote:
> On Wed, Jul 22, 2020 at 12:31 PM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>>
>> On 2020-07-22 11:45, Daniel Vetter wrote:
>>> On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
>>> <thomas_os@shipmail.org> wrote:
>>>> On 2020-07-22 09:11, Daniel Vetter wrote:
>>>>> On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
>>>>> <thomas_os@shipmail.org> wrote:
>>>>>> On 2020-07-22 00:45, Dave Airlie wrote:
>>>>>>> On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
>>>>>>> <thomas_os@shipmail.org> wrote:
>>>>>>>> On 7/21/20 9:45 AM, Christian König wrote:
>>>>>>>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
>>>>>>>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
>>>>>>>>>> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
>>>>>>>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
>>>>>>>>>>>> write this down once and for all.
>>>>>>>>>>>>
>>>>>>>>>>>> What I'm not sure about is whether the text should be more explicit in
>>>>>>>>>>>> flat out mandating the amdkfd eviction fences for long running compute
>>>>>>>>>>>> workloads or workloads where userspace fencing is allowed.
>>>>>>>>>>> Although (in my humble opinion) it might be possible to completely
>>>>>>>>>>> untangle
>>>>>>>>>>> kernel-introduced fences for resource management and dma-fences used
>>>>>>>>>>> for
>>>>>>>>>>> completion- and dependency tracking and lift a lot of restrictions
>>>>>>>>>>> for the
>>>>>>>>>>> dma-fences, including prohibiting infinite ones, I think this makes
>>>>>>>>>>> sense
>>>>>>>>>>> describing the current state.
>>>>>>>>>> Yeah I think a future patch needs to type up how we want to make that
>>>>>>>>>> happen (for some cross driver consistency) and what needs to be
>>>>>>>>>> considered. Some of the necessary parts are already there (with like the
>>>>>>>>>> preemption fences amdkfd has as an example), but I think some clear docs
>>>>>>>>>> on what's required from both hw, drivers and userspace would be really
>>>>>>>>>> good.
>>>>>>>>> I'm currently writing that up, but probably still need a few days for
>>>>>>>>> this.
>>>>>>>> Great! I put down some (very) initial thoughts a couple of weeks ago
>>>>>>>> building on eviction fences for various hardware complexity levels here:
>>>>>>>>
>>>>>>>> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
>>>>>>> We are seeing HW that has recoverable GPU page faults but only for
>>>>>>> compute tasks, and scheduler without semaphores hw for graphics.
>>>>>>>
>>>>>>> So a single driver may have to expose both models to userspace and
>>>>>>> also introduces the problem of how to interoperate between the two
>>>>>>> models on one card.
>>>>>>>
>>>>>>> Dave.
>>>>>> Hmm, yes to begin with it's important to note that this is not a
>>>>>> replacement for new programming models or APIs, This is something that
>>>>>> takes place internally in drivers to mitigate many of the restrictions
>>>>>> that are currently imposed on dma-fence and documented in this and
>>>>>> previous series. It's basically the driver-private narrow completions
>>>>>> Jason suggested in the lockdep patches discussions implemented the same
>>>>>> way as eviction-fences.
>>>>>>
>>>>>> The memory fence API would be local to helpers and middle-layers like
>>>>>> TTM, and the corresponding drivers.  The only cross-driver-like
>>>>>> visibility would be that the dma-buf move_notify() callback would not be
>>>>>> allowed to wait on dma-fences or something that depends on a dma-fence.
>>>>> Because we can't preempt (on some engines at least) we already have
>>>>> the requirement that cross driver buffer management can get stuck on a
>>>>> dma-fence. Not even taking into account the horrors we do with
>>>>> userptr, which are cross driver no matter what. Limiting move_notify
>>>>> to memory fences only doesn't work, since the pte clearing might need
>>>>> to wait for a dma_fence first. Hence this becomes a full end-of-batch
>>>>> fence, not just a limited kernel-internal memory fence.
>>>> For non-preemptible hardware the memory fence typically *is* the
>>>> end-of-batch fence. (Unless, as documented, there is a scheduler
>>>> consuming sync-file dependencies in which case the memory fence wait
>>>> needs to be able to break out of that). The key thing is not that we can
>>>> break out of execution, but that we can break out of dependencies, since
>>>> when we're executing all dependecies (modulo semaphores) are already
>>>> fulfilled. That's what's eliminating the deadlocks.
>>>>
>>>>> That's kinda why I think only reasonable option is to toss in the
>>>>> towel and declare dma-fence to be the memory fence (and suck up all
>>>>> the consequences of that decision as uapi, which is kinda where we
>>>>> are), and construct something new&entirely free-wheeling for userspace
>>>>> fencing. But only for engines that allow enough preempt/gpu page
>>>>> faulting to make that possible. Free wheeling userspace fences/gpu
>>>>> semaphores or whatever you want to call them (on windows I think it's
>>>>> monitored fence) only work if you can preempt to decouple the memory
>>>>> fences from your gpu command execution.
>>>>>
>>>>> There's the in-between step of just decoupling the batchbuffer
>>>>> submission prep for hw without any preempt (but a scheduler), but that
>>>>> seems kinda pointless. Modern execbuf should be O(1) fastpath, with
>>>>> all the allocation/mapping work pulled out ahead. vk exposes that
>>>>> model directly to clients, GL drivers could use it internally too, so
>>>>> I see zero value in spending lots of time engineering very tricky
>>>>> kernel code just for old userspace. Much more reasonable to do that in
>>>>> userspace, where we have real debuggers and no panics about security
>>>>> bugs (or well, a lot less, webgl is still a thing, but at least
>>>>> browsers realized you need to container that completely).
>>>> Sure, it's definitely a big chunk of work. I think the big win would be
>>>> allowing memory allocation in dma-fence critical sections. But I
>>>> completely buy the above argument. I just wanted to point out that many
>>>> of the dma-fence restrictions are IMHO fixable, should we need to do
>>>> that for whatever reason.
>>> I'm still not sure that's possible, without preemption at least. We
>>> have 4 edges:
>>> - Kernel has internal depencies among memory fences. We want that to
>>> allow (mild) amounts of overcommit, since that simplifies live so
>>> much.
>>> - Memory fences can block gpu ctx execution (by nature of the memory
>>> simply not being there yet due to our overcommit)
>>> - gpu ctx have (if we allow this) userspace controlled semaphore
>>> dependencies. Of course userspace is expected to not create deadlocks,
>>> but that's only assuming the kernel doesn't inject additional
>>> dependencies. Compute folks really want that.
>>> - gpu ctx can hold up memory allocations if all we have is
>>> end-of-batch fences. And end-of-batch fences are all we have without
>>> preempt, plus if we want backwards compat with the entire current
>>> winsys/compositor ecosystem we need them, which allows us to inject
>>> stuff dependent upon them pretty much anywhere.
>>>
>>> Fundamentally that's not fixable without throwing one of the edges
>>> (and the corresponding feature that enables) out, since no entity has
>>> full visibility into what's going on. E.g. forcing userspace to tell
>>> the kernel about all semaphores just brings up back to the
>>> drm_timeline_syncobj design we have merged right now. And that's imo
>>> no better.
>> Indeed, HW waiting for semaphores without being able to preempt that
>> wait is a no-go. The doc (perhaps naively) assumes nobody is doing that.
> preempt is a necessary but not sufficient condition, you also must not
> have end-of-batch memory fences. And i915 has semaphore support and
> end-of-batch memory fences, e.g. one piece is:
>
> commit c4e8ba7390346a77ffe33ec3f210bc62e0b6c8c6
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Apr 7 14:08:11 2020 +0100
>
>      drm/i915/gt: Yield the timeslice if caught waiting on a user semaphore
>
> Sure it preempts, but that's not enough.

Yes, i915 would fall in the "hardware with semaphores" category and 
implement memory fences different from the end-of-batch fences.

>
>>> That's kinda why I'm not seeing much benefits in a half-way state:
>>> Tons of work, and still not what userspace wants. And for the full
>>> deal that userspace wants we might as well not change anything with
>>> dma-fences. For that we need a) ctx preempt and b) new entirely
>>> decoupled fences that never feed back into a memory fences and c) are
>>> controlled entirely by userspace. And c) is the really important thing
>>> people want us to provide.
>>>
>>> And once we're ok with dma_fence == memory fences, then enforcing the
>>> strict and painful memory allocation limitations is actually what we
>>> want.
>> Let's hope you're right. My fear is that that might be pretty painful as
>> well.
> Oh it's very painful too:
> - We need a separate uapi flavour for gpu ctx with preempt instead of
> end-of-batch dma-fence.
> - Which needs to be implemented without breaking stuff badly - e.g. we
> need to make sure we don't probe-wait on fences unnecessarily since
> that forces random unwanted preempts.
> - If we want this with winsys integration we need full userspace
> revisions since all the dma_fence based sync sharing is out (implicit
> sync on dma-buf, sync_file, drm_syncobj are all defunct since we can
> only go the other way round).
> Utter pain, but I think it's better since it can be done
> driver-by-driver, and even userspace usecase by usecase. Which means
> we can experiment in areas where the 10+ years of uapi guarantee isn't
> so painful, learn, until we do the big jump of new
> zero-interaction-with-memory-management fences become baked in forever
> into compositor/winsys/modeset protocols.
>   With the other approach of
> splitting dma-fence we need to do all the splitting first, make sure
> we get it right, and only then can we enable the use-case for real.

Again, let me stress, I'm not advocating for splitting the dma-fence in 
favour of the preempt ctx approach. My question is rather: Do we see the 
need for fixing dma-fence as well, with the motivation that fixing all 
drivers to adhere to the dma-fence restrictions might be just as 
painful. So far the clear answer is no, it's not worth it, and I'm fine 
with that.

>
> That's just not going to happen, at least not in upstream across all
> drivers. Within a single driver in some vendor tree hacking stuff up
> is totally fine ofc.

Actually, due to the asynchronous restart, that's not really possible 
either. It's all or none.

> -Daniel

/Thomas



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-22 12:22                           ` Thomas Hellström (Intel)
@ 2020-07-22 12:41                             ` Daniel Vetter
  2020-07-22 13:12                               ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-22 12:41 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Dave Airlie, Christian König, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, DRI Development,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	amd-gfx mailing list, Jason Ekstrand, Jesse Natalie,
	Daniel Vetter, Thomas Hellstrom, Mika Kuoppala, Felix Kuehling,
	Linux Media Mailing List

On Wed, Jul 22, 2020 at 2:22 PM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
>
>
> On 2020-07-22 13:39, Daniel Vetter wrote:
> > On Wed, Jul 22, 2020 at 12:31 PM Thomas Hellström (Intel)
> > <thomas_os@shipmail.org> wrote:
> >>
> >> On 2020-07-22 11:45, Daniel Vetter wrote:
> >>> On Wed, Jul 22, 2020 at 10:05 AM Thomas Hellström (Intel)
> >>> <thomas_os@shipmail.org> wrote:
> >>>> On 2020-07-22 09:11, Daniel Vetter wrote:
> >>>>> On Wed, Jul 22, 2020 at 8:45 AM Thomas Hellström (Intel)
> >>>>> <thomas_os@shipmail.org> wrote:
> >>>>>> On 2020-07-22 00:45, Dave Airlie wrote:
> >>>>>>> On Tue, 21 Jul 2020 at 18:47, Thomas Hellström (Intel)
> >>>>>>> <thomas_os@shipmail.org> wrote:
> >>>>>>>> On 7/21/20 9:45 AM, Christian König wrote:
> >>>>>>>>> Am 21.07.20 um 09:41 schrieb Daniel Vetter:
> >>>>>>>>>> On Mon, Jul 20, 2020 at 01:15:17PM +0200, Thomas Hellström (Intel)
> >>>>>>>>>> wrote:
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> On 7/9/20 2:33 PM, Daniel Vetter wrote:
> >>>>>>>>>>>> Comes up every few years, gets somewhat tedious to discuss, let's
> >>>>>>>>>>>> write this down once and for all.
> >>>>>>>>>>>>
> >>>>>>>>>>>> What I'm not sure about is whether the text should be more explicit in
> >>>>>>>>>>>> flat out mandating the amdkfd eviction fences for long running compute
> >>>>>>>>>>>> workloads or workloads where userspace fencing is allowed.
> >>>>>>>>>>> Although (in my humble opinion) it might be possible to completely
> >>>>>>>>>>> untangle
> >>>>>>>>>>> kernel-introduced fences for resource management and dma-fences used
> >>>>>>>>>>> for
> >>>>>>>>>>> completion- and dependency tracking and lift a lot of restrictions
> >>>>>>>>>>> for the
> >>>>>>>>>>> dma-fences, including prohibiting infinite ones, I think this makes
> >>>>>>>>>>> sense
> >>>>>>>>>>> describing the current state.
> >>>>>>>>>> Yeah I think a future patch needs to type up how we want to make that
> >>>>>>>>>> happen (for some cross driver consistency) and what needs to be
> >>>>>>>>>> considered. Some of the necessary parts are already there (with like the
> >>>>>>>>>> preemption fences amdkfd has as an example), but I think some clear docs
> >>>>>>>>>> on what's required from both hw, drivers and userspace would be really
> >>>>>>>>>> good.
> >>>>>>>>> I'm currently writing that up, but probably still need a few days for
> >>>>>>>>> this.
> >>>>>>>> Great! I put down some (very) initial thoughts a couple of weeks ago
> >>>>>>>> building on eviction fences for various hardware complexity levels here:
> >>>>>>>>
> >>>>>>>> https://gitlab.freedesktop.org/thomash/docs/-/blob/master/Untangling%20dma-fence%20and%20memory%20allocation.odt
> >>>>>>> We are seeing HW that has recoverable GPU page faults but only for
> >>>>>>> compute tasks, and scheduler without semaphores hw for graphics.
> >>>>>>>
> >>>>>>> So a single driver may have to expose both models to userspace and
> >>>>>>> also introduces the problem of how to interoperate between the two
> >>>>>>> models on one card.
> >>>>>>>
> >>>>>>> Dave.
> >>>>>> Hmm, yes to begin with it's important to note that this is not a
> >>>>>> replacement for new programming models or APIs, This is something that
> >>>>>> takes place internally in drivers to mitigate many of the restrictions
> >>>>>> that are currently imposed on dma-fence and documented in this and
> >>>>>> previous series. It's basically the driver-private narrow completions
> >>>>>> Jason suggested in the lockdep patches discussions implemented the same
> >>>>>> way as eviction-fences.
> >>>>>>
> >>>>>> The memory fence API would be local to helpers and middle-layers like
> >>>>>> TTM, and the corresponding drivers.  The only cross-driver-like
> >>>>>> visibility would be that the dma-buf move_notify() callback would not be
> >>>>>> allowed to wait on dma-fences or something that depends on a dma-fence.
> >>>>> Because we can't preempt (on some engines at least) we already have
> >>>>> the requirement that cross driver buffer management can get stuck on a
> >>>>> dma-fence. Not even taking into account the horrors we do with
> >>>>> userptr, which are cross driver no matter what. Limiting move_notify
> >>>>> to memory fences only doesn't work, since the pte clearing might need
> >>>>> to wait for a dma_fence first. Hence this becomes a full end-of-batch
> >>>>> fence, not just a limited kernel-internal memory fence.
> >>>> For non-preemptible hardware the memory fence typically *is* the
> >>>> end-of-batch fence. (Unless, as documented, there is a scheduler
> >>>> consuming sync-file dependencies in which case the memory fence wait
> >>>> needs to be able to break out of that). The key thing is not that we can
> >>>> break out of execution, but that we can break out of dependencies, since
> >>>> when we're executing all dependecies (modulo semaphores) are already
> >>>> fulfilled. That's what's eliminating the deadlocks.
> >>>>
> >>>>> That's kinda why I think only reasonable option is to toss in the
> >>>>> towel and declare dma-fence to be the memory fence (and suck up all
> >>>>> the consequences of that decision as uapi, which is kinda where we
> >>>>> are), and construct something new&entirely free-wheeling for userspace
> >>>>> fencing. But only for engines that allow enough preempt/gpu page
> >>>>> faulting to make that possible. Free wheeling userspace fences/gpu
> >>>>> semaphores or whatever you want to call them (on windows I think it's
> >>>>> monitored fence) only work if you can preempt to decouple the memory
> >>>>> fences from your gpu command execution.
> >>>>>
> >>>>> There's the in-between step of just decoupling the batchbuffer
> >>>>> submission prep for hw without any preempt (but a scheduler), but that
> >>>>> seems kinda pointless. Modern execbuf should be O(1) fastpath, with
> >>>>> all the allocation/mapping work pulled out ahead. vk exposes that
> >>>>> model directly to clients, GL drivers could use it internally too, so
> >>>>> I see zero value in spending lots of time engineering very tricky
> >>>>> kernel code just for old userspace. Much more reasonable to do that in
> >>>>> userspace, where we have real debuggers and no panics about security
> >>>>> bugs (or well, a lot less, webgl is still a thing, but at least
> >>>>> browsers realized you need to container that completely).
> >>>> Sure, it's definitely a big chunk of work. I think the big win would be
> >>>> allowing memory allocation in dma-fence critical sections. But I
> >>>> completely buy the above argument. I just wanted to point out that many
> >>>> of the dma-fence restrictions are IMHO fixable, should we need to do
> >>>> that for whatever reason.
> >>> I'm still not sure that's possible, without preemption at least. We
> >>> have 4 edges:
> >>> - Kernel has internal depencies among memory fences. We want that to
> >>> allow (mild) amounts of overcommit, since that simplifies live so
> >>> much.
> >>> - Memory fences can block gpu ctx execution (by nature of the memory
> >>> simply not being there yet due to our overcommit)
> >>> - gpu ctx have (if we allow this) userspace controlled semaphore
> >>> dependencies. Of course userspace is expected to not create deadlocks,
> >>> but that's only assuming the kernel doesn't inject additional
> >>> dependencies. Compute folks really want that.
> >>> - gpu ctx can hold up memory allocations if all we have is
> >>> end-of-batch fences. And end-of-batch fences are all we have without
> >>> preempt, plus if we want backwards compat with the entire current
> >>> winsys/compositor ecosystem we need them, which allows us to inject
> >>> stuff dependent upon them pretty much anywhere.
> >>>
> >>> Fundamentally that's not fixable without throwing one of the edges
> >>> (and the corresponding feature that enables) out, since no entity has
> >>> full visibility into what's going on. E.g. forcing userspace to tell
> >>> the kernel about all semaphores just brings up back to the
> >>> drm_timeline_syncobj design we have merged right now. And that's imo
> >>> no better.
> >> Indeed, HW waiting for semaphores without being able to preempt that
> >> wait is a no-go. The doc (perhaps naively) assumes nobody is doing that.
> > preempt is a necessary but not sufficient condition, you also must not
> > have end-of-batch memory fences. And i915 has semaphore support and
> > end-of-batch memory fences, e.g. one piece is:
> >
> > commit c4e8ba7390346a77ffe33ec3f210bc62e0b6c8c6
> > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > Date:   Tue Apr 7 14:08:11 2020 +0100
> >
> >      drm/i915/gt: Yield the timeslice if caught waiting on a user semaphore
> >
> > Sure it preempts, but that's not enough.
>
> Yes, i915 would fall in the "hardware with semaphores" category and
> implement memory fences different from the end-of-batch fences.
>
> >
> >>> That's kinda why I'm not seeing much benefits in a half-way state:
> >>> Tons of work, and still not what userspace wants. And for the full
> >>> deal that userspace wants we might as well not change anything with
> >>> dma-fences. For that we need a) ctx preempt and b) new entirely
> >>> decoupled fences that never feed back into a memory fences and c) are
> >>> controlled entirely by userspace. And c) is the really important thing
> >>> people want us to provide.
> >>>
> >>> And once we're ok with dma_fence == memory fences, then enforcing the
> >>> strict and painful memory allocation limitations is actually what we
> >>> want.
> >> Let's hope you're right. My fear is that that might be pretty painful as
> >> well.
> > Oh it's very painful too:
> > - We need a separate uapi flavour for gpu ctx with preempt instead of
> > end-of-batch dma-fence.
> > - Which needs to be implemented without breaking stuff badly - e.g. we
> > need to make sure we don't probe-wait on fences unnecessarily since
> > that forces random unwanted preempts.
> > - If we want this with winsys integration we need full userspace
> > revisions since all the dma_fence based sync sharing is out (implicit
> > sync on dma-buf, sync_file, drm_syncobj are all defunct since we can
> > only go the other way round).
> > Utter pain, but I think it's better since it can be done
> > driver-by-driver, and even userspace usecase by usecase. Which means
> > we can experiment in areas where the 10+ years of uapi guarantee isn't
> > so painful, learn, until we do the big jump of new
> > zero-interaction-with-memory-management fences become baked in forever
> > into compositor/winsys/modeset protocols.
> >   With the other approach of
> > splitting dma-fence we need to do all the splitting first, make sure
> > we get it right, and only then can we enable the use-case for real.
>
> Again, let me stress, I'm not advocating for splitting the dma-fence in
> favour of the preempt ctx approach. My question is rather: Do we see the
> need for fixing dma-fence as well, with the motivation that fixing all
> drivers to adhere to the dma-fence restrictions might be just as
> painful. So far the clear answer is no, it's not worth it, and I'm fine
> with that.

Ah I think I misunderstood which options you want to compare here. I'm
not sure how much pain fixing up "dma-fence as memory fence" really
is. That's kinda why I want a lot more testing on my annotation
patches, to figure that out. Not much feedback aside from amdgpu and
intel, and those two drivers pretty much need to sort out their memory
fence issues anyway (because of userptr and stuff like that).

The only other issues outside of these two drivers I'm aware of:
- various scheduler drivers doing allocations in the drm/scheduler
critical section. Since all arm-soc drivers have a mildly shoddy
memory model of "we just pin everything" they don't really have to
deal with this. So we might just declare arm as a platform broken and
not taint the dma-fence critical sections with fs_reclaim. Otoh we
need to fix this for drm/scheduler anyway, I think best option would
be to have a mempool for hw fences in the scheduler itself, and at
that point fixing the other drivers shouldn't be too onerous.

- vmwgfx doing a dma_resv in the atomic commit tail. Entirely
orthogonal to the entire memory fence discussion.

I'm pretty sure there's more bugs, I just haven't heard from them yet.
Also due to the opt-in nature of dma-fence we can limit the scope of
what we fix fairly naturally, just don't put them where no one cares
:-) Of course that also hides general locking issues in dma_fence
signalling code, but well *shrug*.

So thus far I think fixing up the various small bugs the annotations
turn up is the least problem we have here. Much, much smaller then
either of "split dma-fence in two" or "add entire new fence
model/uapi/winsys protocol set on top of dma-fence". I think a big
reason we didn't screw up a lot worse on this is the atomic framework,
which was designed very much with a) no allocations in the wrong spot
and b) no lock taking in the wrong spot in mind from the start. Some
of the early atomic prototypes were real horrors in that regards, but
with the helper framework we have now drivers have to go the extra
mile to screw this up. And there's a lot more atomic drivers than
render drivers nowadays merged in upstream.

> > That's just not going to happen, at least not in upstream across all
> > drivers. Within a single driver in some vendor tree hacking stuff up
> > is totally fine ofc.
>
> Actually, due to the asynchronous restart, that's not really possible
> either. It's all or none.
>
> > -Daniel
>
> /Thomas
>
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-22 12:41                             ` Daniel Vetter
@ 2020-07-22 13:12                               ` Thomas Hellström (Intel)
  2020-07-22 14:07                                 ` Daniel Vetter
  0 siblings, 1 reply; 83+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-22 13:12 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Dave Airlie, Christian König, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, DRI Development,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	amd-gfx mailing list, Jason Ekstrand, Jesse Natalie,
	Daniel Vetter, Thomas Hellstrom, Mika Kuoppala, Felix Kuehling,
	Linux Media Mailing List


On 2020-07-22 14:41, Daniel Vetter wrote:
>
> Ah I think I misunderstood which options you want to compare here. I'm
> not sure how much pain fixing up "dma-fence as memory fence" really
> is. That's kinda why I want a lot more testing on my annotation
> patches, to figure that out. Not much feedback aside from amdgpu and
> intel, and those two drivers pretty much need to sort out their memory
> fence issues anyway (because of userptr and stuff like that).
>
> The only other issues outside of these two drivers I'm aware of:
> - various scheduler drivers doing allocations in the drm/scheduler
> critical section. Since all arm-soc drivers have a mildly shoddy
> memory model of "we just pin everything" they don't really have to
> deal with this. So we might just declare arm as a platform broken and
> not taint the dma-fence critical sections with fs_reclaim. Otoh we
> need to fix this for drm/scheduler anyway, I think best option would
> be to have a mempool for hw fences in the scheduler itself, and at
> that point fixing the other drivers shouldn't be too onerous.
>
> - vmwgfx doing a dma_resv in the atomic commit tail. Entirely
> orthogonal to the entire memory fence discussion.

With vmwgfx there is another issue that is hit when the gpu signals an 
error. At that point the batch might be restarted with a new meta 
command buffer that needs to be allocated out of a dma pool. in the 
fence critical section. That's probably a bit nasty to fix, but not 
impossible.

>
> I'm pretty sure there's more bugs, I just haven't heard from them yet.
> Also due to the opt-in nature of dma-fence we can limit the scope of
> what we fix fairly naturally, just don't put them where no one cares
> :-) Of course that also hides general locking issues in dma_fence
> signalling code, but well *shrug*.
Hmm, yes. Another potential big problem would be drivers that want to 
use gpu page faults in the dma-fence critical sections with the 
batch-based programming model.

/Thomas



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-22 13:12                               ` Thomas Hellström (Intel)
@ 2020-07-22 14:07                                 ` Daniel Vetter
  2020-07-22 14:23                                   ` Christian König
  0 siblings, 1 reply; 83+ messages in thread
From: Daniel Vetter @ 2020-07-22 14:07 UTC (permalink / raw)
  To: Thomas Hellström (Intel)
  Cc: Dave Airlie, Christian König, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, DRI Development,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	amd-gfx mailing list, Jason Ekstrand, Jesse Natalie,
	Daniel Vetter, Thomas Hellstrom, Mika Kuoppala, Felix Kuehling,
	Linux Media Mailing List

On Wed, Jul 22, 2020 at 3:12 PM Thomas Hellström (Intel)
<thomas_os@shipmail.org> wrote:
> On 2020-07-22 14:41, Daniel Vetter wrote:
> > Ah I think I misunderstood which options you want to compare here. I'm
> > not sure how much pain fixing up "dma-fence as memory fence" really
> > is. That's kinda why I want a lot more testing on my annotation
> > patches, to figure that out. Not much feedback aside from amdgpu and
> > intel, and those two drivers pretty much need to sort out their memory
> > fence issues anyway (because of userptr and stuff like that).
> >
> > The only other issues outside of these two drivers I'm aware of:
> > - various scheduler drivers doing allocations in the drm/scheduler
> > critical section. Since all arm-soc drivers have a mildly shoddy
> > memory model of "we just pin everything" they don't really have to
> > deal with this. So we might just declare arm as a platform broken and
> > not taint the dma-fence critical sections with fs_reclaim. Otoh we
> > need to fix this for drm/scheduler anyway, I think best option would
> > be to have a mempool for hw fences in the scheduler itself, and at
> > that point fixing the other drivers shouldn't be too onerous.
> >
> > - vmwgfx doing a dma_resv in the atomic commit tail. Entirely
> > orthogonal to the entire memory fence discussion.
>
> With vmwgfx there is another issue that is hit when the gpu signals an
> error. At that point the batch might be restarted with a new meta
> command buffer that needs to be allocated out of a dma pool. in the
> fence critical section. That's probably a bit nasty to fix, but not
> impossible.

Yeah reset is fun. From what I've seen this isn't any worse than the
hw allocation issue for drm/scheduler drivers, they just allocate
another hw fence with all that drags along. So the same mempool should
be sufficient.

The really nasty thing around reset is display interactions, because
you just can't take drm_modeset_lock. amdgpu fixed that now (at least
the modeset_lock side, not yet the memory allocations that brings
along). i915 has the same problem for gen2/3 (so really old stuff),
and we've solved that by breaking&restarting all i915 fence waits, but
that predates multi-gpu and wont work for shared fences ofc. But it's
so old and predates all multi-gpu laptops that I think wontfix is the
right take.

Other drm/scheduler drivers don't have that problem since they're all
render-only, so no display driver interaction.

> > I'm pretty sure there's more bugs, I just haven't heard from them yet.
> > Also due to the opt-in nature of dma-fence we can limit the scope of
> > what we fix fairly naturally, just don't put them where no one cares
> > :-) Of course that also hides general locking issues in dma_fence
> > signalling code, but well *shrug*.
> Hmm, yes. Another potential big problem would be drivers that want to
> use gpu page faults in the dma-fence critical sections with the
> batch-based programming model.

Yeah that's a massive can of worms. But luckily there's no such driver
merged in upstream, so hopefully we can think about all the
constraints and how to best annotate&enforce this before we land any
code and have big regrets.
-Daniel



--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-22 14:07                                 ` Daniel Vetter
@ 2020-07-22 14:23                                   ` Christian König
  2020-07-22 14:30                                     ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 83+ messages in thread
From: Christian König @ 2020-07-22 14:23 UTC (permalink / raw)
  To: Daniel Vetter, Thomas Hellström (Intel)
  Cc: Felix Kuehling, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, DRI Development,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	amd-gfx mailing list, Jason Ekstrand, Jesse Natalie,
	Daniel Vetter, Thomas Hellstrom, Linux Media Mailing List,
	Dave Airlie, Christian König, Mika Kuoppala

Am 22.07.20 um 16:07 schrieb Daniel Vetter:
> On Wed, Jul 22, 2020 at 3:12 PM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>> On 2020-07-22 14:41, Daniel Vetter wrote:
>>> I'm pretty sure there's more bugs, I just haven't heard from them yet.
>>> Also due to the opt-in nature of dma-fence we can limit the scope of
>>> what we fix fairly naturally, just don't put them where no one cares
>>> :-) Of course that also hides general locking issues in dma_fence
>>> signalling code, but well *shrug*.
>> Hmm, yes. Another potential big problem would be drivers that want to
>> use gpu page faults in the dma-fence critical sections with the
>> batch-based programming model.
> Yeah that's a massive can of worms. But luckily there's no such driver
> merged in upstream, so hopefully we can think about all the
> constraints and how to best annotate&enforce this before we land any
> code and have big regrets.

Do you want a bad news? I once made a prototype for that when Vega10 
came out.

But we abandoned this approach for the the batch based approach because 
of the horrible performance.

KFD is going to see that, but this is only with user queues and no 
dma_fence involved whatsoever.

Christian.

> -Daniel
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-22 14:23                                   ` Christian König
@ 2020-07-22 14:30                                     ` Thomas Hellström (Intel)
  2020-07-22 14:35                                       ` Christian König
  0 siblings, 1 reply; 83+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-22 14:30 UTC (permalink / raw)
  To: christian.koenig, Daniel Vetter
  Cc: Felix Kuehling, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, DRI Development,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	amd-gfx mailing list, Jason Ekstrand, Jesse Natalie,
	Daniel Vetter, Thomas Hellstrom, Linux Media Mailing List,
	Dave Airlie, Mika Kuoppala


On 2020-07-22 16:23, Christian König wrote:
> Am 22.07.20 um 16:07 schrieb Daniel Vetter:
>> On Wed, Jul 22, 2020 at 3:12 PM Thomas Hellström (Intel)
>> <thomas_os@shipmail.org> wrote:
>>> On 2020-07-22 14:41, Daniel Vetter wrote:
>>>> I'm pretty sure there's more bugs, I just haven't heard from them yet.
>>>> Also due to the opt-in nature of dma-fence we can limit the scope of
>>>> what we fix fairly naturally, just don't put them where no one cares
>>>> :-) Of course that also hides general locking issues in dma_fence
>>>> signalling code, but well *shrug*.
>>> Hmm, yes. Another potential big problem would be drivers that want to
>>> use gpu page faults in the dma-fence critical sections with the
>>> batch-based programming model.
>> Yeah that's a massive can of worms. But luckily there's no such driver
>> merged in upstream, so hopefully we can think about all the
>> constraints and how to best annotate&enforce this before we land any
>> code and have big regrets.
>
> Do you want a bad news? I once made a prototype for that when Vega10 
> came out.
>
> But we abandoned this approach for the the batch based approach 
> because of the horrible performance.

In context of the previous discussion I'd consider the fact that it's 
not performant in the batch-based model good news :)

Thomas


>
> KFD is going to see that, but this is only with user queues and no 
> dma_fence involved whatsoever.
>
> Christian.
>
>> -Daniel
>>
>>
>>
>> -- 
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf.rst: Document why indefinite fences are a bad idea
  2020-07-22 14:30                                     ` Thomas Hellström (Intel)
@ 2020-07-22 14:35                                       ` Christian König
  0 siblings, 0 replies; 83+ messages in thread
From: Christian König @ 2020-07-22 14:35 UTC (permalink / raw)
  To: Thomas Hellström (Intel), Daniel Vetter
  Cc: Felix Kuehling, Daniel Stone, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, DRI Development,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Steve Pronovost,
	amd-gfx mailing list, Jason Ekstrand, Jesse Natalie,
	Daniel Vetter, Thomas Hellstrom, Linux Media Mailing List,
	Dave Airlie, Mika Kuoppala

Am 22.07.20 um 16:30 schrieb Thomas Hellström (Intel):
>
> On 2020-07-22 16:23, Christian König wrote:
>> Am 22.07.20 um 16:07 schrieb Daniel Vetter:
>>> On Wed, Jul 22, 2020 at 3:12 PM Thomas Hellström (Intel)
>>> <thomas_os@shipmail.org> wrote:
>>>> On 2020-07-22 14:41, Daniel Vetter wrote:
>>>>> I'm pretty sure there's more bugs, I just haven't heard from them 
>>>>> yet.
>>>>> Also due to the opt-in nature of dma-fence we can limit the scope of
>>>>> what we fix fairly naturally, just don't put them where no one cares
>>>>> :-) Of course that also hides general locking issues in dma_fence
>>>>> signalling code, but well *shrug*.
>>>> Hmm, yes. Another potential big problem would be drivers that want to
>>>> use gpu page faults in the dma-fence critical sections with the
>>>> batch-based programming model.
>>> Yeah that's a massive can of worms. But luckily there's no such driver
>>> merged in upstream, so hopefully we can think about all the
>>> constraints and how to best annotate&enforce this before we land any
>>> code and have big regrets.
>>
>> Do you want a bad news? I once made a prototype for that when Vega10 
>> came out.
>>
>> But we abandoned this approach for the the batch based approach 
>> because of the horrible performance.
>
> In context of the previous discussion I'd consider the fact that it's 
> not performant in the batch-based model good news :)

Well the Vega10 had such a horrible page fault performance because it 
was the first generation which enabled it.

Later hardware versions are much better, but we just didn't push for 
this feature on them any more.

But yeah, now you mentioned it we did discuss this locking problem on 
tons of team calls as well.

Our solution at that time was to just not allow waiting if we do any 
allocation in the page fault handler. But this is of course not 
practical for a production environment.

Christian.

>
> Thomas
>
>
>>
>> KFD is going to see that, but this is only with user queues and no 
>> dma_fence involved whatsoever.
>>
>> Christian.
>>
>>> -Daniel
>>>
>>>
>>>
>>> -- 
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C65836d463c6a43425a0b08d82e4bc09e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637310250203344946&amp;sdata=F8LZEnsMOJLeC3Sr%2BPn2HjGHlttdkVUiOzW7mYeijys%3D&amp;reserved=0 
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C65836d463c6a43425a0b08d82e4bc09e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637310250203344946&amp;sdata=V3FsfahK6344%2FXujtLA%2BazWV0XjKWDXFWObRWc1JUKs%3D&amp;reserved=0 
>>>


^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2020-07-22 14:35 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200707201229.472834-1-daniel.vetter@ffwll.ch>
2020-07-07 20:12 ` [PATCH 01/25] dma-fence: basic lockdep annotations Daniel Vetter
2020-07-08 14:57   ` Christian König
2020-07-08 15:12     ` Daniel Vetter
2020-07-08 15:19       ` Alex Deucher
2020-07-08 15:37         ` Daniel Vetter
2020-07-14 11:09           ` Daniel Vetter
2020-07-09  7:32       ` [Intel-gfx] " Daniel Stone
2020-07-09  7:52         ` Daniel Vetter
2020-07-13 16:26     ` Daniel Vetter
2020-07-13 16:39       ` Christian König
2020-07-13 20:31         ` Dave Airlie
2020-07-07 20:12 ` [PATCH 02/25] dma-fence: prime " Daniel Vetter
2020-07-09  8:09   ` Daniel Vetter
2020-07-10 12:43     ` Jason Gunthorpe
2020-07-10 12:48       ` Christian König
2020-07-10 12:54         ` Jason Gunthorpe
2020-07-10 13:01           ` Christian König
2020-07-10 13:48             ` Jason Gunthorpe
2020-07-10 14:02               ` Daniel Vetter
2020-07-10 14:23                 ` Jason Gunthorpe
2020-07-10 20:02                   ` Daniel Vetter
2020-07-07 20:12 ` [PATCH 03/25] dma-buf.rst: Document why idenfinite fences are a bad idea Daniel Vetter
2020-07-09  7:36   ` [Intel-gfx] " Daniel Stone
2020-07-09  8:04     ` Daniel Vetter
2020-07-09 12:11       ` Daniel Stone
2020-07-09 12:31         ` Daniel Vetter
2020-07-09 14:28           ` Christian König
2020-07-09 11:53   ` Christian König
2020-07-09 12:33   ` [PATCH 1/2] dma-buf.rst: Document why indefinite " Daniel Vetter
2020-07-10 12:30     ` Maarten Lankhorst
2020-07-14 17:46     ` Jason Ekstrand
2020-07-20 11:15     ` [Linaro-mm-sig] " Thomas Hellström (Intel)
2020-07-21  7:41       ` Daniel Vetter
2020-07-21  7:45         ` Christian König
2020-07-21  8:47           ` Thomas Hellström (Intel)
2020-07-21  8:55             ` Christian König
2020-07-21  9:16               ` Daniel Vetter
2020-07-21  9:24                 ` Daniel Vetter
2020-07-21  9:37               ` Thomas Hellström (Intel)
2020-07-21  9:50                 ` Daniel Vetter
2020-07-21 10:47                   ` Thomas Hellström (Intel)
2020-07-21 13:59                     ` Christian König
2020-07-21 17:46                       ` Thomas Hellström (Intel)
2020-07-21 18:18                         ` Daniel Vetter
2020-07-21 21:42                       ` Dave Airlie
2020-07-21 22:45             ` Dave Airlie
2020-07-22  6:45               ` Thomas Hellström (Intel)
2020-07-22  7:11                 ` Daniel Vetter
2020-07-22  8:05                   ` Thomas Hellström (Intel)
2020-07-22  9:45                     ` Daniel Vetter
2020-07-22 10:31                       ` Thomas Hellström (Intel)
2020-07-22 11:39                         ` Daniel Vetter
2020-07-22 12:22                           ` Thomas Hellström (Intel)
2020-07-22 12:41                             ` Daniel Vetter
2020-07-22 13:12                               ` Thomas Hellström (Intel)
2020-07-22 14:07                                 ` Daniel Vetter
2020-07-22 14:23                                   ` Christian König
2020-07-22 14:30                                     ` Thomas Hellström (Intel)
2020-07-22 14:35                                       ` Christian König
2020-07-07 20:12 ` [PATCH 04/25] drm/vkms: Annotate vblank timer Daniel Vetter
2020-07-12 22:27   ` Rodrigo Siqueira
2020-07-14  9:57     ` Melissa Wen
2020-07-14  9:59       ` Daniel Vetter
2020-07-14 14:55         ` Melissa Wen
2020-07-14 15:23           ` Daniel Vetter
2020-07-07 20:12 ` [PATCH 05/25] drm/vblank: Annotate with dma-fence signalling section Daniel Vetter
2020-07-07 20:12 ` [PATCH 06/25] drm/amdgpu: add dma-fence annotations to atomic commit path Daniel Vetter
2020-07-07 20:12 ` [PATCH 16/25] drm/atomic-helper: Add dma-fence annotations Daniel Vetter
2020-07-07 20:12 ` [PATCH 17/25] drm/scheduler: use dma-fence annotations in main thread Daniel Vetter
2020-07-07 20:12 ` [PATCH 18/25] drm/amdgpu: use dma-fence annotations in cs_submit() Daniel Vetter
2020-07-07 20:12 ` [PATCH 19/25] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code Daniel Vetter
2020-07-14 10:49   ` Daniel Vetter
2020-07-14 11:40     ` Christian König
2020-07-14 14:31       ` Daniel Vetter
2020-07-15  9:17         ` Christian König
2020-07-15 11:53           ` Daniel Vetter
2020-07-07 20:12 ` [PATCH 20/25] drm/amdgpu: DC also loves to allocate stuff where it shouldn't Daniel Vetter
2020-07-14 11:12   ` Daniel Vetter
2020-07-07 20:12 ` [PATCH 21/25] drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail Daniel Vetter
2020-07-07 20:12 ` [PATCH 22/25] drm/scheduler: use dma-fence annotations in tdr work Daniel Vetter
2020-07-07 20:12 ` [PATCH 23/25] drm/amdgpu: use dma-fence annotations for gpu reset code Daniel Vetter
2020-07-07 20:12 ` [PATCH 24/25] Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset" Daniel Vetter
2020-07-07 20:12 ` [PATCH 25/25] drm/amdgpu: gpu recovery does full modesets Daniel Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).