All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 00/17] dma-fence lockdep annotations
@ 2020-05-12  8:59 ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König

Hi all,

I've dragged my feet for years on this, hoping that cross-release lockdep
would do this for us, but well that never really happened unfortunately.
So here we are.

Cc'ed quite a pile of people since this is about the cross-driver contract
around dma_fences. Which is heavily used for dma_buf, and I'm hearing more
noises that rdma folks are looking into this, hence also on cc.

There's a bunch of different parts to this RFC:

- The annotations itself, in the 2nd patch after the prep patch to add
  might_sleep annotations. Commit message has all the motivation for what
  kind of deadlocks I want to catch, best you just read it.

  Since lockdep doesn't understand cross-release natively we need to
  cobble something together using rwlocks and a few more tricks, but from
  the test rollout in a few places in drm/vkms, amdgpu & i915 I think what
  I have now seems to actually work. Downside is that we have to
  explicitly annotate all code involved in eventual dma_fence signalling.

- Second important part is locking down the current dma-fence cross-driver
  contract, using lockdep priming like we already do for dma_resv_lock.
  I've just started with my own take on what we probably need to make the
  current code work (-ish), but both amdgpu and i915 have issues with
  that. So this needs some careful discussions, and also some thought on
  how we land it all eventually to not break lockdep completely for
  everyone.

  The important patch for that is "dma-fence: prime lockdep annotations"
  plus of course the various annotations patches and driver hacks to
  highlight some of the issues caught.

  Note that depending upon what exactly we end up deciding we might need
  to improve the annotations for fs_reclaim_acquire/release - for
  dma_fence_wait in mmu notifiers we can only allow GFP_NOWAIT (afaiui),
  and currently fs_reclaim_acquire/release only has a lockdep class for
  __GFP_FS only, we'd need to add another one for __GFP_DIRECT_RECLAIM in
  general maybe.

- Finally there's clearly some gaps in the current dma_fence driver
  interfaces: Amdgpu's hang recovery is essentially impossible to fix
  as-is - it needs to reset the display state and you can't get at modeset
  locks from tdr without deadlock potential. i915 has an internal trick
  (but it stops working once we involve real cross-driver fences) for this
  issues, but then for i915 modeset reset is only needed on very ancient
  gen2/3. Modern hw is a lot more reasonable.

  I'm kinda hoping that the annotations and priming for basic command
  submission and atomic modeset paths could be merged soonish, while we
  the tdr side clearly needs a pile more work to get going. But since we
  have to explicitly annotate all code paths anyway we can hide bugs in
  e.g. tdr code by simply not yet annotating those functions.

  I'm trying to lay out at least one idea for solving the tdr issue in the
  patch titled "drm/scheduler: use dma-fence annotations in tdr work".

Finally, once we have some agreement on where we're going with all this,
we also need some documentation. Currently that's missing because I don't
want to re-edit the text all the time while we still figure out the
details of the exact cross-driver semantics.

My goal here is that with this we can lock down the cross-driver contract
for the last bit of the dma_buf/resv/fence story and make sure this stops
being such a wobbly thing where everyone just does whatever they feel
like.

Ideas, thoughts, reviews, testing (with specific annotations for that
driver) on other drivers very much welcome.

Cheers, Daniel

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>

Daniel Vetter (17):
  dma-fence: add might_sleep annotation to _wait()
  dma-fence: basic lockdep annotations
  dma-fence: prime lockdep annotations
  drm/vkms: Annotate vblank timer
  drm/vblank: Annotate with dma-fence signalling section
  drm/atomic-helper: Add dma-fence annotations
  drm/amdgpu: add dma-fence annotations to atomic commit path
  drm/scheduler: use dma-fence annotations in main thread
  drm/amdgpu: use dma-fence annotations in cs_submit()
  drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
  drm/amdgpu: DC also loves to allocate stuff where it shouldn't
  drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail
  drm/scheduler: use dma-fence annotations in tdr work
  drm/amdgpu: use dma-fence annotations for gpu reset code
  Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset"
  drm/amdgpu: gpu recovery does full modesets
  drm/i915: Annotate dma_fence_work

 drivers/dma-buf/dma-fence.c                   | 56 +++++++++++++++++++
 drivers/dma-buf/dma-resv.c                    |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  5 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 22 ++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c       |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c      |  2 +-
 drivers/gpu/drm/amd/amdgpu/atom.c             |  2 +-
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 18 +++++-
 drivers/gpu/drm/amd/display/dc/core/dc.c      |  4 +-
 drivers/gpu/drm/drm_atomic_helper.c           | 16 ++++++
 drivers/gpu/drm/drm_vblank.c                  |  8 ++-
 drivers/gpu/drm/i915/i915_sw_fence_work.c     |  3 +
 drivers/gpu/drm/scheduler/sched_main.c        | 11 ++++
 drivers/gpu/drm/vkms/vkms_crtc.c              |  8 ++-
 include/linux/dma-fence.h                     | 13 +++++
 16 files changed, 160 insertions(+), 13 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [RFC 00/17] dma-fence lockdep annotations
@ 2020-05-12  8:59 ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Christian König, linux-media

Hi all,

I've dragged my feet for years on this, hoping that cross-release lockdep
would do this for us, but well that never really happened unfortunately.
So here we are.

Cc'ed quite a pile of people since this is about the cross-driver contract
around dma_fences. Which is heavily used for dma_buf, and I'm hearing more
noises that rdma folks are looking into this, hence also on cc.

There's a bunch of different parts to this RFC:

- The annotations itself, in the 2nd patch after the prep patch to add
  might_sleep annotations. Commit message has all the motivation for what
  kind of deadlocks I want to catch, best you just read it.

  Since lockdep doesn't understand cross-release natively we need to
  cobble something together using rwlocks and a few more tricks, but from
  the test rollout in a few places in drm/vkms, amdgpu & i915 I think what
  I have now seems to actually work. Downside is that we have to
  explicitly annotate all code involved in eventual dma_fence signalling.

- Second important part is locking down the current dma-fence cross-driver
  contract, using lockdep priming like we already do for dma_resv_lock.
  I've just started with my own take on what we probably need to make the
  current code work (-ish), but both amdgpu and i915 have issues with
  that. So this needs some careful discussions, and also some thought on
  how we land it all eventually to not break lockdep completely for
  everyone.

  The important patch for that is "dma-fence: prime lockdep annotations"
  plus of course the various annotations patches and driver hacks to
  highlight some of the issues caught.

  Note that depending upon what exactly we end up deciding we might need
  to improve the annotations for fs_reclaim_acquire/release - for
  dma_fence_wait in mmu notifiers we can only allow GFP_NOWAIT (afaiui),
  and currently fs_reclaim_acquire/release only has a lockdep class for
  __GFP_FS only, we'd need to add another one for __GFP_DIRECT_RECLAIM in
  general maybe.

- Finally there's clearly some gaps in the current dma_fence driver
  interfaces: Amdgpu's hang recovery is essentially impossible to fix
  as-is - it needs to reset the display state and you can't get at modeset
  locks from tdr without deadlock potential. i915 has an internal trick
  (but it stops working once we involve real cross-driver fences) for this
  issues, but then for i915 modeset reset is only needed on very ancient
  gen2/3. Modern hw is a lot more reasonable.

  I'm kinda hoping that the annotations and priming for basic command
  submission and atomic modeset paths could be merged soonish, while we
  the tdr side clearly needs a pile more work to get going. But since we
  have to explicitly annotate all code paths anyway we can hide bugs in
  e.g. tdr code by simply not yet annotating those functions.

  I'm trying to lay out at least one idea for solving the tdr issue in the
  patch titled "drm/scheduler: use dma-fence annotations in tdr work".

Finally, once we have some agreement on where we're going with all this,
we also need some documentation. Currently that's missing because I don't
want to re-edit the text all the time while we still figure out the
details of the exact cross-driver semantics.

My goal here is that with this we can lock down the cross-driver contract
for the last bit of the dma_buf/resv/fence story and make sure this stops
being such a wobbly thing where everyone just does whatever they feel
like.

Ideas, thoughts, reviews, testing (with specific annotations for that
driver) on other drivers very much welcome.

Cheers, Daniel

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>

Daniel Vetter (17):
  dma-fence: add might_sleep annotation to _wait()
  dma-fence: basic lockdep annotations
  dma-fence: prime lockdep annotations
  drm/vkms: Annotate vblank timer
  drm/vblank: Annotate with dma-fence signalling section
  drm/atomic-helper: Add dma-fence annotations
  drm/amdgpu: add dma-fence annotations to atomic commit path
  drm/scheduler: use dma-fence annotations in main thread
  drm/amdgpu: use dma-fence annotations in cs_submit()
  drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
  drm/amdgpu: DC also loves to allocate stuff where it shouldn't
  drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail
  drm/scheduler: use dma-fence annotations in tdr work
  drm/amdgpu: use dma-fence annotations for gpu reset code
  Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset"
  drm/amdgpu: gpu recovery does full modesets
  drm/i915: Annotate dma_fence_work

 drivers/dma-buf/dma-fence.c                   | 56 +++++++++++++++++++
 drivers/dma-buf/dma-resv.c                    |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  5 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 22 ++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c       |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c      |  2 +-
 drivers/gpu/drm/amd/amdgpu/atom.c             |  2 +-
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 18 +++++-
 drivers/gpu/drm/amd/display/dc/core/dc.c      |  4 +-
 drivers/gpu/drm/drm_atomic_helper.c           | 16 ++++++
 drivers/gpu/drm/drm_vblank.c                  |  8 ++-
 drivers/gpu/drm/i915/i915_sw_fence_work.c     |  3 +
 drivers/gpu/drm/scheduler/sched_main.c        | 11 ++++
 drivers/gpu/drm/vkms/vkms_crtc.c              |  8 ++-
 include/linux/dma-fence.h                     | 13 +++++
 16 files changed, 160 insertions(+), 13 deletions(-)

-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 00/17] dma-fence lockdep annotations
@ 2020-05-12  8:59 ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Christian König, linux-media

Hi all,

I've dragged my feet for years on this, hoping that cross-release lockdep
would do this for us, but well that never really happened unfortunately.
So here we are.

Cc'ed quite a pile of people since this is about the cross-driver contract
around dma_fences. Which is heavily used for dma_buf, and I'm hearing more
noises that rdma folks are looking into this, hence also on cc.

There's a bunch of different parts to this RFC:

- The annotations itself, in the 2nd patch after the prep patch to add
  might_sleep annotations. Commit message has all the motivation for what
  kind of deadlocks I want to catch, best you just read it.

  Since lockdep doesn't understand cross-release natively we need to
  cobble something together using rwlocks and a few more tricks, but from
  the test rollout in a few places in drm/vkms, amdgpu & i915 I think what
  I have now seems to actually work. Downside is that we have to
  explicitly annotate all code involved in eventual dma_fence signalling.

- Second important part is locking down the current dma-fence cross-driver
  contract, using lockdep priming like we already do for dma_resv_lock.
  I've just started with my own take on what we probably need to make the
  current code work (-ish), but both amdgpu and i915 have issues with
  that. So this needs some careful discussions, and also some thought on
  how we land it all eventually to not break lockdep completely for
  everyone.

  The important patch for that is "dma-fence: prime lockdep annotations"
  plus of course the various annotations patches and driver hacks to
  highlight some of the issues caught.

  Note that depending upon what exactly we end up deciding we might need
  to improve the annotations for fs_reclaim_acquire/release - for
  dma_fence_wait in mmu notifiers we can only allow GFP_NOWAIT (afaiui),
  and currently fs_reclaim_acquire/release only has a lockdep class for
  __GFP_FS only, we'd need to add another one for __GFP_DIRECT_RECLAIM in
  general maybe.

- Finally there's clearly some gaps in the current dma_fence driver
  interfaces: Amdgpu's hang recovery is essentially impossible to fix
  as-is - it needs to reset the display state and you can't get at modeset
  locks from tdr without deadlock potential. i915 has an internal trick
  (but it stops working once we involve real cross-driver fences) for this
  issues, but then for i915 modeset reset is only needed on very ancient
  gen2/3. Modern hw is a lot more reasonable.

  I'm kinda hoping that the annotations and priming for basic command
  submission and atomic modeset paths could be merged soonish, while we
  the tdr side clearly needs a pile more work to get going. But since we
  have to explicitly annotate all code paths anyway we can hide bugs in
  e.g. tdr code by simply not yet annotating those functions.

  I'm trying to lay out at least one idea for solving the tdr issue in the
  patch titled "drm/scheduler: use dma-fence annotations in tdr work".

Finally, once we have some agreement on where we're going with all this,
we also need some documentation. Currently that's missing because I don't
want to re-edit the text all the time while we still figure out the
details of the exact cross-driver semantics.

My goal here is that with this we can lock down the cross-driver contract
for the last bit of the dma_buf/resv/fence story and make sure this stops
being such a wobbly thing where everyone just does whatever they feel
like.

Ideas, thoughts, reviews, testing (with specific annotations for that
driver) on other drivers very much welcome.

Cheers, Daniel

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>

Daniel Vetter (17):
  dma-fence: add might_sleep annotation to _wait()
  dma-fence: basic lockdep annotations
  dma-fence: prime lockdep annotations
  drm/vkms: Annotate vblank timer
  drm/vblank: Annotate with dma-fence signalling section
  drm/atomic-helper: Add dma-fence annotations
  drm/amdgpu: add dma-fence annotations to atomic commit path
  drm/scheduler: use dma-fence annotations in main thread
  drm/amdgpu: use dma-fence annotations in cs_submit()
  drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
  drm/amdgpu: DC also loves to allocate stuff where it shouldn't
  drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail
  drm/scheduler: use dma-fence annotations in tdr work
  drm/amdgpu: use dma-fence annotations for gpu reset code
  Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset"
  drm/amdgpu: gpu recovery does full modesets
  drm/i915: Annotate dma_fence_work

 drivers/dma-buf/dma-fence.c                   | 56 +++++++++++++++++++
 drivers/dma-buf/dma-resv.c                    |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  5 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 22 ++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c       |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c      |  2 +-
 drivers/gpu/drm/amd/amdgpu/atom.c             |  2 +-
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 18 +++++-
 drivers/gpu/drm/amd/display/dc/core/dc.c      |  4 +-
 drivers/gpu/drm/drm_atomic_helper.c           | 16 ++++++
 drivers/gpu/drm/drm_vblank.c                  |  8 ++-
 drivers/gpu/drm/i915/i915_sw_fence_work.c     |  3 +
 drivers/gpu/drm/scheduler/sched_main.c        | 11 ++++
 drivers/gpu/drm/vkms/vkms_crtc.c              |  8 ++-
 include/linux/dma-fence.h                     | 13 +++++
 16 files changed, 160 insertions(+), 13 deletions(-)

-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [RFC 00/17] dma-fence lockdep annotations
@ 2020-05-12  8:59 ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, Chris Wilson, linaro-mm-sig, Christian König,
	linux-media

Hi all,

I've dragged my feet for years on this, hoping that cross-release lockdep
would do this for us, but well that never really happened unfortunately.
So here we are.

Cc'ed quite a pile of people since this is about the cross-driver contract
around dma_fences. Which is heavily used for dma_buf, and I'm hearing more
noises that rdma folks are looking into this, hence also on cc.

There's a bunch of different parts to this RFC:

- The annotations itself, in the 2nd patch after the prep patch to add
  might_sleep annotations. Commit message has all the motivation for what
  kind of deadlocks I want to catch, best you just read it.

  Since lockdep doesn't understand cross-release natively we need to
  cobble something together using rwlocks and a few more tricks, but from
  the test rollout in a few places in drm/vkms, amdgpu & i915 I think what
  I have now seems to actually work. Downside is that we have to
  explicitly annotate all code involved in eventual dma_fence signalling.

- Second important part is locking down the current dma-fence cross-driver
  contract, using lockdep priming like we already do for dma_resv_lock.
  I've just started with my own take on what we probably need to make the
  current code work (-ish), but both amdgpu and i915 have issues with
  that. So this needs some careful discussions, and also some thought on
  how we land it all eventually to not break lockdep completely for
  everyone.

  The important patch for that is "dma-fence: prime lockdep annotations"
  plus of course the various annotations patches and driver hacks to
  highlight some of the issues caught.

  Note that depending upon what exactly we end up deciding we might need
  to improve the annotations for fs_reclaim_acquire/release - for
  dma_fence_wait in mmu notifiers we can only allow GFP_NOWAIT (afaiui),
  and currently fs_reclaim_acquire/release only has a lockdep class for
  __GFP_FS only, we'd need to add another one for __GFP_DIRECT_RECLAIM in
  general maybe.

- Finally there's clearly some gaps in the current dma_fence driver
  interfaces: Amdgpu's hang recovery is essentially impossible to fix
  as-is - it needs to reset the display state and you can't get at modeset
  locks from tdr without deadlock potential. i915 has an internal trick
  (but it stops working once we involve real cross-driver fences) for this
  issues, but then for i915 modeset reset is only needed on very ancient
  gen2/3. Modern hw is a lot more reasonable.

  I'm kinda hoping that the annotations and priming for basic command
  submission and atomic modeset paths could be merged soonish, while we
  the tdr side clearly needs a pile more work to get going. But since we
  have to explicitly annotate all code paths anyway we can hide bugs in
  e.g. tdr code by simply not yet annotating those functions.

  I'm trying to lay out at least one idea for solving the tdr issue in the
  patch titled "drm/scheduler: use dma-fence annotations in tdr work".

Finally, once we have some agreement on where we're going with all this,
we also need some documentation. Currently that's missing because I don't
want to re-edit the text all the time while we still figure out the
details of the exact cross-driver semantics.

My goal here is that with this we can lock down the cross-driver contract
for the last bit of the dma_buf/resv/fence story and make sure this stops
being such a wobbly thing where everyone just does whatever they feel
like.

Ideas, thoughts, reviews, testing (with specific annotations for that
driver) on other drivers very much welcome.

Cheers, Daniel

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>

Daniel Vetter (17):
  dma-fence: add might_sleep annotation to _wait()
  dma-fence: basic lockdep annotations
  dma-fence: prime lockdep annotations
  drm/vkms: Annotate vblank timer
  drm/vblank: Annotate with dma-fence signalling section
  drm/atomic-helper: Add dma-fence annotations
  drm/amdgpu: add dma-fence annotations to atomic commit path
  drm/scheduler: use dma-fence annotations in main thread
  drm/amdgpu: use dma-fence annotations in cs_submit()
  drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
  drm/amdgpu: DC also loves to allocate stuff where it shouldn't
  drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail
  drm/scheduler: use dma-fence annotations in tdr work
  drm/amdgpu: use dma-fence annotations for gpu reset code
  Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset"
  drm/amdgpu: gpu recovery does full modesets
  drm/i915: Annotate dma_fence_work

 drivers/dma-buf/dma-fence.c                   | 56 +++++++++++++++++++
 drivers/dma-buf/dma-resv.c                    |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  5 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 22 ++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c       |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c      |  2 +-
 drivers/gpu/drm/amd/amdgpu/atom.c             |  2 +-
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 18 +++++-
 drivers/gpu/drm/amd/display/dc/core/dc.c      |  4 +-
 drivers/gpu/drm/drm_atomic_helper.c           | 16 ++++++
 drivers/gpu/drm/drm_vblank.c                  |  8 ++-
 drivers/gpu/drm/i915/i915_sw_fence_work.c     |  3 +
 drivers/gpu/drm/scheduler/sched_main.c        | 11 ++++
 drivers/gpu/drm/vkms/vkms_crtc.c              |  8 ++-
 include/linux/dma-fence.h                     | 13 +++++
 16 files changed, 160 insertions(+), 13 deletions(-)

-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [RFC 01/17] dma-fence: add might_sleep annotation to _wait()
  2020-05-12  8:59 ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  8:59   ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter

But only for non-zero timeout, to avoid false positives.

One question here is whether the might_sleep should be unconditional,
or only for real timeouts. I'm not sure, so went with the more
defensive option. But in the interest of locking down the cross-driver
dma_fence rules we might want to be more aggressive.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/dma-buf/dma-fence.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 052a41e2451c..6802125349fb 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -208,6 +208,9 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
 	if (WARN_ON(timeout < 0))
 		return -EINVAL;
 
+	if (timeout > 0)
+		might_sleep();
+
 	trace_dma_fence_wait_start(fence);
 	if (fence->ops->wait)
 		ret = fence->ops->wait(fence, intr, timeout);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 01/17] dma-fence: add might_sleep annotation to _wait()
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

But only for non-zero timeout, to avoid false positives.

One question here is whether the might_sleep should be unconditional,
or only for real timeouts. I'm not sure, so went with the more
defensive option. But in the interest of locking down the cross-driver
dma_fence rules we might want to be more aggressive.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/dma-buf/dma-fence.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 052a41e2451c..6802125349fb 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -208,6 +208,9 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
 	if (WARN_ON(timeout < 0))
 		return -EINVAL;
 
+	if (timeout > 0)
+		might_sleep();
+
 	trace_dma_fence_wait_start(fence);
 	if (fence->ops->wait)
 		ret = fence->ops->wait(fence, intr, timeout);
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 01/17] dma-fence: add might_sleep annotation to _wait()
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

But only for non-zero timeout, to avoid false positives.

One question here is whether the might_sleep should be unconditional,
or only for real timeouts. I'm not sure, so went with the more
defensive option. But in the interest of locking down the cross-driver
dma_fence rules we might want to be more aggressive.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/dma-buf/dma-fence.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 052a41e2451c..6802125349fb 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -208,6 +208,9 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
 	if (WARN_ON(timeout < 0))
 		return -EINVAL;
 
+	if (timeout > 0)
+		might_sleep();
+
 	trace_dma_fence_wait_start(fence);
 	if (fence->ops->wait)
 		ret = fence->ops->wait(fence, intr, timeout);
-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 01/17] dma-fence: add might_sleep annotation to _wait()
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, Chris Wilson, linaro-mm-sig, Daniel Vetter,
	Christian König, linux-media

But only for non-zero timeout, to avoid false positives.

One question here is whether the might_sleep should be unconditional,
or only for real timeouts. I'm not sure, so went with the more
defensive option. But in the interest of locking down the cross-driver
dma_fence rules we might want to be more aggressive.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/dma-buf/dma-fence.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 052a41e2451c..6802125349fb 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -208,6 +208,9 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
 	if (WARN_ON(timeout < 0))
 		return -EINVAL;
 
+	if (timeout > 0)
+		might_sleep();
+
 	trace_dma_fence_wait_start(fence);
 	if (fence->ops->wait)
 		ret = fence->ops->wait(fence, intr, timeout);
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 02/17] dma-fence: basic lockdep annotations
  2020-05-12  8:59 ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  8:59   ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter

Design is similar to the lockdep annotations for workers, but with
some twists:

- We use a read-lock for the execution/worker/completion side, so that
  this explicit annotation can be more liberally sprinkled around.
  With read locks lockdep isn't going to complain if the read-side
  isn't nested the same way under all circumstances, so ABBA deadlocks
  are ok. Which they are, since this is an annotation only.

- We're using non-recursive lockdep read lock mode, since in recursive
  read lock mode lockdep does not catch read side hazards. And we
  _very_ much want read side hazards to be caught. For full details of
  this limitation see

  commit e91498589746065e3ae95d9a00b068e525eec34f
  Author: Peter Zijlstra <peterz@infradead.org>
  Date:   Wed Aug 23 13:13:11 2017 +0200

      locking/lockdep/selftests: Add mixed read-write ABBA tests

- To allow nesting of the read-side explicit annotations we explicitly
  keep track of the nesting. lock_is_held() allows us to do that.

- The wait-side annotation is a write lock, and entirely done within
  dma_fence_wait() for everyone by default.

- To be able to freely annotate helper functions I want to make it ok
  to call dma_fence_begin/end_signalling from soft/hardirq context.
  First attempt was using the hardirq locking context for the write
  side in lockdep, but this forces all normal spinlocks nested within
  dma_fence_begin/end_signalling to be spinlocks. That bollocks.

  The approach now is to simple check in_atomic(), and for these cases
  entirely rely on the might_sleep() check in dma_fence_wait(). That
  will catch any wrong nesting against spinlocks from soft/hardirq
  contexts.

The idea here is that every code path that's critical for eventually
signalling a dma_fence should be annotated with
dma_fence_begin/end_signalling. The annotation ideally starts right
after a dma_fence is published (added to a dma_resv, exposed as a
sync_file fd, attached to a drm_syncobj fd, or anything else that
makes the dma_fence visible to other kernel threads), up to and
including the dma_fence_wait(). Examples are irq handlers, the
scheduler rt threads, the tail of execbuf (after the corresponding
fences are visible), any workers that end up signalling dma_fences and
really anything else. Not annotated should be code paths that only
complete fences opportunistically as the gpu progresses, like e.g.
shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:

	mutex_lock(A);
	mutex_unlock(A);

	dma_fence_signal();

Thread B:

	mutex_lock(A);
	dma_fence_wait();
	mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around
to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within
dma_fence_begin/end_signalling sections. To allow this to happen the
read lock needs to be upgraded to a write lock, which means that any
other lock is acquired between the dma_fence_begin_signalling() call and
the call to dma_fence_wait(), and still held, this will result in an
immediate lockdep complaint. The only other option would be to not
annotate such calls, defeating the point. Therefore these annotations
cannot be sprinkled over the code entirely mindless to avoid false
positives.

v2: handle soft/hardirq ctx better against write side and dont forget
EXPORT_SYMBOL, drivers can't use this otherwise.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/dma-buf/dma-fence.c | 53 +++++++++++++++++++++++++++++++++++++
 include/linux/dma-fence.h   | 12 +++++++++
 2 files changed, 65 insertions(+)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 6802125349fb..d5c0fd2efc70 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
 }
 EXPORT_SYMBOL(dma_fence_context_alloc);
 
+#ifdef CONFIG_LOCKDEP
+struct lockdep_map	dma_fence_lockdep_map = {
+	.name = "dma_fence_map"
+};
+
+bool dma_fence_begin_signalling(void)
+{
+	/* explicitly nesting ... */
+	if (lock_is_held_type(&dma_fence_lockdep_map, 1))
+		return true;
+
+	/* rely on might_sleep check for soft/hardirq locks */
+	if (in_atomic())
+		return true;
+
+	/* ... and non-recursive readlock */
+	lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
+
+	return false;
+}
+EXPORT_SYMBOL(dma_fence_begin_signalling);
+
+void dma_fence_end_signalling(bool cookie)
+{
+	if (cookie)
+		return;
+
+	lock_release(&dma_fence_lockdep_map, _RET_IP_);
+}
+EXPORT_SYMBOL(dma_fence_end_signalling);
+
+void __dma_fence_might_wait(void)
+{
+	bool tmp;
+
+	tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
+	if (tmp)
+		lock_release(&dma_fence_lockdep_map, _THIS_IP_);
+	lock_map_acquire(&dma_fence_lockdep_map);
+	lock_map_release(&dma_fence_lockdep_map);
+	if (tmp)
+		lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
+}
+#endif
+
+
 /**
  * dma_fence_signal_locked - signal completion of a fence
  * @fence: the fence to signal
@@ -170,14 +216,19 @@ int dma_fence_signal(struct dma_fence *fence)
 {
 	unsigned long flags;
 	int ret;
+	bool tmp;
 
 	if (!fence)
 		return -EINVAL;
 
+	tmp = dma_fence_begin_signalling();
+
 	spin_lock_irqsave(fence->lock, flags);
 	ret = dma_fence_signal_locked(fence);
 	spin_unlock_irqrestore(fence->lock, flags);
 
+	dma_fence_end_signalling(tmp);
+
 	return ret;
 }
 EXPORT_SYMBOL(dma_fence_signal);
@@ -211,6 +262,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
 	if (timeout > 0)
 		might_sleep();
 
+	__dma_fence_might_wait();
+
 	trace_dma_fence_wait_start(fence);
 	if (fence->ops->wait)
 		ret = fence->ops->wait(fence, intr, timeout);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 3347c54f3a87..3f288f7db2ef 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
 	} while (1);
 }
 
+#ifdef CONFIG_LOCKDEP
+bool dma_fence_begin_signalling(void);
+void dma_fence_end_signalling(bool cookie);
+#else
+static inline bool dma_fence_begin_signalling(void)
+{
+	return true;
+}
+static inline void dma_fence_end_signalling(bool cookie) {}
+static inline void __dma_fence_might_wait(void) {}
+#endif
+
 int dma_fence_signal(struct dma_fence *fence);
 int dma_fence_signal_locked(struct dma_fence *fence);
 signed long dma_fence_default_wait(struct dma_fence *fence,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 02/17] dma-fence: basic lockdep annotations
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

Design is similar to the lockdep annotations for workers, but with
some twists:

- We use a read-lock for the execution/worker/completion side, so that
  this explicit annotation can be more liberally sprinkled around.
  With read locks lockdep isn't going to complain if the read-side
  isn't nested the same way under all circumstances, so ABBA deadlocks
  are ok. Which they are, since this is an annotation only.

- We're using non-recursive lockdep read lock mode, since in recursive
  read lock mode lockdep does not catch read side hazards. And we
  _very_ much want read side hazards to be caught. For full details of
  this limitation see

  commit e91498589746065e3ae95d9a00b068e525eec34f
  Author: Peter Zijlstra <peterz@infradead.org>
  Date:   Wed Aug 23 13:13:11 2017 +0200

      locking/lockdep/selftests: Add mixed read-write ABBA tests

- To allow nesting of the read-side explicit annotations we explicitly
  keep track of the nesting. lock_is_held() allows us to do that.

- The wait-side annotation is a write lock, and entirely done within
  dma_fence_wait() for everyone by default.

- To be able to freely annotate helper functions I want to make it ok
  to call dma_fence_begin/end_signalling from soft/hardirq context.
  First attempt was using the hardirq locking context for the write
  side in lockdep, but this forces all normal spinlocks nested within
  dma_fence_begin/end_signalling to be spinlocks. That bollocks.

  The approach now is to simple check in_atomic(), and for these cases
  entirely rely on the might_sleep() check in dma_fence_wait(). That
  will catch any wrong nesting against spinlocks from soft/hardirq
  contexts.

The idea here is that every code path that's critical for eventually
signalling a dma_fence should be annotated with
dma_fence_begin/end_signalling. The annotation ideally starts right
after a dma_fence is published (added to a dma_resv, exposed as a
sync_file fd, attached to a drm_syncobj fd, or anything else that
makes the dma_fence visible to other kernel threads), up to and
including the dma_fence_wait(). Examples are irq handlers, the
scheduler rt threads, the tail of execbuf (after the corresponding
fences are visible), any workers that end up signalling dma_fences and
really anything else. Not annotated should be code paths that only
complete fences opportunistically as the gpu progresses, like e.g.
shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:

	mutex_lock(A);
	mutex_unlock(A);

	dma_fence_signal();

Thread B:

	mutex_lock(A);
	dma_fence_wait();
	mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around
to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within
dma_fence_begin/end_signalling sections. To allow this to happen the
read lock needs to be upgraded to a write lock, which means that any
other lock is acquired between the dma_fence_begin_signalling() call and
the call to dma_fence_wait(), and still held, this will result in an
immediate lockdep complaint. The only other option would be to not
annotate such calls, defeating the point. Therefore these annotations
cannot be sprinkled over the code entirely mindless to avoid false
positives.

v2: handle soft/hardirq ctx better against write side and dont forget
EXPORT_SYMBOL, drivers can't use this otherwise.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/dma-buf/dma-fence.c | 53 +++++++++++++++++++++++++++++++++++++
 include/linux/dma-fence.h   | 12 +++++++++
 2 files changed, 65 insertions(+)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 6802125349fb..d5c0fd2efc70 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
 }
 EXPORT_SYMBOL(dma_fence_context_alloc);
 
+#ifdef CONFIG_LOCKDEP
+struct lockdep_map	dma_fence_lockdep_map = {
+	.name = "dma_fence_map"
+};
+
+bool dma_fence_begin_signalling(void)
+{
+	/* explicitly nesting ... */
+	if (lock_is_held_type(&dma_fence_lockdep_map, 1))
+		return true;
+
+	/* rely on might_sleep check for soft/hardirq locks */
+	if (in_atomic())
+		return true;
+
+	/* ... and non-recursive readlock */
+	lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
+
+	return false;
+}
+EXPORT_SYMBOL(dma_fence_begin_signalling);
+
+void dma_fence_end_signalling(bool cookie)
+{
+	if (cookie)
+		return;
+
+	lock_release(&dma_fence_lockdep_map, _RET_IP_);
+}
+EXPORT_SYMBOL(dma_fence_end_signalling);
+
+void __dma_fence_might_wait(void)
+{
+	bool tmp;
+
+	tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
+	if (tmp)
+		lock_release(&dma_fence_lockdep_map, _THIS_IP_);
+	lock_map_acquire(&dma_fence_lockdep_map);
+	lock_map_release(&dma_fence_lockdep_map);
+	if (tmp)
+		lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
+}
+#endif
+
+
 /**
  * dma_fence_signal_locked - signal completion of a fence
  * @fence: the fence to signal
@@ -170,14 +216,19 @@ int dma_fence_signal(struct dma_fence *fence)
 {
 	unsigned long flags;
 	int ret;
+	bool tmp;
 
 	if (!fence)
 		return -EINVAL;
 
+	tmp = dma_fence_begin_signalling();
+
 	spin_lock_irqsave(fence->lock, flags);
 	ret = dma_fence_signal_locked(fence);
 	spin_unlock_irqrestore(fence->lock, flags);
 
+	dma_fence_end_signalling(tmp);
+
 	return ret;
 }
 EXPORT_SYMBOL(dma_fence_signal);
@@ -211,6 +262,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
 	if (timeout > 0)
 		might_sleep();
 
+	__dma_fence_might_wait();
+
 	trace_dma_fence_wait_start(fence);
 	if (fence->ops->wait)
 		ret = fence->ops->wait(fence, intr, timeout);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 3347c54f3a87..3f288f7db2ef 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
 	} while (1);
 }
 
+#ifdef CONFIG_LOCKDEP
+bool dma_fence_begin_signalling(void);
+void dma_fence_end_signalling(bool cookie);
+#else
+static inline bool dma_fence_begin_signalling(void)
+{
+	return true;
+}
+static inline void dma_fence_end_signalling(bool cookie) {}
+static inline void __dma_fence_might_wait(void) {}
+#endif
+
 int dma_fence_signal(struct dma_fence *fence);
 int dma_fence_signal_locked(struct dma_fence *fence);
 signed long dma_fence_default_wait(struct dma_fence *fence,
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 02/17] dma-fence: basic lockdep annotations
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

Design is similar to the lockdep annotations for workers, but with
some twists:

- We use a read-lock for the execution/worker/completion side, so that
  this explicit annotation can be more liberally sprinkled around.
  With read locks lockdep isn't going to complain if the read-side
  isn't nested the same way under all circumstances, so ABBA deadlocks
  are ok. Which they are, since this is an annotation only.

- We're using non-recursive lockdep read lock mode, since in recursive
  read lock mode lockdep does not catch read side hazards. And we
  _very_ much want read side hazards to be caught. For full details of
  this limitation see

  commit e91498589746065e3ae95d9a00b068e525eec34f
  Author: Peter Zijlstra <peterz@infradead.org>
  Date:   Wed Aug 23 13:13:11 2017 +0200

      locking/lockdep/selftests: Add mixed read-write ABBA tests

- To allow nesting of the read-side explicit annotations we explicitly
  keep track of the nesting. lock_is_held() allows us to do that.

- The wait-side annotation is a write lock, and entirely done within
  dma_fence_wait() for everyone by default.

- To be able to freely annotate helper functions I want to make it ok
  to call dma_fence_begin/end_signalling from soft/hardirq context.
  First attempt was using the hardirq locking context for the write
  side in lockdep, but this forces all normal spinlocks nested within
  dma_fence_begin/end_signalling to be spinlocks. That bollocks.

  The approach now is to simple check in_atomic(), and for these cases
  entirely rely on the might_sleep() check in dma_fence_wait(). That
  will catch any wrong nesting against spinlocks from soft/hardirq
  contexts.

The idea here is that every code path that's critical for eventually
signalling a dma_fence should be annotated with
dma_fence_begin/end_signalling. The annotation ideally starts right
after a dma_fence is published (added to a dma_resv, exposed as a
sync_file fd, attached to a drm_syncobj fd, or anything else that
makes the dma_fence visible to other kernel threads), up to and
including the dma_fence_wait(). Examples are irq handlers, the
scheduler rt threads, the tail of execbuf (after the corresponding
fences are visible), any workers that end up signalling dma_fences and
really anything else. Not annotated should be code paths that only
complete fences opportunistically as the gpu progresses, like e.g.
shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:

	mutex_lock(A);
	mutex_unlock(A);

	dma_fence_signal();

Thread B:

	mutex_lock(A);
	dma_fence_wait();
	mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around
to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within
dma_fence_begin/end_signalling sections. To allow this to happen the
read lock needs to be upgraded to a write lock, which means that any
other lock is acquired between the dma_fence_begin_signalling() call and
the call to dma_fence_wait(), and still held, this will result in an
immediate lockdep complaint. The only other option would be to not
annotate such calls, defeating the point. Therefore these annotations
cannot be sprinkled over the code entirely mindless to avoid false
positives.

v2: handle soft/hardirq ctx better against write side and dont forget
EXPORT_SYMBOL, drivers can't use this otherwise.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/dma-buf/dma-fence.c | 53 +++++++++++++++++++++++++++++++++++++
 include/linux/dma-fence.h   | 12 +++++++++
 2 files changed, 65 insertions(+)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 6802125349fb..d5c0fd2efc70 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
 }
 EXPORT_SYMBOL(dma_fence_context_alloc);
 
+#ifdef CONFIG_LOCKDEP
+struct lockdep_map	dma_fence_lockdep_map = {
+	.name = "dma_fence_map"
+};
+
+bool dma_fence_begin_signalling(void)
+{
+	/* explicitly nesting ... */
+	if (lock_is_held_type(&dma_fence_lockdep_map, 1))
+		return true;
+
+	/* rely on might_sleep check for soft/hardirq locks */
+	if (in_atomic())
+		return true;
+
+	/* ... and non-recursive readlock */
+	lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
+
+	return false;
+}
+EXPORT_SYMBOL(dma_fence_begin_signalling);
+
+void dma_fence_end_signalling(bool cookie)
+{
+	if (cookie)
+		return;
+
+	lock_release(&dma_fence_lockdep_map, _RET_IP_);
+}
+EXPORT_SYMBOL(dma_fence_end_signalling);
+
+void __dma_fence_might_wait(void)
+{
+	bool tmp;
+
+	tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
+	if (tmp)
+		lock_release(&dma_fence_lockdep_map, _THIS_IP_);
+	lock_map_acquire(&dma_fence_lockdep_map);
+	lock_map_release(&dma_fence_lockdep_map);
+	if (tmp)
+		lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
+}
+#endif
+
+
 /**
  * dma_fence_signal_locked - signal completion of a fence
  * @fence: the fence to signal
@@ -170,14 +216,19 @@ int dma_fence_signal(struct dma_fence *fence)
 {
 	unsigned long flags;
 	int ret;
+	bool tmp;
 
 	if (!fence)
 		return -EINVAL;
 
+	tmp = dma_fence_begin_signalling();
+
 	spin_lock_irqsave(fence->lock, flags);
 	ret = dma_fence_signal_locked(fence);
 	spin_unlock_irqrestore(fence->lock, flags);
 
+	dma_fence_end_signalling(tmp);
+
 	return ret;
 }
 EXPORT_SYMBOL(dma_fence_signal);
@@ -211,6 +262,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
 	if (timeout > 0)
 		might_sleep();
 
+	__dma_fence_might_wait();
+
 	trace_dma_fence_wait_start(fence);
 	if (fence->ops->wait)
 		ret = fence->ops->wait(fence, intr, timeout);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 3347c54f3a87..3f288f7db2ef 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
 	} while (1);
 }
 
+#ifdef CONFIG_LOCKDEP
+bool dma_fence_begin_signalling(void);
+void dma_fence_end_signalling(bool cookie);
+#else
+static inline bool dma_fence_begin_signalling(void)
+{
+	return true;
+}
+static inline void dma_fence_end_signalling(bool cookie) {}
+static inline void __dma_fence_might_wait(void) {}
+#endif
+
 int dma_fence_signal(struct dma_fence *fence);
 int dma_fence_signal_locked(struct dma_fence *fence);
 signed long dma_fence_default_wait(struct dma_fence *fence,
-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 02/17] dma-fence: basic lockdep annotations
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, Chris Wilson, linaro-mm-sig, Daniel Vetter,
	Christian König, linux-media

Design is similar to the lockdep annotations for workers, but with
some twists:

- We use a read-lock for the execution/worker/completion side, so that
  this explicit annotation can be more liberally sprinkled around.
  With read locks lockdep isn't going to complain if the read-side
  isn't nested the same way under all circumstances, so ABBA deadlocks
  are ok. Which they are, since this is an annotation only.

- We're using non-recursive lockdep read lock mode, since in recursive
  read lock mode lockdep does not catch read side hazards. And we
  _very_ much want read side hazards to be caught. For full details of
  this limitation see

  commit e91498589746065e3ae95d9a00b068e525eec34f
  Author: Peter Zijlstra <peterz@infradead.org>
  Date:   Wed Aug 23 13:13:11 2017 +0200

      locking/lockdep/selftests: Add mixed read-write ABBA tests

- To allow nesting of the read-side explicit annotations we explicitly
  keep track of the nesting. lock_is_held() allows us to do that.

- The wait-side annotation is a write lock, and entirely done within
  dma_fence_wait() for everyone by default.

- To be able to freely annotate helper functions I want to make it ok
  to call dma_fence_begin/end_signalling from soft/hardirq context.
  First attempt was using the hardirq locking context for the write
  side in lockdep, but this forces all normal spinlocks nested within
  dma_fence_begin/end_signalling to be spinlocks. That bollocks.

  The approach now is to simple check in_atomic(), and for these cases
  entirely rely on the might_sleep() check in dma_fence_wait(). That
  will catch any wrong nesting against spinlocks from soft/hardirq
  contexts.

The idea here is that every code path that's critical for eventually
signalling a dma_fence should be annotated with
dma_fence_begin/end_signalling. The annotation ideally starts right
after a dma_fence is published (added to a dma_resv, exposed as a
sync_file fd, attached to a drm_syncobj fd, or anything else that
makes the dma_fence visible to other kernel threads), up to and
including the dma_fence_wait(). Examples are irq handlers, the
scheduler rt threads, the tail of execbuf (after the corresponding
fences are visible), any workers that end up signalling dma_fences and
really anything else. Not annotated should be code paths that only
complete fences opportunistically as the gpu progresses, like e.g.
shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:

	mutex_lock(A);
	mutex_unlock(A);

	dma_fence_signal();

Thread B:

	mutex_lock(A);
	dma_fence_wait();
	mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around
to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within
dma_fence_begin/end_signalling sections. To allow this to happen the
read lock needs to be upgraded to a write lock, which means that any
other lock is acquired between the dma_fence_begin_signalling() call and
the call to dma_fence_wait(), and still held, this will result in an
immediate lockdep complaint. The only other option would be to not
annotate such calls, defeating the point. Therefore these annotations
cannot be sprinkled over the code entirely mindless to avoid false
positives.

v2: handle soft/hardirq ctx better against write side and dont forget
EXPORT_SYMBOL, drivers can't use this otherwise.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/dma-buf/dma-fence.c | 53 +++++++++++++++++++++++++++++++++++++
 include/linux/dma-fence.h   | 12 +++++++++
 2 files changed, 65 insertions(+)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 6802125349fb..d5c0fd2efc70 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
 }
 EXPORT_SYMBOL(dma_fence_context_alloc);
 
+#ifdef CONFIG_LOCKDEP
+struct lockdep_map	dma_fence_lockdep_map = {
+	.name = "dma_fence_map"
+};
+
+bool dma_fence_begin_signalling(void)
+{
+	/* explicitly nesting ... */
+	if (lock_is_held_type(&dma_fence_lockdep_map, 1))
+		return true;
+
+	/* rely on might_sleep check for soft/hardirq locks */
+	if (in_atomic())
+		return true;
+
+	/* ... and non-recursive readlock */
+	lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
+
+	return false;
+}
+EXPORT_SYMBOL(dma_fence_begin_signalling);
+
+void dma_fence_end_signalling(bool cookie)
+{
+	if (cookie)
+		return;
+
+	lock_release(&dma_fence_lockdep_map, _RET_IP_);
+}
+EXPORT_SYMBOL(dma_fence_end_signalling);
+
+void __dma_fence_might_wait(void)
+{
+	bool tmp;
+
+	tmp = lock_is_held_type(&dma_fence_lockdep_map, 1);
+	if (tmp)
+		lock_release(&dma_fence_lockdep_map, _THIS_IP_);
+	lock_map_acquire(&dma_fence_lockdep_map);
+	lock_map_release(&dma_fence_lockdep_map);
+	if (tmp)
+		lock_acquire(&dma_fence_lockdep_map, 0, 0, 1, 1, NULL, _THIS_IP_);
+}
+#endif
+
+
 /**
  * dma_fence_signal_locked - signal completion of a fence
  * @fence: the fence to signal
@@ -170,14 +216,19 @@ int dma_fence_signal(struct dma_fence *fence)
 {
 	unsigned long flags;
 	int ret;
+	bool tmp;
 
 	if (!fence)
 		return -EINVAL;
 
+	tmp = dma_fence_begin_signalling();
+
 	spin_lock_irqsave(fence->lock, flags);
 	ret = dma_fence_signal_locked(fence);
 	spin_unlock_irqrestore(fence->lock, flags);
 
+	dma_fence_end_signalling(tmp);
+
 	return ret;
 }
 EXPORT_SYMBOL(dma_fence_signal);
@@ -211,6 +262,8 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
 	if (timeout > 0)
 		might_sleep();
 
+	__dma_fence_might_wait();
+
 	trace_dma_fence_wait_start(fence);
 	if (fence->ops->wait)
 		ret = fence->ops->wait(fence, intr, timeout);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 3347c54f3a87..3f288f7db2ef 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -357,6 +357,18 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
 	} while (1);
 }
 
+#ifdef CONFIG_LOCKDEP
+bool dma_fence_begin_signalling(void);
+void dma_fence_end_signalling(bool cookie);
+#else
+static inline bool dma_fence_begin_signalling(void)
+{
+	return true;
+}
+static inline void dma_fence_end_signalling(bool cookie) {}
+static inline void __dma_fence_might_wait(void) {}
+#endif
+
 int dma_fence_signal(struct dma_fence *fence);
 int dma_fence_signal_locked(struct dma_fence *fence);
 signed long dma_fence_default_wait(struct dma_fence *fence,
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 03/17] dma-fence: prime lockdep annotations
  2020-05-12  8:59 ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  8:59   ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter

Two in one go:
- it is allowed to call dma_fence_wait() while holding a
  dma_resv_lock(). This is fundamental to how eviction works with ttm,
  so required.

- it is allowed to call dma_fence_wait() from memory reclaim contexts,
  specifically from shrinker callbacks (which i915 does), and from mmu
  notifier callbacks (which amdgpu does, and which i915 sometimes also
  does, and probably always should, but that's kinda a debate). Also
  for stuff like HMM we really need to be able to do this, or things
  get real dicey.

Consequence is that any critical path necessary to get to a
dma_fence_signal for a fence must never a) call dma_resv_lock nor b)
allocate memory with GFP_KERNEL. Also by implication of
dma_resv_lock(), no userspace faulting allowed. That's some supremely
obnoxious limitations, which is why we need to sprinkle the right
annotations to all relevant paths.

The one big locking context we're leaving out here is mmu notifiers,
added in

commit 23b68395c7c78a764e8963fc15a7cfd318bf187f
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon Aug 26 22:14:21 2019 +0200

    mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end

that one covers a lot of other callsites, and it's also allowed to
wait on dma-fences from mmu notifiers. But there's no ready-made
functions exposed to prime this, so I've left it out for now.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/dma-buf/dma-resv.c | 1 +
 include/linux/dma-fence.h  | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index 99c0a33c918d..392c336f0732 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -115,6 +115,7 @@ static int __init dma_resv_lockdep(void)
 	if (ret == -EDEADLK)
 		dma_resv_lock_slow(&obj, &ctx);
 	fs_reclaim_acquire(GFP_KERNEL);
+	__dma_fence_might_wait();
 	fs_reclaim_release(GFP_KERNEL);
 	ww_mutex_unlock(&obj.lock);
 	ww_acquire_fini(&ctx);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 3f288f7db2ef..09e23adb351d 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -360,6 +360,7 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
 #ifdef CONFIG_LOCKDEP
 bool dma_fence_begin_signalling(void);
 void dma_fence_end_signalling(bool cookie);
+void __dma_fence_might_wait(void);
 #else
 static inline bool dma_fence_begin_signalling(void)
 {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 03/17] dma-fence: prime lockdep annotations
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

Two in one go:
- it is allowed to call dma_fence_wait() while holding a
  dma_resv_lock(). This is fundamental to how eviction works with ttm,
  so required.

- it is allowed to call dma_fence_wait() from memory reclaim contexts,
  specifically from shrinker callbacks (which i915 does), and from mmu
  notifier callbacks (which amdgpu does, and which i915 sometimes also
  does, and probably always should, but that's kinda a debate). Also
  for stuff like HMM we really need to be able to do this, or things
  get real dicey.

Consequence is that any critical path necessary to get to a
dma_fence_signal for a fence must never a) call dma_resv_lock nor b)
allocate memory with GFP_KERNEL. Also by implication of
dma_resv_lock(), no userspace faulting allowed. That's some supremely
obnoxious limitations, which is why we need to sprinkle the right
annotations to all relevant paths.

The one big locking context we're leaving out here is mmu notifiers,
added in

commit 23b68395c7c78a764e8963fc15a7cfd318bf187f
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon Aug 26 22:14:21 2019 +0200

    mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end

that one covers a lot of other callsites, and it's also allowed to
wait on dma-fences from mmu notifiers. But there's no ready-made
functions exposed to prime this, so I've left it out for now.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/dma-buf/dma-resv.c | 1 +
 include/linux/dma-fence.h  | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index 99c0a33c918d..392c336f0732 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -115,6 +115,7 @@ static int __init dma_resv_lockdep(void)
 	if (ret == -EDEADLK)
 		dma_resv_lock_slow(&obj, &ctx);
 	fs_reclaim_acquire(GFP_KERNEL);
+	__dma_fence_might_wait();
 	fs_reclaim_release(GFP_KERNEL);
 	ww_mutex_unlock(&obj.lock);
 	ww_acquire_fini(&ctx);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 3f288f7db2ef..09e23adb351d 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -360,6 +360,7 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
 #ifdef CONFIG_LOCKDEP
 bool dma_fence_begin_signalling(void);
 void dma_fence_end_signalling(bool cookie);
+void __dma_fence_might_wait(void);
 #else
 static inline bool dma_fence_begin_signalling(void)
 {
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 03/17] dma-fence: prime lockdep annotations
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

Two in one go:
- it is allowed to call dma_fence_wait() while holding a
  dma_resv_lock(). This is fundamental to how eviction works with ttm,
  so required.

- it is allowed to call dma_fence_wait() from memory reclaim contexts,
  specifically from shrinker callbacks (which i915 does), and from mmu
  notifier callbacks (which amdgpu does, and which i915 sometimes also
  does, and probably always should, but that's kinda a debate). Also
  for stuff like HMM we really need to be able to do this, or things
  get real dicey.

Consequence is that any critical path necessary to get to a
dma_fence_signal for a fence must never a) call dma_resv_lock nor b)
allocate memory with GFP_KERNEL. Also by implication of
dma_resv_lock(), no userspace faulting allowed. That's some supremely
obnoxious limitations, which is why we need to sprinkle the right
annotations to all relevant paths.

The one big locking context we're leaving out here is mmu notifiers,
added in

commit 23b68395c7c78a764e8963fc15a7cfd318bf187f
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon Aug 26 22:14:21 2019 +0200

    mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end

that one covers a lot of other callsites, and it's also allowed to
wait on dma-fences from mmu notifiers. But there's no ready-made
functions exposed to prime this, so I've left it out for now.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/dma-buf/dma-resv.c | 1 +
 include/linux/dma-fence.h  | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index 99c0a33c918d..392c336f0732 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -115,6 +115,7 @@ static int __init dma_resv_lockdep(void)
 	if (ret == -EDEADLK)
 		dma_resv_lock_slow(&obj, &ctx);
 	fs_reclaim_acquire(GFP_KERNEL);
+	__dma_fence_might_wait();
 	fs_reclaim_release(GFP_KERNEL);
 	ww_mutex_unlock(&obj.lock);
 	ww_acquire_fini(&ctx);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 3f288f7db2ef..09e23adb351d 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -360,6 +360,7 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
 #ifdef CONFIG_LOCKDEP
 bool dma_fence_begin_signalling(void);
 void dma_fence_end_signalling(bool cookie);
+void __dma_fence_might_wait(void);
 #else
 static inline bool dma_fence_begin_signalling(void)
 {
-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 03/17] dma-fence: prime lockdep annotations
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, Chris Wilson, linaro-mm-sig, Daniel Vetter,
	Christian König, linux-media

Two in one go:
- it is allowed to call dma_fence_wait() while holding a
  dma_resv_lock(). This is fundamental to how eviction works with ttm,
  so required.

- it is allowed to call dma_fence_wait() from memory reclaim contexts,
  specifically from shrinker callbacks (which i915 does), and from mmu
  notifier callbacks (which amdgpu does, and which i915 sometimes also
  does, and probably always should, but that's kinda a debate). Also
  for stuff like HMM we really need to be able to do this, or things
  get real dicey.

Consequence is that any critical path necessary to get to a
dma_fence_signal for a fence must never a) call dma_resv_lock nor b)
allocate memory with GFP_KERNEL. Also by implication of
dma_resv_lock(), no userspace faulting allowed. That's some supremely
obnoxious limitations, which is why we need to sprinkle the right
annotations to all relevant paths.

The one big locking context we're leaving out here is mmu notifiers,
added in

commit 23b68395c7c78a764e8963fc15a7cfd318bf187f
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon Aug 26 22:14:21 2019 +0200

    mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end

that one covers a lot of other callsites, and it's also allowed to
wait on dma-fences from mmu notifiers. But there's no ready-made
functions exposed to prime this, so I've left it out for now.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/dma-buf/dma-resv.c | 1 +
 include/linux/dma-fence.h  | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index 99c0a33c918d..392c336f0732 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -115,6 +115,7 @@ static int __init dma_resv_lockdep(void)
 	if (ret == -EDEADLK)
 		dma_resv_lock_slow(&obj, &ctx);
 	fs_reclaim_acquire(GFP_KERNEL);
+	__dma_fence_might_wait();
 	fs_reclaim_release(GFP_KERNEL);
 	ww_mutex_unlock(&obj.lock);
 	ww_acquire_fini(&ctx);
diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
index 3f288f7db2ef..09e23adb351d 100644
--- a/include/linux/dma-fence.h
+++ b/include/linux/dma-fence.h
@@ -360,6 +360,7 @@ dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep)
 #ifdef CONFIG_LOCKDEP
 bool dma_fence_begin_signalling(void);
 void dma_fence_end_signalling(bool cookie);
+void __dma_fence_might_wait(void);
 #else
 static inline bool dma_fence_begin_signalling(void)
 {
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 04/17] drm/vkms: Annotate vblank timer
  2020-05-12  8:59 ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  8:59   ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter, Rodrigo Siqueira,
	Haneen Mohammed, Daniel Vetter

This is needed to signal the fences from page flips, annotate it
accordingly. We need to annotate entire timer callback since if we get
stuck anywhere in there, then the timer stops, and hence fences stop.
Just annotating the top part that does the vblank handling isn't
enough.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
Cc: Haneen Mohammed <hamohammed.sa@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
---
 drivers/gpu/drm/vkms/vkms_crtc.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/vkms/vkms_crtc.c b/drivers/gpu/drm/vkms/vkms_crtc.c
index ac85e17428f8..a53a40848a72 100644
--- a/drivers/gpu/drm/vkms/vkms_crtc.c
+++ b/drivers/gpu/drm/vkms/vkms_crtc.c
@@ -1,5 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0+
 
+#include <linux/dma-fence.h>
+
 #include <drm/drm_atomic.h>
 #include <drm/drm_atomic_helper.h>
 #include <drm/drm_probe_helper.h>
@@ -14,7 +16,9 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
 	struct drm_crtc *crtc = &output->crtc;
 	struct vkms_crtc_state *state;
 	u64 ret_overrun;
-	bool ret;
+	bool ret, fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	ret_overrun = hrtimer_forward_now(&output->vblank_hrtimer,
 					  output->period_ns);
@@ -49,6 +53,8 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
 			DRM_DEBUG_DRIVER("Composer worker already queued\n");
 	}
 
+	dma_fence_end_signalling(fence_cookie);
+
 	return HRTIMER_RESTART;
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 04/17] drm/vkms: Annotate vblank timer
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: Haneen Mohammed, Rodrigo Siqueira, linux-rdma, Daniel Vetter,
	intel-gfx, LKML, amd-gfx, Chris Wilson, linaro-mm-sig,
	Daniel Vetter, Christian König, linux-media

This is needed to signal the fences from page flips, annotate it
accordingly. We need to annotate entire timer callback since if we get
stuck anywhere in there, then the timer stops, and hence fences stop.
Just annotating the top part that does the vblank handling isn't
enough.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
Cc: Haneen Mohammed <hamohammed.sa@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
---
 drivers/gpu/drm/vkms/vkms_crtc.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/vkms/vkms_crtc.c b/drivers/gpu/drm/vkms/vkms_crtc.c
index ac85e17428f8..a53a40848a72 100644
--- a/drivers/gpu/drm/vkms/vkms_crtc.c
+++ b/drivers/gpu/drm/vkms/vkms_crtc.c
@@ -1,5 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0+
 
+#include <linux/dma-fence.h>
+
 #include <drm/drm_atomic.h>
 #include <drm/drm_atomic_helper.h>
 #include <drm/drm_probe_helper.h>
@@ -14,7 +16,9 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
 	struct drm_crtc *crtc = &output->crtc;
 	struct vkms_crtc_state *state;
 	u64 ret_overrun;
-	bool ret;
+	bool ret, fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	ret_overrun = hrtimer_forward_now(&output->vblank_hrtimer,
 					  output->period_ns);
@@ -49,6 +53,8 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
 			DRM_DEBUG_DRIVER("Composer worker already queued\n");
 	}
 
+	dma_fence_end_signalling(fence_cookie);
+
 	return HRTIMER_RESTART;
 }
 
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 04/17] drm/vkms: Annotate vblank timer
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: Haneen Mohammed, Rodrigo Siqueira, linux-rdma, Daniel Vetter,
	intel-gfx, LKML, amd-gfx, Chris Wilson, linaro-mm-sig,
	Daniel Vetter, Christian König, linux-media

This is needed to signal the fences from page flips, annotate it
accordingly. We need to annotate entire timer callback since if we get
stuck anywhere in there, then the timer stops, and hence fences stop.
Just annotating the top part that does the vblank handling isn't
enough.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
Cc: Haneen Mohammed <hamohammed.sa@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
---
 drivers/gpu/drm/vkms/vkms_crtc.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/vkms/vkms_crtc.c b/drivers/gpu/drm/vkms/vkms_crtc.c
index ac85e17428f8..a53a40848a72 100644
--- a/drivers/gpu/drm/vkms/vkms_crtc.c
+++ b/drivers/gpu/drm/vkms/vkms_crtc.c
@@ -1,5 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0+
 
+#include <linux/dma-fence.h>
+
 #include <drm/drm_atomic.h>
 #include <drm/drm_atomic_helper.h>
 #include <drm/drm_probe_helper.h>
@@ -14,7 +16,9 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
 	struct drm_crtc *crtc = &output->crtc;
 	struct vkms_crtc_state *state;
 	u64 ret_overrun;
-	bool ret;
+	bool ret, fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	ret_overrun = hrtimer_forward_now(&output->vblank_hrtimer,
 					  output->period_ns);
@@ -49,6 +53,8 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
 			DRM_DEBUG_DRIVER("Composer worker already queued\n");
 	}
 
+	dma_fence_end_signalling(fence_cookie);
+
 	return HRTIMER_RESTART;
 }
 
-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 04/17] drm/vkms: Annotate vblank timer
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: Haneen Mohammed, Rodrigo Siqueira, linux-rdma, Daniel Vetter,
	intel-gfx, Maarten Lankhorst, LKML, amd-gfx, Chris Wilson,
	linaro-mm-sig, Daniel Vetter, Daniel Vetter,
	Christian König, linux-media

This is needed to signal the fences from page flips, annotate it
accordingly. We need to annotate entire timer callback since if we get
stuck anywhere in there, then the timer stops, and hence fences stop.
Just annotating the top part that does the vblank handling isn't
enough.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
Cc: Haneen Mohammed <hamohammed.sa@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
---
 drivers/gpu/drm/vkms/vkms_crtc.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/vkms/vkms_crtc.c b/drivers/gpu/drm/vkms/vkms_crtc.c
index ac85e17428f8..a53a40848a72 100644
--- a/drivers/gpu/drm/vkms/vkms_crtc.c
+++ b/drivers/gpu/drm/vkms/vkms_crtc.c
@@ -1,5 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0+
 
+#include <linux/dma-fence.h>
+
 #include <drm/drm_atomic.h>
 #include <drm/drm_atomic_helper.h>
 #include <drm/drm_probe_helper.h>
@@ -14,7 +16,9 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
 	struct drm_crtc *crtc = &output->crtc;
 	struct vkms_crtc_state *state;
 	u64 ret_overrun;
-	bool ret;
+	bool ret, fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	ret_overrun = hrtimer_forward_now(&output->vblank_hrtimer,
 					  output->period_ns);
@@ -49,6 +53,8 @@ static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
 			DRM_DEBUG_DRIVER("Composer worker already queued\n");
 	}
 
+	dma_fence_end_signalling(fence_cookie);
+
 	return HRTIMER_RESTART;
 }
 
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 05/17] drm/vblank: Annotate with dma-fence signalling section
  2020-05-12  8:59 ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  8:59   ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter

This is rather overkill since currently all drivers call this from
hardirq (or at least timers). But maybe in the future we're going to
have thread irq handlers and what not, doesn't hurt to be prepared.
Plus this is an easy start for sprinkling these fence annotations into
shared code.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/drm_vblank.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index 758bf74e1cab..125ef0c0f9a1 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -24,6 +24,7 @@
  * OTHER DEALINGS IN THE SOFTWARE.
  */
 
+#include <linux/dma-fence.h>
 #include <linux/export.h>
 #include <linux/moduleparam.h>
 
@@ -1895,7 +1896,7 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 {
 	struct drm_vblank_crtc *vblank = &dev->vblank[pipe];
 	unsigned long irqflags;
-	bool disable_irq;
+	bool disable_irq, fence_cookie;
 
 	if (WARN_ON_ONCE(!dev->num_crtcs))
 		return false;
@@ -1903,6 +1904,8 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 	if (WARN_ON(pipe >= dev->num_crtcs))
 		return false;
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	spin_lock_irqsave(&dev->event_lock, irqflags);
 
 	/* Need timestamp lock to prevent concurrent execution with
@@ -1915,6 +1918,7 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 	if (!vblank->enabled) {
 		spin_unlock(&dev->vblank_time_lock);
 		spin_unlock_irqrestore(&dev->event_lock, irqflags);
+		dma_fence_end_signalling(fence_cookie);
 		return false;
 	}
 
@@ -1940,6 +1944,8 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 	if (disable_irq)
 		vblank_disable_fn(&vblank->disable_timer);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	return true;
 }
 EXPORT_SYMBOL(drm_handle_vblank);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 05/17] drm/vblank: Annotate with dma-fence signalling section
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

This is rather overkill since currently all drivers call this from
hardirq (or at least timers). But maybe in the future we're going to
have thread irq handlers and what not, doesn't hurt to be prepared.
Plus this is an easy start for sprinkling these fence annotations into
shared code.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/drm_vblank.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index 758bf74e1cab..125ef0c0f9a1 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -24,6 +24,7 @@
  * OTHER DEALINGS IN THE SOFTWARE.
  */
 
+#include <linux/dma-fence.h>
 #include <linux/export.h>
 #include <linux/moduleparam.h>
 
@@ -1895,7 +1896,7 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 {
 	struct drm_vblank_crtc *vblank = &dev->vblank[pipe];
 	unsigned long irqflags;
-	bool disable_irq;
+	bool disable_irq, fence_cookie;
 
 	if (WARN_ON_ONCE(!dev->num_crtcs))
 		return false;
@@ -1903,6 +1904,8 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 	if (WARN_ON(pipe >= dev->num_crtcs))
 		return false;
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	spin_lock_irqsave(&dev->event_lock, irqflags);
 
 	/* Need timestamp lock to prevent concurrent execution with
@@ -1915,6 +1918,7 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 	if (!vblank->enabled) {
 		spin_unlock(&dev->vblank_time_lock);
 		spin_unlock_irqrestore(&dev->event_lock, irqflags);
+		dma_fence_end_signalling(fence_cookie);
 		return false;
 	}
 
@@ -1940,6 +1944,8 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 	if (disable_irq)
 		vblank_disable_fn(&vblank->disable_timer);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	return true;
 }
 EXPORT_SYMBOL(drm_handle_vblank);
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 05/17] drm/vblank: Annotate with dma-fence signalling section
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

This is rather overkill since currently all drivers call this from
hardirq (or at least timers). But maybe in the future we're going to
have thread irq handlers and what not, doesn't hurt to be prepared.
Plus this is an easy start for sprinkling these fence annotations into
shared code.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/drm_vblank.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index 758bf74e1cab..125ef0c0f9a1 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -24,6 +24,7 @@
  * OTHER DEALINGS IN THE SOFTWARE.
  */
 
+#include <linux/dma-fence.h>
 #include <linux/export.h>
 #include <linux/moduleparam.h>
 
@@ -1895,7 +1896,7 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 {
 	struct drm_vblank_crtc *vblank = &dev->vblank[pipe];
 	unsigned long irqflags;
-	bool disable_irq;
+	bool disable_irq, fence_cookie;
 
 	if (WARN_ON_ONCE(!dev->num_crtcs))
 		return false;
@@ -1903,6 +1904,8 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 	if (WARN_ON(pipe >= dev->num_crtcs))
 		return false;
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	spin_lock_irqsave(&dev->event_lock, irqflags);
 
 	/* Need timestamp lock to prevent concurrent execution with
@@ -1915,6 +1918,7 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 	if (!vblank->enabled) {
 		spin_unlock(&dev->vblank_time_lock);
 		spin_unlock_irqrestore(&dev->event_lock, irqflags);
+		dma_fence_end_signalling(fence_cookie);
 		return false;
 	}
 
@@ -1940,6 +1944,8 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 	if (disable_irq)
 		vblank_disable_fn(&vblank->disable_timer);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	return true;
 }
 EXPORT_SYMBOL(drm_handle_vblank);
-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 05/17] drm/vblank: Annotate with dma-fence signalling section
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, Chris Wilson, linaro-mm-sig, Daniel Vetter,
	Christian König, linux-media

This is rather overkill since currently all drivers call this from
hardirq (or at least timers). But maybe in the future we're going to
have thread irq handlers and what not, doesn't hurt to be prepared.
Plus this is an easy start for sprinkling these fence annotations into
shared code.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/drm_vblank.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
index 758bf74e1cab..125ef0c0f9a1 100644
--- a/drivers/gpu/drm/drm_vblank.c
+++ b/drivers/gpu/drm/drm_vblank.c
@@ -24,6 +24,7 @@
  * OTHER DEALINGS IN THE SOFTWARE.
  */
 
+#include <linux/dma-fence.h>
 #include <linux/export.h>
 #include <linux/moduleparam.h>
 
@@ -1895,7 +1896,7 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 {
 	struct drm_vblank_crtc *vblank = &dev->vblank[pipe];
 	unsigned long irqflags;
-	bool disable_irq;
+	bool disable_irq, fence_cookie;
 
 	if (WARN_ON_ONCE(!dev->num_crtcs))
 		return false;
@@ -1903,6 +1904,8 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 	if (WARN_ON(pipe >= dev->num_crtcs))
 		return false;
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	spin_lock_irqsave(&dev->event_lock, irqflags);
 
 	/* Need timestamp lock to prevent concurrent execution with
@@ -1915,6 +1918,7 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 	if (!vblank->enabled) {
 		spin_unlock(&dev->vblank_time_lock);
 		spin_unlock_irqrestore(&dev->event_lock, irqflags);
+		dma_fence_end_signalling(fence_cookie);
 		return false;
 	}
 
@@ -1940,6 +1944,8 @@ bool drm_handle_vblank(struct drm_device *dev, unsigned int pipe)
 	if (disable_irq)
 		vblank_disable_fn(&vblank->disable_timer);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	return true;
 }
 EXPORT_SYMBOL(drm_handle_vblank);
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 06/17] drm/atomic-helper: Add dma-fence annotations
  2020-05-12  8:59 ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  8:59   ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter

This is a bit disappointing since we need to split the annotations
over all the different parts.

I was considering just leaking the critical section into the
->atomic_commit_tail callback of each driver. But that would mean we
need to pass the fence_cookie into each driver (there's a total of 13
implementations of this hook right now), so bad flag day. And also a
bit leaky abstraction.

Hence just do it function-by-function.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/drm_atomic_helper.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
index 8ac3aa068261..0a541368246e 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -1549,6 +1549,7 @@ EXPORT_SYMBOL(drm_atomic_helper_wait_for_flip_done);
 void drm_atomic_helper_commit_tail(struct drm_atomic_state *old_state)
 {
 	struct drm_device *dev = old_state->dev;
+	bool fence_cookie = dma_fence_begin_signalling();
 
 	drm_atomic_helper_commit_modeset_disables(dev, old_state);
 
@@ -1560,6 +1561,8 @@ void drm_atomic_helper_commit_tail(struct drm_atomic_state *old_state)
 
 	drm_atomic_helper_commit_hw_done(old_state);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	drm_atomic_helper_wait_for_vblanks(dev, old_state);
 
 	drm_atomic_helper_cleanup_planes(dev, old_state);
@@ -1579,6 +1582,7 @@ EXPORT_SYMBOL(drm_atomic_helper_commit_tail);
 void drm_atomic_helper_commit_tail_rpm(struct drm_atomic_state *old_state)
 {
 	struct drm_device *dev = old_state->dev;
+	bool fence_cookie = dma_fence_begin_signalling();
 
 	drm_atomic_helper_commit_modeset_disables(dev, old_state);
 
@@ -1591,6 +1595,8 @@ void drm_atomic_helper_commit_tail_rpm(struct drm_atomic_state *old_state)
 
 	drm_atomic_helper_commit_hw_done(old_state);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	drm_atomic_helper_wait_for_vblanks(dev, old_state);
 
 	drm_atomic_helper_cleanup_planes(dev, old_state);
@@ -1606,6 +1612,9 @@ static void commit_tail(struct drm_atomic_state *old_state)
 	ktime_t start;
 	s64 commit_time_ms;
 	unsigned int i, new_self_refresh_mask = 0;
+	bool fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	funcs = dev->mode_config.helper_private;
 
@@ -1634,6 +1643,8 @@ static void commit_tail(struct drm_atomic_state *old_state)
 		if (new_crtc_state->self_refresh_active)
 			new_self_refresh_mask |= BIT(i);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	if (funcs && funcs->atomic_commit_tail)
 		funcs->atomic_commit_tail(old_state);
 	else
@@ -1789,6 +1800,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 			     bool nonblock)
 {
 	int ret;
+	bool fence_cookie;
 
 	if (state->async_update) {
 		ret = drm_atomic_helper_prepare_planes(dev, state);
@@ -1811,6 +1823,8 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 	if (ret)
 		return ret;
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	if (!nonblock) {
 		ret = drm_atomic_helper_wait_for_fences(dev, state, true);
 		if (ret)
@@ -1848,6 +1862,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 	 */
 
 	drm_atomic_state_get(state);
+	dma_fence_end_signalling(fence_cookie);
 	if (nonblock)
 		queue_work(system_unbound_wq, &state->commit_work);
 	else
@@ -1856,6 +1871,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 	return 0;
 
 err:
+	dma_fence_end_signalling(fence_cookie);
 	drm_atomic_helper_cleanup_planes(dev, state);
 	return ret;
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 06/17] drm/atomic-helper: Add dma-fence annotations
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

This is a bit disappointing since we need to split the annotations
over all the different parts.

I was considering just leaking the critical section into the
->atomic_commit_tail callback of each driver. But that would mean we
need to pass the fence_cookie into each driver (there's a total of 13
implementations of this hook right now), so bad flag day. And also a
bit leaky abstraction.

Hence just do it function-by-function.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/drm_atomic_helper.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
index 8ac3aa068261..0a541368246e 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -1549,6 +1549,7 @@ EXPORT_SYMBOL(drm_atomic_helper_wait_for_flip_done);
 void drm_atomic_helper_commit_tail(struct drm_atomic_state *old_state)
 {
 	struct drm_device *dev = old_state->dev;
+	bool fence_cookie = dma_fence_begin_signalling();
 
 	drm_atomic_helper_commit_modeset_disables(dev, old_state);
 
@@ -1560,6 +1561,8 @@ void drm_atomic_helper_commit_tail(struct drm_atomic_state *old_state)
 
 	drm_atomic_helper_commit_hw_done(old_state);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	drm_atomic_helper_wait_for_vblanks(dev, old_state);
 
 	drm_atomic_helper_cleanup_planes(dev, old_state);
@@ -1579,6 +1582,7 @@ EXPORT_SYMBOL(drm_atomic_helper_commit_tail);
 void drm_atomic_helper_commit_tail_rpm(struct drm_atomic_state *old_state)
 {
 	struct drm_device *dev = old_state->dev;
+	bool fence_cookie = dma_fence_begin_signalling();
 
 	drm_atomic_helper_commit_modeset_disables(dev, old_state);
 
@@ -1591,6 +1595,8 @@ void drm_atomic_helper_commit_tail_rpm(struct drm_atomic_state *old_state)
 
 	drm_atomic_helper_commit_hw_done(old_state);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	drm_atomic_helper_wait_for_vblanks(dev, old_state);
 
 	drm_atomic_helper_cleanup_planes(dev, old_state);
@@ -1606,6 +1612,9 @@ static void commit_tail(struct drm_atomic_state *old_state)
 	ktime_t start;
 	s64 commit_time_ms;
 	unsigned int i, new_self_refresh_mask = 0;
+	bool fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	funcs = dev->mode_config.helper_private;
 
@@ -1634,6 +1643,8 @@ static void commit_tail(struct drm_atomic_state *old_state)
 		if (new_crtc_state->self_refresh_active)
 			new_self_refresh_mask |= BIT(i);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	if (funcs && funcs->atomic_commit_tail)
 		funcs->atomic_commit_tail(old_state);
 	else
@@ -1789,6 +1800,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 			     bool nonblock)
 {
 	int ret;
+	bool fence_cookie;
 
 	if (state->async_update) {
 		ret = drm_atomic_helper_prepare_planes(dev, state);
@@ -1811,6 +1823,8 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 	if (ret)
 		return ret;
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	if (!nonblock) {
 		ret = drm_atomic_helper_wait_for_fences(dev, state, true);
 		if (ret)
@@ -1848,6 +1862,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 	 */
 
 	drm_atomic_state_get(state);
+	dma_fence_end_signalling(fence_cookie);
 	if (nonblock)
 		queue_work(system_unbound_wq, &state->commit_work);
 	else
@@ -1856,6 +1871,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 	return 0;
 
 err:
+	dma_fence_end_signalling(fence_cookie);
 	drm_atomic_helper_cleanup_planes(dev, state);
 	return ret;
 }
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 06/17] drm/atomic-helper: Add dma-fence annotations
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

This is a bit disappointing since we need to split the annotations
over all the different parts.

I was considering just leaking the critical section into the
->atomic_commit_tail callback of each driver. But that would mean we
need to pass the fence_cookie into each driver (there's a total of 13
implementations of this hook right now), so bad flag day. And also a
bit leaky abstraction.

Hence just do it function-by-function.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/drm_atomic_helper.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
index 8ac3aa068261..0a541368246e 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -1549,6 +1549,7 @@ EXPORT_SYMBOL(drm_atomic_helper_wait_for_flip_done);
 void drm_atomic_helper_commit_tail(struct drm_atomic_state *old_state)
 {
 	struct drm_device *dev = old_state->dev;
+	bool fence_cookie = dma_fence_begin_signalling();
 
 	drm_atomic_helper_commit_modeset_disables(dev, old_state);
 
@@ -1560,6 +1561,8 @@ void drm_atomic_helper_commit_tail(struct drm_atomic_state *old_state)
 
 	drm_atomic_helper_commit_hw_done(old_state);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	drm_atomic_helper_wait_for_vblanks(dev, old_state);
 
 	drm_atomic_helper_cleanup_planes(dev, old_state);
@@ -1579,6 +1582,7 @@ EXPORT_SYMBOL(drm_atomic_helper_commit_tail);
 void drm_atomic_helper_commit_tail_rpm(struct drm_atomic_state *old_state)
 {
 	struct drm_device *dev = old_state->dev;
+	bool fence_cookie = dma_fence_begin_signalling();
 
 	drm_atomic_helper_commit_modeset_disables(dev, old_state);
 
@@ -1591,6 +1595,8 @@ void drm_atomic_helper_commit_tail_rpm(struct drm_atomic_state *old_state)
 
 	drm_atomic_helper_commit_hw_done(old_state);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	drm_atomic_helper_wait_for_vblanks(dev, old_state);
 
 	drm_atomic_helper_cleanup_planes(dev, old_state);
@@ -1606,6 +1612,9 @@ static void commit_tail(struct drm_atomic_state *old_state)
 	ktime_t start;
 	s64 commit_time_ms;
 	unsigned int i, new_self_refresh_mask = 0;
+	bool fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	funcs = dev->mode_config.helper_private;
 
@@ -1634,6 +1643,8 @@ static void commit_tail(struct drm_atomic_state *old_state)
 		if (new_crtc_state->self_refresh_active)
 			new_self_refresh_mask |= BIT(i);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	if (funcs && funcs->atomic_commit_tail)
 		funcs->atomic_commit_tail(old_state);
 	else
@@ -1789,6 +1800,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 			     bool nonblock)
 {
 	int ret;
+	bool fence_cookie;
 
 	if (state->async_update) {
 		ret = drm_atomic_helper_prepare_planes(dev, state);
@@ -1811,6 +1823,8 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 	if (ret)
 		return ret;
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	if (!nonblock) {
 		ret = drm_atomic_helper_wait_for_fences(dev, state, true);
 		if (ret)
@@ -1848,6 +1862,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 	 */
 
 	drm_atomic_state_get(state);
+	dma_fence_end_signalling(fence_cookie);
 	if (nonblock)
 		queue_work(system_unbound_wq, &state->commit_work);
 	else
@@ -1856,6 +1871,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 	return 0;
 
 err:
+	dma_fence_end_signalling(fence_cookie);
 	drm_atomic_helper_cleanup_planes(dev, state);
 	return ret;
 }
-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 06/17] drm/atomic-helper: Add dma-fence annotations
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, Chris Wilson, linaro-mm-sig, Daniel Vetter,
	Christian König, linux-media

This is a bit disappointing since we need to split the annotations
over all the different parts.

I was considering just leaking the critical section into the
->atomic_commit_tail callback of each driver. But that would mean we
need to pass the fence_cookie into each driver (there's a total of 13
implementations of this hook right now), so bad flag day. And also a
bit leaky abstraction.

Hence just do it function-by-function.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/drm_atomic_helper.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
index 8ac3aa068261..0a541368246e 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -1549,6 +1549,7 @@ EXPORT_SYMBOL(drm_atomic_helper_wait_for_flip_done);
 void drm_atomic_helper_commit_tail(struct drm_atomic_state *old_state)
 {
 	struct drm_device *dev = old_state->dev;
+	bool fence_cookie = dma_fence_begin_signalling();
 
 	drm_atomic_helper_commit_modeset_disables(dev, old_state);
 
@@ -1560,6 +1561,8 @@ void drm_atomic_helper_commit_tail(struct drm_atomic_state *old_state)
 
 	drm_atomic_helper_commit_hw_done(old_state);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	drm_atomic_helper_wait_for_vblanks(dev, old_state);
 
 	drm_atomic_helper_cleanup_planes(dev, old_state);
@@ -1579,6 +1582,7 @@ EXPORT_SYMBOL(drm_atomic_helper_commit_tail);
 void drm_atomic_helper_commit_tail_rpm(struct drm_atomic_state *old_state)
 {
 	struct drm_device *dev = old_state->dev;
+	bool fence_cookie = dma_fence_begin_signalling();
 
 	drm_atomic_helper_commit_modeset_disables(dev, old_state);
 
@@ -1591,6 +1595,8 @@ void drm_atomic_helper_commit_tail_rpm(struct drm_atomic_state *old_state)
 
 	drm_atomic_helper_commit_hw_done(old_state);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	drm_atomic_helper_wait_for_vblanks(dev, old_state);
 
 	drm_atomic_helper_cleanup_planes(dev, old_state);
@@ -1606,6 +1612,9 @@ static void commit_tail(struct drm_atomic_state *old_state)
 	ktime_t start;
 	s64 commit_time_ms;
 	unsigned int i, new_self_refresh_mask = 0;
+	bool fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	funcs = dev->mode_config.helper_private;
 
@@ -1634,6 +1643,8 @@ static void commit_tail(struct drm_atomic_state *old_state)
 		if (new_crtc_state->self_refresh_active)
 			new_self_refresh_mask |= BIT(i);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	if (funcs && funcs->atomic_commit_tail)
 		funcs->atomic_commit_tail(old_state);
 	else
@@ -1789,6 +1800,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 			     bool nonblock)
 {
 	int ret;
+	bool fence_cookie;
 
 	if (state->async_update) {
 		ret = drm_atomic_helper_prepare_planes(dev, state);
@@ -1811,6 +1823,8 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 	if (ret)
 		return ret;
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	if (!nonblock) {
 		ret = drm_atomic_helper_wait_for_fences(dev, state, true);
 		if (ret)
@@ -1848,6 +1862,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 	 */
 
 	drm_atomic_state_get(state);
+	dma_fence_end_signalling(fence_cookie);
 	if (nonblock)
 		queue_work(system_unbound_wq, &state->commit_work);
 	else
@@ -1856,6 +1871,7 @@ int drm_atomic_helper_commit(struct drm_device *dev,
 	return 0;
 
 err:
+	dma_fence_end_signalling(fence_cookie);
 	drm_atomic_helper_cleanup_planes(dev, state);
 	return ret;
 }
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 07/17] drm/amdgpu: add dma-fence annotations to atomic commit path
  2020-05-12  8:59 ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  8:59   ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter

I need a canary in a ttm-based atomic driver to make sure the
dma_fence_begin/end_signalling annotations actually work.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index ea0e039a667a..4469a8c96b08 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -57,6 +57,7 @@
 
 #include "ivsrcid/ivsrcid_vislands30.h"
 
+#include <linux/module.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/version.h>
@@ -7109,6 +7110,9 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
 	struct drm_connector_state *old_con_state, *new_con_state;
 	struct dm_crtc_state *dm_old_crtc_state, *dm_new_crtc_state;
 	int crtc_disable_count = 0;
+	bool fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	drm_atomic_helper_update_legacy_modeset_state(dev, state);
 
@@ -7389,6 +7393,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
 	/* Signal HW programming completion */
 	drm_atomic_helper_commit_hw_done(state);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	if (wait_for_vblank)
 		drm_atomic_helper_wait_for_flip_done(dev, state);
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 07/17] drm/amdgpu: add dma-fence annotations to atomic commit path
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

I need a canary in a ttm-based atomic driver to make sure the
dma_fence_begin/end_signalling annotations actually work.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index ea0e039a667a..4469a8c96b08 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -57,6 +57,7 @@
 
 #include "ivsrcid/ivsrcid_vislands30.h"
 
+#include <linux/module.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/version.h>
@@ -7109,6 +7110,9 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
 	struct drm_connector_state *old_con_state, *new_con_state;
 	struct dm_crtc_state *dm_old_crtc_state, *dm_new_crtc_state;
 	int crtc_disable_count = 0;
+	bool fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	drm_atomic_helper_update_legacy_modeset_state(dev, state);
 
@@ -7389,6 +7393,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
 	/* Signal HW programming completion */
 	drm_atomic_helper_commit_hw_done(state);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	if (wait_for_vblank)
 		drm_atomic_helper_wait_for_flip_done(dev, state);
 
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 07/17] drm/amdgpu: add dma-fence annotations to atomic commit path
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

I need a canary in a ttm-based atomic driver to make sure the
dma_fence_begin/end_signalling annotations actually work.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index ea0e039a667a..4469a8c96b08 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -57,6 +57,7 @@
 
 #include "ivsrcid/ivsrcid_vislands30.h"
 
+#include <linux/module.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/version.h>
@@ -7109,6 +7110,9 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
 	struct drm_connector_state *old_con_state, *new_con_state;
 	struct dm_crtc_state *dm_old_crtc_state, *dm_new_crtc_state;
 	int crtc_disable_count = 0;
+	bool fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	drm_atomic_helper_update_legacy_modeset_state(dev, state);
 
@@ -7389,6 +7393,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
 	/* Signal HW programming completion */
 	drm_atomic_helper_commit_hw_done(state);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	if (wait_for_vblank)
 		drm_atomic_helper_wait_for_flip_done(dev, state);
 
-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 07/17] drm/amdgpu: add dma-fence annotations to atomic commit path
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, Chris Wilson, linaro-mm-sig, Daniel Vetter,
	Christian König, linux-media

I need a canary in a ttm-based atomic driver to make sure the
dma_fence_begin/end_signalling annotations actually work.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index ea0e039a667a..4469a8c96b08 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -57,6 +57,7 @@
 
 #include "ivsrcid/ivsrcid_vislands30.h"
 
+#include <linux/module.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
 #include <linux/version.h>
@@ -7109,6 +7110,9 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
 	struct drm_connector_state *old_con_state, *new_con_state;
 	struct dm_crtc_state *dm_old_crtc_state, *dm_new_crtc_state;
 	int crtc_disable_count = 0;
+	bool fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	drm_atomic_helper_update_legacy_modeset_state(dev, state);
 
@@ -7389,6 +7393,8 @@ static void amdgpu_dm_atomic_commit_tail(struct drm_atomic_state *state)
 	/* Signal HW programming completion */
 	drm_atomic_helper_commit_hw_done(state);
 
+	dma_fence_end_signalling(fence_cookie);
+
 	if (wait_for_vblank)
 		drm_atomic_helper_wait_for_flip_done(dev, state);
 
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 08/17] drm/scheduler: use dma-fence annotations in main thread
  2020-05-12  8:59 ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  8:59   ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter

If the scheduler rt thread gets stuck on a mutex that we're holding
while waiting for gpu workloads to complete, we have a problem.

Add dma-fence annotations so that lockdep can check this for us.

I've tried to quite carefully review this, and I think it's at the
right spot. But obviosly no expert on drm scheduler.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 2f319102ae9f..06a736e506ad 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -763,9 +763,12 @@ static int drm_sched_main(void *param)
 	struct sched_param sparam = {.sched_priority = 1};
 	struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param;
 	int r;
+	bool fence_cookie;
 
 	sched_setscheduler(current, SCHED_FIFO, &sparam);
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	while (!kthread_should_stop()) {
 		struct drm_sched_entity *entity = NULL;
 		struct drm_sched_fence *s_fence;
@@ -823,6 +826,9 @@ static int drm_sched_main(void *param)
 
 		wake_up(&sched->job_scheduled);
 	}
+
+	dma_fence_end_signalling(fence_cookie);
+
 	return 0;
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 08/17] drm/scheduler: use dma-fence annotations in main thread
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

If the scheduler rt thread gets stuck on a mutex that we're holding
while waiting for gpu workloads to complete, we have a problem.

Add dma-fence annotations so that lockdep can check this for us.

I've tried to quite carefully review this, and I think it's at the
right spot. But obviosly no expert on drm scheduler.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 2f319102ae9f..06a736e506ad 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -763,9 +763,12 @@ static int drm_sched_main(void *param)
 	struct sched_param sparam = {.sched_priority = 1};
 	struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param;
 	int r;
+	bool fence_cookie;
 
 	sched_setscheduler(current, SCHED_FIFO, &sparam);
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	while (!kthread_should_stop()) {
 		struct drm_sched_entity *entity = NULL;
 		struct drm_sched_fence *s_fence;
@@ -823,6 +826,9 @@ static int drm_sched_main(void *param)
 
 		wake_up(&sched->job_scheduled);
 	}
+
+	dma_fence_end_signalling(fence_cookie);
+
 	return 0;
 }
 
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 08/17] drm/scheduler: use dma-fence annotations in main thread
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

If the scheduler rt thread gets stuck on a mutex that we're holding
while waiting for gpu workloads to complete, we have a problem.

Add dma-fence annotations so that lockdep can check this for us.

I've tried to quite carefully review this, and I think it's at the
right spot. But obviosly no expert on drm scheduler.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 2f319102ae9f..06a736e506ad 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -763,9 +763,12 @@ static int drm_sched_main(void *param)
 	struct sched_param sparam = {.sched_priority = 1};
 	struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param;
 	int r;
+	bool fence_cookie;
 
 	sched_setscheduler(current, SCHED_FIFO, &sparam);
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	while (!kthread_should_stop()) {
 		struct drm_sched_entity *entity = NULL;
 		struct drm_sched_fence *s_fence;
@@ -823,6 +826,9 @@ static int drm_sched_main(void *param)
 
 		wake_up(&sched->job_scheduled);
 	}
+
+	dma_fence_end_signalling(fence_cookie);
+
 	return 0;
 }
 
-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 08/17] drm/scheduler: use dma-fence annotations in main thread
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, Chris Wilson, linaro-mm-sig, Daniel Vetter,
	Christian König, linux-media

If the scheduler rt thread gets stuck on a mutex that we're holding
while waiting for gpu workloads to complete, we have a problem.

Add dma-fence annotations so that lockdep can check this for us.

I've tried to quite carefully review this, and I think it's at the
right spot. But obviosly no expert on drm scheduler.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 2f319102ae9f..06a736e506ad 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -763,9 +763,12 @@ static int drm_sched_main(void *param)
 	struct sched_param sparam = {.sched_priority = 1};
 	struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param;
 	int r;
+	bool fence_cookie;
 
 	sched_setscheduler(current, SCHED_FIFO, &sparam);
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	while (!kthread_should_stop()) {
 		struct drm_sched_entity *entity = NULL;
 		struct drm_sched_fence *s_fence;
@@ -823,6 +826,9 @@ static int drm_sched_main(void *param)
 
 		wake_up(&sched->job_scheduled);
 	}
+
+	dma_fence_end_signalling(fence_cookie);
+
 	return 0;
 }
 
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 09/17] drm/amdgpu: use dma-fence annotations in cs_submit()
  2020-05-12  8:59 ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  8:59   ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter

This is a bit tricky, since ->notifier_lock is held while calling
dma_fence_wait we must ensure that also the read side (i.e.
dma_fence_begin_signalling) is on the same side. If we mix this up
lockdep complaints, and that's again why we want to have these
annotations.

A nice side effect of this is that because of the fs_reclaim priming
for dma_fence_enable lockdep now automatically checks for us that
nothing in here allocates memory, without even running any userptr
workloads.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 7653f62b1b2d..6db3f3c629b0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1213,6 +1213,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	struct amdgpu_job *job;
 	uint64_t seq;
 	int r;
+	bool fence_cookie;
 
 	job = p->job;
 	p->job = NULL;
@@ -1227,6 +1228,8 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	 */
 	mutex_lock(&p->adev->notifier_lock);
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	/* If userptr are invalidated after amdgpu_cs_parser_bos(), return
 	 * -EAGAIN, drmIoctl in libdrm will restart the amdgpu_cs_ioctl.
 	 */
@@ -1264,12 +1267,14 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	amdgpu_vm_move_to_lru_tail(p->adev, &fpriv->vm);
 
 	ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
+	dma_fence_end_signalling(fence_cookie);
 	mutex_unlock(&p->adev->notifier_lock);
 
 	return 0;
 
 error_abort:
 	drm_sched_job_cleanup(&job->base);
+	dma_fence_end_signalling(fence_cookie);
 	mutex_unlock(&p->adev->notifier_lock);
 
 error_unlock:
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 09/17] drm/amdgpu: use dma-fence annotations in cs_submit()
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

This is a bit tricky, since ->notifier_lock is held while calling
dma_fence_wait we must ensure that also the read side (i.e.
dma_fence_begin_signalling) is on the same side. If we mix this up
lockdep complaints, and that's again why we want to have these
annotations.

A nice side effect of this is that because of the fs_reclaim priming
for dma_fence_enable lockdep now automatically checks for us that
nothing in here allocates memory, without even running any userptr
workloads.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 7653f62b1b2d..6db3f3c629b0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1213,6 +1213,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	struct amdgpu_job *job;
 	uint64_t seq;
 	int r;
+	bool fence_cookie;
 
 	job = p->job;
 	p->job = NULL;
@@ -1227,6 +1228,8 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	 */
 	mutex_lock(&p->adev->notifier_lock);
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	/* If userptr are invalidated after amdgpu_cs_parser_bos(), return
 	 * -EAGAIN, drmIoctl in libdrm will restart the amdgpu_cs_ioctl.
 	 */
@@ -1264,12 +1267,14 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	amdgpu_vm_move_to_lru_tail(p->adev, &fpriv->vm);
 
 	ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
+	dma_fence_end_signalling(fence_cookie);
 	mutex_unlock(&p->adev->notifier_lock);
 
 	return 0;
 
 error_abort:
 	drm_sched_job_cleanup(&job->base);
+	dma_fence_end_signalling(fence_cookie);
 	mutex_unlock(&p->adev->notifier_lock);
 
 error_unlock:
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 09/17] drm/amdgpu: use dma-fence annotations in cs_submit()
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

This is a bit tricky, since ->notifier_lock is held while calling
dma_fence_wait we must ensure that also the read side (i.e.
dma_fence_begin_signalling) is on the same side. If we mix this up
lockdep complaints, and that's again why we want to have these
annotations.

A nice side effect of this is that because of the fs_reclaim priming
for dma_fence_enable lockdep now automatically checks for us that
nothing in here allocates memory, without even running any userptr
workloads.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 7653f62b1b2d..6db3f3c629b0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1213,6 +1213,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	struct amdgpu_job *job;
 	uint64_t seq;
 	int r;
+	bool fence_cookie;
 
 	job = p->job;
 	p->job = NULL;
@@ -1227,6 +1228,8 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	 */
 	mutex_lock(&p->adev->notifier_lock);
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	/* If userptr are invalidated after amdgpu_cs_parser_bos(), return
 	 * -EAGAIN, drmIoctl in libdrm will restart the amdgpu_cs_ioctl.
 	 */
@@ -1264,12 +1267,14 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	amdgpu_vm_move_to_lru_tail(p->adev, &fpriv->vm);
 
 	ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
+	dma_fence_end_signalling(fence_cookie);
 	mutex_unlock(&p->adev->notifier_lock);
 
 	return 0;
 
 error_abort:
 	drm_sched_job_cleanup(&job->base);
+	dma_fence_end_signalling(fence_cookie);
 	mutex_unlock(&p->adev->notifier_lock);
 
 error_unlock:
-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 09/17] drm/amdgpu: use dma-fence annotations in cs_submit()
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, Chris Wilson, linaro-mm-sig, Daniel Vetter,
	Christian König, linux-media

This is a bit tricky, since ->notifier_lock is held while calling
dma_fence_wait we must ensure that also the read side (i.e.
dma_fence_begin_signalling) is on the same side. If we mix this up
lockdep complaints, and that's again why we want to have these
annotations.

A nice side effect of this is that because of the fs_reclaim priming
for dma_fence_enable lockdep now automatically checks for us that
nothing in here allocates memory, without even running any userptr
workloads.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 7653f62b1b2d..6db3f3c629b0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1213,6 +1213,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	struct amdgpu_job *job;
 	uint64_t seq;
 	int r;
+	bool fence_cookie;
 
 	job = p->job;
 	p->job = NULL;
@@ -1227,6 +1228,8 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	 */
 	mutex_lock(&p->adev->notifier_lock);
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	/* If userptr are invalidated after amdgpu_cs_parser_bos(), return
 	 * -EAGAIN, drmIoctl in libdrm will restart the amdgpu_cs_ioctl.
 	 */
@@ -1264,12 +1267,14 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	amdgpu_vm_move_to_lru_tail(p->adev, &fpriv->vm);
 
 	ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
+	dma_fence_end_signalling(fence_cookie);
 	mutex_unlock(&p->adev->notifier_lock);
 
 	return 0;
 
 error_abort:
 	drm_sched_job_cleanup(&job->base);
+	dma_fence_end_signalling(fence_cookie);
 	mutex_unlock(&p->adev->notifier_lock);
 
 error_unlock:
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 10/17] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
  2020-05-12  8:59 ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  8:59   ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter

My dma-fence lockdep annotations caught an inversion because we
allocate memory where we really shouldn't:

	kmem_cache_alloc+0x2b/0x6d0
	amdgpu_fence_emit+0x30/0x330 [amdgpu]
	amdgpu_ib_schedule+0x306/0x550 [amdgpu]
	amdgpu_job_run+0x10f/0x260 [amdgpu]
	drm_sched_main+0x1b9/0x490 [gpu_sched]
	kthread+0x12e/0x150

Trouble right now is that lockdep only validates against GFP_FS, which
would be good enough for shrinkers. But for mmu_notifiers we actually
need !GFP_ATOMIC, since they can be called from any page laundering,
even if GFP_NOFS or GFP_NOIO are set.

I guess we should improve the lockdep annotations for
fs_reclaim_acquire/release.

Ofc real fix is to properly preallocate this fence and stuff it into
the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of
the way.

v2: Two more allocations in scheduler paths.

Frist one:

	__kmalloc+0x58/0x720
	amdgpu_vmid_grab+0x100/0xca0 [amdgpu]
	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
	drm_sched_main+0xf9/0x490 [gpu_sched]

Second one:

	kmem_cache_alloc+0x2b/0x6d0
	amdgpu_sync_fence+0x7e/0x110 [amdgpu]
	amdgpu_vmid_grab+0x86b/0xca0 [amdgpu]
	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
	drm_sched_main+0xf9/0x490 [gpu_sched]

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c   | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c  | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index d878fe7fee51..055b47241bb1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f,
 	uint32_t seq;
 	int r;
 
-	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL);
+	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
 	if (fence == NULL)
 		return -ENOMEM;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
index fe92dcd94d4a..fdcd6659f5ad 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
@@ -208,7 +208,7 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
 	if (ring->vmid_wait && !dma_fence_is_signaled(ring->vmid_wait))
 		return amdgpu_sync_fence(sync, ring->vmid_wait, false);
 
-	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_KERNEL);
+	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATOMIC);
 	if (!fences)
 		return -ENOMEM;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index b87ca171986a..330476cc0c86 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -168,7 +168,7 @@ int amdgpu_sync_fence(struct amdgpu_sync *sync, struct dma_fence *f,
 	if (amdgpu_sync_add_later(sync, f, explicit))
 		return 0;
 
-	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_KERNEL);
+	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_ATOMIC);
 	if (!e)
 		return -ENOMEM;
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 10/17] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

My dma-fence lockdep annotations caught an inversion because we
allocate memory where we really shouldn't:

	kmem_cache_alloc+0x2b/0x6d0
	amdgpu_fence_emit+0x30/0x330 [amdgpu]
	amdgpu_ib_schedule+0x306/0x550 [amdgpu]
	amdgpu_job_run+0x10f/0x260 [amdgpu]
	drm_sched_main+0x1b9/0x490 [gpu_sched]
	kthread+0x12e/0x150

Trouble right now is that lockdep only validates against GFP_FS, which
would be good enough for shrinkers. But for mmu_notifiers we actually
need !GFP_ATOMIC, since they can be called from any page laundering,
even if GFP_NOFS or GFP_NOIO are set.

I guess we should improve the lockdep annotations for
fs_reclaim_acquire/release.

Ofc real fix is to properly preallocate this fence and stuff it into
the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of
the way.

v2: Two more allocations in scheduler paths.

Frist one:

	__kmalloc+0x58/0x720
	amdgpu_vmid_grab+0x100/0xca0 [amdgpu]
	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
	drm_sched_main+0xf9/0x490 [gpu_sched]

Second one:

	kmem_cache_alloc+0x2b/0x6d0
	amdgpu_sync_fence+0x7e/0x110 [amdgpu]
	amdgpu_vmid_grab+0x86b/0xca0 [amdgpu]
	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
	drm_sched_main+0xf9/0x490 [gpu_sched]

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c   | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c  | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index d878fe7fee51..055b47241bb1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f,
 	uint32_t seq;
 	int r;
 
-	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL);
+	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
 	if (fence == NULL)
 		return -ENOMEM;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
index fe92dcd94d4a..fdcd6659f5ad 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
@@ -208,7 +208,7 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
 	if (ring->vmid_wait && !dma_fence_is_signaled(ring->vmid_wait))
 		return amdgpu_sync_fence(sync, ring->vmid_wait, false);
 
-	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_KERNEL);
+	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATOMIC);
 	if (!fences)
 		return -ENOMEM;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index b87ca171986a..330476cc0c86 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -168,7 +168,7 @@ int amdgpu_sync_fence(struct amdgpu_sync *sync, struct dma_fence *f,
 	if (amdgpu_sync_add_later(sync, f, explicit))
 		return 0;
 
-	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_KERNEL);
+	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_ATOMIC);
 	if (!e)
 		return -ENOMEM;
 
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 10/17] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

My dma-fence lockdep annotations caught an inversion because we
allocate memory where we really shouldn't:

	kmem_cache_alloc+0x2b/0x6d0
	amdgpu_fence_emit+0x30/0x330 [amdgpu]
	amdgpu_ib_schedule+0x306/0x550 [amdgpu]
	amdgpu_job_run+0x10f/0x260 [amdgpu]
	drm_sched_main+0x1b9/0x490 [gpu_sched]
	kthread+0x12e/0x150

Trouble right now is that lockdep only validates against GFP_FS, which
would be good enough for shrinkers. But for mmu_notifiers we actually
need !GFP_ATOMIC, since they can be called from any page laundering,
even if GFP_NOFS or GFP_NOIO are set.

I guess we should improve the lockdep annotations for
fs_reclaim_acquire/release.

Ofc real fix is to properly preallocate this fence and stuff it into
the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of
the way.

v2: Two more allocations in scheduler paths.

Frist one:

	__kmalloc+0x58/0x720
	amdgpu_vmid_grab+0x100/0xca0 [amdgpu]
	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
	drm_sched_main+0xf9/0x490 [gpu_sched]

Second one:

	kmem_cache_alloc+0x2b/0x6d0
	amdgpu_sync_fence+0x7e/0x110 [amdgpu]
	amdgpu_vmid_grab+0x86b/0xca0 [amdgpu]
	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
	drm_sched_main+0xf9/0x490 [gpu_sched]

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c   | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c  | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index d878fe7fee51..055b47241bb1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f,
 	uint32_t seq;
 	int r;
 
-	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL);
+	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
 	if (fence == NULL)
 		return -ENOMEM;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
index fe92dcd94d4a..fdcd6659f5ad 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
@@ -208,7 +208,7 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
 	if (ring->vmid_wait && !dma_fence_is_signaled(ring->vmid_wait))
 		return amdgpu_sync_fence(sync, ring->vmid_wait, false);
 
-	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_KERNEL);
+	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATOMIC);
 	if (!fences)
 		return -ENOMEM;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index b87ca171986a..330476cc0c86 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -168,7 +168,7 @@ int amdgpu_sync_fence(struct amdgpu_sync *sync, struct dma_fence *f,
 	if (amdgpu_sync_add_later(sync, f, explicit))
 		return 0;
 
-	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_KERNEL);
+	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_ATOMIC);
 	if (!e)
 		return -ENOMEM;
 
-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 10/17] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, Chris Wilson, linaro-mm-sig, Daniel Vetter,
	Christian König, linux-media

My dma-fence lockdep annotations caught an inversion because we
allocate memory where we really shouldn't:

	kmem_cache_alloc+0x2b/0x6d0
	amdgpu_fence_emit+0x30/0x330 [amdgpu]
	amdgpu_ib_schedule+0x306/0x550 [amdgpu]
	amdgpu_job_run+0x10f/0x260 [amdgpu]
	drm_sched_main+0x1b9/0x490 [gpu_sched]
	kthread+0x12e/0x150

Trouble right now is that lockdep only validates against GFP_FS, which
would be good enough for shrinkers. But for mmu_notifiers we actually
need !GFP_ATOMIC, since they can be called from any page laundering,
even if GFP_NOFS or GFP_NOIO are set.

I guess we should improve the lockdep annotations for
fs_reclaim_acquire/release.

Ofc real fix is to properly preallocate this fence and stuff it into
the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of
the way.

v2: Two more allocations in scheduler paths.

Frist one:

	__kmalloc+0x58/0x720
	amdgpu_vmid_grab+0x100/0xca0 [amdgpu]
	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
	drm_sched_main+0xf9/0x490 [gpu_sched]

Second one:

	kmem_cache_alloc+0x2b/0x6d0
	amdgpu_sync_fence+0x7e/0x110 [amdgpu]
	amdgpu_vmid_grab+0x86b/0xca0 [amdgpu]
	amdgpu_job_dependency+0xf9/0x120 [amdgpu]
	drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
	drm_sched_main+0xf9/0x490 [gpu_sched]

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c   | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c  | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index d878fe7fee51..055b47241bb1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f,
 	uint32_t seq;
 	int r;
 
-	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL);
+	fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
 	if (fence == NULL)
 		return -ENOMEM;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
index fe92dcd94d4a..fdcd6659f5ad 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
@@ -208,7 +208,7 @@ static int amdgpu_vmid_grab_idle(struct amdgpu_vm *vm,
 	if (ring->vmid_wait && !dma_fence_is_signaled(ring->vmid_wait))
 		return amdgpu_sync_fence(sync, ring->vmid_wait, false);
 
-	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_KERNEL);
+	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATOMIC);
 	if (!fences)
 		return -ENOMEM;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index b87ca171986a..330476cc0c86 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -168,7 +168,7 @@ int amdgpu_sync_fence(struct amdgpu_sync *sync, struct dma_fence *f,
 	if (amdgpu_sync_add_later(sync, f, explicit))
 		return 0;
 
-	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_KERNEL);
+	e = kmem_cache_alloc(amdgpu_sync_slab, GFP_ATOMIC);
 	if (!e)
 		return -ENOMEM;
 
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 11/17] drm/amdgpu: DC also loves to allocate stuff where it shouldn't
  2020-05-12  8:59 ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  8:59   ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter

Not going to bother with a complete&pretty commit message, just
offending backtrace:

        kvmalloc_node+0x47/0x80
        dc_create_state+0x1f/0x60 [amdgpu]
        dc_commit_state+0xcb/0x9b0 [amdgpu]
        amdgpu_dm_atomic_commit_tail+0xd31/0x2010 [amdgpu]
        commit_tail+0xa4/0x140 [drm_kms_helper]
        drm_atomic_helper_commit+0x152/0x180 [drm_kms_helper]
        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
        drm_client_modeset_commit_locked+0x55/0x190 [drm]
        drm_client_modeset_commit+0x24/0x40 [drm]

v2: Found more in DC code, I'm just going to pile them all up.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/atom.c                 | 2 +-
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +-
 drivers/gpu/drm/amd/display/dc/core/dc.c          | 4 +++-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/atom.c b/drivers/gpu/drm/amd/amdgpu/atom.c
index 4cfc786699c7..1b0c674fab25 100644
--- a/drivers/gpu/drm/amd/amdgpu/atom.c
+++ b/drivers/gpu/drm/amd/amdgpu/atom.c
@@ -1226,7 +1226,7 @@ static int amdgpu_atom_execute_table_locked(struct atom_context *ctx, int index,
 	ectx.abort = false;
 	ectx.last_jump = 0;
 	if (ws)
-		ectx.ws = kcalloc(4, ws, GFP_KERNEL);
+		ectx.ws = kcalloc(4, ws, GFP_ATOMIC);
 	else
 		ectx.ws = NULL;
 
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 4469a8c96b08..9bfaa4cad483 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -6622,7 +6622,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		struct dc_stream_update stream_update;
 	} *bundle;
 
-	bundle = kzalloc(sizeof(*bundle), GFP_KERNEL);
+	bundle = kzalloc(sizeof(*bundle), GFP_ATOMIC);
 
 	if (!bundle) {
 		dm_error("Failed to allocate update bundle\n");
diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c b/drivers/gpu/drm/amd/display/dc/core/dc.c
index 401d1c66a411..a37a32442a5a 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
@@ -1416,8 +1416,10 @@ bool dc_post_update_surfaces_to_stream(struct dc *dc)
 
 struct dc_state *dc_create_state(struct dc *dc)
 {
+	/* No you really cant allocate random crap here this late in
+	 * atomic_commit_tail. */
 	struct dc_state *context = kvzalloc(sizeof(struct dc_state),
-					    GFP_KERNEL);
+					    GFP_ATOMIC);
 
 	if (!context)
 		return NULL;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 11/17] drm/amdgpu: DC also loves to allocate stuff where it shouldn't
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

Not going to bother with a complete&pretty commit message, just
offending backtrace:

        kvmalloc_node+0x47/0x80
        dc_create_state+0x1f/0x60 [amdgpu]
        dc_commit_state+0xcb/0x9b0 [amdgpu]
        amdgpu_dm_atomic_commit_tail+0xd31/0x2010 [amdgpu]
        commit_tail+0xa4/0x140 [drm_kms_helper]
        drm_atomic_helper_commit+0x152/0x180 [drm_kms_helper]
        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
        drm_client_modeset_commit_locked+0x55/0x190 [drm]
        drm_client_modeset_commit+0x24/0x40 [drm]

v2: Found more in DC code, I'm just going to pile them all up.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/atom.c                 | 2 +-
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +-
 drivers/gpu/drm/amd/display/dc/core/dc.c          | 4 +++-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/atom.c b/drivers/gpu/drm/amd/amdgpu/atom.c
index 4cfc786699c7..1b0c674fab25 100644
--- a/drivers/gpu/drm/amd/amdgpu/atom.c
+++ b/drivers/gpu/drm/amd/amdgpu/atom.c
@@ -1226,7 +1226,7 @@ static int amdgpu_atom_execute_table_locked(struct atom_context *ctx, int index,
 	ectx.abort = false;
 	ectx.last_jump = 0;
 	if (ws)
-		ectx.ws = kcalloc(4, ws, GFP_KERNEL);
+		ectx.ws = kcalloc(4, ws, GFP_ATOMIC);
 	else
 		ectx.ws = NULL;
 
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 4469a8c96b08..9bfaa4cad483 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -6622,7 +6622,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		struct dc_stream_update stream_update;
 	} *bundle;
 
-	bundle = kzalloc(sizeof(*bundle), GFP_KERNEL);
+	bundle = kzalloc(sizeof(*bundle), GFP_ATOMIC);
 
 	if (!bundle) {
 		dm_error("Failed to allocate update bundle\n");
diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c b/drivers/gpu/drm/amd/display/dc/core/dc.c
index 401d1c66a411..a37a32442a5a 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
@@ -1416,8 +1416,10 @@ bool dc_post_update_surfaces_to_stream(struct dc *dc)
 
 struct dc_state *dc_create_state(struct dc *dc)
 {
+	/* No you really cant allocate random crap here this late in
+	 * atomic_commit_tail. */
 	struct dc_state *context = kvzalloc(sizeof(struct dc_state),
-					    GFP_KERNEL);
+					    GFP_ATOMIC);
 
 	if (!context)
 		return NULL;
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 11/17] drm/amdgpu: DC also loves to allocate stuff where it shouldn't
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

Not going to bother with a complete&pretty commit message, just
offending backtrace:

        kvmalloc_node+0x47/0x80
        dc_create_state+0x1f/0x60 [amdgpu]
        dc_commit_state+0xcb/0x9b0 [amdgpu]
        amdgpu_dm_atomic_commit_tail+0xd31/0x2010 [amdgpu]
        commit_tail+0xa4/0x140 [drm_kms_helper]
        drm_atomic_helper_commit+0x152/0x180 [drm_kms_helper]
        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
        drm_client_modeset_commit_locked+0x55/0x190 [drm]
        drm_client_modeset_commit+0x24/0x40 [drm]

v2: Found more in DC code, I'm just going to pile them all up.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/atom.c                 | 2 +-
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +-
 drivers/gpu/drm/amd/display/dc/core/dc.c          | 4 +++-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/atom.c b/drivers/gpu/drm/amd/amdgpu/atom.c
index 4cfc786699c7..1b0c674fab25 100644
--- a/drivers/gpu/drm/amd/amdgpu/atom.c
+++ b/drivers/gpu/drm/amd/amdgpu/atom.c
@@ -1226,7 +1226,7 @@ static int amdgpu_atom_execute_table_locked(struct atom_context *ctx, int index,
 	ectx.abort = false;
 	ectx.last_jump = 0;
 	if (ws)
-		ectx.ws = kcalloc(4, ws, GFP_KERNEL);
+		ectx.ws = kcalloc(4, ws, GFP_ATOMIC);
 	else
 		ectx.ws = NULL;
 
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 4469a8c96b08..9bfaa4cad483 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -6622,7 +6622,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		struct dc_stream_update stream_update;
 	} *bundle;
 
-	bundle = kzalloc(sizeof(*bundle), GFP_KERNEL);
+	bundle = kzalloc(sizeof(*bundle), GFP_ATOMIC);
 
 	if (!bundle) {
 		dm_error("Failed to allocate update bundle\n");
diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c b/drivers/gpu/drm/amd/display/dc/core/dc.c
index 401d1c66a411..a37a32442a5a 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
@@ -1416,8 +1416,10 @@ bool dc_post_update_surfaces_to_stream(struct dc *dc)
 
 struct dc_state *dc_create_state(struct dc *dc)
 {
+	/* No you really cant allocate random crap here this late in
+	 * atomic_commit_tail. */
 	struct dc_state *context = kvzalloc(sizeof(struct dc_state),
-					    GFP_KERNEL);
+					    GFP_ATOMIC);
 
 	if (!context)
 		return NULL;
-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 11/17] drm/amdgpu: DC also loves to allocate stuff where it shouldn't
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, Chris Wilson, linaro-mm-sig, Daniel Vetter,
	Christian König, linux-media

Not going to bother with a complete&pretty commit message, just
offending backtrace:

        kvmalloc_node+0x47/0x80
        dc_create_state+0x1f/0x60 [amdgpu]
        dc_commit_state+0xcb/0x9b0 [amdgpu]
        amdgpu_dm_atomic_commit_tail+0xd31/0x2010 [amdgpu]
        commit_tail+0xa4/0x140 [drm_kms_helper]
        drm_atomic_helper_commit+0x152/0x180 [drm_kms_helper]
        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
        drm_client_modeset_commit_locked+0x55/0x190 [drm]
        drm_client_modeset_commit+0x24/0x40 [drm]

v2: Found more in DC code, I'm just going to pile them all up.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/atom.c                 | 2 +-
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +-
 drivers/gpu/drm/amd/display/dc/core/dc.c          | 4 +++-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/atom.c b/drivers/gpu/drm/amd/amdgpu/atom.c
index 4cfc786699c7..1b0c674fab25 100644
--- a/drivers/gpu/drm/amd/amdgpu/atom.c
+++ b/drivers/gpu/drm/amd/amdgpu/atom.c
@@ -1226,7 +1226,7 @@ static int amdgpu_atom_execute_table_locked(struct atom_context *ctx, int index,
 	ectx.abort = false;
 	ectx.last_jump = 0;
 	if (ws)
-		ectx.ws = kcalloc(4, ws, GFP_KERNEL);
+		ectx.ws = kcalloc(4, ws, GFP_ATOMIC);
 	else
 		ectx.ws = NULL;
 
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 4469a8c96b08..9bfaa4cad483 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -6622,7 +6622,7 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		struct dc_stream_update stream_update;
 	} *bundle;
 
-	bundle = kzalloc(sizeof(*bundle), GFP_KERNEL);
+	bundle = kzalloc(sizeof(*bundle), GFP_ATOMIC);
 
 	if (!bundle) {
 		dm_error("Failed to allocate update bundle\n");
diff --git a/drivers/gpu/drm/amd/display/dc/core/dc.c b/drivers/gpu/drm/amd/display/dc/core/dc.c
index 401d1c66a411..a37a32442a5a 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
@@ -1416,8 +1416,10 @@ bool dc_post_update_surfaces_to_stream(struct dc *dc)
 
 struct dc_state *dc_create_state(struct dc *dc)
 {
+	/* No you really cant allocate random crap here this late in
+	 * atomic_commit_tail. */
 	struct dc_state *context = kvzalloc(sizeof(struct dc_state),
-					    GFP_KERNEL);
+					    GFP_ATOMIC);
 
 	if (!context)
 		return NULL;
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 12/17] drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail
  2020-05-12  8:59 ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  8:59   ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter

Trying to grab dma_resv_lock while in commit_tail before we've done
all the code that leads to the eventual signalling of the vblank event
(which can be a dma_fence) is deadlock-y. Don't do that.

Here the solution is easy because just grabbing locks to read
something races anyway. We don't need to bother, READ_ONCE is
equivalent. And avoids the locking issue.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 9bfaa4cad483..28e1af9f823c 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -6699,7 +6699,11 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		 * explicitly on fences instead
 		 * and in general should be called for
 		 * blocking commit to as per framework helpers
+		 *
+		 * Yes, this deadlocks, since you're calling dma_resv_lock in a
+		 * path that leads to a dma_fence_signal(). Don't do that.
 		 */
+#if 0
 		r = amdgpu_bo_reserve(abo, true);
 		if (unlikely(r != 0))
 			DRM_ERROR("failed to reserve buffer before flip\n");
@@ -6709,6 +6713,12 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		tmz_surface = amdgpu_bo_encrypted(abo);
 
 		amdgpu_bo_unreserve(abo);
+#endif
+		/*
+		 * this races anyway, so READ_ONCE isn't any better or worse
+		 * than the stuff above. Except the stuff above can deadlock.
+		 */
+		tiling_flags = READ_ONCE(abo->tiling_flags);
 
 		fill_dc_plane_info_and_addr(
 			dm->adev, new_plane_state, tiling_flags,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 12/17] drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

Trying to grab dma_resv_lock while in commit_tail before we've done
all the code that leads to the eventual signalling of the vblank event
(which can be a dma_fence) is deadlock-y. Don't do that.

Here the solution is easy because just grabbing locks to read
something races anyway. We don't need to bother, READ_ONCE is
equivalent. And avoids the locking issue.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 9bfaa4cad483..28e1af9f823c 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -6699,7 +6699,11 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		 * explicitly on fences instead
 		 * and in general should be called for
 		 * blocking commit to as per framework helpers
+		 *
+		 * Yes, this deadlocks, since you're calling dma_resv_lock in a
+		 * path that leads to a dma_fence_signal(). Don't do that.
 		 */
+#if 0
 		r = amdgpu_bo_reserve(abo, true);
 		if (unlikely(r != 0))
 			DRM_ERROR("failed to reserve buffer before flip\n");
@@ -6709,6 +6713,12 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		tmz_surface = amdgpu_bo_encrypted(abo);
 
 		amdgpu_bo_unreserve(abo);
+#endif
+		/*
+		 * this races anyway, so READ_ONCE isn't any better or worse
+		 * than the stuff above. Except the stuff above can deadlock.
+		 */
+		tiling_flags = READ_ONCE(abo->tiling_flags);
 
 		fill_dc_plane_info_and_addr(
 			dm->adev, new_plane_state, tiling_flags,
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 12/17] drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

Trying to grab dma_resv_lock while in commit_tail before we've done
all the code that leads to the eventual signalling of the vblank event
(which can be a dma_fence) is deadlock-y. Don't do that.

Here the solution is easy because just grabbing locks to read
something races anyway. We don't need to bother, READ_ONCE is
equivalent. And avoids the locking issue.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 9bfaa4cad483..28e1af9f823c 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -6699,7 +6699,11 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		 * explicitly on fences instead
 		 * and in general should be called for
 		 * blocking commit to as per framework helpers
+		 *
+		 * Yes, this deadlocks, since you're calling dma_resv_lock in a
+		 * path that leads to a dma_fence_signal(). Don't do that.
 		 */
+#if 0
 		r = amdgpu_bo_reserve(abo, true);
 		if (unlikely(r != 0))
 			DRM_ERROR("failed to reserve buffer before flip\n");
@@ -6709,6 +6713,12 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		tmz_surface = amdgpu_bo_encrypted(abo);
 
 		amdgpu_bo_unreserve(abo);
+#endif
+		/*
+		 * this races anyway, so READ_ONCE isn't any better or worse
+		 * than the stuff above. Except the stuff above can deadlock.
+		 */
+		tiling_flags = READ_ONCE(abo->tiling_flags);
 
 		fill_dc_plane_info_and_addr(
 			dm->adev, new_plane_state, tiling_flags,
-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 12/17] drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, Chris Wilson, linaro-mm-sig, Daniel Vetter,
	Christian König, linux-media

Trying to grab dma_resv_lock while in commit_tail before we've done
all the code that leads to the eventual signalling of the vblank event
(which can be a dma_fence) is deadlock-y. Don't do that.

Here the solution is easy because just grabbing locks to read
something races anyway. We don't need to bother, READ_ONCE is
equivalent. And avoids the locking issue.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 9bfaa4cad483..28e1af9f823c 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -6699,7 +6699,11 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		 * explicitly on fences instead
 		 * and in general should be called for
 		 * blocking commit to as per framework helpers
+		 *
+		 * Yes, this deadlocks, since you're calling dma_resv_lock in a
+		 * path that leads to a dma_fence_signal(). Don't do that.
 		 */
+#if 0
 		r = amdgpu_bo_reserve(abo, true);
 		if (unlikely(r != 0))
 			DRM_ERROR("failed to reserve buffer before flip\n");
@@ -6709,6 +6713,12 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
 		tmz_surface = amdgpu_bo_encrypted(abo);
 
 		amdgpu_bo_unreserve(abo);
+#endif
+		/*
+		 * this races anyway, so READ_ONCE isn't any better or worse
+		 * than the stuff above. Except the stuff above can deadlock.
+		 */
+		tiling_flags = READ_ONCE(abo->tiling_flags);
 
 		fill_dc_plane_info_and_addr(
 			dm->adev, new_plane_state, tiling_flags,
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 13/17] drm/scheduler: use dma-fence annotations in tdr work
  2020-05-12  8:59 ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  8:59   ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter

In the face of unpriviledged userspace being able to submit bogus gpu
workloads the kernel needs gpu timeout and reset (tdr) to guarantee
that dma_fences actually complete. Annotate this worker to make sure
we don't have any accidental locking inversions or other problems
lurking.

Originally this was part of the overall scheduler annotation patch.
But amdgpu has some glorious inversions here:

- grabs console_lock
- does a full modeset, which grabs all kinds of locks
  (drm_modeset_lock, dma_resv_lock) which can deadlock with
  dma_fence_wait held inside them.
- almost minor at that point, but the modeset code also allocates
  memory

These all look like they'll be very hard to fix properly, the hardware
seems to require a full display reset with any gpu recovery.

Hence split out as a seperate patch.

Since amdgpu isn't the only hardware driver that needs to reset the
display (at least gen2/3 on intel have the same problem) we need a
generic solution for this. There's two tricks we could still from
drm/i915 and lift to dma-fence:

- The big whack, aka force-complete all fences. i915 does this for all
  pending jobs if the reset is somehow stuck. Trouble is we'd need to
  do this for all fences in the entire system, and just the
  book-keeping for that will be fun. Plus lots of drivers use fences
  for all kinds of internal stuff like memory management, so
  unconditionally resetting all of them doesn't work.

  I'm also hoping that with these fence annotations we could enlist
  lockdep in finding the last offenders causing deadlocks, and we
  could remove this get-out-of-jail trick.

- The more feasible approach (across drivers at least as part of the
  dma_fence contract) is what drm/i915 does for gen2/3: When we need
  to reset the display we wake up all dma_fence_wait_interruptible
  calls, or well at least the equivalent of those in i915 internally.

  Relying on ioctl restart we force all other threads to release their
  locks, which means the tdr thread is guaranteed to be able to get
  them. I think we could implement this at the dma_fence level,
  including proper lockdep annotations.

  dma_fence_begin_tdr():
  - must be nested within a dma_fence_begin/end_signalling section
  - will wake up all interruptible (but not the non-interruptible)
    dma_fence_wait() calls and force them to complete with a
    -ERESTARTSYS errno code. All new interrupitble calls to
    dma_fence_wait() will immeidately fail with the same error code.

  dma_fence_end_trdr():
  - this will convert dma_fence_wait() calls back to normal.

  Of course interrupting dma_fence_wait is only ok if the caller
  specified that, which means we need to split the annotations into
  interruptible and non-interruptible version. If we then make sure
  that we only use interruptible dma_fence_wait() calls while holding
  drm_modeset_lock we can grab them in tdr code, and allow display
  resets. Doing the same for dma_resv_lock might be a lot harder, so
  buffer updates must be avoided.

  What's worse, we're not going to be able to make the dma_fence_wait
  calls in mmu-notifiers interruptible, that doesn't work. So
  allocating memory still wont' be allowed, even in tdr sections. Plus
  obviously we can use this trick only in tdr, it is rather intrusive.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 06a736e506ad..e34a44376e87 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -279,9 +279,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
 {
 	struct drm_gpu_scheduler *sched;
 	struct drm_sched_job *job;
+	bool fence_cookie;
 
 	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
 	spin_lock(&sched->job_list_lock);
 	job = list_first_entry_or_null(&sched->ring_mirror_list,
@@ -313,6 +316,8 @@ static void drm_sched_job_timedout(struct work_struct *work)
 	spin_lock(&sched->job_list_lock);
 	drm_sched_start_timeout(sched);
 	spin_unlock(&sched->job_list_lock);
+
+	dma_fence_end_signalling(fence_cookie);
 }
 
  /**
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 13/17] drm/scheduler: use dma-fence annotations in tdr work
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

In the face of unpriviledged userspace being able to submit bogus gpu
workloads the kernel needs gpu timeout and reset (tdr) to guarantee
that dma_fences actually complete. Annotate this worker to make sure
we don't have any accidental locking inversions or other problems
lurking.

Originally this was part of the overall scheduler annotation patch.
But amdgpu has some glorious inversions here:

- grabs console_lock
- does a full modeset, which grabs all kinds of locks
  (drm_modeset_lock, dma_resv_lock) which can deadlock with
  dma_fence_wait held inside them.
- almost minor at that point, but the modeset code also allocates
  memory

These all look like they'll be very hard to fix properly, the hardware
seems to require a full display reset with any gpu recovery.

Hence split out as a seperate patch.

Since amdgpu isn't the only hardware driver that needs to reset the
display (at least gen2/3 on intel have the same problem) we need a
generic solution for this. There's two tricks we could still from
drm/i915 and lift to dma-fence:

- The big whack, aka force-complete all fences. i915 does this for all
  pending jobs if the reset is somehow stuck. Trouble is we'd need to
  do this for all fences in the entire system, and just the
  book-keeping for that will be fun. Plus lots of drivers use fences
  for all kinds of internal stuff like memory management, so
  unconditionally resetting all of them doesn't work.

  I'm also hoping that with these fence annotations we could enlist
  lockdep in finding the last offenders causing deadlocks, and we
  could remove this get-out-of-jail trick.

- The more feasible approach (across drivers at least as part of the
  dma_fence contract) is what drm/i915 does for gen2/3: When we need
  to reset the display we wake up all dma_fence_wait_interruptible
  calls, or well at least the equivalent of those in i915 internally.

  Relying on ioctl restart we force all other threads to release their
  locks, which means the tdr thread is guaranteed to be able to get
  them. I think we could implement this at the dma_fence level,
  including proper lockdep annotations.

  dma_fence_begin_tdr():
  - must be nested within a dma_fence_begin/end_signalling section
  - will wake up all interruptible (but not the non-interruptible)
    dma_fence_wait() calls and force them to complete with a
    -ERESTARTSYS errno code. All new interrupitble calls to
    dma_fence_wait() will immeidately fail with the same error code.

  dma_fence_end_trdr():
  - this will convert dma_fence_wait() calls back to normal.

  Of course interrupting dma_fence_wait is only ok if the caller
  specified that, which means we need to split the annotations into
  interruptible and non-interruptible version. If we then make sure
  that we only use interruptible dma_fence_wait() calls while holding
  drm_modeset_lock we can grab them in tdr code, and allow display
  resets. Doing the same for dma_resv_lock might be a lot harder, so
  buffer updates must be avoided.

  What's worse, we're not going to be able to make the dma_fence_wait
  calls in mmu-notifiers interruptible, that doesn't work. So
  allocating memory still wont' be allowed, even in tdr sections. Plus
  obviously we can use this trick only in tdr, it is rather intrusive.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 06a736e506ad..e34a44376e87 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -279,9 +279,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
 {
 	struct drm_gpu_scheduler *sched;
 	struct drm_sched_job *job;
+	bool fence_cookie;
 
 	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
 	spin_lock(&sched->job_list_lock);
 	job = list_first_entry_or_null(&sched->ring_mirror_list,
@@ -313,6 +316,8 @@ static void drm_sched_job_timedout(struct work_struct *work)
 	spin_lock(&sched->job_list_lock);
 	drm_sched_start_timeout(sched);
 	spin_unlock(&sched->job_list_lock);
+
+	dma_fence_end_signalling(fence_cookie);
 }
 
  /**
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 13/17] drm/scheduler: use dma-fence annotations in tdr work
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

In the face of unpriviledged userspace being able to submit bogus gpu
workloads the kernel needs gpu timeout and reset (tdr) to guarantee
that dma_fences actually complete. Annotate this worker to make sure
we don't have any accidental locking inversions or other problems
lurking.

Originally this was part of the overall scheduler annotation patch.
But amdgpu has some glorious inversions here:

- grabs console_lock
- does a full modeset, which grabs all kinds of locks
  (drm_modeset_lock, dma_resv_lock) which can deadlock with
  dma_fence_wait held inside them.
- almost minor at that point, but the modeset code also allocates
  memory

These all look like they'll be very hard to fix properly, the hardware
seems to require a full display reset with any gpu recovery.

Hence split out as a seperate patch.

Since amdgpu isn't the only hardware driver that needs to reset the
display (at least gen2/3 on intel have the same problem) we need a
generic solution for this. There's two tricks we could still from
drm/i915 and lift to dma-fence:

- The big whack, aka force-complete all fences. i915 does this for all
  pending jobs if the reset is somehow stuck. Trouble is we'd need to
  do this for all fences in the entire system, and just the
  book-keeping for that will be fun. Plus lots of drivers use fences
  for all kinds of internal stuff like memory management, so
  unconditionally resetting all of them doesn't work.

  I'm also hoping that with these fence annotations we could enlist
  lockdep in finding the last offenders causing deadlocks, and we
  could remove this get-out-of-jail trick.

- The more feasible approach (across drivers at least as part of the
  dma_fence contract) is what drm/i915 does for gen2/3: When we need
  to reset the display we wake up all dma_fence_wait_interruptible
  calls, or well at least the equivalent of those in i915 internally.

  Relying on ioctl restart we force all other threads to release their
  locks, which means the tdr thread is guaranteed to be able to get
  them. I think we could implement this at the dma_fence level,
  including proper lockdep annotations.

  dma_fence_begin_tdr():
  - must be nested within a dma_fence_begin/end_signalling section
  - will wake up all interruptible (but not the non-interruptible)
    dma_fence_wait() calls and force them to complete with a
    -ERESTARTSYS errno code. All new interrupitble calls to
    dma_fence_wait() will immeidately fail with the same error code.

  dma_fence_end_trdr():
  - this will convert dma_fence_wait() calls back to normal.

  Of course interrupting dma_fence_wait is only ok if the caller
  specified that, which means we need to split the annotations into
  interruptible and non-interruptible version. If we then make sure
  that we only use interruptible dma_fence_wait() calls while holding
  drm_modeset_lock we can grab them in tdr code, and allow display
  resets. Doing the same for dma_resv_lock might be a lot harder, so
  buffer updates must be avoided.

  What's worse, we're not going to be able to make the dma_fence_wait
  calls in mmu-notifiers interruptible, that doesn't work. So
  allocating memory still wont' be allowed, even in tdr sections. Plus
  obviously we can use this trick only in tdr, it is rather intrusive.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 06a736e506ad..e34a44376e87 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -279,9 +279,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
 {
 	struct drm_gpu_scheduler *sched;
 	struct drm_sched_job *job;
+	bool fence_cookie;
 
 	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
 	spin_lock(&sched->job_list_lock);
 	job = list_first_entry_or_null(&sched->ring_mirror_list,
@@ -313,6 +316,8 @@ static void drm_sched_job_timedout(struct work_struct *work)
 	spin_lock(&sched->job_list_lock);
 	drm_sched_start_timeout(sched);
 	spin_unlock(&sched->job_list_lock);
+
+	dma_fence_end_signalling(fence_cookie);
 }
 
  /**
-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 13/17] drm/scheduler: use dma-fence annotations in tdr work
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, Chris Wilson, linaro-mm-sig, Daniel Vetter,
	Christian König, linux-media

In the face of unpriviledged userspace being able to submit bogus gpu
workloads the kernel needs gpu timeout and reset (tdr) to guarantee
that dma_fences actually complete. Annotate this worker to make sure
we don't have any accidental locking inversions or other problems
lurking.

Originally this was part of the overall scheduler annotation patch.
But amdgpu has some glorious inversions here:

- grabs console_lock
- does a full modeset, which grabs all kinds of locks
  (drm_modeset_lock, dma_resv_lock) which can deadlock with
  dma_fence_wait held inside them.
- almost minor at that point, but the modeset code also allocates
  memory

These all look like they'll be very hard to fix properly, the hardware
seems to require a full display reset with any gpu recovery.

Hence split out as a seperate patch.

Since amdgpu isn't the only hardware driver that needs to reset the
display (at least gen2/3 on intel have the same problem) we need a
generic solution for this. There's two tricks we could still from
drm/i915 and lift to dma-fence:

- The big whack, aka force-complete all fences. i915 does this for all
  pending jobs if the reset is somehow stuck. Trouble is we'd need to
  do this for all fences in the entire system, and just the
  book-keeping for that will be fun. Plus lots of drivers use fences
  for all kinds of internal stuff like memory management, so
  unconditionally resetting all of them doesn't work.

  I'm also hoping that with these fence annotations we could enlist
  lockdep in finding the last offenders causing deadlocks, and we
  could remove this get-out-of-jail trick.

- The more feasible approach (across drivers at least as part of the
  dma_fence contract) is what drm/i915 does for gen2/3: When we need
  to reset the display we wake up all dma_fence_wait_interruptible
  calls, or well at least the equivalent of those in i915 internally.

  Relying on ioctl restart we force all other threads to release their
  locks, which means the tdr thread is guaranteed to be able to get
  them. I think we could implement this at the dma_fence level,
  including proper lockdep annotations.

  dma_fence_begin_tdr():
  - must be nested within a dma_fence_begin/end_signalling section
  - will wake up all interruptible (but not the non-interruptible)
    dma_fence_wait() calls and force them to complete with a
    -ERESTARTSYS errno code. All new interrupitble calls to
    dma_fence_wait() will immeidately fail with the same error code.

  dma_fence_end_trdr():
  - this will convert dma_fence_wait() calls back to normal.

  Of course interrupting dma_fence_wait is only ok if the caller
  specified that, which means we need to split the annotations into
  interruptible and non-interruptible version. If we then make sure
  that we only use interruptible dma_fence_wait() calls while holding
  drm_modeset_lock we can grab them in tdr code, and allow display
  resets. Doing the same for dma_resv_lock might be a lot harder, so
  buffer updates must be avoided.

  What's worse, we're not going to be able to make the dma_fence_wait
  calls in mmu-notifiers interruptible, that doesn't work. So
  allocating memory still wont' be allowed, even in tdr sections. Plus
  obviously we can use this trick only in tdr, it is rather intrusive.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 06a736e506ad..e34a44376e87 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -279,9 +279,12 @@ static void drm_sched_job_timedout(struct work_struct *work)
 {
 	struct drm_gpu_scheduler *sched;
 	struct drm_sched_job *job;
+	bool fence_cookie;
 
 	sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
 
+	fence_cookie = dma_fence_begin_signalling();
+
 	/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
 	spin_lock(&sched->job_list_lock);
 	job = list_first_entry_or_null(&sched->ring_mirror_list,
@@ -313,6 +316,8 @@ static void drm_sched_job_timedout(struct work_struct *work)
 	spin_lock(&sched->job_list_lock);
 	drm_sched_start_timeout(sched);
 	spin_unlock(&sched->job_list_lock);
+
+	dma_fence_end_signalling(fence_cookie);
 }
 
  /**
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 14/17] drm/amdgpu: use dma-fence annotations for gpu reset code
  2020-05-12  8:59 ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  8:59   ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König

To improve coverage also annotate the gpu reset code itself, since
that's called from other places than drm/scheduler (which is already
annotated). Annotations nests, so this doesn't break anything, and
allows easier testing.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b038ddbb2ece..5560d045b2e0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4140,6 +4140,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	bool use_baco =
 		(amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_BACO) ?
 		true : false;
+	bool fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	/*
 	 * Flush RAM to disk so that after reboot
@@ -4168,6 +4171,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
 			  job ? job->base.id : -1, hive->hive_id);
 		mutex_unlock(&hive->hive_lock);
+		dma_fence_end_signalling(fence_cookie);
 		return 0;
 	}
 
@@ -4178,8 +4182,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 */
 	INIT_LIST_HEAD(&device_list);
 	if (adev->gmc.xgmi.num_physical_nodes > 1) {
-		if (!hive)
+		if (!hive) {
+			dma_fence_end_signalling(fence_cookie);
 			return -ENODEV;
+		}
 		if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
 			list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
 		device_list_handle = &hive->device_list;
@@ -4194,6 +4200,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress",
 				  job ? job->base.id : -1);
 			mutex_unlock(&hive->hive_lock);
+			dma_fence_end_signalling(fence_cookie);
 			return 0;
 		}
 
@@ -4319,6 +4326,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 	if (r)
 		dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
+	dma_fence_end_signalling(fence_cookie);
 	return r;
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 14/17] drm/amdgpu: use dma-fence annotations for gpu reset code
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Christian König, linux-media

To improve coverage also annotate the gpu reset code itself, since
that's called from other places than drm/scheduler (which is already
annotated). Annotations nests, so this doesn't break anything, and
allows easier testing.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b038ddbb2ece..5560d045b2e0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4140,6 +4140,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	bool use_baco =
 		(amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_BACO) ?
 		true : false;
+	bool fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	/*
 	 * Flush RAM to disk so that after reboot
@@ -4168,6 +4171,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
 			  job ? job->base.id : -1, hive->hive_id);
 		mutex_unlock(&hive->hive_lock);
+		dma_fence_end_signalling(fence_cookie);
 		return 0;
 	}
 
@@ -4178,8 +4182,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 */
 	INIT_LIST_HEAD(&device_list);
 	if (adev->gmc.xgmi.num_physical_nodes > 1) {
-		if (!hive)
+		if (!hive) {
+			dma_fence_end_signalling(fence_cookie);
 			return -ENODEV;
+		}
 		if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
 			list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
 		device_list_handle = &hive->device_list;
@@ -4194,6 +4200,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress",
 				  job ? job->base.id : -1);
 			mutex_unlock(&hive->hive_lock);
+			dma_fence_end_signalling(fence_cookie);
 			return 0;
 		}
 
@@ -4319,6 +4326,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 	if (r)
 		dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
+	dma_fence_end_signalling(fence_cookie);
 	return r;
 }
 
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 14/17] drm/amdgpu: use dma-fence annotations for gpu reset code
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Christian König, linux-media

To improve coverage also annotate the gpu reset code itself, since
that's called from other places than drm/scheduler (which is already
annotated). Annotations nests, so this doesn't break anything, and
allows easier testing.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b038ddbb2ece..5560d045b2e0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4140,6 +4140,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	bool use_baco =
 		(amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_BACO) ?
 		true : false;
+	bool fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	/*
 	 * Flush RAM to disk so that after reboot
@@ -4168,6 +4171,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
 			  job ? job->base.id : -1, hive->hive_id);
 		mutex_unlock(&hive->hive_lock);
+		dma_fence_end_signalling(fence_cookie);
 		return 0;
 	}
 
@@ -4178,8 +4182,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 */
 	INIT_LIST_HEAD(&device_list);
 	if (adev->gmc.xgmi.num_physical_nodes > 1) {
-		if (!hive)
+		if (!hive) {
+			dma_fence_end_signalling(fence_cookie);
 			return -ENODEV;
+		}
 		if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
 			list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
 		device_list_handle = &hive->device_list;
@@ -4194,6 +4200,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress",
 				  job ? job->base.id : -1);
 			mutex_unlock(&hive->hive_lock);
+			dma_fence_end_signalling(fence_cookie);
 			return 0;
 		}
 
@@ -4319,6 +4326,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 	if (r)
 		dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
+	dma_fence_end_signalling(fence_cookie);
 	return r;
 }
 
-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 14/17] drm/amdgpu: use dma-fence annotations for gpu reset code
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, Chris Wilson, linaro-mm-sig, Christian König,
	linux-media

To improve coverage also annotate the gpu reset code itself, since
that's called from other places than drm/scheduler (which is already
annotated). Annotations nests, so this doesn't break anything, and
allows easier testing.

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b038ddbb2ece..5560d045b2e0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4140,6 +4140,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	bool use_baco =
 		(amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_BACO) ?
 		true : false;
+	bool fence_cookie;
+
+	fence_cookie = dma_fence_begin_signalling();
 
 	/*
 	 * Flush RAM to disk so that after reboot
@@ -4168,6 +4171,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
 			  job ? job->base.id : -1, hive->hive_id);
 		mutex_unlock(&hive->hive_lock);
+		dma_fence_end_signalling(fence_cookie);
 		return 0;
 	}
 
@@ -4178,8 +4182,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 */
 	INIT_LIST_HEAD(&device_list);
 	if (adev->gmc.xgmi.num_physical_nodes > 1) {
-		if (!hive)
+		if (!hive) {
+			dma_fence_end_signalling(fence_cookie);
 			return -ENODEV;
+		}
 		if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
 			list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
 		device_list_handle = &hive->device_list;
@@ -4194,6 +4200,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress",
 				  job ? job->base.id : -1);
 			mutex_unlock(&hive->hive_lock);
+			dma_fence_end_signalling(fence_cookie);
 			return 0;
 		}
 
@@ -4319,6 +4326,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 	if (r)
 		dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
+	dma_fence_end_signalling(fence_cookie);
 	return r;
 }
 
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 15/17] Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset"
  2020-05-12  8:59 ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  8:59   ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter

This is one from the department of "maybe play lottery if you hit
this, karma compensation might work". Or at least lockdep ftw!

This reverts commit 565d1941557756a584ac357d945bc374d5fcd1d0.

It's not quite as low-risk as the commit message claims, because this
grabs console_lock, which might be held when we allocate memory, which
might never happen because the dma_fence_wait() is stuck waiting on
our gpu reset:

[  136.763714] ======================================================
[  136.763714] WARNING: possible circular locking dependency detected
[  136.763715] 5.7.0-rc3+ #346 Tainted: G        W
[  136.763716] ------------------------------------------------------
[  136.763716] kworker/2:3/682 is trying to acquire lock:
[  136.763716] ffffffff8226f140 (console_lock){+.+.}-{0:0}, at: drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763723]
               but task is already holding lock:
[  136.763724] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  136.763726]
               which lock already depends on the new lock.

[  136.763726]
               the existing dependency chain (in reverse order) is:
[  136.763727]
               -> #2 (dma_fence_map){++++}-{0:0}:
[  136.763730]        __dma_fence_might_wait+0x41/0xb0
[  136.763732]        dma_resv_lockdep+0x171/0x202
[  136.763734]        do_one_initcall+0x5d/0x2f0
[  136.763736]        kernel_init_freeable+0x20d/0x26d
[  136.763738]        kernel_init+0xa/0xfb
[  136.763740]        ret_from_fork+0x27/0x50
[  136.763740]
               -> #1 (fs_reclaim){+.+.}-{0:0}:
[  136.763743]        fs_reclaim_acquire.part.0+0x25/0x30
[  136.763745]        kmem_cache_alloc_trace+0x2e/0x6e0
[  136.763747]        device_create_groups_vargs+0x52/0xf0
[  136.763747]        device_create+0x49/0x60
[  136.763749]        fb_console_init+0x25/0x145
[  136.763750]        fbmem_init+0xcc/0xe2
[  136.763750]        do_one_initcall+0x5d/0x2f0
[  136.763751]        kernel_init_freeable+0x20d/0x26d
[  136.763752]        kernel_init+0xa/0xfb
[  136.763753]        ret_from_fork+0x27/0x50
[  136.763753]
               -> #0 (console_lock){+.+.}-{0:0}:
[  136.763755]        __lock_acquire+0x1241/0x23f0
[  136.763756]        lock_acquire+0xad/0x370
[  136.763757]        console_lock+0x47/0x70
[  136.763761]        drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763809]        amdgpu_device_gpu_recover.cold+0x21e/0xe7b [amdgpu]
[  136.763850]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  136.763851]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  136.763852]        process_one_work+0x23c/0x580
[  136.763853]        worker_thread+0x50/0x3b0
[  136.763854]        kthread+0x12e/0x150
[  136.763855]        ret_from_fork+0x27/0x50
[  136.763855]
               other info that might help us debug this:

[  136.763856] Chain exists of:
                 console_lock --> fs_reclaim --> dma_fence_map

[  136.763857]  Possible unsafe locking scenario:

[  136.763857]        CPU0                    CPU1
[  136.763857]        ----                    ----
[  136.763857]   lock(dma_fence_map);
[  136.763858]                                lock(fs_reclaim);
[  136.763858]                                lock(dma_fence_map);
[  136.763858]   lock(console_lock);
[  136.763859]
                *** DEADLOCK ***

[  136.763860] 4 locks held by kworker/2:3/682:
[  136.763860]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  136.763862]  #1: ffffc90000cafe58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  136.763863]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  136.763865]  #3: ffff8887ab621748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x5ab/0xe7b [amdgpu]
[  136.763914]
               stack backtrace:
[  136.763915] CPU: 2 PID: 682 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
[  136.763916] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
[  136.763918] Workqueue: events drm_sched_job_timedout [gpu_sched]
[  136.763919] Call Trace:
[  136.763922]  dump_stack+0x8f/0xd0
[  136.763924]  check_noncircular+0x162/0x180
[  136.763926]  __lock_acquire+0x1241/0x23f0
[  136.763927]  lock_acquire+0xad/0x370
[  136.763932]  ? drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763933]  ? mark_held_locks+0x2d/0x80
[  136.763934]  ? _raw_spin_unlock_irqrestore+0x46/0x60
[  136.763936]  console_lock+0x47/0x70
[  136.763940]  ? drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763944]  drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763993]  amdgpu_device_gpu_recover.cold+0x21e/0xe7b [amdgpu]
[  136.764036]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  136.764038]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  136.764040]  process_one_work+0x23c/0x580
[  136.764041]  worker_thread+0x50/0x3b0
[  136.764042]  ? process_one_work+0x580/0x580
[  136.764044]  kthread+0x12e/0x150
[  136.764045]  ? kthread_create_worker_on_cpu+0x70/0x70
[  136.764046]  ret_from_fork+0x27/0x50

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5560d045b2e0..3584e29323c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4047,8 +4047,6 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 				if (r)
 					goto out;
 
-				amdgpu_fbdev_set_suspend(tmp_adev, 0);
-
 				/* must succeed. */
 				amdgpu_ras_resume(tmp_adev);
 
@@ -4217,8 +4215,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		 */
 		amdgpu_unregister_gpu_instance(tmp_adev);
 
-		amdgpu_fbdev_set_suspend(tmp_adev, 1);
-
 		/* disable ras on ALL IPs */
 		if (!(in_ras_intr && !use_baco) &&
 		      amdgpu_device_ip_need_full_reset(tmp_adev))
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 15/17] Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset"
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

This is one from the department of "maybe play lottery if you hit
this, karma compensation might work". Or at least lockdep ftw!

This reverts commit 565d1941557756a584ac357d945bc374d5fcd1d0.

It's not quite as low-risk as the commit message claims, because this
grabs console_lock, which might be held when we allocate memory, which
might never happen because the dma_fence_wait() is stuck waiting on
our gpu reset:

[  136.763714] ======================================================
[  136.763714] WARNING: possible circular locking dependency detected
[  136.763715] 5.7.0-rc3+ #346 Tainted: G        W
[  136.763716] ------------------------------------------------------
[  136.763716] kworker/2:3/682 is trying to acquire lock:
[  136.763716] ffffffff8226f140 (console_lock){+.+.}-{0:0}, at: drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763723]
               but task is already holding lock:
[  136.763724] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  136.763726]
               which lock already depends on the new lock.

[  136.763726]
               the existing dependency chain (in reverse order) is:
[  136.763727]
               -> #2 (dma_fence_map){++++}-{0:0}:
[  136.763730]        __dma_fence_might_wait+0x41/0xb0
[  136.763732]        dma_resv_lockdep+0x171/0x202
[  136.763734]        do_one_initcall+0x5d/0x2f0
[  136.763736]        kernel_init_freeable+0x20d/0x26d
[  136.763738]        kernel_init+0xa/0xfb
[  136.763740]        ret_from_fork+0x27/0x50
[  136.763740]
               -> #1 (fs_reclaim){+.+.}-{0:0}:
[  136.763743]        fs_reclaim_acquire.part.0+0x25/0x30
[  136.763745]        kmem_cache_alloc_trace+0x2e/0x6e0
[  136.763747]        device_create_groups_vargs+0x52/0xf0
[  136.763747]        device_create+0x49/0x60
[  136.763749]        fb_console_init+0x25/0x145
[  136.763750]        fbmem_init+0xcc/0xe2
[  136.763750]        do_one_initcall+0x5d/0x2f0
[  136.763751]        kernel_init_freeable+0x20d/0x26d
[  136.763752]        kernel_init+0xa/0xfb
[  136.763753]        ret_from_fork+0x27/0x50
[  136.763753]
               -> #0 (console_lock){+.+.}-{0:0}:
[  136.763755]        __lock_acquire+0x1241/0x23f0
[  136.763756]        lock_acquire+0xad/0x370
[  136.763757]        console_lock+0x47/0x70
[  136.763761]        drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763809]        amdgpu_device_gpu_recover.cold+0x21e/0xe7b [amdgpu]
[  136.763850]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  136.763851]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  136.763852]        process_one_work+0x23c/0x580
[  136.763853]        worker_thread+0x50/0x3b0
[  136.763854]        kthread+0x12e/0x150
[  136.763855]        ret_from_fork+0x27/0x50
[  136.763855]
               other info that might help us debug this:

[  136.763856] Chain exists of:
                 console_lock --> fs_reclaim --> dma_fence_map

[  136.763857]  Possible unsafe locking scenario:

[  136.763857]        CPU0                    CPU1
[  136.763857]        ----                    ----
[  136.763857]   lock(dma_fence_map);
[  136.763858]                                lock(fs_reclaim);
[  136.763858]                                lock(dma_fence_map);
[  136.763858]   lock(console_lock);
[  136.763859]
                *** DEADLOCK ***

[  136.763860] 4 locks held by kworker/2:3/682:
[  136.763860]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  136.763862]  #1: ffffc90000cafe58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  136.763863]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  136.763865]  #3: ffff8887ab621748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x5ab/0xe7b [amdgpu]
[  136.763914]
               stack backtrace:
[  136.763915] CPU: 2 PID: 682 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
[  136.763916] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
[  136.763918] Workqueue: events drm_sched_job_timedout [gpu_sched]
[  136.763919] Call Trace:
[  136.763922]  dump_stack+0x8f/0xd0
[  136.763924]  check_noncircular+0x162/0x180
[  136.763926]  __lock_acquire+0x1241/0x23f0
[  136.763927]  lock_acquire+0xad/0x370
[  136.763932]  ? drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763933]  ? mark_held_locks+0x2d/0x80
[  136.763934]  ? _raw_spin_unlock_irqrestore+0x46/0x60
[  136.763936]  console_lock+0x47/0x70
[  136.763940]  ? drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763944]  drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763993]  amdgpu_device_gpu_recover.cold+0x21e/0xe7b [amdgpu]
[  136.764036]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  136.764038]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  136.764040]  process_one_work+0x23c/0x580
[  136.764041]  worker_thread+0x50/0x3b0
[  136.764042]  ? process_one_work+0x580/0x580
[  136.764044]  kthread+0x12e/0x150
[  136.764045]  ? kthread_create_worker_on_cpu+0x70/0x70
[  136.764046]  ret_from_fork+0x27/0x50

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5560d045b2e0..3584e29323c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4047,8 +4047,6 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 				if (r)
 					goto out;
 
-				amdgpu_fbdev_set_suspend(tmp_adev, 0);
-
 				/* must succeed. */
 				amdgpu_ras_resume(tmp_adev);
 
@@ -4217,8 +4215,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		 */
 		amdgpu_unregister_gpu_instance(tmp_adev);
 
-		amdgpu_fbdev_set_suspend(tmp_adev, 1);
-
 		/* disable ras on ALL IPs */
 		if (!(in_ras_intr && !use_baco) &&
 		      amdgpu_device_ip_need_full_reset(tmp_adev))
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 15/17] Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset"
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

This is one from the department of "maybe play lottery if you hit
this, karma compensation might work". Or at least lockdep ftw!

This reverts commit 565d1941557756a584ac357d945bc374d5fcd1d0.

It's not quite as low-risk as the commit message claims, because this
grabs console_lock, which might be held when we allocate memory, which
might never happen because the dma_fence_wait() is stuck waiting on
our gpu reset:

[  136.763714] ======================================================
[  136.763714] WARNING: possible circular locking dependency detected
[  136.763715] 5.7.0-rc3+ #346 Tainted: G        W
[  136.763716] ------------------------------------------------------
[  136.763716] kworker/2:3/682 is trying to acquire lock:
[  136.763716] ffffffff8226f140 (console_lock){+.+.}-{0:0}, at: drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763723]
               but task is already holding lock:
[  136.763724] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  136.763726]
               which lock already depends on the new lock.

[  136.763726]
               the existing dependency chain (in reverse order) is:
[  136.763727]
               -> #2 (dma_fence_map){++++}-{0:0}:
[  136.763730]        __dma_fence_might_wait+0x41/0xb0
[  136.763732]        dma_resv_lockdep+0x171/0x202
[  136.763734]        do_one_initcall+0x5d/0x2f0
[  136.763736]        kernel_init_freeable+0x20d/0x26d
[  136.763738]        kernel_init+0xa/0xfb
[  136.763740]        ret_from_fork+0x27/0x50
[  136.763740]
               -> #1 (fs_reclaim){+.+.}-{0:0}:
[  136.763743]        fs_reclaim_acquire.part.0+0x25/0x30
[  136.763745]        kmem_cache_alloc_trace+0x2e/0x6e0
[  136.763747]        device_create_groups_vargs+0x52/0xf0
[  136.763747]        device_create+0x49/0x60
[  136.763749]        fb_console_init+0x25/0x145
[  136.763750]        fbmem_init+0xcc/0xe2
[  136.763750]        do_one_initcall+0x5d/0x2f0
[  136.763751]        kernel_init_freeable+0x20d/0x26d
[  136.763752]        kernel_init+0xa/0xfb
[  136.763753]        ret_from_fork+0x27/0x50
[  136.763753]
               -> #0 (console_lock){+.+.}-{0:0}:
[  136.763755]        __lock_acquire+0x1241/0x23f0
[  136.763756]        lock_acquire+0xad/0x370
[  136.763757]        console_lock+0x47/0x70
[  136.763761]        drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763809]        amdgpu_device_gpu_recover.cold+0x21e/0xe7b [amdgpu]
[  136.763850]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  136.763851]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  136.763852]        process_one_work+0x23c/0x580
[  136.763853]        worker_thread+0x50/0x3b0
[  136.763854]        kthread+0x12e/0x150
[  136.763855]        ret_from_fork+0x27/0x50
[  136.763855]
               other info that might help us debug this:

[  136.763856] Chain exists of:
                 console_lock --> fs_reclaim --> dma_fence_map

[  136.763857]  Possible unsafe locking scenario:

[  136.763857]        CPU0                    CPU1
[  136.763857]        ----                    ----
[  136.763857]   lock(dma_fence_map);
[  136.763858]                                lock(fs_reclaim);
[  136.763858]                                lock(dma_fence_map);
[  136.763858]   lock(console_lock);
[  136.763859]
                *** DEADLOCK ***

[  136.763860] 4 locks held by kworker/2:3/682:
[  136.763860]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  136.763862]  #1: ffffc90000cafe58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  136.763863]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  136.763865]  #3: ffff8887ab621748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x5ab/0xe7b [amdgpu]
[  136.763914]
               stack backtrace:
[  136.763915] CPU: 2 PID: 682 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
[  136.763916] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
[  136.763918] Workqueue: events drm_sched_job_timedout [gpu_sched]
[  136.763919] Call Trace:
[  136.763922]  dump_stack+0x8f/0xd0
[  136.763924]  check_noncircular+0x162/0x180
[  136.763926]  __lock_acquire+0x1241/0x23f0
[  136.763927]  lock_acquire+0xad/0x370
[  136.763932]  ? drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763933]  ? mark_held_locks+0x2d/0x80
[  136.763934]  ? _raw_spin_unlock_irqrestore+0x46/0x60
[  136.763936]  console_lock+0x47/0x70
[  136.763940]  ? drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763944]  drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763993]  amdgpu_device_gpu_recover.cold+0x21e/0xe7b [amdgpu]
[  136.764036]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  136.764038]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  136.764040]  process_one_work+0x23c/0x580
[  136.764041]  worker_thread+0x50/0x3b0
[  136.764042]  ? process_one_work+0x580/0x580
[  136.764044]  kthread+0x12e/0x150
[  136.764045]  ? kthread_create_worker_on_cpu+0x70/0x70
[  136.764046]  ret_from_fork+0x27/0x50

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5560d045b2e0..3584e29323c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4047,8 +4047,6 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 				if (r)
 					goto out;
 
-				amdgpu_fbdev_set_suspend(tmp_adev, 0);
-
 				/* must succeed. */
 				amdgpu_ras_resume(tmp_adev);
 
@@ -4217,8 +4215,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		 */
 		amdgpu_unregister_gpu_instance(tmp_adev);
 
-		amdgpu_fbdev_set_suspend(tmp_adev, 1);
-
 		/* disable ras on ALL IPs */
 		if (!(in_ras_intr && !use_baco) &&
 		      amdgpu_device_ip_need_full_reset(tmp_adev))
-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 15/17] Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset"
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, Chris Wilson, linaro-mm-sig, Daniel Vetter,
	Christian König, linux-media

This is one from the department of "maybe play lottery if you hit
this, karma compensation might work". Or at least lockdep ftw!

This reverts commit 565d1941557756a584ac357d945bc374d5fcd1d0.

It's not quite as low-risk as the commit message claims, because this
grabs console_lock, which might be held when we allocate memory, which
might never happen because the dma_fence_wait() is stuck waiting on
our gpu reset:

[  136.763714] ======================================================
[  136.763714] WARNING: possible circular locking dependency detected
[  136.763715] 5.7.0-rc3+ #346 Tainted: G        W
[  136.763716] ------------------------------------------------------
[  136.763716] kworker/2:3/682 is trying to acquire lock:
[  136.763716] ffffffff8226f140 (console_lock){+.+.}-{0:0}, at: drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763723]
               but task is already holding lock:
[  136.763724] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  136.763726]
               which lock already depends on the new lock.

[  136.763726]
               the existing dependency chain (in reverse order) is:
[  136.763727]
               -> #2 (dma_fence_map){++++}-{0:0}:
[  136.763730]        __dma_fence_might_wait+0x41/0xb0
[  136.763732]        dma_resv_lockdep+0x171/0x202
[  136.763734]        do_one_initcall+0x5d/0x2f0
[  136.763736]        kernel_init_freeable+0x20d/0x26d
[  136.763738]        kernel_init+0xa/0xfb
[  136.763740]        ret_from_fork+0x27/0x50
[  136.763740]
               -> #1 (fs_reclaim){+.+.}-{0:0}:
[  136.763743]        fs_reclaim_acquire.part.0+0x25/0x30
[  136.763745]        kmem_cache_alloc_trace+0x2e/0x6e0
[  136.763747]        device_create_groups_vargs+0x52/0xf0
[  136.763747]        device_create+0x49/0x60
[  136.763749]        fb_console_init+0x25/0x145
[  136.763750]        fbmem_init+0xcc/0xe2
[  136.763750]        do_one_initcall+0x5d/0x2f0
[  136.763751]        kernel_init_freeable+0x20d/0x26d
[  136.763752]        kernel_init+0xa/0xfb
[  136.763753]        ret_from_fork+0x27/0x50
[  136.763753]
               -> #0 (console_lock){+.+.}-{0:0}:
[  136.763755]        __lock_acquire+0x1241/0x23f0
[  136.763756]        lock_acquire+0xad/0x370
[  136.763757]        console_lock+0x47/0x70
[  136.763761]        drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763809]        amdgpu_device_gpu_recover.cold+0x21e/0xe7b [amdgpu]
[  136.763850]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  136.763851]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  136.763852]        process_one_work+0x23c/0x580
[  136.763853]        worker_thread+0x50/0x3b0
[  136.763854]        kthread+0x12e/0x150
[  136.763855]        ret_from_fork+0x27/0x50
[  136.763855]
               other info that might help us debug this:

[  136.763856] Chain exists of:
                 console_lock --> fs_reclaim --> dma_fence_map

[  136.763857]  Possible unsafe locking scenario:

[  136.763857]        CPU0                    CPU1
[  136.763857]        ----                    ----
[  136.763857]   lock(dma_fence_map);
[  136.763858]                                lock(fs_reclaim);
[  136.763858]                                lock(dma_fence_map);
[  136.763858]   lock(console_lock);
[  136.763859]
                *** DEADLOCK ***

[  136.763860] 4 locks held by kworker/2:3/682:
[  136.763860]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  136.763862]  #1: ffffc90000cafe58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  136.763863]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  136.763865]  #3: ffff8887ab621748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x5ab/0xe7b [amdgpu]
[  136.763914]
               stack backtrace:
[  136.763915] CPU: 2 PID: 682 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
[  136.763916] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
[  136.763918] Workqueue: events drm_sched_job_timedout [gpu_sched]
[  136.763919] Call Trace:
[  136.763922]  dump_stack+0x8f/0xd0
[  136.763924]  check_noncircular+0x162/0x180
[  136.763926]  __lock_acquire+0x1241/0x23f0
[  136.763927]  lock_acquire+0xad/0x370
[  136.763932]  ? drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763933]  ? mark_held_locks+0x2d/0x80
[  136.763934]  ? _raw_spin_unlock_irqrestore+0x46/0x60
[  136.763936]  console_lock+0x47/0x70
[  136.763940]  ? drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763944]  drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper]
[  136.763993]  amdgpu_device_gpu_recover.cold+0x21e/0xe7b [amdgpu]
[  136.764036]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  136.764038]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  136.764040]  process_one_work+0x23c/0x580
[  136.764041]  worker_thread+0x50/0x3b0
[  136.764042]  ? process_one_work+0x580/0x580
[  136.764044]  kthread+0x12e/0x150
[  136.764045]  ? kthread_create_worker_on_cpu+0x70/0x70
[  136.764046]  ret_from_fork+0x27/0x50

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5560d045b2e0..3584e29323c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4047,8 +4047,6 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 				if (r)
 					goto out;
 
-				amdgpu_fbdev_set_suspend(tmp_adev, 0);
-
 				/* must succeed. */
 				amdgpu_ras_resume(tmp_adev);
 
@@ -4217,8 +4215,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		 */
 		amdgpu_unregister_gpu_instance(tmp_adev);
 
-		amdgpu_fbdev_set_suspend(tmp_adev, 1);
-
 		/* disable ras on ALL IPs */
 		if (!(in_ras_intr && !use_baco) &&
 		      amdgpu_device_ip_need_full_reset(tmp_adev))
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
  2020-05-12  8:59 ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  8:59   ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter

...

I think it's time to stop this little exercise.

The lockdep splat, for the record:

[  132.583381] ======================================================
[  132.584091] WARNING: possible circular locking dependency detected
[  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
[  132.585461] ------------------------------------------------------
[  132.586184] kworker/2:3/865 is trying to acquire lock:
[  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.587569]
               but task is already holding lock:
[  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  132.589803]
               which lock already depends on the new lock.

[  132.592009]
               the existing dependency chain (in reverse order) is:
[  132.593507]
               -> #2 (dma_fence_map){++++}-{0:0}:
[  132.595019]        dma_fence_begin_signalling+0x50/0x60
[  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
[  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
[  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
[  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
[  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
[  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
[  132.600539]        fbcon_init+0x2e8/0x660
[  132.601344]        visual_init+0xce/0x130
[  132.602156]        do_bind_con_driver+0x1bc/0x2b0
[  132.602970]        do_take_over_console+0x115/0x180
[  132.603763]        do_fbcon_takeover+0x58/0xb0
[  132.604564]        register_framebuffer+0x1ee/0x300
[  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
[  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
[  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
[  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
[  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
[  132.609511]        local_pci_probe+0x42/0x80
[  132.610324]        pci_device_probe+0x104/0x1a0
[  132.611130]        really_probe+0x147/0x3c0
[  132.611939]        driver_probe_device+0xb6/0x100
[  132.612766]        device_driver_attach+0x53/0x60
[  132.613593]        __driver_attach+0x8c/0x150
[  132.614419]        bus_for_each_dev+0x7b/0xc0
[  132.615249]        bus_add_driver+0x14c/0x1f0
[  132.616071]        driver_register+0x6c/0xc0
[  132.616902]        do_one_initcall+0x5d/0x2f0
[  132.617731]        do_init_module+0x5c/0x230
[  132.618560]        load_module+0x2981/0x2bc0
[  132.619391]        __do_sys_finit_module+0xaa/0x110
[  132.620228]        do_syscall_64+0x5a/0x250
[  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
[  132.621903]
               -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
[  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
[  132.624448]        ww_mutex_lock+0x43/0xb0
[  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
[  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
[  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
[  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
[  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
[  132.629804]        local_pci_probe+0x42/0x80
[  132.630690]        pci_device_probe+0x104/0x1a0
[  132.631583]        really_probe+0x147/0x3c0
[  132.632479]        driver_probe_device+0xb6/0x100
[  132.633379]        device_driver_attach+0x53/0x60
[  132.634275]        __driver_attach+0x8c/0x150
[  132.635170]        bus_for_each_dev+0x7b/0xc0
[  132.636069]        bus_add_driver+0x14c/0x1f0
[  132.636974]        driver_register+0x6c/0xc0
[  132.637870]        do_one_initcall+0x5d/0x2f0
[  132.638765]        do_init_module+0x5c/0x230
[  132.639654]        load_module+0x2981/0x2bc0
[  132.640522]        __do_sys_finit_module+0xaa/0x110
[  132.641372]        do_syscall_64+0x5a/0x250
[  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
[  132.643022]
               -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
[  132.644643]        __lock_acquire+0x1241/0x23f0
[  132.645469]        lock_acquire+0xad/0x370
[  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
[  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
[  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
[  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
[  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  132.652594]        process_one_work+0x23c/0x580
[  132.653402]        worker_thread+0x50/0x3b0
[  132.654139]        kthread+0x12e/0x150
[  132.654868]        ret_from_fork+0x27/0x50
[  132.655598]
               other info that might help us debug this:

[  132.657739] Chain exists of:
                 crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map

[  132.659877]  Possible unsafe locking scenario:

[  132.661416]        CPU0                    CPU1
[  132.662126]        ----                    ----
[  132.662847]   lock(dma_fence_map);
[  132.663574]                                lock(crtc_ww_class_mutex);
[  132.664319]                                lock(dma_fence_map);
[  132.665063]   lock(crtc_ww_class_acquire);
[  132.665799]
                *** DEADLOCK ***

[  132.667965] 4 locks held by kworker/2:3/865:
[  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
[  132.671902]
               stack backtrace:
[  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
[  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
[  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
[  132.676046] Call Trace:
[  132.676897]  dump_stack+0x8f/0xd0
[  132.677748]  check_noncircular+0x162/0x180
[  132.678604]  ? stack_trace_save+0x4b/0x70
[  132.679459]  __lock_acquire+0x1241/0x23f0
[  132.680311]  lock_acquire+0xad/0x370
[  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.682021]  ? cpumask_next+0x16/0x20
[  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
[  132.683737]  ? __module_address+0x28/0xf0
[  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
[  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
[  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
[  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
[  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
[  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  132.693597]  process_one_work+0x23c/0x580
[  132.694487]  worker_thread+0x50/0x3b0
[  132.695373]  ? process_one_work+0x580/0x580
[  132.696264]  kthread+0x12e/0x150
[  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
[  132.698057]  ret_from_fork+0x27/0x50

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3584e29323c0..b3b84a0d3baf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
 		/* displays are handled separately */
 		if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
 			/* XXX handle errors */
+
+			/*
+			 * This is dm_suspend, which calls modeset locks, and
+			 * that a pretty good inversion against dma_fence_signal
+			 * which gpu recovery is supposed to guarantee.
+			 *
+			 * Dont ask me how to fix this.
+			 */
 			r = adev->ip_blocks[i].version->funcs->suspend(adev);
 			/* XXX handle errors */
 			if (r) {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

...

I think it's time to stop this little exercise.

The lockdep splat, for the record:

[  132.583381] ======================================================
[  132.584091] WARNING: possible circular locking dependency detected
[  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
[  132.585461] ------------------------------------------------------
[  132.586184] kworker/2:3/865 is trying to acquire lock:
[  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.587569]
               but task is already holding lock:
[  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  132.589803]
               which lock already depends on the new lock.

[  132.592009]
               the existing dependency chain (in reverse order) is:
[  132.593507]
               -> #2 (dma_fence_map){++++}-{0:0}:
[  132.595019]        dma_fence_begin_signalling+0x50/0x60
[  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
[  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
[  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
[  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
[  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
[  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
[  132.600539]        fbcon_init+0x2e8/0x660
[  132.601344]        visual_init+0xce/0x130
[  132.602156]        do_bind_con_driver+0x1bc/0x2b0
[  132.602970]        do_take_over_console+0x115/0x180
[  132.603763]        do_fbcon_takeover+0x58/0xb0
[  132.604564]        register_framebuffer+0x1ee/0x300
[  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
[  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
[  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
[  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
[  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
[  132.609511]        local_pci_probe+0x42/0x80
[  132.610324]        pci_device_probe+0x104/0x1a0
[  132.611130]        really_probe+0x147/0x3c0
[  132.611939]        driver_probe_device+0xb6/0x100
[  132.612766]        device_driver_attach+0x53/0x60
[  132.613593]        __driver_attach+0x8c/0x150
[  132.614419]        bus_for_each_dev+0x7b/0xc0
[  132.615249]        bus_add_driver+0x14c/0x1f0
[  132.616071]        driver_register+0x6c/0xc0
[  132.616902]        do_one_initcall+0x5d/0x2f0
[  132.617731]        do_init_module+0x5c/0x230
[  132.618560]        load_module+0x2981/0x2bc0
[  132.619391]        __do_sys_finit_module+0xaa/0x110
[  132.620228]        do_syscall_64+0x5a/0x250
[  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
[  132.621903]
               -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
[  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
[  132.624448]        ww_mutex_lock+0x43/0xb0
[  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
[  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
[  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
[  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
[  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
[  132.629804]        local_pci_probe+0x42/0x80
[  132.630690]        pci_device_probe+0x104/0x1a0
[  132.631583]        really_probe+0x147/0x3c0
[  132.632479]        driver_probe_device+0xb6/0x100
[  132.633379]        device_driver_attach+0x53/0x60
[  132.634275]        __driver_attach+0x8c/0x150
[  132.635170]        bus_for_each_dev+0x7b/0xc0
[  132.636069]        bus_add_driver+0x14c/0x1f0
[  132.636974]        driver_register+0x6c/0xc0
[  132.637870]        do_one_initcall+0x5d/0x2f0
[  132.638765]        do_init_module+0x5c/0x230
[  132.639654]        load_module+0x2981/0x2bc0
[  132.640522]        __do_sys_finit_module+0xaa/0x110
[  132.641372]        do_syscall_64+0x5a/0x250
[  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
[  132.643022]
               -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
[  132.644643]        __lock_acquire+0x1241/0x23f0
[  132.645469]        lock_acquire+0xad/0x370
[  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
[  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
[  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
[  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
[  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  132.652594]        process_one_work+0x23c/0x580
[  132.653402]        worker_thread+0x50/0x3b0
[  132.654139]        kthread+0x12e/0x150
[  132.654868]        ret_from_fork+0x27/0x50
[  132.655598]
               other info that might help us debug this:

[  132.657739] Chain exists of:
                 crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map

[  132.659877]  Possible unsafe locking scenario:

[  132.661416]        CPU0                    CPU1
[  132.662126]        ----                    ----
[  132.662847]   lock(dma_fence_map);
[  132.663574]                                lock(crtc_ww_class_mutex);
[  132.664319]                                lock(dma_fence_map);
[  132.665063]   lock(crtc_ww_class_acquire);
[  132.665799]
                *** DEADLOCK ***

[  132.667965] 4 locks held by kworker/2:3/865:
[  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
[  132.671902]
               stack backtrace:
[  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
[  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
[  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
[  132.676046] Call Trace:
[  132.676897]  dump_stack+0x8f/0xd0
[  132.677748]  check_noncircular+0x162/0x180
[  132.678604]  ? stack_trace_save+0x4b/0x70
[  132.679459]  __lock_acquire+0x1241/0x23f0
[  132.680311]  lock_acquire+0xad/0x370
[  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.682021]  ? cpumask_next+0x16/0x20
[  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
[  132.683737]  ? __module_address+0x28/0xf0
[  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
[  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
[  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
[  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
[  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
[  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  132.693597]  process_one_work+0x23c/0x580
[  132.694487]  worker_thread+0x50/0x3b0
[  132.695373]  ? process_one_work+0x580/0x580
[  132.696264]  kthread+0x12e/0x150
[  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
[  132.698057]  ret_from_fork+0x27/0x50

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3584e29323c0..b3b84a0d3baf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
 		/* displays are handled separately */
 		if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
 			/* XXX handle errors */
+
+			/*
+			 * This is dm_suspend, which calls modeset locks, and
+			 * that a pretty good inversion against dma_fence_signal
+			 * which gpu recovery is supposed to guarantee.
+			 *
+			 * Dont ask me how to fix this.
+			 */
 			r = adev->ip_blocks[i].version->funcs->suspend(adev);
 			/* XXX handle errors */
 			if (r) {
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

...

I think it's time to stop this little exercise.

The lockdep splat, for the record:

[  132.583381] ======================================================
[  132.584091] WARNING: possible circular locking dependency detected
[  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
[  132.585461] ------------------------------------------------------
[  132.586184] kworker/2:3/865 is trying to acquire lock:
[  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.587569]
               but task is already holding lock:
[  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  132.589803]
               which lock already depends on the new lock.

[  132.592009]
               the existing dependency chain (in reverse order) is:
[  132.593507]
               -> #2 (dma_fence_map){++++}-{0:0}:
[  132.595019]        dma_fence_begin_signalling+0x50/0x60
[  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
[  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
[  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
[  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
[  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
[  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
[  132.600539]        fbcon_init+0x2e8/0x660
[  132.601344]        visual_init+0xce/0x130
[  132.602156]        do_bind_con_driver+0x1bc/0x2b0
[  132.602970]        do_take_over_console+0x115/0x180
[  132.603763]        do_fbcon_takeover+0x58/0xb0
[  132.604564]        register_framebuffer+0x1ee/0x300
[  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
[  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
[  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
[  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
[  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
[  132.609511]        local_pci_probe+0x42/0x80
[  132.610324]        pci_device_probe+0x104/0x1a0
[  132.611130]        really_probe+0x147/0x3c0
[  132.611939]        driver_probe_device+0xb6/0x100
[  132.612766]        device_driver_attach+0x53/0x60
[  132.613593]        __driver_attach+0x8c/0x150
[  132.614419]        bus_for_each_dev+0x7b/0xc0
[  132.615249]        bus_add_driver+0x14c/0x1f0
[  132.616071]        driver_register+0x6c/0xc0
[  132.616902]        do_one_initcall+0x5d/0x2f0
[  132.617731]        do_init_module+0x5c/0x230
[  132.618560]        load_module+0x2981/0x2bc0
[  132.619391]        __do_sys_finit_module+0xaa/0x110
[  132.620228]        do_syscall_64+0x5a/0x250
[  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
[  132.621903]
               -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
[  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
[  132.624448]        ww_mutex_lock+0x43/0xb0
[  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
[  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
[  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
[  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
[  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
[  132.629804]        local_pci_probe+0x42/0x80
[  132.630690]        pci_device_probe+0x104/0x1a0
[  132.631583]        really_probe+0x147/0x3c0
[  132.632479]        driver_probe_device+0xb6/0x100
[  132.633379]        device_driver_attach+0x53/0x60
[  132.634275]        __driver_attach+0x8c/0x150
[  132.635170]        bus_for_each_dev+0x7b/0xc0
[  132.636069]        bus_add_driver+0x14c/0x1f0
[  132.636974]        driver_register+0x6c/0xc0
[  132.637870]        do_one_initcall+0x5d/0x2f0
[  132.638765]        do_init_module+0x5c/0x230
[  132.639654]        load_module+0x2981/0x2bc0
[  132.640522]        __do_sys_finit_module+0xaa/0x110
[  132.641372]        do_syscall_64+0x5a/0x250
[  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
[  132.643022]
               -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
[  132.644643]        __lock_acquire+0x1241/0x23f0
[  132.645469]        lock_acquire+0xad/0x370
[  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
[  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
[  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
[  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
[  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  132.652594]        process_one_work+0x23c/0x580
[  132.653402]        worker_thread+0x50/0x3b0
[  132.654139]        kthread+0x12e/0x150
[  132.654868]        ret_from_fork+0x27/0x50
[  132.655598]
               other info that might help us debug this:

[  132.657739] Chain exists of:
                 crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map

[  132.659877]  Possible unsafe locking scenario:

[  132.661416]        CPU0                    CPU1
[  132.662126]        ----                    ----
[  132.662847]   lock(dma_fence_map);
[  132.663574]                                lock(crtc_ww_class_mutex);
[  132.664319]                                lock(dma_fence_map);
[  132.665063]   lock(crtc_ww_class_acquire);
[  132.665799]
                *** DEADLOCK ***

[  132.667965] 4 locks held by kworker/2:3/865:
[  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
[  132.671902]
               stack backtrace:
[  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
[  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
[  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
[  132.676046] Call Trace:
[  132.676897]  dump_stack+0x8f/0xd0
[  132.677748]  check_noncircular+0x162/0x180
[  132.678604]  ? stack_trace_save+0x4b/0x70
[  132.679459]  __lock_acquire+0x1241/0x23f0
[  132.680311]  lock_acquire+0xad/0x370
[  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.682021]  ? cpumask_next+0x16/0x20
[  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
[  132.683737]  ? __module_address+0x28/0xf0
[  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
[  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
[  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
[  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
[  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
[  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  132.693597]  process_one_work+0x23c/0x580
[  132.694487]  worker_thread+0x50/0x3b0
[  132.695373]  ? process_one_work+0x580/0x580
[  132.696264]  kthread+0x12e/0x150
[  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
[  132.698057]  ret_from_fork+0x27/0x50

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3584e29323c0..b3b84a0d3baf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
 		/* displays are handled separately */
 		if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
 			/* XXX handle errors */
+
+			/*
+			 * This is dm_suspend, which calls modeset locks, and
+			 * that a pretty good inversion against dma_fence_signal
+			 * which gpu recovery is supposed to guarantee.
+			 *
+			 * Dont ask me how to fix this.
+			 */
 			r = adev->ip_blocks[i].version->funcs->suspend(adev);
 			/* XXX handle errors */
 			if (r) {
-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, Chris Wilson, linaro-mm-sig, Daniel Vetter,
	Christian König, linux-media

...

I think it's time to stop this little exercise.

The lockdep splat, for the record:

[  132.583381] ======================================================
[  132.584091] WARNING: possible circular locking dependency detected
[  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
[  132.585461] ------------------------------------------------------
[  132.586184] kworker/2:3/865 is trying to acquire lock:
[  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.587569]
               but task is already holding lock:
[  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  132.589803]
               which lock already depends on the new lock.

[  132.592009]
               the existing dependency chain (in reverse order) is:
[  132.593507]
               -> #2 (dma_fence_map){++++}-{0:0}:
[  132.595019]        dma_fence_begin_signalling+0x50/0x60
[  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
[  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
[  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
[  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
[  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
[  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
[  132.600539]        fbcon_init+0x2e8/0x660
[  132.601344]        visual_init+0xce/0x130
[  132.602156]        do_bind_con_driver+0x1bc/0x2b0
[  132.602970]        do_take_over_console+0x115/0x180
[  132.603763]        do_fbcon_takeover+0x58/0xb0
[  132.604564]        register_framebuffer+0x1ee/0x300
[  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
[  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
[  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
[  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
[  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
[  132.609511]        local_pci_probe+0x42/0x80
[  132.610324]        pci_device_probe+0x104/0x1a0
[  132.611130]        really_probe+0x147/0x3c0
[  132.611939]        driver_probe_device+0xb6/0x100
[  132.612766]        device_driver_attach+0x53/0x60
[  132.613593]        __driver_attach+0x8c/0x150
[  132.614419]        bus_for_each_dev+0x7b/0xc0
[  132.615249]        bus_add_driver+0x14c/0x1f0
[  132.616071]        driver_register+0x6c/0xc0
[  132.616902]        do_one_initcall+0x5d/0x2f0
[  132.617731]        do_init_module+0x5c/0x230
[  132.618560]        load_module+0x2981/0x2bc0
[  132.619391]        __do_sys_finit_module+0xaa/0x110
[  132.620228]        do_syscall_64+0x5a/0x250
[  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
[  132.621903]
               -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
[  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
[  132.624448]        ww_mutex_lock+0x43/0xb0
[  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
[  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
[  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
[  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
[  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
[  132.629804]        local_pci_probe+0x42/0x80
[  132.630690]        pci_device_probe+0x104/0x1a0
[  132.631583]        really_probe+0x147/0x3c0
[  132.632479]        driver_probe_device+0xb6/0x100
[  132.633379]        device_driver_attach+0x53/0x60
[  132.634275]        __driver_attach+0x8c/0x150
[  132.635170]        bus_for_each_dev+0x7b/0xc0
[  132.636069]        bus_add_driver+0x14c/0x1f0
[  132.636974]        driver_register+0x6c/0xc0
[  132.637870]        do_one_initcall+0x5d/0x2f0
[  132.638765]        do_init_module+0x5c/0x230
[  132.639654]        load_module+0x2981/0x2bc0
[  132.640522]        __do_sys_finit_module+0xaa/0x110
[  132.641372]        do_syscall_64+0x5a/0x250
[  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
[  132.643022]
               -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
[  132.644643]        __lock_acquire+0x1241/0x23f0
[  132.645469]        lock_acquire+0xad/0x370
[  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
[  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
[  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
[  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
[  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  132.652594]        process_one_work+0x23c/0x580
[  132.653402]        worker_thread+0x50/0x3b0
[  132.654139]        kthread+0x12e/0x150
[  132.654868]        ret_from_fork+0x27/0x50
[  132.655598]
               other info that might help us debug this:

[  132.657739] Chain exists of:
                 crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map

[  132.659877]  Possible unsafe locking scenario:

[  132.661416]        CPU0                    CPU1
[  132.662126]        ----                    ----
[  132.662847]   lock(dma_fence_map);
[  132.663574]                                lock(crtc_ww_class_mutex);
[  132.664319]                                lock(dma_fence_map);
[  132.665063]   lock(crtc_ww_class_acquire);
[  132.665799]
                *** DEADLOCK ***

[  132.667965] 4 locks held by kworker/2:3/865:
[  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
[  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
[  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
[  132.671902]
               stack backtrace:
[  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
[  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
[  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
[  132.676046] Call Trace:
[  132.676897]  dump_stack+0x8f/0xd0
[  132.677748]  check_noncircular+0x162/0x180
[  132.678604]  ? stack_trace_save+0x4b/0x70
[  132.679459]  __lock_acquire+0x1241/0x23f0
[  132.680311]  lock_acquire+0xad/0x370
[  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.682021]  ? cpumask_next+0x16/0x20
[  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
[  132.683737]  ? __module_address+0x28/0xf0
[  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
[  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
[  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
[  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
[  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
[  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
[  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
[  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
[  132.693597]  process_one_work+0x23c/0x580
[  132.694487]  worker_thread+0x50/0x3b0
[  132.695373]  ? process_one_work+0x580/0x580
[  132.696264]  kthread+0x12e/0x150
[  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
[  132.698057]  ret_from_fork+0x27/0x50

Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3584e29323c0..b3b84a0d3baf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
 		/* displays are handled separately */
 		if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
 			/* XXX handle errors */
+
+			/*
+			 * This is dm_suspend, which calls modeset locks, and
+			 * that a pretty good inversion against dma_fence_signal
+			 * which gpu recovery is supposed to guarantee.
+			 *
+			 * Dont ask me how to fix this.
+			 */
 			r = adev->ip_blocks[i].version->funcs->suspend(adev);
 			/* XXX handle errors */
 			if (r) {
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 17/17] drm/i915: Annotate dma_fence_work
  2020-05-12  8:59 ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  8:59   ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: LKML, Daniel Vetter, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter

i915 does tons of allocations from this worker, which lockdep catches.

Also generic infrastructure like this with big potential for how
dma_fence or other cross driver contracts work, really should be
reviewed on dri-devel. Implementing custom wheels for everything
within the driver is a classic case of "platform problem" [1]. Which in
upstream we really shouldn't have.

Since there's no quick way to solve these splats (dma_fence_work is
used a bunch in basic buffer management and command submission) like
for amdgpu, I'm giving up at this point here. Annotating i915
scheduler and gpu reset could would be interesting, but since lockdep
is one-shot we can't see what surprises would lurk there.

1: https://lwn.net/Articles/443531/
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/i915/i915_sw_fence_work.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_sw_fence_work.c b/drivers/gpu/drm/i915/i915_sw_fence_work.c
index a3a81bb8f2c3..5b74acadaef5 100644
--- a/drivers/gpu/drm/i915/i915_sw_fence_work.c
+++ b/drivers/gpu/drm/i915/i915_sw_fence_work.c
@@ -17,12 +17,15 @@ static void fence_work(struct work_struct *work)
 {
 	struct dma_fence_work *f = container_of(work, typeof(*f), work);
 	int err;
+	bool fence_cookie;
 
+	fence_cookie = dma_fence_begin_signalling();
 	err = f->ops->work(f);
 	if (err)
 		dma_fence_set_error(&f->dma, err);
 
 	fence_complete(f);
+	dma_fence_end_signalling(fence_cookie);
 	dma_fence_put(&f->dma);
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 17/17] drm/i915: Annotate dma_fence_work
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

i915 does tons of allocations from this worker, which lockdep catches.

Also generic infrastructure like this with big potential for how
dma_fence or other cross driver contracts work, really should be
reviewed on dri-devel. Implementing custom wheels for everything
within the driver is a classic case of "platform problem" [1]. Which in
upstream we really shouldn't have.

Since there's no quick way to solve these splats (dma_fence_work is
used a bunch in basic buffer management and command submission) like
for amdgpu, I'm giving up at this point here. Annotating i915
scheduler and gpu reset could would be interesting, but since lockdep
is one-shot we can't see what surprises would lurk there.

1: https://lwn.net/Articles/443531/
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/i915/i915_sw_fence_work.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_sw_fence_work.c b/drivers/gpu/drm/i915/i915_sw_fence_work.c
index a3a81bb8f2c3..5b74acadaef5 100644
--- a/drivers/gpu/drm/i915/i915_sw_fence_work.c
+++ b/drivers/gpu/drm/i915/i915_sw_fence_work.c
@@ -17,12 +17,15 @@ static void fence_work(struct work_struct *work)
 {
 	struct dma_fence_work *f = container_of(work, typeof(*f), work);
 	int err;
+	bool fence_cookie;
 
+	fence_cookie = dma_fence_begin_signalling();
 	err = f->ops->work(f);
 	if (err)
 		dma_fence_set_error(&f->dma, err);
 
 	fence_complete(f);
+	dma_fence_end_signalling(fence_cookie);
 	dma_fence_put(&f->dma);
 }
 
-- 
2.26.2

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [Intel-gfx] [RFC 17/17] drm/i915: Annotate dma_fence_work
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

i915 does tons of allocations from this worker, which lockdep catches.

Also generic infrastructure like this with big potential for how
dma_fence or other cross driver contracts work, really should be
reviewed on dri-devel. Implementing custom wheels for everything
within the driver is a classic case of "platform problem" [1]. Which in
upstream we really shouldn't have.

Since there's no quick way to solve these splats (dma_fence_work is
used a bunch in basic buffer management and command submission) like
for amdgpu, I'm giving up at this point here. Annotating i915
scheduler and gpu reset could would be interesting, but since lockdep
is one-shot we can't see what surprises would lurk there.

1: https://lwn.net/Articles/443531/
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/i915/i915_sw_fence_work.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_sw_fence_work.c b/drivers/gpu/drm/i915/i915_sw_fence_work.c
index a3a81bb8f2c3..5b74acadaef5 100644
--- a/drivers/gpu/drm/i915/i915_sw_fence_work.c
+++ b/drivers/gpu/drm/i915/i915_sw_fence_work.c
@@ -17,12 +17,15 @@ static void fence_work(struct work_struct *work)
 {
 	struct dma_fence_work *f = container_of(work, typeof(*f), work);
 	int err;
+	bool fence_cookie;
 
+	fence_cookie = dma_fence_begin_signalling();
 	err = f->ops->work(f);
 	if (err)
 		dma_fence_set_error(&f->dma, err);
 
 	fence_complete(f);
+	dma_fence_end_signalling(fence_cookie);
 	dma_fence_put(&f->dma);
 }
 
-- 
2.26.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* [RFC 17/17] drm/i915: Annotate dma_fence_work
@ 2020-05-12  8:59   ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  8:59 UTC (permalink / raw)
  To: DRI Development
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, Chris Wilson, linaro-mm-sig, Daniel Vetter,
	Christian König, linux-media

i915 does tons of allocations from this worker, which lockdep catches.

Also generic infrastructure like this with big potential for how
dma_fence or other cross driver contracts work, really should be
reviewed on dri-devel. Implementing custom wheels for everything
within the driver is a classic case of "platform problem" [1]. Which in
upstream we really shouldn't have.

Since there's no quick way to solve these splats (dma_fence_work is
used a bunch in basic buffer management and command submission) like
for amdgpu, I'm giving up at this point here. Annotating i915
scheduler and gpu reset could would be interesting, but since lockdep
is one-shot we can't see what surprises would lurk there.

1: https://lwn.net/Articles/443531/
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-rdma@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/i915/i915_sw_fence_work.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_sw_fence_work.c b/drivers/gpu/drm/i915/i915_sw_fence_work.c
index a3a81bb8f2c3..5b74acadaef5 100644
--- a/drivers/gpu/drm/i915/i915_sw_fence_work.c
+++ b/drivers/gpu/drm/i915/i915_sw_fence_work.c
@@ -17,12 +17,15 @@ static void fence_work(struct work_struct *work)
 {
 	struct dma_fence_work *f = container_of(work, typeof(*f), work);
 	int err;
+	bool fence_cookie;
 
+	fence_cookie = dma_fence_begin_signalling();
 	err = f->ops->work(f);
 	if (err)
 		dma_fence_set_error(&f->dma, err);
 
 	fence_complete(f);
+	dma_fence_end_signalling(fence_cookie);
 	dma_fence_put(&f->dma);
 }
 
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 203+ messages in thread

* Re: [RFC 01/17] dma-fence: add might_sleep annotation to _wait()
  2020-05-12  8:59   ` Daniel Vetter
  (?)
@ 2020-05-12  9:03     ` Chris Wilson
  -1 siblings, 0 replies; 203+ messages in thread
From: Chris Wilson @ 2020-05-12  9:03 UTC (permalink / raw)
  To: DRI Development, Daniel Vetter
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	linaro-mm-sig, Daniel Vetter, Christian König, linux-media

Quoting Daniel Vetter (2020-05-12 09:59:28)
> But only for non-zero timeout, to avoid false positives.
> 
> One question here is whether the might_sleep should be unconditional,
> or only for real timeouts. I'm not sure, so went with the more
> defensive option. But in the interest of locking down the cross-driver
> dma_fence rules we might want to be more aggressive.

You can argue for enforcing might_sleep() as internal queries should be
using dma_fence_is_signaled() and not dma_fence_wait_timeout(0).

> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/dma-buf/dma-fence.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 052a41e2451c..6802125349fb 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -208,6 +208,9 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
>         if (WARN_ON(timeout < 0))
>                 return -EINVAL;
>  
> +       if (timeout > 0)
> +               might_sleep();

might_sleep_if(timeout > 0);
-Chris
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [Intel-gfx] [RFC 01/17] dma-fence: add might_sleep annotation to _wait()
@ 2020-05-12  9:03     ` Chris Wilson
  0 siblings, 0 replies; 203+ messages in thread
From: Chris Wilson @ 2020-05-12  9:03 UTC (permalink / raw)
  To: DRI Development, Daniel Vetter
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	linaro-mm-sig, Daniel Vetter, Christian König, linux-media

Quoting Daniel Vetter (2020-05-12 09:59:28)
> But only for non-zero timeout, to avoid false positives.
> 
> One question here is whether the might_sleep should be unconditional,
> or only for real timeouts. I'm not sure, so went with the more
> defensive option. But in the interest of locking down the cross-driver
> dma_fence rules we might want to be more aggressive.

You can argue for enforcing might_sleep() as internal queries should be
using dma_fence_is_signaled() and not dma_fence_wait_timeout(0).

> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/dma-buf/dma-fence.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 052a41e2451c..6802125349fb 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -208,6 +208,9 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
>         if (WARN_ON(timeout < 0))
>                 return -EINVAL;
>  
> +       if (timeout > 0)
> +               might_sleep();

might_sleep_if(timeout > 0);
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 01/17] dma-fence: add might_sleep annotation to _wait()
@ 2020-05-12  9:03     ` Chris Wilson
  0 siblings, 0 replies; 203+ messages in thread
From: Chris Wilson @ 2020-05-12  9:03 UTC (permalink / raw)
  To: DRI Development, Daniel Vetter
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

Quoting Daniel Vetter (2020-05-12 09:59:28)
> But only for non-zero timeout, to avoid false positives.
> 
> One question here is whether the might_sleep should be unconditional,
> or only for real timeouts. I'm not sure, so went with the more
> defensive option. But in the interest of locking down the cross-driver
> dma_fence rules we might want to be more aggressive.

You can argue for enforcing might_sleep() as internal queries should be
using dma_fence_is_signaled() and not dma_fence_wait_timeout(0).

> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/dma-buf/dma-fence.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 052a41e2451c..6802125349fb 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -208,6 +208,9 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
>         if (WARN_ON(timeout < 0))
>                 return -EINVAL;
>  
> +       if (timeout > 0)
> +               might_sleep();

might_sleep_if(timeout > 0);
-Chris
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 02/17] dma-fence: basic lockdep annotations
  2020-05-12  8:59   ` Daniel Vetter
  (?)
@ 2020-05-12  9:04     ` Chris Wilson
  -1 siblings, 0 replies; 203+ messages in thread
From: Chris Wilson @ 2020-05-12  9:04 UTC (permalink / raw)
  To: DRI Development, Daniel Vetter
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	linaro-mm-sig, Daniel Vetter, Christian König, linux-media

Quoting Daniel Vetter (2020-05-12 09:59:29)
> Design is similar to the lockdep annotations for workers, but with
> some twists:
> 
> - We use a read-lock for the execution/worker/completion side, so that
>   this explicit annotation can be more liberally sprinkled around.
>   With read locks lockdep isn't going to complain if the read-side
>   isn't nested the same way under all circumstances, so ABBA deadlocks
>   are ok. Which they are, since this is an annotation only.
> 
> - We're using non-recursive lockdep read lock mode, since in recursive
>   read lock mode lockdep does not catch read side hazards. And we
>   _very_ much want read side hazards to be caught. For full details of
>   this limitation see
> 
>   commit e91498589746065e3ae95d9a00b068e525eec34f
>   Author: Peter Zijlstra <peterz@infradead.org>
>   Date:   Wed Aug 23 13:13:11 2017 +0200
> 
>       locking/lockdep/selftests: Add mixed read-write ABBA tests
> 
> - To allow nesting of the read-side explicit annotations we explicitly
>   keep track of the nesting. lock_is_held() allows us to do that.
> 
> - The wait-side annotation is a write lock, and entirely done within
>   dma_fence_wait() for everyone by default.
> 
> - To be able to freely annotate helper functions I want to make it ok
>   to call dma_fence_begin/end_signalling from soft/hardirq context.
>   First attempt was using the hardirq locking context for the write
>   side in lockdep, but this forces all normal spinlocks nested within
>   dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> 
>   The approach now is to simple check in_atomic(), and for these cases
>   entirely rely on the might_sleep() check in dma_fence_wait(). That
>   will catch any wrong nesting against spinlocks from soft/hardirq
>   contexts.
> 
> The idea here is that every code path that's critical for eventually
> signalling a dma_fence should be annotated with
> dma_fence_begin/end_signalling. The annotation ideally starts right
> after a dma_fence is published (added to a dma_resv, exposed as a
> sync_file fd, attached to a drm_syncobj fd, or anything else that
> makes the dma_fence visible to other kernel threads), up to and
> including the dma_fence_wait(). Examples are irq handlers, the
> scheduler rt threads, the tail of execbuf (after the corresponding
> fences are visible), any workers that end up signalling dma_fences and
> really anything else. Not annotated should be code paths that only
> complete fences opportunistically as the gpu progresses, like e.g.
> shrinker/eviction code.
> 
> The main class of deadlocks this is supposed to catch are:
> 
> Thread A:
> 
>         mutex_lock(A);
>         mutex_unlock(A);
> 
>         dma_fence_signal();
> 
> Thread B:
> 
>         mutex_lock(A);
>         dma_fence_wait();
>         mutex_unlock(A);
> 
> Thread B is blocked on A signalling the fence, but A never gets around
> to that because it cannot acquire the lock A.
> 
> Note that dma_fence_wait() is allowed to be nested within
> dma_fence_begin/end_signalling sections. To allow this to happen the
> read lock needs to be upgraded to a write lock, which means that any
> other lock is acquired between the dma_fence_begin_signalling() call and
> the call to dma_fence_wait(), and still held, this will result in an
> immediate lockdep complaint. The only other option would be to not
> annotate such calls, defeating the point. Therefore these annotations
> cannot be sprinkled over the code entirely mindless to avoid false
> positives.
> 
> v2: handle soft/hardirq ctx better against write side and dont forget
> EXPORT_SYMBOL, drivers can't use this otherwise.
> 
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/dma-buf/dma-fence.c | 53 +++++++++++++++++++++++++++++++++++++
>  include/linux/dma-fence.h   | 12 +++++++++
>  2 files changed, 65 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 6802125349fb..d5c0fd2efc70 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
>  }
>  EXPORT_SYMBOL(dma_fence_context_alloc);
>  
> +#ifdef CONFIG_LOCKDEP
> +struct lockdep_map     dma_fence_lockdep_map = {
> +       .name = "dma_fence_map"
> +};

Not another false global sharing lockmap.
-Chris
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [Intel-gfx] [RFC 02/17] dma-fence: basic lockdep annotations
@ 2020-05-12  9:04     ` Chris Wilson
  0 siblings, 0 replies; 203+ messages in thread
From: Chris Wilson @ 2020-05-12  9:04 UTC (permalink / raw)
  To: DRI Development, Daniel Vetter
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, amd-gfx,
	linaro-mm-sig, Daniel Vetter, Christian König, linux-media

Quoting Daniel Vetter (2020-05-12 09:59:29)
> Design is similar to the lockdep annotations for workers, but with
> some twists:
> 
> - We use a read-lock for the execution/worker/completion side, so that
>   this explicit annotation can be more liberally sprinkled around.
>   With read locks lockdep isn't going to complain if the read-side
>   isn't nested the same way under all circumstances, so ABBA deadlocks
>   are ok. Which they are, since this is an annotation only.
> 
> - We're using non-recursive lockdep read lock mode, since in recursive
>   read lock mode lockdep does not catch read side hazards. And we
>   _very_ much want read side hazards to be caught. For full details of
>   this limitation see
> 
>   commit e91498589746065e3ae95d9a00b068e525eec34f
>   Author: Peter Zijlstra <peterz@infradead.org>
>   Date:   Wed Aug 23 13:13:11 2017 +0200
> 
>       locking/lockdep/selftests: Add mixed read-write ABBA tests
> 
> - To allow nesting of the read-side explicit annotations we explicitly
>   keep track of the nesting. lock_is_held() allows us to do that.
> 
> - The wait-side annotation is a write lock, and entirely done within
>   dma_fence_wait() for everyone by default.
> 
> - To be able to freely annotate helper functions I want to make it ok
>   to call dma_fence_begin/end_signalling from soft/hardirq context.
>   First attempt was using the hardirq locking context for the write
>   side in lockdep, but this forces all normal spinlocks nested within
>   dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> 
>   The approach now is to simple check in_atomic(), and for these cases
>   entirely rely on the might_sleep() check in dma_fence_wait(). That
>   will catch any wrong nesting against spinlocks from soft/hardirq
>   contexts.
> 
> The idea here is that every code path that's critical for eventually
> signalling a dma_fence should be annotated with
> dma_fence_begin/end_signalling. The annotation ideally starts right
> after a dma_fence is published (added to a dma_resv, exposed as a
> sync_file fd, attached to a drm_syncobj fd, or anything else that
> makes the dma_fence visible to other kernel threads), up to and
> including the dma_fence_wait(). Examples are irq handlers, the
> scheduler rt threads, the tail of execbuf (after the corresponding
> fences are visible), any workers that end up signalling dma_fences and
> really anything else. Not annotated should be code paths that only
> complete fences opportunistically as the gpu progresses, like e.g.
> shrinker/eviction code.
> 
> The main class of deadlocks this is supposed to catch are:
> 
> Thread A:
> 
>         mutex_lock(A);
>         mutex_unlock(A);
> 
>         dma_fence_signal();
> 
> Thread B:
> 
>         mutex_lock(A);
>         dma_fence_wait();
>         mutex_unlock(A);
> 
> Thread B is blocked on A signalling the fence, but A never gets around
> to that because it cannot acquire the lock A.
> 
> Note that dma_fence_wait() is allowed to be nested within
> dma_fence_begin/end_signalling sections. To allow this to happen the
> read lock needs to be upgraded to a write lock, which means that any
> other lock is acquired between the dma_fence_begin_signalling() call and
> the call to dma_fence_wait(), and still held, this will result in an
> immediate lockdep complaint. The only other option would be to not
> annotate such calls, defeating the point. Therefore these annotations
> cannot be sprinkled over the code entirely mindless to avoid false
> positives.
> 
> v2: handle soft/hardirq ctx better against write side and dont forget
> EXPORT_SYMBOL, drivers can't use this otherwise.
> 
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/dma-buf/dma-fence.c | 53 +++++++++++++++++++++++++++++++++++++
>  include/linux/dma-fence.h   | 12 +++++++++
>  2 files changed, 65 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 6802125349fb..d5c0fd2efc70 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
>  }
>  EXPORT_SYMBOL(dma_fence_context_alloc);
>  
> +#ifdef CONFIG_LOCKDEP
> +struct lockdep_map     dma_fence_lockdep_map = {
> +       .name = "dma_fence_map"
> +};

Not another false global sharing lockmap.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 02/17] dma-fence: basic lockdep annotations
@ 2020-05-12  9:04     ` Chris Wilson
  0 siblings, 0 replies; 203+ messages in thread
From: Chris Wilson @ 2020-05-12  9:04 UTC (permalink / raw)
  To: DRI Development, Daniel Vetter
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	amd-gfx, linaro-mm-sig, Daniel Vetter, Christian König,
	linux-media

Quoting Daniel Vetter (2020-05-12 09:59:29)
> Design is similar to the lockdep annotations for workers, but with
> some twists:
> 
> - We use a read-lock for the execution/worker/completion side, so that
>   this explicit annotation can be more liberally sprinkled around.
>   With read locks lockdep isn't going to complain if the read-side
>   isn't nested the same way under all circumstances, so ABBA deadlocks
>   are ok. Which they are, since this is an annotation only.
> 
> - We're using non-recursive lockdep read lock mode, since in recursive
>   read lock mode lockdep does not catch read side hazards. And we
>   _very_ much want read side hazards to be caught. For full details of
>   this limitation see
> 
>   commit e91498589746065e3ae95d9a00b068e525eec34f
>   Author: Peter Zijlstra <peterz@infradead.org>
>   Date:   Wed Aug 23 13:13:11 2017 +0200
> 
>       locking/lockdep/selftests: Add mixed read-write ABBA tests
> 
> - To allow nesting of the read-side explicit annotations we explicitly
>   keep track of the nesting. lock_is_held() allows us to do that.
> 
> - The wait-side annotation is a write lock, and entirely done within
>   dma_fence_wait() for everyone by default.
> 
> - To be able to freely annotate helper functions I want to make it ok
>   to call dma_fence_begin/end_signalling from soft/hardirq context.
>   First attempt was using the hardirq locking context for the write
>   side in lockdep, but this forces all normal spinlocks nested within
>   dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> 
>   The approach now is to simple check in_atomic(), and for these cases
>   entirely rely on the might_sleep() check in dma_fence_wait(). That
>   will catch any wrong nesting against spinlocks from soft/hardirq
>   contexts.
> 
> The idea here is that every code path that's critical for eventually
> signalling a dma_fence should be annotated with
> dma_fence_begin/end_signalling. The annotation ideally starts right
> after a dma_fence is published (added to a dma_resv, exposed as a
> sync_file fd, attached to a drm_syncobj fd, or anything else that
> makes the dma_fence visible to other kernel threads), up to and
> including the dma_fence_wait(). Examples are irq handlers, the
> scheduler rt threads, the tail of execbuf (after the corresponding
> fences are visible), any workers that end up signalling dma_fences and
> really anything else. Not annotated should be code paths that only
> complete fences opportunistically as the gpu progresses, like e.g.
> shrinker/eviction code.
> 
> The main class of deadlocks this is supposed to catch are:
> 
> Thread A:
> 
>         mutex_lock(A);
>         mutex_unlock(A);
> 
>         dma_fence_signal();
> 
> Thread B:
> 
>         mutex_lock(A);
>         dma_fence_wait();
>         mutex_unlock(A);
> 
> Thread B is blocked on A signalling the fence, but A never gets around
> to that because it cannot acquire the lock A.
> 
> Note that dma_fence_wait() is allowed to be nested within
> dma_fence_begin/end_signalling sections. To allow this to happen the
> read lock needs to be upgraded to a write lock, which means that any
> other lock is acquired between the dma_fence_begin_signalling() call and
> the call to dma_fence_wait(), and still held, this will result in an
> immediate lockdep complaint. The only other option would be to not
> annotate such calls, defeating the point. Therefore these annotations
> cannot be sprinkled over the code entirely mindless to avoid false
> positives.
> 
> v2: handle soft/hardirq ctx better against write side and dont forget
> EXPORT_SYMBOL, drivers can't use this otherwise.
> 
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/dma-buf/dma-fence.c | 53 +++++++++++++++++++++++++++++++++++++
>  include/linux/dma-fence.h   | 12 +++++++++
>  2 files changed, 65 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 6802125349fb..d5c0fd2efc70 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
>  }
>  EXPORT_SYMBOL(dma_fence_context_alloc);
>  
> +#ifdef CONFIG_LOCKDEP
> +struct lockdep_map     dma_fence_lockdep_map = {
> +       .name = "dma_fence_map"
> +};

Not another false global sharing lockmap.
-Chris
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 01/17] dma-fence: add might_sleep annotation to _wait()
  2020-05-12  8:59   ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12  9:08     ` Christian König
  -1 siblings, 0 replies; 203+ messages in thread
From: Christian König @ 2020-05-12  9:08 UTC (permalink / raw)
  To: Daniel Vetter, DRI Development
  Cc: LKML, linux-media, linaro-mm-sig, linux-rdma, amd-gfx, intel-gfx,
	Chris Wilson, Maarten Lankhorst, Daniel Vetter

Am 12.05.20 um 10:59 schrieb Daniel Vetter:
> But only for non-zero timeout, to avoid false positives.
>
> One question here is whether the might_sleep should be unconditional,
> or only for real timeouts. I'm not sure, so went with the more
> defensive option. But in the interest of locking down the cross-driver
> dma_fence rules we might want to be more aggressive.
>
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>   drivers/dma-buf/dma-fence.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 052a41e2451c..6802125349fb 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -208,6 +208,9 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
>   	if (WARN_ON(timeout < 0))
>   		return -EINVAL;
>   
> +	if (timeout > 0)
> +		might_sleep();
> +

I would rather like to see might_sleep() called here all the time even 
with timeout==0.

IIRC I removed the code in TTM abusing this in atomic context quite a 
while ago, but could be that some leaked in again or it is called in 
atomic context elsewhere as well.

Christian.

>   	trace_dma_fence_wait_start(fence);
>   	if (fence->ops->wait)
>   		ret = fence->ops->wait(fence, intr, timeout);


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 01/17] dma-fence: add might_sleep annotation to _wait()
@ 2020-05-12  9:08     ` Christian König
  0 siblings, 0 replies; 203+ messages in thread
From: Christian König @ 2020-05-12  9:08 UTC (permalink / raw)
  To: Daniel Vetter, DRI Development
  Cc: linux-rdma, intel-gfx, LKML, amd-gfx, Chris Wilson,
	linaro-mm-sig, Daniel Vetter, linux-media

Am 12.05.20 um 10:59 schrieb Daniel Vetter:
> But only for non-zero timeout, to avoid false positives.
>
> One question here is whether the might_sleep should be unconditional,
> or only for real timeouts. I'm not sure, so went with the more
> defensive option. But in the interest of locking down the cross-driver
> dma_fence rules we might want to be more aggressive.
>
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>   drivers/dma-buf/dma-fence.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 052a41e2451c..6802125349fb 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -208,6 +208,9 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
>   	if (WARN_ON(timeout < 0))
>   		return -EINVAL;
>   
> +	if (timeout > 0)
> +		might_sleep();
> +

I would rather like to see might_sleep() called here all the time even 
with timeout==0.

IIRC I removed the code in TTM abusing this in atomic context quite a 
while ago, but could be that some leaked in again or it is called in 
atomic context elsewhere as well.

Christian.

>   	trace_dma_fence_wait_start(fence);
>   	if (fence->ops->wait)
>   		ret = fence->ops->wait(fence, intr, timeout);

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [Intel-gfx] [RFC 01/17] dma-fence: add might_sleep annotation to _wait()
@ 2020-05-12  9:08     ` Christian König
  0 siblings, 0 replies; 203+ messages in thread
From: Christian König @ 2020-05-12  9:08 UTC (permalink / raw)
  To: Daniel Vetter, DRI Development
  Cc: linux-rdma, intel-gfx, LKML, amd-gfx, Chris Wilson,
	linaro-mm-sig, Daniel Vetter, linux-media

Am 12.05.20 um 10:59 schrieb Daniel Vetter:
> But only for non-zero timeout, to avoid false positives.
>
> One question here is whether the might_sleep should be unconditional,
> or only for real timeouts. I'm not sure, so went with the more
> defensive option. But in the interest of locking down the cross-driver
> dma_fence rules we might want to be more aggressive.
>
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>   drivers/dma-buf/dma-fence.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 052a41e2451c..6802125349fb 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -208,6 +208,9 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
>   	if (WARN_ON(timeout < 0))
>   		return -EINVAL;
>   
> +	if (timeout > 0)
> +		might_sleep();
> +

I would rather like to see might_sleep() called here all the time even 
with timeout==0.

IIRC I removed the code in TTM abusing this in atomic context quite a 
while ago, but could be that some leaked in again or it is called in 
atomic context elsewhere as well.

Christian.

>   	trace_dma_fence_wait_start(fence);
>   	if (fence->ops->wait)
>   		ret = fence->ops->wait(fence, intr, timeout);

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 01/17] dma-fence: add might_sleep annotation to _wait()
@ 2020-05-12  9:08     ` Christian König
  0 siblings, 0 replies; 203+ messages in thread
From: Christian König @ 2020-05-12  9:08 UTC (permalink / raw)
  To: Daniel Vetter, DRI Development
  Cc: linux-rdma, intel-gfx, Maarten Lankhorst, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, Daniel Vetter, linux-media

Am 12.05.20 um 10:59 schrieb Daniel Vetter:
> But only for non-zero timeout, to avoid false positives.
>
> One question here is whether the might_sleep should be unconditional,
> or only for real timeouts. I'm not sure, so went with the more
> defensive option. But in the interest of locking down the cross-driver
> dma_fence rules we might want to be more aggressive.
>
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>   drivers/dma-buf/dma-fence.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 052a41e2451c..6802125349fb 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -208,6 +208,9 @@ dma_fence_wait_timeout(struct dma_fence *fence, bool intr, signed long timeout)
>   	if (WARN_ON(timeout < 0))
>   		return -EINVAL;
>   
> +	if (timeout > 0)
> +		might_sleep();
> +

I would rather like to see might_sleep() called here all the time even 
with timeout==0.

IIRC I removed the code in TTM abusing this in atomic context quite a 
while ago, but could be that some leaked in again or it is called in 
atomic context elsewhere as well.

Christian.

>   	trace_dma_fence_wait_start(fence);
>   	if (fence->ops->wait)
>   		ret = fence->ops->wait(fence, intr, timeout);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 02/17] dma-fence: basic lockdep annotations
  2020-05-12  9:04     ` [Intel-gfx] " Chris Wilson
  (?)
  (?)
@ 2020-05-12  9:08       ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  9:08 UTC (permalink / raw)
  To: Chris Wilson
  Cc: DRI Development, Daniel Vetter, LKML, linux-media, linaro-mm-sig,
	linux-rdma, amd-gfx, intel-gfx, Maarten Lankhorst,
	Christian König, Daniel Vetter

On Tue, May 12, 2020 at 10:04:22AM +0100, Chris Wilson wrote:
> Quoting Daniel Vetter (2020-05-12 09:59:29)
> > Design is similar to the lockdep annotations for workers, but with
> > some twists:
> > 
> > - We use a read-lock for the execution/worker/completion side, so that
> >   this explicit annotation can be more liberally sprinkled around.
> >   With read locks lockdep isn't going to complain if the read-side
> >   isn't nested the same way under all circumstances, so ABBA deadlocks
> >   are ok. Which they are, since this is an annotation only.
> > 
> > - We're using non-recursive lockdep read lock mode, since in recursive
> >   read lock mode lockdep does not catch read side hazards. And we
> >   _very_ much want read side hazards to be caught. For full details of
> >   this limitation see
> > 
> >   commit e91498589746065e3ae95d9a00b068e525eec34f
> >   Author: Peter Zijlstra <peterz@infradead.org>
> >   Date:   Wed Aug 23 13:13:11 2017 +0200
> > 
> >       locking/lockdep/selftests: Add mixed read-write ABBA tests
> > 
> > - To allow nesting of the read-side explicit annotations we explicitly
> >   keep track of the nesting. lock_is_held() allows us to do that.
> > 
> > - The wait-side annotation is a write lock, and entirely done within
> >   dma_fence_wait() for everyone by default.
> > 
> > - To be able to freely annotate helper functions I want to make it ok
> >   to call dma_fence_begin/end_signalling from soft/hardirq context.
> >   First attempt was using the hardirq locking context for the write
> >   side in lockdep, but this forces all normal spinlocks nested within
> >   dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> > 
> >   The approach now is to simple check in_atomic(), and for these cases
> >   entirely rely on the might_sleep() check in dma_fence_wait(). That
> >   will catch any wrong nesting against spinlocks from soft/hardirq
> >   contexts.
> > 
> > The idea here is that every code path that's critical for eventually
> > signalling a dma_fence should be annotated with
> > dma_fence_begin/end_signalling. The annotation ideally starts right
> > after a dma_fence is published (added to a dma_resv, exposed as a
> > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > makes the dma_fence visible to other kernel threads), up to and
> > including the dma_fence_wait(). Examples are irq handlers, the
> > scheduler rt threads, the tail of execbuf (after the corresponding
> > fences are visible), any workers that end up signalling dma_fences and
> > really anything else. Not annotated should be code paths that only
> > complete fences opportunistically as the gpu progresses, like e.g.
> > shrinker/eviction code.
> > 
> > The main class of deadlocks this is supposed to catch are:
> > 
> > Thread A:
> > 
> >         mutex_lock(A);
> >         mutex_unlock(A);
> > 
> >         dma_fence_signal();
> > 
> > Thread B:
> > 
> >         mutex_lock(A);
> >         dma_fence_wait();
> >         mutex_unlock(A);
> > 
> > Thread B is blocked on A signalling the fence, but A never gets around
> > to that because it cannot acquire the lock A.
> > 
> > Note that dma_fence_wait() is allowed to be nested within
> > dma_fence_begin/end_signalling sections. To allow this to happen the
> > read lock needs to be upgraded to a write lock, which means that any
> > other lock is acquired between the dma_fence_begin_signalling() call and
> > the call to dma_fence_wait(), and still held, this will result in an
> > immediate lockdep complaint. The only other option would be to not
> > annotate such calls, defeating the point. Therefore these annotations
> > cannot be sprinkled over the code entirely mindless to avoid false
> > positives.
> > 
> > v2: handle soft/hardirq ctx better against write side and dont forget
> > EXPORT_SYMBOL, drivers can't use this otherwise.
> > 
> > Cc: linux-media@vger.kernel.org
> > Cc: linaro-mm-sig@lists.linaro.org
> > Cc: linux-rdma@vger.kernel.org
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: intel-gfx@lists.freedesktop.org
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > ---
> >  drivers/dma-buf/dma-fence.c | 53 +++++++++++++++++++++++++++++++++++++
> >  include/linux/dma-fence.h   | 12 +++++++++
> >  2 files changed, 65 insertions(+)
> > 
> > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > index 6802125349fb..d5c0fd2efc70 100644
> > --- a/drivers/dma-buf/dma-fence.c
> > +++ b/drivers/dma-buf/dma-fence.c
> > @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
> >  }
> >  EXPORT_SYMBOL(dma_fence_context_alloc);
> >  
> > +#ifdef CONFIG_LOCKDEP
> > +struct lockdep_map     dma_fence_lockdep_map = {
> > +       .name = "dma_fence_map"
> > +};
> 
> Not another false global sharing lockmap.

It's a global contract, it needs a global lockdep map. And yes a big
reason for the motivation here is that i915-gem has a tremendous urge to
just redefine all these global locks to fit to some local interpretation
of what's going on.

That doesn't make the resulting real&existing deadlocks go away.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 02/17] dma-fence: basic lockdep annotations
@ 2020-05-12  9:08       ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  9:08 UTC (permalink / raw)
  To: Chris Wilson
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, DRI Development,
	linaro-mm-sig, amd-gfx, Daniel Vetter, Christian König,
	linux-media

On Tue, May 12, 2020 at 10:04:22AM +0100, Chris Wilson wrote:
> Quoting Daniel Vetter (2020-05-12 09:59:29)
> > Design is similar to the lockdep annotations for workers, but with
> > some twists:
> > 
> > - We use a read-lock for the execution/worker/completion side, so that
> >   this explicit annotation can be more liberally sprinkled around.
> >   With read locks lockdep isn't going to complain if the read-side
> >   isn't nested the same way under all circumstances, so ABBA deadlocks
> >   are ok. Which they are, since this is an annotation only.
> > 
> > - We're using non-recursive lockdep read lock mode, since in recursive
> >   read lock mode lockdep does not catch read side hazards. And we
> >   _very_ much want read side hazards to be caught. For full details of
> >   this limitation see
> > 
> >   commit e91498589746065e3ae95d9a00b068e525eec34f
> >   Author: Peter Zijlstra <peterz@infradead.org>
> >   Date:   Wed Aug 23 13:13:11 2017 +0200
> > 
> >       locking/lockdep/selftests: Add mixed read-write ABBA tests
> > 
> > - To allow nesting of the read-side explicit annotations we explicitly
> >   keep track of the nesting. lock_is_held() allows us to do that.
> > 
> > - The wait-side annotation is a write lock, and entirely done within
> >   dma_fence_wait() for everyone by default.
> > 
> > - To be able to freely annotate helper functions I want to make it ok
> >   to call dma_fence_begin/end_signalling from soft/hardirq context.
> >   First attempt was using the hardirq locking context for the write
> >   side in lockdep, but this forces all normal spinlocks nested within
> >   dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> > 
> >   The approach now is to simple check in_atomic(), and for these cases
> >   entirely rely on the might_sleep() check in dma_fence_wait(). That
> >   will catch any wrong nesting against spinlocks from soft/hardirq
> >   contexts.
> > 
> > The idea here is that every code path that's critical for eventually
> > signalling a dma_fence should be annotated with
> > dma_fence_begin/end_signalling. The annotation ideally starts right
> > after a dma_fence is published (added to a dma_resv, exposed as a
> > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > makes the dma_fence visible to other kernel threads), up to and
> > including the dma_fence_wait(). Examples are irq handlers, the
> > scheduler rt threads, the tail of execbuf (after the corresponding
> > fences are visible), any workers that end up signalling dma_fences and
> > really anything else. Not annotated should be code paths that only
> > complete fences opportunistically as the gpu progresses, like e.g.
> > shrinker/eviction code.
> > 
> > The main class of deadlocks this is supposed to catch are:
> > 
> > Thread A:
> > 
> >         mutex_lock(A);
> >         mutex_unlock(A);
> > 
> >         dma_fence_signal();
> > 
> > Thread B:
> > 
> >         mutex_lock(A);
> >         dma_fence_wait();
> >         mutex_unlock(A);
> > 
> > Thread B is blocked on A signalling the fence, but A never gets around
> > to that because it cannot acquire the lock A.
> > 
> > Note that dma_fence_wait() is allowed to be nested within
> > dma_fence_begin/end_signalling sections. To allow this to happen the
> > read lock needs to be upgraded to a write lock, which means that any
> > other lock is acquired between the dma_fence_begin_signalling() call and
> > the call to dma_fence_wait(), and still held, this will result in an
> > immediate lockdep complaint. The only other option would be to not
> > annotate such calls, defeating the point. Therefore these annotations
> > cannot be sprinkled over the code entirely mindless to avoid false
> > positives.
> > 
> > v2: handle soft/hardirq ctx better against write side and dont forget
> > EXPORT_SYMBOL, drivers can't use this otherwise.
> > 
> > Cc: linux-media@vger.kernel.org
> > Cc: linaro-mm-sig@lists.linaro.org
> > Cc: linux-rdma@vger.kernel.org
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: intel-gfx@lists.freedesktop.org
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > ---
> >  drivers/dma-buf/dma-fence.c | 53 +++++++++++++++++++++++++++++++++++++
> >  include/linux/dma-fence.h   | 12 +++++++++
> >  2 files changed, 65 insertions(+)
> > 
> > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > index 6802125349fb..d5c0fd2efc70 100644
> > --- a/drivers/dma-buf/dma-fence.c
> > +++ b/drivers/dma-buf/dma-fence.c
> > @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
> >  }
> >  EXPORT_SYMBOL(dma_fence_context_alloc);
> >  
> > +#ifdef CONFIG_LOCKDEP
> > +struct lockdep_map     dma_fence_lockdep_map = {
> > +       .name = "dma_fence_map"
> > +};
> 
> Not another false global sharing lockmap.

It's a global contract, it needs a global lockdep map. And yes a big
reason for the motivation here is that i915-gem has a tremendous urge to
just redefine all these global locks to fit to some local interpretation
of what's going on.

That doesn't make the resulting real&existing deadlocks go away.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [Intel-gfx] [RFC 02/17] dma-fence: basic lockdep annotations
@ 2020-05-12  9:08       ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  9:08 UTC (permalink / raw)
  To: Chris Wilson
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, DRI Development,
	linaro-mm-sig, amd-gfx, Daniel Vetter, Christian König,
	linux-media

On Tue, May 12, 2020 at 10:04:22AM +0100, Chris Wilson wrote:
> Quoting Daniel Vetter (2020-05-12 09:59:29)
> > Design is similar to the lockdep annotations for workers, but with
> > some twists:
> > 
> > - We use a read-lock for the execution/worker/completion side, so that
> >   this explicit annotation can be more liberally sprinkled around.
> >   With read locks lockdep isn't going to complain if the read-side
> >   isn't nested the same way under all circumstances, so ABBA deadlocks
> >   are ok. Which they are, since this is an annotation only.
> > 
> > - We're using non-recursive lockdep read lock mode, since in recursive
> >   read lock mode lockdep does not catch read side hazards. And we
> >   _very_ much want read side hazards to be caught. For full details of
> >   this limitation see
> > 
> >   commit e91498589746065e3ae95d9a00b068e525eec34f
> >   Author: Peter Zijlstra <peterz@infradead.org>
> >   Date:   Wed Aug 23 13:13:11 2017 +0200
> > 
> >       locking/lockdep/selftests: Add mixed read-write ABBA tests
> > 
> > - To allow nesting of the read-side explicit annotations we explicitly
> >   keep track of the nesting. lock_is_held() allows us to do that.
> > 
> > - The wait-side annotation is a write lock, and entirely done within
> >   dma_fence_wait() for everyone by default.
> > 
> > - To be able to freely annotate helper functions I want to make it ok
> >   to call dma_fence_begin/end_signalling from soft/hardirq context.
> >   First attempt was using the hardirq locking context for the write
> >   side in lockdep, but this forces all normal spinlocks nested within
> >   dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> > 
> >   The approach now is to simple check in_atomic(), and for these cases
> >   entirely rely on the might_sleep() check in dma_fence_wait(). That
> >   will catch any wrong nesting against spinlocks from soft/hardirq
> >   contexts.
> > 
> > The idea here is that every code path that's critical for eventually
> > signalling a dma_fence should be annotated with
> > dma_fence_begin/end_signalling. The annotation ideally starts right
> > after a dma_fence is published (added to a dma_resv, exposed as a
> > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > makes the dma_fence visible to other kernel threads), up to and
> > including the dma_fence_wait(). Examples are irq handlers, the
> > scheduler rt threads, the tail of execbuf (after the corresponding
> > fences are visible), any workers that end up signalling dma_fences and
> > really anything else. Not annotated should be code paths that only
> > complete fences opportunistically as the gpu progresses, like e.g.
> > shrinker/eviction code.
> > 
> > The main class of deadlocks this is supposed to catch are:
> > 
> > Thread A:
> > 
> >         mutex_lock(A);
> >         mutex_unlock(A);
> > 
> >         dma_fence_signal();
> > 
> > Thread B:
> > 
> >         mutex_lock(A);
> >         dma_fence_wait();
> >         mutex_unlock(A);
> > 
> > Thread B is blocked on A signalling the fence, but A never gets around
> > to that because it cannot acquire the lock A.
> > 
> > Note that dma_fence_wait() is allowed to be nested within
> > dma_fence_begin/end_signalling sections. To allow this to happen the
> > read lock needs to be upgraded to a write lock, which means that any
> > other lock is acquired between the dma_fence_begin_signalling() call and
> > the call to dma_fence_wait(), and still held, this will result in an
> > immediate lockdep complaint. The only other option would be to not
> > annotate such calls, defeating the point. Therefore these annotations
> > cannot be sprinkled over the code entirely mindless to avoid false
> > positives.
> > 
> > v2: handle soft/hardirq ctx better against write side and dont forget
> > EXPORT_SYMBOL, drivers can't use this otherwise.
> > 
> > Cc: linux-media@vger.kernel.org
> > Cc: linaro-mm-sig@lists.linaro.org
> > Cc: linux-rdma@vger.kernel.org
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: intel-gfx@lists.freedesktop.org
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > ---
> >  drivers/dma-buf/dma-fence.c | 53 +++++++++++++++++++++++++++++++++++++
> >  include/linux/dma-fence.h   | 12 +++++++++
> >  2 files changed, 65 insertions(+)
> > 
> > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > index 6802125349fb..d5c0fd2efc70 100644
> > --- a/drivers/dma-buf/dma-fence.c
> > +++ b/drivers/dma-buf/dma-fence.c
> > @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
> >  }
> >  EXPORT_SYMBOL(dma_fence_context_alloc);
> >  
> > +#ifdef CONFIG_LOCKDEP
> > +struct lockdep_map     dma_fence_lockdep_map = {
> > +       .name = "dma_fence_map"
> > +};
> 
> Not another false global sharing lockmap.

It's a global contract, it needs a global lockdep map. And yes a big
reason for the motivation here is that i915-gem has a tremendous urge to
just redefine all these global locks to fit to some local interpretation
of what's going on.

That doesn't make the resulting real&existing deadlocks go away.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 02/17] dma-fence: basic lockdep annotations
@ 2020-05-12  9:08       ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12  9:08 UTC (permalink / raw)
  To: Chris Wilson
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	DRI Development, linaro-mm-sig, amd-gfx, Daniel Vetter,
	Christian König, linux-media

On Tue, May 12, 2020 at 10:04:22AM +0100, Chris Wilson wrote:
> Quoting Daniel Vetter (2020-05-12 09:59:29)
> > Design is similar to the lockdep annotations for workers, but with
> > some twists:
> > 
> > - We use a read-lock for the execution/worker/completion side, so that
> >   this explicit annotation can be more liberally sprinkled around.
> >   With read locks lockdep isn't going to complain if the read-side
> >   isn't nested the same way under all circumstances, so ABBA deadlocks
> >   are ok. Which they are, since this is an annotation only.
> > 
> > - We're using non-recursive lockdep read lock mode, since in recursive
> >   read lock mode lockdep does not catch read side hazards. And we
> >   _very_ much want read side hazards to be caught. For full details of
> >   this limitation see
> > 
> >   commit e91498589746065e3ae95d9a00b068e525eec34f
> >   Author: Peter Zijlstra <peterz@infradead.org>
> >   Date:   Wed Aug 23 13:13:11 2017 +0200
> > 
> >       locking/lockdep/selftests: Add mixed read-write ABBA tests
> > 
> > - To allow nesting of the read-side explicit annotations we explicitly
> >   keep track of the nesting. lock_is_held() allows us to do that.
> > 
> > - The wait-side annotation is a write lock, and entirely done within
> >   dma_fence_wait() for everyone by default.
> > 
> > - To be able to freely annotate helper functions I want to make it ok
> >   to call dma_fence_begin/end_signalling from soft/hardirq context.
> >   First attempt was using the hardirq locking context for the write
> >   side in lockdep, but this forces all normal spinlocks nested within
> >   dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> > 
> >   The approach now is to simple check in_atomic(), and for these cases
> >   entirely rely on the might_sleep() check in dma_fence_wait(). That
> >   will catch any wrong nesting against spinlocks from soft/hardirq
> >   contexts.
> > 
> > The idea here is that every code path that's critical for eventually
> > signalling a dma_fence should be annotated with
> > dma_fence_begin/end_signalling. The annotation ideally starts right
> > after a dma_fence is published (added to a dma_resv, exposed as a
> > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > makes the dma_fence visible to other kernel threads), up to and
> > including the dma_fence_wait(). Examples are irq handlers, the
> > scheduler rt threads, the tail of execbuf (after the corresponding
> > fences are visible), any workers that end up signalling dma_fences and
> > really anything else. Not annotated should be code paths that only
> > complete fences opportunistically as the gpu progresses, like e.g.
> > shrinker/eviction code.
> > 
> > The main class of deadlocks this is supposed to catch are:
> > 
> > Thread A:
> > 
> >         mutex_lock(A);
> >         mutex_unlock(A);
> > 
> >         dma_fence_signal();
> > 
> > Thread B:
> > 
> >         mutex_lock(A);
> >         dma_fence_wait();
> >         mutex_unlock(A);
> > 
> > Thread B is blocked on A signalling the fence, but A never gets around
> > to that because it cannot acquire the lock A.
> > 
> > Note that dma_fence_wait() is allowed to be nested within
> > dma_fence_begin/end_signalling sections. To allow this to happen the
> > read lock needs to be upgraded to a write lock, which means that any
> > other lock is acquired between the dma_fence_begin_signalling() call and
> > the call to dma_fence_wait(), and still held, this will result in an
> > immediate lockdep complaint. The only other option would be to not
> > annotate such calls, defeating the point. Therefore these annotations
> > cannot be sprinkled over the code entirely mindless to avoid false
> > positives.
> > 
> > v2: handle soft/hardirq ctx better against write side and dont forget
> > EXPORT_SYMBOL, drivers can't use this otherwise.
> > 
> > Cc: linux-media@vger.kernel.org
> > Cc: linaro-mm-sig@lists.linaro.org
> > Cc: linux-rdma@vger.kernel.org
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: intel-gfx@lists.freedesktop.org
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > ---
> >  drivers/dma-buf/dma-fence.c | 53 +++++++++++++++++++++++++++++++++++++
> >  include/linux/dma-fence.h   | 12 +++++++++
> >  2 files changed, 65 insertions(+)
> > 
> > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > index 6802125349fb..d5c0fd2efc70 100644
> > --- a/drivers/dma-buf/dma-fence.c
> > +++ b/drivers/dma-buf/dma-fence.c
> > @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
> >  }
> >  EXPORT_SYMBOL(dma_fence_context_alloc);
> >  
> > +#ifdef CONFIG_LOCKDEP
> > +struct lockdep_map     dma_fence_lockdep_map = {
> > +       .name = "dma_fence_map"
> > +};
> 
> Not another false global sharing lockmap.

It's a global contract, it needs a global lockdep map. And yes a big
reason for the motivation here is that i915-gem has a tremendous urge to
just redefine all these global locks to fit to some local interpretation
of what's going on.

That doesn't make the resulting real&existing deadlocks go away.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 02/17] dma-fence: basic lockdep annotations
  2020-05-12  9:08       ` Daniel Vetter
  (?)
@ 2020-05-12  9:19         ` Chris Wilson
  -1 siblings, 0 replies; 203+ messages in thread
From: Chris Wilson @ 2020-05-12  9:19 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, DRI Development,
	linaro-mm-sig, amd-gfx, Daniel Vetter, Christian König,
	linux-media

Quoting Daniel Vetter (2020-05-12 10:08:47)
> On Tue, May 12, 2020 at 10:04:22AM +0100, Chris Wilson wrote:
> > Quoting Daniel Vetter (2020-05-12 09:59:29)
> > > Design is similar to the lockdep annotations for workers, but with
> > > some twists:
> > > 
> > > - We use a read-lock for the execution/worker/completion side, so that
> > >   this explicit annotation can be more liberally sprinkled around.
> > >   With read locks lockdep isn't going to complain if the read-side
> > >   isn't nested the same way under all circumstances, so ABBA deadlocks
> > >   are ok. Which they are, since this is an annotation only.
> > > 
> > > - We're using non-recursive lockdep read lock mode, since in recursive
> > >   read lock mode lockdep does not catch read side hazards. And we
> > >   _very_ much want read side hazards to be caught. For full details of
> > >   this limitation see
> > > 
> > >   commit e91498589746065e3ae95d9a00b068e525eec34f
> > >   Author: Peter Zijlstra <peterz@infradead.org>
> > >   Date:   Wed Aug 23 13:13:11 2017 +0200
> > > 
> > >       locking/lockdep/selftests: Add mixed read-write ABBA tests
> > > 
> > > - To allow nesting of the read-side explicit annotations we explicitly
> > >   keep track of the nesting. lock_is_held() allows us to do that.
> > > 
> > > - The wait-side annotation is a write lock, and entirely done within
> > >   dma_fence_wait() for everyone by default.
> > > 
> > > - To be able to freely annotate helper functions I want to make it ok
> > >   to call dma_fence_begin/end_signalling from soft/hardirq context.
> > >   First attempt was using the hardirq locking context for the write
> > >   side in lockdep, but this forces all normal spinlocks nested within
> > >   dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> > > 
> > >   The approach now is to simple check in_atomic(), and for these cases
> > >   entirely rely on the might_sleep() check in dma_fence_wait(). That
> > >   will catch any wrong nesting against spinlocks from soft/hardirq
> > >   contexts.
> > > 
> > > The idea here is that every code path that's critical for eventually
> > > signalling a dma_fence should be annotated with
> > > dma_fence_begin/end_signalling. The annotation ideally starts right
> > > after a dma_fence is published (added to a dma_resv, exposed as a
> > > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > > makes the dma_fence visible to other kernel threads), up to and
> > > including the dma_fence_wait(). Examples are irq handlers, the
> > > scheduler rt threads, the tail of execbuf (after the corresponding
> > > fences are visible), any workers that end up signalling dma_fences and
> > > really anything else. Not annotated should be code paths that only
> > > complete fences opportunistically as the gpu progresses, like e.g.
> > > shrinker/eviction code.
> > > 
> > > The main class of deadlocks this is supposed to catch are:
> > > 
> > > Thread A:
> > > 
> > >         mutex_lock(A);
> > >         mutex_unlock(A);
> > > 
> > >         dma_fence_signal();
> > > 
> > > Thread B:
> > > 
> > >         mutex_lock(A);
> > >         dma_fence_wait();
> > >         mutex_unlock(A);
> > > 
> > > Thread B is blocked on A signalling the fence, but A never gets around
> > > to that because it cannot acquire the lock A.
> > > 
> > > Note that dma_fence_wait() is allowed to be nested within
> > > dma_fence_begin/end_signalling sections. To allow this to happen the
> > > read lock needs to be upgraded to a write lock, which means that any
> > > other lock is acquired between the dma_fence_begin_signalling() call and
> > > the call to dma_fence_wait(), and still held, this will result in an
> > > immediate lockdep complaint. The only other option would be to not
> > > annotate such calls, defeating the point. Therefore these annotations
> > > cannot be sprinkled over the code entirely mindless to avoid false
> > > positives.
> > > 
> > > v2: handle soft/hardirq ctx better against write side and dont forget
> > > EXPORT_SYMBOL, drivers can't use this otherwise.
> > > 
> > > Cc: linux-media@vger.kernel.org
> > > Cc: linaro-mm-sig@lists.linaro.org
> > > Cc: linux-rdma@vger.kernel.org
> > > Cc: amd-gfx@lists.freedesktop.org
> > > Cc: intel-gfx@lists.freedesktop.org
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > ---
> > >  drivers/dma-buf/dma-fence.c | 53 +++++++++++++++++++++++++++++++++++++
> > >  include/linux/dma-fence.h   | 12 +++++++++
> > >  2 files changed, 65 insertions(+)
> > > 
> > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > > index 6802125349fb..d5c0fd2efc70 100644
> > > --- a/drivers/dma-buf/dma-fence.c
> > > +++ b/drivers/dma-buf/dma-fence.c
> > > @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
> > >  }
> > >  EXPORT_SYMBOL(dma_fence_context_alloc);
> > >  
> > > +#ifdef CONFIG_LOCKDEP
> > > +struct lockdep_map     dma_fence_lockdep_map = {
> > > +       .name = "dma_fence_map"
> > > +};
> > 
> > Not another false global sharing lockmap.
> 
> It's a global contract, it needs a global lockdep map. And yes a big
> reason for the motivation here is that i915-gem has a tremendous urge to
> just redefine all these global locks to fit to some local interpretation
> of what's going on.

No, you can build the global contract out of the actual contracts
between fence drivers. If you introduce a struct lockdep_map *map into
the fence_ops (so the fence_ops can remain const), you gain correctness
at the cost of having to run through all possible interactions once.
You can also then do if ops->lockmap ?: &global_fence_lockmap for
piecemeal conversion of drivers that do not already use lockmaps for
contract enforcement of their fence waits.
-Chris
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [Intel-gfx] [RFC 02/17] dma-fence: basic lockdep annotations
@ 2020-05-12  9:19         ` Chris Wilson
  0 siblings, 0 replies; 203+ messages in thread
From: Chris Wilson @ 2020-05-12  9:19 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, DRI Development,
	linaro-mm-sig, amd-gfx, Daniel Vetter, Christian König,
	linux-media

Quoting Daniel Vetter (2020-05-12 10:08:47)
> On Tue, May 12, 2020 at 10:04:22AM +0100, Chris Wilson wrote:
> > Quoting Daniel Vetter (2020-05-12 09:59:29)
> > > Design is similar to the lockdep annotations for workers, but with
> > > some twists:
> > > 
> > > - We use a read-lock for the execution/worker/completion side, so that
> > >   this explicit annotation can be more liberally sprinkled around.
> > >   With read locks lockdep isn't going to complain if the read-side
> > >   isn't nested the same way under all circumstances, so ABBA deadlocks
> > >   are ok. Which they are, since this is an annotation only.
> > > 
> > > - We're using non-recursive lockdep read lock mode, since in recursive
> > >   read lock mode lockdep does not catch read side hazards. And we
> > >   _very_ much want read side hazards to be caught. For full details of
> > >   this limitation see
> > > 
> > >   commit e91498589746065e3ae95d9a00b068e525eec34f
> > >   Author: Peter Zijlstra <peterz@infradead.org>
> > >   Date:   Wed Aug 23 13:13:11 2017 +0200
> > > 
> > >       locking/lockdep/selftests: Add mixed read-write ABBA tests
> > > 
> > > - To allow nesting of the read-side explicit annotations we explicitly
> > >   keep track of the nesting. lock_is_held() allows us to do that.
> > > 
> > > - The wait-side annotation is a write lock, and entirely done within
> > >   dma_fence_wait() for everyone by default.
> > > 
> > > - To be able to freely annotate helper functions I want to make it ok
> > >   to call dma_fence_begin/end_signalling from soft/hardirq context.
> > >   First attempt was using the hardirq locking context for the write
> > >   side in lockdep, but this forces all normal spinlocks nested within
> > >   dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> > > 
> > >   The approach now is to simple check in_atomic(), and for these cases
> > >   entirely rely on the might_sleep() check in dma_fence_wait(). That
> > >   will catch any wrong nesting against spinlocks from soft/hardirq
> > >   contexts.
> > > 
> > > The idea here is that every code path that's critical for eventually
> > > signalling a dma_fence should be annotated with
> > > dma_fence_begin/end_signalling. The annotation ideally starts right
> > > after a dma_fence is published (added to a dma_resv, exposed as a
> > > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > > makes the dma_fence visible to other kernel threads), up to and
> > > including the dma_fence_wait(). Examples are irq handlers, the
> > > scheduler rt threads, the tail of execbuf (after the corresponding
> > > fences are visible), any workers that end up signalling dma_fences and
> > > really anything else. Not annotated should be code paths that only
> > > complete fences opportunistically as the gpu progresses, like e.g.
> > > shrinker/eviction code.
> > > 
> > > The main class of deadlocks this is supposed to catch are:
> > > 
> > > Thread A:
> > > 
> > >         mutex_lock(A);
> > >         mutex_unlock(A);
> > > 
> > >         dma_fence_signal();
> > > 
> > > Thread B:
> > > 
> > >         mutex_lock(A);
> > >         dma_fence_wait();
> > >         mutex_unlock(A);
> > > 
> > > Thread B is blocked on A signalling the fence, but A never gets around
> > > to that because it cannot acquire the lock A.
> > > 
> > > Note that dma_fence_wait() is allowed to be nested within
> > > dma_fence_begin/end_signalling sections. To allow this to happen the
> > > read lock needs to be upgraded to a write lock, which means that any
> > > other lock is acquired between the dma_fence_begin_signalling() call and
> > > the call to dma_fence_wait(), and still held, this will result in an
> > > immediate lockdep complaint. The only other option would be to not
> > > annotate such calls, defeating the point. Therefore these annotations
> > > cannot be sprinkled over the code entirely mindless to avoid false
> > > positives.
> > > 
> > > v2: handle soft/hardirq ctx better against write side and dont forget
> > > EXPORT_SYMBOL, drivers can't use this otherwise.
> > > 
> > > Cc: linux-media@vger.kernel.org
> > > Cc: linaro-mm-sig@lists.linaro.org
> > > Cc: linux-rdma@vger.kernel.org
> > > Cc: amd-gfx@lists.freedesktop.org
> > > Cc: intel-gfx@lists.freedesktop.org
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > ---
> > >  drivers/dma-buf/dma-fence.c | 53 +++++++++++++++++++++++++++++++++++++
> > >  include/linux/dma-fence.h   | 12 +++++++++
> > >  2 files changed, 65 insertions(+)
> > > 
> > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > > index 6802125349fb..d5c0fd2efc70 100644
> > > --- a/drivers/dma-buf/dma-fence.c
> > > +++ b/drivers/dma-buf/dma-fence.c
> > > @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
> > >  }
> > >  EXPORT_SYMBOL(dma_fence_context_alloc);
> > >  
> > > +#ifdef CONFIG_LOCKDEP
> > > +struct lockdep_map     dma_fence_lockdep_map = {
> > > +       .name = "dma_fence_map"
> > > +};
> > 
> > Not another false global sharing lockmap.
> 
> It's a global contract, it needs a global lockdep map. And yes a big
> reason for the motivation here is that i915-gem has a tremendous urge to
> just redefine all these global locks to fit to some local interpretation
> of what's going on.

No, you can build the global contract out of the actual contracts
between fence drivers. If you introduce a struct lockdep_map *map into
the fence_ops (so the fence_ops can remain const), you gain correctness
at the cost of having to run through all possible interactions once.
You can also then do if ops->lockmap ?: &global_fence_lockmap for
piecemeal conversion of drivers that do not already use lockmaps for
contract enforcement of their fence waits.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 02/17] dma-fence: basic lockdep annotations
@ 2020-05-12  9:19         ` Chris Wilson
  0 siblings, 0 replies; 203+ messages in thread
From: Chris Wilson @ 2020-05-12  9:19 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	DRI Development, linaro-mm-sig, amd-gfx, Daniel Vetter,
	Christian König, linux-media

Quoting Daniel Vetter (2020-05-12 10:08:47)
> On Tue, May 12, 2020 at 10:04:22AM +0100, Chris Wilson wrote:
> > Quoting Daniel Vetter (2020-05-12 09:59:29)
> > > Design is similar to the lockdep annotations for workers, but with
> > > some twists:
> > > 
> > > - We use a read-lock for the execution/worker/completion side, so that
> > >   this explicit annotation can be more liberally sprinkled around.
> > >   With read locks lockdep isn't going to complain if the read-side
> > >   isn't nested the same way under all circumstances, so ABBA deadlocks
> > >   are ok. Which they are, since this is an annotation only.
> > > 
> > > - We're using non-recursive lockdep read lock mode, since in recursive
> > >   read lock mode lockdep does not catch read side hazards. And we
> > >   _very_ much want read side hazards to be caught. For full details of
> > >   this limitation see
> > > 
> > >   commit e91498589746065e3ae95d9a00b068e525eec34f
> > >   Author: Peter Zijlstra <peterz@infradead.org>
> > >   Date:   Wed Aug 23 13:13:11 2017 +0200
> > > 
> > >       locking/lockdep/selftests: Add mixed read-write ABBA tests
> > > 
> > > - To allow nesting of the read-side explicit annotations we explicitly
> > >   keep track of the nesting. lock_is_held() allows us to do that.
> > > 
> > > - The wait-side annotation is a write lock, and entirely done within
> > >   dma_fence_wait() for everyone by default.
> > > 
> > > - To be able to freely annotate helper functions I want to make it ok
> > >   to call dma_fence_begin/end_signalling from soft/hardirq context.
> > >   First attempt was using the hardirq locking context for the write
> > >   side in lockdep, but this forces all normal spinlocks nested within
> > >   dma_fence_begin/end_signalling to be spinlocks. That bollocks.
> > > 
> > >   The approach now is to simple check in_atomic(), and for these cases
> > >   entirely rely on the might_sleep() check in dma_fence_wait(). That
> > >   will catch any wrong nesting against spinlocks from soft/hardirq
> > >   contexts.
> > > 
> > > The idea here is that every code path that's critical for eventually
> > > signalling a dma_fence should be annotated with
> > > dma_fence_begin/end_signalling. The annotation ideally starts right
> > > after a dma_fence is published (added to a dma_resv, exposed as a
> > > sync_file fd, attached to a drm_syncobj fd, or anything else that
> > > makes the dma_fence visible to other kernel threads), up to and
> > > including the dma_fence_wait(). Examples are irq handlers, the
> > > scheduler rt threads, the tail of execbuf (after the corresponding
> > > fences are visible), any workers that end up signalling dma_fences and
> > > really anything else. Not annotated should be code paths that only
> > > complete fences opportunistically as the gpu progresses, like e.g.
> > > shrinker/eviction code.
> > > 
> > > The main class of deadlocks this is supposed to catch are:
> > > 
> > > Thread A:
> > > 
> > >         mutex_lock(A);
> > >         mutex_unlock(A);
> > > 
> > >         dma_fence_signal();
> > > 
> > > Thread B:
> > > 
> > >         mutex_lock(A);
> > >         dma_fence_wait();
> > >         mutex_unlock(A);
> > > 
> > > Thread B is blocked on A signalling the fence, but A never gets around
> > > to that because it cannot acquire the lock A.
> > > 
> > > Note that dma_fence_wait() is allowed to be nested within
> > > dma_fence_begin/end_signalling sections. To allow this to happen the
> > > read lock needs to be upgraded to a write lock, which means that any
> > > other lock is acquired between the dma_fence_begin_signalling() call and
> > > the call to dma_fence_wait(), and still held, this will result in an
> > > immediate lockdep complaint. The only other option would be to not
> > > annotate such calls, defeating the point. Therefore these annotations
> > > cannot be sprinkled over the code entirely mindless to avoid false
> > > positives.
> > > 
> > > v2: handle soft/hardirq ctx better against write side and dont forget
> > > EXPORT_SYMBOL, drivers can't use this otherwise.
> > > 
> > > Cc: linux-media@vger.kernel.org
> > > Cc: linaro-mm-sig@lists.linaro.org
> > > Cc: linux-rdma@vger.kernel.org
> > > Cc: amd-gfx@lists.freedesktop.org
> > > Cc: intel-gfx@lists.freedesktop.org
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > ---
> > >  drivers/dma-buf/dma-fence.c | 53 +++++++++++++++++++++++++++++++++++++
> > >  include/linux/dma-fence.h   | 12 +++++++++
> > >  2 files changed, 65 insertions(+)
> > > 
> > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > > index 6802125349fb..d5c0fd2efc70 100644
> > > --- a/drivers/dma-buf/dma-fence.c
> > > +++ b/drivers/dma-buf/dma-fence.c
> > > @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
> > >  }
> > >  EXPORT_SYMBOL(dma_fence_context_alloc);
> > >  
> > > +#ifdef CONFIG_LOCKDEP
> > > +struct lockdep_map     dma_fence_lockdep_map = {
> > > +       .name = "dma_fence_map"
> > > +};
> > 
> > Not another false global sharing lockmap.
> 
> It's a global contract, it needs a global lockdep map. And yes a big
> reason for the motivation here is that i915-gem has a tremendous urge to
> just redefine all these global locks to fit to some local interpretation
> of what's going on.

No, you can build the global contract out of the actual contracts
between fence drivers. If you introduce a struct lockdep_map *map into
the fence_ops (so the fence_ops can remain const), you gain correctness
at the cost of having to run through all possible interactions once.
You can also then do if ops->lockmap ?: &global_fence_lockmap for
piecemeal conversion of drivers that do not already use lockmaps for
contract enforcement of their fence waits.
-Chris
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for dma-fence lockdep annotations
  2020-05-12  8:59 ` Daniel Vetter
                   ` (19 preceding siblings ...)
  (?)
@ 2020-05-12  9:31 ` Patchwork
  -1 siblings, 0 replies; 203+ messages in thread
From: Patchwork @ 2020-05-12  9:31 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

== Series Details ==

Series: dma-fence lockdep annotations
URL   : https://patchwork.freedesktop.org/series/77178/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
761b03bd55bc dma-fence: add might_sleep annotation to _wait()
-:39: WARNING:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Daniel Vetter <daniel.vetter@ffwll.ch>'

total: 0 errors, 1 warnings, 0 checks, 9 lines checked
cbdd4bcec7e6 dma-fence: basic lockdep annotations
-:23: ERROR:GIT_COMMIT_ID: Please use git commit description style 'commit <12+ chars of sha1> ("<title line>")' - ie: 'commit e91498589746 ("locking/lockdep/selftests: Add mixed read-write ABBA tests")'
#23: 
  commit e91498589746065e3ae95d9a00b068e525eec34f

-:120: ERROR:IN_ATOMIC: do not use in_atomic in drivers
#120: FILE: drivers/dma-buf/dma-fence.c:125:
+	if (in_atomic())

-:153: CHECK:LINE_SPACING: Please don't use multiple blank lines
#153: FILE: drivers/dma-buf/dma-fence.c:158:
+
+

-:202: CHECK:LINE_SPACING: Please use a blank line after function/struct/union/enum declarations
#202: FILE: include/linux/dma-fence.h:368:
+}
+static inline void dma_fence_end_signalling(bool cookie) {}

-:208: WARNING:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Daniel Vetter <daniel.vetter@ffwll.ch>'

total: 2 errors, 1 warnings, 2 checks, 97 lines checked
30c15f819d26 dma-fence: prime lockdep annotations
-:31: ERROR:GIT_COMMIT_ID: Please use git commit description style 'commit <12+ chars of sha1> ("<title line>")' - ie: 'commit 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end")'
#31: 
commit 23b68395c7c78a764e8963fc15a7cfd318bf187f

-:74: WARNING:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Daniel Vetter <daniel.vetter@ffwll.ch>'

total: 1 errors, 1 warnings, 0 checks, 14 lines checked
2bad1b068255 drm/vkms: Annotate vblank timer
-:59: WARNING:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Daniel Vetter <daniel.vetter@ffwll.ch>'

total: 0 errors, 1 warnings, 0 checks, 25 lines checked
fabebff847de drm/vblank: Annotate with dma-fence signalling section
-:71: WARNING:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Daniel Vetter <daniel.vetter@ffwll.ch>'

total: 0 errors, 1 warnings, 0 checks, 38 lines checked
dc212ca24327 drm/atomic-helper: Add dma-fence annotations
-:119: WARNING:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Daniel Vetter <daniel.vetter@ffwll.ch>'

total: 0 errors, 1 warnings, 0 checks, 76 lines checked
277acf9f1d01 drm/amdgpu: add dma-fence annotations to atomic commit path
-:52: WARNING:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Daniel Vetter <daniel.vetter@ffwll.ch>'

total: 0 errors, 1 warnings, 0 checks, 24 lines checked
3b8d20ef37ac drm/scheduler: use dma-fence annotations in main thread
-:53: WARNING:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Daniel Vetter <daniel.vetter@ffwll.ch>'

total: 0 errors, 1 warnings, 0 checks, 21 lines checked
dc5507ddac48 drm/amdgpu: use dma-fence annotations in cs_submit()
-:65: WARNING:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Daniel Vetter <daniel.vetter@ffwll.ch>'

total: 0 errors, 1 warnings, 0 checks, 29 lines checked
6b777e9d4248 drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
-:82: WARNING:ALLOC_ARRAY_ARGS: kmalloc_array uses number as first arg, sizeof is generally wrong
#82: FILE: drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c:211:
+	fences = kmalloc_array(sizeof(void *), id_mgr->num_ids, GFP_ATOMIC);

-:98: WARNING:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Daniel Vetter <daniel.vetter@ffwll.ch>'

total: 0 errors, 2 warnings, 0 checks, 24 lines checked
04c66458629c drm/amdgpu: DC also loves to allocate stuff where it shouldn't
-:70: WARNING:BLOCK_COMMENT_STYLE: Block comments use a trailing */ on a separate line
#70: FILE: drivers/gpu/drm/amd/display/dc/core/dc.c:1420:
+	 * atomic_commit_tail. */

-:76: WARNING:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Daniel Vetter <daniel.vetter@ffwll.ch>'

total: 0 errors, 2 warnings, 0 checks, 27 lines checked
786ff5f97ade drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail
-:39: WARNING:IF_0: Consider removing the code enclosed by this #if 0 and its #endif
#39: FILE: drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:6706:
+#if 0

-:55: WARNING:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Daniel Vetter <daniel.vetter@ffwll.ch>'

total: 0 errors, 2 warnings, 0 checks, 23 lines checked
7679cc148b5a drm/scheduler: use dma-fence annotations in tdr work
-:28: WARNING:TYPO_SPELLING: 'seperate' may be misspelled - perhaps 'separate'?
#28: 
Hence split out as a seperate patch.

-:114: WARNING:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Daniel Vetter <daniel.vetter@ffwll.ch>'

total: 0 errors, 2 warnings, 0 checks, 20 lines checked
e09083945d14 drm/amdgpu: use dma-fence annotations for gpu reset code
c02a4f686bf5 Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset"
-:145: WARNING:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Daniel Vetter <daniel.vetter@ffwll.ch>'

total: 0 errors, 1 warnings, 0 checks, 16 lines checked
a13323043876 drm/amdgpu: gpu recovery does full modesets
-:186: WARNING:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Daniel Vetter <daniel.vetter@ffwll.ch>'

total: 0 errors, 1 warnings, 0 checks, 14 lines checked
1ef18223426e drm/i915: Annotate dma_fence_work
-:53: WARNING:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal patch author 'Daniel Vetter <daniel.vetter@ffwll.ch>'

total: 0 errors, 1 warnings, 0 checks, 15 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for dma-fence lockdep annotations
  2020-05-12  8:59 ` Daniel Vetter
                   ` (20 preceding siblings ...)
  (?)
@ 2020-05-12  9:38 ` Patchwork
  -1 siblings, 0 replies; 203+ messages in thread
From: Patchwork @ 2020-05-12  9:38 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

== Series Details ==

Series: dma-fence lockdep annotations
URL   : https://patchwork.freedesktop.org/series/77178/
State : warning

== Summary ==

$ dim sparse origin/drm-tip
Sparse version: v0.6.0
Commit: dma-fence: add might_sleep annotation to _wait()
Okay!

Commit: dma-fence: basic lockdep annotations
Okay!

Commit: dma-fence: prime lockdep annotations
Okay!

Commit: drm/vkms: Annotate vblank timer
Okay!

Commit: drm/vblank: Annotate with dma-fence signalling section
Okay!

Commit: drm/atomic-helper: Add dma-fence annotations
Okay!

Commit: drm/amdgpu: add dma-fence annotations to atomic commit path
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1155:17: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1155:17: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1155:17: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1155:17: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1155:17: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1155:17: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1160:17: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1160:17: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1160:17: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1162:36: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1229:25: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1229:25: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1229:25: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1235:37: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1261:41: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1263:39: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1267:17: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1156:17: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1156:17: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1156:17: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1156:17: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1156:17: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1156:17: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1161:17: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1161:17: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1161:17: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1163:36: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1230:25: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1230:25: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1230:25: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1236:37: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1262:41: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1264:39: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1271:17: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:2521:24: warning: symbol 'dm_atomic_get_new_state' was not declared. Should it be static?
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:2539:24: warning: symbol 'dm_atomic_get_old_state' was not declared. Should it be static?
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:5337:6: warning: symbol 'dm_drm_plane_destroy_state' was not declared. Should it be static?
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:750:6: warning: symbol 'amdgpu_dm_audio_eld_notify' was not declared. Should it be static?
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:805:25: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:809:23: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1269:17: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1272:17: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:2522:24: warning: symbol 'dm_atomic_get_new_state' was not declared. Should it be static?
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:2540:24: warning: symbol 'dm_atomic_get_old_state' was not declared. Should it be static?
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:5338:6: warning: symbol 'dm_drm_plane_destroy_state' was not declared. Should it be static?
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:751:6: warning: symbol 'amdgpu_dm_audio_eld_notify' was not declared. Should it be static?
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:806:25: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:813:30: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:816:28: warning: cast to restricted __le32
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:8702:6: warning: symbol 'amdgpu_dm_psr_enable' was not declared. Should it be static?
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:8706:50: warning: missing braces around initializer
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:811:23: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:814:30: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:817:28: warning: cast to restricted __le32
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:8708:6: warning: symbol 'amdgpu_dm_psr_enable' was not declared. Should it be static?
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:8712:50: warning: missing braces around initializer

Commit: drm/scheduler: use dma-fence annotations in main thread
Okay!

Commit: drm/amdgpu: use dma-fence annotations in cs_submit()
Okay!

Commit: drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
Okay!

Commit: drm/amdgpu: DC also loves to allocate stuff where it shouldn't
Okay!

Commit: drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:8708:6: warning: symbol 'amdgpu_dm_psr_enable' was not declared. Should it be static?
-drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:8712:50: warning: missing braces around initializer
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:8718:6: warning: symbol 'amdgpu_dm_psr_enable' was not declared. Should it be static?
+drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:8722:50: warning: missing braces around initializer

Commit: drm/scheduler: use dma-fence annotations in tdr work
Okay!

Commit: drm/amdgpu: use dma-fence annotations for gpu reset code
Okay!

Commit: Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset"
Okay!

Commit: drm/amdgpu: gpu recovery does full modesets
Okay!

Commit: drm/i915: Annotate dma_fence_work
Okay!

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for dma-fence lockdep annotations
  2020-05-12  8:59 ` Daniel Vetter
                   ` (21 preceding siblings ...)
  (?)
@ 2020-05-12  9:45 ` Patchwork
  -1 siblings, 0 replies; 203+ messages in thread
From: Patchwork @ 2020-05-12  9:45 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

== Series Details ==

Series: dma-fence lockdep annotations
URL   : https://patchwork.freedesktop.org/series/77178/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_8467 -> Patchwork_17631
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_17631 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_17631, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/index.html

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_17631:

### IGT changes ###

#### Possible regressions ####

  * igt@gem_close_race@basic-process:
    - fi-ivb-3770:        [PASS][1] -> [DMESG-WARN][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-ivb-3770/igt@gem_close_race@basic-process.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-ivb-3770/igt@gem_close_race@basic-process.html
    - fi-byt-j1900:       [PASS][3] -> [DMESG-WARN][4]
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-byt-j1900/igt@gem_close_race@basic-process.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-byt-j1900/igt@gem_close_race@basic-process.html
    - fi-hsw-4770:        [PASS][5] -> [DMESG-WARN][6]
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-hsw-4770/igt@gem_close_race@basic-process.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-hsw-4770/igt@gem_close_race@basic-process.html
    - fi-byt-n2820:       [PASS][7] -> [DMESG-WARN][8]
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-byt-n2820/igt@gem_close_race@basic-process.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-byt-n2820/igt@gem_close_race@basic-process.html

  * igt@gem_tiled_blits@basic:
    - fi-bwr-2160:        [PASS][9] -> [DMESG-WARN][10]
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-bwr-2160/igt@gem_tiled_blits@basic.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-bwr-2160/igt@gem_tiled_blits@basic.html

  * igt@runner@aborted:
    - fi-cfl-8700k:       NOTRUN -> [FAIL][11]
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-cfl-8700k/igt@runner@aborted.html
    - fi-tgl-y:           NOTRUN -> [FAIL][12]
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-tgl-y/igt@runner@aborted.html
    - fi-cfl-8109u:       NOTRUN -> [FAIL][13]
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-cfl-8109u/igt@runner@aborted.html
    - fi-icl-u2:          NOTRUN -> [FAIL][14]
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-icl-u2/igt@runner@aborted.html
    - fi-bdw-5557u:       NOTRUN -> [FAIL][15]
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-bdw-5557u/igt@runner@aborted.html
    - fi-bwr-2160:        NOTRUN -> [FAIL][16]
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-bwr-2160/igt@runner@aborted.html
    - fi-byt-n2820:       NOTRUN -> [FAIL][17]
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-byt-n2820/igt@runner@aborted.html
    - fi-icl-guc:         NOTRUN -> [FAIL][18]
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-icl-guc/igt@runner@aborted.html
    - fi-hsw-4770:        NOTRUN -> [FAIL][19]
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-hsw-4770/igt@runner@aborted.html
    - fi-whl-u:           NOTRUN -> [FAIL][20]
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-whl-u/igt@runner@aborted.html
    - fi-cml-u2:          NOTRUN -> [FAIL][21]
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-cml-u2/igt@runner@aborted.html
    - fi-ivb-3770:        NOTRUN -> [FAIL][22]
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-ivb-3770/igt@runner@aborted.html
    - fi-bxt-dsi:         NOTRUN -> [FAIL][23]
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-bxt-dsi/igt@runner@aborted.html
    - fi-byt-j1900:       NOTRUN -> [FAIL][24]
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-byt-j1900/igt@runner@aborted.html
    - fi-cml-s:           NOTRUN -> [FAIL][25]
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-cml-s/igt@runner@aborted.html
    - fi-cfl-guc:         NOTRUN -> [FAIL][26]
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-cfl-guc/igt@runner@aborted.html
    - fi-icl-y:           NOTRUN -> [FAIL][27]
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-icl-y/igt@runner@aborted.html

  
#### Suppressed ####

  The following results come from untrusted machines, tests, or statuses.
  They do not affect the overall result.

  * {igt@gem_busy@busy@all}:
    - fi-kbl-x1275:       [PASS][28] -> [DMESG-WARN][29]
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-kbl-x1275/igt@gem_busy@busy@all.html
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-kbl-x1275/igt@gem_busy@busy@all.html
    - fi-cfl-8700k:       [PASS][30] -> [DMESG-WARN][31]
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-cfl-8700k/igt@gem_busy@busy@all.html
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-cfl-8700k/igt@gem_busy@busy@all.html
    - fi-tgl-y:           NOTRUN -> [DMESG-WARN][32]
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-tgl-y/igt@gem_busy@busy@all.html
    - fi-skl-6600u:       [PASS][33] -> [DMESG-WARN][34]
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-skl-6600u/igt@gem_busy@busy@all.html
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-skl-6600u/igt@gem_busy@busy@all.html
    - fi-cfl-8109u:       [PASS][35] -> [DMESG-WARN][36]
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-cfl-8109u/igt@gem_busy@busy@all.html
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-cfl-8109u/igt@gem_busy@busy@all.html
    - fi-icl-u2:          [PASS][37] -> [DMESG-WARN][38]
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-icl-u2/igt@gem_busy@busy@all.html
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-icl-u2/igt@gem_busy@busy@all.html
    - {fi-tgl-dsi}:       [PASS][39] -> [DMESG-WARN][40]
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-tgl-dsi/igt@gem_busy@busy@all.html
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-tgl-dsi/igt@gem_busy@busy@all.html
    - fi-glk-dsi:         [PASS][41] -> [DMESG-WARN][42]
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-glk-dsi/igt@gem_busy@busy@all.html
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-glk-dsi/igt@gem_busy@busy@all.html
    - fi-kbl-8809g:       [PASS][43] -> [DMESG-WARN][44]
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-kbl-8809g/igt@gem_busy@busy@all.html
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-kbl-8809g/igt@gem_busy@busy@all.html
    - fi-skl-lmem:        [PASS][45] -> [DMESG-WARN][46]
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-skl-lmem/igt@gem_busy@busy@all.html
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-skl-lmem/igt@gem_busy@busy@all.html
    - fi-kbl-r:           [PASS][47] -> [DMESG-WARN][48]
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-kbl-r/igt@gem_busy@busy@all.html
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-kbl-r/igt@gem_busy@busy@all.html
    - fi-bdw-5557u:       [PASS][49] -> [DMESG-WARN][50]
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-bdw-5557u/igt@gem_busy@busy@all.html
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-bdw-5557u/igt@gem_busy@busy@all.html
    - fi-icl-guc:         [PASS][51] -> [DMESG-WARN][52]
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-icl-guc/igt@gem_busy@busy@all.html
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-icl-guc/igt@gem_busy@busy@all.html
    - fi-kbl-soraka:      [PASS][53] -> [DMESG-WARN][54]
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-kbl-soraka/igt@gem_busy@busy@all.html
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-kbl-soraka/igt@gem_busy@busy@all.html
    - {fi-ehl-1}:         [PASS][55] -> [DMESG-WARN][56]
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-ehl-1/igt@gem_busy@busy@all.html
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-ehl-1/igt@gem_busy@busy@all.html
    - fi-kbl-7500u:       [PASS][57] -> [DMESG-WARN][58]
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-kbl-7500u/igt@gem_busy@busy@all.html
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-kbl-7500u/igt@gem_busy@busy@all.html
    - fi-kbl-guc:         [PASS][59] -> [DMESG-WARN][60]
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-kbl-guc/igt@gem_busy@busy@all.html
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-kbl-guc/igt@gem_busy@busy@all.html
    - fi-whl-u:           [PASS][61] -> [DMESG-WARN][62]
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-whl-u/igt@gem_busy@busy@all.html
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-whl-u/igt@gem_busy@busy@all.html
    - fi-cml-u2:          [PASS][63] -> [DMESG-WARN][64]
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-cml-u2/igt@gem_busy@busy@all.html
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-cml-u2/igt@gem_busy@busy@all.html
    - fi-bxt-dsi:         [PASS][65] -> [DMESG-WARN][66]
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-bxt-dsi/igt@gem_busy@busy@all.html
   [66]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-bxt-dsi/igt@gem_busy@busy@all.html
    - {fi-tgl-u}:         [PASS][67] -> [DMESG-WARN][68]
   [67]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-tgl-u/igt@gem_busy@busy@all.html
   [68]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-tgl-u/igt@gem_busy@busy@all.html
    - fi-cml-s:           [PASS][69] -> [DMESG-WARN][70]
   [69]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-cml-s/igt@gem_busy@busy@all.html
   [70]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-cml-s/igt@gem_busy@busy@all.html
    - fi-cfl-guc:         [PASS][71] -> [DMESG-WARN][72]
   [71]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-cfl-guc/igt@gem_busy@busy@all.html
   [72]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-cfl-guc/igt@gem_busy@busy@all.html
    - fi-icl-y:           [PASS][73] -> [DMESG-WARN][74]
   [73]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-icl-y/igt@gem_busy@busy@all.html
   [74]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-icl-y/igt@gem_busy@busy@all.html
    - fi-skl-guc:         [PASS][75] -> [DMESG-WARN][76]
   [75]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-skl-guc/igt@gem_busy@busy@all.html
   [76]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-skl-guc/igt@gem_busy@busy@all.html
    - fi-skl-6700k2:      [PASS][77] -> [DMESG-WARN][78]
   [77]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-skl-6700k2/igt@gem_busy@busy@all.html
   [78]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-skl-6700k2/igt@gem_busy@busy@all.html

  * igt@runner@aborted:
    - {fi-tgl-dsi}:       NOTRUN -> [FAIL][79]
   [79]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-tgl-dsi/igt@runner@aborted.html
    - {fi-ehl-1}:         NOTRUN -> [FAIL][80]
   [80]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-ehl-1/igt@runner@aborted.html
    - {fi-tgl-u}:         NOTRUN -> [FAIL][81]
   [81]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-tgl-u/igt@runner@aborted.html

  
Known issues
------------

  Here are the changes found in Patchwork_17631 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@kms_busy@basic@flip:
    - fi-snb-2600:        [PASS][82] -> [DMESG-WARN][83] ([i915#478])
   [82]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-snb-2600/igt@kms_busy@basic@flip.html
   [83]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-snb-2600/igt@kms_busy@basic@flip.html
    - fi-snb-2520m:       [PASS][84] -> [DMESG-WARN][85] ([i915#478])
   [84]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-snb-2520m/igt@kms_busy@basic@flip.html
   [85]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-snb-2520m/igt@kms_busy@basic@flip.html

  * igt@kms_frontbuffer_tracking@basic:
    - fi-ilk-650:         [PASS][86] -> [DMESG-WARN][87] ([i915#478])
   [86]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8467/fi-ilk-650/igt@kms_frontbuffer_tracking@basic.html
   [87]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/fi-ilk-650/igt@kms_frontbuffer_tracking@basic.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [i915#478]: https://gitlab.freedesktop.org/drm/intel/issues/478


Participating hosts (49 -> 44)
------------------------------

  Additional (1): fi-tgl-y 
  Missing    (6): fi-ilk-m540 fi-hsw-4200u fi-byt-squawks fi-bsw-cyan fi-byt-clapper fi-bdw-samus 


Build changes
-------------

  * CI: CI-20190529 -> None
  * Linux: CI_DRM_8467 -> Patchwork_17631

  CI-20190529: 20190529
  CI_DRM_8467: 1a0f0c378117fc90f421a692698ad85963ecdb3a @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_5646: 5a5a3162a7638b3ae38b6dc2545622c204d1b97c @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_17631: 1ef18223426e3805f05163795d2b054ca4294003 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

1ef18223426e drm/i915: Annotate dma_fence_work
a13323043876 drm/amdgpu: gpu recovery does full modesets
c02a4f686bf5 Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset"
e09083945d14 drm/amdgpu: use dma-fence annotations for gpu reset code
7679cc148b5a drm/scheduler: use dma-fence annotations in tdr work
786ff5f97ade drm/amdgpu/dc: Stop dma_resv_lock inversion in commit_tail
04c66458629c drm/amdgpu: DC also loves to allocate stuff where it shouldn't
6b777e9d4248 drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
dc5507ddac48 drm/amdgpu: use dma-fence annotations in cs_submit()
3b8d20ef37ac drm/scheduler: use dma-fence annotations in main thread
277acf9f1d01 drm/amdgpu: add dma-fence annotations to atomic commit path
dc212ca24327 drm/atomic-helper: Add dma-fence annotations
fabebff847de drm/vblank: Annotate with dma-fence signalling section
2bad1b068255 drm/vkms: Annotate vblank timer
30c15f819d26 dma-fence: prime lockdep annotations
cbdd4bcec7e6 dma-fence: basic lockdep annotations
761b03bd55bc dma-fence: add might_sleep annotation to _wait()

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_17631/index.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 02/17] dma-fence: basic lockdep annotations
  2020-05-12  8:59   ` Daniel Vetter
  (?)
@ 2020-05-12 12:09     ` Jason Gunthorpe
  -1 siblings, 0 replies; 203+ messages in thread
From: Jason Gunthorpe @ 2020-05-12 12:09 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: DRI Development, LKML, linux-media, linaro-mm-sig, linux-rdma,
	amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter

On Tue, May 12, 2020 at 10:59:29AM +0200, Daniel Vetter wrote:
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 6802125349fb..d5c0fd2efc70 100644
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
>  }
>  EXPORT_SYMBOL(dma_fence_context_alloc);
>  
> +#ifdef CONFIG_LOCKDEP
> +struct lockdep_map	dma_fence_lockdep_map = {
> +	.name = "dma_fence_map"
> +};
> +
> +bool dma_fence_begin_signalling(void)
> +{

Why is this global? I would have expected it to be connected to a
single fence?

It would also be alot nicer if this was some general lockdep feature,
not tied to dmabuf. This exact problem also strikes anyone using
completions, for instance, and the same solution should be
applicable??

Jason

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 02/17] dma-fence: basic lockdep annotations
@ 2020-05-12 12:09     ` Jason Gunthorpe
  0 siblings, 0 replies; 203+ messages in thread
From: Jason Gunthorpe @ 2020-05-12 12:09 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: linux-rdma, intel-gfx, LKML, amd-gfx, Chris Wilson,
	linaro-mm-sig, DRI Development, Daniel Vetter,
	Christian König, linux-media

On Tue, May 12, 2020 at 10:59:29AM +0200, Daniel Vetter wrote:
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 6802125349fb..d5c0fd2efc70 100644
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
>  }
>  EXPORT_SYMBOL(dma_fence_context_alloc);
>  
> +#ifdef CONFIG_LOCKDEP
> +struct lockdep_map	dma_fence_lockdep_map = {
> +	.name = "dma_fence_map"
> +};
> +
> +bool dma_fence_begin_signalling(void)
> +{

Why is this global? I would have expected it to be connected to a
single fence?

It would also be alot nicer if this was some general lockdep feature,
not tied to dmabuf. This exact problem also strikes anyone using
completions, for instance, and the same solution should be
applicable??

Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 02/17] dma-fence: basic lockdep annotations
@ 2020-05-12 12:09     ` Jason Gunthorpe
  0 siblings, 0 replies; 203+ messages in thread
From: Jason Gunthorpe @ 2020-05-12 12:09 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: linux-rdma, intel-gfx, Maarten Lankhorst, LKML, amd-gfx,
	Chris Wilson, linaro-mm-sig, DRI Development, Daniel Vetter,
	Christian König, linux-media

On Tue, May 12, 2020 at 10:59:29AM +0200, Daniel Vetter wrote:
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 6802125349fb..d5c0fd2efc70 100644
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
>  }
>  EXPORT_SYMBOL(dma_fence_context_alloc);
>  
> +#ifdef CONFIG_LOCKDEP
> +struct lockdep_map	dma_fence_lockdep_map = {
> +	.name = "dma_fence_map"
> +};
> +
> +bool dma_fence_begin_signalling(void)
> +{

Why is this global? I would have expected it to be connected to a
single fence?

It would also be alot nicer if this was some general lockdep feature,
not tied to dmabuf. This exact problem also strikes anyone using
completions, for instance, and the same solution should be
applicable??

Jason
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
  2020-05-12  8:59   ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12 12:54     ` Alex Deucher
  -1 siblings, 0 replies; 203+ messages in thread
From: Alex Deucher @ 2020-05-12 12:54 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: DRI Development, linux-rdma, Intel Graphics Development,
	Maarten Lankhorst, LKML, amd-gfx list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter,
	Christian König, linux-media

On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> ...
>
> I think it's time to stop this little exercise.
>
> The lockdep splat, for the record:
>
> [  132.583381] ======================================================
> [  132.584091] WARNING: possible circular locking dependency detected
> [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> [  132.585461] ------------------------------------------------------
> [  132.586184] kworker/2:3/865 is trying to acquire lock:
> [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.587569]
>                but task is already holding lock:
> [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> [  132.589803]
>                which lock already depends on the new lock.
>
> [  132.592009]
>                the existing dependency chain (in reverse order) is:
> [  132.593507]
>                -> #2 (dma_fence_map){++++}-{0:0}:
> [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> [  132.600539]        fbcon_init+0x2e8/0x660
> [  132.601344]        visual_init+0xce/0x130
> [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> [  132.602970]        do_take_over_console+0x115/0x180
> [  132.603763]        do_fbcon_takeover+0x58/0xb0
> [  132.604564]        register_framebuffer+0x1ee/0x300
> [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> [  132.609511]        local_pci_probe+0x42/0x80
> [  132.610324]        pci_device_probe+0x104/0x1a0
> [  132.611130]        really_probe+0x147/0x3c0
> [  132.611939]        driver_probe_device+0xb6/0x100
> [  132.612766]        device_driver_attach+0x53/0x60
> [  132.613593]        __driver_attach+0x8c/0x150
> [  132.614419]        bus_for_each_dev+0x7b/0xc0
> [  132.615249]        bus_add_driver+0x14c/0x1f0
> [  132.616071]        driver_register+0x6c/0xc0
> [  132.616902]        do_one_initcall+0x5d/0x2f0
> [  132.617731]        do_init_module+0x5c/0x230
> [  132.618560]        load_module+0x2981/0x2bc0
> [  132.619391]        __do_sys_finit_module+0xaa/0x110
> [  132.620228]        do_syscall_64+0x5a/0x250
> [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [  132.621903]
>                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> [  132.624448]        ww_mutex_lock+0x43/0xb0
> [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> [  132.629804]        local_pci_probe+0x42/0x80
> [  132.630690]        pci_device_probe+0x104/0x1a0
> [  132.631583]        really_probe+0x147/0x3c0
> [  132.632479]        driver_probe_device+0xb6/0x100
> [  132.633379]        device_driver_attach+0x53/0x60
> [  132.634275]        __driver_attach+0x8c/0x150
> [  132.635170]        bus_for_each_dev+0x7b/0xc0
> [  132.636069]        bus_add_driver+0x14c/0x1f0
> [  132.636974]        driver_register+0x6c/0xc0
> [  132.637870]        do_one_initcall+0x5d/0x2f0
> [  132.638765]        do_init_module+0x5c/0x230
> [  132.639654]        load_module+0x2981/0x2bc0
> [  132.640522]        __do_sys_finit_module+0xaa/0x110
> [  132.641372]        do_syscall_64+0x5a/0x250
> [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [  132.643022]
>                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> [  132.644643]        __lock_acquire+0x1241/0x23f0
> [  132.645469]        lock_acquire+0xad/0x370
> [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> [  132.652594]        process_one_work+0x23c/0x580
> [  132.653402]        worker_thread+0x50/0x3b0
> [  132.654139]        kthread+0x12e/0x150
> [  132.654868]        ret_from_fork+0x27/0x50
> [  132.655598]
>                other info that might help us debug this:
>
> [  132.657739] Chain exists of:
>                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
>
> [  132.659877]  Possible unsafe locking scenario:
>
> [  132.661416]        CPU0                    CPU1
> [  132.662126]        ----                    ----
> [  132.662847]   lock(dma_fence_map);
> [  132.663574]                                lock(crtc_ww_class_mutex);
> [  132.664319]                                lock(dma_fence_map);
> [  132.665063]   lock(crtc_ww_class_acquire);
> [  132.665799]
>                 *** DEADLOCK ***
>
> [  132.667965] 4 locks held by kworker/2:3/865:
> [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> [  132.671902]
>                stack backtrace:
> [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> [  132.676046] Call Trace:
> [  132.676897]  dump_stack+0x8f/0xd0
> [  132.677748]  check_noncircular+0x162/0x180
> [  132.678604]  ? stack_trace_save+0x4b/0x70
> [  132.679459]  __lock_acquire+0x1241/0x23f0
> [  132.680311]  lock_acquire+0xad/0x370
> [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.682021]  ? cpumask_next+0x16/0x20
> [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> [  132.683737]  ? __module_address+0x28/0xf0
> [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> [  132.693597]  process_one_work+0x23c/0x580
> [  132.694487]  worker_thread+0x50/0x3b0
> [  132.695373]  ? process_one_work+0x580/0x580
> [  132.696264]  kthread+0x12e/0x150
> [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  132.698057]  ret_from_fork+0x27/0x50
>
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 3584e29323c0..b3b84a0d3baf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
>                 /* displays are handled separately */
>                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
>                         /* XXX handle errors */
> +
> +                       /*
> +                        * This is dm_suspend, which calls modeset locks, and
> +                        * that a pretty good inversion against dma_fence_signal
> +                        * which gpu recovery is supposed to guarantee.
> +                        *
> +                        * Dont ask me how to fix this.
> +                        */

We actually have a fix for this.  Will be out shortly.

Alex

>                         r = adev->ip_blocks[i].version->funcs->suspend(adev);
>                         /* XXX handle errors */
>                         if (r) {
> --
> 2.26.2
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 12:54     ` Alex Deucher
  0 siblings, 0 replies; 203+ messages in thread
From: Alex Deucher @ 2020-05-12 12:54 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: linux-rdma, Intel Graphics Development, LKML, amd-gfx list,
	Chris Wilson, moderated list:DMA BUFFER SHARING FRAMEWORK,
	DRI Development, Daniel Vetter, Christian König,
	linux-media

On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> ...
>
> I think it's time to stop this little exercise.
>
> The lockdep splat, for the record:
>
> [  132.583381] ======================================================
> [  132.584091] WARNING: possible circular locking dependency detected
> [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> [  132.585461] ------------------------------------------------------
> [  132.586184] kworker/2:3/865 is trying to acquire lock:
> [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.587569]
>                but task is already holding lock:
> [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> [  132.589803]
>                which lock already depends on the new lock.
>
> [  132.592009]
>                the existing dependency chain (in reverse order) is:
> [  132.593507]
>                -> #2 (dma_fence_map){++++}-{0:0}:
> [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> [  132.600539]        fbcon_init+0x2e8/0x660
> [  132.601344]        visual_init+0xce/0x130
> [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> [  132.602970]        do_take_over_console+0x115/0x180
> [  132.603763]        do_fbcon_takeover+0x58/0xb0
> [  132.604564]        register_framebuffer+0x1ee/0x300
> [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> [  132.609511]        local_pci_probe+0x42/0x80
> [  132.610324]        pci_device_probe+0x104/0x1a0
> [  132.611130]        really_probe+0x147/0x3c0
> [  132.611939]        driver_probe_device+0xb6/0x100
> [  132.612766]        device_driver_attach+0x53/0x60
> [  132.613593]        __driver_attach+0x8c/0x150
> [  132.614419]        bus_for_each_dev+0x7b/0xc0
> [  132.615249]        bus_add_driver+0x14c/0x1f0
> [  132.616071]        driver_register+0x6c/0xc0
> [  132.616902]        do_one_initcall+0x5d/0x2f0
> [  132.617731]        do_init_module+0x5c/0x230
> [  132.618560]        load_module+0x2981/0x2bc0
> [  132.619391]        __do_sys_finit_module+0xaa/0x110
> [  132.620228]        do_syscall_64+0x5a/0x250
> [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [  132.621903]
>                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> [  132.624448]        ww_mutex_lock+0x43/0xb0
> [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> [  132.629804]        local_pci_probe+0x42/0x80
> [  132.630690]        pci_device_probe+0x104/0x1a0
> [  132.631583]        really_probe+0x147/0x3c0
> [  132.632479]        driver_probe_device+0xb6/0x100
> [  132.633379]        device_driver_attach+0x53/0x60
> [  132.634275]        __driver_attach+0x8c/0x150
> [  132.635170]        bus_for_each_dev+0x7b/0xc0
> [  132.636069]        bus_add_driver+0x14c/0x1f0
> [  132.636974]        driver_register+0x6c/0xc0
> [  132.637870]        do_one_initcall+0x5d/0x2f0
> [  132.638765]        do_init_module+0x5c/0x230
> [  132.639654]        load_module+0x2981/0x2bc0
> [  132.640522]        __do_sys_finit_module+0xaa/0x110
> [  132.641372]        do_syscall_64+0x5a/0x250
> [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [  132.643022]
>                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> [  132.644643]        __lock_acquire+0x1241/0x23f0
> [  132.645469]        lock_acquire+0xad/0x370
> [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> [  132.652594]        process_one_work+0x23c/0x580
> [  132.653402]        worker_thread+0x50/0x3b0
> [  132.654139]        kthread+0x12e/0x150
> [  132.654868]        ret_from_fork+0x27/0x50
> [  132.655598]
>                other info that might help us debug this:
>
> [  132.657739] Chain exists of:
>                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
>
> [  132.659877]  Possible unsafe locking scenario:
>
> [  132.661416]        CPU0                    CPU1
> [  132.662126]        ----                    ----
> [  132.662847]   lock(dma_fence_map);
> [  132.663574]                                lock(crtc_ww_class_mutex);
> [  132.664319]                                lock(dma_fence_map);
> [  132.665063]   lock(crtc_ww_class_acquire);
> [  132.665799]
>                 *** DEADLOCK ***
>
> [  132.667965] 4 locks held by kworker/2:3/865:
> [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> [  132.671902]
>                stack backtrace:
> [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> [  132.676046] Call Trace:
> [  132.676897]  dump_stack+0x8f/0xd0
> [  132.677748]  check_noncircular+0x162/0x180
> [  132.678604]  ? stack_trace_save+0x4b/0x70
> [  132.679459]  __lock_acquire+0x1241/0x23f0
> [  132.680311]  lock_acquire+0xad/0x370
> [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.682021]  ? cpumask_next+0x16/0x20
> [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> [  132.683737]  ? __module_address+0x28/0xf0
> [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> [  132.693597]  process_one_work+0x23c/0x580
> [  132.694487]  worker_thread+0x50/0x3b0
> [  132.695373]  ? process_one_work+0x580/0x580
> [  132.696264]  kthread+0x12e/0x150
> [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  132.698057]  ret_from_fork+0x27/0x50
>
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 3584e29323c0..b3b84a0d3baf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
>                 /* displays are handled separately */
>                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
>                         /* XXX handle errors */
> +
> +                       /*
> +                        * This is dm_suspend, which calls modeset locks, and
> +                        * that a pretty good inversion against dma_fence_signal
> +                        * which gpu recovery is supposed to guarantee.
> +                        *
> +                        * Dont ask me how to fix this.
> +                        */

We actually have a fix for this.  Will be out shortly.

Alex

>                         r = adev->ip_blocks[i].version->funcs->suspend(adev);
>                         /* XXX handle errors */
>                         if (r) {
> --
> 2.26.2
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [Intel-gfx] [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 12:54     ` Alex Deucher
  0 siblings, 0 replies; 203+ messages in thread
From: Alex Deucher @ 2020-05-12 12:54 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: linux-rdma, Intel Graphics Development, LKML, amd-gfx list,
	Chris Wilson, moderated list:DMA BUFFER SHARING FRAMEWORK,
	DRI Development, Daniel Vetter, Christian König,
	linux-media

On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> ...
>
> I think it's time to stop this little exercise.
>
> The lockdep splat, for the record:
>
> [  132.583381] ======================================================
> [  132.584091] WARNING: possible circular locking dependency detected
> [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> [  132.585461] ------------------------------------------------------
> [  132.586184] kworker/2:3/865 is trying to acquire lock:
> [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.587569]
>                but task is already holding lock:
> [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> [  132.589803]
>                which lock already depends on the new lock.
>
> [  132.592009]
>                the existing dependency chain (in reverse order) is:
> [  132.593507]
>                -> #2 (dma_fence_map){++++}-{0:0}:
> [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> [  132.600539]        fbcon_init+0x2e8/0x660
> [  132.601344]        visual_init+0xce/0x130
> [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> [  132.602970]        do_take_over_console+0x115/0x180
> [  132.603763]        do_fbcon_takeover+0x58/0xb0
> [  132.604564]        register_framebuffer+0x1ee/0x300
> [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> [  132.609511]        local_pci_probe+0x42/0x80
> [  132.610324]        pci_device_probe+0x104/0x1a0
> [  132.611130]        really_probe+0x147/0x3c0
> [  132.611939]        driver_probe_device+0xb6/0x100
> [  132.612766]        device_driver_attach+0x53/0x60
> [  132.613593]        __driver_attach+0x8c/0x150
> [  132.614419]        bus_for_each_dev+0x7b/0xc0
> [  132.615249]        bus_add_driver+0x14c/0x1f0
> [  132.616071]        driver_register+0x6c/0xc0
> [  132.616902]        do_one_initcall+0x5d/0x2f0
> [  132.617731]        do_init_module+0x5c/0x230
> [  132.618560]        load_module+0x2981/0x2bc0
> [  132.619391]        __do_sys_finit_module+0xaa/0x110
> [  132.620228]        do_syscall_64+0x5a/0x250
> [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [  132.621903]
>                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> [  132.624448]        ww_mutex_lock+0x43/0xb0
> [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> [  132.629804]        local_pci_probe+0x42/0x80
> [  132.630690]        pci_device_probe+0x104/0x1a0
> [  132.631583]        really_probe+0x147/0x3c0
> [  132.632479]        driver_probe_device+0xb6/0x100
> [  132.633379]        device_driver_attach+0x53/0x60
> [  132.634275]        __driver_attach+0x8c/0x150
> [  132.635170]        bus_for_each_dev+0x7b/0xc0
> [  132.636069]        bus_add_driver+0x14c/0x1f0
> [  132.636974]        driver_register+0x6c/0xc0
> [  132.637870]        do_one_initcall+0x5d/0x2f0
> [  132.638765]        do_init_module+0x5c/0x230
> [  132.639654]        load_module+0x2981/0x2bc0
> [  132.640522]        __do_sys_finit_module+0xaa/0x110
> [  132.641372]        do_syscall_64+0x5a/0x250
> [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [  132.643022]
>                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> [  132.644643]        __lock_acquire+0x1241/0x23f0
> [  132.645469]        lock_acquire+0xad/0x370
> [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> [  132.652594]        process_one_work+0x23c/0x580
> [  132.653402]        worker_thread+0x50/0x3b0
> [  132.654139]        kthread+0x12e/0x150
> [  132.654868]        ret_from_fork+0x27/0x50
> [  132.655598]
>                other info that might help us debug this:
>
> [  132.657739] Chain exists of:
>                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
>
> [  132.659877]  Possible unsafe locking scenario:
>
> [  132.661416]        CPU0                    CPU1
> [  132.662126]        ----                    ----
> [  132.662847]   lock(dma_fence_map);
> [  132.663574]                                lock(crtc_ww_class_mutex);
> [  132.664319]                                lock(dma_fence_map);
> [  132.665063]   lock(crtc_ww_class_acquire);
> [  132.665799]
>                 *** DEADLOCK ***
>
> [  132.667965] 4 locks held by kworker/2:3/865:
> [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> [  132.671902]
>                stack backtrace:
> [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> [  132.676046] Call Trace:
> [  132.676897]  dump_stack+0x8f/0xd0
> [  132.677748]  check_noncircular+0x162/0x180
> [  132.678604]  ? stack_trace_save+0x4b/0x70
> [  132.679459]  __lock_acquire+0x1241/0x23f0
> [  132.680311]  lock_acquire+0xad/0x370
> [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.682021]  ? cpumask_next+0x16/0x20
> [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> [  132.683737]  ? __module_address+0x28/0xf0
> [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> [  132.693597]  process_one_work+0x23c/0x580
> [  132.694487]  worker_thread+0x50/0x3b0
> [  132.695373]  ? process_one_work+0x580/0x580
> [  132.696264]  kthread+0x12e/0x150
> [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  132.698057]  ret_from_fork+0x27/0x50
>
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 3584e29323c0..b3b84a0d3baf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
>                 /* displays are handled separately */
>                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
>                         /* XXX handle errors */
> +
> +                       /*
> +                        * This is dm_suspend, which calls modeset locks, and
> +                        * that a pretty good inversion against dma_fence_signal
> +                        * which gpu recovery is supposed to guarantee.
> +                        *
> +                        * Dont ask me how to fix this.
> +                        */

We actually have a fix for this.  Will be out shortly.

Alex

>                         r = adev->ip_blocks[i].version->funcs->suspend(adev);
>                         /* XXX handle errors */
>                         if (r) {
> --
> 2.26.2
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 12:54     ` Alex Deucher
  0 siblings, 0 replies; 203+ messages in thread
From: Alex Deucher @ 2020-05-12 12:54 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: linux-rdma, Intel Graphics Development, Maarten Lankhorst, LKML,
	amd-gfx list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, DRI Development,
	Daniel Vetter, Christian König, linux-media

On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> ...
>
> I think it's time to stop this little exercise.
>
> The lockdep splat, for the record:
>
> [  132.583381] ======================================================
> [  132.584091] WARNING: possible circular locking dependency detected
> [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> [  132.585461] ------------------------------------------------------
> [  132.586184] kworker/2:3/865 is trying to acquire lock:
> [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.587569]
>                but task is already holding lock:
> [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> [  132.589803]
>                which lock already depends on the new lock.
>
> [  132.592009]
>                the existing dependency chain (in reverse order) is:
> [  132.593507]
>                -> #2 (dma_fence_map){++++}-{0:0}:
> [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> [  132.600539]        fbcon_init+0x2e8/0x660
> [  132.601344]        visual_init+0xce/0x130
> [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> [  132.602970]        do_take_over_console+0x115/0x180
> [  132.603763]        do_fbcon_takeover+0x58/0xb0
> [  132.604564]        register_framebuffer+0x1ee/0x300
> [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> [  132.609511]        local_pci_probe+0x42/0x80
> [  132.610324]        pci_device_probe+0x104/0x1a0
> [  132.611130]        really_probe+0x147/0x3c0
> [  132.611939]        driver_probe_device+0xb6/0x100
> [  132.612766]        device_driver_attach+0x53/0x60
> [  132.613593]        __driver_attach+0x8c/0x150
> [  132.614419]        bus_for_each_dev+0x7b/0xc0
> [  132.615249]        bus_add_driver+0x14c/0x1f0
> [  132.616071]        driver_register+0x6c/0xc0
> [  132.616902]        do_one_initcall+0x5d/0x2f0
> [  132.617731]        do_init_module+0x5c/0x230
> [  132.618560]        load_module+0x2981/0x2bc0
> [  132.619391]        __do_sys_finit_module+0xaa/0x110
> [  132.620228]        do_syscall_64+0x5a/0x250
> [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [  132.621903]
>                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> [  132.624448]        ww_mutex_lock+0x43/0xb0
> [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> [  132.629804]        local_pci_probe+0x42/0x80
> [  132.630690]        pci_device_probe+0x104/0x1a0
> [  132.631583]        really_probe+0x147/0x3c0
> [  132.632479]        driver_probe_device+0xb6/0x100
> [  132.633379]        device_driver_attach+0x53/0x60
> [  132.634275]        __driver_attach+0x8c/0x150
> [  132.635170]        bus_for_each_dev+0x7b/0xc0
> [  132.636069]        bus_add_driver+0x14c/0x1f0
> [  132.636974]        driver_register+0x6c/0xc0
> [  132.637870]        do_one_initcall+0x5d/0x2f0
> [  132.638765]        do_init_module+0x5c/0x230
> [  132.639654]        load_module+0x2981/0x2bc0
> [  132.640522]        __do_sys_finit_module+0xaa/0x110
> [  132.641372]        do_syscall_64+0x5a/0x250
> [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> [  132.643022]
>                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> [  132.644643]        __lock_acquire+0x1241/0x23f0
> [  132.645469]        lock_acquire+0xad/0x370
> [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> [  132.652594]        process_one_work+0x23c/0x580
> [  132.653402]        worker_thread+0x50/0x3b0
> [  132.654139]        kthread+0x12e/0x150
> [  132.654868]        ret_from_fork+0x27/0x50
> [  132.655598]
>                other info that might help us debug this:
>
> [  132.657739] Chain exists of:
>                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
>
> [  132.659877]  Possible unsafe locking scenario:
>
> [  132.661416]        CPU0                    CPU1
> [  132.662126]        ----                    ----
> [  132.662847]   lock(dma_fence_map);
> [  132.663574]                                lock(crtc_ww_class_mutex);
> [  132.664319]                                lock(dma_fence_map);
> [  132.665063]   lock(crtc_ww_class_acquire);
> [  132.665799]
>                 *** DEADLOCK ***
>
> [  132.667965] 4 locks held by kworker/2:3/865:
> [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> [  132.671902]
>                stack backtrace:
> [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> [  132.676046] Call Trace:
> [  132.676897]  dump_stack+0x8f/0xd0
> [  132.677748]  check_noncircular+0x162/0x180
> [  132.678604]  ? stack_trace_save+0x4b/0x70
> [  132.679459]  __lock_acquire+0x1241/0x23f0
> [  132.680311]  lock_acquire+0xad/0x370
> [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.682021]  ? cpumask_next+0x16/0x20
> [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> [  132.683737]  ? __module_address+0x28/0xf0
> [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> [  132.693597]  process_one_work+0x23c/0x580
> [  132.694487]  worker_thread+0x50/0x3b0
> [  132.695373]  ? process_one_work+0x580/0x580
> [  132.696264]  kthread+0x12e/0x150
> [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> [  132.698057]  ret_from_fork+0x27/0x50
>
> Cc: linux-media@vger.kernel.org
> Cc: linaro-mm-sig@lists.linaro.org
> Cc: linux-rdma@vger.kernel.org
> Cc: amd-gfx@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 3584e29323c0..b3b84a0d3baf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
>                 /* displays are handled separately */
>                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
>                         /* XXX handle errors */
> +
> +                       /*
> +                        * This is dm_suspend, which calls modeset locks, and
> +                        * that a pretty good inversion against dma_fence_signal
> +                        * which gpu recovery is supposed to guarantee.
> +                        *
> +                        * Dont ask me how to fix this.
> +                        */

We actually have a fix for this.  Will be out shortly.

Alex

>                         r = adev->ip_blocks[i].version->funcs->suspend(adev);
>                         /* XXX handle errors */
>                         if (r) {
> --
> 2.26.2
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 02/17] dma-fence: basic lockdep annotations
  2020-05-12 12:09     ` Jason Gunthorpe
  (?)
  (?)
@ 2020-05-12 12:57       ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12 12:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Daniel Vetter, DRI Development, LKML, linux-media, linaro-mm-sig,
	linux-rdma, amd-gfx, intel-gfx, Chris Wilson, Maarten Lankhorst,
	Christian König, Daniel Vetter

On Tue, May 12, 2020 at 09:09:52AM -0300, Jason Gunthorpe wrote:
> On Tue, May 12, 2020 at 10:59:29AM +0200, Daniel Vetter wrote:
> > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > index 6802125349fb..d5c0fd2efc70 100644
> > +++ b/drivers/dma-buf/dma-fence.c
> > @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
> >  }
> >  EXPORT_SYMBOL(dma_fence_context_alloc);
> >  
> > +#ifdef CONFIG_LOCKDEP
> > +struct lockdep_map	dma_fence_lockdep_map = {
> > +	.name = "dma_fence_map"
> > +};
> > +
> > +bool dma_fence_begin_signalling(void)
> > +{
> 
> Why is this global? I would have expected it to be connected to a
> single fence?

It's the same rules for all fences, since they can be shared across
drivers in various ways. Lockdep usually achieves that with a static
variable hidden in the macro, but that doesn't work if you have lots of
different ways from different drivers to create a dma_fence. Hence the
unique global one that we explicitly allocate.

We have similar stuff for the global dma_resv_lock ww_mutex class, just
there it's a bit more standardized and hidden behind a neat macro. But
really lockdep needs global lockdep_maps or it doesn't work.

> It would also be alot nicer if this was some general lockdep feature,
> not tied to dmabuf. This exact problem also strikes anyone using
> completions, for instance, and the same solution should be
> applicable??

There was:

https://lwn.net/Articles/709849/

It even got merged, and seems to have worked. Unfortunately (and I'm not
entirely clear on the reasons) it was thrown out again, so we can't use
it. That means wait_event/wake_up dependencies need to be manually
annotated, like e.g. flush_work() already is. flush_work is more or less
where I've stolen this idea from, with some adjustements and tricks on top
to make it work for dma_fence users.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 02/17] dma-fence: basic lockdep annotations
@ 2020-05-12 12:57       ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12 12:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, DRI Development,
	Chris Wilson, linaro-mm-sig, amd-gfx, Daniel Vetter,
	Christian König, linux-media

On Tue, May 12, 2020 at 09:09:52AM -0300, Jason Gunthorpe wrote:
> On Tue, May 12, 2020 at 10:59:29AM +0200, Daniel Vetter wrote:
> > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > index 6802125349fb..d5c0fd2efc70 100644
> > +++ b/drivers/dma-buf/dma-fence.c
> > @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
> >  }
> >  EXPORT_SYMBOL(dma_fence_context_alloc);
> >  
> > +#ifdef CONFIG_LOCKDEP
> > +struct lockdep_map	dma_fence_lockdep_map = {
> > +	.name = "dma_fence_map"
> > +};
> > +
> > +bool dma_fence_begin_signalling(void)
> > +{
> 
> Why is this global? I would have expected it to be connected to a
> single fence?

It's the same rules for all fences, since they can be shared across
drivers in various ways. Lockdep usually achieves that with a static
variable hidden in the macro, but that doesn't work if you have lots of
different ways from different drivers to create a dma_fence. Hence the
unique global one that we explicitly allocate.

We have similar stuff for the global dma_resv_lock ww_mutex class, just
there it's a bit more standardized and hidden behind a neat macro. But
really lockdep needs global lockdep_maps or it doesn't work.

> It would also be alot nicer if this was some general lockdep feature,
> not tied to dmabuf. This exact problem also strikes anyone using
> completions, for instance, and the same solution should be
> applicable??

There was:

https://lwn.net/Articles/709849/

It even got merged, and seems to have worked. Unfortunately (and I'm not
entirely clear on the reasons) it was thrown out again, so we can't use
it. That means wait_event/wake_up dependencies need to be manually
annotated, like e.g. flush_work() already is. flush_work is more or less
where I've stolen this idea from, with some adjustements and tricks on top
to make it work for dma_fence users.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [Intel-gfx] [RFC 02/17] dma-fence: basic lockdep annotations
@ 2020-05-12 12:57       ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12 12:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, Daniel Vetter, intel-gfx, LKML, DRI Development,
	Chris Wilson, linaro-mm-sig, amd-gfx, Daniel Vetter,
	Christian König, linux-media

On Tue, May 12, 2020 at 09:09:52AM -0300, Jason Gunthorpe wrote:
> On Tue, May 12, 2020 at 10:59:29AM +0200, Daniel Vetter wrote:
> > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > index 6802125349fb..d5c0fd2efc70 100644
> > +++ b/drivers/dma-buf/dma-fence.c
> > @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
> >  }
> >  EXPORT_SYMBOL(dma_fence_context_alloc);
> >  
> > +#ifdef CONFIG_LOCKDEP
> > +struct lockdep_map	dma_fence_lockdep_map = {
> > +	.name = "dma_fence_map"
> > +};
> > +
> > +bool dma_fence_begin_signalling(void)
> > +{
> 
> Why is this global? I would have expected it to be connected to a
> single fence?

It's the same rules for all fences, since they can be shared across
drivers in various ways. Lockdep usually achieves that with a static
variable hidden in the macro, but that doesn't work if you have lots of
different ways from different drivers to create a dma_fence. Hence the
unique global one that we explicitly allocate.

We have similar stuff for the global dma_resv_lock ww_mutex class, just
there it's a bit more standardized and hidden behind a neat macro. But
really lockdep needs global lockdep_maps or it doesn't work.

> It would also be alot nicer if this was some general lockdep feature,
> not tied to dmabuf. This exact problem also strikes anyone using
> completions, for instance, and the same solution should be
> applicable??

There was:

https://lwn.net/Articles/709849/

It even got merged, and seems to have worked. Unfortunately (and I'm not
entirely clear on the reasons) it was thrown out again, so we can't use
it. That means wait_event/wake_up dependencies need to be manually
annotated, like e.g. flush_work() already is. flush_work is more or less
where I've stolen this idea from, with some adjustements and tricks on top
to make it work for dma_fence users.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 02/17] dma-fence: basic lockdep annotations
@ 2020-05-12 12:57       ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12 12:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, Daniel Vetter, intel-gfx, Maarten Lankhorst, LKML,
	DRI Development, Chris Wilson, linaro-mm-sig, amd-gfx,
	Daniel Vetter, Christian König, linux-media

On Tue, May 12, 2020 at 09:09:52AM -0300, Jason Gunthorpe wrote:
> On Tue, May 12, 2020 at 10:59:29AM +0200, Daniel Vetter wrote:
> > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > index 6802125349fb..d5c0fd2efc70 100644
> > +++ b/drivers/dma-buf/dma-fence.c
> > @@ -110,6 +110,52 @@ u64 dma_fence_context_alloc(unsigned num)
> >  }
> >  EXPORT_SYMBOL(dma_fence_context_alloc);
> >  
> > +#ifdef CONFIG_LOCKDEP
> > +struct lockdep_map	dma_fence_lockdep_map = {
> > +	.name = "dma_fence_map"
> > +};
> > +
> > +bool dma_fence_begin_signalling(void)
> > +{
> 
> Why is this global? I would have expected it to be connected to a
> single fence?

It's the same rules for all fences, since they can be shared across
drivers in various ways. Lockdep usually achieves that with a static
variable hidden in the macro, but that doesn't work if you have lots of
different ways from different drivers to create a dma_fence. Hence the
unique global one that we explicitly allocate.

We have similar stuff for the global dma_resv_lock ww_mutex class, just
there it's a bit more standardized and hidden behind a neat macro. But
really lockdep needs global lockdep_maps or it doesn't work.

> It would also be alot nicer if this was some general lockdep feature,
> not tied to dmabuf. This exact problem also strikes anyone using
> completions, for instance, and the same solution should be
> applicable??

There was:

https://lwn.net/Articles/709849/

It even got merged, and seems to have worked. Unfortunately (and I'm not
entirely clear on the reasons) it was thrown out again, so we can't use
it. That means wait_event/wake_up dependencies need to be manually
annotated, like e.g. flush_work() already is. flush_work is more or less
where I've stolen this idea from, with some adjustements and tricks on top
to make it work for dma_fence users.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
  2020-05-12 12:54     ` Alex Deucher
  (?)
  (?)
@ 2020-05-12 12:58       ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12 12:58 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Daniel Vetter, DRI Development, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, LKML,
	amd-gfx list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter,
	Christian König, linux-media

On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > ...
> >
> > I think it's time to stop this little exercise.
> >
> > The lockdep splat, for the record:
> >
> > [  132.583381] ======================================================
> > [  132.584091] WARNING: possible circular locking dependency detected
> > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > [  132.585461] ------------------------------------------------------
> > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.587569]
> >                but task is already holding lock:
> > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > [  132.589803]
> >                which lock already depends on the new lock.
> >
> > [  132.592009]
> >                the existing dependency chain (in reverse order) is:
> > [  132.593507]
> >                -> #2 (dma_fence_map){++++}-{0:0}:
> > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > [  132.600539]        fbcon_init+0x2e8/0x660
> > [  132.601344]        visual_init+0xce/0x130
> > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > [  132.602970]        do_take_over_console+0x115/0x180
> > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > [  132.604564]        register_framebuffer+0x1ee/0x300
> > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > [  132.609511]        local_pci_probe+0x42/0x80
> > [  132.610324]        pci_device_probe+0x104/0x1a0
> > [  132.611130]        really_probe+0x147/0x3c0
> > [  132.611939]        driver_probe_device+0xb6/0x100
> > [  132.612766]        device_driver_attach+0x53/0x60
> > [  132.613593]        __driver_attach+0x8c/0x150
> > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > [  132.616071]        driver_register+0x6c/0xc0
> > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > [  132.617731]        do_init_module+0x5c/0x230
> > [  132.618560]        load_module+0x2981/0x2bc0
> > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > [  132.620228]        do_syscall_64+0x5a/0x250
> > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [  132.621903]
> >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > [  132.629804]        local_pci_probe+0x42/0x80
> > [  132.630690]        pci_device_probe+0x104/0x1a0
> > [  132.631583]        really_probe+0x147/0x3c0
> > [  132.632479]        driver_probe_device+0xb6/0x100
> > [  132.633379]        device_driver_attach+0x53/0x60
> > [  132.634275]        __driver_attach+0x8c/0x150
> > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > [  132.636974]        driver_register+0x6c/0xc0
> > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > [  132.638765]        do_init_module+0x5c/0x230
> > [  132.639654]        load_module+0x2981/0x2bc0
> > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > [  132.641372]        do_syscall_64+0x5a/0x250
> > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [  132.643022]
> >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > [  132.645469]        lock_acquire+0xad/0x370
> > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > [  132.652594]        process_one_work+0x23c/0x580
> > [  132.653402]        worker_thread+0x50/0x3b0
> > [  132.654139]        kthread+0x12e/0x150
> > [  132.654868]        ret_from_fork+0x27/0x50
> > [  132.655598]
> >                other info that might help us debug this:
> >
> > [  132.657739] Chain exists of:
> >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> >
> > [  132.659877]  Possible unsafe locking scenario:
> >
> > [  132.661416]        CPU0                    CPU1
> > [  132.662126]        ----                    ----
> > [  132.662847]   lock(dma_fence_map);
> > [  132.663574]                                lock(crtc_ww_class_mutex);
> > [  132.664319]                                lock(dma_fence_map);
> > [  132.665063]   lock(crtc_ww_class_acquire);
> > [  132.665799]
> >                 *** DEADLOCK ***
> >
> > [  132.667965] 4 locks held by kworker/2:3/865:
> > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > [  132.671902]
> >                stack backtrace:
> > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > [  132.676046] Call Trace:
> > [  132.676897]  dump_stack+0x8f/0xd0
> > [  132.677748]  check_noncircular+0x162/0x180
> > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > [  132.680311]  lock_acquire+0xad/0x370
> > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.682021]  ? cpumask_next+0x16/0x20
> > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > [  132.683737]  ? __module_address+0x28/0xf0
> > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > [  132.693597]  process_one_work+0x23c/0x580
> > [  132.694487]  worker_thread+0x50/0x3b0
> > [  132.695373]  ? process_one_work+0x580/0x580
> > [  132.696264]  kthread+0x12e/0x150
> > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [  132.698057]  ret_from_fork+0x27/0x50
> >
> > Cc: linux-media@vger.kernel.org
> > Cc: linaro-mm-sig@lists.linaro.org
> > Cc: linux-rdma@vger.kernel.org
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: intel-gfx@lists.freedesktop.org
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 3584e29323c0..b3b84a0d3baf 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> >                 /* displays are handled separately */
> >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> >                         /* XXX handle errors */
> > +
> > +                       /*
> > +                        * This is dm_suspend, which calls modeset locks, and
> > +                        * that a pretty good inversion against dma_fence_signal
> > +                        * which gpu recovery is supposed to guarantee.
> > +                        *
> > +                        * Dont ask me how to fix this.
> > +                        */
> 
> We actually have a fix for this.  Will be out shortly.

Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
the display block entirely. Fixing the locking while still resetting the
display is going to be really hard otoh ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 12:58       ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12 12:58 UTC (permalink / raw)
  To: Alex Deucher
  Cc: linux-rdma, Daniel Vetter, Intel Graphics Development, LKML,
	DRI Development, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, amd-gfx list,
	Daniel Vetter, Christian König, linux-media

On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > ...
> >
> > I think it's time to stop this little exercise.
> >
> > The lockdep splat, for the record:
> >
> > [  132.583381] ======================================================
> > [  132.584091] WARNING: possible circular locking dependency detected
> > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > [  132.585461] ------------------------------------------------------
> > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.587569]
> >                but task is already holding lock:
> > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > [  132.589803]
> >                which lock already depends on the new lock.
> >
> > [  132.592009]
> >                the existing dependency chain (in reverse order) is:
> > [  132.593507]
> >                -> #2 (dma_fence_map){++++}-{0:0}:
> > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > [  132.600539]        fbcon_init+0x2e8/0x660
> > [  132.601344]        visual_init+0xce/0x130
> > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > [  132.602970]        do_take_over_console+0x115/0x180
> > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > [  132.604564]        register_framebuffer+0x1ee/0x300
> > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > [  132.609511]        local_pci_probe+0x42/0x80
> > [  132.610324]        pci_device_probe+0x104/0x1a0
> > [  132.611130]        really_probe+0x147/0x3c0
> > [  132.611939]        driver_probe_device+0xb6/0x100
> > [  132.612766]        device_driver_attach+0x53/0x60
> > [  132.613593]        __driver_attach+0x8c/0x150
> > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > [  132.616071]        driver_register+0x6c/0xc0
> > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > [  132.617731]        do_init_module+0x5c/0x230
> > [  132.618560]        load_module+0x2981/0x2bc0
> > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > [  132.620228]        do_syscall_64+0x5a/0x250
> > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [  132.621903]
> >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > [  132.629804]        local_pci_probe+0x42/0x80
> > [  132.630690]        pci_device_probe+0x104/0x1a0
> > [  132.631583]        really_probe+0x147/0x3c0
> > [  132.632479]        driver_probe_device+0xb6/0x100
> > [  132.633379]        device_driver_attach+0x53/0x60
> > [  132.634275]        __driver_attach+0x8c/0x150
> > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > [  132.636974]        driver_register+0x6c/0xc0
> > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > [  132.638765]        do_init_module+0x5c/0x230
> > [  132.639654]        load_module+0x2981/0x2bc0
> > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > [  132.641372]        do_syscall_64+0x5a/0x250
> > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [  132.643022]
> >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > [  132.645469]        lock_acquire+0xad/0x370
> > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > [  132.652594]        process_one_work+0x23c/0x580
> > [  132.653402]        worker_thread+0x50/0x3b0
> > [  132.654139]        kthread+0x12e/0x150
> > [  132.654868]        ret_from_fork+0x27/0x50
> > [  132.655598]
> >                other info that might help us debug this:
> >
> > [  132.657739] Chain exists of:
> >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> >
> > [  132.659877]  Possible unsafe locking scenario:
> >
> > [  132.661416]        CPU0                    CPU1
> > [  132.662126]        ----                    ----
> > [  132.662847]   lock(dma_fence_map);
> > [  132.663574]                                lock(crtc_ww_class_mutex);
> > [  132.664319]                                lock(dma_fence_map);
> > [  132.665063]   lock(crtc_ww_class_acquire);
> > [  132.665799]
> >                 *** DEADLOCK ***
> >
> > [  132.667965] 4 locks held by kworker/2:3/865:
> > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > [  132.671902]
> >                stack backtrace:
> > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > [  132.676046] Call Trace:
> > [  132.676897]  dump_stack+0x8f/0xd0
> > [  132.677748]  check_noncircular+0x162/0x180
> > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > [  132.680311]  lock_acquire+0xad/0x370
> > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.682021]  ? cpumask_next+0x16/0x20
> > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > [  132.683737]  ? __module_address+0x28/0xf0
> > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > [  132.693597]  process_one_work+0x23c/0x580
> > [  132.694487]  worker_thread+0x50/0x3b0
> > [  132.695373]  ? process_one_work+0x580/0x580
> > [  132.696264]  kthread+0x12e/0x150
> > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [  132.698057]  ret_from_fork+0x27/0x50
> >
> > Cc: linux-media@vger.kernel.org
> > Cc: linaro-mm-sig@lists.linaro.org
> > Cc: linux-rdma@vger.kernel.org
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: intel-gfx@lists.freedesktop.org
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 3584e29323c0..b3b84a0d3baf 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> >                 /* displays are handled separately */
> >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> >                         /* XXX handle errors */
> > +
> > +                       /*
> > +                        * This is dm_suspend, which calls modeset locks, and
> > +                        * that a pretty good inversion against dma_fence_signal
> > +                        * which gpu recovery is supposed to guarantee.
> > +                        *
> > +                        * Dont ask me how to fix this.
> > +                        */
> 
> We actually have a fix for this.  Will be out shortly.

Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
the display block entirely. Fixing the locking while still resetting the
display is going to be really hard otoh ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [Intel-gfx] [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 12:58       ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12 12:58 UTC (permalink / raw)
  To: Alex Deucher
  Cc: linux-rdma, Daniel Vetter, Intel Graphics Development, LKML,
	DRI Development, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, amd-gfx list,
	Daniel Vetter, Christian König, linux-media

On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > ...
> >
> > I think it's time to stop this little exercise.
> >
> > The lockdep splat, for the record:
> >
> > [  132.583381] ======================================================
> > [  132.584091] WARNING: possible circular locking dependency detected
> > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > [  132.585461] ------------------------------------------------------
> > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.587569]
> >                but task is already holding lock:
> > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > [  132.589803]
> >                which lock already depends on the new lock.
> >
> > [  132.592009]
> >                the existing dependency chain (in reverse order) is:
> > [  132.593507]
> >                -> #2 (dma_fence_map){++++}-{0:0}:
> > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > [  132.600539]        fbcon_init+0x2e8/0x660
> > [  132.601344]        visual_init+0xce/0x130
> > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > [  132.602970]        do_take_over_console+0x115/0x180
> > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > [  132.604564]        register_framebuffer+0x1ee/0x300
> > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > [  132.609511]        local_pci_probe+0x42/0x80
> > [  132.610324]        pci_device_probe+0x104/0x1a0
> > [  132.611130]        really_probe+0x147/0x3c0
> > [  132.611939]        driver_probe_device+0xb6/0x100
> > [  132.612766]        device_driver_attach+0x53/0x60
> > [  132.613593]        __driver_attach+0x8c/0x150
> > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > [  132.616071]        driver_register+0x6c/0xc0
> > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > [  132.617731]        do_init_module+0x5c/0x230
> > [  132.618560]        load_module+0x2981/0x2bc0
> > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > [  132.620228]        do_syscall_64+0x5a/0x250
> > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [  132.621903]
> >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > [  132.629804]        local_pci_probe+0x42/0x80
> > [  132.630690]        pci_device_probe+0x104/0x1a0
> > [  132.631583]        really_probe+0x147/0x3c0
> > [  132.632479]        driver_probe_device+0xb6/0x100
> > [  132.633379]        device_driver_attach+0x53/0x60
> > [  132.634275]        __driver_attach+0x8c/0x150
> > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > [  132.636974]        driver_register+0x6c/0xc0
> > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > [  132.638765]        do_init_module+0x5c/0x230
> > [  132.639654]        load_module+0x2981/0x2bc0
> > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > [  132.641372]        do_syscall_64+0x5a/0x250
> > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [  132.643022]
> >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > [  132.645469]        lock_acquire+0xad/0x370
> > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > [  132.652594]        process_one_work+0x23c/0x580
> > [  132.653402]        worker_thread+0x50/0x3b0
> > [  132.654139]        kthread+0x12e/0x150
> > [  132.654868]        ret_from_fork+0x27/0x50
> > [  132.655598]
> >                other info that might help us debug this:
> >
> > [  132.657739] Chain exists of:
> >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> >
> > [  132.659877]  Possible unsafe locking scenario:
> >
> > [  132.661416]        CPU0                    CPU1
> > [  132.662126]        ----                    ----
> > [  132.662847]   lock(dma_fence_map);
> > [  132.663574]                                lock(crtc_ww_class_mutex);
> > [  132.664319]                                lock(dma_fence_map);
> > [  132.665063]   lock(crtc_ww_class_acquire);
> > [  132.665799]
> >                 *** DEADLOCK ***
> >
> > [  132.667965] 4 locks held by kworker/2:3/865:
> > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > [  132.671902]
> >                stack backtrace:
> > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > [  132.676046] Call Trace:
> > [  132.676897]  dump_stack+0x8f/0xd0
> > [  132.677748]  check_noncircular+0x162/0x180
> > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > [  132.680311]  lock_acquire+0xad/0x370
> > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.682021]  ? cpumask_next+0x16/0x20
> > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > [  132.683737]  ? __module_address+0x28/0xf0
> > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > [  132.693597]  process_one_work+0x23c/0x580
> > [  132.694487]  worker_thread+0x50/0x3b0
> > [  132.695373]  ? process_one_work+0x580/0x580
> > [  132.696264]  kthread+0x12e/0x150
> > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [  132.698057]  ret_from_fork+0x27/0x50
> >
> > Cc: linux-media@vger.kernel.org
> > Cc: linaro-mm-sig@lists.linaro.org
> > Cc: linux-rdma@vger.kernel.org
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: intel-gfx@lists.freedesktop.org
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 3584e29323c0..b3b84a0d3baf 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> >                 /* displays are handled separately */
> >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> >                         /* XXX handle errors */
> > +
> > +                       /*
> > +                        * This is dm_suspend, which calls modeset locks, and
> > +                        * that a pretty good inversion against dma_fence_signal
> > +                        * which gpu recovery is supposed to guarantee.
> > +                        *
> > +                        * Dont ask me how to fix this.
> > +                        */
> 
> We actually have a fix for this.  Will be out shortly.

Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
the display block entirely. Fixing the locking while still resetting the
display is going to be really hard otoh ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 12:58       ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12 12:58 UTC (permalink / raw)
  To: Alex Deucher
  Cc: linux-rdma, Daniel Vetter, Intel Graphics Development,
	Maarten Lankhorst, LKML, DRI Development, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, amd-gfx list,
	Daniel Vetter, Christian König, linux-media

On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > ...
> >
> > I think it's time to stop this little exercise.
> >
> > The lockdep splat, for the record:
> >
> > [  132.583381] ======================================================
> > [  132.584091] WARNING: possible circular locking dependency detected
> > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > [  132.585461] ------------------------------------------------------
> > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.587569]
> >                but task is already holding lock:
> > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > [  132.589803]
> >                which lock already depends on the new lock.
> >
> > [  132.592009]
> >                the existing dependency chain (in reverse order) is:
> > [  132.593507]
> >                -> #2 (dma_fence_map){++++}-{0:0}:
> > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > [  132.600539]        fbcon_init+0x2e8/0x660
> > [  132.601344]        visual_init+0xce/0x130
> > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > [  132.602970]        do_take_over_console+0x115/0x180
> > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > [  132.604564]        register_framebuffer+0x1ee/0x300
> > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > [  132.609511]        local_pci_probe+0x42/0x80
> > [  132.610324]        pci_device_probe+0x104/0x1a0
> > [  132.611130]        really_probe+0x147/0x3c0
> > [  132.611939]        driver_probe_device+0xb6/0x100
> > [  132.612766]        device_driver_attach+0x53/0x60
> > [  132.613593]        __driver_attach+0x8c/0x150
> > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > [  132.616071]        driver_register+0x6c/0xc0
> > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > [  132.617731]        do_init_module+0x5c/0x230
> > [  132.618560]        load_module+0x2981/0x2bc0
> > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > [  132.620228]        do_syscall_64+0x5a/0x250
> > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [  132.621903]
> >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > [  132.629804]        local_pci_probe+0x42/0x80
> > [  132.630690]        pci_device_probe+0x104/0x1a0
> > [  132.631583]        really_probe+0x147/0x3c0
> > [  132.632479]        driver_probe_device+0xb6/0x100
> > [  132.633379]        device_driver_attach+0x53/0x60
> > [  132.634275]        __driver_attach+0x8c/0x150
> > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > [  132.636974]        driver_register+0x6c/0xc0
> > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > [  132.638765]        do_init_module+0x5c/0x230
> > [  132.639654]        load_module+0x2981/0x2bc0
> > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > [  132.641372]        do_syscall_64+0x5a/0x250
> > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > [  132.643022]
> >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > [  132.645469]        lock_acquire+0xad/0x370
> > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > [  132.652594]        process_one_work+0x23c/0x580
> > [  132.653402]        worker_thread+0x50/0x3b0
> > [  132.654139]        kthread+0x12e/0x150
> > [  132.654868]        ret_from_fork+0x27/0x50
> > [  132.655598]
> >                other info that might help us debug this:
> >
> > [  132.657739] Chain exists of:
> >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> >
> > [  132.659877]  Possible unsafe locking scenario:
> >
> > [  132.661416]        CPU0                    CPU1
> > [  132.662126]        ----                    ----
> > [  132.662847]   lock(dma_fence_map);
> > [  132.663574]                                lock(crtc_ww_class_mutex);
> > [  132.664319]                                lock(dma_fence_map);
> > [  132.665063]   lock(crtc_ww_class_acquire);
> > [  132.665799]
> >                 *** DEADLOCK ***
> >
> > [  132.667965] 4 locks held by kworker/2:3/865:
> > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > [  132.671902]
> >                stack backtrace:
> > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > [  132.676046] Call Trace:
> > [  132.676897]  dump_stack+0x8f/0xd0
> > [  132.677748]  check_noncircular+0x162/0x180
> > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > [  132.680311]  lock_acquire+0xad/0x370
> > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.682021]  ? cpumask_next+0x16/0x20
> > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > [  132.683737]  ? __module_address+0x28/0xf0
> > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > [  132.693597]  process_one_work+0x23c/0x580
> > [  132.694487]  worker_thread+0x50/0x3b0
> > [  132.695373]  ? process_one_work+0x580/0x580
> > [  132.696264]  kthread+0x12e/0x150
> > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [  132.698057]  ret_from_fork+0x27/0x50
> >
> > Cc: linux-media@vger.kernel.org
> > Cc: linaro-mm-sig@lists.linaro.org
> > Cc: linux-rdma@vger.kernel.org
> > Cc: amd-gfx@lists.freedesktop.org
> > Cc: intel-gfx@lists.freedesktop.org
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 3584e29323c0..b3b84a0d3baf 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> >                 /* displays are handled separately */
> >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> >                         /* XXX handle errors */
> > +
> > +                       /*
> > +                        * This is dm_suspend, which calls modeset locks, and
> > +                        * that a pretty good inversion against dma_fence_signal
> > +                        * which gpu recovery is supposed to guarantee.
> > +                        *
> > +                        * Dont ask me how to fix this.
> > +                        */
> 
> We actually have a fix for this.  Will be out shortly.

Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
the display block entirely. Fixing the locking while still resetting the
display is going to be really hard otoh ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
  2020-05-12 12:58       ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12 13:12         ` Alex Deucher
  -1 siblings, 0 replies; 203+ messages in thread
From: Alex Deucher @ 2020-05-12 13:12 UTC (permalink / raw)
  To: Alex Deucher, DRI Development, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, LKML,
	amd-gfx list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter,
	Christian König, linux-media
  Cc: Daniel Vetter

On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > ...
> > >
> > > I think it's time to stop this little exercise.
> > >
> > > The lockdep splat, for the record:
> > >
> > > [  132.583381] ======================================================
> > > [  132.584091] WARNING: possible circular locking dependency detected
> > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > [  132.585461] ------------------------------------------------------
> > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.587569]
> > >                but task is already holding lock:
> > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > [  132.589803]
> > >                which lock already depends on the new lock.
> > >
> > > [  132.592009]
> > >                the existing dependency chain (in reverse order) is:
> > > [  132.593507]
> > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > [  132.601344]        visual_init+0xce/0x130
> > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > [  132.602970]        do_take_over_console+0x115/0x180
> > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > [  132.609511]        local_pci_probe+0x42/0x80
> > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > [  132.611130]        really_probe+0x147/0x3c0
> > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > [  132.612766]        device_driver_attach+0x53/0x60
> > > [  132.613593]        __driver_attach+0x8c/0x150
> > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > [  132.616071]        driver_register+0x6c/0xc0
> > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > [  132.617731]        do_init_module+0x5c/0x230
> > > [  132.618560]        load_module+0x2981/0x2bc0
> > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > [  132.621903]
> > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > [  132.629804]        local_pci_probe+0x42/0x80
> > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > [  132.631583]        really_probe+0x147/0x3c0
> > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > [  132.633379]        device_driver_attach+0x53/0x60
> > > [  132.634275]        __driver_attach+0x8c/0x150
> > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > [  132.636974]        driver_register+0x6c/0xc0
> > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > [  132.638765]        do_init_module+0x5c/0x230
> > > [  132.639654]        load_module+0x2981/0x2bc0
> > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > [  132.643022]
> > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > [  132.645469]        lock_acquire+0xad/0x370
> > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > [  132.652594]        process_one_work+0x23c/0x580
> > > [  132.653402]        worker_thread+0x50/0x3b0
> > > [  132.654139]        kthread+0x12e/0x150
> > > [  132.654868]        ret_from_fork+0x27/0x50
> > > [  132.655598]
> > >                other info that might help us debug this:
> > >
> > > [  132.657739] Chain exists of:
> > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > >
> > > [  132.659877]  Possible unsafe locking scenario:
> > >
> > > [  132.661416]        CPU0                    CPU1
> > > [  132.662126]        ----                    ----
> > > [  132.662847]   lock(dma_fence_map);
> > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > [  132.664319]                                lock(dma_fence_map);
> > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > [  132.665799]
> > >                 *** DEADLOCK ***
> > >
> > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > [  132.671902]
> > >                stack backtrace:
> > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > [  132.676046] Call Trace:
> > > [  132.676897]  dump_stack+0x8f/0xd0
> > > [  132.677748]  check_noncircular+0x162/0x180
> > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > [  132.680311]  lock_acquire+0xad/0x370
> > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > [  132.683737]  ? __module_address+0x28/0xf0
> > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > [  132.693597]  process_one_work+0x23c/0x580
> > > [  132.694487]  worker_thread+0x50/0x3b0
> > > [  132.695373]  ? process_one_work+0x580/0x580
> > > [  132.696264]  kthread+0x12e/0x150
> > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > [  132.698057]  ret_from_fork+0x27/0x50
> > >
> > > Cc: linux-media@vger.kernel.org
> > > Cc: linaro-mm-sig@lists.linaro.org
> > > Cc: linux-rdma@vger.kernel.org
> > > Cc: amd-gfx@lists.freedesktop.org
> > > Cc: intel-gfx@lists.freedesktop.org
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index 3584e29323c0..b3b84a0d3baf 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > >                 /* displays are handled separately */
> > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > >                         /* XXX handle errors */
> > > +
> > > +                       /*
> > > +                        * This is dm_suspend, which calls modeset locks, and
> > > +                        * that a pretty good inversion against dma_fence_signal
> > > +                        * which gpu recovery is supposed to guarantee.
> > > +                        *
> > > +                        * Dont ask me how to fix this.
> > > +                        */
> >
> > We actually have a fix for this.  Will be out shortly.
>
> Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> the display block entirely. Fixing the locking while still resetting the
> display is going to be really hard otoh ...

There's no way to avoid that.  On dGPUs at least a full asic reset is
a full asic reset.  Mostly just skips the modeset and does the minimum
amount necessary to get the display block into a good state for reset.

Alex

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 13:12         ` Alex Deucher
  0 siblings, 0 replies; 203+ messages in thread
From: Alex Deucher @ 2020-05-12 13:12 UTC (permalink / raw)
  To: Alex Deucher, DRI Development, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, LKML,
	amd-gfx list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter,
	Christian König, linux-media
  Cc: Daniel Vetter

On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > ...
> > >
> > > I think it's time to stop this little exercise.
> > >
> > > The lockdep splat, for the record:
> > >
> > > [  132.583381] ======================================================
> > > [  132.584091] WARNING: possible circular locking dependency detected
> > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > [  132.585461] ------------------------------------------------------
> > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.587569]
> > >                but task is already holding lock:
> > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > [  132.589803]
> > >                which lock already depends on the new lock.
> > >
> > > [  132.592009]
> > >                the existing dependency chain (in reverse order) is:
> > > [  132.593507]
> > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > [  132.601344]        visual_init+0xce/0x130
> > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > [  132.602970]        do_take_over_console+0x115/0x180
> > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > [  132.609511]        local_pci_probe+0x42/0x80
> > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > [  132.611130]        really_probe+0x147/0x3c0
> > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > [  132.612766]        device_driver_attach+0x53/0x60
> > > [  132.613593]        __driver_attach+0x8c/0x150
> > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > [  132.616071]        driver_register+0x6c/0xc0
> > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > [  132.617731]        do_init_module+0x5c/0x230
> > > [  132.618560]        load_module+0x2981/0x2bc0
> > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > [  132.621903]
> > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > [  132.629804]        local_pci_probe+0x42/0x80
> > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > [  132.631583]        really_probe+0x147/0x3c0
> > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > [  132.633379]        device_driver_attach+0x53/0x60
> > > [  132.634275]        __driver_attach+0x8c/0x150
> > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > [  132.636974]        driver_register+0x6c/0xc0
> > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > [  132.638765]        do_init_module+0x5c/0x230
> > > [  132.639654]        load_module+0x2981/0x2bc0
> > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > [  132.643022]
> > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > [  132.645469]        lock_acquire+0xad/0x370
> > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > [  132.652594]        process_one_work+0x23c/0x580
> > > [  132.653402]        worker_thread+0x50/0x3b0
> > > [  132.654139]        kthread+0x12e/0x150
> > > [  132.654868]        ret_from_fork+0x27/0x50
> > > [  132.655598]
> > >                other info that might help us debug this:
> > >
> > > [  132.657739] Chain exists of:
> > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > >
> > > [  132.659877]  Possible unsafe locking scenario:
> > >
> > > [  132.661416]        CPU0                    CPU1
> > > [  132.662126]        ----                    ----
> > > [  132.662847]   lock(dma_fence_map);
> > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > [  132.664319]                                lock(dma_fence_map);
> > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > [  132.665799]
> > >                 *** DEADLOCK ***
> > >
> > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > [  132.671902]
> > >                stack backtrace:
> > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > [  132.676046] Call Trace:
> > > [  132.676897]  dump_stack+0x8f/0xd0
> > > [  132.677748]  check_noncircular+0x162/0x180
> > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > [  132.680311]  lock_acquire+0xad/0x370
> > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > [  132.683737]  ? __module_address+0x28/0xf0
> > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > [  132.693597]  process_one_work+0x23c/0x580
> > > [  132.694487]  worker_thread+0x50/0x3b0
> > > [  132.695373]  ? process_one_work+0x580/0x580
> > > [  132.696264]  kthread+0x12e/0x150
> > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > [  132.698057]  ret_from_fork+0x27/0x50
> > >
> > > Cc: linux-media@vger.kernel.org
> > > Cc: linaro-mm-sig@lists.linaro.org
> > > Cc: linux-rdma@vger.kernel.org
> > > Cc: amd-gfx@lists.freedesktop.org
> > > Cc: intel-gfx@lists.freedesktop.org
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index 3584e29323c0..b3b84a0d3baf 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > >                 /* displays are handled separately */
> > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > >                         /* XXX handle errors */
> > > +
> > > +                       /*
> > > +                        * This is dm_suspend, which calls modeset locks, and
> > > +                        * that a pretty good inversion against dma_fence_signal
> > > +                        * which gpu recovery is supposed to guarantee.
> > > +                        *
> > > +                        * Dont ask me how to fix this.
> > > +                        */
> >
> > We actually have a fix for this.  Will be out shortly.
>
> Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> the display block entirely. Fixing the locking while still resetting the
> display is going to be really hard otoh ...

There's no way to avoid that.  On dGPUs at least a full asic reset is
a full asic reset.  Mostly just skips the modeset and does the minimum
amount necessary to get the display block into a good state for reset.

Alex

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [Intel-gfx] [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 13:12         ` Alex Deucher
  0 siblings, 0 replies; 203+ messages in thread
From: Alex Deucher @ 2020-05-12 13:12 UTC (permalink / raw)
  To: Alex Deucher, DRI Development, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, LKML,
	amd-gfx list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter,
	Christian König, linux-media
  Cc: Daniel Vetter

On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > ...
> > >
> > > I think it's time to stop this little exercise.
> > >
> > > The lockdep splat, for the record:
> > >
> > > [  132.583381] ======================================================
> > > [  132.584091] WARNING: possible circular locking dependency detected
> > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > [  132.585461] ------------------------------------------------------
> > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.587569]
> > >                but task is already holding lock:
> > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > [  132.589803]
> > >                which lock already depends on the new lock.
> > >
> > > [  132.592009]
> > >                the existing dependency chain (in reverse order) is:
> > > [  132.593507]
> > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > [  132.601344]        visual_init+0xce/0x130
> > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > [  132.602970]        do_take_over_console+0x115/0x180
> > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > [  132.609511]        local_pci_probe+0x42/0x80
> > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > [  132.611130]        really_probe+0x147/0x3c0
> > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > [  132.612766]        device_driver_attach+0x53/0x60
> > > [  132.613593]        __driver_attach+0x8c/0x150
> > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > [  132.616071]        driver_register+0x6c/0xc0
> > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > [  132.617731]        do_init_module+0x5c/0x230
> > > [  132.618560]        load_module+0x2981/0x2bc0
> > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > [  132.621903]
> > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > [  132.629804]        local_pci_probe+0x42/0x80
> > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > [  132.631583]        really_probe+0x147/0x3c0
> > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > [  132.633379]        device_driver_attach+0x53/0x60
> > > [  132.634275]        __driver_attach+0x8c/0x150
> > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > [  132.636974]        driver_register+0x6c/0xc0
> > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > [  132.638765]        do_init_module+0x5c/0x230
> > > [  132.639654]        load_module+0x2981/0x2bc0
> > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > [  132.643022]
> > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > [  132.645469]        lock_acquire+0xad/0x370
> > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > [  132.652594]        process_one_work+0x23c/0x580
> > > [  132.653402]        worker_thread+0x50/0x3b0
> > > [  132.654139]        kthread+0x12e/0x150
> > > [  132.654868]        ret_from_fork+0x27/0x50
> > > [  132.655598]
> > >                other info that might help us debug this:
> > >
> > > [  132.657739] Chain exists of:
> > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > >
> > > [  132.659877]  Possible unsafe locking scenario:
> > >
> > > [  132.661416]        CPU0                    CPU1
> > > [  132.662126]        ----                    ----
> > > [  132.662847]   lock(dma_fence_map);
> > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > [  132.664319]                                lock(dma_fence_map);
> > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > [  132.665799]
> > >                 *** DEADLOCK ***
> > >
> > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > [  132.671902]
> > >                stack backtrace:
> > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > [  132.676046] Call Trace:
> > > [  132.676897]  dump_stack+0x8f/0xd0
> > > [  132.677748]  check_noncircular+0x162/0x180
> > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > [  132.680311]  lock_acquire+0xad/0x370
> > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > [  132.683737]  ? __module_address+0x28/0xf0
> > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > [  132.693597]  process_one_work+0x23c/0x580
> > > [  132.694487]  worker_thread+0x50/0x3b0
> > > [  132.695373]  ? process_one_work+0x580/0x580
> > > [  132.696264]  kthread+0x12e/0x150
> > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > [  132.698057]  ret_from_fork+0x27/0x50
> > >
> > > Cc: linux-media@vger.kernel.org
> > > Cc: linaro-mm-sig@lists.linaro.org
> > > Cc: linux-rdma@vger.kernel.org
> > > Cc: amd-gfx@lists.freedesktop.org
> > > Cc: intel-gfx@lists.freedesktop.org
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index 3584e29323c0..b3b84a0d3baf 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > >                 /* displays are handled separately */
> > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > >                         /* XXX handle errors */
> > > +
> > > +                       /*
> > > +                        * This is dm_suspend, which calls modeset locks, and
> > > +                        * that a pretty good inversion against dma_fence_signal
> > > +                        * which gpu recovery is supposed to guarantee.
> > > +                        *
> > > +                        * Dont ask me how to fix this.
> > > +                        */
> >
> > We actually have a fix for this.  Will be out shortly.
>
> Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> the display block entirely. Fixing the locking while still resetting the
> display is going to be really hard otoh ...

There's no way to avoid that.  On dGPUs at least a full asic reset is
a full asic reset.  Mostly just skips the modeset and does the minimum
amount necessary to get the display block into a good state for reset.

Alex

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 13:12         ` Alex Deucher
  0 siblings, 0 replies; 203+ messages in thread
From: Alex Deucher @ 2020-05-12 13:12 UTC (permalink / raw)
  To: Alex Deucher, DRI Development, linux-rdma,
	Intel Graphics Development, Maarten Lankhorst, LKML,
	amd-gfx list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter,
	Christian König, linux-media
  Cc: Daniel Vetter

On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > ...
> > >
> > > I think it's time to stop this little exercise.
> > >
> > > The lockdep splat, for the record:
> > >
> > > [  132.583381] ======================================================
> > > [  132.584091] WARNING: possible circular locking dependency detected
> > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > [  132.585461] ------------------------------------------------------
> > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.587569]
> > >                but task is already holding lock:
> > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > [  132.589803]
> > >                which lock already depends on the new lock.
> > >
> > > [  132.592009]
> > >                the existing dependency chain (in reverse order) is:
> > > [  132.593507]
> > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > [  132.601344]        visual_init+0xce/0x130
> > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > [  132.602970]        do_take_over_console+0x115/0x180
> > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > [  132.609511]        local_pci_probe+0x42/0x80
> > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > [  132.611130]        really_probe+0x147/0x3c0
> > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > [  132.612766]        device_driver_attach+0x53/0x60
> > > [  132.613593]        __driver_attach+0x8c/0x150
> > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > [  132.616071]        driver_register+0x6c/0xc0
> > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > [  132.617731]        do_init_module+0x5c/0x230
> > > [  132.618560]        load_module+0x2981/0x2bc0
> > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > [  132.621903]
> > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > [  132.629804]        local_pci_probe+0x42/0x80
> > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > [  132.631583]        really_probe+0x147/0x3c0
> > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > [  132.633379]        device_driver_attach+0x53/0x60
> > > [  132.634275]        __driver_attach+0x8c/0x150
> > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > [  132.636974]        driver_register+0x6c/0xc0
> > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > [  132.638765]        do_init_module+0x5c/0x230
> > > [  132.639654]        load_module+0x2981/0x2bc0
> > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > [  132.643022]
> > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > [  132.645469]        lock_acquire+0xad/0x370
> > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > [  132.652594]        process_one_work+0x23c/0x580
> > > [  132.653402]        worker_thread+0x50/0x3b0
> > > [  132.654139]        kthread+0x12e/0x150
> > > [  132.654868]        ret_from_fork+0x27/0x50
> > > [  132.655598]
> > >                other info that might help us debug this:
> > >
> > > [  132.657739] Chain exists of:
> > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > >
> > > [  132.659877]  Possible unsafe locking scenario:
> > >
> > > [  132.661416]        CPU0                    CPU1
> > > [  132.662126]        ----                    ----
> > > [  132.662847]   lock(dma_fence_map);
> > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > [  132.664319]                                lock(dma_fence_map);
> > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > [  132.665799]
> > >                 *** DEADLOCK ***
> > >
> > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > [  132.671902]
> > >                stack backtrace:
> > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > [  132.676046] Call Trace:
> > > [  132.676897]  dump_stack+0x8f/0xd0
> > > [  132.677748]  check_noncircular+0x162/0x180
> > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > [  132.680311]  lock_acquire+0xad/0x370
> > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > [  132.683737]  ? __module_address+0x28/0xf0
> > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > [  132.693597]  process_one_work+0x23c/0x580
> > > [  132.694487]  worker_thread+0x50/0x3b0
> > > [  132.695373]  ? process_one_work+0x580/0x580
> > > [  132.696264]  kthread+0x12e/0x150
> > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > [  132.698057]  ret_from_fork+0x27/0x50
> > >
> > > Cc: linux-media@vger.kernel.org
> > > Cc: linaro-mm-sig@lists.linaro.org
> > > Cc: linux-rdma@vger.kernel.org
> > > Cc: amd-gfx@lists.freedesktop.org
> > > Cc: intel-gfx@lists.freedesktop.org
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index 3584e29323c0..b3b84a0d3baf 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > >                 /* displays are handled separately */
> > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > >                         /* XXX handle errors */
> > > +
> > > +                       /*
> > > +                        * This is dm_suspend, which calls modeset locks, and
> > > +                        * that a pretty good inversion against dma_fence_signal
> > > +                        * which gpu recovery is supposed to guarantee.
> > > +                        *
> > > +                        * Dont ask me how to fix this.
> > > +                        */
> >
> > We actually have a fix for this.  Will be out shortly.
>
> Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> the display block entirely. Fixing the locking while still resetting the
> display is going to be really hard otoh ...

There's no way to avoid that.  On dGPUs at least a full asic reset is
a full asic reset.  Mostly just skips the modeset and does the minimum
amount necessary to get the display block into a good state for reset.

Alex

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
  2020-05-12 13:12         ` Alex Deucher
  (?)
  (?)
@ 2020-05-12 13:17           ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12 13:17 UTC (permalink / raw)
  To: Alex Deucher
  Cc: DRI Development, linux-rdma, Intel Graphics Development,
	Maarten Lankhorst, LKML, amd-gfx list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter,
	Christian König, linux-media

On Tue, May 12, 2020 at 3:12 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > >
> > > > ...
> > > >
> > > > I think it's time to stop this little exercise.
> > > >
> > > > The lockdep splat, for the record:
> > > >
> > > > [  132.583381] ======================================================
> > > > [  132.584091] WARNING: possible circular locking dependency detected
> > > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > > [  132.585461] ------------------------------------------------------
> > > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.587569]
> > > >                but task is already holding lock:
> > > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > [  132.589803]
> > > >                which lock already depends on the new lock.
> > > >
> > > > [  132.592009]
> > > >                the existing dependency chain (in reverse order) is:
> > > > [  132.593507]
> > > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > > [  132.601344]        visual_init+0xce/0x130
> > > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > > [  132.602970]        do_take_over_console+0x115/0x180
> > > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > [  132.609511]        local_pci_probe+0x42/0x80
> > > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > > [  132.611130]        really_probe+0x147/0x3c0
> > > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > > [  132.612766]        device_driver_attach+0x53/0x60
> > > > [  132.613593]        __driver_attach+0x8c/0x150
> > > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > > [  132.616071]        driver_register+0x6c/0xc0
> > > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > > [  132.617731]        do_init_module+0x5c/0x230
> > > > [  132.618560]        load_module+0x2981/0x2bc0
> > > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > [  132.621903]
> > > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > [  132.629804]        local_pci_probe+0x42/0x80
> > > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > > [  132.631583]        really_probe+0x147/0x3c0
> > > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > > [  132.633379]        device_driver_attach+0x53/0x60
> > > > [  132.634275]        __driver_attach+0x8c/0x150
> > > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > > [  132.636974]        driver_register+0x6c/0xc0
> > > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > > [  132.638765]        do_init_module+0x5c/0x230
> > > > [  132.639654]        load_module+0x2981/0x2bc0
> > > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > [  132.643022]
> > > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > > [  132.645469]        lock_acquire+0xad/0x370
> > > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > [  132.652594]        process_one_work+0x23c/0x580
> > > > [  132.653402]        worker_thread+0x50/0x3b0
> > > > [  132.654139]        kthread+0x12e/0x150
> > > > [  132.654868]        ret_from_fork+0x27/0x50
> > > > [  132.655598]
> > > >                other info that might help us debug this:
> > > >
> > > > [  132.657739] Chain exists of:
> > > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > > >
> > > > [  132.659877]  Possible unsafe locking scenario:
> > > >
> > > > [  132.661416]        CPU0                    CPU1
> > > > [  132.662126]        ----                    ----
> > > > [  132.662847]   lock(dma_fence_map);
> > > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > > [  132.664319]                                lock(dma_fence_map);
> > > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > > [  132.665799]
> > > >                 *** DEADLOCK ***
> > > >
> > > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > > [  132.671902]
> > > >                stack backtrace:
> > > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > > [  132.676046] Call Trace:
> > > > [  132.676897]  dump_stack+0x8f/0xd0
> > > > [  132.677748]  check_noncircular+0x162/0x180
> > > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > > [  132.680311]  lock_acquire+0xad/0x370
> > > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > > [  132.683737]  ? __module_address+0x28/0xf0
> > > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > [  132.693597]  process_one_work+0x23c/0x580
> > > > [  132.694487]  worker_thread+0x50/0x3b0
> > > > [  132.695373]  ? process_one_work+0x580/0x580
> > > > [  132.696264]  kthread+0x12e/0x150
> > > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > [  132.698057]  ret_from_fork+0x27/0x50
> > > >
> > > > Cc: linux-media@vger.kernel.org
> > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > Cc: linux-rdma@vger.kernel.org
> > > > Cc: amd-gfx@lists.freedesktop.org
> > > > Cc: intel-gfx@lists.freedesktop.org
> > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > Cc: Christian König <christian.koenig@amd.com>
> > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > > >  1 file changed, 8 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > index 3584e29323c0..b3b84a0d3baf 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > > >                 /* displays are handled separately */
> > > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > > >                         /* XXX handle errors */
> > > > +
> > > > +                       /*
> > > > +                        * This is dm_suspend, which calls modeset locks, and
> > > > +                        * that a pretty good inversion against dma_fence_signal
> > > > +                        * which gpu recovery is supposed to guarantee.
> > > > +                        *
> > > > +                        * Dont ask me how to fix this.
> > > > +                        */
> > >
> > > We actually have a fix for this.  Will be out shortly.
> >
> > Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> > the display block entirely. Fixing the locking while still resetting the
> > display is going to be really hard otoh ...
>
> There's no way to avoid that.  On dGPUs at least a full asic reset is
> a full asic reset.  Mostly just skips the modeset and does the minimum
> amount necessary to get the display block into a good state for reset.

But how do you restore the display afterwards? "[RFC 13/17]
drm/scheduler: use dma-fence annotations in tdr work" earlier in the
series has some ideas from me for at least
some of the problems for tdr when the display gets reset along.
Whacking the display while a modeset/flip/whatever is ongoing
concurrently doesn't sound like a good idea, so not sure how you can
do that without taking the drm_modeset_locks. And once you do that,
it's deadlock time.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 13:17           ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12 13:17 UTC (permalink / raw)
  To: Alex Deucher
  Cc: linux-rdma, Intel Graphics Development, LKML, amd-gfx list,
	Chris Wilson, moderated list:DMA BUFFER SHARING FRAMEWORK,
	DRI Development, Daniel Vetter, Christian König,
	linux-media

On Tue, May 12, 2020 at 3:12 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > >
> > > > ...
> > > >
> > > > I think it's time to stop this little exercise.
> > > >
> > > > The lockdep splat, for the record:
> > > >
> > > > [  132.583381] ======================================================
> > > > [  132.584091] WARNING: possible circular locking dependency detected
> > > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > > [  132.585461] ------------------------------------------------------
> > > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.587569]
> > > >                but task is already holding lock:
> > > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > [  132.589803]
> > > >                which lock already depends on the new lock.
> > > >
> > > > [  132.592009]
> > > >                the existing dependency chain (in reverse order) is:
> > > > [  132.593507]
> > > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > > [  132.601344]        visual_init+0xce/0x130
> > > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > > [  132.602970]        do_take_over_console+0x115/0x180
> > > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > [  132.609511]        local_pci_probe+0x42/0x80
> > > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > > [  132.611130]        really_probe+0x147/0x3c0
> > > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > > [  132.612766]        device_driver_attach+0x53/0x60
> > > > [  132.613593]        __driver_attach+0x8c/0x150
> > > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > > [  132.616071]        driver_register+0x6c/0xc0
> > > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > > [  132.617731]        do_init_module+0x5c/0x230
> > > > [  132.618560]        load_module+0x2981/0x2bc0
> > > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > [  132.621903]
> > > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > [  132.629804]        local_pci_probe+0x42/0x80
> > > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > > [  132.631583]        really_probe+0x147/0x3c0
> > > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > > [  132.633379]        device_driver_attach+0x53/0x60
> > > > [  132.634275]        __driver_attach+0x8c/0x150
> > > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > > [  132.636974]        driver_register+0x6c/0xc0
> > > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > > [  132.638765]        do_init_module+0x5c/0x230
> > > > [  132.639654]        load_module+0x2981/0x2bc0
> > > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > [  132.643022]
> > > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > > [  132.645469]        lock_acquire+0xad/0x370
> > > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > [  132.652594]        process_one_work+0x23c/0x580
> > > > [  132.653402]        worker_thread+0x50/0x3b0
> > > > [  132.654139]        kthread+0x12e/0x150
> > > > [  132.654868]        ret_from_fork+0x27/0x50
> > > > [  132.655598]
> > > >                other info that might help us debug this:
> > > >
> > > > [  132.657739] Chain exists of:
> > > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > > >
> > > > [  132.659877]  Possible unsafe locking scenario:
> > > >
> > > > [  132.661416]        CPU0                    CPU1
> > > > [  132.662126]        ----                    ----
> > > > [  132.662847]   lock(dma_fence_map);
> > > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > > [  132.664319]                                lock(dma_fence_map);
> > > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > > [  132.665799]
> > > >                 *** DEADLOCK ***
> > > >
> > > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > > [  132.671902]
> > > >                stack backtrace:
> > > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > > [  132.676046] Call Trace:
> > > > [  132.676897]  dump_stack+0x8f/0xd0
> > > > [  132.677748]  check_noncircular+0x162/0x180
> > > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > > [  132.680311]  lock_acquire+0xad/0x370
> > > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > > [  132.683737]  ? __module_address+0x28/0xf0
> > > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > [  132.693597]  process_one_work+0x23c/0x580
> > > > [  132.694487]  worker_thread+0x50/0x3b0
> > > > [  132.695373]  ? process_one_work+0x580/0x580
> > > > [  132.696264]  kthread+0x12e/0x150
> > > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > [  132.698057]  ret_from_fork+0x27/0x50
> > > >
> > > > Cc: linux-media@vger.kernel.org
> > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > Cc: linux-rdma@vger.kernel.org
> > > > Cc: amd-gfx@lists.freedesktop.org
> > > > Cc: intel-gfx@lists.freedesktop.org
> > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > Cc: Christian König <christian.koenig@amd.com>
> > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > > >  1 file changed, 8 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > index 3584e29323c0..b3b84a0d3baf 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > > >                 /* displays are handled separately */
> > > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > > >                         /* XXX handle errors */
> > > > +
> > > > +                       /*
> > > > +                        * This is dm_suspend, which calls modeset locks, and
> > > > +                        * that a pretty good inversion against dma_fence_signal
> > > > +                        * which gpu recovery is supposed to guarantee.
> > > > +                        *
> > > > +                        * Dont ask me how to fix this.
> > > > +                        */
> > >
> > > We actually have a fix for this.  Will be out shortly.
> >
> > Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> > the display block entirely. Fixing the locking while still resetting the
> > display is going to be really hard otoh ...
>
> There's no way to avoid that.  On dGPUs at least a full asic reset is
> a full asic reset.  Mostly just skips the modeset and does the minimum
> amount necessary to get the display block into a good state for reset.

But how do you restore the display afterwards? "[RFC 13/17]
drm/scheduler: use dma-fence annotations in tdr work" earlier in the
series has some ideas from me for at least
some of the problems for tdr when the display gets reset along.
Whacking the display while a modeset/flip/whatever is ongoing
concurrently doesn't sound like a good idea, so not sure how you can
do that without taking the drm_modeset_locks. And once you do that,
it's deadlock time.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [Intel-gfx] [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 13:17           ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12 13:17 UTC (permalink / raw)
  To: Alex Deucher
  Cc: linux-rdma, Intel Graphics Development, LKML, amd-gfx list,
	Chris Wilson, moderated list:DMA BUFFER SHARING FRAMEWORK,
	DRI Development, Daniel Vetter, Christian König,
	linux-media

On Tue, May 12, 2020 at 3:12 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > >
> > > > ...
> > > >
> > > > I think it's time to stop this little exercise.
> > > >
> > > > The lockdep splat, for the record:
> > > >
> > > > [  132.583381] ======================================================
> > > > [  132.584091] WARNING: possible circular locking dependency detected
> > > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > > [  132.585461] ------------------------------------------------------
> > > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.587569]
> > > >                but task is already holding lock:
> > > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > [  132.589803]
> > > >                which lock already depends on the new lock.
> > > >
> > > > [  132.592009]
> > > >                the existing dependency chain (in reverse order) is:
> > > > [  132.593507]
> > > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > > [  132.601344]        visual_init+0xce/0x130
> > > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > > [  132.602970]        do_take_over_console+0x115/0x180
> > > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > [  132.609511]        local_pci_probe+0x42/0x80
> > > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > > [  132.611130]        really_probe+0x147/0x3c0
> > > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > > [  132.612766]        device_driver_attach+0x53/0x60
> > > > [  132.613593]        __driver_attach+0x8c/0x150
> > > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > > [  132.616071]        driver_register+0x6c/0xc0
> > > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > > [  132.617731]        do_init_module+0x5c/0x230
> > > > [  132.618560]        load_module+0x2981/0x2bc0
> > > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > [  132.621903]
> > > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > [  132.629804]        local_pci_probe+0x42/0x80
> > > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > > [  132.631583]        really_probe+0x147/0x3c0
> > > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > > [  132.633379]        device_driver_attach+0x53/0x60
> > > > [  132.634275]        __driver_attach+0x8c/0x150
> > > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > > [  132.636974]        driver_register+0x6c/0xc0
> > > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > > [  132.638765]        do_init_module+0x5c/0x230
> > > > [  132.639654]        load_module+0x2981/0x2bc0
> > > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > [  132.643022]
> > > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > > [  132.645469]        lock_acquire+0xad/0x370
> > > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > [  132.652594]        process_one_work+0x23c/0x580
> > > > [  132.653402]        worker_thread+0x50/0x3b0
> > > > [  132.654139]        kthread+0x12e/0x150
> > > > [  132.654868]        ret_from_fork+0x27/0x50
> > > > [  132.655598]
> > > >                other info that might help us debug this:
> > > >
> > > > [  132.657739] Chain exists of:
> > > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > > >
> > > > [  132.659877]  Possible unsafe locking scenario:
> > > >
> > > > [  132.661416]        CPU0                    CPU1
> > > > [  132.662126]        ----                    ----
> > > > [  132.662847]   lock(dma_fence_map);
> > > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > > [  132.664319]                                lock(dma_fence_map);
> > > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > > [  132.665799]
> > > >                 *** DEADLOCK ***
> > > >
> > > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > > [  132.671902]
> > > >                stack backtrace:
> > > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > > [  132.676046] Call Trace:
> > > > [  132.676897]  dump_stack+0x8f/0xd0
> > > > [  132.677748]  check_noncircular+0x162/0x180
> > > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > > [  132.680311]  lock_acquire+0xad/0x370
> > > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > > [  132.683737]  ? __module_address+0x28/0xf0
> > > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > [  132.693597]  process_one_work+0x23c/0x580
> > > > [  132.694487]  worker_thread+0x50/0x3b0
> > > > [  132.695373]  ? process_one_work+0x580/0x580
> > > > [  132.696264]  kthread+0x12e/0x150
> > > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > [  132.698057]  ret_from_fork+0x27/0x50
> > > >
> > > > Cc: linux-media@vger.kernel.org
> > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > Cc: linux-rdma@vger.kernel.org
> > > > Cc: amd-gfx@lists.freedesktop.org
> > > > Cc: intel-gfx@lists.freedesktop.org
> > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > Cc: Christian König <christian.koenig@amd.com>
> > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > > >  1 file changed, 8 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > index 3584e29323c0..b3b84a0d3baf 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > > >                 /* displays are handled separately */
> > > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > > >                         /* XXX handle errors */
> > > > +
> > > > +                       /*
> > > > +                        * This is dm_suspend, which calls modeset locks, and
> > > > +                        * that a pretty good inversion against dma_fence_signal
> > > > +                        * which gpu recovery is supposed to guarantee.
> > > > +                        *
> > > > +                        * Dont ask me how to fix this.
> > > > +                        */
> > >
> > > We actually have a fix for this.  Will be out shortly.
> >
> > Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> > the display block entirely. Fixing the locking while still resetting the
> > display is going to be really hard otoh ...
>
> There's no way to avoid that.  On dGPUs at least a full asic reset is
> a full asic reset.  Mostly just skips the modeset and does the minimum
> amount necessary to get the display block into a good state for reset.

But how do you restore the display afterwards? "[RFC 13/17]
drm/scheduler: use dma-fence annotations in tdr work" earlier in the
series has some ideas from me for at least
some of the problems for tdr when the display gets reset along.
Whacking the display while a modeset/flip/whatever is ongoing
concurrently doesn't sound like a good idea, so not sure how you can
do that without taking the drm_modeset_locks. And once you do that,
it's deadlock time.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 13:17           ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12 13:17 UTC (permalink / raw)
  To: Alex Deucher
  Cc: linux-rdma, Intel Graphics Development, Maarten Lankhorst, LKML,
	amd-gfx list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, DRI Development,
	Daniel Vetter, Christian König, linux-media

On Tue, May 12, 2020 at 3:12 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > >
> > > > ...
> > > >
> > > > I think it's time to stop this little exercise.
> > > >
> > > > The lockdep splat, for the record:
> > > >
> > > > [  132.583381] ======================================================
> > > > [  132.584091] WARNING: possible circular locking dependency detected
> > > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > > [  132.585461] ------------------------------------------------------
> > > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.587569]
> > > >                but task is already holding lock:
> > > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > [  132.589803]
> > > >                which lock already depends on the new lock.
> > > >
> > > > [  132.592009]
> > > >                the existing dependency chain (in reverse order) is:
> > > > [  132.593507]
> > > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > > [  132.601344]        visual_init+0xce/0x130
> > > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > > [  132.602970]        do_take_over_console+0x115/0x180
> > > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > [  132.609511]        local_pci_probe+0x42/0x80
> > > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > > [  132.611130]        really_probe+0x147/0x3c0
> > > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > > [  132.612766]        device_driver_attach+0x53/0x60
> > > > [  132.613593]        __driver_attach+0x8c/0x150
> > > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > > [  132.616071]        driver_register+0x6c/0xc0
> > > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > > [  132.617731]        do_init_module+0x5c/0x230
> > > > [  132.618560]        load_module+0x2981/0x2bc0
> > > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > [  132.621903]
> > > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > [  132.629804]        local_pci_probe+0x42/0x80
> > > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > > [  132.631583]        really_probe+0x147/0x3c0
> > > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > > [  132.633379]        device_driver_attach+0x53/0x60
> > > > [  132.634275]        __driver_attach+0x8c/0x150
> > > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > > [  132.636974]        driver_register+0x6c/0xc0
> > > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > > [  132.638765]        do_init_module+0x5c/0x230
> > > > [  132.639654]        load_module+0x2981/0x2bc0
> > > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > [  132.643022]
> > > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > > [  132.645469]        lock_acquire+0xad/0x370
> > > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > [  132.652594]        process_one_work+0x23c/0x580
> > > > [  132.653402]        worker_thread+0x50/0x3b0
> > > > [  132.654139]        kthread+0x12e/0x150
> > > > [  132.654868]        ret_from_fork+0x27/0x50
> > > > [  132.655598]
> > > >                other info that might help us debug this:
> > > >
> > > > [  132.657739] Chain exists of:
> > > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > > >
> > > > [  132.659877]  Possible unsafe locking scenario:
> > > >
> > > > [  132.661416]        CPU0                    CPU1
> > > > [  132.662126]        ----                    ----
> > > > [  132.662847]   lock(dma_fence_map);
> > > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > > [  132.664319]                                lock(dma_fence_map);
> > > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > > [  132.665799]
> > > >                 *** DEADLOCK ***
> > > >
> > > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > > [  132.671902]
> > > >                stack backtrace:
> > > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > > [  132.676046] Call Trace:
> > > > [  132.676897]  dump_stack+0x8f/0xd0
> > > > [  132.677748]  check_noncircular+0x162/0x180
> > > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > > [  132.680311]  lock_acquire+0xad/0x370
> > > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > > [  132.683737]  ? __module_address+0x28/0xf0
> > > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > [  132.693597]  process_one_work+0x23c/0x580
> > > > [  132.694487]  worker_thread+0x50/0x3b0
> > > > [  132.695373]  ? process_one_work+0x580/0x580
> > > > [  132.696264]  kthread+0x12e/0x150
> > > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > [  132.698057]  ret_from_fork+0x27/0x50
> > > >
> > > > Cc: linux-media@vger.kernel.org
> > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > Cc: linux-rdma@vger.kernel.org
> > > > Cc: amd-gfx@lists.freedesktop.org
> > > > Cc: intel-gfx@lists.freedesktop.org
> > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > Cc: Christian König <christian.koenig@amd.com>
> > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > > >  1 file changed, 8 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > index 3584e29323c0..b3b84a0d3baf 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > > >                 /* displays are handled separately */
> > > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > > >                         /* XXX handle errors */
> > > > +
> > > > +                       /*
> > > > +                        * This is dm_suspend, which calls modeset locks, and
> > > > +                        * that a pretty good inversion against dma_fence_signal
> > > > +                        * which gpu recovery is supposed to guarantee.
> > > > +                        *
> > > > +                        * Dont ask me how to fix this.
> > > > +                        */
> > >
> > > We actually have a fix for this.  Will be out shortly.
> >
> > Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> > the display block entirely. Fixing the locking while still resetting the
> > display is going to be really hard otoh ...
>
> There's no way to avoid that.  On dGPUs at least a full asic reset is
> a full asic reset.  Mostly just skips the modeset and does the minimum
> amount necessary to get the display block into a good state for reset.

But how do you restore the display afterwards? "[RFC 13/17]
drm/scheduler: use dma-fence annotations in tdr work" earlier in the
series has some ideas from me for at least
some of the problems for tdr when the display gets reset along.
Whacking the display while a modeset/flip/whatever is ongoing
concurrently doesn't sound like a good idea, so not sure how you can
do that without taking the drm_modeset_locks. And once you do that,
it's deadlock time.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
  2020-05-12 13:17           ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12 13:29             ` Alex Deucher
  -1 siblings, 0 replies; 203+ messages in thread
From: Alex Deucher @ 2020-05-12 13:29 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: DRI Development, linux-rdma, Intel Graphics Development,
	Maarten Lankhorst, LKML, amd-gfx list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter,
	Christian König, linux-media

On Tue, May 12, 2020 at 9:17 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Tue, May 12, 2020 at 3:12 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > >
> > > > > ...
> > > > >
> > > > > I think it's time to stop this little exercise.
> > > > >
> > > > > The lockdep splat, for the record:
> > > > >
> > > > > [  132.583381] ======================================================
> > > > > [  132.584091] WARNING: possible circular locking dependency detected
> > > > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > > > [  132.585461] ------------------------------------------------------
> > > > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.587569]
> > > > >                but task is already holding lock:
> > > > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > [  132.589803]
> > > > >                which lock already depends on the new lock.
> > > > >
> > > > > [  132.592009]
> > > > >                the existing dependency chain (in reverse order) is:
> > > > > [  132.593507]
> > > > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > > > [  132.601344]        visual_init+0xce/0x130
> > > > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > > > [  132.602970]        do_take_over_console+0x115/0x180
> > > > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > [  132.609511]        local_pci_probe+0x42/0x80
> > > > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > > > [  132.611130]        really_probe+0x147/0x3c0
> > > > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > > > [  132.612766]        device_driver_attach+0x53/0x60
> > > > > [  132.613593]        __driver_attach+0x8c/0x150
> > > > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > > > [  132.616071]        driver_register+0x6c/0xc0
> > > > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > > > [  132.617731]        do_init_module+0x5c/0x230
> > > > > [  132.618560]        load_module+0x2981/0x2bc0
> > > > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > [  132.621903]
> > > > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > [  132.629804]        local_pci_probe+0x42/0x80
> > > > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > > > [  132.631583]        really_probe+0x147/0x3c0
> > > > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > > > [  132.633379]        device_driver_attach+0x53/0x60
> > > > > [  132.634275]        __driver_attach+0x8c/0x150
> > > > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > > > [  132.636974]        driver_register+0x6c/0xc0
> > > > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > > > [  132.638765]        do_init_module+0x5c/0x230
> > > > > [  132.639654]        load_module+0x2981/0x2bc0
> > > > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > [  132.643022]
> > > > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > > > [  132.645469]        lock_acquire+0xad/0x370
> > > > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > [  132.652594]        process_one_work+0x23c/0x580
> > > > > [  132.653402]        worker_thread+0x50/0x3b0
> > > > > [  132.654139]        kthread+0x12e/0x150
> > > > > [  132.654868]        ret_from_fork+0x27/0x50
> > > > > [  132.655598]
> > > > >                other info that might help us debug this:
> > > > >
> > > > > [  132.657739] Chain exists of:
> > > > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > > > >
> > > > > [  132.659877]  Possible unsafe locking scenario:
> > > > >
> > > > > [  132.661416]        CPU0                    CPU1
> > > > > [  132.662126]        ----                    ----
> > > > > [  132.662847]   lock(dma_fence_map);
> > > > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > > > [  132.664319]                                lock(dma_fence_map);
> > > > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > > > [  132.665799]
> > > > >                 *** DEADLOCK ***
> > > > >
> > > > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > > > [  132.671902]
> > > > >                stack backtrace:
> > > > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > > > [  132.676046] Call Trace:
> > > > > [  132.676897]  dump_stack+0x8f/0xd0
> > > > > [  132.677748]  check_noncircular+0x162/0x180
> > > > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > > > [  132.680311]  lock_acquire+0xad/0x370
> > > > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > > > [  132.683737]  ? __module_address+0x28/0xf0
> > > > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > [  132.693597]  process_one_work+0x23c/0x580
> > > > > [  132.694487]  worker_thread+0x50/0x3b0
> > > > > [  132.695373]  ? process_one_work+0x580/0x580
> > > > > [  132.696264]  kthread+0x12e/0x150
> > > > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > [  132.698057]  ret_from_fork+0x27/0x50
> > > > >
> > > > > Cc: linux-media@vger.kernel.org
> > > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > > Cc: linux-rdma@vger.kernel.org
> > > > > Cc: amd-gfx@lists.freedesktop.org
> > > > > Cc: intel-gfx@lists.freedesktop.org
> > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > > > >  1 file changed, 8 insertions(+)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > index 3584e29323c0..b3b84a0d3baf 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > > > >                 /* displays are handled separately */
> > > > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > > > >                         /* XXX handle errors */
> > > > > +
> > > > > +                       /*
> > > > > +                        * This is dm_suspend, which calls modeset locks, and
> > > > > +                        * that a pretty good inversion against dma_fence_signal
> > > > > +                        * which gpu recovery is supposed to guarantee.
> > > > > +                        *
> > > > > +                        * Dont ask me how to fix this.
> > > > > +                        */
> > > >
> > > > We actually have a fix for this.  Will be out shortly.
> > >
> > > Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> > > the display block entirely. Fixing the locking while still resetting the
> > > display is going to be really hard otoh ...
> >
> > There's no way to avoid that.  On dGPUs at least a full asic reset is
> > a full asic reset.  Mostly just skips the modeset and does the minimum
> > amount necessary to get the display block into a good state for reset.
>
> But how do you restore the display afterwards? "[RFC 13/17]
> drm/scheduler: use dma-fence annotations in tdr work" earlier in the
> series has some ideas from me for at least
> some of the problems for tdr when the display gets reset along.
> Whacking the display while a modeset/flip/whatever is ongoing
> concurrently doesn't sound like a good idea, so not sure how you can
> do that without taking the drm_modeset_locks. And once you do that,
> it's deadlock time.

We cache the current display hw state and restore it after the reset
without going through the atomic interfaces so everything is back the
way it was before the reset.  IIRC, when we reset the reset of the
GPU, we disconnect the fences, and then re-attach them after a reset.

Alex

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 13:29             ` Alex Deucher
  0 siblings, 0 replies; 203+ messages in thread
From: Alex Deucher @ 2020-05-12 13:29 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: linux-rdma, Intel Graphics Development, LKML, amd-gfx list,
	Chris Wilson, moderated list:DMA BUFFER SHARING FRAMEWORK,
	DRI Development, Daniel Vetter, Christian König,
	linux-media

On Tue, May 12, 2020 at 9:17 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Tue, May 12, 2020 at 3:12 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > >
> > > > > ...
> > > > >
> > > > > I think it's time to stop this little exercise.
> > > > >
> > > > > The lockdep splat, for the record:
> > > > >
> > > > > [  132.583381] ======================================================
> > > > > [  132.584091] WARNING: possible circular locking dependency detected
> > > > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > > > [  132.585461] ------------------------------------------------------
> > > > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.587569]
> > > > >                but task is already holding lock:
> > > > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > [  132.589803]
> > > > >                which lock already depends on the new lock.
> > > > >
> > > > > [  132.592009]
> > > > >                the existing dependency chain (in reverse order) is:
> > > > > [  132.593507]
> > > > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > > > [  132.601344]        visual_init+0xce/0x130
> > > > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > > > [  132.602970]        do_take_over_console+0x115/0x180
> > > > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > [  132.609511]        local_pci_probe+0x42/0x80
> > > > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > > > [  132.611130]        really_probe+0x147/0x3c0
> > > > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > > > [  132.612766]        device_driver_attach+0x53/0x60
> > > > > [  132.613593]        __driver_attach+0x8c/0x150
> > > > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > > > [  132.616071]        driver_register+0x6c/0xc0
> > > > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > > > [  132.617731]        do_init_module+0x5c/0x230
> > > > > [  132.618560]        load_module+0x2981/0x2bc0
> > > > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > [  132.621903]
> > > > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > [  132.629804]        local_pci_probe+0x42/0x80
> > > > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > > > [  132.631583]        really_probe+0x147/0x3c0
> > > > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > > > [  132.633379]        device_driver_attach+0x53/0x60
> > > > > [  132.634275]        __driver_attach+0x8c/0x150
> > > > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > > > [  132.636974]        driver_register+0x6c/0xc0
> > > > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > > > [  132.638765]        do_init_module+0x5c/0x230
> > > > > [  132.639654]        load_module+0x2981/0x2bc0
> > > > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > [  132.643022]
> > > > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > > > [  132.645469]        lock_acquire+0xad/0x370
> > > > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > [  132.652594]        process_one_work+0x23c/0x580
> > > > > [  132.653402]        worker_thread+0x50/0x3b0
> > > > > [  132.654139]        kthread+0x12e/0x150
> > > > > [  132.654868]        ret_from_fork+0x27/0x50
> > > > > [  132.655598]
> > > > >                other info that might help us debug this:
> > > > >
> > > > > [  132.657739] Chain exists of:
> > > > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > > > >
> > > > > [  132.659877]  Possible unsafe locking scenario:
> > > > >
> > > > > [  132.661416]        CPU0                    CPU1
> > > > > [  132.662126]        ----                    ----
> > > > > [  132.662847]   lock(dma_fence_map);
> > > > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > > > [  132.664319]                                lock(dma_fence_map);
> > > > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > > > [  132.665799]
> > > > >                 *** DEADLOCK ***
> > > > >
> > > > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > > > [  132.671902]
> > > > >                stack backtrace:
> > > > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > > > [  132.676046] Call Trace:
> > > > > [  132.676897]  dump_stack+0x8f/0xd0
> > > > > [  132.677748]  check_noncircular+0x162/0x180
> > > > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > > > [  132.680311]  lock_acquire+0xad/0x370
> > > > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > > > [  132.683737]  ? __module_address+0x28/0xf0
> > > > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > [  132.693597]  process_one_work+0x23c/0x580
> > > > > [  132.694487]  worker_thread+0x50/0x3b0
> > > > > [  132.695373]  ? process_one_work+0x580/0x580
> > > > > [  132.696264]  kthread+0x12e/0x150
> > > > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > [  132.698057]  ret_from_fork+0x27/0x50
> > > > >
> > > > > Cc: linux-media@vger.kernel.org
> > > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > > Cc: linux-rdma@vger.kernel.org
> > > > > Cc: amd-gfx@lists.freedesktop.org
> > > > > Cc: intel-gfx@lists.freedesktop.org
> > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > > > >  1 file changed, 8 insertions(+)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > index 3584e29323c0..b3b84a0d3baf 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > > > >                 /* displays are handled separately */
> > > > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > > > >                         /* XXX handle errors */
> > > > > +
> > > > > +                       /*
> > > > > +                        * This is dm_suspend, which calls modeset locks, and
> > > > > +                        * that a pretty good inversion against dma_fence_signal
> > > > > +                        * which gpu recovery is supposed to guarantee.
> > > > > +                        *
> > > > > +                        * Dont ask me how to fix this.
> > > > > +                        */
> > > >
> > > > We actually have a fix for this.  Will be out shortly.
> > >
> > > Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> > > the display block entirely. Fixing the locking while still resetting the
> > > display is going to be really hard otoh ...
> >
> > There's no way to avoid that.  On dGPUs at least a full asic reset is
> > a full asic reset.  Mostly just skips the modeset and does the minimum
> > amount necessary to get the display block into a good state for reset.
>
> But how do you restore the display afterwards? "[RFC 13/17]
> drm/scheduler: use dma-fence annotations in tdr work" earlier in the
> series has some ideas from me for at least
> some of the problems for tdr when the display gets reset along.
> Whacking the display while a modeset/flip/whatever is ongoing
> concurrently doesn't sound like a good idea, so not sure how you can
> do that without taking the drm_modeset_locks. And once you do that,
> it's deadlock time.

We cache the current display hw state and restore it after the reset
without going through the atomic interfaces so everything is back the
way it was before the reset.  IIRC, when we reset the reset of the
GPU, we disconnect the fences, and then re-attach them after a reset.

Alex

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [Intel-gfx] [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 13:29             ` Alex Deucher
  0 siblings, 0 replies; 203+ messages in thread
From: Alex Deucher @ 2020-05-12 13:29 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: linux-rdma, Intel Graphics Development, LKML, amd-gfx list,
	Chris Wilson, moderated list:DMA BUFFER SHARING FRAMEWORK,
	DRI Development, Daniel Vetter, Christian König,
	linux-media

On Tue, May 12, 2020 at 9:17 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Tue, May 12, 2020 at 3:12 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > >
> > > > > ...
> > > > >
> > > > > I think it's time to stop this little exercise.
> > > > >
> > > > > The lockdep splat, for the record:
> > > > >
> > > > > [  132.583381] ======================================================
> > > > > [  132.584091] WARNING: possible circular locking dependency detected
> > > > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > > > [  132.585461] ------------------------------------------------------
> > > > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.587569]
> > > > >                but task is already holding lock:
> > > > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > [  132.589803]
> > > > >                which lock already depends on the new lock.
> > > > >
> > > > > [  132.592009]
> > > > >                the existing dependency chain (in reverse order) is:
> > > > > [  132.593507]
> > > > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > > > [  132.601344]        visual_init+0xce/0x130
> > > > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > > > [  132.602970]        do_take_over_console+0x115/0x180
> > > > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > [  132.609511]        local_pci_probe+0x42/0x80
> > > > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > > > [  132.611130]        really_probe+0x147/0x3c0
> > > > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > > > [  132.612766]        device_driver_attach+0x53/0x60
> > > > > [  132.613593]        __driver_attach+0x8c/0x150
> > > > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > > > [  132.616071]        driver_register+0x6c/0xc0
> > > > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > > > [  132.617731]        do_init_module+0x5c/0x230
> > > > > [  132.618560]        load_module+0x2981/0x2bc0
> > > > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > [  132.621903]
> > > > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > [  132.629804]        local_pci_probe+0x42/0x80
> > > > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > > > [  132.631583]        really_probe+0x147/0x3c0
> > > > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > > > [  132.633379]        device_driver_attach+0x53/0x60
> > > > > [  132.634275]        __driver_attach+0x8c/0x150
> > > > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > > > [  132.636974]        driver_register+0x6c/0xc0
> > > > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > > > [  132.638765]        do_init_module+0x5c/0x230
> > > > > [  132.639654]        load_module+0x2981/0x2bc0
> > > > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > [  132.643022]
> > > > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > > > [  132.645469]        lock_acquire+0xad/0x370
> > > > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > [  132.652594]        process_one_work+0x23c/0x580
> > > > > [  132.653402]        worker_thread+0x50/0x3b0
> > > > > [  132.654139]        kthread+0x12e/0x150
> > > > > [  132.654868]        ret_from_fork+0x27/0x50
> > > > > [  132.655598]
> > > > >                other info that might help us debug this:
> > > > >
> > > > > [  132.657739] Chain exists of:
> > > > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > > > >
> > > > > [  132.659877]  Possible unsafe locking scenario:
> > > > >
> > > > > [  132.661416]        CPU0                    CPU1
> > > > > [  132.662126]        ----                    ----
> > > > > [  132.662847]   lock(dma_fence_map);
> > > > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > > > [  132.664319]                                lock(dma_fence_map);
> > > > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > > > [  132.665799]
> > > > >                 *** DEADLOCK ***
> > > > >
> > > > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > > > [  132.671902]
> > > > >                stack backtrace:
> > > > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > > > [  132.676046] Call Trace:
> > > > > [  132.676897]  dump_stack+0x8f/0xd0
> > > > > [  132.677748]  check_noncircular+0x162/0x180
> > > > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > > > [  132.680311]  lock_acquire+0xad/0x370
> > > > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > > > [  132.683737]  ? __module_address+0x28/0xf0
> > > > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > [  132.693597]  process_one_work+0x23c/0x580
> > > > > [  132.694487]  worker_thread+0x50/0x3b0
> > > > > [  132.695373]  ? process_one_work+0x580/0x580
> > > > > [  132.696264]  kthread+0x12e/0x150
> > > > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > [  132.698057]  ret_from_fork+0x27/0x50
> > > > >
> > > > > Cc: linux-media@vger.kernel.org
> > > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > > Cc: linux-rdma@vger.kernel.org
> > > > > Cc: amd-gfx@lists.freedesktop.org
> > > > > Cc: intel-gfx@lists.freedesktop.org
> > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > > > >  1 file changed, 8 insertions(+)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > index 3584e29323c0..b3b84a0d3baf 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > > > >                 /* displays are handled separately */
> > > > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > > > >                         /* XXX handle errors */
> > > > > +
> > > > > +                       /*
> > > > > +                        * This is dm_suspend, which calls modeset locks, and
> > > > > +                        * that a pretty good inversion against dma_fence_signal
> > > > > +                        * which gpu recovery is supposed to guarantee.
> > > > > +                        *
> > > > > +                        * Dont ask me how to fix this.
> > > > > +                        */
> > > >
> > > > We actually have a fix for this.  Will be out shortly.
> > >
> > > Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> > > the display block entirely. Fixing the locking while still resetting the
> > > display is going to be really hard otoh ...
> >
> > There's no way to avoid that.  On dGPUs at least a full asic reset is
> > a full asic reset.  Mostly just skips the modeset and does the minimum
> > amount necessary to get the display block into a good state for reset.
>
> But how do you restore the display afterwards? "[RFC 13/17]
> drm/scheduler: use dma-fence annotations in tdr work" earlier in the
> series has some ideas from me for at least
> some of the problems for tdr when the display gets reset along.
> Whacking the display while a modeset/flip/whatever is ongoing
> concurrently doesn't sound like a good idea, so not sure how you can
> do that without taking the drm_modeset_locks. And once you do that,
> it's deadlock time.

We cache the current display hw state and restore it after the reset
without going through the atomic interfaces so everything is back the
way it was before the reset.  IIRC, when we reset the reset of the
GPU, we disconnect the fences, and then re-attach them after a reset.

Alex

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 13:29             ` Alex Deucher
  0 siblings, 0 replies; 203+ messages in thread
From: Alex Deucher @ 2020-05-12 13:29 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: linux-rdma, Intel Graphics Development, Maarten Lankhorst, LKML,
	amd-gfx list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, DRI Development,
	Daniel Vetter, Christian König, linux-media

On Tue, May 12, 2020 at 9:17 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Tue, May 12, 2020 at 3:12 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > >
> > > > > ...
> > > > >
> > > > > I think it's time to stop this little exercise.
> > > > >
> > > > > The lockdep splat, for the record:
> > > > >
> > > > > [  132.583381] ======================================================
> > > > > [  132.584091] WARNING: possible circular locking dependency detected
> > > > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > > > [  132.585461] ------------------------------------------------------
> > > > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.587569]
> > > > >                but task is already holding lock:
> > > > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > [  132.589803]
> > > > >                which lock already depends on the new lock.
> > > > >
> > > > > [  132.592009]
> > > > >                the existing dependency chain (in reverse order) is:
> > > > > [  132.593507]
> > > > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > > > [  132.601344]        visual_init+0xce/0x130
> > > > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > > > [  132.602970]        do_take_over_console+0x115/0x180
> > > > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > [  132.609511]        local_pci_probe+0x42/0x80
> > > > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > > > [  132.611130]        really_probe+0x147/0x3c0
> > > > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > > > [  132.612766]        device_driver_attach+0x53/0x60
> > > > > [  132.613593]        __driver_attach+0x8c/0x150
> > > > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > > > [  132.616071]        driver_register+0x6c/0xc0
> > > > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > > > [  132.617731]        do_init_module+0x5c/0x230
> > > > > [  132.618560]        load_module+0x2981/0x2bc0
> > > > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > [  132.621903]
> > > > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > [  132.629804]        local_pci_probe+0x42/0x80
> > > > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > > > [  132.631583]        really_probe+0x147/0x3c0
> > > > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > > > [  132.633379]        device_driver_attach+0x53/0x60
> > > > > [  132.634275]        __driver_attach+0x8c/0x150
> > > > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > > > [  132.636974]        driver_register+0x6c/0xc0
> > > > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > > > [  132.638765]        do_init_module+0x5c/0x230
> > > > > [  132.639654]        load_module+0x2981/0x2bc0
> > > > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > [  132.643022]
> > > > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > > > [  132.645469]        lock_acquire+0xad/0x370
> > > > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > [  132.652594]        process_one_work+0x23c/0x580
> > > > > [  132.653402]        worker_thread+0x50/0x3b0
> > > > > [  132.654139]        kthread+0x12e/0x150
> > > > > [  132.654868]        ret_from_fork+0x27/0x50
> > > > > [  132.655598]
> > > > >                other info that might help us debug this:
> > > > >
> > > > > [  132.657739] Chain exists of:
> > > > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > > > >
> > > > > [  132.659877]  Possible unsafe locking scenario:
> > > > >
> > > > > [  132.661416]        CPU0                    CPU1
> > > > > [  132.662126]        ----                    ----
> > > > > [  132.662847]   lock(dma_fence_map);
> > > > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > > > [  132.664319]                                lock(dma_fence_map);
> > > > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > > > [  132.665799]
> > > > >                 *** DEADLOCK ***
> > > > >
> > > > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > > > [  132.671902]
> > > > >                stack backtrace:
> > > > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > > > [  132.676046] Call Trace:
> > > > > [  132.676897]  dump_stack+0x8f/0xd0
> > > > > [  132.677748]  check_noncircular+0x162/0x180
> > > > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > > > [  132.680311]  lock_acquire+0xad/0x370
> > > > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > > > [  132.683737]  ? __module_address+0x28/0xf0
> > > > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > [  132.693597]  process_one_work+0x23c/0x580
> > > > > [  132.694487]  worker_thread+0x50/0x3b0
> > > > > [  132.695373]  ? process_one_work+0x580/0x580
> > > > > [  132.696264]  kthread+0x12e/0x150
> > > > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > [  132.698057]  ret_from_fork+0x27/0x50
> > > > >
> > > > > Cc: linux-media@vger.kernel.org
> > > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > > Cc: linux-rdma@vger.kernel.org
> > > > > Cc: amd-gfx@lists.freedesktop.org
> > > > > Cc: intel-gfx@lists.freedesktop.org
> > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > > > >  1 file changed, 8 insertions(+)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > index 3584e29323c0..b3b84a0d3baf 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > > > >                 /* displays are handled separately */
> > > > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > > > >                         /* XXX handle errors */
> > > > > +
> > > > > +                       /*
> > > > > +                        * This is dm_suspend, which calls modeset locks, and
> > > > > +                        * that a pretty good inversion against dma_fence_signal
> > > > > +                        * which gpu recovery is supposed to guarantee.
> > > > > +                        *
> > > > > +                        * Dont ask me how to fix this.
> > > > > +                        */
> > > >
> > > > We actually have a fix for this.  Will be out shortly.
> > >
> > > Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> > > the display block entirely. Fixing the locking while still resetting the
> > > display is going to be really hard otoh ...
> >
> > There's no way to avoid that.  On dGPUs at least a full asic reset is
> > a full asic reset.  Mostly just skips the modeset and does the minimum
> > amount necessary to get the display block into a good state for reset.
>
> But how do you restore the display afterwards? "[RFC 13/17]
> drm/scheduler: use dma-fence annotations in tdr work" earlier in the
> series has some ideas from me for at least
> some of the problems for tdr when the display gets reset along.
> Whacking the display while a modeset/flip/whatever is ongoing
> concurrently doesn't sound like a good idea, so not sure how you can
> do that without taking the drm_modeset_locks. And once you do that,
> it's deadlock time.

We cache the current display hw state and restore it after the reset
without going through the atomic interfaces so everything is back the
way it was before the reset.  IIRC, when we reset the reset of the
GPU, we disconnect the fences, and then re-attach them after a reset.

Alex

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
  2020-05-12 13:29             ` Alex Deucher
  (?)
  (?)
@ 2020-05-12 13:45               ` Daniel Vetter
  -1 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12 13:45 UTC (permalink / raw)
  To: Alex Deucher
  Cc: DRI Development, linux-rdma, Intel Graphics Development,
	Maarten Lankhorst, LKML, amd-gfx list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter,
	Christian König, linux-media

On Tue, May 12, 2020 at 3:29 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Tue, May 12, 2020 at 9:17 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > On Tue, May 12, 2020 at 3:12 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> > >
> > > On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > >
> > > > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > > > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > I think it's time to stop this little exercise.
> > > > > >
> > > > > > The lockdep splat, for the record:
> > > > > >
> > > > > > [  132.583381] ======================================================
> > > > > > [  132.584091] WARNING: possible circular locking dependency detected
> > > > > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > > > > [  132.585461] ------------------------------------------------------
> > > > > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > > > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.587569]
> > > > > >                but task is already holding lock:
> > > > > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > > [  132.589803]
> > > > > >                which lock already depends on the new lock.
> > > > > >
> > > > > > [  132.592009]
> > > > > >                the existing dependency chain (in reverse order) is:
> > > > > > [  132.593507]
> > > > > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > > > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > > > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > > > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > > > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > > > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > > > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > > > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > > > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > > > > [  132.601344]        visual_init+0xce/0x130
> > > > > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > > > > [  132.602970]        do_take_over_console+0x115/0x180
> > > > > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > > > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > > > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > > > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > > > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > > > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > > [  132.609511]        local_pci_probe+0x42/0x80
> > > > > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > > > > [  132.611130]        really_probe+0x147/0x3c0
> > > > > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > > > > [  132.612766]        device_driver_attach+0x53/0x60
> > > > > > [  132.613593]        __driver_attach+0x8c/0x150
> > > > > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > > > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > > > > [  132.616071]        driver_register+0x6c/0xc0
> > > > > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > > > > [  132.617731]        do_init_module+0x5c/0x230
> > > > > > [  132.618560]        load_module+0x2981/0x2bc0
> > > > > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > > > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > > > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > > [  132.621903]
> > > > > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > > > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > > > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > > > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > > > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > > > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > > > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > > [  132.629804]        local_pci_probe+0x42/0x80
> > > > > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > > > > [  132.631583]        really_probe+0x147/0x3c0
> > > > > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > > > > [  132.633379]        device_driver_attach+0x53/0x60
> > > > > > [  132.634275]        __driver_attach+0x8c/0x150
> > > > > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > > > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > > > > [  132.636974]        driver_register+0x6c/0xc0
> > > > > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > > > > [  132.638765]        do_init_module+0x5c/0x230
> > > > > > [  132.639654]        load_module+0x2981/0x2bc0
> > > > > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > > > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > > > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > > [  132.643022]
> > > > > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > > > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > > > > [  132.645469]        lock_acquire+0xad/0x370
> > > > > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > > > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > > [  132.652594]        process_one_work+0x23c/0x580
> > > > > > [  132.653402]        worker_thread+0x50/0x3b0
> > > > > > [  132.654139]        kthread+0x12e/0x150
> > > > > > [  132.654868]        ret_from_fork+0x27/0x50
> > > > > > [  132.655598]
> > > > > >                other info that might help us debug this:
> > > > > >
> > > > > > [  132.657739] Chain exists of:
> > > > > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > > > > >
> > > > > > [  132.659877]  Possible unsafe locking scenario:
> > > > > >
> > > > > > [  132.661416]        CPU0                    CPU1
> > > > > > [  132.662126]        ----                    ----
> > > > > > [  132.662847]   lock(dma_fence_map);
> > > > > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > > > > [  132.664319]                                lock(dma_fence_map);
> > > > > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > > > > [  132.665799]
> > > > > >                 *** DEADLOCK ***
> > > > > >
> > > > > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > > > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > > > > [  132.671902]
> > > > > >                stack backtrace:
> > > > > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > > > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > > > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > > > > [  132.676046] Call Trace:
> > > > > > [  132.676897]  dump_stack+0x8f/0xd0
> > > > > > [  132.677748]  check_noncircular+0x162/0x180
> > > > > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > > > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > > > > [  132.680311]  lock_acquire+0xad/0x370
> > > > > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > > > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > > > > [  132.683737]  ? __module_address+0x28/0xf0
> > > > > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > > > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > > > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > > [  132.693597]  process_one_work+0x23c/0x580
> > > > > > [  132.694487]  worker_thread+0x50/0x3b0
> > > > > > [  132.695373]  ? process_one_work+0x580/0x580
> > > > > > [  132.696264]  kthread+0x12e/0x150
> > > > > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > > [  132.698057]  ret_from_fork+0x27/0x50
> > > > > >
> > > > > > Cc: linux-media@vger.kernel.org
> > > > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > > > Cc: linux-rdma@vger.kernel.org
> > > > > > Cc: amd-gfx@lists.freedesktop.org
> > > > > > Cc: intel-gfx@lists.freedesktop.org
> > > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > > > > >  1 file changed, 8 insertions(+)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > index 3584e29323c0..b3b84a0d3baf 100644
> > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > > > > >                 /* displays are handled separately */
> > > > > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > > > > >                         /* XXX handle errors */
> > > > > > +
> > > > > > +                       /*
> > > > > > +                        * This is dm_suspend, which calls modeset locks, and
> > > > > > +                        * that a pretty good inversion against dma_fence_signal
> > > > > > +                        * which gpu recovery is supposed to guarantee.
> > > > > > +                        *
> > > > > > +                        * Dont ask me how to fix this.
> > > > > > +                        */
> > > > >
> > > > > We actually have a fix for this.  Will be out shortly.
> > > >
> > > > Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> > > > the display block entirely. Fixing the locking while still resetting the
> > > > display is going to be really hard otoh ...
> > >
> > > There's no way to avoid that.  On dGPUs at least a full asic reset is
> > > a full asic reset.  Mostly just skips the modeset and does the minimum
> > > amount necessary to get the display block into a good state for reset.
> >
> > But how do you restore the display afterwards? "[RFC 13/17]
> > drm/scheduler: use dma-fence annotations in tdr work" earlier in the
> > series has some ideas from me for at least
> > some of the problems for tdr when the display gets reset along.
> > Whacking the display while a modeset/flip/whatever is ongoing
> > concurrently doesn't sound like a good idea, so not sure how you can
> > do that without taking the drm_modeset_locks. And once you do that,
> > it's deadlock time.
>
> We cache the current display hw state and restore it after the reset
> without going through the atomic interfaces so everything is back the
> way it was before the reset.

Hm this sounds interesting ... how do you make sure a concurrent
atomic update doesn't trample over the same mmio registers while you
do that dance?

> IIRC, when we reset the reset of the
> GPU, we disconnect the fences, and then re-attach them after a reset.

Where is that code? Since I'm not sure how you can make that work
without getting stuck in another kind of deadlock in tdr. But maybe
the code has some clever trick to pull that off somehow.
-Daniel

>
> Alex
>
> > -Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 13:45               ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12 13:45 UTC (permalink / raw)
  To: Alex Deucher
  Cc: linux-rdma, Intel Graphics Development, LKML, amd-gfx list,
	Chris Wilson, moderated list:DMA BUFFER SHARING FRAMEWORK,
	DRI Development, Daniel Vetter, Christian König,
	linux-media

On Tue, May 12, 2020 at 3:29 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Tue, May 12, 2020 at 9:17 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > On Tue, May 12, 2020 at 3:12 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> > >
> > > On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > >
> > > > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > > > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > I think it's time to stop this little exercise.
> > > > > >
> > > > > > The lockdep splat, for the record:
> > > > > >
> > > > > > [  132.583381] ======================================================
> > > > > > [  132.584091] WARNING: possible circular locking dependency detected
> > > > > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > > > > [  132.585461] ------------------------------------------------------
> > > > > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > > > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.587569]
> > > > > >                but task is already holding lock:
> > > > > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > > [  132.589803]
> > > > > >                which lock already depends on the new lock.
> > > > > >
> > > > > > [  132.592009]
> > > > > >                the existing dependency chain (in reverse order) is:
> > > > > > [  132.593507]
> > > > > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > > > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > > > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > > > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > > > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > > > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > > > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > > > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > > > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > > > > [  132.601344]        visual_init+0xce/0x130
> > > > > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > > > > [  132.602970]        do_take_over_console+0x115/0x180
> > > > > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > > > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > > > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > > > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > > > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > > > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > > [  132.609511]        local_pci_probe+0x42/0x80
> > > > > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > > > > [  132.611130]        really_probe+0x147/0x3c0
> > > > > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > > > > [  132.612766]        device_driver_attach+0x53/0x60
> > > > > > [  132.613593]        __driver_attach+0x8c/0x150
> > > > > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > > > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > > > > [  132.616071]        driver_register+0x6c/0xc0
> > > > > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > > > > [  132.617731]        do_init_module+0x5c/0x230
> > > > > > [  132.618560]        load_module+0x2981/0x2bc0
> > > > > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > > > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > > > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > > [  132.621903]
> > > > > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > > > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > > > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > > > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > > > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > > > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > > > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > > [  132.629804]        local_pci_probe+0x42/0x80
> > > > > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > > > > [  132.631583]        really_probe+0x147/0x3c0
> > > > > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > > > > [  132.633379]        device_driver_attach+0x53/0x60
> > > > > > [  132.634275]        __driver_attach+0x8c/0x150
> > > > > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > > > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > > > > [  132.636974]        driver_register+0x6c/0xc0
> > > > > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > > > > [  132.638765]        do_init_module+0x5c/0x230
> > > > > > [  132.639654]        load_module+0x2981/0x2bc0
> > > > > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > > > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > > > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > > [  132.643022]
> > > > > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > > > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > > > > [  132.645469]        lock_acquire+0xad/0x370
> > > > > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > > > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > > [  132.652594]        process_one_work+0x23c/0x580
> > > > > > [  132.653402]        worker_thread+0x50/0x3b0
> > > > > > [  132.654139]        kthread+0x12e/0x150
> > > > > > [  132.654868]        ret_from_fork+0x27/0x50
> > > > > > [  132.655598]
> > > > > >                other info that might help us debug this:
> > > > > >
> > > > > > [  132.657739] Chain exists of:
> > > > > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > > > > >
> > > > > > [  132.659877]  Possible unsafe locking scenario:
> > > > > >
> > > > > > [  132.661416]        CPU0                    CPU1
> > > > > > [  132.662126]        ----                    ----
> > > > > > [  132.662847]   lock(dma_fence_map);
> > > > > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > > > > [  132.664319]                                lock(dma_fence_map);
> > > > > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > > > > [  132.665799]
> > > > > >                 *** DEADLOCK ***
> > > > > >
> > > > > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > > > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > > > > [  132.671902]
> > > > > >                stack backtrace:
> > > > > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > > > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > > > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > > > > [  132.676046] Call Trace:
> > > > > > [  132.676897]  dump_stack+0x8f/0xd0
> > > > > > [  132.677748]  check_noncircular+0x162/0x180
> > > > > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > > > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > > > > [  132.680311]  lock_acquire+0xad/0x370
> > > > > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > > > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > > > > [  132.683737]  ? __module_address+0x28/0xf0
> > > > > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > > > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > > > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > > [  132.693597]  process_one_work+0x23c/0x580
> > > > > > [  132.694487]  worker_thread+0x50/0x3b0
> > > > > > [  132.695373]  ? process_one_work+0x580/0x580
> > > > > > [  132.696264]  kthread+0x12e/0x150
> > > > > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > > [  132.698057]  ret_from_fork+0x27/0x50
> > > > > >
> > > > > > Cc: linux-media@vger.kernel.org
> > > > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > > > Cc: linux-rdma@vger.kernel.org
> > > > > > Cc: amd-gfx@lists.freedesktop.org
> > > > > > Cc: intel-gfx@lists.freedesktop.org
> > > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > > > > >  1 file changed, 8 insertions(+)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > index 3584e29323c0..b3b84a0d3baf 100644
> > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > > > > >                 /* displays are handled separately */
> > > > > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > > > > >                         /* XXX handle errors */
> > > > > > +
> > > > > > +                       /*
> > > > > > +                        * This is dm_suspend, which calls modeset locks, and
> > > > > > +                        * that a pretty good inversion against dma_fence_signal
> > > > > > +                        * which gpu recovery is supposed to guarantee.
> > > > > > +                        *
> > > > > > +                        * Dont ask me how to fix this.
> > > > > > +                        */
> > > > >
> > > > > We actually have a fix for this.  Will be out shortly.
> > > >
> > > > Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> > > > the display block entirely. Fixing the locking while still resetting the
> > > > display is going to be really hard otoh ...
> > >
> > > There's no way to avoid that.  On dGPUs at least a full asic reset is
> > > a full asic reset.  Mostly just skips the modeset and does the minimum
> > > amount necessary to get the display block into a good state for reset.
> >
> > But how do you restore the display afterwards? "[RFC 13/17]
> > drm/scheduler: use dma-fence annotations in tdr work" earlier in the
> > series has some ideas from me for at least
> > some of the problems for tdr when the display gets reset along.
> > Whacking the display while a modeset/flip/whatever is ongoing
> > concurrently doesn't sound like a good idea, so not sure how you can
> > do that without taking the drm_modeset_locks. And once you do that,
> > it's deadlock time.
>
> We cache the current display hw state and restore it after the reset
> without going through the atomic interfaces so everything is back the
> way it was before the reset.

Hm this sounds interesting ... how do you make sure a concurrent
atomic update doesn't trample over the same mmio registers while you
do that dance?

> IIRC, when we reset the reset of the
> GPU, we disconnect the fences, and then re-attach them after a reset.

Where is that code? Since I'm not sure how you can make that work
without getting stuck in another kind of deadlock in tdr. But maybe
the code has some clever trick to pull that off somehow.
-Daniel

>
> Alex
>
> > -Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [Intel-gfx] [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 13:45               ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12 13:45 UTC (permalink / raw)
  To: Alex Deucher
  Cc: linux-rdma, Intel Graphics Development, LKML, amd-gfx list,
	Chris Wilson, moderated list:DMA BUFFER SHARING FRAMEWORK,
	DRI Development, Daniel Vetter, Christian König,
	linux-media

On Tue, May 12, 2020 at 3:29 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Tue, May 12, 2020 at 9:17 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > On Tue, May 12, 2020 at 3:12 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> > >
> > > On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > >
> > > > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > > > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > I think it's time to stop this little exercise.
> > > > > >
> > > > > > The lockdep splat, for the record:
> > > > > >
> > > > > > [  132.583381] ======================================================
> > > > > > [  132.584091] WARNING: possible circular locking dependency detected
> > > > > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > > > > [  132.585461] ------------------------------------------------------
> > > > > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > > > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.587569]
> > > > > >                but task is already holding lock:
> > > > > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > > [  132.589803]
> > > > > >                which lock already depends on the new lock.
> > > > > >
> > > > > > [  132.592009]
> > > > > >                the existing dependency chain (in reverse order) is:
> > > > > > [  132.593507]
> > > > > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > > > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > > > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > > > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > > > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > > > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > > > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > > > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > > > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > > > > [  132.601344]        visual_init+0xce/0x130
> > > > > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > > > > [  132.602970]        do_take_over_console+0x115/0x180
> > > > > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > > > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > > > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > > > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > > > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > > > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > > [  132.609511]        local_pci_probe+0x42/0x80
> > > > > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > > > > [  132.611130]        really_probe+0x147/0x3c0
> > > > > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > > > > [  132.612766]        device_driver_attach+0x53/0x60
> > > > > > [  132.613593]        __driver_attach+0x8c/0x150
> > > > > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > > > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > > > > [  132.616071]        driver_register+0x6c/0xc0
> > > > > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > > > > [  132.617731]        do_init_module+0x5c/0x230
> > > > > > [  132.618560]        load_module+0x2981/0x2bc0
> > > > > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > > > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > > > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > > [  132.621903]
> > > > > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > > > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > > > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > > > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > > > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > > > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > > > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > > [  132.629804]        local_pci_probe+0x42/0x80
> > > > > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > > > > [  132.631583]        really_probe+0x147/0x3c0
> > > > > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > > > > [  132.633379]        device_driver_attach+0x53/0x60
> > > > > > [  132.634275]        __driver_attach+0x8c/0x150
> > > > > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > > > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > > > > [  132.636974]        driver_register+0x6c/0xc0
> > > > > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > > > > [  132.638765]        do_init_module+0x5c/0x230
> > > > > > [  132.639654]        load_module+0x2981/0x2bc0
> > > > > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > > > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > > > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > > [  132.643022]
> > > > > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > > > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > > > > [  132.645469]        lock_acquire+0xad/0x370
> > > > > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > > > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > > [  132.652594]        process_one_work+0x23c/0x580
> > > > > > [  132.653402]        worker_thread+0x50/0x3b0
> > > > > > [  132.654139]        kthread+0x12e/0x150
> > > > > > [  132.654868]        ret_from_fork+0x27/0x50
> > > > > > [  132.655598]
> > > > > >                other info that might help us debug this:
> > > > > >
> > > > > > [  132.657739] Chain exists of:
> > > > > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > > > > >
> > > > > > [  132.659877]  Possible unsafe locking scenario:
> > > > > >
> > > > > > [  132.661416]        CPU0                    CPU1
> > > > > > [  132.662126]        ----                    ----
> > > > > > [  132.662847]   lock(dma_fence_map);
> > > > > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > > > > [  132.664319]                                lock(dma_fence_map);
> > > > > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > > > > [  132.665799]
> > > > > >                 *** DEADLOCK ***
> > > > > >
> > > > > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > > > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > > > > [  132.671902]
> > > > > >                stack backtrace:
> > > > > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > > > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > > > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > > > > [  132.676046] Call Trace:
> > > > > > [  132.676897]  dump_stack+0x8f/0xd0
> > > > > > [  132.677748]  check_noncircular+0x162/0x180
> > > > > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > > > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > > > > [  132.680311]  lock_acquire+0xad/0x370
> > > > > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > > > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > > > > [  132.683737]  ? __module_address+0x28/0xf0
> > > > > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > > > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > > > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > > [  132.693597]  process_one_work+0x23c/0x580
> > > > > > [  132.694487]  worker_thread+0x50/0x3b0
> > > > > > [  132.695373]  ? process_one_work+0x580/0x580
> > > > > > [  132.696264]  kthread+0x12e/0x150
> > > > > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > > [  132.698057]  ret_from_fork+0x27/0x50
> > > > > >
> > > > > > Cc: linux-media@vger.kernel.org
> > > > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > > > Cc: linux-rdma@vger.kernel.org
> > > > > > Cc: amd-gfx@lists.freedesktop.org
> > > > > > Cc: intel-gfx@lists.freedesktop.org
> > > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > > > > >  1 file changed, 8 insertions(+)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > index 3584e29323c0..b3b84a0d3baf 100644
> > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > > > > >                 /* displays are handled separately */
> > > > > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > > > > >                         /* XXX handle errors */
> > > > > > +
> > > > > > +                       /*
> > > > > > +                        * This is dm_suspend, which calls modeset locks, and
> > > > > > +                        * that a pretty good inversion against dma_fence_signal
> > > > > > +                        * which gpu recovery is supposed to guarantee.
> > > > > > +                        *
> > > > > > +                        * Dont ask me how to fix this.
> > > > > > +                        */
> > > > >
> > > > > We actually have a fix for this.  Will be out shortly.
> > > >
> > > > Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> > > > the display block entirely. Fixing the locking while still resetting the
> > > > display is going to be really hard otoh ...
> > >
> > > There's no way to avoid that.  On dGPUs at least a full asic reset is
> > > a full asic reset.  Mostly just skips the modeset and does the minimum
> > > amount necessary to get the display block into a good state for reset.
> >
> > But how do you restore the display afterwards? "[RFC 13/17]
> > drm/scheduler: use dma-fence annotations in tdr work" earlier in the
> > series has some ideas from me for at least
> > some of the problems for tdr when the display gets reset along.
> > Whacking the display while a modeset/flip/whatever is ongoing
> > concurrently doesn't sound like a good idea, so not sure how you can
> > do that without taking the drm_modeset_locks. And once you do that,
> > it's deadlock time.
>
> We cache the current display hw state and restore it after the reset
> without going through the atomic interfaces so everything is back the
> way it was before the reset.

Hm this sounds interesting ... how do you make sure a concurrent
atomic update doesn't trample over the same mmio registers while you
do that dance?

> IIRC, when we reset the reset of the
> GPU, we disconnect the fences, and then re-attach them after a reset.

Where is that code? Since I'm not sure how you can make that work
without getting stuck in another kind of deadlock in tdr. But maybe
the code has some clever trick to pull that off somehow.
-Daniel

>
> Alex
>
> > -Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 13:45               ` Daniel Vetter
  0 siblings, 0 replies; 203+ messages in thread
From: Daniel Vetter @ 2020-05-12 13:45 UTC (permalink / raw)
  To: Alex Deucher
  Cc: linux-rdma, Intel Graphics Development, Maarten Lankhorst, LKML,
	amd-gfx list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, DRI Development,
	Daniel Vetter, Christian König, linux-media

On Tue, May 12, 2020 at 3:29 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Tue, May 12, 2020 at 9:17 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > On Tue, May 12, 2020 at 3:12 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> > >
> > > On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > >
> > > > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > > > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > I think it's time to stop this little exercise.
> > > > > >
> > > > > > The lockdep splat, for the record:
> > > > > >
> > > > > > [  132.583381] ======================================================
> > > > > > [  132.584091] WARNING: possible circular locking dependency detected
> > > > > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > > > > [  132.585461] ------------------------------------------------------
> > > > > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > > > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.587569]
> > > > > >                but task is already holding lock:
> > > > > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > > [  132.589803]
> > > > > >                which lock already depends on the new lock.
> > > > > >
> > > > > > [  132.592009]
> > > > > >                the existing dependency chain (in reverse order) is:
> > > > > > [  132.593507]
> > > > > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > > > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > > > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > > > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > > > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > > > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > > > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > > > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > > > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > > > > [  132.601344]        visual_init+0xce/0x130
> > > > > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > > > > [  132.602970]        do_take_over_console+0x115/0x180
> > > > > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > > > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > > > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > > > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > > > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > > > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > > [  132.609511]        local_pci_probe+0x42/0x80
> > > > > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > > > > [  132.611130]        really_probe+0x147/0x3c0
> > > > > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > > > > [  132.612766]        device_driver_attach+0x53/0x60
> > > > > > [  132.613593]        __driver_attach+0x8c/0x150
> > > > > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > > > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > > > > [  132.616071]        driver_register+0x6c/0xc0
> > > > > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > > > > [  132.617731]        do_init_module+0x5c/0x230
> > > > > > [  132.618560]        load_module+0x2981/0x2bc0
> > > > > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > > > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > > > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > > [  132.621903]
> > > > > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > > > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > > > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > > > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > > > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > > > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > > > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > > [  132.629804]        local_pci_probe+0x42/0x80
> > > > > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > > > > [  132.631583]        really_probe+0x147/0x3c0
> > > > > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > > > > [  132.633379]        device_driver_attach+0x53/0x60
> > > > > > [  132.634275]        __driver_attach+0x8c/0x150
> > > > > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > > > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > > > > [  132.636974]        driver_register+0x6c/0xc0
> > > > > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > > > > [  132.638765]        do_init_module+0x5c/0x230
> > > > > > [  132.639654]        load_module+0x2981/0x2bc0
> > > > > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > > > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > > > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > > [  132.643022]
> > > > > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > > > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > > > > [  132.645469]        lock_acquire+0xad/0x370
> > > > > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > > > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > > [  132.652594]        process_one_work+0x23c/0x580
> > > > > > [  132.653402]        worker_thread+0x50/0x3b0
> > > > > > [  132.654139]        kthread+0x12e/0x150
> > > > > > [  132.654868]        ret_from_fork+0x27/0x50
> > > > > > [  132.655598]
> > > > > >                other info that might help us debug this:
> > > > > >
> > > > > > [  132.657739] Chain exists of:
> > > > > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > > > > >
> > > > > > [  132.659877]  Possible unsafe locking scenario:
> > > > > >
> > > > > > [  132.661416]        CPU0                    CPU1
> > > > > > [  132.662126]        ----                    ----
> > > > > > [  132.662847]   lock(dma_fence_map);
> > > > > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > > > > [  132.664319]                                lock(dma_fence_map);
> > > > > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > > > > [  132.665799]
> > > > > >                 *** DEADLOCK ***
> > > > > >
> > > > > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > > > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > > > > [  132.671902]
> > > > > >                stack backtrace:
> > > > > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > > > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > > > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > > > > [  132.676046] Call Trace:
> > > > > > [  132.676897]  dump_stack+0x8f/0xd0
> > > > > > [  132.677748]  check_noncircular+0x162/0x180
> > > > > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > > > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > > > > [  132.680311]  lock_acquire+0xad/0x370
> > > > > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > > > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > > > > [  132.683737]  ? __module_address+0x28/0xf0
> > > > > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > > > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > > > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > > [  132.693597]  process_one_work+0x23c/0x580
> > > > > > [  132.694487]  worker_thread+0x50/0x3b0
> > > > > > [  132.695373]  ? process_one_work+0x580/0x580
> > > > > > [  132.696264]  kthread+0x12e/0x150
> > > > > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > > [  132.698057]  ret_from_fork+0x27/0x50
> > > > > >
> > > > > > Cc: linux-media@vger.kernel.org
> > > > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > > > Cc: linux-rdma@vger.kernel.org
> > > > > > Cc: amd-gfx@lists.freedesktop.org
> > > > > > Cc: intel-gfx@lists.freedesktop.org
> > > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > > > > >  1 file changed, 8 insertions(+)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > index 3584e29323c0..b3b84a0d3baf 100644
> > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > > > > >                 /* displays are handled separately */
> > > > > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > > > > >                         /* XXX handle errors */
> > > > > > +
> > > > > > +                       /*
> > > > > > +                        * This is dm_suspend, which calls modeset locks, and
> > > > > > +                        * that a pretty good inversion against dma_fence_signal
> > > > > > +                        * which gpu recovery is supposed to guarantee.
> > > > > > +                        *
> > > > > > +                        * Dont ask me how to fix this.
> > > > > > +                        */
> > > > >
> > > > > We actually have a fix for this.  Will be out shortly.
> > > >
> > > > Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> > > > the display block entirely. Fixing the locking while still resetting the
> > > > display is going to be really hard otoh ...
> > >
> > > There's no way to avoid that.  On dGPUs at least a full asic reset is
> > > a full asic reset.  Mostly just skips the modeset and does the minimum
> > > amount necessary to get the display block into a good state for reset.
> >
> > But how do you restore the display afterwards? "[RFC 13/17]
> > drm/scheduler: use dma-fence annotations in tdr work" earlier in the
> > series has some ideas from me for at least
> > some of the problems for tdr when the display gets reset along.
> > Whacking the display while a modeset/flip/whatever is ongoing
> > concurrently doesn't sound like a good idea, so not sure how you can
> > do that without taking the drm_modeset_locks. And once you do that,
> > it's deadlock time.
>
> We cache the current display hw state and restore it after the reset
> without going through the atomic interfaces so everything is back the
> way it was before the reset.

Hm this sounds interesting ... how do you make sure a concurrent
atomic update doesn't trample over the same mmio registers while you
do that dance?

> IIRC, when we reset the reset of the
> GPU, we disconnect the fences, and then re-attach them after a reset.

Where is that code? Since I'm not sure how you can make that work
without getting stuck in another kind of deadlock in tdr. But maybe
the code has some clever trick to pull that off somehow.
-Daniel

>
> Alex
>
> > -Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
  2020-05-12 13:45               ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12 14:24                 ` Alex Deucher
  -1 siblings, 0 replies; 203+ messages in thread
From: Alex Deucher @ 2020-05-12 14:24 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: DRI Development, linux-rdma, Intel Graphics Development,
	Maarten Lankhorst, LKML, amd-gfx list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, Daniel Vetter,
	Christian König, linux-media

On Tue, May 12, 2020 at 9:45 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Tue, May 12, 2020 at 3:29 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Tue, May 12, 2020 at 9:17 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > On Tue, May 12, 2020 at 3:12 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> > > >
> > > > On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > >
> > > > > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > > > > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > I think it's time to stop this little exercise.
> > > > > > >
> > > > > > > The lockdep splat, for the record:
> > > > > > >
> > > > > > > [  132.583381] ======================================================
> > > > > > > [  132.584091] WARNING: possible circular locking dependency detected
> > > > > > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > > > > > [  132.585461] ------------------------------------------------------
> > > > > > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > > > > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.587569]
> > > > > > >                but task is already holding lock:
> > > > > > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > > > [  132.589803]
> > > > > > >                which lock already depends on the new lock.
> > > > > > >
> > > > > > > [  132.592009]
> > > > > > >                the existing dependency chain (in reverse order) is:
> > > > > > > [  132.593507]
> > > > > > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > > > > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > > > > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > > > > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > > > > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > > > > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > > > > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > > > > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > > > > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > > > > > [  132.601344]        visual_init+0xce/0x130
> > > > > > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > > > > > [  132.602970]        do_take_over_console+0x115/0x180
> > > > > > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > > > > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > > > > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > > > > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > > > > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > > > > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > > > [  132.609511]        local_pci_probe+0x42/0x80
> > > > > > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > > > > > [  132.611130]        really_probe+0x147/0x3c0
> > > > > > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > > > > > [  132.612766]        device_driver_attach+0x53/0x60
> > > > > > > [  132.613593]        __driver_attach+0x8c/0x150
> > > > > > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > > > > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > > > > > [  132.616071]        driver_register+0x6c/0xc0
> > > > > > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > > > > > [  132.617731]        do_init_module+0x5c/0x230
> > > > > > > [  132.618560]        load_module+0x2981/0x2bc0
> > > > > > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > > > > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > > > > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > > > [  132.621903]
> > > > > > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > > > > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > > > > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > > > > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > > > > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > > > > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > > > > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > > > [  132.629804]        local_pci_probe+0x42/0x80
> > > > > > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > > > > > [  132.631583]        really_probe+0x147/0x3c0
> > > > > > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > > > > > [  132.633379]        device_driver_attach+0x53/0x60
> > > > > > > [  132.634275]        __driver_attach+0x8c/0x150
> > > > > > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > > > > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > > > > > [  132.636974]        driver_register+0x6c/0xc0
> > > > > > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > > > > > [  132.638765]        do_init_module+0x5c/0x230
> > > > > > > [  132.639654]        load_module+0x2981/0x2bc0
> > > > > > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > > > > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > > > > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > > > [  132.643022]
> > > > > > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > > > > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > > > > > [  132.645469]        lock_acquire+0xad/0x370
> > > > > > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > > > > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > > > [  132.652594]        process_one_work+0x23c/0x580
> > > > > > > [  132.653402]        worker_thread+0x50/0x3b0
> > > > > > > [  132.654139]        kthread+0x12e/0x150
> > > > > > > [  132.654868]        ret_from_fork+0x27/0x50
> > > > > > > [  132.655598]
> > > > > > >                other info that might help us debug this:
> > > > > > >
> > > > > > > [  132.657739] Chain exists of:
> > > > > > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > > > > > >
> > > > > > > [  132.659877]  Possible unsafe locking scenario:
> > > > > > >
> > > > > > > [  132.661416]        CPU0                    CPU1
> > > > > > > [  132.662126]        ----                    ----
> > > > > > > [  132.662847]   lock(dma_fence_map);
> > > > > > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > > > > > [  132.664319]                                lock(dma_fence_map);
> > > > > > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > > > > > [  132.665799]
> > > > > > >                 *** DEADLOCK ***
> > > > > > >
> > > > > > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > > > > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > > > > > [  132.671902]
> > > > > > >                stack backtrace:
> > > > > > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > > > > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > > > > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > > > > > [  132.676046] Call Trace:
> > > > > > > [  132.676897]  dump_stack+0x8f/0xd0
> > > > > > > [  132.677748]  check_noncircular+0x162/0x180
> > > > > > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > > > > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > > > > > [  132.680311]  lock_acquire+0xad/0x370
> > > > > > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > > > > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > > > > > [  132.683737]  ? __module_address+0x28/0xf0
> > > > > > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > > > > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > > > > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > > > [  132.693597]  process_one_work+0x23c/0x580
> > > > > > > [  132.694487]  worker_thread+0x50/0x3b0
> > > > > > > [  132.695373]  ? process_one_work+0x580/0x580
> > > > > > > [  132.696264]  kthread+0x12e/0x150
> > > > > > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > > > [  132.698057]  ret_from_fork+0x27/0x50
> > > > > > >
> > > > > > > Cc: linux-media@vger.kernel.org
> > > > > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > > > > Cc: linux-rdma@vger.kernel.org
> > > > > > > Cc: amd-gfx@lists.freedesktop.org
> > > > > > > Cc: intel-gfx@lists.freedesktop.org
> > > > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > > > ---
> > > > > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > > > > > >  1 file changed, 8 insertions(+)
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > index 3584e29323c0..b3b84a0d3baf 100644
> > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > > > > > >                 /* displays are handled separately */
> > > > > > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > > > > > >                         /* XXX handle errors */
> > > > > > > +
> > > > > > > +                       /*
> > > > > > > +                        * This is dm_suspend, which calls modeset locks, and
> > > > > > > +                        * that a pretty good inversion against dma_fence_signal
> > > > > > > +                        * which gpu recovery is supposed to guarantee.
> > > > > > > +                        *
> > > > > > > +                        * Dont ask me how to fix this.
> > > > > > > +                        */
> > > > > >
> > > > > > We actually have a fix for this.  Will be out shortly.
> > > > >
> > > > > Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> > > > > the display block entirely. Fixing the locking while still resetting the
> > > > > display is going to be really hard otoh ...
> > > >
> > > > There's no way to avoid that.  On dGPUs at least a full asic reset is
> > > > a full asic reset.  Mostly just skips the modeset and does the minimum
> > > > amount necessary to get the display block into a good state for reset.
> > >
> > > But how do you restore the display afterwards? "[RFC 13/17]
> > > drm/scheduler: use dma-fence annotations in tdr work" earlier in the
> > > series has some ideas from me for at least
> > > some of the problems for tdr when the display gets reset along.
> > > Whacking the display while a modeset/flip/whatever is ongoing
> > > concurrently doesn't sound like a good idea, so not sure how you can
> > > do that without taking the drm_modeset_locks. And once you do that,
> > > it's deadlock time.
> >
> > We cache the current display hw state and restore it after the reset
> > without going through the atomic interfaces so everything is back the
> > way it was before the reset.
>
> Hm this sounds interesting ... how do you make sure a concurrent
> atomic update doesn't trample over the same mmio registers while you
> do that dance?

We take the dm->dc_lock across the reset.

>
> > IIRC, when we reset the reset of the
> > GPU, we disconnect the fences, and then re-attach them after a reset.
>
> Where is that code? Since I'm not sure how you can make that work
> without getting stuck in another kind of deadlock in tdr. But maybe
> the code has some clever trick to pull that off somehow.

The GPU scheduler.  drm_sched_stop, drm_sched_resubmit_jobs, and
drm_sched_start.

Alex


> -Daniel
>
> >
> > Alex
> >
> > > -Daniel
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 14:24                 ` Alex Deucher
  0 siblings, 0 replies; 203+ messages in thread
From: Alex Deucher @ 2020-05-12 14:24 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: linux-rdma, Intel Graphics Development, LKML, amd-gfx list,
	Chris Wilson, moderated list:DMA BUFFER SHARING FRAMEWORK,
	DRI Development, Daniel Vetter, Christian König,
	linux-media

On Tue, May 12, 2020 at 9:45 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Tue, May 12, 2020 at 3:29 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Tue, May 12, 2020 at 9:17 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > On Tue, May 12, 2020 at 3:12 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> > > >
> > > > On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > >
> > > > > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > > > > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > I think it's time to stop this little exercise.
> > > > > > >
> > > > > > > The lockdep splat, for the record:
> > > > > > >
> > > > > > > [  132.583381] ======================================================
> > > > > > > [  132.584091] WARNING: possible circular locking dependency detected
> > > > > > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > > > > > [  132.585461] ------------------------------------------------------
> > > > > > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > > > > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.587569]
> > > > > > >                but task is already holding lock:
> > > > > > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > > > [  132.589803]
> > > > > > >                which lock already depends on the new lock.
> > > > > > >
> > > > > > > [  132.592009]
> > > > > > >                the existing dependency chain (in reverse order) is:
> > > > > > > [  132.593507]
> > > > > > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > > > > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > > > > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > > > > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > > > > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > > > > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > > > > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > > > > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > > > > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > > > > > [  132.601344]        visual_init+0xce/0x130
> > > > > > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > > > > > [  132.602970]        do_take_over_console+0x115/0x180
> > > > > > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > > > > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > > > > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > > > > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > > > > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > > > > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > > > [  132.609511]        local_pci_probe+0x42/0x80
> > > > > > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > > > > > [  132.611130]        really_probe+0x147/0x3c0
> > > > > > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > > > > > [  132.612766]        device_driver_attach+0x53/0x60
> > > > > > > [  132.613593]        __driver_attach+0x8c/0x150
> > > > > > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > > > > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > > > > > [  132.616071]        driver_register+0x6c/0xc0
> > > > > > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > > > > > [  132.617731]        do_init_module+0x5c/0x230
> > > > > > > [  132.618560]        load_module+0x2981/0x2bc0
> > > > > > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > > > > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > > > > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > > > [  132.621903]
> > > > > > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > > > > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > > > > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > > > > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > > > > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > > > > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > > > > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > > > [  132.629804]        local_pci_probe+0x42/0x80
> > > > > > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > > > > > [  132.631583]        really_probe+0x147/0x3c0
> > > > > > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > > > > > [  132.633379]        device_driver_attach+0x53/0x60
> > > > > > > [  132.634275]        __driver_attach+0x8c/0x150
> > > > > > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > > > > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > > > > > [  132.636974]        driver_register+0x6c/0xc0
> > > > > > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > > > > > [  132.638765]        do_init_module+0x5c/0x230
> > > > > > > [  132.639654]        load_module+0x2981/0x2bc0
> > > > > > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > > > > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > > > > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > > > [  132.643022]
> > > > > > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > > > > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > > > > > [  132.645469]        lock_acquire+0xad/0x370
> > > > > > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > > > > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > > > [  132.652594]        process_one_work+0x23c/0x580
> > > > > > > [  132.653402]        worker_thread+0x50/0x3b0
> > > > > > > [  132.654139]        kthread+0x12e/0x150
> > > > > > > [  132.654868]        ret_from_fork+0x27/0x50
> > > > > > > [  132.655598]
> > > > > > >                other info that might help us debug this:
> > > > > > >
> > > > > > > [  132.657739] Chain exists of:
> > > > > > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > > > > > >
> > > > > > > [  132.659877]  Possible unsafe locking scenario:
> > > > > > >
> > > > > > > [  132.661416]        CPU0                    CPU1
> > > > > > > [  132.662126]        ----                    ----
> > > > > > > [  132.662847]   lock(dma_fence_map);
> > > > > > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > > > > > [  132.664319]                                lock(dma_fence_map);
> > > > > > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > > > > > [  132.665799]
> > > > > > >                 *** DEADLOCK ***
> > > > > > >
> > > > > > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > > > > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > > > > > [  132.671902]
> > > > > > >                stack backtrace:
> > > > > > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > > > > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > > > > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > > > > > [  132.676046] Call Trace:
> > > > > > > [  132.676897]  dump_stack+0x8f/0xd0
> > > > > > > [  132.677748]  check_noncircular+0x162/0x180
> > > > > > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > > > > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > > > > > [  132.680311]  lock_acquire+0xad/0x370
> > > > > > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > > > > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > > > > > [  132.683737]  ? __module_address+0x28/0xf0
> > > > > > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > > > > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > > > > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > > > [  132.693597]  process_one_work+0x23c/0x580
> > > > > > > [  132.694487]  worker_thread+0x50/0x3b0
> > > > > > > [  132.695373]  ? process_one_work+0x580/0x580
> > > > > > > [  132.696264]  kthread+0x12e/0x150
> > > > > > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > > > [  132.698057]  ret_from_fork+0x27/0x50
> > > > > > >
> > > > > > > Cc: linux-media@vger.kernel.org
> > > > > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > > > > Cc: linux-rdma@vger.kernel.org
> > > > > > > Cc: amd-gfx@lists.freedesktop.org
> > > > > > > Cc: intel-gfx@lists.freedesktop.org
> > > > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > > > ---
> > > > > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > > > > > >  1 file changed, 8 insertions(+)
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > index 3584e29323c0..b3b84a0d3baf 100644
> > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > > > > > >                 /* displays are handled separately */
> > > > > > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > > > > > >                         /* XXX handle errors */
> > > > > > > +
> > > > > > > +                       /*
> > > > > > > +                        * This is dm_suspend, which calls modeset locks, and
> > > > > > > +                        * that a pretty good inversion against dma_fence_signal
> > > > > > > +                        * which gpu recovery is supposed to guarantee.
> > > > > > > +                        *
> > > > > > > +                        * Dont ask me how to fix this.
> > > > > > > +                        */
> > > > > >
> > > > > > We actually have a fix for this.  Will be out shortly.
> > > > >
> > > > > Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> > > > > the display block entirely. Fixing the locking while still resetting the
> > > > > display is going to be really hard otoh ...
> > > >
> > > > There's no way to avoid that.  On dGPUs at least a full asic reset is
> > > > a full asic reset.  Mostly just skips the modeset and does the minimum
> > > > amount necessary to get the display block into a good state for reset.
> > >
> > > But how do you restore the display afterwards? "[RFC 13/17]
> > > drm/scheduler: use dma-fence annotations in tdr work" earlier in the
> > > series has some ideas from me for at least
> > > some of the problems for tdr when the display gets reset along.
> > > Whacking the display while a modeset/flip/whatever is ongoing
> > > concurrently doesn't sound like a good idea, so not sure how you can
> > > do that without taking the drm_modeset_locks. And once you do that,
> > > it's deadlock time.
> >
> > We cache the current display hw state and restore it after the reset
> > without going through the atomic interfaces so everything is back the
> > way it was before the reset.
>
> Hm this sounds interesting ... how do you make sure a concurrent
> atomic update doesn't trample over the same mmio registers while you
> do that dance?

We take the dm->dc_lock across the reset.

>
> > IIRC, when we reset the reset of the
> > GPU, we disconnect the fences, and then re-attach them after a reset.
>
> Where is that code? Since I'm not sure how you can make that work
> without getting stuck in another kind of deadlock in tdr. But maybe
> the code has some clever trick to pull that off somehow.

The GPU scheduler.  drm_sched_stop, drm_sched_resubmit_jobs, and
drm_sched_start.

Alex


> -Daniel
>
> >
> > Alex
> >
> > > -Daniel
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [Intel-gfx] [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 14:24                 ` Alex Deucher
  0 siblings, 0 replies; 203+ messages in thread
From: Alex Deucher @ 2020-05-12 14:24 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: linux-rdma, Intel Graphics Development, LKML, amd-gfx list,
	Chris Wilson, moderated list:DMA BUFFER SHARING FRAMEWORK,
	DRI Development, Daniel Vetter, Christian König,
	linux-media

On Tue, May 12, 2020 at 9:45 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Tue, May 12, 2020 at 3:29 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Tue, May 12, 2020 at 9:17 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > On Tue, May 12, 2020 at 3:12 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> > > >
> > > > On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > >
> > > > > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > > > > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > I think it's time to stop this little exercise.
> > > > > > >
> > > > > > > The lockdep splat, for the record:
> > > > > > >
> > > > > > > [  132.583381] ======================================================
> > > > > > > [  132.584091] WARNING: possible circular locking dependency detected
> > > > > > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > > > > > [  132.585461] ------------------------------------------------------
> > > > > > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > > > > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.587569]
> > > > > > >                but task is already holding lock:
> > > > > > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > > > [  132.589803]
> > > > > > >                which lock already depends on the new lock.
> > > > > > >
> > > > > > > [  132.592009]
> > > > > > >                the existing dependency chain (in reverse order) is:
> > > > > > > [  132.593507]
> > > > > > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > > > > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > > > > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > > > > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > > > > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > > > > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > > > > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > > > > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > > > > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > > > > > [  132.601344]        visual_init+0xce/0x130
> > > > > > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > > > > > [  132.602970]        do_take_over_console+0x115/0x180
> > > > > > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > > > > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > > > > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > > > > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > > > > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > > > > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > > > [  132.609511]        local_pci_probe+0x42/0x80
> > > > > > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > > > > > [  132.611130]        really_probe+0x147/0x3c0
> > > > > > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > > > > > [  132.612766]        device_driver_attach+0x53/0x60
> > > > > > > [  132.613593]        __driver_attach+0x8c/0x150
> > > > > > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > > > > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > > > > > [  132.616071]        driver_register+0x6c/0xc0
> > > > > > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > > > > > [  132.617731]        do_init_module+0x5c/0x230
> > > > > > > [  132.618560]        load_module+0x2981/0x2bc0
> > > > > > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > > > > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > > > > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > > > [  132.621903]
> > > > > > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > > > > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > > > > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > > > > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > > > > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > > > > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > > > > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > > > [  132.629804]        local_pci_probe+0x42/0x80
> > > > > > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > > > > > [  132.631583]        really_probe+0x147/0x3c0
> > > > > > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > > > > > [  132.633379]        device_driver_attach+0x53/0x60
> > > > > > > [  132.634275]        __driver_attach+0x8c/0x150
> > > > > > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > > > > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > > > > > [  132.636974]        driver_register+0x6c/0xc0
> > > > > > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > > > > > [  132.638765]        do_init_module+0x5c/0x230
> > > > > > > [  132.639654]        load_module+0x2981/0x2bc0
> > > > > > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > > > > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > > > > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > > > [  132.643022]
> > > > > > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > > > > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > > > > > [  132.645469]        lock_acquire+0xad/0x370
> > > > > > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > > > > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > > > [  132.652594]        process_one_work+0x23c/0x580
> > > > > > > [  132.653402]        worker_thread+0x50/0x3b0
> > > > > > > [  132.654139]        kthread+0x12e/0x150
> > > > > > > [  132.654868]        ret_from_fork+0x27/0x50
> > > > > > > [  132.655598]
> > > > > > >                other info that might help us debug this:
> > > > > > >
> > > > > > > [  132.657739] Chain exists of:
> > > > > > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > > > > > >
> > > > > > > [  132.659877]  Possible unsafe locking scenario:
> > > > > > >
> > > > > > > [  132.661416]        CPU0                    CPU1
> > > > > > > [  132.662126]        ----                    ----
> > > > > > > [  132.662847]   lock(dma_fence_map);
> > > > > > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > > > > > [  132.664319]                                lock(dma_fence_map);
> > > > > > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > > > > > [  132.665799]
> > > > > > >                 *** DEADLOCK ***
> > > > > > >
> > > > > > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > > > > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > > > > > [  132.671902]
> > > > > > >                stack backtrace:
> > > > > > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > > > > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > > > > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > > > > > [  132.676046] Call Trace:
> > > > > > > [  132.676897]  dump_stack+0x8f/0xd0
> > > > > > > [  132.677748]  check_noncircular+0x162/0x180
> > > > > > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > > > > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > > > > > [  132.680311]  lock_acquire+0xad/0x370
> > > > > > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > > > > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > > > > > [  132.683737]  ? __module_address+0x28/0xf0
> > > > > > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > > > > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > > > > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > > > [  132.693597]  process_one_work+0x23c/0x580
> > > > > > > [  132.694487]  worker_thread+0x50/0x3b0
> > > > > > > [  132.695373]  ? process_one_work+0x580/0x580
> > > > > > > [  132.696264]  kthread+0x12e/0x150
> > > > > > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > > > [  132.698057]  ret_from_fork+0x27/0x50
> > > > > > >
> > > > > > > Cc: linux-media@vger.kernel.org
> > > > > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > > > > Cc: linux-rdma@vger.kernel.org
> > > > > > > Cc: amd-gfx@lists.freedesktop.org
> > > > > > > Cc: intel-gfx@lists.freedesktop.org
> > > > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > > > ---
> > > > > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > > > > > >  1 file changed, 8 insertions(+)
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > index 3584e29323c0..b3b84a0d3baf 100644
> > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > > > > > >                 /* displays are handled separately */
> > > > > > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > > > > > >                         /* XXX handle errors */
> > > > > > > +
> > > > > > > +                       /*
> > > > > > > +                        * This is dm_suspend, which calls modeset locks, and
> > > > > > > +                        * that a pretty good inversion against dma_fence_signal
> > > > > > > +                        * which gpu recovery is supposed to guarantee.
> > > > > > > +                        *
> > > > > > > +                        * Dont ask me how to fix this.
> > > > > > > +                        */
> > > > > >
> > > > > > We actually have a fix for this.  Will be out shortly.
> > > > >
> > > > > Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> > > > > the display block entirely. Fixing the locking while still resetting the
> > > > > display is going to be really hard otoh ...
> > > >
> > > > There's no way to avoid that.  On dGPUs at least a full asic reset is
> > > > a full asic reset.  Mostly just skips the modeset and does the minimum
> > > > amount necessary to get the display block into a good state for reset.
> > >
> > > But how do you restore the display afterwards? "[RFC 13/17]
> > > drm/scheduler: use dma-fence annotations in tdr work" earlier in the
> > > series has some ideas from me for at least
> > > some of the problems for tdr when the display gets reset along.
> > > Whacking the display while a modeset/flip/whatever is ongoing
> > > concurrently doesn't sound like a good idea, so not sure how you can
> > > do that without taking the drm_modeset_locks. And once you do that,
> > > it's deadlock time.
> >
> > We cache the current display hw state and restore it after the reset
> > without going through the atomic interfaces so everything is back the
> > way it was before the reset.
>
> Hm this sounds interesting ... how do you make sure a concurrent
> atomic update doesn't trample over the same mmio registers while you
> do that dance?

We take the dm->dc_lock across the reset.

>
> > IIRC, when we reset the reset of the
> > GPU, we disconnect the fences, and then re-attach them after a reset.
>
> Where is that code? Since I'm not sure how you can make that work
> without getting stuck in another kind of deadlock in tdr. But maybe
> the code has some clever trick to pull that off somehow.

The GPU scheduler.  drm_sched_stop, drm_sched_resubmit_jobs, and
drm_sched_start.

Alex


> -Daniel
>
> >
> > Alex
> >
> > > -Daniel
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 16/17] drm/amdgpu: gpu recovery does full modesets
@ 2020-05-12 14:24                 ` Alex Deucher
  0 siblings, 0 replies; 203+ messages in thread
From: Alex Deucher @ 2020-05-12 14:24 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: linux-rdma, Intel Graphics Development, Maarten Lankhorst, LKML,
	amd-gfx list, Chris Wilson,
	moderated list:DMA BUFFER SHARING FRAMEWORK, DRI Development,
	Daniel Vetter, Christian König, linux-media

On Tue, May 12, 2020 at 9:45 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Tue, May 12, 2020 at 3:29 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Tue, May 12, 2020 at 9:17 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > On Tue, May 12, 2020 at 3:12 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> > > >
> > > > On Tue, May 12, 2020 at 8:58 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > > >
> > > > > On Tue, May 12, 2020 at 08:54:45AM -0400, Alex Deucher wrote:
> > > > > > On Tue, May 12, 2020 at 5:00 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > I think it's time to stop this little exercise.
> > > > > > >
> > > > > > > The lockdep splat, for the record:
> > > > > > >
> > > > > > > [  132.583381] ======================================================
> > > > > > > [  132.584091] WARNING: possible circular locking dependency detected
> > > > > > > [  132.584775] 5.7.0-rc3+ #346 Tainted: G        W
> > > > > > > [  132.585461] ------------------------------------------------------
> > > > > > > [  132.586184] kworker/2:3/865 is trying to acquire lock:
> > > > > > > [  132.586857] ffffc90000677c70 (crtc_ww_class_acquire){+.+.}-{0:0}, at: drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.587569]
> > > > > > >                but task is already holding lock:
> > > > > > > [  132.589044] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > > > [  132.589803]
> > > > > > >                which lock already depends on the new lock.
> > > > > > >
> > > > > > > [  132.592009]
> > > > > > >                the existing dependency chain (in reverse order) is:
> > > > > > > [  132.593507]
> > > > > > >                -> #2 (dma_fence_map){++++}-{0:0}:
> > > > > > > [  132.595019]        dma_fence_begin_signalling+0x50/0x60
> > > > > > > [  132.595767]        drm_atomic_helper_commit+0xa1/0x180 [drm_kms_helper]
> > > > > > > [  132.596567]        drm_client_modeset_commit_atomic+0x1ea/0x250 [drm]
> > > > > > > [  132.597420]        drm_client_modeset_commit_locked+0x55/0x190 [drm]
> > > > > > > [  132.598178]        drm_client_modeset_commit+0x24/0x40 [drm]
> > > > > > > [  132.598948]        drm_fb_helper_restore_fbdev_mode_unlocked+0x4b/0xa0 [drm_kms_helper]
> > > > > > > [  132.599738]        drm_fb_helper_set_par+0x30/0x40 [drm_kms_helper]
> > > > > > > [  132.600539]        fbcon_init+0x2e8/0x660
> > > > > > > [  132.601344]        visual_init+0xce/0x130
> > > > > > > [  132.602156]        do_bind_con_driver+0x1bc/0x2b0
> > > > > > > [  132.602970]        do_take_over_console+0x115/0x180
> > > > > > > [  132.603763]        do_fbcon_takeover+0x58/0xb0
> > > > > > > [  132.604564]        register_framebuffer+0x1ee/0x300
> > > > > > > [  132.605369]        __drm_fb_helper_initial_config_and_unlock+0x36e/0x520 [drm_kms_helper]
> > > > > > > [  132.606187]        amdgpu_fbdev_init+0xb3/0xf0 [amdgpu]
> > > > > > > [  132.607032]        amdgpu_device_init.cold+0xe90/0x1677 [amdgpu]
> > > > > > > [  132.607862]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > > > [  132.608697]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > > > [  132.609511]        local_pci_probe+0x42/0x80
> > > > > > > [  132.610324]        pci_device_probe+0x104/0x1a0
> > > > > > > [  132.611130]        really_probe+0x147/0x3c0
> > > > > > > [  132.611939]        driver_probe_device+0xb6/0x100
> > > > > > > [  132.612766]        device_driver_attach+0x53/0x60
> > > > > > > [  132.613593]        __driver_attach+0x8c/0x150
> > > > > > > [  132.614419]        bus_for_each_dev+0x7b/0xc0
> > > > > > > [  132.615249]        bus_add_driver+0x14c/0x1f0
> > > > > > > [  132.616071]        driver_register+0x6c/0xc0
> > > > > > > [  132.616902]        do_one_initcall+0x5d/0x2f0
> > > > > > > [  132.617731]        do_init_module+0x5c/0x230
> > > > > > > [  132.618560]        load_module+0x2981/0x2bc0
> > > > > > > [  132.619391]        __do_sys_finit_module+0xaa/0x110
> > > > > > > [  132.620228]        do_syscall_64+0x5a/0x250
> > > > > > > [  132.621064]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > > > [  132.621903]
> > > > > > >                -> #1 (crtc_ww_class_mutex){+.+.}-{3:3}:
> > > > > > > [  132.623587]        __ww_mutex_lock.constprop.0+0xcc/0x10c0
> > > > > > > [  132.624448]        ww_mutex_lock+0x43/0xb0
> > > > > > > [  132.625315]        drm_modeset_lock+0x44/0x120 [drm]
> > > > > > > [  132.626184]        drmm_mode_config_init+0x2db/0x8b0 [drm]
> > > > > > > [  132.627098]        amdgpu_device_init.cold+0xbd1/0x1677 [amdgpu]
> > > > > > > [  132.628007]        amdgpu_driver_load_kms+0x5a/0x200 [amdgpu]
> > > > > > > [  132.628920]        amdgpu_pci_probe+0xf7/0x180 [amdgpu]
> > > > > > > [  132.629804]        local_pci_probe+0x42/0x80
> > > > > > > [  132.630690]        pci_device_probe+0x104/0x1a0
> > > > > > > [  132.631583]        really_probe+0x147/0x3c0
> > > > > > > [  132.632479]        driver_probe_device+0xb6/0x100
> > > > > > > [  132.633379]        device_driver_attach+0x53/0x60
> > > > > > > [  132.634275]        __driver_attach+0x8c/0x150
> > > > > > > [  132.635170]        bus_for_each_dev+0x7b/0xc0
> > > > > > > [  132.636069]        bus_add_driver+0x14c/0x1f0
> > > > > > > [  132.636974]        driver_register+0x6c/0xc0
> > > > > > > [  132.637870]        do_one_initcall+0x5d/0x2f0
> > > > > > > [  132.638765]        do_init_module+0x5c/0x230
> > > > > > > [  132.639654]        load_module+0x2981/0x2bc0
> > > > > > > [  132.640522]        __do_sys_finit_module+0xaa/0x110
> > > > > > > [  132.641372]        do_syscall_64+0x5a/0x250
> > > > > > > [  132.642203]        entry_SYSCALL_64_after_hwframe+0x49/0xb3
> > > > > > > [  132.643022]
> > > > > > >                -> #0 (crtc_ww_class_acquire){+.+.}-{0:0}:
> > > > > > > [  132.644643]        __lock_acquire+0x1241/0x23f0
> > > > > > > [  132.645469]        lock_acquire+0xad/0x370
> > > > > > > [  132.646274]        drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > > > [  132.647071]        drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.647902]        dm_suspend+0x1c/0x60 [amdgpu]
> > > > > > > [  132.648698]        amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > > > [  132.649498]        amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > > > [  132.650300]        amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > > > [  132.651084]        amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > > > [  132.651825]        drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > > > [  132.652594]        process_one_work+0x23c/0x580
> > > > > > > [  132.653402]        worker_thread+0x50/0x3b0
> > > > > > > [  132.654139]        kthread+0x12e/0x150
> > > > > > > [  132.654868]        ret_from_fork+0x27/0x50
> > > > > > > [  132.655598]
> > > > > > >                other info that might help us debug this:
> > > > > > >
> > > > > > > [  132.657739] Chain exists of:
> > > > > > >                  crtc_ww_class_acquire --> crtc_ww_class_mutex --> dma_fence_map
> > > > > > >
> > > > > > > [  132.659877]  Possible unsafe locking scenario:
> > > > > > >
> > > > > > > [  132.661416]        CPU0                    CPU1
> > > > > > > [  132.662126]        ----                    ----
> > > > > > > [  132.662847]   lock(dma_fence_map);
> > > > > > > [  132.663574]                                lock(crtc_ww_class_mutex);
> > > > > > > [  132.664319]                                lock(dma_fence_map);
> > > > > > > [  132.665063]   lock(crtc_ww_class_acquire);
> > > > > > > [  132.665799]
> > > > > > >                 *** DEADLOCK ***
> > > > > > >
> > > > > > > [  132.667965] 4 locks held by kworker/2:3/865:
> > > > > > > [  132.668701]  #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > > > [  132.669462]  #1: ffffc90000677e58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580
> > > > > > > [  132.670242]  #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched]
> > > > > > > [  132.671039]  #3: ffff8887b84a1748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x59e/0xe64 [amdgpu]
> > > > > > > [  132.671902]
> > > > > > >                stack backtrace:
> > > > > > > [  132.673515] CPU: 2 PID: 865 Comm: kworker/2:3 Tainted: G        W         5.7.0-rc3+ #346
> > > > > > > [  132.674347] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> > > > > > > [  132.675194] Workqueue: events drm_sched_job_timedout [gpu_sched]
> > > > > > > [  132.676046] Call Trace:
> > > > > > > [  132.676897]  dump_stack+0x8f/0xd0
> > > > > > > [  132.677748]  check_noncircular+0x162/0x180
> > > > > > > [  132.678604]  ? stack_trace_save+0x4b/0x70
> > > > > > > [  132.679459]  __lock_acquire+0x1241/0x23f0
> > > > > > > [  132.680311]  lock_acquire+0xad/0x370
> > > > > > > [  132.681163]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.682021]  ? cpumask_next+0x16/0x20
> > > > > > > [  132.682880]  ? module_assert_mutex_or_preempt+0x14/0x40
> > > > > > > [  132.683737]  ? __module_address+0x28/0xf0
> > > > > > > [  132.684601]  drm_modeset_acquire_init+0xd2/0x100 [drm]
> > > > > > > [  132.685466]  ? drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.686335]  drm_atomic_helper_suspend+0x38/0x120 [drm_kms_helper]
> > > > > > > [  132.687255]  dm_suspend+0x1c/0x60 [amdgpu]
> > > > > > > [  132.688152]  amdgpu_device_ip_suspend_phase1+0x83/0xe0 [amdgpu]
> > > > > > > [  132.689057]  ? amdgpu_fence_process+0x4c/0x150 [amdgpu]
> > > > > > > [  132.689963]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
> > > > > > > [  132.690893]  amdgpu_device_gpu_recover.cold+0x4e6/0xe64 [amdgpu]
> > > > > > > [  132.691818]  amdgpu_job_timedout+0xfb/0x150 [amdgpu]
> > > > > > > [  132.692707]  drm_sched_job_timedout+0x8a/0xf0 [gpu_sched]
> > > > > > > [  132.693597]  process_one_work+0x23c/0x580
> > > > > > > [  132.694487]  worker_thread+0x50/0x3b0
> > > > > > > [  132.695373]  ? process_one_work+0x580/0x580
> > > > > > > [  132.696264]  kthread+0x12e/0x150
> > > > > > > [  132.697154]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > > > > [  132.698057]  ret_from_fork+0x27/0x50
> > > > > > >
> > > > > > > Cc: linux-media@vger.kernel.org
> > > > > > > Cc: linaro-mm-sig@lists.linaro.org
> > > > > > > Cc: linux-rdma@vger.kernel.org
> > > > > > > Cc: amd-gfx@lists.freedesktop.org
> > > > > > > Cc: intel-gfx@lists.freedesktop.org
> > > > > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > > > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > > > > > Cc: Christian König <christian.koenig@amd.com>
> > > > > > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > > > > > ---
> > > > > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++++++
> > > > > > >  1 file changed, 8 insertions(+)
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > index 3584e29323c0..b3b84a0d3baf 100644
> > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > @@ -2415,6 +2415,14 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
> > > > > > >                 /* displays are handled separately */
> > > > > > >                 if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE) {
> > > > > > >                         /* XXX handle errors */
> > > > > > > +
> > > > > > > +                       /*
> > > > > > > +                        * This is dm_suspend, which calls modeset locks, and
> > > > > > > +                        * that a pretty good inversion against dma_fence_signal
> > > > > > > +                        * which gpu recovery is supposed to guarantee.
> > > > > > > +                        *
> > > > > > > +                        * Dont ask me how to fix this.
> > > > > > > +                        */
> > > > > >
> > > > > > We actually have a fix for this.  Will be out shortly.
> > > > >
> > > > > Spoilers? Solid way is to sidesteck the entire thing by avoiding to reset
> > > > > the display block entirely. Fixing the locking while still resetting the
> > > > > display is going to be really hard otoh ...
> > > >
> > > > There's no way to avoid that.  On dGPUs at least a full asic reset is
> > > > a full asic reset.  Mostly just skips the modeset and does the minimum
> > > > amount necessary to get the display block into a good state for reset.
> > >
> > > But how do you restore the display afterwards? "[RFC 13/17]
> > > drm/scheduler: use dma-fence annotations in tdr work" earlier in the
> > > series has some ideas from me for at least
> > > some of the problems for tdr when the display gets reset along.
> > > Whacking the display while a modeset/flip/whatever is ongoing
> > > concurrently doesn't sound like a good idea, so not sure how you can
> > > do that without taking the drm_modeset_locks. And once you do that,
> > > it's deadlock time.
> >
> > We cache the current display hw state and restore it after the reset
> > without going through the atomic interfaces so everything is back the
> > way it was before the reset.
>
> Hm this sounds interesting ... how do you make sure a concurrent
> atomic update doesn't trample over the same mmio registers while you
> do that dance?

We take the dm->dc_lock across the reset.

>
> > IIRC, when we reset the reset of the
> > GPU, we disconnect the fences, and then re-attach them after a reset.
>
> Where is that code? Since I'm not sure how you can make that work
> without getting stuck in another kind of deadlock in tdr. But maybe
> the code has some clever trick to pull that off somehow.

The GPU scheduler.  drm_sched_stop, drm_sched_resubmit_jobs, and
drm_sched_start.

Alex


> -Daniel
>
> >
> > Alex
> >
> > > -Daniel
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC 10/17] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code
  2020-05-12  8:59   ` Daniel Vetter
  (?)
  (?)
@ 2020-05-12 15:56     ` Christian König
  -1 siblings, 0 replies; 203+ messages in thread
From: Christian König @ 2020-05-12 15:56 UTC (permalink / raw)
  To: Daniel Vetter, DRI Development
  Cc: LKML, linux-media, linaro-mm-sig, linux-rdma, amd-gfx, intel-gfx,
	Chris Wilson, Maarten Lankhorst, Daniel Vetter

Hui what? Of hand that doesn't looks correct to me.

Why the heck should this be an atomic context? If that's correct 
allocating memory is the least of the problems we have.

Regards,
Christian.

Am 12.05.20 um 10:59 schrieb Daniel Vetter:
> My dma-fence lockdep annotations caught an inversion