All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770
@ 2017-04-18 15:14 bugzilla-daemon
  2017-04-18 15:15 ` bugzilla-daemon
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: bugzilla-daemon @ 2017-04-18 15:14 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 1806 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=100712

            Bug ID: 100712
           Summary: ring 0 stalled after bytes_moved_threshold reached -
                    Cap Verde - HD 7770
           Product: DRI
           Version: DRI git
          Hardware: Other
                OS: All
            Status: NEW
          Severity: normal
          Priority: medium
         Component: DRM/Radeon
          Assignee: dri-devel@lists.freedesktop.org
          Reporter: julien.isorce@gmail.com

Kernel 4.9 from
https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-4.9 and latest
mesa. (same result with drm-next-4.12 branch)
Same result with kernel 4.8 and mesa 12.0.6.

In kernel radeon_object.c::radeon_bo_list_validate, once "bytes_moved >
bytes_moved_threshold" is reached (this is the case for 850 bo in the same
list_for_each_entry loop), I can see that radeon_ib_schedule emits a fence that
it takes more than the radeon.lockup_timeout to be signaled.

In radeon_fence_activity, I checked that the "last_emitted" is the seq number
for this last emited fence. And last_seq is equal to last_emitted-1.

Then the next call to ttm_wait_bo blocks (15 * HZ > radeon.lockup_timeout)
until gpu lockup which leads to a gpu reset.

Also it seems the fence is signaled by swapper after more than 10 seconds but
it is too late. I requires to reduce the "15" param above to 4 to see that.

Is it normal that radeon_bo_list_validate still tries to move the bo if
bytes_moved_threshold is reached ? Indeed ttm_bo_validate is always called (it
blits from vram to vram).
Is it also normal that ttm_bo_validate is called with evict flag as true once
bytes_moved_threshold is reached ?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 3220 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770
  2017-04-18 15:14 [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770 bugzilla-daemon
@ 2017-04-18 15:15 ` bugzilla-daemon
  2017-04-18 15:15 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2017-04-18 15:15 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 348 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=100712

--- Comment #1 from Julien Isorce <julien.isorce@gmail.com> ---
Created attachment 130902
  --> https://bugs.freedesktop.org/attachment.cgi?id=130902&action=edit
dmesg_HD7770_kernel_amd-staging-4.9_ring_stalled

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1370 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770
  2017-04-18 15:14 [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770 bugzilla-daemon
  2017-04-18 15:15 ` bugzilla-daemon
@ 2017-04-18 15:15 ` bugzilla-daemon
  2017-04-18 15:16 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2017-04-18 15:15 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 348 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=100712

--- Comment #2 from Julien Isorce <julien.isorce@gmail.com> ---
Created attachment 130903
  --> https://bugs.freedesktop.org/attachment.cgi?id=130903&action=edit
dmesg_HD7770_kernel_amd-staging-4.9_ring_stalled

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1370 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770
  2017-04-18 15:14 [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770 bugzilla-daemon
  2017-04-18 15:15 ` bugzilla-daemon
  2017-04-18 15:15 ` bugzilla-daemon
@ 2017-04-18 15:16 ` bugzilla-daemon
  2017-04-19  3:48 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2017-04-18 15:16 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 646 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=100712

Julien Isorce <julien.isorce@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #130903|0                           |1
        is obsolete|                            |

--- Comment #3 from Julien Isorce <julien.isorce@gmail.com> ---
Created attachment 130904
  --> https://bugs.freedesktop.org/attachment.cgi?id=130904&action=edit
ddebug_dumps_HD7770_kernel_amd-staging-4.9_ring_stalled

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 2229 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770
  2017-04-18 15:14 [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770 bugzilla-daemon
                   ` (2 preceding siblings ...)
  2017-04-18 15:16 ` bugzilla-daemon
@ 2017-04-19  3:48 ` bugzilla-daemon
  2017-04-19 12:03 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2017-04-19  3:48 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 1736 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=100712

--- Comment #4 from Michel Dänzer <michel@daenzer.net> ---
(In reply to Julien Isorce from comment #0)
> In kernel radeon_object.c::radeon_bo_list_validate, once "bytes_moved >
> bytes_moved_threshold" is reached (this is the case for 850 bo in the same
> list_for_each_entry loop), I can see that radeon_ib_schedule emits a fence
> that it takes more than the radeon.lockup_timeout to be signaled.

radeon_ib_schedule is called for submitting the command stream from userspace,
not for any BO moves directly, right?

How did you determine that this hang is directly related to bytes_moved /
bytes_moved_threshold? Maybe it's only indirectly related, e.g. due to the
threshold preventing a BO from being moved to VRAM despite userspace's
preference.


> Also it seems the fence is signaled by swapper after more than 10 seconds
> but it is too late. I requires to reduce the "15" param above to 4 to see
> that.

How does "swapper" (what is that exactly?) signal the fence?


> Is it normal that radeon_bo_list_validate still tries to move the bo if
> bytes_moved_threshold is reached ?

There are circumstances where a BO has to be moved even though the threshold is
reached.


> Indeed ttm_bo_validate is always called

ttm_bo_validate must be called for every BO referenced by the command stream
from userspace for correct lifetime management of its memory.


> (it blits from vram to vram).

It might be worth looking into why this happens, though. If domain ==
current_domain == RADEON_GEM_DOMAIN_VRAM, I wouldn't expect ttm_bo_validate to
trigger a blit.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 2812 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770
  2017-04-18 15:14 [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770 bugzilla-daemon
                   ` (3 preceding siblings ...)
  2017-04-19  3:48 ` bugzilla-daemon
@ 2017-04-19 12:03 ` bugzilla-daemon
  2017-04-20 15:15 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2017-04-19 12:03 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 3184 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=100712

--- Comment #5 from Julien Isorce <julien.isorce@gmail.com> ---
(In reply to Michel Dänzer from comment #4)
> (In reply to Julien Isorce from comment #0)
> > In kernel radeon_object.c::radeon_bo_list_validate, once "bytes_moved >
> > bytes_moved_threshold" is reached (this is the case for 850 bo in the same
> > list_for_each_entry loop), I can see that radeon_ib_schedule emits a fence
> > that it takes more than the radeon.lockup_timeout to be signaled.
> 
> radeon_ib_schedule is called for submitting the command stream from
> userspace, not for any BO moves directly, right?
> 
> How did you determine that this hang is directly related to bytes_moved /
> bytes_moved_threshold? Maybe it's only indirectly related, e.g. due to the
> threshold preventing a BO from being moved to VRAM despite userspace's
> preference.
> 

I added a trace and the fence that is not signaled on time is always the one
emited by radeon_ib_schedule after that the bytes_moved_threshold is reached.
But you are right it could be only indirectly related.

Here is the sequence I have:

ioctl_radeon_cs
  radeon_bo_list_validate
    bytes_moved > bytes_moved_threshold(=1024*1024ull)
    800 bo are not moved from gtt to vram because of that.
  radeon_cs_ib_vm_chunk
    radeon_ib_schedule(rdev, &parser->ib, NULL, true);
      radeon_fence_emit on ring 0
      r600_mmio_hdp_flush
/ioctl_radeon_cs

Then anything calling ttm_bo_wait will block more than the
radeon.lockup_timeout because the above fence is not signaled on time.
Could it be that something is not flushed properly ? (ref:
https://patchwork.kernel.org/patch/5807141/ ? tlb_flush ?) 

Are you saying that some bos are required to be moved from gtt to vram in order
for this fence to be signaled ?

As you can see above it happens when vram_usage >= half_vram so
radeon_bo_get_threshold_for_moves returns 1024*1024, which explains why only 1
or 2 bos can be moved from gtt to vram in that case and why all others are
forced to stay in gtt.

In the same run of radeon_bo_list_validate there are many calls to
ttm_bo_validate with both domain and current_domain as VRAM, this is the case
for around 400 bo. Maybe this cause delay for this fence to be signaled,
providing vram usage is high too.

> 
> > Also it seems the fence is signaled by swapper after more than 10 seconds
> > but it is too late. I requires to reduce the "15" param above to 4 to see
> > that.
> 
> How does "swapper" (what is that exactly?) signal the fence?

My wording was wrong sorry, I should have said "the first entity noticing that
the fence is signaled" by calling radeon_fence_activity. swapper is the name
for process 0 (idle). I change drm logging to print process name and id:
(current->comm, current->pid)

> 
> It might be worth looking into why this happens, though. If domain ==
> current_domain == RADEON_GEM_DOMAIN_VRAM, I wouldn't expect ttm_bo_validate
> to trigger a blit.

I will check though I think I get just confused by a previous trace.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 4395 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770
  2017-04-18 15:14 [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770 bugzilla-daemon
                   ` (4 preceding siblings ...)
  2017-04-19 12:03 ` bugzilla-daemon
@ 2017-04-20 15:15 ` bugzilla-daemon
  2017-04-20 15:15 ` [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - CAPVERDE/HD7770 - TAHITI/W9000 bugzilla-daemon
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2017-04-20 15:15 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 339 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=100712

--- Comment #6 from Julien Isorce <julien.isorce@gmail.com> ---
Created attachment 130947
  --> https://bugs.freedesktop.org/attachment.cgi?id=130947&action=edit
dmesg_W9000_with_custom_fence_debug.log

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1343 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - CAPVERDE/HD7770 - TAHITI/W9000
  2017-04-18 15:14 [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770 bugzilla-daemon
                   ` (5 preceding siblings ...)
  2017-04-20 15:15 ` bugzilla-daemon
@ 2017-04-20 15:15 ` bugzilla-daemon
  2017-04-20 15:23 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2017-04-20 15:15 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 603 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=100712

Julien Isorce <julien.isorce@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|ring 0 stalled after        |ring 0 stalled after
                   |bytes_moved_threshold       |bytes_moved_threshold
                   |reached - Cap Verde - HD    |reached - CAPVERDE/HD7770 -
                   |7770                        |TAHITI/W9000

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1250 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - CAPVERDE/HD7770 - TAHITI/W9000
  2017-04-18 15:14 [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770 bugzilla-daemon
                   ` (6 preceding siblings ...)
  2017-04-20 15:15 ` [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - CAPVERDE/HD7770 - TAHITI/W9000 bugzilla-daemon
@ 2017-04-20 15:23 ` bugzilla-daemon
  2017-04-24 11:02 ` bugzilla-daemon
  2019-11-19  9:28 ` bugzilla-daemon
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2017-04-20 15:23 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 1147 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=100712

--- Comment #7 from Julien Isorce <julien.isorce@gmail.com> ---
I made 2 apitrace using Zach's test mentioned here
https://bugs.freedesktop.org/show_bug.cgi?id=100465#c24 . This test is also
good to reproduce this ring 0 stalled issue.

1: apitrace ideal for vram size 2048 (ex: HD7770)
https://drive.google.com/file/d/0Bzat_iFKrgCWYzBlZFFLQjgyRU0/view?usp=sharing

2: apitrace ideal for vram size 6144 (ex: W9000)
https://drive.google.com/file/d/0Bzat_iFKrgCWczgzM2FzaVFTUXc/view?usp=sharing

DISPLAY=:0 apitrace replay thrash.trace

Also I have attached the log (dmesg_W9000_with_custom_fence_debug.log) I get
with my dev branch here
https://github.com/CapOM/linux/commits/amd-staging-4.9_add_debug_fences where I
added traces to debug ring N stalled issues. It prints the backtrace from where
it waits for the fence and it also prints the backtrace from where it has
emited that fence.

Also note that setting R600_DEBUG=nowc avoids this ring N stalled (so the
endless fence is signaled).

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 2449 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - CAPVERDE/HD7770 - TAHITI/W9000
  2017-04-18 15:14 [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770 bugzilla-daemon
                   ` (7 preceding siblings ...)
  2017-04-20 15:23 ` bugzilla-daemon
@ 2017-04-24 11:02 ` bugzilla-daemon
  2019-11-19  9:28 ` bugzilla-daemon
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2017-04-24 11:02 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 413 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=100712

--- Comment #8 from Julien Isorce <julien.isorce@gmail.com> ---
Hack submitted here https://patchwork.kernel.org/patch/9695945/. It contains
some info in the commit message and in the replies. For those who want to try,
it is easier to just set R600_DEBUG=nowc as said in #7.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 1327 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - CAPVERDE/HD7770 - TAHITI/W9000
  2017-04-18 15:14 [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770 bugzilla-daemon
                   ` (8 preceding siblings ...)
  2017-04-24 11:02 ` bugzilla-daemon
@ 2019-11-19  9:28 ` bugzilla-daemon
  9 siblings, 0 replies; 11+ messages in thread
From: bugzilla-daemon @ 2019-11-19  9:28 UTC (permalink / raw)
  To: dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 805 bytes --]

https://bugs.freedesktop.org/show_bug.cgi?id=100712

Martin Peres <martin.peres@free.fr> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |MOVED
             Status|NEW                         |RESOLVED

--- Comment #9 from Martin Peres <martin.peres@free.fr> ---
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been
closed from further activity.

You can subscribe and participate further through the new bug through this link
to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/793.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[-- Attachment #1.2: Type: text/html, Size: 2473 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-11-19  9:28 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-18 15:14 [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770 bugzilla-daemon
2017-04-18 15:15 ` bugzilla-daemon
2017-04-18 15:15 ` bugzilla-daemon
2017-04-18 15:16 ` bugzilla-daemon
2017-04-19  3:48 ` bugzilla-daemon
2017-04-19 12:03 ` bugzilla-daemon
2017-04-20 15:15 ` bugzilla-daemon
2017-04-20 15:15 ` [Bug 100712] ring 0 stalled after bytes_moved_threshold reached - CAPVERDE/HD7770 - TAHITI/W9000 bugzilla-daemon
2017-04-20 15:23 ` bugzilla-daemon
2017-04-24 11:02 ` bugzilla-daemon
2019-11-19  9:28 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.