Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from   suspend.
@ 2011-12-16  8:42 chenhc
  2011-12-16 10:53 ` Michel Dänzer
  2011-12-16 15:46 ` Jerome Glisse
  0 siblings, 2 replies; 27+ messages in thread
From: chenhc @ 2011-12-16  8:42 UTC (permalink / raw)
  To: Michel �nzer; +Cc: yanh, dri-devel, Chen Jie

> On Don, 2011-12-08 at 19:35 +0800, chenhc@lemote.com wrote:
>>
>> I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
>> active, but what it get from ring buffer is wrong.
>
> CP_RB_WPTR is normally only changed by the CPU after adding commands to
> the ring buffer, so I'm afraid that may not be a valid conclusion.
>
>
I'm sorry, I've made a spelling mistake. In fact, CP_RB_RPTR and
CP_RB_WPTR both changed, so I think CP is active.

>> Then, I want to know whether there is a way to check the content that
>> GPU get from ring buffer.
>
> See the r100_debugfs_cp_csq_fifo() function, which generates the output
> for /sys/kernel/debug/dri/0/r100_cp_csq_fifo.
>
Hmmm, I don't think this function can be used by r600 (or write a similar
one for R600), because I haven't found CSQ registers in r600 code.

>
>> BTW, when I use "echo shutdown > /sys/power/disk; echo disk >
>> /sys/power/state" to do a hibernation, there will be occasionally "GPU
>> reset" just like suspend. However, if I use "echo reboot >
>> /sys/power/disk; echo disk > /sys/power/state" to do a hibernation and
>> wakeup automatically, there is no "GPU reset" after hundreds of tests.
>> What does this imply? Power loss cause something break?
>
> Yeah, it sounds like the resume code doesn't properly re-initialize
> something that's preserved on a warm boot but lost on a cold boot.
>
>
> --
> Earthling Michel D

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from   suspend.
  2011-12-16  8:42 [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend chenhc
@ 2011-12-16 10:53 ` Michel Dänzer
  2011-12-16 15:46 ` Jerome Glisse
  1 sibling, 0 replies; 27+ messages in thread
From: Michel Dänzer @ 2011-12-16 10:53 UTC (permalink / raw)
  To: chenhc; +Cc: yanh, dri-devel, Chen Jie

On Fre, 2011-12-16 at 16:42 +0800, chenhc@lemote.com wrote: 
> > On Don, 2011-12-08 at 19:35 +0800, chenhc@lemote.com wrote:
> >>
> >> I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
> >> active, but what it get from ring buffer is wrong.
> >
> > CP_RB_WPTR is normally only changed by the CPU after adding commands to
> > the ring buffer, so I'm afraid that may not be a valid conclusion.
> >
> >
> I'm sorry, I've made a spelling mistake. In fact, CP_RB_RPTR and
> CP_RB_WPTR both changed, so I think CP is active.

I see. However, I think this actually makes it unlikely that the problem
is the CP reading wrong values from the ring, as otherwise the CP itself
would likely get stuck sooner or later.


> >> Then, I want to know whether there is a way to check the content that
> >> GPU get from ring buffer.
> >
> > See the r100_debugfs_cp_csq_fifo() function, which generates the output
> > for /sys/kernel/debug/dri/0/r100_cp_csq_fifo.
> >
> Hmmm, I don't think this function can be used by r600 (or write a similar
> one for R600), because I haven't found CSQ registers in r600 code.

Hmm yeah, looks like the registers for this have changed.


-- 
Earthling Michel Dänzer           |                   http://www.amd.com
Libre software enthusiast         |          Debian, X and DRI developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2011-12-16  8:42 [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend chenhc
  2011-12-16 10:53 ` Michel Dänzer
@ 2011-12-16 15:46 ` Jerome Glisse
  1 sibling, 0 replies; 27+ messages in thread
From: Jerome Glisse @ 2011-12-16 15:46 UTC (permalink / raw)
  To: chenhc; +Cc: Michel �nzer, yanh, dri-devel, Chen Jie

2011/12/16  <chenhc@lemote.com>:
>> On Don, 2011-12-08 at 19:35 +0800, chenhc@lemote.com wrote:
>>>
>>> I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
>>> active, but what it get from ring buffer is wrong.
>>
>> CP_RB_WPTR is normally only changed by the CPU after adding commands to
>> the ring buffer, so I'm afraid that may not be a valid conclusion.
>>
>>
> I'm sorry, I've made a spelling mistake. In fact, CP_RB_RPTR and
> CP_RB_WPTR both changed, so I think CP is active.
>
>>> Then, I want to know whether there is a way to check the content that
>>> GPU get from ring buffer.
>>
>> See the r100_debugfs_cp_csq_fifo() function, which generates the output
>> for /sys/kernel/debug/dri/0/r100_cp_csq_fifo.
>>
> Hmmm, I don't think this function can be used by r600 (or write a similar
> one for R600), because I haven't found CSQ registers in r600 code.
>
>>
>>> BTW, when I use "echo shutdown > /sys/power/disk; echo disk >
>>> /sys/power/state" to do a hibernation, there will be occasionally "GPU
>>> reset" just like suspend. However, if I use "echo reboot >
>>> /sys/power/disk; echo disk > /sys/power/state" to do a hibernation and
>>> wakeup automatically, there is no "GPU reset" after hundreds of tests.
>>> What does this imply? Power loss cause something break?
>>
>> Yeah, it sounds like the resume code doesn't properly re-initialize
>> something that's preserved on a warm boot but lost on a cold boot.
>>
>>
>> --
>> Earthling Michel D
>

It might be pci issue, you should check pci configuration before and after
suspend thought kernel should properly restore things. The cp might still
be reading random data that doesn't lockup the cp (i saw it happen more
than once).

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2012-03-01  9:11 chenhc
@ 2012-03-01 17:19 ` Alex Deucher
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Deucher @ 2012-03-01 17:19 UTC (permalink / raw)
  To: chenhc; +Cc: michel, zhangfx, dri-devel, Chen Jie, yanh

2012/3/1  <chenhc@lemote.com>:
> Status update:
> In r600.c I found for RS780, num_*_threads are like this:
>            sq_thread_resource_mgmt = (NUM_PS_THREADS(79) |
>                                       NUM_VS_THREADS(78) |
>                                       NUM_GS_THREADS(4) |
>                                       NUM_ES_THREADS(31));
>
> But in documents, each of them should be a multiple of 4. And in
> r600_blit_kms.c， they are 136, 48, 4, 4. I want to know why
> 79, 78, 4 and 31 are use here.

You can try changing them, but I don't think it will make a difference.

Alex

>
> Huacai Chen
>
>> On Wed, 2012-02-29 at 12:49 +0800, chenhc@lemote.com wrote:
>>> > On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
>>> >> 在 2012年2月17日 下午5:27，Chen Jie <chenj@lemote.com> 写道：
>>> >> >> One good way to test gart is to go over GPU gart table and write a
>>> >> >> dword using the GPU at end of each page something like 0xCAFEDEAD
>>> >> >> or somevalue that is unlikely to be already set. And then go over
>>> >> >> all the page and check that GPU write succeed. Abusing the scratch
>>> >> >> register write back feature is the easiest way to try that.
>>> >> > I'm planning to add a GART table check procedure when resume, which
>>> >> > will go over GPU gart table:
>>> >> > 1. read(backup) a dword at end of each GPU page
>>> >> > 2. write a mark by GPU and check it
>>> >> > 3. restore the original dword
>>> >> Attachment validateGART.patch do the job:
>>> >> * It current only works for mips64 platform.
>>> >> * To use it, apply all_in_vram.patch first, which will allocate CP
>>> >> ring, ih, ib in VRAM and hard code no_wb=1.
>>> >>
>>> >> The gart test routine will be invoked in r600_resume. We've tried it,
>>> >> and find that when lockup happened the gart table was good before
>>> >> userspace restarting. The related dmesg follows:
>>> >> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
>>> >> at 9000000040040000, 32768 entries, Dummy
>>> >> Page[0x000000000e004000-0x000000000e007fff]
>>> >> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
>>> >> entries(valid=8544, invalid=24224, total=32768).
>>> >> ...
>>> >> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
>>> >> [ 1532.152343] Restarting tasks ... done.
>>> >> [ 1544.468750] radeon 0000:01:05.0: GPU lockup CP stall for more than
>>> >> 10003msec
>>> >> [ 1544.472656] ------------[ cut here ]------------
>>> >> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
>>> >> radeon_fence_wait+0x25c/0x314()
>>> >> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id
>>> >> 0x0002136A)
>>> >> ...
>>> >> [ 1544.886718] radeon 0000:01:05.0: Wait for MC idle timedout !
>>> >> [ 1545.046875] radeon 0000:01:05.0: Wait for MC idle timedout !
>>> >> [ 1545.062500] radeon 0000:01:05.0: WB disabled
>>> >> [ 1545.097656] [drm] ring test succeeded in 0 usecs
>>> >> [ 1545.105468] [drm] ib test succeeded in 0 usecs
>>> >> [ 1545.109375] [drm] Enabling audio support
>>> >> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
>>> >> at 9000000040040000, 32768 entries, Dummy
>>> >> Page[0x000000000e004000-0x000000000e007fff]
>>> >> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
>>> >> unexpected value 0x745aaad1(expect 0xDEADBEEF)
>>> >> entry=0x000000000e008067, orignal=0x745aaad1
>>> >> ...
>>> >> /* System blocked here. */
>>> >>
>>> >> Any idea?
>>> >
>>> > I know lockup are frustrating, my only idea is the memory controller
>>> > is lockup because of some failing pci <-> system ram transaction.
>>> >
>>> >>
>>> >> BTW, we find the following in r600_pcie_gart_enable()
>>> >> (drivers/gpu/drm/radeon/r600.c):
>>> >> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
>>> >> (u32)(rdev->dummy_page.addr >> 12));
>>> >>
>>> >> On our platform, PAGE_SIZE is 16K, does it have any problem?
>>> >
>>> > No this should be handled properly.
>>> >
>>> >> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
>>> >> should change to:
>>> >>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
>>> >>           radeon_gart_set_page(rdev, t, page_base);
>>> >> -         page_base += RADEON_GPU_PAGE_SIZE;
>>> >> +         if (page_base != rdev->dummy_page.addr)
>>> >> +                 page_base += RADEON_GPU_PAGE_SIZE;
>>> >>   }
>>> >> ???
>>> >
>>> > No need to do so, dummy page will be 16K too, so it's fine.
>>> Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page
>>> is 0x8e004000, then there are four types of address in GART:0x8e004000,
>>> 0x8e005000, 0x8e006000, 0x8e007000. The value which written in
>>> VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I
>>> don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I
>>> think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.
>>
>> When radeon_gart_unbind initialize the gart entry to point to the dummy
>> page it's just to have something safe in the GART table.
>>
>> VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is the page address used when
>> there is a fault happening. It's like a sandbox for the mc. It doesn't
>> conflict in anyway to have gart table entry to point to same page.
>>
>> Cheers,
>> Jerome
>>
>>
>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from   suspend.
@ 2012-03-01  9:11 chenhc
  2012-03-01 17:19 ` Alex Deucher
  0 siblings, 1 reply; 27+ messages in thread
From: chenhc @ 2012-03-01  9:11 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: michel, zhangfx, yanh, dri-devel, Chen Jie

Status update:
In r600.c I found for RS780, num_*_threads are like this:
            sq_thread_resource_mgmt = (NUM_PS_THREADS(79) |
                                       NUM_VS_THREADS(78) |
                                       NUM_GS_THREADS(4) |
                                       NUM_ES_THREADS(31));

But in documents, each of them should be a multiple of 4. And in
r600_blit_kms.c， they are 136, 48, 4, 4. I want to know why
79, 78, 4 and 31 are use here.

Huacai Chen

> On Wed, 2012-02-29 at 12:49 +0800, chenhc@lemote.com wrote:
>> > On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
>> >> 在 2012年2月17日 下午5:27，Chen Jie <chenj@lemote.com> 写道：
>> >> >> One good way to test gart is to go over GPU gart table and write a
>> >> >> dword using the GPU at end of each page something like 0xCAFEDEAD
>> >> >> or somevalue that is unlikely to be already set. And then go over
>> >> >> all the page and check that GPU write succeed. Abusing the scratch
>> >> >> register write back feature is the easiest way to try that.
>> >> > I'm planning to add a GART table check procedure when resume, which
>> >> > will go over GPU gart table:
>> >> > 1. read(backup) a dword at end of each GPU page
>> >> > 2. write a mark by GPU and check it
>> >> > 3. restore the original dword
>> >> Attachment validateGART.patch do the job:
>> >> * It current only works for mips64 platform.
>> >> * To use it, apply all_in_vram.patch first, which will allocate CP
>> >> ring, ih, ib in VRAM and hard code no_wb=1.
>> >>
>> >> The gart test routine will be invoked in r600_resume. We've tried it,
>> >> and find that when lockup happened the gart table was good before
>> >> userspace restarting. The related dmesg follows:
>> >> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
>> >> at 9000000040040000, 32768 entries, Dummy
>> >> Page[0x000000000e004000-0x000000000e007fff]
>> >> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
>> >> entries(valid=8544, invalid=24224, total=32768).
>> >> ...
>> >> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
>> >> [ 1532.152343] Restarting tasks ... done.
>> >> [ 1544.468750] radeon 0000:01:05.0: GPU lockup CP stall for more than
>> >> 10003msec
>> >> [ 1544.472656] ------------[ cut here ]------------
>> >> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
>> >> radeon_fence_wait+0x25c/0x314()
>> >> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id
>> >> 0x0002136A)
>> >> ...
>> >> [ 1544.886718] radeon 0000:01:05.0: Wait for MC idle timedout !
>> >> [ 1545.046875] radeon 0000:01:05.0: Wait for MC idle timedout !
>> >> [ 1545.062500] radeon 0000:01:05.0: WB disabled
>> >> [ 1545.097656] [drm] ring test succeeded in 0 usecs
>> >> [ 1545.105468] [drm] ib test succeeded in 0 usecs
>> >> [ 1545.109375] [drm] Enabling audio support
>> >> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
>> >> at 9000000040040000, 32768 entries, Dummy
>> >> Page[0x000000000e004000-0x000000000e007fff]
>> >> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
>> >> unexpected value 0x745aaad1(expect 0xDEADBEEF)
>> >> entry=0x000000000e008067, orignal=0x745aaad1
>> >> ...
>> >> /* System blocked here. */
>> >>
>> >> Any idea?
>> >
>> > I know lockup are frustrating, my only idea is the memory controller
>> > is lockup because of some failing pci <-> system ram transaction.
>> >
>> >>
>> >> BTW, we find the following in r600_pcie_gart_enable()
>> >> (drivers/gpu/drm/radeon/r600.c):
>> >> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
>> >> (u32)(rdev->dummy_page.addr >> 12));
>> >>
>> >> On our platform, PAGE_SIZE is 16K, does it have any problem?
>> >
>> > No this should be handled properly.
>> >
>> >> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
>> >> should change to:
>> >>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
>> >>           radeon_gart_set_page(rdev, t, page_base);
>> >> -         page_base += RADEON_GPU_PAGE_SIZE;
>> >> +         if (page_base != rdev->dummy_page.addr)
>> >> +                 page_base += RADEON_GPU_PAGE_SIZE;
>> >>   }
>> >> ???
>> >
>> > No need to do so, dummy page will be 16K too, so it's fine.
>> Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page
>> is 0x8e004000, then there are four types of address in GART:0x8e004000,
>> 0x8e005000, 0x8e006000, 0x8e007000. The value which written in
>> VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I
>> don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I
>> think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.
>
> When radeon_gart_unbind initialize the gart entry to point to the dummy
> page it's just to have something safe in the GART table.
>
> VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is the page address used when
> there is a fault happening. It's like a sandbox for the mc. It doesn't
> conflict in anyway to have gart table entry to point to same page.
>
> Cheers,
> Jerome
>
>


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2012-02-29  4:49 chenhc
@ 2012-02-29 17:50 ` Jerome Glisse
  0 siblings, 0 replies; 27+ messages in thread
From: Jerome Glisse @ 2012-02-29 17:50 UTC (permalink / raw)
  To: chenhc; +Cc: michel, zhangfx, yanh, dri-devel, Chen Jie

On Wed, 2012-02-29 at 12:49 +0800, chenhc@lemote.com wrote:
> > On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
> >> 在 2012年2月17日 下午5:27，Chen Jie <chenj@lemote.com> 写道：
> >> >> One good way to test gart is to go over GPU gart table and write a
> >> >> dword using the GPU at end of each page something like 0xCAFEDEAD
> >> >> or somevalue that is unlikely to be already set. And then go over
> >> >> all the page and check that GPU write succeed. Abusing the scratch
> >> >> register write back feature is the easiest way to try that.
> >> > I'm planning to add a GART table check procedure when resume, which
> >> > will go over GPU gart table:
> >> > 1. read(backup) a dword at end of each GPU page
> >> > 2. write a mark by GPU and check it
> >> > 3. restore the original dword
> >> Attachment validateGART.patch do the job:
> >> * It current only works for mips64 platform.
> >> * To use it, apply all_in_vram.patch first, which will allocate CP
> >> ring, ih, ib in VRAM and hard code no_wb=1.
> >>
> >> The gart test routine will be invoked in r600_resume. We've tried it,
> >> and find that when lockup happened the gart table was good before
> >> userspace restarting. The related dmesg follows:
> >> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
> >> at 9000000040040000, 32768 entries, Dummy
> >> Page[0x000000000e004000-0x000000000e007fff]
> >> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
> >> entries(valid=8544, invalid=24224, total=32768).
> >> ...
> >> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
> >> [ 1532.152343] Restarting tasks ... done.
> >> [ 1544.468750] radeon 0000:01:05.0: GPU lockup CP stall for more than
> >> 10003msec
> >> [ 1544.472656] ------------[ cut here ]------------
> >> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
> >> radeon_fence_wait+0x25c/0x314()
> >> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id
> >> 0x0002136A)
> >> ...
> >> [ 1544.886718] radeon 0000:01:05.0: Wait for MC idle timedout !
> >> [ 1545.046875] radeon 0000:01:05.0: Wait for MC idle timedout !
> >> [ 1545.062500] radeon 0000:01:05.0: WB disabled
> >> [ 1545.097656] [drm] ring test succeeded in 0 usecs
> >> [ 1545.105468] [drm] ib test succeeded in 0 usecs
> >> [ 1545.109375] [drm] Enabling audio support
> >> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
> >> at 9000000040040000, 32768 entries, Dummy
> >> Page[0x000000000e004000-0x000000000e007fff]
> >> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
> >> unexpected value 0x745aaad1(expect 0xDEADBEEF)
> >> entry=0x000000000e008067, orignal=0x745aaad1
> >> ...
> >> /* System blocked here. */
> >>
> >> Any idea?
> >
> > I know lockup are frustrating, my only idea is the memory controller
> > is lockup because of some failing pci <-> system ram transaction.
> >
> >>
> >> BTW, we find the following in r600_pcie_gart_enable()
> >> (drivers/gpu/drm/radeon/r600.c):
> >> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
> >> (u32)(rdev->dummy_page.addr >> 12));
> >>
> >> On our platform, PAGE_SIZE is 16K, does it have any problem?
> >
> > No this should be handled properly.
> >
> >> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
> >> should change to:
> >>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
> >>           radeon_gart_set_page(rdev, t, page_base);
> >> -         page_base += RADEON_GPU_PAGE_SIZE;
> >> +         if (page_base != rdev->dummy_page.addr)
> >> +                 page_base += RADEON_GPU_PAGE_SIZE;
> >>   }
> >> ???
> >
> > No need to do so, dummy page will be 16K too, so it's fine.
> Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page
> is 0x8e004000, then there are four types of address in GART:0x8e004000,
> 0x8e005000, 0x8e006000, 0x8e007000. The value which written in
> VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I
> don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I
> think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.

When radeon_gart_unbind initialize the gart entry to point to the dummy
page it's just to have something safe in the GART table.

VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is the page address used when
there is a fault happening. It's like a sandbox for the mc. It doesn't
conflict in anyway to have gart table entry to point to same page.

Cheers,
Jerome

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
@ 2012-02-29  4:59 chenhc
  0 siblings, 0 replies; 27+ messages in thread
From: chenhc @ 2012-02-29  4:59 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: michel, Chen Jie, dri-devel, zhangfx, yanhua

> On Mon, 2012-02-27 at 10:44 +0800, Chen Jie wrote:
>> Hi,
>>
>> For this occasional GPU lockup when returns from STR/STD, I find
>> followings(when the problem happens):
>>
>> The value of SRBM_STATUS is whether 0x20002040 or 0x20003040.
>> Which means:
>> * HI_RQ_PENDING(There is a HI/BIF request pending in the SRBM)
>> * MCDW_BUSY(Memory Controller Block is Busy)
>> * BIF_BUSY(Bus Interface is Busy)
>> * MCDX_BUSY(Memory Controller Block is Busy) if is 0x20003040
>> Are MCDW_BUSY and MCDX_BUSY two memory channels? What is the
>> relationship among GART mapped memory, On-board video memory and MCDX,
>> MCDW?
>>
>> CP_STAT: the CSF_RING_BUSY is always set.
>
> Once the memory controller fails to do a pci transaction the CP
> will be stuck. At least if ring is in system memory, if ring is
> in vram CP might be stuck too because anyway everything goes
> through the MC.
>
I've tried the method of rs600 for gpu reset (use rs600_bm_disable() to
disable PCI MASTER bit and enable it after reset), but it doesn't solve
the problem. Then I found that in r100_bm_disable() it do more things,
e.g. writing the GPU register R_000030_BUS_CNTL. In r600_reg.h there is
a register R600_BUS_CNTL, does this register have a similar function?
 But I don't know how to use it...

Huacai Chen

>>
>> There are many CP_PACKET2(0x80000000) in CP ring(more than three
>> hundreds). e.g.
>> r[131800]=0x00028000
>> r[131801]=0xc0016800
>> r[131802]=0x00000140
>> r[131803]=0x000079c5
>> r[131804]=0x0000304a
>> r[131805] ... r[132143]=0x80000000
>> r[132144]=0xffff0000
>> After the first reset, GPU will lockup again, this time, typically
>> there are 320 dwords in CP ring -- with 319 CP_PACKET2 and 0xc0033d00
>> in the end.
>> Are these normal?
>>
>> BTW, is there any way for X to switch to NOACCEL mode when the problem
>> happens? Thus users will have a chance to save their documents and
>> then reboot machine.
>
> I have been meaning to patch the ddx to fallback to sw after GPU lockup.
> But this is useless in today world, where everything is composited ie
> the screen is updated using the 3D driver for which there is no easy
> way to suddenly migrate to  software rendering. I will still probably
> do the ddx patch at one point.
>
> Cheers,
> Jerome
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
@ 2012-02-29  4:49 chenhc
  2012-02-29 17:50 ` Jerome Glisse
  0 siblings, 1 reply; 27+ messages in thread
From: chenhc @ 2012-02-29  4:49 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: michel, zhangfx, yanh, dri-devel, Chen Jie

> On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
>> 在 2012年2月17日 下午5:27，Chen Jie <chenj@lemote.com> 写道：
>> >> One good way to test gart is to go over GPU gart table and write a
>> >> dword using the GPU at end of each page something like 0xCAFEDEAD
>> >> or somevalue that is unlikely to be already set. And then go over
>> >> all the page and check that GPU write succeed. Abusing the scratch
>> >> register write back feature is the easiest way to try that.
>> > I'm planning to add a GART table check procedure when resume, which
>> > will go over GPU gart table:
>> > 1. read(backup) a dword at end of each GPU page
>> > 2. write a mark by GPU and check it
>> > 3. restore the original dword
>> Attachment validateGART.patch do the job:
>> * It current only works for mips64 platform.
>> * To use it, apply all_in_vram.patch first, which will allocate CP
>> ring, ih, ib in VRAM and hard code no_wb=1.
>>
>> The gart test routine will be invoked in r600_resume. We've tried it,
>> and find that when lockup happened the gart table was good before
>> userspace restarting. The related dmesg follows:
>> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
>> at 9000000040040000, 32768 entries, Dummy
>> Page[0x000000000e004000-0x000000000e007fff]
>> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
>> entries(valid=8544, invalid=24224, total=32768).
>> ...
>> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
>> [ 1532.152343] Restarting tasks ... done.
>> [ 1544.468750] radeon 0000:01:05.0: GPU lockup CP stall for more than
>> 10003msec
>> [ 1544.472656] ------------[ cut here ]------------
>> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
>> radeon_fence_wait+0x25c/0x314()
>> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id
>> 0x0002136A)
>> ...
>> [ 1544.886718] radeon 0000:01:05.0: Wait for MC idle timedout !
>> [ 1545.046875] radeon 0000:01:05.0: Wait for MC idle timedout !
>> [ 1545.062500] radeon 0000:01:05.0: WB disabled
>> [ 1545.097656] [drm] ring test succeeded in 0 usecs
>> [ 1545.105468] [drm] ib test succeeded in 0 usecs
>> [ 1545.109375] [drm] Enabling audio support
>> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
>> at 9000000040040000, 32768 entries, Dummy
>> Page[0x000000000e004000-0x000000000e007fff]
>> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
>> unexpected value 0x745aaad1(expect 0xDEADBEEF)
>> entry=0x000000000e008067, orignal=0x745aaad1
>> ...
>> /* System blocked here. */
>>
>> Any idea?
>
> I know lockup are frustrating, my only idea is the memory controller
> is lockup because of some failing pci <-> system ram transaction.
>
>>
>> BTW, we find the following in r600_pcie_gart_enable()
>> (drivers/gpu/drm/radeon/r600.c):
>> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
>> (u32)(rdev->dummy_page.addr >> 12));
>>
>> On our platform, PAGE_SIZE is 16K, does it have any problem?
>
> No this should be handled properly.
>
>> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
>> should change to:
>>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
>>           radeon_gart_set_page(rdev, t, page_base);
>> -         page_base += RADEON_GPU_PAGE_SIZE;
>> +         if (page_base != rdev->dummy_page.addr)
>> +                 page_base += RADEON_GPU_PAGE_SIZE;
>>   }
>> ???
>
> No need to do so, dummy page will be 16K too, so it's fine.
Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page
is 0x8e004000, then there are four types of address in GART:0x8e004000,
0x8e005000, 0x8e006000, 0x8e007000. The value which written in
VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I
don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I
think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.

>
> Cheers,
> Jerome
>
>

Huacai Chen

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2012-02-27  2:44               ` Chen Jie
@ 2012-02-27 18:41                 ` Jerome Glisse
  0 siblings, 0 replies; 27+ messages in thread
From: Jerome Glisse @ 2012-02-27 18:41 UTC (permalink / raw)
  To: Chen Jie; +Cc: chenhc, michel, yanhua, dri-devel

On Mon, 2012-02-27 at 10:44 +0800, Chen Jie wrote:
> Hi,
> 
> For this occasional GPU lockup when returns from STR/STD, I find
> followings(when the problem happens):
> 
> The value of SRBM_STATUS is whether 0x20002040 or 0x20003040.
> Which means:
> * HI_RQ_PENDING(There is a HI/BIF request pending in the SRBM)
> * MCDW_BUSY(Memory Controller Block is Busy)
> * BIF_BUSY(Bus Interface is Busy)
> * MCDX_BUSY(Memory Controller Block is Busy) if is 0x20003040
> Are MCDW_BUSY and MCDX_BUSY two memory channels? What is the
> relationship among GART mapped memory, On-board video memory and MCDX,
> MCDW?
> 
> CP_STAT: the CSF_RING_BUSY is always set.

Once the memory controller fails to do a pci transaction the CP
will be stuck. At least if ring is in system memory, if ring is
in vram CP might be stuck too because anyway everything goes
through the MC.

> 
> There are many CP_PACKET2(0x80000000) in CP ring(more than three hundreds). e.g.
> r[131800]=0x00028000
> r[131801]=0xc0016800
> r[131802]=0x00000140
> r[131803]=0x000079c5
> r[131804]=0x0000304a
> r[131805] ... r[132143]=0x80000000
> r[132144]=0xffff0000
> After the first reset, GPU will lockup again, this time, typically
> there are 320 dwords in CP ring -- with 319 CP_PACKET2 and 0xc0033d00
> in the end.
> Are these normal?
> 
> BTW, is there any way for X to switch to NOACCEL mode when the problem
> happens? Thus users will have a chance to save their documents and
> then reboot machine.

I have been meaning to patch the ddx to fallback to sw after GPU lockup.
But this is useless in today world, where everything is composited ie
the screen is updated using the 3D driver for which there is no easy
way to suddenly migrate to  software rendering. I will still probably
do the ddx patch at one point.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2012-02-21 10:37             ` Chen Jie
  2012-02-27  2:44               ` Chen Jie
@ 2012-02-27 18:38               ` Jerome Glisse
  1 sibling, 0 replies; 27+ messages in thread
From: Jerome Glisse @ 2012-02-27 18:38 UTC (permalink / raw)
  To: Chen Jie; +Cc: chenhc, michel, dri-devel

On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
> 在 2012年2月17日 下午5:27，Chen Jie <chenj@lemote.com> 写道：
> >> One good way to test gart is to go over GPU gart table and write a
> >> dword using the GPU at end of each page something like 0xCAFEDEAD
> >> or somevalue that is unlikely to be already set. And then go over
> >> all the page and check that GPU write succeed. Abusing the scratch
> >> register write back feature is the easiest way to try that.
> > I'm planning to add a GART table check procedure when resume, which
> > will go over GPU gart table:
> > 1. read(backup) a dword at end of each GPU page
> > 2. write a mark by GPU and check it
> > 3. restore the original dword
> Attachment validateGART.patch do the job:
> * It current only works for mips64 platform.
> * To use it, apply all_in_vram.patch first, which will allocate CP
> ring, ih, ib in VRAM and hard code no_wb=1.
> 
> The gart test routine will be invoked in r600_resume. We've tried it,
> and find that when lockup happened the gart table was good before
> userspace restarting. The related dmesg follows:
> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
> at 9000000040040000, 32768 entries, Dummy
> Page[0x000000000e004000-0x000000000e007fff]
> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
> entries(valid=8544, invalid=24224, total=32768).
> ...
> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
> [ 1532.152343] Restarting tasks ... done.
> [ 1544.468750] radeon 0000:01:05.0: GPU lockup CP stall for more than 10003msec
> [ 1544.472656] ------------[ cut here ]------------
> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
> radeon_fence_wait+0x25c/0x314()
> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id 0x0002136A)
> ...
> [ 1544.886718] radeon 0000:01:05.0: Wait for MC idle timedout !
> [ 1545.046875] radeon 0000:01:05.0: Wait for MC idle timedout !
> [ 1545.062500] radeon 0000:01:05.0: WB disabled
> [ 1545.097656] [drm] ring test succeeded in 0 usecs
> [ 1545.105468] [drm] ib test succeeded in 0 usecs
> [ 1545.109375] [drm] Enabling audio support
> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
> at 9000000040040000, 32768 entries, Dummy
> Page[0x000000000e004000-0x000000000e007fff]
> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
> unexpected value 0x745aaad1(expect 0xDEADBEEF)
> entry=0x000000000e008067, orignal=0x745aaad1
> ...
> /* System blocked here. */
> 
> Any idea?

I know lockup are frustrating, my only idea is the memory controller
is lockup because of some failing pci <-> system ram transaction.

> 
> BTW, we find the following in r600_pcie_gart_enable()
> (drivers/gpu/drm/radeon/r600.c):
> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
> (u32)(rdev->dummy_page.addr >> 12));
> 
> On our platform, PAGE_SIZE is 16K, does it have any problem?

No this should be handled properly.

> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
> should change to:
>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
>           radeon_gart_set_page(rdev, t, page_base);
> -         page_base += RADEON_GPU_PAGE_SIZE;
> +         if (page_base != rdev->dummy_page.addr)
> +                 page_base += RADEON_GPU_PAGE_SIZE;
>   }
> ???

No need to do so, dummy page will be 16K too, so it's fine.

Cheers,
Jerome

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2012-02-21 10:37             ` Chen Jie
@ 2012-02-27  2:44               ` Chen Jie
  2012-02-27 18:41                 ` Jerome Glisse
  2012-02-27 18:38               ` Jerome Glisse
  1 sibling, 1 reply; 27+ messages in thread
From: Chen Jie @ 2012-02-27  2:44 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: chenhc, michel, yanhua, dri-devel

Hi,

For this occasional GPU lockup when returns from STR/STD, I find
followings(when the problem happens):

The value of SRBM_STATUS is whether 0x20002040 or 0x20003040.
Which means:
* HI_RQ_PENDING(There is a HI/BIF request pending in the SRBM)
* MCDW_BUSY(Memory Controller Block is Busy)
* BIF_BUSY(Bus Interface is Busy)
* MCDX_BUSY(Memory Controller Block is Busy) if is 0x20003040
Are MCDW_BUSY and MCDX_BUSY two memory channels? What is the
relationship among GART mapped memory, On-board video memory and MCDX,
MCDW?

CP_STAT: the CSF_RING_BUSY is always set.

There are many CP_PACKET2(0x80000000) in CP ring(more than three hundreds). e.g.
r[131800]=0x00028000
r[131801]=0xc0016800
r[131802]=0x00000140
r[131803]=0x000079c5
r[131804]=0x0000304a
r[131805] ... r[132143]=0x80000000
r[132144]=0xffff0000
After the first reset, GPU will lockup again, this time, typically
there are 320 dwords in CP ring -- with 319 CP_PACKET2 and 0xc0033d00
in the end.
Are these normal?

BTW, is there any way for X to switch to NOACCEL mode when the problem
happens? Thus users will have a chance to save their documents and
then reboot machine.

Regards,
-- Chen Jie

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2012-02-17  9:27           ` Chen Jie
@ 2012-02-21 10:37             ` Chen Jie
  2012-02-27  2:44               ` Chen Jie
  2012-02-27 18:38               ` Jerome Glisse
  0 siblings, 2 replies; 27+ messages in thread
From: Chen Jie @ 2012-02-21 10:37 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: chenhc, michel, dri-devel

[-- Attachment #1: Type: text/plain, Size: 3073 bytes --]

在 2012年2月17日 下午5:27，Chen Jie <chenj@lemote.com> 写道：
>> One good way to test gart is to go over GPU gart table and write a
>> dword using the GPU at end of each page something like 0xCAFEDEAD
>> or somevalue that is unlikely to be already set. And then go over
>> all the page and check that GPU write succeed. Abusing the scratch
>> register write back feature is the easiest way to try that.
> I'm planning to add a GART table check procedure when resume, which
> will go over GPU gart table:
> 1. read(backup) a dword at end of each GPU page
> 2. write a mark by GPU and check it
> 3. restore the original dword
Attachment validateGART.patch do the job:
* It current only works for mips64 platform.
* To use it, apply all_in_vram.patch first, which will allocate CP
ring, ih, ib in VRAM and hard code no_wb=1.

The gart test routine will be invoked in r600_resume. We've tried it,
and find that when lockup happened the gart table was good before
userspace restarting. The related dmesg follows:
[ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
at 9000000040040000, 32768 entries, Dummy
Page[0x000000000e004000-0x000000000e007fff]
[ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
entries(valid=8544, invalid=24224, total=32768).
...
[ 1531.156250] PM: resume of devices complete after 9396.588 msecs
[ 1532.152343] Restarting tasks ... done.
[ 1544.468750] radeon 0000:01:05.0: GPU lockup CP stall for more than 10003msec
[ 1544.472656] ------------[ cut here ]------------
[ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
radeon_fence_wait+0x25c/0x314()
[ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id 0x0002136A)
...
[ 1544.886718] radeon 0000:01:05.0: Wait for MC idle timedout !
[ 1545.046875] radeon 0000:01:05.0: Wait for MC idle timedout !
[ 1545.062500] radeon 0000:01:05.0: WB disabled
[ 1545.097656] [drm] ring test succeeded in 0 usecs
[ 1545.105468] [drm] ib test succeeded in 0 usecs
[ 1545.109375] [drm] Enabling audio support
[ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
at 9000000040040000, 32768 entries, Dummy
Page[0x000000000e004000-0x000000000e007fff]
[ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
unexpected value 0x745aaad1(expect 0xDEADBEEF)
entry=0x000000000e008067, orignal=0x745aaad1
...
/* System blocked here. */

Any idea?

BTW, we find the following in r600_pcie_gart_enable()
(drivers/gpu/drm/radeon/r600.c):
WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
(u32)(rdev->dummy_page.addr >> 12));

On our platform, PAGE_SIZE is 16K, does it have any problem?

Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
should change to:
  for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
          radeon_gart_set_page(rdev, t, page_base);
-         page_base += RADEON_GPU_PAGE_SIZE;
+         if (page_base != rdev->dummy_page.addr)
+                 page_base += RADEON_GPU_PAGE_SIZE;
  }
???



Regards,
-- Chen Jie

[-- Attachment #2: all_in_vram.patch --]
[-- Type: text/x-patch, Size: 3972 bytes --]

diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c
index 53dbf50..e5961ed 100644
--- a/drivers/gpu/drm/radeon/r600.c
+++ b/drivers/gpu/drm/radeon/r600.c
@@ -2215,6 +2218,8 @@ int r600_cp_resume(struct radeon_device *rdev)
 
 void r600_cp_commit(struct radeon_device *rdev)
 {
+	if ((rdev->cp.ring_obj->tbo.mem.placement &  TTM_PL_MASK_MEM) == TTM_PL_FLAG_VRAM)
+		WREG32(R_005480_HDP_MEM_COHERENCY_FLUSH_CNTL, 0x1);
 	WREG32(CP_RB_WPTR, rdev->cp.wptr);
 	(void)RREG32(CP_RB_WPTR);
 }
@@ -2754,7 +2764,7 @@ static int r600_ih_ring_alloc(struct radeon_device *rdev)
 	if (rdev->ih.ring_obj == NULL) {
 		r = radeon_bo_create(rdev, NULL, rdev->ih.ring_size,
 				     true,
-				     RADEON_GEM_DOMAIN_GTT,
+				     RADEON_GEM_DOMAIN_VRAM,
 				     &rdev->ih.ring_obj);
 		if (r) {
 			DRM_ERROR("radeon: failed to create ih ring buffer (%d).\n", r);
@@ -2764,7 +2774,7 @@ static int r600_ih_ring_alloc(struct radeon_device *rdev)
 		if (unlikely(r != 0))
 			return r;
 		r = radeon_bo_pin(rdev->ih.ring_obj,
-				  RADEON_GEM_DOMAIN_GTT,
+				  RADEON_GEM_DOMAIN_VRAM,
 				  &rdev->ih.gpu_addr);
 		if (r) {
 			radeon_bo_unreserve(rdev->ih.ring_obj);
@@ -3444,6 +3454,8 @@ restart_ih:
 	if (queue_hotplug)
 		queue_work(rdev->wq, &rdev->hotplug_work);
 	rdev->ih.rptr = rptr;
+	if ((rdev->ih.ring_obj->tbo.mem.placement &  TTM_PL_MASK_MEM) == TTM_PL_FLAG_VRAM)
+		WREG32(R_005480_HDP_MEM_COHERENCY_FLUSH_CNTL, 0x1);
 	WREG32(IH_RB_RPTR, rdev->ih.rptr);
 	spin_unlock_irqrestore(&rdev->ih.lock, flags);
 	return IRQ_HANDLED;
diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c
index 795403b..c5326e0 100644
--- a/drivers/gpu/drm/radeon/radeon_drv.c
+++ b/drivers/gpu/drm/radeon/radeon_drv.c
@@ -82,13 +82,13 @@ void radeon_debugfs_cleanup(struct drm_minor *minor);
 #endif
 
 
-int radeon_no_wb;
+int radeon_no_wb = 1;
 int radeon_modeset = -1;
 int radeon_dynclks = -1;
 int radeon_r4xx_atom = 0;
 int radeon_agpmode = 0;
 int radeon_vram_limit = 0;
-int radeon_gart_size = 512; /* default gart size */
+int radeon_gart_size = 128; /* default gart size */
 int radeon_benchmarking = 0;
 int radeon_testing = 0;
 int radeon_connector_table = 0;
diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c
index 6ea798c..608d2fe 100644
--- a/drivers/gpu/drm/radeon/radeon_ring.c
+++ b/drivers/gpu/drm/radeon/radeon_ring.c
@@ -176,7 +180,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
 	INIT_LIST_HEAD(&rdev->ib_pool.bogus_ib);
 	/* Allocate 1M object buffer */
 	r = radeon_bo_create(rdev, NULL,  RADEON_IB_POOL_SIZE*64*1024,
-				true, RADEON_GEM_DOMAIN_GTT,
+				true, RADEON_GEM_DOMAIN_VRAM,
 				&rdev->ib_pool.robj);
 	if (r) {
 		DRM_ERROR("radeon: failed to ib pool (%d).\n", r);
@@ -185,7 +189,7 @@ int radeon_ib_pool_init(struct radeon_device *rdev)
 	r = radeon_bo_reserve(rdev->ib_pool.robj, false);
 	if (unlikely(r != 0))
 		return r;
-	r = radeon_bo_pin(rdev->ib_pool.robj, RADEON_GEM_DOMAIN_GTT, &gpu_addr);
+	r = radeon_bo_pin(rdev->ib_pool.robj, RADEON_GEM_DOMAIN_VRAM, &gpu_addr);
 	if (r) {
 		radeon_bo_unreserve(rdev->ib_pool.robj);
 		DRM_ERROR("radeon: failed to pin ib pool (%d).\n", r);
@@ -333,7 +337,7 @@ int radeon_ring_init(struct radeon_device *rdev, unsigned ring_size)
 	/* Allocate ring buffer */
 	if (rdev->cp.ring_obj == NULL) {
 		r = radeon_bo_create(rdev, NULL, rdev->cp.ring_size, true,
-					RADEON_GEM_DOMAIN_GTT,
+					RADEON_GEM_DOMAIN_VRAM,
 					&rdev->cp.ring_obj);
 		if (r) {
 			dev_err(rdev->dev, "(%d) ring create failed\n", r);
@@ -342,7 +346,7 @@ int radeon_ring_init(struct radeon_device *rdev, unsigned ring_size)
 		r = radeon_bo_reserve(rdev->cp.ring_obj, false);
 		if (unlikely(r != 0))
 			return r;
-		r = radeon_bo_pin(rdev->cp.ring_obj, RADEON_GEM_DOMAIN_GTT,
+		r = radeon_bo_pin(rdev->cp.ring_obj, RADEON_GEM_DOMAIN_VRAM,
 					&rdev->cp.gpu_addr);
 		if (r) {
 			radeon_bo_unreserve(rdev->cp.ring_obj);
 

[-- Attachment #3: validateGART.patch --]
[-- Type: text/x-patch, Size: 3948 bytes --]

diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c
index fb45a3f..7d55085 100644
--- a/drivers/gpu/drm/radeon/r600.c
+++ b/drivers/gpu/drm/radeon/r600.c
@@ -2445,6 +2447,116 @@ void r600_vga_set_state(struct radeon_device *rdev, bool state)
 	WREG32(CONFIG_CNTL, temp);
 }
 
+int r600_gart_table_validate(struct radeon_device *rdev)
+{
+	int r, i, invalid_count, count, total_entries;
+	u64 dummy_page_start, dummy_page_end;
+	struct radeon_fence *fence = NULL;
+
+	invalid_count = count = 0;
+	total_entries = rdev->gart.table_size >> 3;
+
+	dummy_page_start = (u64) rdev->dummy_page.addr;
+	dummy_page_end = dummy_page_start + PAGE_SIZE - 1;
+
+	DRM_INFO("%s(): Validate GART Table at %p, %d entries, Dummy Page[0x%016llx-0x%016llx]\n",
+		 __func__, rdev->gart.table.vram.ptr, total_entries,
+		 dummy_page_start, dummy_page_end);
+
+	for (i = 0; i < total_entries; i++) {
+		void __iomem *ptr;
+
+		u64 entry_val;
+      		u64 bus_addr, paddr;
+		volatile void *vaddr;
+		u64 gpu_addr;
+		u32 backup, what_read;
+
+		ptr = ((void __iomem *) rdev->gart.table.vram.ptr) + i * 8;
+	       	entry_val = readq(ptr);	
+
+		bus_addr = entry_val & 0xFFFFFFFFFFFFF000ULL;
+
+		/* For loongson, PAGE_SIZE=16K */
+		if (bus_addr >= dummy_page_start && bus_addr <= dummy_page_end) {
+			if (bus_addr + RADEON_GPU_PAGE_SIZE - 1 > dummy_page_end)
+				DRM_ERROR("Iter=%d: dummy page intersects with normal page(entry=%016llx)!\n",
+					  i, entry_val);
+
+			invalid_count++;
+			continue;
+		}
+
+		/* paddr == bus_addr */
+		paddr = bus_addr;
+		/* mips64: map to xkphys: unmapped cached window */
+		vaddr = (volatile void *) (paddr | 0x9800000000000000ULL);
+
+		backup = *((volatile u32 *) (vaddr + RADEON_GPU_PAGE_SIZE - sizeof(u32)));
+
+		gpu_addr = rdev->mc.gtt_start + i*RADEON_GPU_PAGE_SIZE +
+			   RADEON_GPU_PAGE_SIZE - sizeof(u32);
+
+		r = radeon_fence_create(rdev, &fence);
+		if (r) {
+			DRM_ERROR("Iter=%d: failed to create fence.\n", i);
+			break;
+		}
+
+		r = radeon_ring_lock(rdev, 16 /* fence emit */ + 5);
+		if (r) {
+                	DRM_ERROR("Iter=%d: cp failed to lock ring (%d).\n", i, r);
+			break;
+        	}
+
+		radeon_ring_write(rdev, PACKET3(PACKET3_MEM_WRITE, 3));
+
+#define MY_GPU_ADDR_LO32_ALIGN32(gpu_addr) ((u32) ((gpu_addr) & 0xfffffffc))
+#define MY_GPU_ADDR_HI8(gpu_addr) ((u32) ((((gpu_addr) >> 32) & 0xff)))
+#define MY_DATA32_MODE (1<<18)
+
+		radeon_ring_write(rdev, 
+#ifdef __BIG_ENDIAN 
+			(2 << 0) | 
+#endif
+			MY_GPU_ADDR_LO32_ALIGN32(gpu_addr));
+		radeon_ring_write(rdev, MY_GPU_ADDR_HI8(gpu_addr) | MY_DATA32_MODE);
+		radeon_ring_write(rdev, 0xDEADBEEF);
+		radeon_ring_write(rdev, 0x0); /* Discarded */
+		radeon_fence_emit(rdev, fence);
+		radeon_ring_unlock_commit(rdev);
+
+		r = radeon_fence_wait(fence, false);
+                radeon_fence_unref(&fence);
+		if (r) {
+			DRM_ERROR("Iter=%d: failed to wait for fence.\n", i);
+			*((volatile u32 *) (vaddr + RADEON_GPU_PAGE_SIZE - sizeof(u32))) = backup;
+			break;
+		}
+
+		what_read = *((volatile u32 *) (vaddr + RADEON_GPU_PAGE_SIZE - sizeof(u32)));
+		if (what_read != 0xDEADBEEF) {
+			DRM_ERROR("Iter=%d: unexpected value 0x%08x(expect 0xDEADBEEF) "
+				  "entry=0x%016llx, orignal=0x%08x\n", 
+				  i, what_read, entry_val, backup);
+			// *((volatile u32 *) (vaddr + RADEON_GPU_PAGE_SIZE - sizeof(u32))) = backup;
+			// break;
+		}
+
+		*((volatile u32 *) (vaddr + RADEON_GPU_PAGE_SIZE - sizeof(u32))) = backup;
+		count++;
+	}
+
+
+	DRM_INFO("%s(): Sweep %d entries(valid=%d, invalid=%d, total=%d).\n",
+		 __func__, i, count, invalid_count, total_entries);
+
+	if (fence)
+		radeon_fence_unref(&fence);
+
+	return 0;
+}
+
 int r600_resume(struct radeon_device *rdev)
 {
 	int r;
@@ -2474,6 +2586,12 @@ int r600_resume(struct radeon_device *rdev)
 		return r;
 	}
 
+	r = r600_gart_table_validate(rdev);
+	if (r) {
+		DRM_ERROR("radeon: GART invalid failed\n");
+		return r;
+	}
+
 	return r;
 }
 

[-- Attachment #4: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2012-02-16 10:16         ` Chen Jie
@ 2012-02-17 10:42           ` Chen Jie
  0 siblings, 0 replies; 27+ messages in thread
From: Chen Jie @ 2012-02-17 10:42 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: chenhc, michel, dri-devel

>> 在 2012年2月15日 下午11:53，Jerome Glisse <j.glisse@gmail.com> 写道：
>>> To me it looks like the CP is trying to fetch memory but the
>>> GPU memory controller fail to fullfill cp request. Did you
>>> check the PCI configuration before & after (when things don't
>>> work) My best guest is PCI bus mastering is no properly working
>>> or the PCIE GPU gart table as wrong data.
>>>
>>> Maybe one need to drop bus master and reenable bus master to
>>> work around some bug...
>> Thanks for your suggestion. We've tried the 'drop and reenable master'
>> trick, unfortunately doesn't work.
>> The PCI configuration compare will be done later.
> Update: We've checked the first 64 bytes of PCI configuration space
> before & after, and didn't find any difference.
Hi,

Status update:
We try to analyze the GPU instruction stream when lockup today. The
lockup always occurs after tasks restarting, so the related
instructions should reside at ib, as pointed by dmesg:
[ 2456.585937] GPU lockup (waiting for 0x0002F98B last fence id 0x0002F98A)

Print instructions in related ib:
[ 2462.492187] PM4 block 10 has 115 instructions, with fence seq 2f98b
....
[ 2462.976562] Type3:PACKET3_SET_CONTEXT_REG ref_addr  <not interpreted>
[ 2462.984375] Type3:PACKET3_SET_CONTEXT_REG ref_addr  <not interpreted>
[ 2462.988281] Type3:PACKET3_SET_CONTEXT_REG ref_addr  <not interpreted>
[ 2462.992187] Type3:PACKET3_SET_ALU_CONST ref_addr  <not interpreted>
[ 2462.996093] Type3:PACKET3_SURFACE_SYNC ref_addr 18c880
[ 2463.003906] Type3:PACKET3_SET_RESOURCE ref_addr  <not interpreted>
[ 2463.007812] Type3:PACKET3_SET_CONFIG_REG ref_addr  <not interpreted>
[ 2463.011718] Type3:PACKET3_INDEX_TYPE ref_addr  <not interpreted>
[ 2463.015625] Type3:PACKET3_NUM_INSTANCES ref_addr  <not interpreted>
[ 2463.019531] Type3:PACKET3_DRAW_INDEX_AUTO ref_addr  <not interpreted>
[ 2463.027343] Type3:PACKET3_EVENT_WRITE ref_addr  <not interpreted>
[ 2463.031250] Type3:PACKET3_SET_CONFIG_REG ref_addr  <not interpreted>
[ 2463.035156] Type3:PACKET3_SURFACE_SYNC ref_addr 10f680
[ 2463.039062] Type3:PACKET3_SET_CONTEXT_REG ref_addr  <not interpreted>
[ 2463.046875] Type3:PACKET3_SET_CONTEXT_REG ref_addr  <not interpreted>
[ 2463.050781] Type3:PACKET3_SET_CONTEXT_REG ref_addr  <not interpreted>
[ 2463.054687] Type3:PACKET3_SET_BOOL_CONST ref_addr  <not interpreted>
[ 2463.062500] Type3:PACKET3_SURFACE_SYNC ref_addr 10668e

CP_COHER_BASE was 0x0018C880, so the instruction which caused lockup
should be in:
[ 2462.996093] Type3:PACKET3_SURFACE_SYNC ref_addr 18c880
...
[ 2463.035156] Type3:PACKET3_SURFACE_SYNC ref_addr 10f680

Here, only SURFACE_SYNC, SET_RESOURCE and EVENT_WRITE will access GPU memory.
We guess it maybe SURFACE_SYNC?

BTW, when lockup happens, if places the CP ring at vram, ring_test
will pass, but ib_test fails -- which suggests ME fails to feed CP
when lockup? May a former SURFACE_SYNC block the MC?

P.S. We hack to place CP ring, ib and ih at vram and disable
wb(radeon_no_wb=1) in today's debugging.

Any idea?



Regards,
-- Chen Jie
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2012-02-16 16:32         ` Jerome Glisse
@ 2012-02-17  9:27           ` Chen Jie
  2012-02-21 10:37             ` Chen Jie
  0 siblings, 1 reply; 27+ messages in thread
From: Chen Jie @ 2012-02-17  9:27 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: chenhc, michel, dri-devel

在 2012年2月17日 上午12:32，Jerome Glisse <j.glisse@gmail.com> 写道：
> Ok let's start from the begining, i convince it's related to GPU
> memory controller failing to full fill some request that hit system
> memory. So in another mail you wrote :
>
>> BTW, I found radeon_gart_bind() will call pci_map_page(), it hooks
>> to swiotlb_map_page on our platform, which seems allocates and returns
>> dma_addr_t of a new page from pool if not meet dma_mask. Seems a bug, since
>> the BO backed by one set of pages, but mapped to GART was another set of
>> pages?
>
> Is this still the case ? As this is obviously wrong, we fixed that
> recently. What drm code are you using. rs780 dma mask is something
> like 40bits iirc so you should never have issue on your system with
> 1G of memory right ?
Right.

>
> If you have an iommu what happens on resume ? Are all page previously
> mapped with pci map page still valid ?
The physical address is directly mapped to bus address, so iommu do
nothing on resume, the pages should be valid?

>
> One good way to test gart is to go over GPU gart table and write a
> dword using the GPU at end of each page something like 0xCAFEDEAD
> or somevalue that is unlikely to be already set. And then go over
> all the page and check that GPU write succeed. Abusing the scratch
> register write back feature is the easiest way to try that.
I'm planning to add a GART table check procedure when resume, which
will go over GPU gart table:
1. read(backup) a dword at end of each GPU page
2. write a mark by GPU and check it
3. restore the original dword

Hopefully, this can do some help.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2012-02-16  9:21       ` Chen Jie
  2012-02-16 10:16         ` Chen Jie
@ 2012-02-16 16:32         ` Jerome Glisse
  2012-02-17  9:27           ` Chen Jie
  1 sibling, 1 reply; 27+ messages in thread
From: Jerome Glisse @ 2012-02-16 16:32 UTC (permalink / raw)
  To: Chen Jie; +Cc: chenhc, michel, dri-devel

On Thu, Feb 16, 2012 at 05:21:10PM +0800, Chen Jie wrote:
> Hi,
> 
> 在 2012年2月15日 下午11:53，Jerome Glisse <j.glisse@gmail.com> 写道：
> > To me it looks like the CP is trying to fetch memory but the
> > GPU memory controller fail to fullfill cp request. Did you
> > check the PCI configuration before & after (when things don't
> > work) My best guest is PCI bus mastering is no properly working
> > or the PCIE GPU gart table as wrong data.
> >
> > Maybe one need to drop bus master and reenable bus master to
> > work around some bug...
> Thanks for your suggestion. We've tried the 'drop and reenable master'
> trick, unfortunately doesn't work.
> The PCI configuration compare will be done later.
> 
> Some additional information:
> The "GPU Lockup" seems always occur after tasks be restarting -- We
> inserted more ring tests , non of them failed before restarting tasks.
> 
> BTW, I hacked GART  table to try to simulate the problem:
> 1. Changes the system memory address(bus address) of ring_obj to an
> arbitrary value, e.g. 0 or 128M.
> 2. Changes the system memory address of a BO in radeon_test to an
> arbitrary value, e.g. 0
> 
> Non of above leaded to a GPU Lockup:
> Point 1 rendered a black screen;
> Point 2 only the test itself failed
> 
> Any idea?
> 

Ok let's start from the begining, i convince it's related to GPU
memory controller failing to full fill some request that hit system
memory. So in another mail you wrote :

> BTW, I found radeon_gart_bind() will call pci_map_page(), it hooks
> to swiotlb_map_page on our platform, which seems allocates and returns
> dma_addr_t of a new page from pool if not meet dma_mask. Seems a bug, since
> the BO backed by one set of pages, but mapped to GART was another set of
> pages?

Is this still the case ? As this is obviously wrong, we fixed that
recently. What drm code are you using. rs780 dma mask is something
like 40bits iirc so you should never have issue on your system with
1G of memory right ?

If you have an iommu what happens on resume ? Are all page previously
mapped with pci map page still valid ?

One good way to test gart is to go over GPU gart table and write a
dword using the GPU at end of each page something like 0xCAFEDEAD
or somevalue that is unlikely to be already set. And then go over
all the page and check that GPU write succeed. Abusing the scratch
register write back feature is the easiest way to try that.

Cheers,
Jerome
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2012-02-16  9:21       ` Chen Jie
@ 2012-02-16 10:16         ` Chen Jie
  2012-02-17 10:42           ` Chen Jie
  2012-02-16 16:32         ` Jerome Glisse
  1 sibling, 1 reply; 27+ messages in thread
From: Chen Jie @ 2012-02-16 10:16 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: chenhc, michel, dri-devel

在 2012年2月16日 下午5:21，Chen Jie <chenj@lemote.com> 写道：
> Hi,
>
> 在 2012年2月15日 下午11:53，Jerome Glisse <j.glisse@gmail.com> 写道：
>> To me it looks like the CP is trying to fetch memory but the
>> GPU memory controller fail to fullfill cp request. Did you
>> check the PCI configuration before & after (when things don't
>> work) My best guest is PCI bus mastering is no properly working
>> or the PCIE GPU gart table as wrong data.
>>
>> Maybe one need to drop bus master and reenable bus master to
>> work around some bug...
> Thanks for your suggestion. We've tried the 'drop and reenable master'
> trick, unfortunately doesn't work.
> The PCI configuration compare will be done later.
Update: We've checked the first 64 bytes of PCI configuration space
before & after, and didn't find any difference.



Regards,
-- Chen Jie

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2012-02-15 15:53     ` Jerome Glisse
@ 2012-02-16  9:21       ` Chen Jie
  2012-02-16 10:16         ` Chen Jie
  2012-02-16 16:32         ` Jerome Glisse
  0 siblings, 2 replies; 27+ messages in thread
From: Chen Jie @ 2012-02-16  9:21 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: chenhc, michel, dri-devel

Hi,

在 2012年2月15日 下午11:53，Jerome Glisse <j.glisse@gmail.com> 写道：
> To me it looks like the CP is trying to fetch memory but the
> GPU memory controller fail to fullfill cp request. Did you
> check the PCI configuration before & after (when things don't
> work) My best guest is PCI bus mastering is no properly working
> or the PCIE GPU gart table as wrong data.
>
> Maybe one need to drop bus master and reenable bus master to
> work around some bug...
Thanks for your suggestion. We've tried the 'drop and reenable master'
trick, unfortunately doesn't work.
The PCI configuration compare will be done later.

Some additional information:
The "GPU Lockup" seems always occur after tasks be restarting -- We
inserted more ring tests , non of them failed before restarting tasks.

BTW, I hacked GART  table to try to simulate the problem:
1. Changes the system memory address(bus address) of ring_obj to an
arbitrary value, e.g. 0 or 128M.
2. Changes the system memory address of a BO in radeon_test to an
arbitrary value, e.g. 0

Non of above leaded to a GPU Lockup:
Point 1 rendered a black screen;
Point 2 only the test itself failed

Any idea?


Regards,
-- Chen Jie
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2012-02-15  9:32   ` Chen Jie
@ 2012-02-15 15:53     ` Jerome Glisse
  2012-02-16  9:21       ` Chen Jie
  0 siblings, 1 reply; 27+ messages in thread
From: Jerome Glisse @ 2012-02-15 15:53 UTC (permalink / raw)
  To: Chen Jie; +Cc: chenhc, Michel �nzer, dri-devel

On Wed, Feb 15, 2012 at 05:32:35PM +0800, Chen Jie wrote:
> Hi,
> 
> Status update about the problem 'Occasionally "GPU lockup" after
> resuming from suspend.'
> 
> First, this could happen when system returns from a STR(suspend to
> ram) or STD(suspend to disk, aka hibernation).
> When returns from STD, the initialization process is most similar to
> the normal boot.
> The standby is ok, which is similar to STR, except that standby will
> not shutdown the power of CPU,GPU etc.
> 
> We've dumped and compared the registers, and found something:
> CP_STAT
> normal value: 0x00000000
> value when this problem occurred: 0x802100C1 or 0x802300C1
> 
> CP_ME_CNTL
> normal value: 0x000000FF
> value when this problem occurred: always 0x200000FF in our test
> 
> Questions:
> According to the manual,
> CP_STAT = 0x802100C1 means
> 	CSF_RING_BUSY(bit 0):
> 		The Ring fetcher still has command buffer data to fetch, or the PFP
> still has data left to process from the reorder queue.
> 	CSF_BUSY(bit 6):
> 		The input FIFOs have command buffers to fetch, or one or more of the
> fetchers are busy, or the arbiter has a request to send to the MIU.
> 	MIU_RDREQ_BUSY(bit 7):
> 		The read path logic inside the MIU is busy.
> 	MEQ_BUSY(bit 16):
> 		The PFP-to-ME queue has valid data in it.
> 	SURFACE_SYNC_BUSY(bit 21):
> 		The Surface Sync unit is busy.
> 	CP_BUSY(bit 31):
> 		Any block in the CP is busy.
> What does it suggest?
> 
> What does it mean if bit 29 of CP_ME_CNTL is set?
> 
> BTW, how does the dummy page work in GART?
> 
> 
> Regards,
> -- Chen Jie

To me it looks like the CP is trying to fetch memory but the
GPU memory controller fail to fullfill cp request. Did you
check the PCI configuration before & after (when things don't
work) My best guest is PCI bus mastering is no properly working
or the PCIE GPU gart table as wrong data.

Maybe one need to drop bus master and reenable bus master to
work around some bug...

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2011-12-07 14:21 ` Alex Deucher
@ 2012-02-15  9:32   ` Chen Jie
  2012-02-15 15:53     ` Jerome Glisse
  0 siblings, 1 reply; 27+ messages in thread
From: Chen Jie @ 2012-02-15  9:32 UTC (permalink / raw)
  To: chenhc; +Cc: Michel �nzer, dri-devel

Hi,

Status update about the problem 'Occasionally "GPU lockup" after
resuming from suspend.'

First, this could happen when system returns from a STR(suspend to
ram) or STD(suspend to disk, aka hibernation).
When returns from STD, the initialization process is most similar to
the normal boot.
The standby is ok, which is similar to STR, except that standby will
not shutdown the power of CPU,GPU etc.

We've dumped and compared the registers, and found something:
CP_STAT
normal value: 0x00000000
value when this problem occurred: 0x802100C1 or 0x802300C1

CP_ME_CNTL
normal value: 0x000000FF
value when this problem occurred: always 0x200000FF in our test

Questions:
According to the manual,
CP_STAT = 0x802100C1 means
	CSF_RING_BUSY(bit 0):
		The Ring fetcher still has command buffer data to fetch, or the PFP
still has data left to process from the reorder queue.
	CSF_BUSY(bit 6):
		The input FIFOs have command buffers to fetch, or one or more of the
fetchers are busy, or the arbiter has a request to send to the MIU.
	MIU_RDREQ_BUSY(bit 7):
		The read path logic inside the MIU is busy.
	MEQ_BUSY(bit 16):
		The PFP-to-ME queue has valid data in it.
	SURFACE_SYNC_BUSY(bit 21):
		The Surface Sync unit is busy.
	CP_BUSY(bit 31):
		Any block in the CP is busy.
What does it suggest?

What does it mean if bit 29 of CP_ME_CNTL is set?

BTW, how does the dummy page work in GART?


Regards,
-- Chen Jie

在 2011年12月7日 下午10:21，Alex Deucher <alexdeucher@gmail.com> 写道：
> 2011/12/7  <chenhc@lemote.com>:
>> When "MC timeout" happens at GPU reset, we found the 12th and 13th
>> bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
>> two bits are like this:
>> #define         G_000E50_MCDX_BUSY(x)              (((x) >> 12) & 1)
>> #define         G_000E50_MCDW_BUSY(x)              (((x) >> 13) & 1)
>>
>> Could you please tell me what does they mean? And if possible,
>
> They refer to sub-blocks in the memory controller.  I don't really
> know off hand what the name mean.
>
>> I want to know the functionalities of these 5 registers in detail:
>> #define R_000E60_SRBM_SOFT_RESET                       0x0E60
>> #define R_000E50_SRBM_STATUS                           0x0E50
>> #define R_008020_GRBM_SOFT_RESET                0x8020
>> #define R_008010_GRBM_STATUS                    0x8010
>> #define R_008014_GRBM_STATUS2                   0x8014
>>
>> A bit more info: If I reset the MC after resetting CP (this is what
>> Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
>> disappear, but there is still "ring test failed".
>
> The bits are defined in r600d.h.  As to the acronyms:
> BIF - Bus InterFace
> CG - clocks
> DC - Display Controller
> GRBM - Graphics block (3D engine)
> HDP - Host Data Path (CPU access to vram via the PCI BAR)
> IH, RLC - Interrupt controller
> MC - Memory controller
> ROM - ROM
> SEM - semaphore controller
>
> When you reset the MC, you will probably have to reset just about
> everything else since most blocks depend on the MC for access to
> memory.  If you do reset the MC, you should do it at prior to calling
> asic_init so you make sure all the hw gets re-initialized properly.
> Additionally, you should probably reset the GRBM either via
> SRBM_SOFT_RESET or the individual sub-blocks via GRBM_SOFT_RESET.
>
> Alex
>
>>
>> Huacai Chen
>>
>>> 2011/11/8  <chenhc@lemote.com>:
>>>> And, I want to know something:
>>>> 1, Does GPU use MC to access GTT?
>>>
>>> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
>>> memory (vram or gart).
>>>
>>>> 2, What can cause MC timeout？
>>>
>>> Lots of things.  Some GPU client still active, some GPU client hung or
>>> not properly initialized.
>>>
>>> Alex
>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> Some status update.
>>>>> 在 2011年9月29日 下午5:17，Chen Jie <chenj@lemote.com> 写道：
>>>>>> Hi,
>>>>>> Add more information.
>>>>>> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
>>>>>> platform with a mips64 compatible CPU and rs780e, the kernel is
>>>>>> 3.1.0-rc8
>>>>>> 64bit).  Related kernel message:
>>>>>> /* return from STR */
>>>>>> [  156.152343] radeon 0000:01:05.0: WB enabled
>>>>>> [  156.187500] [drm] ring test succeeded in 0 usecs
>>>>>> [  156.187500] [drm] ib test succeeded in 0 usecs
>>>>>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>>>>>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>>>>>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>>>>>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>>>>> [  156.597656] ata1.00: configured for UDMA/133
>>>>>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>>>>>> ehci_hcd
>>>>>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>>>>>> ohci_hcd
>>>>>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>>>>>> ohci_hcd
>>>>>> [  157.683593] r8169 0000:02:00.0: eth0: link up
>>>>>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>>>>>> [  165.628906] Restarting tasks ... done.
>>>>>> [  177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more than
>>>>>> 10019msec
>>>>>> [  177.089843] ------------[ cut here ]------------
>>>>>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>>>>>> radeon_fence_wait+0x25c/0x33c()
>>>>>> [  177.105468] GPU lockup (waiting for 0x000013C3 last fence id
>>>>>> 0x000013AD)
>>>>>> [  177.113281] Modules linked in: psmouse serio_raw
>>>>>> [  177.117187] Call Trace:
>>>>>> [  177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34
>>>>>> [  177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0
>>>>>> [  177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44
>>>>>> [  177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c
>>>>>> [  177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220
>>>>>> [  177.148437] [<ffffffff8053b478>]
>>>>>> radeon_gem_wait_idle_ioctl+0x80/0x114
>>>>>> [  177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc
>>>>>> [  177.160156] [<ffffffff805a1820>] radeon_kms_compat_ioctl+0x28/0x38
>>>>>> [  177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c
>>>>>> [  177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138
>>>>>> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
>>>>>> [  177.187500] radeon 0000:01:05.0: GPU softreset
>>>>>> [  177.191406] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
>>>>>> [  177.195312] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00111103
>>>>>> [  177.203125] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x20023040
>>>>>> [  177.363281] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>>> [  177.367187] radeon 0000:01:05.0:
>>>>>> R_008020_GRBM_SOFT_RESET=0x00007FEE
>>>>>> [  177.390625] radeon 0000:01:05.0:
>>>>>> R_008020_GRBM_SOFT_RESET=0x00000001
>>>>>> [  177.414062] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xA0003030
>>>>>> [  177.417968] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00000003
>>>>>> [  177.425781] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
>>>>>> [  177.433593] radeon 0000:01:05.0: GPU reset succeed
>>>>>> [  177.605468] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>>> [  177.761718] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>>> [  177.804687] radeon 0000:01:05.0: WB enabled
>>>>>> [  178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed
>>>>>> (scratch(0x8504)=0xCAFEDEAD)
>>>>> After pinned ring in VRAM, it warned an ib test failure. It seems
>>>>> something wrong with accessing through GTT.
>>>>>
>>>>> We dump gart table just after stopped cp, and compare gart table with
>>>>> the dumped one just after r600_pcie_gart_enable, and don't find any
>>>>> difference.
>>>>>
>>>>> Any idea?
>>>>>
>>>>>> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
>>>>>> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>>>>> schedule
>>>>>> IB(5).
>>>>>> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
>>>>>> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>>>>> schedule
>>>>>> IB(6).
>>>>>> ...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2011-12-08 11:35 chenhc
@ 2011-12-15 15:50 ` Michel Dänzer
  0 siblings, 0 replies; 27+ messages in thread
From: Michel Dänzer @ 2011-12-15 15:50 UTC (permalink / raw)
  To: chenhc; +Cc: yanh, dri-devel, Chen Jie

On Don, 2011-12-08 at 19:35 +0800, chenhc@lemote.com wrote:
> 
> I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
> active, but what it get from ring buffer is wrong.

CP_RB_WPTR is normally only changed by the CPU after adding commands to
the ring buffer, so I'm afraid that may not be a valid conclusion. 


> Then, I want to know whether there is a way to check the content that
> GPU get from ring buffer. 

See the r100_debugfs_cp_csq_fifo() function, which generates the output
for /sys/kernel/debug/dri/0/r100_cp_csq_fifo. 


> BTW, when I use "echo shutdown > /sys/power/disk; echo disk >
> /sys/power/state" to do a hibernation, there will be occasionally "GPU
> reset" just like suspend. However, if I use "echo reboot >
> /sys/power/disk; echo disk > /sys/power/state" to do a hibernation and
> wakeup automatically, there is no "GPU reset" after hundreds of tests.
> What does this imply? Power loss cause something break?

Yeah, it sounds like the resume code doesn't properly re-initialize
something that's preserved on a warm boot but lost on a cold boot. 


-- 
Earthling Michel Dänzer           |                   http://www.amd.com
Libre software enthusiast         |          Debian, X and DRI developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
@ 2011-12-08 11:35 chenhc
  2011-12-15 15:50 ` Michel Dänzer
  0 siblings, 1 reply; 27+ messages in thread
From: chenhc @ 2011-12-08 11:35 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Michel �nzer, yanh, dri-devel, Chen Jie

Thank you for your reply.

I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
active, but what it get from ring buffer is wrong. Then, I want to know
whether there is a way to check the content that GPU get from ring buffer.

BTW, when I use "echo shutdown > /sys/power/disk; echo disk >
/sys/power/state" to do a hibernation, there will be occasionally "GPU
reset" just like suspend. However, if I use "echo reboot >
/sys/power/disk; echo disk > /sys/power/state" to do a hibernation and
wakeup automatically, there is no "GPU reset" after hundreds of tests.
What does this imply? Power loss cause something break?

Best regards,

Huacai Chen


> 2011/12/7  <chenhc@lemote.com>:
>> When "MC timeout" happens at GPU reset, we found the 12th and 13th
>> bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
>> two bits are like this:
>> #define         G_000E50_MCDX_BUSY(x)              (((x) >> 12) & 1)
>> #define         G_000E50_MCDW_BUSY(x)              (((x) >> 13) & 1)
>>
>> Could you please tell me what does they mean? And if possible,
>
> They refer to sub-blocks in the memory controller.  I don't really
> know off hand what the name mean.
>
>> I want to know the functionalities of these 5 registers in detail:
>> #define R_000E60_SRBM_SOFT_RESET                       0x0E60
>> #define R_000E50_SRBM_STATUS                           0x0E50
>> #define R_008020_GRBM_SOFT_RESET                0x8020
>> #define R_008010_GRBM_STATUS                    0x8010
>> #define R_008014_GRBM_STATUS2                   0x8014
>>
>> A bit more info: If I reset the MC after resetting CP (this is what
>> Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
>> disappear, but there is still "ring test failed".
>
> The bits are defined in r600d.h.  As to the acronyms:
> BIF - Bus InterFace
> CG - clocks
> DC - Display Controller
> GRBM - Graphics block (3D engine)
> HDP - Host Data Path (CPU access to vram via the PCI BAR)
> IH, RLC - Interrupt controller
> MC - Memory controller
> ROM - ROM
> SEM - semaphore controller
>
> When you reset the MC, you will probably have to reset just about
> everything else since most blocks depend on the MC for access to
> memory.  If you do reset the MC, you should do it at prior to calling
> asic_init so you make sure all the hw gets re-initialized properly.
> Additionally, you should probably reset the GRBM either via
> SRBM_SOFT_RESET or the individual sub-blocks via GRBM_SOFT_RESET.
>
> Alex
>
>>
>> Huacai Chen
>>
>>> 2011/11/8  <chenhc@lemote.com>:
>>>> And, I want to know something:
>>>> 1, Does GPU use MC to access GTT?
>>>
>>> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
>>> memory (vram or gart).
>>>
>>>> 2, What can cause MC timeout？
>>>
>>> Lots of things.  Some GPU client still active, some GPU client hung or
>>> not properly initialized.
>>>
>>> Alex
>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> Some status update.
>>>>> 在 2011年9月29日 下午5:17，Chen Jie <chenj@lemote.com> 写道：
>>>>>> Hi,
>>>>>> Add more information.
>>>>>> We got occasionally "GPU lockup" after resuming from suspend(on
>>>>>> mipsel
>>>>>> platform with a mips64 compatible CPU and rs780e, the kernel is
>>>>>> 3.1.0-rc8
>>>>>> 64bit).  Related kernel message:
>>>>>> /* return from STR */
>>>>>> [  156.152343] radeon 0000:01:05.0: WB enabled
>>>>>> [  156.187500] [drm] ring test succeeded in 0 usecs
>>>>>> [  156.187500] [drm] ib test succeeded in 0 usecs
>>>>>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>>>>>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>>>>>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>>>>>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl
>>>>>> 300)
>>>>>> [  156.597656] ata1.00: configured for UDMA/133
>>>>>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>>>>>> ehci_hcd
>>>>>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>>>>>> ohci_hcd
>>>>>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>>>>>> ohci_hcd
>>>>>> [  157.683593] r8169 0000:02:00.0: eth0: link up
>>>>>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>>>>>> [  165.628906] Restarting tasks ... done.
>>>>>> [  177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more
>>>>>> than
>>>>>> 10019msec
>>>>>> [  177.089843] ------------[ cut here ]------------
>>>>>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>>>>>> radeon_fence_wait+0x25c/0x33c()
>>>>>> [  177.105468] GPU lockup (waiting for 0x000013C3 last fence id
>>>>>> 0x000013AD)
>>>>>> [  177.113281] Modules linked in: psmouse serio_raw
>>>>>> [  177.117187] Call Trace:
>>>>>> [  177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34
>>>>>> [  177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0
>>>>>> [  177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44
>>>>>> [  177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c
>>>>>> [  177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220
>>>>>> [  177.148437] [<ffffffff8053b478>]
>>>>>> radeon_gem_wait_idle_ioctl+0x80/0x114
>>>>>> [  177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc
>>>>>> [  177.160156] [<ffffffff805a1820>]
>>>>>> radeon_kms_compat_ioctl+0x28/0x38
>>>>>> [  177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c
>>>>>> [  177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138
>>>>>> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
>>>>>> [  177.187500] radeon 0000:01:05.0: GPU softreset
>>>>>> [  177.191406] radeon 0000:01:05.0:
>>>>>> R_008010_GRBM_STATUS=0xF57C2030
>>>>>> [  177.195312] radeon 0000:01:05.0:
>>>>>> R_008014_GRBM_STATUS2=0x00111103
>>>>>> [  177.203125] radeon 0000:01:05.0:
>>>>>> R_000E50_SRBM_STATUS=0x20023040
>>>>>> [  177.363281] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>>> [  177.367187] radeon 0000:01:05.0:
>>>>>> R_008020_GRBM_SOFT_RESET=0x00007FEE
>>>>>> [  177.390625] radeon 0000:01:05.0:
>>>>>> R_008020_GRBM_SOFT_RESET=0x00000001
>>>>>> [  177.414062] radeon 0000:01:05.0:
>>>>>> R_008010_GRBM_STATUS=0xA0003030
>>>>>> [  177.417968] radeon 0000:01:05.0:
>>>>>> R_008014_GRBM_STATUS2=0x00000003
>>>>>> [  177.425781] radeon 0000:01:05.0:
>>>>>> R_000E50_SRBM_STATUS=0x2002B040
>>>>>> [  177.433593] radeon 0000:01:05.0: GPU reset succeed
>>>>>> [  177.605468] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>>> [  177.761718] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>>> [  177.804687] radeon 0000:01:05.0: WB enabled
>>>>>> [  178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed
>>>>>> (scratch(0x8504)=0xCAFEDEAD)
>>>>> After pinned ring in VRAM, it warned an ib test failure. It seems
>>>>> something wrong with accessing through GTT.
>>>>>
>>>>> We dump gart table just after stopped cp, and compare gart table with
>>>>> the dumped one just after r600_pcie_gart_enable, and don't find any
>>>>> difference.
>>>>>
>>>>> Any idea?
>>>>>
>>>>>> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on
>>>>>> resume
>>>>>> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>>>>> schedule
>>>>>> IB(5).
>>>>>> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
>>>>>> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>>>>> schedule
>>>>>> IB(6).
>>>>>> ...
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> -- Chen Jie
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2011-12-07 11:48 chenhc
@ 2011-12-07 14:21 ` Alex Deucher
  2012-02-15  9:32   ` Chen Jie
  0 siblings, 1 reply; 27+ messages in thread
From: Alex Deucher @ 2011-12-07 14:21 UTC (permalink / raw)
  To: chenhc; +Cc: Michel �nzer, dri-devel, Chen Jie

2011/12/7  <chenhc@lemote.com>:
> When "MC timeout" happens at GPU reset, we found the 12th and 13th
> bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
> two bits are like this:
> #define         G_000E50_MCDX_BUSY(x)              (((x) >> 12) & 1)
> #define         G_000E50_MCDW_BUSY(x)              (((x) >> 13) & 1)
>
> Could you please tell me what does they mean? And if possible,

They refer to sub-blocks in the memory controller.  I don't really
know off hand what the name mean.

> I want to know the functionalities of these 5 registers in detail:
> #define R_000E60_SRBM_SOFT_RESET                       0x0E60
> #define R_000E50_SRBM_STATUS                           0x0E50
> #define R_008020_GRBM_SOFT_RESET                0x8020
> #define R_008010_GRBM_STATUS                    0x8010
> #define R_008014_GRBM_STATUS2                   0x8014
>
> A bit more info: If I reset the MC after resetting CP (this is what
> Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
> disappear, but there is still "ring test failed".

The bits are defined in r600d.h.  As to the acronyms:
BIF - Bus InterFace
CG - clocks
DC - Display Controller
GRBM - Graphics block (3D engine)
HDP - Host Data Path (CPU access to vram via the PCI BAR)
IH, RLC - Interrupt controller
MC - Memory controller
ROM - ROM
SEM - semaphore controller

When you reset the MC, you will probably have to reset just about
everything else since most blocks depend on the MC for access to
memory.  If you do reset the MC, you should do it at prior to calling
asic_init so you make sure all the hw gets re-initialized properly.
Additionally, you should probably reset the GRBM either via
SRBM_SOFT_RESET or the individual sub-blocks via GRBM_SOFT_RESET.

Alex

>
> Huacai Chen
>
>> 2011/11/8  <chenhc@lemote.com>:
>>> And, I want to know something:
>>> 1, Does GPU use MC to access GTT?
>>
>> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
>> memory (vram or gart).
>>
>>> 2, What can cause MC timeout？
>>
>> Lots of things.  Some GPU client still active, some GPU client hung or
>> not properly initialized.
>>
>> Alex
>>
>>>
>>>> Hi,
>>>>
>>>> Some status update.
>>>> 在 2011年9月29日 下午5:17，Chen Jie <chenj@lemote.com> 写道：
>>>>> Hi,
>>>>> Add more information.
>>>>> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
>>>>> platform with a mips64 compatible CPU and rs780e, the kernel is
>>>>> 3.1.0-rc8
>>>>> 64bit).  Related kernel message:
>>>>> /* return from STR */
>>>>> [  156.152343] radeon 0000:01:05.0: WB enabled
>>>>> [  156.187500] [drm] ring test succeeded in 0 usecs
>>>>> [  156.187500] [drm] ib test succeeded in 0 usecs
>>>>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>>>>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>>>>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>>>>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>>>> [  156.597656] ata1.00: configured for UDMA/133
>>>>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>>>>> ehci_hcd
>>>>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>>>>> ohci_hcd
>>>>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>>>>> ohci_hcd
>>>>> [  157.683593] r8169 0000:02:00.0: eth0: link up
>>>>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>>>>> [  165.628906] Restarting tasks ... done.
>>>>> [  177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more than
>>>>> 10019msec
>>>>> [  177.089843] ------------[ cut here ]------------
>>>>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>>>>> radeon_fence_wait+0x25c/0x33c()
>>>>> [  177.105468] GPU lockup (waiting for 0x000013C3 last fence id
>>>>> 0x000013AD)
>>>>> [  177.113281] Modules linked in: psmouse serio_raw
>>>>> [  177.117187] Call Trace:
>>>>> [  177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34
>>>>> [  177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0
>>>>> [  177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44
>>>>> [  177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c
>>>>> [  177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220
>>>>> [  177.148437] [<ffffffff8053b478>]
>>>>> radeon_gem_wait_idle_ioctl+0x80/0x114
>>>>> [  177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc
>>>>> [  177.160156] [<ffffffff805a1820>] radeon_kms_compat_ioctl+0x28/0x38
>>>>> [  177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c
>>>>> [  177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138
>>>>> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
>>>>> [  177.187500] radeon 0000:01:05.0: GPU softreset
>>>>> [  177.191406] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
>>>>> [  177.195312] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00111103
>>>>> [  177.203125] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x20023040
>>>>> [  177.363281] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>> [  177.367187] radeon 0000:01:05.0:
>>>>> R_008020_GRBM_SOFT_RESET=0x00007FEE
>>>>> [  177.390625] radeon 0000:01:05.0:
>>>>> R_008020_GRBM_SOFT_RESET=0x00000001
>>>>> [  177.414062] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xA0003030
>>>>> [  177.417968] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00000003
>>>>> [  177.425781] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
>>>>> [  177.433593] radeon 0000:01:05.0: GPU reset succeed
>>>>> [  177.605468] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>> [  177.761718] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>> [  177.804687] radeon 0000:01:05.0: WB enabled
>>>>> [  178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed
>>>>> (scratch(0x8504)=0xCAFEDEAD)
>>>> After pinned ring in VRAM, it warned an ib test failure. It seems
>>>> something wrong with accessing through GTT.
>>>>
>>>> We dump gart table just after stopped cp, and compare gart table with
>>>> the dumped one just after r600_pcie_gart_enable, and don't find any
>>>> difference.
>>>>
>>>> Any idea?
>>>>
>>>>> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
>>>>> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>>>> schedule
>>>>> IB(5).
>>>>> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
>>>>> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>>>> schedule
>>>>> IB(6).
>>>>> ...
>>>>
>>>>
>>>>
>>>> Regards,
>>>> -- Chen Jie
>>>>
>>>
>>>
>>>
>>
>
>
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
@ 2011-12-07 11:48 chenhc
  2011-12-07 14:21 ` Alex Deucher
  0 siblings, 1 reply; 27+ messages in thread
From: chenhc @ 2011-12-07 11:48 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Michel �nzer, dri-devel, Chen Jie

When "MC timeout" happens at GPU reset, we found the 12th and 13th
bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
two bits are like this:
#define         G_000E50_MCDX_BUSY(x)              (((x) >> 12) & 1)
#define         G_000E50_MCDW_BUSY(x)              (((x) >> 13) & 1)

Could you please tell me what does they mean? And if possible,
I want to know the functionalities of these 5 registers in detail:
#define R_000E60_SRBM_SOFT_RESET                       0x0E60
#define R_000E50_SRBM_STATUS                           0x0E50
#define R_008020_GRBM_SOFT_RESET                0x8020
#define R_008010_GRBM_STATUS                    0x8010
#define R_008014_GRBM_STATUS2                   0x8014

A bit more info: If I reset the MC after resetting CP (this is what
Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
disappear, but there is still "ring test failed".

Huacai Chen

> 2011/11/8  <chenhc@lemote.com>:
>> And, I want to know something:
>> 1, Does GPU use MC to access GTT?
>
> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
> memory (vram or gart).
>
>> 2, What can cause MC timeout？
>
> Lots of things.  Some GPU client still active, some GPU client hung or
> not properly initialized.
>
> Alex
>
>>
>>> Hi,
>>>
>>> Some status update.
>>> 在 2011年9月29日 下午5:17，Chen Jie <chenj@lemote.com> 写道：
>>>> Hi,
>>>> Add more information.
>>>> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
>>>> platform with a mips64 compatible CPU and rs780e, the kernel is
>>>> 3.1.0-rc8
>>>> 64bit).  Related kernel message:
>>>> /* return from STR */
>>>> [  156.152343] radeon 0000:01:05.0: WB enabled
>>>> [  156.187500] [drm] ring test succeeded in 0 usecs
>>>> [  156.187500] [drm] ib test succeeded in 0 usecs
>>>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>>>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>>>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>>>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>>> [  156.597656] ata1.00: configured for UDMA/133
>>>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>>>> ehci_hcd
>>>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>>>> ohci_hcd
>>>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>>>> ohci_hcd
>>>> [  157.683593] r8169 0000:02:00.0: eth0: link up
>>>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>>>> [  165.628906] Restarting tasks ... done.
>>>> [  177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more than
>>>> 10019msec
>>>> [  177.089843] ------------[ cut here ]------------
>>>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>>>> radeon_fence_wait+0x25c/0x33c()
>>>> [  177.105468] GPU lockup (waiting for 0x000013C3 last fence id
>>>> 0x000013AD)
>>>> [  177.113281] Modules linked in: psmouse serio_raw
>>>> [  177.117187] Call Trace:
>>>> [  177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34
>>>> [  177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0
>>>> [  177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44
>>>> [  177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c
>>>> [  177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220
>>>> [  177.148437] [<ffffffff8053b478>]
>>>> radeon_gem_wait_idle_ioctl+0x80/0x114
>>>> [  177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc
>>>> [  177.160156] [<ffffffff805a1820>] radeon_kms_compat_ioctl+0x28/0x38
>>>> [  177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c
>>>> [  177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138
>>>> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
>>>> [  177.187500] radeon 0000:01:05.0: GPU softreset
>>>> [  177.191406] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
>>>> [  177.195312] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00111103
>>>> [  177.203125] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x20023040
>>>> [  177.363281] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>> [  177.367187] radeon 0000:01:05.0:
>>>> R_008020_GRBM_SOFT_RESET=0x00007FEE
>>>> [  177.390625] radeon 0000:01:05.0:
>>>> R_008020_GRBM_SOFT_RESET=0x00000001
>>>> [  177.414062] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xA0003030
>>>> [  177.417968] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00000003
>>>> [  177.425781] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
>>>> [  177.433593] radeon 0000:01:05.0: GPU reset succeed
>>>> [  177.605468] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>> [  177.761718] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>> [  177.804687] radeon 0000:01:05.0: WB enabled
>>>> [  178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed
>>>> (scratch(0x8504)=0xCAFEDEAD)
>>> After pinned ring in VRAM, it warned an ib test failure. It seems
>>> something wrong with accessing through GTT.
>>>
>>> We dump gart table just after stopped cp, and compare gart table with
>>> the dumped one just after r600_pcie_gart_enable, and don't find any
>>> difference.
>>>
>>> Any idea?
>>>
>>>> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
>>>> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>>> schedule
>>>> IB(5).
>>>> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
>>>> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>>> schedule
>>>> IB(6).
>>>> ...
>>>
>>>
>>>
>>> Regards,
>>> -- Chen Jie
>>>
>>
>>
>>
>


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2011-11-08  7:33 ` [mipsel+rs780e]Occasionally " Chen Jie
@ 2011-11-08 15:14   ` Jerome Glisse
  0 siblings, 0 replies; 27+ messages in thread
From: Jerome Glisse @ 2011-11-08 15:14 UTC (permalink / raw)
  To: Chen Jie; +Cc: chenhc, Michel Dänzer, dri-devel

On Tue, Nov 08, 2011 at 03:33:03PM +0800, Chen Jie wrote:
> Hi,
> 
> Some status update.
> 在 2011年9月29日 下午5:17，Chen Jie <chenj@lemote.com> 写道：
> > Hi,
> > Add more information.
> > We got occasionally "GPU lockup" after resuming from suspend(on mipsel
> > platform with a mips64 compatible CPU and rs780e, the kernel is 3.1.0-rc8
> > 64bit).  Related kernel message:
> > /* return from STR */
> > [  156.152343] radeon 0000:01:05.0: WB enabled
> > [  156.187500] [drm] ring test succeeded in 0 usecs
> > [  156.187500] [drm] ib test succeeded in 0 usecs
> > [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
> > [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
> > [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
> > [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> > [  156.597656] ata1.00: configured for UDMA/133
> > [  156.613281] usb 1-5: reset high speed USB device number 4 using ehci_hcd
> > [  157.027343] usb 3-2: reset low speed USB device number 2 using ohci_hcd
> > [  157.609375] usb 3-3: reset low speed USB device number 3 using ohci_hcd
> > [  157.683593] r8169 0000:02:00.0: eth0: link up
> > [  165.621093] PM: resume of devices complete after 9679.556 msecs
> > [  165.628906] Restarting tasks ... done.
> > [  177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more than
> > 10019msec
> > [  177.089843] ------------[ cut here ]------------
> > [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
> > radeon_fence_wait+0x25c/0x33c()
> > [  177.105468] GPU lockup (waiting for 0x000013C3 last fence id 0x000013AD)
> > [  177.113281] Modules linked in: psmouse serio_raw
> > [  177.117187] Call Trace:
> > [  177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34
> > [  177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0
> > [  177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44
> > [  177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c
> > [  177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220
> > [  177.148437] [<ffffffff8053b478>] radeon_gem_wait_idle_ioctl+0x80/0x114
> > [  177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc
> > [  177.160156] [<ffffffff805a1820>] radeon_kms_compat_ioctl+0x28/0x38
> > [  177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c
> > [  177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138
> > [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
> > [  177.187500] radeon 0000:01:05.0: GPU softreset
> > [  177.191406] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
> > [  177.195312] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00111103
> > [  177.203125] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x20023040
> > [  177.363281] radeon 0000:01:05.0: Wait for MC idle timedout !
> > [  177.367187] radeon 0000:01:05.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
> > [  177.390625] radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00000001
> > [  177.414062] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xA0003030
> > [  177.417968] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00000003
> > [  177.425781] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
> > [  177.433593] radeon 0000:01:05.0: GPU reset succeed
> > [  177.605468] radeon 0000:01:05.0: Wait for MC idle timedout !
> > [  177.761718] radeon 0000:01:05.0: Wait for MC idle timedout !
> > [  177.804687] radeon 0000:01:05.0: WB enabled
> > [  178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed
> > (scratch(0x8504)=0xCAFEDEAD)
> After pinned ring in VRAM, it warned an ib test failure. It seems
> something wrong with accessing through GTT.
> 
> We dump gart table just after stopped cp, and compare gart table with
> the dumped one just after r600_pcie_gart_enable, and don't find any
> difference.
> 
> Any idea?
> 
> > [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
> > [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
> > IB(5).
> > [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
> > [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
> > IB(6).
> > ...
> 
> 

Do you have any kind of iommu ? Is the gart table programmed with proper
physical address for the page ? Is the GPU PCI master (iirc a PCI device
need to be master to be able initiate request to memory). Then there
could be a lot other PCI things getting in the way.

Cheers,
Jerome
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2011-11-08  7:54 chenhc
@ 2011-11-08 14:25 ` Alex Deucher
  0 siblings, 0 replies; 27+ messages in thread
From: Alex Deucher @ 2011-11-08 14:25 UTC (permalink / raw)
  To: chenhc; +Cc: Michel �nzer, dri-devel, Chen Jie

2011/11/8  <chenhc@lemote.com>:
> And, I want to know something:
> 1, Does GPU use MC to access GTT?

Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
memory (vram or gart).

> 2, What can cause MC timeout？

Lots of things.  Some GPU client still active, some GPU client hung or
not properly initialized.

Alex

>
>> Hi,
>>
>> Some status update.
>> 在 2011年9月29日 下午5:17，Chen Jie <chenj@lemote.com> 写道：
>>> Hi,
>>> Add more information.
>>> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
>>> platform with a mips64 compatible CPU and rs780e, the kernel is
>>> 3.1.0-rc8
>>> 64bit).  Related kernel message:
>>> /* return from STR */
>>> [  156.152343] radeon 0000:01:05.0: WB enabled
>>> [  156.187500] [drm] ring test succeeded in 0 usecs
>>> [  156.187500] [drm] ib test succeeded in 0 usecs
>>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>> [  156.597656] ata1.00: configured for UDMA/133
>>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>>> ehci_hcd
>>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>>> ohci_hcd
>>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>>> ohci_hcd
>>> [  157.683593] r8169 0000:02:00.0: eth0: link up
>>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>>> [  165.628906] Restarting tasks ... done.
>>> [  177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more than
>>> 10019msec
>>> [  177.089843] ------------[ cut here ]------------
>>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>>> radeon_fence_wait+0x25c/0x33c()
>>> [  177.105468] GPU lockup (waiting for 0x000013C3 last fence id
>>> 0x000013AD)
>>> [  177.113281] Modules linked in: psmouse serio_raw
>>> [  177.117187] Call Trace:
>>> [  177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34
>>> [  177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0
>>> [  177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44
>>> [  177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c
>>> [  177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220
>>> [  177.148437] [<ffffffff8053b478>]
>>> radeon_gem_wait_idle_ioctl+0x80/0x114
>>> [  177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc
>>> [  177.160156] [<ffffffff805a1820>] radeon_kms_compat_ioctl+0x28/0x38
>>> [  177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c
>>> [  177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138
>>> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
>>> [  177.187500] radeon 0000:01:05.0: GPU softreset
>>> [  177.191406] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
>>> [  177.195312] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00111103
>>> [  177.203125] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x20023040
>>> [  177.363281] radeon 0000:01:05.0: Wait for MC idle timedout !
>>> [  177.367187] radeon 0000:01:05.0:
>>> R_008020_GRBM_SOFT_RESET=0x00007FEE
>>> [  177.390625] radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00000001
>>> [  177.414062] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xA0003030
>>> [  177.417968] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00000003
>>> [  177.425781] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
>>> [  177.433593] radeon 0000:01:05.0: GPU reset succeed
>>> [  177.605468] radeon 0000:01:05.0: Wait for MC idle timedout !
>>> [  177.761718] radeon 0000:01:05.0: Wait for MC idle timedout !
>>> [  177.804687] radeon 0000:01:05.0: WB enabled
>>> [  178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed
>>> (scratch(0x8504)=0xCAFEDEAD)
>> After pinned ring in VRAM, it warned an ib test failure. It seems
>> something wrong with accessing through GTT.
>>
>> We dump gart table just after stopped cp, and compare gart table with
>> the dumped one just after r600_pcie_gart_enable, and don't find any
>> difference.
>>
>> Any idea?
>>
>>> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
>>> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>> schedule
>>> IB(5).
>>> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
>>> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>> schedule
>>> IB(6).
>>> ...
>>
>>
>>
>> Regards,
>> -- Chen Jie
>>
>
>
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
@ 2011-11-08  7:54 chenhc
  2011-11-08 14:25 ` Alex Deucher
  0 siblings, 1 reply; 27+ messages in thread
From: chenhc @ 2011-11-08  7:54 UTC (permalink / raw)
  To: Chen Jie; +Cc: Michel �nzer, dri-devel

And, I want to know something:
1, Does GPU use MC to access GTT?
2, What can cause MC timeout？

> Hi,
>
> Some status update.
> 在 2011年9月29日 下午5:17，Chen Jie <chenj@lemote.com> 写道：
>> Hi,
>> Add more information.
>> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
>> platform with a mips64 compatible CPU and rs780e, the kernel is
>> 3.1.0-rc8
>> 64bit).  Related kernel message:
>> /* return from STR */
>> [  156.152343] radeon 0000:01:05.0: WB enabled
>> [  156.187500] [drm] ring test succeeded in 0 usecs
>> [  156.187500] [drm] ib test succeeded in 0 usecs
>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> [  156.597656] ata1.00: configured for UDMA/133
>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>> ehci_hcd
>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>> ohci_hcd
>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>> ohci_hcd
>> [  157.683593] r8169 0000:02:00.0: eth0: link up
>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>> [  165.628906] Restarting tasks ... done.
>> [  177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more than
>> 10019msec
>> [  177.089843] ------------[ cut here ]------------
>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>> radeon_fence_wait+0x25c/0x33c()
>> [  177.105468] GPU lockup (waiting for 0x000013C3 last fence id
>> 0x000013AD)
>> [  177.113281] Modules linked in: psmouse serio_raw
>> [  177.117187] Call Trace:
>> [  177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34
>> [  177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0
>> [  177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44
>> [  177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c
>> [  177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220
>> [  177.148437] [<ffffffff8053b478>]
>> radeon_gem_wait_idle_ioctl+0x80/0x114
>> [  177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc
>> [  177.160156] [<ffffffff805a1820>] radeon_kms_compat_ioctl+0x28/0x38
>> [  177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c
>> [  177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138
>> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
>> [  177.187500] radeon 0000:01:05.0: GPU softreset
>> [  177.191406] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
>> [  177.195312] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00111103
>> [  177.203125] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x20023040
>> [  177.363281] radeon 0000:01:05.0: Wait for MC idle timedout !
>> [  177.367187] radeon 0000:01:05.0:
>> R_008020_GRBM_SOFT_RESET=0x00007FEE
>> [  177.390625] radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00000001
>> [  177.414062] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xA0003030
>> [  177.417968] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00000003
>> [  177.425781] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
>> [  177.433593] radeon 0000:01:05.0: GPU reset succeed
>> [  177.605468] radeon 0000:01:05.0: Wait for MC idle timedout !
>> [  177.761718] radeon 0000:01:05.0: Wait for MC idle timedout !
>> [  177.804687] radeon 0000:01:05.0: WB enabled
>> [  178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed
>> (scratch(0x8504)=0xCAFEDEAD)
> After pinned ring in VRAM, it warned an ib test failure. It seems
> something wrong with accessing through GTT.
>
> We dump gart table just after stopped cp, and compare gart table with
> the dumped one just after r600_pcie_gart_enable, and don't find any
> difference.
>
> Any idea?
>
>> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
>> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>> schedule
>> IB(5).
>> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
>> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>> schedule
>> IB(6).
>> ...
>
>
>
> Regards,
> -- Chen Jie
>


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.
  2011-09-29  9:17 Chen Jie
@ 2011-11-08  7:33 ` Chen Jie
  2011-11-08 15:14   ` Jerome Glisse
  0 siblings, 1 reply; 27+ messages in thread
From: Chen Jie @ 2011-11-08  7:33 UTC (permalink / raw)
  To: Alex Deucher; +Cc: chenhc, Michel Dänzer, dri-devel

Hi,

Some status update.
在 2011年9月29日 下午5:17，Chen Jie <chenj@lemote.com> 写道：
> Hi,
> Add more information.
> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
> platform with a mips64 compatible CPU and rs780e, the kernel is 3.1.0-rc8
> 64bit).  Related kernel message:
> /* return from STR */
> [  156.152343] radeon 0000:01:05.0: WB enabled
> [  156.187500] [drm] ring test succeeded in 0 usecs
> [  156.187500] [drm] ib test succeeded in 0 usecs
> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [  156.597656] ata1.00: configured for UDMA/133
> [  156.613281] usb 1-5: reset high speed USB device number 4 using ehci_hcd
> [  157.027343] usb 3-2: reset low speed USB device number 2 using ohci_hcd
> [  157.609375] usb 3-3: reset low speed USB device number 3 using ohci_hcd
> [  157.683593] r8169 0000:02:00.0: eth0: link up
> [  165.621093] PM: resume of devices complete after 9679.556 msecs
> [  165.628906] Restarting tasks ... done.
> [  177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more than
> 10019msec
> [  177.089843] ------------[ cut here ]------------
> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
> radeon_fence_wait+0x25c/0x33c()
> [  177.105468] GPU lockup (waiting for 0x000013C3 last fence id 0x000013AD)
> [  177.113281] Modules linked in: psmouse serio_raw
> [  177.117187] Call Trace:
> [  177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34
> [  177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0
> [  177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44
> [  177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c
> [  177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220
> [  177.148437] [<ffffffff8053b478>] radeon_gem_wait_idle_ioctl+0x80/0x114
> [  177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc
> [  177.160156] [<ffffffff805a1820>] radeon_kms_compat_ioctl+0x28/0x38
> [  177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c
> [  177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138
> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
> [  177.187500] radeon 0000:01:05.0: GPU softreset
> [  177.191406] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
> [  177.195312] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00111103
> [  177.203125] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x20023040
> [  177.363281] radeon 0000:01:05.0: Wait for MC idle timedout !
> [  177.367187] radeon 0000:01:05.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
> [  177.390625] radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00000001
> [  177.414062] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xA0003030
> [  177.417968] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00000003
> [  177.425781] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
> [  177.433593] radeon 0000:01:05.0: GPU reset succeed
> [  177.605468] radeon 0000:01:05.0: Wait for MC idle timedout !
> [  177.761718] radeon 0000:01:05.0: Wait for MC idle timedout !
> [  177.804687] radeon 0000:01:05.0: WB enabled
> [  178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed
> (scratch(0x8504)=0xCAFEDEAD)
After pinned ring in VRAM, it warned an ib test failure. It seems
something wrong with accessing through GTT.

We dump gart table just after stopped cp, and compare gart table with
the dumped one just after r600_pcie_gart_enable, and don't find any
difference.

Any idea?

> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
> IB(5).
> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
> IB(6).
> ...



Regards,
-- Chen Jie
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2012-03-01 17:19 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-16  8:42 [mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend chenhc
2011-12-16 10:53 ` Michel Dänzer
2011-12-16 15:46 ` Jerome Glisse
  -- strict thread matches above, loose matches on Subject: below --
2012-03-01  9:11 chenhc
2012-03-01 17:19 ` Alex Deucher
2012-02-29  4:59 chenhc
2012-02-29  4:49 chenhc
2012-02-29 17:50 ` Jerome Glisse
2011-12-08 11:35 chenhc
2011-12-15 15:50 ` Michel Dänzer
2011-12-07 11:48 chenhc
2011-12-07 14:21 ` Alex Deucher
2012-02-15  9:32   ` Chen Jie
2012-02-15 15:53     ` Jerome Glisse
2012-02-16  9:21       ` Chen Jie
2012-02-16 10:16         ` Chen Jie
2012-02-17 10:42           ` Chen Jie
2012-02-16 16:32         ` Jerome Glisse
2012-02-17  9:27           ` Chen Jie
2012-02-21 10:37             ` Chen Jie
2012-02-27  2:44               ` Chen Jie
2012-02-27 18:41                 ` Jerome Glisse
2012-02-27 18:38               ` Jerome Glisse
2011-11-08  7:54 chenhc
2011-11-08 14:25 ` Alex Deucher
2011-09-29  9:17 Chen Jie
2011-11-08  7:33 ` [mipsel+rs780e]Occasionally " Chen Jie
2011-11-08 15:14   ` Jerome Glisse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.