[PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
@ 2017-03-22 18:19 Zachary Michaels
       [not found] ` <1490206797-15653-1-git-send-email-zmichaels-bXq66PvbRDbQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Zachary Michaels @ 2017-03-22 18:19 UTC (permalink / raw)
  To: dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Zachary Michaels, Christian König, Michel Dänzer,
	Julien Isorce

We were experiencing an infinite loop due to VRAM bos getting added back
to the VRAM lru on eviction via ttm_bo_mem_force_space, and reverting
this commit solves the problem.

Signed-off-by: Zachary Michaels <zmichaels@oblong.com>
Signed-off-by: Julien Isorce <jisorce@oblong.com>
---
 drivers/gpu/drm/radeon/radeon_ttm.c | 25 +------------------------
 1 file changed, 1 insertion(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
index 0cf03ccbf0a7..d50777f1b48e 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -198,30 +198,7 @@ static void radeon_evict_flags(struct ttm_buffer_object *bo,
 	case TTM_PL_VRAM:
 		if (rbo->rdev->ring[radeon_copy_ring_index(rbo->rdev)].ready == false)
 			radeon_ttm_placement_from_domain(rbo, RADEON_GEM_DOMAIN_CPU);
-		else if (rbo->rdev->mc.visible_vram_size < rbo->rdev->mc.real_vram_size &&
-			 bo->mem.start < (rbo->rdev->mc.visible_vram_size >> PAGE_SHIFT)) {
-			unsigned fpfn = rbo->rdev->mc.visible_vram_size >> PAGE_SHIFT;
-			int i;
-
-			/* Try evicting to the CPU inaccessible part of VRAM
-			 * first, but only set GTT as busy placement, so this
-			 * BO will be evicted to GTT rather than causing other
-			 * BOs to be evicted from VRAM
-			 */
-			radeon_ttm_placement_from_domain(rbo, RADEON_GEM_DOMAIN_VRAM |
-							 RADEON_GEM_DOMAIN_GTT);
-			rbo->placement.num_busy_placement = 0;
-			for (i = 0; i < rbo->placement.num_placement; i++) {
-				if (rbo->placements[i].flags & TTM_PL_FLAG_VRAM) {
-					if (rbo->placements[0].fpfn < fpfn)
-						rbo->placements[0].fpfn = fpfn;
-				} else {
-					rbo->placement.busy_placement =
-						&rbo->placements[i];
-					rbo->placement.num_busy_placement = 1;
-				}
-			}
-		} else
+		else
 			radeon_ttm_placement_from_domain(rbo, RADEON_GEM_DOMAIN_GTT);
 		break;
 	case TTM_PL_TT:
-- 
2.11.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
       [not found] ` <1490206797-15653-1-git-send-email-zmichaels-bXq66PvbRDbQT0dZR+AlfA@public.gmane.org>
@ 2017-03-23  8:10   ` Michel Dänzer
       [not found]     ` <c56f4a6a-a17d-b9b0-311e-7100df8c7cee-otUistvHUpPR7s880joybQ@public.gmane.org>
  2017-03-23 15:31     ` Zachary Michaels
  0 siblings, 2 replies; 21+ messages in thread
From: Michel Dänzer @ 2017-03-23  8:10 UTC (permalink / raw)
  To: Zachary Michaels
  Cc: Christian König, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Julien Isorce

On 23/03/17 03:19 AM, Zachary Michaels wrote:
> We were experiencing an infinite loop due to VRAM bos getting added back
> to the VRAM lru on eviction via ttm_bo_mem_force_space,

Can you share more details about what happened? I can imagine that
moving a BO from CPU visible to CPU invisible VRAM would put it back on
the LRU, but next time around it shouldn't hit this code anymore but get
evicted to GTT directly.

Was userspace maybe performing concurrent CPU access to the BOs in question?


> and reverting this commit solves the problem.

I hope we can find a better solution.


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
       [not found]     ` <c56f4a6a-a17d-b9b0-311e-7100df8c7cee-otUistvHUpPR7s880joybQ@public.gmane.org>
@ 2017-03-23  9:26       ` Julien Isorce
       [not found]         ` <CAGCDoEGuMVnv6rA-y5XDbc6hiNWiTg6nt4e4u2dZObeY8roKXA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Julien Isorce @ 2017-03-23  9:26 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Christian König, Zachary Michaels,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

[-- Attachment #1.1: Type: text/plain, Size: 3000 bytes --]

Hi Michel,

When it happens, the main thread of our gl based app is stuck on a
ioctl(RADEON_CS). I set RADEON_THREAD=false to ease the debugging but same
thing happens if true. Other threads are only si_shader:0,1,2,3 and are
doing nothing, just waiting for jobs. I can also do sudo gdb -p $(pidof
Xorg) to block the X11 server, to make sure there is no ping pong between 2
processes. All other processes are not loading dri/radeonsi_dri.so . And
adding a few traces shows that the above ioctl call is looping for ever on
https://github.com/torvalds/linux/blob/master/drivers/gpu/
drm/ttm/ttm_bo.c#L819 and comes from mesa
https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/winsys/radeon/drm/radeon_drm_cs.c#n454
.

After adding even more traces I can see that the bo, which is being
indefinitely evicted, has the flag RADEON_GEM_NO_CPU_ACCESS.
And it gets 3 potential placements after calling "radeon_evict_flags".
 1: VRAM cpu inaccessible, fpfn is 65536
 2: VRAM cpu accessible, fpfn is 0
 3: GTT, fpfn is 0

And it looks like it continuously succeeds to move on the second placement.
So I might be wrong but it looks it is not even a ping pong between VRAM
accessible / not accessible, it just keeps being blited in the CPU
accessible part of the VRAM.

Maybe radeon_evict_flags should just not add the second placement if its
current placement is already VRAM cpu accessible.
Or could be a bug in the get_node that should not succeed in that case.

Note that this happens when the VRAM is nearly full.

FWIW I noticed that amdgpu is doing something different:
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c#L205
vs
https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/radeon/radeon_ttm.c#L198

Finally the NMI watchdog and the kernel soft lockup and hard lockup
detectors do not detect this looping in that ioctl(RADEON_CS). Maybe
because it estimates it is doing real work. Same for radeon_lockup_timeout,
it does not detect it.

The gpu is a FirePro W600 Cape Verde 2048M.

Thx
Julien

On Thu, Mar 23, 2017 at 8:10 AM, Michel Dänzer <michel-otUistvHUpPR7s880joybQ@public.gmane.org> wrote:

> On 23/03/17 03:19 AM, Zachary Michaels wrote:
> > We were experiencing an infinite loop due to VRAM bos getting added back
> > to the VRAM lru on eviction via ttm_bo_mem_force_space,
>
> Can you share more details about what happened? I can imagine that
> moving a BO from CPU visible to CPU invisible VRAM would put it back on
> the LRU, but next time around it shouldn't hit this code anymore but get
> evicted to GTT directly.
>
> Was userspace maybe performing concurrent CPU access to the BOs in
> question?
>
>
> > and reverting this commit solves the problem.
>
> I hope we can find a better solution.
>
>
> --
> Earthling Michel Dänzer               |               http://www.amd.com
> Libre software enthusiast             |             Mesa and X developer
>
>

[-- Attachment #1.2: Type: text/html, Size: 5142 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
  2017-03-23  8:10   ` Michel Dänzer
       [not found]     ` <c56f4a6a-a17d-b9b0-311e-7100df8c7cee-otUistvHUpPR7s880joybQ@public.gmane.org>
@ 2017-03-23 15:31     ` Zachary Michaels
  2017-03-23 16:05       ` Christian König
  2017-03-24 10:03       ` Michel Dänzer
  1 sibling, 2 replies; 21+ messages in thread
From: Zachary Michaels @ 2017-03-23 15:31 UTC (permalink / raw)
  To: Michel Dänzer; +Cc: amd-gfx, dri-devel, Julien Isorce

[-- Attachment #1.1: Type: text/plain, Size: 2037 bytes --]

>
> Was userspace maybe performing concurrent CPU access to the BOs in
> question?

As far as I know Julien has demonstrated that this is not the case.

> I hope we can find a better solution.

Understood -- I thought you might not want to take this patch, but I went
ahead and sent it out because Christian requested it, and it seems like he
doesn't think VRAM bos should ever evict back to VRAM at all?

Is my understanding of the original commit correct in that it tries to
rewrite the eviction placements of CPU accessible bos so that they are
either size zero (fpfn and lpfn = start of inaccessible VRAM) or they are
in inaccessible VRAM (fpfn = start of inaccessible VRAM and lpfn = 0)?

In this case, to me it seems that the simplest fix would be to iterate
using i to rewrite all the VRAM placements instead of just the first one
(rbo->placements[i] instead of rbo->placements[0]). In the case where
RADEON_GEM_NO_CPU_ACCESS
is set, the second placement will be in CPU accessible VRAM, and that
doesn't seem correct to me as there is no longer any sort of ordering for
evictions. (Unfortunately I'm not currently in a position to test whether
this fixes our issue.) Sorry, I meant to make a note of this originally.

Also, I don't claim to understand this code well enough, but I wonder: if
these sorts of evictions are desirable, would it make more sense to treat
CPU inaccessible/accessible VRAM as distinct entities with their own lrus?

I should also note that we are experiencing another issue where the kernel
locks up in similar circumstances. As Julien noted, we get no output, and
the watchdogs don't seem to work. It may be the case that Xorg and our
process are calling ttm_bo_mem_force_space concurrently, but I don't think
we have enough information yet to say for sure. Reverting this commit does
not fix that issue. I have some small amount of evidence indicating that
bos flagged for CPU access are getting placed in CPU inaccessible memory.
Could that cause this sort of kernel lockup?

Thanks for your help.

[-- Attachment #1.2: Type: text/html, Size: 2642 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
  2017-03-23 15:31     ` Zachary Michaels
@ 2017-03-23 16:05       ` Christian König
       [not found]         ` <d476dcba-048d-ad4b-b080-49e31e6fb25b-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
  2017-03-24 10:03       ` Michel Dänzer
  1 sibling, 1 reply; 21+ messages in thread
From: Christian König @ 2017-03-23 16:05 UTC (permalink / raw)
  To: Zachary Michaels, Michel Dänzer
  Cc: dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Julien Isorce


[-- Attachment #1.1: Type: text/plain, Size: 3716 bytes --]

> Understood -- I thought you might not want to take this patch, but I 
> went ahead and sent it out because Christian requested it, and it 
> seems like he doesn't think VRAM bos should ever evict back to VRAM at 
> all?
No, I've requested reverting the patch for now because it causes an 
obviously and rather severe problem. If you guys can quickly find how to 
fix it feel free to use that instead.

> Is my understanding of the original commit correct in that it tries to 
> rewrite the eviction placements of CPU accessible bos so that they are 
> either size zero (fpfn and lpfn = start of inaccessible VRAM) or they 
> are in inaccessible VRAM (fpfn = start of inaccessible VRAM and lpfn = 0)?
That for example could work as well, but see below.

> if these sorts of evictions are desirable, would it make more sense to 
> treat CPU inaccessible/accessible VRAM as distinct entities with their 
> own lrus?
Actually I'm pretty sure that it isn't desirable. See the evict function 
doesn't know if we try to evict BOs because we need CPU accessible VRAM 
or if we just run out of VRAM.

This code only makes sense when we need to move different BOs into the 
CPU accessible part round robin because they are accessed by the CPU, 
but then it is actually better to move them to GTT sooner or later.

Regards,
Christian.

Am 23.03.2017 um 16:31 schrieb Zachary Michaels:
>
>     Was userspace maybe performing concurrent CPU access to the BOs in
>     question?
>
>
> As far as I know Julien has demonstrated that this is not the case.
>
>     I hope we can find a better solution.
>
>
> Understood -- I thought you might not want to take this patch, but I 
> went ahead and sent it out because Christian requested it, and it 
> seems like he doesn't think VRAM bos should ever evict back to VRAM at 
> all?
>
> Is my understanding of the original commit correct in that it tries to 
> rewrite the eviction placements of CPU accessible bos so that they are 
> either size zero (fpfn and lpfn = start of inaccessible VRAM) or they 
> are in inaccessible VRAM (fpfn = start of inaccessible VRAM and lpfn = 0)?
>
> In this case, to me it seems that the simplest fix would be to iterate 
> using i to rewrite all the VRAM placements instead of just the first 
> one (rbo->placements[i] instead of rbo->placements[0]). In the case 
> where RADEON_GEM_NO_CPU_ACCESS is set, the second placement will be in 
> CPU accessible VRAM, and that doesn't seem correct to me as there is 
> no longer any sort of ordering for evictions. (Unfortunately I'm not 
> currently in a position to test whether this fixes our issue.) Sorry, 
> I meant to make a note of this originally.
>
> Also, I don't claim to understand this code well enough, but I wonder: 
> if these sorts of evictions are desirable, would it make more sense to 
> treat CPU inaccessible/accessible VRAM as distinct entities with their 
> own lrus?
>
> I should also note that we are experiencing another issue where the 
> kernel locks up in similar circumstances. As Julien noted, we get no 
> output, and the watchdogs don't seem to work. It may be the case that 
> Xorg and our process are calling ttm_bo_mem_force_space concurrently, 
> but I don't think we have enough information yet to say for 
> sure. Reverting this commit does not fix that issue. I have some small 
> amount of evidence indicating that bos flagged for CPU access are 
> getting placed in CPU inaccessible memory. Could that cause this sort 
> of kernel lockup?
>
> Thanks for your help.
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



[-- Attachment #1.2: Type: text/html, Size: 6306 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
       [not found]         ` <d476dcba-048d-ad4b-b080-49e31e6fb25b-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
@ 2017-03-23 16:46           ` Zachary Michaels
  0 siblings, 0 replies; 21+ messages in thread
From: Zachary Michaels @ 2017-03-23 16:46 UTC (permalink / raw)
  To: Christian König
  Cc: Michel Dänzer, dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Julien Isorce


[-- Attachment #1.1: Type: text/plain, Size: 235 bytes --]

>
> No, I've requested reverting the patch for now because it causes an
> obviously and rather severe problem. If you guys can quickly find how to
> fix it feel free to use that instead.
>
> My mistake! That makes sense. Thanks again.

[-- Attachment #1.2: Type: text/html, Size: 624 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
       [not found]         ` <CAGCDoEGuMVnv6rA-y5XDbc6hiNWiTg6nt4e4u2dZObeY8roKXA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-03-24  9:24           ` Michel Dänzer
  2017-03-24  9:50             ` Julien Isorce
  0 siblings, 1 reply; 21+ messages in thread
From: Michel Dänzer @ 2017-03-24  9:24 UTC (permalink / raw)
  To: Julien Isorce
  Cc: Christian König, Zachary Michaels,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 23/03/17 06:26 PM, Julien Isorce wrote:
> Hi Michel,
> 
> When it happens, the main thread of our gl based app is stuck on a
> ioctl(RADEON_CS). I set RADEON_THREAD=false to ease the debugging but
> same thing happens if true. Other threads are only si_shader:0,1,2,3 and
> are doing nothing, just waiting for jobs. I can also do sudo gdb -p
> $(pidof Xorg) to block the X11 server, to make sure there is no ping
> pong between 2 processes. All other processes are not loading
> dri/radeonsi_dri.so . And adding a few traces shows that the above ioctl
> call is looping for ever on
> https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/ttm/ttm_bo.c#L819
> <https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/ttm/ttm_bo.c#L819> and
> comes from
> mesa https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/winsys/radeon/drm/radeon_drm_cs.c#n454
> . 
> 
> After adding even more traces I can see that the bo, which is being
> indefinitely evicted, has the flag RADEON_GEM_NO_CPU_ACCESS.
> And it gets 3 potential placements after calling "radeon_evict_flags". 
>  1: VRAM cpu inaccessible, fpfn is 65536
>  2: VRAM cpu accessible, fpfn is 0
>  3: GTT, fpfn is 0
> 
> And it looks like it continuously succeeds to move on the second
> placement. So I might be wrong but it looks it is not even a ping pong
> between VRAM accessible / not accessible, it just keeps being blited in
> the CPU accessible part of the VRAM.

Thanks for the detailed description! AFAICT this can only happen due to
a silly mistake I made in this code. Does this fix it?

diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
index 5c7cf644ba1d..37d68cd1f272 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -213,8 +213,8 @@ static void radeon_evict_flags(struct ttm_buffer_object *bo,
                        rbo->placement.num_busy_placement = 0;
                        for (i = 0; i < rbo->placement.num_placement; i++) {
                                if (rbo->placements[i].flags & TTM_PL_FLAG_VRAM) {
-                                       if (rbo->placements[0].fpfn < fpfn)
-                                               rbo->placements[0].fpfn = fpfn;
+                                       if (rbo->placements[i].fpfn < fpfn)
+                                               rbo->placements[i].fpfn = fpfn;
                                } else {
                                        rbo->placement.busy_placement =
                                                &rbo->placements[i];



-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
  2017-03-24  9:24           ` Michel Dänzer
@ 2017-03-24  9:50             ` Julien Isorce
       [not found]               ` <CAHWPjbXbAo8CVSJDK1GzN=M4pMobMw9XVQVfVeGVM4H=n36vuQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Julien Isorce @ 2017-03-24  9:50 UTC (permalink / raw)
  To: Michel Dänzer; +Cc: amd-gfx, Zachary Michaels, Julien Isorce, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 3930 bytes --]

Hi Michel,

(Just for other readers my reply has been delayed on the mailing lists and
should have been on second position)

We have actually spotted this /0/i/ but somehow I convinced myself it was
intentional. The reason I found was that you wanted to set the fpfn only if
there is 2 placements, which means it will try to move from accessible to
inaccessible.

I will have a go with that change and let you know. I do not remember if I
tried it for this soft lockup. But for sure it does not solve the hard
lockup that Zach also mentioned at the end of his reply. I am saying that
because this other issue has some similarities (same ioctl call).

But in general, isn't "radeon_lockup_timeout" supposed to detect this
situation ?

Thx
Julien


On 24 March 2017 at 09:24, Michel Dänzer <michel@daenzer.net> wrote:

> On 23/03/17 06:26 PM, Julien Isorce wrote:
> > Hi Michel,
> >
> > When it happens, the main thread of our gl based app is stuck on a
> > ioctl(RADEON_CS). I set RADEON_THREAD=false to ease the debugging but
> > same thing happens if true. Other threads are only si_shader:0,1,2,3 and
> > are doing nothing, just waiting for jobs. I can also do sudo gdb -p
> > $(pidof Xorg) to block the X11 server, to make sure there is no ping
> > pong between 2 processes. All other processes are not loading
> > dri/radeonsi_dri.so . And adding a few traces shows that the above ioctl
> > call is looping for ever on
> > https://github.com/torvalds/linux/blob/master/drivers/gpu/
> drm/ttm/ttm_bo.c#L819
> > <https://github.com/torvalds/linux/blob/master/drivers/gpu/
> drm/ttm/ttm_bo.c#L819> and
> > comes from
> > mesa https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/
> winsys/radeon/drm/radeon_drm_cs.c#n454
> > .
> >
> > After adding even more traces I can see that the bo, which is being
> > indefinitely evicted, has the flag RADEON_GEM_NO_CPU_ACCESS.
> > And it gets 3 potential placements after calling "radeon_evict_flags".
> >  1: VRAM cpu inaccessible, fpfn is 65536
> >  2: VRAM cpu accessible, fpfn is 0
> >  3: GTT, fpfn is 0
> >
> > And it looks like it continuously succeeds to move on the second
> > placement. So I might be wrong but it looks it is not even a ping pong
> > between VRAM accessible / not accessible, it just keeps being blited in
> > the CPU accessible part of the VRAM.
>
> Thanks for the detailed description! AFAICT this can only happen due to
> a silly mistake I made in this code. Does this fix it?
>
> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/
> radeon_ttm.c
> index 5c7cf644ba1d..37d68cd1f272 100644
> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> @@ -213,8 +213,8 @@ static void radeon_evict_flags(struct
> ttm_buffer_object *bo,
>                         rbo->placement.num_busy_placement = 0;
>                         for (i = 0; i < rbo->placement.num_placement; i++)
> {
>                                 if (rbo->placements[i].flags &
> TTM_PL_FLAG_VRAM) {
> -                                       if (rbo->placements[0].fpfn < fpfn)
> -                                               rbo->placements[0].fpfn =
> fpfn;
> +                                       if (rbo->placements[i].fpfn < fpfn)
> +                                               rbo->placements[i].fpfn =
> fpfn;
>                                 } else {
>                                         rbo->placement.busy_placement =
>                                                 &rbo->placements[i];
>
>
>
> --
> Earthling Michel Dänzer               |               http://www.amd.com
> Libre software enthusiast             |             Mesa and X developer
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

[-- Attachment #1.2: Type: text/html, Size: 5711 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
       [not found]               ` <CAHWPjbXbAo8CVSJDK1GzN=M4pMobMw9XVQVfVeGVM4H=n36vuQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-03-24  9:59                 ` Michel Dänzer
  2017-03-24 18:59                   ` Julien Isorce
  0 siblings, 1 reply; 21+ messages in thread
From: Michel Dänzer @ 2017-03-24  9:59 UTC (permalink / raw)
  To: Julien Isorce
  Cc: dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Zachary Michaels,
	Julien Isorce, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 24/03/17 06:50 PM, Julien Isorce wrote:
> Hi Michel,
> 
> (Just for other readers my reply has been delayed on the mailing lists
> and should have been on second position)

It is on https://patchwork.freedesktop.org/patch/145731/ , did you mean
something else?

The delay was because you weren't subscribed to the amd-gfx mailing list
yet, so your post went through the moderation queue.


> I will have a go with that change and let you know. I do not remember if
> I tried it for this soft lockup. But for sure it does not solve the hard
> lockup that Zach also mentioned at the end of his reply.

I'll follow up to his post about that.


> But in general, isn't "radeon_lockup_timeout" supposed to detect this
> situation ?

No, it's for detecting GPU hangs, whereas this is a CPU "hang" (infinite
loop).


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
  2017-03-23 15:31     ` Zachary Michaels
  2017-03-23 16:05       ` Christian König
@ 2017-03-24 10:03       ` Michel Dänzer
       [not found]         ` <5f26f8a1-e5bd-ca7a-c7ee-08c8f04f2110-otUistvHUpPR7s880joybQ@public.gmane.org>
  1 sibling, 1 reply; 21+ messages in thread
From: Michel Dänzer @ 2017-03-24 10:03 UTC (permalink / raw)
  To: Zachary Michaels
  Cc: Christian König, dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Julien Isorce

On 24/03/17 12:31 AM, Zachary Michaels wrote:
> 
> I should also note that we are experiencing another issue where the
> kernel locks up in similar circumstances. As Julien noted, we get no
> output, and the watchdogs don't seem to work. It may be the case that
> Xorg and our process are calling ttm_bo_mem_force_space concurrently,
> but I don't think we have enough information yet to say for
> sure. Reverting this commit does not fix that issue. I have some small
> amount of evidence indicating that bos flagged for CPU access are
> getting placed in CPU inaccessible memory. Could that cause this sort of
> kernel lockup?

Possibly, does this help?

diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
index 37d68cd1f272..40d1bb467a71 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -198,7 +198,8 @@ static void radeon_evict_flags(struct ttm_buffer_object *bo,
        case TTM_PL_VRAM:
                if (rbo->rdev->ring[radeon_copy_ring_index(rbo->rdev)].ready == false)
                        radeon_ttm_placement_from_domain(rbo, RADEON_GEM_DOMAIN_CPU);
-               else if (rbo->rdev->mc.visible_vram_size < rbo->rdev->mc.real_vram_size &&
+               else if (!(rbo->flags & RADEON_GEM_CPU_ACCESS) &&
+                        rbo->rdev->mc.visible_vram_size < rbo->rdev->mc.real_vram_size &&
                         bo->mem.start < (rbo->rdev->mc.visible_vram_size >> PAGE_SHIFT)) {
                        unsigned fpfn = rbo->rdev->mc.visible_vram_size >> PAGE_SHIFT;
                        int i;



-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
  2017-03-24  9:59                 ` Michel Dänzer
@ 2017-03-24 18:59                   ` Julien Isorce
  2017-03-27  0:59                     ` Michel Dänzer
  0 siblings, 1 reply; 21+ messages in thread
From: Julien Isorce @ 2017-03-24 18:59 UTC (permalink / raw)
  To: Michel Dänzer; +Cc: dri-devel, Zachary Michaels, Julien Isorce, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 1214 bytes --]

Hi Michel,

I double checked and you are right, the change 0 -> i works.

Cheers
Julien

On 24 March 2017 at 09:59, Michel Dänzer <michel@daenzer.net> wrote:

> On 24/03/17 06:50 PM, Julien Isorce wrote:
> > Hi Michel,
> >
> > (Just for other readers my reply has been delayed on the mailing lists
> > and should have been on second position)
>
> It is on https://patchwork.freedesktop.org/patch/145731/ , did you mean
> something else?
>
> The delay was because you weren't subscribed to the amd-gfx mailing list
> yet, so your post went through the moderation queue.
>
>
> > I will have a go with that change and let you know. I do not remember if
> > I tried it for this soft lockup. But for sure it does not solve the hard
> > lockup that Zach also mentioned at the end of his reply.
>
> I'll follow up to his post about that.
>
>
> > But in general, isn't "radeon_lockup_timeout" supposed to detect this
> > situation ?
>
> No, it's for detecting GPU hangs, whereas this is a CPU "hang" (infinite
> loop).
>
>
> --
> Earthling Michel Dänzer               |               http://www.amd.com
> Libre software enthusiast             |             Mesa and X developer
>

[-- Attachment #1.2: Type: text/html, Size: 2023 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
       [not found]         ` <5f26f8a1-e5bd-ca7a-c7ee-08c8f04f2110-otUistvHUpPR7s880joybQ@public.gmane.org>
@ 2017-03-24 19:01           ` Julien Isorce
  2017-03-28  8:24             ` Julien Isorce
  0 siblings, 1 reply; 21+ messages in thread
From: Julien Isorce @ 2017-03-24 19:01 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Zachary Michaels, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Julien Isorce


[-- Attachment #1.1: Type: text/plain, Size: 2348 bytes --]

Hi Michel,

No this change does not help on the other issue (hard lockup).
I have no tried it in combination with the 0 -> i change.

Thx anyway.
Julien


On 24 March 2017 at 10:03, Michel Dänzer <michel-otUistvHUpPR7s880joybQ@public.gmane.org> wrote:

> On 24/03/17 12:31 AM, Zachary Michaels wrote:
> >
> > I should also note that we are experiencing another issue where the
> > kernel locks up in similar circumstances. As Julien noted, we get no
> > output, and the watchdogs don't seem to work. It may be the case that
> > Xorg and our process are calling ttm_bo_mem_force_space concurrently,
> > but I don't think we have enough information yet to say for
> > sure. Reverting this commit does not fix that issue. I have some small
> > amount of evidence indicating that bos flagged for CPU access are
> > getting placed in CPU inaccessible memory. Could that cause this sort of
> > kernel lockup?
>
> Possibly, does this help?
>
> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/
> radeon_ttm.c
> index 37d68cd1f272..40d1bb467a71 100644
> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> @@ -198,7 +198,8 @@ static void radeon_evict_flags(struct
> ttm_buffer_object *bo,
>         case TTM_PL_VRAM:
>                 if (rbo->rdev->ring[radeon_copy_ring_index(rbo->rdev)].ready
> == false)
>                         radeon_ttm_placement_from_domain(rbo,
> RADEON_GEM_DOMAIN_CPU);
> -               else if (rbo->rdev->mc.visible_vram_size <
> rbo->rdev->mc.real_vram_size &&
> +               else if (!(rbo->flags & RADEON_GEM_CPU_ACCESS) &&
> +                        rbo->rdev->mc.visible_vram_size <
> rbo->rdev->mc.real_vram_size &&
>                          bo->mem.start < (rbo->rdev->mc.visible_vram_size
> >> PAGE_SHIFT)) {
>                         unsigned fpfn = rbo->rdev->mc.visible_vram_size
> >> PAGE_SHIFT;
>                         int i;
>
>
>
> --
> Earthling Michel Dänzer               |               http://www.amd.com
> Libre software enthusiast             |             Mesa and X developer
> _______________________________________________
> dri-devel mailing list
> dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

[-- Attachment #1.2: Type: text/html, Size: 3504 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
  2017-03-24 18:59                   ` Julien Isorce
@ 2017-03-27  0:59                     ` Michel Dänzer
  0 siblings, 0 replies; 21+ messages in thread
From: Michel Dänzer @ 2017-03-27  0:59 UTC (permalink / raw)
  To: Julien Isorce; +Cc: amd-gfx, Zachary Michaels, Julien Isorce, dri-devel

On 25/03/17 03:59 AM, Julien Isorce wrote:
> Hi Michel,
> 
> I double checked and you are right, the change 0 -> i works. 

Thanks for testing, fix submitted for review.


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
  2017-03-24 19:01           ` Julien Isorce
@ 2017-03-28  8:24             ` Julien Isorce
  2017-03-28  9:36               ` Michel Dänzer
  0 siblings, 1 reply; 21+ messages in thread
From: Julien Isorce @ 2017-03-28  8:24 UTC (permalink / raw)
  To: Michel Dänzer; +Cc: Zachary Michaels, amd-gfx, dri-devel, Julien Isorce


[-- Attachment #1.1: Type: text/plain, Size: 3417 bytes --]

Hi Michel,

About the hard lockup, I noticed that I cannot have it with the following
conditions:

1. soft lockup fix (the 0->i change which avoids infinite loop)
2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS)
3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values above do not
help, for example (1024, 1024) or (1024, 2048))

Without 1 and 2, but with 3, our test reproduces the soft lockup (just
discovered few days ago).
Without 3 (and with or without 1., 2.), our test reproduces the hard lockup
which one does not give any info in kern.log (sometimes some NUL ^@
characters but not always).

We are converting this repro test to a piglit test in order to share it but
it will take some times. But to simplify it continuously uploads images
with a size picked randomly and up to 4K. So TTM's eviction mechanism is
hit a lot.

(The card is a ForePro W600 Cape Verde 2048M )

I am happy to try any other suggestion.

Thx
Julien

On 24 March 2017 at 19:01, Julien Isorce <julien.isorce@gmail.com> wrote:

> Hi Michel,
>
> No this change does not help on the other issue (hard lockup).
> I have no tried it in combination with the 0 -> i change.
>
> Thx anyway.
> Julien
>
>
> On 24 March 2017 at 10:03, Michel Dänzer <michel@daenzer.net> wrote:
>
>> On 24/03/17 12:31 AM, Zachary Michaels wrote:
>> >
>> > I should also note that we are experiencing another issue where the
>> > kernel locks up in similar circumstances. As Julien noted, we get no
>> > output, and the watchdogs don't seem to work. It may be the case that
>> > Xorg and our process are calling ttm_bo_mem_force_space concurrently,
>> > but I don't think we have enough information yet to say for
>> > sure. Reverting this commit does not fix that issue. I have some small
>> > amount of evidence indicating that bos flagged for CPU access are
>> > getting placed in CPU inaccessible memory. Could that cause this sort of
>> > kernel lockup?
>>
>> Possibly, does this help?
>>
>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c
>> b/drivers/gpu/drm/radeon/radeon_ttm.c
>> index 37d68cd1f272..40d1bb467a71 100644
>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
>> @@ -198,7 +198,8 @@ static void radeon_evict_flags(struct
>> ttm_buffer_object *bo,
>>         case TTM_PL_VRAM:
>>                 if (rbo->rdev->ring[radeon_copy_ring_index(rbo->rdev)].ready
>> == false)
>>                         radeon_ttm_placement_from_domain(rbo,
>> RADEON_GEM_DOMAIN_CPU);
>> -               else if (rbo->rdev->mc.visible_vram_size <
>> rbo->rdev->mc.real_vram_size &&
>> +               else if (!(rbo->flags & RADEON_GEM_CPU_ACCESS) &&
>> +                        rbo->rdev->mc.visible_vram_size <
>> rbo->rdev->mc.real_vram_size &&
>>                          bo->mem.start < (rbo->rdev->mc.visible_vram_size
>> >> PAGE_SHIFT)) {
>>                         unsigned fpfn = rbo->rdev->mc.visible_vram_size
>> >> PAGE_SHIFT;
>>                         int i;
>>
>>
>>
>> --
>> Earthling Michel Dänzer               |               http://www.amd.com
>> Libre software enthusiast             |             Mesa and X developer
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>
>

[-- Attachment #1.2: Type: text/html, Size: 5250 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
  2017-03-28  8:24             ` Julien Isorce
@ 2017-03-28  9:36               ` Michel Dänzer
  2017-03-28 11:00                 ` Julien Isorce
  0 siblings, 1 reply; 21+ messages in thread
From: Michel Dänzer @ 2017-03-28  9:36 UTC (permalink / raw)
  To: Julien Isorce; +Cc: Zachary Michaels, dri-devel, amd-gfx, Julien Isorce

On 28/03/17 05:24 PM, Julien Isorce wrote:
> Hi Michel,
> 
> About the hard lockup, I noticed that I cannot have it with the
> following conditions:
> 
> 1. soft lockup fix (the 0->i change which avoids infinite loop)
> 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS)
> 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values above do
> not help, for example (1024, 1024) or (1024, 2048))
> 
> Without 1 and 2, but with 3, our test reproduces the soft lockup (just
> discovered few days ago).
> Without 3 (and with or without 1., 2.), our test reproduces the hard
> lockup which one does not give any info in kern.log (sometimes some NUL
> ^@ characters but not always).

What exactly does "hard lockup" mean? What are the symptoms?


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
  2017-03-28  9:36               ` Michel Dänzer
@ 2017-03-28 11:00                 ` Julien Isorce
  2017-03-29  8:59                   ` Michel Dänzer
  0 siblings, 1 reply; 21+ messages in thread
From: Julien Isorce @ 2017-03-28 11:00 UTC (permalink / raw)
  To: Michel Dänzer; +Cc: Zachary Michaels, dri-devel, amd-gfx, Julien Isorce


[-- Attachment #1.1: Type: text/plain, Size: 1359 bytes --]

On 28 March 2017 at 10:36, Michel Dänzer <michel@daenzer.net> wrote:

> On 28/03/17 05:24 PM, Julien Isorce wrote:
> > Hi Michel,
> >
> > About the hard lockup, I noticed that I cannot have it with the
> > following conditions:
> >
> > 1. soft lockup fix (the 0->i change which avoids infinite loop)
> > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS)
> > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values above do
> > not help, for example (1024, 1024) or (1024, 2048))
> >
> > Without 1 and 2, but with 3, our test reproduces the soft lockup (just
> > discovered few days ago).
> > Without 3 (and with or without 1., 2.), our test reproduces the hard
> > lockup which one does not give any info in kern.log (sometimes some NUL
> > ^@ characters but not always).
>
> What exactly does "hard lockup" mean? What are the symptoms?
>
>
Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic.
Requires hard reboot.
After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing in
kern.log except sometimes some nul characters ^@.
Adding traces it looks like the test app was still in a ioctl(RADEON_CS)
but it is difficult to rely on that since this is called a lot.
Using a serial console did not show additional debug messages. kgdb was not
useful but probably worth another attempt.

[-- Attachment #1.2: Type: text/html, Size: 1980 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
  2017-03-28 11:00                 ` Julien Isorce
@ 2017-03-29  8:59                   ` Michel Dänzer
       [not found]                     ` <ed374781-4b6d-93ba-2810-39a5943209b7-otUistvHUpPR7s880joybQ@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Michel Dänzer @ 2017-03-29  8:59 UTC (permalink / raw)
  To: Julien Isorce; +Cc: Zachary Michaels, amd-gfx, dri-devel, Julien Isorce

On 28/03/17 08:00 PM, Julien Isorce wrote:
> 
> On 28 March 2017 at 10:36, Michel Dänzer <michel@daenzer.net
> <mailto:michel@daenzer.net>> wrote:
> 
>     On 28/03/17 05:24 PM, Julien Isorce wrote:
>     > Hi Michel,
>     >
>     > About the hard lockup, I noticed that I cannot have it with the
>     > following conditions:
>     >
>     > 1. soft lockup fix (the 0->i change which avoids infinite loop)
>     > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS)
>     > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values above do
>     > not help, for example (1024, 1024) or (1024, 2048))
>     >
>     > Without 1 and 2, but with 3, our test reproduces the soft lockup (just
>     > discovered few days ago).
>     > Without 3 (and with or without 1., 2.), our test reproduces the hard
>     > lockup which one does not give any info in kern.log (sometimes some NUL
>     > ^@ characters but not always).
> 
>     What exactly does "hard lockup" mean? What are the symptoms?
> 
> 
> Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic.
> Requires hard reboot.
> After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing
> in kern.log except sometimes some nul characters ^@.

Does it still respond to ping when it's hung?


> Using a serial console did not show additional debug messages. kgdb was
> not useful but probably worth another attempt.

Right.

Anyway, I'm afraid it sounds like it's probably not directly related to
the issue I was thinking of for my previous test patch or other similar
ones I was thinking of writing.


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
       [not found]                     ` <ed374781-4b6d-93ba-2810-39a5943209b7-otUistvHUpPR7s880joybQ@public.gmane.org>
@ 2017-03-29  9:07                       ` Christian König
  2017-03-29  9:36                         ` Michel Dänzer
  0 siblings, 1 reply; 21+ messages in thread
From: Christian König @ 2017-03-29  9:07 UTC (permalink / raw)
  To: Michel Dänzer, Julien Isorce
  Cc: Zachary Michaels, dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Julien Isorce

Am 29.03.2017 um 10:59 schrieb Michel Dänzer:
> On 28/03/17 08:00 PM, Julien Isorce wrote:
>> On 28 March 2017 at 10:36, Michel Dänzer <michel@daenzer.net
>> <mailto:michel@daenzer.net>> wrote:
>>
>>      On 28/03/17 05:24 PM, Julien Isorce wrote:
>>      > Hi Michel,
>>      >
>>      > About the hard lockup, I noticed that I cannot have it with the
>>      > following conditions:
>>      >
>>      > 1. soft lockup fix (the 0->i change which avoids infinite loop)
>>      > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS)
>>      > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values above do
>>      > not help, for example (1024, 1024) or (1024, 2048))
>>      >
>>      > Without 1 and 2, but with 3, our test reproduces the soft lockup (just
>>      > discovered few days ago).
>>      > Without 3 (and with or without 1., 2.), our test reproduces the hard
>>      > lockup which one does not give any info in kern.log (sometimes some NUL
>>      > ^@ characters but not always).
>>
>>      What exactly does "hard lockup" mean? What are the symptoms?
>>
>>
>> Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic.
>> Requires hard reboot.
>> After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing
>> in kern.log except sometimes some nul characters ^@.
> Does it still respond to ping when it's hung?
>
>
>> Using a serial console did not show additional debug messages. kgdb was
>> not useful but probably worth another attempt.
> Right.
>
> Anyway, I'm afraid it sounds like it's probably not directly related to
> the issue I was thinking of for my previous test patch or other similar
> ones I was thinking of writing.

Yeah, agree.

Additional to that a complete crash where you don't even get anything 
over serial console is rather unlikely to be cause by something an 
application can do directly.

Possible causes are more likely power management or completely messing 
up a bus system. Have you tried disabling dpm as well?

Christian.


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
  2017-03-29  9:07                       ` Christian König
@ 2017-03-29  9:36                         ` Michel Dänzer
  2017-03-29 17:26                           ` Nicolai Hähnle
  0 siblings, 1 reply; 21+ messages in thread
From: Michel Dänzer @ 2017-03-29  9:36 UTC (permalink / raw)
  To: Christian König, Julien Isorce
  Cc: Zachary Michaels, amd-gfx, dri-devel, Julien Isorce

On 29/03/17 06:07 PM, Christian König wrote:
> Am 29.03.2017 um 10:59 schrieb Michel Dänzer:
>> On 28/03/17 08:00 PM, Julien Isorce wrote:
>>> On 28 March 2017 at 10:36, Michel Dänzer <michel@daenzer.net
>>> <mailto:michel@daenzer.net>> wrote:
>>>
>>>      On 28/03/17 05:24 PM, Julien Isorce wrote:
>>>      > Hi Michel,
>>>      >
>>>      > About the hard lockup, I noticed that I cannot have it with the
>>>      > following conditions:
>>>      >
>>>      > 1. soft lockup fix (the 0->i change which avoids infinite loop)
>>>      > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS)
>>>      > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values
>>> above do
>>>      > not help, for example (1024, 1024) or (1024, 2048))
>>>      >
>>>      > Without 1 and 2, but with 3, our test reproduces the soft
>>> lockup (just
>>>      > discovered few days ago).
>>>      > Without 3 (and with or without 1., 2.), our test reproduces
>>> the hard
>>>      > lockup which one does not give any info in kern.log (sometimes
>>> some NUL
>>>      > ^@ characters but not always).
>>>
>>>      What exactly does "hard lockup" mean? What are the symptoms?
>>>
>>>
>>> Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic.
>>> Requires hard reboot.
>>> After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing
>>> in kern.log except sometimes some nul characters ^@.
>> Does it still respond to ping when it's hung?
>>
>>
>>> Using a serial console did not show additional debug messages. kgdb was
>>> not useful but probably worth another attempt.
>> Right.
>>
>> Anyway, I'm afraid it sounds like it's probably not directly related to
>> the issue I was thinking of for my previous test patch or other similar
>> ones I was thinking of writing.
> 
> Yeah, agree.
> 
> Additional to that a complete crash where you don't even get anything
> over serial console is rather unlikely to be cause by something an
> application can do directly.
> 
> Possible causes are more likely power management or completely messing
> up a bus system. Have you tried disabling dpm as well?

Might also be worth trying the amdgpu kernel driver instead of radeon,
not sure how well the former currently works with Cape Verde though.


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
  2017-03-29  9:36                         ` Michel Dänzer
@ 2017-03-29 17:26                           ` Nicolai Hähnle
       [not found]                             ` <de189c17-9904-f240-7569-eb1564ff2810-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 21+ messages in thread
From: Nicolai Hähnle @ 2017-03-29 17:26 UTC (permalink / raw)
  To: Michel Dänzer, Christian König, Julien Isorce
  Cc: Zachary Michaels, dri-devel, amd-gfx, Julien Isorce

On 29.03.2017 11:36, Michel Dänzer wrote:
> On 29/03/17 06:07 PM, Christian König wrote:
>> Am 29.03.2017 um 10:59 schrieb Michel Dänzer:
>>> On 28/03/17 08:00 PM, Julien Isorce wrote:
>>>> On 28 March 2017 at 10:36, Michel Dänzer <michel@daenzer.net
>>>> <mailto:michel@daenzer.net>> wrote:
>>>>
>>>>      On 28/03/17 05:24 PM, Julien Isorce wrote:
>>>>      > Hi Michel,
>>>>      >
>>>>      > About the hard lockup, I noticed that I cannot have it with the
>>>>      > following conditions:
>>>>      >
>>>>      > 1. soft lockup fix (the 0->i change which avoids infinite loop)
>>>>      > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS)
>>>>      > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values
>>>> above do
>>>>      > not help, for example (1024, 1024) or (1024, 2048))
>>>>      >
>>>>      > Without 1 and 2, but with 3, our test reproduces the soft
>>>> lockup (just
>>>>      > discovered few days ago).
>>>>      > Without 3 (and with or without 1., 2.), our test reproduces
>>>> the hard
>>>>      > lockup which one does not give any info in kern.log (sometimes
>>>> some NUL
>>>>      > ^@ characters but not always).
>>>>
>>>>      What exactly does "hard lockup" mean? What are the symptoms?
>>>>
>>>>
>>>> Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic.
>>>> Requires hard reboot.
>>>> After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing
>>>> in kern.log except sometimes some nul characters ^@.
>>> Does it still respond to ping when it's hung?
>>>
>>>
>>>> Using a serial console did not show additional debug messages. kgdb was
>>>> not useful but probably worth another attempt.
>>> Right.
>>>
>>> Anyway, I'm afraid it sounds like it's probably not directly related to
>>> the issue I was thinking of for my previous test patch or other similar
>>> ones I was thinking of writing.
>>
>> Yeah, agree.
>>
>> Additional to that a complete crash where you don't even get anything
>> over serial console is rather unlikely to be cause by something an
>> application can do directly.
>>
>> Possible causes are more likely power management or completely messing
>> up a bus system. Have you tried disabling dpm as well?
>
> Might also be worth trying the amdgpu kernel driver instead of radeon,
> not sure how well the former currently works with Cape Verde though.

I've recently used it to experiment with the sparse buffer support. It 
worked well enough for that :)

Cheers,
Nicolai
-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
       [not found]                             ` <de189c17-9904-f240-7569-eb1564ff2810-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-03-30 11:03                               ` Julien Isorce
  0 siblings, 0 replies; 21+ messages in thread
From: Julien Isorce @ 2017-03-30 11:03 UTC (permalink / raw)
  To: Nicolai Hähnle
  Cc: Zachary Michaels, Michel Dänzer,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Christian König,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Julien Isorce


[-- Attachment #1.1: Type: text/plain, Size: 3598 bytes --]

Thx for the suggestions.

No, it does not respond to ping.

radeon.dpm=0 does not help. But it only tells to use the old power
management right ?
So I tried: low, mid and high for /sys/class/drm/card0/device/prower_profile
 (and setting profile for power_mode)
With radeon.dpm=1 I tried all values for power_dpm_state /
power_dpm_force_performance_level. Same results.

I also tried today amd-staging-4.9 branch and same result.

Note that this also happens on W600 (verde), W9000 (tahiti) and W9100
(hawaii), with radeonsi driver.

I have open a bug here: https://bugs.freedesktop.org/show_bug.cgi?id=100465
. It contains a test to reproduce the freeze.

Cheers
Julien

On 29 March 2017 at 18:26, Nicolai Hähnle <nhaehnle-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On 29.03.2017 11:36, Michel Dänzer wrote:
>
>> On 29/03/17 06:07 PM, Christian König wrote:
>>
>>> Am 29.03.2017 um 10:59 schrieb Michel Dänzer:
>>>
>>>> On 28/03/17 08:00 PM, Julien Isorce wrote:
>>>>
>>>>> On 28 March 2017 at 10:36, Michel Dänzer <michel-otUistvHUpPR7s880joybQ@public.gmane.org
>>>>> <mailto:michel-otUistvHUpPR7s880joybQ@public.gmane.org>> wrote:
>>>>>
>>>>>      On 28/03/17 05:24 PM, Julien Isorce wrote:
>>>>>      > Hi Michel,
>>>>>      >
>>>>>      > About the hard lockup, I noticed that I cannot have it with the
>>>>>      > following conditions:
>>>>>      >
>>>>>      > 1. soft lockup fix (the 0->i change which avoids infinite loop)
>>>>>      > 2. Your suggestion: (!(rbo->flags & RADEON_GEM_CPU_ACCESS)
>>>>>      > 3. radeon.gartsize=512 radeon.vramlimit=1024 (any other values
>>>>> above do
>>>>>      > not help, for example (1024, 1024) or (1024, 2048))
>>>>>      >
>>>>>      > Without 1 and 2, but with 3, our test reproduces the soft
>>>>> lockup (just
>>>>>      > discovered few days ago).
>>>>>      > Without 3 (and with or without 1., 2.), our test reproduces
>>>>> the hard
>>>>>      > lockup which one does not give any info in kern.log (sometimes
>>>>> some NUL
>>>>>      > ^@ characters but not always).
>>>>>
>>>>>      What exactly does "hard lockup" mean? What are the symptoms?
>>>>>
>>>>>
>>>>> Screens are frozen, cannot ssh, no mouse/keyboard, no kernel panic.
>>>>> Requires hard reboot.
>>>>> After reboot, nothing in /var/crash, nothing in /sys/fs/pstore, nothing
>>>>> in kern.log except sometimes some nul characters ^@.
>>>>>
>>>> Does it still respond to ping when it's hung?
>>>>
>>>>
>>>> Using a serial console did not show additional debug messages. kgdb was
>>>>> not useful but probably worth another attempt.
>>>>>
>>>> Right.
>>>>
>>>> Anyway, I'm afraid it sounds like it's probably not directly related to
>>>> the issue I was thinking of for my previous test patch or other similar
>>>> ones I was thinking of writing.
>>>>
>>>
>>> Yeah, agree.
>>>
>>> Additional to that a complete crash where you don't even get anything
>>> over serial console is rather unlikely to be cause by something an
>>> application can do directly.
>>>
>>> Possible causes are more likely power management or completely messing
>>> up a bus system. Have you tried disabling dpm as well?
>>>
>>
>> Might also be worth trying the amdgpu kernel driver instead of radeon,
>> not sure how well the former currently works with Cape Verde though.
>>
>
> I've recently used it to experiment with the sparse buffer support. It
> worked well enough for that :)
>
> Cheers,
> Nicolai
> --
> Lerne, wie die Welt wirklich ist,
> Aber vergiss niemals, wie sie sein sollte.
>

[-- Attachment #1.2: Type: text/html, Size: 5236 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-03-30 11:03 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-22 18:19 [PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first" Zachary Michaels
     [not found] ` <1490206797-15653-1-git-send-email-zmichaels-bXq66PvbRDbQT0dZR+AlfA@public.gmane.org>
2017-03-23  8:10   ` Michel Dänzer
     [not found]     ` <c56f4a6a-a17d-b9b0-311e-7100df8c7cee-otUistvHUpPR7s880joybQ@public.gmane.org>
2017-03-23  9:26       ` Julien Isorce
     [not found]         ` <CAGCDoEGuMVnv6rA-y5XDbc6hiNWiTg6nt4e4u2dZObeY8roKXA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-03-24  9:24           ` Michel Dänzer
2017-03-24  9:50             ` Julien Isorce
     [not found]               ` <CAHWPjbXbAo8CVSJDK1GzN=M4pMobMw9XVQVfVeGVM4H=n36vuQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-03-24  9:59                 ` Michel Dänzer
2017-03-24 18:59                   ` Julien Isorce
2017-03-27  0:59                     ` Michel Dänzer
2017-03-23 15:31     ` Zachary Michaels
2017-03-23 16:05       ` Christian König
     [not found]         ` <d476dcba-048d-ad4b-b080-49e31e6fb25b-ANTagKRnAhcb1SvskN2V4Q@public.gmane.org>
2017-03-23 16:46           ` Zachary Michaels
2017-03-24 10:03       ` Michel Dänzer
     [not found]         ` <5f26f8a1-e5bd-ca7a-c7ee-08c8f04f2110-otUistvHUpPR7s880joybQ@public.gmane.org>
2017-03-24 19:01           ` Julien Isorce
2017-03-28  8:24             ` Julien Isorce
2017-03-28  9:36               ` Michel Dänzer
2017-03-28 11:00                 ` Julien Isorce
2017-03-29  8:59                   ` Michel Dänzer
     [not found]                     ` <ed374781-4b6d-93ba-2810-39a5943209b7-otUistvHUpPR7s880joybQ@public.gmane.org>
2017-03-29  9:07                       ` Christian König
2017-03-29  9:36                         ` Michel Dänzer
2017-03-29 17:26                           ` Nicolai Hähnle
     [not found]                             ` <de189c17-9904-f240-7569-eb1564ff2810-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-03-30 11:03                               ` Julien Isorce

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.