All of lore.kernel.org
 help / color / mirror / Atom feed
* Radeon DRM dom0 issues
@ 2014-01-20 14:58 Michael D Labriola
  2014-01-20 15:14 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 15+ messages in thread
From: Michael D Labriola @ 2014-01-20 14:58 UTC (permalink / raw)
  To: xen-devel; +Cc: michael.d.labriola

Anyone here running a dom0 w/ Radeon DRM?  I'm having consistent crashes 
with multiple older R600 series (HD 6470 and HD 6570) and unusably slow 
graphics with a newer HD7000 (can see each line refresh indiviually on 
radeonfb tty).  All 3 systems seem to work fine bare metal.

The R600 crashes happen seemingly randomly when using OpenGL Compositor in 
Enlightenment 0.17.  My dom0 need not even have any domUs running. 
Sometimes it happens within a few minutes, sometimes it will run OK for an 
afternoon or so.  Eventually TTM issues an "unable to get page 0" error 
message, the radeon driver follows that up with a 
"radeon_gem_object_create failed to allocate gem" error message.  Then the 
radeon driver starts spamming that gem failure message until I'm forced to 
reboot.

Behavior is identical with kernel versions 3.8, 3.10, and 3.13-rc8.

I'm using Xen 4.2.1 32bit.

xen command line is:  vga=mode=0x314
kernel command line is:  root=/dev/md0 quiet pci=realloc 
console=ttyS0,115200n8 console=tty0

Fingers crossed that there's some magic boot parameter I'm missing.  It's 
been a while since I dabbled with this stuff.  ;-)

Thanks!

---
Michael D Labriola
Electric Boat
mlabriol@gdeb.com
401-848-8871 (desk)
401-848-8513 (lab)
401-316-9844 (cell)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Radeon DRM dom0 issues
  2014-01-20 14:58 Radeon DRM dom0 issues Michael D Labriola
@ 2014-01-20 15:14 ` Konrad Rzeszutek Wilk
  2014-01-20 15:26   ` Michael D Labriola
  0 siblings, 1 reply; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-01-20 15:14 UTC (permalink / raw)
  To: Michael D Labriola; +Cc: michael.d.labriola, xen-devel

On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola wrote:
> Anyone here running a dom0 w/ Radeon DRM?  I'm having consistent crashes 
> with multiple older R600 series (HD 6470 and HD 6570) and unusably slow 
> graphics with a newer HD7000 (can see each line refresh indiviually on 
> radeonfb tty).  All 3 systems seem to work fine bare metal.

I hadn't been using DRM, just Xserver. Is that what you mean?
> 
> The R600 crashes happen seemingly randomly when using OpenGL Compositor in 
> Enlightenment 0.17.  My dom0 need not even have any domUs running. 
> Sometimes it happens within a few minutes, sometimes it will run OK for an 
> afternoon or so.  Eventually TTM issues an "unable to get page 0" error 
> message, the radeon driver follows that up with a 
> "radeon_gem_object_create failed to allocate gem" error message.  Then the 
> radeon driver starts spamming that gem failure message until I'm forced to 
> reboot.
> 
> Behavior is identical with kernel versions 3.8, 3.10, and 3.13-rc8.
> 
> I'm using Xen 4.2.1 32bit.
> 
> xen command line is:  vga=mode=0x314
> kernel command line is:  root=/dev/md0 quiet pci=realloc 
> console=ttyS0,115200n8 console=tty0
> 
> Fingers crossed that there's some magic boot parameter I'm missing.  It's 
> been a while since I dabbled with this stuff.  ;-)

That should have been working. I had been using Xserver for years now.
Just to make sure - you aren't referring to running with X right? Just
simple framebuffer?
> 
> Thanks!
> 
> ---
> Michael D Labriola
> Electric Boat
> mlabriol@gdeb.com
> 401-848-8871 (desk)
> 401-848-8513 (lab)
> 401-316-9844 (cell)
> 
> 
>  
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Radeon DRM dom0 issues
  2014-01-20 15:14 ` Konrad Rzeszutek Wilk
@ 2014-01-20 15:26   ` Michael D Labriola
  2014-01-20 15:38     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 15+ messages in thread
From: Michael D Labriola @ 2014-01-20 15:26 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: michael.d.labriola, xen-devel

Konrad Rzeszutek Wilk <konrad@darnok.org> wrote on 01/20/2014 10:14:36 AM:

> From: Konrad Rzeszutek Wilk <konrad@darnok.org>
> To: Michael D Labriola <mlabriol@gdeb.com>, 
> Cc: xen-devel@lists.xen.org, michael.d.labriola@gmail.com
> Date: 01/20/2014 10:14 AM
> Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> 
> On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola wrote:
> > Anyone here running a dom0 w/ Radeon DRM?  I'm having consistent 
crashes 
> > with multiple older R600 series (HD 6470 and HD 6570) and unusably 
slow 
> > graphics with a newer HD7000 (can see each line refresh indiviually on 

> > radeonfb tty).  All 3 systems seem to work fine bare metal.
> 
> I hadn't been using DRM, just Xserver. Is that what you mean?

The R600 problems happen when in X, using OpenGL, on my dom0.  The 
RadeonSI sluggishness is when using the KMS framebuffer device for a plain 
text console login.


> > 
> > The R600 crashes happen seemingly randomly when using OpenGL 
Compositor in 
> > Enlightenment 0.17.  My dom0 need not even have any domUs running. 
> > Sometimes it happens within a few minutes, sometimes it will run OK 
for an 
> > afternoon or so.  Eventually TTM issues an "unable to get page 0" 
error 
> > message, the radeon driver follows that up with a 
> > "radeon_gem_object_create failed to allocate gem" error message.  Then 
the 
> > radeon driver starts spamming that gem failure message until I'm 
forced to 
> > reboot.
> > 
> > Behavior is identical with kernel versions 3.8, 3.10, and 3.13-rc8.
> > 
> > I'm using Xen 4.2.1 32bit.
> > 
> > xen command line is:  vga=mode=0x314
> > kernel command line is:  root=/dev/md0 quiet pci=realloc 
> > console=ttyS0,115200n8 console=tty0
> > 
> > Fingers crossed that there's some magic boot parameter I'm missing. 
It's 
> > been a while since I dabbled with this stuff.  ;-)
> 
> That should have been working. I had been using Xserver for years now.
> Just to make sure - you aren't referring to running with X right? Just
> simple framebuffer?

I'm not sure I understand the delineation between Xserver and X...  I am 
indeed in X, using xf86-video-ati and Mesa for OpenGL support.  I've been 
doing this for years as well, but with nouveau and w/out 3d support.  Was 
kinda hoping that the Radeon cards I got a hold of would allow for 
hardware accelerated 3d on my dom0.

I just tried adding 'nopat' to the kernel command line.  I remember doing 
that a year ago... don't recall why.  Any chance that helps?

---
Michael D Labriola
Electric Boat
mlabriol@gdeb.com
401-848-8871 (desk)
401-848-8513 (lab)
401-316-9844 (cell)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Radeon DRM dom0 issues
  2014-01-20 15:26   ` Michael D Labriola
@ 2014-01-20 15:38     ` Konrad Rzeszutek Wilk
  2014-01-20 20:15       ` Michael D Labriola
  0 siblings, 1 reply; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-01-20 15:38 UTC (permalink / raw)
  To: Michael D Labriola; +Cc: Konrad Rzeszutek Wilk, michael.d.labriola, xen-devel

On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D Labriola wrote:
> Konrad Rzeszutek Wilk <konrad@darnok.org> wrote on 01/20/2014 10:14:36 AM:
> 
> > From: Konrad Rzeszutek Wilk <konrad@darnok.org>
> > To: Michael D Labriola <mlabriol@gdeb.com>, 
> > Cc: xen-devel@lists.xen.org, michael.d.labriola@gmail.com
> > Date: 01/20/2014 10:14 AM
> > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > 
> > On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola wrote:
> > > Anyone here running a dom0 w/ Radeon DRM?  I'm having consistent 
> crashes 
> > > with multiple older R600 series (HD 6470 and HD 6570) and unusably 
> slow 
> > > graphics with a newer HD7000 (can see each line refresh indiviually on 
> 
> > > radeonfb tty).  All 3 systems seem to work fine bare metal.
> > 
> > I hadn't been using DRM, just Xserver. Is that what you mean?
> 
> The R600 problems happen when in X, using OpenGL, on my dom0.  The 
> RadeonSI sluggishness is when using the KMS framebuffer device for a plain 
> text console login.

So sluggish is probably due to the PAT not being enabled. This patch
should be applied:

lkml.org/lkml/2011/11/8/406

(or http://marc.info/?l=linux-kernel&m=132888833209874&w=2)

and these two reverted:

 "xen/pat: Disable PAT support for now."
 "xen/pat: Disable PAT using pat_enabled value."

Which is to say do:

git revert c79c49826270b8b0061b2fca840fc3f013c8a78a
git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1

> 
> 
> > > 
> > > The R600 crashes happen seemingly randomly when using OpenGL 
> Compositor in 
> > > Enlightenment 0.17.  My dom0 need not even have any domUs running. 
> > > Sometimes it happens within a few minutes, sometimes it will run OK 
> for an 
> > > afternoon or so.  Eventually TTM issues an "unable to get page 0" 
> error 
> > > message, the radeon driver follows that up with a 
> > > "radeon_gem_object_create failed to allocate gem" error message.  Then 
> the 
> > > radeon driver starts spamming that gem failure message until I'm 
> forced to 
> > > reboot.
> > > 
> > > Behavior is identical with kernel versions 3.8, 3.10, and 3.13-rc8.
> > > 
> > > I'm using Xen 4.2.1 32bit.
> > > 
> > > xen command line is:  vga=mode=0x314
> > > kernel command line is:  root=/dev/md0 quiet pci=realloc 
> > > console=ttyS0,115200n8 console=tty0
> > > 
> > > Fingers crossed that there's some magic boot parameter I'm missing. 
> It's 
> > > been a while since I dabbled with this stuff.  ;-)
> > 
> > That should have been working. I had been using Xserver for years now.
> > Just to make sure - you aren't referring to running with X right? Just
> > simple framebuffer?
> 
> I'm not sure I understand the delineation between Xserver and X...  I am 
> indeed in X, using xf86-video-ati and Mesa for OpenGL support.  I've been 
> doing this for years as well, but with nouveau and w/out 3d support.  Was 
> kinda hoping that the Radeon cards I got a hold of would allow for 
> hardware accelerated 3d on my dom0.

You should be able to use 3D as well - with those said patches.

> 
> I just tried adding 'nopat' to the kernel command line.  I remember doing 
> that a year ago... don't recall why.  Any chance that helps?

Heh. So you want the inverse with a patch that fixes the PAT wreaking
havoc.
> 
> ---
> Michael D Labriola
> Electric Boat
> mlabriol@gdeb.com
> 401-848-8871 (desk)
> 401-848-8513 (lab)
> 401-316-9844 (cell)
> 
> 
> 
>  
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Radeon DRM dom0 issues
  2014-01-20 15:38     ` Konrad Rzeszutek Wilk
@ 2014-01-20 20:15       ` Michael D Labriola
  2014-01-21 21:59         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 15+ messages in thread
From: Michael D Labriola @ 2014-01-20 20:15 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Konrad Rzeszutek Wilk, michael.d.labriola, xen-devel

Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 01/20/2014 
10:38:27 AM:

> From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> To: Michael D Labriola <mlabriol@gdeb.com>, 
> Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>, 
> michael.d.labriola@gmail.com, xen-devel@lists.xen.org
> Date: 01/20/2014 10:38 AM
> Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> 
> On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D Labriola wrote:
> > Konrad Rzeszutek Wilk <konrad@darnok.org> wrote on 01/20/2014 10:14:36 
AM:
> > 
> > > From: Konrad Rzeszutek Wilk <konrad@darnok.org>
> > > To: Michael D Labriola <mlabriol@gdeb.com>, 
> > > Cc: xen-devel@lists.xen.org, michael.d.labriola@gmail.com
> > > Date: 01/20/2014 10:14 AM
> > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > > 
> > > On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola wrote:
> > > > Anyone here running a dom0 w/ Radeon DRM?  I'm having consistent 
> > crashes 
> > > > with multiple older R600 series (HD 6470 and HD 6570) and unusably 

> > slow 
> > > > graphics with a newer HD7000 (can see each line refresh 
indiviually on 
> > 
> > > > radeonfb tty).  All 3 systems seem to work fine bare metal.
> > > 
> > > I hadn't been using DRM, just Xserver. Is that what you mean?
> > 
> > The R600 problems happen when in X, using OpenGL, on my dom0.  The 
> > RadeonSI sluggishness is when using the KMS framebuffer device for a 
plain 
> > text console login.
> 
> So sluggish is probably due to the PAT not being enabled. This patch
> should be applied:
> 
> lkml.org/lkml/2011/11/8/406
> 
> (or http://marc.info/?l=linux-kernel&m=132888833209874&w=2)
> 
> and these two reverted:
> 
>  "xen/pat: Disable PAT support for now."
>  "xen/pat: Disable PAT using pat_enabled value."
> 
> Which is to say do:
> 
> git revert c79c49826270b8b0061b2fca840fc3f013c8a78a
> git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1

Thanks!  I cherry-picked that patch out of your testing tree, reverted 
those 2 commits, recompiled and installed.  Definitely fixed the HD 7000 
sluggishness and appears to have fixed the R600 crashes (although it's 
only been running a few hours).

How come that patch didn't get into mainline?  It looks pretty innocuous 
to me...

---
Michael D Labriola
Electric Boat
mlabriol@gdeb.com
401-848-8871 (desk)
401-848-8513 (lab)
401-316-9844 (cell)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Radeon DRM dom0 issues
  2014-01-20 20:15       ` Michael D Labriola
@ 2014-01-21 21:59         ` Konrad Rzeszutek Wilk
  2014-01-23 16:54           ` Michael D Labriola
  0 siblings, 1 reply; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-01-21 21:59 UTC (permalink / raw)
  To: Michael D Labriola; +Cc: Konrad Rzeszutek Wilk, michael.d.labriola, xen-devel

On Mon, Jan 20, 2014 at 03:15:24PM -0500, Michael D Labriola wrote:
> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 01/20/2014 
> 10:38:27 AM:
> 
> > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > To: Michael D Labriola <mlabriol@gdeb.com>, 
> > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>, 
> > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
> > Date: 01/20/2014 10:38 AM
> > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > 
> > On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D Labriola wrote:
> > > Konrad Rzeszutek Wilk <konrad@darnok.org> wrote on 01/20/2014 10:14:36 
> AM:
> > > 
> > > > From: Konrad Rzeszutek Wilk <konrad@darnok.org>
> > > > To: Michael D Labriola <mlabriol@gdeb.com>, 
> > > > Cc: xen-devel@lists.xen.org, michael.d.labriola@gmail.com
> > > > Date: 01/20/2014 10:14 AM
> > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > > > 
> > > > On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola wrote:
> > > > > Anyone here running a dom0 w/ Radeon DRM?  I'm having consistent 
> > > crashes 
> > > > > with multiple older R600 series (HD 6470 and HD 6570) and unusably 
> 
> > > slow 
> > > > > graphics with a newer HD7000 (can see each line refresh 
> indiviually on 
> > > 
> > > > > radeonfb tty).  All 3 systems seem to work fine bare metal.
> > > > 
> > > > I hadn't been using DRM, just Xserver. Is that what you mean?
> > > 
> > > The R600 problems happen when in X, using OpenGL, on my dom0.  The 
> > > RadeonSI sluggishness is when using the KMS framebuffer device for a 
> plain 
> > > text console login.
> > 
> > So sluggish is probably due to the PAT not being enabled. This patch
> > should be applied:
> > 
> > lkml.org/lkml/2011/11/8/406
> > 
> > (or http://marc.info/?l=linux-kernel&m=132888833209874&w=2)
> > 
> > and these two reverted:
> > 
> >  "xen/pat: Disable PAT support for now."
> >  "xen/pat: Disable PAT using pat_enabled value."
> > 
> > Which is to say do:
> > 
> > git revert c79c49826270b8b0061b2fca840fc3f013c8a78a
> > git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1
> 
> Thanks!  I cherry-picked that patch out of your testing tree, reverted 
> those 2 commits, recompiled and installed.  Definitely fixed the HD 7000 
> sluggishness and appears to have fixed the R600 crashes (although it's 
> only been running a few hours).
> 
> How come that patch didn't get into mainline?  It looks pretty innocuous 
> to me...

<Sigh> the x86 maintainers wanted a different route. And I hadn't had
the chance nor time to implement it.


> 
> ---
> Michael D Labriola
> Electric Boat
> mlabriol@gdeb.com
> 401-848-8871 (desk)
> 401-848-8513 (lab)
> 401-316-9844 (cell)
> 
> 
> 
>  
> 
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Radeon DRM dom0 issues
  2014-01-21 21:59         ` Konrad Rzeszutek Wilk
@ 2014-01-23 16:54           ` Michael D Labriola
  2014-01-24 14:49             ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 15+ messages in thread
From: Michael D Labriola @ 2014-01-23 16:54 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Konrad Rzeszutek Wilk, michael.d.labriola, xen-devel-bounces, xen-devel

xen-devel-bounces@lists.xen.org wrote on 01/21/2014 04:59:05 PM:

> From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> To: Michael D Labriola <mlabriol@gdeb.com>, 
> Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>, 
> michael.d.labriola@gmail.com, xen-devel@lists.xen.org
> Date: 01/21/2014 04:59 PM
> Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> Sent by: xen-devel-bounces@lists.xen.org
> 
> On Mon, Jan 20, 2014 at 03:15:24PM -0500, Michael D Labriola wrote:
> > Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 01/20/2014 
> > 10:38:27 AM:
> > 
> > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > > To: Michael D Labriola <mlabriol@gdeb.com>, 
> > > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>, 
> > > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
> > > Date: 01/20/2014 10:38 AM
> > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > > 
> > > On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D Labriola wrote:
> > > > Konrad Rzeszutek Wilk <konrad@darnok.org> wrote on 01/20/2014 
10:14:36 
> > AM:
> > > > 
> > > > > From: Konrad Rzeszutek Wilk <konrad@darnok.org>
> > > > > To: Michael D Labriola <mlabriol@gdeb.com>, 
> > > > > Cc: xen-devel@lists.xen.org, michael.d.labriola@gmail.com
> > > > > Date: 01/20/2014 10:14 AM
> > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > > > > 
> > > > > On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola 
wrote:
> > > > > > Anyone here running a dom0 w/ Radeon DRM?  I'm having 
consistent 
> > > > crashes 
> > > > > > with multiple older R600 series (HD 6470 and HD 6570) and 
unusably 
> > 
> > > > slow 
> > > > > > graphics with a newer HD7000 (can see each line refresh 
> > indiviually on 
> > > > 
> > > > > > radeonfb tty).  All 3 systems seem to work fine bare metal.
> > > > > 
> > > > > I hadn't been using DRM, just Xserver. Is that what you mean?
> > > > 
> > > > The R600 problems happen when in X, using OpenGL, on my dom0.  The 

> > > > RadeonSI sluggishness is when using the KMS framebuffer device for 
a 
> > plain 
> > > > text console login.
> > > 
> > > So sluggish is probably due to the PAT not being enabled. This patch
> > > should be applied:
> > > 
> > > lkml.org/lkml/2011/11/8/406
> > > 
> > > (or http://marc.info/?l=linux-kernel&m=132888833209874&w=2)
> > > 
> > > and these two reverted:
> > > 
> > >  "xen/pat: Disable PAT support for now."
> > >  "xen/pat: Disable PAT using pat_enabled value."
> > > 
> > > Which is to say do:
> > > 
> > > git revert c79c49826270b8b0061b2fca840fc3f013c8a78a
> > > git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1
> > 
> > Thanks!  I cherry-picked that patch out of your testing tree, reverted 

> > those 2 commits, recompiled and installed.  Definitely fixed the HD 
7000 
> > sluggishness and appears to have fixed the R600 crashes (although it's 

> > only been running a few hours).
> > 
> > How come that patch didn't get into mainline?  It looks pretty 
innocuous 
> > to me...
> 
> <Sigh> the x86 maintainers wanted a different route. And I hadn't had
> the chance nor time to implement it.

I see.  Well, I've got a handful of boxes in my lab that need that patch 
to be usable.  If you do come up with a more mainline-able solution, I'd 
gladly test it for you.  ;-)

Thanks again, by the way.

---
Michael D Labriola
Electric Boat
mlabriol@gdeb.com
401-848-8871 (desk)
401-848-8513 (lab)
401-316-9844 (cell)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Radeon DRM dom0 issues
  2014-01-23 16:54           ` Michael D Labriola
@ 2014-01-24 14:49             ` Konrad Rzeszutek Wilk
  2014-02-11 15:35               ` Michael D Labriola
  0 siblings, 1 reply; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-01-24 14:49 UTC (permalink / raw)
  To: Michael D Labriola
  Cc: Konrad Rzeszutek Wilk, michael.d.labriola, xen-devel-bounces, xen-devel

On Thu, Jan 23, 2014 at 11:54:37AM -0500, Michael D Labriola wrote:
> xen-devel-bounces@lists.xen.org wrote on 01/21/2014 04:59:05 PM:
> 
> > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > To: Michael D Labriola <mlabriol@gdeb.com>, 
> > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>, 
> > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
> > Date: 01/21/2014 04:59 PM
> > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > Sent by: xen-devel-bounces@lists.xen.org
> > 
> > On Mon, Jan 20, 2014 at 03:15:24PM -0500, Michael D Labriola wrote:
> > > Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 01/20/2014 
> > > 10:38:27 AM:
> > > 
> > > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > > > To: Michael D Labriola <mlabriol@gdeb.com>, 
> > > > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>, 
> > > > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
> > > > Date: 01/20/2014 10:38 AM
> > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > > > 
> > > > On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D Labriola wrote:
> > > > > Konrad Rzeszutek Wilk <konrad@darnok.org> wrote on 01/20/2014 
> 10:14:36 
> > > AM:
> > > > > 
> > > > > > From: Konrad Rzeszutek Wilk <konrad@darnok.org>
> > > > > > To: Michael D Labriola <mlabriol@gdeb.com>, 
> > > > > > Cc: xen-devel@lists.xen.org, michael.d.labriola@gmail.com
> > > > > > Date: 01/20/2014 10:14 AM
> > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > > > > > 
> > > > > > On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola 
> wrote:
> > > > > > > Anyone here running a dom0 w/ Radeon DRM?  I'm having 
> consistent 
> > > > > crashes 
> > > > > > > with multiple older R600 series (HD 6470 and HD 6570) and 
> unusably 
> > > 
> > > > > slow 
> > > > > > > graphics with a newer HD7000 (can see each line refresh 
> > > indiviually on 
> > > > > 
> > > > > > > radeonfb tty).  All 3 systems seem to work fine bare metal.
> > > > > > 
> > > > > > I hadn't been using DRM, just Xserver. Is that what you mean?
> > > > > 
> > > > > The R600 problems happen when in X, using OpenGL, on my dom0.  The 
> 
> > > > > RadeonSI sluggishness is when using the KMS framebuffer device for 
> a 
> > > plain 
> > > > > text console login.
> > > > 
> > > > So sluggish is probably due to the PAT not being enabled. This patch
> > > > should be applied:
> > > > 
> > > > lkml.org/lkml/2011/11/8/406
> > > > 
> > > > (or http://marc.info/?l=linux-kernel&m=132888833209874&w=2)
> > > > 
> > > > and these two reverted:
> > > > 
> > > >  "xen/pat: Disable PAT support for now."
> > > >  "xen/pat: Disable PAT using pat_enabled value."
> > > > 
> > > > Which is to say do:
> > > > 
> > > > git revert c79c49826270b8b0061b2fca840fc3f013c8a78a
> > > > git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1
> > > 
> > > Thanks!  I cherry-picked that patch out of your testing tree, reverted 
> 
> > > those 2 commits, recompiled and installed.  Definitely fixed the HD 
> 7000 
> > > sluggishness and appears to have fixed the R600 crashes (although it's 
> 
> > > only been running a few hours).
> > > 
> > > How come that patch didn't get into mainline?  It looks pretty 
> innocuous 
> > > to me...
> > 
> > <Sigh> the x86 maintainers wanted a different route. And I hadn't had
> > the chance nor time to implement it.
> 
> I see.  Well, I've got a handful of boxes in my lab that need that patch 
> to be usable.  If you do come up with a more mainline-able solution, I'd 
> gladly test it for you.  ;-)

Thank you!
> 
> Thanks again, by the way.
> 
> ---
> Michael D Labriola
> Electric Boat
> mlabriol@gdeb.com
> 401-848-8871 (desk)
> 401-848-8513 (lab)
> 401-316-9844 (cell)
> 
> 
> 
>  
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Radeon DRM dom0 issues
  2014-01-24 14:49             ` Konrad Rzeszutek Wilk
@ 2014-02-11 15:35               ` Michael D Labriola
  2014-02-19 17:04                 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 15+ messages in thread
From: Michael D Labriola @ 2014-02-11 15:35 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Konrad Rzeszutek Wilk, michael.d.labriola, xen-devel-bounces, xen-devel

Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 01/24/2014 
09:49:38 AM:

> From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> To: Michael D Labriola <mlabriol@gdeb.com>, 
> Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>, 
> michael.d.labriola@gmail.com, xen-devel@lists.xen.org, xen-devel-
> bounces@lists.xen.org
> Date: 01/24/2014 09:50 AM
> Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> 
> On Thu, Jan 23, 2014 at 11:54:37AM -0500, Michael D Labriola wrote:
> > xen-devel-bounces@lists.xen.org wrote on 01/21/2014 04:59:05 PM:
> > 
> > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > > To: Michael D Labriola <mlabriol@gdeb.com>, 
> > > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>, 
> > > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
> > > Date: 01/21/2014 04:59 PM
> > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > > Sent by: xen-devel-bounces@lists.xen.org
> > > 
> > > On Mon, Jan 20, 2014 at 03:15:24PM -0500, Michael D Labriola wrote:
> > > > Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 01/20/2014 

> > > > 10:38:27 AM:
> > > > 
> > > > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > > > > To: Michael D Labriola <mlabriol@gdeb.com>, 
> > > > > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>, 
> > > > > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
> > > > > Date: 01/20/2014 10:38 AM
> > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > > > > 
> > > > > On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D Labriola 
wrote:
> > > > > > Konrad Rzeszutek Wilk <konrad@darnok.org> wrote on 01/20/2014 
> > 10:14:36 
> > > > AM:
> > > > > > 
> > > > > > > From: Konrad Rzeszutek Wilk <konrad@darnok.org>
> > > > > > > To: Michael D Labriola <mlabriol@gdeb.com>, 
> > > > > > > Cc: xen-devel@lists.xen.org, michael.d.labriola@gmail.com
> > > > > > > Date: 01/20/2014 10:14 AM
> > > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > > > > > > 
> > > > > > > On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola 

> > wrote:
> > > > > > > > Anyone here running a dom0 w/ Radeon DRM?  I'm having 
> > consistent 
> > > > > > crashes 
> > > > > > > > with multiple older R600 series (HD 6470 and HD 6570) and 
> > unusably 
> > > > 
> > > > > > slow 
> > > > > > > > graphics with a newer HD7000 (can see each line refresh 
> > > > indiviually on 
> > > > > > 
> > > > > > > > radeonfb tty).  All 3 systems seem to work fine bare 
metal.
> > > > > > > 
> > > > > > > I hadn't been using DRM, just Xserver. Is that what you 
mean?
> > > > > > 
> > > > > > The R600 problems happen when in X, using OpenGL, on my dom0. 
The 
> > 
> > > > > > RadeonSI sluggishness is when using the KMS framebuffer device 
for 
> > a 
> > > > plain 
> > > > > > text console login.
> > > > > 
> > > > > So sluggish is probably due to the PAT not being enabled. This 
patch
> > > > > should be applied:
> > > > > 
> > > > > lkml.org/lkml/2011/11/8/406
> > > > > 
> > > > > (or http://marc.info/?l=linux-kernel&m=132888833209874&w=2)
> > > > > 
> > > > > and these two reverted:
> > > > > 
> > > > >  "xen/pat: Disable PAT support for now."
> > > > >  "xen/pat: Disable PAT using pat_enabled value."
> > > > > 
> > > > > Which is to say do:
> > > > > 
> > > > > git revert c79c49826270b8b0061b2fca840fc3f013c8a78a
> > > > > git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1
> > > > 
> > > > Thanks!  I cherry-picked that patch out of your testing tree, 
reverted 
> > 
> > > > those 2 commits, recompiled and installed.  Definitely fixed the 
HD 
> > 7000 
> > > > sluggishness and appears to have fixed the R600 crashes (although 
it's 
> > 
> > > > only been running a few hours).
> > > > 
> > > > How come that patch didn't get into mainline?  It looks pretty 
> > innocuous 
> > > > to me...
> > > 
> > > <Sigh> the x86 maintainers wanted a different route. And I hadn't 
had
> > > the chance nor time to implement it.
> > 
> > I see.  Well, I've got a handful of boxes in my lab that need that 
patch 
> > to be usable.  If you do come up with a more mainline-able solution, 
I'd 
> > gladly test it for you.  ;-)
> 
> Thank you!

Uh, oh.  Looks like those reverts and patches didn't entirely fix my 
problem.  My box with the HD5450 (r600 gallium3d) started going bonkers 
again yeserday.  After being solid as a rock for 2 weeks as my primary 
workstation, X has crashed a half dozen or so times so far this week. I've 
been in Xen with 2 paravirtual linux guests running almost constantly for 
this whole period.  I don't understand what's changed, but my system has 
been entirely unstable now.  I did recompile my kernel... but I all did 
was merge the v3.13.1 stable commit into my working tree and turn a few 
things on (netfilter, wifi, a couple drivers turned on here and there).  I 
just went and verified that those patches are still applied in my tree 
(i.e., I didn't accidentally undo them).  I'm scratching my head (and 
staring at a TTY login).

When X crashes, my kernel log prints a couple dozen iterations of this. 3d 
acceleration no longer functions unless I reboot.  If memory serves, the 
unpatched behavior upon X crash was that the kernel continued to spew 
these errors until the whole box locked up.  At least that's not happening 
any more... ;-)

[  702.070084] [TTM] radeon 0000:01:00.0: Unable to get page 2
[  702.075971] [TTM] radeon 0000:01:00.0: Failed to fill cached pool 
(r:-12)!
[  704.720699] [TTM] radeon 0000:01:00.0: Unable to get page 0
[  704.726635] [TTM] radeon 0000:01:00.0: Failed to fill cached pool 
(r:-12)!
[  704.733910] [drm:radeon_gem_object_create] *ERROR* Failed to allocate 
GEM object (8192, 2, 4096, -12)

and here's a slightly different variant that happened while I was typing 
this email (on a different machine, luckily):

[ 3107.713039] sdf: detected capacity change from 31625052160 to 0
[ 3114.491717] usb 9-1: USB disconnect, device number 2
[64348.271534] [TTM] radeon 0000:01:00.0: Unable to get page 3
[64348.277312] [TTM] radeon 0000:01:00.0: Failed to fill cached pool 
(r:-12)!
[64348.284470] [TTM] radeon 0000:01:00.0: Unable to get page 0
[64348.290257] [TTM] radeon 0000:01:00.0: Failed to fill cached pool 
(r:-12)!
[64348.297561] [TTM] Buffer eviction failed
[64349.550518] [TTM] radeon 0000:01:00.0: Unable to get page 0
[64349.556417] [TTM] radeon 0000:01:00.0: Failed to fill cached pool 
(r:-12)!
[64349.563714] [drm:radeon_gem_object_create] *ERROR* Failed to allocate 
GEM object (16384, 2, 4096, -12)

Any ideas?


---
Michael D Labriola
Electric Boat
mlabriol@gdeb.com
401-848-8871 (desk)
401-848-8513 (lab)
401-316-9844 (cell)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Radeon DRM dom0 issues
  2014-02-11 15:35               ` Michael D Labriola
@ 2014-02-19 17:04                 ` Konrad Rzeszutek Wilk
  2014-02-19 19:33                   ` Michael Labriola
  0 siblings, 1 reply; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-02-19 17:04 UTC (permalink / raw)
  To: Michael D Labriola
  Cc: Konrad Rzeszutek Wilk, michael.d.labriola, xen-devel-bounces, xen-devel

On Tue, Feb 11, 2014 at 10:35:18AM -0500, Michael D Labriola wrote:
> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 01/24/2014 
> 09:49:38 AM:
> 
> > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > To: Michael D Labriola <mlabriol@gdeb.com>, 
> > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>, 
> > michael.d.labriola@gmail.com, xen-devel@lists.xen.org, xen-devel-
> > bounces@lists.xen.org
> > Date: 01/24/2014 09:50 AM
> > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > 
> > On Thu, Jan 23, 2014 at 11:54:37AM -0500, Michael D Labriola wrote:
> > > xen-devel-bounces@lists.xen.org wrote on 01/21/2014 04:59:05 PM:
> > > 
> > > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > > > To: Michael D Labriola <mlabriol@gdeb.com>, 
> > > > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>, 
> > > > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
> > > > Date: 01/21/2014 04:59 PM
> > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > > > Sent by: xen-devel-bounces@lists.xen.org
> > > > 
> > > > On Mon, Jan 20, 2014 at 03:15:24PM -0500, Michael D Labriola wrote:
> > > > > Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 01/20/2014 
> 
> > > > > 10:38:27 AM:
> > > > > 
> > > > > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > > > > > To: Michael D Labriola <mlabriol@gdeb.com>, 
> > > > > > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>, 
> > > > > > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
> > > > > > Date: 01/20/2014 10:38 AM
> > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > > > > > 
> > > > > > On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D Labriola 
> wrote:
> > > > > > > Konrad Rzeszutek Wilk <konrad@darnok.org> wrote on 01/20/2014 
> > > 10:14:36 
> > > > > AM:
> > > > > > > 
> > > > > > > > From: Konrad Rzeszutek Wilk <konrad@darnok.org>
> > > > > > > > To: Michael D Labriola <mlabriol@gdeb.com>, 
> > > > > > > > Cc: xen-devel@lists.xen.org, michael.d.labriola@gmail.com
> > > > > > > > Date: 01/20/2014 10:14 AM
> > > > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > > > > > > > 
> > > > > > > > On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola 
> 
> > > wrote:
> > > > > > > > > Anyone here running a dom0 w/ Radeon DRM?  I'm having 
> > > consistent 
> > > > > > > crashes 
> > > > > > > > > with multiple older R600 series (HD 6470 and HD 6570) and 
> > > unusably 
> > > > > 
> > > > > > > slow 
> > > > > > > > > graphics with a newer HD7000 (can see each line refresh 
> > > > > indiviually on 
> > > > > > > 
> > > > > > > > > radeonfb tty).  All 3 systems seem to work fine bare 
> metal.
> > > > > > > > 
> > > > > > > > I hadn't been using DRM, just Xserver. Is that what you 
> mean?
> > > > > > > 
> > > > > > > The R600 problems happen when in X, using OpenGL, on my dom0. 
> The 
> > > 
> > > > > > > RadeonSI sluggishness is when using the KMS framebuffer device 
> for 
> > > a 
> > > > > plain 
> > > > > > > text console login.
> > > > > > 
> > > > > > So sluggish is probably due to the PAT not being enabled. This 
> patch
> > > > > > should be applied:
> > > > > > 
> > > > > > lkml.org/lkml/2011/11/8/406
> > > > > > 
> > > > > > (or http://marc.info/?l=linux-kernel&m=132888833209874&w=2)
> > > > > > 
> > > > > > and these two reverted:
> > > > > > 
> > > > > >  "xen/pat: Disable PAT support for now."
> > > > > >  "xen/pat: Disable PAT using pat_enabled value."
> > > > > > 
> > > > > > Which is to say do:
> > > > > > 
> > > > > > git revert c79c49826270b8b0061b2fca840fc3f013c8a78a
> > > > > > git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1
> > > > > 
> > > > > Thanks!  I cherry-picked that patch out of your testing tree, 
> reverted 
> > > 
> > > > > those 2 commits, recompiled and installed.  Definitely fixed the 
> HD 
> > > 7000 
> > > > > sluggishness and appears to have fixed the R600 crashes (although 
> it's 
> > > 
> > > > > only been running a few hours).
> > > > > 
> > > > > How come that patch didn't get into mainline?  It looks pretty 
> > > innocuous 
> > > > > to me...
> > > > 
> > > > <Sigh> the x86 maintainers wanted a different route. And I hadn't 
> had
> > > > the chance nor time to implement it.
> > > 
> > > I see.  Well, I've got a handful of boxes in my lab that need that 
> patch 
> > > to be usable.  If you do come up with a more mainline-able solution, 
> I'd 
> > > gladly test it for you.  ;-)
> > 
> > Thank you!
> 
> Uh, oh.  Looks like those reverts and patches didn't entirely fix my 
> problem.  My box with the HD5450 (r600 gallium3d) started going bonkers 
> again yeserday.  After being solid as a rock for 2 weeks as my primary 
> workstation, X has crashed a half dozen or so times so far this week. I've 
> been in Xen with 2 paravirtual linux guests running almost constantly for 
> this whole period.  I don't understand what's changed, but my system has 
> been entirely unstable now.  I did recompile my kernel... but I all did 
> was merge the v3.13.1 stable commit into my working tree and turn a few 
> things on (netfilter, wifi, a couple drivers turned on here and there).  I 
> just went and verified that those patches are still applied in my tree 
> (i.e., I didn't accidentally undo them).  I'm scratching my head (and 
> staring at a TTY login).
> 
> When X crashes, my kernel log prints a couple dozen iterations of this. 3d 
> acceleration no longer functions unless I reboot.  If memory serves, the 
> unpatched behavior upon X crash was that the kernel continued to spew 
> these errors until the whole box locked up.  At least that's not happening 
> any more... ;-)
> 
> [  702.070084] [TTM] radeon 0000:01:00.0: Unable to get page 2
> [  702.075971] [TTM] radeon 0000:01:00.0: Failed to fill cached pool 
> (r:-12)!
> [  704.720699] [TTM] radeon 0000:01:00.0: Unable to get page 0
> [  704.726635] [TTM] radeon 0000:01:00.0: Failed to fill cached pool 
> (r:-12)!
> [  704.733910] [drm:radeon_gem_object_create] *ERROR* Failed to allocate 
> GEM object (8192, 2, 4096, -12)
> 
> and here's a slightly different variant that happened while I was typing 
> this email (on a different machine, luckily):
> 
> [ 3107.713039] sdf: detected capacity change from 31625052160 to 0
> [ 3114.491717] usb 9-1: USB disconnect, device number 2
> [64348.271534] [TTM] radeon 0000:01:00.0: Unable to get page 3
> [64348.277312] [TTM] radeon 0000:01:00.0: Failed to fill cached pool 
> (r:-12)!
> [64348.284470] [TTM] radeon 0000:01:00.0: Unable to get page 0
> [64348.290257] [TTM] radeon 0000:01:00.0: Failed to fill cached pool 
> (r:-12)!
> [64348.297561] [TTM] Buffer eviction failed
> [64349.550518] [TTM] radeon 0000:01:00.0: Unable to get page 0
> [64349.556417] [TTM] radeon 0000:01:00.0: Failed to fill cached pool 
> (r:-12)!
> [64349.563714] [drm:radeon_gem_object_create] *ERROR* Failed to allocate 
> GEM object (16384, 2, 4096, -12)
> 
> Any ideas?

yes. I believe you have a memory leak. As in, some driver (or X) is
eating up the memory and not giving up enough. That means the TTM
layer is hitting its ceiling of how much memory it can allocate.

Now finding the culprit is going to be a bit hard.

You could use:

[root@phenom 1]# cat /sys/kernel/debug/dri/1/ttm_dma_page_pool 
         pool      refills   pages freed    inuse available     name
           wc          259           224      808        4 nouveau 0000:05:00.0
       cached      3403058      13561071    51158        3 radeon 0000:01:00.0
       cached           25             0       96        4 nouveau 0000:05:00.0

to figure out if my thinking is really true. You should have a huge
'inuse' count and almost no 'available'.


But that will get us just to confirm that yes - you have a big usage
of memory and it is hitting the ceiling.

Now to actually figure out which application is hanging on these - that
I am not sure about. I think there is some drm info tool to investigate
how many pages each application is using. You can leave it running and
see which app is gulping up the memory. But I am not sure which
tool that is (if there was one). 

Well, lets do one step at a time - see if my theory is correct first.

> 
> 
> ---
> Michael D Labriola
> Electric Boat
> mlabriol@gdeb.com
> 401-848-8871 (desk)
> 401-848-8513 (lab)
> 401-316-9844 (cell)
> 
> 
> 
>  
> 
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Radeon DRM dom0 issues
  2014-02-19 17:04                 ` Konrad Rzeszutek Wilk
@ 2014-02-19 19:33                   ` Michael Labriola
  2014-02-19 19:57                     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 15+ messages in thread
From: Michael Labriola @ 2014-02-19 19:33 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Konrad Rzeszutek Wilk, xen-devel-bounces, Michael D Labriola, xen-devel

On Wed, Feb 19, 2014 at 12:04 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Tue, Feb 11, 2014 at 10:35:18AM -0500, Michael D Labriola wrote:
>> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 01/24/2014
>> 09:49:38 AM:
>>
>> > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> > To: Michael D Labriola <mlabriol@gdeb.com>,
>> > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>,
>> > michael.d.labriola@gmail.com, xen-devel@lists.xen.org, xen-devel-
>> > bounces@lists.xen.org
>> > Date: 01/24/2014 09:50 AM
>> > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
>> >
>> > On Thu, Jan 23, 2014 at 11:54:37AM -0500, Michael D Labriola wrote:
>> > > xen-devel-bounces@lists.xen.org wrote on 01/21/2014 04:59:05 PM:
>> > >
>> > > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> > > > To: Michael D Labriola <mlabriol@gdeb.com>,
>> > > > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>,
>> > > > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
>> > > > Date: 01/21/2014 04:59 PM
>> > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
>> > > > Sent by: xen-devel-bounces@lists.xen.org
>> > > >
>> > > > On Mon, Jan 20, 2014 at 03:15:24PM -0500, Michael D Labriola wrote:
>> > > > > Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 01/20/2014
>>
>> > > > > 10:38:27 AM:
>> > > > >
>> > > > > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> > > > > > To: Michael D Labriola <mlabriol@gdeb.com>,
>> > > > > > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>,
>> > > > > > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
>> > > > > > Date: 01/20/2014 10:38 AM
>> > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
>> > > > > >
>> > > > > > On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D Labriola
>> wrote:
>> > > > > > > Konrad Rzeszutek Wilk <konrad@darnok.org> wrote on 01/20/2014
>> > > 10:14:36
>> > > > > AM:
>> > > > > > >
>> > > > > > > > From: Konrad Rzeszutek Wilk <konrad@darnok.org>
>> > > > > > > > To: Michael D Labriola <mlabriol@gdeb.com>,
>> > > > > > > > Cc: xen-devel@lists.xen.org, michael.d.labriola@gmail.com
>> > > > > > > > Date: 01/20/2014 10:14 AM
>> > > > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
>> > > > > > > >
>> > > > > > > > On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola
>>
>> > > wrote:
>> > > > > > > > > Anyone here running a dom0 w/ Radeon DRM?  I'm having
>> > > consistent
>> > > > > > > crashes
>> > > > > > > > > with multiple older R600 series (HD 6470 and HD 6570) and
>> > > unusably
>> > > > >
>> > > > > > > slow
>> > > > > > > > > graphics with a newer HD7000 (can see each line refresh
>> > > > > indiviually on
>> > > > > > >
>> > > > > > > > > radeonfb tty).  All 3 systems seem to work fine bare
>> metal.
>> > > > > > > >
>> > > > > > > > I hadn't been using DRM, just Xserver. Is that what you
>> mean?
>> > > > > > >
>> > > > > > > The R600 problems happen when in X, using OpenGL, on my dom0.
>> The
>> > >
>> > > > > > > RadeonSI sluggishness is when using the KMS framebuffer device
>> for
>> > > a
>> > > > > plain
>> > > > > > > text console login.
>> > > > > >
>> > > > > > So sluggish is probably due to the PAT not being enabled. This
>> patch
>> > > > > > should be applied:
>> > > > > >
>> > > > > > lkml.org/lkml/2011/11/8/406
>> > > > > >
>> > > > > > (or http://marc.info/?l=linux-kernel&m=132888833209874&w=2)
>> > > > > >
>> > > > > > and these two reverted:
>> > > > > >
>> > > > > >  "xen/pat: Disable PAT support for now."
>> > > > > >  "xen/pat: Disable PAT using pat_enabled value."
>> > > > > >
>> > > > > > Which is to say do:
>> > > > > >
>> > > > > > git revert c79c49826270b8b0061b2fca840fc3f013c8a78a
>> > > > > > git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1
>> > > > >
>> > > > > Thanks!  I cherry-picked that patch out of your testing tree,
>> reverted
>> > >
>> > > > > those 2 commits, recompiled and installed.  Definitely fixed the
>> HD
>> > > 7000
>> > > > > sluggishness and appears to have fixed the R600 crashes (although
>> it's
>> > >
>> > > > > only been running a few hours).
>> > > > >
>> > > > > How come that patch didn't get into mainline?  It looks pretty
>> > > innocuous
>> > > > > to me...
>> > > >
>> > > > <Sigh> the x86 maintainers wanted a different route. And I hadn't
>> had
>> > > > the chance nor time to implement it.
>> > >
>> > > I see.  Well, I've got a handful of boxes in my lab that need that
>> patch
>> > > to be usable.  If you do come up with a more mainline-able solution,
>> I'd
>> > > gladly test it for you.  ;-)
>> >
>> > Thank you!
>>
>> Uh, oh.  Looks like those reverts and patches didn't entirely fix my
>> problem.  My box with the HD5450 (r600 gallium3d) started going bonkers
>> again yeserday.  After being solid as a rock for 2 weeks as my primary
>> workstation, X has crashed a half dozen or so times so far this week. I've
>> been in Xen with 2 paravirtual linux guests running almost constantly for
>> this whole period.  I don't understand what's changed, but my system has
>> been entirely unstable now.  I did recompile my kernel... but I all did
>> was merge the v3.13.1 stable commit into my working tree and turn a few
>> things on (netfilter, wifi, a couple drivers turned on here and there).  I
>> just went and verified that those patches are still applied in my tree
>> (i.e., I didn't accidentally undo them).  I'm scratching my head (and
>> staring at a TTY login).
>>
>> When X crashes, my kernel log prints a couple dozen iterations of this. 3d
>> acceleration no longer functions unless I reboot.  If memory serves, the
>> unpatched behavior upon X crash was that the kernel continued to spew
>> these errors until the whole box locked up.  At least that's not happening
>> any more... ;-)
>>
>> [  702.070084] [TTM] radeon 0000:01:00.0: Unable to get page 2
>> [  702.075971] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> (r:-12)!
>> [  704.720699] [TTM] radeon 0000:01:00.0: Unable to get page 0
>> [  704.726635] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> (r:-12)!
>> [  704.733910] [drm:radeon_gem_object_create] *ERROR* Failed to allocate
>> GEM object (8192, 2, 4096, -12)
>>
>> and here's a slightly different variant that happened while I was typing
>> this email (on a different machine, luckily):
>>
>> [ 3107.713039] sdf: detected capacity change from 31625052160 to 0
>> [ 3114.491717] usb 9-1: USB disconnect, device number 2
>> [64348.271534] [TTM] radeon 0000:01:00.0: Unable to get page 3
>> [64348.277312] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> (r:-12)!
>> [64348.284470] [TTM] radeon 0000:01:00.0: Unable to get page 0
>> [64348.290257] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> (r:-12)!
>> [64348.297561] [TTM] Buffer eviction failed
>> [64349.550518] [TTM] radeon 0000:01:00.0: Unable to get page 0
>> [64349.556417] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> (r:-12)!
>> [64349.563714] [drm:radeon_gem_object_create] *ERROR* Failed to allocate
>> GEM object (16384, 2, 4096, -12)
>>
>> Any ideas?
>
> yes. I believe you have a memory leak. As in, some driver (or X) is
> eating up the memory and not giving up enough. That means the TTM
> layer is hitting its ceiling of how much memory it can allocate.
>
> Now finding the culprit is going to be a bit hard.
>
> You could use:
>
> [root@phenom 1]# cat /sys/kernel/debug/dri/1/ttm_dma_page_pool
>          pool      refills   pages freed    inuse available     name
>            wc          259           224      808        4 nouveau 0000:05:00.0
>        cached      3403058      13561071    51158        3 radeon 0000:01:00.0
>        cached           25             0       96        4 nouveau 0000:05:00.0
>
> to figure out if my thinking is really true. You should have a huge
> 'inuse' count and almost no 'available'.

My /sys/kernel/debug/dri directory has a 0 and a 64 entry, which appear to
always have the same contents.  Is that normal?

My /sys/kernel/debug/dri/0/ttm_dma_page_pool file doesn't exist bare
metal... only in Xen.  Is that normal?

         pool      refills   pages freed    inuse available     name
       cached        15190         59551     1205        4 radeon 0000:01:00.0

If I watch that file while creating xterms, moving them around, etc, I can
see the number available fluctuate between 3 and 6.  This is true, even on
my box w/ the newer R7 card in it, which hasn't gotten that GEM error
message (yet?).


>
> But that will get us just to confirm that yes - you have a big usage
> of memory and it is hitting the ceiling.
>
> Now to actually figure out which application is hanging on these - that
> I am not sure about. I think there is some drm info tool to investigate
> how many pages each application is using. You can leave it running and
> see which app is gulping up the memory. But I am not sure which
> tool that is (if there was one).
>
> Well, lets do one step at a time - see if my theory is correct first.



-- 
Michael D Labriola
21 Rip Van Winkle Cir
Warwick, RI 02886
401-316-9844 (cell)
401-848-8871 (work)
401-234-1306 (home)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Radeon DRM dom0 issues
  2014-02-19 19:33                   ` Michael Labriola
@ 2014-02-19 19:57                     ` Konrad Rzeszutek Wilk
  2014-02-19 20:08                       ` Michael Labriola
  0 siblings, 1 reply; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-02-19 19:57 UTC (permalink / raw)
  To: Michael Labriola
  Cc: Konrad Rzeszutek Wilk, xen-devel-bounces, Michael D Labriola, xen-devel

On Wed, Feb 19, 2014 at 02:33:26PM -0500, Michael Labriola wrote:
> On Wed, Feb 19, 2014 at 12:04 PM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> > On Tue, Feb 11, 2014 at 10:35:18AM -0500, Michael D Labriola wrote:
> >> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 01/24/2014
> >> 09:49:38 AM:
> >>
> >> > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >> > To: Michael D Labriola <mlabriol@gdeb.com>,
> >> > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>,
> >> > michael.d.labriola@gmail.com, xen-devel@lists.xen.org, xen-devel-
> >> > bounces@lists.xen.org
> >> > Date: 01/24/2014 09:50 AM
> >> > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> >> >
> >> > On Thu, Jan 23, 2014 at 11:54:37AM -0500, Michael D Labriola wrote:
> >> > > xen-devel-bounces@lists.xen.org wrote on 01/21/2014 04:59:05 PM:
> >> > >
> >> > > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >> > > > To: Michael D Labriola <mlabriol@gdeb.com>,
> >> > > > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>,
> >> > > > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
> >> > > > Date: 01/21/2014 04:59 PM
> >> > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> >> > > > Sent by: xen-devel-bounces@lists.xen.org
> >> > > >
> >> > > > On Mon, Jan 20, 2014 at 03:15:24PM -0500, Michael D Labriola wrote:
> >> > > > > Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 01/20/2014
> >>
> >> > > > > 10:38:27 AM:
> >> > > > >
> >> > > > > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >> > > > > > To: Michael D Labriola <mlabriol@gdeb.com>,
> >> > > > > > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>,
> >> > > > > > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
> >> > > > > > Date: 01/20/2014 10:38 AM
> >> > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> >> > > > > >
> >> > > > > > On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D Labriola
> >> wrote:
> >> > > > > > > Konrad Rzeszutek Wilk <konrad@darnok.org> wrote on 01/20/2014
> >> > > 10:14:36
> >> > > > > AM:
> >> > > > > > >
> >> > > > > > > > From: Konrad Rzeszutek Wilk <konrad@darnok.org>
> >> > > > > > > > To: Michael D Labriola <mlabriol@gdeb.com>,
> >> > > > > > > > Cc: xen-devel@lists.xen.org, michael.d.labriola@gmail.com
> >> > > > > > > > Date: 01/20/2014 10:14 AM
> >> > > > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> >> > > > > > > >
> >> > > > > > > > On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola
> >>
> >> > > wrote:
> >> > > > > > > > > Anyone here running a dom0 w/ Radeon DRM?  I'm having
> >> > > consistent
> >> > > > > > > crashes
> >> > > > > > > > > with multiple older R600 series (HD 6470 and HD 6570) and
> >> > > unusably
> >> > > > >
> >> > > > > > > slow
> >> > > > > > > > > graphics with a newer HD7000 (can see each line refresh
> >> > > > > indiviually on
> >> > > > > > >
> >> > > > > > > > > radeonfb tty).  All 3 systems seem to work fine bare
> >> metal.
> >> > > > > > > >
> >> > > > > > > > I hadn't been using DRM, just Xserver. Is that what you
> >> mean?
> >> > > > > > >
> >> > > > > > > The R600 problems happen when in X, using OpenGL, on my dom0.
> >> The
> >> > >
> >> > > > > > > RadeonSI sluggishness is when using the KMS framebuffer device
> >> for
> >> > > a
> >> > > > > plain
> >> > > > > > > text console login.
> >> > > > > >
> >> > > > > > So sluggish is probably due to the PAT not being enabled. This
> >> patch
> >> > > > > > should be applied:
> >> > > > > >
> >> > > > > > lkml.org/lkml/2011/11/8/406
> >> > > > > >
> >> > > > > > (or http://marc.info/?l=linux-kernel&m=132888833209874&w=2)
> >> > > > > >
> >> > > > > > and these two reverted:
> >> > > > > >
> >> > > > > >  "xen/pat: Disable PAT support for now."
> >> > > > > >  "xen/pat: Disable PAT using pat_enabled value."
> >> > > > > >
> >> > > > > > Which is to say do:
> >> > > > > >
> >> > > > > > git revert c79c49826270b8b0061b2fca840fc3f013c8a78a
> >> > > > > > git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1
> >> > > > >
> >> > > > > Thanks!  I cherry-picked that patch out of your testing tree,
> >> reverted
> >> > >
> >> > > > > those 2 commits, recompiled and installed.  Definitely fixed the
> >> HD
> >> > > 7000
> >> > > > > sluggishness and appears to have fixed the R600 crashes (although
> >> it's
> >> > >
> >> > > > > only been running a few hours).
> >> > > > >
> >> > > > > How come that patch didn't get into mainline?  It looks pretty
> >> > > innocuous
> >> > > > > to me...
> >> > > >
> >> > > > <Sigh> the x86 maintainers wanted a different route. And I hadn't
> >> had
> >> > > > the chance nor time to implement it.
> >> > >
> >> > > I see.  Well, I've got a handful of boxes in my lab that need that
> >> patch
> >> > > to be usable.  If you do come up with a more mainline-able solution,
> >> I'd
> >> > > gladly test it for you.  ;-)
> >> >
> >> > Thank you!
> >>
> >> Uh, oh.  Looks like those reverts and patches didn't entirely fix my
> >> problem.  My box with the HD5450 (r600 gallium3d) started going bonkers
> >> again yeserday.  After being solid as a rock for 2 weeks as my primary
> >> workstation, X has crashed a half dozen or so times so far this week. I've
> >> been in Xen with 2 paravirtual linux guests running almost constantly for
> >> this whole period.  I don't understand what's changed, but my system has
> >> been entirely unstable now.  I did recompile my kernel... but I all did
> >> was merge the v3.13.1 stable commit into my working tree and turn a few
> >> things on (netfilter, wifi, a couple drivers turned on here and there).  I
> >> just went and verified that those patches are still applied in my tree
> >> (i.e., I didn't accidentally undo them).  I'm scratching my head (and
> >> staring at a TTY login).
> >>
> >> When X crashes, my kernel log prints a couple dozen iterations of this. 3d
> >> acceleration no longer functions unless I reboot.  If memory serves, the
> >> unpatched behavior upon X crash was that the kernel continued to spew
> >> these errors until the whole box locked up.  At least that's not happening
> >> any more... ;-)
> >>
> >> [  702.070084] [TTM] radeon 0000:01:00.0: Unable to get page 2
> >> [  702.075971] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
> >> (r:-12)!
> >> [  704.720699] [TTM] radeon 0000:01:00.0: Unable to get page 0
> >> [  704.726635] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
> >> (r:-12)!
> >> [  704.733910] [drm:radeon_gem_object_create] *ERROR* Failed to allocate
> >> GEM object (8192, 2, 4096, -12)
> >>
> >> and here's a slightly different variant that happened while I was typing
> >> this email (on a different machine, luckily):
> >>
> >> [ 3107.713039] sdf: detected capacity change from 31625052160 to 0
> >> [ 3114.491717] usb 9-1: USB disconnect, device number 2
> >> [64348.271534] [TTM] radeon 0000:01:00.0: Unable to get page 3
> >> [64348.277312] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
> >> (r:-12)!
> >> [64348.284470] [TTM] radeon 0000:01:00.0: Unable to get page 0
> >> [64348.290257] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
> >> (r:-12)!
> >> [64348.297561] [TTM] Buffer eviction failed
> >> [64349.550518] [TTM] radeon 0000:01:00.0: Unable to get page 0
> >> [64349.556417] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
> >> (r:-12)!
> >> [64349.563714] [drm:radeon_gem_object_create] *ERROR* Failed to allocate
> >> GEM object (16384, 2, 4096, -12)
> >>
> >> Any ideas?
> >
> > yes. I believe you have a memory leak. As in, some driver (or X) is
> > eating up the memory and not giving up enough. That means the TTM
> > layer is hitting its ceiling of how much memory it can allocate.
> >
> > Now finding the culprit is going to be a bit hard.
> >
> > You could use:
> >
> > [root@phenom 1]# cat /sys/kernel/debug/dri/1/ttm_dma_page_pool
> >          pool      refills   pages freed    inuse available     name
> >            wc          259           224      808        4 nouveau 0000:05:00.0
> >        cached      3403058      13561071    51158        3 radeon 0000:01:00.0
> >        cached           25             0       96        4 nouveau 0000:05:00.0
> >
> > to figure out if my thinking is really true. You should have a huge
> > 'inuse' count and almost no 'available'.
> 
> My /sys/kernel/debug/dri directory has a 0 and a 64 entry, which appear to
> always have the same contents.  Is that normal?

Yes.
> 
> My /sys/kernel/debug/dri/0/ttm_dma_page_pool file doesn't exist bare
> metal... only in Xen.  Is that normal?

It would show up on baremetal if you boot with 'iommu=soft'

> 
>          pool      refills   pages freed    inuse available     name
>        cached        15190         59551     1205        4 radeon 0000:01:00.0
> 
> If I watch that file while creating xterms, moving them around, etc, I can
> see the number available fluctuate between 3 and 6.  This is true, even on
> my box w/ the newer R7 card in it, which hasn't gotten that GEM error
> message (yet?).

OK, so lets see what happens when the error shows. Incidentally - what amount of
memory does your initial domain have? And is it different then when you
boot it as a baremetal?

Thank you.

> 
> 
> >
> > But that will get us just to confirm that yes - you have a big usage
> > of memory and it is hitting the ceiling.
> >
> > Now to actually figure out which application is hanging on these - that
> > I am not sure about. I think there is some drm info tool to investigate
> > how many pages each application is using. You can leave it running and
> > see which app is gulping up the memory. But I am not sure which
> > tool that is (if there was one).
> >
> > Well, lets do one step at a time - see if my theory is correct first.
> 
> 
> 
> -- 
> Michael D Labriola
> 21 Rip Van Winkle Cir
> Warwick, RI 02886
> 401-316-9844 (cell)
> 401-848-8871 (work)
> 401-234-1306 (home)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Radeon DRM dom0 issues
  2014-02-19 19:57                     ` Konrad Rzeszutek Wilk
@ 2014-02-19 20:08                       ` Michael Labriola
  2014-02-19 20:30                         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 15+ messages in thread
From: Michael Labriola @ 2014-02-19 20:08 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Konrad Rzeszutek Wilk, xen-devel-bounces, Michael D Labriola, xen-devel

On Wed, Feb 19, 2014 at 2:57 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Wed, Feb 19, 2014 at 02:33:26PM -0500, Michael Labriola wrote:
>> On Wed, Feb 19, 2014 at 12:04 PM, Konrad Rzeszutek Wilk
>> <konrad.wilk@oracle.com> wrote:
>> > On Tue, Feb 11, 2014 at 10:35:18AM -0500, Michael D Labriola wrote:
>> >> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 01/24/2014
>> >> 09:49:38 AM:
>> >>
>> >> > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> >> > To: Michael D Labriola <mlabriol@gdeb.com>,
>> >> > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>,
>> >> > michael.d.labriola@gmail.com, xen-devel@lists.xen.org, xen-devel-
>> >> > bounces@lists.xen.org
>> >> > Date: 01/24/2014 09:50 AM
>> >> > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
>> >> >
>> >> > On Thu, Jan 23, 2014 at 11:54:37AM -0500, Michael D Labriola wrote:
>> >> > > xen-devel-bounces@lists.xen.org wrote on 01/21/2014 04:59:05 PM:
>> >> > >
>> >> > > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> >> > > > To: Michael D Labriola <mlabriol@gdeb.com>,
>> >> > > > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>,
>> >> > > > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
>> >> > > > Date: 01/21/2014 04:59 PM
>> >> > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
>> >> > > > Sent by: xen-devel-bounces@lists.xen.org
>> >> > > >
>> >> > > > On Mon, Jan 20, 2014 at 03:15:24PM -0500, Michael D Labriola wrote:
>> >> > > > > Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 01/20/2014
>> >>
>> >> > > > > 10:38:27 AM:
>> >> > > > >
>> >> > > > > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> >> > > > > > To: Michael D Labriola <mlabriol@gdeb.com>,
>> >> > > > > > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>,
>> >> > > > > > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
>> >> > > > > > Date: 01/20/2014 10:38 AM
>> >> > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
>> >> > > > > >
>> >> > > > > > On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D Labriola
>> >> wrote:
>> >> > > > > > > Konrad Rzeszutek Wilk <konrad@darnok.org> wrote on 01/20/2014
>> >> > > 10:14:36
>> >> > > > > AM:
>> >> > > > > > >
>> >> > > > > > > > From: Konrad Rzeszutek Wilk <konrad@darnok.org>
>> >> > > > > > > > To: Michael D Labriola <mlabriol@gdeb.com>,
>> >> > > > > > > > Cc: xen-devel@lists.xen.org, michael.d.labriola@gmail.com
>> >> > > > > > > > Date: 01/20/2014 10:14 AM
>> >> > > > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
>> >> > > > > > > >
>> >> > > > > > > > On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola
>> >>
>> >> > > wrote:
>> >> > > > > > > > > Anyone here running a dom0 w/ Radeon DRM?  I'm having
>> >> > > consistent
>> >> > > > > > > crashes
>> >> > > > > > > > > with multiple older R600 series (HD 6470 and HD 6570) and
>> >> > > unusably
>> >> > > > >
>> >> > > > > > > slow
>> >> > > > > > > > > graphics with a newer HD7000 (can see each line refresh
>> >> > > > > indiviually on
>> >> > > > > > >
>> >> > > > > > > > > radeonfb tty).  All 3 systems seem to work fine bare
>> >> metal.
>> >> > > > > > > >
>> >> > > > > > > > I hadn't been using DRM, just Xserver. Is that what you
>> >> mean?
>> >> > > > > > >
>> >> > > > > > > The R600 problems happen when in X, using OpenGL, on my dom0.
>> >> The
>> >> > >
>> >> > > > > > > RadeonSI sluggishness is when using the KMS framebuffer device
>> >> for
>> >> > > a
>> >> > > > > plain
>> >> > > > > > > text console login.
>> >> > > > > >
>> >> > > > > > So sluggish is probably due to the PAT not being enabled. This
>> >> patch
>> >> > > > > > should be applied:
>> >> > > > > >
>> >> > > > > > lkml.org/lkml/2011/11/8/406
>> >> > > > > >
>> >> > > > > > (or http://marc.info/?l=linux-kernel&m=132888833209874&w=2)
>> >> > > > > >
>> >> > > > > > and these two reverted:
>> >> > > > > >
>> >> > > > > >  "xen/pat: Disable PAT support for now."
>> >> > > > > >  "xen/pat: Disable PAT using pat_enabled value."
>> >> > > > > >
>> >> > > > > > Which is to say do:
>> >> > > > > >
>> >> > > > > > git revert c79c49826270b8b0061b2fca840fc3f013c8a78a
>> >> > > > > > git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1
>> >> > > > >
>> >> > > > > Thanks!  I cherry-picked that patch out of your testing tree,
>> >> reverted
>> >> > >
>> >> > > > > those 2 commits, recompiled and installed.  Definitely fixed the
>> >> HD
>> >> > > 7000
>> >> > > > > sluggishness and appears to have fixed the R600 crashes (although
>> >> it's
>> >> > >
>> >> > > > > only been running a few hours).
>> >> > > > >
>> >> > > > > How come that patch didn't get into mainline?  It looks pretty
>> >> > > innocuous
>> >> > > > > to me...
>> >> > > >
>> >> > > > <Sigh> the x86 maintainers wanted a different route. And I hadn't
>> >> had
>> >> > > > the chance nor time to implement it.
>> >> > >
>> >> > > I see.  Well, I've got a handful of boxes in my lab that need that
>> >> patch
>> >> > > to be usable.  If you do come up with a more mainline-able solution,
>> >> I'd
>> >> > > gladly test it for you.  ;-)
>> >> >
>> >> > Thank you!
>> >>
>> >> Uh, oh.  Looks like those reverts and patches didn't entirely fix my
>> >> problem.  My box with the HD5450 (r600 gallium3d) started going bonkers
>> >> again yeserday.  After being solid as a rock for 2 weeks as my primary
>> >> workstation, X has crashed a half dozen or so times so far this week. I've
>> >> been in Xen with 2 paravirtual linux guests running almost constantly for
>> >> this whole period.  I don't understand what's changed, but my system has
>> >> been entirely unstable now.  I did recompile my kernel... but I all did
>> >> was merge the v3.13.1 stable commit into my working tree and turn a few
>> >> things on (netfilter, wifi, a couple drivers turned on here and there).  I
>> >> just went and verified that those patches are still applied in my tree
>> >> (i.e., I didn't accidentally undo them).  I'm scratching my head (and
>> >> staring at a TTY login).
>> >>
>> >> When X crashes, my kernel log prints a couple dozen iterations of this. 3d
>> >> acceleration no longer functions unless I reboot.  If memory serves, the
>> >> unpatched behavior upon X crash was that the kernel continued to spew
>> >> these errors until the whole box locked up.  At least that's not happening
>> >> any more... ;-)
>> >>
>> >> [  702.070084] [TTM] radeon 0000:01:00.0: Unable to get page 2
>> >> [  702.075971] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> >> (r:-12)!
>> >> [  704.720699] [TTM] radeon 0000:01:00.0: Unable to get page 0
>> >> [  704.726635] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> >> (r:-12)!
>> >> [  704.733910] [drm:radeon_gem_object_create] *ERROR* Failed to allocate
>> >> GEM object (8192, 2, 4096, -12)
>> >>
>> >> and here's a slightly different variant that happened while I was typing
>> >> this email (on a different machine, luckily):
>> >>
>> >> [ 3107.713039] sdf: detected capacity change from 31625052160 to 0
>> >> [ 3114.491717] usb 9-1: USB disconnect, device number 2
>> >> [64348.271534] [TTM] radeon 0000:01:00.0: Unable to get page 3
>> >> [64348.277312] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> >> (r:-12)!
>> >> [64348.284470] [TTM] radeon 0000:01:00.0: Unable to get page 0
>> >> [64348.290257] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> >> (r:-12)!
>> >> [64348.297561] [TTM] Buffer eviction failed
>> >> [64349.550518] [TTM] radeon 0000:01:00.0: Unable to get page 0
>> >> [64349.556417] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
>> >> (r:-12)!
>> >> [64349.563714] [drm:radeon_gem_object_create] *ERROR* Failed to allocate
>> >> GEM object (16384, 2, 4096, -12)
>> >>
>> >> Any ideas?
>> >
>> > yes. I believe you have a memory leak. As in, some driver (or X) is
>> > eating up the memory and not giving up enough. That means the TTM
>> > layer is hitting its ceiling of how much memory it can allocate.
>> >
>> > Now finding the culprit is going to be a bit hard.
>> >
>> > You could use:
>> >
>> > [root@phenom 1]# cat /sys/kernel/debug/dri/1/ttm_dma_page_pool
>> >          pool      refills   pages freed    inuse available     name
>> >            wc          259           224      808        4 nouveau 0000:05:00.0
>> >        cached      3403058      13561071    51158        3 radeon 0000:01:00.0
>> >        cached           25             0       96        4 nouveau 0000:05:00.0
>> >
>> > to figure out if my thinking is really true. You should have a huge
>> > 'inuse' count and almost no 'available'.
>>
>> My /sys/kernel/debug/dri directory has a 0 and a 64 entry, which appear to
>> always have the same contents.  Is that normal?
>
> Yes.
>>
>> My /sys/kernel/debug/dri/0/ttm_dma_page_pool file doesn't exist bare
>> metal... only in Xen.  Is that normal?
>
> It would show up on baremetal if you boot with 'iommu=soft'
>
>>
>>          pool      refills   pages freed    inuse available     name
>>        cached        15190         59551     1205        4 radeon 0000:01:00.0
>>
>> If I watch that file while creating xterms, moving them around, etc, I can
>> see the number available fluctuate between 3 and 6.  This is true, even on
>> my box w/ the newer R7 card in it, which hasn't gotten that GEM error
>> message (yet?).
>
> OK, so lets see what happens when the error shows. Incidentally - what amount of
> memory does your initial domain have? And is it different then when you
> boot it as a baremetal?

I've got the problem very reproducible on 3 boxes.  All three are
booting the dom0 with as much RAM as Xen will give them, then giving
up some of their RAM as needed when I create domUs.  The 3 boxes have
4G, 8G, and 16G if memory serves.

Does the amount of RAM on the actual video cards matter?  All the
older cards (that crash all the time) have 2G, whereas the R7 that
hasn't crashed yet only has 1G.

I've been reproducing the crash by just logging in and out of fluxbox
via XDM over and over again right after booting my dom0 in Xen w/ no
guests running.  That makes it happen within a few minutes.  Otherwise
it randomly crashes while I'm in the middle of trying to work... ;-)

>
> Thank you.
>
>>
>>
>> >
>> > But that will get us just to confirm that yes - you have a big usage
>> > of memory and it is hitting the ceiling.
>> >
>> > Now to actually figure out which application is hanging on these - that
>> > I am not sure about. I think there is some drm info tool to investigate
>> > how many pages each application is using. You can leave it running and
>> > see which app is gulping up the memory. But I am not sure which
>> > tool that is (if there was one).
>> >
>> > Well, lets do one step at a time - see if my theory is correct first.

-- 
Michael D Labriola
21 Rip Van Winkle Cir
Warwick, RI 02886
401-316-9844 (cell)
401-848-8871 (work)
401-234-1306 (home)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Radeon DRM dom0 issues
  2014-02-19 20:08                       ` Michael Labriola
@ 2014-02-19 20:30                         ` Konrad Rzeszutek Wilk
  2014-02-19 21:02                           ` Michael D Labriola
  0 siblings, 1 reply; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-02-19 20:30 UTC (permalink / raw)
  To: Michael Labriola
  Cc: Konrad Rzeszutek Wilk, xen-devel-bounces, Michael D Labriola, xen-devel

On Wed, Feb 19, 2014 at 03:08:08PM -0500, Michael Labriola wrote:
> On Wed, Feb 19, 2014 at 2:57 PM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> > On Wed, Feb 19, 2014 at 02:33:26PM -0500, Michael Labriola wrote:
> >> On Wed, Feb 19, 2014 at 12:04 PM, Konrad Rzeszutek Wilk
> >> <konrad.wilk@oracle.com> wrote:
> >> > On Tue, Feb 11, 2014 at 10:35:18AM -0500, Michael D Labriola wrote:
> >> >> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 01/24/2014
> >> >> 09:49:38 AM:
> >> >>
> >> >> > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >> >> > To: Michael D Labriola <mlabriol@gdeb.com>,
> >> >> > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>,
> >> >> > michael.d.labriola@gmail.com, xen-devel@lists.xen.org, xen-devel-
> >> >> > bounces@lists.xen.org
> >> >> > Date: 01/24/2014 09:50 AM
> >> >> > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> >> >> >
> >> >> > On Thu, Jan 23, 2014 at 11:54:37AM -0500, Michael D Labriola wrote:
> >> >> > > xen-devel-bounces@lists.xen.org wrote on 01/21/2014 04:59:05 PM:
> >> >> > >
> >> >> > > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >> >> > > > To: Michael D Labriola <mlabriol@gdeb.com>,
> >> >> > > > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>,
> >> >> > > > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
> >> >> > > > Date: 01/21/2014 04:59 PM
> >> >> > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> >> >> > > > Sent by: xen-devel-bounces@lists.xen.org
> >> >> > > >
> >> >> > > > On Mon, Jan 20, 2014 at 03:15:24PM -0500, Michael D Labriola wrote:
> >> >> > > > > Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 01/20/2014
> >> >>
> >> >> > > > > 10:38:27 AM:
> >> >> > > > >
> >> >> > > > > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >> >> > > > > > To: Michael D Labriola <mlabriol@gdeb.com>,
> >> >> > > > > > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>,
> >> >> > > > > > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
> >> >> > > > > > Date: 01/20/2014 10:38 AM
> >> >> > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> >> >> > > > > >
> >> >> > > > > > On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D Labriola
> >> >> wrote:
> >> >> > > > > > > Konrad Rzeszutek Wilk <konrad@darnok.org> wrote on 01/20/2014
> >> >> > > 10:14:36
> >> >> > > > > AM:
> >> >> > > > > > >
> >> >> > > > > > > > From: Konrad Rzeszutek Wilk <konrad@darnok.org>
> >> >> > > > > > > > To: Michael D Labriola <mlabriol@gdeb.com>,
> >> >> > > > > > > > Cc: xen-devel@lists.xen.org, michael.d.labriola@gmail.com
> >> >> > > > > > > > Date: 01/20/2014 10:14 AM
> >> >> > > > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> >> >> > > > > > > >
> >> >> > > > > > > > On Mon, Jan 20, 2014 at 09:58:32AM -0500, Michael D Labriola
> >> >>
> >> >> > > wrote:
> >> >> > > > > > > > > Anyone here running a dom0 w/ Radeon DRM?  I'm having
> >> >> > > consistent
> >> >> > > > > > > crashes
> >> >> > > > > > > > > with multiple older R600 series (HD 6470 and HD 6570) and
> >> >> > > unusably
> >> >> > > > >
> >> >> > > > > > > slow
> >> >> > > > > > > > > graphics with a newer HD7000 (can see each line refresh
> >> >> > > > > indiviually on
> >> >> > > > > > >
> >> >> > > > > > > > > radeonfb tty).  All 3 systems seem to work fine bare
> >> >> metal.
> >> >> > > > > > > >
> >> >> > > > > > > > I hadn't been using DRM, just Xserver. Is that what you
> >> >> mean?
> >> >> > > > > > >
> >> >> > > > > > > The R600 problems happen when in X, using OpenGL, on my dom0.
> >> >> The
> >> >> > >
> >> >> > > > > > > RadeonSI sluggishness is when using the KMS framebuffer device
> >> >> for
> >> >> > > a
> >> >> > > > > plain
> >> >> > > > > > > text console login.
> >> >> > > > > >
> >> >> > > > > > So sluggish is probably due to the PAT not being enabled. This
> >> >> patch
> >> >> > > > > > should be applied:
> >> >> > > > > >
> >> >> > > > > > lkml.org/lkml/2011/11/8/406
> >> >> > > > > >
> >> >> > > > > > (or http://marc.info/?l=linux-kernel&m=132888833209874&w=2)
> >> >> > > > > >
> >> >> > > > > > and these two reverted:
> >> >> > > > > >
> >> >> > > > > >  "xen/pat: Disable PAT support for now."
> >> >> > > > > >  "xen/pat: Disable PAT using pat_enabled value."
> >> >> > > > > >
> >> >> > > > > > Which is to say do:
> >> >> > > > > >
> >> >> > > > > > git revert c79c49826270b8b0061b2fca840fc3f013c8a78a
> >> >> > > > > > git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1
> >> >> > > > >
> >> >> > > > > Thanks!  I cherry-picked that patch out of your testing tree,
> >> >> reverted
> >> >> > >
> >> >> > > > > those 2 commits, recompiled and installed.  Definitely fixed the
> >> >> HD
> >> >> > > 7000
> >> >> > > > > sluggishness and appears to have fixed the R600 crashes (although
> >> >> it's
> >> >> > >
> >> >> > > > > only been running a few hours).
> >> >> > > > >
> >> >> > > > > How come that patch didn't get into mainline?  It looks pretty
> >> >> > > innocuous
> >> >> > > > > to me...
> >> >> > > >
> >> >> > > > <Sigh> the x86 maintainers wanted a different route. And I hadn't
> >> >> had
> >> >> > > > the chance nor time to implement it.
> >> >> > >
> >> >> > > I see.  Well, I've got a handful of boxes in my lab that need that
> >> >> patch
> >> >> > > to be usable.  If you do come up with a more mainline-able solution,
> >> >> I'd
> >> >> > > gladly test it for you.  ;-)
> >> >> >
> >> >> > Thank you!
> >> >>
> >> >> Uh, oh.  Looks like those reverts and patches didn't entirely fix my
> >> >> problem.  My box with the HD5450 (r600 gallium3d) started going bonkers
> >> >> again yeserday.  After being solid as a rock for 2 weeks as my primary
> >> >> workstation, X has crashed a half dozen or so times so far this week. I've
> >> >> been in Xen with 2 paravirtual linux guests running almost constantly for
> >> >> this whole period.  I don't understand what's changed, but my system has
> >> >> been entirely unstable now.  I did recompile my kernel... but I all did
> >> >> was merge the v3.13.1 stable commit into my working tree and turn a few
> >> >> things on (netfilter, wifi, a couple drivers turned on here and there).  I
> >> >> just went and verified that those patches are still applied in my tree
> >> >> (i.e., I didn't accidentally undo them).  I'm scratching my head (and
> >> >> staring at a TTY login).
> >> >>
> >> >> When X crashes, my kernel log prints a couple dozen iterations of this. 3d
> >> >> acceleration no longer functions unless I reboot.  If memory serves, the
> >> >> unpatched behavior upon X crash was that the kernel continued to spew
> >> >> these errors until the whole box locked up.  At least that's not happening
> >> >> any more... ;-)
> >> >>
> >> >> [  702.070084] [TTM] radeon 0000:01:00.0: Unable to get page 2
> >> >> [  702.075971] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
> >> >> (r:-12)!
> >> >> [  704.720699] [TTM] radeon 0000:01:00.0: Unable to get page 0
> >> >> [  704.726635] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
> >> >> (r:-12)!
> >> >> [  704.733910] [drm:radeon_gem_object_create] *ERROR* Failed to allocate
> >> >> GEM object (8192, 2, 4096, -12)
> >> >>
> >> >> and here's a slightly different variant that happened while I was typing
> >> >> this email (on a different machine, luckily):
> >> >>
> >> >> [ 3107.713039] sdf: detected capacity change from 31625052160 to 0
> >> >> [ 3114.491717] usb 9-1: USB disconnect, device number 2
> >> >> [64348.271534] [TTM] radeon 0000:01:00.0: Unable to get page 3
> >> >> [64348.277312] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
> >> >> (r:-12)!
> >> >> [64348.284470] [TTM] radeon 0000:01:00.0: Unable to get page 0
> >> >> [64348.290257] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
> >> >> (r:-12)!
> >> >> [64348.297561] [TTM] Buffer eviction failed
> >> >> [64349.550518] [TTM] radeon 0000:01:00.0: Unable to get page 0
> >> >> [64349.556417] [TTM] radeon 0000:01:00.0: Failed to fill cached pool
> >> >> (r:-12)!
> >> >> [64349.563714] [drm:radeon_gem_object_create] *ERROR* Failed to allocate
> >> >> GEM object (16384, 2, 4096, -12)
> >> >>
> >> >> Any ideas?
> >> >
> >> > yes. I believe you have a memory leak. As in, some driver (or X) is
> >> > eating up the memory and not giving up enough. That means the TTM
> >> > layer is hitting its ceiling of how much memory it can allocate.
> >> >
> >> > Now finding the culprit is going to be a bit hard.
> >> >
> >> > You could use:
> >> >
> >> > [root@phenom 1]# cat /sys/kernel/debug/dri/1/ttm_dma_page_pool
> >> >          pool      refills   pages freed    inuse available     name
> >> >            wc          259           224      808        4 nouveau 0000:05:00.0
> >> >        cached      3403058      13561071    51158        3 radeon 0000:01:00.0
> >> >        cached           25             0       96        4 nouveau 0000:05:00.0
> >> >
> >> > to figure out if my thinking is really true. You should have a huge
> >> > 'inuse' count and almost no 'available'.
> >>
> >> My /sys/kernel/debug/dri directory has a 0 and a 64 entry, which appear to
> >> always have the same contents.  Is that normal?
> >
> > Yes.
> >>
> >> My /sys/kernel/debug/dri/0/ttm_dma_page_pool file doesn't exist bare
> >> metal... only in Xen.  Is that normal?
> >
> > It would show up on baremetal if you boot with 'iommu=soft'
> >
> >>
> >>          pool      refills   pages freed    inuse available     name
> >>        cached        15190         59551     1205        4 radeon 0000:01:00.0
> >>
> >> If I watch that file while creating xterms, moving them around, etc, I can
> >> see the number available fluctuate between 3 and 6.  This is true, even on
> >> my box w/ the newer R7 card in it, which hasn't gotten that GEM error
> >> message (yet?).
> >
> > OK, so lets see what happens when the error shows. Incidentally - what amount of
> > memory does your initial domain have? And is it different then when you
> > boot it as a baremetal?
> 
> I've got the problem very reproducible on 3 boxes.  All three are
> booting the dom0 with as much RAM as Xen will give them, then giving
> up some of their RAM as needed when I create domUs.  The 3 boxes have
> 4G, 8G, and 16G if memory serves.
> 
> Does the amount of RAM on the actual video cards matter?  All the
> older cards (that crash all the time) have 2G, whereas the R7 that
> hasn't crashed yet only has 1G.

The TTM pool has a limit (a hard one). It is pretty simple:


       pr_info("Zone %7s: Available graphics memory: %llu kiB\n",      
394                         zone->name, (unsigned long long)zone->max_mem >> 10);   
395         }                                                                       
396         ttm_page_alloc_init(glob, glob->zone_kernel->max_mem/(2*PAGE_SIZE));    
397         ttm_dma_page_alloc_init(glob, glob->zone_kernel->max_mem/(2*PAGE_SIZE));

so 1/4 of your memory. Which means that when boot dom0 with as much
memory as possible and then balloon down you might confuse it
(as the initial memory assumption is done during bootup).

If you boot the troubled dom0s with 'dom0_mem_max' set to some good
number - that might shed some light on this.


> 
> I've been reproducing the crash by just logging in and out of fluxbox
> via XDM over and over again right after booting my dom0 in Xen w/ no
> guests running.  That makes it happen within a few minutes.  Otherwise
> it randomly crashes while I'm in the middle of trying to work... ;-)

HA!

Does fluxbox use a lot of graphic? I mean does it do a lot of fancy
things when it starts and shuts itself?

> 
> >
> > Thank you.
> >
> >>
> >>
> >> >
> >> > But that will get us just to confirm that yes - you have a big usage
> >> > of memory and it is hitting the ceiling.
> >> >
> >> > Now to actually figure out which application is hanging on these - that
> >> > I am not sure about. I think there is some drm info tool to investigate
> >> > how many pages each application is using. You can leave it running and
> >> > see which app is gulping up the memory. But I am not sure which
> >> > tool that is (if there was one).
> >> >
> >> > Well, lets do one step at a time - see if my theory is correct first.
> 
> -- 
> Michael D Labriola
> 21 Rip Van Winkle Cir
> Warwick, RI 02886
> 401-316-9844 (cell)
> 401-848-8871 (work)
> 401-234-1306 (home)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Radeon DRM dom0 issues
  2014-02-19 20:30                         ` Konrad Rzeszutek Wilk
@ 2014-02-19 21:02                           ` Michael D Labriola
  0 siblings, 0 replies; 15+ messages in thread
From: Michael D Labriola @ 2014-02-19 21:02 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Konrad Rzeszutek Wilk, Michael Labriola, xen-devel-bounces, xen-devel

 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 02/19/2014 
03:30:07 PM:

> From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> To: Michael Labriola <michael.d.labriola@gmail.com>, 
> Cc: Michael D Labriola <mlabriol@gdeb.com>, Konrad Rzeszutek Wilk 
> <konrad@darnok.org>, xen-devel@lists.xen.org, 
xen-devel-bounces@lists.xen.org
> Date: 02/19/2014 03:30 PM
> Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> 
> On Wed, Feb 19, 2014 at 03:08:08PM -0500, Michael Labriola wrote:
> > On Wed, Feb 19, 2014 at 2:57 PM, Konrad Rzeszutek Wilk
> > <konrad.wilk@oracle.com> wrote:
> > > On Wed, Feb 19, 2014 at 02:33:26PM -0500, Michael Labriola wrote:
> > >> On Wed, Feb 19, 2014 at 12:04 PM, Konrad Rzeszutek Wilk
> > >> <konrad.wilk@oracle.com> wrote:
> > >> > On Tue, Feb 11, 2014 at 10:35:18AM -0500, Michael D Labriola 
wrote:
> > >> >> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote on 
01/24/2014
> > >> >> 09:49:38 AM:
> > >> >>
> > >> >> > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > >> >> > To: Michael D Labriola <mlabriol@gdeb.com>,
> > >> >> > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>,
> > >> >> > michael.d.labriola@gmail.com, xen-devel@lists.xen.org, 
xen-devel-
> > >> >> > bounces@lists.xen.org
> > >> >> > Date: 01/24/2014 09:50 AM
> > >> >> > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > >> >> >
> > >> >> > On Thu, Jan 23, 2014 at 11:54:37AM -0500, Michael D Labriola 
wrote:
> > >> >> > > xen-devel-bounces@lists.xen.org wrote on 01/21/2014 04:59:05 
PM:
> > >> >> > >
> > >> >> > > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > >> >> > > > To: Michael D Labriola <mlabriol@gdeb.com>,
> > >> >> > > > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>,
> > >> >> > > > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
> > >> >> > > > Date: 01/21/2014 04:59 PM
> > >> >> > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > >> >> > > > Sent by: xen-devel-bounces@lists.xen.org
> > >> >> > > >
> > >> >> > > > On Mon, Jan 20, 2014 at 03:15:24PM -0500, Michael D 
> Labriola wrote:
> > >> >> > > > > Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote 
> on 01/20/2014
> > >> >>
> > >> >> > > > > 10:38:27 AM:
> > >> >> > > > >
> > >> >> > > > > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > >> >> > > > > > To: Michael D Labriola <mlabriol@gdeb.com>,
> > >> >> > > > > > Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>,
> > >> >> > > > > > michael.d.labriola@gmail.com, xen-devel@lists.xen.org
> > >> >> > > > > > Date: 01/20/2014 10:38 AM
> > >> >> > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > >> >> > > > > >
> > >> >> > > > > > On Mon, Jan 20, 2014 at 10:26:22AM -0500, Michael D 
Labriola
> > >> >> wrote:
> > >> >> > > > > > > Konrad Rzeszutek Wilk <konrad@darnok.org> wrote 
on01/20/2014
> > >> >> > > 10:14:36
> > >> >> > > > > AM:
> > >> >> > > > > > >
> > >> >> > > > > > > > From: Konrad Rzeszutek Wilk <konrad@darnok.org>
> > >> >> > > > > > > > To: Michael D Labriola <mlabriol@gdeb.com>,
> > >> >> > > > > > > > Cc: xen-devel@lists.xen.org, 
michael.d.labriola@gmail.com
> > >> >> > > > > > > > Date: 01/20/2014 10:14 AM
> > >> >> > > > > > > > Subject: Re: [Xen-devel] Radeon DRM dom0 issues
> > >> >> > > > > > > >
> > >> >> > > > > > > > On Mon, Jan 20, 2014 at 09:58:32AM -0500, 
> Michael D Labriola
> > >> >>
> > >> >> > > wrote:
> > >> >> > > > > > > > > Anyone here running a dom0 w/ Radeon DRM?  I'm 
having
> > >> >> > > consistent
> > >> >> > > > > > > crashes
> > >> >> > > > > > > > > with multiple older R600 series (HD 6470 and 
> HD 6570) and
> > >> >> > > unusably
> > >> >> > > > >
> > >> >> > > > > > > slow
> > >> >> > > > > > > > > graphics with a newer HD7000 (can see each line 
refresh
> > >> >> > > > > indiviually on
> > >> >> > > > > > >
> > >> >> > > > > > > > > radeonfb tty).  All 3 systems seem to work fine 
bare
> > >> >> metal.
> > >> >> > > > > > > >
> > >> >> > > > > > > > I hadn't been using DRM, just Xserver. Is that 
what you
> > >> >> mean?
> > >> >> > > > > > >
> > >> >> > > > > > > The R600 problems happen when in X, using OpenGL, 
> on my dom0.
> > >> >> The
> > >> >> > >
> > >> >> > > > > > > RadeonSI sluggishness is when using the KMS 
> framebuffer device
> > >> >> for
> > >> >> > > a
> > >> >> > > > > plain
> > >> >> > > > > > > text console login.
> > >> >> > > > > >
> > >> >> > > > > > So sluggish is probably due to the PAT not being 
enabled. This
> > >> >> patch
> > >> >> > > > > > should be applied:
> > >> >> > > > > >
> > >> >> > > > > > lkml.org/lkml/2011/11/8/406
> > >> >> > > > > >
> > >> >> > > > > > (or 
http://marc.info/?l=linux-kernel&m=132888833209874&w=2)
> > >> >> > > > > >
> > >> >> > > > > > and these two reverted:
> > >> >> > > > > >
> > >> >> > > > > >  "xen/pat: Disable PAT support for now."
> > >> >> > > > > >  "xen/pat: Disable PAT using pat_enabled value."
> > >> >> > > > > >
> > >> >> > > > > > Which is to say do:
> > >> >> > > > > >
> > >> >> > > > > > git revert c79c49826270b8b0061b2fca840fc3f013c8a78a
> > >> >> > > > > > git revert 8eaffa67b43e99ae581622c5133e20b0f48bcef1
> > >> >> > > > >
> > >> >> > > > > Thanks!  I cherry-picked that patch out of your testing 
tree,
> > >> >> reverted
> > >> >> > >
> > >> >> > > > > those 2 commits, recompiled and installed. 
Definitelyfixed the
> > >> >> HD
> > >> >> > > 7000
> > >> >> > > > > sluggishness and appears to have fixed the R600 
> crashes (although
> > >> >> it's
> > >> >> > >
> > >> >> > > > > only been running a few hours).
> > >> >> > > > >
> > >> >> > > > > How come that patch didn't get into mainline?  It looks 
pretty
> > >> >> > > innocuous
> > >> >> > > > > to me...
> > >> >> > > >
> > >> >> > > > <Sigh> the x86 maintainers wanted a different route. And I 
hadn't
> > >> >> had
> > >> >> > > > the chance nor time to implement it.
> > >> >> > >
> > >> >> > > I see.  Well, I've got a handful of boxes in my lab that 
need that
> > >> >> patch
> > >> >> > > to be usable.  If you do come up with a more 
mainline-ablesolution,
> > >> >> I'd
> > >> >> > > gladly test it for you.  ;-)
> > >> >> >
> > >> >> > Thank you!
> > >> >>
> > >> >> Uh, oh.  Looks like those reverts and patches didn't entirely 
fix my
> > >> >> problem.  My box with the HD5450 (r600 gallium3d) started going 
bonkers
> > >> >> again yeserday.  After being solid as a rock for 2 weeks as my 
primary
> > >> >> workstation, X has crashed a half dozen or so times so far 
> this week. I've
> > >> >> been in Xen with 2 paravirtual linux guests running almost 
> constantly for
> > >> >> this whole period.  I don't understand what's changed, but my 
system has
> > >> >> been entirely unstable now.  I did recompile my kernel... but I 
all did
> > >> >> was merge the v3.13.1 stable commit into my working tree and 
turn a few
> > >> >> things on (netfilter, wifi, a couple drivers turned on here 
> and there).  I
> > >> >> just went and verified that those patches are still applied in 
my tree
> > >> >> (i.e., I didn't accidentally undo them).  I'm scratching my head 
(and
> > >> >> staring at a TTY login).
> > >> >>
> > >> >> When X crashes, my kernel log prints a couple dozen iterations
> of this. 3d
> > >> >> acceleration no longer functions unless I reboot.  If memory 
serves, the
> > >> >> unpatched behavior upon X crash was that the kernel continued to 
spew
> > >> >> these errors until the whole box locked up.  At least that's 
> not happening
> > >> >> any more... ;-)
> > >> >>
> > >> >> [  702.070084] [TTM] radeon 0000:01:00.0: Unable to get page 2
> > >> >> [  702.075971] [TTM] radeon 0000:01:00.0: Failed to fill cached 
pool
> > >> >> (r:-12)!
> > >> >> [  704.720699] [TTM] radeon 0000:01:00.0: Unable to get page 0
> > >> >> [  704.726635] [TTM] radeon 0000:01:00.0: Failed to fill cached 
pool
> > >> >> (r:-12)!
> > >> >> [  704.733910] [drm:radeon_gem_object_create] *ERROR* Failed to 
allocate
> > >> >> GEM object (8192, 2, 4096, -12)
> > >> >>
> > >> >> and here's a slightly different variant that happened while I 
was typing
> > >> >> this email (on a different machine, luckily):
> > >> >>
> > >> >> [ 3107.713039] sdf: detected capacity change from 31625052160 to 
0
> > >> >> [ 3114.491717] usb 9-1: USB disconnect, device number 2
> > >> >> [64348.271534] [TTM] radeon 0000:01:00.0: Unable to get page 3
> > >> >> [64348.277312] [TTM] radeon 0000:01:00.0: Failed to fill cached 
pool
> > >> >> (r:-12)!
> > >> >> [64348.284470] [TTM] radeon 0000:01:00.0: Unable to get page 0
> > >> >> [64348.290257] [TTM] radeon 0000:01:00.0: Failed to fill cached 
pool
> > >> >> (r:-12)!
> > >> >> [64348.297561] [TTM] Buffer eviction failed
> > >> >> [64349.550518] [TTM] radeon 0000:01:00.0: Unable to get page 0
> > >> >> [64349.556417] [TTM] radeon 0000:01:00.0: Failed to fill cached 
pool
> > >> >> (r:-12)!
> > >> >> [64349.563714] [drm:radeon_gem_object_create] *ERROR* Failed to 
allocate
> > >> >> GEM object (16384, 2, 4096, -12)
> > >> >>
> > >> >> Any ideas?
> > >> >
> > >> > yes. I believe you have a memory leak. As in, some driver (or X) 
is
> > >> > eating up the memory and not giving up enough. That means the TTM
> > >> > layer is hitting its ceiling of how much memory it can allocate.
> > >> >
> > >> > Now finding the culprit is going to be a bit hard.
> > >> >
> > >> > You could use:
> > >> >
> > >> > [root@phenom 1]# cat /sys/kernel/debug/dri/1/ttm_dma_page_pool
> > >> >          pool      refills   pages freed    inuse available name
> > >> >            wc          259           224      808        4 
> nouveau 0000:05:00.0
> > >> >        cached      3403058      13561071    51158        3 
> radeon 0000:01:00.0
> > >> >        cached           25             0       96        4 
> nouveau 0000:05:00.0
> > >> >
> > >> > to figure out if my thinking is really true. You should have a 
huge
> > >> > 'inuse' count and almost no 'available'.
> > >>
> > >> My /sys/kernel/debug/dri directory has a 0 and a 64 entry, which 
appear to
> > >> always have the same contents.  Is that normal?
> > >
> > > Yes.
> > >>
> > >> My /sys/kernel/debug/dri/0/ttm_dma_page_pool file doesn't exist 
bare
> > >> metal... only in Xen.  Is that normal?
> > >
> > > It would show up on baremetal if you boot with 'iommu=soft'
> > >
> > >>
> > >>          pool      refills   pages freed    inuse available name
> > >>        cached        15190         59551     1205        4 radeon
> 0000:01:00.0
> > >>
> > >> If I watch that file while creating xterms, moving them around, 
etc, I can
> > >> see the number available fluctuate between 3 and 6.  This is true, 
even on
> > >> my box w/ the newer R7 card in it, which hasn't gotten that GEM 
error
> > >> message (yet?).
> > >
> > > OK, so lets see what happens when the error shows. Incidentally - 
> what amount of
> > > memory does your initial domain have? And is it different then when 
you
> > > boot it as a baremetal?
> > 
> > I've got the problem very reproducible on 3 boxes.  All three are
> > booting the dom0 with as much RAM as Xen will give them, then giving
> > up some of their RAM as needed when I create domUs.  The 3 boxes have
> > 4G, 8G, and 16G if memory serves.

Actually, they're 6G, 8G, and 16G... and I've got a box that I can't 
reproduce the problem on even though it's got the same video card... and 
it only has 2G of RAM.  Could this be a PAE/HIHGMEM issue?  I'm running 
32bit with CONFIG_HIGHMEM64G on all my boxes.


> > 
> > Does the amount of RAM on the actual video cards matter?  All the
> > older cards (that crash all the time) have 2G, whereas the R7 that
> > hasn't crashed yet only has 1G.
> 
> The TTM pool has a limit (a hard one). It is pretty simple:
> 
> 
>        pr_info("Zone %7s: Available graphics memory: %llu kiB\n", 
> 394                         zone->name, (unsigned long long)
> zone->max_mem >> 10); 
> 395         }  
> 396         ttm_page_alloc_init(glob, glob->zone_kernel->max_mem/
> (2*PAGE_SIZE)); 
> 397         ttm_dma_page_alloc_init(glob, glob->zone_kernel->max_mem/
> (2*PAGE_SIZE));
> 
> so 1/4 of your memory. Which means that when boot dom0 with as much
> memory as possible and then balloon down you might confuse it
> (as the initial memory assumption is done during bootup).
> 
> If you boot the troubled dom0s with 'dom0_mem_max' set to some good
> number - that might shed some light on this.

Ok, I've got one of the problematic boxes booted with dom0_mem=5G and it 
doesn't seem to be crashing.  Fingers crossed!


> 
> 
> > 
> > I've been reproducing the crash by just logging in and out of fluxbox
> > via XDM over and over again right after booting my dom0 in Xen w/ no
> > guests running.  That makes it happen within a few minutes.  Otherwise
> > it randomly crashes while I'm in the middle of trying to work... ;-)
> 
> HA!
> 
> Does fluxbox use a lot of graphic? I mean does it do a lot of fancy
> things when it starts and shuts itself?

Negative.  It does next to nothing.  Super light weight, pretty much just 
gets rid of the login box and puts a taskbar-type-thing on the bottom of 
the screen.  I'd say the majority of my crashes have happened in 
Enlightenment (with plenty of extra fancy things), but it HAS happened in 
fluxbox doing next to nothing.  Which was pretty surprising.


---
Michael D Labriola
Electric Boat
mlabriol@gdeb.com
401-848-8871 (desk)
401-848-8513 (lab)
401-316-9844 (cell)

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-02-19 21:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-20 14:58 Radeon DRM dom0 issues Michael D Labriola
2014-01-20 15:14 ` Konrad Rzeszutek Wilk
2014-01-20 15:26   ` Michael D Labriola
2014-01-20 15:38     ` Konrad Rzeszutek Wilk
2014-01-20 20:15       ` Michael D Labriola
2014-01-21 21:59         ` Konrad Rzeszutek Wilk
2014-01-23 16:54           ` Michael D Labriola
2014-01-24 14:49             ` Konrad Rzeszutek Wilk
2014-02-11 15:35               ` Michael D Labriola
2014-02-19 17:04                 ` Konrad Rzeszutek Wilk
2014-02-19 19:33                   ` Michael Labriola
2014-02-19 19:57                     ` Konrad Rzeszutek Wilk
2014-02-19 20:08                       ` Michael Labriola
2014-02-19 20:30                         ` Konrad Rzeszutek Wilk
2014-02-19 21:02                           ` Michael D Labriola

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.