All of lore.kernel.org
 help / color / mirror / Atom feed
* amdgpu crashes on OOM
       [not found] <1603684905.h43s1t0y05.none.ref@localhost>
@ 2020-10-26  4:29   ` Alex Xu (Hello71)
  0 siblings, 0 replies; 8+ messages in thread
From: Alex Xu (Hello71) @ 2020-10-26  4:29 UTC (permalink / raw)
  To: Nicholas Kazlauskas, alexander.deucher, Harry Wentland, Leo Li, amd-gfx
  Cc: linux-kernel

Hi,

I frequently encounter OOM on my system, mostly due to my own fault. 
Recently, I noticed that not only does a swap storm happen and OOM 
killer gets invoked, but the graphics output freezes permanently. 
Checking the kernel messages, I see:

kworker/u24:4: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
CPU: 6 PID: 279469 Comm: kworker/u24:4 Tainted: G        W         5.9.0-14732-g20b1adb60cf6 #2
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450 Pro4, BIOS P4.20 06/18/2020
Workqueue: events_unbound commit_work
Call Trace:
 ? dump_stack+0x57/0x6a
 ? warn_alloc.cold+0x69/0xcd
 ? __alloc_pages_direct_compact+0xfb/0x116
 ? __alloc_pages_slowpath.constprop.0+0x9c2/0xc14
 ? __alloc_pages_nodemask+0x143/0x167
 ? kmalloc_order+0x24/0x64
 ? dc_create_state+0x1a/0x4d
 ? amdgpu_dm_atomic_commit_tail+0x1b19/0x227d

followed by:

WARNING: CPU: 6 PID: 279469 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7511 amdgpu_dm_atomic_commit_tail+0x217c/0x227d

followed by:

BUG: unable to handle page fault for address: 0000000000012480
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
[ ... ]
RIP: 0010:dc_resource_state_copy_construct+0x10/0x455
[ ... ]
Call Trace:
 ? amdgpu_dm_atomic_commit_tail+0x2193/0x227

This area of code is quite odd:

dc_state_temp = dc_create_state(dm->dc);
ASSERT(dc_state_temp);
dc_state = dc_state_temp;
dc_resource_state_copy_construct_current(dm->dc, dc_state);

This ASSERT macro is misleading: unless CONFIG_DEBUG_KERNEL_DC is set, 
it is actually WARN_ON_ONCE(!(expr)). Therefore, this code fails to 
allocate memory (causing a warning to be printed), prints another 
warning that it failed, then proceeds to immediately dereference it, 
crashing the thread (and the kernel if panic_on_oops is set).

While I am not by any means a graphics or kernel expert, it seems to me 
like there should be a better solution than crashing. If nothing else, 
the OOM killer should be invoked and the operation retried. We may lose 
some frames or see some corruption, but that's far better than totally 
breaking.

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* amdgpu crashes on OOM
@ 2020-10-26  4:29   ` Alex Xu (Hello71)
  0 siblings, 0 replies; 8+ messages in thread
From: Alex Xu (Hello71) @ 2020-10-26  4:29 UTC (permalink / raw)
  To: Nicholas Kazlauskas, alexander.deucher, Harry Wentland, Leo Li, amd-gfx
  Cc: linux-kernel

Hi,

I frequently encounter OOM on my system, mostly due to my own fault. 
Recently, I noticed that not only does a swap storm happen and OOM 
killer gets invoked, but the graphics output freezes permanently. 
Checking the kernel messages, I see:

kworker/u24:4: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
CPU: 6 PID: 279469 Comm: kworker/u24:4 Tainted: G        W         5.9.0-14732-g20b1adb60cf6 #2
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450 Pro4, BIOS P4.20 06/18/2020
Workqueue: events_unbound commit_work
Call Trace:
 ? dump_stack+0x57/0x6a
 ? warn_alloc.cold+0x69/0xcd
 ? __alloc_pages_direct_compact+0xfb/0x116
 ? __alloc_pages_slowpath.constprop.0+0x9c2/0xc14
 ? __alloc_pages_nodemask+0x143/0x167
 ? kmalloc_order+0x24/0x64
 ? dc_create_state+0x1a/0x4d
 ? amdgpu_dm_atomic_commit_tail+0x1b19/0x227d

followed by:

WARNING: CPU: 6 PID: 279469 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7511 amdgpu_dm_atomic_commit_tail+0x217c/0x227d

followed by:

BUG: unable to handle page fault for address: 0000000000012480
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
[ ... ]
RIP: 0010:dc_resource_state_copy_construct+0x10/0x455
[ ... ]
Call Trace:
 ? amdgpu_dm_atomic_commit_tail+0x2193/0x227

This area of code is quite odd:

dc_state_temp = dc_create_state(dm->dc);
ASSERT(dc_state_temp);
dc_state = dc_state_temp;
dc_resource_state_copy_construct_current(dm->dc, dc_state);

This ASSERT macro is misleading: unless CONFIG_DEBUG_KERNEL_DC is set, 
it is actually WARN_ON_ONCE(!(expr)). Therefore, this code fails to 
allocate memory (causing a warning to be printed), prints another 
warning that it failed, then proceeds to immediately dereference it, 
crashing the thread (and the kernel if panic_on_oops is set).

While I am not by any means a graphics or kernel expert, it seems to me 
like there should be a better solution than crashing. If nothing else, 
the OOM killer should be invoked and the operation retried. We may lose 
some frames or see some corruption, but that's far better than totally 
breaking.

Thanks,
Alex.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: amdgpu crashes on OOM
  2020-10-26  4:29   ` Alex Xu (Hello71)
@ 2020-10-26 11:03     ` Michel Dänzer
  -1 siblings, 0 replies; 8+ messages in thread
From: Michel Dänzer @ 2020-10-26 11:03 UTC (permalink / raw)
  To: Alex Xu (Hello71),
	Nicholas Kazlauskas, alexander.deucher, Harry Wentland, Leo Li,
	amd-gfx
  Cc: linux-kernel

On 2020-10-26 5:29 a.m., Alex Xu (Hello71) wrote:
> Hi,
> 
> I frequently encounter OOM on my system, mostly due to my own fault.
> Recently, I noticed that not only does a swap storm happen and OOM
> killer gets invoked, but the graphics output freezes permanently.
> Checking the kernel messages, I see:
> 
> kworker/u24:4: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
> CPU: 6 PID: 279469 Comm: kworker/u24:4 Tainted: G        W         5.9.0-14732-g20b1adb60cf6 #2
> Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450 Pro4, BIOS P4.20 06/18/2020
> Workqueue: events_unbound commit_work
> Call Trace:
>   ? dump_stack+0x57/0x6a
>   ? warn_alloc.cold+0x69/0xcd
>   ? __alloc_pages_direct_compact+0xfb/0x116
>   ? __alloc_pages_slowpath.constprop.0+0x9c2/0xc14
>   ? __alloc_pages_nodemask+0x143/0x167
>   ? kmalloc_order+0x24/0x64
>   ? dc_create_state+0x1a/0x4d
>   ? amdgpu_dm_atomic_commit_tail+0x1b19/0x227d

Looks like dc_create_state should use kvzalloc instead of kzalloc 
(dc_state_free already uses kvfree).

order:5 means it's trying to allocate 32 physically contiguous pages, 
which can be hard to fulfill even with lower memory pressure.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: amdgpu crashes on OOM
@ 2020-10-26 11:03     ` Michel Dänzer
  0 siblings, 0 replies; 8+ messages in thread
From: Michel Dänzer @ 2020-10-26 11:03 UTC (permalink / raw)
  To: Alex Xu (Hello71),
	Nicholas Kazlauskas, alexander.deucher, Harry Wentland, Leo Li,
	amd-gfx
  Cc: linux-kernel

On 2020-10-26 5:29 a.m., Alex Xu (Hello71) wrote:
> Hi,
> 
> I frequently encounter OOM on my system, mostly due to my own fault.
> Recently, I noticed that not only does a swap storm happen and OOM
> killer gets invoked, but the graphics output freezes permanently.
> Checking the kernel messages, I see:
> 
> kworker/u24:4: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
> CPU: 6 PID: 279469 Comm: kworker/u24:4 Tainted: G        W         5.9.0-14732-g20b1adb60cf6 #2
> Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450 Pro4, BIOS P4.20 06/18/2020
> Workqueue: events_unbound commit_work
> Call Trace:
>   ? dump_stack+0x57/0x6a
>   ? warn_alloc.cold+0x69/0xcd
>   ? __alloc_pages_direct_compact+0xfb/0x116
>   ? __alloc_pages_slowpath.constprop.0+0x9c2/0xc14
>   ? __alloc_pages_nodemask+0x143/0x167
>   ? kmalloc_order+0x24/0x64
>   ? dc_create_state+0x1a/0x4d
>   ? amdgpu_dm_atomic_commit_tail+0x1b19/0x227d

Looks like dc_create_state should use kvzalloc instead of kzalloc 
(dc_state_free already uses kvfree).

order:5 means it's trying to allocate 32 physically contiguous pages, 
which can be hard to fulfill even with lower memory pressure.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: amdgpu crashes on OOM
  2020-10-26 11:03     ` Michel Dänzer
@ 2020-10-26 14:34       ` Deucher, Alexander
  -1 siblings, 0 replies; 8+ messages in thread
From: Deucher, Alexander @ 2020-10-26 14:34 UTC (permalink / raw)
  To: Michel Dänzer, Alex Xu (Hello71),
	Kazlauskas, Nicholas, Wentland, Harry, Li, Sun peng (Leo),
	amd-gfx
  Cc: linux-kernel

[AMD Public Use]

> -----Original Message-----
> From: Michel Dänzer <michel@daenzer.net>
> Sent: Monday, October 26, 2020 7:04 AM
> To: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>; Kazlauskas, Nicholas
> <Nicholas.Kazlauskas@amd.com>; Deucher, Alexander
> <Alexander.Deucher@amd.com>; Wentland, Harry
> <Harry.Wentland@amd.com>; Li, Sun peng (Leo) <Sunpeng.Li@amd.com>;
> amd-gfx@lists.freedesktop.org
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: amdgpu crashes on OOM
> 
> On 2020-10-26 5:29 a.m., Alex Xu (Hello71) wrote:
> > Hi,
> >
> > I frequently encounter OOM on my system, mostly due to my own fault.
> > Recently, I noticed that not only does a swap storm happen and OOM
> > killer gets invoked, but the graphics output freezes permanently.
> > Checking the kernel messages, I see:
> >
> > kworker/u24:4: page allocation failure: order:5,
> mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO),
> nodemask=(null)
> > CPU: 6 PID: 279469 Comm: kworker/u24:4 Tainted: G        W         5.9.0-14732-
> g20b1adb60cf6 #2
> > Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450
> > Pro4, BIOS P4.20 06/18/2020
> > Workqueue: events_unbound commit_work
> > Call Trace:
> >   ? dump_stack+0x57/0x6a
> >   ? warn_alloc.cold+0x69/0xcd
> >   ? __alloc_pages_direct_compact+0xfb/0x116
> >   ? __alloc_pages_slowpath.constprop.0+0x9c2/0xc14
> >   ? __alloc_pages_nodemask+0x143/0x167
> >   ? kmalloc_order+0x24/0x64
> >   ? dc_create_state+0x1a/0x4d
> >   ? amdgpu_dm_atomic_commit_tail+0x1b19/0x227d
> 
> Looks like dc_create_state should use kvzalloc instead of kzalloc
> (dc_state_free already uses kvfree).
> 
> order:5 means it's trying to allocate 32 physically contiguous pages, which can
> be hard to fulfill even with lower memory pressure.
> 

It was using kvzalloc, but was accidently dropped when that code was refactored.  I just sent a patch to fix it.

Alex

> 
> --
> Earthling Michel Dänzer               |
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fredh
> at.com%2F&amp;data=04%7C01%7Calexander.deucher%40amd.com%7Cc60
> 56551dd4d423bdc0508d8799ed189%7C3dd8961fe4884e608e11a82d994e183d
> %7C0%7C0%7C637393070333648663%7CUnknown%7CTWFpbGZsb3d8eyJWIj
> oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1
> 000&amp;sdata=a7Lpu04KnpsFQpCO7y5WOLJSMPpA%2Be1s%2FufgYTDHs2k
> %3D&amp;reserved=0
> Libre software enthusiast             |             Mesa and X developer

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: amdgpu crashes on OOM
@ 2020-10-26 14:34       ` Deucher, Alexander
  0 siblings, 0 replies; 8+ messages in thread
From: Deucher, Alexander @ 2020-10-26 14:34 UTC (permalink / raw)
  To: Michel Dänzer, Alex Xu (Hello71),
	Kazlauskas, Nicholas, Wentland, Harry, Li, Sun peng (Leo),
	amd-gfx
  Cc: linux-kernel

[AMD Public Use]

> -----Original Message-----
> From: Michel Dänzer <michel@daenzer.net>
> Sent: Monday, October 26, 2020 7:04 AM
> To: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>; Kazlauskas, Nicholas
> <Nicholas.Kazlauskas@amd.com>; Deucher, Alexander
> <Alexander.Deucher@amd.com>; Wentland, Harry
> <Harry.Wentland@amd.com>; Li, Sun peng (Leo) <Sunpeng.Li@amd.com>;
> amd-gfx@lists.freedesktop.org
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: amdgpu crashes on OOM
> 
> On 2020-10-26 5:29 a.m., Alex Xu (Hello71) wrote:
> > Hi,
> >
> > I frequently encounter OOM on my system, mostly due to my own fault.
> > Recently, I noticed that not only does a swap storm happen and OOM
> > killer gets invoked, but the graphics output freezes permanently.
> > Checking the kernel messages, I see:
> >
> > kworker/u24:4: page allocation failure: order:5,
> mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO),
> nodemask=(null)
> > CPU: 6 PID: 279469 Comm: kworker/u24:4 Tainted: G        W         5.9.0-14732-
> g20b1adb60cf6 #2
> > Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450
> > Pro4, BIOS P4.20 06/18/2020
> > Workqueue: events_unbound commit_work
> > Call Trace:
> >   ? dump_stack+0x57/0x6a
> >   ? warn_alloc.cold+0x69/0xcd
> >   ? __alloc_pages_direct_compact+0xfb/0x116
> >   ? __alloc_pages_slowpath.constprop.0+0x9c2/0xc14
> >   ? __alloc_pages_nodemask+0x143/0x167
> >   ? kmalloc_order+0x24/0x64
> >   ? dc_create_state+0x1a/0x4d
> >   ? amdgpu_dm_atomic_commit_tail+0x1b19/0x227d
> 
> Looks like dc_create_state should use kvzalloc instead of kzalloc
> (dc_state_free already uses kvfree).
> 
> order:5 means it's trying to allocate 32 physically contiguous pages, which can
> be hard to fulfill even with lower memory pressure.
> 

It was using kvzalloc, but was accidently dropped when that code was refactored.  I just sent a patch to fix it.

Alex

> 
> --
> Earthling Michel Dänzer               |
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fredh
> at.com%2F&amp;data=04%7C01%7Calexander.deucher%40amd.com%7Cc60
> 56551dd4d423bdc0508d8799ed189%7C3dd8961fe4884e608e11a82d994e183d
> %7C0%7C0%7C637393070333648663%7CUnknown%7CTWFpbGZsb3d8eyJWIj
> oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1
> 000&amp;sdata=a7Lpu04KnpsFQpCO7y5WOLJSMPpA%2Be1s%2FufgYTDHs2k
> %3D&amp;reserved=0
> Libre software enthusiast             |             Mesa and X developer
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: amdgpu crashes on OOM
  2020-10-26 14:34       ` Deucher, Alexander
@ 2020-10-26 14:50         ` Alex Xu (Hello71)
  -1 siblings, 0 replies; 8+ messages in thread
From: Alex Xu (Hello71) @ 2020-10-26 14:50 UTC (permalink / raw)
  To: Deucher, Alexander, amd-gfx, Wentland, Harry, Michel Dänzer,
	Kazlauskas, Nicholas, Li, Sun peng (Leo)
  Cc: linux-kernel

Excerpts from Deucher, Alexander's message of October 26, 2020 10:34 am:
> It was using kvzalloc, but was accidently dropped when that code was refactored.  I just sent a patch to fix it.

Ah, that explains why I wasn't seeing it before. I was only looking at 
changes in amdgpu_dm_atomic_commit_tail, not dc_create_state.

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: amdgpu crashes on OOM
@ 2020-10-26 14:50         ` Alex Xu (Hello71)
  0 siblings, 0 replies; 8+ messages in thread
From: Alex Xu (Hello71) @ 2020-10-26 14:50 UTC (permalink / raw)
  To: Deucher, Alexander, amd-gfx, Wentland, Harry, Michel Dänzer,
	Kazlauskas, Nicholas, Li, Sun peng (Leo)
  Cc: linux-kernel

Excerpts from Deucher, Alexander's message of October 26, 2020 10:34 am:
> It was using kvzalloc, but was accidently dropped when that code was refactored.  I just sent a patch to fix it.

Ah, that explains why I wasn't seeing it before. I was only looking at 
changes in amdgpu_dm_atomic_commit_tail, not dc_create_state.

Thanks,
Alex.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-10-26 15:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1603684905.h43s1t0y05.none.ref@localhost>
2020-10-26  4:29 ` amdgpu crashes on OOM Alex Xu (Hello71)
2020-10-26  4:29   ` Alex Xu (Hello71)
2020-10-26 11:03   ` Michel Dänzer
2020-10-26 11:03     ` Michel Dänzer
2020-10-26 14:34     ` Deucher, Alexander
2020-10-26 14:34       ` Deucher, Alexander
2020-10-26 14:50       ` Alex Xu (Hello71)
2020-10-26 14:50         ` Alex Xu (Hello71)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.