From: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca> To: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>, alexander.deucher@amd.com, Harry Wentland <harry.wentland@amd.com>, Leo Li <sunpeng.li@amd.com>, amd-gfx@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org Subject: amdgpu crashes on OOM Date: Mon, 26 Oct 2020 00:29:00 -0400 [thread overview] Message-ID: <1603684905.h43s1t0y05.none@localhost> (raw) In-Reply-To: 1603684905.h43s1t0y05.none.ref@localhost Hi, I frequently encounter OOM on my system, mostly due to my own fault. Recently, I noticed that not only does a swap storm happen and OOM killer gets invoked, but the graphics output freezes permanently. Checking the kernel messages, I see: kworker/u24:4: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null) CPU: 6 PID: 279469 Comm: kworker/u24:4 Tainted: G W 5.9.0-14732-g20b1adb60cf6 #2 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450 Pro4, BIOS P4.20 06/18/2020 Workqueue: events_unbound commit_work Call Trace: ? dump_stack+0x57/0x6a ? warn_alloc.cold+0x69/0xcd ? __alloc_pages_direct_compact+0xfb/0x116 ? __alloc_pages_slowpath.constprop.0+0x9c2/0xc14 ? __alloc_pages_nodemask+0x143/0x167 ? kmalloc_order+0x24/0x64 ? dc_create_state+0x1a/0x4d ? amdgpu_dm_atomic_commit_tail+0x1b19/0x227d followed by: WARNING: CPU: 6 PID: 279469 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7511 amdgpu_dm_atomic_commit_tail+0x217c/0x227d followed by: BUG: unable to handle page fault for address: 0000000000012480 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page [ ... ] RIP: 0010:dc_resource_state_copy_construct+0x10/0x455 [ ... ] Call Trace: ? amdgpu_dm_atomic_commit_tail+0x2193/0x227 This area of code is quite odd: dc_state_temp = dc_create_state(dm->dc); ASSERT(dc_state_temp); dc_state = dc_state_temp; dc_resource_state_copy_construct_current(dm->dc, dc_state); This ASSERT macro is misleading: unless CONFIG_DEBUG_KERNEL_DC is set, it is actually WARN_ON_ONCE(!(expr)). Therefore, this code fails to allocate memory (causing a warning to be printed), prints another warning that it failed, then proceeds to immediately dereference it, crashing the thread (and the kernel if panic_on_oops is set). While I am not by any means a graphics or kernel expert, it seems to me like there should be a better solution than crashing. If nothing else, the OOM killer should be invoked and the operation retried. We may lose some frames or see some corruption, but that's far better than totally breaking. Thanks, Alex.
WARNING: multiple messages have this Message-ID (diff)
From: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca> To: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>, alexander.deucher@amd.com, Harry Wentland <harry.wentland@amd.com>, Leo Li <sunpeng.li@amd.com>, amd-gfx@lists.freedesktop.org Cc: linux-kernel@vger.kernel.org Subject: amdgpu crashes on OOM Date: Mon, 26 Oct 2020 00:29:00 -0400 [thread overview] Message-ID: <1603684905.h43s1t0y05.none@localhost> (raw) In-Reply-To: 1603684905.h43s1t0y05.none.ref@localhost Hi, I frequently encounter OOM on my system, mostly due to my own fault. Recently, I noticed that not only does a swap storm happen and OOM killer gets invoked, but the graphics output freezes permanently. Checking the kernel messages, I see: kworker/u24:4: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null) CPU: 6 PID: 279469 Comm: kworker/u24:4 Tainted: G W 5.9.0-14732-g20b1adb60cf6 #2 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450 Pro4, BIOS P4.20 06/18/2020 Workqueue: events_unbound commit_work Call Trace: ? dump_stack+0x57/0x6a ? warn_alloc.cold+0x69/0xcd ? __alloc_pages_direct_compact+0xfb/0x116 ? __alloc_pages_slowpath.constprop.0+0x9c2/0xc14 ? __alloc_pages_nodemask+0x143/0x167 ? kmalloc_order+0x24/0x64 ? dc_create_state+0x1a/0x4d ? amdgpu_dm_atomic_commit_tail+0x1b19/0x227d followed by: WARNING: CPU: 6 PID: 279469 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7511 amdgpu_dm_atomic_commit_tail+0x217c/0x227d followed by: BUG: unable to handle page fault for address: 0000000000012480 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page [ ... ] RIP: 0010:dc_resource_state_copy_construct+0x10/0x455 [ ... ] Call Trace: ? amdgpu_dm_atomic_commit_tail+0x2193/0x227 This area of code is quite odd: dc_state_temp = dc_create_state(dm->dc); ASSERT(dc_state_temp); dc_state = dc_state_temp; dc_resource_state_copy_construct_current(dm->dc, dc_state); This ASSERT macro is misleading: unless CONFIG_DEBUG_KERNEL_DC is set, it is actually WARN_ON_ONCE(!(expr)). Therefore, this code fails to allocate memory (causing a warning to be printed), prints another warning that it failed, then proceeds to immediately dereference it, crashing the thread (and the kernel if panic_on_oops is set). While I am not by any means a graphics or kernel expert, it seems to me like there should be a better solution than crashing. If nothing else, the OOM killer should be invoked and the operation retried. We may lose some frames or see some corruption, but that's far better than totally breaking. Thanks, Alex. _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
next parent reply other threads:[~2020-10-26 4:37 UTC|newest] Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top [not found] <1603684905.h43s1t0y05.none.ref@localhost> 2020-10-26 4:29 ` Alex Xu (Hello71) [this message] 2020-10-26 4:29 ` amdgpu crashes on OOM Alex Xu (Hello71) 2020-10-26 11:03 ` Michel Dänzer 2020-10-26 11:03 ` Michel Dänzer 2020-10-26 14:34 ` Deucher, Alexander 2020-10-26 14:34 ` Deucher, Alexander 2020-10-26 14:50 ` Alex Xu (Hello71) 2020-10-26 14:50 ` Alex Xu (Hello71)
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=1603684905.h43s1t0y05.none@localhost \ --to=alex_y_xu@yahoo.ca \ --cc=alexander.deucher@amd.com \ --cc=amd-gfx@lists.freedesktop.org \ --cc=harry.wentland@amd.com \ --cc=linux-kernel@vger.kernel.org \ --cc=nicholas.kazlauskas@amd.com \ --cc=sunpeng.li@amd.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.