linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Youngmin Nam <youngmin.nam@samsung.com>
To: rafael@kernel.org, len.brown@intel.com, pavel@ucw.cz,
	gregkh@linuxfoundation.org
Cc: linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org,
	d7271.choe@samsung.com, janghyuck.kim@samsung.com,
	hyesoo.yu@samsung.com, youngmin.nam@samsung.com
Subject: [BUG] mutex deadlock of dpm_resume() in low memory situation
Date: Wed, 27 Dec 2023 17:42:50 +0900	[thread overview]
Message-ID: <ZYvjiqX6EsL15moe@perf> (raw)
In-Reply-To: CGME20231227084252epcas2p3b063f7852f81f82cd0a31afd7f404db4@epcas2p3.samsung.com

[-- Attachment #1: Type: text/plain, Size: 6085 bytes --]

Hi all.

I'm reporting a issue which looks like a upstream kernel bug.
We are using 6.1 but I think all of version can cause this issue.

A mutex deadlock issue occured on low mem situation when the device runs dpm_resume().
Here's the problematic situation.

#1. Currently, the device is on low mem situation as below.
[4: binder:569_5:27109] SLUB: Unable to allocate memory on node -1, gfp=0xb20(GFP_ATOMIC|__GFP_ZERO)
[4: binder:569_5:27109] cache: kmalloc-128, object size: 128, buffer size: 128, default order: 0, min order: 0
[4: binder:569_5:27109] node 0: slabs: 1865, objs: 59680, free: 0
[4: binder:569_5:27109] binder:569_5: page allocation failure: order:0, mode:0xa20(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
[4: binder:569_5:27109] CPU: 4 PID: 27109 Comm: binder:569_5 Tainted: G S C E 6.1.43-android14-11-abS921BXXU1AWLB #1
[4: binder:569_5:27109] Hardware name: Samsung E1S EUR OPENX board based on S5E9945 (DT)
[4: binder:569_5:27109] Call trace:
[4: binder:569_5:27109] dump_backtrace+0xf4/0x118
[4: binder:569_5:27109] show_stack+0x18/0x24
[4: binder:569_5:27109] dump_stack_lvl+0x60/0x7c
[4: binder:569_5:27109] dump_stack+0x18/0x38
[4: binder:569_5:27109] warn_alloc+0xf4/0x190
[4: binder:569_5:27109] __alloc_pages_slowpath+0x10ec/0x12ac
[4: binder:569_5:27109] __alloc_pages+0x27c/0x2fc
[4: binder:569_5:27109] new_slab+0x17c/0x4e0
[4: binder:569_5:27109] ___slab_alloc+0x4e4/0x8a8
[4: binder:569_5:27109] __slab_alloc+0x34/0x6c
[4: binder:569_5:27109] __kmem_cache_alloc_node+0x1f4/0x260
[4: binder:569_5:27109] kmalloc_trace+0x4c/0x144
[4: binder:569_5:27109] async_schedule_node_domain+0x40/0x1ec
[4: binder:569_5:27109] async_schedule_node+0x18/0x28
[4: binder:569_5:27109] dpm_suspend+0xfc/0x48c

#2. The process A runs dpm_resume() and acquired "dpm_list_mtx" lock as below.
1000 void dpm_resume(pm_message_t state)
1001 {
1002         struct device *dev;
1003         ktime_t starttime = ktime_get();
1004
1005         trace_suspend_resume(TPS("dpm_resume"), state.event, true);
1006         might_sleep();
1007
1008         mutex_lock(&dpm_list_mtx);  <-------- process A acquired the lock
1009         pm_transition = state;
1010         async_error = 0;
1011
1012         list_for_each_entry(dev, &dpm_suspended_list, power.entry)
1013                 dpm_async_fn(dev, async_resume);

#3. The process A continues to run below functions as below.
dpm_async_fn()
      --> async_schedule_dev()
          --> async_schedule_node()
              --> async_schedule_node_domain()

#4. The kzalloc() will be failed because of lowmen situation.
165 async_cookie_t async_schedule_node_domain(async_func_t func, void *data,
166                                           int node, struct async_domain *domain)
167 {
168         struct async_entry *entry;
169         unsigned long flags;
170         async_cookie_t newcookie;
171
172         /* allow irq-off callers */
173         entry = kzalloc(sizeof(struct async_entry), GFP_ATOMIC); <--- Will be failied
174
175         /*
176          * If we're out of memory or if there's too much work
177          * pending already, we execute synchronously.
178          */
179         if (!entry || atomic_read(&entry_count) > MAX_WORK) {
180                 kfree(entry);
181                 spin_lock_irqsave(&async_lock, flags);
182                 newcookie = next_cookie++;
183                 spin_unlock_irqrestore(&async_lock, flags);
184
185                 /* low on memory.. run synchronously */
186                 func(data, newcookie); <--- the process A will run func() that is async_resume()
187                 return newcookie;

5. The process A continues to run below functions as below.
async_resume()
    --> device_resume()
        --> dpm_wait_for_superior()

6. The process A is trying to get the lock which was acquired by A.
278 static bool dpm_wait_for_superior(struct device *dev, bool async)
279 {
280         struct device *parent;
281
282         /*
283          * If the device is resumed asynchronously and the parent's callback
284          * deletes both the device and the parent itself, the parent object may
285          * be freed while this function is running, so avoid that by reference
286          * counting the parent once more unless the device has been deleted
287          * already (in which case return right away).
288          */
289         mutex_lock(&dpm_list_mtx);

So we think this situation can make mutex dead lock issue in suspend/resume sequence.

Here's the process A's callstack in kernel log. (binder:569:5)
I[4:      swapper/4:    0]      pid            uTime            sTime     last_arrival      last_queued   stat   cpu  task_struct           comm [wait channel]
I[4:      swapper/4:    0]    27109                0         1758074533    2396019774802             0    D(  2)  5   ffffff8044bc92c0      binder:569_5 [dpm_wait_for_superior]
I[4:      swapper/4:    0] Mutex: dpm_list_mtx+0x0/0x30: owner[0xffffff8044bc92c0 binder:569_5 :27109]
I[4:      swapper/4:    0] Call trace:
I[4:      swapper/4:    0]  __switch_to+0x174/0x338
I[4:      swapper/4:    0]  __schedule+0x5ec/0x9cc
I[4:      swapper/4:    0]  schedule+0x7c/0xe8
I[4:      swapper/4:    0]  schedule_preempt_disabled+0x24/0x40
I[4:      swapper/4:    0]  __mutex_lock+0x408/0xdac
I[4:      swapper/4:    0]  __mutex_lock_slowpath+0x14/0x24
I[4:      swapper/4:    0]  mutex_lock+0x40/0xec  <-------- trying to acquire dpm_list_mtx again
I[4:      swapper/4:    0]  dpm_wait_for_superior+0x30/0x148
I[4:      swapper/4:    0]  device_resume+0x38/0x1e4
I[4:      swapper/4:    0]  async_resume+0x24/0xf4
I[4:      swapper/4:    0]  async_schedule_node_domain+0xb0/0x1ec
I[4:      swapper/4:    0]  async_schedule_node+0x18/0x28
I[4:      swapper/4:    0]  dpm_resume+0xbc/0x578  <------- acquired dpm_list_mtx
I[4:      swapper/4:    0]  dpm_resume_end+0x1c/0x38
I[4:      swapper/4:    0]  suspend_devices_and_enter+0x83c/0xb2c
I[4:      swapper/4:    0]  pm_suspend+0x34c/0x618
I[4:      swapper/4:    0]  state_store+0x104/0x144

Could you look into this issue ?

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



       reply	other threads:[~2023-12-27  8:43 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20231227084252epcas2p3b063f7852f81f82cd0a31afd7f404db4@epcas2p3.samsung.com>
2023-12-27  8:42 ` Youngmin Nam [this message]
2023-12-27 16:08   ` [BUG] mutex deadlock of dpm_resume() in low memory situation Greg KH
2023-12-27 17:44     ` Rafael J. Wysocki
2023-12-27 18:39     ` Rafael J. Wysocki
2023-12-27 18:58       ` Rafael J. Wysocki
2023-12-27 20:50         ` Rafael J. Wysocki
2023-12-28  6:40           ` Youngmin Nam
2023-12-27 20:35       ` [PATCH v1 0/3] PM: sleep: Fix possible device suspend-resume deadlocks Rafael J. Wysocki
2023-12-27 20:37         ` [PATCH v1 1/3] async: Split async_schedule_node_domain() Rafael J. Wysocki
2023-12-27 20:38         ` [PATCH v1 2/3] async: Introduce async_schedule_dev_nocall() Rafael J. Wysocki
2023-12-28 20:29           ` Stanislaw Gruszka
2023-12-29 13:37             ` Rafael J. Wysocki
2023-12-29  3:08               ` Stanislaw Gruszka
2023-12-29 16:36                 ` Rafael J. Wysocki
2024-01-02  7:09                   ` Stanislaw Gruszka
2024-01-02 13:15                     ` Rafael J. Wysocki
2023-12-27 20:41         ` [PATCH v1 3/3] PM: sleep: Fix possible deadlocks in core system-wide PM code Rafael J. Wysocki
2024-01-02 13:35           ` Ulf Hansson
2024-01-02 13:53             ` Rafael J. Wysocki
2024-01-03 10:17               ` Ulf Hansson
2024-01-03 10:27                 ` Rafael J. Wysocki
2024-01-03 10:33                 ` Greg KH
2024-01-02 13:18         ` [PATCH v1 0/3] PM: sleep: Fix possible device suspend-resume deadlocks Rafael J. Wysocki
2024-01-03  4:39           ` Youngmin Nam
2024-01-03 10:28             ` Rafael J. Wysocki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZYvjiqX6EsL15moe@perf \
    --to=youngmin.nam@samsung.com \
    --cc=d7271.choe@samsung.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hyesoo.yu@samsung.com \
    --cc=janghyuck.kim@samsung.com \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=pavel@ucw.cz \
    --cc=rafael@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).