"mm: move pcp and lru-pcp draining into single wq" broke resume from s2ram
diff mbox series

Message ID CAMuHMdUJSfrZ=2zy88_zojDek3CHEWKhv_qoJAVgDpPWz8V=Ew@mail.gmail.com
State New, archived
Headers show
Series
  • "mm: move pcp and lru-pcp draining into single wq" broke resume from s2ram
Related show

Commit Message

Geert Uytterhoeven April 18, 2017, 7:56 p.m. UTC
Hi all,

On Sat, Apr 8, 2017 at 10:48 AM, Linux Kernel Mailing List
<linux-kernel@vger.kernel.org> wrote:
> Web:        https://git.kernel.org/torvalds/c/ce612879ddc78ea7e4de4be80cba4ebf9caa07ee
> Commit:     ce612879ddc78ea7e4de4be80cba4ebf9caa07ee
> Parent:     cdcf4330d5660998d06fcd899b443693ab3d652f
> Refname:    refs/heads/master
> Author:     Michal Hocko <mhocko@suse.com>
> AuthorDate: Fri Apr 7 16:05:05 2017 -0700
> Committer:  Linus Torvalds <torvalds@linux-foundation.org>
> CommitDate: Sat Apr 8 00:47:49 2017 -0700
>
>     mm: move pcp and lru-pcp draining into single wq
>
>     We currently have 2 specific WQ_RECLAIM workqueues in the mm code.
>     vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain
>     per cpu lru caches.  This seems more than necessary because both can run
>     on a single WQ.  Both do not block on locks requiring a memory
>     allocation nor perform any allocations themselves.  We will save one
>     rescuer thread this way.
>
>     On the other hand drain_all_pages() queues work on the system wq which
>     doesn't have rescuer and so this depend on memory allocation (when all
>     workers are stuck allocating and new ones cannot be created).
>
>     Initially we thought this would be more of a theoretical problem but
>     Hugh Dickins has reported:
>
>     : 4.11-rc has been giving me hangs after hours of swapping load.  At
>     : first they looked like memory leaks ("fork: Cannot allocate memory");
>     : but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
>     : before looking at /proc/meminfo one time, and the stat_refresh stuck
>     : in D state, waiting for completion of flush_work like many kworkers.
>     : kthreadd waiting for completion of flush_work in drain_all_pages().
>
>     This worker should be using WQ_RECLAIM as well in order to guarantee a
>     forward progress.  We can reuse the same one as for lru draining and
>     vmstat.
>
>     Link: http://lkml.kernel.org/r/20170307131751.24936-1-mhocko@kernel.org
>     Signed-off-by: Michal Hocko <mhocko@suse.com>
>     Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>     Acked-by: Vlastimil Babka <vbabka@suse.cz>
>     Acked-by: Mel Gorman <mgorman@suse.de>
>     Tested-by: Yang Li <pku.leo@gmail.com>
>     Tested-by: Hugh Dickins <hughd@google.com>
>     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

This commit broke resume from s2ram on some of my Renesas ARM boards.
On some boards the hang is 100% reproducible, on others it's intermittent
(which was a PITA, as I had to bisect another independent s2ram bug as well).

On r8a7791/koelsch:

Comments

Michal Hocko April 18, 2017, 8:19 p.m. UTC | #1
On Tue 18-04-17 21:56:56, Geert Uytterhoeven wrote:
> Hi all,
> 
> On Sat, Apr 8, 2017 at 10:48 AM, Linux Kernel Mailing List
> <linux-kernel@vger.kernel.org> wrote:
> > Web:        https://git.kernel.org/torvalds/c/ce612879ddc78ea7e4de4be80cba4ebf9caa07ee
> > Commit:     ce612879ddc78ea7e4de4be80cba4ebf9caa07ee
> > Parent:     cdcf4330d5660998d06fcd899b443693ab3d652f
> > Refname:    refs/heads/master
> > Author:     Michal Hocko <mhocko@suse.com>
> > AuthorDate: Fri Apr 7 16:05:05 2017 -0700
> > Committer:  Linus Torvalds <torvalds@linux-foundation.org>
> > CommitDate: Sat Apr 8 00:47:49 2017 -0700
> >
> >     mm: move pcp and lru-pcp draining into single wq
> >
> >     We currently have 2 specific WQ_RECLAIM workqueues in the mm code.
> >     vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain
> >     per cpu lru caches.  This seems more than necessary because both can run
> >     on a single WQ.  Both do not block on locks requiring a memory
> >     allocation nor perform any allocations themselves.  We will save one
> >     rescuer thread this way.
> >
> >     On the other hand drain_all_pages() queues work on the system wq which
> >     doesn't have rescuer and so this depend on memory allocation (when all
> >     workers are stuck allocating and new ones cannot be created).
> >
> >     Initially we thought this would be more of a theoretical problem but
> >     Hugh Dickins has reported:
> >
> >     : 4.11-rc has been giving me hangs after hours of swapping load.  At
> >     : first they looked like memory leaks ("fork: Cannot allocate memory");
> >     : but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
> >     : before looking at /proc/meminfo one time, and the stat_refresh stuck
> >     : in D state, waiting for completion of flush_work like many kworkers.
> >     : kthreadd waiting for completion of flush_work in drain_all_pages().
> >
> >     This worker should be using WQ_RECLAIM as well in order to guarantee a
> >     forward progress.  We can reuse the same one as for lru draining and
> >     vmstat.
> >
> >     Link: http://lkml.kernel.org/r/20170307131751.24936-1-mhocko@kernel.org
> >     Signed-off-by: Michal Hocko <mhocko@suse.com>
> >     Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> >     Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >     Acked-by: Mel Gorman <mgorman@suse.de>
> >     Tested-by: Yang Li <pku.leo@gmail.com>
> >     Tested-by: Hugh Dickins <hughd@google.com>
> >     Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> >     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> This commit broke resume from s2ram on some of my Renesas ARM boards.
> On some boards the hang is 100% reproducible, on others it's intermittent
> (which was a PITA, as I had to bisect another independent s2ram bug as well).

Hmm, I am rather confused, how the above commit could change anything
here. Your lockup detector is hitting
dpm_wait_for_superior
        dpm_wait(dev->parent, async);
	dpm_wait_for_suppliers(dev, async);

which in turn waits wait_for_completion(&dev->power.completion)

the above commit has reduced the load on the system WQ. It also removed
one WQ and reused the existing one. The work done on the mm_percpu_wq
doesn't block so I suspect that what you are seeing is just showing a
real bug somewhere else. I will have a look tomorrow. Let's add Tejun,
maybe I have introduced some subtle dependency, which is not clear to
me.

> 
> On r8a7791/koelsch:
> 
> --- /tmp/good   2017-04-18 21:47:04.457156167 +0200
> +++ /tmp/bad    2017-04-18 21:43:26.215240325 +0200
> @@ -13,11 +13,178 @@ Enabling non-boot CPUs ...
>  CPU1 is up
>  PM: noirq resume of devices complete after N.N msecs
>  PM: early resume of devices complete after N.N msecs
> -Micrel KSZ8041RNLI ee700000.ethernet-ffffffff:01: attached PHY driver
> [Micrel KSZ8041RNLI] (mii_bus:phy_addr=ee700000.ethernet-ffffffff:01,
> irq=-1)
> -PM: resume of devices complete after N.N msecs
> -PM: resume devices took N.N seconds
> -PM: Finishing wakeup.
> -Restarting tasks ... done.
> -ata1: link resume succeeded after 1 retries
> -ata1: SATA link down (SStatus 0 SControl 300)
> -sh-eth ee700000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
> +INFO: task kworker/u4:0:5 blocked for more than 120 seconds.
> +      Not tainted 4.11.0-rc7-koelsch-00426-g70412b99f7936b37 #3470
> +"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> +kworker/u4:0    D    0     5      2 0x00000000
> +Workqueue: events_unbound async_run_entry_fn
> +[<c0712c58>] (__schedule) from [<c0712f7c>] (schedule+0xb0/0xcc)
> +[<c0712f7c>] (schedule) from [<c0717140>] (schedule_timeout+0x18/0x1f4)
> +[<c0717140>] (schedule_timeout) from [<c07139f8>] (wait_for_common+0x100/0x19c)
> +[<c07139f8>] (wait_for_common) from [<c04d8488>]
> (dpm_wait_for_superior+0x14/0x5c)
> +[<c04d8488>] (dpm_wait_for_superior) from [<c04d8aa4>]
> (device_resume+0x40/0x1a0)
> +[<c04d8aa4>] (device_resume) from [<c04d8c1c>] (async_resume+0x18/0x44)
> +[<c04d8c1c>] (async_resume) from [<c023db34>] (async_run_entry_fn+0x44/0x114)
> +[<c023db34>] (async_run_entry_fn) from [<c0236544>]
> (process_one_work+0x1cc/0x31c)
> +[<c0236544>] (process_one_work) from [<c0236ca0>] (worker_thread+0x2b8/0x3f0)
> +[<c0236ca0>] (worker_thread) from [<c023b240>] (kthread+0x120/0x140)
> +[<c023b240>] (kthread) from [<c0206d68>] (ret_from_fork+0x14/0x2c)
> +INFO: task kworker/u4:1:125 blocked for more than 120 seconds.
> +      Not tainted 4.11.0-rc7-koelsch-00426-g70412b99f7936b37 #3470
> +"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> +kworker/u4:1    D    0   125      2 0x00000000
> +Workqueue: events_unbound async_run_entry_fn
> +[<c0712c58>] (__schedule) from [<c0712f7c>] (schedule+0xb0/0xcc)
> +[<c0712f7c>] (schedule) from [<c0717140>] (schedule_timeout+0x18/0x1f4)
> +[<c0717140>] (schedule_timeout) from [<c07139f8>] (wait_for_common+0x100/0x19c)
> +[<c07139f8>] (wait_for_common) from [<c04d8488>]
> (dpm_wait_for_superior+0x14/0x5c)
> +[<c04d8488>] (dpm_wait_for_superior) from [<c04d8aa4>]
> (device_resume+0x40/0x1a0)
> +[<c04d8aa4>] (device_resume) from [<c04d8c1c>] (async_resume+0x18/0x44)
> +[<c04d8c1c>] (async_resume) from [<c023db34>] (async_run_entry_fn+0x44/0x114)
> +[<c023db34>] (async_run_entry_fn) from [<c0236544>]
> (process_one_work+0x1cc/0x31c)
> +[<c0236544>] (process_one_work) from [<c0236ca0>] (worker_thread+0x2b8/0x3f0)
> +[<c0236ca0>] (worker_thread) from [<c023b240>] (kthread+0x120/0x140)
> +[<c023b240>] (kthread) from [<c0206d68>] (ret_from_fork+0x14/0x2c)
> ...
> 
> On r8a7795/salvator-x, where I have working lockdep:
> 
> PM: noirq resume of devices complete after 131.415 msecs
> PM: early resume of devices complete after 8.894 msecs
> INFO: task kworker/u16:2:276 blocked for more than 120 seconds.
>       Not tainted 4.11.0-rc7-salvator-x-06706-g70412b99f7936b37 #1220
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/u16:2   D    0   276      2 0x00000000
> Workqueue: events_unbound async_run_entry_fn
> Call trace:
> [<ffffff800808612c>] __switch_to+0xa0/0xac
> [<ffffff800866df14>] __schedule+0x70c/0xb88
> [<ffffff800866e414>] schedule+0x84/0xa4
> [<ffffff800867287c>] schedule_timeout+0x30/0x400
> [<ffffff800866ee94>] wait_for_common+0x164/0x1a8
> [<ffffff800866eeec>] wait_for_completion+0x14/0x1c
> [<ffffff80083d3f84>] dpm_wait+0x30/0x38
> [<ffffff80083d3ff8>] dpm_wait_for_superior+0x28/0x7c
> [<ffffff80083d490c>] device_resume+0x44/0x190
> [<ffffff80083d4a7c>] async_resume+0x24/0x54
> [<ffffff80080d5360>] async_run_entry_fn+0x4c/0x12c
> [<ffffff80080cb59c>] process_one_work+0x340/0x66c
> [<ffffff80080cc9ec>] worker_thread+0x274/0x39c
> [<ffffff80080d2004>] kthread+0x120/0x128
> [<ffffff8008083090>] ret_from_fork+0x10/0x40
> 
> Showing all locks held in the system:
> 2 locks held by khungtaskd/52:
>  #0:  (rcu_read_lock){......}, at: [<ffffff80081436cc>] watchdog+0xc0/0x618
>  #1:  (tasklist_lock){.+.+..}, at: [<ffffff80080fb5f0>]
> debug_show_all_locks+0x68/0x18c
> 2 locks held by kworker/u16:2/276:
>  #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
>  #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
> 2 locks held by kworker/u16:3/291:
>  #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
>  #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
> 8 locks held by s2ram/1899:
>  #0:  (sb_writers#7){.+.+.+}, at: [<ffffff80081ca1a4>] vfs_write+0xa8/0x15c
>  #1:  (&of->mutex){+.+.+.}, at: [<ffffff8008245964>] kernfs_fop_write+0xf0/0x194
>  #2:  (s_active#48){.+.+.+}, at: [<ffffff800824596c>]
> kernfs_fop_write+0xf8/0x194
>  #3:  (pm_mutex){+.+.+.}, at: [<ffffff80081059a4>] pm_suspend+0x16c/0xabc
>  #4:  (&dev->mutex){......}, at: [<ffffff80083d4920>] device_resume+0x58/0x190
>  #5:  (cma_mutex){+.+...}, at: [<ffffff80081c516c>] cma_alloc+0x150/0x374
>  #6:  (lock){+.+...}, at: [<ffffff800818b8ec>] lru_add_drain_all+0x4c/0x1b4
>  #7:  (cpu_hotplug.dep_map){++++++}, at: [<ffffff80080ab8f4>]
> get_online_cpus+0x3c/0x9c
> 2 locks held by kworker/u16:1/1918:
>  #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
>  #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
> 2 locks held by kworker/u16:4/1919:
>  #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
>  #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
> 2 locks held by kworker/u16:5/1920:
>  #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
>  #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
> 2 locks held by kworker/u16:7/1922:
>  #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
>  #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
> 2 locks held by kworker/u16:9/1924:
>  #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
>  #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
> 2 locks held by kworker/u16:10/1925:
>  #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
>  #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
> 2 locks held by kworker/u16:11/1926:
>  #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
>  #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
> 2 locks held by kworker/u16:12/1927:
>  #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
>  #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
> 2 locks held by kworker/u16:13/1928:
>  #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
>  #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
> 2 locks held by kworker/u16:14/1929:
>  #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
>  #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
> 2 locks held by kworker/u16:16/1931:
>  #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
>  #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
> process_one_work+0x1c8/0x66c
> 
> Thanks for your comments!
> 
> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                 -- Linus Torvalds
Tetsuo Handa April 19, 2017, 5:41 a.m. UTC | #2
Geert Uytterhoeven wrote:
> 8 locks held by s2ram/1899:
>  #0:  (sb_writers#7){.+.+.+}, at: [<ffffff80081ca1a4>] vfs_write+0xa8/0x15c
>  #1:  (&of->mutex){+.+.+.}, at: [<ffffff8008245964>] kernfs_fop_write+0xf0/0x194
>  #2:  (s_active#48){.+.+.+}, at: [<ffffff800824596c>] kernfs_fop_write+0xf8/0x194
>  #3:  (pm_mutex){+.+.+.}, at: [<ffffff80081059a4>] pm_suspend+0x16c/0xabc
>  #4:  (&dev->mutex){......}, at: [<ffffff80083d4920>] device_resume+0x58/0x190
>  #5:  (cma_mutex){+.+...}, at: [<ffffff80081c516c>] cma_alloc+0x150/0x374
>  #6:  (lock){+.+...}, at: [<ffffff800818b8ec>] lru_add_drain_all+0x4c/0x1b4
>  #7:  (cpu_hotplug.dep_map){++++++}, at: [<ffffff80080ab8f4>] get_online_cpus+0x3c/0x9c

I think this situation suggests that

int pm_suspend(suspend_state_t state) {
  error = enter_state(state) {
    if (!mutex_trylock(&pm_mutex)) /* #3 */
      return -EBUSY;
    error = suspend_devices_and_enter(state) {
      error = suspend_enter(state, &wakeup) {
        enable_nonboot_cpus() {
          cpu_maps_update_begin() {
            mutex_lock(&cpu_add_remove_lock);
          }
          pr_info("Enabling non-boot CPUs ...\n");
          for_each_cpu(cpu, frozen_cpus) {
            error = _cpu_up(cpu, 1, CPUHP_ONLINE) {
              cpu_hotplug_begin() {
                mutex_lock(&cpu_hotplug.lock);
              }
              
              cpu_hotplug_done() {
                mutex_unlock(&cpu_hotplug.lock);
              }
            }
            if (!error) {
              pr_info("CPU%d is up\n", cpu);
              continue;
            }
          }
          cpu_maps_update_done() {
             mutex_unlock(&cpu_add_remove_lock);
          }
        }
      }
      dpm_resume_end(PMSG_RESUME) {
        dpm_resume(state) {
          mutex_lock(&dpm_list_mtx);
          while (!list_empty(&dpm_suspended_list)) {
            mutex_unlock(&dpm_list_mtx);
            error = device_resume(dev, state, false) {
              dpm_wait_for_superior(dev, async);
              dpm_watchdog_set(&wd, dev);
              device_lock(dev) {
                mutex_lock(&dev->mutex); /* #4 */
              }
              error = dpm_run_callback(callback, dev, state, info) {
                cma_alloc() {
                  mutex_lock(&cma_mutex); /* #5 */
                  alloc_contig_range() {
                    lru_add_drain_all() {
                      mutex_lock(&lock); /* #6 */
                      get_online_cpus() {
                        mutex_lock(&cpu_hotplug.lock); /* #7 hang? */
                        mutex_unlock(&cpu_hotplug.lock);
                      }
                      put_online_cpus();
                      mutex_unlock(&lock); /* #6 */
                    }
                  }
                  mutex_unlock(&cma_mutex); /* #5 */
                }
              }
              device_unlock(dev) {
                mutex_unlock(&dev->mutex); /* #4 */
              }
            }
            mutex_lock(&dpm_list_mtx);
          }
          mutex_unlock(&dpm_list_mtx);
        }
        dpm_complete(state) {
          mutex_lock(&dpm_list_mtx);
          while (!list_empty(&dpm_prepared_list)) {
            mutex_unlock(&dpm_list_mtx);
            device_complete(dev, state) {
            }
            mutex_lock(&dpm_list_mtx);
          }
          mutex_unlock(&dpm_list_mtx);
        }
      }
    }
    mutex_unlock(&pm_mutex); /* #3 */
  }
}

Somebody is waiting forever with cpu_hotplug.lock held?
I think that full dmesg with SysRq-t output is appreciated.
Michal Hocko April 19, 2017, 7:10 a.m. UTC | #3
On Wed 19-04-17 14:41:30, Tetsuo Handa wrote:
[...]
> Somebody is waiting forever with cpu_hotplug.lock held?

Why would that matter for drain_all_pages? It doesn't use
get_online_cpus since a459eeb7b852 ("mm, page_alloc: do not depend on
cpu hotplug locks inside the allocator") while ce612879ddc7 ("mm: move
pcp and lru-pcp draining into single wq") was merged later.
Geert Uytterhoeven April 19, 2017, 7:16 a.m. UTC | #4
Hi Tetsuo,

On Wed, Apr 19, 2017 at 7:41 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> Geert Uytterhoeven wrote:
>> 8 locks held by s2ram/1899:
>>  #0:  (sb_writers#7){.+.+.+}, at: [<ffffff80081ca1a4>] vfs_write+0xa8/0x15c
>>  #1:  (&of->mutex){+.+.+.}, at: [<ffffff8008245964>] kernfs_fop_write+0xf0/0x194
>>  #2:  (s_active#48){.+.+.+}, at: [<ffffff800824596c>] kernfs_fop_write+0xf8/0x194
>>  #3:  (pm_mutex){+.+.+.}, at: [<ffffff80081059a4>] pm_suspend+0x16c/0xabc
>>  #4:  (&dev->mutex){......}, at: [<ffffff80083d4920>] device_resume+0x58/0x190
>>  #5:  (cma_mutex){+.+...}, at: [<ffffff80081c516c>] cma_alloc+0x150/0x374
>>  #6:  (lock){+.+...}, at: [<ffffff800818b8ec>] lru_add_drain_all+0x4c/0x1b4
>>  #7:  (cpu_hotplug.dep_map){++++++}, at: [<ffffff80080ab8f4>] get_online_cpus+0x3c/0x9c
>
> I think this situation suggests that

[...]

> Somebody is waiting forever with cpu_hotplug.lock held?
> I think that full dmesg with SysRq-t output is appreciated.

As SysRq doesn't work with my serial console, I added calls to show_state()
and show_workqueue_state() to check_hung_task().

Result with current linus/master attached.

Thanks!

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
Tetsuo Handa April 19, 2017, 7:26 a.m. UTC | #5
> On Wed 19-04-17 14:41:30, Tetsuo Handa wrote:
> [...]
> > Somebody is waiting forever with cpu_hotplug.lock held?
> 
> Why would that matter for drain_all_pages? It doesn't use
> get_online_cpus since a459eeb7b852 ("mm, page_alloc: do not depend on
> cpu hotplug locks inside the allocator") while ce612879ddc7 ("mm: move
> pcp and lru-pcp draining into single wq") was merged later.
> 

Looking at ce612879ddc7 ("mm: move pcp and lru-pcp draining into single wq"),
we merged "lru-add-drain" (!WQ_FREEZABLE && WQ_MEM_RECLAIM) workqueue and
"vmstat" (WQ_FREEZABLE && WQ_MEM_RECLAIM) workqueue into
"mm_percpu_wq" (WQ_FREEZABLE && WQ_MEM_RECLAIM) workqueue.

-       lru_add_drain_wq = alloc_workqueue("lru-add-drain", WQ_MEM_RECLAIM, 0);
-       vmstat_wq = alloc_workqueue("vmstat", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
+       mm_percpu_wq = alloc_workqueue("mm_percpu_wq",
+                                      WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);

That means "lru-add-drain" became freezable, doesn't it? And this problem
occurs around resume operation where all freezable threads are frozen?
Then, lru_add_drain_per_cpu() cannot be performed due to mm_percpu_wq frozen?
Michal Hocko April 19, 2017, 7:45 a.m. UTC | #6
On Wed 19-04-17 09:16:42, Geert Uytterhoeven wrote:
> Hi Tetsuo,
> 
> On Wed, Apr 19, 2017 at 7:41 AM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
[...]
> > Somebody is waiting forever with cpu_hotplug.lock held?
> > I think that full dmesg with SysRq-t output is appreciated.
> 
> As SysRq doesn't work with my serial console, I added calls to show_state()
> and show_workqueue_state() to check_hung_task().
> 
> Result with current linus/master attached.
[   47.165412] Enabling non-boot CPUs ...
[   47.205615] CPU1 is up
[   47.382002] PM: noirq resume of devices complete after 174.017 msecs
[   47.390181] PM: early resume of devices complete after 1.468 msecs

OK, so this is still the early resume path AFAIU which means that the
userspace is still in the fridge... Is it possible that new workers
cannot be spawned?

[  243.691979] INFO: task kworker/u4:0:5 blocked for more than 120 seconds.
[  243.698684]       Not tainted 4.11.0-rc7-koelsch-00029-g005882e53d62f25d-dirty #3476
[  243.706439] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  243.714276] kworker/u4:0    D    0     5      2 0x00000000
[  243.719780] Workqueue: events_unbound async_run_entry_fn
[  243.725118] [<c0700c20>] (__schedule) from [<c0700f44>] (schedule+0xb0/0xcc)
[  243.732181] [<c0700f44>] (schedule) from [<c0705108>] (schedule_timeout+0x18/0x1f4)
[  243.739840] [<c0705108>] (schedule_timeout) from [<c07019c0>] (wait_for_common+0x100/0x19c)
[  243.748207] [<c07019c0>] (wait_for_common) from [<c04d2008>] (dpm_wait_for_superior+0x14/0x5c)
[  243.756836] [<c04d2008>] (dpm_wait_for_superior) from [<c04d2624>] (device_resume+0x40/0x1a0)
[  243.765380] [<c04d2624>] (device_resume) from [<c04d279c>] (async_resume+0x18/0x44)
[  243.773055] [<c04d279c>] (async_resume) from [<c023db24>] (async_run_entry_fn+0x44/0x114)
[  243.781245] [<c023db24>] (async_run_entry_fn) from [<c0236534>] (process_one_work+0x1cc/0x31c)
[  243.789876] [<c0236534>] (process_one_work) from [<c0236c90>] (worker_thread+0x2b8/0x3f0)
[  243.798080] [<c0236c90>] (worker_thread) from [<c023b230>] (kthread+0x120/0x140)
[  243.805500] [<c023b230>] (kthread) from [<c0206d68>] (ret_from_fork+0x14/0x2c)
[...]
[  249.441198] bash            D    0  1703   1694 0x00000000
[  249.446702] [<c0700c20>] (__schedule) from [<c0700f44>] (schedule+0xb0/0xcc)
[  249.453764] [<c0700f44>] (schedule) from [<c0705108>] (schedule_timeout+0x18/0x1f4)
[  249.461427] [<c0705108>] (schedule_timeout) from [<c07019c0>] (wait_for_common+0x100/0x19c)
[  249.469797] [<c07019c0>] (wait_for_common) from [<c0234e44>] (flush_work+0x128/0x158)
[  249.477650] [<c0234e44>] (flush_work) from [<c02ab488>] (drain_all_pages+0x198/0x1f0)
[  249.485503] [<c02ab488>] (drain_all_pages) from [<c02e1a1c>] (start_isolate_page_range+0xd8/0x1ac)
[  249.494484] [<c02e1a1c>] (start_isolate_page_range) from [<c02ae464>] (alloc_contig_range+0xc4/0x304)
[  249.503724] [<c02ae464>] (alloc_contig_range) from [<c02e1e78>] (cma_alloc+0x134/0x1bc)
[  249.511739] [<c02e1e78>] (cma_alloc) from [<c021308c>] (__alloc_from_contiguous+0x30/0xa0)
[  249.520023] [<c021308c>] (__alloc_from_contiguous) from [<c021313c>] (cma_allocator_alloc+0x40/0x48)
[  249.529173] [<c021313c>] (cma_allocator_alloc) from [<c0213318>] (__dma_alloc+0x1d4/0x2e8)
[  249.537455] [<c0213318>] (__dma_alloc) from [<c02134a8>] (arm_dma_alloc+0x40/0x4c)
[  249.545047] [<c02134a8>] (arm_dma_alloc) from [<c0534548>] (sh_eth_ring_init+0xec/0x1b8)
[  249.553160] [<c0534548>] (sh_eth_ring_init) from [<c0536df0>] (sh_eth_open+0x88/0x1e0)
[  249.561086] [<c0536df0>] (sh_eth_open) from [<c0536fc4>] (sh_eth_resume+0x7c/0xc0)
[  249.568678] [<c0536fc4>] (sh_eth_resume) from [<c04d2240>] (dpm_run_callback+0x48/0xc8)
[  249.576702] [<c04d2240>] (dpm_run_callback) from [<c04d2740>] (device_resume+0x15c/0x1a0)
[  249.584898] [<c04d2740>] (device_resume) from [<c04d3644>] (dpm_resume+0xe4/0x244)
[  249.592485] [<c04d3644>] (dpm_resume) from [<c04d3968>] (dpm_resume_end+0xc/0x18)
[  249.599977] [<c04d3968>] (dpm_resume_end) from [<c0261010>] (suspend_devices_and_enter+0x3c8/0x490)
[  249.609042] [<c0261010>] (suspend_devices_and_enter) from [<c0261300>] (pm_suspend+0x228/0x280)
[  249.617759] [<c0261300>] (pm_suspend) from [<c025fecc>] (state_store+0xac/0xcc)
[  249.625089] [<c025fecc>] (state_store) from [<c0343b04>] (kernfs_fop_write+0x164/0x1a0)
[  249.633116] [<c0343b04>] (kernfs_fop_write) from [<c02e5838>] (__vfs_write+0x20/0x108)
[  249.641043] [<c02e5838>] (__vfs_write) from [<c02e6c08>] (vfs_write+0xb8/0x144)
[  249.648373] [<c02e6c08>] (vfs_write) from [<c02e788c>] (SyS_write+0x40/0x80)
[  249.655437] [<c02e788c>] (SyS_write) from [<c0206cc0>] (ret_fast_syscall+0x0/0x34)
[...]
[  254.753928] Showing busy workqueues and worker pools:
[...]
[  254.854225] workqueue mm_percpu_wq: flags=0xc
[  254.858583]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=0/0
[  254.864428]     delayed: drain_local_pages_wq, vmstat_update
[  254.870111]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=0/0
[  254.875957]     delayed: drain_local_pages_wq BAR(1703), vmstat_update

I got lost in the indirection here. But is it possible that the
allocating context will wake up the workqeue context? Anyway the patch
you have bisected to doesn't change a lot in this scenario as I've said
before. If anything the change to using WQ for the draining rather than
smp_function_call would change the behavior. Does the below help by any
chance?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5fca73c7881a..a9a1ab7ea4c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2417,6 +2417,14 @@ void drain_all_pages(struct zone *zone)
 	static cpumask_t cpus_with_pcps;
 
 	/*
+	 * This is an uggly hack but let's back off in the early PM suspend/resume
+	 * paths because the whole infrastructure might not be available yet for
+	 * us - namely kworkers might be still frozen
+	 */
+	if (pm_suspended_storage())
+		return;
+
+	/*
 	 * Make sure nobody triggers this path before mm_percpu_wq is fully
 	 * initialized.
 	 */
Michal Hocko April 19, 2017, 7:57 a.m. UTC | #7
On Wed 19-04-17 16:26:10, Tetsuo Handa wrote:
> > On Wed 19-04-17 14:41:30, Tetsuo Handa wrote:
> > [...]
> > > Somebody is waiting forever with cpu_hotplug.lock held?
> > 
> > Why would that matter for drain_all_pages? It doesn't use
> > get_online_cpus since a459eeb7b852 ("mm, page_alloc: do not depend on
> > cpu hotplug locks inside the allocator") while ce612879ddc7 ("mm: move
> > pcp and lru-pcp draining into single wq") was merged later.
> > 
> 
> Looking at ce612879ddc7 ("mm: move pcp and lru-pcp draining into single wq"),
> we merged "lru-add-drain" (!WQ_FREEZABLE && WQ_MEM_RECLAIM) workqueue and
> "vmstat" (WQ_FREEZABLE && WQ_MEM_RECLAIM) workqueue into
> "mm_percpu_wq" (WQ_FREEZABLE && WQ_MEM_RECLAIM) workqueue.
> 
> -       lru_add_drain_wq = alloc_workqueue("lru-add-drain", WQ_MEM_RECLAIM, 0);
> -       vmstat_wq = alloc_workqueue("vmstat", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
> +       mm_percpu_wq = alloc_workqueue("mm_percpu_wq",
> +                                      WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
> 
> That means "lru-add-drain" became freezable, doesn't it? And this problem
> occurs around resume operation where all freezable threads are frozen?
> Then, lru_add_drain_per_cpu() cannot be performed due to mm_percpu_wq frozen?

Ohh, right you are! Very well spotted. I have completely missed
WQ_FREEZABLE there. The following should work
---
>From f3c6e287042259d6ae9916f1ff66392c46ce2a3c Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 19 Apr 2017 09:52:46 +0200
Subject: [PATCH] mm: make mm_percpu_wq non freezable

Geert has reported a freeze during PM resume and some additional
debugging has shown that the device_resume worker cannot make a forward
progress because it waits for an event which is stuck waiting in
drain_all_pages:
[  243.691979] INFO: task kworker/u4:0:5 blocked for more than 120 seconds.
[  243.698684]       Not tainted 4.11.0-rc7-koelsch-00029-g005882e53d62f25d-dirty #3476
[  243.706439] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  243.714276] kworker/u4:0    D    0     5      2 0x00000000
[  243.719780] Workqueue: events_unbound async_run_entry_fn
[  243.725118] [<c0700c20>] (__schedule) from [<c0700f44>] (schedule+0xb0/0xcc)
[  243.732181] [<c0700f44>] (schedule) from [<c0705108>] (schedule_timeout+0x18/0x1f4)
[  243.739840] [<c0705108>] (schedule_timeout) from [<c07019c0>] (wait_for_common+0x100/0x19c)
[  243.748207] [<c07019c0>] (wait_for_common) from [<c04d2008>] (dpm_wait_for_superior+0x14/0x5c)
[  243.756836] [<c04d2008>] (dpm_wait_for_superior) from [<c04d2624>] (device_resume+0x40/0x1a0)
[  243.765380] [<c04d2624>] (device_resume) from [<c04d279c>] (async_resume+0x18/0x44)
[  243.773055] [<c04d279c>] (async_resume) from [<c023db24>] (async_run_entry_fn+0x44/0x114)
[  243.781245] [<c023db24>] (async_run_entry_fn) from [<c0236534>] (process_one_work+0x1cc/0x31c)
[  243.789876] [<c0236534>] (process_one_work) from [<c0236c90>] (worker_thread+0x2b8/0x3f0)
[  243.798080] [<c0236c90>] (worker_thread) from [<c023b230>] (kthread+0x120/0x140)
[  243.805500] [<c023b230>] (kthread) from [<c0206d68>] (ret_from_fork+0x14/0x2c)
[...]
[  249.441198] bash            D    0  1703   1694 0x00000000
[  249.446702] [<c0700c20>] (__schedule) from [<c0700f44>] (schedule+0xb0/0xcc)
[  249.453764] [<c0700f44>] (schedule) from [<c0705108>] (schedule_timeout+0x18/0x1f4)
[  249.461427] [<c0705108>] (schedule_timeout) from [<c07019c0>] (wait_for_common+0x100/0x19c)
[  249.469797] [<c07019c0>] (wait_for_common) from [<c0234e44>] (flush_work+0x128/0x158)
[  249.477650] [<c0234e44>] (flush_work) from [<c02ab488>] (drain_all_pages+0x198/0x1f0)
[  249.485503] [<c02ab488>] (drain_all_pages) from [<c02e1a1c>] (start_isolate_page_range+0xd8/0x1ac)
[  249.494484] [<c02e1a1c>] (start_isolate_page_range) from [<c02ae464>] (alloc_contig_range+0xc4/0x304)
[  249.503724] [<c02ae464>] (alloc_contig_range) from [<c02e1e78>] (cma_alloc+0x134/0x1bc)
[  249.511739] [<c02e1e78>] (cma_alloc) from [<c021308c>] (__alloc_from_contiguous+0x30/0xa0)
[  249.520023] [<c021308c>] (__alloc_from_contiguous) from [<c021313c>] (cma_allocator_alloc+0x40/0x48)
[  249.529173] [<c021313c>] (cma_allocator_alloc) from [<c0213318>] (__dma_alloc+0x1d4/0x2e8)
[  249.537455] [<c0213318>] (__dma_alloc) from [<c02134a8>] (arm_dma_alloc+0x40/0x4c)
[  249.545047] [<c02134a8>] (arm_dma_alloc) from [<c0534548>] (sh_eth_ring_init+0xec/0x1b8)
[  249.553160] [<c0534548>] (sh_eth_ring_init) from [<c0536df0>] (sh_eth_open+0x88/0x1e0)
[  249.561086] [<c0536df0>] (sh_eth_open) from [<c0536fc4>] (sh_eth_resume+0x7c/0xc0)
[  249.568678] [<c0536fc4>] (sh_eth_resume) from [<c04d2240>] (dpm_run_callback+0x48/0xc8)
[  249.576702] [<c04d2240>] (dpm_run_callback) from [<c04d2740>] (device_resume+0x15c/0x1a0)
[  249.584898] [<c04d2740>] (device_resume) from [<c04d3644>] (dpm_resume+0xe4/0x244)
[  249.592485] [<c04d3644>] (dpm_resume) from [<c04d3968>] (dpm_resume_end+0xc/0x18)
[  249.599977] [<c04d3968>] (dpm_resume_end) from [<c0261010>] (suspend_devices_and_enter+0x3c8/0x490)
[  249.609042] [<c0261010>] (suspend_devices_and_enter) from [<c0261300>] (pm_suspend+0x228/0x280)
[  249.617759] [<c0261300>] (pm_suspend) from [<c025fecc>] (state_store+0xac/0xcc)
[  249.625089] [<c025fecc>] (state_store) from [<c0343b04>] (kernfs_fop_write+0x164/0x1a0)
[  249.633116] [<c0343b04>] (kernfs_fop_write) from [<c02e5838>] (__vfs_write+0x20/0x108)
[  249.641043] [<c02e5838>] (__vfs_write) from [<c02e6c08>] (vfs_write+0xb8/0x144)
[  249.648373] [<c02e6c08>] (vfs_write) from [<c02e788c>] (SyS_write+0x40/0x80)
[  249.655437] [<c02e788c>] (SyS_write) from [<c0206cc0>] (ret_fast_syscall+0x0/0x34)
[...]
[  254.753928] Showing busy workqueues and worker pools:
[...]
[  254.854225] workqueue mm_percpu_wq: flags=0xc
[  254.858583]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=0/0
[  254.864428]     delayed: drain_local_pages_wq, vmstat_update
[  254.870111]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=0/0
[  254.875957]     delayed: drain_local_pages_wq BAR(1703), vmstat_update

Tetsuo has properly noted that mm_percpu_wq is created as WQ_FREEZABLE
so it is frozen this early during resume so we are effectively deadlocked.
Fix this by dropping WQ_FREEZABLE when creating mm_percpu_wq. We really want to
have it operational all the time.

Fixes: ce612879ddc7 ("mm: move pcp and lru-pcp draining into single wq")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmstat.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 757be8303aa0..f5fa1bd1eb16 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1766,8 +1766,7 @@ void __init init_mm_internals(void)
 {
 	int ret __maybe_unused;
 
-	mm_percpu_wq = alloc_workqueue("mm_percpu_wq",
-				       WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
+	mm_percpu_wq = alloc_workqueue("mm_percpu_wq", WQ_MEM_RECLAIM, 0);
 
 #ifdef CONFIG_SMP
 	ret = cpuhp_setup_state_nocalls(CPUHP_MM_VMSTAT_DEAD, "mm/vmstat:dead",
Geert Uytterhoeven April 19, 2017, 8:09 a.m. UTC | #8
Hi Michal, Tetsuo,

On Wed, Apr 19, 2017 at 9:57 AM, Michal Hocko <mhocko@kernel.org> wrote:
> From f3c6e287042259d6ae9916f1ff66392c46ce2a3c Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 19 Apr 2017 09:52:46 +0200
> Subject: [PATCH] mm: make mm_percpu_wq non freezable
>
> Geert has reported a freeze during PM resume and some additional
> debugging has shown that the device_resume worker cannot make a forward
> progress because it waits for an event which is stuck waiting in
> drain_all_pages:
> [  243.691979] INFO: task kworker/u4:0:5 blocked for more than 120 seconds.
> [  243.698684]       Not tainted 4.11.0-rc7-koelsch-00029-g005882e53d62f25d-dirty #3476
> [  243.706439] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  243.714276] kworker/u4:0    D    0     5      2 0x00000000
> [  243.719780] Workqueue: events_unbound async_run_entry_fn
> [  243.725118] [<c0700c20>] (__schedule) from [<c0700f44>] (schedule+0xb0/0xcc)
> [  243.732181] [<c0700f44>] (schedule) from [<c0705108>] (schedule_timeout+0x18/0x1f4)
> [  243.739840] [<c0705108>] (schedule_timeout) from [<c07019c0>] (wait_for_common+0x100/0x19c)
> [  243.748207] [<c07019c0>] (wait_for_common) from [<c04d2008>] (dpm_wait_for_superior+0x14/0x5c)
> [  243.756836] [<c04d2008>] (dpm_wait_for_superior) from [<c04d2624>] (device_resume+0x40/0x1a0)
> [  243.765380] [<c04d2624>] (device_resume) from [<c04d279c>] (async_resume+0x18/0x44)
> [  243.773055] [<c04d279c>] (async_resume) from [<c023db24>] (async_run_entry_fn+0x44/0x114)
> [  243.781245] [<c023db24>] (async_run_entry_fn) from [<c0236534>] (process_one_work+0x1cc/0x31c)
> [  243.789876] [<c0236534>] (process_one_work) from [<c0236c90>] (worker_thread+0x2b8/0x3f0)
> [  243.798080] [<c0236c90>] (worker_thread) from [<c023b230>] (kthread+0x120/0x140)
> [  243.805500] [<c023b230>] (kthread) from [<c0206d68>] (ret_from_fork+0x14/0x2c)
> [...]
> [  249.441198] bash            D    0  1703   1694 0x00000000
> [  249.446702] [<c0700c20>] (__schedule) from [<c0700f44>] (schedule+0xb0/0xcc)
> [  249.453764] [<c0700f44>] (schedule) from [<c0705108>] (schedule_timeout+0x18/0x1f4)
> [  249.461427] [<c0705108>] (schedule_timeout) from [<c07019c0>] (wait_for_common+0x100/0x19c)
> [  249.469797] [<c07019c0>] (wait_for_common) from [<c0234e44>] (flush_work+0x128/0x158)
> [  249.477650] [<c0234e44>] (flush_work) from [<c02ab488>] (drain_all_pages+0x198/0x1f0)
> [  249.485503] [<c02ab488>] (drain_all_pages) from [<c02e1a1c>] (start_isolate_page_range+0xd8/0x1ac)
> [  249.494484] [<c02e1a1c>] (start_isolate_page_range) from [<c02ae464>] (alloc_contig_range+0xc4/0x304)
> [  249.503724] [<c02ae464>] (alloc_contig_range) from [<c02e1e78>] (cma_alloc+0x134/0x1bc)
> [  249.511739] [<c02e1e78>] (cma_alloc) from [<c021308c>] (__alloc_from_contiguous+0x30/0xa0)
> [  249.520023] [<c021308c>] (__alloc_from_contiguous) from [<c021313c>] (cma_allocator_alloc+0x40/0x48)
> [  249.529173] [<c021313c>] (cma_allocator_alloc) from [<c0213318>] (__dma_alloc+0x1d4/0x2e8)
> [  249.537455] [<c0213318>] (__dma_alloc) from [<c02134a8>] (arm_dma_alloc+0x40/0x4c)
> [  249.545047] [<c02134a8>] (arm_dma_alloc) from [<c0534548>] (sh_eth_ring_init+0xec/0x1b8)
> [  249.553160] [<c0534548>] (sh_eth_ring_init) from [<c0536df0>] (sh_eth_open+0x88/0x1e0)
> [  249.561086] [<c0536df0>] (sh_eth_open) from [<c0536fc4>] (sh_eth_resume+0x7c/0xc0)
> [  249.568678] [<c0536fc4>] (sh_eth_resume) from [<c04d2240>] (dpm_run_callback+0x48/0xc8)
> [  249.576702] [<c04d2240>] (dpm_run_callback) from [<c04d2740>] (device_resume+0x15c/0x1a0)
> [  249.584898] [<c04d2740>] (device_resume) from [<c04d3644>] (dpm_resume+0xe4/0x244)
> [  249.592485] [<c04d3644>] (dpm_resume) from [<c04d3968>] (dpm_resume_end+0xc/0x18)
> [  249.599977] [<c04d3968>] (dpm_resume_end) from [<c0261010>] (suspend_devices_and_enter+0x3c8/0x490)
> [  249.609042] [<c0261010>] (suspend_devices_and_enter) from [<c0261300>] (pm_suspend+0x228/0x280)
> [  249.617759] [<c0261300>] (pm_suspend) from [<c025fecc>] (state_store+0xac/0xcc)
> [  249.625089] [<c025fecc>] (state_store) from [<c0343b04>] (kernfs_fop_write+0x164/0x1a0)
> [  249.633116] [<c0343b04>] (kernfs_fop_write) from [<c02e5838>] (__vfs_write+0x20/0x108)
> [  249.641043] [<c02e5838>] (__vfs_write) from [<c02e6c08>] (vfs_write+0xb8/0x144)
> [  249.648373] [<c02e6c08>] (vfs_write) from [<c02e788c>] (SyS_write+0x40/0x80)
> [  249.655437] [<c02e788c>] (SyS_write) from [<c0206cc0>] (ret_fast_syscall+0x0/0x34)
> [...]
> [  254.753928] Showing busy workqueues and worker pools:
> [...]
> [  254.854225] workqueue mm_percpu_wq: flags=0xc
> [  254.858583]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=0/0
> [  254.864428]     delayed: drain_local_pages_wq, vmstat_update
> [  254.870111]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=0/0
> [  254.875957]     delayed: drain_local_pages_wq BAR(1703), vmstat_update
>
> Tetsuo has properly noted that mm_percpu_wq is created as WQ_FREEZABLE
> so it is frozen this early during resume so we are effectively deadlocked.
> Fix this by dropping WQ_FREEZABLE when creating mm_percpu_wq. We really want to
> have it operational all the time.
>
> Fixes: ce612879ddc7 ("mm: move pcp and lru-pcp draining into single wq")
> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
> Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>

Thanks a lot to both of you!

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
Michal Hocko April 19, 2017, 8:17 a.m. UTC | #9
On Wed 19-04-17 10:09:12, Geert Uytterhoeven wrote:
> Hi Michal, Tetsuo,
> 
> On Wed, Apr 19, 2017 at 9:57 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > From f3c6e287042259d6ae9916f1ff66392c46ce2a3c Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Wed, 19 Apr 2017 09:52:46 +0200
> > Subject: [PATCH] mm: make mm_percpu_wq non freezable
> >
> > Geert has reported a freeze during PM resume and some additional
> > debugging has shown that the device_resume worker cannot make a forward
> > progress because it waits for an event which is stuck waiting in
> > drain_all_pages:
> > [  243.691979] INFO: task kworker/u4:0:5 blocked for more than 120 seconds.
> > [  243.698684]       Not tainted 4.11.0-rc7-koelsch-00029-g005882e53d62f25d-dirty #3476
> > [  243.706439] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  243.714276] kworker/u4:0    D    0     5      2 0x00000000
> > [  243.719780] Workqueue: events_unbound async_run_entry_fn
> > [  243.725118] [<c0700c20>] (__schedule) from [<c0700f44>] (schedule+0xb0/0xcc)
> > [  243.732181] [<c0700f44>] (schedule) from [<c0705108>] (schedule_timeout+0x18/0x1f4)
> > [  243.739840] [<c0705108>] (schedule_timeout) from [<c07019c0>] (wait_for_common+0x100/0x19c)
> > [  243.748207] [<c07019c0>] (wait_for_common) from [<c04d2008>] (dpm_wait_for_superior+0x14/0x5c)
> > [  243.756836] [<c04d2008>] (dpm_wait_for_superior) from [<c04d2624>] (device_resume+0x40/0x1a0)
> > [  243.765380] [<c04d2624>] (device_resume) from [<c04d279c>] (async_resume+0x18/0x44)
> > [  243.773055] [<c04d279c>] (async_resume) from [<c023db24>] (async_run_entry_fn+0x44/0x114)
> > [  243.781245] [<c023db24>] (async_run_entry_fn) from [<c0236534>] (process_one_work+0x1cc/0x31c)
> > [  243.789876] [<c0236534>] (process_one_work) from [<c0236c90>] (worker_thread+0x2b8/0x3f0)
> > [  243.798080] [<c0236c90>] (worker_thread) from [<c023b230>] (kthread+0x120/0x140)
> > [  243.805500] [<c023b230>] (kthread) from [<c0206d68>] (ret_from_fork+0x14/0x2c)
> > [...]
> > [  249.441198] bash            D    0  1703   1694 0x00000000
> > [  249.446702] [<c0700c20>] (__schedule) from [<c0700f44>] (schedule+0xb0/0xcc)
> > [  249.453764] [<c0700f44>] (schedule) from [<c0705108>] (schedule_timeout+0x18/0x1f4)
> > [  249.461427] [<c0705108>] (schedule_timeout) from [<c07019c0>] (wait_for_common+0x100/0x19c)
> > [  249.469797] [<c07019c0>] (wait_for_common) from [<c0234e44>] (flush_work+0x128/0x158)
> > [  249.477650] [<c0234e44>] (flush_work) from [<c02ab488>] (drain_all_pages+0x198/0x1f0)
> > [  249.485503] [<c02ab488>] (drain_all_pages) from [<c02e1a1c>] (start_isolate_page_range+0xd8/0x1ac)
> > [  249.494484] [<c02e1a1c>] (start_isolate_page_range) from [<c02ae464>] (alloc_contig_range+0xc4/0x304)
> > [  249.503724] [<c02ae464>] (alloc_contig_range) from [<c02e1e78>] (cma_alloc+0x134/0x1bc)
> > [  249.511739] [<c02e1e78>] (cma_alloc) from [<c021308c>] (__alloc_from_contiguous+0x30/0xa0)
> > [  249.520023] [<c021308c>] (__alloc_from_contiguous) from [<c021313c>] (cma_allocator_alloc+0x40/0x48)
> > [  249.529173] [<c021313c>] (cma_allocator_alloc) from [<c0213318>] (__dma_alloc+0x1d4/0x2e8)
> > [  249.537455] [<c0213318>] (__dma_alloc) from [<c02134a8>] (arm_dma_alloc+0x40/0x4c)
> > [  249.545047] [<c02134a8>] (arm_dma_alloc) from [<c0534548>] (sh_eth_ring_init+0xec/0x1b8)
> > [  249.553160] [<c0534548>] (sh_eth_ring_init) from [<c0536df0>] (sh_eth_open+0x88/0x1e0)
> > [  249.561086] [<c0536df0>] (sh_eth_open) from [<c0536fc4>] (sh_eth_resume+0x7c/0xc0)
> > [  249.568678] [<c0536fc4>] (sh_eth_resume) from [<c04d2240>] (dpm_run_callback+0x48/0xc8)
> > [  249.576702] [<c04d2240>] (dpm_run_callback) from [<c04d2740>] (device_resume+0x15c/0x1a0)
> > [  249.584898] [<c04d2740>] (device_resume) from [<c04d3644>] (dpm_resume+0xe4/0x244)
> > [  249.592485] [<c04d3644>] (dpm_resume) from [<c04d3968>] (dpm_resume_end+0xc/0x18)
> > [  249.599977] [<c04d3968>] (dpm_resume_end) from [<c0261010>] (suspend_devices_and_enter+0x3c8/0x490)
> > [  249.609042] [<c0261010>] (suspend_devices_and_enter) from [<c0261300>] (pm_suspend+0x228/0x280)
> > [  249.617759] [<c0261300>] (pm_suspend) from [<c025fecc>] (state_store+0xac/0xcc)
> > [  249.625089] [<c025fecc>] (state_store) from [<c0343b04>] (kernfs_fop_write+0x164/0x1a0)
> > [  249.633116] [<c0343b04>] (kernfs_fop_write) from [<c02e5838>] (__vfs_write+0x20/0x108)
> > [  249.641043] [<c02e5838>] (__vfs_write) from [<c02e6c08>] (vfs_write+0xb8/0x144)
> > [  249.648373] [<c02e6c08>] (vfs_write) from [<c02e788c>] (SyS_write+0x40/0x80)
> > [  249.655437] [<c02e788c>] (SyS_write) from [<c0206cc0>] (ret_fast_syscall+0x0/0x34)
> > [...]
> > [  254.753928] Showing busy workqueues and worker pools:
> > [...]
> > [  254.854225] workqueue mm_percpu_wq: flags=0xc
> > [  254.858583]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=0/0
> > [  254.864428]     delayed: drain_local_pages_wq, vmstat_update
> > [  254.870111]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=0/0
> > [  254.875957]     delayed: drain_local_pages_wq BAR(1703), vmstat_update
> >
> > Tetsuo has properly noted that mm_percpu_wq is created as WQ_FREEZABLE
> > so it is frozen this early during resume so we are effectively deadlocked.
> > Fix this by dropping WQ_FREEZABLE when creating mm_percpu_wq. We really want to
> > have it operational all the time.
> >
> > Fixes: ce612879ddc7 ("mm: move pcp and lru-pcp draining into single wq")
> > Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
> > Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>

Thanks for the testing. Linus will you take the patch from this thread
or you prefer a resend?
Linus Torvalds April 19, 2017, 10:50 p.m. UTC | #10
On Wed, Apr 19, 2017 at 1:17 AM, Michal Hocko <mhocko@kernel.org> wrote:
>
> Thanks for the testing. Linus will you take the patch from this thread
> or you prefer a resend?

I'll take it from this branch since I'm looking at it now, but in
general I prefer resends just because finding patches deep in some
discussion is very iffy.

I get too much email, so it really helps to make the patches more
explicit than this...

              Linus
Michal Hocko April 20, 2017, 6:53 a.m. UTC | #11
On Wed 19-04-17 15:50:01, Linus Torvalds wrote:
> On Wed, Apr 19, 2017 at 1:17 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >
> > Thanks for the testing. Linus will you take the patch from this thread
> > or you prefer a resend?
> 
> I'll take it from this branch since I'm looking at it now, but in
> general I prefer resends just because finding patches deep in some
> discussion is very iffy.

Yeah, I perfectly understand this and that's why I've asked. Thanks for
taking the patch!

Patch
diff mbox series

--- /tmp/good   2017-04-18 21:47:04.457156167 +0200
+++ /tmp/bad    2017-04-18 21:43:26.215240325 +0200
@@ -13,11 +13,178 @@  Enabling non-boot CPUs ...
 CPU1 is up
 PM: noirq resume of devices complete after N.N msecs
 PM: early resume of devices complete after N.N msecs
-Micrel KSZ8041RNLI ee700000.ethernet-ffffffff:01: attached PHY driver
[Micrel KSZ8041RNLI] (mii_bus:phy_addr=ee700000.ethernet-ffffffff:01,
irq=-1)
-PM: resume of devices complete after N.N msecs
-PM: resume devices took N.N seconds
-PM: Finishing wakeup.
-Restarting tasks ... done.
-ata1: link resume succeeded after 1 retries
-ata1: SATA link down (SStatus 0 SControl 300)
-sh-eth ee700000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx
+INFO: task kworker/u4:0:5 blocked for more than 120 seconds.
+      Not tainted 4.11.0-rc7-koelsch-00426-g70412b99f7936b37 #3470
+"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
+kworker/u4:0    D    0     5      2 0x00000000
+Workqueue: events_unbound async_run_entry_fn
+[<c0712c58>] (__schedule) from [<c0712f7c>] (schedule+0xb0/0xcc)
+[<c0712f7c>] (schedule) from [<c0717140>] (schedule_timeout+0x18/0x1f4)
+[<c0717140>] (schedule_timeout) from [<c07139f8>] (wait_for_common+0x100/0x19c)
+[<c07139f8>] (wait_for_common) from [<c04d8488>]
(dpm_wait_for_superior+0x14/0x5c)
+[<c04d8488>] (dpm_wait_for_superior) from [<c04d8aa4>]
(device_resume+0x40/0x1a0)
+[<c04d8aa4>] (device_resume) from [<c04d8c1c>] (async_resume+0x18/0x44)
+[<c04d8c1c>] (async_resume) from [<c023db34>] (async_run_entry_fn+0x44/0x114)
+[<c023db34>] (async_run_entry_fn) from [<c0236544>]
(process_one_work+0x1cc/0x31c)
+[<c0236544>] (process_one_work) from [<c0236ca0>] (worker_thread+0x2b8/0x3f0)
+[<c0236ca0>] (worker_thread) from [<c023b240>] (kthread+0x120/0x140)
+[<c023b240>] (kthread) from [<c0206d68>] (ret_from_fork+0x14/0x2c)
+INFO: task kworker/u4:1:125 blocked for more than 120 seconds.
+      Not tainted 4.11.0-rc7-koelsch-00426-g70412b99f7936b37 #3470
+"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
+kworker/u4:1    D    0   125      2 0x00000000
+Workqueue: events_unbound async_run_entry_fn
+[<c0712c58>] (__schedule) from [<c0712f7c>] (schedule+0xb0/0xcc)
+[<c0712f7c>] (schedule) from [<c0717140>] (schedule_timeout+0x18/0x1f4)
+[<c0717140>] (schedule_timeout) from [<c07139f8>] (wait_for_common+0x100/0x19c)
+[<c07139f8>] (wait_for_common) from [<c04d8488>]
(dpm_wait_for_superior+0x14/0x5c)
+[<c04d8488>] (dpm_wait_for_superior) from [<c04d8aa4>]
(device_resume+0x40/0x1a0)
+[<c04d8aa4>] (device_resume) from [<c04d8c1c>] (async_resume+0x18/0x44)
+[<c04d8c1c>] (async_resume) from [<c023db34>] (async_run_entry_fn+0x44/0x114)
+[<c023db34>] (async_run_entry_fn) from [<c0236544>]
(process_one_work+0x1cc/0x31c)
+[<c0236544>] (process_one_work) from [<c0236ca0>] (worker_thread+0x2b8/0x3f0)
+[<c0236ca0>] (worker_thread) from [<c023b240>] (kthread+0x120/0x140)
+[<c023b240>] (kthread) from [<c0206d68>] (ret_from_fork+0x14/0x2c)
...

On r8a7795/salvator-x, where I have working lockdep:

PM: noirq resume of devices complete after 131.415 msecs
PM: early resume of devices complete after 8.894 msecs
INFO: task kworker/u16:2:276 blocked for more than 120 seconds.
      Not tainted 4.11.0-rc7-salvator-x-06706-g70412b99f7936b37 #1220
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u16:2   D    0   276      2 0x00000000
Workqueue: events_unbound async_run_entry_fn
Call trace:
[<ffffff800808612c>] __switch_to+0xa0/0xac
[<ffffff800866df14>] __schedule+0x70c/0xb88
[<ffffff800866e414>] schedule+0x84/0xa4
[<ffffff800867287c>] schedule_timeout+0x30/0x400
[<ffffff800866ee94>] wait_for_common+0x164/0x1a8
[<ffffff800866eeec>] wait_for_completion+0x14/0x1c
[<ffffff80083d3f84>] dpm_wait+0x30/0x38
[<ffffff80083d3ff8>] dpm_wait_for_superior+0x28/0x7c
[<ffffff80083d490c>] device_resume+0x44/0x190
[<ffffff80083d4a7c>] async_resume+0x24/0x54
[<ffffff80080d5360>] async_run_entry_fn+0x4c/0x12c
[<ffffff80080cb59c>] process_one_work+0x340/0x66c
[<ffffff80080cc9ec>] worker_thread+0x274/0x39c
[<ffffff80080d2004>] kthread+0x120/0x128
[<ffffff8008083090>] ret_from_fork+0x10/0x40

Showing all locks held in the system:
2 locks held by khungtaskd/52:
 #0:  (rcu_read_lock){......}, at: [<ffffff80081436cc>] watchdog+0xc0/0x618
 #1:  (tasklist_lock){.+.+..}, at: [<ffffff80080fb5f0>]
debug_show_all_locks+0x68/0x18c
2 locks held by kworker/u16:2/276:
 #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
 #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
2 locks held by kworker/u16:3/291:
 #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
 #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
8 locks held by s2ram/1899:
 #0:  (sb_writers#7){.+.+.+}, at: [<ffffff80081ca1a4>] vfs_write+0xa8/0x15c
 #1:  (&of->mutex){+.+.+.}, at: [<ffffff8008245964>] kernfs_fop_write+0xf0/0x194
 #2:  (s_active#48){.+.+.+}, at: [<ffffff800824596c>]
kernfs_fop_write+0xf8/0x194
 #3:  (pm_mutex){+.+.+.}, at: [<ffffff80081059a4>] pm_suspend+0x16c/0xabc
 #4:  (&dev->mutex){......}, at: [<ffffff80083d4920>] device_resume+0x58/0x190
 #5:  (cma_mutex){+.+...}, at: [<ffffff80081c516c>] cma_alloc+0x150/0x374
 #6:  (lock){+.+...}, at: [<ffffff800818b8ec>] lru_add_drain_all+0x4c/0x1b4
 #7:  (cpu_hotplug.dep_map){++++++}, at: [<ffffff80080ab8f4>]
get_online_cpus+0x3c/0x9c
2 locks held by kworker/u16:1/1918:
 #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
 #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
2 locks held by kworker/u16:4/1919:
 #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
 #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
2 locks held by kworker/u16:5/1920:
 #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
 #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
2 locks held by kworker/u16:7/1922:
 #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
 #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
2 locks held by kworker/u16:9/1924:
 #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
 #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
2 locks held by kworker/u16:10/1925:
 #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
 #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
2 locks held by kworker/u16:11/1926:
 #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
 #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
2 locks held by kworker/u16:12/1927:
 #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
 #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
2 locks held by kworker/u16:13/1928:
 #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
 #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
2 locks held by kworker/u16:14/1929:
 #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
 #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
2 locks held by kworker/u16:16/1931:
 #0:  ("events_unbound"){.+.+.+}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c
 #1:  ((&entry->work)){+.+...}, at: [<ffffff80080cb424>]
process_one_work+0x1c8/0x66c

Thanks for your comments!

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds