All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
To: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	David Rientjes <rientjes@google.com>, Tejun Heo <tj@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().
Date: Sat, 28 Jul 2018 00:47:40 +0900	[thread overview]
Message-ID: <55c9da7f-e448-964a-5b50-47f89a24235b@i-love.sakura.ne.jp> (raw)
In-Reply-To: <20180726113958.GE28386@dhcp22.suse.cz>

On 2018/07/26 20:39, Michal Hocko wrote:
> On Thu 26-07-18 20:06:24, Tetsuo Handa wrote:
>> Before applying "an OOM lockup mitigation patch", I want to apply this
>> "another OOM lockup avoidance" patch.
> 
> I do not really see why. All these are borderline interesting as the
> system is basically dead by the time you reach this state.

I don't like your "borderline interesting". We still don't have a watchdog
which tells something went wrong. Thus, "borderline interesting" has to be
examined and fixed.

> 
>> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20180726.txt.xz
>> (which was captured with
>>
>>   --- a/mm/oom_kill.c
>>   +++ b/mm/oom_kill.c
>>   @@ -1071,6 +1071,12 @@ bool out_of_memory(struct oom_control *oc)
>>    {
>>    	unsigned long freed = 0;
>>    	bool delay = false; /* if set, delay next allocation attempt */
>>   +	static unsigned long last_warned;
>>   +	if (!last_warned || time_after(jiffies, last_warned + 10 * HZ)) {
>>   +		pr_warn("%s(%d) gfp_mask=%#x(%pGg), order=%d\n", current->comm,
>>   +			current->pid, oc->gfp_mask, &oc->gfp_mask, oc->order);
>>   +		last_warned = jiffies;
>>   +	}
>>    
>>    	oc->constraint = CONSTRAINT_NONE;
>>    	if (oom_killer_disabled)
>>
>> in order to demonstrate that the GFP_NOIO allocation from disk_events_workfn() is
>> calling out_of_memory() rather than by error failing to give up direct reclaim).
>>
>> [  258.619119] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
>> [  268.622732] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
>> [  278.635344] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
>> [  288.639360] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
>> [  298.642715] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
> 
> Hmm, so there is no other memory allocation to trigger the oom or they
> all just back off on the oom_lock trylock? In other words what is
> preventing from the oom killer invocation?

All __GFP_FS allocations got stuck at direct reclaim or workqueue.

[  310.265293] systemd         D12240     1      0 0x00000000
[  310.268118] Call Trace:
[  310.269894]  ? __schedule+0x245/0x7f0
[  310.271901]  ? xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs]
[  310.274187]  schedule+0x23/0x80
[  310.275965]  schedule_preempt_disabled+0x5/0x10
[  310.278107]  __mutex_lock+0x3f5/0x9b0
[  310.280070]  ? xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs]
[  310.282451]  xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs]
[  310.284730]  ? lock_acquire+0x51/0x70
[  310.286678]  ? xfs_ail_push_all+0x13/0x60 [xfs]
[  310.288844]  xfs_reclaim_inodes_nr+0x2c/0x40 [xfs]
[  310.291092]  super_cache_scan+0x14b/0x1a0
[  310.293131]  do_shrink_slab+0xfc/0x190
[  310.295082]  shrink_slab+0x240/0x2c0
[  310.297028]  shrink_node+0xe3/0x460
[  310.298899]  do_try_to_free_pages+0xcb/0x380
[  310.300945]  try_to_free_pages+0xbb/0xf0
[  310.302874]  __alloc_pages_slowpath+0x3c1/0xc50
[  310.304931]  ? lock_acquire+0x51/0x70
[  310.306753]  ? set_page_refcounted+0x40/0x40
[  310.308686]  __alloc_pages_nodemask+0x2a6/0x2c0
[  310.310663]  filemap_fault+0x437/0x8e0
[  310.312446]  ? lock_acquire+0x51/0x70
[  310.314150]  ? xfs_ilock+0x86/0x190 [xfs]
[  310.315915]  __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
[  310.318089]  __do_fault+0x13/0x126
[  310.319696]  __handle_mm_fault+0xc8d/0x11c0
[  310.321493]  handle_mm_fault+0x17a/0x390
[  310.323156]  __do_page_fault+0x24c/0x4d0
[  310.324782]  do_page_fault+0x2a/0x70
[  310.326289]  ? page_fault+0x8/0x30
[  310.327745]  page_fault+0x1e/0x30

[  313.714537] systemd-journal D12600   498      1 0x00000000
[  313.716538] Call Trace:
[  313.717683]  ? __schedule+0x245/0x7f0
[  313.719221]  schedule+0x23/0x80
[  313.720586]  schedule_timeout+0x21f/0x400
[  313.722163]  wait_for_completion+0xb2/0x130
[  313.723750]  ? wake_up_q+0x70/0x70
[  313.725134]  flush_work+0x1db/0x2b0
[  313.726535]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
[  313.728336]  ? page_alloc_cpu_dead+0x30/0x30
[  313.729936]  drain_all_pages+0x174/0x1e0
[  313.731446]  __alloc_pages_slowpath+0x428/0xc50
[  313.733120]  __alloc_pages_nodemask+0x2a6/0x2c0
[  313.734826]  filemap_fault+0x437/0x8e0
[  313.736296]  ? lock_acquire+0x51/0x70
[  313.737769]  ? xfs_ilock+0x86/0x190 [xfs]
[  313.739309]  __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
[  313.741291]  __do_fault+0x13/0x126
[  313.742667]  __handle_mm_fault+0xc8d/0x11c0
[  313.744245]  handle_mm_fault+0x17a/0x390
[  313.745755]  __do_page_fault+0x24c/0x4d0
[  313.747290]  do_page_fault+0x2a/0x70
[  313.748728]  ? page_fault+0x8/0x30
[  313.750148]  page_fault+0x1e/0x30

[  324.987240] workqueue events_freezable_power_: flags=0x84
[  324.989292]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  324.991482]     in-flight: 5:disk_events_workfn
[  324.993371] workqueue mm_percpu_wq: flags=0x8
[  324.995167]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  324.997363]     pending: vmstat_update, drain_local_pages_wq BAR(498)

[  378.984241] INFO: task systemd-journal:498 blocked for more than 120 seconds.
[  378.986657]       Not tainted 4.18.0-rc6-next-20180724+ #248
[  378.988741] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  378.991343] systemd-journal D12600   498      1 0x00000000
[  378.993434] Call Trace:
[  378.994643]  ? __schedule+0x245/0x7f0
[  378.996158]  schedule+0x23/0x80
[  378.997528]  schedule_timeout+0x21f/0x400
[  378.999112]  wait_for_completion+0xb2/0x130
[  379.000728]  ? wake_up_q+0x70/0x70
[  379.002170]  flush_work+0x1db/0x2b0
[  379.003667]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
[  379.005507]  ? page_alloc_cpu_dead+0x30/0x30
[  379.007176]  drain_all_pages+0x174/0x1e0
[  379.008836]  __alloc_pages_slowpath+0x428/0xc50
[  379.010625]  __alloc_pages_nodemask+0x2a6/0x2c0
[  379.012364]  filemap_fault+0x437/0x8e0
[  379.013881]  ? lock_acquire+0x51/0x70
[  379.015373]  ? xfs_ilock+0x86/0x190 [xfs]
[  379.016950]  __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
[  379.019019]  __do_fault+0x13/0x126
[  379.020439]  __handle_mm_fault+0xc8d/0x11c0
[  379.022055]  handle_mm_fault+0x17a/0x390
[  379.023623]  __do_page_fault+0x24c/0x4d0
[  379.025167]  do_page_fault+0x2a/0x70
[  379.026621]  ? page_fault+0x8/0x30
[  379.028022]  page_fault+0x1e/0x30

[  432.759743] systemd-journal D12600   498      1 0x00000000
[  432.761748] Call Trace:
[  432.762892]  ? __schedule+0x245/0x7f0
[  432.764352]  schedule+0x23/0x80
[  432.765673]  schedule_timeout+0x21f/0x400
[  432.767428]  wait_for_completion+0xb2/0x130
[  432.769207]  ? wake_up_q+0x70/0x70
[  432.770764]  flush_work+0x1db/0x2b0
[  432.772223]  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
[  432.774125]  ? page_alloc_cpu_dead+0x30/0x30
[  432.775873]  drain_all_pages+0x174/0x1e0
[  432.777528]  __alloc_pages_slowpath+0x428/0xc50
[  432.779306]  __alloc_pages_nodemask+0x2a6/0x2c0
[  432.780977]  filemap_fault+0x437/0x8e0
[  432.782503]  ? lock_acquire+0x51/0x70
[  432.784012]  ? xfs_ilock+0x86/0x190 [xfs]
[  432.785556]  __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
[  432.787537]  __do_fault+0x13/0x126
[  432.788930]  __handle_mm_fault+0xc8d/0x11c0
[  432.790570]  handle_mm_fault+0x17a/0x390
[  432.792162]  __do_page_fault+0x24c/0x4d0
[  432.793773]  do_page_fault+0x2a/0x70
[  432.795196]  ? page_fault+0x8/0x30
[  432.796572]  page_fault+0x1e/0x30

[  444.234824] workqueue events_freezable_power_: flags=0x84
[  444.236937]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  444.239138]     in-flight: 5:disk_events_workfn
[  444.241022] workqueue mm_percpu_wq: flags=0x8
[  444.242829]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  444.245057]     pending: vmstat_update, drain_local_pages_wq BAR(498)

>  
> [...]
> 
>> Since the patch shown below was suggested by Michal Hocko at
>> https://marc.info/?l=linux-mm&m=152723708623015 , it is from Michal Hocko.
>>
>> >From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001
>> From: Michal Hocko <mhocko@suse.com>
>> Date: Thu, 26 Jul 2018 14:40:03 +0900
>> Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().
>>
>> Tetsuo Handa has reported that it is possible to bypass the short sleep
>> for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
>> ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
>> any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO
>> only when there are too many dirty and writeback pages") and lock up the
>> system if OOM.
>>
>> This is because we are implicitly counting on falling back to
>> schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when
>> schedule_timeout_uninterruptible() in should_reclaim_retry() was not
>> called due to __zone_watermark_ok() == false.
> 
> How do we rely on that?

Unless disk_events_workfn() (PID=5) sleeps at schedule_timeout_uninterruptible()
in __alloc_pages_may_oom(), drain_local_pages_wq() which a thread doing __GFP_FS
allocation (PID=498) is waiting for cannot be started.

> 
>> However, schedule_timeout_uninterruptible() in __alloc_pages_may_oom() is
>> not called if all allocating threads but a PF_WQ_WORKER thread got stuck at
>> __GFP_FS direct reclaim, for mutex_trylock(&oom_lock) by that PF_WQ_WORKER
>> thread succeeds and out_of_memory() remains no-op unless that PF_WQ_WORKER
>> thread is doing __GFP_FS allocation.
> 
> I have really hard time to parse and understand this.

You can write as you like.

> 
>> Tetsuo is observing that GFP_NOIO
>> allocation request from disk_events_workfn() is preventing other pending
>> works from starting.
> 
> What about any other allocation from !PF_WQ_WORKER context? Why those do
> not jump in?

All __GFP_FS allocations got stuck at direct reclaim or workqueue.

> 
>> Since should_reclaim_retry() should be a natural reschedule point,
>> let's do the short sleep for PF_WQ_WORKER threads unconditionally
>> in order to guarantee that other pending works are started.
> 
> OK, this is finally makes some sense. But it doesn't explain why it
> handles the live lock.

As explained above, if disk_events_workfn() (PID=5) explicitly sleeps,
vmstat_update and drain_local_pages_wq from WQ_MEM_RECLAIM workqueue will
start, and unblock PID=498 which is waiting for drain_local_pages_wq and
allow PID=498 to invoke the OOM killer.

> 
>> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>> Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Your s-o-b is missing again. I have already told you that previously
> when you were posting the patch.

I'm waiting for you to post this change with your wording.

> 
> I do not mind this change per se but I am not happy about _your_ changelog.
> It doesn't explain the underlying problem IMHO. Having a natural and
> unconditional scheduling point in should_reclaim_retry is a reasonable
> thing. But how the hack it relates to the livelock you are seeing. So
> namely the changelog should explain
> 1) why nobody is able to make forward progress during direct reclaim

Because GFP_NOIO allocation from one workqueue prevented WQ_MEM_RECLAIM
workqueues from starting, for it did not call schedule_timeout_*() because
mutex_trylock(&oom_lock) did not fail because nobody else could call
__alloc_pages_may_oom().

> 2) why nobody is able to trigger oom killer as the last resort

Because only one !__GFP_FS allocating thread which did not get stuck at
direct reclaim was able to call __alloc_pages_may_oom().


  reply	other threads:[~2018-07-27 15:47 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-26 11:06 [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry() Tetsuo Handa
2018-07-26 11:06 ` Tetsuo Handa
2018-07-26 11:39 ` Michal Hocko
2018-07-27 15:47   ` Tetsuo Handa [this message]
2018-07-30  9:32     ` Michal Hocko
2018-07-30 14:34       ` Tetsuo Handa
2018-07-30 14:46         ` Michal Hocko
2018-07-30 14:54           ` Tejun Heo
2018-07-30 15:25             ` Tetsuo Handa
2018-07-30 15:44               ` Tejun Heo
2018-07-30 18:51                 ` Michal Hocko
2018-07-30 19:10                   ` Michal Hocko
2018-07-30 19:10                     ` Michal Hocko
2018-07-30 21:01                     ` Tetsuo Handa
2018-07-31  5:09                       ` Michal Hocko
2018-07-31 10:47                         ` Tetsuo Handa
2018-07-31 11:15                           ` Michal Hocko
2018-07-31 11:30                             ` Tetsuo Handa
2018-07-31 11:55                               ` Michal Hocko
2018-08-02 22:05                         ` Tetsuo Handa
2018-08-03  6:16                           ` Michal Hocko
2018-08-21 21:07                             ` Tetsuo Handa
2018-08-22  7:32                               ` Michal Hocko
2018-08-23 20:06                               ` David Rientjes
2018-08-23 21:00                                 ` Tetsuo Handa
2018-08-23 22:45                                   ` David Rientjes
2018-08-24  0:31                                     ` Tetsuo Handa
2018-09-05 13:20                                       ` Tetsuo Handa
2018-09-05 13:40                                         ` Michal Hocko
2018-09-05 13:53                                           ` Tetsuo Handa
2018-09-05 14:04                                             ` Michal Hocko
2018-09-06  1:00                                               ` Tetsuo Handa
2018-09-06  5:57                                                 ` Michal Hocko
2018-09-06  6:22                                                   ` Tetsuo Handa
2018-09-06  7:03                                                   ` Tetsuo Handa
2018-07-30 19:14                   ` Tejun Heo
2018-08-27 13:51 Michal Hocko
2018-08-27 13:51 ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55c9da7f-e448-964a-5b50-47f89a24235b@i-love.sakura.ne.jp \
    --to=penguin-kernel@i-love.sakura.ne.jp \
    --cc=akpm@linux-foundation.org \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rientjes@google.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.