linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Liu Bo <bo.liu@linux.alibaba.com>, Jan Kara <jack@suse.cz>,
	Dave Chinner <david@fromorbit.com>, Theodore Ts'o <tytso@mit.edu>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	LKML <linux-kernel@vger.kernel.org>,
	Shakeel Butt <shakeelb@google.com>,
	Stable tree <stable@vger.kernel.org>
Subject: Re: [PATCH v3] mm, memcg: fix reclaim deadlock with writeback
Date: Thu, 13 Dec 2018 13:39:52 +0100	[thread overview]
Message-ID: <20181213123952.GZ1286@dhcp22.suse.cz> (raw)
In-Reply-To: <20181213104147.ud2lngxn5avri2zm@kshutemo-mobl1>

On Thu 13-12-18 13:41:47, Kirill A. Shutemov wrote:
> On Thu, Dec 13, 2018 at 10:22:21AM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> > ext4 writeback
> > task1:
> > [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> > [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> > [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> > [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> > [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> > [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> > [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> > [<ffffffff8122df2d>] try_charge+0x14d/0x720
> > [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> > [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> > [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> > [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> > [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> > [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> > [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> > [<ffffffff811e5d93>] do_fault+0x103/0x970
> > [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> > [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> > [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> > [<ffffffff8171bce8>] page_fault+0x28/0x30
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > task2:
> > [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> > [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> > [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> > [<ffffffff811bbede>] do_writepages+0x1e/0x30
> > [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> > [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> > [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> > [<ffffffff81273568>] wb_writeback+0x268/0x300
> > [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> > [<ffffffff810a2f19>] process_one_work+0x189/0x420
> > [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> > [<ffffffff810a9786>] kthread+0xe6/0x100
> > [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > He adds
> > : task1 is waiting for the PageWriteback bit of the page that task2 has
> > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> > : bit the page which tasks1 has locked.
> > 
> > More precisely task1 is handling a page fault and it has a page locked
> > while it charges a new page table to a memcg. That in turn hits a memory
> > limit reclaim and the memcg reclaim for legacy controller is waiting on
> > the writeback but that is never going to finish because the writeback
> > itself is waiting for the page locked in the #PF path. So this is
> > essentially ABBA deadlock:
> >                                         lock_page(A)
> >                                         SetPageWriteback(A)
> >                                         unlock_page(A)
> > lock_page(B)
> >                                         lock_page(B)
> > pte_alloc_pne
> >   shrink_page_list
> >     wait_on_page_writeback(A)
> >                                         SetPageWriteback(B)
> >                                         unlock_page(B)
> > 
> >                                         # flush A, B to clear the writeback
> > 
> > This accumulating of more pages to flush is used by several filesystems
> > to generate a more optimal IO patterns.
> > 
> > Waiting for the writeback in legacy memcg controller is a workaround
> > for pre-mature OOM killer invocations because there is no dirty IO
> > throttling available for the controller. There is no easy way around
> > that unfortunately. Therefore fix this specific issue by pre-allocating
> > the page table outside of the page lock. We have that handy
> > infrastructure for that already so simply reuse the fault-around pattern
> > which already does this.
> > 
> > There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> > from under a fs page locked but they should be really rare. I am not
> > aware of a better solution unfortunately.
> > 
> > Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
> > Cc: stable
> > Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Thanks!

> Will you take care about converting vmf_insert_* to use the pre-allocated
> page table?

I can try but I would appreciate if somebody more familiar with the code
could do that. I am busy as hell and I do not want to promis something I
will likely not get to soon.
-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2018-12-13 12:39 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-11 13:26 [PATCH] mm, memcg: fix reclaim deadlock with writeback Michal Hocko
2018-12-11 15:15 ` Kirill A. Shutemov
2018-12-11 16:21   ` Michal Hocko
2018-12-11 16:39     ` Jan Kara
2018-12-12  9:42 ` Kirill A. Shutemov
2018-12-12  9:48   ` Michal Hocko
2018-12-12 10:05   ` Jan Kara
2018-12-12 11:58 ` Kirill A. Shutemov
2018-12-12 12:13   ` Michal Hocko
2018-12-12 15:33 ` kbuild test robot
2018-12-12 15:57   ` Michal Hocko
2018-12-12 15:50 ` [PATCH v2] " Michal Hocko
2018-12-12 17:24   ` Shakeel Butt
2018-12-12 17:49   ` Liu Bo
2018-12-13  9:22   ` [PATCH v3] " Michal Hocko
2018-12-13 10:41     ` Kirill A. Shutemov
2018-12-13 12:39       ` Michal Hocko [this message]
2018-12-13 22:04     ` Johannes Weiner
2018-12-14  8:49       ` Michal Hocko
2018-12-13 22:43     ` Liu Bo
2018-12-14 17:13 ` [PATCH] " kbuild test robot
2018-12-14 17:31   ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181213123952.GZ1286@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=bo.liu@linux.alibaba.com \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=kirill@shutemov.name \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=shakeelb@google.com \
    --cc=stable@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).