All of lore.kernel.org
 help / color / mirror / Atom feed
* FAILED: patch "[PATCH] mm, memcg: fix reclaim deadlock with writeback" failed to apply to 4.4-stable tree
@ 2019-01-14 14:57 gregkh
  2019-01-15 15:34 ` Michal Hocko
  0 siblings, 1 reply; 9+ messages in thread
From: gregkh @ 2019-01-14 14:57 UTC (permalink / raw)
  To: mhocko, akpm, bo.liu, david, hannes, jack, kirill.shutemov,
	shakeelb, stable, torvalds, tytso, vdavydov.dev
  Cc: stable


The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.

thanks,

greg k-h

------------------ original commit in Linus's tree ------------------

From 63f3655f950186752236bb88a22f8252c11ce394 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 8 Jan 2019 15:23:07 -0800
Subject: [PATCH] mm, memcg: fix reclaim deadlock with writeback

Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
ext4 writeback

  task1:
    wait_on_page_bit+0x82/0xa0
    shrink_page_list+0x907/0x960
    shrink_inactive_list+0x2c7/0x680
    shrink_node_memcg+0x404/0x830
    shrink_node+0xd8/0x300
    do_try_to_free_pages+0x10d/0x330
    try_to_free_mem_cgroup_pages+0xd5/0x1b0
    try_charge+0x14d/0x720
    memcg_kmem_charge_memcg+0x3c/0xa0
    memcg_kmem_charge+0x7e/0xd0
    __alloc_pages_nodemask+0x178/0x260
    alloc_pages_current+0x95/0x140
    pte_alloc_one+0x17/0x40
    __pte_alloc+0x1e/0x110
    alloc_set_pte+0x5fe/0xc20
    do_fault+0x103/0x970
    handle_mm_fault+0x61e/0xd10
    __do_page_fault+0x252/0x4d0
    do_page_fault+0x30/0x80
    page_fault+0x28/0x30

  task2:
    __lock_page+0x86/0xa0
    mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
    ext4_writepages+0x479/0xd60
    do_writepages+0x1e/0x30
    __writeback_single_inode+0x45/0x320
    writeback_sb_inodes+0x272/0x600
    __writeback_inodes_wb+0x92/0xc0
    wb_writeback+0x268/0x300
    wb_workfn+0xb4/0x390
    process_one_work+0x189/0x420
    worker_thread+0x4e/0x4b0
    kthread+0xe6/0x100
    ret_from_fork+0x41/0x50

He adds
 "task1 is waiting for the PageWriteback bit of the page that task2 has
  collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
  LOCKED bit the page which tasks1 has locked"

More precisely task1 is handling a page fault and it has a page locked
while it charges a new page table to a memcg.  That in turn hits a
memory limit reclaim and the memcg reclaim for legacy controller is
waiting on the writeback but that is never going to finish because the
writeback itself is waiting for the page locked in the #PF path.  So
this is essentially ABBA deadlock:

                                        lock_page(A)
                                        SetPageWriteback(A)
                                        unlock_page(A)
  lock_page(B)
                                        lock_page(B)
  pte_alloc_pne
    shrink_page_list
      wait_on_page_writeback(A)
                                        SetPageWriteback(B)
                                        unlock_page(B)

                                        # flush A, B to clear the writeback

This accumulating of more pages to flush is used by several filesystems
to generate a more optimal IO patterns.

Waiting for the writeback in legacy memcg controller is a workaround for
pre-mature OOM killer invocations because there is no dirty IO
throttling available for the controller.  There is no easy way around
that unfortunately.  Therefore fix this specific issue by pre-allocating
the page table outside of the page lock.  We have that handy
infrastructure for that already so simply reuse the fault-around pattern
which already does this.

There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
from under a fs page locked but they should be really rare.  I am not
aware of a better solution unfortunately.

[akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
[akpm@linux-foundation.org: coding-style fixes]
[mhocko@kernel.org: enhance comment, per Johannes]
  Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Liu Bo <bo.liu@linux.alibaba.com>
Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/mm/memory.c b/mm/memory.c
index a52663c0612d..5e46836714dc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2994,6 +2994,28 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	vm_fault_t ret;
 
+	/*
+	 * Preallocate pte before we take page_lock because this might lead to
+	 * deadlocks for memcg reclaim which waits for pages under writeback:
+	 *				lock_page(A)
+	 *				SetPageWriteback(A)
+	 *				unlock_page(A)
+	 * lock_page(B)
+	 *				lock_page(B)
+	 * pte_alloc_pne
+	 *   shrink_page_list
+	 *     wait_on_page_writeback(A)
+	 *				SetPageWriteback(B)
+	 *				unlock_page(B)
+	 *				# flush A, B to clear the writeback
+	 */
+	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
+		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
+		if (!vmf->prealloc_pte)
+			return VM_FAULT_OOM;
+		smp_wmb(); /* See comment in __pte_alloc() */
+	}
+
 	ret = vma->vm_ops->fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
 			    VM_FAULT_DONE_COW)))


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: FAILED: patch "[PATCH] mm, memcg: fix reclaim deadlock with writeback" failed to apply to 4.4-stable tree
  2019-01-14 14:57 FAILED: patch "[PATCH] mm, memcg: fix reclaim deadlock with writeback" failed to apply to 4.4-stable tree gregkh
@ 2019-01-15 15:34 ` Michal Hocko
  2019-01-15 15:51   ` Greg KH
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2019-01-15 15:34 UTC (permalink / raw)
  To: gregkh
  Cc: akpm, bo.liu, david, hannes, jack, kirill.shutemov, shakeelb,
	stable, torvalds, tytso, vdavydov.dev

I do not see a straightforward backport of this patch without pulling
more changes in. Do we have anybody to actually hit the issue on those
older kernels? While the issue is possible in principle I do not
remember anybody complaining.

On Mon 14-01-19 15:57:16, Greg KH wrote:
> >From 63f3655f950186752236bb88a22f8252c11ce394 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Tue, 8 Jan 2019 15:23:07 -0800
> Subject: [PATCH] mm, memcg: fix reclaim deadlock with writeback
> 
> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> ext4 writeback
> 
>   task1:
>     wait_on_page_bit+0x82/0xa0
>     shrink_page_list+0x907/0x960
>     shrink_inactive_list+0x2c7/0x680
>     shrink_node_memcg+0x404/0x830
>     shrink_node+0xd8/0x300
>     do_try_to_free_pages+0x10d/0x330
>     try_to_free_mem_cgroup_pages+0xd5/0x1b0
>     try_charge+0x14d/0x720
>     memcg_kmem_charge_memcg+0x3c/0xa0
>     memcg_kmem_charge+0x7e/0xd0
>     __alloc_pages_nodemask+0x178/0x260
>     alloc_pages_current+0x95/0x140
>     pte_alloc_one+0x17/0x40
>     __pte_alloc+0x1e/0x110
>     alloc_set_pte+0x5fe/0xc20
>     do_fault+0x103/0x970
>     handle_mm_fault+0x61e/0xd10
>     __do_page_fault+0x252/0x4d0
>     do_page_fault+0x30/0x80
>     page_fault+0x28/0x30
> 
>   task2:
>     __lock_page+0x86/0xa0
>     mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
>     ext4_writepages+0x479/0xd60
>     do_writepages+0x1e/0x30
>     __writeback_single_inode+0x45/0x320
>     writeback_sb_inodes+0x272/0x600
>     __writeback_inodes_wb+0x92/0xc0
>     wb_writeback+0x268/0x300
>     wb_workfn+0xb4/0x390
>     process_one_work+0x189/0x420
>     worker_thread+0x4e/0x4b0
>     kthread+0xe6/0x100
>     ret_from_fork+0x41/0x50
> 
> He adds
>  "task1 is waiting for the PageWriteback bit of the page that task2 has
>   collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
>   LOCKED bit the page which tasks1 has locked"
> 
> More precisely task1 is handling a page fault and it has a page locked
> while it charges a new page table to a memcg.  That in turn hits a
> memory limit reclaim and the memcg reclaim for legacy controller is
> waiting on the writeback but that is never going to finish because the
> writeback itself is waiting for the page locked in the #PF path.  So
> this is essentially ABBA deadlock:
> 
>                                         lock_page(A)
>                                         SetPageWriteback(A)
>                                         unlock_page(A)
>   lock_page(B)
>                                         lock_page(B)
>   pte_alloc_pne
>     shrink_page_list
>       wait_on_page_writeback(A)
>                                         SetPageWriteback(B)
>                                         unlock_page(B)
> 
>                                         # flush A, B to clear the writeback
> 
> This accumulating of more pages to flush is used by several filesystems
> to generate a more optimal IO patterns.
> 
> Waiting for the writeback in legacy memcg controller is a workaround for
> pre-mature OOM killer invocations because there is no dirty IO
> throttling available for the controller.  There is no easy way around
> that unfortunately.  Therefore fix this specific issue by pre-allocating
> the page table outside of the page lock.  We have that handy
> infrastructure for that already so simply reuse the fault-around pattern
> which already does this.
> 
> There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> from under a fs page locked but they should be really rare.  I am not
> aware of a better solution unfortunately.
> 
> [akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
> [akpm@linux-foundation.org: coding-style fixes]
> [mhocko@kernel.org: enhance comment, per Johannes]
>   Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
> Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
> Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reported-by: Liu Bo <bo.liu@linux.alibaba.com>
> Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Theodore Ts'o <tytso@mit.edu>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index a52663c0612d..5e46836714dc 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2994,6 +2994,28 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
>  	struct vm_area_struct *vma = vmf->vma;
>  	vm_fault_t ret;
>  
> +	/*
> +	 * Preallocate pte before we take page_lock because this might lead to
> +	 * deadlocks for memcg reclaim which waits for pages under writeback:
> +	 *				lock_page(A)
> +	 *				SetPageWriteback(A)
> +	 *				unlock_page(A)
> +	 * lock_page(B)
> +	 *				lock_page(B)
> +	 * pte_alloc_pne
> +	 *   shrink_page_list
> +	 *     wait_on_page_writeback(A)
> +	 *				SetPageWriteback(B)
> +	 *				unlock_page(B)
> +	 *				# flush A, B to clear the writeback
> +	 */
> +	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> +		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
> +		if (!vmf->prealloc_pte)
> +			return VM_FAULT_OOM;
> +		smp_wmb(); /* See comment in __pte_alloc() */
> +	}
> +
>  	ret = vma->vm_ops->fault(vmf);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
>  			    VM_FAULT_DONE_COW)))
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: FAILED: patch "[PATCH] mm, memcg: fix reclaim deadlock with writeback" failed to apply to 4.4-stable tree
  2019-01-15 15:34 ` Michal Hocko
@ 2019-01-15 15:51   ` Greg KH
  2019-01-15 17:40     ` Michal Hocko
  0 siblings, 1 reply; 9+ messages in thread
From: Greg KH @ 2019-01-15 15:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, bo.liu, david, hannes, jack, kirill.shutemov, shakeelb,
	stable, torvalds, tytso, vdavydov.dev

On Tue, Jan 15, 2019 at 04:34:44PM +0100, Michal Hocko wrote:
> I do not see a straightforward backport of this patch without pulling
> more changes in. Do we have anybody to actually hit the issue on those
> older kernels? While the issue is possible in principle I do not
> remember anybody complaining.

If no one is complaining, that's fine, you just got this message because
you put this in the commit:

> > Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")

Which means any kernel newer than 4.2 (and some older stable releases)
has the issue that this patch is trying to fix.  If it doesn't need to
be backported that far, wonderful!

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: FAILED: patch "[PATCH] mm, memcg: fix reclaim deadlock with writeback" failed to apply to 4.4-stable tree
  2019-01-15 15:51   ` Greg KH
@ 2019-01-15 17:40     ` Michal Hocko
  2019-01-15 18:09       ` Greg KH
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2019-01-15 17:40 UTC (permalink / raw)
  To: Greg KH
  Cc: akpm, bo.liu, david, hannes, jack, kirill.shutemov, shakeelb,
	stable, torvalds, tytso, vdavydov.dev

On Tue 15-01-19 16:51:31, Greg KH wrote:
> On Tue, Jan 15, 2019 at 04:34:44PM +0100, Michal Hocko wrote:
> > I do not see a straightforward backport of this patch without pulling
> > more changes in. Do we have anybody to actually hit the issue on those
> > older kernels? While the issue is possible in principle I do not
> > remember anybody complaining.
> 
> If no one is complaining, that's fine, you just got this message because
> you put this in the commit:
> 
> > > Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
> 
> Which means any kernel newer than 4.2 (and some older stable releases)
> has the issue that this patch is trying to fix.  If it doesn't need to
> be backported that far, wonderful!

After a second thought, we are not really affected all the way down to
c3b94f44fcb0. We do account page tables to memcgs only since
3e79ec7ddc33 ("arch: x86: charge page tables to kmemcg") 4.8+. Without
that there is no realy memcg reclaim and thus not wait_on_page_writeback.

So my Fixes is a bit misleading. Sorry about that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: FAILED: patch "[PATCH] mm, memcg: fix reclaim deadlock with writeback" failed to apply to 4.4-stable tree
  2019-01-15 17:40     ` Michal Hocko
@ 2019-01-15 18:09       ` Greg KH
  2019-01-15 19:57         ` Michal Hocko
  0 siblings, 1 reply; 9+ messages in thread
From: Greg KH @ 2019-01-15 18:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, bo.liu, david, hannes, jack, kirill.shutemov, shakeelb,
	stable, torvalds, tytso, vdavydov.dev

On Tue, Jan 15, 2019 at 06:40:36PM +0100, Michal Hocko wrote:
> On Tue 15-01-19 16:51:31, Greg KH wrote:
> > On Tue, Jan 15, 2019 at 04:34:44PM +0100, Michal Hocko wrote:
> > > I do not see a straightforward backport of this patch without pulling
> > > more changes in. Do we have anybody to actually hit the issue on those
> > > older kernels? While the issue is possible in principle I do not
> > > remember anybody complaining.
> > 
> > If no one is complaining, that's fine, you just got this message because
> > you put this in the commit:
> > 
> > > > Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
> > 
> > Which means any kernel newer than 4.2 (and some older stable releases)
> > has the issue that this patch is trying to fix.  If it doesn't need to
> > be backported that far, wonderful!
> 
> After a second thought, we are not really affected all the way down to
> c3b94f44fcb0. We do account page tables to memcgs only since
> 3e79ec7ddc33 ("arch: x86: charge page tables to kmemcg") 4.8+. Without
> that there is no realy memcg reclaim and thus not wait_on_page_writeback.
> 
> So my Fixes is a bit misleading. Sorry about that.

Not a problem, thanks for looking into it.  As you say 4.8+, this didn't
apply to 4.9 either, so do I need to look into doing a manual backport
there?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: FAILED: patch "[PATCH] mm, memcg: fix reclaim deadlock with writeback" failed to apply to 4.4-stable tree
  2019-01-15 18:09       ` Greg KH
@ 2019-01-15 19:57         ` Michal Hocko
  2019-01-16 10:48           ` [PATCH 4.9] mm, memcg: fix reclaim deadlock with writeback Michal Hocko
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2019-01-15 19:57 UTC (permalink / raw)
  To: Greg KH
  Cc: akpm, bo.liu, david, hannes, jack, kirill.shutemov, shakeelb,
	stable, torvalds, tytso, vdavydov.dev

On Tue 15-01-19 19:09:18, Greg KH wrote:
> On Tue, Jan 15, 2019 at 06:40:36PM +0100, Michal Hocko wrote:
> > On Tue 15-01-19 16:51:31, Greg KH wrote:
> > > On Tue, Jan 15, 2019 at 04:34:44PM +0100, Michal Hocko wrote:
> > > > I do not see a straightforward backport of this patch without pulling
> > > > more changes in. Do we have anybody to actually hit the issue on those
> > > > older kernels? While the issue is possible in principle I do not
> > > > remember anybody complaining.
> > > 
> > > If no one is complaining, that's fine, you just got this message because
> > > you put this in the commit:
> > > 
> > > > > Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
> > > 
> > > Which means any kernel newer than 4.2 (and some older stable releases)
> > > has the issue that this patch is trying to fix.  If it doesn't need to
> > > be backported that far, wonderful!
> > 
> > After a second thought, we are not really affected all the way down to
> > c3b94f44fcb0. We do account page tables to memcgs only since
> > 3e79ec7ddc33 ("arch: x86: charge page tables to kmemcg") 4.8+. Without
> > that there is no realy memcg reclaim and thus not wait_on_page_writeback.
> > 
> > So my Fixes is a bit misleading. Sorry about that.
> 
> Not a problem, thanks for looking into it.  As you say 4.8+, this didn't
> apply to 4.9 either, so do I need to look into doing a manual backport
> there?

Backporting there should be much more easier because we already have
prealloc stuff since 4.8. I will try to find some time to look into this
tomorrow but I am leaving for few days off so I cannot promise anything.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 4.9] mm, memcg: fix reclaim deadlock with writeback
  2019-01-15 19:57         ` Michal Hocko
@ 2019-01-16 10:48           ` Michal Hocko
  2019-01-16 11:41             ` Kirill A. Shutemov
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2019-01-16 10:48 UTC (permalink / raw)
  To: Greg KH, kirill.shutemov
  Cc: akpm, bo.liu, david, hannes, jack, shakeelb, stable, torvalds,
	tytso, vdavydov.dev

On Tue 15-01-19 20:57:31, Michal Hocko wrote:
> On Tue 15-01-19 19:09:18, Greg KH wrote:
[...]
> > Not a problem, thanks for looking into it.  As you say 4.8+, this didn't
> > apply to 4.9 either, so do I need to look into doing a manual backport
> > there?
> 
> Backporting there should be much more easier because we already have
> prealloc stuff since 4.8. I will try to find some time to look into this
> tomorrow but I am leaving for few days off so I cannot promise anything.

So here we go. But this really needs another pair of eyes before
merging. Kirill, could you have a look please?
---
From 0e4688669b0a00dfb9c7ffa47064dbe099ef5785 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 8 Jan 2019 15:23:07 -0800
Subject: [PATCH] mm, memcg: fix reclaim deadlock with writeback

commit 63f3655f950186752236bb88a22f8252c11ce394 upstream.

Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
ext4 writeback

  task1:
    wait_on_page_bit+0x82/0xa0
    shrink_page_list+0x907/0x960
    shrink_inactive_list+0x2c7/0x680
    shrink_node_memcg+0x404/0x830
    shrink_node+0xd8/0x300
    do_try_to_free_pages+0x10d/0x330
    try_to_free_mem_cgroup_pages+0xd5/0x1b0
    try_charge+0x14d/0x720
    memcg_kmem_charge_memcg+0x3c/0xa0
    memcg_kmem_charge+0x7e/0xd0
    __alloc_pages_nodemask+0x178/0x260
    alloc_pages_current+0x95/0x140
    pte_alloc_one+0x17/0x40
    __pte_alloc+0x1e/0x110
    alloc_set_pte+0x5fe/0xc20
    do_fault+0x103/0x970
    handle_mm_fault+0x61e/0xd10
    __do_page_fault+0x252/0x4d0
    do_page_fault+0x30/0x80
    page_fault+0x28/0x30

  task2:
    __lock_page+0x86/0xa0
    mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
    ext4_writepages+0x479/0xd60
    do_writepages+0x1e/0x30
    __writeback_single_inode+0x45/0x320
    writeback_sb_inodes+0x272/0x600
    __writeback_inodes_wb+0x92/0xc0
    wb_writeback+0x268/0x300
    wb_workfn+0xb4/0x390
    process_one_work+0x189/0x420
    worker_thread+0x4e/0x4b0
    kthread+0xe6/0x100
    ret_from_fork+0x41/0x50

He adds
 "task1 is waiting for the PageWriteback bit of the page that task2 has
  collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
  LOCKED bit the page which tasks1 has locked"

More precisely task1 is handling a page fault and it has a page locked
while it charges a new page table to a memcg.  That in turn hits a
memory limit reclaim and the memcg reclaim for legacy controller is
waiting on the writeback but that is never going to finish because the
writeback itself is waiting for the page locked in the #PF path.  So
this is essentially ABBA deadlock:

                                        lock_page(A)
                                        SetPageWriteback(A)
                                        unlock_page(A)
  lock_page(B)
                                        lock_page(B)
  pte_alloc_pne
    shrink_page_list
      wait_on_page_writeback(A)
                                        SetPageWriteback(B)
                                        unlock_page(B)

                                        # flush A, B to clear the writeback

This accumulating of more pages to flush is used by several filesystems
to generate a more optimal IO patterns.

Waiting for the writeback in legacy memcg controller is a workaround for
pre-mature OOM killer invocations because there is no dirty IO
throttling available for the controller.  There is no easy way around
that unfortunately.  Therefore fix this specific issue by pre-allocating
the page table outside of the page lock.  We have that handy
infrastructure for that already so simply reuse the fault-around pattern
which already does this.

There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
from under a fs page locked but they should be really rare.  I am not
aware of a better solution unfortunately.

[akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
[akpm@linux-foundation.org: coding-style fixes]
[mhocko@kernel.org: enhance comment, per Johannes]
  Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Liu Bo <bo.liu@linux.alibaba.com>
Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/memory.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index f3fef1df7402..35d8217bb046 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2823,6 +2823,28 @@ static int __do_fault(struct fault_env *fe, pgoff_t pgoff,
 	struct vm_fault vmf;
 	int ret;
 
+	/*
+	 * Preallocate pte before we take page_lock because this might lead to
+	 * deadlocks for memcg reclaim which waits for pages under writeback:
+	 *				lock_page(A)
+	 *				SetPageWriteback(A)
+	 *				unlock_page(A)
+	 * lock_page(B)
+	 *				lock_page(B)
+	 * pte_alloc_pne
+	 *   shrink_page_list
+	 *     wait_on_page_writeback(A)
+	 *				SetPageWriteback(B)
+	 *				unlock_page(B)
+	 *				# flush A, B to clear the writeback
+	 */
+	if (pmd_none(*fe->pmd) && !fe->prealloc_pte) {
+		fe->prealloc_pte = pte_alloc_one(vma->vm_mm, fe->address);
+		if (!fe->prealloc_pte)
+			return VM_FAULT_OOM;
+		smp_wmb(); /* See comment in __pte_alloc() */
+	}
+
 	vmf.virtual_address = (void __user *)(fe->address & PAGE_MASK);
 	vmf.pgoff = pgoff;
 	vmf.flags = fe->flags;
-- 
2.20.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 4.9] mm, memcg: fix reclaim deadlock with writeback
  2019-01-16 10:48           ` [PATCH 4.9] mm, memcg: fix reclaim deadlock with writeback Michal Hocko
@ 2019-01-16 11:41             ` Kirill A. Shutemov
  2019-01-21 12:21               ` Greg KH
  0 siblings, 1 reply; 9+ messages in thread
From: Kirill A. Shutemov @ 2019-01-16 11:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg KH, akpm, bo.liu, david, hannes, jack, shakeelb, stable,
	torvalds, tytso, vdavydov.dev

On Wed, Jan 16, 2019 at 10:48:54AM +0000, Michal Hocko wrote:
> On Tue 15-01-19 20:57:31, Michal Hocko wrote:
> > On Tue 15-01-19 19:09:18, Greg KH wrote:
> [...]
> > > Not a problem, thanks for looking into it.  As you say 4.8+, this didn't
> > > apply to 4.9 either, so do I need to look into doing a manual backport
> > > there?
> > 
> > Backporting there should be much more easier because we already have
> > prealloc stuff since 4.8. I will try to find some time to look into this
> > tomorrow but I am leaving for few days off so I cannot promise anything.
> 
> So here we go. But this really needs another pair of eyes before
> merging. Kirill, could you have a look please?

Looks right to me.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 4.9] mm, memcg: fix reclaim deadlock with writeback
  2019-01-16 11:41             ` Kirill A. Shutemov
@ 2019-01-21 12:21               ` Greg KH
  0 siblings, 0 replies; 9+ messages in thread
From: Greg KH @ 2019-01-21 12:21 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Michal Hocko, akpm, bo.liu, david, hannes, jack, shakeelb,
	stable, torvalds, tytso, vdavydov.dev

On Wed, Jan 16, 2019 at 02:41:07PM +0300, Kirill A. Shutemov wrote:
> On Wed, Jan 16, 2019 at 10:48:54AM +0000, Michal Hocko wrote:
> > On Tue 15-01-19 20:57:31, Michal Hocko wrote:
> > > On Tue 15-01-19 19:09:18, Greg KH wrote:
> > [...]
> > > > Not a problem, thanks for looking into it.  As you say 4.8+, this didn't
> > > > apply to 4.9 either, so do I need to look into doing a manual backport
> > > > there?
> > > 
> > > Backporting there should be much more easier because we already have
> > > prealloc stuff since 4.8. I will try to find some time to look into this
> > > tomorrow but I am leaving for few days off so I cannot promise anything.
> > 
> > So here we go. But this really needs another pair of eyes before
> > merging. Kirill, could you have a look please?
> 
> Looks right to me.

Thanks for the review and the patch, now queued up!

greg k-h

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-01-21 12:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-14 14:57 FAILED: patch "[PATCH] mm, memcg: fix reclaim deadlock with writeback" failed to apply to 4.4-stable tree gregkh
2019-01-15 15:34 ` Michal Hocko
2019-01-15 15:51   ` Greg KH
2019-01-15 17:40     ` Michal Hocko
2019-01-15 18:09       ` Greg KH
2019-01-15 19:57         ` Michal Hocko
2019-01-16 10:48           ` [PATCH 4.9] mm, memcg: fix reclaim deadlock with writeback Michal Hocko
2019-01-16 11:41             ` Kirill A. Shutemov
2019-01-21 12:21               ` Greg KH

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.