linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm, memcg: fix reclaim deadlock with writeback
@ 2018-12-11 13:26 Michal Hocko
  2018-12-11 15:15 ` Kirill A. Shutemov
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Michal Hocko @ 2018-12-11 13:26 UTC (permalink / raw)
  To: Andrew Morton, Kirill A. Shutemov
  Cc: Liu Bo, Jan Kara, Dave Chinner, Theodore Ts'o,
	Johannes Weiner, Vladimir Davydov, linux-mm, linux-fsdevel, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
ext4 writeback
task1:
[<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
[<ffffffff811c5777>] shrink_page_list+0x907/0x960
[<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
[<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
[<ffffffff811c70a8>] shrink_node+0xd8/0x300
[<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
[<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
[<ffffffff8122df2d>] try_charge+0x14d/0x720
[<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
[<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
[<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
[<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
[<ffffffff81074247>] pte_alloc_one+0x17/0x40
[<ffffffff811e34de>] __pte_alloc+0x1e/0x110
[<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
[<ffffffff811e5d93>] do_fault+0x103/0x970
[<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
[<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
[<ffffffff8106ecb0>] do_page_fault+0x30/0x80
[<ffffffff8171bce8>] page_fault+0x28/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

task2:
[<ffffffff811aadc6>] __lock_page+0x86/0xa0
[<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
[<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
[<ffffffff811bbede>] do_writepages+0x1e/0x30
[<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
[<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
[<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
[<ffffffff81273568>] wb_writeback+0x268/0x300
[<ffffffff81273d24>] wb_workfn+0xb4/0x390
[<ffffffff810a2f19>] process_one_work+0x189/0x420
[<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
[<ffffffff810a9786>] kthread+0xe6/0x100
[<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
[<ffffffffffffffff>] 0xffffffffffffffff

He adds
: task1 is waiting for the PageWriteback bit of the page that task2 has
: collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
: bit the page which tasks1 has locked.

More precisely task1 is handling a page fault and it has a page locked
while it charges a new page table to a memcg. That in turn hits a memory
limit reclaim and the memcg reclaim for legacy controller is waiting on
the writeback but that is never going to finish because the writeback
itself is waiting for the page locked in the #PF path. So this is
essentially ABBA deadlock.

Waiting for the writeback in legacy memcg controller is a workaround
for pre-mature OOM killer invocations because there is no dirty IO
throttling available for the controller. There is no easy way around
that unfortunately. Therefore fix this specific issue by pre-allocating
the page table outside of the page lock. We have that handy
infrastructure for that already so simply reuse the fault-around pattern
which already does this.

Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
this has been originally reported here [1]. While it could get worked
around in the fs, catching the allocation early sounds like a preferable
approach. Liu Bo has noted that he is not able to reproduce this anymore
because kmem accounting has been disabled in their workload but this
should be quite straightforward to review.

There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
from under a fs page locked but they should be really rare. I am not
aware of a better solution unfortunately.

I would appreciate if Kirril could have a look and double check I am not
doing something stupid here.

Debugging lock_page deadlocks is an absolute PITA considering a lack of
lockdep support so I would mark it for stable.

[1] http://lkml.kernel.org/r/1540858969-75803-1-git-send-email-bo.liu@linux.alibaba.com
 mm/memory.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 4ad2d293ddc2..1a73d2d4659e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	vm_fault_t ret;
 
+	/*
+	 * Preallocate pte before we take page_lock because this might lead to
+	 * deadlocks for memcg reclaim which waits for pages under writeback.
+	 */
+	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
+		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
+		if (!vmf->prealloc_pte)
+			return VM_FAULT_OOM;
+		smp_wmb(); /* See comment in __pte_alloc() */
+	}
+
 	ret = vma->vm_ops->fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
 			    VM_FAULT_DONE_COW)))
-- 
2.19.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback
  2018-12-11 13:26 [PATCH] mm, memcg: fix reclaim deadlock with writeback Michal Hocko
@ 2018-12-11 15:15 ` Kirill A. Shutemov
  2018-12-11 16:21   ` Michal Hocko
  2018-12-12  9:42 ` Kirill A. Shutemov
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2018-12-11 15:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Liu Bo, Jan Kara, Dave Chinner, Theodore Ts'o,
	Johannes Weiner, Vladimir Davydov, linux-mm, linux-fsdevel, LKML,
	Michal Hocko

On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> ext4 writeback
> task1:
> [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> [<ffffffff8122df2d>] try_charge+0x14d/0x720
> [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> [<ffffffff811e5d93>] do_fault+0x103/0x970
> [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> [<ffffffff8171bce8>] page_fault+0x28/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> task2:
> [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> [<ffffffff811bbede>] do_writepages+0x1e/0x30
> [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> [<ffffffff81273568>] wb_writeback+0x268/0x300
> [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> [<ffffffff810a2f19>] process_one_work+0x189/0x420
> [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> [<ffffffff810a9786>] kthread+0xe6/0x100
> [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> He adds
> : task1 is waiting for the PageWriteback bit of the page that task2 has
> : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> : bit the page which tasks1 has locked.
> 
> More precisely task1 is handling a page fault and it has a page locked
> while it charges a new page table to a memcg. That in turn hits a memory
> limit reclaim and the memcg reclaim for legacy controller is waiting on
> the writeback but that is never going to finish because the writeback
> itself is waiting for the page locked in the #PF path. So this is
> essentially ABBA deadlock.
> 
> Waiting for the writeback in legacy memcg controller is a workaround
> for pre-mature OOM killer invocations because there is no dirty IO
> throttling available for the controller. There is no easy way around
> that unfortunately. Therefore fix this specific issue by pre-allocating
> the page table outside of the page lock. We have that handy
> infrastructure for that already so simply reuse the fault-around pattern
> which already does this.
> 
> Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> Hi,
> this has been originally reported here [1]. While it could get worked
> around in the fs, catching the allocation early sounds like a preferable
> approach. Liu Bo has noted that he is not able to reproduce this anymore
> because kmem accounting has been disabled in their workload but this
> should be quite straightforward to review.
> 
> There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> from under a fs page locked but they should be really rare. I am not
> aware of a better solution unfortunately.
> 
> I would appreciate if Kirril could have a look and double check I am not
> doing something stupid here.
> 
> Debugging lock_page deadlocks is an absolute PITA considering a lack of
> lockdep support so I would mark it for stable.
> 
> [1] http://lkml.kernel.org/r/1540858969-75803-1-git-send-email-bo.liu@linux.alibaba.com
>  mm/memory.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 4ad2d293ddc2..1a73d2d4659e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
>  	struct vm_area_struct *vma = vmf->vma;
>  	vm_fault_t ret;
>  
> +	/*
> +	 * Preallocate pte before we take page_lock because this might lead to
> +	 * deadlocks for memcg reclaim which waits for pages under writeback.
> +	 */
> +	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> +		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
> +		if (!vmf->prealloc_pte)
> +			return VM_FAULT_OOM;
> +		smp_wmb(); /* See comment in __pte_alloc() */
> +	}
> +
>  	ret = vma->vm_ops->fault(vmf);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
>  			    VM_FAULT_DONE_COW)))

Sorry, but I don't think it fixes anything. Just hides it a level deeper.

The trick with ->prealloc_pte works for faultaround because we can rely on
->map_pages() to not sleep and we know how it will setup page table entry.
Basically, core controls most of the path.

It's not the case with ->fault(). It is free to sleep and allocate
whatever it wants.

For instance, DAX page fault will setup page table entry on its own and
return VM_FAULT_NOPAGE. It uses vmf_insert_mixed() to setup the page table
and ignores your pre-allocated page table.

But it's just an example. The problem is that ->fault() is not bounded on
what it can do, unlike ->map_pages().

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback
  2018-12-11 15:15 ` Kirill A. Shutemov
@ 2018-12-11 16:21   ` Michal Hocko
  2018-12-11 16:39     ` Jan Kara
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2018-12-11 16:21 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Liu Bo, Jan Kara, Dave Chinner, Theodore Ts'o,
	Johannes Weiner, Vladimir Davydov, linux-mm, linux-fsdevel, LKML

On Tue 11-12-18 18:15:42, Kirill A. Shutemov wrote:
> On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote:
[...]
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
> >  	struct vm_area_struct *vma = vmf->vma;
> >  	vm_fault_t ret;
> >  
> > +	/*
> > +	 * Preallocate pte before we take page_lock because this might lead to
> > +	 * deadlocks for memcg reclaim which waits for pages under writeback.
> > +	 */
> > +	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> > +		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
> > +		if (!vmf->prealloc_pte)
> > +			return VM_FAULT_OOM;
> > +		smp_wmb(); /* See comment in __pte_alloc() */
> > +	}
> > +
> >  	ret = vma->vm_ops->fault(vmf);
> >  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
> >  			    VM_FAULT_DONE_COW)))
> 
> Sorry, but I don't think it fixes anything. Just hides it a level deeper.
> 
> The trick with ->prealloc_pte works for faultaround because we can rely on
> ->map_pages() to not sleep and we know how it will setup page table entry.
> Basically, core controls most of the path.
> 
> It's not the case with ->fault(). It is free to sleep and allocate
> whatever it wants.

Yeah, but if the fault callback wants to allocate then it has to
consider the usual allocation restrictions. e.g. NOFS if the allocation
itself can trip over fs locks.

> For instance, DAX page fault will setup page table entry on its own and
> return VM_FAULT_NOPAGE. It uses vmf_insert_mixed() to setup the page table
> and ignores your pre-allocated page table.

Does this happen with a page locked and with __GFP_ACCOUNT allocation. I
am not familiar with that code but I do not see it from a quick look.
 
> But it's just an example. The problem is that ->fault() is not bounded on
> what it can do, unlike ->map_pages().

That is a fair point but the primary issue here is that the generic #PF
code breaks the underlying assumption and performs
__GFP_ACCOUNT|GFP_KERNEL allocation from within a fs owned locked page.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback
  2018-12-11 16:21   ` Michal Hocko
@ 2018-12-11 16:39     ` Jan Kara
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Kara @ 2018-12-11 16:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill A. Shutemov, Andrew Morton, Liu Bo, Jan Kara,
	Dave Chinner, Theodore Ts'o, Johannes Weiner,
	Vladimir Davydov, linux-mm, linux-fsdevel, LKML

On Tue 11-12-18 17:21:49, Michal Hocko wrote:
> On Tue 11-12-18 18:15:42, Kirill A. Shutemov wrote:
> > For instance, DAX page fault will setup page table entry on its own and
> > return VM_FAULT_NOPAGE. It uses vmf_insert_mixed() to setup the page table
> > and ignores your pre-allocated page table.
> 
> Does this happen with a page locked and with __GFP_ACCOUNT allocation. I
> am not familiar with that code but I do not see it from a quick look.

DAX has no page to lock and also no writeback to do so the deadlock isn't
really possible when DAX is in use...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback
  2018-12-11 13:26 [PATCH] mm, memcg: fix reclaim deadlock with writeback Michal Hocko
  2018-12-11 15:15 ` Kirill A. Shutemov
@ 2018-12-12  9:42 ` Kirill A. Shutemov
  2018-12-12  9:48   ` Michal Hocko
  2018-12-12 10:05   ` Jan Kara
  2018-12-12 11:58 ` Kirill A. Shutemov
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 22+ messages in thread
From: Kirill A. Shutemov @ 2018-12-12  9:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Liu Bo, Jan Kara, Dave Chinner, Theodore Ts'o,
	Johannes Weiner, Vladimir Davydov, linux-mm, linux-fsdevel, LKML,
	Michal Hocko, Hugh Dickins

On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> ext4 writeback
> task1:
> [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> [<ffffffff8122df2d>] try_charge+0x14d/0x720
> [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> [<ffffffff811e5d93>] do_fault+0x103/0x970
> [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> [<ffffffff8171bce8>] page_fault+0x28/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> task2:
> [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> [<ffffffff811bbede>] do_writepages+0x1e/0x30
> [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> [<ffffffff81273568>] wb_writeback+0x268/0x300
> [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> [<ffffffff810a2f19>] process_one_work+0x189/0x420
> [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> [<ffffffff810a9786>] kthread+0xe6/0x100
> [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> He adds
> : task1 is waiting for the PageWriteback bit of the page that task2 has
> : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> : bit the page which tasks1 has locked.
> 
> More precisely task1 is handling a page fault and it has a page locked
> while it charges a new page table to a memcg. That in turn hits a memory
> limit reclaim and the memcg reclaim for legacy controller is waiting on
> the writeback but that is never going to finish because the writeback
> itself is waiting for the page locked in the #PF path. So this is
> essentially ABBA deadlock.

Side node:

Do we have PG_writeback vs. PG_locked ordering documentated somewhere?

IIUC, the trace from task2 suggests that we must not wait for writeback
on the locked page.

But that not what I see for many wait_on_page_writeback() users: it usally
called with the page locked. I see it for truncate, shmem, swapfile,
splice...

Maybe the problem is within task2 codepath after all?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback
  2018-12-12  9:42 ` Kirill A. Shutemov
@ 2018-12-12  9:48   ` Michal Hocko
  2018-12-12 10:05   ` Jan Kara
  1 sibling, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2018-12-12  9:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Liu Bo, Jan Kara, Dave Chinner, Theodore Ts'o,
	Johannes Weiner, Vladimir Davydov, linux-mm, linux-fsdevel, LKML,
	Hugh Dickins

On Wed 12-12-18 12:42:49, Kirill A. Shutemov wrote:
> On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> > ext4 writeback
> > task1:
> > [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> > [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> > [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> > [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> > [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> > [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> > [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> > [<ffffffff8122df2d>] try_charge+0x14d/0x720
> > [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> > [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> > [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> > [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> > [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> > [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> > [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> > [<ffffffff811e5d93>] do_fault+0x103/0x970
> > [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> > [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> > [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> > [<ffffffff8171bce8>] page_fault+0x28/0x30
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > task2:
> > [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> > [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> > [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> > [<ffffffff811bbede>] do_writepages+0x1e/0x30
> > [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> > [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> > [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> > [<ffffffff81273568>] wb_writeback+0x268/0x300
> > [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> > [<ffffffff810a2f19>] process_one_work+0x189/0x420
> > [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> > [<ffffffff810a9786>] kthread+0xe6/0x100
> > [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > He adds
> > : task1 is waiting for the PageWriteback bit of the page that task2 has
> > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> > : bit the page which tasks1 has locked.
> > 
> > More precisely task1 is handling a page fault and it has a page locked
> > while it charges a new page table to a memcg. That in turn hits a memory
> > limit reclaim and the memcg reclaim for legacy controller is waiting on
> > the writeback but that is never going to finish because the writeback
> > itself is waiting for the page locked in the #PF path. So this is
> > essentially ABBA deadlock.
> 
> Side node:
> 
> Do we have PG_writeback vs. PG_locked ordering documentated somewhere?

I am not aware of any

> IIUC, the trace from task2 suggests that we must not wait for writeback
> on the locked page.
> 
> But that not what I see for many wait_on_page_writeback() users: it usally
> called with the page locked. I see it for truncate, shmem, swapfile,
> splice...
> 
> Maybe the problem is within task2 codepath after all?

Jack and David have explained that this is due to an optimization
multiple filesystems do. They lock and set wribeback on multiple pages
and then send a largeer IO at once. So in this case we have the
following pattern

                                        lock_page(B)
                                        SetPageWriteback(B)
                                        unlock_page(B)
lock_page(A)
                                        lock_page(A)
pte_alloc_pne
  shrink_page_list
    wait_on_page_writeback(B)
                                        SetPageWriteback(A)
                                        unlock_page(A)

                                        # flush A, B to clear the writeback

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback
  2018-12-12  9:42 ` Kirill A. Shutemov
  2018-12-12  9:48   ` Michal Hocko
@ 2018-12-12 10:05   ` Jan Kara
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Kara @ 2018-12-12 10:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Michal Hocko, Andrew Morton, Liu Bo, Jan Kara, Dave Chinner,
	Theodore Ts'o, Johannes Weiner, Vladimir Davydov, linux-mm,
	linux-fsdevel, LKML, Michal Hocko, Hugh Dickins

On Wed 12-12-18 12:42:49, Kirill A. Shutemov wrote:
> On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> > ext4 writeback
> > task1:
> > [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> > [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> > [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> > [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> > [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> > [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> > [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> > [<ffffffff8122df2d>] try_charge+0x14d/0x720
> > [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> > [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> > [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> > [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> > [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> > [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> > [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> > [<ffffffff811e5d93>] do_fault+0x103/0x970
> > [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> > [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> > [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> > [<ffffffff8171bce8>] page_fault+0x28/0x30
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > task2:
> > [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> > [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> > [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> > [<ffffffff811bbede>] do_writepages+0x1e/0x30
> > [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> > [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> > [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> > [<ffffffff81273568>] wb_writeback+0x268/0x300
> > [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> > [<ffffffff810a2f19>] process_one_work+0x189/0x420
> > [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> > [<ffffffff810a9786>] kthread+0xe6/0x100
> > [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > He adds
> > : task1 is waiting for the PageWriteback bit of the page that task2 has
> > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> > : bit the page which tasks1 has locked.
> > 
> > More precisely task1 is handling a page fault and it has a page locked
> > while it charges a new page table to a memcg. That in turn hits a memory
> > limit reclaim and the memcg reclaim for legacy controller is waiting on
> > the writeback but that is never going to finish because the writeback
> > itself is waiting for the page locked in the #PF path. So this is
> > essentially ABBA deadlock.
> 
> Side node:
> 
> Do we have PG_writeback vs. PG_locked ordering documentated somewhere?
> 
> IIUC, the trace from task2 suggests that we must not wait for writeback
> on the locked page.

Well, waiting on writeback of page A when A is locked has always been fine.
After all that's the only easy way to make sure you really have a page for
which no IO is running as page lock protects you from new writeback attempt
starting.

Waiting on writeback of page B while having page A locked *is* problematic
and prone to deadlocks due to code paths like in task2.

> But that not what I see for many wait_on_page_writeback() users: it usally
> called with the page locked. I see it for truncate, shmem, swapfile,
> splice...
> 
> Maybe the problem is within task2 codepath after all?

So ->writepages() methods in filesystems have the property that to complete
writeback on page with index X, they may need page lock from page X+1. I
agree that this is a bit hairy but from fs point of view it makes a lot of
sense and AFAIK nothing besides that memcg IO throttling can create
deadlocks with such a locking scheme.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback
  2018-12-11 13:26 [PATCH] mm, memcg: fix reclaim deadlock with writeback Michal Hocko
  2018-12-11 15:15 ` Kirill A. Shutemov
  2018-12-12  9:42 ` Kirill A. Shutemov
@ 2018-12-12 11:58 ` Kirill A. Shutemov
  2018-12-12 12:13   ` Michal Hocko
  2018-12-12 15:33 ` kbuild test robot
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2018-12-12 11:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Liu Bo, Jan Kara, Dave Chinner, Theodore Ts'o,
	Johannes Weiner, Vladimir Davydov, linux-mm, linux-fsdevel, LKML,
	Michal Hocko

On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> ext4 writeback
> task1:
> [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> [<ffffffff8122df2d>] try_charge+0x14d/0x720
> [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> [<ffffffff811e5d93>] do_fault+0x103/0x970
> [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> [<ffffffff8171bce8>] page_fault+0x28/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> task2:
> [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> [<ffffffff811bbede>] do_writepages+0x1e/0x30
> [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> [<ffffffff81273568>] wb_writeback+0x268/0x300
> [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> [<ffffffff810a2f19>] process_one_work+0x189/0x420
> [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> [<ffffffff810a9786>] kthread+0xe6/0x100
> [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> He adds
> : task1 is waiting for the PageWriteback bit of the page that task2 has
> : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> : bit the page which tasks1 has locked.
> 
> More precisely task1 is handling a page fault and it has a page locked
> while it charges a new page table to a memcg. That in turn hits a memory
> limit reclaim and the memcg reclaim for legacy controller is waiting on
> the writeback but that is never going to finish because the writeback
> itself is waiting for the page locked in the #PF path. So this is
> essentially ABBA deadlock.
> 
> Waiting for the writeback in legacy memcg controller is a workaround
> for pre-mature OOM killer invocations because there is no dirty IO
> throttling available for the controller. There is no easy way around
> that unfortunately. Therefore fix this specific issue by pre-allocating
> the page table outside of the page lock. We have that handy
> infrastructure for that already so simply reuse the fault-around pattern
> which already does this.
> 
> Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> Hi,
> this has been originally reported here [1]. While it could get worked
> around in the fs, catching the allocation early sounds like a preferable
> approach. Liu Bo has noted that he is not able to reproduce this anymore
> because kmem accounting has been disabled in their workload but this
> should be quite straightforward to review.
> 
> There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> from under a fs page locked but they should be really rare. I am not
> aware of a better solution unfortunately.

Okay, I have spent some time on the issue and was not able to find a
better solution too. But I cannot say I like it.

I think we need to spend more time on making ->prealloc_pte useful: looks
like it would help to convert vmf_insert_* helpers to take struct vm_fault
* as input and propagate it down to pmd population point. Otherwise DAX
and drivers would alloacate the page table for nothing.

Have you considered if we need anything similar for anon path? Is it
possible to have similar deadlock with swaping rather than writeback?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback
  2018-12-12 11:58 ` Kirill A. Shutemov
@ 2018-12-12 12:13   ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2018-12-12 12:13 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Liu Bo, Jan Kara, Dave Chinner, Theodore Ts'o,
	Johannes Weiner, Vladimir Davydov, linux-mm, linux-fsdevel, LKML

On Wed 12-12-18 14:58:37, Kirill A. Shutemov wrote:
> On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> > ext4 writeback
> > task1:
> > [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> > [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> > [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> > [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> > [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> > [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> > [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> > [<ffffffff8122df2d>] try_charge+0x14d/0x720
> > [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> > [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> > [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> > [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> > [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> > [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> > [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> > [<ffffffff811e5d93>] do_fault+0x103/0x970
> > [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> > [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> > [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> > [<ffffffff8171bce8>] page_fault+0x28/0x30
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > task2:
> > [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> > [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> > [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> > [<ffffffff811bbede>] do_writepages+0x1e/0x30
> > [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> > [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> > [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> > [<ffffffff81273568>] wb_writeback+0x268/0x300
> > [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> > [<ffffffff810a2f19>] process_one_work+0x189/0x420
> > [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> > [<ffffffff810a9786>] kthread+0xe6/0x100
> > [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > He adds
> > : task1 is waiting for the PageWriteback bit of the page that task2 has
> > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> > : bit the page which tasks1 has locked.
> > 
> > More precisely task1 is handling a page fault and it has a page locked
> > while it charges a new page table to a memcg. That in turn hits a memory
> > limit reclaim and the memcg reclaim for legacy controller is waiting on
> > the writeback but that is never going to finish because the writeback
> > itself is waiting for the page locked in the #PF path. So this is
> > essentially ABBA deadlock.
> > 
> > Waiting for the writeback in legacy memcg controller is a workaround
> > for pre-mature OOM killer invocations because there is no dirty IO
> > throttling available for the controller. There is no easy way around
> > that unfortunately. Therefore fix this specific issue by pre-allocating
> > the page table outside of the page lock. We have that handy
> > infrastructure for that already so simply reuse the fault-around pattern
> > which already does this.
> > 
> > Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> > Hi,
> > this has been originally reported here [1]. While it could get worked
> > around in the fs, catching the allocation early sounds like a preferable
> > approach. Liu Bo has noted that he is not able to reproduce this anymore
> > because kmem accounting has been disabled in their workload but this
> > should be quite straightforward to review.
> > 
> > There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> > from under a fs page locked but they should be really rare. I am not
> > aware of a better solution unfortunately.
> 
> Okay, I have spent some time on the issue and was not able to find a
> better solution too. But I cannot say I like it.

Ohh, I do not like it either. I can make it more targeted by abstracting
sane_reclaim() and using it for the check but I am not sure this is
really more helpful.
 
> I think we need to spend more time on making ->prealloc_pte useful: looks
> like it would help to convert vmf_insert_* helpers to take struct vm_fault
> * as input and propagate it down to pmd population point. Otherwise DAX
> and drivers would alloacate the page table for nothing.

Yes this would be an obvious enahancement.

> Have you considered if we need anything similar for anon path? Is it
> possible to have similar deadlock with swaping rather than writeback?

No, I do not see anon path to allocated under page lock.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback
  2018-12-11 13:26 [PATCH] mm, memcg: fix reclaim deadlock with writeback Michal Hocko
                   ` (2 preceding siblings ...)
  2018-12-12 11:58 ` Kirill A. Shutemov
@ 2018-12-12 15:33 ` kbuild test robot
  2018-12-12 15:57   ` Michal Hocko
  2018-12-12 15:50 ` [PATCH v2] " Michal Hocko
  2018-12-14 17:13 ` [PATCH] " kbuild test robot
  5 siblings, 1 reply; 22+ messages in thread
From: kbuild test robot @ 2018-12-12 15:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kbuild-all, Andrew Morton, Kirill A. Shutemov, Liu Bo, Jan Kara,
	Dave Chinner, Theodore Ts'o, Johannes Weiner,
	Vladimir Davydov, linux-mm, linux-fsdevel, LKML, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 2863 bytes --]

Hi Michal,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.20-rc6 next-20181212]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-memcg-fix-reclaim-deadlock-with-writeback/20181212-224633
config: x86_64-randconfig-x000-201849 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   mm/memory.c: In function '__do_fault':
>> mm/memory.c:3001:45: error: 'struct vm_area_struct' has no member named 'vm'
      vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
                                                ^~
>> mm/memory.c:3001:50: error: 'mm' undeclared (first use in this function); did you mean 'hmm'?
      vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
                                                     ^~
                                                     hmm
   mm/memory.c:3001:50: note: each undeclared identifier is reported only once for each function it appears in

vim +3001 mm/memory.c

  2985	
  2986	/*
  2987	 * The mmap_sem must have been held on entry, and may have been
  2988	 * released depending on flags and vma->vm_ops->fault() return value.
  2989	 * See filemap_fault() and __lock_page_retry().
  2990	 */
  2991	static vm_fault_t __do_fault(struct vm_fault *vmf)
  2992	{
  2993		struct vm_area_struct *vma = vmf->vma;
  2994		vm_fault_t ret;
  2995	
  2996		/*
  2997		 * Preallocate pte before we take page_lock because this might lead to
  2998		 * deadlocks for memcg reclaim which waits for pages under writeback.
  2999		 */
  3000		if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> 3001			vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
  3002			if (!vmf->prealloc_pte)
  3003				return VM_FAULT_OOM;
  3004			smp_wmb(); /* See comment in __pte_alloc() */
  3005		}
  3006	
  3007		ret = vma->vm_ops->fault(vmf);
  3008		if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
  3009				    VM_FAULT_DONE_COW)))
  3010			return ret;
  3011	
  3012		if (unlikely(PageHWPoison(vmf->page))) {
  3013			if (ret & VM_FAULT_LOCKED)
  3014				unlock_page(vmf->page);
  3015			put_page(vmf->page);
  3016			vmf->page = NULL;
  3017			return VM_FAULT_HWPOISON;
  3018		}
  3019	
  3020		if (unlikely(!(ret & VM_FAULT_LOCKED)))
  3021			lock_page(vmf->page);
  3022		else
  3023			VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
  3024	
  3025		return ret;
  3026	}
  3027	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30629 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2] mm, memcg: fix reclaim deadlock with writeback
  2018-12-11 13:26 [PATCH] mm, memcg: fix reclaim deadlock with writeback Michal Hocko
                   ` (3 preceding siblings ...)
  2018-12-12 15:33 ` kbuild test robot
@ 2018-12-12 15:50 ` Michal Hocko
  2018-12-12 17:24   ` Shakeel Butt
                     ` (2 more replies)
  2018-12-14 17:13 ` [PATCH] " kbuild test robot
  5 siblings, 3 replies; 22+ messages in thread
From: Michal Hocko @ 2018-12-12 15:50 UTC (permalink / raw)
  To: Andrew Morton, Kirill A. Shutemov
  Cc: Liu Bo, Jan Kara, Dave Chinner, Theodore Ts'o,
	Johannes Weiner, Vladimir Davydov, linux-mm, linux-fsdevel, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
ext4 writeback
task1:
[<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
[<ffffffff811c5777>] shrink_page_list+0x907/0x960
[<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
[<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
[<ffffffff811c70a8>] shrink_node+0xd8/0x300
[<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
[<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
[<ffffffff8122df2d>] try_charge+0x14d/0x720
[<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
[<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
[<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
[<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
[<ffffffff81074247>] pte_alloc_one+0x17/0x40
[<ffffffff811e34de>] __pte_alloc+0x1e/0x110
[<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
[<ffffffff811e5d93>] do_fault+0x103/0x970
[<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
[<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
[<ffffffff8106ecb0>] do_page_fault+0x30/0x80
[<ffffffff8171bce8>] page_fault+0x28/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

task2:
[<ffffffff811aadc6>] __lock_page+0x86/0xa0
[<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
[<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
[<ffffffff811bbede>] do_writepages+0x1e/0x30
[<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
[<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
[<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
[<ffffffff81273568>] wb_writeback+0x268/0x300
[<ffffffff81273d24>] wb_workfn+0xb4/0x390
[<ffffffff810a2f19>] process_one_work+0x189/0x420
[<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
[<ffffffff810a9786>] kthread+0xe6/0x100
[<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
[<ffffffffffffffff>] 0xffffffffffffffff

He adds
: task1 is waiting for the PageWriteback bit of the page that task2 has
: collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
: bit the page which tasks1 has locked.

More precisely task1 is handling a page fault and it has a page locked
while it charges a new page table to a memcg. That in turn hits a memory
limit reclaim and the memcg reclaim for legacy controller is waiting on
the writeback but that is never going to finish because the writeback
itself is waiting for the page locked in the #PF path. So this is
essentially ABBA deadlock.

Waiting for the writeback in legacy memcg controller is a workaround
for pre-mature OOM killer invocations because there is no dirty IO
throttling available for the controller. There is no easy way around
that unfortunately. Therefore fix this specific issue by pre-allocating
the page table outside of the page lock. We have that handy
infrastructure for that already so simply reuse the fault-around pattern
which already does this.

Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 4ad2d293ddc2..bb78e90a9b70 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	vm_fault_t ret;
 
+	/*
+	 * Preallocate pte before we take page_lock because this might lead to
+	 * deadlocks for memcg reclaim which waits for pages under writeback.
+	 */
+	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
+		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address);
+		if (!vmf->prealloc_pte)
+			return VM_FAULT_OOM;
+		smp_wmb(); /* See comment in __pte_alloc() */
+	}
+
 	ret = vma->vm_ops->fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
 			    VM_FAULT_DONE_COW)))
-- 
2.19.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback
  2018-12-12 15:33 ` kbuild test robot
@ 2018-12-12 15:57   ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2018-12-12 15:57 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Andrew Morton, Kirill A. Shutemov, Liu Bo, Jan Kara,
	Dave Chinner, Theodore Ts'o, Johannes Weiner,
	Vladimir Davydov, linux-mm, linux-fsdevel, LKML

On Wed 12-12-18 23:33:10, kbuild test robot wrote:
> Hi Michal,
> 
> I love your patch! Yet something to improve:

Well, I hate it ;) (like all obviously broken patches) Sorry this is a
typo. v2 sent out

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] mm, memcg: fix reclaim deadlock with writeback
  2018-12-12 15:50 ` [PATCH v2] " Michal Hocko
@ 2018-12-12 17:24   ` Shakeel Butt
  2018-12-12 17:49   ` Liu Bo
  2018-12-13  9:22   ` [PATCH v3] " Michal Hocko
  2 siblings, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2018-12-12 17:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, bo.liu, Jan Kara, david,
	tytso, Johannes Weiner, Vladimir Davydov, Linux MM,
	linux-fsdevel, LKML, Michal Hocko

On Wed, Dec 12, 2018 at 7:51 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> From: Michal Hocko <mhocko@suse.com>
>
> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> ext4 writeback
> task1:
> [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> [<ffffffff8122df2d>] try_charge+0x14d/0x720
> [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> [<ffffffff811e5d93>] do_fault+0x103/0x970
> [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> [<ffffffff8171bce8>] page_fault+0x28/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> task2:
> [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> [<ffffffff811bbede>] do_writepages+0x1e/0x30
> [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> [<ffffffff81273568>] wb_writeback+0x268/0x300
> [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> [<ffffffff810a2f19>] process_one_work+0x189/0x420
> [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> [<ffffffff810a9786>] kthread+0xe6/0x100
> [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> He adds
> : task1 is waiting for the PageWriteback bit of the page that task2 has
> : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> : bit the page which tasks1 has locked.
>
> More precisely task1 is handling a page fault and it has a page locked
> while it charges a new page table to a memcg. That in turn hits a memory
> limit reclaim and the memcg reclaim for legacy controller is waiting on
> the writeback but that is never going to finish because the writeback
> itself is waiting for the page locked in the #PF path. So this is
> essentially ABBA deadlock.
>
> Waiting for the writeback in legacy memcg controller is a workaround
> for pre-mature OOM killer invocations because there is no dirty IO
> throttling available for the controller. There is no easy way around
> that unfortunately. Therefore fix this specific issue by pre-allocating
> the page table outside of the page lock. We have that handy
> infrastructure for that already so simply reuse the fault-around pattern
> which already does this.

Michal, can you please add the following para in the commit message as
well which was in the first version. This fact should be documented at
least in the commit message.

>
> There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> from under a fs page locked but they should be really rare. I am not
> aware of a better solution unfortunately.
>

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] mm, memcg: fix reclaim deadlock with writeback
  2018-12-12 15:50 ` [PATCH v2] " Michal Hocko
  2018-12-12 17:24   ` Shakeel Butt
@ 2018-12-12 17:49   ` Liu Bo
  2018-12-13  9:22   ` [PATCH v3] " Michal Hocko
  2 siblings, 0 replies; 22+ messages in thread
From: Liu Bo @ 2018-12-12 17:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, Jan Kara, Dave Chinner,
	Theodore Ts'o, Johannes Weiner, Vladimir Davydov, linux-mm,
	linux-fsdevel, LKML, Michal Hocko

On Wed, Dec 12, 2018 at 04:50:55PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> ext4 writeback
> task1:
> [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> [<ffffffff8122df2d>] try_charge+0x14d/0x720
> [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> [<ffffffff811e5d93>] do_fault+0x103/0x970
> [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> [<ffffffff8171bce8>] page_fault+0x28/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> task2:
> [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> [<ffffffff811bbede>] do_writepages+0x1e/0x30
> [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> [<ffffffff81273568>] wb_writeback+0x268/0x300
> [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> [<ffffffff810a2f19>] process_one_work+0x189/0x420
> [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> [<ffffffff810a9786>] kthread+0xe6/0x100
> [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> He adds
> : task1 is waiting for the PageWriteback bit of the page that task2 has
> : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> : bit the page which tasks1 has locked.
> 
> More precisely task1 is handling a page fault and it has a page locked
> while it charges a new page table to a memcg. That in turn hits a memory
> limit reclaim and the memcg reclaim for legacy controller is waiting on
> the writeback but that is never going to finish because the writeback
> itself is waiting for the page locked in the #PF path. So this is
> essentially ABBA deadlock.

Thanks for the patch, Michal.

Could you please append the followings (quoted from your reply in
other thread)?  It'd be much easier for reviewers to pick up what was
happening.

-----------------------------------------------------------------
                                        lock_page(B)
                                        SetPageWriteback(B)
                                        unlock_page(B)
lock_page(A)
                                        lock_page(A)
pte_alloc_pne
  shrink_page_list
    wait_on_page_writeback(B)
                                        SetPageWriteback(A)
                                        unlock_page(A)

                                        # flush A, B to clear the writeback 
-----------------------------------------------------------------

thanks,
-liubo

> 
> Waiting for the writeback in legacy memcg controller is a workaround
> for pre-mature OOM killer invocations because there is no dirty IO
> throttling available for the controller. There is no easy way around
> that unfortunately. Therefore fix this specific issue by pre-allocating
> the page table outside of the page lock. We have that handy
> infrastructure for that already so simply reuse the fault-around pattern
> which already does this.
> 
> Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memory.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 4ad2d293ddc2..bb78e90a9b70 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
>  	struct vm_area_struct *vma = vmf->vma;
>  	vm_fault_t ret;
>  
> +	/*
> +	 * Preallocate pte before we take page_lock because this might lead to
> +	 * deadlocks for memcg reclaim which waits for pages under writeback.
> +	 */
> +	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> +		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address);
> +		if (!vmf->prealloc_pte)
> +			return VM_FAULT_OOM;
> +		smp_wmb(); /* See comment in __pte_alloc() */
> +	}
> +
>  	ret = vma->vm_ops->fault(vmf);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
>  			    VM_FAULT_DONE_COW)))
> -- 
> 2.19.2

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3] mm, memcg: fix reclaim deadlock with writeback
  2018-12-12 15:50 ` [PATCH v2] " Michal Hocko
  2018-12-12 17:24   ` Shakeel Butt
  2018-12-12 17:49   ` Liu Bo
@ 2018-12-13  9:22   ` Michal Hocko
  2018-12-13 10:41     ` Kirill A. Shutemov
                       ` (2 more replies)
  2 siblings, 3 replies; 22+ messages in thread
From: Michal Hocko @ 2018-12-13  9:22 UTC (permalink / raw)
  To: Andrew Morton, Kirill A. Shutemov
  Cc: Liu Bo, Jan Kara, Dave Chinner, Theodore Ts'o,
	Johannes Weiner, Vladimir Davydov, linux-mm, linux-fsdevel, LKML,
	Shakeel Butt, Michal Hocko, Stable tree

From: Michal Hocko <mhocko@suse.com>

Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
ext4 writeback
task1:
[<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
[<ffffffff811c5777>] shrink_page_list+0x907/0x960
[<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
[<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
[<ffffffff811c70a8>] shrink_node+0xd8/0x300
[<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
[<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
[<ffffffff8122df2d>] try_charge+0x14d/0x720
[<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
[<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
[<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
[<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
[<ffffffff81074247>] pte_alloc_one+0x17/0x40
[<ffffffff811e34de>] __pte_alloc+0x1e/0x110
[<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
[<ffffffff811e5d93>] do_fault+0x103/0x970
[<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
[<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
[<ffffffff8106ecb0>] do_page_fault+0x30/0x80
[<ffffffff8171bce8>] page_fault+0x28/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

task2:
[<ffffffff811aadc6>] __lock_page+0x86/0xa0
[<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
[<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
[<ffffffff811bbede>] do_writepages+0x1e/0x30
[<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
[<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
[<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
[<ffffffff81273568>] wb_writeback+0x268/0x300
[<ffffffff81273d24>] wb_workfn+0xb4/0x390
[<ffffffff810a2f19>] process_one_work+0x189/0x420
[<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
[<ffffffff810a9786>] kthread+0xe6/0x100
[<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
[<ffffffffffffffff>] 0xffffffffffffffff

He adds
: task1 is waiting for the PageWriteback bit of the page that task2 has
: collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
: bit the page which tasks1 has locked.

More precisely task1 is handling a page fault and it has a page locked
while it charges a new page table to a memcg. That in turn hits a memory
limit reclaim and the memcg reclaim for legacy controller is waiting on
the writeback but that is never going to finish because the writeback
itself is waiting for the page locked in the #PF path. So this is
essentially ABBA deadlock:
                                        lock_page(A)
                                        SetPageWriteback(A)
                                        unlock_page(A)
lock_page(B)
                                        lock_page(B)
pte_alloc_pne
  shrink_page_list
    wait_on_page_writeback(A)
                                        SetPageWriteback(B)
                                        unlock_page(B)

                                        # flush A, B to clear the writeback

This accumulating of more pages to flush is used by several filesystems
to generate a more optimal IO patterns.

Waiting for the writeback in legacy memcg controller is a workaround
for pre-mature OOM killer invocations because there is no dirty IO
throttling available for the controller. There is no easy way around
that unfortunately. Therefore fix this specific issue by pre-allocating
the page table outside of the page lock. We have that handy
infrastructure for that already so simply reuse the fault-around pattern
which already does this.

There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
from under a fs page locked but they should be really rare. I am not
aware of a better solution unfortunately.

Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
Cc: stable
Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 4ad2d293ddc2..bb78e90a9b70 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	vm_fault_t ret;
 
+	/*
+	 * Preallocate pte before we take page_lock because this might lead to
+	 * deadlocks for memcg reclaim which waits for pages under writeback.
+	 */
+	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
+		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address);
+		if (!vmf->prealloc_pte)
+			return VM_FAULT_OOM;
+		smp_wmb(); /* See comment in __pte_alloc() */
+	}
+
 	ret = vma->vm_ops->fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
 			    VM_FAULT_DONE_COW)))
-- 
2.19.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] mm, memcg: fix reclaim deadlock with writeback
  2018-12-13  9:22   ` [PATCH v3] " Michal Hocko
@ 2018-12-13 10:41     ` Kirill A. Shutemov
  2018-12-13 12:39       ` Michal Hocko
  2018-12-13 22:04     ` Johannes Weiner
  2018-12-13 22:43     ` Liu Bo
  2 siblings, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2018-12-13 10:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Liu Bo, Jan Kara, Dave Chinner, Theodore Ts'o,
	Johannes Weiner, Vladimir Davydov, linux-mm, linux-fsdevel, LKML,
	Shakeel Butt, Michal Hocko, Stable tree

On Thu, Dec 13, 2018 at 10:22:21AM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> ext4 writeback
> task1:
> [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> [<ffffffff8122df2d>] try_charge+0x14d/0x720
> [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> [<ffffffff811e5d93>] do_fault+0x103/0x970
> [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> [<ffffffff8171bce8>] page_fault+0x28/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> task2:
> [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> [<ffffffff811bbede>] do_writepages+0x1e/0x30
> [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> [<ffffffff81273568>] wb_writeback+0x268/0x300
> [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> [<ffffffff810a2f19>] process_one_work+0x189/0x420
> [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> [<ffffffff810a9786>] kthread+0xe6/0x100
> [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> He adds
> : task1 is waiting for the PageWriteback bit of the page that task2 has
> : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> : bit the page which tasks1 has locked.
> 
> More precisely task1 is handling a page fault and it has a page locked
> while it charges a new page table to a memcg. That in turn hits a memory
> limit reclaim and the memcg reclaim for legacy controller is waiting on
> the writeback but that is never going to finish because the writeback
> itself is waiting for the page locked in the #PF path. So this is
> essentially ABBA deadlock:
>                                         lock_page(A)
>                                         SetPageWriteback(A)
>                                         unlock_page(A)
> lock_page(B)
>                                         lock_page(B)
> pte_alloc_pne
>   shrink_page_list
>     wait_on_page_writeback(A)
>                                         SetPageWriteback(B)
>                                         unlock_page(B)
> 
>                                         # flush A, B to clear the writeback
> 
> This accumulating of more pages to flush is used by several filesystems
> to generate a more optimal IO patterns.
> 
> Waiting for the writeback in legacy memcg controller is a workaround
> for pre-mature OOM killer invocations because there is no dirty IO
> throttling available for the controller. There is no easy way around
> that unfortunately. Therefore fix this specific issue by pre-allocating
> the page table outside of the page lock. We have that handy
> infrastructure for that already so simply reuse the fault-around pattern
> which already does this.
> 
> There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> from under a fs page locked but they should be really rare. I am not
> aware of a better solution unfortunately.
> 
> Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
> Cc: stable
> Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Will you take care about converting vmf_insert_* to use the pre-allocated
page table?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] mm, memcg: fix reclaim deadlock with writeback
  2018-12-13 10:41     ` Kirill A. Shutemov
@ 2018-12-13 12:39       ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2018-12-13 12:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Liu Bo, Jan Kara, Dave Chinner, Theodore Ts'o,
	Johannes Weiner, Vladimir Davydov, linux-mm, linux-fsdevel, LKML,
	Shakeel Butt, Stable tree

On Thu 13-12-18 13:41:47, Kirill A. Shutemov wrote:
> On Thu, Dec 13, 2018 at 10:22:21AM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> > ext4 writeback
> > task1:
> > [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> > [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> > [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> > [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> > [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> > [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> > [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> > [<ffffffff8122df2d>] try_charge+0x14d/0x720
> > [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> > [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> > [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> > [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> > [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> > [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> > [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> > [<ffffffff811e5d93>] do_fault+0x103/0x970
> > [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> > [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> > [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> > [<ffffffff8171bce8>] page_fault+0x28/0x30
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > task2:
> > [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> > [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> > [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> > [<ffffffff811bbede>] do_writepages+0x1e/0x30
> > [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> > [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> > [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> > [<ffffffff81273568>] wb_writeback+0x268/0x300
> > [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> > [<ffffffff810a2f19>] process_one_work+0x189/0x420
> > [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> > [<ffffffff810a9786>] kthread+0xe6/0x100
> > [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > He adds
> > : task1 is waiting for the PageWriteback bit of the page that task2 has
> > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> > : bit the page which tasks1 has locked.
> > 
> > More precisely task1 is handling a page fault and it has a page locked
> > while it charges a new page table to a memcg. That in turn hits a memory
> > limit reclaim and the memcg reclaim for legacy controller is waiting on
> > the writeback but that is never going to finish because the writeback
> > itself is waiting for the page locked in the #PF path. So this is
> > essentially ABBA deadlock:
> >                                         lock_page(A)
> >                                         SetPageWriteback(A)
> >                                         unlock_page(A)
> > lock_page(B)
> >                                         lock_page(B)
> > pte_alloc_pne
> >   shrink_page_list
> >     wait_on_page_writeback(A)
> >                                         SetPageWriteback(B)
> >                                         unlock_page(B)
> > 
> >                                         # flush A, B to clear the writeback
> > 
> > This accumulating of more pages to flush is used by several filesystems
> > to generate a more optimal IO patterns.
> > 
> > Waiting for the writeback in legacy memcg controller is a workaround
> > for pre-mature OOM killer invocations because there is no dirty IO
> > throttling available for the controller. There is no easy way around
> > that unfortunately. Therefore fix this specific issue by pre-allocating
> > the page table outside of the page lock. We have that handy
> > infrastructure for that already so simply reuse the fault-around pattern
> > which already does this.
> > 
> > There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> > from under a fs page locked but they should be really rare. I am not
> > aware of a better solution unfortunately.
> > 
> > Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
> > Cc: stable
> > Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Thanks!

> Will you take care about converting vmf_insert_* to use the pre-allocated
> page table?

I can try but I would appreciate if somebody more familiar with the code
could do that. I am busy as hell and I do not want to promis something I
will likely not get to soon.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] mm, memcg: fix reclaim deadlock with writeback
  2018-12-13  9:22   ` [PATCH v3] " Michal Hocko
  2018-12-13 10:41     ` Kirill A. Shutemov
@ 2018-12-13 22:04     ` Johannes Weiner
  2018-12-14  8:49       ` Michal Hocko
  2018-12-13 22:43     ` Liu Bo
  2 siblings, 1 reply; 22+ messages in thread
From: Johannes Weiner @ 2018-12-13 22:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, Liu Bo, Jan Kara,
	Dave Chinner, Theodore Ts'o, Vladimir Davydov, linux-mm,
	linux-fsdevel, LKML, Shakeel Butt, Michal Hocko, Stable tree

On Thu, Dec 13, 2018 at 10:22:21AM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> ext4 writeback
> task1:
> [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> [<ffffffff8122df2d>] try_charge+0x14d/0x720
> [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> [<ffffffff811e5d93>] do_fault+0x103/0x970
> [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> [<ffffffff8171bce8>] page_fault+0x28/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> task2:
> [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> [<ffffffff811bbede>] do_writepages+0x1e/0x30
> [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> [<ffffffff81273568>] wb_writeback+0x268/0x300
> [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> [<ffffffff810a2f19>] process_one_work+0x189/0x420
> [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> [<ffffffff810a9786>] kthread+0xe6/0x100
> [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> He adds
> : task1 is waiting for the PageWriteback bit of the page that task2 has
> : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> : bit the page which tasks1 has locked.
> 
> More precisely task1 is handling a page fault and it has a page locked
> while it charges a new page table to a memcg. That in turn hits a memory
> limit reclaim and the memcg reclaim for legacy controller is waiting on
> the writeback but that is never going to finish because the writeback
> itself is waiting for the page locked in the #PF path. So this is
> essentially ABBA deadlock:
>                                         lock_page(A)
>                                         SetPageWriteback(A)
>                                         unlock_page(A)
> lock_page(B)
>                                         lock_page(B)
> pte_alloc_pne
>   shrink_page_list
>     wait_on_page_writeback(A)
>                                         SetPageWriteback(B)
>                                         unlock_page(B)
> 
>                                         # flush A, B to clear the writeback
> 
> This accumulating of more pages to flush is used by several filesystems
> to generate a more optimal IO patterns.
> 
> Waiting for the writeback in legacy memcg controller is a workaround
> for pre-mature OOM killer invocations because there is no dirty IO
> throttling available for the controller. There is no easy way around
> that unfortunately. Therefore fix this specific issue by pre-allocating
> the page table outside of the page lock. We have that handy
> infrastructure for that already so simply reuse the fault-around pattern
> which already does this.
> 
> There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> from under a fs page locked but they should be really rare. I am not
> aware of a better solution unfortunately.
> 
> Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
> Cc: stable
> Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Just one nit:

> @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
>  	struct vm_area_struct *vma = vmf->vma;
>  	vm_fault_t ret;
>  
> +	/*
> +	 * Preallocate pte before we take page_lock because this might lead to
> +	 * deadlocks for memcg reclaim which waits for pages under writeback.
> +	 */
> +	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> +		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address);
> +		if (!vmf->prealloc_pte)
> +			return VM_FAULT_OOM;
> +		smp_wmb(); /* See comment in __pte_alloc() */
> +	}

Could you be more specific in the deadlock comment? git blame will
work fine for a while, but it becomes a pain to find corresponding
patches after stuff gets moved around for years.

In particular the race diagram between reclaim with a page lock held
and the fs doing SetPageWriteback batches before kicking off IO would
be useful directly in the code, IMO.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] mm, memcg: fix reclaim deadlock with writeback
  2018-12-13  9:22   ` [PATCH v3] " Michal Hocko
  2018-12-13 10:41     ` Kirill A. Shutemov
  2018-12-13 22:04     ` Johannes Weiner
@ 2018-12-13 22:43     ` Liu Bo
  2 siblings, 0 replies; 22+ messages in thread
From: Liu Bo @ 2018-12-13 22:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Kirill A. Shutemov, Jan Kara, Dave Chinner,
	Theodore Ts'o, Johannes Weiner, Vladimir Davydov, linux-mm,
	linux-fsdevel, LKML, Shakeel Butt, Michal Hocko, Stable tree

On Thu, Dec 13, 2018 at 10:22:21AM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> ext4 writeback
> task1:
> [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> [<ffffffff8122df2d>] try_charge+0x14d/0x720
> [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> [<ffffffff811e5d93>] do_fault+0x103/0x970
> [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> [<ffffffff8171bce8>] page_fault+0x28/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> task2:
> [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> [<ffffffff811bbede>] do_writepages+0x1e/0x30
> [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> [<ffffffff81273568>] wb_writeback+0x268/0x300
> [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> [<ffffffff810a2f19>] process_one_work+0x189/0x420
> [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> [<ffffffff810a9786>] kthread+0xe6/0x100
> [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> He adds
> : task1 is waiting for the PageWriteback bit of the page that task2 has
> : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> : bit the page which tasks1 has locked.
> 
> More precisely task1 is handling a page fault and it has a page locked
> while it charges a new page table to a memcg. That in turn hits a memory
> limit reclaim and the memcg reclaim for legacy controller is waiting on
> the writeback but that is never going to finish because the writeback
> itself is waiting for the page locked in the #PF path. So this is
> essentially ABBA deadlock:
>                                         lock_page(A)
>                                         SetPageWriteback(A)
>                                         unlock_page(A)
> lock_page(B)
>                                         lock_page(B)
> pte_alloc_pne
>   shrink_page_list
>     wait_on_page_writeback(A)
>                                         SetPageWriteback(B)
>                                         unlock_page(B)
> 
>                                         # flush A, B to clear the writeback
> 
> This accumulating of more pages to flush is used by several filesystems
> to generate a more optimal IO patterns.
> 
> Waiting for the writeback in legacy memcg controller is a workaround
> for pre-mature OOM killer invocations because there is no dirty IO
> throttling available for the controller. There is no easy way around
> that unfortunately. Therefore fix this specific issue by pre-allocating
> the page table outside of the page lock. We have that handy
> infrastructure for that already so simply reuse the fault-around pattern
> which already does this.
> 
> There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
> from under a fs page locked but they should be really rare. I am not
> aware of a better solution unfortunately.
> 

Thanks for the update.

Looks good to me.

Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>

thanks,
-liubo

> Reported-and-Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
> Cc: stable
> Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memory.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 4ad2d293ddc2..bb78e90a9b70 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
>  	struct vm_area_struct *vma = vmf->vma;
>  	vm_fault_t ret;
>  
> +	/*
> +	 * Preallocate pte before we take page_lock because this might lead to
> +	 * deadlocks for memcg reclaim which waits for pages under writeback.
> +	 */
> +	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> +		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address);
> +		if (!vmf->prealloc_pte)
> +			return VM_FAULT_OOM;
> +		smp_wmb(); /* See comment in __pte_alloc() */
> +	}
> +
>  	ret = vma->vm_ops->fault(vmf);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
>  			    VM_FAULT_DONE_COW)))
> -- 
> 2.19.2

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] mm, memcg: fix reclaim deadlock with writeback
  2018-12-13 22:04     ` Johannes Weiner
@ 2018-12-14  8:49       ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2018-12-14  8:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Kirill A. Shutemov, Liu Bo, Jan Kara,
	Dave Chinner, Theodore Ts'o, Vladimir Davydov, linux-mm,
	linux-fsdevel, LKML, Shakeel Butt, Stable tree

On Thu 13-12-18 17:04:00, Johannes Weiner wrote:
[...]
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks!

> Just one nit:
> 
> > @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
> >  	struct vm_area_struct *vma = vmf->vma;
> >  	vm_fault_t ret;
> >  
> > +	/*
> > +	 * Preallocate pte before we take page_lock because this might lead to
> > +	 * deadlocks for memcg reclaim which waits for pages under writeback.
> > +	 */
> > +	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> > +		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address);
> > +		if (!vmf->prealloc_pte)
> > +			return VM_FAULT_OOM;
> > +		smp_wmb(); /* See comment in __pte_alloc() */
> > +	}
> 
> Could you be more specific in the deadlock comment? git blame will
> work fine for a while, but it becomes a pain to find corresponding
> patches after stuff gets moved around for years.
> 
> In particular the race diagram between reclaim with a page lock held
> and the fs doing SetPageWriteback batches before kicking off IO would
> be useful directly in the code, IMO.

This?

diff --git a/mm/memory.c b/mm/memory.c
index bb78e90a9b70..ece221e4da6d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2995,7 +2995,18 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 
 	/*
 	 * Preallocate pte before we take page_lock because this might lead to
-	 * deadlocks for memcg reclaim which waits for pages under writeback.
+	 * deadlocks for memcg reclaim which waits for pages under writeback:
+	 * 				lock_page(A)
+	 * 				SetPageWriteback(A)
+	 * 				unlock_page(A)
+	 * lock_page(B)
+	 * 				lock_page(B)
+	 * pte_alloc_pne
+	 *   shrink_page_list
+	 *     wait_on_page_writeback(A)
+	 *     				SetPageWriteback(B)
+	 *     				unlock_page(B)
+	 *     				# flush A, B to clear the writeback
 	 */
 	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
 		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm, vmf->address);
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback
  2018-12-11 13:26 [PATCH] mm, memcg: fix reclaim deadlock with writeback Michal Hocko
                   ` (4 preceding siblings ...)
  2018-12-12 15:50 ` [PATCH v2] " Michal Hocko
@ 2018-12-14 17:13 ` kbuild test robot
  2018-12-14 17:31   ` Michal Hocko
  5 siblings, 1 reply; 22+ messages in thread
From: kbuild test robot @ 2018-12-14 17:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kbuild-all, Andrew Morton, Kirill A. Shutemov, Liu Bo, Jan Kara,
	Dave Chinner, Theodore Ts'o, Johannes Weiner,
	Vladimir Davydov, linux-mm, linux-fsdevel, LKML, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 2943 bytes --]

Hi Michal,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.20-rc6 next-20181214]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-memcg-fix-reclaim-deadlock-with-writeback/20181212-224633
config: nds32-allmodconfig (attached as .config)
compiler: nds32le-linux-gcc (GCC) 6.4.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=6.4.0 make.cross ARCH=nds32 

All errors (new ones prefixed by >>):

   mm/memory.c: In function '__do_fault':
   mm/memory.c:3001:45: error: 'struct vm_area_struct' has no member named 'vm'
      vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
                                                ^~
>> mm/memory.c:3001:50: error: 'mm' undeclared (first use in this function)
      vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
                                                     ^~
   mm/memory.c:3001:50: note: each undeclared identifier is reported only once for each function it appears in

vim +/mm +3001 mm/memory.c

  2985	
  2986	/*
  2987	 * The mmap_sem must have been held on entry, and may have been
  2988	 * released depending on flags and vma->vm_ops->fault() return value.
  2989	 * See filemap_fault() and __lock_page_retry().
  2990	 */
  2991	static vm_fault_t __do_fault(struct vm_fault *vmf)
  2992	{
  2993		struct vm_area_struct *vma = vmf->vma;
  2994		vm_fault_t ret;
  2995	
  2996		/*
  2997		 * Preallocate pte before we take page_lock because this might lead to
  2998		 * deadlocks for memcg reclaim which waits for pages under writeback.
  2999		 */
  3000		if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> 3001			vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
  3002			if (!vmf->prealloc_pte)
  3003				return VM_FAULT_OOM;
  3004			smp_wmb(); /* See comment in __pte_alloc() */
  3005		}
  3006	
  3007		ret = vma->vm_ops->fault(vmf);
  3008		if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
  3009				    VM_FAULT_DONE_COW)))
  3010			return ret;
  3011	
  3012		if (unlikely(PageHWPoison(vmf->page))) {
  3013			if (ret & VM_FAULT_LOCKED)
  3014				unlock_page(vmf->page);
  3015			put_page(vmf->page);
  3016			vmf->page = NULL;
  3017			return VM_FAULT_HWPOISON;
  3018		}
  3019	
  3020		if (unlikely(!(ret & VM_FAULT_LOCKED)))
  3021			lock_page(vmf->page);
  3022		else
  3023			VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);
  3024	
  3025		return ret;
  3026	}
  3027	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 48501 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback
  2018-12-14 17:13 ` [PATCH] " kbuild test robot
@ 2018-12-14 17:31   ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2018-12-14 17:31 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Andrew Morton, Kirill A. Shutemov, Liu Bo, Jan Kara,
	Dave Chinner, Theodore Ts'o, Johannes Weiner,
	Vladimir Davydov, linux-mm, linux-fsdevel, LKML

On Sat 15-12-18 01:13:53, kbuild test robot wrote:
> Hi Michal,
> 
> I love your patch! Yet something to improve:

Could you point the bot to v3 please? http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-12-14 17:32 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-11 13:26 [PATCH] mm, memcg: fix reclaim deadlock with writeback Michal Hocko
2018-12-11 15:15 ` Kirill A. Shutemov
2018-12-11 16:21   ` Michal Hocko
2018-12-11 16:39     ` Jan Kara
2018-12-12  9:42 ` Kirill A. Shutemov
2018-12-12  9:48   ` Michal Hocko
2018-12-12 10:05   ` Jan Kara
2018-12-12 11:58 ` Kirill A. Shutemov
2018-12-12 12:13   ` Michal Hocko
2018-12-12 15:33 ` kbuild test robot
2018-12-12 15:57   ` Michal Hocko
2018-12-12 15:50 ` [PATCH v2] " Michal Hocko
2018-12-12 17:24   ` Shakeel Butt
2018-12-12 17:49   ` Liu Bo
2018-12-13  9:22   ` [PATCH v3] " Michal Hocko
2018-12-13 10:41     ` Kirill A. Shutemov
2018-12-13 12:39       ` Michal Hocko
2018-12-13 22:04     ` Johannes Weiner
2018-12-14  8:49       ` Michal Hocko
2018-12-13 22:43     ` Liu Bo
2018-12-14 17:13 ` [PATCH] " kbuild test robot
2018-12-14 17:31   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).