linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Michal Hocko <mhocko@suse.com>,
	Liu Bo <bo.liu@linux.alibaba.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Johannes Weiner <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>,
	Dave Chinner <david@fromorbit.com>, Theodore Tso <tytso@mit.edu>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Shakeel Butt <shakeelb@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 4.14 15/27] mm, memcg: fix reclaim deadlock with writeback
Date: Tue, 15 Jan 2019 17:36:04 +0100	[thread overview]
Message-ID: <20190115154902.135722496@linuxfoundation.org> (raw)
In-Reply-To: <20190115154901.189747728@linuxfoundation.org>

4.14-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Michal Hocko <mhocko@suse.com>

commit 63f3655f950186752236bb88a22f8252c11ce394 upstream.

Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
ext4 writeback

  task1:
    wait_on_page_bit+0x82/0xa0
    shrink_page_list+0x907/0x960
    shrink_inactive_list+0x2c7/0x680
    shrink_node_memcg+0x404/0x830
    shrink_node+0xd8/0x300
    do_try_to_free_pages+0x10d/0x330
    try_to_free_mem_cgroup_pages+0xd5/0x1b0
    try_charge+0x14d/0x720
    memcg_kmem_charge_memcg+0x3c/0xa0
    memcg_kmem_charge+0x7e/0xd0
    __alloc_pages_nodemask+0x178/0x260
    alloc_pages_current+0x95/0x140
    pte_alloc_one+0x17/0x40
    __pte_alloc+0x1e/0x110
    alloc_set_pte+0x5fe/0xc20
    do_fault+0x103/0x970
    handle_mm_fault+0x61e/0xd10
    __do_page_fault+0x252/0x4d0
    do_page_fault+0x30/0x80
    page_fault+0x28/0x30

  task2:
    __lock_page+0x86/0xa0
    mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
    ext4_writepages+0x479/0xd60
    do_writepages+0x1e/0x30
    __writeback_single_inode+0x45/0x320
    writeback_sb_inodes+0x272/0x600
    __writeback_inodes_wb+0x92/0xc0
    wb_writeback+0x268/0x300
    wb_workfn+0xb4/0x390
    process_one_work+0x189/0x420
    worker_thread+0x4e/0x4b0
    kthread+0xe6/0x100
    ret_from_fork+0x41/0x50

He adds
 "task1 is waiting for the PageWriteback bit of the page that task2 has
  collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
  LOCKED bit the page which tasks1 has locked"

More precisely task1 is handling a page fault and it has a page locked
while it charges a new page table to a memcg.  That in turn hits a
memory limit reclaim and the memcg reclaim for legacy controller is
waiting on the writeback but that is never going to finish because the
writeback itself is waiting for the page locked in the #PF path.  So
this is essentially ABBA deadlock:

                                        lock_page(A)
                                        SetPageWriteback(A)
                                        unlock_page(A)
  lock_page(B)
                                        lock_page(B)
  pte_alloc_pne
    shrink_page_list
      wait_on_page_writeback(A)
                                        SetPageWriteback(B)
                                        unlock_page(B)

                                        # flush A, B to clear the writeback

This accumulating of more pages to flush is used by several filesystems
to generate a more optimal IO patterns.

Waiting for the writeback in legacy memcg controller is a workaround for
pre-mature OOM killer invocations because there is no dirty IO
throttling available for the controller.  There is no easy way around
that unfortunately.  Therefore fix this specific issue by pre-allocating
the page table outside of the page lock.  We have that handy
infrastructure for that already so simply reuse the fault-around pattern
which already does this.

There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
from under a fs page locked but they should be really rare.  I am not
aware of a better solution unfortunately.

[akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
[akpm@linux-foundation.org: coding-style fixes]
[mhocko@kernel.org: enhance comment, per Johannes]
  Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Liu Bo <bo.liu@linux.alibaba.com>
Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 mm/memory.c |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3191,6 +3191,29 @@ static int __do_fault(struct vm_fault *v
 	struct vm_area_struct *vma = vmf->vma;
 	int ret;
 
+	/*
+	 * Preallocate pte before we take page_lock because this might lead to
+	 * deadlocks for memcg reclaim which waits for pages under writeback:
+	 *				lock_page(A)
+	 *				SetPageWriteback(A)
+	 *				unlock_page(A)
+	 * lock_page(B)
+	 *				lock_page(B)
+	 * pte_alloc_pne
+	 *   shrink_page_list
+	 *     wait_on_page_writeback(A)
+	 *				SetPageWriteback(B)
+	 *				unlock_page(B)
+	 *				# flush A, B to clear the writeback
+	 */
+	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
+		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm,
+						  vmf->address);
+		if (!vmf->prealloc_pte)
+			return VM_FAULT_OOM;
+		smp_wmb(); /* See comment in __pte_alloc() */
+	}
+
 	ret = vma->vm_ops->fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
 			    VM_FAULT_DONE_COW)))



  parent reply	other threads:[~2019-01-15 16:40 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-15 16:35 [PATCH 4.14 00/27] 4.14.94-stable review Greg Kroah-Hartman
2019-01-15 16:35 ` [PATCH 4.14 01/27] x86,kvm: move qemu/guest FPU switching out to vcpu_run Greg Kroah-Hartman
2019-01-15 16:35 ` [PATCH 4.14 02/27] x86, modpost: Replace last remnants of RETPOLINE with CONFIG_RETPOLINE Greg Kroah-Hartman
2019-01-15 16:35 ` [PATCH 4.14 03/27] ALSA: hda/realtek - Support Dell headset mode for New AIO platform Greg Kroah-Hartman
2019-01-15 16:35 ` [PATCH 4.14 04/27] ALSA: hda/realtek - Add unplug function into unplug state of Headset Mode for ALC225 Greg Kroah-Hartman
2019-01-15 16:35 ` [PATCH 4.14 05/27] ALSA: hda/realtek - Disable headset Mic VREF for headset mode of ALC225 Greg Kroah-Hartman
2019-01-15 16:35 ` [PATCH 4.14 06/27] CIFS: Fix adjustment of credits for MTU requests Greg Kroah-Hartman
2019-01-15 16:35 ` [PATCH 4.14 07/27] CIFS: Do not hide EINTR after sending network packets Greg Kroah-Hartman
2019-01-15 16:35 ` [PATCH 4.14 08/27] cifs: Fix potential OOB access of lock element array Greg Kroah-Hartman
2019-01-15 16:35 ` [PATCH 4.14 09/27] usb: cdc-acm: send ZLP for Telit 3G Intel based modems Greg Kroah-Hartman
2019-01-15 16:35 ` [PATCH 4.14 10/27] USB: storage: dont insert sane sense for SPC3+ when bad sense specified Greg Kroah-Hartman
2019-01-15 16:36 ` [PATCH 4.14 11/27] USB: storage: add quirk for SMI SM3350 Greg Kroah-Hartman
2019-01-15 16:36 ` [PATCH 4.14 12/27] USB: Add USB_QUIRK_DELAY_CTRL_MSG quirk for Corsair K70 RGB Greg Kroah-Hartman
2019-01-15 16:36 ` [PATCH 4.14 13/27] slab: alien caches must not be initialized if the allocation of the alien cache failed Greg Kroah-Hartman
2019-01-15 16:36 ` [PATCH 4.14 14/27] mm: page_mapped: dont assume compound page is huge or THP Greg Kroah-Hartman
2019-01-15 16:36 ` Greg Kroah-Hartman [this message]
2019-01-15 16:36 ` [PATCH 4.14 16/27] ACPI: power: Skip duplicate power resource references in _PRx Greg Kroah-Hartman
2019-01-15 16:36 ` [PATCH 4.14 17/27] ACPI / PMIC: xpower: Fix TS-pin current-source handling Greg Kroah-Hartman
2019-01-15 16:36 ` [PATCH 4.14 18/27] i2c: dev: prevent adapter retries and timeout being set as minus value Greg Kroah-Hartman
2019-01-15 16:36 ` [PATCH 4.14 19/27] drm/fb-helper: Partially bring back workaround for bugs of SDL 1.2 Greg Kroah-Hartman
2019-01-15 16:36 ` [PATCH 4.14 20/27] rbd: dont return 0 on unmap if RBD_DEV_FLAG_REMOVING is set Greg Kroah-Hartman
2019-01-15 16:36 ` [PATCH 4.14 21/27] ext4: make sure enough credits are reserved for dioread_nolock writes Greg Kroah-Hartman
2019-01-15 16:36 ` [PATCH 4.14 22/27] ext4: fix a potential fiemap/page fault deadlock w/ inline_data Greg Kroah-Hartman
2019-01-15 16:36 ` [PATCH 4.14 23/27] ext4: avoid kernel warning when writing the superblock to a dead device Greg Kroah-Hartman
2019-01-15 16:36 ` [PATCH 4.14 24/27] ext4: use ext4_write_inode() when fsyncing w/o a journal Greg Kroah-Hartman
2019-01-15 16:36 ` [PATCH 4.14 25/27] ext4: track writeback errors using the generic tracking infrastructure Greg Kroah-Hartman
2019-01-15 16:36 ` [PATCH 4.14 26/27] sunrpc: use-after-free in svc_process_common() Greg Kroah-Hartman
2019-01-15 16:36 ` [PATCH 4.14 27/27] KVM: arm/arm64: Fix VMID alloc race by reverting to lock-less Greg Kroah-Hartman
2019-01-16  1:37 ` [PATCH 4.14 00/27] 4.14.94-stable review shuah
2019-01-16  9:25 ` Jon Hunter
2019-01-16 16:02   ` Greg Kroah-Hartman
2019-01-16 16:56     ` Jon Hunter
2019-01-16 17:11       ` Greg Kroah-Hartman
2019-01-16 17:38         ` Jon Hunter
2019-01-16 17:47           ` Greg Kroah-Hartman
2019-01-16 11:51 ` Naresh Kamboju
2019-01-16 20:37 ` Guenter Roeck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190115154902.135722496@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=bo.liu@linux.alibaba.com \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@suse.com \
    --cc=shakeelb@google.com \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).