linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Michal Hocko <mhocko@suse.com>,
	Liu Bo <bo.liu@linux.alibaba.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Johannes Weiner <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>,
	Dave Chinner <david@fromorbit.com>, Theodore Tso <tytso@mit.edu>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Shakeel Butt <shakeelb@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 4.9 51/51] mm, memcg: fix reclaim deadlock with writeback
Date: Mon, 21 Jan 2019 14:44:47 +0100	[thread overview]
Message-ID: <20190121122458.832467342@linuxfoundation.org> (raw)
In-Reply-To: <20190121122453.700446926@linuxfoundation.org>

4.9-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Michal Hocko <mhocko@suse.com>

commit 63f3655f950186752236bb88a22f8252c11ce394 upstream.

Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
ext4 writeback

  task1:
    wait_on_page_bit+0x82/0xa0
    shrink_page_list+0x907/0x960
    shrink_inactive_list+0x2c7/0x680
    shrink_node_memcg+0x404/0x830
    shrink_node+0xd8/0x300
    do_try_to_free_pages+0x10d/0x330
    try_to_free_mem_cgroup_pages+0xd5/0x1b0
    try_charge+0x14d/0x720
    memcg_kmem_charge_memcg+0x3c/0xa0
    memcg_kmem_charge+0x7e/0xd0
    __alloc_pages_nodemask+0x178/0x260
    alloc_pages_current+0x95/0x140
    pte_alloc_one+0x17/0x40
    __pte_alloc+0x1e/0x110
    alloc_set_pte+0x5fe/0xc20
    do_fault+0x103/0x970
    handle_mm_fault+0x61e/0xd10
    __do_page_fault+0x252/0x4d0
    do_page_fault+0x30/0x80
    page_fault+0x28/0x30

  task2:
    __lock_page+0x86/0xa0
    mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
    ext4_writepages+0x479/0xd60
    do_writepages+0x1e/0x30
    __writeback_single_inode+0x45/0x320
    writeback_sb_inodes+0x272/0x600
    __writeback_inodes_wb+0x92/0xc0
    wb_writeback+0x268/0x300
    wb_workfn+0xb4/0x390
    process_one_work+0x189/0x420
    worker_thread+0x4e/0x4b0
    kthread+0xe6/0x100
    ret_from_fork+0x41/0x50

He adds
 "task1 is waiting for the PageWriteback bit of the page that task2 has
  collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
  LOCKED bit the page which tasks1 has locked"

More precisely task1 is handling a page fault and it has a page locked
while it charges a new page table to a memcg.  That in turn hits a
memory limit reclaim and the memcg reclaim for legacy controller is
waiting on the writeback but that is never going to finish because the
writeback itself is waiting for the page locked in the #PF path.  So
this is essentially ABBA deadlock:

                                        lock_page(A)
                                        SetPageWriteback(A)
                                        unlock_page(A)
  lock_page(B)
                                        lock_page(B)
  pte_alloc_pne
    shrink_page_list
      wait_on_page_writeback(A)
                                        SetPageWriteback(B)
                                        unlock_page(B)

                                        # flush A, B to clear the writeback

This accumulating of more pages to flush is used by several filesystems
to generate a more optimal IO patterns.

Waiting for the writeback in legacy memcg controller is a workaround for
pre-mature OOM killer invocations because there is no dirty IO
throttling available for the controller.  There is no easy way around
that unfortunately.  Therefore fix this specific issue by pre-allocating
the page table outside of the page lock.  We have that handy
infrastructure for that already so simply reuse the fault-around pattern
which already does this.

There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
from under a fs page locked but they should be really rare.  I am not
aware of a better solution unfortunately.

[akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
[akpm@linux-foundation.org: coding-style fixes]
[mhocko@kernel.org: enhance comment, per Johannes]
  Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Liu Bo <bo.liu@linux.alibaba.com>
Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 mm/memory.c |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2823,6 +2823,28 @@ static int __do_fault(struct fault_env *
 	struct vm_fault vmf;
 	int ret;
 
+	/*
+	 * Preallocate pte before we take page_lock because this might lead to
+	 * deadlocks for memcg reclaim which waits for pages under writeback:
+	 *				lock_page(A)
+	 *				SetPageWriteback(A)
+	 *				unlock_page(A)
+	 * lock_page(B)
+	 *				lock_page(B)
+	 * pte_alloc_pne
+	 *   shrink_page_list
+	 *     wait_on_page_writeback(A)
+	 *				SetPageWriteback(B)
+	 *				unlock_page(B)
+	 *				# flush A, B to clear the writeback
+	 */
+	if (pmd_none(*fe->pmd) && !fe->prealloc_pte) {
+		fe->prealloc_pte = pte_alloc_one(vma->vm_mm, fe->address);
+		if (!fe->prealloc_pte)
+			return VM_FAULT_OOM;
+		smp_wmb(); /* See comment in __pte_alloc() */
+	}
+
 	vmf.virtual_address = (void __user *)(fe->address & PAGE_MASK);
 	vmf.pgoff = pgoff;
 	vmf.flags = fe->flags;



  parent reply	other threads:[~2019-01-21 13:56 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-21 13:43 [PATCH 4.9 00/51] 4.9.152-stable review Greg Kroah-Hartman
2019-01-21 13:43 ` [PATCH 4.9 01/51] tty/ldsem: Wake up readers after timed out down_write() Greg Kroah-Hartman
2019-01-21 13:43 ` [PATCH 4.9 02/51] tty: Hold tty_ldisc_lock() during tty_reopen() Greg Kroah-Hartman
2019-01-21 13:43 ` [PATCH 4.9 03/51] tty: Simplify tty->count math in tty_reopen() Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 04/51] tty: Dont hold ldisc lock in tty_reopen() if ldisc present Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 05/51] can: gw: ensure DLC boundaries after CAN frame modification Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 06/51] Revert "f2fs: do not recover from previous remained wrong dnodes" Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 07/51] media: em28xx: Fix misplaced reset of dev->v4l::field_count Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 08/51] proc: Remove empty line in /proc/self/status Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 09/51] Revert "scsi: target: iscsi: cxgbit: fix csk leak" Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 10/51] scsi: target: iscsi: cxgbit: fix csk leak Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 11/51] arm64/kvm: consistently handle host HCR_EL2 flags Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 12/51] arm64: Dont trap host pointer auth use to EL2 Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 13/51] ipv6: fix kernel-infoleak in ipv6_local_error() Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 14/51] net: bridge: fix a bug on using a neighbour cache entry without checking its state Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 15/51] packet: Do not leak dev refcounts on error exit Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 16/51] bonding: update nest level on unlink Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 17/51] ip: on queued skb use skb_header_pointer instead of pskb_may_pull Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 18/51] crypto: caam - fix zero-length buffer DMA mapping Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 19/51] crypto: authencesn - Avoid twice completion call in decrypt path Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 20/51] crypto: authenc - fix parsing key with misaligned rta_len Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 21/51] btrfs: wait on ordered extents on abort cleanup Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 22/51] Yama: Check for pid death before checking ancestry Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 23/51] scsi: core: Synchronize request queue PM status only on successful resume Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 24/51] scsi: sd: Fix cache_type_store() Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 25/51] crypto: talitos - reorder code in talitos_edesc_alloc() Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 26/51] crypto: talitos - fix ablkcipher for CONFIG_VMAP_STACK Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 27/51] mips: fix n32 compat_ipc_parse_version Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 28/51] mfd: tps6586x: Handle interrupts on suspend Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 29/51] arm64: kaslr: ensure randomized quantities are clean to the PoC Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 30/51] Disable MSI also when pcie-octeon.pcie_disable on Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 31/51] omap2fb: Fix stack memory disclosure Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 32/51] media: vivid: fix error handling of kthread_run Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 33/51] media: vivid: set min width/height to a value > 0 Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 34/51] LSM: Check for NULL cred-security on free Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 35/51] media: vb2: vb2_mmap: move lock up Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 36/51] sunrpc: handle ENOMEM in rpcb_getport_async Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 37/51] netfilter: ebtables: account ebt_table_info to kmemcg Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 38/51] selinux: fix GPF on invalid policy Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 39/51] blockdev: Fix livelocks on loop device Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 40/51] sctp: allocate sctp_sockaddr_entry with kzalloc Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 41/51] tipc: fix uninit-value in tipc_nl_compat_link_reset_stats Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 42/51] tipc: fix uninit-value in tipc_nl_compat_bearer_enable Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 43/51] tipc: fix uninit-value in tipc_nl_compat_link_set Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 44/51] tipc: fix uninit-value in tipc_nl_compat_name_table_dump Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 45/51] tipc: fix uninit-value in tipc_nl_compat_doit Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 46/51] block/loop: Use global lock for ioctl() operation Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 47/51] loop: Fold __loop_release into loop_release Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 48/51] loop: Get rid of loop_index_mutex Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 49/51] loop: Fix double mutex_unlock(&loop_ctl_mutex) in loop_control_ioctl() Greg Kroah-Hartman
2019-01-21 13:44 ` [PATCH 4.9 50/51] drm/fb-helper: Ignore the value of fb_var_screeninfo.pixclock Greg Kroah-Hartman
2019-01-21 13:44 ` Greg Kroah-Hartman [this message]
2019-01-22 13:40 ` [PATCH 4.9 00/51] 4.9.152-stable review Naresh Kamboju
2019-01-22 20:44 ` Guenter Roeck
2019-01-22 22:39 ` shuah
2019-01-23  9:06 ` Jon Hunter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190121122458.832467342@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=bo.liu@linux.alibaba.com \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@suse.com \
    --cc=shakeelb@google.com \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).