From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S941303AbcJXUaS (ORCPT ); Mon, 24 Oct 2016 16:30:18 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:60842 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935599AbcJXUaO (ORCPT ); Mon, 24 Oct 2016 16:30:14 -0400 From: Johannes Weiner To: Andrew Morton Cc: Michal Hocko , Vladimir Davydov , Tejun Heo , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH] mm: memcontrol: do not recurse in direct reclaim Date: Mon, 24 Oct 2016 16:30:05 -0400 Message-Id: <20161024203005.5547-1-hannes@cmpxchg.org> X-Mailer: git-send-email 2.10.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4.0, we saw a stack corruption from a page fault entering direct memory cgroup reclaim, calling into btrfs_releasepage(), which then tried to allocate an extent and recursed back into a kmem charge ad nauseam: [...] [] btrfs_releasepage+0x2c/0x30 [] try_to_release_page+0x32/0x50 [] shrink_page_list+0x6da/0x7a0 [] shrink_inactive_list+0x1e5/0x510 [] shrink_lruvec+0x605/0x7f0 [] shrink_zone+0xee/0x320 [] do_try_to_free_pages+0x174/0x440 [] try_to_free_mem_cgroup_pages+0xa7/0x130 [] try_charge+0x17b/0x830 [] memcg_charge_kmem+0x40/0x80 [] new_slab+0x2d9/0x5a0 [] __slab_alloc+0x2fd/0x44f [] kmem_cache_alloc+0x193/0x1e0 [] alloc_extent_state+0x21/0xc0 [] __clear_extent_bit+0x2b5/0x400 [] try_release_extent_mapping+0x1a3/0x220 [] __btrfs_releasepage+0x31/0x70 [] btrfs_releasepage+0x2c/0x30 [] try_to_release_page+0x32/0x50 [] shrink_page_list+0x6da/0x7a0 [] shrink_inactive_list+0x1e5/0x510 [] shrink_lruvec+0x605/0x7f0 [] shrink_zone+0xee/0x320 [] do_try_to_free_pages+0x174/0x440 [] try_to_free_mem_cgroup_pages+0xa7/0x130 [] try_charge+0x17b/0x830 [] mem_cgroup_try_charge+0x65/0x1c0 [] handle_mm_fault+0x117f/0x1510 [] __do_page_fault+0x177/0x420 [] do_page_fault+0xc/0x10 [] page_fault+0x22/0x30 On later kernels, kmem charging is opt-in rather than opt-out, and that particular kmem allocation in btrfs_releasepage() is no longer being charged and won't recurse and overrun the stack anymore. But it's not impossible for an accounted allocation to happen from the memcg direct reclaim context, and we needed to reproduce this crash many times before we even got a useful stack trace out of it. Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC to avoid recursing into any other form of direct reclaim. Then let recursive charges from PF_MEMALLOC contexts bypass the cgroup limit. Signed-off-by: Johannes Weiner --- mm/memcontrol.c | 9 +++++---- mm/vmscan.c | 2 ++ 2 files changed, 7 insertions(+), 4 deletions(-) Hey guys, can anyone think of a reason why this might not be a good idea? We've never really needed this in the past because page reclaim doesn't recurse into instantiating another LRU page, especially with GFP_NOFS. But with a wider variety of tracked allocations, it's no longer that obvious. It seems like a risky hole to leave around. diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ae052b5e3315..3dac6f4ba4cf 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1908,13 +1908,14 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, /* * Unlike in global OOM situations, memcg is not in a physical - * memory shortage. Allow dying and OOM-killed tasks to - * bypass the last charges so that they can exit quickly and - * free their memory. + * memory shortage. Allow dying and OOM-killed tasks to bypass + * the last charges so that they can exit quickly and free + * their memory. The same applies for recursing reclaimers. */ if (unlikely(test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current) || - current->flags & PF_EXITING)) + current->flags & PF_EXITING || + current->flags & PF_MEMALLOC)) goto force; if (unlikely(task_in_memcg_oom(current))) diff --git a/mm/vmscan.c b/mm/vmscan.c index 744f926af442..76fda2268148 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3043,7 +3043,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, sc.gfp_mask, sc.reclaim_idx); + current->flags |= PF_MEMALLOC; nr_reclaimed = do_try_to_free_pages(zonelist, &sc); + current->flags &= ~PF_MEMALLOC; trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); -- 2.10.0