All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm: memcontrol: do not recurse in direct reclaim
@ 2016-10-24 20:30 ` Johannes Weiner
  0 siblings, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2016-10-24 20:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On 4.0, we saw a stack corruption from a page fault entering direct
memory cgroup reclaim, calling into btrfs_releasepage(), which then
tried to allocate an extent and recursed back into a kmem charge ad
nauseam:

[...]
[<ffffffff8136590c>] btrfs_releasepage+0x2c/0x30
[<ffffffff811559a2>] try_to_release_page+0x32/0x50
[<ffffffff81168cea>] shrink_page_list+0x6da/0x7a0
[<ffffffff811693b5>] shrink_inactive_list+0x1e5/0x510
[<ffffffff8116a0a5>] shrink_lruvec+0x605/0x7f0
[<ffffffff8116a37e>] shrink_zone+0xee/0x320
[<ffffffff8116a934>] do_try_to_free_pages+0x174/0x440
[<ffffffff8116adf7>] try_to_free_mem_cgroup_pages+0xa7/0x130
[<ffffffff811b738b>] try_charge+0x17b/0x830
[<ffffffff811bb5b0>] memcg_charge_kmem+0x40/0x80
[<ffffffff811a96a9>] new_slab+0x2d9/0x5a0
[<ffffffff817b2547>] __slab_alloc+0x2fd/0x44f
[<ffffffff811a9b03>] kmem_cache_alloc+0x193/0x1e0
[<ffffffff813801e1>] alloc_extent_state+0x21/0xc0
[<ffffffff813820c5>] __clear_extent_bit+0x2b5/0x400
[<ffffffff81386d03>] try_release_extent_mapping+0x1a3/0x220
[<ffffffff813658a1>] __btrfs_releasepage+0x31/0x70
[<ffffffff8136590c>] btrfs_releasepage+0x2c/0x30
[<ffffffff811559a2>] try_to_release_page+0x32/0x50
[<ffffffff81168cea>] shrink_page_list+0x6da/0x7a0
[<ffffffff811693b5>] shrink_inactive_list+0x1e5/0x510
[<ffffffff8116a0a5>] shrink_lruvec+0x605/0x7f0
[<ffffffff8116a37e>] shrink_zone+0xee/0x320
[<ffffffff8116a934>] do_try_to_free_pages+0x174/0x440
[<ffffffff8116adf7>] try_to_free_mem_cgroup_pages+0xa7/0x130
[<ffffffff811b738b>] try_charge+0x17b/0x830
[<ffffffff811bbfd5>] mem_cgroup_try_charge+0x65/0x1c0
[<ffffffff8118338f>] handle_mm_fault+0x117f/0x1510
[<ffffffff81041cf7>] __do_page_fault+0x177/0x420
[<ffffffff81041fac>] do_page_fault+0xc/0x10
[<ffffffff817c0182>] page_fault+0x22/0x30

On later kernels, kmem charging is opt-in rather than opt-out, and
that particular kmem allocation in btrfs_releasepage() is no longer
being charged and won't recurse and overrun the stack anymore. But
it's not impossible for an accounted allocation to happen from the
memcg direct reclaim context, and we needed to reproduce this crash
many times before we even got a useful stack trace out of it.

Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC
to avoid recursing into any other form of direct reclaim. Then let
recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 9 +++++----
 mm/vmscan.c     | 2 ++
 2 files changed, 7 insertions(+), 4 deletions(-)

Hey guys, can anyone think of a reason why this might not be a good
idea? We've never really needed this in the past because page reclaim
doesn't recurse into instantiating another LRU page, especially with
GFP_NOFS. But with a wider variety of tracked allocations, it's no
longer that obvious. It seems like a risky hole to leave around.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae052b5e3315..3dac6f4ba4cf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1908,13 +1908,14 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	/*
 	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
+	 * memory shortage. Allow dying and OOM-killed tasks to bypass
+	 * the last charges so that they can exit quickly and free
+	 * their memory. The same applies for recursing reclaimers.
 	 */
 	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
 		     fatal_signal_pending(current) ||
-		     current->flags & PF_EXITING))
+		     current->flags & PF_EXITING ||
+		     current->flags & PF_MEMALLOC))
 		goto force;
 
 	if (unlikely(task_in_memcg_oom(current)))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 744f926af442..76fda2268148 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3043,7 +3043,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					    sc.gfp_mask,
 					    sc.reclaim_idx);
 
+	current->flags |= PF_MEMALLOC;
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+	current->flags &= ~PF_MEMALLOC;
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH] mm: memcontrol: do not recurse in direct reclaim
@ 2016-10-24 20:30 ` Johannes Weiner
  0 siblings, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2016-10-24 20:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On 4.0, we saw a stack corruption from a page fault entering direct
memory cgroup reclaim, calling into btrfs_releasepage(), which then
tried to allocate an extent and recursed back into a kmem charge ad
nauseam:

[...]
[<ffffffff8136590c>] btrfs_releasepage+0x2c/0x30
[<ffffffff811559a2>] try_to_release_page+0x32/0x50
[<ffffffff81168cea>] shrink_page_list+0x6da/0x7a0
[<ffffffff811693b5>] shrink_inactive_list+0x1e5/0x510
[<ffffffff8116a0a5>] shrink_lruvec+0x605/0x7f0
[<ffffffff8116a37e>] shrink_zone+0xee/0x320
[<ffffffff8116a934>] do_try_to_free_pages+0x174/0x440
[<ffffffff8116adf7>] try_to_free_mem_cgroup_pages+0xa7/0x130
[<ffffffff811b738b>] try_charge+0x17b/0x830
[<ffffffff811bb5b0>] memcg_charge_kmem+0x40/0x80
[<ffffffff811a96a9>] new_slab+0x2d9/0x5a0
[<ffffffff817b2547>] __slab_alloc+0x2fd/0x44f
[<ffffffff811a9b03>] kmem_cache_alloc+0x193/0x1e0
[<ffffffff813801e1>] alloc_extent_state+0x21/0xc0
[<ffffffff813820c5>] __clear_extent_bit+0x2b5/0x400
[<ffffffff81386d03>] try_release_extent_mapping+0x1a3/0x220
[<ffffffff813658a1>] __btrfs_releasepage+0x31/0x70
[<ffffffff8136590c>] btrfs_releasepage+0x2c/0x30
[<ffffffff811559a2>] try_to_release_page+0x32/0x50
[<ffffffff81168cea>] shrink_page_list+0x6da/0x7a0
[<ffffffff811693b5>] shrink_inactive_list+0x1e5/0x510
[<ffffffff8116a0a5>] shrink_lruvec+0x605/0x7f0
[<ffffffff8116a37e>] shrink_zone+0xee/0x320
[<ffffffff8116a934>] do_try_to_free_pages+0x174/0x440
[<ffffffff8116adf7>] try_to_free_mem_cgroup_pages+0xa7/0x130
[<ffffffff811b738b>] try_charge+0x17b/0x830
[<ffffffff811bbfd5>] mem_cgroup_try_charge+0x65/0x1c0
[<ffffffff8118338f>] handle_mm_fault+0x117f/0x1510
[<ffffffff81041cf7>] __do_page_fault+0x177/0x420
[<ffffffff81041fac>] do_page_fault+0xc/0x10
[<ffffffff817c0182>] page_fault+0x22/0x30

On later kernels, kmem charging is opt-in rather than opt-out, and
that particular kmem allocation in btrfs_releasepage() is no longer
being charged and won't recurse and overrun the stack anymore. But
it's not impossible for an accounted allocation to happen from the
memcg direct reclaim context, and we needed to reproduce this crash
many times before we even got a useful stack trace out of it.

Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC
to avoid recursing into any other form of direct reclaim. Then let
recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 9 +++++----
 mm/vmscan.c     | 2 ++
 2 files changed, 7 insertions(+), 4 deletions(-)

Hey guys, can anyone think of a reason why this might not be a good
idea? We've never really needed this in the past because page reclaim
doesn't recurse into instantiating another LRU page, especially with
GFP_NOFS. But with a wider variety of tracked allocations, it's no
longer that obvious. It seems like a risky hole to leave around.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae052b5e3315..3dac6f4ba4cf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1908,13 +1908,14 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	/*
 	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
+	 * memory shortage. Allow dying and OOM-killed tasks to bypass
+	 * the last charges so that they can exit quickly and free
+	 * their memory. The same applies for recursing reclaimers.
 	 */
 	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
 		     fatal_signal_pending(current) ||
-		     current->flags & PF_EXITING))
+		     current->flags & PF_EXITING ||
+		     current->flags & PF_MEMALLOC))
 		goto force;
 
 	if (unlikely(task_in_memcg_oom(current)))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 744f926af442..76fda2268148 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3043,7 +3043,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					    sc.gfp_mask,
 					    sc.reclaim_idx);
 
+	current->flags |= PF_MEMALLOC;
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+	current->flags &= ~PF_MEMALLOC;
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
-- 
2.10.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: do not recurse in direct reclaim
  2016-10-24 20:30 ` Johannes Weiner
@ 2016-10-25  9:07   ` Michal Hocko
  -1 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2016-10-25  9:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vladimir Davydov, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Mon 24-10-16 16:30:05, Johannes Weiner wrote:
> On 4.0, we saw a stack corruption from a page fault entering direct
> memory cgroup reclaim, calling into btrfs_releasepage(), which then
> tried to allocate an extent and recursed back into a kmem charge ad
> nauseam:
> 
> [...]
> [<ffffffff8136590c>] btrfs_releasepage+0x2c/0x30
> [<ffffffff811559a2>] try_to_release_page+0x32/0x50
> [<ffffffff81168cea>] shrink_page_list+0x6da/0x7a0
> [<ffffffff811693b5>] shrink_inactive_list+0x1e5/0x510
> [<ffffffff8116a0a5>] shrink_lruvec+0x605/0x7f0
> [<ffffffff8116a37e>] shrink_zone+0xee/0x320
> [<ffffffff8116a934>] do_try_to_free_pages+0x174/0x440
> [<ffffffff8116adf7>] try_to_free_mem_cgroup_pages+0xa7/0x130
> [<ffffffff811b738b>] try_charge+0x17b/0x830
> [<ffffffff811bb5b0>] memcg_charge_kmem+0x40/0x80
> [<ffffffff811a96a9>] new_slab+0x2d9/0x5a0
> [<ffffffff817b2547>] __slab_alloc+0x2fd/0x44f
> [<ffffffff811a9b03>] kmem_cache_alloc+0x193/0x1e0
> [<ffffffff813801e1>] alloc_extent_state+0x21/0xc0
> [<ffffffff813820c5>] __clear_extent_bit+0x2b5/0x400
> [<ffffffff81386d03>] try_release_extent_mapping+0x1a3/0x220
> [<ffffffff813658a1>] __btrfs_releasepage+0x31/0x70
> [<ffffffff8136590c>] btrfs_releasepage+0x2c/0x30
> [<ffffffff811559a2>] try_to_release_page+0x32/0x50
> [<ffffffff81168cea>] shrink_page_list+0x6da/0x7a0
> [<ffffffff811693b5>] shrink_inactive_list+0x1e5/0x510
> [<ffffffff8116a0a5>] shrink_lruvec+0x605/0x7f0
> [<ffffffff8116a37e>] shrink_zone+0xee/0x320
> [<ffffffff8116a934>] do_try_to_free_pages+0x174/0x440
> [<ffffffff8116adf7>] try_to_free_mem_cgroup_pages+0xa7/0x130
> [<ffffffff811b738b>] try_charge+0x17b/0x830
> [<ffffffff811bbfd5>] mem_cgroup_try_charge+0x65/0x1c0
> [<ffffffff8118338f>] handle_mm_fault+0x117f/0x1510
> [<ffffffff81041cf7>] __do_page_fault+0x177/0x420
> [<ffffffff81041fac>] do_page_fault+0xc/0x10
> [<ffffffff817c0182>] page_fault+0x22/0x30
> 
> On later kernels, kmem charging is opt-in rather than opt-out, and
> that particular kmem allocation in btrfs_releasepage() is no longer
> being charged and won't recurse and overrun the stack anymore. But
> it's not impossible for an accounted allocation to happen from the
> memcg direct reclaim context, and we needed to reproduce this crash
> many times before we even got a useful stack trace out of it.

I agree that stack overruns are really nasty to debug. Been there done
that, hate it...

I would argue that we shouldn't account arbitrary objects and most of
them should be directly related to a user visible API which shouldn't
happen from the reclaim path. But the reality is that we reuse the code
and we can easily end up in the situation similat to the one above so I
agree that being more careful is definitely worth it.

> Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC
> to avoid recursing into any other form of direct reclaim. Then let
> recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.

Yes, as long as the allocation is still properly accounted then this is
the right way to go. We can breach the limit already. The outer reclaim
loop would push back to the limit anyway.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

One nit below

> ---
>  mm/memcontrol.c | 9 +++++----
>  mm/vmscan.c     | 2 ++
>  2 files changed, 7 insertions(+), 4 deletions(-)
> 
> Hey guys, can anyone think of a reason why this might not be a good
> idea? We've never really needed this in the past because page reclaim
> doesn't recurse into instantiating another LRU page, especially with
> GFP_NOFS. But with a wider variety of tracked allocations, it's no
> longer that obvious. It seems like a risky hole to leave around.
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae052b5e3315..3dac6f4ba4cf 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1908,13 +1908,14 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	/*
>  	 * Unlike in global OOM situations, memcg is not in a physical
> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> -	 * bypass the last charges so that they can exit quickly and
> -	 * free their memory.
> +	 * memory shortage. Allow dying and OOM-killed tasks to bypass
> +	 * the last charges so that they can exit quickly and free
> +	 * their memory. The same applies for recursing reclaimers.
>  	 */
>  	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
>  		     fatal_signal_pending(current) ||
> -		     current->flags & PF_EXITING))
> +		     current->flags & PF_EXITING ||
> +		     current->flags & PF_MEMALLOC))
>  		goto force;

I would prefer to have the PF_MEMALLOC condition in a check on its own
with a short explanation that we really do not want to recurse to the
reclaim due to stack overflows.
  
>  	if (unlikely(task_in_memcg_oom(current)))
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 744f926af442..76fda2268148 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3043,7 +3043,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  					    sc.gfp_mask,
>  					    sc.reclaim_idx);
>  
> +	current->flags |= PF_MEMALLOC;
>  	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
> +	current->flags &= ~PF_MEMALLOC;
>  
>  	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
>  
> -- 
> 2.10.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: do not recurse in direct reclaim
@ 2016-10-25  9:07   ` Michal Hocko
  0 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2016-10-25  9:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vladimir Davydov, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Mon 24-10-16 16:30:05, Johannes Weiner wrote:
> On 4.0, we saw a stack corruption from a page fault entering direct
> memory cgroup reclaim, calling into btrfs_releasepage(), which then
> tried to allocate an extent and recursed back into a kmem charge ad
> nauseam:
> 
> [...]
> [<ffffffff8136590c>] btrfs_releasepage+0x2c/0x30
> [<ffffffff811559a2>] try_to_release_page+0x32/0x50
> [<ffffffff81168cea>] shrink_page_list+0x6da/0x7a0
> [<ffffffff811693b5>] shrink_inactive_list+0x1e5/0x510
> [<ffffffff8116a0a5>] shrink_lruvec+0x605/0x7f0
> [<ffffffff8116a37e>] shrink_zone+0xee/0x320
> [<ffffffff8116a934>] do_try_to_free_pages+0x174/0x440
> [<ffffffff8116adf7>] try_to_free_mem_cgroup_pages+0xa7/0x130
> [<ffffffff811b738b>] try_charge+0x17b/0x830
> [<ffffffff811bb5b0>] memcg_charge_kmem+0x40/0x80
> [<ffffffff811a96a9>] new_slab+0x2d9/0x5a0
> [<ffffffff817b2547>] __slab_alloc+0x2fd/0x44f
> [<ffffffff811a9b03>] kmem_cache_alloc+0x193/0x1e0
> [<ffffffff813801e1>] alloc_extent_state+0x21/0xc0
> [<ffffffff813820c5>] __clear_extent_bit+0x2b5/0x400
> [<ffffffff81386d03>] try_release_extent_mapping+0x1a3/0x220
> [<ffffffff813658a1>] __btrfs_releasepage+0x31/0x70
> [<ffffffff8136590c>] btrfs_releasepage+0x2c/0x30
> [<ffffffff811559a2>] try_to_release_page+0x32/0x50
> [<ffffffff81168cea>] shrink_page_list+0x6da/0x7a0
> [<ffffffff811693b5>] shrink_inactive_list+0x1e5/0x510
> [<ffffffff8116a0a5>] shrink_lruvec+0x605/0x7f0
> [<ffffffff8116a37e>] shrink_zone+0xee/0x320
> [<ffffffff8116a934>] do_try_to_free_pages+0x174/0x440
> [<ffffffff8116adf7>] try_to_free_mem_cgroup_pages+0xa7/0x130
> [<ffffffff811b738b>] try_charge+0x17b/0x830
> [<ffffffff811bbfd5>] mem_cgroup_try_charge+0x65/0x1c0
> [<ffffffff8118338f>] handle_mm_fault+0x117f/0x1510
> [<ffffffff81041cf7>] __do_page_fault+0x177/0x420
> [<ffffffff81041fac>] do_page_fault+0xc/0x10
> [<ffffffff817c0182>] page_fault+0x22/0x30
> 
> On later kernels, kmem charging is opt-in rather than opt-out, and
> that particular kmem allocation in btrfs_releasepage() is no longer
> being charged and won't recurse and overrun the stack anymore. But
> it's not impossible for an accounted allocation to happen from the
> memcg direct reclaim context, and we needed to reproduce this crash
> many times before we even got a useful stack trace out of it.

I agree that stack overruns are really nasty to debug. Been there done
that, hate it...

I would argue that we shouldn't account arbitrary objects and most of
them should be directly related to a user visible API which shouldn't
happen from the reclaim path. But the reality is that we reuse the code
and we can easily end up in the situation similat to the one above so I
agree that being more careful is definitely worth it.

> Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC
> to avoid recursing into any other form of direct reclaim. Then let
> recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.

Yes, as long as the allocation is still properly accounted then this is
the right way to go. We can breach the limit already. The outer reclaim
loop would push back to the limit anyway.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

One nit below

> ---
>  mm/memcontrol.c | 9 +++++----
>  mm/vmscan.c     | 2 ++
>  2 files changed, 7 insertions(+), 4 deletions(-)
> 
> Hey guys, can anyone think of a reason why this might not be a good
> idea? We've never really needed this in the past because page reclaim
> doesn't recurse into instantiating another LRU page, especially with
> GFP_NOFS. But with a wider variety of tracked allocations, it's no
> longer that obvious. It seems like a risky hole to leave around.
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae052b5e3315..3dac6f4ba4cf 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1908,13 +1908,14 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	/*
>  	 * Unlike in global OOM situations, memcg is not in a physical
> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> -	 * bypass the last charges so that they can exit quickly and
> -	 * free their memory.
> +	 * memory shortage. Allow dying and OOM-killed tasks to bypass
> +	 * the last charges so that they can exit quickly and free
> +	 * their memory. The same applies for recursing reclaimers.
>  	 */
>  	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
>  		     fatal_signal_pending(current) ||
> -		     current->flags & PF_EXITING))
> +		     current->flags & PF_EXITING ||
> +		     current->flags & PF_MEMALLOC))
>  		goto force;

I would prefer to have the PF_MEMALLOC condition in a check on its own
with a short explanation that we really do not want to recurse to the
reclaim due to stack overflows.
  
>  	if (unlikely(task_in_memcg_oom(current)))
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 744f926af442..76fda2268148 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3043,7 +3043,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  					    sc.gfp_mask,
>  					    sc.reclaim_idx);
>  
> +	current->flags |= PF_MEMALLOC;
>  	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
> +	current->flags &= ~PF_MEMALLOC;
>  
>  	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
>  
> -- 
> 2.10.0

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: do not recurse in direct reclaim
  2016-10-25  9:07   ` Michal Hocko
  (?)
@ 2016-10-25 14:10     ` Johannes Weiner
  -1 siblings, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2016-10-25 14:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vladimir Davydov, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Tue, Oct 25, 2016 at 11:07:47AM +0200, Michal Hocko wrote:
> Acked-by: Michal Hocko <mhocko@suse.com>

Thank you.

> I would prefer to have the PF_MEMALLOC condition in a check on its own
> with a short explanation that we really do not want to recurse to the
> reclaim due to stack overflows.

Okay, fair enough. I also added why we prefer temporarily breaching
the limit over failing the allocation. How is this?

>From 9034d2e6a21036774df9a8e021511720cf432c82 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 24 Oct 2016 16:01:55 -0400
Subject: [PATCH] mm: memcontrol: do not recurse in direct reclaim

On 4.0, we saw a stack corruption from a page fault entering direct
memory cgroup reclaim, calling into btrfs_releasepage(), which then
tried to allocate an extent and recursed back into a kmem charge ad
nauseam:

[...]
[<ffffffff8136590c>] btrfs_releasepage+0x2c/0x30
[<ffffffff811559a2>] try_to_release_page+0x32/0x50
[<ffffffff81168cea>] shrink_page_list+0x6da/0x7a0
[<ffffffff811693b5>] shrink_inactive_list+0x1e5/0x510
[<ffffffff8116a0a5>] shrink_lruvec+0x605/0x7f0
[<ffffffff8116a37e>] shrink_zone+0xee/0x320
[<ffffffff8116a934>] do_try_to_free_pages+0x174/0x440
[<ffffffff8116adf7>] try_to_free_mem_cgroup_pages+0xa7/0x130
[<ffffffff811b738b>] try_charge+0x17b/0x830
[<ffffffff811bb5b0>] memcg_charge_kmem+0x40/0x80
[<ffffffff811a96a9>] new_slab+0x2d9/0x5a0
[<ffffffff817b2547>] __slab_alloc+0x2fd/0x44f
[<ffffffff811a9b03>] kmem_cache_alloc+0x193/0x1e0
[<ffffffff813801e1>] alloc_extent_state+0x21/0xc0
[<ffffffff813820c5>] __clear_extent_bit+0x2b5/0x400
[<ffffffff81386d03>] try_release_extent_mapping+0x1a3/0x220
[<ffffffff813658a1>] __btrfs_releasepage+0x31/0x70
[<ffffffff8136590c>] btrfs_releasepage+0x2c/0x30
[<ffffffff811559a2>] try_to_release_page+0x32/0x50
[<ffffffff81168cea>] shrink_page_list+0x6da/0x7a0
[<ffffffff811693b5>] shrink_inactive_list+0x1e5/0x510
[<ffffffff8116a0a5>] shrink_lruvec+0x605/0x7f0
[<ffffffff8116a37e>] shrink_zone+0xee/0x320
[<ffffffff8116a934>] do_try_to_free_pages+0x174/0x440
[<ffffffff8116adf7>] try_to_free_mem_cgroup_pages+0xa7/0x130
[<ffffffff811b738b>] try_charge+0x17b/0x830
[<ffffffff811bbfd5>] mem_cgroup_try_charge+0x65/0x1c0
[<ffffffff8118338f>] handle_mm_fault+0x117f/0x1510
[<ffffffff81041cf7>] __do_page_fault+0x177/0x420
[<ffffffff81041fac>] do_page_fault+0xc/0x10
[<ffffffff817c0182>] page_fault+0x22/0x30

On later kernels, kmem charging is opt-in rather than opt-out, and
that particular kmem allocation in btrfs_releasepage() is no longer
being charged and won't recurse and overrun the stack anymore. But
it's not impossible for an accounted allocation to happen from the
memcg direct reclaim context, and we needed to reproduce this crash
many times before we even got a useful stack trace out of it.

Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC
to avoid recursing into any other form of direct reclaim. Then let
recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c | 9 +++++++++
 mm/vmscan.c     | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae052b5e3315..0f870ba43942 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1917,6 +1917,15 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		     current->flags & PF_EXITING))
 		goto force;
 
+	/*
+	 * Prevent unbounded recursion when reclaim operations need to
+	 * allocate memory. This might exceed the limits temporarily,
+	 * but we prefer facilitating memory reclaim and getting back
+	 * under the limit over triggering OOM kills in these cases.
+	 */
+	if (unlikely(current->flags & PF_MEMALLOC))
+		goto force;
+
 	if (unlikely(task_in_memcg_oom(current)))
 		goto nomem;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 744f926af442..76fda2268148 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3043,7 +3043,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					    sc.gfp_mask,
 					    sc.reclaim_idx);
 
+	current->flags |= PF_MEMALLOC;
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+	current->flags &= ~PF_MEMALLOC;
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: do not recurse in direct reclaim
@ 2016-10-25 14:10     ` Johannes Weiner
  0 siblings, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2016-10-25 14:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vladimir Davydov, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Tue, Oct 25, 2016 at 11:07:47AM +0200, Michal Hocko wrote:
> Acked-by: Michal Hocko <mhocko@suse.com>

Thank you.

> I would prefer to have the PF_MEMALLOC condition in a check on its own
> with a short explanation that we really do not want to recurse to the
> reclaim due to stack overflows.

Okay, fair enough. I also added why we prefer temporarily breaching
the limit over failing the allocation. How is this?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: do not recurse in direct reclaim
@ 2016-10-25 14:10     ` Johannes Weiner
  0 siblings, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2016-10-25 14:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vladimir Davydov, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Tue, Oct 25, 2016 at 11:07:47AM +0200, Michal Hocko wrote:
> Acked-by: Michal Hocko <mhocko@suse.com>

Thank you.

> I would prefer to have the PF_MEMALLOC condition in a check on its own
> with a short explanation that we really do not want to recurse to the
> reclaim due to stack overflows.

Okay, fair enough. I also added why we prefer temporarily breaching
the limit over failing the allocation. How is this?

From 9034d2e6a21036774df9a8e021511720cf432c82 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 24 Oct 2016 16:01:55 -0400
Subject: [PATCH] mm: memcontrol: do not recurse in direct reclaim

On 4.0, we saw a stack corruption from a page fault entering direct
memory cgroup reclaim, calling into btrfs_releasepage(), which then
tried to allocate an extent and recursed back into a kmem charge ad
nauseam:

[...]
[<ffffffff8136590c>] btrfs_releasepage+0x2c/0x30
[<ffffffff811559a2>] try_to_release_page+0x32/0x50
[<ffffffff81168cea>] shrink_page_list+0x6da/0x7a0
[<ffffffff811693b5>] shrink_inactive_list+0x1e5/0x510
[<ffffffff8116a0a5>] shrink_lruvec+0x605/0x7f0
[<ffffffff8116a37e>] shrink_zone+0xee/0x320
[<ffffffff8116a934>] do_try_to_free_pages+0x174/0x440
[<ffffffff8116adf7>] try_to_free_mem_cgroup_pages+0xa7/0x130
[<ffffffff811b738b>] try_charge+0x17b/0x830
[<ffffffff811bb5b0>] memcg_charge_kmem+0x40/0x80
[<ffffffff811a96a9>] new_slab+0x2d9/0x5a0
[<ffffffff817b2547>] __slab_alloc+0x2fd/0x44f
[<ffffffff811a9b03>] kmem_cache_alloc+0x193/0x1e0
[<ffffffff813801e1>] alloc_extent_state+0x21/0xc0
[<ffffffff813820c5>] __clear_extent_bit+0x2b5/0x400
[<ffffffff81386d03>] try_release_extent_mapping+0x1a3/0x220
[<ffffffff813658a1>] __btrfs_releasepage+0x31/0x70
[<ffffffff8136590c>] btrfs_releasepage+0x2c/0x30
[<ffffffff811559a2>] try_to_release_page+0x32/0x50
[<ffffffff81168cea>] shrink_page_list+0x6da/0x7a0
[<ffffffff811693b5>] shrink_inactive_list+0x1e5/0x510
[<ffffffff8116a0a5>] shrink_lruvec+0x605/0x7f0
[<ffffffff8116a37e>] shrink_zone+0xee/0x320
[<ffffffff8116a934>] do_try_to_free_pages+0x174/0x440
[<ffffffff8116adf7>] try_to_free_mem_cgroup_pages+0xa7/0x130
[<ffffffff811b738b>] try_charge+0x17b/0x830
[<ffffffff811bbfd5>] mem_cgroup_try_charge+0x65/0x1c0
[<ffffffff8118338f>] handle_mm_fault+0x117f/0x1510
[<ffffffff81041cf7>] __do_page_fault+0x177/0x420
[<ffffffff81041fac>] do_page_fault+0xc/0x10
[<ffffffff817c0182>] page_fault+0x22/0x30

On later kernels, kmem charging is opt-in rather than opt-out, and
that particular kmem allocation in btrfs_releasepage() is no longer
being charged and won't recurse and overrun the stack anymore. But
it's not impossible for an accounted allocation to happen from the
memcg direct reclaim context, and we needed to reproduce this crash
many times before we even got a useful stack trace out of it.

Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC
to avoid recursing into any other form of direct reclaim. Then let
recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c | 9 +++++++++
 mm/vmscan.c     | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ae052b5e3315..0f870ba43942 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1917,6 +1917,15 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		     current->flags & PF_EXITING))
 		goto force;
 
+	/*
+	 * Prevent unbounded recursion when reclaim operations need to
+	 * allocate memory. This might exceed the limits temporarily,
+	 * but we prefer facilitating memory reclaim and getting back
+	 * under the limit over triggering OOM kills in these cases.
+	 */
+	if (unlikely(current->flags & PF_MEMALLOC))
+		goto force;
+
 	if (unlikely(task_in_memcg_oom(current)))
 		goto nomem;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 744f926af442..76fda2268148 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3043,7 +3043,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					    sc.gfp_mask,
 					    sc.reclaim_idx);
 
+	current->flags |= PF_MEMALLOC;
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+	current->flags &= ~PF_MEMALLOC;
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
-- 
2.10.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: do not recurse in direct reclaim
  2016-10-25 14:10     ` Johannes Weiner
@ 2016-10-25 14:45       ` Michal Hocko
  -1 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2016-10-25 14:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vladimir Davydov, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Tue 25-10-16 10:10:50, Johannes Weiner wrote:
> On Tue, Oct 25, 2016 at 11:07:47AM +0200, Michal Hocko wrote:
> > Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Thank you.
> 
> > I would prefer to have the PF_MEMALLOC condition in a check on its own
> > with a short explanation that we really do not want to recurse to the
> > reclaim due to stack overflows.
> 
> Okay, fair enough. I also added why we prefer temporarily breaching
> the limit over failing the allocation. How is this?
> 
> >From 9034d2e6a21036774df9a8e021511720cf432c82 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 24 Oct 2016 16:01:55 -0400
> Subject: [PATCH] mm: memcontrol: do not recurse in direct reclaim
> 
> On 4.0, we saw a stack corruption from a page fault entering direct
> memory cgroup reclaim, calling into btrfs_releasepage(), which then
> tried to allocate an extent and recursed back into a kmem charge ad
> nauseam:
> 
> [...]
> [<ffffffff8136590c>] btrfs_releasepage+0x2c/0x30
> [<ffffffff811559a2>] try_to_release_page+0x32/0x50
> [<ffffffff81168cea>] shrink_page_list+0x6da/0x7a0
> [<ffffffff811693b5>] shrink_inactive_list+0x1e5/0x510
> [<ffffffff8116a0a5>] shrink_lruvec+0x605/0x7f0
> [<ffffffff8116a37e>] shrink_zone+0xee/0x320
> [<ffffffff8116a934>] do_try_to_free_pages+0x174/0x440
> [<ffffffff8116adf7>] try_to_free_mem_cgroup_pages+0xa7/0x130
> [<ffffffff811b738b>] try_charge+0x17b/0x830
> [<ffffffff811bb5b0>] memcg_charge_kmem+0x40/0x80
> [<ffffffff811a96a9>] new_slab+0x2d9/0x5a0
> [<ffffffff817b2547>] __slab_alloc+0x2fd/0x44f
> [<ffffffff811a9b03>] kmem_cache_alloc+0x193/0x1e0
> [<ffffffff813801e1>] alloc_extent_state+0x21/0xc0
> [<ffffffff813820c5>] __clear_extent_bit+0x2b5/0x400
> [<ffffffff81386d03>] try_release_extent_mapping+0x1a3/0x220
> [<ffffffff813658a1>] __btrfs_releasepage+0x31/0x70
> [<ffffffff8136590c>] btrfs_releasepage+0x2c/0x30
> [<ffffffff811559a2>] try_to_release_page+0x32/0x50
> [<ffffffff81168cea>] shrink_page_list+0x6da/0x7a0
> [<ffffffff811693b5>] shrink_inactive_list+0x1e5/0x510
> [<ffffffff8116a0a5>] shrink_lruvec+0x605/0x7f0
> [<ffffffff8116a37e>] shrink_zone+0xee/0x320
> [<ffffffff8116a934>] do_try_to_free_pages+0x174/0x440
> [<ffffffff8116adf7>] try_to_free_mem_cgroup_pages+0xa7/0x130
> [<ffffffff811b738b>] try_charge+0x17b/0x830
> [<ffffffff811bbfd5>] mem_cgroup_try_charge+0x65/0x1c0
> [<ffffffff8118338f>] handle_mm_fault+0x117f/0x1510
> [<ffffffff81041cf7>] __do_page_fault+0x177/0x420
> [<ffffffff81041fac>] do_page_fault+0xc/0x10
> [<ffffffff817c0182>] page_fault+0x22/0x30
> 
> On later kernels, kmem charging is opt-in rather than opt-out, and
> that particular kmem allocation in btrfs_releasepage() is no longer
> being charged and won't recurse and overrun the stack anymore. But
> it's not impossible for an accounted allocation to happen from the
> memcg direct reclaim context, and we needed to reproduce this crash
> many times before we even got a useful stack trace out of it.
> 
> Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC
> to avoid recursing into any other form of direct reclaim. Then let
> recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.
> 

Should we mark this for stable (up to 4.5) which changed the out-out to
opt-in?

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memcontrol.c | 9 +++++++++
>  mm/vmscan.c     | 2 ++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae052b5e3315..0f870ba43942 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1917,6 +1917,15 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		     current->flags & PF_EXITING))
>  		goto force;
>  
> +	/*
> +	 * Prevent unbounded recursion when reclaim operations need to
> +	 * allocate memory. This might exceed the limits temporarily,
> +	 * but we prefer facilitating memory reclaim and getting back
> +	 * under the limit over triggering OOM kills in these cases.
> +	 */
> +	if (unlikely(current->flags & PF_MEMALLOC))
> +		goto force;
> +

OK, sounds good to me. THanks!

>  	if (unlikely(task_in_memcg_oom(current)))
>  		goto nomem;
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 744f926af442..76fda2268148 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3043,7 +3043,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  					    sc.gfp_mask,
>  					    sc.reclaim_idx);
>  
> +	current->flags |= PF_MEMALLOC;
>  	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
> +	current->flags &= ~PF_MEMALLOC;
>  
>  	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
>  
> -- 
> 2.10.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: do not recurse in direct reclaim
@ 2016-10-25 14:45       ` Michal Hocko
  0 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2016-10-25 14:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vladimir Davydov, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Tue 25-10-16 10:10:50, Johannes Weiner wrote:
> On Tue, Oct 25, 2016 at 11:07:47AM +0200, Michal Hocko wrote:
> > Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Thank you.
> 
> > I would prefer to have the PF_MEMALLOC condition in a check on its own
> > with a short explanation that we really do not want to recurse to the
> > reclaim due to stack overflows.
> 
> Okay, fair enough. I also added why we prefer temporarily breaching
> the limit over failing the allocation. How is this?
> 
> >From 9034d2e6a21036774df9a8e021511720cf432c82 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 24 Oct 2016 16:01:55 -0400
> Subject: [PATCH] mm: memcontrol: do not recurse in direct reclaim
> 
> On 4.0, we saw a stack corruption from a page fault entering direct
> memory cgroup reclaim, calling into btrfs_releasepage(), which then
> tried to allocate an extent and recursed back into a kmem charge ad
> nauseam:
> 
> [...]
> [<ffffffff8136590c>] btrfs_releasepage+0x2c/0x30
> [<ffffffff811559a2>] try_to_release_page+0x32/0x50
> [<ffffffff81168cea>] shrink_page_list+0x6da/0x7a0
> [<ffffffff811693b5>] shrink_inactive_list+0x1e5/0x510
> [<ffffffff8116a0a5>] shrink_lruvec+0x605/0x7f0
> [<ffffffff8116a37e>] shrink_zone+0xee/0x320
> [<ffffffff8116a934>] do_try_to_free_pages+0x174/0x440
> [<ffffffff8116adf7>] try_to_free_mem_cgroup_pages+0xa7/0x130
> [<ffffffff811b738b>] try_charge+0x17b/0x830
> [<ffffffff811bb5b0>] memcg_charge_kmem+0x40/0x80
> [<ffffffff811a96a9>] new_slab+0x2d9/0x5a0
> [<ffffffff817b2547>] __slab_alloc+0x2fd/0x44f
> [<ffffffff811a9b03>] kmem_cache_alloc+0x193/0x1e0
> [<ffffffff813801e1>] alloc_extent_state+0x21/0xc0
> [<ffffffff813820c5>] __clear_extent_bit+0x2b5/0x400
> [<ffffffff81386d03>] try_release_extent_mapping+0x1a3/0x220
> [<ffffffff813658a1>] __btrfs_releasepage+0x31/0x70
> [<ffffffff8136590c>] btrfs_releasepage+0x2c/0x30
> [<ffffffff811559a2>] try_to_release_page+0x32/0x50
> [<ffffffff81168cea>] shrink_page_list+0x6da/0x7a0
> [<ffffffff811693b5>] shrink_inactive_list+0x1e5/0x510
> [<ffffffff8116a0a5>] shrink_lruvec+0x605/0x7f0
> [<ffffffff8116a37e>] shrink_zone+0xee/0x320
> [<ffffffff8116a934>] do_try_to_free_pages+0x174/0x440
> [<ffffffff8116adf7>] try_to_free_mem_cgroup_pages+0xa7/0x130
> [<ffffffff811b738b>] try_charge+0x17b/0x830
> [<ffffffff811bbfd5>] mem_cgroup_try_charge+0x65/0x1c0
> [<ffffffff8118338f>] handle_mm_fault+0x117f/0x1510
> [<ffffffff81041cf7>] __do_page_fault+0x177/0x420
> [<ffffffff81041fac>] do_page_fault+0xc/0x10
> [<ffffffff817c0182>] page_fault+0x22/0x30
> 
> On later kernels, kmem charging is opt-in rather than opt-out, and
> that particular kmem allocation in btrfs_releasepage() is no longer
> being charged and won't recurse and overrun the stack anymore. But
> it's not impossible for an accounted allocation to happen from the
> memcg direct reclaim context, and we needed to reproduce this crash
> many times before we even got a useful stack trace out of it.
> 
> Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC
> to avoid recursing into any other form of direct reclaim. Then let
> recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.
> 

Should we mark this for stable (up to 4.5) which changed the out-out to
opt-in?

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memcontrol.c | 9 +++++++++
>  mm/vmscan.c     | 2 ++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ae052b5e3315..0f870ba43942 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1917,6 +1917,15 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		     current->flags & PF_EXITING))
>  		goto force;
>  
> +	/*
> +	 * Prevent unbounded recursion when reclaim operations need to
> +	 * allocate memory. This might exceed the limits temporarily,
> +	 * but we prefer facilitating memory reclaim and getting back
> +	 * under the limit over triggering OOM kills in these cases.
> +	 */
> +	if (unlikely(current->flags & PF_MEMALLOC))
> +		goto force;
> +

OK, sounds good to me. THanks!

>  	if (unlikely(task_in_memcg_oom(current)))
>  		goto nomem;
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 744f926af442..76fda2268148 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3043,7 +3043,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  					    sc.gfp_mask,
>  					    sc.reclaim_idx);
>  
> +	current->flags |= PF_MEMALLOC;
>  	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
> +	current->flags &= ~PF_MEMALLOC;
>  
>  	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
>  
> -- 
> 2.10.0

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: do not recurse in direct reclaim
  2016-10-25 14:45       ` Michal Hocko
@ 2016-10-25 15:01         ` Johannes Weiner
  -1 siblings, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2016-10-25 15:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vladimir Davydov, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Tue, Oct 25, 2016 at 04:45:44PM +0200, Michal Hocko wrote:
> On Tue 25-10-16 10:10:50, Johannes Weiner wrote:
> > Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC
> > to avoid recursing into any other form of direct reclaim. Then let
> > recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.
> 
> Should we mark this for stable (up to 4.5) which changed the out-out to
> opt-in?

Yes, good point.

Internally, we're pulling it into our 4.6 tree as well. The commit
that fixes the particular bug we encountered in btrfs is a9bb7e620efd
("memcg: only account kmem allocations marked as __GFP_ACCOUNT") in
4.5+, so you could argue that we don't need the backport in kernels
with this commit. And I'm not aware of other manifestations of this
problem. But the unbounded recursion hole is still there, technically,
so we might just want to put it into all stable kernels and be safe.

So either

Cc: <stable@vger.kernel.org>	# up to and including 4.5

or, and I'm leaning toward that, simply

Cc: <stable@vger.kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: do not recurse in direct reclaim
@ 2016-10-25 15:01         ` Johannes Weiner
  0 siblings, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2016-10-25 15:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vladimir Davydov, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Tue, Oct 25, 2016 at 04:45:44PM +0200, Michal Hocko wrote:
> On Tue 25-10-16 10:10:50, Johannes Weiner wrote:
> > Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC
> > to avoid recursing into any other form of direct reclaim. Then let
> > recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.
> 
> Should we mark this for stable (up to 4.5) which changed the out-out to
> opt-in?

Yes, good point.

Internally, we're pulling it into our 4.6 tree as well. The commit
that fixes the particular bug we encountered in btrfs is a9bb7e620efd
("memcg: only account kmem allocations marked as __GFP_ACCOUNT") in
4.5+, so you could argue that we don't need the backport in kernels
with this commit. And I'm not aware of other manifestations of this
problem. But the unbounded recursion hole is still there, technically,
so we might just want to put it into all stable kernels and be safe.

So either

Cc: <stable@vger.kernel.org>	# up to and including 4.5

or, and I'm leaning toward that, simply

Cc: <stable@vger.kernel.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: do not recurse in direct reclaim
  2016-10-25 15:01         ` Johannes Weiner
@ 2016-10-25 15:07           ` Michal Hocko
  -1 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2016-10-25 15:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vladimir Davydov, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Tue 25-10-16 11:01:42, Johannes Weiner wrote:
> On Tue, Oct 25, 2016 at 04:45:44PM +0200, Michal Hocko wrote:
> > On Tue 25-10-16 10:10:50, Johannes Weiner wrote:
> > > Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC
> > > to avoid recursing into any other form of direct reclaim. Then let
> > > recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.
> > 
> > Should we mark this for stable (up to 4.5) which changed the out-out to
> > opt-in?
> 
> Yes, good point.
> 
> Internally, we're pulling it into our 4.6 tree as well. The commit
> that fixes the particular bug we encountered in btrfs is a9bb7e620efd
> ("memcg: only account kmem allocations marked as __GFP_ACCOUNT") in
> 4.5+, so you could argue that we don't need the backport in kernels
> with this commit. And I'm not aware of other manifestations of this
> problem. But the unbounded recursion hole is still there, technically,
> so we might just want to put it into all stable kernels and be safe.
> 
> So either
> 
> Cc: <stable@vger.kernel.org>	# up to and including 4.5

As the patch was released in 4.5 it shouldn't be needed in 4.5 stable
tree but

> or, and I'm leaning toward that, simply
> 
> Cc: <stable@vger.kernel.org>

this sounds less confusing I guess.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: memcontrol: do not recurse in direct reclaim
@ 2016-10-25 15:07           ` Michal Hocko
  0 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2016-10-25 15:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vladimir Davydov, Tejun Heo, linux-mm, cgroups,
	linux-kernel, kernel-team

On Tue 25-10-16 11:01:42, Johannes Weiner wrote:
> On Tue, Oct 25, 2016 at 04:45:44PM +0200, Michal Hocko wrote:
> > On Tue 25-10-16 10:10:50, Johannes Weiner wrote:
> > > Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC
> > > to avoid recursing into any other form of direct reclaim. Then let
> > > recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.
> > 
> > Should we mark this for stable (up to 4.5) which changed the out-out to
> > opt-in?
> 
> Yes, good point.
> 
> Internally, we're pulling it into our 4.6 tree as well. The commit
> that fixes the particular bug we encountered in btrfs is a9bb7e620efd
> ("memcg: only account kmem allocations marked as __GFP_ACCOUNT") in
> 4.5+, so you could argue that we don't need the backport in kernels
> with this commit. And I'm not aware of other manifestations of this
> problem. But the unbounded recursion hole is still there, technically,
> so we might just want to put it into all stable kernels and be safe.
> 
> So either
> 
> Cc: <stable@vger.kernel.org>	# up to and including 4.5

As the patch was released in 4.5 it shouldn't be needed in 4.5 stable
tree but

> or, and I'm leaning toward that, simply
> 
> Cc: <stable@vger.kernel.org>

this sounds less confusing I guess.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-10-25 15:08 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-24 20:30 [PATCH] mm: memcontrol: do not recurse in direct reclaim Johannes Weiner
2016-10-24 20:30 ` Johannes Weiner
2016-10-25  9:07 ` Michal Hocko
2016-10-25  9:07   ` Michal Hocko
2016-10-25 14:10   ` Johannes Weiner
2016-10-25 14:10     ` Johannes Weiner
2016-10-25 14:10     ` Johannes Weiner
2016-10-25 14:45     ` Michal Hocko
2016-10-25 14:45       ` Michal Hocko
2016-10-25 15:01       ` Johannes Weiner
2016-10-25 15:01         ` Johannes Weiner
2016-10-25 15:07         ` Michal Hocko
2016-10-25 15:07           ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.