linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm, vmscan: add cond_resched into shrink_node_memcg
@ 2016-12-02  9:58 Michal Hocko
  2016-12-05 12:44 ` Balbir Singh
  2016-12-09 10:13 ` Donald Buczek
  0 siblings, 2 replies; 5+ messages in thread
From: Michal Hocko @ 2016-12-02  9:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, linux-mm, LKML,
	Michal Hocko, Boris Zhmurov, Christopher S. Aker, Donald Buczek,
	Paul Menzel

From: Michal Hocko <mhocko@suse.com>

Boris Zhmurov has reported RCU stalls during the kswapd reclaim:
17511.573645] INFO: rcu_sched detected stalls on CPUs/tasks:
[17511.573699]  23-...: (22 ticks this GP) idle=92f/140000000000000/0 softirq=2638404/2638404 fqs=23
[17511.573740]  (detected by 4, t=6389 jiffies, g=786259, c=786258, q=42115)
[17511.573776] Task dump for CPU 23:
[17511.573777] kswapd1         R  running task        0   148      2 0x00000008
[17511.573781]  0000000000000000 ffff8efe5f491400 ffff8efe44523e68 ffff8f16a7f49000
[17511.573782]  0000000000000000 ffffffffafb67482 0000000000000000 0000000000000000
[17511.573784]  0000000000000000 0000000000000000 ffff8efe44523e58 00000000016dbbee
[17511.573786] Call Trace:
[17511.573796]  [<ffffffffafb67482>] ? shrink_node+0xd2/0x2f0
[17511.573798]  [<ffffffffafb683ab>] ? kswapd+0x2cb/0x6a0
[17511.573800]  [<ffffffffafb680e0>] ? mem_cgroup_shrink_node+0x160/0x160
[17511.573806]  [<ffffffffafa8b63d>] ? kthread+0xbd/0xe0
[17511.573810]  [<ffffffffafa2967a>] ? __switch_to+0x1fa/0x5c0
[17511.573813]  [<ffffffffaff9095f>] ? ret_from_fork+0x1f/0x40
[17511.573815]  [<ffffffffafa8b580>] ? kthread_create_on_node+0x180/0x180

a closer code inspection has shown that we might indeed miss all the
scheduling points in the reclaim path if no pages can be isolated from
the LRU list. This is a pathological case but other reports from Donald
Buczek have shown that we might indeed hit such a path:
        clusterd-989   [009] .... 118023.654491: mm_vmscan_direct_reclaim_end: nr_reclaimed=193
         kswapd1-86    [001] dN.. 118023.987475: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118024.320968: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118024.654375: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118024.987036: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118025.319651: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118025.652248: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1
         kswapd1-86    [001] dN.. 118025.984870: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1
[...]
         kswapd1-86    [001] dN.. 118084.274403: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1

this is minute long snapshot which didn't take a single page from the
LRU. It is not entirely clear why only 1303 pages have been scanned
during that time (maybe there was a heavy IRQ activity interfering).

In any case it looks like we can really hit long periods without
scheduling on non preemptive kernels so an explicit cond_resched() in
shrink_node_memcg which is independent on the reclaim operation is due.

Reported-and-tested-by: Boris Zhmurov <bb@kernelpanic.ru>
Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Reported-by: "Christopher S. Aker" <caker@theshore.net>
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
there were multiple reportes of the similar RCU stalls. Only Boris has
confirmed that this patch helps in his workload. Others might see a
slightly different issue and that should be investigated if it is the
case. As pointed out by Paul [1] cond_resched might be not sufficient
to silence RCU stalls because that would require a real scheduling.
This is a separate problem, though, and Paul is working with Peter [2]
to resolve it.

Anyway, I believe that this patch should be a good start because it
really seems that nr_taken=0 during the LRU isolation can be triggered
in the real life. All reporters are agreeing to start seeing this issue
when moving on to 4.8 kernel which might be just a coincidence or a
different behavior of some subsystem. Well, MM has moved from zone to
node reclaim but I couldn't have found any direct relation to that
change.

[1] http://lkml.kernel.org/r/20161130142955.GS3924@linux.vnet.ibm.com
[2] http://lkml.kernel.org/r/20161201124024.GB3924@linux.vnet.ibm.com

 mm/vmscan.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c05f00042430..c4abf08861d2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2362,6 +2362,8 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
 			}
 		}
 
+		cond_resched();
+
 		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
 			continue;
 
-- 
2.10.2

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm, vmscan: add cond_resched into shrink_node_memcg
  2016-12-02  9:58 [PATCH] mm, vmscan: add cond_resched into shrink_node_memcg Michal Hocko
@ 2016-12-05 12:44 ` Balbir Singh
  2016-12-05 12:49   ` Michal Hocko
  2016-12-09 10:13 ` Donald Buczek
  1 sibling, 1 reply; 5+ messages in thread
From: Balbir Singh @ 2016-12-05 12:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko, Boris Zhmurov, Christopher S. Aker,
	Donald Buczek, Paul Menzel

>
> Hi,
> there were multiple reportes of the similar RCU stalls. Only Boris has
> confirmed that this patch helps in his workload. Others might see a
> slightly different issue and that should be investigated if it is the
> case. As pointed out by Paul [1] cond_resched might be not sufficient
> to silence RCU stalls because that would require a real scheduling.
> This is a separate problem, though, and Paul is working with Peter [2]
> to resolve it.
>
> Anyway, I believe that this patch should be a good start because it
> really seems that nr_taken=0 during the LRU isolation can be triggered
> in the real life. All reporters are agreeing to start seeing this issue
> when moving on to 4.8 kernel which might be just a coincidence or a
> different behavior of some subsystem. Well, MM has moved from zone to
> node reclaim but I couldn't have found any direct relation to that
> change.
>
> [1] http://lkml.kernel.org/r/20161130142955.GS3924@linux.vnet.ibm.com
> [2] http://lkml.kernel.org/r/20161201124024.GB3924@linux.vnet.ibm.com
>
>  mm/vmscan.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c05f00042430..c4abf08861d2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2362,6 +2362,8 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
>                         }
>                 }
>
> +               cond_resched();
> +

I see a cond_resched_rcu_qs() as a part of linux next inside the while
(nr[..]) loop.
Do we need this as well?

Balbir Singh.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm, vmscan: add cond_resched into shrink_node_memcg
  2016-12-05 12:44 ` Balbir Singh
@ 2016-12-05 12:49   ` Michal Hocko
  2016-12-05 16:16     ` Paul E. McKenney
  0 siblings, 1 reply; 5+ messages in thread
From: Michal Hocko @ 2016-12-05 12:49 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	linux-mm, LKML, Boris Zhmurov, Christopher S. Aker,
	Donald Buczek, Paul Menzel, Paul E. McKenney

[CC Paul - sorry I've tried to save you from more emails...]

On Mon 05-12-16 23:44:27, Balbir Singh wrote:
> >
> > Hi,
> > there were multiple reportes of the similar RCU stalls. Only Boris has
> > confirmed that this patch helps in his workload. Others might see a
> > slightly different issue and that should be investigated if it is the
> > case. As pointed out by Paul [1] cond_resched might be not sufficient
> > to silence RCU stalls because that would require a real scheduling.
> > This is a separate problem, though, and Paul is working with Peter [2]
> > to resolve it.
> >
> > Anyway, I believe that this patch should be a good start because it
> > really seems that nr_taken=0 during the LRU isolation can be triggered
> > in the real life. All reporters are agreeing to start seeing this issue
> > when moving on to 4.8 kernel which might be just a coincidence or a
> > different behavior of some subsystem. Well, MM has moved from zone to
> > node reclaim but I couldn't have found any direct relation to that
> > change.
> >
> > [1] http://lkml.kernel.org/r/20161130142955.GS3924@linux.vnet.ibm.com
> > [2] http://lkml.kernel.org/r/20161201124024.GB3924@linux.vnet.ibm.com
> >
> >  mm/vmscan.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index c05f00042430..c4abf08861d2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2362,6 +2362,8 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
> >                         }
> >                 }
> >
> > +               cond_resched();
> > +
> 
> I see a cond_resched_rcu_qs() as a part of linux next inside the while
> (nr[..]) loop.

This is a left over from Paul's initial attempt to fix this issue. I
expect him to drop his patch from his tree. He has considered it
experimental anyway.

> Do we need this as well?

Paul is working with Peter to make cond_resched general and cover RCU
stalls even when cond_resched doesn't schedule because there is no
runnable task.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm, vmscan: add cond_resched into shrink_node_memcg
  2016-12-05 12:49   ` Michal Hocko
@ 2016-12-05 16:16     ` Paul E. McKenney
  0 siblings, 0 replies; 5+ messages in thread
From: Paul E. McKenney @ 2016-12-05 16:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Balbir Singh, Andrew Morton, Mel Gorman, Johannes Weiner,
	Vlastimil Babka, linux-mm, LKML, Boris Zhmurov,
	Christopher S. Aker, Donald Buczek, Paul Menzel

On Mon, Dec 05, 2016 at 01:49:55PM +0100, Michal Hocko wrote:
> [CC Paul - sorry I've tried to save you from more emails...]
> 
> On Mon 05-12-16 23:44:27, Balbir Singh wrote:
> > >
> > > Hi,
> > > there were multiple reportes of the similar RCU stalls. Only Boris has
> > > confirmed that this patch helps in his workload. Others might see a
> > > slightly different issue and that should be investigated if it is the
> > > case. As pointed out by Paul [1] cond_resched might be not sufficient
> > > to silence RCU stalls because that would require a real scheduling.
> > > This is a separate problem, though, and Paul is working with Peter [2]
> > > to resolve it.
> > >
> > > Anyway, I believe that this patch should be a good start because it
> > > really seems that nr_taken=0 during the LRU isolation can be triggered
> > > in the real life. All reporters are agreeing to start seeing this issue
> > > when moving on to 4.8 kernel which might be just a coincidence or a
> > > different behavior of some subsystem. Well, MM has moved from zone to
> > > node reclaim but I couldn't have found any direct relation to that
> > > change.
> > >
> > > [1] http://lkml.kernel.org/r/20161130142955.GS3924@linux.vnet.ibm.com
> > > [2] http://lkml.kernel.org/r/20161201124024.GB3924@linux.vnet.ibm.com
> > >
> > >  mm/vmscan.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index c05f00042430..c4abf08861d2 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2362,6 +2362,8 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
> > >                         }
> > >                 }
> > >
> > > +               cond_resched();
> > > +
> > 
> > I see a cond_resched_rcu_qs() as a part of linux next inside the while
> > (nr[..]) loop.
> 
> This is a left over from Paul's initial attempt to fix this issue. I
> expect him to drop his patch from his tree. He has considered it
> experimental anyway.

To prevent further confusion, I am dropping these patches from my tree:

80c099e11c19 ("mm: Prevent shrink_node() RCU CPU stall warnings")
34c53f5cd399 ("mm: Prevent shrink_node_memcg() RCU CPU stall warnings")

If you need them, please feel free to pull them in.

Given that I don't have those, I am dropping this one as well:

f2a471ffc8a8 ("rcu: Allow boot-time use of cond_resched_rcu_qs()")

If you need it, please let me know.

> > Do we need this as well?
> 
> Paul is working with Peter to make cond_resched general and cover RCU
> stalls even when cond_resched doesn't schedule because there is no
> runnable task.

And 0day just told me that my current attempt gets a 227% increase in
context switches on the unlink tests in LTP, so back to the drawing
board...

						Thanx, Paul

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm, vmscan: add cond_resched into shrink_node_memcg
  2016-12-02  9:58 [PATCH] mm, vmscan: add cond_resched into shrink_node_memcg Michal Hocko
  2016-12-05 12:44 ` Balbir Singh
@ 2016-12-09 10:13 ` Donald Buczek
  1 sibling, 0 replies; 5+ messages in thread
From: Donald Buczek @ 2016-12-09 10:13 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, linux-mm, LKML,
	Michal Hocko, Boris Zhmurov, Christopher S. Aker, Paul Menzel

On 12/02/16 10:58, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> Boris Zhmurov has reported RCU stalls during the kswapd reclaim:
> 17511.573645] INFO: rcu_sched detected stalls on CPUs/tasks:
> [17511.573699]  23-...: (22 ticks this GP) idle=92f/140000000000000/0 softirq=2638404/2638404 fqs=23
> [17511.573740]  (detected by 4, t=6389 jiffies, g=786259, c=786258, q=42115)
> [17511.573776] Task dump for CPU 23:
> [17511.573777] kswapd1         R  running task        0   148      2 0x00000008
> [17511.573781]  0000000000000000 ffff8efe5f491400 ffff8efe44523e68 ffff8f16a7f49000
> [17511.573782]  0000000000000000 ffffffffafb67482 0000000000000000 0000000000000000
> [17511.573784]  0000000000000000 0000000000000000 ffff8efe44523e58 00000000016dbbee
> [17511.573786] Call Trace:
> [17511.573796]  [<ffffffffafb67482>] ? shrink_node+0xd2/0x2f0
> [17511.573798]  [<ffffffffafb683ab>] ? kswapd+0x2cb/0x6a0
> [17511.573800]  [<ffffffffafb680e0>] ? mem_cgroup_shrink_node+0x160/0x160
> [17511.573806]  [<ffffffffafa8b63d>] ? kthread+0xbd/0xe0
> [17511.573810]  [<ffffffffafa2967a>] ? __switch_to+0x1fa/0x5c0
> [17511.573813]  [<ffffffffaff9095f>] ? ret_from_fork+0x1f/0x40
> [17511.573815]  [<ffffffffafa8b580>] ? kthread_create_on_node+0x180/0x180
>
> a closer code inspection has shown that we might indeed miss all the
> scheduling points in the reclaim path if no pages can be isolated from
> the LRU list. This is a pathological case but other reports from Donald
> Buczek have shown that we might indeed hit such a path:
>          clusterd-989   [009] .... 118023.654491: mm_vmscan_direct_reclaim_end: nr_reclaimed=193
>           kswapd1-86    [001] dN.. 118023.987475: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1
>           kswapd1-86    [001] dN.. 118024.320968: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1
>           kswapd1-86    [001] dN.. 118024.654375: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1
>           kswapd1-86    [001] dN.. 118024.987036: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1
>           kswapd1-86    [001] dN.. 118025.319651: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1
>           kswapd1-86    [001] dN.. 118025.652248: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1
>           kswapd1-86    [001] dN.. 118025.984870: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1
> [...]
>           kswapd1-86    [001] dN.. 118084.274403: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1
>
> this is minute long snapshot which didn't take a single page from the
> LRU. It is not entirely clear why only 1303 pages have been scanned
> during that time (maybe there was a heavy IRQ activity interfering).
>
> In any case it looks like we can really hit long periods without
> scheduling on non preemptive kernels so an explicit cond_resched() in
> shrink_node_memcg which is independent on the reclaim operation is due.
>
> Reported-and-tested-by: Boris Zhmurov <bb@kernelpanic.ru>
> Reported-by: Donald Buczek <buczek@molgen.mpg.de>
> Reported-by: "Christopher S. Aker" <caker@theshore.net>
> Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>
> Hi,
> there were multiple reportes of the similar RCU stalls. Only Boris has
> confirmed that this patch helps in his workload. Others might see a
> slightly different issue and that should be investigated if it is the
> case. As pointed out by Paul [1] cond_resched might be not sufficient
> to silence RCU stalls because that would require a real scheduling.
> This is a separate problem, though, and Paul is working with Peter [2]
> to resolve it.
>
> Anyway, I believe that this patch should be a good start because it
> really seems that nr_taken=0 during the LRU isolation can be triggered
> in the real life. All reporters are agreeing to start seeing this issue
> when moving on to 4.8 kernel which might be just a coincidence or a
> different behavior of some subsystem. Well, MM has moved from zone to
> node reclaim but I couldn't have found any direct relation to that
> change.
>
> [1] http://lkml.kernel.org/r/20161130142955.GS3924@linux.vnet.ibm.com
> [2] http://lkml.kernel.org/r/20161201124024.GB3924@linux.vnet.ibm.com
>
>   mm/vmscan.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c05f00042430..c4abf08861d2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2362,6 +2362,8 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
>   			}
>   		}
>   
> +		cond_resched();
> +
>   		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
>   			continue;
>   

Our two backup servers which had rcu stall warnings since 4.8 are 
running with this patch on top of v4.8.12 for 3 1/2 days now and didn't 
log any rcu stalls since then. So this patch might be fixing it for our 
environment, too.

The previous times between boots and first occurrences of rcu stall 
warnings were:

Server A ("void"): 1d14h 5h 1d4h 2d2h 21h 3d21h
Server B ("null"): 3d12h 2d3h 5d4h 4h 12h

(Yes, this contradicts a previous mail from me, where I wrongly stated 
"37,0.2,1,2,0.8 hours" for the first server, because I messed up the 
units. Its "37 hours,  0.2 days, 1 day, 2 days, 0.8 days" which fits the 
first 5 numbers in the above list. Sorry.)

We should wait a few days longer for a better p-value but there is 
reason for hope.

Donald

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-12-09 10:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-02  9:58 [PATCH] mm, vmscan: add cond_resched into shrink_node_memcg Michal Hocko
2016-12-05 12:44 ` Balbir Singh
2016-12-05 12:49   ` Michal Hocko
2016-12-05 16:16     ` Paul E. McKenney
2016-12-09 10:13 ` Donald Buczek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).