Cgroup memory barrier usage and call frequency from scheduler

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Cgroup memory barrier usage and call frequency from scheduler
@ 2020-04-09 15:44 Mel Gorman
  2020-04-09 16:49 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Mel Gorman @ 2020-04-09 15:44 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Peter Zijlstra, Ingo Molnar, Davidlohr Bueso, LKML

Hi Tejun,

Commit 9a9e97b2f1f2 ("cgroup: Add memory barriers to plug
cgroup_rstat_updated() race window") introduced two full memory
barriers to close a race. The one in cgroup_rstat_updated can be
called at a high frequency from the scheduler from update_curr ->
cgroup_account_cputime. The patch has no cc's, acks or reviews so I'm
not sure how closely this was looked at. cgroup_rstat_updated shows up
in profiles of netperf UDP_STREAM accounting for about 1% of overhead
which doesn't sound a lot but that's about the same weight as some of
the critical network paths. I have three questions about the patch

1. Why were full barriers used?
2. Why was it important that the data race be closed when the inaccuracy
   is temporary?
3. Why is it called from the context of update_curr()?

For 1, the use of a full barrier seems unnecessary when it appears that
you could have used a read barrier and a write barrier. The following
patch drops the profile overhead to 0.1%

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index ca19b4c8acf5..bc3125949b4b 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -36,7 +36,7 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 	 * Paired with the one in cgroup_rstat_cpu_pop_upated().  Either we
 	 * see NULL updated_next or they see our updated stat.
 	 */
-	smp_mb();
+	smp_rmb();

 	/*
 	 * Because @parent's updated_children is terminated with @parent
@@ -139,7 +139,7 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
 		 * Either they see NULL updated_next or we see their
 		 * updated stat.
 		 */
-		smp_mb();
+		smp_wmb();

 		return pos;
 	}

For 2, the changelog says the barriers are necessary because "we plan to use
rstat to track counters which need to be accurate". That is a bit vague.
Under what circumstances is a transient inaccuracy a serious enough
problem to justify additional barriers in the scheduler?

For 3, update_curr() is called from a lot of places, some of which are
quite hot -- e.g. task enqueue/dequeue. This is necessary information from
the runqueue needs to be preserved. However, it's less clear that the cpu
accounting information needs to be up to date on this granularity although
it might be related to question 2. Why was the delta_exec not similarly
accumulated in cpuacct_change() and defer the hierarchical update to
be called from somewhere like entity_tick()? It would need tracking the
CPU time at the last update as delta_exec would be lost so it's not very
trivial but it does not look like it would be overly complicated.

Thanks

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Cgroup memory barrier usage and call frequency from scheduler
  2020-04-09 15:44 Cgroup memory barrier usage and call frequency from scheduler Mel Gorman
@ 2020-04-09 16:49 ` Peter Zijlstra
  2020-04-09 17:21   ` Mel Gorman
  2020-04-09 17:56 ` Tejun Heo
  2020-04-09 19:08 ` [PATCH cgroup/for-5.7-fixes] Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window" Tejun Heo
  2 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2020-04-09 16:49 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Tejun Heo, Ingo Molnar, Davidlohr Bueso, LKML

On Thu, Apr 09, 2020 at 04:44:13PM +0100, Mel Gorman wrote:

> For 1, the use of a full barrier seems unnecessary when it appears that
> you could have used a read barrier and a write barrier. The following
> patch drops the profile overhead to 0.1%

Yikes. And why still .1% the below should be a barrier() on x86. Is the
compiler so contrained by that?

> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index ca19b4c8acf5..bc3125949b4b 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -36,7 +36,7 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
>  	 * Paired with the one in cgroup_rstat_cpu_pop_upated().  Either we
>  	 * see NULL updated_next or they see our updated stat.
>  	 */
> -	smp_mb();
> +	smp_rmb();
>  
>  	/*
>  	 * Because @parent's updated_children is terminated with @parent
> @@ -139,7 +139,7 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
>  		 * Either they see NULL updated_next or we see their
>  		 * updated stat.
>  		 */
> -		smp_mb();
> +		smp_wmb();
>  
>  		return pos;
>  	}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Cgroup memory barrier usage and call frequency from scheduler
  2020-04-09 16:49 ` Peter Zijlstra
@ 2020-04-09 17:21   ` Mel Gorman
  0 siblings, 0 replies; 7+ messages in thread
From: Mel Gorman @ 2020-04-09 17:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Tejun Heo, Ingo Molnar, Davidlohr Bueso, LKML

On Thu, Apr 09, 2020 at 06:49:19PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 09, 2020 at 04:44:13PM +0100, Mel Gorman wrote:
> 
> > For 1, the use of a full barrier seems unnecessary when it appears that
> > you could have used a read barrier and a write barrier. The following
> > patch drops the profile overhead to 0.1%
> 
> Yikes. And why still .1% the below should be a barrier() on x86. Is the
> compiler so contrained by that?
> 

The 0.1% is still doing all the work up until just after the barrier with
this check;

	if (cgroup_rstat_cpu(cgrp, cpu)->updated_next)
		return;

That must often be true as samples were not gathered in the rest of the
function. As this function is called on every update_curr(), it gets
called a lot.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Cgroup memory barrier usage and call frequency from scheduler
  2020-04-09 15:44 Cgroup memory barrier usage and call frequency from scheduler Mel Gorman
  2020-04-09 16:49 ` Peter Zijlstra
@ 2020-04-09 17:56 ` Tejun Heo
  2020-04-09 18:20   ` Mel Gorman
  2020-04-09 19:08 ` [PATCH cgroup/for-5.7-fixes] Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window" Tejun Heo
  2 siblings, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2020-04-09 17:56 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Peter Zijlstra, Ingo Molnar, Davidlohr Bueso, LKML

Hello, Mel.

On Thu, Apr 09, 2020 at 04:44:13PM +0100, Mel Gorman wrote:
> Commit 9a9e97b2f1f2 ("cgroup: Add memory barriers to plug
> cgroup_rstat_updated() race window") introduced two full memory
> barriers to close a race. The one in cgroup_rstat_updated can be
> called at a high frequency from the scheduler from update_curr ->
> cgroup_account_cputime. The patch has no cc's, acks or reviews so I'm
> not sure how closely this was looked at. cgroup_rstat_updated shows up
> in profiles of netperf UDP_STREAM accounting for about 1% of overhead

Oops, that's pretty high.

> which doesn't sound a lot but that's about the same weight as some of
> the critical network paths. I have three questions about the patch
> 
> 1. Why were full barriers used?

Given

   A    C
  ---  ---
   B    D

the code is trying to guarantee that either B sees C or D sees A, so it does
need full ordering.

> 2. Why was it important that the data race be closed when the inaccuracy
>    is temporary?

There was a pending patchset which converted memcg to use rstat and the
conversion included the event counters which needed to be synchronous (e.g.
for things like oom kill counts). The patchset didn't make it through due to
the percpu memory overhead at the time. The memory overhead issue can be
resolved now but in the meantime memcg got improved in a different way which
made the rstat conversion not immediately necessary, so it fell through the
cracks. In retrospect, this patch shouldn't have been committed on its own or
at least the synchronous and pure state update paths should have been
separate.

> 3. Why is it called from the context of update_curr()?

It's just being callled from the path which udpates sched statistics.

> For 1, the use of a full barrier seems unnecessary when it appears that
> you could have used a read barrier and a write barrier. The following
> patch drops the profile overhead to 0.1%

I'm not sure this is correct but that's kinda irrelevant.

> For 2, the changelog says the barriers are necessary because "we plan to use
> rstat to track counters which need to be accurate". That is a bit vague.
> Under what circumstances is a transient inaccuracy a serious enough
> problem to justify additional barriers in the scheduler?

Hope this is explained now.

> For 3, update_curr() is called from a lot of places, some of which are
> quite hot -- e.g. task enqueue/dequeue. This is necessary information from
> the runqueue needs to be preserved. However, it's less clear that the cpu
> accounting information needs to be up to date on this granularity although
> it might be related to question 2. Why was the delta_exec not similarly
> accumulated in cpuacct_change() and defer the hierarchical update to
> be called from somewhere like entity_tick()? It would need tracking the
> CPU time at the last update as delta_exec would be lost so it's not very
> trivial but it does not look like it would be overly complicated.

Most likely historic. The code has been there for a long time and the only
recent changes were plumbing around them. Nothing in cpuacct needs to be
per-scheduling-event accurate, so yeah, for the longer term, it'd be a good
idea to move them out of hot path.

For now, I'll revert the patch. Nothing in tree needs that right now. If the
need for synchronous counting comes back later, I'll make that a separate
path.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Cgroup memory barrier usage and call frequency from scheduler
  2020-04-09 17:56 ` Tejun Heo
@ 2020-04-09 18:20   ` Mel Gorman
  2020-04-09 19:27     ` Tejun Heo
  0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2020-04-09 18:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Peter Zijlstra, Ingo Molnar, Davidlohr Bueso, LKML

On Thu, Apr 09, 2020 at 01:56:21PM -0400, Tejun Heo wrote:
> Hello, Mel.
> 
> On Thu, Apr 09, 2020 at 04:44:13PM +0100, Mel Gorman wrote:
> > Commit 9a9e97b2f1f2 ("cgroup: Add memory barriers to plug
> > cgroup_rstat_updated() race window") introduced two full memory
> > barriers to close a race. The one in cgroup_rstat_updated can be
> > called at a high frequency from the scheduler from update_curr ->
> > cgroup_account_cputime. The patch has no cc's, acks or reviews so I'm
> > not sure how closely this was looked at. cgroup_rstat_updated shows up
> > in profiles of netperf UDP_STREAM accounting for about 1% of overhead
> 
> Oops, that's pretty high.
> 
> > which doesn't sound a lot but that's about the same weight as some of
> > the critical network paths. I have three questions about the patch
> > 
> > 1. Why were full barriers used?
> 
> Given
> 
>    A    C
>   ---  ---
>    B    D
> 
> the code is trying to guarantee that either B sees C or D sees A, so it does
> need full ordering.
> 

Ok, still not particularly clear given where they are used and how
it's related to updated_children but like you say later it's "kinda
irrelevant" :)

> > 2. Why was it important that the data race be closed when the inaccuracy
> >    is temporary?
> 
> There was a pending patchset which converted memcg to use rstat and the
> conversion included the event counters which needed to be synchronous (e.g.
> for things like oom kill counts). The patchset didn't make it through due to
> the percpu memory overhead at the time. The memory overhead issue can be
> resolved now but in the meantime memcg got improved in a different way which
> made the rstat conversion not immediately necessary, so it fell through the
> cracks. In retrospect, this patch shouldn't have been committed on its own or
> at least the synchronous and pure state update paths should have been
> separate.
> 

Ah, thanks for that explanation.

> > 3. Why is it called from the context of update_curr()?
> 
> It's just being callled from the path which udpates sched statistics.
> 
> > For 1, the use of a full barrier seems unnecessary when it appears that
> > you could have used a read barrier and a write barrier. The following
> > patch drops the profile overhead to 0.1%
> 
> I'm not sure this is correct but that's kinda irrelevant.
> 
> > For 2, the changelog says the barriers are necessary because "we plan to use
> > rstat to track counters which need to be accurate". That is a bit vague.
> > Under what circumstances is a transient inaccuracy a serious enough
> > problem to justify additional barriers in the scheduler?
> 
> Hope this is explained now.
> 

It is.

> > For 3, update_curr() is called from a lot of places, some of which are
> > quite hot -- e.g. task enqueue/dequeue. This is necessary information from
> > the runqueue needs to be preserved. However, it's less clear that the cpu
> > accounting information needs to be up to date on this granularity although
> > it might be related to question 2. Why was the delta_exec not similarly
> > accumulated in cpuacct_change() and defer the hierarchical update to
> > be called from somewhere like entity_tick()? It would need tracking the
> > CPU time at the last update as delta_exec would be lost so it's not very
> > trivial but it does not look like it would be overly complicated.
> 
> Most likely historic. The code has been there for a long time and the only
> recent changes were plumbing around them. Nothing in cpuacct needs to be
> per-scheduling-event accurate, so yeah, for the longer term, it'd be a good
> idea to move them out of hot path.
> 

Even if it's a future thing it helps me to know the accuracy does not
have to be perfect. It means if it bothers me enough, I can take a shot
at addressing it myself without having to worry that some controller is
broken as a side-effect.

> For now, I'll revert the patch. Nothing in tree needs that right now. If the
> need for synchronous counting comes back later, I'll make that a separate
> path.
> 

That's perfect, thanks!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH cgroup/for-5.7-fixes] Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window"
  2020-04-09 15:44 Cgroup memory barrier usage and call frequency from scheduler Mel Gorman
  2020-04-09 16:49 ` Peter Zijlstra
  2020-04-09 17:56 ` Tejun Heo
@ 2020-04-09 19:08 ` Tejun Heo
  2 siblings, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2020-04-09 19:08 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Peter Zijlstra, Ingo Molnar, Davidlohr Bueso, LKML, cgroups

From d8ef4b38cb69d907f9b0e889c44d05fc0f890977 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Thu, 9 Apr 2020 14:55:35 -0400

This reverts commit 9a9e97b2f1f2 ("cgroup: Add memory barriers to plug
cgroup_rstat_updated() race window").

The commit was added in anticipation of memcg rstat conversion which needed
synchronous accounting for the event counters (e.g. oom kill count). However,
the conversion didn't get merged due to percpu memory overhead concern which
couldn't be addressed at the time.

Unfortunately, the patch's addition of smp_mb() to cgroup_rstat_updated()
meant that every scheduling event now had to go through an additional full
barrier and Mel Gorman noticed it as 1% regression in netperf UDP_STREAM test.

There's no need to have this barrier in tree now and even if we need
synchronous accounting in the future, the right thing to do is separating that
out to a separate function so that hot paths which don't care about
synchronous behavior don't have to pay the overhead of the full barrier. Let's
revert.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mel Gorman <mgorman@techsingularity.net>
Link: http://lkml.kernel.org/r/20200409154413.GK3818@techsingularity.net
Cc: v4.18+
---
Applying to cgroup/for-5.7-fixes.

Thanks!

 kernel/cgroup/rstat.c | 16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 6f87352f8219..41ca996568df 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -33,12 +33,9 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
 		return;

 	/*
-	 * Paired with the one in cgroup_rstat_cpu_pop_updated().  Either we
-	 * see NULL updated_next or they see our updated stat.
-	 */
-	smp_mb();
-
-	/*
+	 * Speculative already-on-list test. This may race leading to
+	 * temporary inaccuracies, which is fine.
+	 *
 	 * Because @parent's updated_children is terminated with @parent
 	 * instead of NULL, we can tell whether @cgrp is on the list by
 	 * testing the next pointer for NULL.
@@ -134,13 +131,6 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
 		*nextp = rstatc->updated_next;
 		rstatc->updated_next = NULL;

-		/*
-		 * Paired with the one in cgroup_rstat_cpu_updated().
-		 * Either they see NULL updated_next or we see their
-		 * updated stat.
-		 */
-		smp_mb();
-
 		return pos;
 	}

-- 
2.25.2

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Cgroup memory barrier usage and call frequency from scheduler
  2020-04-09 18:20   ` Mel Gorman
@ 2020-04-09 19:27     ` Tejun Heo
  0 siblings, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2020-04-09 19:27 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Peter Zijlstra, Ingo Molnar, Davidlohr Bueso, LKML

Hello,

On Thu, Apr 09, 2020 at 07:20:55PM +0100, Mel Gorman wrote:
> > Given
> > 
> >    A    C
> >   ---  ---
> >    B    D
> > 
> > the code is trying to guarantee that either B sees C or D sees A, so it does
> > need full ordering.
> > 
> 
> Ok, still not particularly clear given where they are used and how
> it's related to updated_children but like you say later it's "kinda
> irrelevant" :)

Hahaha, yeah, just to make sure that I am understanding it correctly:

There are two parties - "updater" updating the local stat and trying to queue
itself on the updated list, "reader" walking the updated list trying to
collect all the numbers which changed since it last walked it.

There are two misbehaviors which can result from lack of interlocking:

1. If the updater puts itself on the list and the reader takes it off the list
   and then reads a stale value, the reader may end up reporting a stale value
   to user possibly breaking synchronity of events. This can be addressed by
   adding matching (data-dependency) barriers - wmb in updater, rmb in reader.

2. If the updater checking whether it's already on the list races against
   reader clearing it from the updated list, while the current on-going read
   is correct, the updater may incorrectly skip adding itself to the updated
   list and the next read may miss the event. This can be addressed by
   interlocking both directions - either the reader sees the new value or the
   updater sees itself cleared off the updated list - mb between stat update
   and "am I already on the updated list?" check in updater, mb between
   clearing the updater off the updated list and reading the stats in reader.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-04-09 19:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-09 15:44 Cgroup memory barrier usage and call frequency from scheduler Mel Gorman
2020-04-09 16:49 ` Peter Zijlstra
2020-04-09 17:21   ` Mel Gorman
2020-04-09 17:56 ` Tejun Heo
2020-04-09 18:20   ` Mel Gorman
2020-04-09 19:27     ` Tejun Heo
2020-04-09 19:08 ` [PATCH cgroup/for-5.7-fixes] Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window" Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).