linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision
@ 2019-10-04  0:12 Xuewei Zhang
  2019-10-04  0:54 ` Phil Auld
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Xuewei Zhang @ 2019-10-04  0:12 UTC (permalink / raw)
  To: Phil Auld, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman
  Cc: Anton Blanchard, Linus Torvalds, Thomas Gleixner, linux-kernel,
	stable, trivial, Xuewei Zhang

quota/period ratio is used to ensure a child task group won't get more
bandwidth than the parent task group, and is calculated as:
normalized_cfs_quota() = [(quota_us << 20) / period_us]

If the quota/period ratio was changed during this scaling due to
precision loss, it will cause inconsistency between parent and child
task groups. See below example:
A userspace container manager (kubelet) does three operations:
1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
2) Create a few children cgroups.
3) Set quota to 1,000us and period to 10,000us on a child cgroup.

These operations are expected to succeed. However, if the scaling of
147/128 happens before step 3), quota and period of the parent cgroup
will be changed:
new_quota: 1148437ns, 1148us
new_period: 11484375ns, 11484us

And when step 3) comes in, the ratio of the child cgroup will be 104857,
which will be larger than the parent cgroup ratio (104821), and will
fail.

Scaling them by a factor of 2 will fix the problem.

Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
Signed-off-by: Xuewei Zhang <xueweiz@google.com>
---
 kernel/sched/fair.c | 36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 83ab35e2374f..b3d3d0a231cd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4926,20 +4926,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 		if (++count > 3) {
 			u64 new, old = ktime_to_ns(cfs_b->period);
 
-			new = (old * 147) / 128; /* ~115% */
-			new = min(new, max_cfs_quota_period);
-
-			cfs_b->period = ns_to_ktime(new);
-
-			/* since max is 1s, this is limited to 1e9^2, which fits in u64 */
-			cfs_b->quota *= new;
-			cfs_b->quota = div64_u64(cfs_b->quota, old);
-
-			pr_warn_ratelimited(
-	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
-				smp_processor_id(),
-				div_u64(new, NSEC_PER_USEC),
-				div_u64(cfs_b->quota, NSEC_PER_USEC));
+			/*
+			 * Grow period by a factor of 2 to avoid lossing precision.
+			 * Precision loss in the quota/period ratio can cause __cfs_schedulable
+			 * to fail.
+			 */
+			new = old * 2;
+			if (new < max_cfs_quota_period) {
+				cfs_b->period = ns_to_ktime(new);
+				cfs_b->quota *= 2;
+
+				pr_warn_ratelimited(
+	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
+					smp_processor_id(),
+					div_u64(new, NSEC_PER_USEC),
+					div_u64(cfs_b->quota, NSEC_PER_USEC));
+			} else {
+				pr_warn_ratelimited(
+	"cfs_period_timer[cpu%d]: period too short, but cannot scale up without losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n",
+					smp_processor_id(),
+					div_u64(old, NSEC_PER_USEC),
+					div_u64(cfs_b->quota, NSEC_PER_USEC));
+			}
 
 			/* reset count so we don't come right back in here */
 			count = 0;
-- 
2.23.0.581.g78d2f28ef7-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision
  2019-10-04  0:12 [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision Xuewei Zhang
@ 2019-10-04  0:54 ` Phil Auld
  2019-10-04  2:05   ` Xuewei Zhang
  2019-10-04  6:11 ` Greg KH
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Phil Auld @ 2019-10-04  0:54 UTC (permalink / raw)
  To: Xuewei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Anton Blanchard, Linus Torvalds, Thomas Gleixner, linux-kernel,
	stable, trivial

Hi,

On Thu, Oct 03, 2019 at 05:12:43PM -0700 Xuewei Zhang wrote:
> quota/period ratio is used to ensure a child task group won't get more
> bandwidth than the parent task group, and is calculated as:
> normalized_cfs_quota() = [(quota_us << 20) / period_us]
> 
> If the quota/period ratio was changed during this scaling due to
> precision loss, it will cause inconsistency between parent and child
> task groups. See below example:
> A userspace container manager (kubelet) does three operations:
> 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
> 2) Create a few children cgroups.
> 3) Set quota to 1,000us and period to 10,000us on a child cgroup.
> 
> These operations are expected to succeed. However, if the scaling of
> 147/128 happens before step 3), quota and period of the parent cgroup
> will be changed:
> new_quota: 1148437ns, 1148us
> new_period: 11484375ns, 11484us
> 
> And when step 3) comes in, the ratio of the child cgroup will be 104857,
> which will be larger than the parent cgroup ratio (104821), and will
> fail.
> 
> Scaling them by a factor of 2 will fix the problem.

I have no issues with the concept. We went around a bit about the actual
numbers and made it an approximation. 

> 
> Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
> Signed-off-by: Xuewei Zhang <xueweiz@google.com>
> ---
>  kernel/sched/fair.c | 36 ++++++++++++++++++++++--------------
>  1 file changed, 22 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 83ab35e2374f..b3d3d0a231cd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4926,20 +4926,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>  		if (++count > 3) {
>  			u64 new, old = ktime_to_ns(cfs_b->period);
>  
> -			new = (old * 147) / 128; /* ~115% */
> -			new = min(new, max_cfs_quota_period);
> -
> -			cfs_b->period = ns_to_ktime(new);
> -
> -			/* since max is 1s, this is limited to 1e9^2, which fits in u64 */
> -			cfs_b->quota *= new;
> -			cfs_b->quota = div64_u64(cfs_b->quota, old);
> -
> -			pr_warn_ratelimited(
> -	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> -				smp_processor_id(),
> -				div_u64(new, NSEC_PER_USEC),
> -				div_u64(cfs_b->quota, NSEC_PER_USEC));
> +			/*
> +			 * Grow period by a factor of 2 to avoid lossing precision.
> +			 * Precision loss in the quota/period ratio can cause __cfs_schedulable
> +			 * to fail.
> +			 */
> +			new = old * 2;
> +			if (new < max_cfs_quota_period) {

I don't like this part as much. There may be a value between
max_cfs_quota_period/2 and max_cfs_quota_period that would get us out of
the loop. Possibly in practice it won't matter but here you trigger the
warning and take no action to keep it from continuing.

Also, if you are actually hitting this then you might want to just start at
a higher but proportional quota and period.


Cheers,
Phil

> +				cfs_b->period = ns_to_ktime(new);
> +				cfs_b->quota *= 2;
> +
> +				pr_warn_ratelimited(
> +	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> +					smp_processor_id(),
> +					div_u64(new, NSEC_PER_USEC),
> +					div_u64(cfs_b->quota, NSEC_PER_USEC));
> +			} else {
> +				pr_warn_ratelimited(
> +	"cfs_period_timer[cpu%d]: period too short, but cannot scale up without losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> +					smp_processor_id(),
> +					div_u64(old, NSEC_PER_USEC),
> +					div_u64(cfs_b->quota, NSEC_PER_USEC));
> +			}
>  
>  			/* reset count so we don't come right back in here */
>  			count = 0;
> -- 
> 2.23.0.581.g78d2f28ef7-goog
> 

-- 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision
  2019-10-04  0:54 ` Phil Auld
@ 2019-10-04  2:05   ` Xuewei Zhang
  2019-10-04 13:14     ` Phil Auld
  0 siblings, 1 reply; 12+ messages in thread
From: Xuewei Zhang @ 2019-10-04  2:05 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Anton Blanchard, Linus Torvalds, Thomas Gleixner, linux-kernel,
	stable, trivial, Neel Natu, Hao Luo

+cc neelnatu@google.com and haoluo@google.com, they helped a lot
for this issue. Sorry I forgot to include them when sending out the patch.

On Thu, Oct 3, 2019 at 5:55 PM Phil Auld <pauld@redhat.com> wrote:
>
> Hi,
>
> On Thu, Oct 03, 2019 at 05:12:43PM -0700 Xuewei Zhang wrote:
> > quota/period ratio is used to ensure a child task group won't get more
> > bandwidth than the parent task group, and is calculated as:
> > normalized_cfs_quota() = [(quota_us << 20) / period_us]
> >
> > If the quota/period ratio was changed during this scaling due to
> > precision loss, it will cause inconsistency between parent and child
> > task groups. See below example:
> > A userspace container manager (kubelet) does three operations:
> > 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
> > 2) Create a few children cgroups.
> > 3) Set quota to 1,000us and period to 10,000us on a child cgroup.
> >
> > These operations are expected to succeed. However, if the scaling of
> > 147/128 happens before step 3), quota and period of the parent cgroup
> > will be changed:
> > new_quota: 1148437ns, 1148us
> > new_period: 11484375ns, 11484us
> >
> > And when step 3) comes in, the ratio of the child cgroup will be 104857,
> > which will be larger than the parent cgroup ratio (104821), and will
> > fail.
> >
> > Scaling them by a factor of 2 will fix the problem.
>
> I have no issues with the concept. We went around a bit about the actual
> numbers and made it an approximation.
>
> >
> > Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
> > Signed-off-by: Xuewei Zhang <xueweiz@google.com>
> > ---
> >  kernel/sched/fair.c | 36 ++++++++++++++++++++++--------------
> >  1 file changed, 22 insertions(+), 14 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 83ab35e2374f..b3d3d0a231cd 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4926,20 +4926,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >               if (++count > 3) {
> >                       u64 new, old = ktime_to_ns(cfs_b->period);
> >
> > -                     new = (old * 147) / 128; /* ~115% */
> > -                     new = min(new, max_cfs_quota_period);
> > -
> > -                     cfs_b->period = ns_to_ktime(new);
> > -
> > -                     /* since max is 1s, this is limited to 1e9^2, which fits in u64 */
> > -                     cfs_b->quota *= new;
> > -                     cfs_b->quota = div64_u64(cfs_b->quota, old);
> > -
> > -                     pr_warn_ratelimited(
> > -     "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> > -                             smp_processor_id(),
> > -                             div_u64(new, NSEC_PER_USEC),
> > -                             div_u64(cfs_b->quota, NSEC_PER_USEC));
> > +                     /*
> > +                      * Grow period by a factor of 2 to avoid lossing precision.
> > +                      * Precision loss in the quota/period ratio can cause __cfs_schedulable
> > +                      * to fail.
> > +                      */
> > +                     new = old * 2;
> > +                     if (new < max_cfs_quota_period) {
>
> I don't like this part as much. There may be a value between
> max_cfs_quota_period/2 and max_cfs_quota_period that would get us out of
> the loop. Possibly in practice it won't matter but here you trigger the
> warning and take no action to keep it from continuing.
>
> Also, if you are actually hitting this then you might want to just start at
> a higher but proportional quota and period.

I'd like to do what you suggested. A quick idea would be to scale period to
max_cfs_quota_period, and scale quota proportionally. However the naive
implementation won't work under this edge case:
original:
quota: 500,000us  period: 570,000us
after scaling:
quota: 877,192us  period: 1,000,000us
original ratio: 919803
new ratio: 919802

To do this right, the code would have to keep an eye out on the precision loss,
and increase quota by 1us sometimes to cancel out the precision loss.

Also, I think this case is not that important. Because if we are
hitting this case, that
suggests the period is already >0.5s. And if we are still hitting
timeouts with a 0.5s
period, scaling it to 1s probably won't help much.
When this happens, I'd imagine the parent cgroup would have a LOT of child
cgroups. It might make sense for the userspace to create the parent cgroup with
1s period.

If you think automatically scaling 0.5s+ to 1s is still important, I'm
happy to stash
this patch, and send in another one that handles the 0.5+s -> 1s
scaling the right
way. :) Thanks!

Best regards,
Xuewei

>
>
> Cheers,
> Phil
>
> > +                             cfs_b->period = ns_to_ktime(new);
> > +                             cfs_b->quota *= 2;
> > +
> > +                             pr_warn_ratelimited(
> > +     "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> > +                                     smp_processor_id(),
> > +                                     div_u64(new, NSEC_PER_USEC),
> > +                                     div_u64(cfs_b->quota, NSEC_PER_USEC));
> > +                     } else {
> > +                             pr_warn_ratelimited(
> > +     "cfs_period_timer[cpu%d]: period too short, but cannot scale up without losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> > +                                     smp_processor_id(),
> > +                                     div_u64(old, NSEC_PER_USEC),
> > +                                     div_u64(cfs_b->quota, NSEC_PER_USEC));
> > +                     }
> >
> >                       /* reset count so we don't come right back in here */
> >                       count = 0;
> > --
> > 2.23.0.581.g78d2f28ef7-goog
> >
>
> --

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision
  2019-10-04  0:12 [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision Xuewei Zhang
  2019-10-04  0:54 ` Phil Auld
@ 2019-10-04  6:11 ` Greg KH
  2019-10-07 15:14 ` Phil Auld
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Greg KH @ 2019-10-04  6:11 UTC (permalink / raw)
  To: Xuewei Zhang
  Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Anton Blanchard, Linus Torvalds, Thomas Gleixner,
	linux-kernel, stable, trivial

On Thu, Oct 03, 2019 at 05:12:43PM -0700, Xuewei Zhang wrote:
> quota/period ratio is used to ensure a child task group won't get more
> bandwidth than the parent task group, and is calculated as:
> normalized_cfs_quota() = [(quota_us << 20) / period_us]
> 
> If the quota/period ratio was changed during this scaling due to
> precision loss, it will cause inconsistency between parent and child
> task groups. See below example:
> A userspace container manager (kubelet) does three operations:
> 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
> 2) Create a few children cgroups.
> 3) Set quota to 1,000us and period to 10,000us on a child cgroup.
> 
> These operations are expected to succeed. However, if the scaling of
> 147/128 happens before step 3), quota and period of the parent cgroup
> will be changed:
> new_quota: 1148437ns, 1148us
> new_period: 11484375ns, 11484us
> 
> And when step 3) comes in, the ratio of the child cgroup will be 104857,
> which will be larger than the parent cgroup ratio (104821), and will
> fail.
> 
> Scaling them by a factor of 2 will fix the problem.
> 
> Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
> Signed-off-by: Xuewei Zhang <xueweiz@google.com>
> ---
>  kernel/sched/fair.c | 36 ++++++++++++++++++++++--------------
>  1 file changed, 22 insertions(+), 14 deletions(-)
> 

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision
  2019-10-04  2:05   ` Xuewei Zhang
@ 2019-10-04 13:14     ` Phil Auld
  2019-10-05  0:28       ` Xuewei Zhang
  0 siblings, 1 reply; 12+ messages in thread
From: Phil Auld @ 2019-10-04 13:14 UTC (permalink / raw)
  To: Xuewei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Anton Blanchard, Linus Torvalds, Thomas Gleixner, linux-kernel,
	stable, trivial, Neel Natu, Hao Luo

On Thu, Oct 03, 2019 at 07:05:56PM -0700 Xuewei Zhang wrote:
> +cc neelnatu@google.com and haoluo@google.com, they helped a lot
> for this issue. Sorry I forgot to include them when sending out the patch.
> 
> On Thu, Oct 3, 2019 at 5:55 PM Phil Auld <pauld@redhat.com> wrote:
> >
> > Hi,
> >
> > On Thu, Oct 03, 2019 at 05:12:43PM -0700 Xuewei Zhang wrote:
> > > quota/period ratio is used to ensure a child task group won't get more
> > > bandwidth than the parent task group, and is calculated as:
> > > normalized_cfs_quota() = [(quota_us << 20) / period_us]
> > >
> > > If the quota/period ratio was changed during this scaling due to
> > > precision loss, it will cause inconsistency between parent and child
> > > task groups. See below example:
> > > A userspace container manager (kubelet) does three operations:
> > > 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
> > > 2) Create a few children cgroups.
> > > 3) Set quota to 1,000us and period to 10,000us on a child cgroup.
> > >
> > > These operations are expected to succeed. However, if the scaling of
> > > 147/128 happens before step 3), quota and period of the parent cgroup
> > > will be changed:
> > > new_quota: 1148437ns, 1148us
> > > new_period: 11484375ns, 11484us
> > >
> > > And when step 3) comes in, the ratio of the child cgroup will be 104857,
> > > which will be larger than the parent cgroup ratio (104821), and will
> > > fail.
> > >
> > > Scaling them by a factor of 2 will fix the problem.
> >
> > I have no issues with the concept. We went around a bit about the actual
> > numbers and made it an approximation.
> >
> > >
> > > Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
> > > Signed-off-by: Xuewei Zhang <xueweiz@google.com>
> > > ---
> > >  kernel/sched/fair.c | 36 ++++++++++++++++++++++--------------
> > >  1 file changed, 22 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 83ab35e2374f..b3d3d0a231cd 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4926,20 +4926,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> > >               if (++count > 3) {
> > >                       u64 new, old = ktime_to_ns(cfs_b->period);
> > >
> > > -                     new = (old * 147) / 128; /* ~115% */
> > > -                     new = min(new, max_cfs_quota_period);
> > > -
> > > -                     cfs_b->period = ns_to_ktime(new);
> > > -
> > > -                     /* since max is 1s, this is limited to 1e9^2, which fits in u64 */
> > > -                     cfs_b->quota *= new;
> > > -                     cfs_b->quota = div64_u64(cfs_b->quota, old);
> > > -
> > > -                     pr_warn_ratelimited(
> > > -     "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> > > -                             smp_processor_id(),
> > > -                             div_u64(new, NSEC_PER_USEC),
> > > -                             div_u64(cfs_b->quota, NSEC_PER_USEC));
> > > +                     /*
> > > +                      * Grow period by a factor of 2 to avoid lossing precision.
> > > +                      * Precision loss in the quota/period ratio can cause __cfs_schedulable
> > > +                      * to fail.
> > > +                      */
> > > +                     new = old * 2;
> > > +                     if (new < max_cfs_quota_period) {
> >
> > I don't like this part as much. There may be a value between
> > max_cfs_quota_period/2 and max_cfs_quota_period that would get us out of
> > the loop. Possibly in practice it won't matter but here you trigger the
> > warning and take no action to keep it from continuing.
> >
> > Also, if you are actually hitting this then you might want to just start at
> > a higher but proportional quota and period.
> 
> I'd like to do what you suggested. A quick idea would be to scale period to
> max_cfs_quota_period, and scale quota proportionally. However the naive
> implementation won't work under this edge case:
> original:
> quota: 500,000us  period: 570,000us
> after scaling:
> quota: 877,192us  period: 1,000,000us
> original ratio: 919803
> new ratio: 919802
> 
> To do this right, the code would have to keep an eye out on the precision loss,
> and increase quota by 1us sometimes to cancel out the precision loss.
> 
> Also, I think this case is not that important. Because if we are
> hitting this case, that
> suggests the period is already >0.5s. And if we are still hitting
> timeouts with a 0.5s
> period, scaling it to 1s probably won't help much.
> When this happens, I'd imagine the parent cgroup would have a LOT of child
> cgroups. It might make sense for the userspace to create the parent cgroup with
> 1s period.
> 
> If you think automatically scaling 0.5s+ to 1s is still important, I'm
> happy to stash
> this patch, and send in another one that handles the 0.5+s -> 1s
> scaling the right
> way. :) Thanks!

First let me understand your use case better. I was thinking about this more last
night and it doesn't make sense.

You are setting a small quota and period on the parent cgroup and then setting the 
same small quota and period on the child. As you say to keep the child from getting
more quota than the parent. But that should already be the case simply by setting
it on the parent. The child can't get more quota than the parent.   All this does
is make the kernel do more work handling more period timers and such. 

Setting the child quota/period only makes sense when setting it smaller than 
the parent. 

Also, in order to hit this problem you need to have many hundreds of children, in
my experience. In that case it makes even less sense to write the same quota/preiod 
as the parent into each of the children.   

Or there is something else causing the timer to take too long to run... 


I agree that if we are taking > 1/2s to run do_sched_cfs_period_timer() it may 
not matter, as I said above.  


Cheers,
Phil

> 
> Best regards,
> Xuewei
> 
> >
> >
> > Cheers,
> > Phil
> >
> > > +                             cfs_b->period = ns_to_ktime(new);
> > > +                             cfs_b->quota *= 2;
> > > +
> > > +                             pr_warn_ratelimited(
> > > +     "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> > > +                                     smp_processor_id(),
> > > +                                     div_u64(new, NSEC_PER_USEC),
> > > +                                     div_u64(cfs_b->quota, NSEC_PER_USEC));
> > > +                     } else {
> > > +                             pr_warn_ratelimited(
> > > +     "cfs_period_timer[cpu%d]: period too short, but cannot scale up without losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> > > +                                     smp_processor_id(),
> > > +                                     div_u64(old, NSEC_PER_USEC),
> > > +                                     div_u64(cfs_b->quota, NSEC_PER_USEC));
> > > +                     }
> > >
> > >                       /* reset count so we don't come right back in here */
> > >                       count = 0;
> > > --
> > > 2.23.0.581.g78d2f28ef7-goog
> > >
> >
> > --

-- 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision
  2019-10-04 13:14     ` Phil Auld
@ 2019-10-05  0:28       ` Xuewei Zhang
  2019-10-07 13:02         ` Phil Auld
  0 siblings, 1 reply; 12+ messages in thread
From: Xuewei Zhang @ 2019-10-05  0:28 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Anton Blanchard, Linus Torvalds, Thomas Gleixner, linux-kernel,
	stable, trivial, Neel Natu, Hao Luo

On Fri, Oct 4, 2019 at 6:14 AM Phil Auld <pauld@redhat.com> wrote:
>
> On Thu, Oct 03, 2019 at 07:05:56PM -0700 Xuewei Zhang wrote:
> > +cc neelnatu@google.com and haoluo@google.com, they helped a lot
> > for this issue. Sorry I forgot to include them when sending out the patch.
> >
> > On Thu, Oct 3, 2019 at 5:55 PM Phil Auld <pauld@redhat.com> wrote:
> > >
> > > Hi,
> > >
> > > On Thu, Oct 03, 2019 at 05:12:43PM -0700 Xuewei Zhang wrote:
> > > > quota/period ratio is used to ensure a child task group won't get more
> > > > bandwidth than the parent task group, and is calculated as:
> > > > normalized_cfs_quota() = [(quota_us << 20) / period_us]
> > > >
> > > > If the quota/period ratio was changed during this scaling due to
> > > > precision loss, it will cause inconsistency between parent and child
> > > > task groups. See below example:
> > > > A userspace container manager (kubelet) does three operations:
> > > > 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
> > > > 2) Create a few children cgroups.
> > > > 3) Set quota to 1,000us and period to 10,000us on a child cgroup.
> > > >
> > > > These operations are expected to succeed. However, if the scaling of
> > > > 147/128 happens before step 3), quota and period of the parent cgroup
> > > > will be changed:
> > > > new_quota: 1148437ns, 1148us
> > > > new_period: 11484375ns, 11484us
> > > >
> > > > And when step 3) comes in, the ratio of the child cgroup will be 104857,
> > > > which will be larger than the parent cgroup ratio (104821), and will
> > > > fail.
> > > >
> > > > Scaling them by a factor of 2 will fix the problem.
> > >
> > > I have no issues with the concept. We went around a bit about the actual
> > > numbers and made it an approximation.
> > >
> > > >
> > > > Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
> > > > Signed-off-by: Xuewei Zhang <xueweiz@google.com>
> > > > ---
> > > >  kernel/sched/fair.c | 36 ++++++++++++++++++++++--------------
> > > >  1 file changed, 22 insertions(+), 14 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 83ab35e2374f..b3d3d0a231cd 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -4926,20 +4926,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> > > >               if (++count > 3) {
> > > >                       u64 new, old = ktime_to_ns(cfs_b->period);
> > > >
> > > > -                     new = (old * 147) / 128; /* ~115% */
> > > > -                     new = min(new, max_cfs_quota_period);
> > > > -
> > > > -                     cfs_b->period = ns_to_ktime(new);
> > > > -
> > > > -                     /* since max is 1s, this is limited to 1e9^2, which fits in u64 */
> > > > -                     cfs_b->quota *= new;
> > > > -                     cfs_b->quota = div64_u64(cfs_b->quota, old);
> > > > -
> > > > -                     pr_warn_ratelimited(
> > > > -     "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> > > > -                             smp_processor_id(),
> > > > -                             div_u64(new, NSEC_PER_USEC),
> > > > -                             div_u64(cfs_b->quota, NSEC_PER_USEC));
> > > > +                     /*
> > > > +                      * Grow period by a factor of 2 to avoid lossing precision.
> > > > +                      * Precision loss in the quota/period ratio can cause __cfs_schedulable
> > > > +                      * to fail.
> > > > +                      */
> > > > +                     new = old * 2;
> > > > +                     if (new < max_cfs_quota_period) {
> > >
> > > I don't like this part as much. There may be a value between
> > > max_cfs_quota_period/2 and max_cfs_quota_period that would get us out of
> > > the loop. Possibly in practice it won't matter but here you trigger the
> > > warning and take no action to keep it from continuing.
> > >
> > > Also, if you are actually hitting this then you might want to just start at
> > > a higher but proportional quota and period.
> >
> > I'd like to do what you suggested. A quick idea would be to scale period to
> > max_cfs_quota_period, and scale quota proportionally. However the naive
> > implementation won't work under this edge case:
> > original:
> > quota: 500,000us  period: 570,000us
> > after scaling:
> > quota: 877,192us  period: 1,000,000us
> > original ratio: 919803
> > new ratio: 919802
> >
> > To do this right, the code would have to keep an eye out on the precision loss,
> > and increase quota by 1us sometimes to cancel out the precision loss.
> >
> > Also, I think this case is not that important. Because if we are
> > hitting this case, that
> > suggests the period is already >0.5s. And if we are still hitting
> > timeouts with a 0.5s
> > period, scaling it to 1s probably won't help much.
> > When this happens, I'd imagine the parent cgroup would have a LOT of child
> > cgroups. It might make sense for the userspace to create the parent cgroup with
> > 1s period.
> >
> > If you think automatically scaling 0.5s+ to 1s is still important, I'm
> > happy to stash
> > this patch, and send in another one that handles the 0.5+s -> 1s
> > scaling the right
> > way. :) Thanks!
>
> First let me understand your use case better. I was thinking about this more last
> night and it doesn't make sense.
>
> You are setting a small quota and period on the parent cgroup and then setting the
> same small quota and period on the child. As you say to keep the child from getting
> more quota than the parent. But that should already be the case simply by setting
> it on the parent. The child can't get more quota than the parent.   All this does
> is make the kernel do more work handling more period timers and such.

Sorry for not being clear enough. Let me provide a bit more additional context:

kubelet [1] is the userspace program setting the cfs quota and period.
kubelet is essentially a container manager for the end user. The end user
can specify any attainable configurations for a pod (which contains multiple
containers).

The user interface of kubelet allows end user to specify the amount of CPU
granted to any pod or container (in the form of mCPU). And then kubelet will
convert the spec to quota/period accepted by cgroup fs, using this rule:
the period of any pod/container will be set to 100000us
the quota of the pod/container will be calculated using the allowed mCPU

And kubelet simply then writes the calculated period and quota to cgroup fs.

It's very common to specify a pod with multiple containers, and setting
different quota for the child containers: some granted with 5-50% of the
bandwidth available to the parent, while some other granted with 100%. For
simplicity, kubelet writes quota/period to cgroup fs for all pods and
containers.

----
Now back to our discussion. :)

You see, the reason that kubelet write identical quota and period to parent and
child cgroup, is not because it want to enforce that child doesn't get more
quota than parent. It is simply because kubelet needs to manage the quota for
all containers and pods, and it's more convenenient to just set the quota and
period for all of them (because in many cases, child cgroups actually gets less
bandwidth than the parent, and has to be set specifically).

I agree that your suggestion would work. If a child cgroup is set to the same
bandwidth of the parent cgroup, we could change the userspace program, and ask
it to skip setting the child cgroup bandwidth.
However, this logic would be a special case, and will require significant logic
change to the userspace container managers.


This issue is affecting many Kubernetes users, see this open issue:
https://github.com/kubernetes/kubernetes/issues/72878
kubelet on their machines are doing the three operations mentioned in the patch.
I also explained them in more detail in this doc:
https://docs.google.com/document/d/13KLD__6A935igLXpTFFomqfclATC89nAkhPOxsuKA0I/edit?usp=sharing

Basically, Kubernetes is operating on the below assumption of kernel today:
Setting the cpu quota/period of a child cgroup should not be rejected unless
the bandwidth is exceeding what the quota/period set for the parent cgroup.

I think this assumption is fair. Please let me know if you think otherwise. And
if so, since the kernel broke this assumption today, I don't think it's the
responsibility for the userspace to deal with the problem that kernel may change
the quota/period ratio at any time.

[1] https://github.com/kubernetes/kubernetes/tree/master/pkg/kubelet

>
> Setting the child quota/period only makes sense when setting it smaller than
> the parent.

As mentioned above, in the use case of kublet, it's much easier to always
set the child quota/period, than to only set it when it is different
(i.e. smaller)
than the parent.

>
> Also, in order to hit this problem you need to have many hundreds of children, in
> my experience. In that case it makes even less sense to write the same quota/preiod
> as the parent into each of the children.

Here is a problematic scenario:
The parent cgroup have 1000 children with a small quota/period, and after a
few minutes, kubelet wants to add one additional child with the same
quota/period.
This bug could prevent kubelet from setting that one additional child
successfully.


Thanks a lot for taking time reviewing and responding the patch Phil!
Really appreciate it.

Best regards,
Xuewei

>
> Or there is something else causing the timer to take too long to run...
>
>
> I agree that if we are taking > 1/2s to run do_sched_cfs_period_timer() it may
> not matter, as I said above.
>
>
> Cheers,
> Phil
>
> >
> > Best regards,
> > Xuewei
> >
> > >
> > >
> > > Cheers,
> > > Phil
> > >
> > > > +                             cfs_b->period = ns_to_ktime(new);
> > > > +                             cfs_b->quota *= 2;
> > > > +
> > > > +                             pr_warn_ratelimited(
> > > > +     "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> > > > +                                     smp_processor_id(),
> > > > +                                     div_u64(new, NSEC_PER_USEC),
> > > > +                                     div_u64(cfs_b->quota, NSEC_PER_USEC));
> > > > +                     } else {
> > > > +                             pr_warn_ratelimited(
> > > > +     "cfs_period_timer[cpu%d]: period too short, but cannot scale up without losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> > > > +                                     smp_processor_id(),
> > > > +                                     div_u64(old, NSEC_PER_USEC),
> > > > +                                     div_u64(cfs_b->quota, NSEC_PER_USEC));
> > > > +                     }
> > > >
> > > >                       /* reset count so we don't come right back in here */
> > > >                       count = 0;
> > > > --
> > > > 2.23.0.581.g78d2f28ef7-goog
> > > >
> > >
> > > --
>
> --

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision
  2019-10-05  0:28       ` Xuewei Zhang
@ 2019-10-07 13:02         ` Phil Auld
  0 siblings, 0 replies; 12+ messages in thread
From: Phil Auld @ 2019-10-07 13:02 UTC (permalink / raw)
  To: Xuewei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Anton Blanchard, Linus Torvalds, Thomas Gleixner, linux-kernel,
	stable, trivial, Neel Natu, Hao Luo

Hi Xuewei,

On Fri, Oct 04, 2019 at 05:28:15PM -0700 Xuewei Zhang wrote:
> On Fri, Oct 4, 2019 at 6:14 AM Phil Auld <pauld@redhat.com> wrote:
> >
> > On Thu, Oct 03, 2019 at 07:05:56PM -0700 Xuewei Zhang wrote:
> > > +cc neelnatu@google.com and haoluo@google.com, they helped a lot
> > > for this issue. Sorry I forgot to include them when sending out the patch.
> > >
> > > On Thu, Oct 3, 2019 at 5:55 PM Phil Auld <pauld@redhat.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Thu, Oct 03, 2019 at 05:12:43PM -0700 Xuewei Zhang wrote:
> > > > > quota/period ratio is used to ensure a child task group won't get more
> > > > > bandwidth than the parent task group, and is calculated as:
> > > > > normalized_cfs_quota() = [(quota_us << 20) / period_us]
> > > > >
> > > > > If the quota/period ratio was changed during this scaling due to
> > > > > precision loss, it will cause inconsistency between parent and child
> > > > > task groups. See below example:
> > > > > A userspace container manager (kubelet) does three operations:
> > > > > 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
> > > > > 2) Create a few children cgroups.
> > > > > 3) Set quota to 1,000us and period to 10,000us on a child cgroup.
> > > > >
> > > > > These operations are expected to succeed. However, if the scaling of
> > > > > 147/128 happens before step 3), quota and period of the parent cgroup
> > > > > will be changed:
> > > > > new_quota: 1148437ns, 1148us
> > > > > new_period: 11484375ns, 11484us
> > > > >
> > > > > And when step 3) comes in, the ratio of the child cgroup will be 104857,
> > > > > which will be larger than the parent cgroup ratio (104821), and will
> > > > > fail.
> > > > >
> > > > > Scaling them by a factor of 2 will fix the problem.
> > > >
> > > > I have no issues with the concept. We went around a bit about the actual
> > > > numbers and made it an approximation.
> > > >
> > > > >
> > > > > Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
> > > > > Signed-off-by: Xuewei Zhang <xueweiz@google.com>
> > > > > ---
> > > > >  kernel/sched/fair.c | 36 ++++++++++++++++++++++--------------
> > > > >  1 file changed, 22 insertions(+), 14 deletions(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index 83ab35e2374f..b3d3d0a231cd 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -4926,20 +4926,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> > > > >               if (++count > 3) {
> > > > >                       u64 new, old = ktime_to_ns(cfs_b->period);
> > > > >
> > > > > -                     new = (old * 147) / 128; /* ~115% */
> > > > > -                     new = min(new, max_cfs_quota_period);
> > > > > -
> > > > > -                     cfs_b->period = ns_to_ktime(new);
> > > > > -
> > > > > -                     /* since max is 1s, this is limited to 1e9^2, which fits in u64 */
> > > > > -                     cfs_b->quota *= new;
> > > > > -                     cfs_b->quota = div64_u64(cfs_b->quota, old);
> > > > > -
> > > > > -                     pr_warn_ratelimited(
> > > > > -     "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> > > > > -                             smp_processor_id(),
> > > > > -                             div_u64(new, NSEC_PER_USEC),
> > > > > -                             div_u64(cfs_b->quota, NSEC_PER_USEC));
> > > > > +                     /*
> > > > > +                      * Grow period by a factor of 2 to avoid lossing precision.
> > > > > +                      * Precision loss in the quota/period ratio can cause __cfs_schedulable
> > > > > +                      * to fail.
> > > > > +                      */
> > > > > +                     new = old * 2;
> > > > > +                     if (new < max_cfs_quota_period) {
> > > >
> > > > I don't like this part as much. There may be a value between
> > > > max_cfs_quota_period/2 and max_cfs_quota_period that would get us out of
> > > > the loop. Possibly in practice it won't matter but here you trigger the
> > > > warning and take no action to keep it from continuing.
> > > >
> > > > Also, if you are actually hitting this then you might want to just start at
> > > > a higher but proportional quota and period.
> > >
> > > I'd like to do what you suggested. A quick idea would be to scale period to
> > > max_cfs_quota_period, and scale quota proportionally. However the naive
> > > implementation won't work under this edge case:
> > > original:
> > > quota: 500,000us  period: 570,000us
> > > after scaling:
> > > quota: 877,192us  period: 1,000,000us
> > > original ratio: 919803
> > > new ratio: 919802
> > >
> > > To do this right, the code would have to keep an eye out on the precision loss,
> > > and increase quota by 1us sometimes to cancel out the precision loss.
> > >
> > > Also, I think this case is not that important. Because if we are
> > > hitting this case, that
> > > suggests the period is already >0.5s. And if we are still hitting
> > > timeouts with a 0.5s
> > > period, scaling it to 1s probably won't help much.
> > > When this happens, I'd imagine the parent cgroup would have a LOT of child
> > > cgroups. It might make sense for the userspace to create the parent cgroup with
> > > 1s period.
> > >
> > > If you think automatically scaling 0.5s+ to 1s is still important, I'm
> > > happy to stash
> > > this patch, and send in another one that handles the 0.5+s -> 1s
> > > scaling the right
> > > way. :) Thanks!
> >
> > First let me understand your use case better. I was thinking about this more last
> > night and it doesn't make sense.
> >
> > You are setting a small quota and period on the parent cgroup and then setting the
> > same small quota and period on the child. As you say to keep the child from getting
> > more quota than the parent. But that should already be the case simply by setting
> > it on the parent. The child can't get more quota than the parent.   All this does
> > is make the kernel do more work handling more period timers and such.
> 
> Sorry for not being clear enough. Let me provide a bit more additional context:
> 
> kubelet [1] is the userspace program setting the cfs quota and period.
> kubelet is essentially a container manager for the end user. The end user
> can specify any attainable configurations for a pod (which contains multiple
> containers).
> 
> The user interface of kubelet allows end user to specify the amount of CPU
> granted to any pod or container (in the form of mCPU). And then kubelet will
> convert the spec to quota/period accepted by cgroup fs, using this rule:
> the period of any pod/container will be set to 100000us
> the quota of the pod/container will be calculated using the allowed mCPU
> 
> And kubelet simply then writes the calculated period and quota to cgroup fs.
> 
> It's very common to specify a pod with multiple containers, and setting
> different quota for the child containers: some granted with 5-50% of the
> bandwidth available to the parent, while some other granted with 100%. For
> simplicity, kubelet writes quota/period to cgroup fs for all pods and
> containers.
> 

Thanks for the details :)


> ----
> Now back to our discussion. :)
> 
> You see, the reason that kubelet write identical quota and period to parent and
> child cgroup, is not because it want to enforce that child doesn't get more
> quota than parent. It is simply because kubelet needs to manage the quota for
> all containers and pods, and it's more convenenient to just set the quota and
> period for all of them (because in many cases, child cgroups actually gets less
> bandwidth than the parent, and has to be set specifically).
> 
> I agree that your suggestion would work. If a child cgroup is set to the same
> bandwidth of the parent cgroup, we could change the userspace program, and ask
> it to skip setting the child cgroup bandwidth.
> However, this logic would be a special case, and will require significant logic
> change to the userspace container managers.
> 
> 
> This issue is affecting many Kubernetes users, see this open issue:
> https://github.com/kubernetes/kubernetes/issues/72878
> kubelet on their machines are doing the three operations mentioned in the patch.
> I also explained them in more detail in this doc:
> https://docs.google.com/document/d/13KLD__6A935igLXpTFFomqfclATC89nAkhPOxsuKA0I/edit?usp=sharing
> 
> Basically, Kubernetes is operating on the below assumption of kernel today:
> Setting the cpu quota/period of a child cgroup should not be rejected unless
> the bandwidth is exceeding what the quota/period set for the parent cgroup.
> 
> I think this assumption is fair. Please let me know if you think otherwise. And
> if so, since the kernel broke this assumption today, I don't think it's the
> responsibility for the userspace to deal with the problem that kernel may change
> the quota/period ratio at any time.
> 
> [1] https://github.com/kubernetes/kubernetes/tree/master/pkg/kubelet
> 

Okay. I'm on board with this. At your starting values you'll get 1,2,4,800ms before 
hitting max. That should be enough. I'm a little surprised you're hitting it even
at 100ms but it sounds like you have a lot of children. And if they have their own 
settings that could be taking longer. I suspect contention on the cfs_b->lock could
be adding to it.

I do think that setup is wasting kernel cpu cycles but that's a somewhat orthagonal
discussion :)


> >
> > Setting the child quota/period only makes sense when setting it smaller than
> > the parent.
> 
> As mentioned above, in the use case of kublet, it's much easier to always
> set the child quota/period, than to only set it when it is different
> (i.e. smaller)
> than the parent.
> 
> >
> > Also, in order to hit this problem you need to have many hundreds of children, in
> > my experience. In that case it makes even less sense to write the same quota/preiod
> > as the parent into each of the children.
> 
> Here is a problematic scenario:
> The parent cgroup have 1000 children with a small quota/period, and after a
> few minutes, kubelet wants to add one additional child with the same
> quota/period.
> This bug could prevent kubelet from setting that one additional child
> successfully.
> 
> 
> Thanks a lot for taking time reviewing and responding the patch Phil!
> Really appreciate it.
> 

Sure thing. Thanks for tracking it down. I'll try to test this on my original 
reproducer when I have a chance. I don't foresee any issues though, so for now:

Acked-by: Phil Auld <pauld@redhat.com>



Cheers,
Phil

> Best regards,
> Xuewei
> 
> >
> > Or there is something else causing the timer to take too long to run...
> >
> >
> > I agree that if we are taking > 1/2s to run do_sched_cfs_period_timer() it may
> > not matter, as I said above.
> >
> >
> > Cheers,
> > Phil
> >
> > >
> > > Best regards,
> > > Xuewei
> > >
> > > >
> > > >
> > > > Cheers,
> > > > Phil
> > > >
> > > > > +                             cfs_b->period = ns_to_ktime(new);
> > > > > +                             cfs_b->quota *= 2;
> > > > > +
> > > > > +                             pr_warn_ratelimited(
> > > > > +     "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> > > > > +                                     smp_processor_id(),
> > > > > +                                     div_u64(new, NSEC_PER_USEC),
> > > > > +                                     div_u64(cfs_b->quota, NSEC_PER_USEC));
> > > > > +                     } else {
> > > > > +                             pr_warn_ratelimited(
> > > > > +     "cfs_period_timer[cpu%d]: period too short, but cannot scale up without losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> > > > > +                                     smp_processor_id(),
> > > > > +                                     div_u64(old, NSEC_PER_USEC),
> > > > > +                                     div_u64(cfs_b->quota, NSEC_PER_USEC));
> > > > > +                     }
> > > > >
> > > > >                       /* reset count so we don't come right back in here */
> > > > >                       count = 0;
> > > > > --
> > > > > 2.23.0.581.g78d2f28ef7-goog
> > > > >
> > > >
> > > > --
> >
> > --

-- 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision
  2019-10-04  0:12 [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision Xuewei Zhang
  2019-10-04  0:54 ` Phil Auld
  2019-10-04  6:11 ` Greg KH
@ 2019-10-07 15:14 ` Phil Auld
  2019-10-07 23:29   ` Xuewei Zhang
  2019-10-08 11:26   ` Peter Zijlstra
  2019-10-09 12:59 ` [tip: sched/urgent] sched/fair: Scale bandwidth " tip-bot2 for Xuewei Zhang
  2019-10-09 12:59 ` [tip: sched/core] " tip-bot2 for Xuewei Zhang
  4 siblings, 2 replies; 12+ messages in thread
From: Phil Auld @ 2019-10-07 15:14 UTC (permalink / raw)
  To: Xuewei Zhang
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Anton Blanchard, Linus Torvalds, Thomas Gleixner, linux-kernel,
	stable, trivial

On Thu, Oct 03, 2019 at 05:12:43PM -0700 Xuewei Zhang wrote:
> quota/period ratio is used to ensure a child task group won't get more
> bandwidth than the parent task group, and is calculated as:
> normalized_cfs_quota() = [(quota_us << 20) / period_us]
> 
> If the quota/period ratio was changed during this scaling due to
> precision loss, it will cause inconsistency between parent and child
> task groups. See below example:
> A userspace container manager (kubelet) does three operations:
> 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
> 2) Create a few children cgroups.
> 3) Set quota to 1,000us and period to 10,000us on a child cgroup.
> 
> These operations are expected to succeed. However, if the scaling of
> 147/128 happens before step 3), quota and period of the parent cgroup
> will be changed:
> new_quota: 1148437ns, 1148us
> new_period: 11484375ns, 11484us
> 
> And when step 3) comes in, the ratio of the child cgroup will be 104857,
> which will be larger than the parent cgroup ratio (104821), and will
> fail.
> 
> Scaling them by a factor of 2 will fix the problem.
> 
> Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
> Signed-off-by: Xuewei Zhang <xueweiz@google.com>


I managed to get it to trigger the second case. It took 50,000 children (20x my initial tests).

[ 1367.850630] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 4340, cfs_quota_us = 250000)
[ 1370.390832] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 8680, cfs_quota_us = 500000)
[ 1372.914689] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 17360, cfs_quota_us = 1000000)
[ 1375.447431] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 34720, cfs_quota_us = 2000000)
[ 1377.982785] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 69440, cfs_quota_us = 4000000)
[ 1380.481702] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 138880, cfs_quota_us = 8000000)
[ 1382.894692] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 277760, cfs_quota_us = 16000000)
[ 1385.264872] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 555520, cfs_quota_us = 32000000)
[ 1393.965140] cfs_period_timer[cpu11]: period too short, but cannot scale up without losing precision (cfs_period_us = 555520, cfs_quota_us = 32000000)

I suspect going higher could cause the original lockup, but that'd be the case with the old code as well. 
And this also gets us out of it faster.


Tested-by: Phil Auld <pauld@redhat.com>


Cheers,
Phil


> ---
>  kernel/sched/fair.c | 36 ++++++++++++++++++++++--------------
>  1 file changed, 22 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 83ab35e2374f..b3d3d0a231cd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4926,20 +4926,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>  		if (++count > 3) {
>  			u64 new, old = ktime_to_ns(cfs_b->period);
>  
> -			new = (old * 147) / 128; /* ~115% */
> -			new = min(new, max_cfs_quota_period);
> -
> -			cfs_b->period = ns_to_ktime(new);
> -
> -			/* since max is 1s, this is limited to 1e9^2, which fits in u64 */
> -			cfs_b->quota *= new;
> -			cfs_b->quota = div64_u64(cfs_b->quota, old);
> -
> -			pr_warn_ratelimited(
> -	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> -				smp_processor_id(),
> -				div_u64(new, NSEC_PER_USEC),
> -				div_u64(cfs_b->quota, NSEC_PER_USEC));
> +			/*
> +			 * Grow period by a factor of 2 to avoid lossing precision.
> +			 * Precision loss in the quota/period ratio can cause __cfs_schedulable
> +			 * to fail.
> +			 */
> +			new = old * 2;
> +			if (new < max_cfs_quota_period) {
> +				cfs_b->period = ns_to_ktime(new);
> +				cfs_b->quota *= 2;
> +
> +				pr_warn_ratelimited(
> +	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> +					smp_processor_id(),
> +					div_u64(new, NSEC_PER_USEC),
> +					div_u64(cfs_b->quota, NSEC_PER_USEC));
> +			} else {
> +				pr_warn_ratelimited(
> +	"cfs_period_timer[cpu%d]: period too short, but cannot scale up without losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> +					smp_processor_id(),
> +					div_u64(old, NSEC_PER_USEC),
> +					div_u64(cfs_b->quota, NSEC_PER_USEC));
> +			}
>  
>  			/* reset count so we don't come right back in here */
>  			count = 0;
> -- 
> 2.23.0.581.g78d2f28ef7-goog
> 

-- 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision
  2019-10-07 15:14 ` Phil Auld
@ 2019-10-07 23:29   ` Xuewei Zhang
  2019-10-08 11:26   ` Peter Zijlstra
  1 sibling, 0 replies; 12+ messages in thread
From: Xuewei Zhang @ 2019-10-07 23:29 UTC (permalink / raw)
  To: Phil Auld
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Anton Blanchard, Linus Torvalds, Thomas Gleixner, linux-kernel,
	stable, trivial

On Mon, Oct 7, 2019 at 8:14 AM Phil Auld <pauld@redhat.com> wrote:
>
> On Thu, Oct 03, 2019 at 05:12:43PM -0700 Xuewei Zhang wrote:
> > quota/period ratio is used to ensure a child task group won't get more
> > bandwidth than the parent task group, and is calculated as:
> > normalized_cfs_quota() = [(quota_us << 20) / period_us]
> >
> > If the quota/period ratio was changed during this scaling due to
> > precision loss, it will cause inconsistency between parent and child
> > task groups. See below example:
> > A userspace container manager (kubelet) does three operations:
> > 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
> > 2) Create a few children cgroups.
> > 3) Set quota to 1,000us and period to 10,000us on a child cgroup.
> >
> > These operations are expected to succeed. However, if the scaling of
> > 147/128 happens before step 3), quota and period of the parent cgroup
> > will be changed:
> > new_quota: 1148437ns, 1148us
> > new_period: 11484375ns, 11484us
> >
> > And when step 3) comes in, the ratio of the child cgroup will be 104857,
> > which will be larger than the parent cgroup ratio (104821), and will
> > fail.
> >
> > Scaling them by a factor of 2 will fix the problem.
> >
> > Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
> > Signed-off-by: Xuewei Zhang <xueweiz@google.com>
>
>
> I managed to get it to trigger the second case. It took 50,000 children (20x my initial tests).
>
> [ 1367.850630] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 4340, cfs_quota_us = 250000)
> [ 1370.390832] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 8680, cfs_quota_us = 500000)
> [ 1372.914689] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 17360, cfs_quota_us = 1000000)
> [ 1375.447431] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 34720, cfs_quota_us = 2000000)
> [ 1377.982785] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 69440, cfs_quota_us = 4000000)
> [ 1380.481702] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 138880, cfs_quota_us = 8000000)
> [ 1382.894692] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 277760, cfs_quota_us = 16000000)
> [ 1385.264872] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 555520, cfs_quota_us = 32000000)
> [ 1393.965140] cfs_period_timer[cpu11]: period too short, but cannot scale up without losing precision (cfs_period_us = 555520, cfs_quota_us = 32000000)
>
> I suspect going higher could cause the original lockup, but that'd be the case with the old code as well.
> And this also gets us out of it faster.
>
>
> Tested-by: Phil Auld <pauld@redhat.com>

Thanks a lot for the review and experiment+test Phil! Really appreciate it.

To other scheduler maintainers: Could someone help review and approve
the patch? I'm happy to fix any defect in it :)

Best regards,
Xuewei

>
>
> Cheers,
> Phil
>
>
> > ---
> >  kernel/sched/fair.c | 36 ++++++++++++++++++++++--------------
> >  1 file changed, 22 insertions(+), 14 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 83ab35e2374f..b3d3d0a231cd 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4926,20 +4926,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >               if (++count > 3) {
> >                       u64 new, old = ktime_to_ns(cfs_b->period);
> >
> > -                     new = (old * 147) / 128; /* ~115% */
> > -                     new = min(new, max_cfs_quota_period);
> > -
> > -                     cfs_b->period = ns_to_ktime(new);
> > -
> > -                     /* since max is 1s, this is limited to 1e9^2, which fits in u64 */
> > -                     cfs_b->quota *= new;
> > -                     cfs_b->quota = div64_u64(cfs_b->quota, old);
> > -
> > -                     pr_warn_ratelimited(
> > -     "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> > -                             smp_processor_id(),
> > -                             div_u64(new, NSEC_PER_USEC),
> > -                             div_u64(cfs_b->quota, NSEC_PER_USEC));
> > +                     /*
> > +                      * Grow period by a factor of 2 to avoid lossing precision.
> > +                      * Precision loss in the quota/period ratio can cause __cfs_schedulable
> > +                      * to fail.
> > +                      */
> > +                     new = old * 2;
> > +                     if (new < max_cfs_quota_period) {
> > +                             cfs_b->period = ns_to_ktime(new);
> > +                             cfs_b->quota *= 2;
> > +
> > +                             pr_warn_ratelimited(
> > +     "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> > +                                     smp_processor_id(),
> > +                                     div_u64(new, NSEC_PER_USEC),
> > +                                     div_u64(cfs_b->quota, NSEC_PER_USEC));
> > +                     } else {
> > +                             pr_warn_ratelimited(
> > +     "cfs_period_timer[cpu%d]: period too short, but cannot scale up without losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> > +                                     smp_processor_id(),
> > +                                     div_u64(old, NSEC_PER_USEC),
> > +                                     div_u64(cfs_b->quota, NSEC_PER_USEC));
> > +                     }
> >
> >                       /* reset count so we don't come right back in here */
> >                       count = 0;
> > --
> > 2.23.0.581.g78d2f28ef7-goog
> >
>
> --

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision
  2019-10-07 15:14 ` Phil Auld
  2019-10-07 23:29   ` Xuewei Zhang
@ 2019-10-08 11:26   ` Peter Zijlstra
  1 sibling, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2019-10-08 11:26 UTC (permalink / raw)
  To: Phil Auld
  Cc: Xuewei Zhang, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Anton Blanchard, Linus Torvalds, Thomas Gleixner, linux-kernel,
	stable, trivial

On Mon, Oct 07, 2019 at 11:14:25AM -0400, Phil Auld wrote:
> On Thu, Oct 03, 2019 at 05:12:43PM -0700 Xuewei Zhang wrote:
> > quota/period ratio is used to ensure a child task group won't get more
> > bandwidth than the parent task group, and is calculated as:
> > normalized_cfs_quota() = [(quota_us << 20) / period_us]
> > 
> > If the quota/period ratio was changed during this scaling due to
> > precision loss, it will cause inconsistency between parent and child
> > task groups. See below example:
> > A userspace container manager (kubelet) does three operations:
> > 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
> > 2) Create a few children cgroups.
> > 3) Set quota to 1,000us and period to 10,000us on a child cgroup.
> > 
> > These operations are expected to succeed. However, if the scaling of
> > 147/128 happens before step 3), quota and period of the parent cgroup
> > will be changed:
> > new_quota: 1148437ns, 1148us
> > new_period: 11484375ns, 11484us
> > 
> > And when step 3) comes in, the ratio of the child cgroup will be 104857,
> > which will be larger than the parent cgroup ratio (104821), and will
> > fail.
> > 
> > Scaling them by a factor of 2 will fix the problem.
> > 
> > Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
> > Signed-off-by: Xuewei Zhang <xueweiz@google.com>
> 
> 
> I managed to get it to trigger the second case. It took 50,000 children (20x my initial tests).
> 
> [ 1367.850630] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 4340, cfs_quota_us = 250000)
> [ 1370.390832] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 8680, cfs_quota_us = 500000)
> [ 1372.914689] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 17360, cfs_quota_us = 1000000)
> [ 1375.447431] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 34720, cfs_quota_us = 2000000)
> [ 1377.982785] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 69440, cfs_quota_us = 4000000)
> [ 1380.481702] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 138880, cfs_quota_us = 8000000)
> [ 1382.894692] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 277760, cfs_quota_us = 16000000)
> [ 1385.264872] cfs_period_timer[cpu11]: period too short, scaling up (new cfs_period_us = 555520, cfs_quota_us = 32000000)
> [ 1393.965140] cfs_period_timer[cpu11]: period too short, but cannot scale up without losing precision (cfs_period_us = 555520, cfs_quota_us = 32000000)
> 
> I suspect going higher could cause the original lockup, but that'd be the case with the old code as well. 
> And this also gets us out of it faster.
> 
> 
> Tested-by: Phil Auld <pauld@redhat.com>

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip: sched/urgent] sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision
  2019-10-04  0:12 [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision Xuewei Zhang
                   ` (2 preceding siblings ...)
  2019-10-07 15:14 ` Phil Auld
@ 2019-10-09 12:59 ` tip-bot2 for Xuewei Zhang
  2019-10-09 12:59 ` [tip: sched/core] " tip-bot2 for Xuewei Zhang
  4 siblings, 0 replies; 12+ messages in thread
From: tip-bot2 for Xuewei Zhang @ 2019-10-09 12:59 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Phil Auld, Xuewei Zhang, Peter Zijlstra (Intel),
	Anton Blanchard, Ben Segall, Dietmar Eggemann, Juri Lelli,
	Linus Torvalds, Mel Gorman, Steven Rostedt, Thomas Gleixner,
	Vincent Guittot, Ingo Molnar, Borislav Petkov, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     4929a4e6faa0f13289a67cae98139e727f0d4a97
Gitweb:        https://git.kernel.org/tip/4929a4e6faa0f13289a67cae98139e727f0d4a97
Author:        Xuewei Zhang <xueweiz@google.com>
AuthorDate:    Thu, 03 Oct 2019 17:12:43 -07:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 09 Oct 2019 12:38:02 +02:00

sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision

The quota/period ratio is used to ensure a child task group won't get
more bandwidth than the parent task group, and is calculated as:

  normalized_cfs_quota() = [(quota_us << 20) / period_us]

If the quota/period ratio was changed during this scaling due to
precision loss, it will cause inconsistency between parent and child
task groups.

See below example:

A userspace container manager (kubelet) does three operations:

 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
 2) Create a few children cgroups.
 3) Set quota to 1,000us and period to 10,000us on a child cgroup.

These operations are expected to succeed. However, if the scaling of
147/128 happens before step 3, quota and period of the parent cgroup
will be changed:

  new_quota: 1148437ns,   1148us
 new_period: 11484375ns, 11484us

And when step 3 comes in, the ratio of the child cgroup will be
104857, which will be larger than the parent cgroup ratio (104821),
and will fail.

Scaling them by a factor of 2 will fix the problem.

Tested-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Xuewei Zhang <xueweiz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Phil Auld <pauld@redhat.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 83ab35e..682a754 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4926,20 +4926,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 		if (++count > 3) {
 			u64 new, old = ktime_to_ns(cfs_b->period);
 
-			new = (old * 147) / 128; /* ~115% */
-			new = min(new, max_cfs_quota_period);
-
-			cfs_b->period = ns_to_ktime(new);
-
-			/* since max is 1s, this is limited to 1e9^2, which fits in u64 */
-			cfs_b->quota *= new;
-			cfs_b->quota = div64_u64(cfs_b->quota, old);
-
-			pr_warn_ratelimited(
-	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
-				smp_processor_id(),
-				div_u64(new, NSEC_PER_USEC),
-				div_u64(cfs_b->quota, NSEC_PER_USEC));
+			/*
+			 * Grow period by a factor of 2 to avoid losing precision.
+			 * Precision loss in the quota/period ratio can cause __cfs_schedulable
+			 * to fail.
+			 */
+			new = old * 2;
+			if (new < max_cfs_quota_period) {
+				cfs_b->period = ns_to_ktime(new);
+				cfs_b->quota *= 2;
+
+				pr_warn_ratelimited(
+	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
+					smp_processor_id(),
+					div_u64(new, NSEC_PER_USEC),
+					div_u64(cfs_b->quota, NSEC_PER_USEC));
+			} else {
+				pr_warn_ratelimited(
+	"cfs_period_timer[cpu%d]: period too short, but cannot scale up without losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n",
+					smp_processor_id(),
+					div_u64(old, NSEC_PER_USEC),
+					div_u64(cfs_b->quota, NSEC_PER_USEC));
+			}
 
 			/* reset count so we don't come right back in here */
 			count = 0;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [tip: sched/core] sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision
  2019-10-04  0:12 [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision Xuewei Zhang
                   ` (3 preceding siblings ...)
  2019-10-09 12:59 ` [tip: sched/urgent] sched/fair: Scale bandwidth " tip-bot2 for Xuewei Zhang
@ 2019-10-09 12:59 ` tip-bot2 for Xuewei Zhang
  4 siblings, 0 replies; 12+ messages in thread
From: tip-bot2 for Xuewei Zhang @ 2019-10-09 12:59 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Phil Auld, Xuewei Zhang, Peter Zijlstra (Intel),
	Anton Blanchard, Ben Segall, Dietmar Eggemann, Juri Lelli,
	Linus Torvalds, Mel Gorman, Steven Rostedt, Thomas Gleixner,
	Vincent Guittot, Ingo Molnar, Borislav Petkov, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     4929a4e6faa0f13289a67cae98139e727f0d4a97
Gitweb:        https://git.kernel.org/tip/4929a4e6faa0f13289a67cae98139e727f0d4a97
Author:        Xuewei Zhang <xueweiz@google.com>
AuthorDate:    Thu, 03 Oct 2019 17:12:43 -07:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 09 Oct 2019 12:38:02 +02:00

sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision

The quota/period ratio is used to ensure a child task group won't get
more bandwidth than the parent task group, and is calculated as:

  normalized_cfs_quota() = [(quota_us << 20) / period_us]

If the quota/period ratio was changed during this scaling due to
precision loss, it will cause inconsistency between parent and child
task groups.

See below example:

A userspace container manager (kubelet) does three operations:

 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
 2) Create a few children cgroups.
 3) Set quota to 1,000us and period to 10,000us on a child cgroup.

These operations are expected to succeed. However, if the scaling of
147/128 happens before step 3, quota and period of the parent cgroup
will be changed:

  new_quota: 1148437ns,   1148us
 new_period: 11484375ns, 11484us

And when step 3 comes in, the ratio of the child cgroup will be
104857, which will be larger than the parent cgroup ratio (104821),
and will fail.

Scaling them by a factor of 2 will fix the problem.

Tested-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Xuewei Zhang <xueweiz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Phil Auld <pauld@redhat.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 83ab35e..682a754 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4926,20 +4926,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 		if (++count > 3) {
 			u64 new, old = ktime_to_ns(cfs_b->period);
 
-			new = (old * 147) / 128; /* ~115% */
-			new = min(new, max_cfs_quota_period);
-
-			cfs_b->period = ns_to_ktime(new);
-
-			/* since max is 1s, this is limited to 1e9^2, which fits in u64 */
-			cfs_b->quota *= new;
-			cfs_b->quota = div64_u64(cfs_b->quota, old);
-
-			pr_warn_ratelimited(
-	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
-				smp_processor_id(),
-				div_u64(new, NSEC_PER_USEC),
-				div_u64(cfs_b->quota, NSEC_PER_USEC));
+			/*
+			 * Grow period by a factor of 2 to avoid losing precision.
+			 * Precision loss in the quota/period ratio can cause __cfs_schedulable
+			 * to fail.
+			 */
+			new = old * 2;
+			if (new < max_cfs_quota_period) {
+				cfs_b->period = ns_to_ktime(new);
+				cfs_b->quota *= 2;
+
+				pr_warn_ratelimited(
+	"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n",
+					smp_processor_id(),
+					div_u64(new, NSEC_PER_USEC),
+					div_u64(cfs_b->quota, NSEC_PER_USEC));
+			} else {
+				pr_warn_ratelimited(
+	"cfs_period_timer[cpu%d]: period too short, but cannot scale up without losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n",
+					smp_processor_id(),
+					div_u64(old, NSEC_PER_USEC),
+					div_u64(cfs_b->quota, NSEC_PER_USEC));
+			}
 
 			/* reset count so we don't come right back in here */
 			count = 0;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-10-09 13:00 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-04  0:12 [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision Xuewei Zhang
2019-10-04  0:54 ` Phil Auld
2019-10-04  2:05   ` Xuewei Zhang
2019-10-04 13:14     ` Phil Auld
2019-10-05  0:28       ` Xuewei Zhang
2019-10-07 13:02         ` Phil Auld
2019-10-04  6:11 ` Greg KH
2019-10-07 15:14 ` Phil Auld
2019-10-07 23:29   ` Xuewei Zhang
2019-10-08 11:26   ` Peter Zijlstra
2019-10-09 12:59 ` [tip: sched/urgent] sched/fair: Scale bandwidth " tip-bot2 for Xuewei Zhang
2019-10-09 12:59 ` [tip: sched/core] " tip-bot2 for Xuewei Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).