* [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices @ 2019-05-17 19:30 Dave Chiluk 2019-05-23 18:44 ` [PATCH v2 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk ` (5 more replies) 0 siblings, 6 replies; 40+ messages in thread From: Dave Chiluk @ 2019-05-17 19:30 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc It has been observed, that highly-threaded, non-cpu-bound applications running under cpu.cfs_quota_us constraints can hit a high percentage of periods throttled while simultaneously not consuming the allocated amount of quota. This use case is typical of user-interactive non-cpu bound applications, such as those running in kubernetes or mesos when run on multiple cpu cores. This has been root caused to threads being allocated per cpu bandwidth slices, and then not fully using that slice within the period. At which point the slice and quota expires. This expiration of unused slice results in applications not being able to utilize the quota for which they are allocated. The expiration of per-cpu slices was recently fixed by commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition"). Prior to that it appears that this has been broken since a least commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period") which was introduced in v3.16-rc1 in 2014. That commit added the following conditional which resulted in slices never being expired. if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; Because this was broken for nearly 5 years, and has recently been fixed and is now being noticed by many users running kubernetes (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion that the mechanisms around expiring runtime should be removed altogether. This allows only per-cpu slices to live longer than the period boundary. This allows threads on runqueues that do not use much CPU to continue to use their remaining slice over a longer period of time than cpu.cfs_period_us. However, this helps prevents the above condition of hitting throttling while also not fully utilizing your cpu quota. This theoretically allows a machine to use slightly more than it's allotted quota in some periods. This overflow would be bounded by the remaining per-cpu slice that was left un-used in the previous period. For CPU bound tasks this will change nothing, as they should theoretically fully utilize all of their quota and slices in each period. For user-interactive tasks as described above this provides a much better user/application experience as their cpu utilization will more closely match the amount they requested when they hit throttling. This greatly improves performance of high-thread-count, non-cpu bound applications with low cfs_quota_us allocation on high-core-count machines. In the case of an artificial testcase, this performance discrepancy has been observed to be almost 30x performance improvement, while still maintaining correct cpu quota restrictions albeit over longer time intervals than cpu.cfs_period_us. That testcase is available at https://github.com/indeedeng/fibtest. Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> --- Documentation/scheduler/sched-bwc.txt | 29 +++++++++++--- kernel/sched/fair.c | 71 +++-------------------------------- kernel/sched/sched.h | 4 -- 3 files changed, 29 insertions(+), 75 deletions(-) diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt index f6b1873..4ded8ae 100644 --- a/Documentation/scheduler/sched-bwc.txt +++ b/Documentation/scheduler/sched-bwc.txt @@ -8,16 +8,33 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the specification of the maximum CPU bandwidth available to a group or hierarchy. The bandwidth allowed for a group is specified using a quota and period. Within -each given "period" (microseconds), a group is allowed to consume only up to -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a -group exceeds this limit (for that period), the tasks belonging to its -hierarchy will be throttled and are not allowed to run again until the next -period. +each given "period" (microseconds), a task group is allocated up to "quota" +microseconds of CPU time. When the CPU bandwidth consumption of a group +exceeds this limit (for that period), the tasks belonging to its hierarchy will +be throttled and are not allowed to run again until the next period. A group's unused runtime is globally tracked, being refreshed with quota units above at each period boundary. As threads consume this bandwidth it is transferred to cpu-local "silos" on a demand basis. The amount transferred -within each of these updates is tunable and described as the "slice". +within each of these updates is tunable and described as the "slice". Slices +that are allocated to cpu-local silos do not expire at the end of the period, +but unallocated quota does. This doesn't affect cpu-bound applications as they +by definition consume all of their bandwidth in each each period. + +However for highly-threaded user-interactive/non-cpu bound applications this +non-expiration nuance allows applications to burst past their quota limits +equal to the amount of unused slice per cpu that the task group is running on. +This slight burst requires that quota had gone unused in previous periods. +Additionally this burst amount is limited to the size of a slice for every cpu +a task group is run on. As a result, this mechanism still strictly limits the +task group to quota average usage over a longer time windows. This provides +better more predictable user experience for highly threaded applications with +small quota limits on high core count machines. It also eliminates the +propensity to throttle these applications while simultanously using less than +quota amounts of cpu. Another way to say this, is that by allowing the unused +portion of a slice to be used in following periods we have decreased the +possibility of wasting unused quota on cpu-local silos that don't need much cpu +time. Management ---------- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..a675c69 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) now = sched_clock_cpu(smp_processor_id()); cfs_b->runtime = cfs_b->quota; - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); - cfs_b->expires_seq++; } static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) @@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) { struct task_group *tg = cfs_rq->tg; struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); - u64 amount = 0, min_amount, expires; - int expires_seq; + u64 amount = 0, min_amount; /* note: this is a positive sum as runtime_remaining <= 0 */ min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; @@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_b->idle = 0; } } - expires_seq = cfs_b->expires_seq; - expires = cfs_b->runtime_expires; raw_spin_unlock(&cfs_b->lock); cfs_rq->runtime_remaining += amount; - /* - * we may have advanced our local expiration to account for allowed - * spread between our sched_clock and the one on which runtime was - * issued. - */ - if (cfs_rq->expires_seq != expires_seq) { - cfs_rq->expires_seq = expires_seq; - cfs_rq->runtime_expires = expires; - } return cfs_rq->runtime_remaining > 0; } -/* - * Note: This depends on the synchronization provided by sched_clock and the - * fact that rq->clock snapshots this value. - */ -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) -{ - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); - - /* if the deadline is ahead of our clock, nothing to do */ - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) - return; - - if (cfs_rq->runtime_remaining < 0) - return; - - /* - * If the local deadline has passed we have to consider the - * possibility that our sched_clock is 'fast' and the global deadline - * has not truly expired. - * - * Fortunately we can check determine whether this the case by checking - * whether the global deadline(cfs_b->expires_seq) has advanced. - */ - if (cfs_rq->expires_seq == cfs_b->expires_seq) { - /* extend local deadline, drift is bounded above by 2 ticks */ - cfs_rq->runtime_expires += TICK_NSEC; - } else { - /* global deadline is ahead, expiration has passed */ - cfs_rq->runtime_remaining = 0; - } -} - static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) { /* dock delta_exec before expiring quota (as it could span periods) */ cfs_rq->runtime_remaining -= delta_exec; - expire_cfs_rq_runtime(cfs_rq); if (likely(cfs_rq->runtime_remaining > 0)) return; @@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) resched_curr(rq); } -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, - u64 remaining, u64 expires) +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) { struct cfs_rq *cfs_rq; u64 runtime; @@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, remaining -= runtime; cfs_rq->runtime_remaining += runtime; - cfs_rq->runtime_expires = expires; /* we check whether we're throttled above */ if (cfs_rq->runtime_remaining > 0) @@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, */ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) { - u64 runtime, runtime_expires; + u64 runtime; int throttled; /* no need to continue the timer with no bandwidth constraint */ @@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u /* account preceding periods in which throttling occurred */ cfs_b->nr_throttled += overrun; - runtime_expires = cfs_b->runtime_expires; - /* * This check is repeated as we are holding onto the new bandwidth while * we unthrottle. This can potentially race with an unthrottled group @@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u cfs_b->distribute_running = 1; raw_spin_unlock_irqrestore(&cfs_b->lock, flags); /* we can't nest cfs_b->lock while distributing bandwidth */ - runtime = distribute_cfs_runtime(cfs_b, runtime, - runtime_expires); + runtime = distribute_cfs_runtime(cfs_b, runtime); raw_spin_lock_irqsave(&cfs_b->lock, flags); cfs_b->distribute_running = 0; @@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) return; raw_spin_lock(&cfs_b->lock); - if (cfs_b->quota != RUNTIME_INF && - cfs_rq->runtime_expires == cfs_b->runtime_expires) { + if (cfs_b->quota != RUNTIME_INF) { cfs_b->runtime += slack_runtime; /* we are under rq->lock, defer unthrottling using a timer */ @@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) { u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); unsigned long flags; - u64 expires; /* confirm we're still not at a refresh boundary */ raw_spin_lock_irqsave(&cfs_b->lock, flags); @@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) runtime = cfs_b->runtime; - expires = cfs_b->runtime_expires; if (runtime) cfs_b->distribute_running = 1; @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) if (!runtime) return; - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); + runtime = distribute_cfs_runtime(cfs_b, runtime); raw_spin_lock_irqsave(&cfs_b->lock, flags); - if (expires == cfs_b->runtime_expires) - lsub_positive(&cfs_b->runtime, runtime); cfs_b->distribute_running = 0; raw_spin_unlock_irqrestore(&cfs_b->lock, flags); } @@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) cfs_b->period_active = 1; overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); - cfs_b->expires_seq++; hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b52ed1a..0c0ed23 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -341,8 +341,6 @@ struct cfs_bandwidth { u64 quota; u64 runtime; s64 hierarchical_quota; - u64 runtime_expires; - int expires_seq; short idle; short period_active; @@ -562,8 +560,6 @@ struct cfs_rq { #ifdef CONFIG_CFS_BANDWIDTH int runtime_enabled; - int expires_seq; - u64 runtime_expires; s64 runtime_remaining; u64 throttled_clock; -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v2 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-17 19:30 [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices Dave Chiluk @ 2019-05-23 18:44 ` Dave Chiluk 2019-05-23 18:44 ` [PATCH v2 1/1] " Dave Chiluk 2019-05-29 19:08 ` [PATCH v3 0/1] " Dave Chiluk ` (4 subsequent siblings) 5 siblings, 1 reply; 40+ messages in thread From: Dave Chiluk @ 2019-05-23 18:44 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc Changelog v2 - Fixed some checkpatch errors in the commit message. ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-23 18:44 ` [PATCH v2 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk @ 2019-05-23 18:44 ` Dave Chiluk 2019-05-23 21:01 ` Peter Oskolkov 2019-05-24 8:55 ` Peter Zijlstra 0 siblings, 2 replies; 40+ messages in thread From: Dave Chiluk @ 2019-05-23 18:44 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc It has been observed, that highly-threaded, non-cpu-bound applications running under cpu.cfs_quota_us constraints can hit a high percentage of periods throttled while simultaneously not consuming the allocated amount of quota. This use case is typical of user-interactive non-cpu bound applications, such as those running in kubernetes or mesos when run on multiple cpu cores. This has been root caused to threads being allocated per cpu bandwidth slices, and then not fully using that slice within the period. At which point the slice and quota expires. This expiration of unused slice results in applications not being able to utilize the quota for which they are allocated. The expiration of per-cpu slices was recently fixed by 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")'. Prior to that it appears that this has been broken since at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That added the following conditional which resulted in slices never being expired. if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; Because this was broken for nearly 5 years, and has recently been fixed and is now being noticed by many users running kubernetes (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion that the mechanisms around expiring runtime should be removed altogether. This allows only per-cpu slices to live longer than the period boundary. This allows threads on runqueues that do not use much CPU to continue to use their remaining slice over a longer period of time than cpu.cfs_period_us. However, this helps prevents the above condition of hitting throttling while also not fully utilizing your cpu quota. This theoretically allows a machine to use slightly more than it's allotted quota in some periods. This overflow would be bounded by the remaining per-cpu slice that was left un-used in the previous period. For CPU bound tasks this will change nothing, as they should theoretically fully utilize all of their quota and slices in each period. For user-interactive tasks as described above this provides a much better user/application experience as their cpu utilization will more closely match the amount they requested when they hit throttling. This greatly improves performance of high-thread-count, non-cpu bound applications with low cfs_quota_us allocation on high-core-count machines. In the case of an artificial testcase, this performance discrepancy has been observed to be almost 30x performance improvement, while still maintaining correct cpu quota restrictions albeit over longer time intervals than cpu.cfs_period_us. That testcase is available at https://github.com/indeedeng/fibtest. Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> --- Documentation/scheduler/sched-bwc.txt | 29 +++++++++++--- kernel/sched/fair.c | 71 +++-------------------------------- kernel/sched/sched.h | 4 -- 3 files changed, 29 insertions(+), 75 deletions(-) diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt index f6b1873..4ded8ae 100644 --- a/Documentation/scheduler/sched-bwc.txt +++ b/Documentation/scheduler/sched-bwc.txt @@ -8,16 +8,33 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the specification of the maximum CPU bandwidth available to a group or hierarchy. The bandwidth allowed for a group is specified using a quota and period. Within -each given "period" (microseconds), a group is allowed to consume only up to -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a -group exceeds this limit (for that period), the tasks belonging to its -hierarchy will be throttled and are not allowed to run again until the next -period. +each given "period" (microseconds), a task group is allocated up to "quota" +microseconds of CPU time. When the CPU bandwidth consumption of a group +exceeds this limit (for that period), the tasks belonging to its hierarchy will +be throttled and are not allowed to run again until the next period. A group's unused runtime is globally tracked, being refreshed with quota units above at each period boundary. As threads consume this bandwidth it is transferred to cpu-local "silos" on a demand basis. The amount transferred -within each of these updates is tunable and described as the "slice". +within each of these updates is tunable and described as the "slice". Slices +that are allocated to cpu-local silos do not expire at the end of the period, +but unallocated quota does. This doesn't affect cpu-bound applications as they +by definition consume all of their bandwidth in each each period. + +However for highly-threaded user-interactive/non-cpu bound applications this +non-expiration nuance allows applications to burst past their quota limits +equal to the amount of unused slice per cpu that the task group is running on. +This slight burst requires that quota had gone unused in previous periods. +Additionally this burst amount is limited to the size of a slice for every cpu +a task group is run on. As a result, this mechanism still strictly limits the +task group to quota average usage over a longer time windows. This provides +better more predictable user experience for highly threaded applications with +small quota limits on high core count machines. It also eliminates the +propensity to throttle these applications while simultanously using less than +quota amounts of cpu. Another way to say this, is that by allowing the unused +portion of a slice to be used in following periods we have decreased the +possibility of wasting unused quota on cpu-local silos that don't need much cpu +time. Management ---------- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..a675c69 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) now = sched_clock_cpu(smp_processor_id()); cfs_b->runtime = cfs_b->quota; - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); - cfs_b->expires_seq++; } static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) @@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) { struct task_group *tg = cfs_rq->tg; struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); - u64 amount = 0, min_amount, expires; - int expires_seq; + u64 amount = 0, min_amount; /* note: this is a positive sum as runtime_remaining <= 0 */ min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; @@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_b->idle = 0; } } - expires_seq = cfs_b->expires_seq; - expires = cfs_b->runtime_expires; raw_spin_unlock(&cfs_b->lock); cfs_rq->runtime_remaining += amount; - /* - * we may have advanced our local expiration to account for allowed - * spread between our sched_clock and the one on which runtime was - * issued. - */ - if (cfs_rq->expires_seq != expires_seq) { - cfs_rq->expires_seq = expires_seq; - cfs_rq->runtime_expires = expires; - } return cfs_rq->runtime_remaining > 0; } -/* - * Note: This depends on the synchronization provided by sched_clock and the - * fact that rq->clock snapshots this value. - */ -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) -{ - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); - - /* if the deadline is ahead of our clock, nothing to do */ - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) - return; - - if (cfs_rq->runtime_remaining < 0) - return; - - /* - * If the local deadline has passed we have to consider the - * possibility that our sched_clock is 'fast' and the global deadline - * has not truly expired. - * - * Fortunately we can check determine whether this the case by checking - * whether the global deadline(cfs_b->expires_seq) has advanced. - */ - if (cfs_rq->expires_seq == cfs_b->expires_seq) { - /* extend local deadline, drift is bounded above by 2 ticks */ - cfs_rq->runtime_expires += TICK_NSEC; - } else { - /* global deadline is ahead, expiration has passed */ - cfs_rq->runtime_remaining = 0; - } -} - static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) { /* dock delta_exec before expiring quota (as it could span periods) */ cfs_rq->runtime_remaining -= delta_exec; - expire_cfs_rq_runtime(cfs_rq); if (likely(cfs_rq->runtime_remaining > 0)) return; @@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) resched_curr(rq); } -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, - u64 remaining, u64 expires) +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) { struct cfs_rq *cfs_rq; u64 runtime; @@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, remaining -= runtime; cfs_rq->runtime_remaining += runtime; - cfs_rq->runtime_expires = expires; /* we check whether we're throttled above */ if (cfs_rq->runtime_remaining > 0) @@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, */ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) { - u64 runtime, runtime_expires; + u64 runtime; int throttled; /* no need to continue the timer with no bandwidth constraint */ @@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u /* account preceding periods in which throttling occurred */ cfs_b->nr_throttled += overrun; - runtime_expires = cfs_b->runtime_expires; - /* * This check is repeated as we are holding onto the new bandwidth while * we unthrottle. This can potentially race with an unthrottled group @@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u cfs_b->distribute_running = 1; raw_spin_unlock_irqrestore(&cfs_b->lock, flags); /* we can't nest cfs_b->lock while distributing bandwidth */ - runtime = distribute_cfs_runtime(cfs_b, runtime, - runtime_expires); + runtime = distribute_cfs_runtime(cfs_b, runtime); raw_spin_lock_irqsave(&cfs_b->lock, flags); cfs_b->distribute_running = 0; @@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) return; raw_spin_lock(&cfs_b->lock); - if (cfs_b->quota != RUNTIME_INF && - cfs_rq->runtime_expires == cfs_b->runtime_expires) { + if (cfs_b->quota != RUNTIME_INF) { cfs_b->runtime += slack_runtime; /* we are under rq->lock, defer unthrottling using a timer */ @@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) { u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); unsigned long flags; - u64 expires; /* confirm we're still not at a refresh boundary */ raw_spin_lock_irqsave(&cfs_b->lock, flags); @@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) runtime = cfs_b->runtime; - expires = cfs_b->runtime_expires; if (runtime) cfs_b->distribute_running = 1; @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) if (!runtime) return; - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); + runtime = distribute_cfs_runtime(cfs_b, runtime); raw_spin_lock_irqsave(&cfs_b->lock, flags); - if (expires == cfs_b->runtime_expires) - lsub_positive(&cfs_b->runtime, runtime); cfs_b->distribute_running = 0; raw_spin_unlock_irqrestore(&cfs_b->lock, flags); } @@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) cfs_b->period_active = 1; overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); - cfs_b->expires_seq++; hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b52ed1a..0c0ed23 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -341,8 +341,6 @@ struct cfs_bandwidth { u64 quota; u64 runtime; s64 hierarchical_quota; - u64 runtime_expires; - int expires_seq; short idle; short period_active; @@ -562,8 +560,6 @@ struct cfs_rq { #ifdef CONFIG_CFS_BANDWIDTH int runtime_enabled; - int expires_seq; - u64 runtime_expires; s64 runtime_remaining; u64 throttled_clock; -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-23 18:44 ` [PATCH v2 1/1] " Dave Chiluk @ 2019-05-23 21:01 ` Peter Oskolkov 2019-05-24 14:32 ` Phil Auld 2019-05-24 8:55 ` Peter Zijlstra 1 sibling, 1 reply; 40+ messages in thread From: Peter Oskolkov @ 2019-05-23 21:01 UTC (permalink / raw) To: Dave Chiluk Cc: Peter Zijlstra, Ingo Molnar, cgroups, Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc On Thu, May 23, 2019 at 11:44 AM Dave Chiluk <chiluk+linux@indeed.com> wrote: > > It has been observed, that highly-threaded, non-cpu-bound applications > running under cpu.cfs_quota_us constraints can hit a high percentage of > periods throttled while simultaneously not consuming the allocated > amount of quota. This use case is typical of user-interactive non-cpu > bound applications, such as those running in kubernetes or mesos when > run on multiple cpu cores. > > This has been root caused to threads being allocated per cpu bandwidth > slices, and then not fully using that slice within the period. At which > point the slice and quota expires. This expiration of unused slice > results in applications not being able to utilize the quota for which > they are allocated. > > The expiration of per-cpu slices was recently fixed by > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift > condition")'. Prior to that it appears that this has been broken since > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That > added the following conditional which resulted in slices never being > expired. > > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { > /* extend local deadline, drift is bounded above by 2 ticks */ > cfs_rq->runtime_expires += TICK_NSEC; > > Because this was broken for nearly 5 years, and has recently been fixed > and is now being noticed by many users running kubernetes > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion > that the mechanisms around expiring runtime should be removed > altogether. > > This allows only per-cpu slices to live longer than the period boundary. > This allows threads on runqueues that do not use much CPU to continue to > use their remaining slice over a longer period of time than > cpu.cfs_period_us. However, this helps prevents the above condition of > hitting throttling while also not fully utilizing your cpu quota. > > This theoretically allows a machine to use slightly more than it's > allotted quota in some periods. This overflow would be bounded by the > remaining per-cpu slice that was left un-used in the previous period. > For CPU bound tasks this will change nothing, as they should > theoretically fully utilize all of their quota and slices in each > period. For user-interactive tasks as described above this provides a > much better user/application experience as their cpu utilization will > more closely match the amount they requested when they hit throttling. > > This greatly improves performance of high-thread-count, non-cpu bound > applications with low cfs_quota_us allocation on high-core-count > machines. In the case of an artificial testcase, this performance > discrepancy has been observed to be almost 30x performance improvement, > while still maintaining correct cpu quota restrictions albeit over > longer time intervals than cpu.cfs_period_us. If the machine runs at/close to capacity, won't the overallocation of the quota to bursty tasks necessarily negatively impact every other task? Should the "unused" quota be available only on idle CPUs? (Or maybe this is the behavior achieved here, and only the comment and the commit message should be fixed...) > That testcase is > available at https://github.com/indeedeng/fibtest. > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") > Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> > --- > Documentation/scheduler/sched-bwc.txt | 29 +++++++++++--- > kernel/sched/fair.c | 71 +++-------------------------------- > kernel/sched/sched.h | 4 -- > 3 files changed, 29 insertions(+), 75 deletions(-) > > diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt > index f6b1873..4ded8ae 100644 > --- a/Documentation/scheduler/sched-bwc.txt > +++ b/Documentation/scheduler/sched-bwc.txt > @@ -8,16 +8,33 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the > specification of the maximum CPU bandwidth available to a group or hierarchy. > > The bandwidth allowed for a group is specified using a quota and period. Within > -each given "period" (microseconds), a group is allowed to consume only up to > -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a > -group exceeds this limit (for that period), the tasks belonging to its > -hierarchy will be throttled and are not allowed to run again until the next > -period. > +each given "period" (microseconds), a task group is allocated up to "quota" > +microseconds of CPU time. When the CPU bandwidth consumption of a group > +exceeds this limit (for that period), the tasks belonging to its hierarchy will > +be throttled and are not allowed to run again until the next period. > > A group's unused runtime is globally tracked, being refreshed with quota units > above at each period boundary. As threads consume this bandwidth it is > transferred to cpu-local "silos" on a demand basis. The amount transferred > -within each of these updates is tunable and described as the "slice". > +within each of these updates is tunable and described as the "slice". Slices > +that are allocated to cpu-local silos do not expire at the end of the period, > +but unallocated quota does. This doesn't affect cpu-bound applications as they > +by definition consume all of their bandwidth in each each period. > + > +However for highly-threaded user-interactive/non-cpu bound applications this > +non-expiration nuance allows applications to burst past their quota limits > +equal to the amount of unused slice per cpu that the task group is running on. > +This slight burst requires that quota had gone unused in previous periods. > +Additionally this burst amount is limited to the size of a slice for every cpu > +a task group is run on. As a result, this mechanism still strictly limits the > +task group to quota average usage over a longer time windows. This provides > +better more predictable user experience for highly threaded applications with > +small quota limits on high core count machines. It also eliminates the > +propensity to throttle these applications while simultanously using less than > +quota amounts of cpu. Another way to say this, is that by allowing the unused > +portion of a slice to be used in following periods we have decreased the > +possibility of wasting unused quota on cpu-local silos that don't need much cpu > +time. > > Management > ---------- > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index f35930f..a675c69 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) > > now = sched_clock_cpu(smp_processor_id()); > cfs_b->runtime = cfs_b->quota; > - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); > - cfs_b->expires_seq++; > } > > static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) > @@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) > { > struct task_group *tg = cfs_rq->tg; > struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); > - u64 amount = 0, min_amount, expires; > - int expires_seq; > + u64 amount = 0, min_amount; > > /* note: this is a positive sum as runtime_remaining <= 0 */ > min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; > @@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) > cfs_b->idle = 0; > } > } > - expires_seq = cfs_b->expires_seq; > - expires = cfs_b->runtime_expires; > raw_spin_unlock(&cfs_b->lock); > > cfs_rq->runtime_remaining += amount; > - /* > - * we may have advanced our local expiration to account for allowed > - * spread between our sched_clock and the one on which runtime was > - * issued. > - */ > - if (cfs_rq->expires_seq != expires_seq) { > - cfs_rq->expires_seq = expires_seq; > - cfs_rq->runtime_expires = expires; > - } > > return cfs_rq->runtime_remaining > 0; > } > > -/* > - * Note: This depends on the synchronization provided by sched_clock and the > - * fact that rq->clock snapshots this value. > - */ > -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) > -{ > - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); > - > - /* if the deadline is ahead of our clock, nothing to do */ > - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) > - return; > - > - if (cfs_rq->runtime_remaining < 0) > - return; > - > - /* > - * If the local deadline has passed we have to consider the > - * possibility that our sched_clock is 'fast' and the global deadline > - * has not truly expired. > - * > - * Fortunately we can check determine whether this the case by checking > - * whether the global deadline(cfs_b->expires_seq) has advanced. > - */ > - if (cfs_rq->expires_seq == cfs_b->expires_seq) { > - /* extend local deadline, drift is bounded above by 2 ticks */ > - cfs_rq->runtime_expires += TICK_NSEC; > - } else { > - /* global deadline is ahead, expiration has passed */ > - cfs_rq->runtime_remaining = 0; > - } > -} > - > static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) > { > /* dock delta_exec before expiring quota (as it could span periods) */ > cfs_rq->runtime_remaining -= delta_exec; > - expire_cfs_rq_runtime(cfs_rq); > > if (likely(cfs_rq->runtime_remaining > 0)) > return; > @@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) > resched_curr(rq); > } > > -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > - u64 remaining, u64 expires) > +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) > { > struct cfs_rq *cfs_rq; > u64 runtime; > @@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > remaining -= runtime; > > cfs_rq->runtime_remaining += runtime; > - cfs_rq->runtime_expires = expires; > > /* we check whether we're throttled above */ > if (cfs_rq->runtime_remaining > 0) > @@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > */ > static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) > { > - u64 runtime, runtime_expires; > + u64 runtime; > int throttled; > > /* no need to continue the timer with no bandwidth constraint */ > @@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u > /* account preceding periods in which throttling occurred */ > cfs_b->nr_throttled += overrun; > > - runtime_expires = cfs_b->runtime_expires; > - > /* > * This check is repeated as we are holding onto the new bandwidth while > * we unthrottle. This can potentially race with an unthrottled group > @@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u > cfs_b->distribute_running = 1; > raw_spin_unlock_irqrestore(&cfs_b->lock, flags); > /* we can't nest cfs_b->lock while distributing bandwidth */ > - runtime = distribute_cfs_runtime(cfs_b, runtime, > - runtime_expires); > + runtime = distribute_cfs_runtime(cfs_b, runtime); > raw_spin_lock_irqsave(&cfs_b->lock, flags); > > cfs_b->distribute_running = 0; > @@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) > return; > > raw_spin_lock(&cfs_b->lock); > - if (cfs_b->quota != RUNTIME_INF && > - cfs_rq->runtime_expires == cfs_b->runtime_expires) { > + if (cfs_b->quota != RUNTIME_INF) { > cfs_b->runtime += slack_runtime; > > /* we are under rq->lock, defer unthrottling using a timer */ > @@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > { > u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); > unsigned long flags; > - u64 expires; > > /* confirm we're still not at a refresh boundary */ > raw_spin_lock_irqsave(&cfs_b->lock, flags); > @@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) > runtime = cfs_b->runtime; > > - expires = cfs_b->runtime_expires; > if (runtime) > cfs_b->distribute_running = 1; > > @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > if (!runtime) > return; > > - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); > + runtime = distribute_cfs_runtime(cfs_b, runtime); > > raw_spin_lock_irqsave(&cfs_b->lock, flags); > - if (expires == cfs_b->runtime_expires) > - lsub_positive(&cfs_b->runtime, runtime); > cfs_b->distribute_running = 0; > raw_spin_unlock_irqrestore(&cfs_b->lock, flags); > } > @@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) > > cfs_b->period_active = 1; > overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); > - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); > - cfs_b->expires_seq++; > hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); > } > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index b52ed1a..0c0ed23 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -341,8 +341,6 @@ struct cfs_bandwidth { > u64 quota; > u64 runtime; > s64 hierarchical_quota; > - u64 runtime_expires; > - int expires_seq; > > short idle; > short period_active; > @@ -562,8 +560,6 @@ struct cfs_rq { > > #ifdef CONFIG_CFS_BANDWIDTH > int runtime_enabled; > - int expires_seq; > - u64 runtime_expires; > s64 runtime_remaining; > > u64 throttled_clock; > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-23 21:01 ` Peter Oskolkov @ 2019-05-24 14:32 ` Phil Auld 2019-05-24 15:14 ` Dave Chiluk 0 siblings, 1 reply; 40+ messages in thread From: Phil Auld @ 2019-05-24 14:32 UTC (permalink / raw) To: Peter Oskolkov Cc: Dave Chiluk, Peter Zijlstra, Ingo Molnar, cgroups, Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote: > On Thu, May 23, 2019 at 11:44 AM Dave Chiluk <chiluk+linux@indeed.com> wrote: > > > > It has been observed, that highly-threaded, non-cpu-bound applications > > running under cpu.cfs_quota_us constraints can hit a high percentage of > > periods throttled while simultaneously not consuming the allocated > > amount of quota. This use case is typical of user-interactive non-cpu > > bound applications, such as those running in kubernetes or mesos when > > run on multiple cpu cores. > > > > This has been root caused to threads being allocated per cpu bandwidth > > slices, and then not fully using that slice within the period. At which > > point the slice and quota expires. This expiration of unused slice > > results in applications not being able to utilize the quota for which > > they are allocated. > > > > The expiration of per-cpu slices was recently fixed by > > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift > > condition")'. Prior to that it appears that this has been broken since > > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some > > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That > > added the following conditional which resulted in slices never being > > expired. > > > > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { > > /* extend local deadline, drift is bounded above by 2 ticks */ > > cfs_rq->runtime_expires += TICK_NSEC; > > > > Because this was broken for nearly 5 years, and has recently been fixed > > and is now being noticed by many users running kubernetes > > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion > > that the mechanisms around expiring runtime should be removed > > altogether. > > > > This allows only per-cpu slices to live longer than the period boundary. > > This allows threads on runqueues that do not use much CPU to continue to > > use their remaining slice over a longer period of time than > > cpu.cfs_period_us. However, this helps prevents the above condition of > > hitting throttling while also not fully utilizing your cpu quota. > > > > This theoretically allows a machine to use slightly more than it's > > allotted quota in some periods. This overflow would be bounded by the > > remaining per-cpu slice that was left un-used in the previous period. > > For CPU bound tasks this will change nothing, as they should > > theoretically fully utilize all of their quota and slices in each > > period. For user-interactive tasks as described above this provides a > > much better user/application experience as their cpu utilization will > > more closely match the amount they requested when they hit throttling. > > > > This greatly improves performance of high-thread-count, non-cpu bound > > applications with low cfs_quota_us allocation on high-core-count > > machines. In the case of an artificial testcase, this performance > > discrepancy has been observed to be almost 30x performance improvement, > > while still maintaining correct cpu quota restrictions albeit over > > longer time intervals than cpu.cfs_period_us. > > If the machine runs at/close to capacity, won't the overallocation > of the quota to bursty tasks necessarily negatively impact every other > task? Should the "unused" quota be available only on idle CPUs? > (Or maybe this is the behavior achieved here, and only the comment and > the commit message should be fixed...) > It's bounded by the amount left unused from the previous period. So theoretically a process could use almost twice its quota. But then it would have nothing left over in the next period. To repeat it would have to not use any that next period. Over a longer number of periods it's the same amount of CPU usage. I think that is more fair than throttling a process that has never used its full quota. And it removes complexity. Cheers, Phil > > That testcase is > > available at https://github.com/indeedeng/fibtest. > > > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") > > Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> > > --- > > Documentation/scheduler/sched-bwc.txt | 29 +++++++++++--- > > kernel/sched/fair.c | 71 +++-------------------------------- > > kernel/sched/sched.h | 4 -- > > 3 files changed, 29 insertions(+), 75 deletions(-) > > > > diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt > > index f6b1873..4ded8ae 100644 > > --- a/Documentation/scheduler/sched-bwc.txt > > +++ b/Documentation/scheduler/sched-bwc.txt > > @@ -8,16 +8,33 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the > > specification of the maximum CPU bandwidth available to a group or hierarchy. > > > > The bandwidth allowed for a group is specified using a quota and period. Within > > -each given "period" (microseconds), a group is allowed to consume only up to > > -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a > > -group exceeds this limit (for that period), the tasks belonging to its > > -hierarchy will be throttled and are not allowed to run again until the next > > -period. > > +each given "period" (microseconds), a task group is allocated up to "quota" > > +microseconds of CPU time. When the CPU bandwidth consumption of a group > > +exceeds this limit (for that period), the tasks belonging to its hierarchy will > > +be throttled and are not allowed to run again until the next period. > > > > A group's unused runtime is globally tracked, being refreshed with quota units > > above at each period boundary. As threads consume this bandwidth it is > > transferred to cpu-local "silos" on a demand basis. The amount transferred > > -within each of these updates is tunable and described as the "slice". > > +within each of these updates is tunable and described as the "slice". Slices > > +that are allocated to cpu-local silos do not expire at the end of the period, > > +but unallocated quota does. This doesn't affect cpu-bound applications as they > > +by definition consume all of their bandwidth in each each period. > > + > > +However for highly-threaded user-interactive/non-cpu bound applications this > > +non-expiration nuance allows applications to burst past their quota limits > > +equal to the amount of unused slice per cpu that the task group is running on. > > +This slight burst requires that quota had gone unused in previous periods. > > +Additionally this burst amount is limited to the size of a slice for every cpu > > +a task group is run on. As a result, this mechanism still strictly limits the > > +task group to quota average usage over a longer time windows. This provides > > +better more predictable user experience for highly threaded applications with > > +small quota limits on high core count machines. It also eliminates the > > +propensity to throttle these applications while simultanously using less than > > +quota amounts of cpu. Another way to say this, is that by allowing the unused > > +portion of a slice to be used in following periods we have decreased the > > +possibility of wasting unused quota on cpu-local silos that don't need much cpu > > +time. > > > > Management > > ---------- > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index f35930f..a675c69 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) > > > > now = sched_clock_cpu(smp_processor_id()); > > cfs_b->runtime = cfs_b->quota; > > - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); > > - cfs_b->expires_seq++; > > } > > > > static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) > > @@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) > > { > > struct task_group *tg = cfs_rq->tg; > > struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); > > - u64 amount = 0, min_amount, expires; > > - int expires_seq; > > + u64 amount = 0, min_amount; > > > > /* note: this is a positive sum as runtime_remaining <= 0 */ > > min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; > > @@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) > > cfs_b->idle = 0; > > } > > } > > - expires_seq = cfs_b->expires_seq; > > - expires = cfs_b->runtime_expires; > > raw_spin_unlock(&cfs_b->lock); > > > > cfs_rq->runtime_remaining += amount; > > - /* > > - * we may have advanced our local expiration to account for allowed > > - * spread between our sched_clock and the one on which runtime was > > - * issued. > > - */ > > - if (cfs_rq->expires_seq != expires_seq) { > > - cfs_rq->expires_seq = expires_seq; > > - cfs_rq->runtime_expires = expires; > > - } > > > > return cfs_rq->runtime_remaining > 0; > > } > > > > -/* > > - * Note: This depends on the synchronization provided by sched_clock and the > > - * fact that rq->clock snapshots this value. > > - */ > > -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) > > -{ > > - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); > > - > > - /* if the deadline is ahead of our clock, nothing to do */ > > - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) > > - return; > > - > > - if (cfs_rq->runtime_remaining < 0) > > - return; > > - > > - /* > > - * If the local deadline has passed we have to consider the > > - * possibility that our sched_clock is 'fast' and the global deadline > > - * has not truly expired. > > - * > > - * Fortunately we can check determine whether this the case by checking > > - * whether the global deadline(cfs_b->expires_seq) has advanced. > > - */ > > - if (cfs_rq->expires_seq == cfs_b->expires_seq) { > > - /* extend local deadline, drift is bounded above by 2 ticks */ > > - cfs_rq->runtime_expires += TICK_NSEC; > > - } else { > > - /* global deadline is ahead, expiration has passed */ > > - cfs_rq->runtime_remaining = 0; > > - } > > -} > > - > > static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) > > { > > /* dock delta_exec before expiring quota (as it could span periods) */ > > cfs_rq->runtime_remaining -= delta_exec; > > - expire_cfs_rq_runtime(cfs_rq); > > > > if (likely(cfs_rq->runtime_remaining > 0)) > > return; > > @@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) > > resched_curr(rq); > > } > > > > -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > > - u64 remaining, u64 expires) > > +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) > > { > > struct cfs_rq *cfs_rq; > > u64 runtime; > > @@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > > remaining -= runtime; > > > > cfs_rq->runtime_remaining += runtime; > > - cfs_rq->runtime_expires = expires; > > > > /* we check whether we're throttled above */ > > if (cfs_rq->runtime_remaining > 0) > > @@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > > */ > > static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) > > { > > - u64 runtime, runtime_expires; > > + u64 runtime; > > int throttled; > > > > /* no need to continue the timer with no bandwidth constraint */ > > @@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u > > /* account preceding periods in which throttling occurred */ > > cfs_b->nr_throttled += overrun; > > > > - runtime_expires = cfs_b->runtime_expires; > > - > > /* > > * This check is repeated as we are holding onto the new bandwidth while > > * we unthrottle. This can potentially race with an unthrottled group > > @@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u > > cfs_b->distribute_running = 1; > > raw_spin_unlock_irqrestore(&cfs_b->lock, flags); > > /* we can't nest cfs_b->lock while distributing bandwidth */ > > - runtime = distribute_cfs_runtime(cfs_b, runtime, > > - runtime_expires); > > + runtime = distribute_cfs_runtime(cfs_b, runtime); > > raw_spin_lock_irqsave(&cfs_b->lock, flags); > > > > cfs_b->distribute_running = 0; > > @@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) > > return; > > > > raw_spin_lock(&cfs_b->lock); > > - if (cfs_b->quota != RUNTIME_INF && > > - cfs_rq->runtime_expires == cfs_b->runtime_expires) { > > + if (cfs_b->quota != RUNTIME_INF) { > > cfs_b->runtime += slack_runtime; > > > > /* we are under rq->lock, defer unthrottling using a timer */ > > @@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > > { > > u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); > > unsigned long flags; > > - u64 expires; > > > > /* confirm we're still not at a refresh boundary */ > > raw_spin_lock_irqsave(&cfs_b->lock, flags); > > @@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > > if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) > > runtime = cfs_b->runtime; > > > > - expires = cfs_b->runtime_expires; > > if (runtime) > > cfs_b->distribute_running = 1; > > > > @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > > if (!runtime) > > return; > > > > - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); > > + runtime = distribute_cfs_runtime(cfs_b, runtime); > > > > raw_spin_lock_irqsave(&cfs_b->lock, flags); > > - if (expires == cfs_b->runtime_expires) > > - lsub_positive(&cfs_b->runtime, runtime); > > cfs_b->distribute_running = 0; > > raw_spin_unlock_irqrestore(&cfs_b->lock, flags); > > } > > @@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) > > > > cfs_b->period_active = 1; > > overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); > > - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); > > - cfs_b->expires_seq++; > > hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); > > } > > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > index b52ed1a..0c0ed23 100644 > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -341,8 +341,6 @@ struct cfs_bandwidth { > > u64 quota; > > u64 runtime; > > s64 hierarchical_quota; > > - u64 runtime_expires; > > - int expires_seq; > > > > short idle; > > short period_active; > > @@ -562,8 +560,6 @@ struct cfs_rq { > > > > #ifdef CONFIG_CFS_BANDWIDTH > > int runtime_enabled; > > - int expires_seq; > > - u64 runtime_expires; > > s64 runtime_remaining; > > > > u64 throttled_clock; > > -- > > 1.8.3.1 > > -- ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-24 14:32 ` Phil Auld @ 2019-05-24 15:14 ` Dave Chiluk 2019-05-24 15:59 ` Phil Auld 2019-05-24 16:28 ` Peter Oskolkov 0 siblings, 2 replies; 40+ messages in thread From: Dave Chiluk @ 2019-05-24 15:14 UTC (permalink / raw) To: Phil Auld Cc: Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc, Ben Segall On Fri, May 24, 2019 at 9:32 AM Phil Auld <pauld@redhat.com> wrote: > On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote: > > If the machine runs at/close to capacity, won't the overallocation > > of the quota to bursty tasks necessarily negatively impact every other > > task? Should the "unused" quota be available only on idle CPUs? > > (Or maybe this is the behavior achieved here, and only the comment and > > the commit message should be fixed...) > > > > It's bounded by the amount left unused from the previous period. So > theoretically a process could use almost twice its quota. But then it > would have nothing left over in the next period. To repeat it would have > to not use any that next period. Over a longer number of periods it's the > same amount of CPU usage. > > I think that is more fair than throttling a process that has never used > its full quota. > > And it removes complexity. > > Cheers, > Phil Actually it's not even that bad. The overallocation of quota to a bursty task in a period is limited to at most one slice per cpu, and that slice must not have been used in the previous periods. The slice size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and defaults to 5ms. If a bursty task goes from underutilizing quota to using it's entire quota, it will not be able to burst in the subsequent periods. Therefore in an absolute worst case contrived scenario, a bursty task can add at most 5ms to the latency of other threads on the same CPU. I think this worst case 5ms tradeoff is entirely worth it. This does mean that a theoretically a poorly written massively threaded application on an 80 core box, that spreads itself onto 80 cpu run queues, can overutilize it's quota in a period by at most 5ms * 80 CPUs in a sincle period (slice * number of runqueues the application is running on). But that means that each of those threads would have had to not be use their quota in a previous period, and it also means that the application would have to be carefully written to exacerbate this behavior. Additionally if cpu bound threads underutilize a slice of their quota in a period due to the cfs choosing a bursty task to run, they should theoretically be able to make it up in the following periods when the bursty task is unable to "burst". Please be careful here quota and slice are being treated differently. Quota does not roll-over between periods, only slices of quota that has already been allocated to per cpu run queues. If you allocate 100ms of quota per period to an application, but it only spreads onto 3 cpu run queues that means it can in the worst case use 3 x slice size = 15ms in periods following underutilization. So why does this matter. Well applications that use thread pools *(*cough* java *cough*) with lots of tiny little worker threads, tend to spread themselves out onto a lot of run queues. These worker threads grab quota slices in order to run, then rarely use all of their slice (1 or 2ms out of the 5ms). This results in those worker threads starving the main application of quota, and then expiring the remainder of that quota slice on the per-cpu. Going back to my earlier 100ms quota / 80 cpu example. That means only 100ms/cfs_bandwidth_slice_us(5ms) = 20 slices are available in a period. So only 20 out of these 80 cpus ever get a slice allocated to them. By allowing these per-cpu run queues to use their remaining slice in following periods these worker threads do not need to be allocated additional slice, and thereby the main threads are actually able to use the allocated cpu quota. This can be experienced by running fibtest available at https://github.com/indeedeng/fibtest/. $ runfibtest 1 runs a single fast thread taskset to cpu 0 $ runfibtest 8 Runs a single fast thread taskset to cpu 0, and 7 slow threads taskset to cpus 1-7. This run is expected to show less iterations, but the worse problem is that the cpu usage is far less than the 500ms that it should have received. Thanks for the engagement on this, Dave Chiluk ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-24 15:14 ` Dave Chiluk @ 2019-05-24 15:59 ` Phil Auld 2019-05-24 16:28 ` Peter Oskolkov 1 sibling, 0 replies; 40+ messages in thread From: Phil Auld @ 2019-05-24 15:59 UTC (permalink / raw) To: Dave Chiluk Cc: Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc, Ben Segall On Fri, May 24, 2019 at 10:14:36AM -0500 Dave Chiluk wrote: > On Fri, May 24, 2019 at 9:32 AM Phil Auld <pauld@redhat.com> wrote: > > On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote: > > > > If the machine runs at/close to capacity, won't the overallocation > > > of the quota to bursty tasks necessarily negatively impact every other > > > task? Should the "unused" quota be available only on idle CPUs? > > > (Or maybe this is the behavior achieved here, and only the comment and > > > the commit message should be fixed...) > > > > > > > It's bounded by the amount left unused from the previous period. So > > theoretically a process could use almost twice its quota. But then it > > would have nothing left over in the next period. To repeat it would have > > to not use any that next period. Over a longer number of periods it's the > > same amount of CPU usage. > > > > I think that is more fair than throttling a process that has never used > > its full quota. > > > > And it removes complexity. > > > > Cheers, > > Phil > > Actually it's not even that bad. The overallocation of quota to a > bursty task in a period is limited to at most one slice per cpu, and > that slice must not have been used in the previous periods. The slice > size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and > defaults to 5ms. If a bursty task goes from underutilizing quota to > using it's entire quota, it will not be able to burst in the > subsequent periods. Therefore in an absolute worst case contrived > scenario, a bursty task can add at most 5ms to the latency of other > threads on the same CPU. I think this worst case 5ms tradeoff is > entirely worth it. > > This does mean that a theoretically a poorly written massively > threaded application on an 80 core box, that spreads itself onto 80 > cpu run queues, can overutilize it's quota in a period by at most 5ms > * 80 CPUs in a sincle period (slice * number of runqueues the > application is running on). But that means that each of those threads > would have had to not be use their quota in a previous period, and it > also means that the application would have to be carefully written to > exacerbate this behavior. > > Additionally if cpu bound threads underutilize a slice of their quota > in a period due to the cfs choosing a bursty task to run, they should > theoretically be able to make it up in the following periods when the > bursty task is unable to "burst". > > Please be careful here quota and slice are being treated differently. > Quota does not roll-over between periods, only slices of quota that > has already been allocated to per cpu run queues. If you allocate > 100ms of quota per period to an application, but it only spreads onto > 3 cpu run queues that means it can in the worst case use 3 x slice > size = 15ms in periods following underutilization. > > So why does this matter. Well applications that use thread pools > *(*cough* java *cough*) with lots of tiny little worker threads, tend > to spread themselves out onto a lot of run queues. These worker > threads grab quota slices in order to run, then rarely use all of > their slice (1 or 2ms out of the 5ms). This results in those worker > threads starving the main application of quota, and then expiring the > remainder of that quota slice on the per-cpu. Going back to my > earlier 100ms quota / 80 cpu example. That means only > 100ms/cfs_bandwidth_slice_us(5ms) = 20 slices are available in a > period. So only 20 out of these 80 cpus ever get a slice allocated to > them. By allowing these per-cpu run queues to use their remaining > slice in following periods these worker threads do not need to be > allocated additional slice, and thereby the main threads are actually > able to use the allocated cpu quota. > > This can be experienced by running fibtest available at > https://github.com/indeedeng/fibtest/. > $ runfibtest 1 > runs a single fast thread taskset to cpu 0 > $ runfibtest 8 > Runs a single fast thread taskset to cpu 0, and 7 slow threads taskset > to cpus 1-7. This run is expected to show less iterations, but the > worse problem is that the cpu usage is far less than the 500ms that it > should have received. > > Thanks for the engagement on this, > Dave Chiluk Thanks for the clarification. This is an even better explanation. Fwiw, I ran some of my cfs throttling tests with this, none of which are designed to measure or hit this particular issue. They are more focused on starvation and hard lockups that I've hit. But I did not see any issues or oddities with this patch. Cheers, Phil -- ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-24 15:14 ` Dave Chiluk 2019-05-24 15:59 ` Phil Auld @ 2019-05-24 16:28 ` Peter Oskolkov 2019-05-24 21:35 ` Dave Chiluk 1 sibling, 1 reply; 40+ messages in thread From: Peter Oskolkov @ 2019-05-24 16:28 UTC (permalink / raw) To: Dave Chiluk Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, cgroups, Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc, Ben Segall On Fri, May 24, 2019 at 8:15 AM Dave Chiluk <chiluk+linux@indeed.com> wrote: > > On Fri, May 24, 2019 at 9:32 AM Phil Auld <pauld@redhat.com> wrote: > > On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote: > > > > If the machine runs at/close to capacity, won't the overallocation > > > of the quota to bursty tasks necessarily negatively impact every other > > > task? Should the "unused" quota be available only on idle CPUs? > > > (Or maybe this is the behavior achieved here, and only the comment and > > > the commit message should be fixed...) > > > > > > > It's bounded by the amount left unused from the previous period. So > > theoretically a process could use almost twice its quota. But then it > > would have nothing left over in the next period. To repeat it would have > > to not use any that next period. Over a longer number of periods it's the > > same amount of CPU usage. > > > > I think that is more fair than throttling a process that has never used > > its full quota. > > > > And it removes complexity. > > > > Cheers, > > Phil > > Actually it's not even that bad. The overallocation of quota to a > bursty task in a period is limited to at most one slice per cpu, and > that slice must not have been used in the previous periods. The slice > size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and > defaults to 5ms. If a bursty task goes from underutilizing quota to > using it's entire quota, it will not be able to burst in the > subsequent periods. Therefore in an absolute worst case contrived > scenario, a bursty task can add at most 5ms to the latency of other > threads on the same CPU. I think this worst case 5ms tradeoff is > entirely worth it. > > This does mean that a theoretically a poorly written massively > threaded application on an 80 core box, that spreads itself onto 80 > cpu run queues, can overutilize it's quota in a period by at most 5ms > * 80 CPUs in a sincle period (slice * number of runqueues the > application is running on). But that means that each of those threads > would have had to not be use their quota in a previous period, and it > also means that the application would have to be carefully written to > exacerbate this behavior. > > Additionally if cpu bound threads underutilize a slice of their quota > in a period due to the cfs choosing a bursty task to run, they should > theoretically be able to make it up in the following periods when the > bursty task is unable to "burst". OK, so it is indeed possible that CPU bound threads will underutilize a slice of their quota in a period as a result of this patch. This should probably be clearly stated in the code comments and in the commit message. In addition, I believe that although many workloads will indeed be indifferent to getting their fair share "later", some latency-sensitive workloads will definitely be negatively affected by this temporary CPU quota stealing by bursty antagonists. So there should probably be a way to limit this behavior; for example, by making it tunable per cgroup. > > Please be careful here quota and slice are being treated differently. > Quota does not roll-over between periods, only slices of quota that > has already been allocated to per cpu run queues. If you allocate > 100ms of quota per period to an application, but it only spreads onto > 3 cpu run queues that means it can in the worst case use 3 x slice > size = 15ms in periods following underutilization. > > So why does this matter. Well applications that use thread pools > *(*cough* java *cough*) with lots of tiny little worker threads, tend > to spread themselves out onto a lot of run queues. These worker > threads grab quota slices in order to run, then rarely use all of > their slice (1 or 2ms out of the 5ms). This results in those worker > threads starving the main application of quota, and then expiring the > remainder of that quota slice on the per-cpu. Going back to my > earlier 100ms quota / 80 cpu example. That means only > 100ms/cfs_bandwidth_slice_us(5ms) = 20 slices are available in a > period. So only 20 out of these 80 cpus ever get a slice allocated to > them. By allowing these per-cpu run queues to use their remaining > slice in following periods these worker threads do not need to be > allocated additional slice, and thereby the main threads are actually > able to use the allocated cpu quota. > > This can be experienced by running fibtest available at > https://github.com/indeedeng/fibtest/. > $ runfibtest 1 > runs a single fast thread taskset to cpu 0 > $ runfibtest 8 > Runs a single fast thread taskset to cpu 0, and 7 slow threads taskset > to cpus 1-7. This run is expected to show less iterations, but the > worse problem is that the cpu usage is far less than the 500ms that it > should have received. > > Thanks for the engagement on this, > Dave Chiluk ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-24 16:28 ` Peter Oskolkov @ 2019-05-24 21:35 ` Dave Chiluk 2019-05-24 22:07 ` Peter Oskolkov 0 siblings, 1 reply; 40+ messages in thread From: Dave Chiluk @ 2019-05-24 21:35 UTC (permalink / raw) To: Peter Oskolkov Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, cgroups, Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc, Ben Segall On Fri, May 24, 2019 at 11:28 AM Peter Oskolkov <posk@posk.io> wrote: > > On Fri, May 24, 2019 at 8:15 AM Dave Chiluk <chiluk+linux@indeed.com> wrote: > > > > On Fri, May 24, 2019 at 9:32 AM Phil Auld <pauld@redhat.com> wrote: > > > On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote: > > > > > > If the machine runs at/close to capacity, won't the overallocation > > > > of the quota to bursty tasks necessarily negatively impact every other > > > > task? Should the "unused" quota be available only on idle CPUs? > > > > (Or maybe this is the behavior achieved here, and only the comment and > > > > the commit message should be fixed...) > > > > > > > > > > It's bounded by the amount left unused from the previous period. So > > > theoretically a process could use almost twice its quota. But then it > > > would have nothing left over in the next period. To repeat it would have > > > to not use any that next period. Over a longer number of periods it's the > > > same amount of CPU usage. > > > > > > I think that is more fair than throttling a process that has never used > > > its full quota. > > > > > > And it removes complexity. > > > > > > Cheers, > > > Phil > > > > Actually it's not even that bad. The overallocation of quota to a > > bursty task in a period is limited to at most one slice per cpu, and > > that slice must not have been used in the previous periods. The slice > > size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and > > defaults to 5ms. If a bursty task goes from underutilizing quota to > > using it's entire quota, it will not be able to burst in the > > subsequent periods. Therefore in an absolute worst case contrived > > scenario, a bursty task can add at most 5ms to the latency of other > > threads on the same CPU. I think this worst case 5ms tradeoff is > > entirely worth it. > > > > This does mean that a theoretically a poorly written massively > > threaded application on an 80 core box, that spreads itself onto 80 > > cpu run queues, can overutilize it's quota in a period by at most 5ms > > * 80 CPUs in a sincle period (slice * number of runqueues the > > application is running on). But that means that each of those threads > > would have had to not be use their quota in a previous period, and it > > also means that the application would have to be carefully written to > > exacerbate this behavior. > > > > Additionally if cpu bound threads underutilize a slice of their quota > > in a period due to the cfs choosing a bursty task to run, they should > > theoretically be able to make it up in the following periods when the > > bursty task is unable to "burst". > > OK, so it is indeed possible that CPU bound threads will underutilize a slice > of their quota in a period as a result of this patch. This should probably > be clearly stated in the code comments and in the commit message. > > In addition, I believe that although many workloads will indeed be > indifferent to getting their fair share "later", some latency-sensitive > workloads will definitely be negatively affected by this temporary > CPU quota stealing by bursty antagonists. So there should probably be > a way to limit this behavior; for example, by making it tunable > per cgroup. > This patch restores the behavior that existed from at least v3.16..v4.18, and the current Redhat 7 kernels. So you are kind of championing a moot point as no one has noticed this "bursting" behavior in over 5 years. By removing this slice expiration altogether we restore the behavior and also solve the root issue of 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")'. Since 512ac999d275, many people are now noticing the slice expiration and very displeased with the behavior change. see: https://github.com/kubernetes/kubernetes/issues/67577 I would love to hear if you know of a single complaint during that 5 year time window, where someone noticed this bursting and reported that it negatively affected their application. As for the documentation, I thought about documenting the possible adverse side-effect, but I didn't feel it was worthwhile since no one had noticed that corner case in the 5 years. I also could not figure out a concise way of describing the slight corner case issue without overly complicating the documentation. I felt that adding that corner case would have detracted rather than added to the documentation's usefulness. As far as commenting in code, considering most of this commit removes lines, there's not a really great place for it. That's why I did my best to describe the behavior in the documentation. Smart people can draw conclusions from there as you have done. Please keep in mind that this "bursting" is again limited to a single slice on each per-cpu run-queue. Thank you, Dave ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-24 21:35 ` Dave Chiluk @ 2019-05-24 22:07 ` Peter Oskolkov 2019-05-28 22:25 ` Dave Chiluk 0 siblings, 1 reply; 40+ messages in thread From: Peter Oskolkov @ 2019-05-24 22:07 UTC (permalink / raw) To: Dave Chiluk Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, cgroups, Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc, Ben Segall On Fri, May 24, 2019 at 2:35 PM Dave Chiluk <chiluk+linux@indeed.com> wrote: > > On Fri, May 24, 2019 at 11:28 AM Peter Oskolkov <posk@posk.io> wrote: > > > > On Fri, May 24, 2019 at 8:15 AM Dave Chiluk <chiluk+linux@indeed.com> wrote: > > > > > > On Fri, May 24, 2019 at 9:32 AM Phil Auld <pauld@redhat.com> wrote: > > > > On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote: > > > > > > > > If the machine runs at/close to capacity, won't the overallocation > > > > > of the quota to bursty tasks necessarily negatively impact every other > > > > > task? Should the "unused" quota be available only on idle CPUs? > > > > > (Or maybe this is the behavior achieved here, and only the comment and > > > > > the commit message should be fixed...) > > > > > > > > > > > > > It's bounded by the amount left unused from the previous period. So > > > > theoretically a process could use almost twice its quota. But then it > > > > would have nothing left over in the next period. To repeat it would have > > > > to not use any that next period. Over a longer number of periods it's the > > > > same amount of CPU usage. > > > > > > > > I think that is more fair than throttling a process that has never used > > > > its full quota. > > > > > > > > And it removes complexity. > > > > > > > > Cheers, > > > > Phil > > > > > > Actually it's not even that bad. The overallocation of quota to a > > > bursty task in a period is limited to at most one slice per cpu, and > > > that slice must not have been used in the previous periods. The slice > > > size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and > > > defaults to 5ms. If a bursty task goes from underutilizing quota to > > > using it's entire quota, it will not be able to burst in the > > > subsequent periods. Therefore in an absolute worst case contrived > > > scenario, a bursty task can add at most 5ms to the latency of other > > > threads on the same CPU. I think this worst case 5ms tradeoff is > > > entirely worth it. > > > > > > This does mean that a theoretically a poorly written massively > > > threaded application on an 80 core box, that spreads itself onto 80 > > > cpu run queues, can overutilize it's quota in a period by at most 5ms > > > * 80 CPUs in a sincle period (slice * number of runqueues the > > > application is running on). But that means that each of those threads > > > would have had to not be use their quota in a previous period, and it > > > also means that the application would have to be carefully written to > > > exacerbate this behavior. > > > > > > Additionally if cpu bound threads underutilize a slice of their quota > > > in a period due to the cfs choosing a bursty task to run, they should > > > theoretically be able to make it up in the following periods when the > > > bursty task is unable to "burst". > > > > OK, so it is indeed possible that CPU bound threads will underutilize a slice > > of their quota in a period as a result of this patch. This should probably > > be clearly stated in the code comments and in the commit message. > > > > In addition, I believe that although many workloads will indeed be > > indifferent to getting their fair share "later", some latency-sensitive > > workloads will definitely be negatively affected by this temporary > > CPU quota stealing by bursty antagonists. So there should probably be > > a way to limit this behavior; for example, by making it tunable > > per cgroup. > > > This patch restores the behavior that existed from at least > v3.16..v4.18, and the current Redhat 7 kernels. So you are kind of > championing a moot point as no one has noticed this "bursting" > behavior in over 5 years. By removing this slice expiration > altogether we restore the behavior and also solve the root issue of > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift > condition")'. > > Since 512ac999d275, many people are now noticing the slice expiration > and very displeased with the behavior change. > see: https://github.com/kubernetes/kubernetes/issues/67577 > > I would love to hear if you know of a single complaint during that 5 > year time window, where someone noticed this bursting and reported > that it negatively affected their application. Linux CPU scheduling tail latency is a well-known issue and a major pain point in some workloads: https://www.google.com/search?q=linux+cpu+scheduling+tail+latency Even assuming that nobody noticed this particular cause of CPU scheduling latencies, it does not mean the problem should be waved away. At least it should be documented, if at this point it decided that it is difficult to address it in a meaningful way. And, preferably, a way to address the issue later on should be discussed and hopefully agreed to. > > As for the documentation, I thought about documenting the possible > adverse side-effect, but I didn't feel it was worthwhile since no one > had noticed that corner case in the 5 years. I also could not figure > out a concise way of describing the slight corner case issue without > overly complicating the documentation. I felt that adding that corner > case would have detracted rather than added to the documentation's > usefulness. As far as commenting in code, considering most of this > commit removes lines, there's not a really great place for it. That's > why I did my best to describe the behavior in the documentation. > Smart people can draw conclusions from there as you have done. Please > keep in mind that this "bursting" is again limited to a single slice > on each per-cpu run-queue. > > Thank you, > Dave ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-24 22:07 ` Peter Oskolkov @ 2019-05-28 22:25 ` Dave Chiluk 0 siblings, 0 replies; 40+ messages in thread From: Dave Chiluk @ 2019-05-28 22:25 UTC (permalink / raw) To: Peter Oskolkov Cc: Phil Auld, Peter Zijlstra, Ingo Molnar, cgroups, Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc, Ben Segall On Fri, May 24, 2019 at 5:07 PM Peter Oskolkov <posk@posk.io> wrote: > Linux CPU scheduling tail latency is a well-known issue and a major > pain point in some workloads: > https://www.google.com/search?q=linux+cpu+scheduling+tail+latency > > Even assuming that nobody noticed this particular cause > of CPU scheduling latencies, it does not mean the problem should be waved > away. At least it should be documented, if at this point it decided that > it is difficult to address it in a meaningful way. And, preferably, a way > to address the issue later on should be discussed and hopefully agreed to. Pursuing reducing tail latencies for our web application is the precise reason I created this patch set. Those applications that previously were responding in 20ms 95% where now taking 220ms. Those were correctly sized applications prior to 512ac999. After which, they started seeing massive increases in their latencies due to hitting throttling with lower than quota amounts of cpu usage. I'll see if I can rework the documentation. Any specific suggestions for how that can be worded would be appreciated. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-23 18:44 ` [PATCH v2 1/1] " Dave Chiluk 2019-05-23 21:01 ` Peter Oskolkov @ 2019-05-24 8:55 ` Peter Zijlstra 1 sibling, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2019-05-24 8:55 UTC (permalink / raw) To: Dave Chiluk Cc: Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc, Ben Segall, Paul Turner (it always helps to Cc the people who actually wrote the code) Ben, can you have a look at this? On Thu, May 23, 2019 at 01:44:47PM -0500, Dave Chiluk wrote: > It has been observed, that highly-threaded, non-cpu-bound applications > running under cpu.cfs_quota_us constraints can hit a high percentage of > periods throttled while simultaneously not consuming the allocated > amount of quota. This use case is typical of user-interactive non-cpu > bound applications, such as those running in kubernetes or mesos when > run on multiple cpu cores. > > This has been root caused to threads being allocated per cpu bandwidth > slices, and then not fully using that slice within the period. At which > point the slice and quota expires. This expiration of unused slice > results in applications not being able to utilize the quota for which > they are allocated. > > The expiration of per-cpu slices was recently fixed by > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift > condition")'. Prior to that it appears that this has been broken since > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That > added the following conditional which resulted in slices never being > expired. > > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { > /* extend local deadline, drift is bounded above by 2 ticks */ > cfs_rq->runtime_expires += TICK_NSEC; > > Because this was broken for nearly 5 years, and has recently been fixed > and is now being noticed by many users running kubernetes > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion > that the mechanisms around expiring runtime should be removed > altogether. > > This allows only per-cpu slices to live longer than the period boundary. > This allows threads on runqueues that do not use much CPU to continue to > use their remaining slice over a longer period of time than > cpu.cfs_period_us. However, this helps prevents the above condition of > hitting throttling while also not fully utilizing your cpu quota. > > This theoretically allows a machine to use slightly more than it's > allotted quota in some periods. This overflow would be bounded by the > remaining per-cpu slice that was left un-used in the previous period. > For CPU bound tasks this will change nothing, as they should > theoretically fully utilize all of their quota and slices in each > period. For user-interactive tasks as described above this provides a > much better user/application experience as their cpu utilization will > more closely match the amount they requested when they hit throttling. > > This greatly improves performance of high-thread-count, non-cpu bound > applications with low cfs_quota_us allocation on high-core-count > machines. In the case of an artificial testcase, this performance > discrepancy has been observed to be almost 30x performance improvement, > while still maintaining correct cpu quota restrictions albeit over > longer time intervals than cpu.cfs_period_us. That testcase is > available at https://github.com/indeedeng/fibtest. > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") > Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> > --- > Documentation/scheduler/sched-bwc.txt | 29 +++++++++++--- > kernel/sched/fair.c | 71 +++-------------------------------- > kernel/sched/sched.h | 4 -- > 3 files changed, 29 insertions(+), 75 deletions(-) > > diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt > index f6b1873..4ded8ae 100644 > --- a/Documentation/scheduler/sched-bwc.txt > +++ b/Documentation/scheduler/sched-bwc.txt > @@ -8,16 +8,33 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the > specification of the maximum CPU bandwidth available to a group or hierarchy. > > The bandwidth allowed for a group is specified using a quota and period. Within > -each given "period" (microseconds), a group is allowed to consume only up to > -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a > -group exceeds this limit (for that period), the tasks belonging to its > -hierarchy will be throttled and are not allowed to run again until the next > -period. > +each given "period" (microseconds), a task group is allocated up to "quota" > +microseconds of CPU time. When the CPU bandwidth consumption of a group > +exceeds this limit (for that period), the tasks belonging to its hierarchy will > +be throttled and are not allowed to run again until the next period. > > A group's unused runtime is globally tracked, being refreshed with quota units > above at each period boundary. As threads consume this bandwidth it is > transferred to cpu-local "silos" on a demand basis. The amount transferred > -within each of these updates is tunable and described as the "slice". > +within each of these updates is tunable and described as the "slice". Slices > +that are allocated to cpu-local silos do not expire at the end of the period, > +but unallocated quota does. This doesn't affect cpu-bound applications as they > +by definition consume all of their bandwidth in each each period. > + > +However for highly-threaded user-interactive/non-cpu bound applications this > +non-expiration nuance allows applications to burst past their quota limits > +equal to the amount of unused slice per cpu that the task group is running on. > +This slight burst requires that quota had gone unused in previous periods. > +Additionally this burst amount is limited to the size of a slice for every cpu > +a task group is run on. As a result, this mechanism still strictly limits the > +task group to quota average usage over a longer time windows. This provides > +better more predictable user experience for highly threaded applications with > +small quota limits on high core count machines. It also eliminates the > +propensity to throttle these applications while simultanously using less than > +quota amounts of cpu. Another way to say this, is that by allowing the unused > +portion of a slice to be used in following periods we have decreased the > +possibility of wasting unused quota on cpu-local silos that don't need much cpu > +time. > > Management > ---------- > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index f35930f..a675c69 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) > > now = sched_clock_cpu(smp_processor_id()); > cfs_b->runtime = cfs_b->quota; > - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); > - cfs_b->expires_seq++; > } > > static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) > @@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) > { > struct task_group *tg = cfs_rq->tg; > struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); > - u64 amount = 0, min_amount, expires; > - int expires_seq; > + u64 amount = 0, min_amount; > > /* note: this is a positive sum as runtime_remaining <= 0 */ > min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; > @@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) > cfs_b->idle = 0; > } > } > - expires_seq = cfs_b->expires_seq; > - expires = cfs_b->runtime_expires; > raw_spin_unlock(&cfs_b->lock); > > cfs_rq->runtime_remaining += amount; > - /* > - * we may have advanced our local expiration to account for allowed > - * spread between our sched_clock and the one on which runtime was > - * issued. > - */ > - if (cfs_rq->expires_seq != expires_seq) { > - cfs_rq->expires_seq = expires_seq; > - cfs_rq->runtime_expires = expires; > - } > > return cfs_rq->runtime_remaining > 0; > } > > -/* > - * Note: This depends on the synchronization provided by sched_clock and the > - * fact that rq->clock snapshots this value. > - */ > -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) > -{ > - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); > - > - /* if the deadline is ahead of our clock, nothing to do */ > - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) > - return; > - > - if (cfs_rq->runtime_remaining < 0) > - return; > - > - /* > - * If the local deadline has passed we have to consider the > - * possibility that our sched_clock is 'fast' and the global deadline > - * has not truly expired. > - * > - * Fortunately we can check determine whether this the case by checking > - * whether the global deadline(cfs_b->expires_seq) has advanced. > - */ > - if (cfs_rq->expires_seq == cfs_b->expires_seq) { > - /* extend local deadline, drift is bounded above by 2 ticks */ > - cfs_rq->runtime_expires += TICK_NSEC; > - } else { > - /* global deadline is ahead, expiration has passed */ > - cfs_rq->runtime_remaining = 0; > - } > -} > - > static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) > { > /* dock delta_exec before expiring quota (as it could span periods) */ > cfs_rq->runtime_remaining -= delta_exec; > - expire_cfs_rq_runtime(cfs_rq); > > if (likely(cfs_rq->runtime_remaining > 0)) > return; > @@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) > resched_curr(rq); > } > > -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > - u64 remaining, u64 expires) > +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) > { > struct cfs_rq *cfs_rq; > u64 runtime; > @@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > remaining -= runtime; > > cfs_rq->runtime_remaining += runtime; > - cfs_rq->runtime_expires = expires; > > /* we check whether we're throttled above */ > if (cfs_rq->runtime_remaining > 0) > @@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > */ > static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) > { > - u64 runtime, runtime_expires; > + u64 runtime; > int throttled; > > /* no need to continue the timer with no bandwidth constraint */ > @@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u > /* account preceding periods in which throttling occurred */ > cfs_b->nr_throttled += overrun; > > - runtime_expires = cfs_b->runtime_expires; > - > /* > * This check is repeated as we are holding onto the new bandwidth while > * we unthrottle. This can potentially race with an unthrottled group > @@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u > cfs_b->distribute_running = 1; > raw_spin_unlock_irqrestore(&cfs_b->lock, flags); > /* we can't nest cfs_b->lock while distributing bandwidth */ > - runtime = distribute_cfs_runtime(cfs_b, runtime, > - runtime_expires); > + runtime = distribute_cfs_runtime(cfs_b, runtime); > raw_spin_lock_irqsave(&cfs_b->lock, flags); > > cfs_b->distribute_running = 0; > @@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) > return; > > raw_spin_lock(&cfs_b->lock); > - if (cfs_b->quota != RUNTIME_INF && > - cfs_rq->runtime_expires == cfs_b->runtime_expires) { > + if (cfs_b->quota != RUNTIME_INF) { > cfs_b->runtime += slack_runtime; > > /* we are under rq->lock, defer unthrottling using a timer */ > @@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > { > u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); > unsigned long flags; > - u64 expires; > > /* confirm we're still not at a refresh boundary */ > raw_spin_lock_irqsave(&cfs_b->lock, flags); > @@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) > runtime = cfs_b->runtime; > > - expires = cfs_b->runtime_expires; > if (runtime) > cfs_b->distribute_running = 1; > > @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > if (!runtime) > return; > > - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); > + runtime = distribute_cfs_runtime(cfs_b, runtime); > > raw_spin_lock_irqsave(&cfs_b->lock, flags); > - if (expires == cfs_b->runtime_expires) > - lsub_positive(&cfs_b->runtime, runtime); > cfs_b->distribute_running = 0; > raw_spin_unlock_irqrestore(&cfs_b->lock, flags); > } > @@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) > > cfs_b->period_active = 1; > overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); > - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); > - cfs_b->expires_seq++; > hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); > } > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index b52ed1a..0c0ed23 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -341,8 +341,6 @@ struct cfs_bandwidth { > u64 quota; > u64 runtime; > s64 hierarchical_quota; > - u64 runtime_expires; > - int expires_seq; > > short idle; > short period_active; > @@ -562,8 +560,6 @@ struct cfs_rq { > > #ifdef CONFIG_CFS_BANDWIDTH > int runtime_enabled; > - int expires_seq; > - u64 runtime_expires; > s64 runtime_remaining; > > u64 throttled_clock; > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v3 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-17 19:30 [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices Dave Chiluk 2019-05-23 18:44 ` [PATCH v2 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk @ 2019-05-29 19:08 ` Dave Chiluk 2019-05-29 19:08 ` [PATCH v3 1/1] " Dave Chiluk [not found] ` <1561391404-14450-1-git-send-email-chiluk+linux@indeed.com> ` (3 subsequent siblings) 5 siblings, 1 reply; 40+ messages in thread From: Dave Chiluk @ 2019-05-29 19:08 UTC (permalink / raw) To: Ben Segall, Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc Changelog v3 - Reworked documentation to better describe behavior of slice expiration per feedback from Peter Oskolkov Changelog v2 - Fixed some checkpatch errors in the commit message. ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-29 19:08 ` [PATCH v3 0/1] " Dave Chiluk @ 2019-05-29 19:08 ` Dave Chiluk 2019-05-29 19:28 ` Phil Auld ` (2 more replies) 0 siblings, 3 replies; 40+ messages in thread From: Dave Chiluk @ 2019-05-29 19:08 UTC (permalink / raw) To: Ben Segall, Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc It has been observed, that highly-threaded, non-cpu-bound applications running under cpu.cfs_quota_us constraints can hit a high percentage of periods throttled while simultaneously not consuming the allocated amount of quota. This use case is typical of user-interactive non-cpu bound applications, such as those running in kubernetes or mesos when run on multiple cpu cores. This has been root caused to threads being allocated per cpu bandwidth slices, and then not fully using that slice within the period. At which point the slice and quota expires. This expiration of unused slice results in applications not being able to utilize the quota for which they are allocated. The expiration of per-cpu slices was recently fixed by 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")'. Prior to that it appears that this has been broken since at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That added the following conditional which resulted in slices never being expired. if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; Because this was broken for nearly 5 years, and has recently been fixed and is now being noticed by many users running kubernetes (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion that the mechanisms around expiring runtime should be removed altogether. This allows only per-cpu slices to live longer than the period boundary. This allows threads on runqueues that do not use much CPU to continue to use their remaining slice over a longer period of time than cpu.cfs_period_us. However, this helps prevents the above condition of hitting throttling while also not fully utilizing your cpu quota. This theoretically allows a machine to use slightly more than it's allotted quota in some periods. This overflow would be bounded by the remaining per-cpu slice that was left un-used in the previous period. For CPU bound tasks this will change nothing, as they should theoretically fully utilize all of their quota and slices in each period. For user-interactive tasks as described above this provides a much better user/application experience as their cpu utilization will more closely match the amount they requested when they hit throttling. This greatly improves performance of high-thread-count, non-cpu bound applications with low cfs_quota_us allocation on high-core-count machines. In the case of an artificial testcase, this performance discrepancy has been observed to be almost 30x performance improvement, while still maintaining correct cpu quota restrictions albeit over longer time intervals than cpu.cfs_period_us. That testcase is available at https://github.com/indeedeng/fibtest. Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> --- Documentation/scheduler/sched-bwc.txt | 56 ++++++++++++++++++++++----- kernel/sched/fair.c | 71 +++-------------------------------- kernel/sched/sched.h | 4 -- 3 files changed, 53 insertions(+), 78 deletions(-) diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt index f6b1873..260fd65 100644 --- a/Documentation/scheduler/sched-bwc.txt +++ b/Documentation/scheduler/sched-bwc.txt @@ -8,15 +8,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the specification of the maximum CPU bandwidth available to a group or hierarchy. The bandwidth allowed for a group is specified using a quota and period. Within -each given "period" (microseconds), a group is allowed to consume only up to -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a -group exceeds this limit (for that period), the tasks belonging to its -hierarchy will be throttled and are not allowed to run again until the next -period. - -A group's unused runtime is globally tracked, being refreshed with quota units -above at each period boundary. As threads consume this bandwidth it is -transferred to cpu-local "silos" on a demand basis. The amount transferred +each given "period" (microseconds), a task group is allocated up to "quota" +microseconds of CPU time. That quota is assigned to per cpu run queues in +slices as threads in the cgroup become runnable. Once all quota has been +assigned any additional requests for quota will result in those threads being +throttled. Throttled threads will not be able to run again until the next +period when the quota is replenished. + +A group's unassigned quota is globally tracked, being refreshed back to +cfs_quota units at each period boundary. As threads consume this bandwidth it +is transferred to cpu-local "silos" on a demand basis. The amount transferred within each of these updates is tunable and described as the "slice". Management @@ -90,6 +91,43 @@ There are two ways in which a group may become throttled: In case b) above, even though the child may have runtime remaining it will not be allowed to until the parent's runtime is refreshed. +Real-world behavior of slice non-expiration +------------------------------------------- +The fact that cpu-local slices do not expire results in some interesting corner +cases that should be understood. + +For cgroup cpu constrained applications that are cpu limited this is a +relatively moot point because they will naturally consume the entirety of their +quota as well as the entirety of each cpu-local slice in each period. As a +result it is expected that nr_periods roughly equal nr_throttled, and that +cpuacct.usage will increase roughly equal to cfs_quota_us in each period. + +However in a worst-case scenario, highly-threaded, interactive/non-cpu bound +applications this non-expiration nuance allows applications to briefly burst +past their quota limits by the amount of unused slice on each cpu that the task +group is running on. This slight burst requires that quota had been assigned +and then not fully used in previous periods. This burst amount will not be +transferred between cores. As a result, this mechanism still strictly limits +the task group to quota average usage, albeit over a longer time window than +period. This provides better more predictable user experience for highly +threaded applications with small quota limits on high core count machines. It +also eliminates the propensity to throttle these applications while +simultanously using less than quota amounts of cpu. Another way to say this, +is that by allowing the unused portion of a slice to remain valid across +periods we have decreased the possibility of wasting quota on cpu-local silos +that don't need a full slice's amount of cpu time. + +The interaction between cpu-bound and non-cpu-bound-interactive applications +should also be considered, especially when single core usage hits 100%. If you +gave each of these applications half of a cpu-core and they both got scheduled +on the same CPU it is theoretically possible that the non-cpu bound application +will use up to sched_cfs_bandwidth_slice_us additional quota in some periods, +thereby preventing the cpu-bound application from fully using it's quota by +that same amount. In these instances it will be up to the CFS algorithm (see +sched-design-CFS.txt) to decide which application is chosen to run, as they +will both be runnable and have remaining quota. This runtime discrepancy will +should made up in the following periods when the interactive application idles. + Examples -------- 1. Limit a group to 1 CPU worth of runtime. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..a675c69 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) now = sched_clock_cpu(smp_processor_id()); cfs_b->runtime = cfs_b->quota; - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); - cfs_b->expires_seq++; } static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) @@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) { struct task_group *tg = cfs_rq->tg; struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); - u64 amount = 0, min_amount, expires; - int expires_seq; + u64 amount = 0, min_amount; /* note: this is a positive sum as runtime_remaining <= 0 */ min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; @@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_b->idle = 0; } } - expires_seq = cfs_b->expires_seq; - expires = cfs_b->runtime_expires; raw_spin_unlock(&cfs_b->lock); cfs_rq->runtime_remaining += amount; - /* - * we may have advanced our local expiration to account for allowed - * spread between our sched_clock and the one on which runtime was - * issued. - */ - if (cfs_rq->expires_seq != expires_seq) { - cfs_rq->expires_seq = expires_seq; - cfs_rq->runtime_expires = expires; - } return cfs_rq->runtime_remaining > 0; } -/* - * Note: This depends on the synchronization provided by sched_clock and the - * fact that rq->clock snapshots this value. - */ -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) -{ - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); - - /* if the deadline is ahead of our clock, nothing to do */ - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) - return; - - if (cfs_rq->runtime_remaining < 0) - return; - - /* - * If the local deadline has passed we have to consider the - * possibility that our sched_clock is 'fast' and the global deadline - * has not truly expired. - * - * Fortunately we can check determine whether this the case by checking - * whether the global deadline(cfs_b->expires_seq) has advanced. - */ - if (cfs_rq->expires_seq == cfs_b->expires_seq) { - /* extend local deadline, drift is bounded above by 2 ticks */ - cfs_rq->runtime_expires += TICK_NSEC; - } else { - /* global deadline is ahead, expiration has passed */ - cfs_rq->runtime_remaining = 0; - } -} - static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) { /* dock delta_exec before expiring quota (as it could span periods) */ cfs_rq->runtime_remaining -= delta_exec; - expire_cfs_rq_runtime(cfs_rq); if (likely(cfs_rq->runtime_remaining > 0)) return; @@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) resched_curr(rq); } -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, - u64 remaining, u64 expires) +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) { struct cfs_rq *cfs_rq; u64 runtime; @@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, remaining -= runtime; cfs_rq->runtime_remaining += runtime; - cfs_rq->runtime_expires = expires; /* we check whether we're throttled above */ if (cfs_rq->runtime_remaining > 0) @@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, */ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) { - u64 runtime, runtime_expires; + u64 runtime; int throttled; /* no need to continue the timer with no bandwidth constraint */ @@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u /* account preceding periods in which throttling occurred */ cfs_b->nr_throttled += overrun; - runtime_expires = cfs_b->runtime_expires; - /* * This check is repeated as we are holding onto the new bandwidth while * we unthrottle. This can potentially race with an unthrottled group @@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u cfs_b->distribute_running = 1; raw_spin_unlock_irqrestore(&cfs_b->lock, flags); /* we can't nest cfs_b->lock while distributing bandwidth */ - runtime = distribute_cfs_runtime(cfs_b, runtime, - runtime_expires); + runtime = distribute_cfs_runtime(cfs_b, runtime); raw_spin_lock_irqsave(&cfs_b->lock, flags); cfs_b->distribute_running = 0; @@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) return; raw_spin_lock(&cfs_b->lock); - if (cfs_b->quota != RUNTIME_INF && - cfs_rq->runtime_expires == cfs_b->runtime_expires) { + if (cfs_b->quota != RUNTIME_INF) { cfs_b->runtime += slack_runtime; /* we are under rq->lock, defer unthrottling using a timer */ @@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) { u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); unsigned long flags; - u64 expires; /* confirm we're still not at a refresh boundary */ raw_spin_lock_irqsave(&cfs_b->lock, flags); @@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) runtime = cfs_b->runtime; - expires = cfs_b->runtime_expires; if (runtime) cfs_b->distribute_running = 1; @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) if (!runtime) return; - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); + runtime = distribute_cfs_runtime(cfs_b, runtime); raw_spin_lock_irqsave(&cfs_b->lock, flags); - if (expires == cfs_b->runtime_expires) - lsub_positive(&cfs_b->runtime, runtime); cfs_b->distribute_running = 0; raw_spin_unlock_irqrestore(&cfs_b->lock, flags); } @@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) cfs_b->period_active = 1; overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); - cfs_b->expires_seq++; hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b52ed1a..0c0ed23 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -341,8 +341,6 @@ struct cfs_bandwidth { u64 quota; u64 runtime; s64 hierarchical_quota; - u64 runtime_expires; - int expires_seq; short idle; short period_active; @@ -562,8 +560,6 @@ struct cfs_rq { #ifdef CONFIG_CFS_BANDWIDTH int runtime_enabled; - int expires_seq; - u64 runtime_expires; s64 runtime_remaining; u64 throttled_clock; -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-29 19:08 ` [PATCH v3 1/1] " Dave Chiluk @ 2019-05-29 19:28 ` Phil Auld 2019-05-29 19:50 ` bsegall 2019-05-29 21:05 ` bsegall 2 siblings, 0 replies; 40+ messages in thread From: Phil Auld @ 2019-05-29 19:28 UTC (permalink / raw) To: Dave Chiluk Cc: Ben Segall, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc On Wed, May 29, 2019 at 02:08:46PM -0500 Dave Chiluk wrote: > It has been observed, that highly-threaded, non-cpu-bound applications > running under cpu.cfs_quota_us constraints can hit a high percentage of > periods throttled while simultaneously not consuming the allocated > amount of quota. This use case is typical of user-interactive non-cpu > bound applications, such as those running in kubernetes or mesos when > run on multiple cpu cores. > > This has been root caused to threads being allocated per cpu bandwidth > slices, and then not fully using that slice within the period. At which > point the slice and quota expires. This expiration of unused slice > results in applications not being able to utilize the quota for which > they are allocated. > > The expiration of per-cpu slices was recently fixed by > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift > condition")'. Prior to that it appears that this has been broken since > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That > added the following conditional which resulted in slices never being > expired. > > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { > /* extend local deadline, drift is bounded above by 2 ticks */ > cfs_rq->runtime_expires += TICK_NSEC; > > Because this was broken for nearly 5 years, and has recently been fixed > and is now being noticed by many users running kubernetes > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion > that the mechanisms around expiring runtime should be removed > altogether. > > This allows only per-cpu slices to live longer than the period boundary. > This allows threads on runqueues that do not use much CPU to continue to > use their remaining slice over a longer period of time than > cpu.cfs_period_us. However, this helps prevents the above condition of > hitting throttling while also not fully utilizing your cpu quota. > > This theoretically allows a machine to use slightly more than it's > allotted quota in some periods. This overflow would be bounded by the > remaining per-cpu slice that was left un-used in the previous period. > For CPU bound tasks this will change nothing, as they should > theoretically fully utilize all of their quota and slices in each > period. For user-interactive tasks as described above this provides a > much better user/application experience as their cpu utilization will > more closely match the amount they requested when they hit throttling. > > This greatly improves performance of high-thread-count, non-cpu bound > applications with low cfs_quota_us allocation on high-core-count > machines. In the case of an artificial testcase, this performance > discrepancy has been observed to be almost 30x performance improvement, > while still maintaining correct cpu quota restrictions albeit over > longer time intervals than cpu.cfs_period_us. That testcase is > available at https://github.com/indeedeng/fibtest. > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") > Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> > --- > Documentation/scheduler/sched-bwc.txt | 56 ++++++++++++++++++++++----- > kernel/sched/fair.c | 71 +++-------------------------------- > kernel/sched/sched.h | 4 -- > 3 files changed, 53 insertions(+), 78 deletions(-) > > diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt > index f6b1873..260fd65 100644 > --- a/Documentation/scheduler/sched-bwc.txt > +++ b/Documentation/scheduler/sched-bwc.txt > @@ -8,15 +8,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the > specification of the maximum CPU bandwidth available to a group or hierarchy. > > The bandwidth allowed for a group is specified using a quota and period. Within > -each given "period" (microseconds), a group is allowed to consume only up to > -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a > -group exceeds this limit (for that period), the tasks belonging to its > -hierarchy will be throttled and are not allowed to run again until the next > -period. > - > -A group's unused runtime is globally tracked, being refreshed with quota units > -above at each period boundary. As threads consume this bandwidth it is > -transferred to cpu-local "silos" on a demand basis. The amount transferred > +each given "period" (microseconds), a task group is allocated up to "quota" > +microseconds of CPU time. That quota is assigned to per cpu run queues in > +slices as threads in the cgroup become runnable. Once all quota has been > +assigned any additional requests for quota will result in those threads being > +throttled. Throttled threads will not be able to run again until the next > +period when the quota is replenished. > + > +A group's unassigned quota is globally tracked, being refreshed back to > +cfs_quota units at each period boundary. As threads consume this bandwidth it > +is transferred to cpu-local "silos" on a demand basis. The amount transferred > within each of these updates is tunable and described as the "slice". > > Management > @@ -90,6 +91,43 @@ There are two ways in which a group may become throttled: > In case b) above, even though the child may have runtime remaining it will not > be allowed to until the parent's runtime is refreshed. > > +Real-world behavior of slice non-expiration > +------------------------------------------- > +The fact that cpu-local slices do not expire results in some interesting corner > +cases that should be understood. > + > +For cgroup cpu constrained applications that are cpu limited this is a > +relatively moot point because they will naturally consume the entirety of their > +quota as well as the entirety of each cpu-local slice in each period. As a > +result it is expected that nr_periods roughly equal nr_throttled, and that > +cpuacct.usage will increase roughly equal to cfs_quota_us in each period. > + > +However in a worst-case scenario, highly-threaded, interactive/non-cpu bound > +applications this non-expiration nuance allows applications to briefly burst > +past their quota limits by the amount of unused slice on each cpu that the task > +group is running on. This slight burst requires that quota had been assigned > +and then not fully used in previous periods. This burst amount will not be > +transferred between cores. As a result, this mechanism still strictly limits > +the task group to quota average usage, albeit over a longer time window than > +period. This provides better more predictable user experience for highly > +threaded applications with small quota limits on high core count machines. It > +also eliminates the propensity to throttle these applications while > +simultanously using less than quota amounts of cpu. Another way to say this, > +is that by allowing the unused portion of a slice to remain valid across > +periods we have decreased the possibility of wasting quota on cpu-local silos > +that don't need a full slice's amount of cpu time. > + > +The interaction between cpu-bound and non-cpu-bound-interactive applications > +should also be considered, especially when single core usage hits 100%. If you > +gave each of these applications half of a cpu-core and they both got scheduled > +on the same CPU it is theoretically possible that the non-cpu bound application > +will use up to sched_cfs_bandwidth_slice_us additional quota in some periods, > +thereby preventing the cpu-bound application from fully using it's quota by "its quota" > +that same amount. In these instances it will be up to the CFS algorithm (see > +sched-design-CFS.txt) to decide which application is chosen to run, as they > +will both be runnable and have remaining quota. This runtime discrepancy will > +should made up in the following periods when the interactive application idles. > + "discrepancy will be made" or "descrepancy should be made" but not both :) Otherwise, fwiw, Acked-by: Phil Auld <pauld@redhat.com> Cheers, Phil -- ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-29 19:08 ` [PATCH v3 1/1] " Dave Chiluk 2019-05-29 19:28 ` Phil Auld @ 2019-05-29 19:50 ` bsegall 2019-05-29 21:05 ` bsegall 2 siblings, 0 replies; 40+ messages in thread From: bsegall @ 2019-05-29 19:50 UTC (permalink / raw) To: Dave Chiluk Cc: Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc Dave Chiluk <chiluk+linux@indeed.com> writes: > It has been observed, that highly-threaded, non-cpu-bound applications > running under cpu.cfs_quota_us constraints can hit a high percentage of > periods throttled while simultaneously not consuming the allocated > amount of quota. This use case is typical of user-interactive non-cpu > bound applications, such as those running in kubernetes or mesos when > run on multiple cpu cores. > > This has been root caused to threads being allocated per cpu bandwidth > slices, and then not fully using that slice within the period. At which > point the slice and quota expires. This expiration of unused slice > results in applications not being able to utilize the quota for which > they are allocated. > So most of the bandwidth is supposed to be returned, leaving only 1ms. I'm setting up to try your test now, but there was a bug with the unthrottle part of this where it could be continuously pushed into the future. You might try this patch instead, possibly along with lowering min_cfs_rq_runtime (currently not runtime tunable) if you see that cost more than the cost of extra locking to acquire runtime. From: Ben Segall <bsegall@google.com> Date: Tue, 28 May 2019 12:30:25 -0700 Subject: [PATCH] sched/fair: don't push cfs_bandwith slack timers forward When a cfs_rq sleeps and returns its quota, we delay for 5ms before waking any throttled cfs_rqs to coalesce with other cfs_rqs going to sleep, as this has has to be done outside of the rq lock we hold. The current code waits for 5ms without any sleeps, instead of waiting for 5ms from the first sleep, which can delay the unthrottle more than we want. Switch this around. Change-Id: Ife536bb54af633863de486ba6ec0d20e55ada8c9 Signed-off-by: Ben Segall <bsegall@google.com> --- kernel/sched/fair.c | 7 +++++++ kernel/sched/sched.h | 1 + 2 files changed, 8 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8213ff6e365d..2ead252cfa32 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4729,6 +4729,11 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b) if (runtime_refresh_within(cfs_b, min_left)) return; + /* don't push forwards an existing deferred unthrottle */ + if (cfs_b->slack_started) + return; + cfs_b->slack_started = true; + hrtimer_start(&cfs_b->slack_timer, ns_to_ktime(cfs_bandwidth_slack_period), HRTIMER_MODE_REL); @@ -4782,6 +4787,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) /* confirm we're still not at a refresh boundary */ raw_spin_lock_irqsave(&cfs_b->lock, flags); + cfs_b->slack_started = false; if (cfs_b->distribute_running) { raw_spin_unlock_irqrestore(&cfs_b->lock, flags); return; @@ -4920,6 +4926,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); cfs_b->slack_timer.function = sched_cfs_slack_timer; cfs_b->distribute_running = 0; + cfs_b->slack_started = false; } static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index efa686eeff26..60219acda94b 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -356,6 +356,7 @@ struct cfs_bandwidth { u64 throttled_time; bool distribute_running; + bool slack_started; #endif }; -- 2.22.0.rc1.257.g3120a18244-goog ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-29 19:08 ` [PATCH v3 1/1] " Dave Chiluk 2019-05-29 19:28 ` Phil Auld 2019-05-29 19:50 ` bsegall @ 2019-05-29 21:05 ` bsegall 2019-05-30 17:53 ` Dave Chiluk 2 siblings, 1 reply; 40+ messages in thread From: bsegall @ 2019-05-29 21:05 UTC (permalink / raw) To: Dave Chiluk Cc: Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc, pjt Dave Chiluk <chiluk+linux@indeed.com> writes: > It has been observed, that highly-threaded, non-cpu-bound applications > running under cpu.cfs_quota_us constraints can hit a high percentage of > periods throttled while simultaneously not consuming the allocated > amount of quota. This use case is typical of user-interactive non-cpu > bound applications, such as those running in kubernetes or mesos when > run on multiple cpu cores. > > This has been root caused to threads being allocated per cpu bandwidth > slices, and then not fully using that slice within the period. At which > point the slice and quota expires. This expiration of unused slice > results in applications not being able to utilize the quota for which > they are allocated. > > The expiration of per-cpu slices was recently fixed by > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift > condition")'. Prior to that it appears that this has been broken since > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That > added the following conditional which resulted in slices never being > expired. Yeah, having run the test, stranding only 1 ms per cpu rather than 5 doesn't help if you only have 10 ms of quota and even 10 threads/cpus. The slack timer isn't important in this test, though I think it probably should be changed. Decreasing min_cfs_rq_runtime helps, but would mean that we have to pull quota more often / always. The worst case here I think is where you run/sleep for ~1ns, so you wind up taking the lock twice every min_cfs_rq_runtime: once for assign and once to return all but min, which you then use up doing short run/sleep. I suppose that determines how much we care about this overhead at all. Removing expiration means that in the worst case period and quota can be effectively twice what the user specified, but only on very particular workloads. I think we should at least think about instead lowering min_cfs_rq_runtime to some smaller value. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-29 21:05 ` bsegall @ 2019-05-30 17:53 ` Dave Chiluk 2019-05-30 20:44 ` bsegall 0 siblings, 1 reply; 40+ messages in thread From: Dave Chiluk @ 2019-05-30 17:53 UTC (permalink / raw) To: bsegall Cc: Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc, pjt On Wed, May 29, 2019 at 02:05:55PM -0700, bsegall@google.com wrote: > Dave Chiluk <chiluk+linux@indeed.com> writes: > > Yeah, having run the test, stranding only 1 ms per cpu rather than 5 > doesn't help if you only have 10 ms of quota and even 10 threads/cpus. > The slack timer isn't important in this test, though I think it probably > should be changed. My min_cfs_rq_runtime was already set to 1ms. Additionally raising the amount of quota from 10ms to 50ms or even 100ms, still results in throttling without full quota usage. > Decreasing min_cfs_rq_runtime helps, but would mean that we have to pull > quota more often / always. The worst case here I think is where you > run/sleep for ~1ns, so you wind up taking the lock twice every > min_cfs_rq_runtime: once for assign and once to return all but min, > which you then use up doing short run/sleep. I suppose that determines > how much we care about this overhead at all. I'm not so concerned about how inefficiently the user-space application runs, as that's up to the invidual developer. The fibtest testcase, is purely my approximation of what a java application with lots of worker threads might do, as I didn't have a great deterministic java reproducer, and I feared posting java to LKML. I'm more concerned with the fact that the user requested 10ms/period or 100ms/period and they hit throttling while simultaneously not seeing that amount of cpu usage. i.e. on an 8 core machine if I $ ./runfibtest 1 Iterations Completed(M): 1886 Throttled for: 51 CPU Usage (msecs) = 507 $ ./runfibtest 8 Iterations Completed(M): 1274 Throttled for: 52 CPU Usage (msecs) = 380 You see that in the 8 core case where we have 7 do nothing threads on cpu's 1-7, we see only 380 ms of usage, and 52 periods of throttling when we should have received ~500ms of cpu usage. Looking more closely at the __return_cfs_rq_runtime logic I noticed if (cfs_b->quota != RUNTIME_INF && cfs_rq->runtime_expires == cfs_b->runtime_expires) { Which is awfully similar to the logic that was fixed by 512ac999. Is it possible that we are just not ever returning runtime back to the cfs_b because of the runtime_expires comparison here? Early on in my testing I created a patch that would report via sysfs the amount of quota that was expired at the end of a period in expire_cfs_rq_runtime, and it roughly equalled the difference in quota between the actual cpu usage and the alloted quota. That leads me to believe that something is not going quite correct in the slack return logic __return_cfs_rq_runtime. I'll attach that patch at the bottom of this e-mail in case you'd like to reproduce my tests. > Removing expiration means that in the worst case period and quota can be > effectively twice what the user specified, but only on very particular > workloads. I'm only removing expiration of slices that have already been assigned to individual cfs_rq. My understanding is that there is at most one cfs_rq per cpu, and each of those can have at most one slice of available runtime. So the worst case burst is slice_ms * cpus. Please help me understand how you get to twice user specified quota and period as it's not obvious to me *(I've only been looking at this for a few months). > I think we should at least think about instead lowering > min_cfs_rq_runtime to some smaller value Do you mean lower than 1ms? Thanks Dave. From: Dave Chiluk <chiluk+linux@indeed.com> Date: Thu, 30 May 2019 12:47:12 -0500 Subject: [PATCH] Add expired_time sysfs entry Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> --- kernel/sched/core.c | 41 +++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 4 ++++ kernel/sched/sched.h | 1 + 3 files changed, 46 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 874c427..3c06df9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6655,6 +6655,30 @@ static long tg_get_cfs_period(struct task_group *tg) return cfs_period_us; } +static int tg_set_cfs_expired_runtime(struct task_group *tg, long slice_expiration) +{ + struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; + + if (slice_expiration == 0) + { + raw_spin_lock_irq(&cfs_b->lock); + cfs_b->expired_runtime= 0; + raw_spin_unlock_irq(&cfs_b->lock); + return 0; + } + return 1; +} + +static u64 tg_get_cfs_expired_runtime(struct task_group *tg) +{ + u64 expired_runtime; + + expired_runtime = tg->cfs_bandwidth.expired_runtime; + do_div(expired_runtime, NSEC_PER_USEC); + + return expired_runtime; +} + static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) { @@ -6679,6 +6703,18 @@ static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css, return tg_set_cfs_period(css_tg(css), cfs_period_us); } +static u64 cpu_cfs_expired_runtime_read_u64(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + return tg_get_cfs_expired_runtime(css_tg(css)); +} + +static int cpu_cfs_expired_runtime_write_s64(struct cgroup_subsys_state *css, + struct cftype *cftype, s64 cfs_slice_expiration_us) +{ + return tg_set_cfs_expired_runtime(css_tg(css), cfs_slice_expiration_us); +} + struct cfs_schedulable_data { struct task_group *tg; u64 period, quota; @@ -6832,6 +6868,11 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css, .write_u64 = cpu_cfs_period_write_u64, }, { + .name = "cfs_expired_runtime", + .read_u64 = cpu_cfs_expired_runtime_read_u64, + .write_s64 = cpu_cfs_expired_runtime_write_s64, + }, + { .name = "stat", .seq_show = cpu_cfs_stat_show, }, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..bcbd4ef 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4382,6 +4382,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_rq->runtime_expires += TICK_NSEC; } else { /* global deadline is ahead, expiration has passed */ + raw_spin_lock_irq(&cfs_b->lock); + cfs_b->expired_runtime += cfs_rq->runtime_remaining; + raw_spin_unlock_irq(&cfs_b->lock); cfs_rq->runtime_remaining = 0; } } @@ -4943,6 +4946,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) cfs_b->runtime = 0; cfs_b->quota = RUNTIME_INF; cfs_b->period = ns_to_ktime(default_cfs_period()); + cfs_b->expired_runtime = 0; INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq); hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b52ed1a..499d2e2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -343,6 +343,7 @@ struct cfs_bandwidth { s64 hierarchical_quota; u64 runtime_expires; int expires_seq; + u64 expired_runtime; short idle; short period_active; -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v3 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-30 17:53 ` Dave Chiluk @ 2019-05-30 20:44 ` bsegall 0 siblings, 0 replies; 40+ messages in thread From: bsegall @ 2019-05-30 20:44 UTC (permalink / raw) To: Dave Chiluk Cc: Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc, pjt Dave Chiluk <chiluk+linux@indeed.com> writes: > On Wed, May 29, 2019 at 02:05:55PM -0700, bsegall@google.com wrote: >> Dave Chiluk <chiluk+linux@indeed.com> writes: >> >> Yeah, having run the test, stranding only 1 ms per cpu rather than 5 >> doesn't help if you only have 10 ms of quota and even 10 threads/cpus. >> The slack timer isn't important in this test, though I think it probably >> should be changed. > My min_cfs_rq_runtime was already set to 1ms. Yeah, I meant min_cfs_rq_runtime vs the 5ms if the slack stuff was broken. > > Additionally raising the amount of quota from 10ms to 50ms or even > 100ms, still results in throttling without full quota usage. > >> Decreasing min_cfs_rq_runtime helps, but would mean that we have to pull >> quota more often / always. The worst case here I think is where you >> run/sleep for ~1ns, so you wind up taking the lock twice every >> min_cfs_rq_runtime: once for assign and once to return all but min, >> which you then use up doing short run/sleep. I suppose that determines >> how much we care about this overhead at all. > I'm not so concerned about how inefficiently the user-space application > runs, as that's up to the invidual developer. Increasing scheduler overhead is something we generally try to prevent is what I was worried about. > The fibtest testcase, is > purely my approximation of what a java application with lots of worker > threads might do, as I didn't have a great deterministic java > reproducer, and I feared posting java to LKML. I'm more concerned with > the fact that the user requested 10ms/period or 100ms/period and they > hit throttling while simultaneously not seeing that amount of cpu usage. > i.e. on an 8 core machine if I > $ ./runfibtest 1 > Iterations Completed(M): 1886 > Throttled for: 51 > CPU Usage (msecs) = 507 > $ ./runfibtest 8 > Iterations Completed(M): 1274 > Throttled for: 52 > CPU Usage (msecs) = 380 > > You see that in the 8 core case where we have 7 do nothing threads on > cpu's 1-7, we see only 380 ms of usage, and 52 periods of throttling > when we should have received ~500ms of cpu usage. > > Looking more closely at the __return_cfs_rq_runtime logic I noticed > if (cfs_b->quota != RUNTIME_INF && > cfs_rq->runtime_expires == cfs_b->runtime_expires) { > > Which is awfully similar to the logic that was fixed by 512ac999. Is it > possible that we are just not ever returning runtime back to the cfs_b > because of the runtime_expires comparison here? The relevant issue that patch fixes is that the old conditional was backwards. Also lowering min_cfs_rq_runtime to 0 fixes your testcase, so it's working. > >> Removing expiration means that in the worst case period and quota can be >> effectively twice what the user specified, but only on very particular >> workloads. > I'm only removing expiration of slices that have already been assigned > to individual cfs_rq. My understanding is that there is at most one > cfs_rq per cpu, and each of those can have at most one slice of > available runtime. So the worst case burst is slice_ms * cpus. Please > help me understand how you get to twice user specified quota and period > as it's not obvious to me *(I've only been looking at this for a few > months). The reason that this effect is so significant is because slice_ms * cpus is roughly 100% of the quota. So yes, it's roughly the same thing. Unfortunately if there are more spare cpus on the system just doubling quota and period (keeping the same ratio) would not fix your issue, while removing expiration does while also potentially having that effect. > >> I think we should at least think about instead lowering >> min_cfs_rq_runtime to some smaller value > Do you mean lower than 1ms? Yes ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <1561391404-14450-1-git-send-email-chiluk+linux@indeed.com>]
* [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining. [not found] ` <1561391404-14450-1-git-send-email-chiluk+linux@indeed.com> @ 2019-06-24 15:50 ` Dave Chiluk 2019-06-24 17:33 ` bsegall 0 siblings, 1 reply; 40+ messages in thread From: Dave Chiluk @ 2019-06-24 15:50 UTC (permalink / raw) To: Ben Segall, Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang It has been observed, that highly-threaded, user-interactive applications running under cpu.cfs_quota_us constraints can hit a high percentage of periods throttled while simultaneously not consuming the allocated amount of quota. This impacts user-interactive non-cpu bound applications, such as those running in kubernetes or mesos when run on multiple cores. This has been root caused to threads being allocated per cpu bandwidth slices, and then not fully using that slice within the period. This results in min_cfs_rq_runtime remaining on each per-cpu cfs_rq. At the end of the period this remaining quota goes unused and expires. This expiration of unused time on per-cpu runqueues results in applications under-utilizing their quota while simultaneously hitting throttling. The solution is to return all spare cfs_rq->runtime_remaining when cfs_b->runtime nears the sched_cfs_bandwidth_slice. This balances the desire to prevent cfs_rq from always pulling quota with the desire to allow applications to fully utilize their quota. Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> --- kernel/sched/fair.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..4894eda 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4695,7 +4695,9 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u return 1; } -/* a cfs_rq won't donate quota below this amount */ +/* a cfs_rq won't donate quota below this amount unless cfs_b has very little + * remaining runtime. + */ static const u64 min_cfs_rq_runtime = 1 * NSEC_PER_MSEC; /* minimum remaining period time to redistribute slack quota */ static const u64 min_bandwidth_expiration = 2 * NSEC_PER_MSEC; @@ -4743,16 +4745,27 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b) static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) { struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); - s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime; + s64 slack_runtime = cfs_rq->runtime_remaining; + /* There is no runtime to return. */ if (slack_runtime <= 0) return; raw_spin_lock(&cfs_b->lock); if (cfs_b->quota != RUNTIME_INF && cfs_rq->runtime_expires == cfs_b->runtime_expires) { - cfs_b->runtime += slack_runtime; + /* As we near 0 quota remaining on cfs_b start returning all + * remaining runtime. This avoids stranding and then expiring + * runtime on per-cpu cfs_rq. + * + * cfs->b has plenty of runtime leave min_cfs_rq_runtime of + * runtime on this cfs_rq. + */ + if (cfs_b->runtime >= sched_cfs_bandwidth_slice() * 3 && + slack_runtime > min_cfs_rq_runtime) + slack_runtime -= min_cfs_rq_runtime; + cfs_b->runtime += slack_runtime; /* we are under rq->lock, defer unthrottling using a timer */ if (cfs_b->runtime > sched_cfs_bandwidth_slice() && !list_empty(&cfs_b->throttled_cfs_rq)) -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining. 2019-06-24 15:50 ` [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining Dave Chiluk @ 2019-06-24 17:33 ` bsegall 2019-06-26 22:10 ` Dave Chiluk 0 siblings, 1 reply; 40+ messages in thread From: bsegall @ 2019-06-24 17:33 UTC (permalink / raw) To: Dave Chiluk Cc: Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang Dave Chiluk <chiluk+linux@indeed.com> writes: > It has been observed, that highly-threaded, user-interactive > applications running under cpu.cfs_quota_us constraints can hit a high > percentage of periods throttled while simultaneously not consuming the > allocated amount of quota. This impacts user-interactive non-cpu bound > applications, such as those running in kubernetes or mesos when run on > multiple cores. > > This has been root caused to threads being allocated per cpu bandwidth > slices, and then not fully using that slice within the period. This > results in min_cfs_rq_runtime remaining on each per-cpu cfs_rq. At the > end of the period this remaining quota goes unused and expires. This > expiration of unused time on per-cpu runqueues results in applications > under-utilizing their quota while simultaneously hitting throttling. > > The solution is to return all spare cfs_rq->runtime_remaining when > cfs_b->runtime nears the sched_cfs_bandwidth_slice. This balances the > desire to prevent cfs_rq from always pulling quota with the desire to > allow applications to fully utilize their quota. > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") > Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> > --- > kernel/sched/fair.c | 19 ++++++++++++++++--- > 1 file changed, 16 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index f35930f..4894eda 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4695,7 +4695,9 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u > return 1; > } > > -/* a cfs_rq won't donate quota below this amount */ > +/* a cfs_rq won't donate quota below this amount unless cfs_b has very little > + * remaining runtime. > + */ > static const u64 min_cfs_rq_runtime = 1 * NSEC_PER_MSEC; > /* minimum remaining period time to redistribute slack quota */ > static const u64 min_bandwidth_expiration = 2 * NSEC_PER_MSEC; > @@ -4743,16 +4745,27 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b) > static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) > { > struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); > - s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime; > + s64 slack_runtime = cfs_rq->runtime_remaining; > > + /* There is no runtime to return. */ > if (slack_runtime <= 0) > return; > > raw_spin_lock(&cfs_b->lock); > if (cfs_b->quota != RUNTIME_INF && > cfs_rq->runtime_expires == cfs_b->runtime_expires) { > - cfs_b->runtime += slack_runtime; > + /* As we near 0 quota remaining on cfs_b start returning all > + * remaining runtime. This avoids stranding and then expiring > + * runtime on per-cpu cfs_rq. > + * > + * cfs->b has plenty of runtime leave min_cfs_rq_runtime of > + * runtime on this cfs_rq. > + */ > + if (cfs_b->runtime >= sched_cfs_bandwidth_slice() * 3 && > + slack_runtime > min_cfs_rq_runtime) > + slack_runtime -= min_cfs_rq_runtime; > > + cfs_b->runtime += slack_runtime; > /* we are under rq->lock, defer unthrottling using a timer */ > if (cfs_b->runtime > sched_cfs_bandwidth_slice() && > !list_empty(&cfs_b->throttled_cfs_rq)) This still has a similar cost as reducing min_cfs_rq_runtime to 0 - we now take a tg-global lock on every group se dequeue. Setting min=0 means that we have to take it on both enqueue and dequeue, while baseline means we take it once per min_cfs_rq_runtime in the worst case. In addition how much this helps is very dependent on the exact pattern of sleep/wake - you can still strand all but 15ms of runtime with a pretty reasonable pattern. If the cost of taking this global lock across all cpus without a ratelimit was somehow not a problem, I'd much prefer to just set min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie and sorta have 2x period 2x runtime" solution of removing expiration) ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining. 2019-06-24 17:33 ` bsegall @ 2019-06-26 22:10 ` Dave Chiluk 2019-06-27 20:18 ` bsegall 0 siblings, 1 reply; 40+ messages in thread From: Dave Chiluk @ 2019-06-26 22:10 UTC (permalink / raw) To: Ben Segall Cc: Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang On Mon, Jun 24, 2019 at 10:33:07AM -0700, bsegall@google.com wrote: > This still has a similar cost as reducing min_cfs_rq_runtime to 0 - we > now take a tg-global lock on every group se dequeue. Setting min=0 means > that we have to take it on both enqueue and dequeue, while baseline > means we take it once per min_cfs_rq_runtime in the worst case. Yep, it's only slightly better than simply setting min_cfs_rq_runtime=0. There's definitely a tradeoff of having to grab the lock every time. The other non-obvious benefit to this is that when the application starts hitting throttling the entire application starts hitting throttling closer to simultaneously. Previously, threads that don't do much could continue sipping on their 1ms of min_cfs_rq_runtime, while main threads were throttled. With this the application would more or less halt within 5ms of full quota usage. > In addition how much this helps is very dependent on the exact pattern > of sleep/wake - you can still strand all but 15ms of runtime with a > pretty reasonable pattern. I thought this change would have an upper bound stranded time of: NUMCPUs * min_cfs_rq_runtime - 3 * sched_cfs_bandwidth_slice(). From the stranding of 1ms on each queue minues the amount that we'd return when we forcibly hit the 3 x bandwidth slice Am I missing something here? Additionally that's worst case, and would require that threads be scheduled on distinct cpus mostly sequentially, and then never run again. > If the cost of taking this global lock across all cpus without a > ratelimit was somehow not a problem, I'd much prefer to just set > min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie > and sorta have 2x period 2x runtime" solution of removing expiration) Can I take this as an technical ack of the v3 patchset? Do you have any other comments for that patchset you'd like to see corrected before it gets integrated? If you are indeed good with that patchset, I'll clean up the comment and Documentation text and re-submit as a v5 patchset. I actually like that patchset best as well, for all the reasons already discussed. Especially the one where it's been running like this for 5 years prior to 512ac999. Also given my new-found understanding of min_cfs_rq_runtime, doesn't this mean that if we don't expire runtime we could potentially have a worst case of 1ms * NUMCPUs of extra runtime instead of 2x runtime? This is because min_cfs_rq_runtime could theoretically live on a per-cpu indefinitely? Still this is actually much smaller than I previously had feared. Also it still holds that average runtime is still bounded by runtime/period on average. Thanks ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining. 2019-06-26 22:10 ` Dave Chiluk @ 2019-06-27 20:18 ` bsegall 0 siblings, 0 replies; 40+ messages in thread From: bsegall @ 2019-06-27 20:18 UTC (permalink / raw) To: Dave Chiluk Cc: Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang Dave Chiluk <chiluk+linux@indeed.com> writes: > On Mon, Jun 24, 2019 at 10:33:07AM -0700, bsegall@google.com wrote: >> This still has a similar cost as reducing min_cfs_rq_runtime to 0 - we >> now take a tg-global lock on every group se dequeue. Setting min=0 means >> that we have to take it on both enqueue and dequeue, while baseline >> means we take it once per min_cfs_rq_runtime in the worst case. > Yep, it's only slightly better than simply setting min_cfs_rq_runtime=0. > There's definitely a tradeoff of having to grab the lock every time. > > The other non-obvious benefit to this is that when the application > starts hitting throttling the entire application starts hitting > throttling closer to simultaneously. Previously, threads that don't do > much could continue sipping on their 1ms of min_cfs_rq_runtime, while > main threads were throttled. With this the application would more or > less halt within 5ms of full quota usage. > >> In addition how much this helps is very dependent on the exact pattern >> of sleep/wake - you can still strand all but 15ms of runtime with a >> pretty reasonable pattern. > > I thought this change would have an upper bound stranded time of: > NUMCPUs * min_cfs_rq_runtime - 3 * sched_cfs_bandwidth_slice(). > From the stranding of 1ms on each queue minues the amount that we'd > return when we forcibly hit the 3 x bandwidth slice Am I missing > something here? Additionally that's worst case, and would require that > threads be scheduled on distinct cpus mostly sequentially, and then > never run again. Yeah, to be exact it's that. It's just that this is only relevant at all when NUMCPUs*min_cfs_rq_runtime is at least a significant fraction of quota, and when "scheduled on distinct cpus, and then never run again" happens. How sequential it needs to be depends on the threads vs num cpus. > >> If the cost of taking this global lock across all cpus without a >> ratelimit was somehow not a problem, I'd much prefer to just set >> min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie >> and sorta have 2x period 2x runtime" solution of removing expiration) > > Can I take this as an technical ack of the v3 patchset? Do you have any > other comments for that patchset you'd like to see corrected before it > gets integrated? If you are indeed good with that patchset, I'll clean > up the comment and Documentation text and re-submit as a v5 patchset. I > > actually like that patchset best as well, for all the reasons already > discussed. Especially the one where it's been running like this for 5 > years prior to 512ac999. > > Also given my new-found understanding of min_cfs_rq_runtime, doesn't > this mean that if we don't expire runtime we could potentially have a > worst case of 1ms * NUMCPUs of extra runtime instead of 2x runtime? > This is because min_cfs_rq_runtime could theoretically live on a per-cpu > indefinitely? Still this is actually much smaller than I previously had > feared. Also it still holds that average runtime is still bounded by > runtime/period on average. Yeah, I think I was misremembering or thinking only about one example or something. Meanwhile I'm thinking this is potentially /worse/ than just 2x runtime per 2x period, though obviously it depends on cpus vs quota. One thing I'm trying now is if we made the slack timer go and collect all the min_cfs_rq_runtime left behind, which would provide something similar to the min=0 case for fixing not only this but other cases of stranding bandwidth, but also some level of rate limiting. > > Thanks ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-17 19:30 [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices Dave Chiluk ` (2 preceding siblings ...) [not found] ` <1561391404-14450-1-git-send-email-chiluk+linux@indeed.com> @ 2019-06-27 19:09 ` Dave Chiluk 2019-06-27 19:49 ` [PATCH v5 0/1] " Dave Chiluk 2019-07-23 16:44 ` [PATCH v6 0/1] " Dave Chiluk 5 siblings, 0 replies; 40+ messages in thread From: Dave Chiluk @ 2019-06-27 19:09 UTC (permalink / raw) To: Ben Segall, Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc It has been observed, that highly-threaded, non-cpu-bound applications running under cpu.cfs_quota_us constraints can hit a high percentage of periods throttled while simultaneously not consuming the allocated amount of quota. This use case is typical of user-interactive non-cpu bound applications, such as those running in kubernetes or mesos when run on multiple cpu cores. This has been root caused to threads being allocated per cpu bandwidth slices, and then not fully using that slice within the period. At which point the slice and quota expires. This expiration of unused slice results in applications not being able to utilize the quota for which they are allocated. The expiration of per-cpu slices was recently fixed by 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")'. Prior to that it appears that this has been broken since at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That added the following conditional which resulted in slices never being expired. if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; Because this was broken for nearly 5 years, and has recently been fixed and is now being noticed by many users running kubernetes (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion that the mechanisms around expiring runtime should be removed altogether. This allows quota already allocated to per-cpu run-queues to live longer than the period boundary. This allows threads on runqueues that do not use much CPU to continue to use their remaining slice over a longer period of time than cpu.cfs_period_us. However, this helps prevents the above condition of hitting throttling while also not fully utilizing your cpu quota. This theoretically allows a machine to use slightly more than its allotted quota in some periods. This overflow would be bounded by the remaining quota left on each per-cpu runqueueu. This is typically no more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will change nothing, as they should theoretically fully utilize all of their quota in each period. For user-interactive tasks as described above this provides a much better user/application experience as their cpu utilization will more closely match the amount they requested when they hit throttling. This means that cpu limits no longer strictly apply per period for non-cpu bound applications, but that they are still accurate over longer timeframes. This greatly improves performance of high-thread-count, non-cpu bound applications with low cfs_quota_us allocation on high-core-count machines. In the case of an artificial testcase (10ms/100ms of quota on 80 CPU machine), this commit resulted in almost 30x performance improvement, while still maintaining correct cpu quota restrictions. That testcase is available at https://github.com/indeedeng/fibtest. Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> --- Documentation/scheduler/sched-bwc.txt | 73 ++++++++++++++++++++++++++++------- kernel/sched/fair.c | 71 +++------------------------------- kernel/sched/sched.h | 4 -- 3 files changed, 65 insertions(+), 83 deletions(-) diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt index f6b1873..4c19c94 100644 --- a/Documentation/scheduler/sched-bwc.txt +++ b/Documentation/scheduler/sched-bwc.txt @@ -8,15 +8,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the specification of the maximum CPU bandwidth available to a group or hierarchy. The bandwidth allowed for a group is specified using a quota and period. Within -each given "period" (microseconds), a group is allowed to consume only up to -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a -group exceeds this limit (for that period), the tasks belonging to its -hierarchy will be throttled and are not allowed to run again until the next -period. - -A group's unused runtime is globally tracked, being refreshed with quota units -above at each period boundary. As threads consume this bandwidth it is -transferred to cpu-local "silos" on a demand basis. The amount transferred +each given "period" (microseconds), a task group is allocated up to "quota" +microseconds of CPU time. That quota is assigned to per cpu run queues in +slices as threads in the cgroup become runnable. Once all quota has been +assigned any additional requests for quota will result in those threads being +throttled. Throttled threads will not be able to run again until the next +period when the quota is replenished. + +A group's unassigned quota is globally tracked, being refreshed back to +cfs_quota units at each period boundary. As threads consume this bandwidth it +is transferred to cpu-local "silos" on a demand basis. The amount transferred within each of these updates is tunable and described as the "slice". Management @@ -33,12 +34,12 @@ The default values are: A value of -1 for cpu.cfs_quota_us indicates that the group does not have any bandwidth restriction in place, such a group is described as an unconstrained -bandwidth group. This represents the traditional work-conserving behavior for +bandwidth group. This represents the traditional work-conserving behavior for CFS. Writing any (valid) positive value(s) will enact the specified bandwidth limit. -The minimum quota allowed for the quota or period is 1ms. There is also an -upper bound on the period length of 1s. Additional restrictions exist when +The minimum quota allowed for the quota or period is 1ms. There is also an +upper bound on the period length of 1s. Additional restrictions exist when bandwidth limits are used in a hierarchical fashion, these are explained in more detail below. @@ -51,8 +52,8 @@ unthrottled if it is in a constrained state. System wide settings -------------------- For efficiency run-time is transferred between the global pool and CPU local -"silos" in a batch fashion. This greatly reduces global accounting pressure -on large systems. The amount transferred each time such an update is required +"silos" in a batch fashion. This greatly reduces global accounting pressure +on large systems. The amount transferred each time such an update is required is described as the "slice". This is tunable via procfs: @@ -90,6 +91,50 @@ There are two ways in which a group may become throttled: In case b) above, even though the child may have runtime remaining it will not be allowed to until the parent's runtime is refreshed. +Real-world behavior +------------------- +Once a slice is assigned to a cpu it does not expire, but all except 1ms of it +may be returned to the global pool if all threads on that cpu become +unrunnable. This is configured at compile time by the min_cfs_rq_runtime +variable. + +The fact that cpu-local slices do not expire results in some interesting corner +cases that should be understood. + +For cgroup cpu constrained applications that are cpu limited this is a +relatively moot point because they will naturally consume the entirety of their +quota as well as the entirety of each cpu-local slice in each period. As a +result it is expected that nr_periods roughly equal nr_throttled, and that +cpuacct.usage will increase roughly equal to cfs_quota_us in each period. + +However in a worst-case scenario, highly-threaded, interactive/non-cpu bound +applications this non-expiration nuance allows applications to briefly burst +past their quota limits by the amount of unused slice on each cpu that the task +group is running on (typically at most 1ms per cpu). This slight burst +only applies if quota had been assigned to a cpu and then not fully used or +returned in previous periods. This burst amount will not be transferred +between cores. As a result, this mechanism still strictly limits the task +group to quota average usage, albeit over a longer time window than period. +This also limits the burst ability to no more than 1ms per cpu. This provides +better more predictable user experience for highly threaded applications with +small quota limits on high core count machines. It also eliminates the +propensity to throttle these applications while simultanously using less than +quota amounts of cpu. Another way to say this, is that by allowing the unused +portion of a slice to remain valid across periods we have decreased the +possibility of wasting quota on cpu-local silos that don't need a full slice's +amount of cpu time. + +The interaction between cpu-bound and non-cpu-bound-interactive applications +should also be considered, especially when single core usage hits 100%. If you +gave each of these applications half of a cpu-core and they both got scheduled +on the same CPU it is theoretically possible that the non-cpu bound application +will use up to 1ms additional quota in some periods, thereby preventing the +cpu-bound application from fully using its quota by that same amount. In these +instances it will be up to the CFS algorithm (see sched-design-CFS.txt) to +decide which application is chosen to run, as they will both be runnable and +have remaining quota. This runtime discrepancy will be made up in the following +periods when the interactive application idles. + Examples -------- 1. Limit a group to 1 CPU worth of runtime. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..a675c69 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) now = sched_clock_cpu(smp_processor_id()); cfs_b->runtime = cfs_b->quota; - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); - cfs_b->expires_seq++; } static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) @@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) { struct task_group *tg = cfs_rq->tg; struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); - u64 amount = 0, min_amount, expires; - int expires_seq; + u64 amount = 0, min_amount; /* note: this is a positive sum as runtime_remaining <= 0 */ min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; @@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_b->idle = 0; } } - expires_seq = cfs_b->expires_seq; - expires = cfs_b->runtime_expires; raw_spin_unlock(&cfs_b->lock); cfs_rq->runtime_remaining += amount; - /* - * we may have advanced our local expiration to account for allowed - * spread between our sched_clock and the one on which runtime was - * issued. - */ - if (cfs_rq->expires_seq != expires_seq) { - cfs_rq->expires_seq = expires_seq; - cfs_rq->runtime_expires = expires; - } return cfs_rq->runtime_remaining > 0; } -/* - * Note: This depends on the synchronization provided by sched_clock and the - * fact that rq->clock snapshots this value. - */ -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) -{ - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); - - /* if the deadline is ahead of our clock, nothing to do */ - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) - return; - - if (cfs_rq->runtime_remaining < 0) - return; - - /* - * If the local deadline has passed we have to consider the - * possibility that our sched_clock is 'fast' and the global deadline - * has not truly expired. - * - * Fortunately we can check determine whether this the case by checking - * whether the global deadline(cfs_b->expires_seq) has advanced. - */ - if (cfs_rq->expires_seq == cfs_b->expires_seq) { - /* extend local deadline, drift is bounded above by 2 ticks */ - cfs_rq->runtime_expires += TICK_NSEC; - } else { - /* global deadline is ahead, expiration has passed */ - cfs_rq->runtime_remaining = 0; - } -} - static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) { /* dock delta_exec before expiring quota (as it could span periods) */ cfs_rq->runtime_remaining -= delta_exec; - expire_cfs_rq_runtime(cfs_rq); if (likely(cfs_rq->runtime_remaining > 0)) return; @@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) resched_curr(rq); } -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, - u64 remaining, u64 expires) +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) { struct cfs_rq *cfs_rq; u64 runtime; @@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, remaining -= runtime; cfs_rq->runtime_remaining += runtime; - cfs_rq->runtime_expires = expires; /* we check whether we're throttled above */ if (cfs_rq->runtime_remaining > 0) @@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, */ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) { - u64 runtime, runtime_expires; + u64 runtime; int throttled; /* no need to continue the timer with no bandwidth constraint */ @@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u /* account preceding periods in which throttling occurred */ cfs_b->nr_throttled += overrun; - runtime_expires = cfs_b->runtime_expires; - /* * This check is repeated as we are holding onto the new bandwidth while * we unthrottle. This can potentially race with an unthrottled group @@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u cfs_b->distribute_running = 1; raw_spin_unlock_irqrestore(&cfs_b->lock, flags); /* we can't nest cfs_b->lock while distributing bandwidth */ - runtime = distribute_cfs_runtime(cfs_b, runtime, - runtime_expires); + runtime = distribute_cfs_runtime(cfs_b, runtime); raw_spin_lock_irqsave(&cfs_b->lock, flags); cfs_b->distribute_running = 0; @@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) return; raw_spin_lock(&cfs_b->lock); - if (cfs_b->quota != RUNTIME_INF && - cfs_rq->runtime_expires == cfs_b->runtime_expires) { + if (cfs_b->quota != RUNTIME_INF) { cfs_b->runtime += slack_runtime; /* we are under rq->lock, defer unthrottling using a timer */ @@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) { u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); unsigned long flags; - u64 expires; /* confirm we're still not at a refresh boundary */ raw_spin_lock_irqsave(&cfs_b->lock, flags); @@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) runtime = cfs_b->runtime; - expires = cfs_b->runtime_expires; if (runtime) cfs_b->distribute_running = 1; @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) if (!runtime) return; - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); + runtime = distribute_cfs_runtime(cfs_b, runtime); raw_spin_lock_irqsave(&cfs_b->lock, flags); - if (expires == cfs_b->runtime_expires) - lsub_positive(&cfs_b->runtime, runtime); cfs_b->distribute_running = 0; raw_spin_unlock_irqrestore(&cfs_b->lock, flags); } @@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) cfs_b->period_active = 1; overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); - cfs_b->expires_seq++; hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b52ed1a..0c0ed23 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -341,8 +341,6 @@ struct cfs_bandwidth { u64 quota; u64 runtime; s64 hierarchical_quota; - u64 runtime_expires; - int expires_seq; short idle; short period_active; @@ -562,8 +560,6 @@ struct cfs_rq { #ifdef CONFIG_CFS_BANDWIDTH int runtime_enabled; - int expires_seq; - u64 runtime_expires; s64 runtime_remaining; u64 throttled_clock; -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v5 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-17 19:30 [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices Dave Chiluk ` (3 preceding siblings ...) 2019-06-27 19:09 ` [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk @ 2019-06-27 19:49 ` Dave Chiluk 2019-06-27 19:49 ` [PATCH v5 1/1] " Dave Chiluk 2019-07-23 16:44 ` [PATCH v6 0/1] " Dave Chiluk 5 siblings, 1 reply; 40+ messages in thread From: Dave Chiluk @ 2019-06-27 19:49 UTC (permalink / raw) To: Ben Segall, Pqhil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc Changelog v5 - Based on this comment from Ben Segall's comment on v4 > If the cost of taking this global lock across all cpus without a > ratelimit was somehow not a problem, I'd much prefer to just set > min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie > and sorta have 2x period 2x runtime" solution of removing expiration) I'm resubmitting my v3 patchset, with the requested changes. - Updated Commit log given review comments - Update sched-bwc.txt give my new understanding of the slack timer. Changelog v4 - Rewrote patchset around the concept of returning all of runtime_remaining when cfs_b nears the end of available quota. Changelog v3 - Reworked documentation to better describe behavior of slice expiration per feedback from Peter Oskolkov Changelog v2 - Fixed some checkpatch errors in the commit message. ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-06-27 19:49 ` [PATCH v5 0/1] " Dave Chiluk @ 2019-06-27 19:49 ` Dave Chiluk 2019-07-01 20:15 ` bsegall 2019-07-16 19:58 ` bsegall 0 siblings, 2 replies; 40+ messages in thread From: Dave Chiluk @ 2019-06-27 19:49 UTC (permalink / raw) To: Ben Segall, Pqhil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc It has been observed, that highly-threaded, non-cpu-bound applications running under cpu.cfs_quota_us constraints can hit a high percentage of periods throttled while simultaneously not consuming the allocated amount of quota. This use case is typical of user-interactive non-cpu bound applications, such as those running in kubernetes or mesos when run on multiple cpu cores. This has been root caused to threads being allocated per cpu bandwidth slices, and then not fully using that slice within the period. At which point the slice and quota expires. This expiration of unused slice results in applications not being able to utilize the quota for which they are allocated. The expiration of per-cpu slices was recently fixed by 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")'. Prior to that it appears that this has been broken since at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That added the following conditional which resulted in slices never being expired. if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; Because this was broken for nearly 5 years, and has recently been fixed and is now being noticed by many users running kubernetes (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion that the mechanisms around expiring runtime should be removed altogether. This allows quota already allocated to per-cpu run-queues to live longer than the period boundary. This allows threads on runqueues that do not use much CPU to continue to use their remaining slice over a longer period of time than cpu.cfs_period_us. However, this helps prevents the above condition of hitting throttling while also not fully utilizing your cpu quota. This theoretically allows a machine to use slightly more than its allotted quota in some periods. This overflow would be bounded by the remaining quota left on each per-cpu runqueueu. This is typically no more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will change nothing, as they should theoretically fully utilize all of their quota in each period. For user-interactive tasks as described above this provides a much better user/application experience as their cpu utilization will more closely match the amount they requested when they hit throttling. This means that cpu limits no longer strictly apply per period for non-cpu bound applications, but that they are still accurate over longer timeframes. This greatly improves performance of high-thread-count, non-cpu bound applications with low cfs_quota_us allocation on high-core-count machines. In the case of an artificial testcase (10ms/100ms of quota on 80 CPU machine), this commit resulted in almost 30x performance improvement, while still maintaining correct cpu quota restrictions. That testcase is available at https://github.com/indeedeng/fibtest. Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> --- Documentation/scheduler/sched-bwc.txt | 73 ++++++++++++++++++++++++++++------- kernel/sched/fair.c | 71 +++------------------------------- kernel/sched/sched.h | 4 -- 3 files changed, 65 insertions(+), 83 deletions(-) diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt index f6b1873..4c19c94 100644 --- a/Documentation/scheduler/sched-bwc.txt +++ b/Documentation/scheduler/sched-bwc.txt @@ -8,15 +8,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the specification of the maximum CPU bandwidth available to a group or hierarchy. The bandwidth allowed for a group is specified using a quota and period. Within -each given "period" (microseconds), a group is allowed to consume only up to -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a -group exceeds this limit (for that period), the tasks belonging to its -hierarchy will be throttled and are not allowed to run again until the next -period. - -A group's unused runtime is globally tracked, being refreshed with quota units -above at each period boundary. As threads consume this bandwidth it is -transferred to cpu-local "silos" on a demand basis. The amount transferred +each given "period" (microseconds), a task group is allocated up to "quota" +microseconds of CPU time. That quota is assigned to per cpu run queues in +slices as threads in the cgroup become runnable. Once all quota has been +assigned any additional requests for quota will result in those threads being +throttled. Throttled threads will not be able to run again until the next +period when the quota is replenished. + +A group's unassigned quota is globally tracked, being refreshed back to +cfs_quota units at each period boundary. As threads consume this bandwidth it +is transferred to cpu-local "silos" on a demand basis. The amount transferred within each of these updates is tunable and described as the "slice". Management @@ -33,12 +34,12 @@ The default values are: A value of -1 for cpu.cfs_quota_us indicates that the group does not have any bandwidth restriction in place, such a group is described as an unconstrained -bandwidth group. This represents the traditional work-conserving behavior for +bandwidth group. This represents the traditional work-conserving behavior for CFS. Writing any (valid) positive value(s) will enact the specified bandwidth limit. -The minimum quota allowed for the quota or period is 1ms. There is also an -upper bound on the period length of 1s. Additional restrictions exist when +The minimum quota allowed for the quota or period is 1ms. There is also an +upper bound on the period length of 1s. Additional restrictions exist when bandwidth limits are used in a hierarchical fashion, these are explained in more detail below. @@ -51,8 +52,8 @@ unthrottled if it is in a constrained state. System wide settings -------------------- For efficiency run-time is transferred between the global pool and CPU local -"silos" in a batch fashion. This greatly reduces global accounting pressure -on large systems. The amount transferred each time such an update is required +"silos" in a batch fashion. This greatly reduces global accounting pressure +on large systems. The amount transferred each time such an update is required is described as the "slice". This is tunable via procfs: @@ -90,6 +91,50 @@ There are two ways in which a group may become throttled: In case b) above, even though the child may have runtime remaining it will not be allowed to until the parent's runtime is refreshed. +Real-world behavior +------------------- +Once a slice is assigned to a cpu it does not expire, but all except 1ms of it +may be returned to the global pool if all threads on that cpu become +unrunnable. This is configured at compile time by the min_cfs_rq_runtime +variable. + +The fact that cpu-local slices do not expire results in some interesting corner +cases that should be understood. + +For cgroup cpu constrained applications that are cpu limited this is a +relatively moot point because they will naturally consume the entirety of their +quota as well as the entirety of each cpu-local slice in each period. As a +result it is expected that nr_periods roughly equal nr_throttled, and that +cpuacct.usage will increase roughly equal to cfs_quota_us in each period. + +However in a worst-case scenario, highly-threaded, interactive/non-cpu bound +applications this non-expiration nuance allows applications to briefly burst +past their quota limits by the amount of unused slice on each cpu that the task +group is running on (typically at most 1ms per cpu). This slight burst +only applies if quota had been assigned to a cpu and then not fully used or +returned in previous periods. This burst amount will not be transferred +between cores. As a result, this mechanism still strictly limits the task +group to quota average usage, albeit over a longer time window than period. +This also limits the burst ability to no more than 1ms per cpu. This provides +better more predictable user experience for highly threaded applications with +small quota limits on high core count machines. It also eliminates the +propensity to throttle these applications while simultanously using less than +quota amounts of cpu. Another way to say this, is that by allowing the unused +portion of a slice to remain valid across periods we have decreased the +possibility of wasting quota on cpu-local silos that don't need a full slice's +amount of cpu time. + +The interaction between cpu-bound and non-cpu-bound-interactive applications +should also be considered, especially when single core usage hits 100%. If you +gave each of these applications half of a cpu-core and they both got scheduled +on the same CPU it is theoretically possible that the non-cpu bound application +will use up to 1ms additional quota in some periods, thereby preventing the +cpu-bound application from fully using its quota by that same amount. In these +instances it will be up to the CFS algorithm (see sched-design-CFS.txt) to +decide which application is chosen to run, as they will both be runnable and +have remaining quota. This runtime discrepancy will be made up in the following +periods when the interactive application idles. + Examples -------- 1. Limit a group to 1 CPU worth of runtime. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..a675c69 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4295,8 +4295,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) now = sched_clock_cpu(smp_processor_id()); cfs_b->runtime = cfs_b->quota; - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); - cfs_b->expires_seq++; } static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) @@ -4318,8 +4316,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) { struct task_group *tg = cfs_rq->tg; struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); - u64 amount = 0, min_amount, expires; - int expires_seq; + u64 amount = 0, min_amount; /* note: this is a positive sum as runtime_remaining <= 0 */ min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; @@ -4336,61 +4333,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_b->idle = 0; } } - expires_seq = cfs_b->expires_seq; - expires = cfs_b->runtime_expires; raw_spin_unlock(&cfs_b->lock); cfs_rq->runtime_remaining += amount; - /* - * we may have advanced our local expiration to account for allowed - * spread between our sched_clock and the one on which runtime was - * issued. - */ - if (cfs_rq->expires_seq != expires_seq) { - cfs_rq->expires_seq = expires_seq; - cfs_rq->runtime_expires = expires; - } return cfs_rq->runtime_remaining > 0; } -/* - * Note: This depends on the synchronization provided by sched_clock and the - * fact that rq->clock snapshots this value. - */ -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) -{ - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); - - /* if the deadline is ahead of our clock, nothing to do */ - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) - return; - - if (cfs_rq->runtime_remaining < 0) - return; - - /* - * If the local deadline has passed we have to consider the - * possibility that our sched_clock is 'fast' and the global deadline - * has not truly expired. - * - * Fortunately we can check determine whether this the case by checking - * whether the global deadline(cfs_b->expires_seq) has advanced. - */ - if (cfs_rq->expires_seq == cfs_b->expires_seq) { - /* extend local deadline, drift is bounded above by 2 ticks */ - cfs_rq->runtime_expires += TICK_NSEC; - } else { - /* global deadline is ahead, expiration has passed */ - cfs_rq->runtime_remaining = 0; - } -} - static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) { /* dock delta_exec before expiring quota (as it could span periods) */ cfs_rq->runtime_remaining -= delta_exec; - expire_cfs_rq_runtime(cfs_rq); if (likely(cfs_rq->runtime_remaining > 0)) return; @@ -4581,8 +4534,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) resched_curr(rq); } -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, - u64 remaining, u64 expires) +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) { struct cfs_rq *cfs_rq; u64 runtime; @@ -4604,7 +4556,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, remaining -= runtime; cfs_rq->runtime_remaining += runtime; - cfs_rq->runtime_expires = expires; /* we check whether we're throttled above */ if (cfs_rq->runtime_remaining > 0) @@ -4629,7 +4580,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, */ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) { - u64 runtime, runtime_expires; + u64 runtime; int throttled; /* no need to continue the timer with no bandwidth constraint */ @@ -4657,8 +4608,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u /* account preceding periods in which throttling occurred */ cfs_b->nr_throttled += overrun; - runtime_expires = cfs_b->runtime_expires; - /* * This check is repeated as we are holding onto the new bandwidth while * we unthrottle. This can potentially race with an unthrottled group @@ -4671,8 +4620,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u cfs_b->distribute_running = 1; raw_spin_unlock_irqrestore(&cfs_b->lock, flags); /* we can't nest cfs_b->lock while distributing bandwidth */ - runtime = distribute_cfs_runtime(cfs_b, runtime, - runtime_expires); + runtime = distribute_cfs_runtime(cfs_b, runtime); raw_spin_lock_irqsave(&cfs_b->lock, flags); cfs_b->distribute_running = 0; @@ -4749,8 +4697,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) return; raw_spin_lock(&cfs_b->lock); - if (cfs_b->quota != RUNTIME_INF && - cfs_rq->runtime_expires == cfs_b->runtime_expires) { + if (cfs_b->quota != RUNTIME_INF) { cfs_b->runtime += slack_runtime; /* we are under rq->lock, defer unthrottling using a timer */ @@ -4783,7 +4730,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) { u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); unsigned long flags; - u64 expires; /* confirm we're still not at a refresh boundary */ raw_spin_lock_irqsave(&cfs_b->lock, flags); @@ -4800,7 +4746,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) runtime = cfs_b->runtime; - expires = cfs_b->runtime_expires; if (runtime) cfs_b->distribute_running = 1; @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) if (!runtime) return; - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); + runtime = distribute_cfs_runtime(cfs_b, runtime); raw_spin_lock_irqsave(&cfs_b->lock, flags); - if (expires == cfs_b->runtime_expires) - lsub_positive(&cfs_b->runtime, runtime); cfs_b->distribute_running = 0; raw_spin_unlock_irqrestore(&cfs_b->lock, flags); } @@ -4969,8 +4912,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) cfs_b->period_active = 1; overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); - cfs_b->expires_seq++; hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b52ed1a..0c0ed23 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -341,8 +341,6 @@ struct cfs_bandwidth { u64 quota; u64 runtime; s64 hierarchical_quota; - u64 runtime_expires; - int expires_seq; short idle; short period_active; @@ -562,8 +560,6 @@ struct cfs_rq { #ifdef CONFIG_CFS_BANDWIDTH int runtime_enabled; - int expires_seq; - u64 runtime_expires; s64 runtime_remaining; u64 throttled_clock; -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-06-27 19:49 ` [PATCH v5 1/1] " Dave Chiluk @ 2019-07-01 20:15 ` bsegall 2019-07-11 9:51 ` Peter Zijlstra 2019-07-16 19:58 ` bsegall 1 sibling, 1 reply; 40+ messages in thread From: bsegall @ 2019-07-01 20:15 UTC (permalink / raw) To: Dave Chiluk Cc: Pqhil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc, Paul Turner Alright, this prototype of "maybe we should just 100% avoid stranding runtime for a full period" appears to fix the fibtest synthetic example, and seems like a theoretically-reasonable approach. Things that may want improvement or at least thought (but it's a holiday week in the US and I wanted any opinions on the basic approach): - I don't like cfs_rq->slack_list, since logically it's mutually exclusive with throttled_list, but we have to iterate without locks, so I don't know that we can avoid it. - I previously was using _rcu for the slack list, like throttled, but there is no list_for_each_entry_rcu_safe, so the list_del_init would be invalid and we'd have to use another flag or opencode the equivalent. - (Actually, this just made me realize that distribute is sorta wrong if the unthrottled task immediately runs and rethrottles; this would just mean that we effectively restart the loop) - We unconditionally start the slack timer, even if nothing is throttled. We could instead have throttle start the timer in this case (setting the timeout some definition of "appropriately"), but this bookkeeping would be a big hassle. - We could try to do better about deciding what cfs_rqs are idle than "nr_running == 0", possibly requiring that to have been true for N<5 ms, and restarting the slack timer if we didn't clear everything. - Taking up-to-every rq->lock is bad and expensive and 5ms may be too short a delay for this. I haven't tried microbenchmarks on the cost of this vs min_cfs_rq_runtime = 0 vs baseline. - runtime_expires vs expires_seq choice is basically rand(), much like the existing code. (I think the most consistent with existing code would be runtime_expires, since cfs_b lock is held; I think most consistent in general would change most of the existing ones as well to be seq) -- >8 -- Subject: [PATCH] sched: avoid stranding cfs_bandwidth runtime We avoid contention on the per-tg cfs_b->lock by keeping 1ms of runtime on a cfs_rq even when all tasks in that cfs_rq dequeue. This way tasks doing frequent wake/sleep can't hit this cross-cpu lock more than once per ms. This however means that up to 1ms of runtime per cpu can be lost if no task does wake up on that cpu, which is leading to issues on cgroups with low quota, many available cpus, and a combination of threads that run for very little time and ones that want to run constantly. This was previously hidden by runtime expiration being broken, which allowed this stranded runtime to be kept indefinitely across period resets. Thus after an initial period or two any long-running tasks could use an appropriate portion of their group's quota. The issue was that the group could also potentially burst for 1ms * cpus more than their quota allowed, and in these situations this is a significant increase. Fix this by having the group's slack timer (which runs at most once per 5ms) remove all runtime from empty cfs_rqs, not just redistribute any runtime above that 1ms that was returned immediately. Signed-off-by: Ben Segall <bsegall@google.com> --- kernel/sched/fair.c | 66 +++++++++++++++++++++++++++++++++++--------- kernel/sched/sched.h | 2 ++ 2 files changed, 55 insertions(+), 13 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 300f2c54dea5..80b2198d9b29 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4745,23 +4745,32 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime; - if (slack_runtime <= 0) + if (cfs_rq->runtime_remaining <= 0) + return; + + if (slack_runtime <= 0 && !list_empty(&cfs_rq->slack_list)) return; raw_spin_lock(&cfs_b->lock); if (cfs_b->quota != RUNTIME_INF && - cfs_rq->runtime_expires == cfs_b->runtime_expires) { - cfs_b->runtime += slack_runtime; + cfs_rq->expires_seq == cfs_b->expires_seq) { + if (slack_runtime > 0) + cfs_b->runtime += slack_runtime; + if (list_empty(&cfs_rq->slack_list)) + list_add(&cfs_rq->slack_list, &cfs_b->slack_cfs_rq); - /* we are under rq->lock, defer unthrottling using a timer */ - if (cfs_b->runtime > sched_cfs_bandwidth_slice() && - !list_empty(&cfs_b->throttled_cfs_rq)) - start_cfs_slack_bandwidth(cfs_b); + /* + * After a timeout, gather our remaining runtime so it can't get + * stranded. We need a timer anyways to distribute any of the + * runtime due to locking issues. + */ + start_cfs_slack_bandwidth(cfs_b); } raw_spin_unlock(&cfs_b->lock); - /* even if it's not valid for return we don't want to try again */ - cfs_rq->runtime_remaining -= slack_runtime; + if (slack_runtime > 0) + /* even if it's not valid for return we don't want to try again */ + cfs_rq->runtime_remaining -= slack_runtime; } static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) @@ -4781,12 +4790,41 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) */ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) { - u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); + u64 runtime = 0; unsigned long flags; u64 expires; + struct cfs_rq *cfs_rq, *temp; + LIST_HEAD(temp_head); + + local_irq_save(flags); + + raw_spin_lock(&cfs_b->lock); + cfs_b->slack_started = false; + list_splice_init(&cfs_b->slack_cfs_rq, &temp_head); + raw_spin_unlock(&cfs_b->lock); + + + /* Gather all left over runtime from all rqs */ + list_for_each_entry_safe(cfs_rq, temp, &temp_head, slack_list) { + struct rq *rq = rq_of(cfs_rq); + struct rq_flags rf; + + rq_lock(rq, &rf); + + raw_spin_lock(&cfs_b->lock); + list_del_init(&cfs_rq->slack_list); + if (!cfs_rq->nr_running && cfs_rq->runtime_remaining > 0 && + cfs_rq->runtime_expires == cfs_b->runtime_expires) { + cfs_b->runtime += cfs_rq->runtime_remaining; + cfs_rq->runtime_remaining = 0; + } + raw_spin_unlock(&cfs_b->lock); + + rq_unlock(rq, &rf); + } /* confirm we're still not at a refresh boundary */ - raw_spin_lock_irqsave(&cfs_b->lock, flags); + raw_spin_lock(&cfs_b->lock); cfs_b->slack_started = false; if (cfs_b->distribute_running) { raw_spin_unlock_irqrestore(&cfs_b->lock, flags); @@ -4798,7 +4836,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) return; } - if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) + if (cfs_b->quota != RUNTIME_INF) runtime = cfs_b->runtime; expires = cfs_b->runtime_expires; @@ -4946,6 +4984,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) cfs_b->period = ns_to_ktime(default_cfs_period()); INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq); + INIT_LIST_HEAD(&cfs_b->slack_cfs_rq); hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED); cfs_b->period_timer.function = sched_cfs_period_timer; hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); @@ -4958,6 +4997,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) { cfs_rq->runtime_enabled = 0; INIT_LIST_HEAD(&cfs_rq->throttled_list); + INIT_LIST_HEAD(&cfs_rq->slack_list); } void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b08dee29ef5e..3b272ee894fb 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -345,6 +345,7 @@ struct cfs_bandwidth { struct hrtimer period_timer; struct hrtimer slack_timer; struct list_head throttled_cfs_rq; + struct list_head slack_cfs_rq; /* Statistics: */ int nr_periods; @@ -566,6 +567,7 @@ struct cfs_rq { int throttled; int throttle_count; struct list_head throttled_list; + struct list_head slack_list; #endif /* CONFIG_CFS_BANDWIDTH */ #endif /* CONFIG_FAIR_GROUP_SCHED */ }; -- 2.22.0.410.gd8fdbe21b5-goog ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-07-01 20:15 ` bsegall @ 2019-07-11 9:51 ` Peter Zijlstra 2019-07-11 17:46 ` bsegall 0 siblings, 1 reply; 40+ messages in thread From: Peter Zijlstra @ 2019-07-11 9:51 UTC (permalink / raw) To: bsegall Cc: Dave Chiluk, Pqhil Auld, Peter Oskolkov, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc, Paul Turner FWIW, good to see progress, still waiting for you guys to agree :-) On Mon, Jul 01, 2019 at 01:15:44PM -0700, bsegall@google.com wrote: > - Taking up-to-every rq->lock is bad and expensive and 5ms may be too > short a delay for this. I haven't tried microbenchmarks on the cost of > this vs min_cfs_rq_runtime = 0 vs baseline. Yes, that's tricky, SGI/HPE have definite ideas about that. > @@ -4781,12 +4790,41 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) > */ > static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > { > - u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); > + u64 runtime = 0; > unsigned long flags; > u64 expires; > + struct cfs_rq *cfs_rq, *temp; > + LIST_HEAD(temp_head); > + > + local_irq_save(flags); > + > + raw_spin_lock(&cfs_b->lock); > + cfs_b->slack_started = false; > + list_splice_init(&cfs_b->slack_cfs_rq, &temp_head); > + raw_spin_unlock(&cfs_b->lock); > + > + > + /* Gather all left over runtime from all rqs */ > + list_for_each_entry_safe(cfs_rq, temp, &temp_head, slack_list) { > + struct rq *rq = rq_of(cfs_rq); > + struct rq_flags rf; > + > + rq_lock(rq, &rf); > + > + raw_spin_lock(&cfs_b->lock); > + list_del_init(&cfs_rq->slack_list); > + if (!cfs_rq->nr_running && cfs_rq->runtime_remaining > 0 && > + cfs_rq->runtime_expires == cfs_b->runtime_expires) { > + cfs_b->runtime += cfs_rq->runtime_remaining; > + cfs_rq->runtime_remaining = 0; > + } > + raw_spin_unlock(&cfs_b->lock); > + > + rq_unlock(rq, &rf); > + } But worse still, you take possibly every rq->lock without ever re-enabling IRQs. > > /* confirm we're still not at a refresh boundary */ > - raw_spin_lock_irqsave(&cfs_b->lock, flags); > + raw_spin_lock(&cfs_b->lock); > cfs_b->slack_started = false; > if (cfs_b->distribute_running) { > raw_spin_unlock_irqrestore(&cfs_b->lock, flags); ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-07-11 9:51 ` Peter Zijlstra @ 2019-07-11 17:46 ` bsegall [not found] ` <CAC=E7cV4sO50NpYOZ06n_BkZTcBqf1KQp83prc+oave3ircBrw@mail.gmail.com> 0 siblings, 1 reply; 40+ messages in thread From: bsegall @ 2019-07-11 17:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Dave Chiluk, Pqhil Auld, Peter Oskolkov, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc, Paul Turner Peter Zijlstra <peterz@infradead.org> writes: > FWIW, good to see progress, still waiting for you guys to agree :-) > > On Mon, Jul 01, 2019 at 01:15:44PM -0700, bsegall@google.com wrote: > >> - Taking up-to-every rq->lock is bad and expensive and 5ms may be too >> short a delay for this. I haven't tried microbenchmarks on the cost of >> this vs min_cfs_rq_runtime = 0 vs baseline. > > Yes, that's tricky, SGI/HPE have definite ideas about that. > >> @@ -4781,12 +4790,41 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) >> */ >> static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) >> { >> - u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); >> + u64 runtime = 0; >> unsigned long flags; >> u64 expires; >> + struct cfs_rq *cfs_rq, *temp; >> + LIST_HEAD(temp_head); >> + >> + local_irq_save(flags); >> + >> + raw_spin_lock(&cfs_b->lock); >> + cfs_b->slack_started = false; >> + list_splice_init(&cfs_b->slack_cfs_rq, &temp_head); >> + raw_spin_unlock(&cfs_b->lock); >> + >> + >> + /* Gather all left over runtime from all rqs */ >> + list_for_each_entry_safe(cfs_rq, temp, &temp_head, slack_list) { >> + struct rq *rq = rq_of(cfs_rq); >> + struct rq_flags rf; >> + >> + rq_lock(rq, &rf); >> + >> + raw_spin_lock(&cfs_b->lock); >> + list_del_init(&cfs_rq->slack_list); >> + if (!cfs_rq->nr_running && cfs_rq->runtime_remaining > 0 && >> + cfs_rq->runtime_expires == cfs_b->runtime_expires) { >> + cfs_b->runtime += cfs_rq->runtime_remaining; >> + cfs_rq->runtime_remaining = 0; >> + } >> + raw_spin_unlock(&cfs_b->lock); >> + >> + rq_unlock(rq, &rf); >> + } > > But worse still, you take possibly every rq->lock without ever > re-enabling IRQs. > Yeah, I'm not sure why I did that, it isn't correctness. ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <CAC=E7cV4sO50NpYOZ06n_BkZTcBqf1KQp83prc+oave3ircBrw@mail.gmail.com>]
* Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices [not found] ` <CAC=E7cV4sO50NpYOZ06n_BkZTcBqf1KQp83prc+oave3ircBrw@mail.gmail.com> @ 2019-07-12 18:01 ` bsegall 2019-07-12 22:09 ` bsegall 1 sibling, 0 replies; 40+ messages in thread From: bsegall @ 2019-07-12 18:01 UTC (permalink / raw) To: Dave Chiluk Cc: Peter Zijlstra, Pqhil Auld, Peter Oskolkov, Ingo Molnar, cgroups, Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc, Paul Turner Dave Chiluk <chiluk+linux@indeed.com> writes: > So I spent some more time testing this new patch as is *(interrupts disabled). I know I probably should have fixed the patch, but it's hard to get time on big test hardware sometimes, and I was already well along my way with testing. > > In regards to the quota usage overage I was seeing earlier: I have a > theory as to what might be happening here, and I'm pretty sure it's > related to the IRQs being disabled during the rq->lock walk. I think > that the main fast thread was able to use an excess amount of quota > because the timer interrupt meant to stop it wasn't being handled > timely due to the interrupts being disabled. On my 8 core machine this > resulted in a what looked like simply improved usage of the quota, but > when I ran the test on an 80 core machine I saw a massive overage of > cpu usage when running fibtest. Specifically when running fibtest for > 5 seconds with 50ms quota/100ms period expecting ~2500ms of quota > usage; I got 3731 ms of cpu usage which was an unexpected overage of > 1231ms. Is that a reasonable theory? Tht doesn't seem likely - taking 1ms would be way longer than I'd expect to begin with, and runtime_remaining can go negative for that sort of reason anyways assuming the irq time is even counted towards the task. Also I don't that the enable-irqs version will help for the scheduler tick at least without rt patchsets. That is still also too much for what I was thinking of though. I'll have to look into this more. > > I'll try to get some time again tomorrow to test with IRQs disabled before the walk. Ben if you have a chance to fix and resend the patch that'd help. > > I'm really starting to think that simply removing the quota expiration > may be the best solution here. Mathmatically it works out, it makes > the code simpler, it doesn't have any of the lock walk issues, it > doesn't add extra latency or overhead due to the slack timer, It works out _for the job that is supposed to be throttled_. If the job then gets a burst of actually-expensive work on many threads it can then use NCPUs extra ms, adding latency to any other job on the system. Given that it's still only 1ms on each runqueue, maybe this isn't the end of the world, but the fail case does exist. (We have to do exactly the same locking stuff on distribute, both more rarely on the period timer, and on the currently existing slack timer) > and that behavior is exactly what the kernel was doing for 5 years with few complaints about overage afaik. > > Either way, I'm very glad that we are getting to the end of this one, and all solutions appear to solve the core of the problem. I thank you all the work you guys have put into this. > > On Thu, Jul 11, 2019 at 12:46 PM <bsegall@google.com> wrote: > > Peter Zijlstra <peterz@infradead.org> writes: > > > FWIW, good to see progress, still waiting for you guys to agree :-) > > > > On Mon, Jul 01, 2019 at 01:15:44PM -0700, bsegall@google.com wrote: > > > >> - Taking up-to-every rq->lock is bad and expensive and 5ms may be too > >> short a delay for this. I haven't tried microbenchmarks on the cost of > >> this vs min_cfs_rq_runtime = 0 vs baseline. > > > > Yes, that's tricky, SGI/HPE have definite ideas about that. > > > >> @@ -4781,12 +4790,41 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) > >> */ > >> static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > >> { > >> - u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); > >> + u64 runtime = 0; > >> unsigned long flags; > >> u64 expires; > >> + struct cfs_rq *cfs_rq, *temp; > >> + LIST_HEAD(temp_head); > >> + > >> + local_irq_save(flags); > >> + > >> + raw_spin_lock(&cfs_b->lock); > >> + cfs_b->slack_started = false; > >> + list_splice_init(&cfs_b->slack_cfs_rq, &temp_head); > >> + raw_spin_unlock(&cfs_b->lock); > >> + > >> + > >> + /* Gather all left over runtime from all rqs */ > >> + list_for_each_entry_safe(cfs_rq, temp, &temp_head, slack_list) { > >> + struct rq *rq = rq_of(cfs_rq); > >> + struct rq_flags rf; > >> + > >> + rq_lock(rq, &rf); > >> + > >> + raw_spin_lock(&cfs_b->lock); > >> + list_del_init(&cfs_rq->slack_list); > >> + if (!cfs_rq->nr_running && cfs_rq->runtime_remaining > 0 && > >> + cfs_rq->runtime_expires == cfs_b->runtime_expires) { > >> + cfs_b->runtime += cfs_rq->runtime_remaining; > >> + cfs_rq->runtime_remaining = 0; > >> + } > >> + raw_spin_unlock(&cfs_b->lock); > >> + > >> + rq_unlock(rq, &rf); > >> + } > > > > But worse still, you take possibly every rq->lock without ever > > re-enabling IRQs. > > > > Yeah, I'm not sure why I did that, it isn't correctness. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices [not found] ` <CAC=E7cV4sO50NpYOZ06n_BkZTcBqf1KQp83prc+oave3ircBrw@mail.gmail.com> 2019-07-12 18:01 ` bsegall @ 2019-07-12 22:09 ` bsegall 2019-07-15 15:44 ` Dave Chiluk 1 sibling, 1 reply; 40+ messages in thread From: bsegall @ 2019-07-12 22:09 UTC (permalink / raw) To: Dave Chiluk Cc: Peter Zijlstra, Pqhil Auld, Peter Oskolkov, Ingo Molnar, cgroups, Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc, Paul Turner Dave Chiluk <chiluk+linux@indeed.com> writes: > So I spent some more time testing this new patch as is *(interrupts disabled). I know I probably should have fixed the patch, but it's hard to get time on big test hardware sometimes, and I was already well along my way with testing. > > In regards to the quota usage overage I was seeing earlier: I have a theory as to what might be happening here, and I'm pretty sure it's related to the IRQs being disabled during the rq->lock walk. I think that the main fast thread was able to use an excess amount > of quota because the timer interrupt meant to stop it wasn't being handled timely due to the interrupts being disabled. On my 8 core machine this resulted in a what looked like simply improved usage of the quota, but when I ran the test on an 80 core machine I > saw a massive overage of cpu usage when running fibtest. Specifically when running fibtest for 5 seconds with 50ms quota/100ms period expecting ~2500ms of quota usage; I got 3731 ms of cpu usage which was an unexpected overage of 1231ms. Is that a > reasonable theory? I think I've figured out what's going on here (and a related issue that gave me some inconsistency when trying to debug it): other "slow" threads can wake up while the slack timer is in distribute and double-spend some runtime. Since we lsub_positive rather than allow cfs_b->runtime to be negative this double-spending is permanent, and can go on indefinitely. In addition, if things fall out in a slightly different way, all the "slow" threads can wind up getting on cpu and claiming slices of runtime before the "fast" thread, and then it just has to wait another slack period to hope that the ordering winds up better that time. This just depends on things like IPI latency and maybe what order things happened to happen at the start of the period. Ugh. Maybe we /do/ just give up and say that most people don't seem to be using cfs_b in a way that expiration of the leftover 1ms matters. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-07-12 22:09 ` bsegall @ 2019-07-15 15:44 ` Dave Chiluk 0 siblings, 0 replies; 40+ messages in thread From: Dave Chiluk @ 2019-07-15 15:44 UTC (permalink / raw) To: Ben Segall Cc: Peter Zijlstra, Pqhil Auld, Peter Oskolkov, Ingo Molnar, cgroups, Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc, Paul Turner On Fri, Jul 12, 2019 at 5:10 PM <bsegall@google.com> wrote: > Ugh. Maybe we /do/ just give up and say that most people don't seem to > be using cfs_b in a way that expiration of the leftover 1ms matters. That was my conclusion as well. Does this mean you want to proceed with my patch set? Do you have any changes you want made to the proposed documentation changes, or any other changes for that matter? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-06-27 19:49 ` [PATCH v5 1/1] " Dave Chiluk 2019-07-01 20:15 ` bsegall @ 2019-07-16 19:58 ` bsegall 1 sibling, 0 replies; 40+ messages in thread From: bsegall @ 2019-07-16 19:58 UTC (permalink / raw) To: Dave Chiluk Cc: Pqhil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc Dave Chiluk <chiluk+linux@indeed.com> writes: > It has been observed, that highly-threaded, non-cpu-bound applications > running under cpu.cfs_quota_us constraints can hit a high percentage of > periods throttled while simultaneously not consuming the allocated > amount of quota. This use case is typical of user-interactive non-cpu > bound applications, such as those running in kubernetes or mesos when > run on multiple cpu cores. > > This has been root caused to threads being allocated per cpu bandwidth > slices, and then not fully using that slice within the period. At which > point the slice and quota expires. This expiration of unused slice > results in applications not being able to utilize the quota for which > they are allocated. > > The expiration of per-cpu slices was recently fixed by > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift > condition")'. Prior to that it appears that this has been broken since > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That > added the following conditional which resulted in slices never being > expired. > > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { > /* extend local deadline, drift is bounded above by 2 ticks */ > cfs_rq->runtime_expires += TICK_NSEC; > > Because this was broken for nearly 5 years, and has recently been fixed > and is now being noticed by many users running kubernetes > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion > that the mechanisms around expiring runtime should be removed > altogether. > > This allows quota already allocated to per-cpu run-queues to live longer > than the period boundary. This allows threads on runqueues that do not > use much CPU to continue to use their remaining slice over a longer > period of time than cpu.cfs_period_us. However, this helps prevents the > above condition of hitting throttling while also not fully utilizing > your cpu quota. > > This theoretically allows a machine to use slightly more than its > allotted quota in some periods. This overflow would be bounded by the > remaining quota left on each per-cpu runqueueu. This is typically no > more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will > change nothing, as they should theoretically fully utilize all of their > quota in each period. For user-interactive tasks as described above this > provides a much better user/application experience as their cpu > utilization will more closely match the amount they requested when they > hit throttling. This means that cpu limits no longer strictly apply per > period for non-cpu bound applications, but that they are still accurate > over longer timeframes. > > This greatly improves performance of high-thread-count, non-cpu bound > applications with low cfs_quota_us allocation on high-core-count > machines. In the case of an artificial testcase (10ms/100ms of quota on > 80 CPU machine), this commit resulted in almost 30x performance > improvement, while still maintaining correct cpu quota restrictions. > That testcase is available at https://github.com/indeedeng/fibtest. > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") > Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> > --- > Documentation/scheduler/sched-bwc.txt | 73 ++++++++++++++++++++++++++++------- > kernel/sched/fair.c | 71 +++------------------------------- > kernel/sched/sched.h | 4 -- > 3 files changed, 65 insertions(+), 83 deletions(-) > > [...] > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index f35930f..a675c69 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > if (!runtime) > return; > > - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); > + runtime = distribute_cfs_runtime(cfs_b, runtime); > > raw_spin_lock_irqsave(&cfs_b->lock, flags); > - if (expires == cfs_b->runtime_expires) > - lsub_positive(&cfs_b->runtime, runtime); The lsub_positive is still needed, just get rid of the if. Other than that, Reviewed-by: Ben Segall <bsegall@google.com> ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v6 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-05-17 19:30 [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices Dave Chiluk ` (4 preceding siblings ...) 2019-06-27 19:49 ` [PATCH v5 0/1] " Dave Chiluk @ 2019-07-23 16:44 ` Dave Chiluk 2019-07-23 16:44 ` [PATCH v6 1/1] " Dave Chiluk 5 siblings, 1 reply; 40+ messages in thread From: Dave Chiluk @ 2019-07-23 16:44 UTC (permalink / raw) To: Ben Segall, Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc Changelog v6 - Added back missing call to lsub_positive(&cfs_b->runtime, runtime); - Added Reviewed-by: Ben Segall <bsegall@google.com> - Fix some grammar in the Documentation, and change some wording. - Updated documentation due to the .rst change Changelog v5 - Based on this comment from Ben Segall's comment on v4 > If the cost of taking this global lock across all cpus without a > ratelimit was somehow not a problem, I'd much prefer to just set > min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie > and sorta have 2x period 2x runtime" solution of removing expiration) I'm resubmitting my v3 patchset, with the requested changes. - Updated Commit log given review comments - Update sched-bwc.txt give my new understanding of the slack timer. Changelog v4 - Rewrote patchset around the concept of returning all of runtime_remaining when cfs_b nears the end of available quota. Changelog v3 - Reworked documentation to better describe behavior of slice expiration per feedback from Peter Oskolkov Changelog v2 - Fixed some checkpatch errors in the commit message. ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v6 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-07-23 16:44 ` [PATCH v6 0/1] " Dave Chiluk @ 2019-07-23 16:44 ` Dave Chiluk 2019-07-23 17:13 ` Phil Auld 2019-08-08 10:53 ` [tip:sched/core] " tip-bot for Dave Chiluk 0 siblings, 2 replies; 40+ messages in thread From: Dave Chiluk @ 2019-07-23 16:44 UTC (permalink / raw) To: Ben Segall, Phil Auld, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc It has been observed, that highly-threaded, non-cpu-bound applications running under cpu.cfs_quota_us constraints can hit a high percentage of periods throttled while simultaneously not consuming the allocated amount of quota. This use case is typical of user-interactive non-cpu bound applications, such as those running in kubernetes or mesos when run on multiple cpu cores. This has been root caused to cpu-local run queue being allocated per cpu bandwidth slices, and then not fully using that slice within the period. At which point the slice and quota expires. This expiration of unused slice results in applications not being able to utilize the quota for which they are allocated. The non-expiration of per-cpu slices was recently fixed by 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")'. Prior to that it appears that this had been broken since at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That added the following conditional which resulted in slices never being expired. if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; Because this was broken for nearly 5 years, and has recently been fixed and is now being noticed by many users running kubernetes (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion that the mechanisms around expiring runtime should be removed altogether. This allows quota already allocated to per-cpu run-queues to live longer than the period boundary. This allows threads on runqueues that do not use much CPU to continue to use their remaining slice over a longer period of time than cpu.cfs_period_us. However, this helps prevent the above condition of hitting throttling while also not fully utilizing your cpu quota. This theoretically allows a machine to use slightly more than its allotted quota in some periods. This overflow would be bounded by the remaining quota left on each per-cpu runqueueu. This is typically no more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will change nothing, as they should theoretically fully utilize all of their quota in each period. For user-interactive tasks as described above this provides a much better user/application experience as their cpu utilization will more closely match the amount they requested when they hit throttling. This means that cpu limits no longer strictly apply per period for non-cpu bound applications, but that they are still accurate over longer timeframes. This greatly improves performance of high-thread-count, non-cpu bound applications with low cfs_quota_us allocation on high-core-count machines. In the case of an artificial testcase (10ms/100ms of quota on 80 CPU machine), this commit resulted in almost 30x performance improvement, while still maintaining correct cpu quota restrictions. That testcase is available at https://github.com/indeedeng/fibtest. Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> Reviewed-by: Ben Segall <bsegall@google.com> --- Documentation/scheduler/sched-bwc.rst | 74 ++++++++++++++++++++++++++++------- kernel/sched/fair.c | 72 ++++------------------------------ kernel/sched/sched.h | 4 -- 3 files changed, 67 insertions(+), 83 deletions(-) diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst index 3a90642..9801d6b 100644 --- a/Documentation/scheduler/sched-bwc.rst +++ b/Documentation/scheduler/sched-bwc.rst @@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the specification of the maximum CPU bandwidth available to a group or hierarchy. The bandwidth allowed for a group is specified using a quota and period. Within -each given "period" (microseconds), a group is allowed to consume only up to -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a -group exceeds this limit (for that period), the tasks belonging to its -hierarchy will be throttled and are not allowed to run again until the next -period. - -A group's unused runtime is globally tracked, being refreshed with quota units -above at each period boundary. As threads consume this bandwidth it is -transferred to cpu-local "silos" on a demand basis. The amount transferred +each given "period" (microseconds), a task group is allocated up to "quota" +microseconds of CPU time. That quota is assigned to per-cpu run queues in +slices as threads in the cgroup become runnable. Once all quota has been +assigned any additional requests for quota will result in those threads being +throttled. Throttled threads will not be able to run again until the next +period when the quota is replenished. + +A group's unassigned quota is globally tracked, being refreshed back to +cfs_quota units at each period boundary. As threads consume this bandwidth it +is transferred to cpu-local "silos" on a demand basis. The amount transferred within each of these updates is tunable and described as the "slice". Management @@ -35,12 +36,12 @@ The default values are:: A value of -1 for cpu.cfs_quota_us indicates that the group does not have any bandwidth restriction in place, such a group is described as an unconstrained -bandwidth group. This represents the traditional work-conserving behavior for +bandwidth group. This represents the traditional work-conserving behavior for CFS. Writing any (valid) positive value(s) will enact the specified bandwidth limit. -The minimum quota allowed for the quota or period is 1ms. There is also an -upper bound on the period length of 1s. Additional restrictions exist when +The minimum quota allowed for the quota or period is 1ms. There is also an +upper bound on the period length of 1s. Additional restrictions exist when bandwidth limits are used in a hierarchical fashion, these are explained in more detail below. @@ -53,8 +54,8 @@ unthrottled if it is in a constrained state. System wide settings -------------------- For efficiency run-time is transferred between the global pool and CPU local -"silos" in a batch fashion. This greatly reduces global accounting pressure -on large systems. The amount transferred each time such an update is required +"silos" in a batch fashion. This greatly reduces global accounting pressure +on large systems. The amount transferred each time such an update is required is described as the "slice". This is tunable via procfs:: @@ -97,6 +98,51 @@ There are two ways in which a group may become throttled: In case b) above, even though the child may have runtime remaining it will not be allowed to until the parent's runtime is refreshed. +CFS Bandwidth Quota Caveats +--------------------------- +Once a slice is assigned to a cpu it does not expire. However all but 1ms of +the slice may be returned to the global pool if all threads on that cpu become +unrunnable. This is configured at compile time by the min_cfs_rq_runtime +variable. This is a performance tweak that helps prevent added contention on +the global lock. + +The fact that cpu-local slices do not expire results in some interesting corner +cases that should be understood. + +For cgroup cpu constrained applications that are cpu limited this is a +relatively moot point because they will naturally consume the entirety of their +quota as well as the entirety of each cpu-local slice in each period. As a +result it is expected that nr_periods roughly equal nr_throttled, and that +cpuacct.usage will increase roughly equal to cfs_quota_us in each period. + +For highly-threaded, non-cpu bound applications this non-expiration nuance +allows applications to briefly burst past their quota limits by the amount of +unused slice on each cpu that the task group is running on (typically at most +1ms per cpu or as defined by min_cfs_rq_runtime). This slight burst only +applies if quota had been assigned to a cpu and then not fully used or returned +in previous periods. This burst amount will not be transferred between cores. +As a result, this mechanism still strictly limits the task group to quota +average usage, albeit over a longer time window than a single period. This +also limits the burst ability to no more than 1ms per cpu. This provides +better more predictable user experience for highly threaded applications with +small quota limits on high core count machines. It also eliminates the +propensity to throttle these applications while simultanously using less than +quota amounts of cpu. Another way to say this, is that by allowing the unused +portion of a slice to remain valid across periods we have decreased the +possibility of wastefully expiring quota on cpu-local silos that don't need a +full slice's amount of cpu time. + +The interaction between cpu-bound and non-cpu-bound-interactive applications +should also be considered, especially when single core usage hits 100%. If you +gave each of these applications half of a cpu-core and they both got scheduled +on the same CPU it is theoretically possible that the non-cpu bound application +will use up to 1ms additional quota in some periods, thereby preventing the +cpu-bound application from fully using its quota by that same amount. In these +instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to +decide which application is chosen to run, as they will both be runnable and +have remaining quota. This runtime discrepancy will be made up in the following +periods when the interactive application idles. + Examples -------- 1. Limit a group to 1 CPU worth of runtime:: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 036be95..00b68f0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4316,8 +4316,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) now = sched_clock_cpu(smp_processor_id()); cfs_b->runtime = cfs_b->quota; - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); - cfs_b->expires_seq++; } static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) @@ -4339,8 +4337,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) { struct task_group *tg = cfs_rq->tg; struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); - u64 amount = 0, min_amount, expires; - int expires_seq; + u64 amount = 0, min_amount; /* note: this is a positive sum as runtime_remaining <= 0 */ min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; @@ -4357,61 +4354,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_b->idle = 0; } } - expires_seq = cfs_b->expires_seq; - expires = cfs_b->runtime_expires; raw_spin_unlock(&cfs_b->lock); cfs_rq->runtime_remaining += amount; - /* - * we may have advanced our local expiration to account for allowed - * spread between our sched_clock and the one on which runtime was - * issued. - */ - if (cfs_rq->expires_seq != expires_seq) { - cfs_rq->expires_seq = expires_seq; - cfs_rq->runtime_expires = expires; - } return cfs_rq->runtime_remaining > 0; } -/* - * Note: This depends on the synchronization provided by sched_clock and the - * fact that rq->clock snapshots this value. - */ -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) -{ - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); - - /* if the deadline is ahead of our clock, nothing to do */ - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) - return; - - if (cfs_rq->runtime_remaining < 0) - return; - - /* - * If the local deadline has passed we have to consider the - * possibility that our sched_clock is 'fast' and the global deadline - * has not truly expired. - * - * Fortunately we can check determine whether this the case by checking - * whether the global deadline(cfs_b->expires_seq) has advanced. - */ - if (cfs_rq->expires_seq == cfs_b->expires_seq) { - /* extend local deadline, drift is bounded above by 2 ticks */ - cfs_rq->runtime_expires += TICK_NSEC; - } else { - /* global deadline is ahead, expiration has passed */ - cfs_rq->runtime_remaining = 0; - } -} - static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) { /* dock delta_exec before expiring quota (as it could span periods) */ cfs_rq->runtime_remaining -= delta_exec; - expire_cfs_rq_runtime(cfs_rq); if (likely(cfs_rq->runtime_remaining > 0)) return; @@ -4602,8 +4555,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) resched_curr(rq); } -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, - u64 remaining, u64 expires) +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) { struct cfs_rq *cfs_rq; u64 runtime; @@ -4625,7 +4577,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, remaining -= runtime; cfs_rq->runtime_remaining += runtime; - cfs_rq->runtime_expires = expires; /* we check whether we're throttled above */ if (cfs_rq->runtime_remaining > 0) @@ -4650,7 +4601,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, */ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) { - u64 runtime, runtime_expires; + u64 runtime; int throttled; /* no need to continue the timer with no bandwidth constraint */ @@ -4678,8 +4629,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u /* account preceding periods in which throttling occurred */ cfs_b->nr_throttled += overrun; - runtime_expires = cfs_b->runtime_expires; - /* * This check is repeated as we are holding onto the new bandwidth while * we unthrottle. This can potentially race with an unthrottled group @@ -4692,8 +4641,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u cfs_b->distribute_running = 1; raw_spin_unlock_irqrestore(&cfs_b->lock, flags); /* we can't nest cfs_b->lock while distributing bandwidth */ - runtime = distribute_cfs_runtime(cfs_b, runtime, - runtime_expires); + runtime = distribute_cfs_runtime(cfs_b, runtime); raw_spin_lock_irqsave(&cfs_b->lock, flags); cfs_b->distribute_running = 0; @@ -4775,8 +4723,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) return; raw_spin_lock(&cfs_b->lock); - if (cfs_b->quota != RUNTIME_INF && - cfs_rq->runtime_expires == cfs_b->runtime_expires) { + if (cfs_b->quota != RUNTIME_INF) { cfs_b->runtime += slack_runtime; /* we are under rq->lock, defer unthrottling using a timer */ @@ -4809,7 +4756,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) { u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); unsigned long flags; - u64 expires; /* confirm we're still not at a refresh boundary */ raw_spin_lock_irqsave(&cfs_b->lock, flags); @@ -4827,7 +4773,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) runtime = cfs_b->runtime; - expires = cfs_b->runtime_expires; if (runtime) cfs_b->distribute_running = 1; @@ -4836,11 +4781,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) if (!runtime) return; - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); + runtime = distribute_cfs_runtime(cfs_b, runtime); raw_spin_lock_irqsave(&cfs_b->lock, flags); - if (expires == cfs_b->runtime_expires) - lsub_positive(&cfs_b->runtime, runtime); + lsub_positive(&cfs_b->runtime, runtime); cfs_b->distribute_running = 0; raw_spin_unlock_irqrestore(&cfs_b->lock, flags); } @@ -4997,8 +4941,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) cfs_b->period_active = 1; overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); - cfs_b->expires_seq++; hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 802b1f3..28c16e9 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -335,8 +335,6 @@ struct cfs_bandwidth { u64 quota; u64 runtime; s64 hierarchical_quota; - u64 runtime_expires; - int expires_seq; u8 idle; u8 period_active; @@ -556,8 +554,6 @@ struct cfs_rq { #ifdef CONFIG_CFS_BANDWIDTH int runtime_enabled; - int expires_seq; - u64 runtime_expires; s64 runtime_remaining; u64 throttled_clock; -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v6 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-07-23 16:44 ` [PATCH v6 1/1] " Dave Chiluk @ 2019-07-23 17:13 ` Phil Auld 2019-07-23 22:12 ` Dave Chiluk 2019-07-26 18:14 ` Peter Zijlstra 2019-08-08 10:53 ` [tip:sched/core] " tip-bot for Dave Chiluk 1 sibling, 2 replies; 40+ messages in thread From: Phil Auld @ 2019-07-23 17:13 UTC (permalink / raw) To: Dave Chiluk Cc: Ben Segall, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc Hi Dave, On Tue, Jul 23, 2019 at 11:44:26AM -0500 Dave Chiluk wrote: > It has been observed, that highly-threaded, non-cpu-bound applications > running under cpu.cfs_quota_us constraints can hit a high percentage of > periods throttled while simultaneously not consuming the allocated > amount of quota. This use case is typical of user-interactive non-cpu > bound applications, such as those running in kubernetes or mesos when > run on multiple cpu cores. > > This has been root caused to cpu-local run queue being allocated per cpu > bandwidth slices, and then not fully using that slice within the period. > At which point the slice and quota expires. This expiration of unused > slice results in applications not being able to utilize the quota for > which they are allocated. > > The non-expiration of per-cpu slices was recently fixed by > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift > condition")'. Prior to that it appears that this had been broken since > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That > added the following conditional which resulted in slices never being > expired. > > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { > /* extend local deadline, drift is bounded above by 2 ticks */ > cfs_rq->runtime_expires += TICK_NSEC; > > Because this was broken for nearly 5 years, and has recently been fixed > and is now being noticed by many users running kubernetes > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion > that the mechanisms around expiring runtime should be removed > altogether. > > This allows quota already allocated to per-cpu run-queues to live longer > than the period boundary. This allows threads on runqueues that do not > use much CPU to continue to use their remaining slice over a longer > period of time than cpu.cfs_period_us. However, this helps prevent the > above condition of hitting throttling while also not fully utilizing > your cpu quota. > > This theoretically allows a machine to use slightly more than its > allotted quota in some periods. This overflow would be bounded by the > remaining quota left on each per-cpu runqueueu. This is typically no > more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will > change nothing, as they should theoretically fully utilize all of their > quota in each period. For user-interactive tasks as described above this > provides a much better user/application experience as their cpu > utilization will more closely match the amount they requested when they > hit throttling. This means that cpu limits no longer strictly apply per > period for non-cpu bound applications, but that they are still accurate > over longer timeframes. > > This greatly improves performance of high-thread-count, non-cpu bound > applications with low cfs_quota_us allocation on high-core-count > machines. In the case of an artificial testcase (10ms/100ms of quota on > 80 CPU machine), this commit resulted in almost 30x performance > improvement, while still maintaining correct cpu quota restrictions. > That testcase is available at https://github.com/indeedeng/fibtest. > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") > Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> > Reviewed-by: Ben Segall <bsegall@google.com> This still works for me. The documentation reads pretty well, too. Good job. Feel free to add my Acked-by: or Reviewed-by: Phil Auld <pauld@redhat.com>. I'll run it through some more tests when I have time. The code is the same as the earlier one I tested from what I can see. Cheers, Phil > --- > Documentation/scheduler/sched-bwc.rst | 74 ++++++++++++++++++++++++++++------- > kernel/sched/fair.c | 72 ++++------------------------------ > kernel/sched/sched.h | 4 -- > 3 files changed, 67 insertions(+), 83 deletions(-) > > diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst > index 3a90642..9801d6b 100644 > --- a/Documentation/scheduler/sched-bwc.rst > +++ b/Documentation/scheduler/sched-bwc.rst > @@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the > specification of the maximum CPU bandwidth available to a group or hierarchy. > > The bandwidth allowed for a group is specified using a quota and period. Within > -each given "period" (microseconds), a group is allowed to consume only up to > -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a > -group exceeds this limit (for that period), the tasks belonging to its > -hierarchy will be throttled and are not allowed to run again until the next > -period. > - > -A group's unused runtime is globally tracked, being refreshed with quota units > -above at each period boundary. As threads consume this bandwidth it is > -transferred to cpu-local "silos" on a demand basis. The amount transferred > +each given "period" (microseconds), a task group is allocated up to "quota" > +microseconds of CPU time. That quota is assigned to per-cpu run queues in > +slices as threads in the cgroup become runnable. Once all quota has been > +assigned any additional requests for quota will result in those threads being > +throttled. Throttled threads will not be able to run again until the next > +period when the quota is replenished. > + > +A group's unassigned quota is globally tracked, being refreshed back to > +cfs_quota units at each period boundary. As threads consume this bandwidth it > +is transferred to cpu-local "silos" on a demand basis. The amount transferred > within each of these updates is tunable and described as the "slice". > > Management > @@ -35,12 +36,12 @@ The default values are:: > > A value of -1 for cpu.cfs_quota_us indicates that the group does not have any > bandwidth restriction in place, such a group is described as an unconstrained > -bandwidth group. This represents the traditional work-conserving behavior for > +bandwidth group. This represents the traditional work-conserving behavior for > CFS. > > Writing any (valid) positive value(s) will enact the specified bandwidth limit. > -The minimum quota allowed for the quota or period is 1ms. There is also an > -upper bound on the period length of 1s. Additional restrictions exist when > +The minimum quota allowed for the quota or period is 1ms. There is also an > +upper bound on the period length of 1s. Additional restrictions exist when > bandwidth limits are used in a hierarchical fashion, these are explained in > more detail below. > > @@ -53,8 +54,8 @@ unthrottled if it is in a constrained state. > System wide settings > -------------------- > For efficiency run-time is transferred between the global pool and CPU local > -"silos" in a batch fashion. This greatly reduces global accounting pressure > -on large systems. The amount transferred each time such an update is required > +"silos" in a batch fashion. This greatly reduces global accounting pressure > +on large systems. The amount transferred each time such an update is required > is described as the "slice". > > This is tunable via procfs:: > @@ -97,6 +98,51 @@ There are two ways in which a group may become throttled: > In case b) above, even though the child may have runtime remaining it will not > be allowed to until the parent's runtime is refreshed. > > +CFS Bandwidth Quota Caveats > +--------------------------- > +Once a slice is assigned to a cpu it does not expire. However all but 1ms of > +the slice may be returned to the global pool if all threads on that cpu become > +unrunnable. This is configured at compile time by the min_cfs_rq_runtime > +variable. This is a performance tweak that helps prevent added contention on > +the global lock. > + > +The fact that cpu-local slices do not expire results in some interesting corner > +cases that should be understood. > + > +For cgroup cpu constrained applications that are cpu limited this is a > +relatively moot point because they will naturally consume the entirety of their > +quota as well as the entirety of each cpu-local slice in each period. As a > +result it is expected that nr_periods roughly equal nr_throttled, and that > +cpuacct.usage will increase roughly equal to cfs_quota_us in each period. > + > +For highly-threaded, non-cpu bound applications this non-expiration nuance > +allows applications to briefly burst past their quota limits by the amount of > +unused slice on each cpu that the task group is running on (typically at most > +1ms per cpu or as defined by min_cfs_rq_runtime). This slight burst only > +applies if quota had been assigned to a cpu and then not fully used or returned > +in previous periods. This burst amount will not be transferred between cores. > +As a result, this mechanism still strictly limits the task group to quota > +average usage, albeit over a longer time window than a single period. This > +also limits the burst ability to no more than 1ms per cpu. This provides > +better more predictable user experience for highly threaded applications with > +small quota limits on high core count machines. It also eliminates the > +propensity to throttle these applications while simultanously using less than > +quota amounts of cpu. Another way to say this, is that by allowing the unused > +portion of a slice to remain valid across periods we have decreased the > +possibility of wastefully expiring quota on cpu-local silos that don't need a > +full slice's amount of cpu time. > + > +The interaction between cpu-bound and non-cpu-bound-interactive applications > +should also be considered, especially when single core usage hits 100%. If you > +gave each of these applications half of a cpu-core and they both got scheduled > +on the same CPU it is theoretically possible that the non-cpu bound application > +will use up to 1ms additional quota in some periods, thereby preventing the > +cpu-bound application from fully using its quota by that same amount. In these > +instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to > +decide which application is chosen to run, as they will both be runnable and > +have remaining quota. This runtime discrepancy will be made up in the following > +periods when the interactive application idles. > + > Examples > -------- > 1. Limit a group to 1 CPU worth of runtime:: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 036be95..00b68f0 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4316,8 +4316,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) > > now = sched_clock_cpu(smp_processor_id()); > cfs_b->runtime = cfs_b->quota; > - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); > - cfs_b->expires_seq++; > } > > static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) > @@ -4339,8 +4337,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) > { > struct task_group *tg = cfs_rq->tg; > struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); > - u64 amount = 0, min_amount, expires; > - int expires_seq; > + u64 amount = 0, min_amount; > > /* note: this is a positive sum as runtime_remaining <= 0 */ > min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; > @@ -4357,61 +4354,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) > cfs_b->idle = 0; > } > } > - expires_seq = cfs_b->expires_seq; > - expires = cfs_b->runtime_expires; > raw_spin_unlock(&cfs_b->lock); > > cfs_rq->runtime_remaining += amount; > - /* > - * we may have advanced our local expiration to account for allowed > - * spread between our sched_clock and the one on which runtime was > - * issued. > - */ > - if (cfs_rq->expires_seq != expires_seq) { > - cfs_rq->expires_seq = expires_seq; > - cfs_rq->runtime_expires = expires; > - } > > return cfs_rq->runtime_remaining > 0; > } > > -/* > - * Note: This depends on the synchronization provided by sched_clock and the > - * fact that rq->clock snapshots this value. > - */ > -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) > -{ > - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); > - > - /* if the deadline is ahead of our clock, nothing to do */ > - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) > - return; > - > - if (cfs_rq->runtime_remaining < 0) > - return; > - > - /* > - * If the local deadline has passed we have to consider the > - * possibility that our sched_clock is 'fast' and the global deadline > - * has not truly expired. > - * > - * Fortunately we can check determine whether this the case by checking > - * whether the global deadline(cfs_b->expires_seq) has advanced. > - */ > - if (cfs_rq->expires_seq == cfs_b->expires_seq) { > - /* extend local deadline, drift is bounded above by 2 ticks */ > - cfs_rq->runtime_expires += TICK_NSEC; > - } else { > - /* global deadline is ahead, expiration has passed */ > - cfs_rq->runtime_remaining = 0; > - } > -} > - > static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) > { > /* dock delta_exec before expiring quota (as it could span periods) */ > cfs_rq->runtime_remaining -= delta_exec; > - expire_cfs_rq_runtime(cfs_rq); > > if (likely(cfs_rq->runtime_remaining > 0)) > return; > @@ -4602,8 +4555,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) > resched_curr(rq); > } > > -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > - u64 remaining, u64 expires) > +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) > { > struct cfs_rq *cfs_rq; > u64 runtime; > @@ -4625,7 +4577,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > remaining -= runtime; > > cfs_rq->runtime_remaining += runtime; > - cfs_rq->runtime_expires = expires; > > /* we check whether we're throttled above */ > if (cfs_rq->runtime_remaining > 0) > @@ -4650,7 +4601,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > */ > static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) > { > - u64 runtime, runtime_expires; > + u64 runtime; > int throttled; > > /* no need to continue the timer with no bandwidth constraint */ > @@ -4678,8 +4629,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u > /* account preceding periods in which throttling occurred */ > cfs_b->nr_throttled += overrun; > > - runtime_expires = cfs_b->runtime_expires; > - > /* > * This check is repeated as we are holding onto the new bandwidth while > * we unthrottle. This can potentially race with an unthrottled group > @@ -4692,8 +4641,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u > cfs_b->distribute_running = 1; > raw_spin_unlock_irqrestore(&cfs_b->lock, flags); > /* we can't nest cfs_b->lock while distributing bandwidth */ > - runtime = distribute_cfs_runtime(cfs_b, runtime, > - runtime_expires); > + runtime = distribute_cfs_runtime(cfs_b, runtime); > raw_spin_lock_irqsave(&cfs_b->lock, flags); > > cfs_b->distribute_running = 0; > @@ -4775,8 +4723,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) > return; > > raw_spin_lock(&cfs_b->lock); > - if (cfs_b->quota != RUNTIME_INF && > - cfs_rq->runtime_expires == cfs_b->runtime_expires) { > + if (cfs_b->quota != RUNTIME_INF) { > cfs_b->runtime += slack_runtime; > > /* we are under rq->lock, defer unthrottling using a timer */ > @@ -4809,7 +4756,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > { > u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); > unsigned long flags; > - u64 expires; > > /* confirm we're still not at a refresh boundary */ > raw_spin_lock_irqsave(&cfs_b->lock, flags); > @@ -4827,7 +4773,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) > runtime = cfs_b->runtime; > > - expires = cfs_b->runtime_expires; > if (runtime) > cfs_b->distribute_running = 1; > > @@ -4836,11 +4781,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > if (!runtime) > return; > > - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); > + runtime = distribute_cfs_runtime(cfs_b, runtime); > > raw_spin_lock_irqsave(&cfs_b->lock, flags); > - if (expires == cfs_b->runtime_expires) > - lsub_positive(&cfs_b->runtime, runtime); > + lsub_positive(&cfs_b->runtime, runtime); > cfs_b->distribute_running = 0; > raw_spin_unlock_irqrestore(&cfs_b->lock, flags); > } > @@ -4997,8 +4941,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) > > cfs_b->period_active = 1; > overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); > - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); > - cfs_b->expires_seq++; > hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); > } > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 802b1f3..28c16e9 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -335,8 +335,6 @@ struct cfs_bandwidth { > u64 quota; > u64 runtime; > s64 hierarchical_quota; > - u64 runtime_expires; > - int expires_seq; > > u8 idle; > u8 period_active; > @@ -556,8 +554,6 @@ struct cfs_rq { > > #ifdef CONFIG_CFS_BANDWIDTH > int runtime_enabled; > - int expires_seq; > - u64 runtime_expires; > s64 runtime_remaining; > > u64 throttled_clock; > -- > 1.8.3.1 > -- ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v6 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-07-23 17:13 ` Phil Auld @ 2019-07-23 22:12 ` Dave Chiluk 2019-07-23 23:26 ` Phil Auld 2019-07-26 18:14 ` Peter Zijlstra 1 sibling, 1 reply; 40+ messages in thread From: Dave Chiluk @ 2019-07-23 22:12 UTC (permalink / raw) To: Phil Auld Cc: Ben Segall, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc Thanks for all the help and testing you provided. It's good to know these changes have passed at least some scheduler regression tests. If it comes to a v7 I'll add the Reviewed-by, otherwise I'll just let Peter add it. Will you be handling the backport into the RHEL 8 kernels? I'll submit this to Ubuntu and linux-stable once it gets accepted. Thanks again, On Tue, Jul 23, 2019 at 12:13 PM Phil Auld <pauld@redhat.com> wrote: > > Hi Dave, > > On Tue, Jul 23, 2019 at 11:44:26AM -0500 Dave Chiluk wrote: > > It has been observed, that highly-threaded, non-cpu-bound applications > > running under cpu.cfs_quota_us constraints can hit a high percentage of > > periods throttled while simultaneously not consuming the allocated > > amount of quota. This use case is typical of user-interactive non-cpu > > bound applications, such as those running in kubernetes or mesos when > > run on multiple cpu cores. > > > > This has been root caused to cpu-local run queue being allocated per cpu > > bandwidth slices, and then not fully using that slice within the period. > > At which point the slice and quota expires. This expiration of unused > > slice results in applications not being able to utilize the quota for > > which they are allocated. > > > > The non-expiration of per-cpu slices was recently fixed by > > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift > > condition")'. Prior to that it appears that this had been broken since > > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some > > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That > > added the following conditional which resulted in slices never being > > expired. > > > > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { > > /* extend local deadline, drift is bounded above by 2 ticks */ > > cfs_rq->runtime_expires += TICK_NSEC; > > > > Because this was broken for nearly 5 years, and has recently been fixed > > and is now being noticed by many users running kubernetes > > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion > > that the mechanisms around expiring runtime should be removed > > altogether. > > > > This allows quota already allocated to per-cpu run-queues to live longer > > than the period boundary. This allows threads on runqueues that do not > > use much CPU to continue to use their remaining slice over a longer > > period of time than cpu.cfs_period_us. However, this helps prevent the > > above condition of hitting throttling while also not fully utilizing > > your cpu quota. > > > > This theoretically allows a machine to use slightly more than its > > allotted quota in some periods. This overflow would be bounded by the > > remaining quota left on each per-cpu runqueueu. This is typically no > > more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will > > change nothing, as they should theoretically fully utilize all of their > > quota in each period. For user-interactive tasks as described above this > > provides a much better user/application experience as their cpu > > utilization will more closely match the amount they requested when they > > hit throttling. This means that cpu limits no longer strictly apply per > > period for non-cpu bound applications, but that they are still accurate > > over longer timeframes. > > > > This greatly improves performance of high-thread-count, non-cpu bound > > applications with low cfs_quota_us allocation on high-core-count > > machines. In the case of an artificial testcase (10ms/100ms of quota on > > 80 CPU machine), this commit resulted in almost 30x performance > > improvement, while still maintaining correct cpu quota restrictions. > > That testcase is available at https://github.com/indeedeng/fibtest. > > > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") > > Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> > > Reviewed-by: Ben Segall <bsegall@google.com> > > This still works for me. The documentation reads pretty well, too. Good job. > > Feel free to add my Acked-by: or Reviewed-by: Phil Auld <pauld@redhat.com>. > > I'll run it through some more tests when I have time. The code is the same > as the earlier one I tested from what I can see. > > Cheers, > Phil > > > --- > > Documentation/scheduler/sched-bwc.rst | 74 ++++++++++++++++++++++++++++------- > > kernel/sched/fair.c | 72 ++++------------------------------ > > kernel/sched/sched.h | 4 -- > > 3 files changed, 67 insertions(+), 83 deletions(-) > > > > diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst > > index 3a90642..9801d6b 100644 > > --- a/Documentation/scheduler/sched-bwc.rst > > +++ b/Documentation/scheduler/sched-bwc.rst > > @@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the > > specification of the maximum CPU bandwidth available to a group or hierarchy. > > > > The bandwidth allowed for a group is specified using a quota and period. Within > > -each given "period" (microseconds), a group is allowed to consume only up to > > -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a > > -group exceeds this limit (for that period), the tasks belonging to its > > -hierarchy will be throttled and are not allowed to run again until the next > > -period. > > - > > -A group's unused runtime is globally tracked, being refreshed with quota units > > -above at each period boundary. As threads consume this bandwidth it is > > -transferred to cpu-local "silos" on a demand basis. The amount transferred > > +each given "period" (microseconds), a task group is allocated up to "quota" > > +microseconds of CPU time. That quota is assigned to per-cpu run queues in > > +slices as threads in the cgroup become runnable. Once all quota has been > > +assigned any additional requests for quota will result in those threads being > > +throttled. Throttled threads will not be able to run again until the next > > +period when the quota is replenished. > > + > > +A group's unassigned quota is globally tracked, being refreshed back to > > +cfs_quota units at each period boundary. As threads consume this bandwidth it > > +is transferred to cpu-local "silos" on a demand basis. The amount transferred > > within each of these updates is tunable and described as the "slice". > > > > Management > > @@ -35,12 +36,12 @@ The default values are:: > > > > A value of -1 for cpu.cfs_quota_us indicates that the group does not have any > > bandwidth restriction in place, such a group is described as an unconstrained > > -bandwidth group. This represents the traditional work-conserving behavior for > > +bandwidth group. This represents the traditional work-conserving behavior for > > CFS. > > > > Writing any (valid) positive value(s) will enact the specified bandwidth limit. > > -The minimum quota allowed for the quota or period is 1ms. There is also an > > -upper bound on the period length of 1s. Additional restrictions exist when > > +The minimum quota allowed for the quota or period is 1ms. There is also an > > +upper bound on the period length of 1s. Additional restrictions exist when > > bandwidth limits are used in a hierarchical fashion, these are explained in > > more detail below. > > > > @@ -53,8 +54,8 @@ unthrottled if it is in a constrained state. > > System wide settings > > -------------------- > > For efficiency run-time is transferred between the global pool and CPU local > > -"silos" in a batch fashion. This greatly reduces global accounting pressure > > -on large systems. The amount transferred each time such an update is required > > +"silos" in a batch fashion. This greatly reduces global accounting pressure > > +on large systems. The amount transferred each time such an update is required > > is described as the "slice". > > > > This is tunable via procfs:: > > @@ -97,6 +98,51 @@ There are two ways in which a group may become throttled: > > In case b) above, even though the child may have runtime remaining it will not > > be allowed to until the parent's runtime is refreshed. > > > > +CFS Bandwidth Quota Caveats > > +--------------------------- > > +Once a slice is assigned to a cpu it does not expire. However all but 1ms of > > +the slice may be returned to the global pool if all threads on that cpu become > > +unrunnable. This is configured at compile time by the min_cfs_rq_runtime > > +variable. This is a performance tweak that helps prevent added contention on > > +the global lock. > > + > > +The fact that cpu-local slices do not expire results in some interesting corner > > +cases that should be understood. > > + > > +For cgroup cpu constrained applications that are cpu limited this is a > > +relatively moot point because they will naturally consume the entirety of their > > +quota as well as the entirety of each cpu-local slice in each period. As a > > +result it is expected that nr_periods roughly equal nr_throttled, and that > > +cpuacct.usage will increase roughly equal to cfs_quota_us in each period. > > + > > +For highly-threaded, non-cpu bound applications this non-expiration nuance > > +allows applications to briefly burst past their quota limits by the amount of > > +unused slice on each cpu that the task group is running on (typically at most > > +1ms per cpu or as defined by min_cfs_rq_runtime). This slight burst only > > +applies if quota had been assigned to a cpu and then not fully used or returned > > +in previous periods. This burst amount will not be transferred between cores. > > +As a result, this mechanism still strictly limits the task group to quota > > +average usage, albeit over a longer time window than a single period. This > > +also limits the burst ability to no more than 1ms per cpu. This provides > > +better more predictable user experience for highly threaded applications with > > +small quota limits on high core count machines. It also eliminates the > > +propensity to throttle these applications while simultanously using less than > > +quota amounts of cpu. Another way to say this, is that by allowing the unused > > +portion of a slice to remain valid across periods we have decreased the > > +possibility of wastefully expiring quota on cpu-local silos that don't need a > > +full slice's amount of cpu time. > > + > > +The interaction between cpu-bound and non-cpu-bound-interactive applications > > +should also be considered, especially when single core usage hits 100%. If you > > +gave each of these applications half of a cpu-core and they both got scheduled > > +on the same CPU it is theoretically possible that the non-cpu bound application > > +will use up to 1ms additional quota in some periods, thereby preventing the > > +cpu-bound application from fully using its quota by that same amount. In these > > +instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to > > +decide which application is chosen to run, as they will both be runnable and > > +have remaining quota. This runtime discrepancy will be made up in the following > > +periods when the interactive application idles. > > + > > Examples > > -------- > > 1. Limit a group to 1 CPU worth of runtime:: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 036be95..00b68f0 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -4316,8 +4316,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) > > > > now = sched_clock_cpu(smp_processor_id()); > > cfs_b->runtime = cfs_b->quota; > > - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); > > - cfs_b->expires_seq++; > > } > > > > static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) > > @@ -4339,8 +4337,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) > > { > > struct task_group *tg = cfs_rq->tg; > > struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); > > - u64 amount = 0, min_amount, expires; > > - int expires_seq; > > + u64 amount = 0, min_amount; > > > > /* note: this is a positive sum as runtime_remaining <= 0 */ > > min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; > > @@ -4357,61 +4354,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) > > cfs_b->idle = 0; > > } > > } > > - expires_seq = cfs_b->expires_seq; > > - expires = cfs_b->runtime_expires; > > raw_spin_unlock(&cfs_b->lock); > > > > cfs_rq->runtime_remaining += amount; > > - /* > > - * we may have advanced our local expiration to account for allowed > > - * spread between our sched_clock and the one on which runtime was > > - * issued. > > - */ > > - if (cfs_rq->expires_seq != expires_seq) { > > - cfs_rq->expires_seq = expires_seq; > > - cfs_rq->runtime_expires = expires; > > - } > > > > return cfs_rq->runtime_remaining > 0; > > } > > > > -/* > > - * Note: This depends on the synchronization provided by sched_clock and the > > - * fact that rq->clock snapshots this value. > > - */ > > -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) > > -{ > > - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); > > - > > - /* if the deadline is ahead of our clock, nothing to do */ > > - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) > > - return; > > - > > - if (cfs_rq->runtime_remaining < 0) > > - return; > > - > > - /* > > - * If the local deadline has passed we have to consider the > > - * possibility that our sched_clock is 'fast' and the global deadline > > - * has not truly expired. > > - * > > - * Fortunately we can check determine whether this the case by checking > > - * whether the global deadline(cfs_b->expires_seq) has advanced. > > - */ > > - if (cfs_rq->expires_seq == cfs_b->expires_seq) { > > - /* extend local deadline, drift is bounded above by 2 ticks */ > > - cfs_rq->runtime_expires += TICK_NSEC; > > - } else { > > - /* global deadline is ahead, expiration has passed */ > > - cfs_rq->runtime_remaining = 0; > > - } > > -} > > - > > static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) > > { > > /* dock delta_exec before expiring quota (as it could span periods) */ > > cfs_rq->runtime_remaining -= delta_exec; > > - expire_cfs_rq_runtime(cfs_rq); > > > > if (likely(cfs_rq->runtime_remaining > 0)) > > return; > > @@ -4602,8 +4555,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) > > resched_curr(rq); > > } > > > > -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > > - u64 remaining, u64 expires) > > +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) > > { > > struct cfs_rq *cfs_rq; > > u64 runtime; > > @@ -4625,7 +4577,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > > remaining -= runtime; > > > > cfs_rq->runtime_remaining += runtime; > > - cfs_rq->runtime_expires = expires; > > > > /* we check whether we're throttled above */ > > if (cfs_rq->runtime_remaining > 0) > > @@ -4650,7 +4601,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > > */ > > static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) > > { > > - u64 runtime, runtime_expires; > > + u64 runtime; > > int throttled; > > > > /* no need to continue the timer with no bandwidth constraint */ > > @@ -4678,8 +4629,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u > > /* account preceding periods in which throttling occurred */ > > cfs_b->nr_throttled += overrun; > > > > - runtime_expires = cfs_b->runtime_expires; > > - > > /* > > * This check is repeated as we are holding onto the new bandwidth while > > * we unthrottle. This can potentially race with an unthrottled group > > @@ -4692,8 +4641,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u > > cfs_b->distribute_running = 1; > > raw_spin_unlock_irqrestore(&cfs_b->lock, flags); > > /* we can't nest cfs_b->lock while distributing bandwidth */ > > - runtime = distribute_cfs_runtime(cfs_b, runtime, > > - runtime_expires); > > + runtime = distribute_cfs_runtime(cfs_b, runtime); > > raw_spin_lock_irqsave(&cfs_b->lock, flags); > > > > cfs_b->distribute_running = 0; > > @@ -4775,8 +4723,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) > > return; > > > > raw_spin_lock(&cfs_b->lock); > > - if (cfs_b->quota != RUNTIME_INF && > > - cfs_rq->runtime_expires == cfs_b->runtime_expires) { > > + if (cfs_b->quota != RUNTIME_INF) { > > cfs_b->runtime += slack_runtime; > > > > /* we are under rq->lock, defer unthrottling using a timer */ > > @@ -4809,7 +4756,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > > { > > u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); > > unsigned long flags; > > - u64 expires; > > > > /* confirm we're still not at a refresh boundary */ > > raw_spin_lock_irqsave(&cfs_b->lock, flags); > > @@ -4827,7 +4773,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > > if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) > > runtime = cfs_b->runtime; > > > > - expires = cfs_b->runtime_expires; > > if (runtime) > > cfs_b->distribute_running = 1; > > > > @@ -4836,11 +4781,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > > if (!runtime) > > return; > > > > - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); > > + runtime = distribute_cfs_runtime(cfs_b, runtime); > > > > raw_spin_lock_irqsave(&cfs_b->lock, flags); > > - if (expires == cfs_b->runtime_expires) > > - lsub_positive(&cfs_b->runtime, runtime); > > + lsub_positive(&cfs_b->runtime, runtime); > > cfs_b->distribute_running = 0; > > raw_spin_unlock_irqrestore(&cfs_b->lock, flags); > > } > > @@ -4997,8 +4941,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) > > > > cfs_b->period_active = 1; > > overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); > > - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); > > - cfs_b->expires_seq++; > > hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); > > } > > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > index 802b1f3..28c16e9 100644 > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -335,8 +335,6 @@ struct cfs_bandwidth { > > u64 quota; > > u64 runtime; > > s64 hierarchical_quota; > > - u64 runtime_expires; > > - int expires_seq; > > > > u8 idle; > > u8 period_active; > > @@ -556,8 +554,6 @@ struct cfs_rq { > > > > #ifdef CONFIG_CFS_BANDWIDTH > > int runtime_enabled; > > - int expires_seq; > > - u64 runtime_expires; > > s64 runtime_remaining; > > > > u64 throttled_clock; > > -- > > 1.8.3.1 > > > > -- ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v6 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-07-23 22:12 ` Dave Chiluk @ 2019-07-23 23:26 ` Phil Auld 0 siblings, 0 replies; 40+ messages in thread From: Phil Auld @ 2019-07-23 23:26 UTC (permalink / raw) To: Dave Chiluk Cc: Ben Segall, Peter Oskolkov, Peter Zijlstra, Ingo Molnar, cgroups, Linux Kernel Mailing List, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc On Tue, Jul 23, 2019 at 05:12:18PM -0500 Dave Chiluk wrote: > Thanks for all the help and testing you provided. It's good to know > these changes have passed at least some scheduler regression tests. > If it comes to a v7 I'll add the Reviewed-by, otherwise I'll just let > Peter add it. > Sounds good. > Will you be handling the backport into the RHEL 8 kernels? I'll > submit this to Ubuntu and linux-stable once it gets accepted. > Yep. > Thanks again, Thank you :) Cheers, Phil > > > On Tue, Jul 23, 2019 at 12:13 PM Phil Auld <pauld@redhat.com> wrote: > > > > Hi Dave, > > > > On Tue, Jul 23, 2019 at 11:44:26AM -0500 Dave Chiluk wrote: > > > It has been observed, that highly-threaded, non-cpu-bound applications > > > running under cpu.cfs_quota_us constraints can hit a high percentage of > > > periods throttled while simultaneously not consuming the allocated > > > amount of quota. This use case is typical of user-interactive non-cpu > > > bound applications, such as those running in kubernetes or mesos when > > > run on multiple cpu cores. > > > > > > This has been root caused to cpu-local run queue being allocated per cpu > > > bandwidth slices, and then not fully using that slice within the period. > > > At which point the slice and quota expires. This expiration of unused > > > slice results in applications not being able to utilize the quota for > > > which they are allocated. > > > > > > The non-expiration of per-cpu slices was recently fixed by > > > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift > > > condition")'. Prior to that it appears that this had been broken since > > > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some > > > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That > > > added the following conditional which resulted in slices never being > > > expired. > > > > > > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { > > > /* extend local deadline, drift is bounded above by 2 ticks */ > > > cfs_rq->runtime_expires += TICK_NSEC; > > > > > > Because this was broken for nearly 5 years, and has recently been fixed > > > and is now being noticed by many users running kubernetes > > > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion > > > that the mechanisms around expiring runtime should be removed > > > altogether. > > > > > > This allows quota already allocated to per-cpu run-queues to live longer > > > than the period boundary. This allows threads on runqueues that do not > > > use much CPU to continue to use their remaining slice over a longer > > > period of time than cpu.cfs_period_us. However, this helps prevent the > > > above condition of hitting throttling while also not fully utilizing > > > your cpu quota. > > > > > > This theoretically allows a machine to use slightly more than its > > > allotted quota in some periods. This overflow would be bounded by the > > > remaining quota left on each per-cpu runqueueu. This is typically no > > > more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will > > > change nothing, as they should theoretically fully utilize all of their > > > quota in each period. For user-interactive tasks as described above this > > > provides a much better user/application experience as their cpu > > > utilization will more closely match the amount they requested when they > > > hit throttling. This means that cpu limits no longer strictly apply per > > > period for non-cpu bound applications, but that they are still accurate > > > over longer timeframes. > > > > > > This greatly improves performance of high-thread-count, non-cpu bound > > > applications with low cfs_quota_us allocation on high-core-count > > > machines. In the case of an artificial testcase (10ms/100ms of quota on > > > 80 CPU machine), this commit resulted in almost 30x performance > > > improvement, while still maintaining correct cpu quota restrictions. > > > That testcase is available at https://github.com/indeedeng/fibtest. > > > > > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") > > > Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> > > > Reviewed-by: Ben Segall <bsegall@google.com> > > > > This still works for me. The documentation reads pretty well, too. Good job. > > > > Feel free to add my Acked-by: or Reviewed-by: Phil Auld <pauld@redhat.com>. > > > > I'll run it through some more tests when I have time. The code is the same > > as the earlier one I tested from what I can see. > > > > Cheers, > > Phil > > > > > --- > > > Documentation/scheduler/sched-bwc.rst | 74 ++++++++++++++++++++++++++++------- > > > kernel/sched/fair.c | 72 ++++------------------------------ > > > kernel/sched/sched.h | 4 -- > > > 3 files changed, 67 insertions(+), 83 deletions(-) > > > > > > diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst > > > index 3a90642..9801d6b 100644 > > > --- a/Documentation/scheduler/sched-bwc.rst > > > +++ b/Documentation/scheduler/sched-bwc.rst > > > @@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the > > > specification of the maximum CPU bandwidth available to a group or hierarchy. > > > > > > The bandwidth allowed for a group is specified using a quota and period. Within > > > -each given "period" (microseconds), a group is allowed to consume only up to > > > -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a > > > -group exceeds this limit (for that period), the tasks belonging to its > > > -hierarchy will be throttled and are not allowed to run again until the next > > > -period. > > > - > > > -A group's unused runtime is globally tracked, being refreshed with quota units > > > -above at each period boundary. As threads consume this bandwidth it is > > > -transferred to cpu-local "silos" on a demand basis. The amount transferred > > > +each given "period" (microseconds), a task group is allocated up to "quota" > > > +microseconds of CPU time. That quota is assigned to per-cpu run queues in > > > +slices as threads in the cgroup become runnable. Once all quota has been > > > +assigned any additional requests for quota will result in those threads being > > > +throttled. Throttled threads will not be able to run again until the next > > > +period when the quota is replenished. > > > + > > > +A group's unassigned quota is globally tracked, being refreshed back to > > > +cfs_quota units at each period boundary. As threads consume this bandwidth it > > > +is transferred to cpu-local "silos" on a demand basis. The amount transferred > > > within each of these updates is tunable and described as the "slice". > > > > > > Management > > > @@ -35,12 +36,12 @@ The default values are:: > > > > > > A value of -1 for cpu.cfs_quota_us indicates that the group does not have any > > > bandwidth restriction in place, such a group is described as an unconstrained > > > -bandwidth group. This represents the traditional work-conserving behavior for > > > +bandwidth group. This represents the traditional work-conserving behavior for > > > CFS. > > > > > > Writing any (valid) positive value(s) will enact the specified bandwidth limit. > > > -The minimum quota allowed for the quota or period is 1ms. There is also an > > > -upper bound on the period length of 1s. Additional restrictions exist when > > > +The minimum quota allowed for the quota or period is 1ms. There is also an > > > +upper bound on the period length of 1s. Additional restrictions exist when > > > bandwidth limits are used in a hierarchical fashion, these are explained in > > > more detail below. > > > > > > @@ -53,8 +54,8 @@ unthrottled if it is in a constrained state. > > > System wide settings > > > -------------------- > > > For efficiency run-time is transferred between the global pool and CPU local > > > -"silos" in a batch fashion. This greatly reduces global accounting pressure > > > -on large systems. The amount transferred each time such an update is required > > > +"silos" in a batch fashion. This greatly reduces global accounting pressure > > > +on large systems. The amount transferred each time such an update is required > > > is described as the "slice". > > > > > > This is tunable via procfs:: > > > @@ -97,6 +98,51 @@ There are two ways in which a group may become throttled: > > > In case b) above, even though the child may have runtime remaining it will not > > > be allowed to until the parent's runtime is refreshed. > > > > > > +CFS Bandwidth Quota Caveats > > > +--------------------------- > > > +Once a slice is assigned to a cpu it does not expire. However all but 1ms of > > > +the slice may be returned to the global pool if all threads on that cpu become > > > +unrunnable. This is configured at compile time by the min_cfs_rq_runtime > > > +variable. This is a performance tweak that helps prevent added contention on > > > +the global lock. > > > + > > > +The fact that cpu-local slices do not expire results in some interesting corner > > > +cases that should be understood. > > > + > > > +For cgroup cpu constrained applications that are cpu limited this is a > > > +relatively moot point because they will naturally consume the entirety of their > > > +quota as well as the entirety of each cpu-local slice in each period. As a > > > +result it is expected that nr_periods roughly equal nr_throttled, and that > > > +cpuacct.usage will increase roughly equal to cfs_quota_us in each period. > > > + > > > +For highly-threaded, non-cpu bound applications this non-expiration nuance > > > +allows applications to briefly burst past their quota limits by the amount of > > > +unused slice on each cpu that the task group is running on (typically at most > > > +1ms per cpu or as defined by min_cfs_rq_runtime). This slight burst only > > > +applies if quota had been assigned to a cpu and then not fully used or returned > > > +in previous periods. This burst amount will not be transferred between cores. > > > +As a result, this mechanism still strictly limits the task group to quota > > > +average usage, albeit over a longer time window than a single period. This > > > +also limits the burst ability to no more than 1ms per cpu. This provides > > > +better more predictable user experience for highly threaded applications with > > > +small quota limits on high core count machines. It also eliminates the > > > +propensity to throttle these applications while simultanously using less than > > > +quota amounts of cpu. Another way to say this, is that by allowing the unused > > > +portion of a slice to remain valid across periods we have decreased the > > > +possibility of wastefully expiring quota on cpu-local silos that don't need a > > > +full slice's amount of cpu time. > > > + > > > +The interaction between cpu-bound and non-cpu-bound-interactive applications > > > +should also be considered, especially when single core usage hits 100%. If you > > > +gave each of these applications half of a cpu-core and they both got scheduled > > > +on the same CPU it is theoretically possible that the non-cpu bound application > > > +will use up to 1ms additional quota in some periods, thereby preventing the > > > +cpu-bound application from fully using its quota by that same amount. In these > > > +instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to > > > +decide which application is chosen to run, as they will both be runnable and > > > +have remaining quota. This runtime discrepancy will be made up in the following > > > +periods when the interactive application idles. > > > + > > > Examples > > > -------- > > > 1. Limit a group to 1 CPU worth of runtime:: > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 036be95..00b68f0 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -4316,8 +4316,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) > > > > > > now = sched_clock_cpu(smp_processor_id()); > > > cfs_b->runtime = cfs_b->quota; > > > - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); > > > - cfs_b->expires_seq++; > > > } > > > > > > static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) > > > @@ -4339,8 +4337,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) > > > { > > > struct task_group *tg = cfs_rq->tg; > > > struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); > > > - u64 amount = 0, min_amount, expires; > > > - int expires_seq; > > > + u64 amount = 0, min_amount; > > > > > > /* note: this is a positive sum as runtime_remaining <= 0 */ > > > min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; > > > @@ -4357,61 +4354,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) > > > cfs_b->idle = 0; > > > } > > > } > > > - expires_seq = cfs_b->expires_seq; > > > - expires = cfs_b->runtime_expires; > > > raw_spin_unlock(&cfs_b->lock); > > > > > > cfs_rq->runtime_remaining += amount; > > > - /* > > > - * we may have advanced our local expiration to account for allowed > > > - * spread between our sched_clock and the one on which runtime was > > > - * issued. > > > - */ > > > - if (cfs_rq->expires_seq != expires_seq) { > > > - cfs_rq->expires_seq = expires_seq; > > > - cfs_rq->runtime_expires = expires; > > > - } > > > > > > return cfs_rq->runtime_remaining > 0; > > > } > > > > > > -/* > > > - * Note: This depends on the synchronization provided by sched_clock and the > > > - * fact that rq->clock snapshots this value. > > > - */ > > > -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) > > > -{ > > > - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); > > > - > > > - /* if the deadline is ahead of our clock, nothing to do */ > > > - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) > > > - return; > > > - > > > - if (cfs_rq->runtime_remaining < 0) > > > - return; > > > - > > > - /* > > > - * If the local deadline has passed we have to consider the > > > - * possibility that our sched_clock is 'fast' and the global deadline > > > - * has not truly expired. > > > - * > > > - * Fortunately we can check determine whether this the case by checking > > > - * whether the global deadline(cfs_b->expires_seq) has advanced. > > > - */ > > > - if (cfs_rq->expires_seq == cfs_b->expires_seq) { > > > - /* extend local deadline, drift is bounded above by 2 ticks */ > > > - cfs_rq->runtime_expires += TICK_NSEC; > > > - } else { > > > - /* global deadline is ahead, expiration has passed */ > > > - cfs_rq->runtime_remaining = 0; > > > - } > > > -} > > > - > > > static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) > > > { > > > /* dock delta_exec before expiring quota (as it could span periods) */ > > > cfs_rq->runtime_remaining -= delta_exec; > > > - expire_cfs_rq_runtime(cfs_rq); > > > > > > if (likely(cfs_rq->runtime_remaining > 0)) > > > return; > > > @@ -4602,8 +4555,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) > > > resched_curr(rq); > > > } > > > > > > -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > > > - u64 remaining, u64 expires) > > > +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) > > > { > > > struct cfs_rq *cfs_rq; > > > u64 runtime; > > > @@ -4625,7 +4577,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > > > remaining -= runtime; > > > > > > cfs_rq->runtime_remaining += runtime; > > > - cfs_rq->runtime_expires = expires; > > > > > > /* we check whether we're throttled above */ > > > if (cfs_rq->runtime_remaining > 0) > > > @@ -4650,7 +4601,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > > > */ > > > static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) > > > { > > > - u64 runtime, runtime_expires; > > > + u64 runtime; > > > int throttled; > > > > > > /* no need to continue the timer with no bandwidth constraint */ > > > @@ -4678,8 +4629,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u > > > /* account preceding periods in which throttling occurred */ > > > cfs_b->nr_throttled += overrun; > > > > > > - runtime_expires = cfs_b->runtime_expires; > > > - > > > /* > > > * This check is repeated as we are holding onto the new bandwidth while > > > * we unthrottle. This can potentially race with an unthrottled group > > > @@ -4692,8 +4641,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u > > > cfs_b->distribute_running = 1; > > > raw_spin_unlock_irqrestore(&cfs_b->lock, flags); > > > /* we can't nest cfs_b->lock while distributing bandwidth */ > > > - runtime = distribute_cfs_runtime(cfs_b, runtime, > > > - runtime_expires); > > > + runtime = distribute_cfs_runtime(cfs_b, runtime); > > > raw_spin_lock_irqsave(&cfs_b->lock, flags); > > > > > > cfs_b->distribute_running = 0; > > > @@ -4775,8 +4723,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) > > > return; > > > > > > raw_spin_lock(&cfs_b->lock); > > > - if (cfs_b->quota != RUNTIME_INF && > > > - cfs_rq->runtime_expires == cfs_b->runtime_expires) { > > > + if (cfs_b->quota != RUNTIME_INF) { > > > cfs_b->runtime += slack_runtime; > > > > > > /* we are under rq->lock, defer unthrottling using a timer */ > > > @@ -4809,7 +4756,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > > > { > > > u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); > > > unsigned long flags; > > > - u64 expires; > > > > > > /* confirm we're still not at a refresh boundary */ > > > raw_spin_lock_irqsave(&cfs_b->lock, flags); > > > @@ -4827,7 +4773,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > > > if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) > > > runtime = cfs_b->runtime; > > > > > > - expires = cfs_b->runtime_expires; > > > if (runtime) > > > cfs_b->distribute_running = 1; > > > > > > @@ -4836,11 +4781,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > > > if (!runtime) > > > return; > > > > > > - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); > > > + runtime = distribute_cfs_runtime(cfs_b, runtime); > > > > > > raw_spin_lock_irqsave(&cfs_b->lock, flags); > > > - if (expires == cfs_b->runtime_expires) > > > - lsub_positive(&cfs_b->runtime, runtime); > > > + lsub_positive(&cfs_b->runtime, runtime); > > > cfs_b->distribute_running = 0; > > > raw_spin_unlock_irqrestore(&cfs_b->lock, flags); > > > } > > > @@ -4997,8 +4941,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) > > > > > > cfs_b->period_active = 1; > > > overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); > > > - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); > > > - cfs_b->expires_seq++; > > > hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); > > > } > > > > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > > index 802b1f3..28c16e9 100644 > > > --- a/kernel/sched/sched.h > > > +++ b/kernel/sched/sched.h > > > @@ -335,8 +335,6 @@ struct cfs_bandwidth { > > > u64 quota; > > > u64 runtime; > > > s64 hierarchical_quota; > > > - u64 runtime_expires; > > > - int expires_seq; > > > > > > u8 idle; > > > u8 period_active; > > > @@ -556,8 +554,6 @@ struct cfs_rq { > > > > > > #ifdef CONFIG_CFS_BANDWIDTH > > > int runtime_enabled; > > > - int expires_seq; > > > - u64 runtime_expires; > > > s64 runtime_remaining; > > > > > > u64 throttled_clock; > > > -- > > > 1.8.3.1 > > > > > > > -- -- ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v6 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-07-23 17:13 ` Phil Auld 2019-07-23 22:12 ` Dave Chiluk @ 2019-07-26 18:14 ` Peter Zijlstra 1 sibling, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2019-07-26 18:14 UTC (permalink / raw) To: Phil Auld Cc: Dave Chiluk, Ben Segall, Peter Oskolkov, Ingo Molnar, cgroups, linux-kernel, Brendan Gregg, Kyle Anderson, Gabriel Munos, John Hammond, Cong Wang, Jonathan Corbet, linux-doc On Tue, Jul 23, 2019 at 01:13:09PM -0400, Phil Auld wrote: > Hi Dave, > > On Tue, Jul 23, 2019 at 11:44:26AM -0500 Dave Chiluk wrote: > > It has been observed, that highly-threaded, non-cpu-bound applications > > running under cpu.cfs_quota_us constraints can hit a high percentage of > > periods throttled while simultaneously not consuming the allocated > > amount of quota. This use case is typical of user-interactive non-cpu > > bound applications, such as those running in kubernetes or mesos when > > run on multiple cpu cores. > > > > This has been root caused to cpu-local run queue being allocated per cpu > > bandwidth slices, and then not fully using that slice within the period. > > At which point the slice and quota expires. This expiration of unused > > slice results in applications not being able to utilize the quota for > > which they are allocated. > > > > The non-expiration of per-cpu slices was recently fixed by > > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift > > condition")'. Prior to that it appears that this had been broken since > > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some > > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That > > added the following conditional which resulted in slices never being > > expired. > > > > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { > > /* extend local deadline, drift is bounded above by 2 ticks */ > > cfs_rq->runtime_expires += TICK_NSEC; > > > > Because this was broken for nearly 5 years, and has recently been fixed > > and is now being noticed by many users running kubernetes > > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion > > that the mechanisms around expiring runtime should be removed > > altogether. > > > > This allows quota already allocated to per-cpu run-queues to live longer > > than the period boundary. This allows threads on runqueues that do not > > use much CPU to continue to use their remaining slice over a longer > > period of time than cpu.cfs_period_us. However, this helps prevent the > > above condition of hitting throttling while also not fully utilizing > > your cpu quota. > > > > This theoretically allows a machine to use slightly more than its > > allotted quota in some periods. This overflow would be bounded by the > > remaining quota left on each per-cpu runqueueu. This is typically no > > more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will > > change nothing, as they should theoretically fully utilize all of their > > quota in each period. For user-interactive tasks as described above this > > provides a much better user/application experience as their cpu > > utilization will more closely match the amount they requested when they > > hit throttling. This means that cpu limits no longer strictly apply per > > period for non-cpu bound applications, but that they are still accurate > > over longer timeframes. > > > > This greatly improves performance of high-thread-count, non-cpu bound > > applications with low cfs_quota_us allocation on high-core-count > > machines. In the case of an artificial testcase (10ms/100ms of quota on > > 80 CPU machine), this commit resulted in almost 30x performance > > improvement, while still maintaining correct cpu quota restrictions. > > That testcase is available at https://github.com/indeedeng/fibtest. > > > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") > > Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> > > Reviewed-by: Ben Segall <bsegall@google.com> > > This still works for me. The documentation reads pretty well, too. Good job. > > Feel free to add my Acked-by: or Reviewed-by: Phil Auld <pauld@redhat.com>. Thanks guys! ^ permalink raw reply [flat|nested] 40+ messages in thread
* [tip:sched/core] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-07-23 16:44 ` [PATCH v6 1/1] " Dave Chiluk 2019-07-23 17:13 ` Phil Auld @ 2019-08-08 10:53 ` tip-bot for Dave Chiluk 1 sibling, 0 replies; 40+ messages in thread From: tip-bot for Dave Chiluk @ 2019-08-08 10:53 UTC (permalink / raw) To: linux-tip-commits Cc: posk, hpa, jhammond, mingo, gmunoz, corbet, tglx, chiluk+linux, mingo, kwa, linux-kernel, pauld, peterz, xiyou.wangcong, bsegall, bgregg Commit-ID: de53fd7aedb100f03e5d2231cfce0e4993282425 Gitweb: https://git.kernel.org/tip/de53fd7aedb100f03e5d2231cfce0e4993282425 Author: Dave Chiluk <chiluk+linux@indeed.com> AuthorDate: Tue, 23 Jul 2019 11:44:26 -0500 Committer: Peter Zijlstra <peterz@infradead.org> CommitDate: Thu, 8 Aug 2019 09:09:30 +0200 sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices It has been observed, that highly-threaded, non-cpu-bound applications running under cpu.cfs_quota_us constraints can hit a high percentage of periods throttled while simultaneously not consuming the allocated amount of quota. This use case is typical of user-interactive non-cpu bound applications, such as those running in kubernetes or mesos when run on multiple cpu cores. This has been root caused to cpu-local run queue being allocated per cpu bandwidth slices, and then not fully using that slice within the period. At which point the slice and quota expires. This expiration of unused slice results in applications not being able to utilize the quota for which they are allocated. The non-expiration of per-cpu slices was recently fixed by 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")'. Prior to that it appears that this had been broken since at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That added the following conditional which resulted in slices never being expired. if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; Because this was broken for nearly 5 years, and has recently been fixed and is now being noticed by many users running kubernetes (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion that the mechanisms around expiring runtime should be removed altogether. This allows quota already allocated to per-cpu run-queues to live longer than the period boundary. This allows threads on runqueues that do not use much CPU to continue to use their remaining slice over a longer period of time than cpu.cfs_period_us. However, this helps prevent the above condition of hitting throttling while also not fully utilizing your cpu quota. This theoretically allows a machine to use slightly more than its allotted quota in some periods. This overflow would be bounded by the remaining quota left on each per-cpu runqueueu. This is typically no more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will change nothing, as they should theoretically fully utilize all of their quota in each period. For user-interactive tasks as described above this provides a much better user/application experience as their cpu utilization will more closely match the amount they requested when they hit throttling. This means that cpu limits no longer strictly apply per period for non-cpu bound applications, but that they are still accurate over longer timeframes. This greatly improves performance of high-thread-count, non-cpu bound applications with low cfs_quota_us allocation on high-core-count machines. In the case of an artificial testcase (10ms/100ms of quota on 80 CPU machine), this commit resulted in almost 30x performance improvement, while still maintaining correct cpu quota restrictions. That testcase is available at https://github.com/indeedeng/fibtest. Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: John Hammond <jhammond@indeed.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kyle Anderson <kwa@yelp.com> Cc: Gabriel Munos <gmunoz@netflix.com> Cc: Peter Oskolkov <posk@posk.io> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Brendan Gregg <bgregg@netflix.com> Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com --- Documentation/scheduler/sched-bwc.rst | 74 ++++++++++++++++++++++++++++------- kernel/sched/fair.c | 72 ++++------------------------------ kernel/sched/sched.h | 4 -- 3 files changed, 67 insertions(+), 83 deletions(-) diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst index 3a9064219656..9801d6b284b1 100644 --- a/Documentation/scheduler/sched-bwc.rst +++ b/Documentation/scheduler/sched-bwc.rst @@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the specification of the maximum CPU bandwidth available to a group or hierarchy. The bandwidth allowed for a group is specified using a quota and period. Within -each given "period" (microseconds), a group is allowed to consume only up to -"quota" microseconds of CPU time. When the CPU bandwidth consumption of a -group exceeds this limit (for that period), the tasks belonging to its -hierarchy will be throttled and are not allowed to run again until the next -period. - -A group's unused runtime is globally tracked, being refreshed with quota units -above at each period boundary. As threads consume this bandwidth it is -transferred to cpu-local "silos" on a demand basis. The amount transferred +each given "period" (microseconds), a task group is allocated up to "quota" +microseconds of CPU time. That quota is assigned to per-cpu run queues in +slices as threads in the cgroup become runnable. Once all quota has been +assigned any additional requests for quota will result in those threads being +throttled. Throttled threads will not be able to run again until the next +period when the quota is replenished. + +A group's unassigned quota is globally tracked, being refreshed back to +cfs_quota units at each period boundary. As threads consume this bandwidth it +is transferred to cpu-local "silos" on a demand basis. The amount transferred within each of these updates is tunable and described as the "slice". Management @@ -35,12 +36,12 @@ The default values are:: A value of -1 for cpu.cfs_quota_us indicates that the group does not have any bandwidth restriction in place, such a group is described as an unconstrained -bandwidth group. This represents the traditional work-conserving behavior for +bandwidth group. This represents the traditional work-conserving behavior for CFS. Writing any (valid) positive value(s) will enact the specified bandwidth limit. -The minimum quota allowed for the quota or period is 1ms. There is also an -upper bound on the period length of 1s. Additional restrictions exist when +The minimum quota allowed for the quota or period is 1ms. There is also an +upper bound on the period length of 1s. Additional restrictions exist when bandwidth limits are used in a hierarchical fashion, these are explained in more detail below. @@ -53,8 +54,8 @@ unthrottled if it is in a constrained state. System wide settings -------------------- For efficiency run-time is transferred between the global pool and CPU local -"silos" in a batch fashion. This greatly reduces global accounting pressure -on large systems. The amount transferred each time such an update is required +"silos" in a batch fashion. This greatly reduces global accounting pressure +on large systems. The amount transferred each time such an update is required is described as the "slice". This is tunable via procfs:: @@ -97,6 +98,51 @@ There are two ways in which a group may become throttled: In case b) above, even though the child may have runtime remaining it will not be allowed to until the parent's runtime is refreshed. +CFS Bandwidth Quota Caveats +--------------------------- +Once a slice is assigned to a cpu it does not expire. However all but 1ms of +the slice may be returned to the global pool if all threads on that cpu become +unrunnable. This is configured at compile time by the min_cfs_rq_runtime +variable. This is a performance tweak that helps prevent added contention on +the global lock. + +The fact that cpu-local slices do not expire results in some interesting corner +cases that should be understood. + +For cgroup cpu constrained applications that are cpu limited this is a +relatively moot point because they will naturally consume the entirety of their +quota as well as the entirety of each cpu-local slice in each period. As a +result it is expected that nr_periods roughly equal nr_throttled, and that +cpuacct.usage will increase roughly equal to cfs_quota_us in each period. + +For highly-threaded, non-cpu bound applications this non-expiration nuance +allows applications to briefly burst past their quota limits by the amount of +unused slice on each cpu that the task group is running on (typically at most +1ms per cpu or as defined by min_cfs_rq_runtime). This slight burst only +applies if quota had been assigned to a cpu and then not fully used or returned +in previous periods. This burst amount will not be transferred between cores. +As a result, this mechanism still strictly limits the task group to quota +average usage, albeit over a longer time window than a single period. This +also limits the burst ability to no more than 1ms per cpu. This provides +better more predictable user experience for highly threaded applications with +small quota limits on high core count machines. It also eliminates the +propensity to throttle these applications while simultanously using less than +quota amounts of cpu. Another way to say this, is that by allowing the unused +portion of a slice to remain valid across periods we have decreased the +possibility of wastefully expiring quota on cpu-local silos that don't need a +full slice's amount of cpu time. + +The interaction between cpu-bound and non-cpu-bound-interactive applications +should also be considered, especially when single core usage hits 100%. If you +gave each of these applications half of a cpu-core and they both got scheduled +on the same CPU it is theoretically possible that the non-cpu bound application +will use up to 1ms additional quota in some periods, thereby preventing the +cpu-bound application from fully using its quota by that same amount. In these +instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to +decide which application is chosen to run, as they will both be runnable and +have remaining quota. This runtime discrepancy will be made up in the following +periods when the interactive application idles. + Examples -------- 1. Limit a group to 1 CPU worth of runtime:: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fb75c0bea80f..7d8043fc8317 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4371,8 +4371,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) now = sched_clock_cpu(smp_processor_id()); cfs_b->runtime = cfs_b->quota; - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); - cfs_b->expires_seq++; } static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) @@ -4394,8 +4392,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) { struct task_group *tg = cfs_rq->tg; struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); - u64 amount = 0, min_amount, expires; - int expires_seq; + u64 amount = 0, min_amount; /* note: this is a positive sum as runtime_remaining <= 0 */ min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; @@ -4412,61 +4409,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_b->idle = 0; } } - expires_seq = cfs_b->expires_seq; - expires = cfs_b->runtime_expires; raw_spin_unlock(&cfs_b->lock); cfs_rq->runtime_remaining += amount; - /* - * we may have advanced our local expiration to account for allowed - * spread between our sched_clock and the one on which runtime was - * issued. - */ - if (cfs_rq->expires_seq != expires_seq) { - cfs_rq->expires_seq = expires_seq; - cfs_rq->runtime_expires = expires; - } return cfs_rq->runtime_remaining > 0; } -/* - * Note: This depends on the synchronization provided by sched_clock and the - * fact that rq->clock snapshots this value. - */ -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) -{ - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); - - /* if the deadline is ahead of our clock, nothing to do */ - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) - return; - - if (cfs_rq->runtime_remaining < 0) - return; - - /* - * If the local deadline has passed we have to consider the - * possibility that our sched_clock is 'fast' and the global deadline - * has not truly expired. - * - * Fortunately we can check determine whether this the case by checking - * whether the global deadline(cfs_b->expires_seq) has advanced. - */ - if (cfs_rq->expires_seq == cfs_b->expires_seq) { - /* extend local deadline, drift is bounded above by 2 ticks */ - cfs_rq->runtime_expires += TICK_NSEC; - } else { - /* global deadline is ahead, expiration has passed */ - cfs_rq->runtime_remaining = 0; - } -} - static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) { /* dock delta_exec before expiring quota (as it could span periods) */ cfs_rq->runtime_remaining -= delta_exec; - expire_cfs_rq_runtime(cfs_rq); if (likely(cfs_rq->runtime_remaining > 0)) return; @@ -4661,8 +4614,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) resched_curr(rq); } -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, - u64 remaining, u64 expires) +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) { struct cfs_rq *cfs_rq; u64 runtime; @@ -4684,7 +4636,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, remaining -= runtime; cfs_rq->runtime_remaining += runtime; - cfs_rq->runtime_expires = expires; /* we check whether we're throttled above */ if (cfs_rq->runtime_remaining > 0) @@ -4709,7 +4660,7 @@ next: */ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) { - u64 runtime, runtime_expires; + u64 runtime; int throttled; /* no need to continue the timer with no bandwidth constraint */ @@ -4737,8 +4688,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u /* account preceding periods in which throttling occurred */ cfs_b->nr_throttled += overrun; - runtime_expires = cfs_b->runtime_expires; - /* * This check is repeated as we are holding onto the new bandwidth while * we unthrottle. This can potentially race with an unthrottled group @@ -4751,8 +4700,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u cfs_b->distribute_running = 1; raw_spin_unlock_irqrestore(&cfs_b->lock, flags); /* we can't nest cfs_b->lock while distributing bandwidth */ - runtime = distribute_cfs_runtime(cfs_b, runtime, - runtime_expires); + runtime = distribute_cfs_runtime(cfs_b, runtime); raw_spin_lock_irqsave(&cfs_b->lock, flags); cfs_b->distribute_running = 0; @@ -4834,8 +4782,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) return; raw_spin_lock(&cfs_b->lock); - if (cfs_b->quota != RUNTIME_INF && - cfs_rq->runtime_expires == cfs_b->runtime_expires) { + if (cfs_b->quota != RUNTIME_INF) { cfs_b->runtime += slack_runtime; /* we are under rq->lock, defer unthrottling using a timer */ @@ -4868,7 +4815,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) { u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); unsigned long flags; - u64 expires; /* confirm we're still not at a refresh boundary */ raw_spin_lock_irqsave(&cfs_b->lock, flags); @@ -4886,7 +4832,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) runtime = cfs_b->runtime; - expires = cfs_b->runtime_expires; if (runtime) cfs_b->distribute_running = 1; @@ -4895,11 +4840,10 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) if (!runtime) return; - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); + runtime = distribute_cfs_runtime(cfs_b, runtime); raw_spin_lock_irqsave(&cfs_b->lock, flags); - if (expires == cfs_b->runtime_expires) - lsub_positive(&cfs_b->runtime, runtime); + lsub_positive(&cfs_b->runtime, runtime); cfs_b->distribute_running = 0; raw_spin_unlock_irqrestore(&cfs_b->lock, flags); } @@ -5056,8 +5000,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) cfs_b->period_active = 1; overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); - cfs_b->expires_seq++; hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 7583faddba33..ea48aa5daeee 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -335,8 +335,6 @@ struct cfs_bandwidth { u64 quota; u64 runtime; s64 hierarchical_quota; - u64 runtime_expires; - int expires_seq; u8 idle; u8 period_active; @@ -557,8 +555,6 @@ struct cfs_rq { #ifdef CONFIG_CFS_BANDWIDTH int runtime_enabled; - int expires_seq; - u64 runtime_expires; s64 runtime_remaining; u64 throttled_clock; ^ permalink raw reply related [flat|nested] 40+ messages in thread
end of thread, other threads:[~2019-08-08 10:54 UTC | newest] Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-05-17 19:30 [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu slices Dave Chiluk 2019-05-23 18:44 ` [PATCH v2 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk 2019-05-23 18:44 ` [PATCH v2 1/1] " Dave Chiluk 2019-05-23 21:01 ` Peter Oskolkov 2019-05-24 14:32 ` Phil Auld 2019-05-24 15:14 ` Dave Chiluk 2019-05-24 15:59 ` Phil Auld 2019-05-24 16:28 ` Peter Oskolkov 2019-05-24 21:35 ` Dave Chiluk 2019-05-24 22:07 ` Peter Oskolkov 2019-05-28 22:25 ` Dave Chiluk 2019-05-24 8:55 ` Peter Zijlstra 2019-05-29 19:08 ` [PATCH v3 0/1] " Dave Chiluk 2019-05-29 19:08 ` [PATCH v3 1/1] " Dave Chiluk 2019-05-29 19:28 ` Phil Auld 2019-05-29 19:50 ` bsegall 2019-05-29 21:05 ` bsegall 2019-05-30 17:53 ` Dave Chiluk 2019-05-30 20:44 ` bsegall [not found] ` <1561391404-14450-1-git-send-email-chiluk+linux@indeed.com> 2019-06-24 15:50 ` [PATCH v4 1/1] sched/fair: Return all runtime when cfs_b has very little remaining Dave Chiluk 2019-06-24 17:33 ` bsegall 2019-06-26 22:10 ` Dave Chiluk 2019-06-27 20:18 ` bsegall 2019-06-27 19:09 ` [PATCH] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices Dave Chiluk 2019-06-27 19:49 ` [PATCH v5 0/1] " Dave Chiluk 2019-06-27 19:49 ` [PATCH v5 1/1] " Dave Chiluk 2019-07-01 20:15 ` bsegall 2019-07-11 9:51 ` Peter Zijlstra 2019-07-11 17:46 ` bsegall [not found] ` <CAC=E7cV4sO50NpYOZ06n_BkZTcBqf1KQp83prc+oave3ircBrw@mail.gmail.com> 2019-07-12 18:01 ` bsegall 2019-07-12 22:09 ` bsegall 2019-07-15 15:44 ` Dave Chiluk 2019-07-16 19:58 ` bsegall 2019-07-23 16:44 ` [PATCH v6 0/1] " Dave Chiluk 2019-07-23 16:44 ` [PATCH v6 1/1] " Dave Chiluk 2019-07-23 17:13 ` Phil Auld 2019-07-23 22:12 ` Dave Chiluk 2019-07-23 23:26 ` Phil Auld 2019-07-26 18:14 ` Peter Zijlstra 2019-08-08 10:53 ` [tip:sched/core] " tip-bot for Dave Chiluk
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).