All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/fair: fix broken bandwidth control with nohz_full
@ 2022-03-28 11:07 Chengming Zhou
  2022-03-28 13:20 ` Peter Zijlstra
  2022-03-28 19:05 ` Benjamin Segall
  0 siblings, 2 replies; 15+ messages in thread
From: Chengming Zhou @ 2022-03-28 11:07 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, bristot
  Cc: linux-kernel, duanxiongchun, songmuchun, Chengming Zhou

With nohz_full enabled on cpu, the scheduler_tick() will be stopped
when only one CFS task left on rq.

scheduler_tick()
  task_tick_fair()
    entity_tick()
      update_curr()
        account_cfs_rq_runtime(cfs_rq, delta_exec) --> stopped

So that running task can't account its runtime periodically, but
the cfs_bandwidth hrtimer still __refill_cfs_bandwidth_runtime()
periodically. Later in one period, the task would account very
big delta_exec, which cause the cfs_rq to be throttled for a
long time.

There are two solutions for the problem, the first is that we
can check in sched_can_stop_tick() if current task's cfs_rq
have runtime_enabled, in which case we don't stop tick. But
it will make nohz_full almost useless in cloud environment
that every container has the cpu bandwidth control setting.

The other is what this patch implemented, cfs_bandwidth hrtimer
would sync unaccounted runtime from all running cfs_rqs with
tick stopped, just before __refill_cfs_bandwidth_runtime().
Also do the same thing in tg_set_cfs_bandwidth().

A testcase to reproduce:
```
cd /sys/fs/cgroup
echo "+cpu" > cgroup.subtree_control

mkdir test
echo "105000 100000" > test/cpu.max

echo $$ > test/cgroup.procs
taskset -c 1 bash -c "while true; do let i++; done"
```
Ctrl-C and cat test/cpu.stat to see if nr_throttled > 0.

The above testcase uses period 100ms and quota 105ms, would
only see nr_throttled > 0 on nohz_full system. The problem
is gone in test with this patch.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/sched/core.c  |  4 ++++
 kernel/sched/fair.c  | 30 ++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  3 +++
 3 files changed, 37 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d575b4914925..17b5e3d27401 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10443,6 +10443,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
 	 */
 	if (runtime_enabled && !runtime_was_enabled)
 		cfs_bandwidth_usage_inc();
+
+	if (runtime_was_enabled)
+		sync_cfs_bandwidth_runtime(cfs_b);
+
 	raw_spin_lock_irq(&cfs_b->lock);
 	cfs_b->period = ns_to_ktime(period);
 	cfs_b->quota = quota;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee0664c9d291..ebda70a0e3a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5339,6 +5339,34 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
+#ifdef CONFIG_NO_HZ_FULL
+void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
+{
+	unsigned int cpu;
+	struct rq *rq;
+	struct rq_flags rf;
+	struct cfs_rq *cfs_rq;
+	struct task_group *tg;
+
+	tg = container_of(cfs_b, struct task_group, cfs_bandwidth);
+
+	for_each_online_cpu(cpu) {
+		if (!tick_nohz_tick_stopped_cpu(cpu))
+			continue;
+
+		rq = cpu_rq(cpu);
+		cfs_rq = tg->cfs_rq[cpu];
+
+		rq_lock_irqsave(rq, &rf);
+		if (cfs_rq->curr) {
+			update_rq_clock(rq);
+			update_curr(cfs_rq);
+		}
+		rq_unlock_irqrestore(rq, &rf);
+	}
+}
+#endif
+
 extern const u64 max_cfs_quota_period;
 
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
@@ -5350,6 +5378,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 	int idle = 0;
 	int count = 0;
 
+	sync_cfs_bandwidth_runtime(cfs_b);
+
 	raw_spin_lock_irqsave(&cfs_b->lock, flags);
 	for (;;) {
 		overrun = hrtimer_forward_now(timer, cfs_b->period);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 58263f90c559..57f9da9c50c1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2351,9 +2351,12 @@ static inline void sched_update_tick_dependency(struct rq *rq)
 	else
 		tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
 }
+
+extern void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
 #else
 static inline int sched_tick_offload_init(void) { return 0; }
 static inline void sched_update_tick_dependency(struct rq *rq) { }
+static inline void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) {}
 #endif
 
 static inline void add_nr_running(struct rq *rq, unsigned count)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] sched/fair: fix broken bandwidth control with nohz_full
  2022-03-28 11:07 [PATCH] sched/fair: fix broken bandwidth control with nohz_full Chengming Zhou
@ 2022-03-28 13:20 ` Peter Zijlstra
  2022-03-28 13:50   ` [External] " Chengming Zhou
  2022-03-28 19:05 ` Benjamin Segall
  1 sibling, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2022-03-28 13:20 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, linux-kernel, duanxiongchun,
	songmuchun

On Mon, Mar 28, 2022 at 07:07:51PM +0800, Chengming Zhou wrote:
> With nohz_full enabled on cpu, the scheduler_tick() will be stopped
> when only one CFS task left on rq.
> 
> scheduler_tick()
>   task_tick_fair()
>     entity_tick()
>       update_curr()
>         account_cfs_rq_runtime(cfs_rq, delta_exec) --> stopped
> 
> So that running task can't account its runtime periodically, but
> the cfs_bandwidth hrtimer still __refill_cfs_bandwidth_runtime()
> periodically. Later in one period, the task would account very
> big delta_exec, which cause the cfs_rq to be throttled for a
> long time.
> 
> There are two solutions for the problem, the first is that we
> can check in sched_can_stop_tick() if current task's cfs_rq
> have runtime_enabled, in which case we don't stop tick. But
> it will make nohz_full almost useless in cloud environment
> that every container has the cpu bandwidth control setting.

How is NOHZ_FULL useful in that environment to begin with? If you set
bandwidth crap, the expectation is that there is overcommit, which more
or less assumes lots of scheduling, presumably VMs or somesuch crud.

So how does NOHZ_FULL make sense?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] Re: [PATCH] sched/fair: fix broken bandwidth control with nohz_full
  2022-03-28 13:20 ` Peter Zijlstra
@ 2022-03-28 13:50   ` Chengming Zhou
  2022-03-28 15:17     ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Chengming Zhou @ 2022-03-28 13:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, linux-kernel, duanxiongchun,
	songmuchun

On 2022/3/28 21:20, Peter Zijlstra wrote:
> On Mon, Mar 28, 2022 at 07:07:51PM +0800, Chengming Zhou wrote:
>> With nohz_full enabled on cpu, the scheduler_tick() will be stopped
>> when only one CFS task left on rq.
>>
>> scheduler_tick()
>>   task_tick_fair()
>>     entity_tick()
>>       update_curr()
>>         account_cfs_rq_runtime(cfs_rq, delta_exec) --> stopped
>>
>> So that running task can't account its runtime periodically, but
>> the cfs_bandwidth hrtimer still __refill_cfs_bandwidth_runtime()
>> periodically. Later in one period, the task would account very
>> big delta_exec, which cause the cfs_rq to be throttled for a
>> long time.
>>
>> There are two solutions for the problem, the first is that we
>> can check in sched_can_stop_tick() if current task's cfs_rq
>> have runtime_enabled, in which case we don't stop tick. But
>> it will make nohz_full almost useless in cloud environment
>> that every container has the cpu bandwidth control setting.
> 
> How is NOHZ_FULL useful in that environment to begin with? If you set
> bandwidth crap, the expectation is that there is overcommit, which more
> or less assumes lots of scheduling, presumably VMs or somesuch crud.
> 
> So how does NOHZ_FULL make sense?

Yes, we have scheduled some VMs in cgroups on the host, which
enabled NOHZ_FULL to reduce the interference of tick to vcpu task
if it's the only task running on cpu.

This problem will however throttle it wrongly, even if it hasn't
used up its quota.

Do you suggest that we shouldn't stop tick when the current task's
cfs_rq has runtime_enabled ?

Thanks.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] Re: [PATCH] sched/fair: fix broken bandwidth control with nohz_full
  2022-03-28 13:50   ` [External] " Chengming Zhou
@ 2022-03-28 15:17     ` Peter Zijlstra
  2022-03-28 15:40       ` Chengming Zhou
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2022-03-28 15:17 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, linux-kernel, duanxiongchun,
	songmuchun, Frederic Weisbecker

On Mon, Mar 28, 2022 at 09:50:05PM +0800, Chengming Zhou wrote:
> On 2022/3/28 21:20, Peter Zijlstra wrote:
> > On Mon, Mar 28, 2022 at 07:07:51PM +0800, Chengming Zhou wrote:
> >> With nohz_full enabled on cpu, the scheduler_tick() will be stopped
> >> when only one CFS task left on rq.
> >>
> >> scheduler_tick()
> >>   task_tick_fair()
> >>     entity_tick()
> >>       update_curr()
> >>         account_cfs_rq_runtime(cfs_rq, delta_exec) --> stopped
> >>
> >> So that running task can't account its runtime periodically, but
> >> the cfs_bandwidth hrtimer still __refill_cfs_bandwidth_runtime()
> >> periodically. Later in one period, the task would account very
> >> big delta_exec, which cause the cfs_rq to be throttled for a
> >> long time.
> >>
> >> There are two solutions for the problem, the first is that we
> >> can check in sched_can_stop_tick() if current task's cfs_rq
> >> have runtime_enabled, in which case we don't stop tick. But
> >> it will make nohz_full almost useless in cloud environment
> >> that every container has the cpu bandwidth control setting.
> > 
> > How is NOHZ_FULL useful in that environment to begin with? If you set
> > bandwidth crap, the expectation is that there is overcommit, which more
> > or less assumes lots of scheduling, presumably VMs or somesuch crud.
> > 
> > So how does NOHZ_FULL make sense?
> 
> Yes, we have scheduled some VMs in cgroups on the host, which
> enabled NOHZ_FULL to reduce the interference of tick to vcpu task
> if it's the only task running on cpu.
> 
> This problem will however throttle it wrongly, even if it hasn't
> used up its quota.
> 
> Do you suggest that we shouldn't stop tick when the current task's
> cfs_rq has runtime_enabled ?

I'm not suggesting anything just yet as I'm not sure I understand things
well enough. I'm just wondering if NOHZ_FULL makes sense for you since
NOHZ_FULL makes system entry/exit so much more expensive.

NOHZ_FULL is for use-cases that 'never' intend to go into the kernel,
your use-case actively relies on going into the kernel. Hence the
confusion.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] Re: [PATCH] sched/fair: fix broken bandwidth control with nohz_full
  2022-03-28 15:17     ` Peter Zijlstra
@ 2022-03-28 15:40       ` Chengming Zhou
  2022-03-28 15:56         ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Chengming Zhou @ 2022-03-28 15:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, linux-kernel, duanxiongchun,
	songmuchun, Frederic Weisbecker

On 2022/3/28 23:17, Peter Zijlstra wrote:
> On Mon, Mar 28, 2022 at 09:50:05PM +0800, Chengming Zhou wrote:
>> On 2022/3/28 21:20, Peter Zijlstra wrote:
>>> On Mon, Mar 28, 2022 at 07:07:51PM +0800, Chengming Zhou wrote:
>>>> With nohz_full enabled on cpu, the scheduler_tick() will be stopped
>>>> when only one CFS task left on rq.
>>>>
>>>> scheduler_tick()
>>>>   task_tick_fair()
>>>>     entity_tick()
>>>>       update_curr()
>>>>         account_cfs_rq_runtime(cfs_rq, delta_exec) --> stopped
>>>>
>>>> So that running task can't account its runtime periodically, but
>>>> the cfs_bandwidth hrtimer still __refill_cfs_bandwidth_runtime()
>>>> periodically. Later in one period, the task would account very
>>>> big delta_exec, which cause the cfs_rq to be throttled for a
>>>> long time.
>>>>
>>>> There are two solutions for the problem, the first is that we
>>>> can check in sched_can_stop_tick() if current task's cfs_rq
>>>> have runtime_enabled, in which case we don't stop tick. But
>>>> it will make nohz_full almost useless in cloud environment
>>>> that every container has the cpu bandwidth control setting.
>>>
>>> How is NOHZ_FULL useful in that environment to begin with? If you set
>>> bandwidth crap, the expectation is that there is overcommit, which more
>>> or less assumes lots of scheduling, presumably VMs or somesuch crud.
>>>
>>> So how does NOHZ_FULL make sense?
>>
>> Yes, we have scheduled some VMs in cgroups on the host, which
>> enabled NOHZ_FULL to reduce the interference of tick to vcpu task
>> if it's the only task running on cpu.
>>
>> This problem will however throttle it wrongly, even if it hasn't
>> used up its quota.
>>
>> Do you suggest that we shouldn't stop tick when the current task's
>> cfs_rq has runtime_enabled ?
> 
> I'm not suggesting anything just yet as I'm not sure I understand things
> well enough. I'm just wondering if NOHZ_FULL makes sense for you since
> NOHZ_FULL makes system entry/exit so much more expensive.

Ok, I see. It seems a normal use-case that run VMs on system with NOHZ_FULL
enabled, and set bandwidth for overcommit. At some times, the cpu will
stop tick when low load.

> 
> NOHZ_FULL is for use-cases that 'never' intend to go into the kernel,
> your use-case actively relies on going into the kernel. Hence the
> confusion.

In fact, I put a testcase at the end of git message, in which only run
a userspace loop workload:

cd /sys/fs/cgroup
echo "+cpu" > cgroup.subtree_control

mkdir test
echo "105000 100000" > test/cpu.max

echo $$ > test/cgroup.procs
taskset -c 1 bash -c "while true; do let i++; done"  --> will be throttled

Thanks.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] Re: [PATCH] sched/fair: fix broken bandwidth control with nohz_full
  2022-03-28 15:40       ` Chengming Zhou
@ 2022-03-28 15:56         ` Peter Zijlstra
  2022-03-28 16:35           ` Chengming Zhou
  2022-03-28 16:44           ` Steven Rostedt
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2022-03-28 15:56 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, linux-kernel, duanxiongchun,
	songmuchun, Frederic Weisbecker

On Mon, Mar 28, 2022 at 11:40:25PM +0800, Chengming Zhou wrote:

> > NOHZ_FULL is for use-cases that 'never' intend to go into the kernel,
> > your use-case actively relies on going into the kernel. Hence the
> > confusion.
> 
> In fact, I put a testcase at the end of git message, in which only run
> a userspace loop workload:
> 
> cd /sys/fs/cgroup
> echo "+cpu" > cgroup.subtree_control
> 
> mkdir test
> echo "105000 100000" > test/cpu.max
> 
> echo $$ > test/cgroup.procs
> taskset -c 1 bash -c "while true; do let i++; done"  --> will be throttled

Ofcourse.. I'm arguing that bandiwdth control and NOHZ_FULL are somewhat
mutually exclusive, use-case wise. So I really don't get why you'd want
them both.

NOHZ_FULL says, "I 'never' intend to go to the kernel"

bandwidth control says: "I expect to be sharing the system and must be
interrupted to not consume too much time", which very much implies: "I
will go into the kernel".

The trade-off we make to make NOHZ_FULL work, makes system enter/exit
*far* more expensive. There's also people asking to outright kill a task
that causes entry under NOHZ_FULL.

So yes, you can configure it, but why does it make sense?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] Re: [PATCH] sched/fair: fix broken bandwidth control with nohz_full
  2022-03-28 15:56         ` Peter Zijlstra
@ 2022-03-28 16:35           ` Chengming Zhou
  2022-03-28 16:44           ` Steven Rostedt
  1 sibling, 0 replies; 15+ messages in thread
From: Chengming Zhou @ 2022-03-28 16:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, bristot, linux-kernel, duanxiongchun,
	songmuchun, Frederic Weisbecker

On 2022/3/28 23:56, Peter Zijlstra wrote:
> On Mon, Mar 28, 2022 at 11:40:25PM +0800, Chengming Zhou wrote:
> 
>>> NOHZ_FULL is for use-cases that 'never' intend to go into the kernel,
>>> your use-case actively relies on going into the kernel. Hence the
>>> confusion.
>>
>> In fact, I put a testcase at the end of git message, in which only run
>> a userspace loop workload:
>>
>> cd /sys/fs/cgroup
>> echo "+cpu" > cgroup.subtree_control
>>
>> mkdir test
>> echo "105000 100000" > test/cpu.max
>>
>> echo $$ > test/cgroup.procs
>> taskset -c 1 bash -c "while true; do let i++; done"  --> will be throttled
> 
> Ofcourse.. I'm arguing that bandiwdth control and NOHZ_FULL are somewhat
> mutually exclusive, use-case wise. So I really don't get why you'd want
> them both.

This problem is found by our VM team, they use bandwidth for overcommit,
share CPUs between two VMs.

> 
> NOHZ_FULL says, "I 'never' intend to go to the kernel"

Like VCPU seldom kvm_exit to the kernel, stop tick is helpful for performance,
since kvm_exit is more expensive.

> 
> bandwidth control says: "I expect to be sharing the system and must be
> interrupted to not consume too much time", which very much implies: "I
> will go into the kernel".

Yes, agree. If the tasks in the task_group used up quota, they have to
go into the kernel to resched out.

> 
> The trade-off we make to make NOHZ_FULL work, makes system enter/exit
> *far* more expensive. There's also people asking to outright kill a task
> that causes entry under NOHZ_FULL.

It's correct that the task under NOHZ_FULL shouldn't often enter/exit.

> 
> So yes, you can configure it, but why does it make sense?

I don't know if other people have the same use-case, or is there other
better way to do VMs overcommit and bandwidth...

Thanks.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] Re: [PATCH] sched/fair: fix broken bandwidth control with nohz_full
  2022-03-28 15:56         ` Peter Zijlstra
  2022-03-28 16:35           ` Chengming Zhou
@ 2022-03-28 16:44           ` Steven Rostedt
  2022-03-29  2:58             ` Chengming Zhou
  2022-03-30 18:23             ` Peter Zijlstra
  1 sibling, 2 replies; 15+ messages in thread
From: Steven Rostedt @ 2022-03-28 16:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chengming Zhou, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, linux-kernel,
	duanxiongchun, songmuchun, Frederic Weisbecker

On Mon, 28 Mar 2022 17:56:07 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > echo $$ > test/cgroup.procs
> > taskset -c 1 bash -c "while true; do let i++; done"  --> will be throttled  
> 
> Ofcourse.. I'm arguing that bandiwdth control and NOHZ_FULL are somewhat
> mutually exclusive, use-case wise. So I really don't get why you'd want
> them both.

Is it?

One use case I can see for having both is for having a deadline task that
needs to get something done in a tight deadline. NOHZ_FULL means "do not
interrupt this task when it is the top priority task on the CPU and is
running in user space".

Why is it mutually exclusive to have a deadline task that does not want to
be interrupted by timer interrupts?

Just because the biggest pushers of NOHZ_FULL is for those that are running
RT tasks completely in user space and event want to fault if it ever goes
into the kernel, doesn't mean that's the only use case.

Chengming brought up VMs. That's a case to want to control the bandwidth,
but also not interrupt them with timer interrupts when they are running as
the top priority task on a CPU.

-- Steve

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] sched/fair: fix broken bandwidth control with nohz_full
  2022-03-28 11:07 [PATCH] sched/fair: fix broken bandwidth control with nohz_full Chengming Zhou
  2022-03-28 13:20 ` Peter Zijlstra
@ 2022-03-28 19:05 ` Benjamin Segall
  2022-03-29  3:36   ` [External] " Chengming Zhou
  1 sibling, 1 reply; 15+ messages in thread
From: Benjamin Segall @ 2022-03-28 19:05 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, mgorman, bristot, linux-kernel, duanxiongchun,
	songmuchun

Chengming Zhou <zhouchengming@bytedance.com> writes:

> With nohz_full enabled on cpu, the scheduler_tick() will be stopped
> when only one CFS task left on rq.
>
> scheduler_tick()
>   task_tick_fair()
>     entity_tick()
>       update_curr()
>         account_cfs_rq_runtime(cfs_rq, delta_exec) --> stopped
>
> So that running task can't account its runtime periodically, but
> the cfs_bandwidth hrtimer still __refill_cfs_bandwidth_runtime()
> periodically. Later in one period, the task would account very
> big delta_exec, which cause the cfs_rq to be throttled for a
> long time.
>
> There are two solutions for the problem, the first is that we
> can check in sched_can_stop_tick() if current task's cfs_rq
> have runtime_enabled, in which case we don't stop tick. But
> it will make nohz_full almost useless in cloud environment
> that every container has the cpu bandwidth control setting.
>
> The other is what this patch implemented, cfs_bandwidth hrtimer
> would sync unaccounted runtime from all running cfs_rqs with
> tick stopped, just before __refill_cfs_bandwidth_runtime().
> Also do the same thing in tg_set_cfs_bandwidth().

>
> A testcase to reproduce:
> ```
> cd /sys/fs/cgroup
> echo "+cpu" > cgroup.subtree_control
>
> mkdir test
> echo "105000 100000" > test/cpu.max


The other half of this problem I think would be that it also won't
throttle when it /is/ out of bandwidth. That is, 'echo "50000 100000" >
test/cpu.max' would not stop after 50ms of runtime is spent, it would
only stop (with this patch) after 100ms. It would then correctly need to
repay that debt, so the (very) long-run ratio should be correct, but a
misbehaving task could be use practically arbitrarily more than their
supposed quota over the ~100s of milliseconds timescales cfsb is
generally supposed to work on.

However, I don't really see a way for cfsb's current design to avoid
that on nohz_full (without vetoing sched_can_stop_tick). It might be
possible to try and guess when quota will run out by tracking the number
of running or runnable threads, but that would be a nontrivial
undertaking and involve a bunch of hrtimer reprogramming (particularly
in cases when some cgroups actually have tasks that wake/sleep frequently).


>
> echo $$ > test/cgroup.procs
> taskset -c 1 bash -c "while true; do let i++; done"
> ```
> Ctrl-C and cat test/cpu.stat to see if nr_throttled > 0.
>
> The above testcase uses period 100ms and quota 105ms, would
> only see nr_throttled > 0 on nohz_full system. The problem
> is gone in test with this patch.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  kernel/sched/core.c  |  4 ++++
>  kernel/sched/fair.c  | 30 ++++++++++++++++++++++++++++++
>  kernel/sched/sched.h |  3 +++
>  3 files changed, 37 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d575b4914925..17b5e3d27401 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10443,6 +10443,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
>  	 */
>  	if (runtime_enabled && !runtime_was_enabled)
>  		cfs_bandwidth_usage_inc();
> +
> +	if (runtime_was_enabled)
> +		sync_cfs_bandwidth_runtime(cfs_b);
> +
>  	raw_spin_lock_irq(&cfs_b->lock);
>  	cfs_b->period = ns_to_ktime(period);
>  	cfs_b->quota = quota;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ee0664c9d291..ebda70a0e3a8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5339,6 +5339,34 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
>  	return HRTIMER_NORESTART;
>  }
>  
> +#ifdef CONFIG_NO_HZ_FULL
> +void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
> +{
> +	unsigned int cpu;
> +	struct rq *rq;
> +	struct rq_flags rf;
> +	struct cfs_rq *cfs_rq;
> +	struct task_group *tg;
> +
> +	tg = container_of(cfs_b, struct task_group, cfs_bandwidth);
> +
> +	for_each_online_cpu(cpu) {
> +		if (!tick_nohz_tick_stopped_cpu(cpu))
> +			continue;
> +
> +		rq = cpu_rq(cpu);
> +		cfs_rq = tg->cfs_rq[cpu];
> +
> +		rq_lock_irqsave(rq, &rf);

A for-every-nohz-cpu rqlock is definitely not great when the cgroup is
likely to be running on much fewer cpus. I'm not sure what the current
opinion would be on doing some sort of optimistic locking like

if (!READ_ONCE(cfs_rq->curr)) continue;
rq_lock_irqsave(...);
if (cfs_rq->curr) { ... }

but solutions to the other problem I mentioned earlier might provide
help here anyways.


> +		if (cfs_rq->curr) {
> +			update_rq_clock(rq);
> +			update_curr(cfs_rq);
> +		}
> +		rq_unlock_irqrestore(rq, &rf);
> +	}
> +}
> +#endif
> +
>  extern const u64 max_cfs_quota_period;
>  
>  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> @@ -5350,6 +5378,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>  	int idle = 0;
>  	int count = 0;
>  
> +	sync_cfs_bandwidth_runtime(cfs_b);
> +
>  	raw_spin_lock_irqsave(&cfs_b->lock, flags);
>  	for (;;) {
>  		overrun = hrtimer_forward_now(timer, cfs_b->period);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 58263f90c559..57f9da9c50c1 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2351,9 +2351,12 @@ static inline void sched_update_tick_dependency(struct rq *rq)
>  	else
>  		tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
>  }
> +
> +extern void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
>  #else
>  static inline int sched_tick_offload_init(void) { return 0; }
>  static inline void sched_update_tick_dependency(struct rq *rq) { }
> +static inline void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) {}
>  #endif
>  
>  static inline void add_nr_running(struct rq *rq, unsigned count)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] Re: [PATCH] sched/fair: fix broken bandwidth control with nohz_full
  2022-03-28 16:44           ` Steven Rostedt
@ 2022-03-29  2:58             ` Chengming Zhou
  2022-03-30 18:23             ` Peter Zijlstra
  1 sibling, 0 replies; 15+ messages in thread
From: Chengming Zhou @ 2022-03-29  2:58 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, bsegall,
	mgorman, bristot, linux-kernel, duanxiongchun, songmuchun,
	Frederic Weisbecker

On 2022/3/29 00:44, Steven Rostedt wrote:
> On Mon, 28 Mar 2022 17:56:07 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
>>> echo $$ > test/cgroup.procs
>>> taskset -c 1 bash -c "while true; do let i++; done"  --> will be throttled  
>>
>> Ofcourse.. I'm arguing that bandiwdth control and NOHZ_FULL are somewhat
>> mutually exclusive, use-case wise. So I really don't get why you'd want
>> them both.
> 
> Is it?
> 
> One use case I can see for having both is for having a deadline task that
> needs to get something done in a tight deadline. NOHZ_FULL means "do not
> interrupt this task when it is the top priority task on the CPU and is
> running in user space".

Yes, this is similar with our use-case.

> 
> Why is it mutually exclusive to have a deadline task that does not want to
> be interrupted by timer interrupts?
> 
> Just because the biggest pushers of NOHZ_FULL is for those that are running
> RT tasks completely in user space and event want to fault if it ever goes
> into the kernel, doesn't mean that's the only use case.
> 
> Chengming brought up VMs. That's a case to want to control the bandwidth,
> but also not interrupt them with timer interrupts when they are running as
> the top priority task on a CPU.

Agree. NOHZ_FULL means don't want timer interrupts when running mostly in
user space or guest mode, bandwidth control just means need to go into kernel
to schedule out only when its quota used up. We shouldn't make them mutually
exclusive.

Thanks.

> 
> -- Steve

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] Re: [PATCH] sched/fair: fix broken bandwidth control with nohz_full
  2022-03-28 19:05 ` Benjamin Segall
@ 2022-03-29  3:36   ` Chengming Zhou
  0 siblings, 0 replies; 15+ messages in thread
From: Chengming Zhou @ 2022-03-29  3:36 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, mgorman, bristot, linux-kernel, duanxiongchun,
	songmuchun

Hi,

On 2022/3/29 03:05, Benjamin Segall wrote:
> Chengming Zhou <zhouchengming@bytedance.com> writes:
> 
>> With nohz_full enabled on cpu, the scheduler_tick() will be stopped
>> when only one CFS task left on rq.
>>
>> scheduler_tick()
>>   task_tick_fair()
>>     entity_tick()
>>       update_curr()
>>         account_cfs_rq_runtime(cfs_rq, delta_exec) --> stopped
>>
>> So that running task can't account its runtime periodically, but
>> the cfs_bandwidth hrtimer still __refill_cfs_bandwidth_runtime()
>> periodically. Later in one period, the task would account very
>> big delta_exec, which cause the cfs_rq to be throttled for a
>> long time.
>>
>> There are two solutions for the problem, the first is that we
>> can check in sched_can_stop_tick() if current task's cfs_rq
>> have runtime_enabled, in which case we don't stop tick. But
>> it will make nohz_full almost useless in cloud environment
>> that every container has the cpu bandwidth control setting.
>>
>> The other is what this patch implemented, cfs_bandwidth hrtimer
>> would sync unaccounted runtime from all running cfs_rqs with
>> tick stopped, just before __refill_cfs_bandwidth_runtime().
>> Also do the same thing in tg_set_cfs_bandwidth().
> 
>>
>> A testcase to reproduce:
>> ```
>> cd /sys/fs/cgroup
>> echo "+cpu" > cgroup.subtree_control
>>
>> mkdir test
>> echo "105000 100000" > test/cpu.max
> 
> 
> The other half of this problem I think would be that it also won't
> throttle when it /is/ out of bandwidth. That is, 'echo "50000 100000" >
> test/cpu.max' would not stop after 50ms of runtime is spent, it would
> only stop (with this patch) after 100ms. It would then correctly need to
> repay that debt, so the (very) long-run ratio should be correct, but a
> misbehaving task could be use practically arbitrarily more than their
> supposed quota over the ~100s of milliseconds timescales cfsb is
> generally supposed to work on.

Right, it's a problem with this patch's solution. I can't think of a
better way to make NOHZ_FULL work with bandwidth control. But it should
be no problem for most use-case? If it used over quota, it would need
to repay that debt before it can run again, so it just misbehave in that
period.

The main problem this patch solved is that task accumulated a very long
delta_exec accounted in one period, then cause very long time throttled.

> 
> However, I don't really see a way for cfsb's current design to avoid
> that on nohz_full (without vetoing sched_can_stop_tick). It might be
> possible to try and guess when quota will run out by tracking the number
> of running or runnable threads, but that would be a nontrivial
> undertaking and involve a bunch of hrtimer reprogramming (particularly
> in cases when some cgroups actually have tasks that wake/sleep frequently).
> 
> 
>>
>> echo $$ > test/cgroup.procs
>> taskset -c 1 bash -c "while true; do let i++; done"
>> ```
>> Ctrl-C and cat test/cpu.stat to see if nr_throttled > 0.
>>
>> The above testcase uses period 100ms and quota 105ms, would
>> only see nr_throttled > 0 on nohz_full system. The problem
>> is gone in test with this patch.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>>  kernel/sched/core.c  |  4 ++++
>>  kernel/sched/fair.c  | 30 ++++++++++++++++++++++++++++++
>>  kernel/sched/sched.h |  3 +++
>>  3 files changed, 37 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index d575b4914925..17b5e3d27401 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -10443,6 +10443,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
>>  	 */
>>  	if (runtime_enabled && !runtime_was_enabled)
>>  		cfs_bandwidth_usage_inc();
>> +
>> +	if (runtime_was_enabled)
>> +		sync_cfs_bandwidth_runtime(cfs_b);
>> +
>>  	raw_spin_lock_irq(&cfs_b->lock);
>>  	cfs_b->period = ns_to_ktime(period);
>>  	cfs_b->quota = quota;
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ee0664c9d291..ebda70a0e3a8 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5339,6 +5339,34 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
>>  	return HRTIMER_NORESTART;
>>  }
>>  
>> +#ifdef CONFIG_NO_HZ_FULL
>> +void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
>> +{
>> +	unsigned int cpu;
>> +	struct rq *rq;
>> +	struct rq_flags rf;
>> +	struct cfs_rq *cfs_rq;
>> +	struct task_group *tg;
>> +
>> +	tg = container_of(cfs_b, struct task_group, cfs_bandwidth);
>> +
>> +	for_each_online_cpu(cpu) {
>> +		if (!tick_nohz_tick_stopped_cpu(cpu))
>> +			continue;
>> +
>> +		rq = cpu_rq(cpu);
>> +		cfs_rq = tg->cfs_rq[cpu];
>> +
>> +		rq_lock_irqsave(rq, &rf);
> 
> A for-every-nohz-cpu rqlock is definitely not great when the cgroup is
> likely to be running on much fewer cpus. I'm not sure what the current
> opinion would be on doing some sort of optimistic locking like
> 
> if (!READ_ONCE(cfs_rq->curr)) continue;
> rq_lock_irqsave(...);
> if (cfs_rq->curr) { ... }

I think it's better, will changed to this if Peter is ok with this patch.

Thanks.

> 
> but solutions to the other problem I mentioned earlier might provide
> help here anyways.
> 
> 
>> +		if (cfs_rq->curr) {
>> +			update_rq_clock(rq);
>> +			update_curr(cfs_rq);
>> +		}
>> +		rq_unlock_irqrestore(rq, &rf);
>> +	}
>> +}
>> +#endif
>> +
>>  extern const u64 max_cfs_quota_period;
>>  
>>  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>> @@ -5350,6 +5378,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>>  	int idle = 0;
>>  	int count = 0;
>>  
>> +	sync_cfs_bandwidth_runtime(cfs_b);
>> +
>>  	raw_spin_lock_irqsave(&cfs_b->lock, flags);
>>  	for (;;) {
>>  		overrun = hrtimer_forward_now(timer, cfs_b->period);
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 58263f90c559..57f9da9c50c1 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2351,9 +2351,12 @@ static inline void sched_update_tick_dependency(struct rq *rq)
>>  	else
>>  		tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED);
>>  }
>> +
>> +extern void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
>>  #else
>>  static inline int sched_tick_offload_init(void) { return 0; }
>>  static inline void sched_update_tick_dependency(struct rq *rq) { }
>> +static inline void sync_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) {}
>>  #endif
>>  
>>  static inline void add_nr_running(struct rq *rq, unsigned count)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] Re: [PATCH] sched/fair: fix broken bandwidth control with nohz_full
  2022-03-28 16:44           ` Steven Rostedt
  2022-03-29  2:58             ` Chengming Zhou
@ 2022-03-30 18:23             ` Peter Zijlstra
  2022-03-30 18:37               ` Steven Rostedt
  2022-03-30 19:14               ` Phil Auld
  1 sibling, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2022-03-30 18:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Chengming Zhou, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, linux-kernel,
	duanxiongchun, songmuchun, Frederic Weisbecker

On Mon, Mar 28, 2022 at 12:44:54PM -0400, Steven Rostedt wrote:
> On Mon, 28 Mar 2022 17:56:07 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > echo $$ > test/cgroup.procs
> > > taskset -c 1 bash -c "while true; do let i++; done"  --> will be throttled  
> > 
> > Ofcourse.. I'm arguing that bandiwdth control and NOHZ_FULL are somewhat
> > mutually exclusive, use-case wise. So I really don't get why you'd want
> > them both.
> 
> Is it?
> 
> One use case I can see for having both is for having a deadline task that
> needs to get something done in a tight deadline. NOHZ_FULL means "do not
> interrupt this task when it is the top priority task on the CPU and is
> running in user space".

This is absolute batshit.. It means no such thing. We'll happily wake
another task to this CPU and re-enable the tick any instant.

Worse; the use-case at hand pertains to cfs bandwidth control, which
pretty much guarantees there *will* be an interrupt.

> Why is it mutually exclusive to have a deadline task that does not want to
> be interrupted by timer interrupts?

This has absolutely nothing to do with deadline tasks, nada, noppes.

> Just because the biggest pushers of NOHZ_FULL is for those that are running
> RT tasks completely in user space and event want to fault if it ever goes
> into the kernel, doesn't mean that's the only use case.

Because there's costs associated with the whole thing. system entry/exit
get far more expensive. It just doesn't make much sense to use NOHZ_FULL
if you're not absoultely limiting system entry.

> Chengming brought up VMs. That's a case to want to control the bandwidth,
> but also not interrupt them with timer interrupts when they are running as
> the top priority task on a CPU.

It's CFS, there is nothing top priority about that.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] Re: [PATCH] sched/fair: fix broken bandwidth control with nohz_full
  2022-03-30 18:23             ` Peter Zijlstra
@ 2022-03-30 18:37               ` Steven Rostedt
  2022-03-30 19:14               ` Phil Auld
  1 sibling, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2022-03-30 18:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chengming Zhou, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, linux-kernel,
	duanxiongchun, songmuchun, Frederic Weisbecker

On Wed, 30 Mar 2022 20:23:27 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > Chengming brought up VMs. That's a case to want to control the bandwidth,
> > but also not interrupt them with timer interrupts when they are running as
> > the top priority task on a CPU.  
> 
> It's CFS, there is nothing top priority about that.

If there's only one task running on a CPU, even with CFS, why do we need a
tick?

Look, we have a host that is doing nothing but running VMs. Could be a
single VM. The host doesn't even have disk, it just runs on from initrd,
and the user is actually just logging into a guest.

Why should the host be triggering a tick when running this VM, which is the
only thing that is running on the host? Yeah, it's CFS, or does CFS have to
have a tick? even when there's nothing else on the CPU?

-- Steve

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] Re: [PATCH] sched/fair: fix broken bandwidth control with nohz_full
  2022-03-30 18:23             ` Peter Zijlstra
  2022-03-30 18:37               ` Steven Rostedt
@ 2022-03-30 19:14               ` Phil Auld
  2022-04-01  7:05                 ` Chengming Zhou
  1 sibling, 1 reply; 15+ messages in thread
From: Phil Auld @ 2022-03-30 19:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Chengming Zhou, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, bsegall, mgorman, bristot,
	linux-kernel, duanxiongchun, songmuchun, Frederic Weisbecker

Hi Peter,

On Wed, Mar 30, 2022 at 08:23:27PM +0200 Peter Zijlstra wrote:
> On Mon, Mar 28, 2022 at 12:44:54PM -0400, Steven Rostedt wrote:
> > On Mon, 28 Mar 2022 17:56:07 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > > echo $$ > test/cgroup.procs
> > > > taskset -c 1 bash -c "while true; do let i++; done"  --> will be throttled  
> > > 
> > > Ofcourse.. I'm arguing that bandiwdth control and NOHZ_FULL are somewhat
> > > mutually exclusive, use-case wise. So I really don't get why you'd want
> > > them both.
> > 
> > Is it?
> > 
> > One use case I can see for having both is for having a deadline task that
> > needs to get something done in a tight deadline. NOHZ_FULL means "do not
> > interrupt this task when it is the top priority task on the CPU and is
> > running in user space".
> 
> This is absolute batshit.. It means no such thing. We'll happily wake
> another task to this CPU and re-enable the tick any instant.
> 
> Worse; the use-case at hand pertains to cfs bandwidth control, which
> pretty much guarantees there *will* be an interrupt.

The problem is (at least in some cases) that container orchestration userspace
code allocates a whole CPU by setting quota == period.  Or 3 cpus as 3*period etc.

In cases where an isolated task is expected to run uninterrupted (only task in 
the system affined to that cpu, nohz_full, nocbs etc) you can end up with it
getting throttled even though it theoritically has enough bandwidth for the full
cpu and therefore should never get throttled. 

There are radio network setups where the packet processing is isolated
like this but the system as a whole is managed by container orchestration so 
everything has cfs bandwidth quotas set.

I don't think generally the bandwidth controls in these cases are used for 
CPU sharing (quota < period). I agree that doesn't make much sense with NOHZ_FULL
and won't work right. 

It's doled out as full cpu(s) in these cases.

Thats not a VM case so is likely different from the one that started this thread
but I thought I should mention it.


Cheers,
Phil

> 
> > Why is it mutually exclusive to have a deadline task that does not want to
> > be interrupted by timer interrupts?
> 
> This has absolutely nothing to do with deadline tasks, nada, noppes.
> 
> > Just because the biggest pushers of NOHZ_FULL is for those that are running
> > RT tasks completely in user space and event want to fault if it ever goes
> > into the kernel, doesn't mean that's the only use case.
> 
> Because there's costs associated with the whole thing. system entry/exit
> get far more expensive. It just doesn't make much sense to use NOHZ_FULL
> if you're not absoultely limiting system entry.
> 
> > Chengming brought up VMs. That's a case to want to control the bandwidth,
> > but also not interrupt them with timer interrupts when they are running as
> > the top priority task on a CPU.
> 
> It's CFS, there is nothing top priority about that.
> 

-- 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] Re: [PATCH] sched/fair: fix broken bandwidth control with nohz_full
  2022-03-30 19:14               ` Phil Auld
@ 2022-04-01  7:05                 ` Chengming Zhou
  0 siblings, 0 replies; 15+ messages in thread
From: Chengming Zhou @ 2022-04-01  7:05 UTC (permalink / raw)
  To: Phil Auld, Peter Zijlstra
  Cc: Steven Rostedt, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, bsegall, mgorman, bristot, linux-kernel,
	duanxiongchun, songmuchun, Frederic Weisbecker

Hi,

On 2022/3/31 03:14, Phil Auld wrote:
> Hi Peter,
> 
> On Wed, Mar 30, 2022 at 08:23:27PM +0200 Peter Zijlstra wrote:
>> On Mon, Mar 28, 2022 at 12:44:54PM -0400, Steven Rostedt wrote:
>>> On Mon, 28 Mar 2022 17:56:07 +0200
>>> Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>>>> echo $$ > test/cgroup.procs
>>>>> taskset -c 1 bash -c "while true; do let i++; done"  --> will be throttled  
>>>>
>>>> Ofcourse.. I'm arguing that bandiwdth control and NOHZ_FULL are somewhat
>>>> mutually exclusive, use-case wise. So I really don't get why you'd want
>>>> them both.
>>>
>>> Is it?
>>>
>>> One use case I can see for having both is for having a deadline task that
>>> needs to get something done in a tight deadline. NOHZ_FULL means "do not
>>> interrupt this task when it is the top priority task on the CPU and is
>>> running in user space".
>>
>> This is absolute batshit.. It means no such thing. We'll happily wake
>> another task to this CPU and re-enable the tick any instant.
>>
>> Worse; the use-case at hand pertains to cfs bandwidth control, which
>> pretty much guarantees there *will* be an interrupt.
> 
> The problem is (at least in some cases) that container orchestration userspace
> code allocates a whole CPU by setting quota == period.  Or 3 cpus as 3*period etc.
> 
> In cases where an isolated task is expected to run uninterrupted (only task in 
> the system affined to that cpu, nohz_full, nocbs etc) you can end up with it
> getting throttled even though it theoritically has enough bandwidth for the full
> cpu and therefore should never get throttled. 
> 
> There are radio network setups where the packet processing is isolated
> like this but the system as a whole is managed by container orchestration so 
> everything has cfs bandwidth quotas set.
> 
> I don't think generally the bandwidth controls in these cases are used for 
> CPU sharing (quota < period). I agree that doesn't make much sense with NOHZ_FULL
> and won't work right. 
> 
> It's doled out as full cpu(s) in these cases.
> 
> Thats not a VM case so is likely different from the one that started this thread
> but I thought I should mention it.

Yes, it's a different use-case from ours. Thanks for sharing with us. I should
put these in the patch log and send an updated version.

Thanks.

> 
> 
> Cheers,
> Phil
> 
>>
>>> Why is it mutually exclusive to have a deadline task that does not want to
>>> be interrupted by timer interrupts?
>>
>> This has absolutely nothing to do with deadline tasks, nada, noppes.
>>
>>> Just because the biggest pushers of NOHZ_FULL is for those that are running
>>> RT tasks completely in user space and event want to fault if it ever goes
>>> into the kernel, doesn't mean that's the only use case.
>>
>> Because there's costs associated with the whole thing. system entry/exit
>> get far more expensive. It just doesn't make much sense to use NOHZ_FULL
>> if you're not absoultely limiting system entry.
>>
>>> Chengming brought up VMs. That's a case to want to control the bandwidth,
>>> but also not interrupt them with timer interrupts when they are running as
>>> the top priority task on a CPU.
>>
>> It's CFS, there is nothing top priority about that.
>>
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-04-01  7:05 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-28 11:07 [PATCH] sched/fair: fix broken bandwidth control with nohz_full Chengming Zhou
2022-03-28 13:20 ` Peter Zijlstra
2022-03-28 13:50   ` [External] " Chengming Zhou
2022-03-28 15:17     ` Peter Zijlstra
2022-03-28 15:40       ` Chengming Zhou
2022-03-28 15:56         ` Peter Zijlstra
2022-03-28 16:35           ` Chengming Zhou
2022-03-28 16:44           ` Steven Rostedt
2022-03-29  2:58             ` Chengming Zhou
2022-03-30 18:23             ` Peter Zijlstra
2022-03-30 18:37               ` Steven Rostedt
2022-03-30 19:14               ` Phil Auld
2022-04-01  7:05                 ` Chengming Zhou
2022-03-28 19:05 ` Benjamin Segall
2022-03-29  3:36   ` [External] " Chengming Zhou

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.