Re: [PATCH 19/25] sched/vite: Handle nice updates under vtime

From: Peter Zijlstra <peterz@infradead.org>
To: Frederic Weisbecker <frederic@kernel.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Wanpeng Li <wanpengli@tencent.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Yauheni Kaliuta <yauheni.kaliuta@redhat.com>,
	Ingo Molnar <mingo@kernel.org>, Rik van Riel <riel@redhat.com>
Subject: Re: [PATCH 19/25] sched/vite: Handle nice updates under vtime
Date: Tue, 20 Nov 2018 15:17:54 +0100	[thread overview]
Message-ID: <20181120141754.GW2131@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <1542163569-20047-20-git-send-email-frederic@kernel.org>

On Wed, Nov 14, 2018 at 03:46:03AM +0100, Frederic Weisbecker wrote:
> On the vtime level, nice updates are currently handled on context
> switches. When a task's nice value gets updated while it is sleeping,
> the context switch takes into account the new nice value in order to
> later record the vtime delta to the appropriate kcpustat index.

Urgh, so this patch should be folded into the previous one. On their own
neither really makes sense.

> We have yet to handle live updates: when set_user_nice() is called
> while the target is running. We'll handle that on two sides:
> 
> * If the caller of set_user_nice() is the current task, we update the
>   vtime state in place.
> 
> * If the target runs on a different CPU, we interrupt it with an IPI to
>   update the vtime state in place.

*groan*... So what are the rules for vtime updates? Who can do that
when?

So when we change nice, we'll have the respective rq locked and task
effectively unqueued. It cannot schedule at such a point. Can
'concurrent' vtime updates still happen?

> The vtime update in question consists in flushing the pending vtime
> delta to the task/kcpustat and resume the accounting on top of the new
> nice value.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f12225f..e8f0437 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3868,6 +3868,7 @@ void set_user_nice(struct task_struct *p, long nice)
>  	int old_prio, delta;
>  	struct rq_flags rf;
>  	struct rq *rq;
> +	long old_nice;
>  
>  	if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE)
>  		return;
> @@ -3878,6 +3879,8 @@ void set_user_nice(struct task_struct *p, long nice)
>  	rq = task_rq_lock(p, &rf);
>  	update_rq_clock(rq);
>  
> +	old_nice = task_nice(p);
> +
>  	/*
>  	 * The RT priorities are set via sched_setscheduler(), but we still
>  	 * allow the 'normal' nice value to be set - but as expected
> @@ -3913,6 +3916,7 @@ void set_user_nice(struct task_struct *p, long nice)
>  	if (running)
>  		set_curr_task(rq, p);
>  out_unlock:
> +	vtime_set_nice(rq, p, old_nice);
>  	task_rq_unlock(rq, p, &rf);
>  }

That's not sufficient; I think you want to hook set_load_weight() or
something. Things like sys_sched_setattr() can also change the nice
value.

>  EXPORT_SYMBOL(set_user_nice);
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 07c2e7f..2b35132 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c

> @@ -937,6 +937,33 @@ void vtime_exit_task(struct task_struct *t)
>  	local_irq_restore(flags);
>  }
>  
> +void vtime_set_nice_local(struct task_struct *t)
> +{
> +	struct vtime *vtime = &t->vtime;
> +
> +	write_seqcount_begin(&vtime->seqcount);
> +	if (vtime->state == VTIME_USER)
> +		vtime_account_user(t, vtime, true);
> +	else if (vtime->state == VTIME_GUEST)
> +		vtime_account_guest(t, vtime, true);
> +	vtime->nice = (task_nice(t) > 0) ? 1 : 0;
> +	write_seqcount_end(&vtime->seqcount);
> +}
> +
> +static void vtime_set_nice_func(struct irq_work *work)
> +{
> +	vtime_set_nice_local(current);
> +}
> +
> +static DEFINE_PER_CPU(struct irq_work, vtime_set_nice_work) = {
> +	.func = vtime_set_nice_func,
> +};
> +
> +void vtime_set_nice_remote(int cpu)
> +{
> +	irq_work_queue_on(&per_cpu(vtime_set_nice_work, cpu), cpu);

What happens if you already had one pending? Do we loose updates?

> +}
> +
>  u64 task_gtime(struct task_struct *t)
>  {
>  	struct vtime *vtime = &t->vtime;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 618577f..c7846ca 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1790,6 +1790,45 @@ static inline int sched_tick_offload_init(void) { return 0; }
>  static inline void sched_update_tick_dependency(struct rq *rq) { }
>  #endif
>  
> +static inline void vtime_set_nice(struct rq *rq,
> +				  struct task_struct *p, long old_nice)
> +{
> +#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> +	long nice;
> +	int cpu;
> +
> +	if (!vtime_accounting_enabled())
> +		return;
> +
> +	cpu = cpu_of(rq);
> +
> +	if (!vtime_accounting_enabled_cpu(cpu))
> +		return;
> +
> +	/*
> +	 * Task not running, nice update will be seen by vtime on its
> +	 * next context switch.
> +	 */
> +	if (!task_current(rq, p))
> +		return;
> +
> +	nice = task_nice(p);
> +
> +	/* Task stays nice, still accounted as nice in kcpustat */
> +	if (old_nice > 0 && nice > 0)
> +		return;
> +
> +	/* Task stays rude, still accounted as non-nice in kcpustat */
> +	if (old_nice <= 0 && nice <= 0)
> +		return;
> +
> +	if (p == current)
> +		vtime_set_nice_local(p);
> +	else
> +		vtime_set_nice_remote(cpu);
> +#endif
> +}

That's _far_ too large for an inline I'm thinking. Also, changing nice
really isn't a fast path or anything.