Re: [RFC v5 2/9] sched/deadline: improve the tracking of active utilization

From: Luca Abeni <luca.abeni@santannapisa.it>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@arm.com>,
	Claudio Scordino <claudio@evidence.eu.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Tommaso Cucinotta <tommaso.cucinotta@sssup.it>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Joel Fernandes <joelaf@google.com>,
	Mathieu Poirier <mathieu.poirier@linaro.org>
Subject: Re: [RFC v5 2/9] sched/deadline: improve the tracking of active utilization
Date: Mon, 24 Jul 2017 10:06:09 +0200	[thread overview]
Message-ID: <20170724100609.4ec38473@luca> (raw)
In-Reply-To: <20170324224715.4098dbfb@nowhere>

Hi Peter,

On Fri, 24 Mar 2017 22:47:15 +0100
luca abeni <luca.abeni@santannapisa.it> wrote:

> Hi Peter,
> 
> On Fri, 24 Mar 2017 14:20:41 +0100
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, Mar 24, 2017 at 04:52:55AM +0100, luca abeni wrote:
> >   
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index d67eee8..952cac8 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -445,16 +445,33 @@ struct sched_dl_entity {
> > >  	 *
> > >  	 * @dl_yielded tells if task gave up the CPU before
> > > consuming
> > >  	 * all its available runtime during the last job.
> > > +	 *
> > > +	 * @dl_non_contending tells if task is inactive while still
> > > +	 * contributing to the active utilization. In other words,
> > > it
> > > +	 * indicates if the inactive timer has been armed and its
> > > handler
> > > +	 * has not been executed yet. This flag is useful to avoid
> > > race
> > > +	 * conditions between the inactive timer handler and the
> > > wakeup
> > > +	 * code.
> > >  	 */
> > >  	int				dl_throttled;
> > >  	int				dl_boosted;
> > >  	int				dl_yielded;
> > > +	int				dl_non_contending;    
> > 
> > We should maybe look into making those bits :1, not something for this
> > patch though;  
> 
> Yes, grouping all the flags in a single field was my intention too... I
> planned to submit a patch to do this after merging the reclaiming
> patches... But maybe it is better to do this first :)

I implemented this change, but before submitting the patch I have a
small question.
I implemented some helpers to access the various
{throttled,boosted,yielded,non_contending} flags. I have some
"dl_{throttled,boosted,...}()" inline functions for reading the values
of the flags, and some inline functions for setting / clearing the
flags. For these, I have two possibilities:
- using two separate "dl_set_{throttled,...}()" and
  "dl_clear_{throttled,..}()" functions for each flag
- using one single "dl_set_{throttled,...}(dl, value)" function per
  flag, in which the flag's value is specified.

I have no preferences (with the first proposal, I introduce more inline
functions, but I think the functions can be made more efficient /
optimized). Which one of the two proposals is preferred? (or, there is
a third, better, idea that I overlooked?)

			Thanks,
				Luca

> 
> 
> > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > > index cef9adb..86aed82 100644
> > > --- a/kernel/sched/deadline.c
> > > +++ b/kernel/sched/deadline.c
> > > @@ -65,6 +65,107 @@ void sub_running_bw(u64 dl_bw, struct dl_rq
> > > *dl_rq) dl_rq->running_bw = 0;
> > >  }
> > >  
> > > +void dl_change_utilization(struct task_struct *p, u64 new_bw)
> > > +{
> > > +	if (!task_on_rq_queued(p)) {
> > > +		struct rq *rq = task_rq(p);
> > > +
> > > +		if (p->dl.dl_non_contending) {
> > > +			sub_running_bw(p->dl.dl_bw, &rq->dl);
> > > +			p->dl.dl_non_contending = 0;
> > > +			/*
> > > +			 * If the timer handler is currently
> > > running and the
> > > +			 * timer cannot be cancelled,
> > > inactive_task_timer()
> > > +			 * will see that dl_not_contending is not
> > > set, and
> > > +			 * will not touch the rq's active
> > > utilization,
> > > +			 * so we are still safe.
> > > +			 */
> > > +			if
> > > (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
> > > +				put_task_struct(p);
> > > +		}
> > > +	}
> > > +}    
> > 
> > If we rearrange that slightly we can avoid the double indent:  
> [...]
> Ok; I'll rework the code in this way on Monday.
> 
> [...]
> > > +		if (hrtimer_try_to_cancel(&dl_se->inactive_timer)
> > > == 1)
> > > +			put_task_struct(dl_task_of(dl_se));
> > > +		dl_se->dl_non_contending = 0;    
> > 
> > This had me worried for a little while as being a use-after-free, but
> > this put_task_struct() cannot be the last in this case. Might be
> > worth a comment.  
> 
> Or maybe it is better to move "dl_se->dl_non_contending = 0;" before
> hrtimer_try_to_cancel()?
> 
> >   
> > > +	} else {
> > > +		/*
> > > +		 * Since "dl_non_contending" is not set, the
> > > +		 * task's utilization has already been removed from
> > > +		 * active utilization (either when the task
> > > blocked,
> > > +		 * when the "inactive timer" fired).
> > > +		 * So, add it back.
> > > +		 */
> > > +		add_running_bw(dl_se->dl_bw, dl_rq);
> > > +	}
> > > +}    
> > 
> > In general I feel it would be nice to have a state diagram included
> > somewhere near these two functions. It would be nice to not have to
> > dig out the PDF every time.  
> 
> Ok... Since I am not good at ascii art, would it be ok to add a textual
> description? If yes, I'll add a comment like:
> "
> The utilization of a task is added to the runqueue's active utilization
> when the task becomes active (is enqueued in the runqueue), and is
> removed when the task becomes inactive. A task does not become
> immediately inactive when it blocks, but becomes inactive at the so
> called "0 lag time"; so, we setup the "inactive timer" to fire at the
> "0 lag time". When the "inactive timer" fires, the task utilization is
> removed from the runqueue's active utilization. If the task wakes up
> again on the same runqueue before the "0 lag time", the active
> utilization must not be changed and the "inactive timer" must be
> cancelled. If the task wakes up again on a different runqueue before
> the "0 lag time", then the task's utilization must be removed from the
> previous runqueue's active utilization and must be added to the new
> runqueue's active utilization.
> In order to avoid races between a task waking up on a runqueue while the
> "inactive timer" is running on a different CPU, the "dl_non_contending"
> flag is used to indicate that a task is not on a runqueue but is active
> (so, the flag is set when the task blocks and is cleared when the
> "inactive timer" fires or when the task  wakes up).
> "
> (if this is ok, where can I add this comment?)
> 
> 
> > > +static void migrate_task_rq_dl(struct task_struct *p)
> > > +{
> > > +	if ((p->state == TASK_WAKING) &&
> > > (p->dl.dl_non_contending)) {
> > > +		struct rq *rq = task_rq(p);
> > > +
> > > +		raw_spin_lock(&rq->lock);
> > > +		sub_running_bw(p->dl.dl_bw, &rq->dl);
> > > +		p->dl.dl_non_contending = 0;
> > > +		/*
> > > +		 * If the timer handler is currently running and
> > > the
> > > +		 * timer cannot be cancelled, inactive_task_timer()
> > > +		 * will see that dl_not_contending is not set, and
> > > +		 * will not touch the rq's active utilization,
> > > +		 * so we are still safe.
> > > +		 */
> > > +		if (hrtimer_try_to_cancel(&p->dl.inactive_timer)
> > > == 1)
> > > +			put_task_struct(p);
> > > +
> > > +		raw_spin_unlock(&rq->lock);
> > > +	}
> > > +}    
> > 
> > This can similarly be reduced in indent, albeit this is only a single
> > indent level, so not as onerous as the other one.  
> 
> Ok; I'll do this on Monday.
> 
> 
> > What had me puzzled for a while here is taking the lock; because some
> > callers of set_task_cpu() must in fact hold this lock already. Worth a
> > comment I feel.  
> 
> Ok; I'll add a comment mentioning that since state is TASK_WAKING we do
> not have the lock.
> 
> 
> > Once I figured out the exact locking; I realized you do this on
> > cross-cpu wakeups. We spend a fair amount of effort not to take both
> > rq locks. But I suppose you have to here.  
> 
> The problem is that when a task wakes up before the "0 lag time" on a
> different runqueue, we must "migrate" its utilization from the old
> runqueue to the new one (see comment above). And I need the lock to
> avoid races... The only alternative I can think about is to ad a new
> lock protecting the active utilization, and to take it instead of the
> runqueue lock (I do not know if this is enough, I need to check...).
> I'll have a look on Monday.
> 
> 
> 
> 			Thanks,
> 				Luca