From mboxrd@z Thu Jan  1 00:00:00 1970
From: George Dunlap <george.dunlap@eu.citrix.com>
Subject: Re: [PATCH v3 1/4] sched: credit2: respect per-vcpu
	hard affinity
Date: Tue, 31 Mar 2015 15:37:28 +0100
Message-ID: <551AB128.8090905@eu.citrix.com>
References: <1427363314-25430-1-git-send-email-jtweaver@hawaii.edu>
	<1427363314-25430-2-git-send-email-jtweaver@hawaii.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <1427363314-25430-2-git-send-email-jtweaver@hawaii.edu>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: "Justin T. Weaver" <jtweaver@hawaii.edu>, xen-devel@lists.xen.org
Cc: dario.faggioli@citrix.com, henric@hawaii.edu
List-Id: xen-devel@lists.xenproject.org

On 03/26/2015 09:48 AM, Justin T. Weaver wrote:
> by making sure that vcpus only run on the pcpu(s) they are allowed to
> run on based on their hard affinity cpu masks.
> 
> Signed-off-by: Justin T. Weaver <jtweaver@hawaii.edu>

Hey Justin!  Getting close.  A couple of comments:

> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 7581731..af716e4 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -176,6 +176,7 @@ integer_param("sched_credit2_migrate_resist", opt_migrate_resist);
>  #define c2r(_ops, _cpu)     (CSCHED2_PRIV(_ops)->runq_map[(_cpu)])
>  /* CPU to runqueue struct macro */
>  #define RQD(_ops, _cpu)     (&CSCHED2_PRIV(_ops)->rqd[c2r(_ops, _cpu)])
> +#define VCPU2ONLINE(_v)     cpupool_online_cpumask((_v)->domain->cpupool)
>  
>  /*
>   * Shifts for load average.
> @@ -194,6 +195,12 @@ int opt_overload_balance_tolerance=-3;
>  integer_param("credit2_balance_over", opt_overload_balance_tolerance);
>  
>  /*
> + * Use this to avoid having too many cpumask_t structs on the stack
> + */
> +static cpumask_t **scratch_mask = NULL;
> +#define csched2_cpumask scratch_mask[smp_processor_id()]

Since I'm asking for changes below:  It's not clear to me when I'm
scanning the code that csched2_cpumask is a scratch variable.  What
about calling the actual variable _scratch_cpumask and havind the
#define be scratch_cpumask?

>      /* Find the runqueue with the lowest instantaneous load */
>      for_each_cpu(i, &prv->active_queues)
>      {
>          struct csched2_runqueue_data *rqd;
> -        s_time_t rqd_avgload;
> +        s_time_t rqd_avgload = MAX_LOAD;
>  
>          rqd = prv->rqd + i;
>  
>          /* If checking a different runqueue, grab the lock,
> -         * read the avg, and then release the lock.
> +         * check hard affinity, read the avg, and then release the lock.
>           *
>           * If on our own runqueue, don't grab or release the lock;
>           * but subtract our own load from the runqueue load to simulate
> -         * impartiality */
> +         * impartiality.
> +         *
> +         * svc's hard affinity may have changed; this function is the
> +         * credit 2 scheduler's first opportunity to react to the change,
> +         * so it is possible here that svc does not have hard affinity
> +         * with any of the pcpus of svc's currently assigned run queue.
> +         */
>          if ( rqd == svc->rqd )
>          {
> -            rqd_avgload = rqd->b_avgload - svc->avgload;
> +            if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
> +                rqd_avgload = rqd->b_avgload - svc->avgload;
>          }
>          else if ( spin_trylock(&rqd->lock) )
>          {
> -            rqd_avgload = rqd->b_avgload;
> +            if ( cpumask_intersects(vc->cpu_hard_affinity, &rqd->active) )
> +                rqd_avgload = rqd->b_avgload;
> +
>              spin_unlock(&rqd->lock);
>          }
>          else

Since we're already potentially falling through and doing the comparison
below with an unchanged rqd_avgload (which has now been set to MAX_LOAD
above), I wonder if it makes more sense to remove this "else continue"
here, just to avoid confusing people.

> @@ -2024,6 +2096,13 @@ csched2_alloc_pdata(const struct scheduler *ops, int cpu)
>          printk("%s: cpu %d not online yet, deferring initializatgion\n",
>                 __func__, cpu);
>  
> +    /*
> +     * For each new pcpu, allocate a cpumask_t for use throughout the
> +     * scheduler to avoid putting any cpumask_t structs on the stack.
> +     */
> +    if ( !zalloc_cpumask_var(&scratch_mask[cpu]) )

Any reason not to use "scratch_mask + cpu" here rather than
"&scratch_mask[cpu]"?

It might not be a bad idea to ad ASSERT(scratch_mask[cpu] == NULL)
before this, just to be paranoid...

> +        return NULL;
> +
>      return (void *)1;
>  }
>  
> @@ -2072,6 +2151,8 @@ csched2_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
>  
>      spin_unlock_irqrestore(&prv->lock, flags);
>  
> +    free_cpumask_var(scratch_mask[cpu]);

We should definitely set this to NULL after freeing it (see below)

> +
>      return;
>  }
>  
> @@ -2159,6 +2240,10 @@ csched2_init(struct scheduler *ops)
>  
>      prv->load_window_shift = opt_load_window_shift;
>  
> +    scratch_mask = xmalloc_array(cpumask_t *, nr_cpu_ids);

I realize Dario recommended using xmalloc_array() instead of
xzalloc_array(), but I don't understand why he thinks that's OK.  His
mail says "(see below about why I actually don't
think we need)", but I don't actually see that addressed in that e-mail.

I think it's just dangerous to leave uninitialized pointers around.  The
invariant should be that if the array entry is invalid it's NULL, and if
it's non-null then it's valid.

(* marc.info/?i=<1426266674.32500.25.camel@citrix.com>)

Also -- I think this allocation wants to happen in global_init(), right?
 Otherwise if you make a second cpupool with the credit2 scheduler this
will be clobbered.  (I think nr_cpu_ids should be defined at that point.)

 -George