Re: [PATCH 13/16] xen: sched: allow for choosing credit2 runqueues configuration at boot

From: Juergen Gross <jgross@suse.com>
To: Dario Faggioli <dario.faggioli@citrix.com>,
	xen-devel@lists.xenproject.org
Cc: George Dunlap <george.dunlap@eu.citrix.com>,
	Uma Sharma <uma.sharma523@gmail.com>
Subject: Re: [PATCH 13/16] xen: sched: allow for choosing credit2 runqueues configuration at boot
Date: Tue, 22 Mar 2016 08:46:14 +0100	[thread overview]
Message-ID: <56F0F846.3020804@suse.com> (raw)
In-Reply-To: <20160318190538.8117.96025.stgit@Solace.station>

On 18/03/16 20:05, Dario Faggioli wrote:
> In fact, credit2 uses CPU topology to decide how to arrange
> its internal runqueues. Before this change, only 'one runqueue
> per socket' was allowed. However, experiments have shown that,
> for instance, having one runqueue per physical core improves
> performance, especially in case hyperthreading is available.
> 
> In general, it makes sense to allow users to pick one runqueue
> arrangement at boot time, so that:
>  - more experiments can be easily performed to even better
>    assess and improve performance;
>  - one can select the best configuration for his specific
>    use case and/or hardware.
> 
> This patch enables the above.
> 
> Note that, for correctly arranging runqueues to be per-core,
> just checking cpu_to_core() on the host CPUs is not enough.
> In fact, cores (and hyperthreads) on different sockets, can
> have the same core (and thread) IDs! We, therefore, need to
> check whether the full topology of two CPUs matches, for
> them to be put in the same runqueue.
> 
> Note also that the default (although not functional) for
> credit2, since now, has been per-socket runqueue. This patch
> leaves things that way, to avoid mixing policy and technical
> changes.
> 
> Finally, it would be a nice feature to be able to select
> a particular runqueue arrangement, even when creating a
> Credit2 cpupool. This is left as future work.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Uma Sharma <uma.sharma523@gmail.com>
> Cc: Juergen Gross <jgross@suse.com>
> ---
> Cahnges from v1:
>  * added 'node' and 'global' runqueue arrangements, as
>    suggested during review;
> ---
>  docs/misc/xen-command-line.markdown |   19 +++++++++
>  xen/common/sched_credit2.c          |   76 +++++++++++++++++++++++++++++++++--
>  2 files changed, 90 insertions(+), 5 deletions(-)
> 
> diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
> index ca77e3b..0047f94 100644
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -469,6 +469,25 @@ combination with the `low_crashinfo` command line option.
>  ### credit2\_load\_window\_shift
>  > `= <integer>`
>  
> +### credit2\_runqueue
> +> `= core | socket | node | all`
> +
> +> Default: `socket`
> +
> +Specify how host CPUs are arranged in runqueues. Runqueues are kept
> +balanced with respect to the load generated by the vCPUs running on
> +them. Smaller runqueues (as in with `core`) means more accurate load
> +balancing (for instance, it will deal better with hyperthreading),
> +but also more overhead.
> +
> +Available alternatives, with their meaning, are:
> +* `core`: one runqueue per each physical core of the host;
> +* `socket`: one runqueue per each physical socket (which often,
> +            but not always, matches a NUMA node) of the host;
> +* `node`: one runqueue per each NUMA node of the host;
> +* `all`: just one runqueue shared by all the logical pCPUs of
> +         the host
> +
>  ### dbgp
>  > `= ehci[ <integer> | @pci<bus>:<slot>.<func> ]`
>  
> diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
> index 456b9ea..c242dc4 100644
> --- a/xen/common/sched_credit2.c
> +++ b/xen/common/sched_credit2.c
> @@ -81,10 +81,6 @@
>   * Credits are "reset" when the next vcpu in the runqueue is less than
>   * or equal to zero.  At that point, everyone's credits are "clipped"
>   * to a small value, and a fixed credit is added to everyone.
> - *
> - * The plan is for all cores that share an L2 will share the same
> - * runqueue.  At the moment, there is one global runqueue for all
> - * cores.
>   */
>  
>  /*
> @@ -193,6 +189,55 @@ static int __read_mostly opt_overload_balance_tolerance = -3;
>  integer_param("credit2_balance_over", opt_overload_balance_tolerance);
>  
>  /*
> + * Runqueue organization.
> + *
> + * The various cpus are to be assigned each one to a runqueue, and we
> + * want that to happen basing on topology. At the moment, it is possible
> + * to choose to arrange runqueues to be:
> + *
> + * - per-core: meaning that there will be one runqueue per each physical
> + *             core of the host. This will happen if the opt_runqueue
> + *             parameter is set to 'core';
> + *
> + * - per-node: meaning that there will be one runqueue per each physical
> + *             NUMA node of the host. This will happen if the opt_runqueue
> + *             parameter is set to 'node';
> + *
> + * - per-socket: meaning that there will be one runqueue per each physical
> + *               socket (AKA package, which often, but not always, also
> + *               matches a NUMA node) of the host; This will happen if
> + *               the opt_runqueue parameter is set to 'socket';
> + *
> + * - global: meaning that there will be only one runqueue to which all the
> + *           (logical) processors of the host belongs. This will happen if
> + *           the opt_runqueue parameter is set to 'all'.
> + *
> + * Depending on the value of opt_runqueue, therefore, cpus that are part of
> + * either the same physical core, or of the same physical socket, will be
> + * put together to form runqueues.
> + */
> +#define OPT_RUNQUEUE_CORE   1
> +#define OPT_RUNQUEUE_SOCKET 2
> +#define OPT_RUNQUEUE_NODE   3
> +#define OPT_RUNQUEUE_ALL    4
> +static int __read_mostly opt_runqueue = OPT_RUNQUEUE_SOCKET;
> +
> +static void parse_credit2_runqueue(const char *s)
> +{
> +    if ( !strncmp(s, "core", 4) && !s[4] )
> +        opt_runqueue = OPT_RUNQUEUE_CORE;
> +    else if ( !strncmp(s, "socket", 6) && !s[6] )
> +        opt_runqueue = OPT_RUNQUEUE_SOCKET;
> +    else if ( !strncmp(s, "node", 4) && !s[4] )
> +        opt_runqueue = OPT_RUNQUEUE_NODE;
> +    else if ( !strncmp(s, "all", 6) && !s[6] )

The length is wrong. Should be 3 instead of 6 here.

Which poses the question: why don't you use strcmp() here? I don't see
any advantage using strncmp() in this case, especially as you've just
proven it is more error prone here.

> +        opt_runqueue = OPT_RUNQUEUE_ALL;
> +    else
> +        printk("WARNING, unrecognized value of credit2_runqueue option!\n");
> +}
> +custom_param("credit2_runqueue", parse_credit2_runqueue);
> +
> +/*
>   * Per-runqueue data
>   */
>  struct csched2_runqueue_data {
> @@ -1971,6 +2016,22 @@ static void deactivate_runqueue(struct csched2_private *prv, int rqi)
>      cpumask_clear_cpu(rqi, &prv->active_queues);
>  }
>  
> +static inline bool_t same_node(unsigned int cpua, unsigned int cpub)
> +{
> +    return cpu_to_node(cpua) == cpu_to_node(cpub);
> +}
> +
> +static inline bool_t same_socket(unsigned int cpua, unsigned int cpub)
> +{
> +    return cpu_to_socket(cpua) == cpu_to_socket(cpub);
> +}
> +
> +static inline bool_t same_core(unsigned int cpua, unsigned int cpub)
> +{
> +    return same_socket(cpua, cpub) &&
> +           cpu_to_core(cpua) == cpu_to_core(cpub);
> +}
> +
>  static unsigned int
>  cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu)
>  {
> @@ -2003,7 +2064,10 @@ cpu_to_runqueue(struct csched2_private *prv, unsigned int cpu)
>          BUG_ON(cpu_to_socket(cpu) == XEN_INVALID_SOCKET_ID ||
>                 cpu_to_socket(peer_cpu) == XEN_INVALID_SOCKET_ID);
>  
> -        if ( cpu_to_socket(cpumask_first(&rqd->active)) == cpu_to_socket(cpu) )
> +        if ( opt_runqueue == OPT_RUNQUEUE_ALL ||
> +             (opt_runqueue == OPT_RUNQUEUE_CORE && same_core(peer_cpu, cpu)) ||
> +             (opt_runqueue == OPT_RUNQUEUE_SOCKET && same_socket(peer_cpu, cpu)) ||
> +             (opt_runqueue == OPT_RUNQUEUE_NODE && same_node(peer_cpu, cpu)) )
>              break;
>      }
>  
> @@ -2157,6 +2221,8 @@ csched2_init(struct scheduler *ops)
>      printk(" load_window_shift: %d\n", opt_load_window_shift);
>      printk(" underload_balance_tolerance: %d\n", opt_underload_balance_tolerance);
>      printk(" overload_balance_tolerance: %d\n", opt_overload_balance_tolerance);
> +    printk(" runqueues arrangement: per-%s\n",
> +           opt_runqueue == OPT_RUNQUEUE_CORE ? "core" : "socket");

node? all?

>  
>      if ( opt_load_window_shift < LOADAVG_WINDOW_SHIFT_MIN )
>      {

Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel