All of lore.kernel.org
 help / color / mirror / Atom feed
From: Boris Ostrovsky <boris.ostrovsky@oracle.com>
To: Dongli Zhang <dongli.zhang@oracle.com>,
	xen-devel@lists.xenproject.org, linux-kernel@vger.kernel.org
Cc: jgross@suse.com, joao.m.martins@oracle.com
Subject: Re: [PATCH v5 1/1] xen/time: do not decrease steal time after live migration on xen
Date: Mon, 30 Oct 2017 09:34:44 -0400	[thread overview]
Message-ID: <ce923518-73c1-e4eb-e618-668c74fc7fc0@oracle.com> (raw)
In-Reply-To: <1509350613-15356-1-git-send-email-dongli.zhang@oracle.com>

On 10/30/2017 04:03 AM, Dongli Zhang wrote:
> After guest live migration on xen, steal time in /proc/stat
> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
> xen_steal_lock() might be less than this_rq()->prev_steal_time which is
> derived from previous return value of xen_steal_clock().
>
> For instance, steal time of each vcpu is 335 before live migration.
>
> cpu  198 0 368 200064 1962 0 0 1340 0 0
> cpu0 38 0 81 50063 492 0 0 335 0 0
> cpu1 65 0 97 49763 634 0 0 335 0 0
> cpu2 38 0 81 50098 462 0 0 335 0 0
> cpu3 56 0 107 50138 374 0 0 335 0 0
>
> After live migration, steal time is reduced to 312.
>
> cpu  200 0 370 200330 1971 0 0 1248 0 0
> cpu0 38 0 82 50123 500 0 0 312 0 0
> cpu1 65 0 97 49832 634 0 0 312 0 0
> cpu2 39 0 82 50167 462 0 0 312 0 0
> cpu3 56 0 107 50207 374 0 0 312 0 0
>
> Since runstate times are cumulative and cleared during xen live migration
> by xen hypervisor, the idea of this patch is to accumulate runstate times
> to global percpu variables before live migration suspend. Once guest VM is
> resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
> runstate times and previously accumulated times stored in global percpu
> variables.
>
> Similar and more severe issue would impact prior linux 4.8-4.10 as
> discussed by Michael Las at
> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
> which would overflow steal time and lead to 100% st usage in top command
> for linux 4.8-4.10. A backport of this patch would fix that issue.
>
> References: https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
>
> ---
> Changed since v1:
>   * relocate modification to xen_get_runstate_snapshot_cpu
>
> Changed since v2:
>   * accumulate runstate times before live migration
>
> Changed since v3:
>   * do not accumulate times in the case of guest checkpointing
>
> Changed since v4:
>   * allocate array of vcpu_runstate_info to reduce number of memory allocation
>
> ---
>  drivers/xen/manage.c         |  2 ++
>  drivers/xen/time.c           | 68 ++++++++++++++++++++++++++++++++++++++++++--
>  include/xen/interface/vcpu.h |  2 ++
>  include/xen/xen-ops.h        |  1 +
>  4 files changed, 71 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
> index c425d03..3dc085d 100644
> --- a/drivers/xen/manage.c
> +++ b/drivers/xen/manage.c
> @@ -72,6 +72,7 @@ static int xen_suspend(void *data)
>  	}
>  
>  	gnttab_suspend();
> +	xen_accumulate_runstate_time(-1);
>  	xen_arch_pre_suspend();
>  
>  	/*
> @@ -84,6 +85,7 @@ static int xen_suspend(void *data)
>                                             : 0);
>  
>  	xen_arch_post_suspend(si->cancelled);
> +	xen_accumulate_runstate_time(si->cancelled);

I am not convinced that the comment above HYPERVISOR_suspend() is
correct. The call can return an error code and so if it returns -EPERM
(which AFAICS it can't now but might in the future) then
xen_accumulate_runstate_time() will do wrong thing.


>  	gnttab_resume();
>  
>  	if (!si->cancelled) {
> diff --git a/drivers/xen/time.c b/drivers/xen/time.c
> index ac5f23f..cf3afb9 100644
> --- a/drivers/xen/time.c
> +++ b/drivers/xen/time.c
> @@ -19,6 +19,9 @@
>  /* runstate info updated by Xen */
>  static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
>  
> +static DEFINE_PER_CPU(u64[RUNSTATE_max], old_runstate_time);
> +static struct vcpu_runstate_info *runstate_delta;

I'd move this inside xen_accumulate_runstate_time() since that's the
only function that uses it. And why does it need to be
vcpu_runstate_info and not u64[4]?

> +
>  /* return an consistent snapshot of 64-bit time/counter value */
>  static u64 get64(const u64 *p)
>  {
> @@ -47,8 +50,8 @@ static u64 get64(const u64 *p)
>  	return ret;
>  }
>  
> -static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
> -					  unsigned int cpu)
> +static void xen_get_runstate_snapshot_cpu_delta(
> +			struct vcpu_runstate_info *res, unsigned int cpu)
>  {
>  	u64 state_time;
>  	struct vcpu_runstate_info *state;
> @@ -66,6 +69,67 @@ static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
>  		 (state_time & XEN_RUNSTATE_UPDATE));
>  }
>  
> +static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
> +					  unsigned int cpu)
> +{
> +	int i;
> +
> +	xen_get_runstate_snapshot_cpu_delta(res, cpu);
> +
> +	for (i = 0; i < RUNSTATE_max; i++)
> +		res->time[i] += per_cpu(old_runstate_time, cpu)[i];
> +}
> +
> +void xen_accumulate_runstate_time(int action)
> +{
> +	struct vcpu_runstate_info state;
> +	int cpu, i;
> +
> +	switch (action) {
> +	case -1: /* backup runstate time before suspend */
> +		WARN_ON_ONCE(unlikely(runstate_delta));

pr_warn_once(), to be consistent with the rest of the file. And then
should you return if this is true?

> +
> +		runstate_delta = kcalloc(num_possible_cpus(),
> +					 sizeof(*runstate_delta),
> +					 GFP_KERNEL);
> +		if (unlikely(!runstate_delta)) {
> +			pr_alert("%s: failed to allocate runstate_delta\n",
> +				    __func__);

pr_warn() should be sufficient. Below too.

Also, as a side question --- can we do kmalloc() at this point?

> +			return;
> +		}
> +
> +		for_each_possible_cpu(cpu) {
> +			xen_get_runstate_snapshot_cpu_delta(&state, cpu);
> +			memcpy(runstate_delta[cpu].time, state.time,
> +			      RUNSTATE_max * sizeof(*runstate_delta[cpu].time));

sizeof(runstate_delta[cpu].time).

> +		}
> +
> +		break;
> +
> +	case 0: /* backup runstate time after resume */
> +		if (unlikely(!runstate_delta)) {
> +			pr_alert("%s: cannot accumulate runstate time as runstate_delta is NULL\n",
> +				    __func__);
> +			return;
> +		}
> +
> +		for_each_possible_cpu(cpu) {
> +			for (i = 0; i < RUNSTATE_max; i++)
> +				per_cpu(old_runstate_time, cpu)[i] +=
> +					runstate_delta[cpu].time[i];
> +		}
> +		break;
> +
> +	default: /* do not accumulate runstate time for checkpointing */
> +		break;
> +	}
> +
> +	if (action != -1 && runstate_delta) {
> +		kfree(runstate_delta);
> +		runstate_delta = NULL;
> +	}
> +}
> +
>  /*
>   * Runstate accounting
>   */
> diff --git a/include/xen/interface/vcpu.h b/include/xen/interface/vcpu.h
> index 98188c8..85e81ce 100644
> --- a/include/xen/interface/vcpu.h
> +++ b/include/xen/interface/vcpu.h
> @@ -110,6 +110,8 @@ DEFINE_GUEST_HANDLE_STRUCT(vcpu_runstate_info);
>   */
>  #define RUNSTATE_offline  3
>  
> +#define RUNSTATE_max      4

This file is part of Xen ABI. While this macro technically doesn't
change anything I'd rather have those updates first appear in Xen code.

Besides, this change leaves vcpu_runstate_info.time[4] definition
intact. I think all RUNSTATE_* macros would need to be moved above
vcpu_runstate_info definition.

TBH, for purposes of this patch I'd use 4.


-boris

  parent reply	other threads:[~2017-10-30 13:33 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-30  8:03 [PATCH v5 1/1] xen/time: do not decrease steal time after live migration on xen Dongli Zhang
2017-10-30 13:34 ` Boris Ostrovsky
2017-10-30 13:34 ` Boris Ostrovsky [this message]
2017-10-31  0:14   ` Dongli Zhang
2017-10-31  0:14   ` [Xen-devel] " Dongli Zhang
2017-10-31  0:58     ` Boris Ostrovsky
2017-10-31  0:58       ` Boris Ostrovsky
2017-10-31  3:13       ` Dongli Zhang
2017-10-31  3:13       ` [Xen-devel] " Dongli Zhang
2017-10-31 11:05         ` Boris Ostrovsky
2017-10-31 11:05           ` Boris Ostrovsky
2017-11-01 20:05 ` kbuild test robot
2017-11-01 20:05 ` kbuild test robot
2017-10-30  8:03 Dongli Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ce923518-73c1-e4eb-e618-668c74fc7fc0@oracle.com \
    --to=boris.ostrovsky@oracle.com \
    --cc=dongli.zhang@oracle.com \
    --cc=jgross@suse.com \
    --cc=joao.m.martins@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.