xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [Xen-devel] [RFC 0/9] Changes to time accounting
@ 2019-09-11 10:32 Andrii Anisov
  2019-09-11 10:32 ` [Xen-devel] [RFC 1/9] schedule: Introduce per-pcpu " Andrii Anisov
                   ` (8 more replies)
  0 siblings, 9 replies; 26+ messages in thread
From: Andrii Anisov @ 2019-09-11 10:32 UTC (permalink / raw)
  To: xen-devel, Dario Faggioli
  Cc: Stefano Stabellini, Andrii Anisov, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Tim Deegan, Julien Grall, Meng Xu, Jan Beulich,
	Volodymyr Babchuk

From: Andrii Anisov <andrii_anisov@epam.com>

That is the second attempt of the changes for the time accounting in XEN.
The initial topic is here [1].

In this series it is introduced a time accounting infrastructure separated
from runstates, and made an attempt to use new infra solely for taking 
scheduling decissions.
This series still employs ideas of the initial series about what hypervisor,
idle and guest time are.

Yet, this series lacks of connection between new time accounting infra and
runstate. This is TODO, currently.

The main goal of this work is to improve time accounting precission as well
as avoiding charging guests for the other entities (guests or hypervisor itself)
work.

Changes from:

Initial RFC series [1]:
    - Time accounting made independent from runstates infrastructure.
    - Dropped odd patches to ARM64 code.

[1] https://lists.xenproject.org/archives/html/xen-devel/2019-07/msg01839.html

Andrii Anisov (9):
  schedule: Introduce per-pcpu time accounting
  sysctl: extend XEN_SYSCTL_getcpuinfo interface
  xentop: show CPU load information
  arm64: utilize time accounting
  tacc: Introduce a lockless interface for guest time
  sched:rtds: get guest time from time accounting code
  tacc: Introduce a locked interface for guest time
  sched:credit: get guest time from time accounting code
  sched:credit2: get guest time from time accounting code

 tools/xenstat/libxenstat/src/xenstat.c      |  50 +++++++++
 tools/xenstat/libxenstat/src/xenstat.h      |  15 +++
 tools/xenstat/libxenstat/src/xenstat_priv.h |   5 +
 tools/xenstat/xentop/xentop.c               |  36 ++++++
 xen/arch/arm/arm64/entry.S                  |  39 ++++++-
 xen/arch/arm/domain.c                       |   2 +
 xen/common/Kconfig                          |   5 +
 xen/common/sched_credit.c                   |  12 +-
 xen/common/sched_credit2.c                  |  17 +--
 xen/common/sched_rt.c                       |   4 +-
 xen/common/schedule.c                       | 168 +++++++++++++++++++++++++++-
 xen/common/sysctl.c                         |   4 +
 xen/include/public/sysctl.h                 |   4 +
 xen/include/xen/sched.h                     |  50 +++++++++
 14 files changed, 386 insertions(+), 25 deletions(-)

-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Xen-devel] [RFC 1/9] schedule: Introduce per-pcpu time accounting
  2019-09-11 10:32 [Xen-devel] [RFC 0/9] Changes to time accounting Andrii Anisov
@ 2019-09-11 10:32 ` Andrii Anisov
  2019-09-11 18:01   ` Volodymyr Babchuk
  2019-10-28 14:28   ` Julien Grall
  2019-09-11 10:32 ` [Xen-devel] [RFC 2/9] sysctl: extend XEN_SYSCTL_getcpuinfo interface Andrii Anisov
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 26+ messages in thread
From: Andrii Anisov @ 2019-09-11 10:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Tim Deegan, Stefano Stabellini, Andrii Anisov, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Dario Faggioli, Julien Grall, Jan Beulich

From: Andrii Anisov <andrii_anisov@epam.com>

Introduce per-pcpu time accounting what includes the following states:

TACC_HYP - the pcpu executes hypervisor code like softirq processing
           (including scheduling), tasklets and context switches
TACC_GUEST - the pcpu executes guests code
TACC_IDLE - the low-power state of the pcpu
TACC_IRQ - the pcpu performs interrupts processing, without separation to
           guest or hypervisor interrupts
TACC_GSYNC - the pcpu executes hypervisor code to process synchronous trap
             from the guest. E.g. hypercall processing or io emulation.

Currently, the only reenterant state is TACC_IRQ. It is assumed, no changes
to state other than TACC_IRQ could happen until we return from nested
interrupts. IRQ time is accounted in a distinct way comparing to other states.
It is acumulated between other states transition moments, and is substracted
from the old state on states transion calculation.

Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
---
 xen/common/schedule.c   | 81 +++++++++++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/sched.h | 27 +++++++++++++++++
 2 files changed, 108 insertions(+)

diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 7b71581..6dd6603 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -1539,6 +1539,87 @@ static void schedule(void)
     context_switch(prev, next);
 }
 
+DEFINE_PER_CPU(struct tacc, tacc);
+
+static void tacc_state_change(enum TACC_STATES new_state)
+{
+    s_time_t now, delta;
+    struct tacc* tacc = &this_cpu(tacc);
+    unsigned long flags;
+
+    local_irq_save(flags);
+
+    now = NOW();
+    delta = now - tacc->state_entry_time;
+
+    /* We do not expect states reenterability (at least through this function)*/
+    ASSERT(new_state != tacc->state);
+
+    tacc->state_time[tacc->state] += delta - tacc->irq_time;
+    tacc->state_time[TACC_IRQ] += tacc->irq_time;
+    tacc->irq_time = 0;
+    tacc->state = new_state;
+    tacc->state_entry_time = now;
+
+    local_irq_restore(flags);
+}
+
+void tacc_hyp(int place)
+{
+//    printk("\ttacc_hyp %u, place %d\n", smp_processor_id(), place);
+    tacc_state_change(TACC_HYP);
+}
+
+void tacc_guest(int place)
+{
+//    printk("\ttacc_guest %u, place %d\n", smp_processor_id(), place);
+    tacc_state_change(TACC_GUEST);
+}
+
+void tacc_idle(int place)
+{
+//    printk("\tidle cpu %u, place %d\n", smp_processor_id(), place);
+    tacc_state_change(TACC_IDLE);
+}
+
+void tacc_gsync(int place)
+{
+//    printk("\ttacc_gsync %u, place %d\n", smp_processor_id(), place);
+    tacc_state_change(TACC_GSYNC);
+}
+
+void tacc_irq_enter(int place)
+{
+    struct tacc* tacc = &this_cpu(tacc);
+
+//    printk("\ttacc_irq_enter %u, place %d, cnt %d\n", smp_processor_id(), place, this_cpu(tacc).irq_cnt);
+    ASSERT(!local_irq_is_enabled());
+    ASSERT(tacc->irq_cnt >= 0);
+
+    if ( tacc->irq_cnt == 0 )
+    {
+        tacc->irq_enter_time = NOW();
+    }
+
+    tacc->irq_cnt++;
+}
+
+void tacc_irq_exit(int place)
+{
+    struct tacc* tacc = &this_cpu(tacc);
+
+//    printk("\ttacc_irq_exit %u, place %d, cnt %d\n", smp_processor_id(), place, tacc->irq_cnt);
+    ASSERT(!local_irq_is_enabled());
+    ASSERT(tacc->irq_cnt > 0);
+    if ( tacc->irq_cnt == 1 )
+    {
+        tacc->irq_time = NOW() - tacc->irq_enter_time;
+        tacc->irq_enter_time = 0;
+    }
+
+    tacc->irq_cnt--;
+}
+
 void context_saved(struct vcpu *prev)
 {
     /* Clear running flag /after/ writing context to memory. */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index e3601c1..04a8724 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -1002,6 +1002,33 @@ extern void dump_runq(unsigned char key);
 
 void arch_do_physinfo(struct xen_sysctl_physinfo *pi);
 
+enum TACC_STATES {
+    TACC_HYP = 0,
+    TACC_GUEST = 1,
+    TACC_IDLE = 2,
+    TACC_IRQ = 3,
+    TACC_GSYNC = 4,
+    TACC_STATES_MAX
+};
+
+struct tacc
+{
+    s_time_t state_time[TACC_STATES_MAX];
+    s_time_t state_entry_time;
+    int state;
+
+    s_time_t guest_time;
+
+    s_time_t irq_enter_time;
+    s_time_t irq_time;
+    int irq_cnt;
+};
+
+DECLARE_PER_CPU(struct tacc, tacc);
+
+void tacc_hyp(int place);
+void tacc_idle(int place);
+
 #endif /* __SCHED_H__ */
 
 /*
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Xen-devel] [RFC 2/9] sysctl: extend XEN_SYSCTL_getcpuinfo interface
  2019-09-11 10:32 [Xen-devel] [RFC 0/9] Changes to time accounting Andrii Anisov
  2019-09-11 10:32 ` [Xen-devel] [RFC 1/9] schedule: Introduce per-pcpu " Andrii Anisov
@ 2019-09-11 10:32 ` Andrii Anisov
  2019-10-28 14:52   ` Julien Grall
  2019-09-11 10:32 ` [Xen-devel] [RFC 3/9] xentop: show CPU load information Andrii Anisov
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Andrii Anisov @ 2019-09-11 10:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Tim Deegan, Stefano Stabellini, Andrii Anisov, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Dario Faggioli, Julien Grall, Jan Beulich

From: Andrii Anisov <andrii_anisov@epam.com>

Extend XEN_SYSCTL_getcpuinfo interface with timing information
provided by introduced time accounting infrastructure.

Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
---
 xen/common/schedule.c       | 33 ++++++++++++++++++++++++++++-----
 xen/common/sysctl.c         |  4 ++++
 xen/include/public/sysctl.h |  4 ++++
 xen/include/xen/sched.h     |  4 ++++
 4 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 6dd6603..2007034 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -208,13 +208,36 @@ void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate)
 
 uint64_t get_cpu_idle_time(unsigned int cpu)
 {
-    struct vcpu_runstate_info state = { 0 };
-    struct vcpu *v = idle_vcpu[cpu];
+    struct tacc *tacc = &per_cpu(tacc, cpu);
 
-    if ( cpu_online(cpu) && v )
-        vcpu_runstate_get(v, &state);
+    return tacc->state_time[TACC_IDLE];
+}
+
+uint64_t get_cpu_guest_time(unsigned int cpu)
+{
+    struct tacc *tacc = &per_cpu(tacc, cpu);
+
+    return tacc->state_time[TACC_GUEST];
+}
+
+uint64_t get_cpu_hyp_time(unsigned int cpu)
+{
+    struct tacc *tacc = &per_cpu(tacc, cpu);
+
+    return tacc->state_time[TACC_HYP];
+}
+
+uint64_t get_cpu_irq_time(unsigned int cpu)
+{
+    struct tacc *tacc = &per_cpu(tacc, cpu);
+
+    return tacc->state_time[TACC_IRQ];
+}
+uint64_t get_cpu_gsync_time(unsigned int cpu)
+{
+    struct tacc *tacc = &per_cpu(tacc, cpu);
 
-    return state.time[RUNSTATE_running];
+    return tacc->state_time[TACC_GSYNC];
 }
 
 /*
diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c
index 92b4ea0..b63083c 100644
--- a/xen/common/sysctl.c
+++ b/xen/common/sysctl.c
@@ -152,6 +152,10 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
         for ( i = 0; i < nr_cpus; i++ )
         {
             cpuinfo.idletime = get_cpu_idle_time(i);
+            cpuinfo.guesttime = get_cpu_guest_time(i);
+            cpuinfo.hyptime = get_cpu_hyp_time(i);
+            cpuinfo.gsynctime = get_cpu_gsync_time(i);
+            cpuinfo.irqtime = get_cpu_irq_time(i);
 
             if ( copy_to_guest_offset(op->u.getcpuinfo.info, i, &cpuinfo, 1) )
                 goto out;
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index 5401f9c..cdada1f 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -168,6 +168,10 @@ struct xen_sysctl_debug_keys {
 /* XEN_SYSCTL_getcpuinfo */
 struct xen_sysctl_cpuinfo {
     uint64_aligned_t idletime;
+    uint64_aligned_t hyptime;
+    uint64_aligned_t guesttime;
+    uint64_aligned_t irqtime;
+    uint64_aligned_t gsynctime;
 };
 typedef struct xen_sysctl_cpuinfo xen_sysctl_cpuinfo_t;
 DEFINE_XEN_GUEST_HANDLE(xen_sysctl_cpuinfo_t);
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 04a8724..8167608 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -876,6 +876,10 @@ void restore_vcpu_affinity(struct domain *d);
 
 void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate);
 uint64_t get_cpu_idle_time(unsigned int cpu);
+uint64_t get_cpu_hyp_time(unsigned int cpu);
+uint64_t get_cpu_guest_time(unsigned int cpu);
+uint64_t get_cpu_gsync_time(unsigned int cpu);
+uint64_t get_cpu_irq_time(unsigned int cpu);
 
 /*
  * Used by idle loop to decide whether there is work to do:
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Xen-devel] [RFC 3/9] xentop: show CPU load information
  2019-09-11 10:32 [Xen-devel] [RFC 0/9] Changes to time accounting Andrii Anisov
  2019-09-11 10:32 ` [Xen-devel] [RFC 1/9] schedule: Introduce per-pcpu " Andrii Anisov
  2019-09-11 10:32 ` [Xen-devel] [RFC 2/9] sysctl: extend XEN_SYSCTL_getcpuinfo interface Andrii Anisov
@ 2019-09-11 10:32 ` Andrii Anisov
  2019-09-11 10:32 ` [Xen-devel] [RFC 4/9] arm64: utilize time accounting Andrii Anisov
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Andrii Anisov @ 2019-09-11 10:32 UTC (permalink / raw)
  To: xen-devel; +Cc: Ian Jackson, Andrii Anisov, Wei Liu

From: Andrii Anisov <andrii_anisov@epam.com>

Let xentop request and show information about CPU load provided
by new time accounting infrastructure.

Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
---
 tools/xenstat/libxenstat/src/xenstat.c      | 50 +++++++++++++++++++++++++++++
 tools/xenstat/libxenstat/src/xenstat.h      | 15 +++++++++
 tools/xenstat/libxenstat/src/xenstat_priv.h |  5 +++
 tools/xenstat/xentop/xentop.c               | 36 +++++++++++++++++++++
 4 files changed, 106 insertions(+)

diff --git a/tools/xenstat/libxenstat/src/xenstat.c b/tools/xenstat/libxenstat/src/xenstat.c
index 6f93d4e..cfb6504 100644
--- a/tools/xenstat/libxenstat/src/xenstat.c
+++ b/tools/xenstat/libxenstat/src/xenstat.c
@@ -134,6 +134,9 @@ void xenstat_uninit(xenstat_handle * handle)
 xenstat_node *xenstat_get_node(xenstat_handle * handle, unsigned int flags)
 {
 #define DOMAIN_CHUNK_SIZE 256
+	xc_cpuinfo_t *cpuinfo;
+	int max_cpus, nr_cpus;
+
 	xenstat_node *node;
 	xc_physinfo_t physinfo = { 0 };
 	xc_domaininfo_t domaininfo[DOMAIN_CHUNK_SIZE];
@@ -163,6 +166,28 @@ xenstat_node *xenstat_get_node(xenstat_handle * handle, unsigned int flags)
 	    * handle->page_size;
 
 	node->freeable_mb = 0;
+
+	max_cpus = node->num_cpus;
+
+	cpuinfo = calloc(max_cpus, sizeof(xc_cpuinfo_t));
+	if (!cpuinfo)
+		return NULL;
+
+	if (xc_getcpuinfo(handle->xc_handle, max_cpus, cpuinfo, &nr_cpus) < 0) {
+		free(cpuinfo);
+		return NULL;
+	}
+
+	for ( i = 0; i < nr_cpus; i++) {
+		node->idle_time += cpuinfo[i].idletime;
+		node->hyp_time += cpuinfo[i].hyptime;
+		node->guest_time += cpuinfo[i].guesttime;
+		node->gsync_time += cpuinfo[i].gsynctime;
+		node->irq_time += cpuinfo[i].irqtime;
+	}
+
+	free(cpuinfo);
+
 	/* malloc(0) is not portable, so allocate a single domain.  This will
 	 * be resized below. */
 	node->domains = malloc(sizeof(xenstat_domain));
@@ -332,6 +357,31 @@ unsigned long long xenstat_node_cpu_hz(xenstat_node * node)
 	return node->cpu_hz;
 }
 
+unsigned long long xenstat_node_idle_time(xenstat_node * node)
+{
+	return node->idle_time;
+}
+
+unsigned long long xenstat_node_guest_time(xenstat_node * node)
+{
+	return node->guest_time;
+}
+
+unsigned long long xenstat_node_hyp_time(xenstat_node * node)
+{
+	return node->hyp_time;
+}
+
+unsigned long long xenstat_node_gsync_time(xenstat_node * node)
+{
+	return node->gsync_time;
+}
+
+unsigned long long xenstat_node_irq_time(xenstat_node * node)
+{
+	return node->irq_time;
+}
+
 /* Get the domain ID for this domain */
 unsigned xenstat_domain_id(xenstat_domain * domain)
 {
diff --git a/tools/xenstat/libxenstat/src/xenstat.h b/tools/xenstat/libxenstat/src/xenstat.h
index 76a660f..9881ce5 100644
--- a/tools/xenstat/libxenstat/src/xenstat.h
+++ b/tools/xenstat/libxenstat/src/xenstat.h
@@ -80,6 +80,21 @@ unsigned int xenstat_node_num_cpus(xenstat_node * node);
 /* Get information about the CPU speed */
 unsigned long long xenstat_node_cpu_hz(xenstat_node * node);
 
+/* Get information about the CPU idle time */
+unsigned long long xenstat_node_idle_time(xenstat_node * node);
+
+/* Get information about the CPU guest execution time */
+unsigned long long xenstat_node_guest_time(xenstat_node * node);
+
+/* Get information about the CPU hypervisor execution time */
+unsigned long long xenstat_node_hyp_time(xenstat_node * node);
+
+/* Get information about the CPU guest syncronous trap execution time */
+unsigned long long xenstat_node_gsync_time(xenstat_node * node);
+
+/* Get information about the CPU IRQ serving time */
+unsigned long long xenstat_node_irq_time(xenstat_node * node);
+
 /*
  * Domain functions - extract information from a xenstat_domain
  */
diff --git a/tools/xenstat/libxenstat/src/xenstat_priv.h b/tools/xenstat/libxenstat/src/xenstat_priv.h
index 4eb44a8..64387cd 100644
--- a/tools/xenstat/libxenstat/src/xenstat_priv.h
+++ b/tools/xenstat/libxenstat/src/xenstat_priv.h
@@ -45,6 +45,11 @@ struct xenstat_node {
 	unsigned int flags;
 	unsigned long long cpu_hz;
 	unsigned int num_cpus;
+	unsigned long long hyp_time;
+	unsigned long long guest_time;
+	unsigned long long idle_time;
+	unsigned long long gsync_time;
+	unsigned long long irq_time;
 	unsigned long long tot_mem;
 	unsigned long long free_mem;
 	unsigned int num_domains;
diff --git a/tools/xenstat/xentop/xentop.c b/tools/xenstat/xentop/xentop.c
index af11ebf..7514559 100644
--- a/tools/xenstat/xentop/xentop.c
+++ b/tools/xenstat/xentop/xentop.c
@@ -930,6 +930,40 @@ void adjust_field_widths(xenstat_domain *domain)
 		fields[FIELD_VBD_WSECT-1].default_width = length;
 }
 
+void do_utilization(void)
+{
+	double us_elapsed = 0.0,
+		   us_hyp = 0.0,
+		   us_guest = 0.0,
+		   us_idle = 0.0,
+		   us_gsync = 0.0,
+		   us_irq = 0.0;
+
+	/* Can't calculate CPU percentage without a current and a previous sample.*/
+	if(prev_node != NULL && cur_node != NULL) {
+
+		/* Calculate the time elapsed in microseconds */
+		us_elapsed = ((curtime.tv_sec-oldtime.tv_sec)*1000000.0
+				  +(curtime.tv_usec - oldtime.tv_usec));
+
+		/* In the following, nanoseconds must be multiplied by 1000.0 to
+		 * convert to microseconds, then divided by 100.0 to get a percentage,
+		 * resulting in a multiplication by 10.0 */
+		us_idle = ((xenstat_node_idle_time(cur_node) -
+				   xenstat_node_idle_time(prev_node))/10.0)/us_elapsed;
+		us_guest = ((xenstat_node_guest_time(cur_node) -
+				   xenstat_node_guest_time(prev_node))/10.0)/us_elapsed;
+		us_hyp = ((xenstat_node_hyp_time(cur_node) -
+				   xenstat_node_hyp_time(prev_node))/10.0)/us_elapsed;
+		us_gsync = ((xenstat_node_gsync_time(cur_node) -
+				   xenstat_node_gsync_time(prev_node))/10.0)/us_elapsed;
+		us_irq = ((xenstat_node_irq_time(cur_node) -
+				   xenstat_node_irq_time(prev_node))/10.0)/us_elapsed;
+	}
+
+	print("%%CPU(s): %6.1f gu, %6.1f gs, %6.1f ir, %6.1f hy, %6.1f id \n",
+		  us_guest, us_gsync, us_irq, us_hyp, us_idle);
+}
 
 /* Section printing functions */
 /* Prints the top summary, above the domain table */
@@ -972,6 +1006,8 @@ void do_summary(void)
 	used = xenstat_node_tot_mem(cur_node);
 	freeable_mb = 0;
 
+	do_utilization();
+
 	/* Dump node memory and cpu information */
 	if ( freeable_mb <= 0 )
 	     print("Mem: %lluk total, %lluk used, %lluk free    ",
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Xen-devel] [RFC 4/9] arm64: utilize time accounting
  2019-09-11 10:32 [Xen-devel] [RFC 0/9] Changes to time accounting Andrii Anisov
                   ` (2 preceding siblings ...)
  2019-09-11 10:32 ` [Xen-devel] [RFC 3/9] xentop: show CPU load information Andrii Anisov
@ 2019-09-11 10:32 ` Andrii Anisov
  2019-09-11 17:48   ` Volodymyr Babchuk
  2019-10-28 14:47   ` Julien Grall
  2019-09-11 10:32 ` [Xen-devel] [RFC 5/9] tacc: Introduce a lockless interface for guest time Andrii Anisov
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 26+ messages in thread
From: Andrii Anisov @ 2019-09-11 10:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Volodymyr Babchuk, Julien Grall, Stefano Stabellini, Andrii Anisov

From: Andrii Anisov <andrii_anisov@epam.com>

Call time accounting hooks from appropriate transition points
of the ARM64 code.

Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
---
 xen/arch/arm/arm64/entry.S | 39 ++++++++++++++++++++++++++++++++++++---
 xen/arch/arm/domain.c      |  2 ++
 2 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/xen/arch/arm/arm64/entry.S b/xen/arch/arm/arm64/entry.S
index 2d9a271..6fb2fa9 100644
--- a/xen/arch/arm/arm64/entry.S
+++ b/xen/arch/arm/arm64/entry.S
@@ -143,12 +143,21 @@
 
         .endm
 
-        .macro  exit, hyp, compat
+        .macro  exit, hyp, compat, tacc=1
 
         .if \hyp == 0         /* Guest mode */
 
+	.if \tacc == 1
+
+        mov     x0, #1
+        bl      tacc_hyp
+
+	.endif
+
         bl      leave_hypervisor_tail /* Disables interrupts on return */
 
+	mov     x0, #1
+	bl      tacc_guest
         exit_guest \compat
 
         .endif
@@ -205,9 +214,15 @@ hyp_sync:
 
 hyp_irq:
         entry   hyp=1
+        mov     x0,#5
+        bl      tacc_irq_enter
         msr     daifclr, #4
         mov     x0, sp
         bl      do_trap_irq
+
+        mov     x0,#5
+        bl      tacc_irq_exit
+
         exit    hyp=1
 
 guest_sync:
@@ -291,6 +306,9 @@ guest_sync_slowpath:
          * to save them.
          */
         entry   hyp=0, compat=0, save_x0_x1=0
+        
+        mov     x0,#1
+        bl      tacc_gsync
         /*
          * The vSError will be checked while SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT
          * is not set. If a vSError took place, the initial exception will be
@@ -307,6 +325,10 @@ guest_sync_slowpath:
 
 guest_irq:
         entry   hyp=0, compat=0
+
+        mov     x0,#6
+        bl      tacc_irq_enter
+
         /*
          * The vSError will be checked while SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT
          * is not set. If a vSError took place, the initial exception will be
@@ -319,6 +341,8 @@ guest_irq:
         mov     x0, sp
         bl      do_trap_irq
 1:
+	mov	x0,#6
+        bl      tacc_irq_exit
         exit    hyp=0, compat=0
 
 guest_fiq_invalid:
@@ -334,6 +358,9 @@ guest_error:
 
 guest_sync_compat:
         entry   hyp=0, compat=1
+
+        mov     x0,#2
+        bl      tacc_gsync
         /*
          * The vSError will be checked while SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT
          * is not set. If a vSError took place, the initial exception will be
@@ -350,6 +377,10 @@ guest_sync_compat:
 
 guest_irq_compat:
         entry   hyp=0, compat=1
+
+        mov     x0,#7
+        bl      tacc_irq_enter
+
         /*
          * The vSError will be checked while SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT
          * is not set. If a vSError took place, the initial exception will be
@@ -362,6 +393,8 @@ guest_irq_compat:
         mov     x0, sp
         bl      do_trap_irq
 1:
+        mov     x0,#7
+        bl      tacc_irq_exit
         exit    hyp=0, compat=1
 
 guest_fiq_invalid_compat:
@@ -376,9 +409,9 @@ guest_error_compat:
         exit    hyp=0, compat=1
 
 ENTRY(return_to_new_vcpu32)
-        exit    hyp=0, compat=1
+        exit    hyp=0, compat=1, tacc=0
 ENTRY(return_to_new_vcpu64)
-        exit    hyp=0, compat=0
+        exit    hyp=0, compat=0, tacc=0
 
 return_from_trap:
         msr     daifset, #2 /* Mask interrupts */
diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index a9c4113..53ef630 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -51,11 +51,13 @@ static void do_idle(void)
     process_pending_softirqs();
 
     local_irq_disable();
+    tacc_idle(1);
     if ( cpu_is_haltable(cpu) )
     {
         dsb(sy);
         wfi();
     }
+    tacc_hyp(2);
     local_irq_enable();
 
     sched_tick_resume();
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Xen-devel] [RFC 5/9] tacc: Introduce a lockless interface for guest time
  2019-09-11 10:32 [Xen-devel] [RFC 0/9] Changes to time accounting Andrii Anisov
                   ` (3 preceding siblings ...)
  2019-09-11 10:32 ` [Xen-devel] [RFC 4/9] arm64: utilize time accounting Andrii Anisov
@ 2019-09-11 10:32 ` Andrii Anisov
  2019-09-11 10:32 ` [Xen-devel] [RFC 6/9] sched:rtds: get guest time from time accounting code Andrii Anisov
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Andrii Anisov @ 2019-09-11 10:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Tim Deegan, Stefano Stabellini, Andrii Anisov, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Dario Faggioli, Julien Grall, Jan Beulich

From: Andrii Anisov <andrii_anisov@epam.com>

The lockless interface to acquire guest time by scheduling code
is introduced. It can be used by schedulers what do not require
guest time from a different pcpu to take scheduling decission.

Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
---
 xen/common/schedule.c   | 10 ++++++++++
 xen/include/xen/sched.h |  8 ++++++++
 2 files changed, 18 insertions(+)

diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 2007034..62df77e 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -1643,6 +1643,16 @@ void tacc_irq_exit(int place)
     tacc->irq_cnt--;
 }
 
+s_time_t tacc_get_guest_time(struct tacc *tacc)
+{
+    s_time_t guest_time;
+
+    guest_time = tacc->state_time[TACC_GUEST];
+    guest_time += tacc->state_time[TACC_GSYNC];
+
+    return guest_time;
+}
+
 void context_saved(struct vcpu *prev)
 {
     /* Clear running flag /after/ writing context to memory. */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 8167608..5b41805 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -266,6 +266,8 @@ struct vcpu
 
     struct evtchn_fifo_vcpu *evtchn_fifo;
 
+    s_time_t    pcpu_guest_time;
+
     /* vPCI per-vCPU area, used to store data for long running operations. */
     struct vpci_vcpu vpci;
 
@@ -1033,6 +1035,12 @@ DECLARE_PER_CPU(struct tacc, tacc);
 void tacc_hyp(int place);
 void tacc_idle(int place);
 
+s_time_t tacc_get_guest_time(struct tacc *tacc);
+inline s_time_t tacc_get_guest_time_delta(void)
+{
+    return tacc_get_guest_time(&this_cpu(tacc)) - current->pcpu_guest_time;
+}
+
 #endif /* __SCHED_H__ */
 
 /*
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Xen-devel] [RFC 6/9] sched:rtds: get guest time from time accounting code
  2019-09-11 10:32 [Xen-devel] [RFC 0/9] Changes to time accounting Andrii Anisov
                   ` (4 preceding siblings ...)
  2019-09-11 10:32 ` [Xen-devel] [RFC 5/9] tacc: Introduce a lockless interface for guest time Andrii Anisov
@ 2019-09-11 10:32 ` Andrii Anisov
  2019-09-11 10:32 ` [Xen-devel] [RFC 7/9] tacc: Introduce a locked interface for guest time Andrii Anisov
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Andrii Anisov @ 2019-09-11 10:32 UTC (permalink / raw)
  To: xen-devel; +Cc: George Dunlap, Andrii Anisov, Meng Xu, Dario Faggioli

From: Andrii Anisov <andrii_anisov@epam.com>

While the RTDS scheduler code does not use guest time from the
other pcpu, we are free to go with lockless time accounting.

Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
---
 xen/common/sched_rt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
index e0e350b..2ce200b 100644
--- a/xen/common/sched_rt.c
+++ b/xen/common/sched_rt.c
@@ -945,8 +945,9 @@ burn_budget(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t now)
     if ( is_idle_vcpu(svc->vcpu) )
         return;
 
+    ASSERT(svc->vcpu == current);
     /* burn at nanoseconds level */
-    delta = now - svc->last_start;
+    delta = tacc_get_guest_time_delta();
     /*
      * delta < 0 only happens in nested virtualization;
      * TODO: how should we handle delta < 0 in a better way?
@@ -960,7 +961,6 @@ burn_budget(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t now)
     }
 
     svc->cur_budget -= delta;
-    svc->last_start = now;
 
     if ( svc->cur_budget <= 0 )
     {
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Xen-devel] [RFC 7/9] tacc: Introduce a locked interface for guest time
  2019-09-11 10:32 [Xen-devel] [RFC 0/9] Changes to time accounting Andrii Anisov
                   ` (5 preceding siblings ...)
  2019-09-11 10:32 ` [Xen-devel] [RFC 6/9] sched:rtds: get guest time from time accounting code Andrii Anisov
@ 2019-09-11 10:32 ` Andrii Anisov
  2019-09-11 10:32 ` [Xen-devel] [RFC 8/9] sched:credit: get guest time from time accounting code Andrii Anisov
  2019-09-11 10:32 ` [Xen-devel] [RFC 9/9] sched:credit2: " Andrii Anisov
  8 siblings, 0 replies; 26+ messages in thread
From: Andrii Anisov @ 2019-09-11 10:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Andrii Anisov, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Tim Deegan, Julien Grall, Jan Beulich, Dario Faggioli

From: Andrii Anisov <andrii_anisov@epam.com>

The locked interface to acquire guest time by scheduling code
is introduced. It can be used by schedulers what do require
guest time from a different pcpu to take scheduling decission.

Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
---
 xen/common/Kconfig      |  3 +++
 xen/common/schedule.c   | 44 ++++++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/sched.h | 11 +++++++++++
 3 files changed, 58 insertions(+)

diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index 16829f6..c1748dd 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -221,6 +221,9 @@ config ARGO
 menu "Schedulers"
 	visible if EXPERT = "y"
 
+config TACC_NEEDS_LOCK
+	bool
+
 config SCHED_CREDIT
 	bool "Credit scheduler support"
 	default y
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 62df77e..98b739f 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -1562,6 +1562,14 @@ static void schedule(void)
     context_switch(prev, next);
 }
 
+#ifdef CONFIG_TACC_NEEDS_LOCK
+#define     tacc_lock(tacc) spin_lock(&tacc->tacc_lock)
+#define     tacc_unlock(tacc) spin_unlock(&tacc->tacc_lock)
+#else
+#define     tacc_lock(tacc)
+#define     tacc_unlock(tacc)
+#endif
+
 DEFINE_PER_CPU(struct tacc, tacc);
 
 static void tacc_state_change(enum TACC_STATES new_state)
@@ -1571,6 +1579,7 @@ static void tacc_state_change(enum TACC_STATES new_state)
     unsigned long flags;
 
     local_irq_save(flags);
+    tacc_lock(tacc);
 
     now = NOW();
     delta = now - tacc->state_entry_time;
@@ -1584,6 +1593,7 @@ static void tacc_state_change(enum TACC_STATES new_state)
     tacc->state = new_state;
     tacc->state_entry_time = now;
 
+    tacc_unlock(tacc);
     local_irq_restore(flags);
 }
 
@@ -1621,7 +1631,9 @@ void tacc_irq_enter(int place)
 
     if ( tacc->irq_cnt == 0 )
     {
+        tacc_lock(tacc);
         tacc->irq_enter_time = NOW();
+        tacc_unlock(tacc);
     }
 
     tacc->irq_cnt++;
@@ -1636,8 +1648,10 @@ void tacc_irq_exit(int place)
     ASSERT(tacc->irq_cnt > 0);
     if ( tacc->irq_cnt == 1 )
     {
+        tacc_lock(tacc);
         tacc->irq_time = NOW() - tacc->irq_enter_time;
         tacc->irq_enter_time = 0;
+        tacc_unlock(tacc);
     }
 
     tacc->irq_cnt--;
@@ -1653,6 +1667,36 @@ s_time_t tacc_get_guest_time(struct tacc *tacc)
     return guest_time;
 }
 
+#ifdef CONFIG_TACC_NEEDS_LOCK
+s_time_t tacc_get_guest_time_cpu(int cpu)
+{
+    struct tacc* tacc = &per_cpu(tacc, cpu);
+    s_time_t guest_time;
+    s_time_t now;
+
+    tacc_lock(tacc);
+
+    now = NOW();
+    guest_time = tacc_get_guest_time(tacc);
+    if (tacc->state == TACC_GUEST || tacc->state == TACC_GSYNC)
+    {
+        guest_time += NOW() - tacc->state_entry_time;
+    }
+
+    if (tacc->irq_enter_time)
+    {
+        guest_time -= NOW() - tacc->irq_enter_time;
+    }
+
+    guest_time -= tacc->irq_time;
+
+    tacc_unlock(tacc);
+
+    return guest_time;
+}
+#endif
+
+
 void context_saved(struct vcpu *prev)
 {
     /* Clear running flag /after/ writing context to memory. */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 5b41805..a649d1f 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -1028,6 +1028,9 @@ struct tacc
     s_time_t irq_enter_time;
     s_time_t irq_time;
     int irq_cnt;
+#ifdef CONFIG_TACC_NEEDS_LOCK
+    spinlock_t tacc_lock;
+#endif
 };
 
 DECLARE_PER_CPU(struct tacc, tacc);
@@ -1041,6 +1044,14 @@ inline s_time_t tacc_get_guest_time_delta(void)
     return tacc_get_guest_time(&this_cpu(tacc)) - current->pcpu_guest_time;
 }
 
+#ifdef CONFIG_TACC_NEEDS_LOCK
+s_time_t tacc_get_guest_time_cpu(int cpu);
+inline s_time_t tacc_get_guest_time_delta_vcpu(struct vcpu* vcpu)
+{
+    return tacc_get_guest_time_cpu(vcpu->processor) - vcpu->pcpu_guest_time;
+}
+#endif
+
 #endif /* __SCHED_H__ */
 
 /*
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Xen-devel] [RFC 8/9] sched:credit: get guest time from time accounting code
  2019-09-11 10:32 [Xen-devel] [RFC 0/9] Changes to time accounting Andrii Anisov
                   ` (6 preceding siblings ...)
  2019-09-11 10:32 ` [Xen-devel] [RFC 7/9] tacc: Introduce a locked interface for guest time Andrii Anisov
@ 2019-09-11 10:32 ` Andrii Anisov
  2019-09-11 10:32 ` [Xen-devel] [RFC 9/9] sched:credit2: " Andrii Anisov
  8 siblings, 0 replies; 26+ messages in thread
From: Andrii Anisov @ 2019-09-11 10:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Andrii Anisov, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Tim Deegan, Julien Grall, Jan Beulich, Dario Faggioli

From: Andrii Anisov <andrii_anisov@epam.com>

While the Credit scheduler code uses guest time from the
other pcpu, we have to use locked time accounting.

Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
---
 xen/common/Kconfig        |  1 +
 xen/common/sched_credit.c | 12 +++++-------
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index c1748dd..d17a8b4 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -227,6 +227,7 @@ config TACC_NEEDS_LOCK
 config SCHED_CREDIT
 	bool "Credit scheduler support"
 	default y
+	select TACC_NEEDS_LOCK
 	---help---
 	  The traditional credit scheduler is a general purpose scheduler.
 
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 81dee5e..ac6b9e6 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -324,16 +324,15 @@ runq_remove(struct csched_vcpu *svc)
     __runq_remove(svc);
 }
 
-static void burn_credits(struct csched_vcpu *svc, s_time_t now)
+static void burn_credits(struct csched_vcpu *svc, s_time_t delta)
 {
-    s_time_t delta;
     uint64_t val;
     unsigned int credits;
 
     /* Assert svc is current */
     ASSERT( svc == CSCHED_VCPU(curr_on_cpu(svc->vcpu->processor)) );
 
-    if ( (delta = now - svc->start_time) <= 0 )
+    if ( delta <= 0 )
         return;
 
     val = delta * CSCHED_CREDITS_PER_MSEC + svc->residual;
@@ -341,7 +340,6 @@ static void burn_credits(struct csched_vcpu *svc, s_time_t now)
     credits = val;
     ASSERT(credits == val); /* make sure we haven't truncated val */
     atomic_sub(credits, &svc->credit);
-    svc->start_time += (credits * MILLISECS(1)) / CSCHED_CREDITS_PER_MSEC;
 }
 
 static bool_t __read_mostly opt_tickle_one_idle = 1;
@@ -956,7 +954,7 @@ csched_vcpu_acct(struct csched_private *prv, unsigned int cpu)
     /*
      * Update credits
      */
-    burn_credits(svc, NOW());
+    burn_credits(svc, tacc_get_guest_time_delta_vcpu(svc->vcpu));
 
     /*
      * Put this VCPU and domain back on the active list if it was
@@ -1856,14 +1854,14 @@ csched_schedule(
                     (unsigned char *)&d);
     }
 
-    runtime = now - current->runstate.state_entry_time;
+    runtime = tacc_get_guest_time_delta();
     if ( runtime < 0 ) /* Does this ever happen? */
         runtime = 0;
 
     if ( !is_idle_vcpu(scurr->vcpu) )
     {
         /* Update credits of a non-idle VCPU. */
-        burn_credits(scurr, now);
+        burn_credits(scurr, runtime);
         scurr->start_time -= now;
         scurr->last_sched_time = now;
     }
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Xen-devel] [RFC 9/9] sched:credit2: get guest time from time accounting code
  2019-09-11 10:32 [Xen-devel] [RFC 0/9] Changes to time accounting Andrii Anisov
                   ` (7 preceding siblings ...)
  2019-09-11 10:32 ` [Xen-devel] [RFC 8/9] sched:credit: get guest time from time accounting code Andrii Anisov
@ 2019-09-11 10:32 ` Andrii Anisov
  8 siblings, 0 replies; 26+ messages in thread
From: Andrii Anisov @ 2019-09-11 10:32 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Andrii Anisov, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Tim Deegan, Julien Grall, Jan Beulich, Dario Faggioli

From: Andrii Anisov <andrii_anisov@epam.com>

While the Credit2 scheduler code uses guest time from the
other pcpu, we have to use locked time accounting.

Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
---
 xen/common/Kconfig         |  1 +
 xen/common/sched_credit2.c | 17 +++++++++--------
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/xen/common/Kconfig b/xen/common/Kconfig
index d17a8b4..6408c18 100644
--- a/xen/common/Kconfig
+++ b/xen/common/Kconfig
@@ -234,6 +234,7 @@ config SCHED_CREDIT
 config SCHED_CREDIT2
 	bool "Credit2 scheduler support"
 	default y
+	select TACC_NEEDS_LOCK
 	---help---
 	  The credit2 scheduler is a general purpose scheduler that is
 	  optimized for lower latency and higher VM density.
diff --git a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c
index 6b77da7..3b3888b 100644
--- a/xen/common/sched_credit2.c
+++ b/xen/common/sched_credit2.c
@@ -1342,7 +1342,7 @@ static inline bool is_preemptable(const struct csched2_vcpu *svc,
         return true;
 
     ASSERT(svc->vcpu->is_running);
-    return now - svc->vcpu->runstate.state_entry_time >
+    return tacc_get_guest_time_delta_vcpu(svc->vcpu)>
            ratelimit - CSCHED2_RATELIMIT_TICKLE_TOLERANCE;
 }
 
@@ -1722,7 +1722,7 @@ void burn_credits(struct csched2_runqueue_data *rqd,
         return;
     }
 
-    delta = now - svc->start_time;
+    delta = tacc_get_guest_time_delta_vcpu(svc->vcpu);
 
     if ( unlikely(delta <= 0) )
     {
@@ -1739,7 +1739,7 @@ void burn_credits(struct csched2_runqueue_data *rqd,
     if ( has_cap(svc) )
         svc->budget -= delta;
 
-    svc->start_time = now;
+    svc->vcpu->pcpu_guest_time += delta;
 
  out:
     if ( unlikely(tb_init_done) )
@@ -3189,8 +3189,8 @@ csched2_runtime(const struct scheduler *ops, int cpu,
     {
         s_time_t ratelimit_min = MICROSECS(prv->ratelimit_us);
         if ( snext->vcpu->is_running )
-            ratelimit_min = snext->vcpu->runstate.state_entry_time +
-                            MICROSECS(prv->ratelimit_us) - now;
+            ratelimit_min = tacc_get_guest_time_delta_vcpu(snext->vcpu) +
+                            MICROSECS(prv->ratelimit_us);
         if ( ratelimit_min > min_time )
             min_time = ratelimit_min;
     }
@@ -3265,6 +3265,7 @@ runq_candidate(struct csched2_runqueue_data *rqd,
     struct csched2_vcpu *snext = NULL;
     struct csched2_private *prv = csched2_priv(per_cpu(scheduler, cpu));
     bool yield = false, soft_aff_preempt = false;
+    s_time_t guest_time;
 
     *skipped = 0;
 
@@ -3286,8 +3287,8 @@ runq_candidate(struct csched2_runqueue_data *rqd,
      * no point forcing it to do so until rate limiting expires.
      */
     if ( !yield && prv->ratelimit_us && vcpu_runnable(scurr->vcpu) &&
-         (now - scurr->vcpu->runstate.state_entry_time) <
-          MICROSECS(prv->ratelimit_us) )
+         ((guest_time = tacc_get_guest_time_delta_vcpu(scurr->vcpu)) <
+          MICROSECS(prv->ratelimit_us)))
     {
         if ( unlikely(tb_init_done) )
         {
@@ -3297,7 +3298,7 @@ runq_candidate(struct csched2_runqueue_data *rqd,
             } d;
             d.dom = scurr->vcpu->domain->domain_id;
             d.vcpu = scurr->vcpu->vcpu_id;
-            d.runtime = now - scurr->vcpu->runstate.state_entry_time;
+            d.runtime = guest_time;
             __trace_var(TRC_CSCHED2_RATELIMIT, 1,
                         sizeof(d),
                         (unsigned char *)&d);
-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] [RFC 4/9] arm64: utilize time accounting
  2019-09-11 10:32 ` [Xen-devel] [RFC 4/9] arm64: utilize time accounting Andrii Anisov
@ 2019-09-11 17:48   ` Volodymyr Babchuk
  2019-09-12 12:09     ` Andrii Anisov
  2019-10-28 14:47   ` Julien Grall
  1 sibling, 1 reply; 26+ messages in thread
From: Volodymyr Babchuk @ 2019-09-11 17:48 UTC (permalink / raw)
  To: Andrii Anisov
  Cc: xen-devel, Julien Grall, Stefano Stabellini, Andrii Anisov,
	Volodymyr Babchuk


Hi Andrii,

Andrii Anisov writes:

> From: Andrii Anisov <andrii_anisov@epam.com>
>
> Call time accounting hooks from appropriate transition points
> of the ARM64 code.
I'd prefer more elaborate commit message. For example - what are
appropriate transition points? I mean - how you chose ones?

> Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
> ---
>  xen/arch/arm/arm64/entry.S | 39 ++++++++++++++++++++++++++++++++++++---
>  xen/arch/arm/domain.c      |  2 ++
>  2 files changed, 38 insertions(+), 3 deletions(-)
>
> diff --git a/xen/arch/arm/arm64/entry.S b/xen/arch/arm/arm64/entry.S
> index 2d9a271..6fb2fa9 100644
> --- a/xen/arch/arm/arm64/entry.S
> +++ b/xen/arch/arm/arm64/entry.S
> @@ -143,12 +143,21 @@
>
>          .endm
>
> -        .macro  exit, hyp, compat
> +        .macro  exit, hyp, compat, tacc=1
>
>          .if \hyp == 0         /* Guest mode */
>
> +	.if \tacc == 1
There is a hard tab, instead of 8 spaces.

Also, while it is easy to guess what 'hyp' and 'compat' mean, it is hard
to tell what 'tacc' stands for. I think, you need either better
name for this or a comment, which explains all parameters.

> +
> +        mov     x0, #1
> +        bl      tacc_hyp
> +
> +	.endif
The same about hard tabs. Probably, there are more of them in this patch.

> +
>          bl      leave_hypervisor_tail /* Disables interrupts on return */
>
> +	mov     x0, #1
> +	bl      tacc_guest
>          exit_guest \compat
>
>          .endif
> @@ -205,9 +214,15 @@ hyp_sync:
>
>  hyp_irq:
>          entry   hyp=1
> +        mov     x0,#5
Space is missing before #5

> +        bl      tacc_irq_enter
>          msr     daifclr, #4
>          mov     x0, sp
>          bl      do_trap_irq
> +
> +        mov     x0,#5
Space is missing before #5

> +        bl      tacc_irq_exit
> +
>          exit    hyp=1
>
>  guest_sync:
> @@ -291,6 +306,9 @@ guest_sync_slowpath:
>           * to save them.
>           */
>          entry   hyp=0, compat=0, save_x0_x1=0
> +
There are trailing whitespaces. I sure that 'git diff' highlights
such mistakes...

> +        mov     x0,#1
Space is missing before #1

> +        bl      tacc_gsync
>          /*
>           * The vSError will be checked while SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT
>           * is not set. If a vSError took place, the initial exception will be
> @@ -307,6 +325,10 @@ guest_sync_slowpath:
>
>  guest_irq:
>          entry   hyp=0, compat=0
> +
> +        mov     x0,#6
Space is missing before #6

> +        bl      tacc_irq_enter
> +
>          /*
>           * The vSError will be checked while SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT
>           * is not set. If a vSError took place, the initial exception will be
> @@ -319,6 +341,8 @@ guest_irq:
>          mov     x0, sp
>          bl      do_trap_irq
>  1:
> +	mov	x0,#6
Space is missing before #6, also looks like there is a hard tab character.

> +        bl      tacc_irq_exit
>          exit    hyp=0, compat=0
>
>  guest_fiq_invalid:
> @@ -334,6 +358,9 @@ guest_error:
>
>  guest_sync_compat:
>          entry   hyp=0, compat=1
> +
> +        mov     x0,#2
Space is missing before #2.

> +        bl      tacc_gsync
>          /*
>           * The vSError will be checked while SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT
>           * is not set. If a vSError took place, the initial exception will be
> @@ -350,6 +377,10 @@ guest_sync_compat:
>
>  guest_irq_compat:
>          entry   hyp=0, compat=1
> +
> +        mov     x0,#7
Space is missing before #7.

> +        bl      tacc_irq_enter
> +
>          /*
>           * The vSError will be checked while SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT
>           * is not set. If a vSError took place, the initial exception will be
> @@ -362,6 +393,8 @@ guest_irq_compat:
>          mov     x0, sp
>          bl      do_trap_irq
>  1:
> +        mov     x0,#7
Space is missing before #7...

> +        bl      tacc_irq_exit
>          exit    hyp=0, compat=1
>
>  guest_fiq_invalid_compat:
> @@ -376,9 +409,9 @@ guest_error_compat:
>          exit    hyp=0, compat=1
>
>  ENTRY(return_to_new_vcpu32)
> -        exit    hyp=0, compat=1
> +        exit    hyp=0, compat=1, tacc=0
>  ENTRY(return_to_new_vcpu64)
> -        exit    hyp=0, compat=0
> +        exit    hyp=0, compat=0, tacc=0
>
>  return_from_trap:
>          msr     daifset, #2 /* Mask interrupts */


> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
> index a9c4113..53ef630 100644
> --- a/xen/arch/arm/domain.c
> +++ b/xen/arch/arm/domain.c
> @@ -51,11 +51,13 @@ static void do_idle(void)
>      process_pending_softirqs();
>
>      local_irq_disable();
> +    tacc_idle(1);
1 and 2 (below) look like some magical values to me.
I believe, you need to define some enumeration.

>      if ( cpu_is_haltable(cpu) )
>      {
>          dsb(sy);
>          wfi();
>      }
> +    tacc_hyp(2);
>      local_irq_enable();
>
>      sched_tick_resume();


--
Volodymyr Babchuk at EPAM
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] [RFC 1/9] schedule: Introduce per-pcpu time accounting
  2019-09-11 10:32 ` [Xen-devel] [RFC 1/9] schedule: Introduce per-pcpu " Andrii Anisov
@ 2019-09-11 18:01   ` Volodymyr Babchuk
  2019-09-12 10:26     ` Andrii Anisov
  2019-10-28 14:28   ` Julien Grall
  1 sibling, 1 reply; 26+ messages in thread
From: Volodymyr Babchuk @ 2019-09-11 18:01 UTC (permalink / raw)
  To: Andrii Anisov
  Cc: Stefano Stabellini, Andrii Anisov, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Tim Deegan,
	Dario Faggioli, Julien Grall, Jan Beulich, xen-devel,
	Ian Jackson


Andrii,

Andrii Anisov writes:

> From: Andrii Anisov <andrii_anisov@epam.com>
>
> Introduce per-pcpu time accounting what includes the following states:
>
> TACC_HYP - the pcpu executes hypervisor code like softirq processing
>            (including scheduling), tasklets and context switches
> TACC_GUEST - the pcpu executes guests code
> TACC_IDLE - the low-power state of the pcpu
Is it really low-power?

> TACC_IRQ - the pcpu performs interrupts processing, without separation to
>            guest or hypervisor interrupts
I think, word "distinguishing" would be better than "separation"

> TACC_GSYNC - the pcpu executes hypervisor code to process synchronous trap
>              from the guest. E.g. hypercall processing or io emulation.
>
> Currently, the only reenterant state is TACC_IRQ. It is assumed, no changes
> to state other than TACC_IRQ could happen until we return from nested
> interrupts. IRQ time is accounted in a distinct way comparing to other states.
> It is acumulated between other states transition moments, and is substracted
> from the old state on states transion calculation.
>
> Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
> ---
>  xen/common/schedule.c   | 81 +++++++++++++++++++++++++++++++++++++++++++++++++
>  xen/include/xen/sched.h | 27 +++++++++++++++++
>  2 files changed, 108 insertions(+)
>
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index 7b71581..6dd6603 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -1539,6 +1539,87 @@ static void schedule(void)
>      context_switch(prev, next);
>  }
>  
> +DEFINE_PER_CPU(struct tacc, tacc);
> +
> +static void tacc_state_change(enum TACC_STATES new_state)
> +{
> +    s_time_t now, delta;
> +    struct tacc* tacc = &this_cpu(tacc);
> +    unsigned long flags;
> +
> +    local_irq_save(flags);
> +
> +    now = NOW();
> +    delta = now - tacc->state_entry_time;
> +
> +    /* We do not expect states reenterability (at least through this function)*/
> +    ASSERT(new_state != tacc->state);
> +
> +    tacc->state_time[tacc->state] += delta - tacc->irq_time;
> +    tacc->state_time[TACC_IRQ] += tacc->irq_time;
> +    tacc->irq_time = 0;
> +    tacc->state = new_state;
> +    tacc->state_entry_time = now;
> +
> +    local_irq_restore(flags);
> +}
> +
> +void tacc_hyp(int place)
I believe, you want some enum for this "place" parameter type
> +{
> +//    printk("\ttacc_hyp %u, place %d\n", smp_processor_id(), place);
Please, don't push commented-out code. BTW, I think, it is possible to
add some TACC_DEBUG facilities to enable/disable this traces during
compile-time.

Also, looks like you don't use "place" parameter at all.

Lastly, I believe that this function (and other similar functions below)
can be defined as "static inline" in a header file.

> +    tacc_state_change(TACC_HYP);
> +}
> +
> +void tacc_guest(int place)
> +{
> +//    printk("\ttacc_guest %u, place %d\n", smp_processor_id(), place);
> +    tacc_state_change(TACC_GUEST);
> +}
> +
> +void tacc_idle(int place)
> +{
> +//    printk("\tidle cpu %u, place %d\n", smp_processor_id(), place);
> +    tacc_state_change(TACC_IDLE);
> +}
> +
> +void tacc_gsync(int place)
> +{
> +//    printk("\ttacc_gsync %u, place %d\n", smp_processor_id(), place);
> +    tacc_state_change(TACC_GSYNC);
> +}
> +
> +void tacc_irq_enter(int place)
> +{
> +    struct tacc* tacc = &this_cpu(tacc);
> +
> +//    printk("\ttacc_irq_enter %u, place %d, cnt %d\n", smp_processor_id(), place, this_cpu(tacc).irq_cnt);
> +    ASSERT(!local_irq_is_enabled());
> +    ASSERT(tacc->irq_cnt >= 0);
You can make irq_cnt unsigned and drop this assert.

> +
> +    if ( tacc->irq_cnt == 0 )
> +    {
> +        tacc->irq_enter_time = NOW();
> +    }
Coding style:

---
Braces should be omitted for blocks with a single statement. e.g.,

if ( condition )
    single_statement();
---

> +
> +    tacc->irq_cnt++;
> +}
> +
> +void tacc_irq_exit(int place)
> +{
> +    struct tacc* tacc = &this_cpu(tacc);
> +
> +//    printk("\ttacc_irq_exit %u, place %d, cnt %d\n", smp_processor_id(), place, tacc->irq_cnt);
> +    ASSERT(!local_irq_is_enabled());
> +    ASSERT(tacc->irq_cnt > 0);
> +    if ( tacc->irq_cnt == 1 )
> +    {
> +        tacc->irq_time = NOW() - tacc->irq_enter_time;
> +        tacc->irq_enter_time = 0;
> +    }
> +
> +    tacc->irq_cnt--;
What if, you IRQ will arrive right after this? I believe, you will lose
some of the accumulated time.

> +}
> +
>  void context_saved(struct vcpu *prev)
>  {
>      /* Clear running flag /after/ writing context to memory. */
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index e3601c1..04a8724 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -1002,6 +1002,33 @@ extern void dump_runq(unsigned char key);
>  
>  void arch_do_physinfo(struct xen_sysctl_physinfo *pi);
>  
> +enum TACC_STATES {
If I remember correct, enum names should in lower case

> +    TACC_HYP = 0,
> +    TACC_GUEST = 1,
> +    TACC_IDLE = 2,
> +    TACC_IRQ = 3,
> +    TACC_GSYNC = 4,
> +    TACC_STATES_MAX
> +};
> +
> +struct tacc
> +{
> +    s_time_t state_time[TACC_STATES_MAX];
> +    s_time_t state_entry_time;
> +    int state;
enum, maybe?

> +
> +    s_time_t guest_time;
> +
> +    s_time_t irq_enter_time;
> +    s_time_t irq_time;
> +    int irq_cnt;
> +};
> +
> +DECLARE_PER_CPU(struct tacc, tacc);
> +
> +void tacc_hyp(int place);
> +void tacc_idle(int place);
What about functions from sched.c? Should they be declared there?

> +
>  #endif /* __SCHED_H__ */
>  
>  /*


-- 
Volodymyr Babchuk at EPAM
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] [RFC 1/9] schedule: Introduce per-pcpu time accounting
  2019-09-11 18:01   ` Volodymyr Babchuk
@ 2019-09-12 10:26     ` Andrii Anisov
  0 siblings, 0 replies; 26+ messages in thread
From: Andrii Anisov @ 2019-09-12 10:26 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: Stefano Stabellini, Andrii Anisov, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Tim Deegan,
	Dario Faggioli, Julien Grall, Jan Beulich, xen-devel,
	Ian Jackson

Hello Volodymyr,

On 11.09.19 21:01, Volodymyr Babchuk wrote:
>> Introduce per-pcpu time accounting what includes the following states:
>>
>> TACC_HYP - the pcpu executes hypervisor code like softirq processing
>>             (including scheduling), tasklets and context switches
>> TACC_GUEST - the pcpu executes guests code
>> TACC_IDLE - the low-power state of the pcpu
> Is it really low-power?

It is rather matter of scheduling design. It differs from OS to OS, even from arch to arch. See [1].
Me personally tend to treat only low-power state as a true idle.
As a bad (IMO) example I can give the current XEN mainline. Pretty heavy tasks could be performed by the idle vcpu, and they are accounted as idle. This may mislead, for example, cpufreq governor.

>> TACC_IRQ - the pcpu performs interrupts processing, without separation to
>>             guest or hypervisor interrupts
> I think, word "distinguishing" would be better than "separation"

Why so?

> 
>> TACC_GSYNC - the pcpu executes hypervisor code to process synchronous trap
>>               from the guest. E.g. hypercall processing or io emulation.
>>
>> Currently, the only reenterant state is TACC_IRQ. It is assumed, no changes
>> to state other than TACC_IRQ could happen until we return from nested
>> interrupts. IRQ time is accounted in a distinct way comparing to other states.
>> It is acumulated between other states transition moments, and is substracted
>> from the old state on states transion calculation.
>>
>> Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
>> ---
>>   xen/common/schedule.c   | 81 +++++++++++++++++++++++++++++++++++++++++++++++++
>>   xen/include/xen/sched.h | 27 +++++++++++++++++
>>   2 files changed, 108 insertions(+)
>>
>> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
>> index 7b71581..6dd6603 100644
>> --- a/xen/common/schedule.c
>> +++ b/xen/common/schedule.c
>> @@ -1539,6 +1539,87 @@ static void schedule(void)
>>       context_switch(prev, next);
>>   }
>>   
>> +DEFINE_PER_CPU(struct tacc, tacc);
>> +
>> +static void tacc_state_change(enum TACC_STATES new_state)
>> +{
>> +    s_time_t now, delta;
>> +    struct tacc* tacc = &this_cpu(tacc);
>> +    unsigned long flags;
>> +
>> +    local_irq_save(flags);
>> +
>> +    now = NOW();
>> +    delta = now - tacc->state_entry_time;
>> +
>> +    /* We do not expect states reenterability (at least through this function)*/
>> +    ASSERT(new_state != tacc->state);
>> +
>> +    tacc->state_time[tacc->state] += delta - tacc->irq_time;
>> +    tacc->state_time[TACC_IRQ] += tacc->irq_time;
>> +    tacc->irq_time = 0;
>> +    tacc->state = new_state;
>> +    tacc->state_entry_time = now;
>> +
>> +    local_irq_restore(flags);
>> +}
>> +
>> +void tacc_hyp(int place)
> I believe, you want some enum for this "place" parameter type
>> +{
>> +//    printk("\ttacc_hyp %u, place %d\n", smp_processor_id(), place);
> Please, don't push commented-out code. BTW, I think, it is possible to
> add some TACC_DEBUG facilities to enable/disable this traces during
> compile-time.
> 
> Also, looks like you don't use "place" parameter at all.

Since that is the RFC, I've comforted myself with leaving my debug code in place. I hope it should not be confusing.

> 
> Lastly, I believe that this function (and other similar functions below)
> can be defined as "static inline" in a header file

Not this time. They are mostly called from asm (at least now).

> 
>> +    tacc_state_change(TACC_HYP);
>> +}
>> +
>> +void tacc_guest(int place)
>> +{
>> +//    printk("\ttacc_guest %u, place %d\n", smp_processor_id(), place);
>> +    tacc_state_change(TACC_GUEST);
>> +}
>> +
>> +void tacc_idle(int place)
>> +{
>> +//    printk("\tidle cpu %u, place %d\n", smp_processor_id(), place);
>> +    tacc_state_change(TACC_IDLE);
>> +}
>> +
>> +void tacc_gsync(int place)
>> +{
>> +//    printk("\ttacc_gsync %u, place %d\n", smp_processor_id(), place);
>> +    tacc_state_change(TACC_GSYNC);
>> +}
>> +
>> +void tacc_irq_enter(int place)
>> +{
>> +    struct tacc* tacc = &this_cpu(tacc);
>> +
>> +//    printk("\ttacc_irq_enter %u, place %d, cnt %d\n", smp_processor_id(), place, this_cpu(tacc).irq_cnt);
>> +    ASSERT(!local_irq_is_enabled());
>> +    ASSERT(tacc->irq_cnt >= 0);
> You can make irq_cnt unsigned and drop this assert.

No. Otherwise one might miss proper call sequence when utilize this for the different arch, and have no notice from the debug assertion.

> 
>> +
>> +    if ( tacc->irq_cnt == 0 )
>> +    {
>> +        tacc->irq_enter_time = NOW();
>> +    }
> Coding style:
> 
> ---
> Braces should be omitted for blocks with a single statement. e.g.,
> 
> if ( condition )
>      single_statement();
> ---
> 

OK.

>> +
>> +    tacc->irq_cnt++;
>> +}
>> +
>> +void tacc_irq_exit(int place)
>> +{
>> +    struct tacc* tacc = &this_cpu(tacc);
>> +
>> +//    printk("\ttacc_irq_exit %u, place %d, cnt %d\n", smp_processor_id(), place, tacc->irq_cnt);
>> +    ASSERT(!local_irq_is_enabled());
>> +    ASSERT(tacc->irq_cnt > 0);
>> +    if ( tacc->irq_cnt == 1 )
>> +    {
>> +        tacc->irq_time = NOW() - tacc->irq_enter_time;
>> +        tacc->irq_enter_time = 0;
>> +    }
>> +
>> +    tacc->irq_cnt--;
> What if, you IRQ will arrive right after this? I believe, you will lose
> some of the accumulated time.

See ASSERT(!local_irq_is_enabled()) above.

> 
>> +}
>> +
>>   void context_saved(struct vcpu *prev)
>>   {
>>       /* Clear running flag /after/ writing context to memory. */
>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>> index e3601c1..04a8724 100644
>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -1002,6 +1002,33 @@ extern void dump_runq(unsigned char key);
>>   
>>   void arch_do_physinfo(struct xen_sysctl_physinfo *pi);
>>   
>> +enum TACC_STATES {
> If I remember correct, enum names should in lower case

Ugh...

> 
>> +    TACC_HYP = 0,
>> +    TACC_GUEST = 1,
>> +    TACC_IDLE = 2,
>> +    TACC_IRQ = 3,
>> +    TACC_GSYNC = 4,
>> +    TACC_STATES_MAX
>> +};
>> +
>> +struct tacc
>> +{
>> +    s_time_t state_time[TACC_STATES_MAX];
>> +    s_time_t state_entry_time;
>> +    int state;
> enum, maybe?

Maybe.

> 
>> +
>> +    s_time_t guest_time;
>> +
>> +    s_time_t irq_enter_time;
>> +    s_time_t irq_time;
>> +    int irq_cnt;
>> +};
>> +
>> +DECLARE_PER_CPU(struct tacc, tacc);
>> +
>> +void tacc_hyp(int place);
>> +void tacc_idle(int place);
> What about functions from sched.c? Should they be declared there?

Maybe.

> 
>> +
>>   #endif /* __SCHED_H__ */
>>   
>>   /*
> 
> 

[1] https://elixir.bootlin.com/linux/latest/source/kernel/sched/cputime.c#L429

-- 
Sincerely,
Andrii Anisov.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] [RFC 4/9] arm64: utilize time accounting
  2019-09-11 17:48   ` Volodymyr Babchuk
@ 2019-09-12 12:09     ` Andrii Anisov
  2019-09-12 12:17       ` Julien Grall
  0 siblings, 1 reply; 26+ messages in thread
From: Andrii Anisov @ 2019-09-12 12:09 UTC (permalink / raw)
  To: Volodymyr Babchuk
  Cc: xen-devel, Julien Grall, Stefano Stabellini, Andrii Anisov

Hello Volodymyr,

On 11.09.19 20:48, Volodymyr Babchuk wrote:
> 
> Hi Andrii,
> 

As we agreed, I'll wipe out debugging remains as well as cleanup coding style nits and resend the series.

-- 
Sincerely,
Andrii Anisov.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] [RFC 4/9] arm64: utilize time accounting
  2019-09-12 12:09     ` Andrii Anisov
@ 2019-09-12 12:17       ` Julien Grall
  2019-09-12 12:29         ` Andrii Anisov
  0 siblings, 1 reply; 26+ messages in thread
From: Julien Grall @ 2019-09-12 12:17 UTC (permalink / raw)
  To: Andrii Anisov
  Cc: xen-devel, Julien Grall, Stefano Stabellini, Volodymyr Babchuk,
	Andrii Anisov


[-- Attachment #1.1: Type: text/plain, Size: 674 bytes --]

On Thu, 12 Sep 2019, 13:10 Andrii Anisov, <andrii.anisov@gmail.com> wrote:

> Hello Volodymyr,
>
> On 11.09.19 20:48, Volodymyr Babchuk wrote:
> >
> > Hi Andrii,
> >
>
> As we agreed, I'll wipe out debugging remains as well as cleanup coding
> style nits and resend the series.


This an RFC and I am sure there current state is enough to spark a
discussion. There are no need to waste time resending it and use filling up
inboxes.

Please wait for more time.

Cheers,


> --
> Sincerely,
> Andrii Anisov.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel

[-- Attachment #1.2: Type: text/html, Size: 1790 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] [RFC 4/9] arm64: utilize time accounting
  2019-09-12 12:17       ` Julien Grall
@ 2019-09-12 12:29         ` Andrii Anisov
  0 siblings, 0 replies; 26+ messages in thread
From: Andrii Anisov @ 2019-09-12 12:29 UTC (permalink / raw)
  To: Julien Grall
  Cc: xen-devel, Julien Grall, Stefano Stabellini, Volodymyr Babchuk,
	Andrii Anisov



On 12.09.19 15:17, Julien Grall wrote:
> This an RFC and I am sure there current state is enough to spark a discussion. There are no need to waste time resending it and use filling up inboxes.
> 
> Please wait for more time.

Gotcha!

-- 
Sincerely,
Andrii Anisov.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] [RFC 1/9] schedule: Introduce per-pcpu time accounting
  2019-09-11 10:32 ` [Xen-devel] [RFC 1/9] schedule: Introduce per-pcpu " Andrii Anisov
  2019-09-11 18:01   ` Volodymyr Babchuk
@ 2019-10-28 14:28   ` Julien Grall
  2019-11-06 11:24     ` Andrii Anisov
  1 sibling, 1 reply; 26+ messages in thread
From: Julien Grall @ 2019-10-28 14:28 UTC (permalink / raw)
  To: Andrii Anisov, xen-devel
  Cc: Tim Deegan, Stefano Stabellini, Andrii Anisov, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Dario Faggioli, Jan Beulich

Hi Andrii,

Sorry for the late answer. It would be good to get a review from the scheduler 
maintainers (Dario, George) to make sure they are happy with the suggested 
states here.

Please see my comments below.

On 11/09/2019 11:32, Andrii Anisov wrote:
> From: Andrii Anisov <andrii_anisov@epam.com>
> 
> Introduce per-pcpu time accounting what includes the following states:

I think we need a very detailed description of each states. Otherwise it will be 
hard to know how to categorize it.

> 
> TACC_HYP - the pcpu executes hypervisor code like softirq processing
>             (including scheduling), tasklets and context switches

IHMO, "like" is too weak here. What do you exactly plan to introduce?

For instance, on Arm, you consider that leave_hypervisor_tail() is part of 
TACC_HYP. This function will include some handling for synchronous trap.

> TACC_GUEST - the pcpu executes guests code

Looking at the arm64 code, you are executing some hypervisor code here. I agree 
this is impossible to not run any hypervisor code with TACC_GUEST, but I think 
this should be clarified in the documentation.

> TACC_IDLE - the low-power state of the pcpu

Did you intend to mean "idle vCPU" is in use?

> TACC_IRQ - the pcpu performs interrupts processing, without separation to
>             guest or hypervisor interrupts
> TACC_GSYNC - the pcpu executes hypervisor code to process synchronous trap
>               from the guest. E.g. hypercall processing or io emulation.
> 
> Currently, the only reenterant state is TACC_IRQ. It is assumed, no changes
> to state other than TACC_IRQ could happen until we return from nested
> interrupts. IRQ time is accounted in a distinct way comparing to other states.

s/comparing/compare/

> It is acumulated between other states transition moments, and is substracted

s/acumulated/accumulated/ s/substracted/subtracted/

> from the old state on states transion calculation.

s/transion/transition/

> 
> Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
> ---
>   xen/common/schedule.c   | 81 +++++++++++++++++++++++++++++++++++++++++++++++++
>   xen/include/xen/sched.h | 27 +++++++++++++++++
>   2 files changed, 108 insertions(+)
> 
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index 7b71581..6dd6603 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -1539,6 +1539,87 @@ static void schedule(void)
>       context_switch(prev, next);
>   }
>   
> +DEFINE_PER_CPU(struct tacc, tacc);
> +
> +static void tacc_state_change(enum TACC_STATES new_state)

This should never be called with the TACC_IRQ, right?

> +{
> +    s_time_t now, delta;
> +    struct tacc* tacc = &this_cpu(tacc);
> +    unsigned long flags;
> +
> +    local_irq_save(flags);
> +
> +    now = NOW();
> +    delta = now - tacc->state_entry_time;
> +
> +    /* We do not expect states reenterability (at least through this function)*/
> +    ASSERT(new_state != tacc->state);
> +
> +    tacc->state_time[tacc->state] += delta - tacc->irq_time;
> +    tacc->state_time[TACC_IRQ] += tacc->irq_time;
> +    tacc->irq_time = 0;
> +    tacc->state = new_state;
> +    tacc->state_entry_time = now;
> +
> +    local_irq_restore(flags);
> +}
> +
> +void tacc_hyp(int place)

Place is never used except for your commented printk. So what's the goal for it?

Also, is it really necessary to provide helper for each state? Couldn't we just 
introduce one functions doing all the state?

> +{
> +//    printk("\ttacc_hyp %u, place %d\n", smp_processor_id(), place);
> +    tacc_state_change(TACC_HYP);
> +}
> +
> +void tacc_guest(int place)
> +{
> +//    printk("\ttacc_guest %u, place %d\n", smp_processor_id(), place);
> +    tacc_state_change(TACC_GUEST);
> +}
> +
> +void tacc_idle(int place)
> +{
> +//    printk("\tidle cpu %u, place %d\n", smp_processor_id(), place);
> +    tacc_state_change(TACC_IDLE);
> +}
> +
> +void tacc_gsync(int place)
> +{
> +//    printk("\ttacc_gsync %u, place %d\n", smp_processor_id(), place);
> +    tacc_state_change(TACC_GSYNC);
> +}
> +
> +void tacc_irq_enter(int place)
> +{
> +    struct tacc* tacc = &this_cpu(tacc);
> +
> +//    printk("\ttacc_irq_enter %u, place %d, cnt %d\n", smp_processor_id(), place, this_cpu(tacc).irq_cnt);
> +    ASSERT(!local_irq_is_enabled());
> +    ASSERT(tacc->irq_cnt >= 0);
> +
> +    if ( tacc->irq_cnt == 0 )
> +    {
> +        tacc->irq_enter_time = NOW();
> +    }
> +
> +    tacc->irq_cnt++;
> +}
> +
> +void tacc_irq_exit(int place)
> +{
> +    struct tacc* tacc = &this_cpu(tacc);
> +
> +//    printk("\ttacc_irq_exit %u, place %d, cnt %d\n", smp_processor_id(), place, tacc->irq_cnt);
> +    ASSERT(!local_irq_is_enabled());
> +    ASSERT(tacc->irq_cnt > 0);
> +    if ( tacc->irq_cnt == 1 )
> +    {
> +        tacc->irq_time = NOW() - tacc->irq_enter_time;

If I understand correctly, you will use irq_time to update TACC_IRQ in 
tacc_state_change(). It may be possible to receive another interrupt before the 
state is changed (e.g. HYP -> GUEST). This means only the time for the last IRQ 
received would be accounted.

> +        tacc->irq_enter_time = 0;
> +    }
> +
> +    tacc->irq_cnt--;
> +}
> +
>   void context_saved(struct vcpu *prev)
>   {
>       /* Clear running flag /after/ writing context to memory. */
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index e3601c1..04a8724 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -1002,6 +1002,33 @@ extern void dump_runq(unsigned char key);
>   
>   void arch_do_physinfo(struct xen_sysctl_physinfo *pi);
>   
> +enum TACC_STATES {

We don't tend to use all uppercases for enum name.

> +    TACC_HYP = 0,

enum begins at 0 and increment by one every time. So there is no need to 
hardcode a number.

Also, looking at the code, I think you rely on the first state to be TACC_HYP. 
Am I correct?

> +    TACC_GUEST = 1,
> +    TACC_IDLE = 2,
> +    TACC_IRQ = 3,
> +    TACC_GSYNC = 4,
> +    TACC_STATES_MAX
> +};

It would be good to document all the states in the header as well.

> +
> +struct tacc

Please document the structure.

> +{
> +    s_time_t state_time[TACC_STATES_MAX];
> +    s_time_t state_entry_time;
> +    int state;

This should be the enum you used above here.

> +
> +    s_time_t guest_time;

This is not used.

> +
> +    s_time_t irq_enter_time;
> +    s_time_t irq_time;
> +    int irq_cnt;
Why do you need this to be signed?

> +};
> +
> +DECLARE_PER_CPU(struct tacc, tacc);
> +
> +void tacc_hyp(int place);
> +void tacc_idle(int place);
> +
>   #endif /* __SCHED_H__ */
>   
>   /*
> 

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] [RFC 4/9] arm64: utilize time accounting
  2019-09-11 10:32 ` [Xen-devel] [RFC 4/9] arm64: utilize time accounting Andrii Anisov
  2019-09-11 17:48   ` Volodymyr Babchuk
@ 2019-10-28 14:47   ` Julien Grall
  2019-11-06 11:31     ` Andrii Anisov
  1 sibling, 1 reply; 26+ messages in thread
From: Julien Grall @ 2019-10-28 14:47 UTC (permalink / raw)
  To: Andrii Anisov, xen-devel
  Cc: George Dunlap, Volodymyr Babchuk, Dario Faggioli,
	Stefano Stabellini, Andrii Anisov

(+ George and Dario)

Hi,

On 11/09/2019 11:32, Andrii Anisov wrote:
> From: Andrii Anisov <andrii_anisov@epam.com>
> 
> Call time accounting hooks from appropriate transition points
> of the ARM64 code.
> 
> Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
> ---
>   xen/arch/arm/arm64/entry.S | 39 ++++++++++++++++++++++++++++++++++++---
>   xen/arch/arm/domain.c      |  2 ++
>   2 files changed, 38 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/arch/arm/arm64/entry.S b/xen/arch/arm/arm64/entry.S
> index 2d9a271..6fb2fa9 100644
> --- a/xen/arch/arm/arm64/entry.S
> +++ b/xen/arch/arm/arm64/entry.S
> @@ -143,12 +143,21 @@
>   
>           .endm
>   
> -        .macro  exit, hyp, compat
> +        .macro  exit, hyp, compat, tacc=1
>   
>           .if \hyp == 0         /* Guest mode */
>   
> +	.if \tacc == 1

This is here because you may already be in the HYP state, right?

I noticed in the previous patch you mention that you only handle "re-entry" for 
the IRQ state.

As you don't have "exit" for states other than IRQ, then I would not consider 
this is as re-entry. It is more like you transition from one state to another 
(it happen this is the same state).

The problem of re-entry would be if you take an exception that is going to 
switch the state. But the concern would be exactly the same if you take an 
exception that switch the state (such as synchronous hypervisor exception).

This raises the question, how do you account SError interrupt/synchronous exception?

> +
> +        mov     x0, #1
> +        bl      tacc_hyp
> +
> +	.endif
> +
>           bl      leave_hypervisor_tail /* Disables interrupts on return */

As mentioned in the previous patch, leave_hypervisor_tail() may do some IO 
emulation that requires to be preemptible. So I don't think this is correct to 
get that time accounted to the hypervisor.

>   
> +	mov     x0, #1
> +	bl      tacc_guest
>           exit_guest \compat
>   
>           .endif
> @@ -205,9 +214,15 @@ hyp_sync:
>   
>   hyp_irq:
>           entry   hyp=1
> +        mov     x0,#5
> +        bl      tacc_irq_enter
>           msr     daifclr, #4
>           mov     x0, sp
>           bl      do_trap_irq
> +
> +        mov     x0,#5
> +        bl      tacc_irq_exit
> +
>           exit    hyp=1
>   
>   guest_sync:
> @@ -291,6 +306,9 @@ guest_sync_slowpath:
>            * to save them.
>            */
>           entry   hyp=0, compat=0, save_x0_x1=0
> +
> +        mov     x0,#1
> +        bl      tacc_gsync
>           /*
>            * The vSError will be checked while SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT
>            * is not set. If a vSError took place, the initial exception will be
> @@ -307,6 +325,10 @@ guest_sync_slowpath:
>   
>   guest_irq:
>           entry   hyp=0, compat=0
> +
> +        mov     x0,#6
> +        bl      tacc_irq_enter
> +
>           /*
>            * The vSError will be checked while SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT
>            * is not set. If a vSError took place, the initial exception will be
> @@ -319,6 +341,8 @@ guest_irq:
>           mov     x0, sp
>           bl      do_trap_irq
>   1:
> +	mov	x0,#6
> +        bl      tacc_irq_exit
>           exit    hyp=0, compat=0
>   
>   guest_fiq_invalid:
> @@ -334,6 +358,9 @@ guest_error:
>   
>   guest_sync_compat:
>           entry   hyp=0, compat=1
> +
> +        mov     x0,#2
> +        bl      tacc_gsync
>           /*
>            * The vSError will be checked while SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT
>            * is not set. If a vSError took place, the initial exception will be
> @@ -350,6 +377,10 @@ guest_sync_compat:
>   
>   guest_irq_compat:
>           entry   hyp=0, compat=1
> +
> +        mov     x0,#7
> +        bl      tacc_irq_enter
> +
>           /*
>            * The vSError will be checked while SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT
>            * is not set. If a vSError took place, the initial exception will be
> @@ -362,6 +393,8 @@ guest_irq_compat:
>           mov     x0, sp
>           bl      do_trap_irq
>   1:
> +        mov     x0,#7
> +        bl      tacc_irq_exit
>           exit    hyp=0, compat=1
>   
>   guest_fiq_invalid_compat:
> @@ -376,9 +409,9 @@ guest_error_compat:
>           exit    hyp=0, compat=1
>   
>   ENTRY(return_to_new_vcpu32)
> -        exit    hyp=0, compat=1
> +        exit    hyp=0, compat=1, tacc=0
>   ENTRY(return_to_new_vcpu64)
> -        exit    hyp=0, compat=0
> +        exit    hyp=0, compat=0, tacc=0
>   
>   return_from_trap:
>           msr     daifset, #2 /* Mask interrupts */
> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
> index a9c4113..53ef630 100644
> --- a/xen/arch/arm/domain.c
> +++ b/xen/arch/arm/domain.c
> @@ -51,11 +51,13 @@ static void do_idle(void)
>       process_pending_softirqs();
>   
>       local_irq_disable();
> +    tacc_idle(1);

Any reason to call this before the if and not inside?

>       if ( cpu_is_haltable(cpu) )
>       {
>           dsb(sy);
>           wfi();
>       }
> +    tacc_hyp(2);
>       local_irq_enable();
>   
>       sched_tick_resume();
> 

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] [RFC 2/9] sysctl: extend XEN_SYSCTL_getcpuinfo interface
  2019-09-11 10:32 ` [Xen-devel] [RFC 2/9] sysctl: extend XEN_SYSCTL_getcpuinfo interface Andrii Anisov
@ 2019-10-28 14:52   ` Julien Grall
  2019-11-06 11:25     ` Andrii Anisov
  0 siblings, 1 reply; 26+ messages in thread
From: Julien Grall @ 2019-10-28 14:52 UTC (permalink / raw)
  To: Andrii Anisov, xen-devel
  Cc: Tim Deegan, Stefano Stabellini, Andrii Anisov, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Dario Faggioli, Jan Beulich

Hi,

On 11/09/2019 11:32, Andrii Anisov wrote:
> From: Andrii Anisov <andrii_anisov@epam.com>
> 
> Extend XEN_SYSCTL_getcpuinfo interface with timing information
> provided by introduced time accounting infrastructure.
> 
> Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
> ---
>   xen/common/schedule.c       | 33 ++++++++++++++++++++++++++++-----
>   xen/common/sysctl.c         |  4 ++++
>   xen/include/public/sysctl.h |  4 ++++
>   xen/include/xen/sched.h     |  4 ++++
>   4 files changed, 40 insertions(+), 5 deletions(-)
> 
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index 6dd6603..2007034 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -208,13 +208,36 @@ void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate)
>   
>   uint64_t get_cpu_idle_time(unsigned int cpu)
>   {
> -    struct vcpu_runstate_info state = { 0 };
> -    struct vcpu *v = idle_vcpu[cpu];
> +    struct tacc *tacc = &per_cpu(tacc, cpu);
>   
> -    if ( cpu_online(cpu) && v )
> -        vcpu_runstate_get(v, &state);
> +    return tacc->state_time[TACC_IDLE];

So what's the difference between the current idle time and the new version?

> +}
> +
> +uint64_t get_cpu_guest_time(unsigned int cpu)
> +{
> +    struct tacc *tacc = &per_cpu(tacc, cpu);
> +
> +    return tacc->state_time[TACC_GUEST];
> +}
> +
> +uint64_t get_cpu_hyp_time(unsigned int cpu)
> +{
> +    struct tacc *tacc = &per_cpu(tacc, cpu);
> +
> +    return tacc->state_time[TACC_HYP];
> +}
> +
> +uint64_t get_cpu_irq_time(unsigned int cpu)
> +{
> +    struct tacc *tacc = &per_cpu(tacc, cpu);
> +
> +    return tacc->state_time[TACC_IRQ];
> +}
> +uint64_t get_cpu_gsync_time(unsigned int cpu)
> +{
> +    struct tacc *tacc = &per_cpu(tacc, cpu);
>   
> -    return state.time[RUNSTATE_running];
> +    return tacc->state_time[TACC_GSYNC];
>   }

You may want to introduce an helper retrieving the time for a given state rather 
than duplicating it.

>   
>   /*
> diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c
> index 92b4ea0..b63083c 100644
> --- a/xen/common/sysctl.c
> +++ b/xen/common/sysctl.c
> @@ -152,6 +152,10 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
>           for ( i = 0; i < nr_cpus; i++ )
>           {
>               cpuinfo.idletime = get_cpu_idle_time(i);
> +            cpuinfo.guesttime = get_cpu_guest_time(i);
> +            cpuinfo.hyptime = get_cpu_hyp_time(i);
> +            cpuinfo.gsynctime = get_cpu_gsync_time(i);
> +            cpuinfo.irqtime = get_cpu_irq_time(i);

It feels to me we want a function that fills up the structure.

>   
>               if ( copy_to_guest_offset(op->u.getcpuinfo.info, i, &cpuinfo, 1) )
>                   goto out;
> diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
> index 5401f9c..cdada1f 100644
> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h

As a side note, SYSCTL version will need to be bumped if this wasn't done before 
during the current release.

> @@ -168,6 +168,10 @@ struct xen_sysctl_debug_keys {
>   /* XEN_SYSCTL_getcpuinfo */
>   struct xen_sysctl_cpuinfo {
>       uint64_aligned_t idletime;
> +    uint64_aligned_t hyptime;
> +    uint64_aligned_t guesttime;
> +    uint64_aligned_t irqtime;
> +    uint64_aligned_t gsynctime;
>   };
>   typedef struct xen_sysctl_cpuinfo xen_sysctl_cpuinfo_t;
>   DEFINE_XEN_GUEST_HANDLE(xen_sysctl_cpuinfo_t);
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index 04a8724..8167608 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -876,6 +876,10 @@ void restore_vcpu_affinity(struct domain *d);
>   
>   void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate);
>   uint64_t get_cpu_idle_time(unsigned int cpu);
> +uint64_t get_cpu_hyp_time(unsigned int cpu);
> +uint64_t get_cpu_guest_time(unsigned int cpu);
> +uint64_t get_cpu_gsync_time(unsigned int cpu);
> +uint64_t get_cpu_irq_time(unsigned int cpu);
>   
>   /*
>    * Used by idle loop to decide whether there is work to do:
> 

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] [RFC 1/9] schedule: Introduce per-pcpu time accounting
  2019-10-28 14:28   ` Julien Grall
@ 2019-11-06 11:24     ` Andrii Anisov
  2020-05-26  2:27       ` Volodymyr Babchuk
  0 siblings, 1 reply; 26+ messages in thread
From: Andrii Anisov @ 2019-11-06 11:24 UTC (permalink / raw)
  To: Julien Grall, xen-devel
  Cc: Tim Deegan, Stefano Stabellini, Andrii Anisov, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Dario Faggioli, Jan Beulich

Hello Julien,

On 28.10.19 16:28, Julien Grall wrote:
> It would be good to get a review from the scheduler maintainers (Dario, George) to make sure they are happy with the suggested states here.

I would not say I'm completely happy with this set of states. I'd like to have a discussion on this topic with scheduler maintainers. Also because they could have a different view from x86 world.

>> Introduce per-pcpu time accounting what includes the following states:
> 
> I think we need a very detailed description of each states. Otherwise it will be hard to know how to categorize it.

I agree that we need a very detailed description of each states. Ask questions if something is not clear or doubtful. I guess we could have something better after Q&A process.

> 
>>
>> TACC_HYP - the pcpu executes hypervisor code like softirq processing
>>             (including scheduling), tasklets and context switches
> 
> IHMO, "like" is too weak here. What do you exactly plan to introduce?

I think this should be what hypervisor does except hypercall and IO emulation (what is TACC_GSYNC).

> 
> For instance, on Arm, you consider that leave_hypervisor_tail() is part of TACC_HYP. This function will include some handling for synchronous trap.

I guess you are saying about `p2m_flush_vm`. I doubt here, and open for suggestions.


>> TACC_GUEST - the pcpu executes guests code
> 
> Looking at the arm64 code, you are executing some hypervisor code here. I agree this is impossible to not run any hypervisor code with TACC_GUEST, but I think this should be clarified in the documentation.

Do you mean adding few words about still having some hypervisor code near the actual context switch from/to guest (entry/return_from_trap)?

> 
>> TACC_IDLE - the low-power state of the pcpu
> 
> Did you intend to mean "idle vCPU" is in use?

No. I did mean what is written.
Currently, the idle vcpu does hypervisor work (e.g. tasklets) along with the low-power mode. IMO we have to separate them.

> 
>> TACC_IRQ - the pcpu performs interrupts processing, without separation to
>>             guest or hypervisor interrupts
>> TACC_GSYNC - the pcpu executes hypervisor code to process synchronous trap
>>               from the guest. E.g. hypercall processing or io emulation.
>>
>> Currently, the only reenterant state is TACC_IRQ. It is assumed, no changes
>> to state other than TACC_IRQ could happen until we return from nested
>> interrupts. IRQ time is accounted in a distinct way comparing to other states.
> 
> s/comparing/compare/

OK.

> 
>> It is acumulated between other states transition moments, and is substracted
> 
> s/acumulated/accumulated/ s/substracted/subtracted/

OK.

> 
>> from the old state on states transion calculation.
[1]
> 
> s/transion/transition/

OK.

> 
>>
>> Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
>> ---
>>   xen/common/schedule.c   | 81 +++++++++++++++++++++++++++++++++++++++++++++++++
>>   xen/include/xen/sched.h | 27 +++++++++++++++++
>>   2 files changed, 108 insertions(+)
>>
>> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
>> index 7b71581..6dd6603 100644
>> --- a/xen/common/schedule.c
>> +++ b/xen/common/schedule.c
>> @@ -1539,6 +1539,87 @@ static void schedule(void)
>>       context_switch(prev, next);
>>   }
>> +DEFINE_PER_CPU(struct tacc, tacc);
>> +
>> +static void tacc_state_change(enum TACC_STATES new_state)
> 
> This should never be called with the TACC_IRQ, right?

Yes. Actually, tacc->state should never be TACC_IRQ.
Because of TACC_IRQ reenterability it is handled through the tacc->irq_cnt and tacc->irq_enter_time.

> 
>> +{
>> +    s_time_t now, delta;
>> +    struct tacc* tacc = &this_cpu(tacc);
>> +    unsigned long flags;
>> +
>> +    local_irq_save(flags);
>> +
>> +    now = NOW();
>> +    delta = now - tacc->state_entry_time;
>> +
>> +    /* We do not expect states reenterability (at least through this function)*/
>> +    ASSERT(new_state != tacc->state);
>> +
>> +    tacc->state_time[tacc->state] += delta - tacc->irq_time;
>> +    tacc->state_time[TACC_IRQ] += tacc->irq_time;
>> +    tacc->irq_time = 0;
>> +    tacc->state = new_state;
>> +    tacc->state_entry_time = now;
>> +
>> +    local_irq_restore(flags);
>> +}
>> +
>> +void tacc_hyp(int place)
> 
> Place is never used except for your commented printk. So what's the goal for it?

Place is just a piece of code used for debugging, as well as printk. I keept it here because this series is very RFC, yet it could be removed.

> Also, is it really necessary to provide helper for each state? Couldn't we just introduce one functions doing all the state?

I'd like calling that stuff from assembler without parameters. But have no strong opinion here.
  
>> +{
>> +//    printk("\ttacc_hyp %u, place %d\n", smp_processor_id(), place);
>> +    tacc_state_change(TACC_HYP);
>> +}
>> +
>> +void tacc_guest(int place)
>> +{
>> +//    printk("\ttacc_guest %u, place %d\n", smp_processor_id(), place);
>> +    tacc_state_change(TACC_GUEST);
>> +}
>> +
>> +void tacc_idle(int place)
>> +{
>> +//    printk("\tidle cpu %u, place %d\n", smp_processor_id(), place);
>> +    tacc_state_change(TACC_IDLE);
>> +}
>> +
>> +void tacc_gsync(int place)
>> +{
>> +//    printk("\ttacc_gsync %u, place %d\n", smp_processor_id(), place);
>> +    tacc_state_change(TACC_GSYNC);
>> +}
>> +
>> +void tacc_irq_enter(int place)
>> +{
>> +    struct tacc* tacc = &this_cpu(tacc);
>> +
>> +//    printk("\ttacc_irq_enter %u, place %d, cnt %d\n", smp_processor_id(), place, this_cpu(tacc).irq_cnt);
>> +    ASSERT(!local_irq_is_enabled());
>> +    ASSERT(tacc->irq_cnt >= 0);
>> +
>> +    if ( tacc->irq_cnt == 0 )
>> +    {
>> +        tacc->irq_enter_time = NOW();
>> +    }
>> +
>> +    tacc->irq_cnt++;
>> +}
>> +
>> +void tacc_irq_exit(int place)
>> +{
>> +    struct tacc* tacc = &this_cpu(tacc);
>> +
>> +//    printk("\ttacc_irq_exit %u, place %d, cnt %d\n", smp_processor_id(), place, tacc->irq_cnt);
>> +    ASSERT(!local_irq_is_enabled());
>> +    ASSERT(tacc->irq_cnt > 0);
>> +    if ( tacc->irq_cnt == 1 )
>> +    {
>> +        tacc->irq_time = NOW() - tacc->irq_enter_time;
> 
> If I understand correctly, you will use irq_time to update TACC_IRQ in tacc_state_change(). It may be possible to receive another interrupt before the state is changed (e.g. HYP -> GUEST). This means only the time for the last IRQ received would be accounted.

I do lock IRQs for state change. Shouldn't that protect it?

> 
>> +        tacc->irq_enter_time = 0;
>> +    }
>> +
>> +    tacc->irq_cnt--;
>> +}
>> +
>>   void context_saved(struct vcpu *prev)
>>   {
>>       /* Clear running flag /after/ writing context to memory. */
>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>> index e3601c1..04a8724 100644
>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -1002,6 +1002,33 @@ extern void dump_runq(unsigned char key);
>>   void arch_do_physinfo(struct xen_sysctl_physinfo *pi);
>> +enum TACC_STATES {
> 
> We don't tend to use all uppercases for enum name.

OK.

> 
>> +    TACC_HYP = 0,
> 
> enum begins at 0 and increment by one every time. So there is no need to hardcode a number.
> 
> Also, looking at the code, I think you rely on the first state to be TACC_HYP. Am I correct?

TACC_HYP is expected to be the initial state of the PCPU.

> 
>> +    TACC_GUEST = 1,
>> +    TACC_IDLE = 2,
>> +    TACC_IRQ = 3,
>> +    TACC_GSYNC = 4,
>> +    TACC_STATES_MAX
>> +};
> > It would be good to document all the states in the header as well.

OK.

> 
>> +
>> +struct tacc
> 
> Please document the structure.

OK.

> 
>> +{
>> +    s_time_t state_time[TACC_STATES_MAX];
>> +    s_time_t state_entry_time;
>> +    int state;
> 
> This should be the enum you used above here.

Yep.

>> +
>> +    s_time_t guest_time;
> 
> This is not used.

Yep, will drop it.

> 
>> +
>> +    s_time_t irq_enter_time;
>> +    s_time_t irq_time;
>> +    int irq_cnt;
> Why do you need this to be signed?

For assertion.
  
>> +};
>> +
>> +DECLARE_PER_CPU(struct tacc, tacc);
>> +
>> +void tacc_hyp(int place);
>> +void tacc_idle(int place);
>> +
>>   #endif /* __SCHED_H__ */
>>   /*
>>
> 
> Cheers,
>

-- 
Sincerely,
Andrii Anisov.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] [RFC 2/9] sysctl: extend XEN_SYSCTL_getcpuinfo interface
  2019-10-28 14:52   ` Julien Grall
@ 2019-11-06 11:25     ` Andrii Anisov
  0 siblings, 0 replies; 26+ messages in thread
From: Andrii Anisov @ 2019-11-06 11:25 UTC (permalink / raw)
  To: Julien Grall, xen-devel
  Cc: Tim Deegan, Stefano Stabellini, Andrii Anisov, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Ian Jackson,
	Dario Faggioli, Jan Beulich

Hello Julien

On 28.10.19 16:52, Julien Grall wrote:
> Hi,
> 
> On 11/09/2019 11:32, Andrii Anisov wrote:
>> From: Andrii Anisov <andrii_anisov@epam.com>
>>
>> Extend XEN_SYSCTL_getcpuinfo interface with timing information
>> provided by introduced time accounting infrastructure.
>>
>> Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
>> ---
>>   xen/common/schedule.c       | 33 ++++++++++++++++++++++++++++-----
>>   xen/common/sysctl.c         |  4 ++++
>>   xen/include/public/sysctl.h |  4 ++++
>>   xen/include/xen/sched.h     |  4 ++++
>>   4 files changed, 40 insertions(+), 5 deletions(-)
>>
>> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
>> index 6dd6603..2007034 100644
>> --- a/xen/common/schedule.c
>> +++ b/xen/common/schedule.c
>> @@ -208,13 +208,36 @@ void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate)
>>   uint64_t get_cpu_idle_time(unsigned int cpu)
>>   {
>> -    struct vcpu_runstate_info state = { 0 };
>> -    struct vcpu *v = idle_vcpu[cpu];
>> +    struct tacc *tacc = &per_cpu(tacc, cpu);
>> -    if ( cpu_online(cpu) && v )
>> -        vcpu_runstate_get(v, &state);
>> +    return tacc->state_time[TACC_IDLE];
> 
> So what's the difference between the current idle time and the new version?

Currently, idle time is the idle vcpu run time, what includes tasklets, scheduling, irq processing etc.
IMO it may confuse cpufreq and power management.

> 
>> +}
>> +
>> +uint64_t get_cpu_guest_time(unsigned int cpu)
>> +{
>> +    struct tacc *tacc = &per_cpu(tacc, cpu);
>> +
>> +    return tacc->state_time[TACC_GUEST];
>> +}
>> +
>> +uint64_t get_cpu_hyp_time(unsigned int cpu)
>> +{
>> +    struct tacc *tacc = &per_cpu(tacc, cpu);
>> +
>> +    return tacc->state_time[TACC_HYP];
>> +}
>> +
>> +uint64_t get_cpu_irq_time(unsigned int cpu)
>> +{
>> +    struct tacc *tacc = &per_cpu(tacc, cpu);
>> +
>> +    return tacc->state_time[TACC_IRQ];
>> +}
>> +uint64_t get_cpu_gsync_time(unsigned int cpu)
>> +{
>> +    struct tacc *tacc = &per_cpu(tacc, cpu);
>> -    return state.time[RUNSTATE_running];
>> +    return tacc->state_time[TACC_GSYNC];
>>   }
> 
> You may want to introduce an helper retrieving the time for a given state rather than duplicating it.

Yep.

> 
>>   /*
>> diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c
>> index 92b4ea0..b63083c 100644
>> --- a/xen/common/sysctl.c
>> +++ b/xen/common/sysctl.c
>> @@ -152,6 +152,10 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
>>           for ( i = 0; i < nr_cpus; i++ )
>>           {
>>               cpuinfo.idletime = get_cpu_idle_time(i);
>> +            cpuinfo.guesttime = get_cpu_guest_time(i);
>> +            cpuinfo.hyptime = get_cpu_hyp_time(i);
>> +            cpuinfo.gsynctime = get_cpu_gsync_time(i);
>> +            cpuinfo.irqtime = get_cpu_irq_time(i);
> 
> It feels to me we want a function that fills up the structure.

Maybe.

> 
>>               if ( copy_to_guest_offset(op->u.getcpuinfo.info, i, &cpuinfo, 1) )
>>                   goto out;
>> diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
>> index 5401f9c..cdada1f 100644
>> --- a/xen/include/public/sysctl.h
>> +++ b/xen/include/public/sysctl.h
> 
> As a side note, SYSCTL version will need to be bumped if this wasn't done before during the current release.
> 
>> @@ -168,6 +168,10 @@ struct xen_sysctl_debug_keys {
>>   /* XEN_SYSCTL_getcpuinfo */
>>   struct xen_sysctl_cpuinfo {
>>       uint64_aligned_t idletime;
>> +    uint64_aligned_t hyptime;
>> +    uint64_aligned_t guesttime;
>> +    uint64_aligned_t irqtime;
>> +    uint64_aligned_t gsynctime;
>>   };
>>   typedef struct xen_sysctl_cpuinfo xen_sysctl_cpuinfo_t;
>>   DEFINE_XEN_GUEST_HANDLE(xen_sysctl_cpuinfo_t);
>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>> index 04a8724..8167608 100644
>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -876,6 +876,10 @@ void restore_vcpu_affinity(struct domain *d);
>>   void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate);
>>   uint64_t get_cpu_idle_time(unsigned int cpu);
>> +uint64_t get_cpu_hyp_time(unsigned int cpu);
>> +uint64_t get_cpu_guest_time(unsigned int cpu);
>> +uint64_t get_cpu_gsync_time(unsigned int cpu);
>> +uint64_t get_cpu_irq_time(unsigned int cpu);
>>   /*
>>    * Used by idle loop to decide whether there is work to do:
>>
> 
> Cheers,
> 

-- 
Sincerely,
Andrii Anisov.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] [RFC 4/9] arm64: utilize time accounting
  2019-10-28 14:47   ` Julien Grall
@ 2019-11-06 11:31     ` Andrii Anisov
  0 siblings, 0 replies; 26+ messages in thread
From: Andrii Anisov @ 2019-11-06 11:31 UTC (permalink / raw)
  To: Julien Grall, xen-devel
  Cc: George Dunlap, Volodymyr Babchuk, Dario Faggioli,
	Stefano Stabellini, Andrii Anisov



On 28.10.19 16:47, Julien Grall wrote:
> (+ George and Dario)
> 
> Hi,
> 
> On 11/09/2019 11:32, Andrii Anisov wrote:
>> From: Andrii Anisov <andrii_anisov@epam.com>
>>
>> Call time accounting hooks from appropriate transition points
>> of the ARM64 code.
>>
>> Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
>> ---
>>   xen/arch/arm/arm64/entry.S | 39 ++++++++++++++++++++++++++++++++++++---
>>   xen/arch/arm/domain.c      |  2 ++
>>   2 files changed, 38 insertions(+), 3 deletions(-)
>>
>> diff --git a/xen/arch/arm/arm64/entry.S b/xen/arch/arm/arm64/entry.S
>> index 2d9a271..6fb2fa9 100644
>> --- a/xen/arch/arm/arm64/entry.S
>> +++ b/xen/arch/arm/arm64/entry.S
>> @@ -143,12 +143,21 @@
>>           .endm
>> -        .macro  exit, hyp, compat
>> +        .macro  exit, hyp, compat, tacc=1
>>           .if \hyp == 0         /* Guest mode */
>> +    .if \tacc == 1
> 
> This is here because you may already be in the HYP state, right?
> 
> I noticed in the previous patch you mention that you only handle "re-entry" for the IRQ state.
> 
> As you don't have "exit" for states other than IRQ, then I would not consider this is as re-entry. It is more like you transition from one state to another (it happen this is the same state).
> 
> The problem of re-entry would be if you take an exception that is going to switch the state. But the concern would be exactly the same if you take an exception that switch the state (such as synchronous hypervisor exception).
> > This raises the question, how do you account SError interrupt/synchronous exception?

Well, I guess I have to think more about it. Maybe I do not understand completely that problem now.

> 
>> +
>> +        mov     x0, #1
>> +        bl      tacc_hyp
>> +
>> +    .endif
>> +
>>           bl      leave_hypervisor_tail /* Disables interrupts on return */
> 
> As mentioned in the previous patch, leave_hypervisor_tail() may do some IO emulation that requires to be preemptible. So I don't think this is correct to get that time accounted to the hypervisor.

Could you please point me to the line of the code? Are you saying about GIC?

> 
>> +    mov     x0, #1
>> +    bl      tacc_guest
>>           exit_guest \compat
>>           .endif
>> @@ -205,9 +214,15 @@ hyp_sync:
>>   hyp_irq:
>>           entry   hyp=1
>> +        mov     x0,#5
>> +        bl      tacc_irq_enter
>>           msr     daifclr, #4
>>           mov     x0, sp
>>           bl      do_trap_irq
>> +
>> +        mov     x0,#5
>> +        bl      tacc_irq_exit
>> +
>>           exit    hyp=1
>>   guest_sync:
>> @@ -291,6 +306,9 @@ guest_sync_slowpath:
>>            * to save them.
>>            */
>>           entry   hyp=0, compat=0, save_x0_x1=0
>> +
>> +        mov     x0,#1
>> +        bl      tacc_gsync
>>           /*
>>            * The vSError will be checked while SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT
>>            * is not set. If a vSError took place, the initial exception will be
>> @@ -307,6 +325,10 @@ guest_sync_slowpath:
>>   guest_irq:
>>           entry   hyp=0, compat=0
>> +
>> +        mov     x0,#6
>> +        bl      tacc_irq_enter
>> +
>>           /*
>>            * The vSError will be checked while SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT
>>            * is not set. If a vSError took place, the initial exception will be
>> @@ -319,6 +341,8 @@ guest_irq:
>>           mov     x0, sp
>>           bl      do_trap_irq
>>   1:
>> +    mov    x0,#6
>> +        bl      tacc_irq_exit
>>           exit    hyp=0, compat=0
>>   guest_fiq_invalid:
>> @@ -334,6 +358,9 @@ guest_error:
>>   guest_sync_compat:
>>           entry   hyp=0, compat=1
>> +
>> +        mov     x0,#2
>> +        bl      tacc_gsync
>>           /*
>>            * The vSError will be checked while SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT
>>            * is not set. If a vSError took place, the initial exception will be
>> @@ -350,6 +377,10 @@ guest_sync_compat:
>>   guest_irq_compat:
>>           entry   hyp=0, compat=1
>> +
>> +        mov     x0,#7
>> +        bl      tacc_irq_enter
>> +
>>           /*
>>            * The vSError will be checked while SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT
>>            * is not set. If a vSError took place, the initial exception will be
>> @@ -362,6 +393,8 @@ guest_irq_compat:
>>           mov     x0, sp
>>           bl      do_trap_irq
>>   1:
>> +        mov     x0,#7
>> +        bl      tacc_irq_exit
>>           exit    hyp=0, compat=1
>>   guest_fiq_invalid_compat:
>> @@ -376,9 +409,9 @@ guest_error_compat:
>>           exit    hyp=0, compat=1
>>   ENTRY(return_to_new_vcpu32)
>> -        exit    hyp=0, compat=1
>> +        exit    hyp=0, compat=1, tacc=0
>>   ENTRY(return_to_new_vcpu64)
>> -        exit    hyp=0, compat=0
>> +        exit    hyp=0, compat=0, tacc=0
>>   return_from_trap:
>>           msr     daifset, #2 /* Mask interrupts */
>> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
>> index a9c4113..53ef630 100644
>> --- a/xen/arch/arm/domain.c
>> +++ b/xen/arch/arm/domain.c
>> @@ -51,11 +51,13 @@ static void do_idle(void)
>>       process_pending_softirqs();
>>       local_irq_disable();
>> +    tacc_idle(1);
> 
> Any reason to call this before the if and not inside?

Maybe having at least a bit of idle for non-haltable cpu. But from the second thought it would not work.

> 
>>       if ( cpu_is_haltable(cpu) )
>>       {
>>           dsb(sy);
>>           wfi();
>>       }
>> +    tacc_hyp(2);
>>       local_irq_enable();
>>       sched_tick_resume();
>>
> 
> Cheers,
> 

-- 
Sincerely,
Andrii Anisov.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] [RFC 1/9] schedule: Introduce per-pcpu time accounting
  2019-11-06 11:24     ` Andrii Anisov
@ 2020-05-26  2:27       ` Volodymyr Babchuk
  2020-05-29  8:48         ` Dario Faggioli
  0 siblings, 1 reply; 26+ messages in thread
From: Volodymyr Babchuk @ 2020-05-26  2:27 UTC (permalink / raw)
  To: andrii.anisov, george.dunlap, dfaggioli, julien.grall, xen-devel
  Cc: sstabellini, wl, konrad.wilk, ian.jackson, tim, jbeulich, andrew.cooper3

Hello All,

This is gentle reminder about this RFC. 

Sadly, Andrii Anisov has left our team. But I'm commited to continue
his work on time accounting and real time scheduling.

I do realize, that proposed patches have become moldy. I can rebase
them onto current master, if it would help. 

On Wed, 2019-11-06 at 13:24 +0200, Andrii Anisov wrote:
> Hello Julien,
> 
> On 28.10.19 16:28, Julien Grall wrote:
> > It would be good to get a review from the scheduler maintainers (Dario, George) to make sure they are happy with the suggested states here.
> 
> I would not say I'm completely happy with this set of states. I'd like to have a discussion on this topic with scheduler maintainers. Also because they could have a different view from x86 world.

I would love to hear any inputs on this topc from general scheduling
approach standpoint and from x86 view.  

> > > Introduce per-pcpu time accounting what includes the following states:
> > 
> > I think we need a very detailed description of each states. Otherwise it will be hard to know how to categorize it.
> 
> I agree that we need a very detailed description of each states. Ask questions if something is not clear or doubtful. I guess we could have something better after Q&A process.
> 
> > > TACC_HYP - the pcpu executes hypervisor code like softirq processing
> > >             (including scheduling), tasklets and context switches
> > 
> > IHMO, "like" is too weak here. What do you exactly plan to introduce?
> 
> I think this should be what hypervisor does except hypercall and IO emulation (what is TACC_GSYNC).
> 
> > For instance, on Arm, you consider that leave_hypervisor_tail() is part of TACC_HYP. This function will include some handling for synchronous trap.
> 
> I guess you are saying about `p2m_flush_vm`. I doubt here, and open for suggestions.
> 
> 
> > > TACC_GUEST - the pcpu executes guests code
> > 
> > Looking at the arm64 code, you are executing some hypervisor code here. I agree this is impossible to not run any hypervisor code with TACC_GUEST, but I think this should be clarified in the documentation.
> 
> Do you mean adding few words about still having some hypervisor code near the actual context switch from/to guest (entry/return_from_trap)?
> 
> > > TACC_IDLE - the low-power state of the pcpu
> > 
> > Did you intend to mean "idle vCPU" is in use?
> 
> No. I did mean what is written.
> Currently, the idle vcpu does hypervisor work (e.g. tasklets) along with the low-power mode. IMO we have to separate them.
> 
> > > TACC_IRQ - the pcpu performs interrupts processing, without separation to
> > >             guest or hypervisor interrupts
> > > TACC_GSYNC - the pcpu executes hypervisor code to process synchronous trap
> > >               from the guest. E.g. hypercall processing or io emulation.
> > > 
> > > Currently, the only reenterant state is TACC_IRQ. It is assumed, no changes
> > > to state other than TACC_IRQ could happen until we return from nested
> > > interrupts. IRQ time is accounted in a distinct way comparing to other states.
> > 
> > s/comparing/compare/
> 
> OK.
> 
> > > It is acumulated between other states transition moments, and is substracted
> > 
> > s/acumulated/accumulated/ s/substracted/subtracted/
> 
> OK.
> 
> > > from the old state on states transion calculation.
> [1]
> > s/transion/transition/
> 
> OK.
> 
> > > Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
> > > ---
> > >   xen/common/schedule.c   | 81 +++++++++++++++++++++++++++++++++++++++++++++++++
> > >   xen/include/xen/sched.h | 27 +++++++++++++++++
> > >   2 files changed, 108 insertions(+)
> > > 
> > > diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> > > index 7b71581..6dd6603 100644
> > > --- a/xen/common/schedule.c
> > > +++ b/xen/common/schedule.c
> > > @@ -1539,6 +1539,87 @@ static void schedule(void)
> > >       context_switch(prev, next);
> > >   }
> > > +DEFINE_PER_CPU(struct tacc, tacc);
> > > +
> > > +static void tacc_state_change(enum TACC_STATES new_state)
> > 
> > This should never be called with the TACC_IRQ, right?
> 
> Yes. Actually, tacc->state should never be TACC_IRQ.
> Because of TACC_IRQ reenterability it is handled through the tacc->irq_cnt and tacc->irq_enter_time.
> 
> > > +{
> > > +    s_time_t now, delta;
> > > +    struct tacc* tacc = &this_cpu(tacc);
> > > +    unsigned long flags;
> > > +
> > > +    local_irq_save(flags);
> > > +
> > > +    now = NOW();
> > > +    delta = now - tacc->state_entry_time;
> > > +
> > > +    /* We do not expect states reenterability (at least through this function)*/
> > > +    ASSERT(new_state != tacc->state);
> > > +
> > > +    tacc->state_time[tacc->state] += delta - tacc->irq_time;
> > > +    tacc->state_time[TACC_IRQ] += tacc->irq_time;
> > > +    tacc->irq_time = 0;
> > > +    tacc->state = new_state;
> > > +    tacc->state_entry_time = now;
> > > +
> > > +    local_irq_restore(flags);
> > > +}
> > > +
> > > +void tacc_hyp(int place)
> > 
> > Place is never used except for your commented printk. So what's the goal for it?
> 
> Place is just a piece of code used for debugging, as well as printk. I keept it here because this series is very RFC, yet it could be removed.
> 
> > Also, is it really necessary to provide helper for each state? Couldn't we just introduce one functions doing all the state?
> 
> I'd like calling that stuff from assembler without parameters. But have no strong opinion here.
>   
> > > +{
> > > +//    printk("\ttacc_hyp %u, place %d\n", smp_processor_id(), place);
> > > +    tacc_state_change(TACC_HYP);
> > > +}
> > > +
> > > +void tacc_guest(int place)
> > > +{
> > > +//    printk("\ttacc_guest %u, place %d\n", smp_processor_id(), place);
> > > +    tacc_state_change(TACC_GUEST);
> > > +}
> > > +
> > > +void tacc_idle(int place)
> > > +{
> > > +//    printk("\tidle cpu %u, place %d\n", smp_processor_id(), place);
> > > +    tacc_state_change(TACC_IDLE);
> > > +}
> > > +
> > > +void tacc_gsync(int place)
> > > +{
> > > +//    printk("\ttacc_gsync %u, place %d\n", smp_processor_id(), place);
> > > +    tacc_state_change(TACC_GSYNC);
> > > +}
> > > +
> > > +void tacc_irq_enter(int place)
> > > +{
> > > +    struct tacc* tacc = &this_cpu(tacc);
> > > +
> > > +//    printk("\ttacc_irq_enter %u, place %d, cnt %d\n", smp_processor_id(), place, this_cpu(tacc).irq_cnt);
> > > +    ASSERT(!local_irq_is_enabled());
> > > +    ASSERT(tacc->irq_cnt >= 0);
> > > +
> > > +    if ( tacc->irq_cnt == 0 )
> > > +    {
> > > +        tacc->irq_enter_time = NOW();
> > > +    }
> > > +
> > > +    tacc->irq_cnt++;
> > > +}
> > > +
> > > +void tacc_irq_exit(int place)
> > > +{
> > > +    struct tacc* tacc = &this_cpu(tacc);
> > > +
> > > +//    printk("\ttacc_irq_exit %u, place %d, cnt %d\n", smp_processor_id(), place, tacc->irq_cnt);
> > > +    ASSERT(!local_irq_is_enabled());
> > > +    ASSERT(tacc->irq_cnt > 0);
> > > +    if ( tacc->irq_cnt == 1 )
> > > +    {
> > > +        tacc->irq_time = NOW() - tacc->irq_enter_time;
> > 
> > If I understand correctly, you will use irq_time to update TACC_IRQ in tacc_state_change(). It may be possible to receive another interrupt before the state is changed (e.g. HYP -> GUEST). This means only the time for the last IRQ received would be accounted.
> 
> I do lock IRQs for state change. Shouldn't that protect it?
> 
> > > +        tacc->irq_enter_time = 0;
> > > +    }
> > > +
> > > +    tacc->irq_cnt--;
> > > +}
> > > +
> > >   void context_saved(struct vcpu *prev)
> > >   {
> > >       /* Clear running flag /after/ writing context to memory. */
> > > diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> > > index e3601c1..04a8724 100644
> > > --- a/xen/include/xen/sched.h
> > > +++ b/xen/include/xen/sched.h
> > > @@ -1002,6 +1002,33 @@ extern void dump_runq(unsigned char key);
> > >   void arch_do_physinfo(struct xen_sysctl_physinfo *pi);
> > > +enum TACC_STATES {
> > 
> > We don't tend to use all uppercases for enum name.
> 
> OK.
> 
> > > +    TACC_HYP = 0,
> > 
> > enum begins at 0 and increment by one every time. So there is no need to hardcode a number.
> > 
> > Also, looking at the code, I think you rely on the first state to be TACC_HYP. Am I correct?
> 
> TACC_HYP is expected to be the initial state of the PCPU.
> 
> > > +    TACC_GUEST = 1,
> > > +    TACC_IDLE = 2,
> > > +    TACC_IRQ = 3,
> > > +    TACC_GSYNC = 4,
> > > +    TACC_STATES_MAX
> > > +};
> > > It would be good to document all the states in the header as well.
> 
> OK.
> 
> > > +
> > > +struct tacc
> > 
> > Please document the structure.
> 
> OK.
> 
> > > +{
> > > +    s_time_t state_time[TACC_STATES_MAX];
> > > +    s_time_t state_entry_time;
> > > +    int state;
> > 
> > This should be the enum you used above here.
> 
> Yep.
> 
> > > +
> > > +    s_time_t guest_time;
> > 
> > This is not used.
> 
> Yep, will drop it.
> 
> > > +
> > > +    s_time_t irq_enter_time;
> > > +    s_time_t irq_time;
> > > +    int irq_cnt;
> > Why do you need this to be signed?
> 
> For assertion.
>   
> > > +};
> > > +
> > > +DECLARE_PER_CPU(struct tacc, tacc);
> > > +
> > > +void tacc_hyp(int place);
> > > +void tacc_idle(int place);
> > > +
> > >   #endif /* __SCHED_H__ */
> > >   /*
> > > 
> > 
> > Cheers,
> > 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] [RFC 1/9] schedule: Introduce per-pcpu time accounting
  2020-05-26  2:27       ` Volodymyr Babchuk
@ 2020-05-29  8:48         ` Dario Faggioli
  2020-06-02  1:12           ` Volodymyr Babchuk
  0 siblings, 1 reply; 26+ messages in thread
From: Dario Faggioli @ 2020-05-29  8:48 UTC (permalink / raw)
  To: Volodymyr Babchuk, andrii.anisov, george.dunlap, julien.grall, xen-devel
  Cc: sstabellini, wl, konrad.wilk, ian.jackson, tim, jbeulich, andrew.cooper3

[-- Attachment #1: Type: text/plain, Size: 1107 bytes --]

On Tue, 2020-05-26 at 02:27 +0000, Volodymyr Babchuk wrote:
> Hello All,
> 
Hello Volodymyr,

> This is gentle reminder about this RFC. 
> 
> Sadly, Andrii Anisov has left our team. But I'm commited to continue
> his work on time accounting and real time scheduling.
> 
Ok, so, first of all, sorry that this has not been properly addressed.

I personally never forgot about it or anything... Still, I haven't been
able to look into it properly.

> I do realize, that proposed patches have become moldy. I can rebase
> them onto current master, if it would help. 
>
As a matter of fact, it would. Especially considering that, AFAICT,
this pre-dates core-scheduling.

So, if you're really keen doing such a rebase and resending, I will be
happy to have a look at how it ends up looking like.

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] [RFC 1/9] schedule: Introduce per-pcpu time accounting
  2020-05-29  8:48         ` Dario Faggioli
@ 2020-06-02  1:12           ` Volodymyr Babchuk
  2020-06-03 15:22             ` Dario Faggioli
  0 siblings, 1 reply; 26+ messages in thread
From: Volodymyr Babchuk @ 2020-06-02  1:12 UTC (permalink / raw)
  To: dfaggioli, george.dunlap, julien.grall, xen-devel
  Cc: sstabellini, wl, konrad.wilk, ian.jackson, tim, jbeulich, andrew.cooper3

On Fri, 2020-05-29 at 10:48 +0200, Dario Faggioli wrote:
> On Tue, 2020-05-26 at 02:27 +0000, Volodymyr Babchuk wrote:
> > Hello All,
> > 
> Hello Volodymyr,
> 

Hi Dario,

> > This is gentle reminder about this RFC. 
> > 
> > Sadly, Andrii Anisov has left our team. But I'm commited to continue
> > his work on time accounting and real time scheduling.
> > 
> Ok, so, first of all, sorry that this has not been properly addressed.
> 
> I personally never forgot about it or anything... Still, I haven't been
> able to look into it properly.
> 

I see.. Anyways, thanks for the reply. 

Actually, I tried to not only rebase this patch series to the current
mainline, but also to add x86 support. This gave me deeper
unsterstanding of the inner workings. At least I hope so :)

Anyways, I want to discuss the matter before continuing reworking the
patches. The goal of those patches is to account guest time more
precisely. 

Right now I can see only two main reasons, when guest can be charged
for a time it dindn't used: interrupts and soft irqs. 

- do_softirq() is called every time we leave hypervisor mode. It is
used to do housekeeping for the hypervisor itself. But, some random
guest will charged for time spent in do_softirq() unless this function
is not called on a idle vcpu.

- also, pCPU can be interrupted by IRQ assigned to some other guest or
to hypervisor itself. But time spent in interrupt handler will be
charged for a guest being interrupted.

So, basically, to account guest time correctly, we need to substract
time spent in do_softirq() and in do_IRQ(). 

Actually, we can charge the correct guest for time spent in do_IRQ(),
because handler code will eventually know target vCPU for the
interrupt. There is technical problem with interrupt nesting. We will
need some stack to track nesting correctly. But this is doable.

Just for statistical purposes we can track hypervisor time somwhere,
but it is not needed for scheduling decisions.

Am I missing something?



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Xen-devel] [RFC 1/9] schedule: Introduce per-pcpu time accounting
  2020-06-02  1:12           ` Volodymyr Babchuk
@ 2020-06-03 15:22             ` Dario Faggioli
  0 siblings, 0 replies; 26+ messages in thread
From: Dario Faggioli @ 2020-06-03 15:22 UTC (permalink / raw)
  To: Volodymyr Babchuk, george.dunlap, julien.grall, xen-devel
  Cc: sstabellini, wl, konrad.wilk, ian.jackson, tim, jbeulich, andrew.cooper3

[-- Attachment #1: Type: text/plain, Size: 2961 bytes --]

On Tue, 2020-06-02 at 01:12 +0000, Volodymyr Babchuk wrote:
> On Fri, 2020-05-29 at 10:48 +0200, Dario Faggioli wrote:
> > 
> Actually, I tried to not only rebase this patch series to the current
> mainline, but also to add x86 support. This gave me deeper
> unsterstanding of the inner workings. At least I hope so :)
> 
Right.

> Anyways, I want to discuss the matter before continuing reworking the
> patches. The goal of those patches is to account guest time more
> precisely. 
> 
Yes, I agree. IIRC, the patches are doing more than that, e.g.,
discriminating between the runtime of the idle vCPUs and the time
during which the CPUs were actually idle, and even trying to classify
somehow what the hypervisor was actually doing (guest sync, etc).

But, indeed, I would very much start with the one yous stated above, as
a goal.

> Right now I can see only two main reasons, when guest can be charged
> for a time it dindn't used: interrupts and soft irqs. 
> 
> - do_softirq() is called every time we leave hypervisor mode. It is
> used to do housekeeping for the hypervisor itself. But, some random
> guest will charged for time spent in do_softirq() unless this
> function
> is not called on a idle vcpu.
> 
> - also, pCPU can be interrupted by IRQ assigned to some other guest
> or
> to hypervisor itself. But time spent in interrupt handler will be
> charged for a guest being interrupted.
> 
I think those are the ones, yes.

> So, basically, to account guest time correctly, we need to substract
> time spent in do_softirq() and in do_IRQ(). 
> 
That's how I'd try to do this, if it were me doing it.

> Actually, we can charge the correct guest for time spent in do_IRQ(),
> because handler code will eventually know target vCPU for the
> interrupt. There is technical problem with interrupt nesting. We will
> need some stack to track nesting correctly. But this is doable.
> 
Yes, there's this, and maybe a few other "dependencies" that we may
discuss about, and try to track and account for, for even greather
fairness. But maybe this can come as a second step?

> Just for statistical purposes we can track hypervisor time somwhere,
> but it is not needed for scheduling decisions.
> 
What we need is, I think, a way to tell the used/admin that that time
is being spent in the hypervisor. E.g., if we were spending (let's
exaggerate) 20% of the time processing interrupts and softirqs, the
user would see some of this 20% load coming from each guest. It
certainly wasn't ideal, but we do not want for such 20% to suddenly
vanish either.

> Am I missing something?
>
To me, it seems you're not. :-)

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-06-03 15:23 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-11 10:32 [Xen-devel] [RFC 0/9] Changes to time accounting Andrii Anisov
2019-09-11 10:32 ` [Xen-devel] [RFC 1/9] schedule: Introduce per-pcpu " Andrii Anisov
2019-09-11 18:01   ` Volodymyr Babchuk
2019-09-12 10:26     ` Andrii Anisov
2019-10-28 14:28   ` Julien Grall
2019-11-06 11:24     ` Andrii Anisov
2020-05-26  2:27       ` Volodymyr Babchuk
2020-05-29  8:48         ` Dario Faggioli
2020-06-02  1:12           ` Volodymyr Babchuk
2020-06-03 15:22             ` Dario Faggioli
2019-09-11 10:32 ` [Xen-devel] [RFC 2/9] sysctl: extend XEN_SYSCTL_getcpuinfo interface Andrii Anisov
2019-10-28 14:52   ` Julien Grall
2019-11-06 11:25     ` Andrii Anisov
2019-09-11 10:32 ` [Xen-devel] [RFC 3/9] xentop: show CPU load information Andrii Anisov
2019-09-11 10:32 ` [Xen-devel] [RFC 4/9] arm64: utilize time accounting Andrii Anisov
2019-09-11 17:48   ` Volodymyr Babchuk
2019-09-12 12:09     ` Andrii Anisov
2019-09-12 12:17       ` Julien Grall
2019-09-12 12:29         ` Andrii Anisov
2019-10-28 14:47   ` Julien Grall
2019-11-06 11:31     ` Andrii Anisov
2019-09-11 10:32 ` [Xen-devel] [RFC 5/9] tacc: Introduce a lockless interface for guest time Andrii Anisov
2019-09-11 10:32 ` [Xen-devel] [RFC 6/9] sched:rtds: get guest time from time accounting code Andrii Anisov
2019-09-11 10:32 ` [Xen-devel] [RFC 7/9] tacc: Introduce a locked interface for guest time Andrii Anisov
2019-09-11 10:32 ` [Xen-devel] [RFC 8/9] sched:credit: get guest time from time accounting code Andrii Anisov
2019-09-11 10:32 ` [Xen-devel] [RFC 9/9] sched:credit2: " Andrii Anisov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).