All of lore.kernel.org
 help / color / mirror / Atom feed
* Introduce rt real-time scheduler for Xen
@ 2014-07-11  4:49 Meng Xu
  2014-07-11  4:49 ` [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor Meng Xu
                   ` (5 more replies)
  0 siblings, 6 replies; 50+ messages in thread
From: Meng Xu @ 2014-07-11  4:49 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	dario.faggioli, ian.jackson, xumengpanda, lichong659, dgolomb

This serie of patches adds rt real-time scheduler to Xen.

In summary, It supports:
1) Preemptive Global Earliest Deadline First scheduling policy by using a global RunQ for the scheduler;
2) Assign/display each VCPU's parameters of each domain;
3) Supports CPU Pool

The design of this rt scheduler is as follows:
This rt scheduler follows the Preemptive Global Earliest Deadline First (GEDF) theory in real-time field.
Each VCPU can have a dedicated period and budget. While scheduled, a VCPU burns its budget. Each VCPU has its budget replenished at the beginning of each of its periods; Each VCPU discards its unused budget at the end of each of its periods. If a VCPU runs out of budget in a period, it has to wait until next period.
The mechanism of how to burn a VCPU's budget depends on the server mechanism implemented for each VCPU.
The mechanism of deciding the priority of VCPUs at each scheduling point is based on the Preemptive Global Earliest Deadline First scheduling scheme.

Server mechanism: a VCPU is implemented as a deferrable server.
When a VCPU has a task running on it, its budget is continuously burned;
When a VCPU has no task but with budget left, its budget is preserved.

Priority scheme: Global Earliest Deadline First (EDF).
At any scheduling point, the VCPU with earliest deadline has highest priority.

Queue scheme: A global runqueue for each CPU pool.
The runqueue holds all runnable VCPUs.
VCPUs in the runqueue are divided into two parts: with and without remaining budget.
At each part, VCPUs are sorted based on GEDF priority scheme.

Scheduling quanta: 1 ms; but accounting the budget is in microsecond.

-----------------------------------------------------------------------------------------------------------------------------
One scenario to show the functionality of this rt scheduler is as follows:
//list each vcpu's parameters of each domain in cpu pools using rt scheduler
#xl sched-rt
Cpupool Pool-0: sched=EDF
Name                                ID VCPU Period Budget
Domain-0                             0    0     10     10
Domain-0                             0    1     20     20
Domain-0                             0    2     30     30
Domain-0                             0    3     10     10
litmus1                              1    0     10      4
litmus1                              1    1     10      4



//set the parameters of the vcpu 1 of domain litmus1:
# xl sched-rt -d litmus1 -v 1 -p 20 -b 10

//domain litmus1's vcpu 1's parameters are changed, display each VCPU's parameters separately:
# xl sched-rt -d litmus1
Name                                ID VCPU Period Budget
litmus1                              1    0     10      4
litmus1                              1    1     20     10

// list cpupool information
xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0              12        rt       y          2

//create a cpupool test
#xl cpupool-cpu-remove Pool-0 11
#xl cpupool-cpu-remove Pool-0 10
#xl cpupool-create name=\"test\" sched=\"credit\"
#xl cpupool-cpu-add test 11
#xl cpupool-cpu-add test 10
#xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0              10        rt       y          2
test                 2    credit       y          0

//migrate litmus1 from cpupool Pool-0 to cpupool test.
#xl cpupool-migrate litmus1 test

//now litmus1 is in cpupool test
# xl sched-credit
Cpupool test: tslice=30ms ratelimit=1000us
Name                                ID Weight  Cap
litmus1                              1    256    0

-----------------------------------------------------------------------------------------------------------------------------
The differences between this new rt real-time scheduler and the sedf scheduler are as follows:
1) rt scheduler supports global EDF scheduling, while sedf only supports partitioned scheduling. With the support of vcpu mask, rt scheduler can also be used as partitioned scheduling by setting each VCPU’s cpumask to a specific cpu.
2) rt scheduler supports setting and getting each VCPU’s parameters of a domain. A domain can have multiple vcpus with different parameters, rt scheduler can let user get/set the parameters of each VCPU of a specific domain; (sedf scheduler does not support it now)
3) rt scheduler supports cpupool.
4) rt scheduler uses deferrable server to burn/replenish budget of a VCPU, while sedf uses constrant bandwidth server to burn/replenish budget of a VCPU. This is just two options of implementing a global EDF real-time scheduler and both options’ real-time performance have already been proved in academic.

(Briefly speaking, the functionality that the *SEDF* scheduler plans to implement and improve in the future release has already been supported in this rt scheduler.)
(Although it’s unnecessary to implement two server mechanisms, we can simply modify the two functions of burning and replenishing vcpus’ budget to incorporate the CBS server mechanism or other server mechanisms into this rt scheduler.)

-----------------------------------------------------------------------------------------------------------------------------
TODO:
1) Improve the code of getting/setting each VCPU’s parameters. [easy]
    Right now, it create an array with LIBXL_XEN_LEGACY_MAX_VCPUS (i.e., 32) elements to bounce all VCPUs’ parameters of a domain between xen tool and xen to get all VCPUs’ parameters of a domain. It is unnecessary to have LIBXL_XEN_LEGACY_MAX_VCPUS elements for this array.
    The current work is to first get the exact number of VCPUs of a domain and then create an array with that exact number of elements to bounce between xen tool and xen.
2) Provide microsecond time precision in xl interface instead of millisecond time precision. [easy]
    Right now, rt scheduler let user to specify each VCPU’s parameters (period, budget) in millisecond (i.e., ms). In some real-time application, user may want to specify VCPUs’ parameters in  microsecond (i.e., us). The next work is to let user specify VCPUs’ parameters in microsecond and count the time in microsecond (or nanosecond) in xen rt scheduler as well.
3) Add Xen trace into the rt scheduler. [easy]
    We will add a few xentrace tracepoints, like TRC_CSCHED2_RUNQ_POS in credit2 scheduler, in rt scheduler, to debug via tracing.
4) Method of improving the performance of rt scheduler [future work]
    VCPUs of the same domain may preempt each other based on the preemptive global EDF scheduling policy. This self-switch issue does not bring benefit to the domain but introduce more overhead. When this situation happens, we can simply promote the current running lower-priority VCPU’s priority and let it  borrow budget from higher priority VCPUs to avoid such self-swtich issue.

Timeline of implementing the TODOs:
We plan to finish the TODO 1), 2) and 3) within 3-4 weeks (or earlier).
Because TODO 4) will make the scheduling policy not pure GEDF, (people who wants the real GEDF may not be happy with this.) we look forward to hearing people’s opinions.

-----------------------------------------------------------------------------------------------------------------------------
Special huge thanks to Dario Faggioli for his helpful and detailed comments on the preview version of this rt scheduler. :-)

Any comment, question, and concerns are more than welcome! :-)

Thank you very much!

Meng

[PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
[PATCH RFC v1 2/4] xl for rt scheduler
[PATCH RFC v1 3/4] libxl for rt scheduler
[PATCH RFC v1 4/4] libxc for rt scheduler

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
  2014-07-11  4:49 Introduce rt real-time scheduler for Xen Meng Xu
@ 2014-07-11  4:49 ` Meng Xu
  2014-07-11 14:27   ` Dario Faggioli
  2014-07-11 14:37   ` Andrew Cooper
  2014-07-11  4:49 ` [PATCH RFC v1 2/4] xl for rt scheduler Meng Xu
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 50+ messages in thread
From: Meng Xu @ 2014-07-11  4:49 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	dario.faggioli, ian.jackson, xumengpanda, Meng Xu, lichong659,
	dgolomb

This is the core rt scheduler patch. It adds the new real time scheduler to
the hypervisor, as the non-default scheduler.

This scheduler follows the pre-emptive Global EDF theory in real-time field.
Each VCPU can have a dedicated period and budget.
While scheduled, a VCPU burns its budget.
A VCPU has its budget replenished at the beginning of each of its periods;
The VCPU discards its unused budget at the end of each of its periods.
If a VCPU runs out of budget in a period, it has to wait until next period.
The mechanism of how to burn a VCPU's budget depends on the server mechanism
implemented for each VCPU.

Server mechanism: a VCPU is implemented as a deferable server.
When a VCPU has a task running on it, its budget is continuously
burned;
When a VCPU has no task but with budget left, its budget is
preserved.

Priority scheme: Preemptive Global Earliest Deadline First (gEDF).
At any scheduling point, the VCPU with earliest deadline has highest
priority.

Queue scheme: A global runqueue for each CPU pool.
The runqueue holds all runnable VCPUs.
VCPUs in the runqueue are divided into two parts: with and without budget.
At each part, VCPUs are sorted based on EDF priority scheme.

Scheduling quanta: 1 ms; but accounting the budget is in microsecond.

Note: cpumask and cpupool is supported.

This is still in the development phase.

Signed-off-by: Sisu Xi <xisisu@gmail.com>
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
---
 xen/common/Makefile         |    1 +
 xen/common/sched_rt.c       |  984 +++++++++++++++++++++++++++++++++++++++++++
 xen/common/schedule.c       |    1 +
 xen/include/public/domctl.h |   19 +
 xen/include/xen/sched-if.h  |    2 +-
 5 files changed, 1006 insertions(+), 1 deletion(-)
 create mode 100644 xen/common/sched_rt.c

diff --git a/xen/common/Makefile b/xen/common/Makefile
index 3683ae3..5a23aa4 100644
--- a/xen/common/Makefile
+++ b/xen/common/Makefile
@@ -26,6 +26,7 @@ obj-y += sched_credit.o
 obj-y += sched_credit2.o
 obj-y += sched_sedf.o
 obj-y += sched_arinc653.o
+obj-y += sched_rt.o
 obj-y += schedule.o
 obj-y += shutdown.o
 obj-y += softirq.o
diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
new file mode 100644
index 0000000..41543a2
--- /dev/null
+++ b/xen/common/sched_rt.c
@@ -0,0 +1,984 @@
+/******************************************************************************
+ * Preemptive Global Earliest Deadline First  (EDF) scheduler for Xen
+ * EDF scheduling is one of most popular real-time scheduling algorithm used in
+ * embedded field.
+ *
+ * by Sisu Xi, 2013, Washington University in Saint Louis
+ * and Meng Xu, 2014, University of Pennsylvania
+ *
+ * based on the code of credit Scheduler
+ */
+
+#include <xen/config.h>
+#include <xen/init.h>
+#include <xen/lib.h>
+#include <xen/sched.h>
+#include <xen/domain.h>
+#include <xen/delay.h>
+#include <xen/event.h>
+#include <xen/time.h>
+#include <xen/perfc.h>
+#include <xen/sched-if.h>
+#include <xen/softirq.h>
+#include <asm/atomic.h>
+#include <xen/errno.h>
+#include <xen/trace.h>
+#include <xen/cpu.h>
+#include <xen/keyhandler.h>
+#include <xen/trace.h>
+#include <xen/guest_access.h>
+
+/*
+ * TODO:
+ *
+ * Migration compensation and resist like credit2 to better use cache;
+ * Lock Holder Problem, using yield?
+ * Self switch problem: VCPUs of the same domain may preempt each other;
+ */
+
+/*
+ * Design:
+ *
+ * This scheduler follows the Preemptive Global EDF theory in real-time field.
+ * Each VCPU can have a dedicated period and budget. 
+ * While scheduled, a VCPU burns its budget.
+ * A VCPU has its budget replenished at the beginning of each of its periods;
+ * The VCPU discards its unused budget at the end of each of its periods.
+ * If a VCPU runs out of budget in a period, it has to wait until next period.
+ * The mechanism of how to burn a VCPU's budget depends on the server mechanism
+ * implemented for each VCPU.
+ *
+ * Server mechanism: a VCPU is implemented as a deferable server.
+ * When a VCPU has a task running on it, its budget is continuously burned;
+ * When a VCPU has no task but with budget left, its budget is preserved.
+ *
+ * Priority scheme: Preemptive Global Earliest Deadline First (gEDF).
+ * At any scheduling point, the VCPU with earliest deadline has highest priority.
+ *
+ * Queue scheme: A global runqueue for each CPU pool. 
+ * The runqueue holds all runnable VCPUs. 
+ * VCPUs in the runqueue are divided into two parts: with and without remaining budget. 
+ * At each part, VCPUs are sorted based on EDF priority scheme.
+ *
+ * Scheduling quanta: 1 ms; but accounting the budget is in microsecond.
+ *
+ * Note: cpumask and cpupool is supported.
+ */
+
+/*
+ * Locking:
+ * Just like credit2, a global system lock is used to protect the RunQ.
+ * The global lock is referenced by schedule_data.schedule_lock from all physical cpus.
+ *
+ * The lock is already grabbed when calling wake/sleep/schedule/ functions in schedule.c
+ *
+ * The functions involes RunQ and needs to grab locks are:
+ *    dump, vcpu_insert, vcpu_remove, context_saved,
+ */
+
+
+/*
+ * Default parameters in ms
+ */
+#define RT_DEFAULT_PERIOD     10
+#define RT_DEFAULT_BUDGET      4
+
+/*
+ * Useful macros
+ */
+#define RT_PRIV(_ops)     ((struct rt_private *)((_ops)->sched_data))
+#define RT_VCPU(_vcpu)    ((struct rt_vcpu *)(_vcpu)->sched_priv)
+#define RT_DOM(_dom)      ((struct rt_dom *)(_dom)->sched_priv)
+#define RUNQ(_ops)              (&RT_PRIV(_ops)->runq)
+
+/*
+ * Flags
+ */
+#define __RT_scheduled            1
+#define RT_scheduled (1<<__RT_scheduled)
+#define __RT_delayed_runq_add     2
+#define RT_delayed_runq_add (1<<__RT_delayed_runq_add)
+
+/*
+ * Used to printout debug information
+ */
+#define printtime()     ( printk("%d : %3ld.%3ld : %-19s ", smp_processor_id(), NOW()/MILLISECS(1), NOW()%MILLISECS(1)/1000, __func__) )
+
+/*
+ * Systme-wide private data, include a global RunQueue
+ * The global lock is referenced by schedule_data.schedule_lock from all physical cpus.
+ * It can be grabbed via vcpu_schedule_lock_irq()
+ */
+struct rt_private {
+    spinlock_t lock;        /* The global coarse grand lock */
+    struct list_head sdom;  /* list of availalbe domains, used for dump */
+    struct list_head runq;  /* Ordered list of runnable VMs */
+    cpumask_t cpus;         /* cpumask_t of available physical cpus */
+    cpumask_t tickled;      /* another cpu in the queue already ticked this one */
+};
+
+/*
+ * Virtual CPU
+ */
+struct rt_vcpu {
+    struct list_head runq_elem; /* On the runqueue list */
+    struct list_head sdom_elem; /* On the domain VCPU list */
+
+    /* Up-pointers */
+    struct rt_dom *sdom;
+    struct vcpu *vcpu;
+
+    /* VCPU parameters, in milliseconds */
+    s_time_t period;
+    s_time_t budget;
+
+    /* VCPU current infomation */
+    long cur_budget;             /* current budget in microseconds */
+    s_time_t last_start;        /* last start time, used to calculate budget */
+    s_time_t cur_deadline;      /* current deadline, used to do EDF */
+    unsigned flags;             /* mark __RT_scheduled, etc.. */
+};
+
+/*
+ * Domain
+ */
+struct rt_dom {
+    struct list_head vcpu;      /* link its VCPUs */
+    struct list_head sdom_elem; /* link list on rt_priv */
+    struct domain *dom;         /* pointer to upper domain */
+};
+
+/*
+ * RunQueue helper functions
+ */
+static int
+__vcpu_on_runq(struct rt_vcpu *svc)
+{
+   return !list_empty(&svc->runq_elem);
+}
+
+static struct rt_vcpu *
+__runq_elem(struct list_head *elem)
+{
+    return list_entry(elem, struct rt_vcpu, runq_elem);
+}
+
+static inline void
+__runq_remove(struct rt_vcpu *svc)
+{
+    if ( __vcpu_on_runq(svc) )
+        list_del_init(&svc->runq_elem);
+}
+
+/*
+ * Insert a vcpu in the RunQ basing on vcpu's deadline: 
+ * vcpus with shorter deadline are inserted first.
+ * EDF schedule policy: vcpu with smaller deadline has higher priority;
+ * When vcpus have the same deadline, insert the current one at the head of these vcpus.
+ */
+static void
+__runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
+{
+    struct list_head *runq = RUNQ(ops);
+    struct list_head *iter;
+   
+    ASSERT( spin_is_locked(per_cpu(schedule_data, svc->vcpu->processor).schedule_lock) );
+
+    if ( __vcpu_on_runq(svc) )
+        return;
+
+    list_for_each(iter, runq) {
+        struct rt_vcpu * iter_svc = __runq_elem(iter);
+
+        if ( svc->cur_budget > 0 ) { /* svc still has budget */
+            if ( iter_svc->cur_budget == 0 ||
+                 svc->cur_deadline <= iter_svc->cur_deadline )
+                    break;
+        } else { /* svc has no budget */
+            if ( iter_svc->cur_budget == 0 &&
+                 svc->cur_deadline <= iter_svc->cur_deadline )
+                    break;
+        }
+    }
+
+    list_add_tail(&svc->runq_elem, iter);
+}
+
+
+/*
+ * Debug related code, dump vcpu/cpu information
+ */
+static void
+rt_dump_vcpu(struct rt_vcpu *svc)
+{
+    if ( svc == NULL ) {
+        printk("NULL!\n");
+        return;
+    }
+#define cpustr keyhandler_scratch
+    cpumask_scnprintf(cpustr, sizeof(cpustr), svc->vcpu->cpu_hard_affinity);
+    printk("[%5d.%-2d] cpu %d, (%"PRId64", %"PRId64"), cur_b=%"PRId64" cur_d=%"PRId64" last_start=%"PRId64" onR=%d runnable=%d cpu_hard_affinity=%s ",
+            svc->vcpu->domain->domain_id,
+            svc->vcpu->vcpu_id,
+            svc->vcpu->processor,
+            svc->period,
+            svc->budget,
+            svc->cur_budget,
+            svc->cur_deadline,
+            svc->last_start,
+            __vcpu_on_runq(svc),
+            vcpu_runnable(svc->vcpu),
+            cpustr);
+    memset(cpustr, 0, sizeof(char)*1024);
+    cpumask_scnprintf(cpustr, sizeof(cpustr), cpupool_scheduler_cpumask(svc->vcpu->domain->cpupool));
+    printk("cpupool=%s\n", cpustr);
+#undef cpustr
+}
+
+static void
+rt_dump_pcpu(const struct scheduler *ops, int cpu)
+{
+    struct rt_vcpu *svc = RT_VCPU(curr_on_cpu(cpu));
+
+    printtime();
+    rt_dump_vcpu(svc);
+}
+
+/*
+ * should not need lock here. only showing stuff 
+ */
+static void
+rt_dump(const struct scheduler *ops)
+{
+    struct list_head *iter_sdom, *iter_svc, *runq, *iter;
+    struct rt_private *prv = RT_PRIV(ops);
+    struct rt_vcpu *svc;
+    int cpu = 0;
+    int loop = 0;
+
+    printtime();
+    printk("Priority Scheme: EDF\n");
+
+    printk("PCPU info: \n");
+    for_each_cpu(cpu, &prv->cpus) {
+        rt_dump_pcpu(ops, cpu);
+    }
+
+    printk("Global RunQueue info: \n");
+    loop = 0;
+    runq = RUNQ(ops);
+    list_for_each( iter, runq ) {
+        svc = __runq_elem(iter);
+        printk("\t%3d: ", ++loop);
+        rt_dump_vcpu(svc);
+    }
+
+    printk("Domain info: \n");
+    loop = 0;
+    list_for_each( iter_sdom, &prv->sdom ) {
+        struct rt_dom *sdom;
+        sdom = list_entry(iter_sdom, struct rt_dom, sdom_elem);
+        printk("\tdomain: %d\n", sdom->dom->domain_id);
+
+        list_for_each( iter_svc, &sdom->vcpu ) {
+            svc = list_entry(iter_svc, struct rt_vcpu, sdom_elem);
+            printk("\t\t%3d: ", ++loop);
+            rt_dump_vcpu(svc);
+        }
+    }
+
+    printk("\n");
+}
+
+/*
+ * Init/Free related code
+ */
+static int
+rt_init(struct scheduler *ops)
+{
+    struct rt_private *prv;
+
+    prv = xzalloc(struct rt_private);
+    if ( prv == NULL ) {
+        printk("xzalloc failed at rt_private\n");
+        return -ENOMEM;
+    }
+
+    ops->sched_data = prv;
+    spin_lock_init(&prv->lock);
+    INIT_LIST_HEAD(&prv->sdom);
+    INIT_LIST_HEAD(&prv->runq);
+    cpumask_clear(&prv->tickled);
+    cpumask_clear(&prv->cpus);
+
+    printtime();
+    printk("\n");
+
+    return 0;
+}
+
+static void
+rt_deinit(const struct scheduler *ops)
+{
+    struct rt_private *prv;
+
+    printtime();
+    printk("\n");
+
+    prv = RT_PRIV(ops);
+    if ( prv )
+        xfree(prv);
+}
+
+/* 
+ * point per_cpu spinlock to the global system lock; all cpu have same global system lock 
+ */
+static void *
+rt_alloc_pdata(const struct scheduler *ops, int cpu)
+{
+    struct rt_private *prv = RT_PRIV(ops);
+
+    cpumask_set_cpu(cpu, &prv->cpus);
+
+    per_cpu(schedule_data, cpu).schedule_lock = &prv->lock;
+
+    printtime();
+    printk("%s total cpus: %d", __FUNCTION__, cpumask_weight(&prv->cpus));
+    return (void *)1;
+}
+
+static void
+rt_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
+{
+    struct rt_private * prv = RT_PRIV(ops);
+    cpumask_clear_cpu(cpu, &prv->cpus);
+    printtime();
+    printk("%s cpu=%d\n", __FUNCTION__, cpu);
+}
+
+static void *
+rt_alloc_domdata(const struct scheduler *ops, struct domain *dom)
+{
+    unsigned long flags;
+    struct rt_dom *sdom;
+    struct rt_private * prv = RT_PRIV(ops);
+
+    printtime();
+    printk("dom=%d\n", dom->domain_id);
+
+    sdom = xzalloc(struct rt_dom);
+    if ( sdom == NULL ) {
+        printk("%s, xzalloc failed\n", __func__);
+        return NULL;
+    }
+
+    INIT_LIST_HEAD(&sdom->vcpu);
+    INIT_LIST_HEAD(&sdom->sdom_elem);
+    sdom->dom = dom;
+
+    /* spinlock here to insert the dom */
+    spin_lock_irqsave(&prv->lock, flags);
+    list_add_tail(&sdom->sdom_elem, &(prv->sdom));
+    spin_unlock_irqrestore(&prv->lock, flags);
+
+    return (void *)sdom;
+}
+
+static void
+rt_free_domdata(const struct scheduler *ops, void *data)
+{
+    unsigned long flags;
+    struct rt_dom *sdom = data;
+    struct rt_private * prv = RT_PRIV(ops);
+
+    printtime();
+    printk("dom=%d\n", sdom->dom->domain_id);
+
+    spin_lock_irqsave(&prv->lock, flags);
+    list_del_init(&sdom->sdom_elem);
+    spin_unlock_irqrestore(&prv->lock, flags);
+    xfree(data);
+}
+
+static int
+rt_dom_init(const struct scheduler *ops, struct domain *dom)
+{
+    struct rt_dom *sdom;
+
+    printtime();
+    printk("dom=%d\n", dom->domain_id);
+
+    /* IDLE Domain does not link on rt_private */
+    if ( is_idle_domain(dom) ) { return 0; }
+
+    sdom = rt_alloc_domdata(ops, dom);
+    if ( sdom == NULL ) {
+        printk("%s, failed\n", __func__);
+        return -ENOMEM;
+    }
+    dom->sched_priv = sdom;
+
+    return 0;
+}
+
+static void
+rt_dom_destroy(const struct scheduler *ops, struct domain *dom)
+{
+    printtime();
+    printk("dom=%d\n", dom->domain_id);
+
+    rt_free_domdata(ops, RT_DOM(dom));
+}
+
+static void *
+rt_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
+{
+    struct rt_vcpu *svc;
+    s_time_t now = NOW();
+    long count;
+
+    /* Allocate per-VCPU info */
+    svc = xzalloc(struct rt_vcpu);
+    if ( svc == NULL ) {
+        printk("%s, xzalloc failed\n", __func__);
+        return NULL;
+    }
+
+    INIT_LIST_HEAD(&svc->runq_elem);
+    INIT_LIST_HEAD(&svc->sdom_elem);
+    svc->flags = 0U;
+    svc->sdom = dd;
+    svc->vcpu = vc;
+    svc->last_start = 0;            /* init last_start is 0 */
+
+    svc->period = RT_DEFAULT_PERIOD;
+    if ( !is_idle_vcpu(vc) && vc->domain->domain_id != 0 ) {
+        svc->budget = RT_DEFAULT_BUDGET;
+    } else {
+        svc->budget = RT_DEFAULT_PERIOD; /* give vcpus of dom0 100% utilization */
+    }
+
+    count = (now/MILLISECS(svc->period)) + 1;
+    /* sync all VCPU's start time to 0 */
+    svc->cur_deadline += count*MILLISECS(svc->period);
+
+    svc->cur_budget = svc->budget*1000; /* counting in microseconds level */
+    /* Debug only: dump new vcpu's info */
+    printtime();
+    rt_dump_vcpu(svc);
+
+    return svc;
+}
+
+static void
+rt_free_vdata(const struct scheduler *ops, void *priv)
+{
+    struct rt_vcpu *svc = priv;
+
+    /* Debug only: dump freed vcpu's info */
+    printtime();
+    rt_dump_vcpu(svc);
+    xfree(svc);
+}
+
+static void
+rt_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_vcpu *svc = RT_VCPU(vc);
+
+    /* Debug only: dump info of vcpu to insert */
+    printtime();
+    rt_dump_vcpu(svc);
+
+    /* IDLE VCPU not allowed on RunQ */
+    if ( is_idle_vcpu(vc) )
+        return;
+
+    list_add_tail(&svc->sdom_elem, &svc->sdom->vcpu);   /* add to dom vcpu list */
+}
+
+static void
+rt_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_vcpu * const svc = RT_VCPU(vc);
+    struct rt_dom * const sdom = svc->sdom;
+
+    printtime();
+    rt_dump_vcpu(svc);
+
+    BUG_ON( sdom == NULL );
+    BUG_ON( __vcpu_on_runq(svc) );
+
+    if ( !is_idle_vcpu(vc) ) {
+        list_del_init(&svc->sdom_elem);
+    }
+}
+
+/* 
+ * Pick a valid CPU for the vcpu vc
+ * Valid CPU of a vcpu is intesection of vcpu's affinity and available cpus
+ */
+static int
+rt_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
+{
+    cpumask_t cpus;
+    cpumask_t *online;
+    int cpu;
+    struct rt_private * prv = RT_PRIV(ops);
+
+    online = cpupool_scheduler_cpumask(vc->domain->cpupool);
+    cpumask_and(&cpus, &prv->cpus, online);
+    cpumask_and(&cpus, &cpus, vc->cpu_hard_affinity);
+
+    cpu = cpumask_test_cpu(vc->processor, &cpus)
+            ? vc->processor 
+            : cpumask_cycle(vc->processor, &cpus);
+    ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
+
+    return cpu;
+}
+
+/*
+ * Implemented as deferrable server. 
+ * Different server mechanism has different implementation.
+ * burn budget at microsecond level. 
+ */
+static void
+burn_budgets(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t now) {
+    s_time_t delta;
+    unsigned int consume;
+    long count = 0;
+
+    /* first time called for this svc, update last_start */
+    if ( svc->last_start == 0 ) {
+        svc->last_start = now;
+        return;
+    }
+
+    /* don't burn budget for idle VCPU */
+    if ( is_idle_vcpu(svc->vcpu) ) {
+        return;
+    }
+
+    /* don't burn budget for Domain-0, RT-Xen use only */
+    if ( svc->sdom->dom->domain_id == 0 ) {
+        return;
+    }
+
+    /* update deadline info */
+    delta = now - svc->cur_deadline;
+    if ( delta >= 0 ) {
+        count = ( delta/MILLISECS(svc->period) ) + 1;
+        svc->cur_deadline += count * MILLISECS(svc->period);
+        svc->cur_budget = svc->budget * 1000;
+        return;
+    }
+
+    delta = now - svc->last_start;
+    if ( delta < 0 ) {
+        printk("%s, delta = %ld for ", __func__, delta);
+        rt_dump_vcpu(svc);
+        svc->last_start = now;  /* update last_start */
+        svc->cur_budget = 0;
+        return;
+    }
+
+    if ( svc->cur_budget == 0 ) return;
+
+    /* burn at microseconds level */
+    consume = ( delta/MICROSECS(1) );
+    if ( delta%MICROSECS(1) > MICROSECS(1)/2 ) consume++;
+
+    svc->cur_budget -= consume;
+    if ( svc->cur_budget < 0 ) svc->cur_budget = 0;
+}
+
+/* 
+ * RunQ is sorted. Pick first one within cpumask. If no one, return NULL
+ * lock is grabbed before calling this function 
+ */
+static struct rt_vcpu *
+__runq_pick(const struct scheduler *ops, cpumask_t mask)
+{
+    struct list_head *runq = RUNQ(ops);
+    struct list_head *iter;
+    struct rt_vcpu *svc = NULL;
+    struct rt_vcpu *iter_svc = NULL;
+    cpumask_t cpu_common;
+    cpumask_t *online;
+    struct rt_private * prv = RT_PRIV(ops);
+
+    list_for_each(iter, runq) {
+        iter_svc = __runq_elem(iter);
+
+        /* mask is intersection of cpu_hard_affinity and cpupool and priv->cpus */
+        online = cpupool_scheduler_cpumask(iter_svc->vcpu->domain->cpupool);
+        cpumask_and(&cpu_common, online, &prv->cpus);
+        cpumask_and(&cpu_common, &cpu_common, iter_svc->vcpu->cpu_hard_affinity);
+        cpumask_and(&cpu_common, &mask, &cpu_common);
+        if ( cpumask_empty(&cpu_common) )
+            continue;
+
+        if ( iter_svc->cur_budget <= 0 )
+            continue;
+
+        svc = iter_svc;
+        break;
+    }
+
+    return svc;
+}
+
+/*
+ * Update vcpu's budget and sort runq by insert the modifed vcpu back to runq
+ * lock is grabbed before calling this function 
+ */
+static void
+__repl_update(const struct scheduler *ops, s_time_t now)
+{
+    struct list_head *runq = RUNQ(ops);
+    struct list_head *iter;
+    struct list_head *tmp;
+    struct rt_vcpu *svc = NULL;
+
+    s_time_t diff;
+    long count;
+
+    list_for_each_safe(iter, tmp, runq) {
+        svc = __runq_elem(iter);
+
+        diff = now - svc->cur_deadline;
+        if ( diff > 0 ) {
+            count = (diff/MILLISECS(svc->period)) + 1;
+            svc->cur_deadline += count * MILLISECS(svc->period);
+            svc->cur_budget = svc->budget * 1000;
+            __runq_remove(svc);
+            __runq_insert(ops, svc);
+        }
+    }
+}
+
+/* 
+ * schedule function for rt scheduler.
+ * The lock is already grabbed in schedule.c, no need to lock here 
+ */
+static struct task_slice
+rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_scheduled)
+{
+    const int cpu = smp_processor_id();
+    struct rt_private * prv = RT_PRIV(ops);
+    struct rt_vcpu * const scurr = RT_VCPU(current);
+    struct rt_vcpu * snext = NULL;
+    struct task_slice ret;
+
+    /* clear ticked bit now that we've been scheduled */
+    if ( cpumask_test_cpu(cpu, &prv->tickled) )
+        cpumask_clear_cpu(cpu, &prv->tickled);
+
+    /* burn_budget would return for IDLE VCPU */
+    burn_budgets(ops, scurr, now);
+
+    __repl_update(ops, now);
+
+    if ( tasklet_work_scheduled ) {
+        snext = RT_VCPU(idle_vcpu[cpu]);
+    } else {
+        cpumask_t cur_cpu;
+        cpumask_clear(&cur_cpu);
+        cpumask_set_cpu(cpu, &cur_cpu);
+        snext = __runq_pick(ops, cur_cpu);
+        if ( snext == NULL )
+            snext = RT_VCPU(idle_vcpu[cpu]);
+
+        /* if scurr has higher priority and budget, still pick scurr */
+        if ( !is_idle_vcpu(current) &&
+             vcpu_runnable(current) &&
+             scurr->cur_budget > 0 &&
+             ( is_idle_vcpu(snext->vcpu) ||
+               scurr->cur_deadline <= snext->cur_deadline ) ) {
+            snext = scurr;
+        }
+    }
+
+    if ( snext != scurr &&
+         !is_idle_vcpu(current) &&
+         vcpu_runnable(current) ) {
+        set_bit(__RT_delayed_runq_add, &scurr->flags);
+    }
+
+    snext->last_start = now;
+    ret.migrated = 0;
+    if ( !is_idle_vcpu(snext->vcpu) ) {
+        if ( snext != scurr ) {
+            __runq_remove(snext);
+            set_bit(__RT_scheduled, &snext->flags);
+        }
+        if ( snext->vcpu->processor != cpu ) {
+            snext->vcpu->processor = cpu;
+            ret.migrated = 1;
+        }
+    }
+
+    ret.time = MILLISECS(1);
+    ret.task = snext->vcpu;
+
+    return ret;
+}
+
+/*
+ * Remove VCPU from RunQ
+ * The lock is already grabbed in schedule.c, no need to lock here 
+ */
+static void
+rt_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_vcpu * const svc = RT_VCPU(vc);
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    if ( curr_on_cpu(vc->processor) == vc ) {
+        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
+        return;
+    }
+
+    if ( __vcpu_on_runq(svc) ) {
+        __runq_remove(svc);
+    }
+
+    clear_bit(__RT_delayed_runq_add, &svc->flags);
+}
+
+/*
+ * Pick a vcpu on a cpu to kick out to place the running candidate
+ * Called by wake() and context_saved()
+ * We have a running candidate here, the kick logic is:
+ * Among all the cpus that are within the cpu affinity
+ * 1) if the new->cpu is idle, kick it. This could benefit cache hit
+ * 2) if there are any idle vcpu, kick it.
+ * 3) now all pcpus are busy, among all the running vcpus, pick lowest priority one
+ *    if snext has higher priority, kick it.
+ *
+ * TODO:
+ * 1) what if these two vcpus belongs to the same domain?
+ *    replace a vcpu belonging to the same domain introduces more overhead
+ *
+ * lock is grabbed before calling this function 
+ */
+static void
+runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
+{
+    struct rt_private * prv = RT_PRIV(ops);
+    struct rt_vcpu * scheduled = NULL;    /* lowest priority scheduled */
+    struct rt_vcpu * iter_svc;
+    struct vcpu * iter_vc;
+    int cpu = 0;
+    cpumask_t not_tickled;                      /* not tickled cpus */
+    cpumask_t *online;
+
+    if ( new == NULL || is_idle_vcpu(new->vcpu) ) return;
+
+    online = cpupool_scheduler_cpumask(new->vcpu->domain->cpupool);
+    cpumask_and(&not_tickled, online, &prv->cpus);
+    cpumask_and(&not_tickled, &not_tickled, new->vcpu->cpu_hard_affinity);
+    cpumask_andnot(&not_tickled, &not_tickled, &prv->tickled);
+
+    /* 1) if new's previous cpu is idle, kick it for cache benefit */
+    if ( is_idle_vcpu(curr_on_cpu(new->vcpu->processor)) ) {
+        cpumask_set_cpu(new->vcpu->processor, &prv->tickled);
+        cpu_raise_softirq(new->vcpu->processor, SCHEDULE_SOFTIRQ);
+        return;
+    }
+
+    /* 2) if there are any idle pcpu, kick it */
+    /* The same loop also find the one with lowest priority */
+    for_each_cpu(cpu, &not_tickled) {
+        iter_vc = curr_on_cpu(cpu);
+        if ( is_idle_vcpu(iter_vc) ) {
+            cpumask_set_cpu(cpu, &prv->tickled);
+            cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
+            return;
+        }
+        iter_svc = RT_VCPU(iter_vc);
+        if ( scheduled == NULL || 
+             iter_svc->cur_deadline > scheduled->cur_deadline ) {
+            scheduled = iter_svc;
+        }
+    }
+
+    /* 3) candicate has higher priority, kick out the lowest priority vcpu */
+    if ( scheduled != NULL && new->cur_deadline < scheduled->cur_deadline ) {
+        cpumask_set_cpu(scheduled->vcpu->processor, &prv->tickled);
+        cpu_raise_softirq(scheduled->vcpu->processor, SCHEDULE_SOFTIRQ);
+    }
+    return;
+}
+
+/* 
+ * Should always wake up runnable vcpu, put it back to RunQ. 
+ * Check priority to raise interrupt 
+ * The lock is already grabbed in schedule.c, no need to lock here 
+ * TODO: what if these two vcpus belongs to the same domain? 
+ */
+static void
+rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_vcpu * const svc = RT_VCPU(vc);
+    s_time_t diff;
+    s_time_t now = NOW();
+    long count = 0;
+    struct rt_private * prv = RT_PRIV(ops);
+    struct rt_vcpu * snext = NULL;        /* highest priority on RunQ */
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    if ( unlikely(curr_on_cpu(vc->processor) == vc) ) return;
+
+    /* on RunQ, just update info is ok */
+    if ( unlikely(__vcpu_on_runq(svc)) ) return;
+
+    /* If context hasn't been saved yet, set flag so it will add later */
+    if ( unlikely(test_bit(__RT_scheduled, &svc->flags)) ) {
+        set_bit(__RT_delayed_runq_add, &svc->flags);
+        return;
+    }
+
+    /* update deadline info */
+    diff = now - svc->cur_deadline;
+    if ( diff >= 0 ) {
+        count = ( diff/MILLISECS(svc->period) ) + 1;
+        svc->cur_deadline += count * MILLISECS(svc->period);
+        svc->cur_budget = svc->budget * 1000;
+    }
+
+    __runq_insert(ops, svc);
+    __repl_update(ops, now);
+    snext = __runq_pick(ops, prv->cpus);    /* pick snext from ALL valid cpus */
+    runq_tickle(ops, snext);
+
+    return;
+}
+
+/* 
+ * scurr has finished context switch, insert it back to the RunQ,
+ * and then pick the highest priority vcpu from runq to run 
+ */
+static void
+rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_vcpu * svc = RT_VCPU(vc);
+    struct rt_vcpu * snext = NULL;
+    struct rt_private * prv = RT_PRIV(ops);
+    spinlock_t *lock;
+
+    clear_bit(__RT_scheduled, &svc->flags);
+    if ( is_idle_vcpu(vc) ) return;
+
+    lock = vcpu_schedule_lock_irq(vc);
+    if ( test_and_clear_bit(__RT_delayed_runq_add, &svc->flags) && 
+         likely(vcpu_runnable(vc)) ) {
+        __runq_insert(ops, svc);
+        __repl_update(ops, NOW());
+        snext = __runq_pick(ops, prv->cpus);    /* pick snext from ALL cpus */
+        runq_tickle(ops, snext);
+    }
+    vcpu_schedule_unlock_irq(lock, vc);
+}
+
+/*
+ * set/get each vcpu info of each domain
+ */
+static int
+rt_dom_cntl(const struct scheduler *ops, struct domain *d, struct xen_domctl_scheduler_op *op)
+{
+    xen_domctl_sched_rt_params_t local_sched;
+    struct rt_dom * const sdom = RT_DOM(d);
+    struct list_head *iter;
+    int vcpu_index = 0;
+    int rc = -EINVAL;
+
+    switch ( op->cmd )
+    {
+    case XEN_DOMCTL_SCHEDOP_getinfo:
+        /* for debug use, whenever adjust Dom0 parameter, do global dump */
+        if ( d->domain_id == 0 ) {
+            rt_dump(ops);
+        }
+
+        local_sched.num_vcpus = 0;
+        list_for_each( iter, &sdom->vcpu ) {
+            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu, sdom_elem);
+
+            ASSERT(vcpu_index < XEN_LEGACY_MAX_VCPUS);
+            local_sched.vcpus[vcpu_index].budget = svc->budget;
+            local_sched.vcpus[vcpu_index].period = svc->period;
+
+            vcpu_index++;
+        }
+        local_sched.num_vcpus = vcpu_index;
+        copy_to_guest(op->u.rt.schedule, &local_sched, 1);
+        rc = 0;
+        break;
+    case XEN_DOMCTL_SCHEDOP_putinfo:
+        copy_from_guest(&local_sched, op->u.rt.schedule, 1);
+        list_for_each( iter, &sdom->vcpu ) {
+            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu, sdom_elem);
+
+            if ( local_sched.vcpu_index == svc->vcpu->vcpu_id ) { /* adjust per VCPU parameter */
+                vcpu_index = local_sched.vcpu_index;
+
+                if ( vcpu_index < 0 || vcpu_index > XEN_LEGACY_MAX_VCPUS) {
+                    printk("XEN_DOMCTL_SCHEDOP_putinfo: vcpu_index=%d\n",
+                            vcpu_index);
+                }else{
+                    printk("XEN_DOMCTL_SCHEDOP_putinfo: vcpu_index=%d, period=%"PRId64", budget=%"PRId64"\n",
+                            vcpu_index, local_sched.vcpus[vcpu_index].period, 
+                            local_sched.vcpus[vcpu_index].budget);
+                }
+
+                svc->period = local_sched.vcpus[vcpu_index].period;
+                svc->budget = local_sched.vcpus[vcpu_index].budget;
+
+                break;
+            }
+        }
+        rc = 0;
+        break;
+    }
+
+    return rc;
+}
+
+static struct rt_private _rt_priv;
+
+const struct scheduler sched_rt_def = {
+    .name           = "SMP RT Scheduler",
+    .opt_name       = "rt",
+    .sched_id       = XEN_SCHEDULER_RT,
+    .sched_data     = &_rt_priv,
+
+    .dump_cpu_state = rt_dump_pcpu,
+    .dump_settings  = rt_dump,
+    .init           = rt_init,
+    .deinit         = rt_deinit,
+    .alloc_pdata    = rt_alloc_pdata,
+    .free_pdata     = rt_free_pdata,
+    .alloc_domdata  = rt_alloc_domdata,
+    .free_domdata   = rt_free_domdata,
+    .init_domain    = rt_dom_init,
+    .destroy_domain = rt_dom_destroy,
+    .alloc_vdata    = rt_alloc_vdata,
+    .free_vdata     = rt_free_vdata,
+    .insert_vcpu    = rt_vcpu_insert,
+    .remove_vcpu    = rt_vcpu_remove,
+
+    .adjust         = rt_dom_cntl,
+
+    .pick_cpu       = rt_cpu_pick,
+    .do_schedule    = rt_schedule,
+    .sleep          = rt_vcpu_sleep,
+    .wake           = rt_vcpu_wake,
+    .context_saved  = rt_context_saved,
+
+    .yield          = NULL,
+    .migrate        = NULL,
+};
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index e9eb0bc..2d13966 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -68,6 +68,7 @@ static const struct scheduler *schedulers[] = {
     &sched_sedf_def,
     &sched_credit_def,
     &sched_credit2_def,
+    &sched_rt_def,
     &sched_arinc653_def,
 };
 
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 5b11bbf..d1a8201 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -339,6 +339,20 @@ struct xen_domctl_max_vcpus {
 typedef struct xen_domctl_max_vcpus xen_domctl_max_vcpus_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_max_vcpus_t);
 
+/*
+ * This structure is used to pass to rt scheduler from a 
+ * privileged domain to Xen
+ */
+struct xen_domctl_sched_rt_params {
+    struct {
+        signed long period; /* s_time_t type */
+        signed long budget; 
+    } vcpus[XEN_LEGACY_MAX_VCPUS];
+    uint16_t num_vcpus;
+    uint16_t vcpu_index;
+};
+typedef struct xen_domctl_sched_rt_params xen_domctl_sched_rt_params_t;
+DEFINE_XEN_GUEST_HANDLE(xen_domctl_sched_rt_params_t);
 
 /* XEN_DOMCTL_scheduler_op */
 /* Scheduler types. */
@@ -346,6 +360,8 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_max_vcpus_t);
 #define XEN_SCHEDULER_CREDIT   5
 #define XEN_SCHEDULER_CREDIT2  6
 #define XEN_SCHEDULER_ARINC653 7
+#define XEN_SCHEDULER_RT       8
+
 /* Set or get info? */
 #define XEN_DOMCTL_SCHEDOP_putinfo 0
 #define XEN_DOMCTL_SCHEDOP_getinfo 1
@@ -367,6 +383,9 @@ struct xen_domctl_scheduler_op {
         struct xen_domctl_sched_credit2 {
             uint16_t weight;
         } credit2;
+        struct xen_domctl_sched_rt{
+            XEN_GUEST_HANDLE_64(xen_domctl_sched_rt_params_t) schedule;
+        } rt;
     } u;
 };
 typedef struct xen_domctl_scheduler_op xen_domctl_scheduler_op_t;
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 4164dff..e452d32 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -169,7 +169,7 @@ extern const struct scheduler sched_sedf_def;
 extern const struct scheduler sched_credit_def;
 extern const struct scheduler sched_credit2_def;
 extern const struct scheduler sched_arinc653_def;
-
+extern const struct scheduler sched_rt_def;
 
 struct cpupool
 {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC v1 2/4] xl for rt scheduler
  2014-07-11  4:49 Introduce rt real-time scheduler for Xen Meng Xu
  2014-07-11  4:49 ` [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor Meng Xu
@ 2014-07-11  4:49 ` Meng Xu
  2014-07-11 11:02   ` Wei Liu
  2014-07-11  4:49 ` [PATCH RFC v1 3/4] libxl " Meng Xu
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 50+ messages in thread
From: Meng Xu @ 2014-07-11  4:49 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	dario.faggioli, ian.jackson, xumengpanda, Meng Xu, lichong659,
	dgolomb

Add xl command for rt scheduler

Signed-off-by: Sisu Xi <xisisu@gmail.com>
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
---
 docs/man/xl.pod.1         |   40 +++++++++++++
 tools/libxl/xl.h          |    1 +
 tools/libxl/xl_cmdimpl.c  |  141 +++++++++++++++++++++++++++++++++++++++++++++
 tools/libxl/xl_cmdtable.c |    9 +++
 4 files changed, 191 insertions(+)

diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
index 30bd4bf..42aeedc 100644
--- a/docs/man/xl.pod.1
+++ b/docs/man/xl.pod.1
@@ -1019,6 +1019,46 @@ Restrict output to domains in the specified cpupool.
 
 =back
 
+=item B<sched-rt> [I<OPTIONS>]
+
+Set or get rt (Real Time) scheduler parameters. This rt scheduler applies 
+Preemptive Global Earliest Deadline First real-time scheduling algorithm to 
+schedule VCPUs in the system. Each VCPU has a dedicated period and budget.
+While scheduled, a VCPU burns its budget. 
+A VCPU has its budget replenished at the beginning of each of its periods;
+The VCPU discards its unused budget at the end of its periods.
+
+B<OPTIONS>
+
+=over 4
+
+=item B<-d DOMAIN>, B<--domain=DOMAIN>
+
+Specify domain for which scheduler parameters are to be modified or retrieved.
+Mandatory for modifying scheduler parameters.
+
+=item B<-v VCPU>, B<--vcpu=VCPU>
+
+Specify the index of VCPU whose parameters will be set. 
+A domain can have multiple VCPUs; Each VCPU has a unique index in this domain;
+When set domain's parameters, it needs to set each VCPU's parameters of this 
+domain.
+
+=item B<-p PERIOD>, B<--period=PERIOD>
+
+A VCPU replenish its budget in every period. Time unit is millisecond.
+
+=item B<-b BUDGET>, B<--budget=BUDGET>
+
+A VCPU has BUDGET amount of time to run for each period. 
+Time unit is millisecond.
+
+=item B<-c CPUPOOL>, B<--cpupool=CPUPOOL>
+
+Restrict output to domains in the specified cpupool.
+
+=back
+
 =back
 
 =head1 CPUPOOLS COMMANDS
diff --git a/tools/libxl/xl.h b/tools/libxl/xl.h
index 10a2e66..51b634a 100644
--- a/tools/libxl/xl.h
+++ b/tools/libxl/xl.h
@@ -67,6 +67,7 @@ int main_memset(int argc, char **argv);
 int main_sched_credit(int argc, char **argv);
 int main_sched_credit2(int argc, char **argv);
 int main_sched_sedf(int argc, char **argv);
+int main_sched_rt(int argc, char **argv);
 int main_domid(int argc, char **argv);
 int main_domname(int argc, char **argv);
 int main_rename(int argc, char **argv);
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 68df548..c043f88 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -4947,6 +4947,7 @@ int main_sharing(int argc, char **argv)
     return 0;
 }
 
+
 static int sched_domain_get(libxl_scheduler sched, int domid,
                             libxl_domain_sched_params *scinfo)
 {
@@ -5098,6 +5099,52 @@ static int sched_sedf_domain_output(
     return 0;
 }
 
+
+static int sched_rt_domain_output(
+    int domid)
+{
+    char *domname;
+    libxl_domain_sched_params scinfo;
+    int rc, i;
+
+    if (domid < 0) {
+        printf("%-33s %4s %4s %6s %6s\n", "Name", "ID", "VCPU", "Period", "Budget");
+        return 0;
+    }
+
+    libxl_domain_sched_params_init(&scinfo);
+    rc = sched_domain_get(LIBXL_SCHEDULER_RT, domid, &scinfo);
+    if (rc)
+        return rc;
+
+    domname = libxl_domid_to_name(ctx, domid);
+    for( i = 0; i < scinfo.rt.num_vcpus; i++ )
+    {
+        printf("%-33s %4d %4d %6d %6d\n",
+            domname,
+            domid,
+            i,
+            scinfo.rt.vcpus[i].period,
+            scinfo.rt.vcpus[i].budget);
+    }
+    free(domname);
+
+    libxl_domain_sched_params_dispose(&scinfo);
+
+    return 0;
+}
+
+static int sched_rt_pool_output(uint32_t poolid)
+{
+    char *poolname;
+
+    poolname = libxl_cpupoolid_to_name(ctx, poolid);
+    printf("Cpupool %s: sched=EDF\n", poolname);
+
+    free(poolname);
+    return 0;
+}
+
 static int sched_default_pool_output(uint32_t poolid)
 {
     char *poolname;
@@ -5465,6 +5512,100 @@ int main_sched_sedf(int argc, char **argv)
     return 0;
 }
 
+/*
+ * <nothing>            : List all domain paramters and sched params
+ * -d [domid]           : List domain params for domain
+ * -d [domid] [params]  : Set domain params for domain 
+ */
+int main_sched_rt(int argc, char **argv)
+{
+    const char *dom = NULL;
+    const char *cpupool = NULL;
+    int period = 10, opt_p = 0;
+    int budget = 4, opt_b = 0;
+    int vcpu_index = 0, opt_v = 0;
+    int opt, rc;
+    static struct option opts[] = {
+        {"domain", 1, 0, 'd'},
+        {"period", 1, 0, 'p'},
+        {"budget", 1, 0, 'b'},
+        {"vcpu", 1, 0, 'v'},
+        {"cpupool", 1, 0, 'c'},
+        COMMON_LONG_OPTS,
+        {0, 0, 0, 0}
+    };
+
+    SWITCH_FOREACH_OPT(opt, "d:p:b:v:c:h", opts, "sched-rt", 0) {
+    case 'd':
+        dom = optarg;
+        break;
+    case 'p':
+        period = strtol(optarg, NULL, 10);
+        opt_p = 1;
+        break;
+    case 'b':
+        budget = strtol(optarg, NULL, 10);
+        opt_b = 1;
+        break;
+    case 'v':
+        vcpu_index = strtol(optarg, NULL, 10);
+        opt_v = 1;
+        break;
+    case 'c':
+        cpupool = optarg;
+        break;
+    }
+
+    if (cpupool && (dom || opt_p || opt_b || opt_v)) {
+        fprintf(stderr, "Specifying a cpupool is not allowed with other options.\n");
+        return 1;
+    }
+    if (!dom && (opt_p || opt_b || opt_v)) {
+        fprintf(stderr, "Must specify a domain.\n");
+        return 1;
+    }
+    if ( (opt_v || opt_p || opt_b) && (opt_p + opt_b + opt_v != 3) ) {
+        fprintf(stderr, "Must specify vcpu, period, budget\n");
+        return 1;
+    }
+    
+    if (!dom) { /* list all domain's rt scheduler info */
+        return -sched_domain_output(LIBXL_SCHEDULER_RT,
+                                    sched_rt_domain_output,
+                                    sched_rt_pool_output,
+                                    cpupool);
+    } else {
+        uint32_t domid = find_domain(dom);
+        if (!opt_p && !opt_b && !opt_v) { /* output rt scheduler info */
+            sched_rt_domain_output(-1);
+            return -sched_rt_domain_output(domid);
+        } else { /* set rt scheduler paramaters */
+            libxl_domain_sched_params scinfo;
+            libxl_domain_sched_params_init(&scinfo);
+            scinfo.rt.max_vcpus = LIBXL_XEN_LEGACY_MAX_VCPUS;
+             /* TODO: only alloc an array with the same num of dom's vcpus*/
+            scinfo.rt.num_vcpus = LIBXL_XEN_LEGACY_MAX_VCPUS;
+            scinfo.rt.vcpus = 
+                (libxl_vcpu*) malloc( sizeof(libxl_vcpu) * scinfo.rt.max_vcpus );
+            if ( scinfo.rt.vcpus == NULL ) {
+                fprintf(stderr, "Alloc memory for scinfo.rt.vcpus fails\n");
+                return 1;
+            }
+            scinfo.sched = LIBXL_SCHEDULER_RT;
+            scinfo.rt.vcpu_index = vcpu_index;
+            scinfo.rt.vcpus[vcpu_index].period = period;
+            scinfo.rt.vcpus[vcpu_index].budget = budget;
+
+            rc = sched_domain_set(domid, &scinfo);
+            libxl_domain_sched_params_dispose(&scinfo);
+            if (rc)
+                return -rc;
+        }
+    }
+
+    return 0;
+}
+
 int main_domid(int argc, char **argv)
 {
     uint32_t domid;
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
index 4279b9f..70e2585 100644
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -277,6 +277,15 @@ struct cmd_spec cmd_table[] = {
       "                               --period/--slice)\n"
       "-c CPUPOOL, --cpupool=CPUPOOL  Restrict output to CPUPOOL"
     },
+    { "sched-rt",
+      &main_sched_rt, 0, 1,
+      "Get/set rt scheduler parameters",
+      "[-d <Domain> [-v[=VCPU]] [-p[=PERIOD]] [-b[=BUDGET]]]",
+      "-d DOMAIN, --domain=DOMAIN     Domain to modify\n"
+      "-v VCPU,   --vcpu=VCPU         VCPU\n"
+      "-p PERIOD, --period=PERIOD     Period (ms)\n"
+      "-b BUDGET, --budget=BUDGET     Budget (ms)\n"
+    },
     { "domid",
       &main_domid, 0, 0,
       "Convert a domain name to domain id",
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC v1 3/4] libxl for rt scheduler
  2014-07-11  4:49 Introduce rt real-time scheduler for Xen Meng Xu
  2014-07-11  4:49 ` [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor Meng Xu
  2014-07-11  4:49 ` [PATCH RFC v1 2/4] xl for rt scheduler Meng Xu
@ 2014-07-11  4:49 ` Meng Xu
  2014-07-11 11:05   ` Wei Liu
                     ` (2 more replies)
  2014-07-11  4:49 ` [PATCH RFC v1 4/4] libxc " Meng Xu
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 50+ messages in thread
From: Meng Xu @ 2014-07-11  4:49 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	dario.faggioli, ian.jackson, xumengpanda, Meng Xu, lichong659,
	dgolomb

Add libxl functions to set/get domain's parameters for rt scheduler

Signed-off-by: Sisu Xi <xisisu@gmail.com>
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
---
 tools/libxl/libxl.c         |  121 +++++++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl.h         |    7 +++
 tools/libxl/libxl_types.idl |   13 +++++
 3 files changed, 141 insertions(+)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 39f1c28..670420e 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -5100,6 +5100,121 @@ static int sched_sedf_domain_set(libxl__gc *gc, uint32_t domid,
     return 0;
 }
 
+static int sched_rt_domain_get(libxl__gc *gc, uint32_t domid,
+                               libxl_domain_sched_params *scinfo)
+{
+    struct xen_domctl_sched_rt_params sdom;
+    int rc, i;
+
+    rc = xc_sched_rt_domain_get(CTX->xch, domid, &sdom);
+    if (rc != 0) {
+        LOGE(ERROR, "getting domain sched rt");
+        return ERROR_FAIL;
+    }
+
+    libxl_domain_sched_params_init(scinfo);
+    scinfo->sched = LIBXL_SCHEDULER_RT;
+    scinfo->rt.max_vcpus = LIBXL_XEN_LEGACY_MAX_VCPUS;
+    /*TODO: alloc array with exact num of dom's vcpus; and use GCNEW_ARRAY()*/
+    scinfo->rt.vcpus = (libxl_vcpu *)
+                    malloc( sizeof(libxl_vcpu) * scinfo->rt.max_vcpus );
+    if ( !scinfo->rt.vcpus ){
+        LOGE(ERROR, "Allocate lib_vcpu array fails\n");
+        return ERROR_INVAL;
+    }
+    scinfo->rt.num_vcpus = sdom.num_vcpus;
+    for( i = 0; i < sdom.num_vcpus; ++i)
+    {
+        scinfo->rt.vcpus[i].period = sdom.vcpus[i].period;
+        scinfo->rt.vcpus[i].budget = sdom.vcpus[i].budget;
+    }
+
+    return 0;
+}
+
+#define SCHED_RT_VCPU_PARAMS_MAX    4294967295 /* 2^32-1 */
+
+/*
+ * Sanity check of the scinfo parameters
+ * return 0 if all values are valid
+ * return 1 if one param is default value
+ * return 2 if the target vcpu's index, period or budget is out of range
+ */
+static int sched_rt_domain_set_validate_params(libxl__gc *gc,
+                                               const libxl_domain_sched_params *scinfo)
+{
+    int vcpu_index = scinfo->rt.vcpu_index;
+
+    if (scinfo->rt.vcpus == NULL ||
+        vcpu_index == LIBXL_DOMAIN_SCHED_PARAM_VCPU_DEFAULT ||
+        scinfo->rt.vcpus[vcpu_index].period == LIBXL_DOMAIN_SCHED_PARAM_PERIOD_DEFAULT ||
+        scinfo->rt.vcpus[vcpu_index].budget == LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT)
+    {
+        return 1;
+    }
+
+    if (vcpu_index < 0 || vcpu_index > scinfo->rt.num_vcpus)
+    {
+        LOG(ERROR, "VCPU index is not set or out of range, "
+                    "valid values are within range from 0 to %d", scinfo->rt.num_vcpus);
+        return 2;
+    }
+
+    if (scinfo->rt.vcpus[vcpu_index].period < 1 ||
+        scinfo->rt.vcpus[vcpu_index].period > SCHED_RT_VCPU_PARAMS_MAX)
+    {
+        LOG(ERROR, "VCPU period is not set or out of range, "
+                    "valid values are within range from 0 to %lu", SCHED_RT_VCPU_PARAMS_MAX);
+        return 2;
+    }
+
+    if (scinfo->rt.vcpus[vcpu_index].budget < 1 ||
+        scinfo->rt.vcpus[vcpu_index].budget > SCHED_RT_VCPU_PARAMS_MAX)
+    {
+        LOG(ERROR, "VCPU budget is not set or out of range, "
+                    "valid values are within range from 0 to %lu", SCHED_RT_VCPU_PARAMS_MAX);
+        return 2;
+    }
+
+    return 0;
+
+}
+
+static int sched_rt_domain_set(libxl__gc *gc, uint32_t domid,
+                               const libxl_domain_sched_params *scinfo)
+{
+    struct xen_domctl_sched_rt_params sdom;
+    int vcpu_index;
+    int rc;
+ 
+    rc = xc_sched_rt_domain_get(CTX->xch, domid, &sdom);
+    if (rc != 0) {
+        LOGE(ERROR, "getting domain sched rt");
+        return ERROR_FAIL;
+    }
+    
+    rc = sched_rt_domain_set_validate_params(gc, scinfo);
+    if (rc == 2)
+        return ERROR_INVAL;
+    if (rc == 1)
+        return 0;
+    if (rc == 0)
+    {
+        vcpu_index = scinfo->rt.vcpu_index;
+        sdom.vcpu_index = scinfo->rt.vcpu_index;
+        sdom.vcpus[vcpu_index].period = scinfo->rt.vcpus[vcpu_index].period;
+        sdom.vcpus[vcpu_index].budget = scinfo->rt.vcpus[vcpu_index].budget;
+    }
+
+    rc = xc_sched_rt_domain_set(CTX->xch, domid, &sdom);
+    if ( rc < 0 ) {
+        LOGE(ERROR, "setting domain sched rt");
+        return ERROR_FAIL;
+    }
+
+    return 0;
+}
+
 int libxl_domain_sched_params_set(libxl_ctx *ctx, uint32_t domid,
                                   const libxl_domain_sched_params *scinfo)
 {
@@ -5123,6 +5238,9 @@ int libxl_domain_sched_params_set(libxl_ctx *ctx, uint32_t domid,
     case LIBXL_SCHEDULER_ARINC653:
         ret=sched_arinc653_domain_set(gc, domid, scinfo);
         break;
+    case LIBXL_SCHEDULER_RT:
+        ret=sched_rt_domain_set(gc, domid, scinfo);
+        break;
     default:
         LOG(ERROR, "Unknown scheduler");
         ret=ERROR_INVAL;
@@ -5153,6 +5271,9 @@ int libxl_domain_sched_params_get(libxl_ctx *ctx, uint32_t domid,
     case LIBXL_SCHEDULER_CREDIT2:
         ret=sched_credit2_domain_get(gc, domid, scinfo);
         break;
+    case LIBXL_SCHEDULER_RT:
+        ret=sched_rt_domain_get(gc, domid, scinfo);
+        break;
     default:
         LOG(ERROR, "Unknown scheduler");
         ret=ERROR_INVAL;
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 459557d..42c8ede 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -1217,6 +1217,13 @@ int libxl_sched_credit_params_set(libxl_ctx *ctx, uint32_t poolid,
 #define LIBXL_DOMAIN_SCHED_PARAM_LATENCY_DEFAULT   -1
 #define LIBXL_DOMAIN_SCHED_PARAM_EXTRATIME_DEFAULT -1
 
+#define LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT     -1
+#define LIBXL_DOMAIN_SCHED_PARAM_VCPU_DEFAULT       -1
+#define LIBXL_DOMAIN_SCHED_PARAM_NUM_VCPUS_DEFAULT  -1
+#define LIBXL_DOMAIN_SCHED_PARAM_VCPU_INDEX_DEFAULT -1
+/* Consistent with XEN_LEGACY_MAX_VCPUS xen/arch-x86/xen.h*/
+#define LIBXL_XEN_LEGACY_MAX_VCPUS                  32 
+
 int libxl_domain_sched_params_get(libxl_ctx *ctx, uint32_t domid,
                                   libxl_domain_sched_params *params);
 int libxl_domain_sched_params_set(libxl_ctx *ctx, uint32_t domid,
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index de25f42..00e00d8 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -139,6 +139,7 @@ libxl_scheduler = Enumeration("scheduler", [
     (5, "credit"),
     (6, "credit2"),
     (7, "arinc653"),
+    (8, "rt"),
     ])
 
 # Consistent with SHUTDOWN_* in sched.h (apart from UNKNOWN)
@@ -289,6 +290,17 @@ libxl_domain_restore_params = Struct("domain_restore_params", [
     ("checkpointed_stream", integer),
     ])
 
+libxl_rt_vcpu = Struct("vcpu",[
+    ("period",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_PERIOD_DEFAULT'}),
+    ("budget",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT'}),
+    ])
+
+libxl_domain_sched_rt_params = Struct("domain_sched_rt_params",[
+    ("vcpus",        Array(libxl_rt_vcpu, "max_vcpus")),
+    ("num_vcpus",    integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_NUM_VCPUS_DEFAULT'}),
+    ("vcpu_index",   integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_VCPU_INDEX_DEFAULT'}),
+    ])
+
 libxl_domain_sched_params = Struct("domain_sched_params",[
     ("sched",        libxl_scheduler),
     ("weight",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_WEIGHT_DEFAULT'}),
@@ -297,6 +309,7 @@ libxl_domain_sched_params = Struct("domain_sched_params",[
     ("slice",        integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_SLICE_DEFAULT'}),
     ("latency",      integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_LATENCY_DEFAULT'}),
     ("extratime",    integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_EXTRATIME_DEFAULT'}),
+    ("rt",     libxl_domain_sched_rt_params),
     ])
 
 libxl_domain_build_info = Struct("domain_build_info",[
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC v1 4/4] libxc for rt scheduler
  2014-07-11  4:49 Introduce rt real-time scheduler for Xen Meng Xu
                   ` (2 preceding siblings ...)
  2014-07-11  4:49 ` [PATCH RFC v1 3/4] libxl " Meng Xu
@ 2014-07-11  4:49 ` Meng Xu
  2014-07-11 14:49   ` Dario Faggioli
  2014-07-11 10:50 ` Introduce rt real-time scheduler for Xen Wei Liu
  2014-07-11 16:19 ` Dario Faggioli
  5 siblings, 1 reply; 50+ messages in thread
From: Meng Xu @ 2014-07-11  4:49 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	dario.faggioli, ian.jackson, xumengpanda, Meng Xu, lichong659,
	dgolomb

Add xc_sched_rt_* functions to interact with Xen to set/get domain's
parameters for rt scheduler.

Signed-off-by: Sisu Xi <xisisu@gmail.com>
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
---
 tools/libxc/Makefile  |    1 +
 tools/libxc/xc_rt.c   |   81 +++++++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xenctrl.h |    7 +++++
 3 files changed, 89 insertions(+)
 create mode 100644 tools/libxc/xc_rt.c

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index f77677c..7ad6dea 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -20,6 +20,7 @@ CTRL_SRCS-y       += xc_sedf.c
 CTRL_SRCS-y       += xc_csched.c
 CTRL_SRCS-y       += xc_csched2.c
 CTRL_SRCS-y       += xc_arinc653.c
+CTRL_SRCS-y       += xc_rt.c
 CTRL_SRCS-y       += xc_tbuf.c
 CTRL_SRCS-y       += xc_pm.c
 CTRL_SRCS-y       += xc_cpu_hotplug.c
diff --git a/tools/libxc/xc_rt.c b/tools/libxc/xc_rt.c
new file mode 100644
index 0000000..a3e76a1
--- /dev/null
+++ b/tools/libxc/xc_rt.c
@@ -0,0 +1,81 @@
+/****************************************************************************
+ *
+ *        File: xc_rt.c
+ *      Author: Sisu Xi
+ *              Meng Xu
+ *
+ * Description: XC Interface to the rt scheduler
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation;
+ * version 2.1 of the License.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "xc_private.h"
+
+int
+xc_sched_rt_domain_set(
+    xc_interface *xch,
+    uint32_t domid,
+    struct xen_domctl_sched_rt_params *sdom)
+{
+    int rc;
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BOUNCE(sdom, 
+        sizeof(*sdom), 
+        XC_HYPERCALL_BUFFER_BOUNCE_IN);
+
+    if ( xc_hypercall_bounce_pre(xch, sdom) )
+        return -1;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_RT;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_putinfo;
+    set_xen_guest_handle(domctl.u.scheduler_op.u.rt.schedule, sdom);
+
+    rc = do_domctl(xch, &domctl);
+
+    xc_hypercall_bounce_post(xch, sdom);
+
+    return rc;
+}
+
+int
+xc_sched_rt_domain_get(
+    xc_interface *xch,
+    uint32_t domid,
+    struct xen_domctl_sched_rt_params *sdom)
+{
+    int rc;
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BOUNCE(sdom, 
+        sizeof(*sdom), 
+        XC_HYPERCALL_BUFFER_BOUNCE_OUT);
+
+    if ( xc_hypercall_bounce_pre(xch, sdom) )
+        return -1;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_RT;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_getinfo;
+    set_xen_guest_handle(domctl.u.scheduler_op.u.rt.schedule, sdom);
+
+    rc = do_domctl(xch, &domctl);
+
+    xc_hypercall_bounce_post(xch, sdom);
+
+    return rc;
+}
+
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 3578b09..b7b0c04 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -865,6 +865,13 @@ int xc_sched_credit2_domain_get(xc_interface *xch,
                                uint32_t domid,
                                struct xen_domctl_sched_credit2 *sdom);
 
+int xc_sched_rt_domain_set(xc_interface *xch,
+                               uint32_t domid,
+                               struct xen_domctl_sched_rt_params *sdom);
+int xc_sched_rt_domain_get(xc_interface *xch,
+                               uint32_t domid,
+                               struct xen_domctl_sched_rt_params *sdom);
+
 int
 xc_sched_arinc653_schedule_set(
     xc_interface *xch,
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: Introduce rt real-time scheduler for Xen
  2014-07-11  4:49 Introduce rt real-time scheduler for Xen Meng Xu
                   ` (3 preceding siblings ...)
  2014-07-11  4:49 ` [PATCH RFC v1 4/4] libxc " Meng Xu
@ 2014-07-11 10:50 ` Wei Liu
  2014-07-11 11:06   ` Dario Faggioli
  2014-07-11 16:19 ` Dario Faggioli
  5 siblings, 1 reply; 50+ messages in thread
From: Wei Liu @ 2014-07-11 10:50 UTC (permalink / raw)
  To: Meng Xu
  Cc: wei.liu2, ian.campbell, xisisu, stefano.stabellini,
	george.dunlap, dario.faggioli, ian.jackson, xen-devel,
	xumengpanda, lichong659, dgolomb

On Fri, Jul 11, 2014 at 12:49:54AM -0400, Meng Xu wrote:
[...]
> 
> [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
> [PATCH RFC v1 2/4] xl for rt scheduler
> [PATCH RFC v1 3/4] libxl for rt scheduler
> [PATCH RFC v1 4/4] libxc for rt scheduler
> 

I have some general comments on how you arrange these patches.

At a glance of the title and code you should do them in the order of 1,
4, 3 and 2. Apparently xl depends on libxl, libxl depends on libxc, and
libxc depends on hypervisor. You will break bisection with current
ordering.

And we normally write titles like
  xen: add rt scheduler
  libxl: introduce rt scheduler
  xl: XXXX
  etc.
start with component name and separate with colon.

Last but not least, you need to CC relevant maintainers. You can find
out maintainers with scripts/get_maintainers.pl.

Wei.

> -----------
> Meng Xu
> PhD Student in Computer and Information Science
> University of Pennsylvania
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 2/4] xl for rt scheduler
  2014-07-11  4:49 ` [PATCH RFC v1 2/4] xl for rt scheduler Meng Xu
@ 2014-07-11 11:02   ` Wei Liu
  2014-07-11 14:59     ` Meng Xu
  0 siblings, 1 reply; 50+ messages in thread
From: Wei Liu @ 2014-07-11 11:02 UTC (permalink / raw)
  To: Meng Xu
  Cc: wei.liu2, ian.campbell, xisisu, stefano.stabellini,
	george.dunlap, dario.faggioli, ian.jackson, xen-devel,
	xumengpanda, lichong659, dgolomb

On Fri, Jul 11, 2014 at 12:49:56AM -0400, Meng Xu wrote:
[...]
>  
>  =head1 CPUPOOLS COMMANDS
> diff --git a/tools/libxl/xl.h b/tools/libxl/xl.h
> index 10a2e66..51b634a 100644
> --- a/tools/libxl/xl.h
> +++ b/tools/libxl/xl.h
> @@ -67,6 +67,7 @@ int main_memset(int argc, char **argv);
>  int main_sched_credit(int argc, char **argv);
>  int main_sched_credit2(int argc, char **argv);
>  int main_sched_sedf(int argc, char **argv);
> +int main_sched_rt(int argc, char **argv);
>  int main_domid(int argc, char **argv);
>  int main_domname(int argc, char **argv);
>  int main_rename(int argc, char **argv);
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> index 68df548..c043f88 100644
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -4947,6 +4947,7 @@ int main_sharing(int argc, char **argv)
>      return 0;
>  }
>  
> +

Stray blank line.

>  static int sched_domain_get(libxl_scheduler sched, int domid,
>                              libxl_domain_sched_params *scinfo)
>  {
> @@ -5098,6 +5099,52 @@ static int sched_sedf_domain_output(
>      return 0;
>  }
>  
> +
> +static int sched_rt_domain_output(
> +    int domid)
> +{
> +    char *domname;
> +    libxl_domain_sched_params scinfo;
> +    int rc, i;
> +
> +    if (domid < 0) {
> +        printf("%-33s %4s %4s %6s %6s\n", "Name", "ID", "VCPU", "Period", "Budget");
> +        return 0;
> +    }
> +

Just print the header and nothing more?

> +    libxl_domain_sched_params_init(&scinfo);
> +    rc = sched_domain_get(LIBXL_SCHEDULER_RT, domid, &scinfo);
> +    if (rc)
> +        return rc;
> +

This is violating libxl type paradigm. See libxl.h.                             
                                                                                
 263  * libxl types                                                             
 264  *                                                                         
 265  * Most libxl types are defined by the libxl IDL (see                      
 266  * libxl_types.idl). The library provides a common set of methods for         
 267  * initialising and freeing these types.                                   
 268  *                                                                         
 269  * IDL-generated libxl types should be used as follows: the user must         
 270  * always call the "init" function before using a type, even if the        
 271  * variable is simply being passed by reference as an out parameter        
 272  * to a libxl function.  The user must always calls "dispose" exactly         
 273  * once afterwards, to clean up, regardless of whether operations on          
 274  * this object succeeded or failed.  See the xl code for examples.         
                                                                                
And you should check other places that use libxl types as well.  

Wei.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 3/4] libxl for rt scheduler
  2014-07-11  4:49 ` [PATCH RFC v1 3/4] libxl " Meng Xu
@ 2014-07-11 11:05   ` Wei Liu
  2014-07-11 15:08   ` Dario Faggioli
  2014-07-17 15:36   ` Ian Campbell
  2 siblings, 0 replies; 50+ messages in thread
From: Wei Liu @ 2014-07-11 11:05 UTC (permalink / raw)
  To: Meng Xu
  Cc: wei.liu2, ian.campbell, xisisu, stefano.stabellini,
	george.dunlap, dario.faggioli, ian.jackson, xen-devel,
	xumengpanda, lichong659, dgolomb

On Fri, Jul 11, 2014 at 12:49:57AM -0400, Meng Xu wrote:
[...]
>  
>  # Consistent with SHUTDOWN_* in sched.h (apart from UNKNOWN)
> @@ -289,6 +290,17 @@ libxl_domain_restore_params = Struct("domain_restore_params", [
>      ("checkpointed_stream", integer),
>      ])
>  
> +libxl_rt_vcpu = Struct("vcpu",[
> +    ("period",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_PERIOD_DEFAULT'}),
> +    ("budget",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT'}),
> +    ])
> +
> +libxl_domain_sched_rt_params = Struct("domain_sched_rt_params",[
> +    ("vcpus",        Array(libxl_rt_vcpu, "max_vcpus")),

The array counter should be named "num_vcpus" (and you probably need to
rename the other "num_vcpus" to something else).

Wei.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Introduce rt real-time scheduler for Xen
  2014-07-11 10:50 ` Introduce rt real-time scheduler for Xen Wei Liu
@ 2014-07-11 11:06   ` Dario Faggioli
  2014-07-11 16:14     ` Meng Xu
  0 siblings, 1 reply; 50+ messages in thread
From: Dario Faggioli @ 2014-07-11 11:06 UTC (permalink / raw)
  To: Wei Liu
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xen-devel, xumengpanda, Meng Xu, lichong659,
	dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 1447 bytes --]

On ven, 2014-07-11 at 11:50 +0100, Wei Liu wrote:
> On Fri, Jul 11, 2014 at 12:49:54AM -0400, Meng Xu wrote:
> [...]
> > 
> > [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
> > [PATCH RFC v1 2/4] xl for rt scheduler
> > [PATCH RFC v1 3/4] libxl for rt scheduler
> > [PATCH RFC v1 4/4] libxc for rt scheduler
> > 
> 
> I have some general comments on how you arrange these patches.
> 
> At a glance of the title and code you should do them in the order of 1,
> 4, 3 and 2. Apparently xl depends on libxl, libxl depends on libxc, and
> libxc depends on hypervisor. You will break bisection with current
> ordering.
> 
Yep, I agree with Wei.

> And we normally write titles like
>   xen: add rt scheduler
>   libxl: introduce rt scheduler
>   xl: XXXX
>   etc.
> start with component name and separate with colon.
> 
Indeed we do, and this helps quite a bit.

> Last but not least, you need to CC relevant maintainers. You can find
> out maintainers with scripts/get_maintainers.pl.
> 
Yes, but this one, Meng almost got it right, I think.

Basically, Meng, you're missing hypervisors maintainers (at least for
patch 1). :-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
  2014-07-11  4:49 ` [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor Meng Xu
@ 2014-07-11 14:27   ` Dario Faggioli
  2014-07-11 14:37   ` Andrew Cooper
  1 sibling, 0 replies; 50+ messages in thread
From: Dario Faggioli @ 2014-07-11 14:27 UTC (permalink / raw)
  To: Meng Xu
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xen-devel, xumengpanda, Jan Beulich, lichong659,
	dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 36357 bytes --]

[Adding Jan]

On ven, 2014-07-11 at 00:49 -0400, Meng Xu wrote:
> This is the core rt scheduler patch. It adds the new real time scheduler to
> the hypervisor, as the non-default scheduler.
> 
So, honestly, I think these two lines above can just go away...

The subject can be improved a bit too, probably. I don't know...
something like "scheduling: introduce XEN_SCHEDULER_RT" or "sched:
implement XEN_SCHEDULER_RT scheduling policy", etc.

> This scheduler follows the pre-emptive Global EDF theory in real-time field.
> Each VCPU can have a dedicated period and budget.
> While scheduled, a VCPU burns its budget.
> A VCPU has its budget replenished at the beginning of each of its periods;
> The VCPU discards its unused budget at the end of each of its periods.
> If a VCPU runs out of budget in a period, it has to wait until next period.
> The mechanism of how to burn a VCPU's budget depends on the server mechanism
> implemented for each VCPU.
> 
> Server mechanism: a VCPU is implemented as a deferable server.
> When a VCPU has a task running on it, its budget is continuously
> burned;
>
"When a VCPU is scheduled", or "When a VCPU executes on a PCPU". Let's
avoid messing with _why_ a VCPU executes, i.e., what the VCPU is
actually doing... That's guest's business, not ours! :-)

> When a VCPU has no task but with budget left, its budget is
> preserved.
> 
Ditto.

> Priority scheme: Preemptive Global Earliest Deadline First (gEDF).
> At any scheduling point, the VCPU with earliest deadline has highest
> priority.
> 
> Queue scheme: A global runqueue for each CPU pool.
  ^Runqueue, it's a bit more clear, I think.

> The runqueue holds all runnable VCPUs.
> VCPUs in the runqueue are divided into two parts: with and without budget.
> At each part, VCPUs are sorted based on EDF priority scheme.
> 
> Scheduling quanta: 1 ms; but accounting the budget is in microsecond.
> 
> Note: cpumask and cpupool is supported.
>
No need to say this in the changelog. Actually, it's probably useful to
say it right now, given this is an RFC... Not so much when this won't be
an RFC any longer.

> This is still in the development phase.
> 
Right, and threfore let's concentrate on the code... we'll refine the
changelog version after version. :-)


> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> new file mode 100644
> index 0000000..41543a2
> --- /dev/null
> +++ b/xen/common/sched_rt.c

> +/*
> + * TODO:
> + *
> + * Migration compensation and resist like credit2 to better use cache;
> + * Lock Holder Problem, using yield?
> + * Self switch problem: VCPUs of the same domain may preempt each other;
> + */
> +
This last one is an interesting point, even for non real-time
scheduling. I gave some thoughts at this a few times, but never had time
to properly gather ideas and/or collect some numbers... if you have
anything like that, feel free to share (in another thread, as it's, I
think, more general).

> +/*
> + * Locking:
> + * Just like credit2, a global system lock is used to protect the RunQ.
> + * The global lock is referenced by schedule_data.schedule_lock from all physical cpus.
> + *
Mmm... Actually, I think credit2 at least tries to do something
different. Well, each RunQ does have its own lock, but there are more
than 1 RunQ in credit2 (or, again, at least that's the design. There are
a few bugs in the credit2 implementation, but that does not matter
here!! :-P).

No big deal, of course, just perhaps try to describe a bit better what's
the intended locking scheme. :-)

> + * The lock is already grabbed when calling wake/sleep/schedule/ functions in schedule.c
> + *
> + * The functions involes RunQ and needs to grab locks are:
> + *    dump, vcpu_insert, vcpu_remove, context_saved,
> + */
> +
> +
> +/*
> + * Default parameters in ms
> + */
> +#define RT_DEFAULT_PERIOD     10
> +#define RT_DEFAULT_BUDGET      4
> +
ms? us? Can you put a comment saying what time units these 10 and 4 are.

> +/*
> + * Insert a vcpu in the RunQ basing on vcpu's deadline: 
> + * vcpus with shorter deadline are inserted first.
> + * EDF schedule policy: vcpu with smaller deadline has higher priority;
>
The two line above sound redundant. I'd ditch the latter.

> + * When vcpus have the same deadline, insert the current one at the head of these vcpus.
> + */
>
Knowing how ties are broken is useful. However, I'm not sure what you
mean with "insert the current one at the head of these vcpus". Can you
rephrase it?

> +static void
> +__runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
> +{
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +   
> +    ASSERT( spin_is_locked(per_cpu(schedule_data, svc->vcpu->processor).schedule_lock) );
> +
Coding style: this line looks very long. :-)

> +    if ( __vcpu_on_runq(svc) )
> +        return;
> +
Can this happen? I mean the fact that you try to put a VCPU on a RunQ
while it's already there. If yes, and if that's by design, ok, but
that's the sort of things I usually want to keep under control, as
they're frequently hiding bugs etc.

Again, I'm not saying this is the case. What I'd do is, if it can
happen, put a comment explaining under what circumstances that is the
case. If it can't, then put down another ASSERT rather than just
returning.

> +    list_for_each(iter, runq) {
> +        struct rt_vcpu * iter_svc = __runq_elem(iter);
> +
> +        if ( svc->cur_budget > 0 ) { /* svc still has budget */
>
Put the comment above (minor thing, though).

> +            if ( iter_svc->cur_budget == 0 ||
> +                 svc->cur_deadline <= iter_svc->cur_deadline )
> +                    break;
> +        } else { /* svc has no budget */
> +            if ( iter_svc->cur_budget == 0 &&
> +                 svc->cur_deadline <= iter_svc->cur_deadline )
> +                    break;
> +        }
>
Bear with me, as I'm rally thinking out loud here. Have you though at,
or perhaps explored the solution of using two lists --one for vcpus with
budget the other for depleted ones-- instead of only one divided in two
parts.

Perhaps it's not a big deal, but looking at this code looks like
inserting a fully depleted vcpu potentially involves pointlessly going
through all the elements with budget>0, which also means holding the
global RunQ spinlock longer than one would expect. Thoughts?


Also, coding style: inside the hypervisor, we do:

if ( bla )
{
    blabla;
}
else    /* or else if (bleblu ) */
{
    bleah
}

(Of course, this needs fixing everywhere).

> +/*
> + * Init/Free related code
> + */
> +static int
> +rt_init(struct scheduler *ops)
> +{
> +    struct rt_private *prv;
> +
> +    prv = xzalloc(struct rt_private);
> +    if ( prv == NULL ) {
> +        printk("xzalloc failed at rt_private\n");
> +        return -ENOMEM;
> +    }
> +
> +    ops->sched_data = prv;
> +    spin_lock_init(&prv->lock);
> +    INIT_LIST_HEAD(&prv->sdom);
> +    INIT_LIST_HEAD(&prv->runq);
> +    cpumask_clear(&prv->tickled);
> +    cpumask_clear(&prv->cpus);
> +
> +    printtime();
> +    printk("\n");
> +
Perhaps printtime() can take a string argument and append it to the time
it prints. Well, it probably does not matter much, as I don't think
things like printtime() should make it to the final versions. I mean,
it's useful now, but when upstream, debugging happens via tracing, so it
hopefully will become useless.

Actually, I think adding a few xentrace tracepoints would be a great
plus, as it'll allow Xen devs to start investigating and debugging this
the usual 'Xen way'. Look at entries like TRC_CSCHED2_RUNQ_POS in
credit2, or similar ones in credit.

> +    return 0;
> +}

> +static void *
> +rt_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
> +{
> +    struct rt_vcpu *svc;
> +    s_time_t now = NOW();
> +    long count;
> +
> +    /* Allocate per-VCPU info */
> +    svc = xzalloc(struct rt_vcpu);
> +    if ( svc == NULL ) {
> +        printk("%s, xzalloc failed\n", __func__);
> +        return NULL;
> +    }
> +
> +    INIT_LIST_HEAD(&svc->runq_elem);
> +    INIT_LIST_HEAD(&svc->sdom_elem);
> +    svc->flags = 0U;
> +    svc->sdom = dd;
> +    svc->vcpu = vc;
> +    svc->last_start = 0;            /* init last_start is 0 */
> +
> +    svc->period = RT_DEFAULT_PERIOD;
> +    if ( !is_idle_vcpu(vc) && vc->domain->domain_id != 0 ) {
> +        svc->budget = RT_DEFAULT_BUDGET;
> +    } else {
> +        svc->budget = RT_DEFAULT_PERIOD; /* give vcpus of dom0 100% utilization */
> +    }
> +
Oh, so idle VCPUs and Dom0's VCPUs get 100%? Well, for one, write this
explicitly somewhere. Anyway, idle, not so much important, it's kind of
obvious actually. Dom0, it's certainly worth a few words, both here and
up in the 'Design' description.

I've got to think at whether or not that's a sane default setting. For
sure Dom0 is a very peculiar domain, but the beauty (well, one of the
beauties :-D) of Xen is that Dom0 is actually just one domain, there may
exist other "special" domains (driver/stub domains, etc)... So let's try
not making Dom0 itself too special, ok? :-P

> +    count = (now/MILLISECS(svc->period)) + 1;
> +    /* sync all VCPU's start time to 0 */
> +    svc->cur_deadline += count*MILLISECS(svc->period);
> +
> +    svc->cur_budget = svc->budget*1000; /* counting in microseconds level */
> +    /* Debug only: dump new vcpu's info */
> +    printtime();
> +    rt_dump_vcpu(svc);
> +
> +    return svc;
> +}

> +static void
> +rt_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu *svc = RT_VCPU(vc);
> +
> +    /* Debug only: dump info of vcpu to insert */
> +    printtime();
> +    rt_dump_vcpu(svc);
> +
> +    /* IDLE VCPU not allowed on RunQ */
> +    if ( is_idle_vcpu(vc) )
> +        return;
> +
Mmm... Why are we speaking about the RunQ... this is
not the function adding a VCPU to the RunQ, is it?

> +    list_add_tail(&svc->sdom_elem, &svc->sdom->vcpu);   /* add to dom vcpu list */
> +}
> +
> +static void
> +rt_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * const svc = RT_VCPU(vc);
> +    struct rt_dom * const sdom = svc->sdom;
> +
> +    printtime();
> +    rt_dump_vcpu(svc);
> +
> +    BUG_ON( sdom == NULL );
> +    BUG_ON( __vcpu_on_runq(svc) );
> +
> +    if ( !is_idle_vcpu(vc) ) {
> +        list_del_init(&svc->sdom_elem);
> +    }
> +}
> +
About these two functions (rtglobal_vcpu_insert and
rtglobal_vcpu_remove), in addition to what I wrote yesterday.

It looks to me that, when calling them, for instance, from
sched_move_domain() in schedule.c, what you do is removing the VCPU from
the list of VCPUs of its domain, and then add it back there. Sure the
net effect is ok, but it does certainly not look like the best possible
thing that should happen in such case.

I can't be positive about what I'm about to say, but it looks to me
that, given how the insert_vcpu hook is used, you may not want to define
it (just don't put any .insert_vcpu=xxx nor .remove_vcpu=yyy there when
initializing sched_rtglobal_def, at the bottom of the file). Of course
you'd need to find another place from where to add the vcpu to the list
of its own domain's VCPUs, but that should be easy.

Or am I missing something?

For sure, in the way it looks not, it is now, although probably
functional (it does look so, I haven't tested this but I trust you
did :-) ), it's rather inefficient and confusing to read.

> +/*
> + * Implemented as deferrable server. 
> + * Different server mechanism has different implementation.
>
Kill this line, until (if ever) we'll actually have more than one server
mechanism.

> + * burn budget at microsecond level. 
> + */
I.e., using microsecs everythere, including in the interface, would kill
the need of doing these conversions inside the scheduler logic, which
is, to me, is already a fair reason to go that way.

I see you've got this in your TODO list already, so good. :-)

> +static void
> +burn_budgets(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t now) {
> +    s_time_t delta;
> +    unsigned int consume;
> +    long count = 0;
> +
> +    /* first time called for this svc, update last_start */
> +    if ( svc->last_start == 0 ) {
> +        svc->last_start = now;
> +        return;
> +    }
> +
> +    /* don't burn budget for idle VCPU */
> +    if ( is_idle_vcpu(svc->vcpu) ) {
> +        return;
> +    }
> +
This check on is_idle_vcpu() can be the very first thing you do,
can't it? I mean, is it important that we update last_start for the idle
VCPUs? If not, move it up.

> +    /* don't burn budget for Domain-0, RT-Xen use only */
> +    if ( svc->sdom->dom->domain_id == 0 ) {
> +        return;
> +    }
> +
Oh, so Dom0 is not inside a (deferrable) server? Nope, that needs to
change. It's probably ok to give it 100% bandwidth by default (still
thinking about this), but it certainly needs to be possible for a
sysadmin to change that, assign to Dom0's VCPUs budgets and periods, and
have it scheduled like all the other domains.

Is it going to be hard to make happen? I honestly don't think
so. I think that just killing all of this (domain_id == 0) checks would
pretty much do...

> +    /* update deadline info */
>
Be a bit more explicit in the comment, in explaining what's going on
here, i.e., deadline is in the past, so we need to compensate, etc. etc.

Also, can you provide me a little insight on what could cause delta to
be non-negative? Have you observed it? If yes, a lot? Under what
circumstances?

More important, is it ok to just silently fix things, or should we warn
the user/sysadmin that we're lagging behind?

> +    delta = now - svc->cur_deadline;
> +    if ( delta >= 0 ) {
> +        count = ( delta/MILLISECS(svc->period) ) + 1;
> +        svc->cur_deadline += count * MILLISECS(svc->period);
> +        svc->cur_budget = svc->budget * 1000;
> +        return;
> +    }
> +
So, in case deadline was > 0, deadline was in the past. What about
cur_budget? I mean, what about overruns?

I know it's something that should not happen, but if it does, is
forgetting about it the best option?

However, I see budget is handled below, so I'll comment there.

> +    delta = now - svc->last_start;
> +    if ( delta < 0 ) {
> +        printk("%s, delta = %ld for ", __func__, delta);
> +        rt_dump_vcpu(svc);
> +        svc->last_start = now;  /* update last_start */
> +        svc->cur_budget = 0;
> +        return;
> +    }
> +
As said above, put down some words of comment on when this can happen
and why you think this is the best recovery action, please. :-)

> +    if ( svc->cur_budget == 0 ) return;
> +
> +    /* burn at microseconds level */
> +    consume = ( delta/MICROSECS(1) );
> +    if ( delta%MICROSECS(1) > MICROSECS(1)/2 ) consume++;
> +
ditto (about conversions).

> +    svc->cur_budget -= consume;
> +    if ( svc->cur_budget < 0 ) svc->cur_budget = 0;
>
What I've always done is having the task (in this case, that would be
the vcpu) to pay back for that. So if it's max_budget was 100 and I at
some point find out it managed to execute for 115, i.e., it overran by
15, I only replenish it to 100-15=85.

I'm saying this here because, even if it's not here that you replenish,
all the mechanisms I can think of to take care of overruns need the
current budget to actually be negative, when replenishment happens, in
order to be able to take overrun itself into account.

So what do you think? Can overrun happen in this scheduler? If yes, how
severely? Don't you think it's something important to consider?

> +}
> +
> +/* 
> + * RunQ is sorted. Pick first one within cpumask. If no one, return NULL
> + * lock is grabbed before calling this function 
> + */
> +static struct rt_vcpu *
> +__runq_pick(const struct scheduler *ops, cpumask_t mask)
> +{
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +    struct rt_vcpu *svc = NULL;
> +    struct rt_vcpu *iter_svc = NULL;
> +    cpumask_t cpu_common;
> +    cpumask_t *online;
> +    struct rt_private * prv = RT_PRIV(ops);
> +
> +    list_for_each(iter, runq) {
> +        iter_svc = __runq_elem(iter);
> +
> +        /* mask is intersection of cpu_hard_affinity and cpupool and priv->cpus */
>
What's in priv->cpus, BTW? How is it different form the cpupool online
mask (returned in 'online' by cpupool_scheduler_cpumask() )?

> +        online = cpupool_scheduler_cpumask(iter_svc->vcpu->domain->cpupool);
> +        cpumask_and(&cpu_common, online, &prv->cpus);
> +        cpumask_and(&cpu_common, &cpu_common, iter_svc->vcpu->cpu_hard_affinity);
> +        cpumask_and(&cpu_common, &mask, &cpu_common);
> +        if ( cpumask_empty(&cpu_common) )
> +            continue;
> +
> +        if ( iter_svc->cur_budget <= 0 )
> +            continue;
> +
> +        svc = iter_svc;
> +        break;
> +    }
> +
> +    return svc;
> +}
> +
> +/*
> + * Update vcpu's budget and sort runq by insert the modifed vcpu back to runq
> + * lock is grabbed before calling this function 
> + */
> +static void
> +__repl_update(const struct scheduler *ops, s_time_t now)
> +{
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +    struct list_head *tmp;
> +    struct rt_vcpu *svc = NULL;
> +
> +    s_time_t diff;
> +    long count;
> +
> +    list_for_each_safe(iter, tmp, runq) {
> +        svc = __runq_elem(iter);
> +
Wow, so we scan all the RunQ... :-/

Let's check when this is called, but I personally really don't like this
things. :-(

Can't we use something like a timer for replenishments?

> +        diff = now - svc->cur_deadline;
> +        if ( diff > 0 ) {
> +            count = (diff/MILLISECS(svc->period)) + 1;
> +            svc->cur_deadline += count * MILLISECS(svc->period);
> +            svc->cur_budget = svc->budget * 1000;
>
Payback for possible overrun again?

In any case, payback or not, this is the same code from inside
burn_budget() above. An helper function is probably what we want.

> +            __runq_remove(svc);
> +            __runq_insert(ops, svc);
> +        }
> +    }
> +}
> +
> +/* 
> + * schedule function for rt scheduler.
> + * The lock is already grabbed in schedule.c, no need to lock here 
> + */
> +static struct task_slice
> +rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_scheduled)
> +{
> +    const int cpu = smp_processor_id();
> +    struct rt_private * prv = RT_PRIV(ops);
> +    struct rt_vcpu * const scurr = RT_VCPU(current);
> +    struct rt_vcpu * snext = NULL;
> +    struct task_slice ret;
> +
    struct task_slice ret = { .migrated = 0 };

should work, and should allow you to get rid of the ret.migrated=0
below. It's a minor thing, though, mostly a matter of taste, I think.

> +    /* clear ticked bit now that we've been scheduled */
> +    if ( cpumask_test_cpu(cpu, &prv->tickled) )
> +        cpumask_clear_cpu(cpu, &prv->tickled);
> +
> +    /* burn_budget would return for IDLE VCPU */
> +    burn_budgets(ops, scurr, now);
> +
> +    __repl_update(ops, now);
> +
And here it is. So, at each tick (and possibly even more frequently) we
scan the full RunQ, just to update the deadlines and perform
replenishments.

I confirm that I don't like this very much. burn_budget() is something
that should indeed happen here. __repl_update(), OTOH, should somehow be
made event based. I can't provide a deferrable server example out of the
top of my head.

IIRC, in my Linux CBS implementation I was setting a timer when a task
is inserted in the runqueue, to fire at the task's absolute deadline
time (well, after one period, if you want to allow deadline different
than periods). At that point, you either find the task still running,
with some leftover budget, you find it still running with no (or very
few) budget, which means overrun (or something very close to overrun),
or (most of the cases) you find it stopped, because it ran out of
budget. That's when --in the timer handler-- you issue the replenishment
and postpone the deadline accordingly (depending on which one of the 3
above the actual situation was). When re-enqueueing the task, with new
budget and new deadline, I was also setting the timer again.
Of course, when the task was removed from the runqueue because it
blocked by itself (i.e., _not_ because he depleted its budget), I was
stopping the timer too.

I've also implemented the sporadic server on Linux, using the exact same
mechanism, although it was a bit more complex, as the sporadic server
handles replenishments in a more complicated way (it replenishes in
chunks, so I had to set the timer for each chunk, one after the other).

I'm sure something like this can be done here, in Xen, and with DS as an
algorithm.

It shouldn't even be too hard to just use either only one timer, or,
perhaps, one timer per PCPU. I'm quite sure we don't want one timer per
VCPU, but this is of course something we can discuss. For now, what do
you think about the timer idea?

> +    if ( tasklet_work_scheduled ) {
> +        snext = RT_VCPU(idle_vcpu[cpu]);
> +    } else {
> +        cpumask_t cur_cpu;
> +        cpumask_clear(&cur_cpu);
> +        cpumask_set_cpu(cpu, &cur_cpu);
> +        snext = __runq_pick(ops, cur_cpu);
> +        if ( snext == NULL )
> +            snext = RT_VCPU(idle_vcpu[cpu]);
> +
> +        /* if scurr has higher priority and budget, still pick scurr */
> +        if ( !is_idle_vcpu(current) &&
> +             vcpu_runnable(current) &&
> +             scurr->cur_budget > 0 &&
> +             ( is_idle_vcpu(snext->vcpu) ||
> +               scurr->cur_deadline <= snext->cur_deadline ) ) {
> +            snext = scurr;
> +        }
> +    }
> +
> +    if ( snext != scurr &&
> +         !is_idle_vcpu(current) &&
> +         vcpu_runnable(current) ) {
> +        set_bit(__RT_delayed_runq_add, &scurr->flags);
> +    }
> +
> +    snext->last_start = now;
> +    ret.migrated = 0;
> +    if ( !is_idle_vcpu(snext->vcpu) ) {
> +        if ( snext != scurr ) {
> +            __runq_remove(snext);
> +            set_bit(__RT_scheduled, &snext->flags);
> +        }
> +        if ( snext->vcpu->processor != cpu ) {
> +            snext->vcpu->processor = cpu;
> +            ret.migrated = 1;
> +        }
> +    }
> +
> +    ret.time = MILLISECS(1);
> +    ret.task = snext->vcpu;
> +
> +    return ret;
> +}
> +
> +/*
> + * Remove VCPU from RunQ
> + * The lock is already grabbed in schedule.c, no need to lock here 
> + */
> +static void
> +rt_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * const svc = RT_VCPU(vc);
> +
> +    BUG_ON( is_idle_vcpu(vc) );
> +
> +    if ( curr_on_cpu(vc->processor) == vc ) {
> +        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
> +        return;
> +    }
> +
> +    if ( __vcpu_on_runq(svc) ) {
> +        __runq_remove(svc);
> +    }
> +
As said, can it actually happen that we try to put to sleep a vcpu
that's not in the RunQ. If yes, fine, but I'd like a comment about why
that's the case. If not, let's convert the if to an ASSERT.

Let me state once more that I'm not questioning the correctness of your
code, I just want to be sure that the final result is as straightforward
to read and understand as possible, when it comes to study or debug it.
Well, this, in my opinion, is one of the spots where having the said aid
(comment or ASSERT) from the original implementors would help a lot with
that goal. :-)

> +    clear_bit(__RT_delayed_runq_add, &svc->flags);
> +}
> +
> +/*
> + * Pick a vcpu on a cpu to kick out to place the running candidate
> + * Called by wake() and context_saved()
> + * We have a running candidate here, the kick logic is:
> + * Among all the cpus that are within the cpu affinity
> + * 1) if the new->cpu is idle, kick it. This could benefit cache hit
> + * 2) if there are any idle vcpu, kick it.
> + * 3) now all pcpus are busy, among all the running vcpus, pick lowest priority one
> + *    if snext has higher priority, kick it.
> + *
> + * TODO:
> + * 1) what if these two vcpus belongs to the same domain?
> + *    replace a vcpu belonging to the same domain introduces more overhead
> + *
> + * lock is grabbed before calling this function 
> + */
Now this is a really good looking doc comment!! :-D

> +static void
> +runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
> +{
> +    struct rt_private * prv = RT_PRIV(ops);
> +    struct rt_vcpu * scheduled = NULL;    /* lowest priority scheduled */
>
Can I suggest a better name for this variable? Maybe
latest_deadline_vcpu, or latest_scheduled, or just latest... In general,
I'd like it to reflect the concept that it points to the VCPU with the
latest deadline (i.e., the lowest prio, but as repeatedly said, we're
dealing with EDF only!).

> +    struct rt_vcpu * iter_svc;
> +    struct vcpu * iter_vc;
> +    int cpu = 0;
> +    cpumask_t not_tickled;                      /* not tickled cpus */
>
Comment not necessary, the name is talking enough by itself in this
case.

One issue here is that we really are tying hard to avoid putting
cpumask_t on the stack, as they can be big on large boxes, with lots of
cpus. However, let's consider this low prio for now.

> +    cpumask_t *online;
> +
> +    if ( new == NULL || is_idle_vcpu(new->vcpu) ) return;
> +
> +    online = cpupool_scheduler_cpumask(new->vcpu->domain->cpupool);
> +    cpumask_and(&not_tickled, online, &prv->cpus);
> +    cpumask_and(&not_tickled, &not_tickled, new->vcpu->cpu_hard_affinity);
> +    cpumask_andnot(&not_tickled, &not_tickled, &prv->tickled);
> +
> +    /* 1) if new's previous cpu is idle, kick it for cache benefit */
> +    if ( is_idle_vcpu(curr_on_cpu(new->vcpu->processor)) ) {
>
Remember: '{' goes on new line while inside the hypervisor.

> +        cpumask_set_cpu(new->vcpu->processor, &prv->tickled);
> +        cpu_raise_softirq(new->vcpu->processor, SCHEDULE_SOFTIRQ);
> +        return;
> +    }
> +
> +    /* 2) if there are any idle pcpu, kick it */
> +    /* The same loop also find the one with lowest priority */
> +    for_each_cpu(cpu, &not_tickled) {
> +        iter_vc = curr_on_cpu(cpu);
> +        if ( is_idle_vcpu(iter_vc) ) {
> +            cpumask_set_cpu(cpu, &prv->tickled);
> +            cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
> +            return;
> +        }
> +        iter_svc = RT_VCPU(iter_vc);
> +        if ( scheduled == NULL || 
> +             iter_svc->cur_deadline > scheduled->cur_deadline ) {
> +            scheduled = iter_svc;
> +        }
> +    }
> +
> +    /* 3) candicate has higher priority, kick out the lowest priority vcpu */
> +    if ( scheduled != NULL && new->cur_deadline < scheduled->cur_deadline ) {
> +        cpumask_set_cpu(scheduled->vcpu->processor, &prv->tickled);
> +        cpu_raise_softirq(scheduled->vcpu->processor, SCHEDULE_SOFTIRQ);
> +    }
> +    return;
> +}
> +
> +/* 
> + * Should always wake up runnable vcpu, put it back to RunQ. 
> + * Check priority to raise interrupt 
> + * The lock is already grabbed in schedule.c, no need to lock here 
> + * TODO: what if these two vcpus belongs to the same domain? 
> + */
> +static void
> +rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * const svc = RT_VCPU(vc);
> +    s_time_t diff;
> +    s_time_t now = NOW();
> +    long count = 0;
> +    struct rt_private * prv = RT_PRIV(ops);
> +    struct rt_vcpu * snext = NULL;        /* highest priority on RunQ */
> +
> +    BUG_ON( is_idle_vcpu(vc) );
> +
> +    if ( unlikely(curr_on_cpu(vc->processor) == vc) ) return;
> +
ASSERT or comment.

Also, here and everywhere else where this happens, 'return' (in general
the then branch of the if) goes on new line, even if it's a single
statement.

> +    /* on RunQ, just update info is ok */
> +    if ( unlikely(__vcpu_on_runq(svc)) ) return;
> +
This happens to have a comment, but not as good as the case
requires. :-P

> +    /* If context hasn't been saved yet, set flag so it will add later */
> +    if ( unlikely(test_bit(__RT_scheduled, &svc->flags)) ) {
> +        set_bit(__RT_delayed_runq_add, &svc->flags);
> +        return;
> +    }
> +
Credit2 implementation has proper comments on what this flag (and the
other one too, actually) means and how it's used, both when the flags
are defined and used. Either you provide similar comments in here
(taking them from sched_credit2.c is ok), or you somehow reference those
in sched_credit2.c.

Personally, I don't like referencing other comments from other file, so
I'd go for the cut-&-past-ish approach.

> +    /* update deadline info */
> +    diff = now - svc->cur_deadline;
> +    if ( diff >= 0 ) {
> +        count = ( diff/MILLISECS(svc->period) ) + 1;
> +        svc->cur_deadline += count * MILLISECS(svc->period);
> +        svc->cur_budget = svc->budget * 1000;
> +    }
> +
And here it comes the same code again. Definitely, make an helper
function and call it from all the 3 spots.

> +    __runq_insert(ops, svc);
> +    __repl_update(ops, now);
>
So, every time a VCPU wakes up, we scan the full RunQ,, which also means
grabbing and holding the global RunQ lock, to check whether any task
needs a replenishment... Why is this required?

Once more, can't we use timers for replenishments, and get rid of this?
I'm sure it comes at a rather high cost, doesn't it?

> +    snext = __runq_pick(ops, prv->cpus);    /* pick snext from ALL valid cpus */
> +    runq_tickle(ops, snext);
> +
> +    return;
> +}
> +
> +/* 
> + * scurr has finished context switch, insert it back to the RunQ,
> + * and then pick the highest priority vcpu from runq to run 
> + */
> +static void
> +rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * svc = RT_VCPU(vc);
> +    struct rt_vcpu * snext = NULL;
> +    struct rt_private * prv = RT_PRIV(ops);
> +    spinlock_t *lock;
> +
> +    clear_bit(__RT_scheduled, &svc->flags);
>
You clear this flag outside of the critical section on lock... This is
different from what happens in credit2, and it does not sound safe to
me. Is it intentional?

> +    if ( is_idle_vcpu(vc) ) return;
> +
> +    lock = vcpu_schedule_lock_irq(vc);
> +    if ( test_and_clear_bit(__RT_delayed_runq_add, &svc->flags) && 
> +         likely(vcpu_runnable(vc)) ) {
> +        __runq_insert(ops, svc);
> +        __repl_update(ops, NOW());
ditto.

> +        snext = __runq_pick(ops, prv->cpus);    /* pick snext from ALL cpus */
> +        runq_tickle(ops, snext);
>
I also think that, if we use something like timers for replenishments,
we can avoid calling __runq_pick() here too, as we won't be re-sorting
the queue, so it'd be enough to runq_tickle() for svc... I think. :-)

> +    }
> +    vcpu_schedule_unlock_irq(lock, vc);
> +}
> +
> +/*
> + * set/get each vcpu info of each domain
> + */
> +static int
> +rt_dom_cntl(const struct scheduler *ops, struct domain *d, struct xen_domctl_scheduler_op *op)
>
Lots of long lines in this function...

> +{
> +    xen_domctl_sched_rt_params_t local_sched;
> +    struct rt_dom * const sdom = RT_DOM(d);
> +    struct list_head *iter;
> +    int vcpu_index = 0;
> +    int rc = -EINVAL;
> +
> +    switch ( op->cmd )
> +    {
> +    case XEN_DOMCTL_SCHEDOP_getinfo:
> +        /* for debug use, whenever adjust Dom0 parameter, do global dump */
> +        if ( d->domain_id == 0 ) {
> +            rt_dump(ops);
> +        }
> +
> +        local_sched.num_vcpus = 0;
> +        list_for_each( iter, &sdom->vcpu ) {
> +            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu, sdom_elem);
> +
> +            ASSERT(vcpu_index < XEN_LEGACY_MAX_VCPUS);
> +            local_sched.vcpus[vcpu_index].budget = svc->budget;
> +            local_sched.vcpus[vcpu_index].period = svc->period;
> +
> +            vcpu_index++;
> +        }
> +        local_sched.num_vcpus = vcpu_index;
> +        copy_to_guest(op->u.rt.schedule, &local_sched, 1);
> +        rc = 0;
> +        break;
> +    case XEN_DOMCTL_SCHEDOP_putinfo:
> +        copy_from_guest(&local_sched, op->u.rt.schedule, 1);
> +        list_for_each( iter, &sdom->vcpu ) {
> +            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu, sdom_elem);
> +
> +            if ( local_sched.vcpu_index == svc->vcpu->vcpu_id ) { /* adjust per VCPU parameter */
> +                vcpu_index = local_sched.vcpu_index;
> +
> +                if ( vcpu_index < 0 || vcpu_index > XEN_LEGACY_MAX_VCPUS) {
> +                    printk("XEN_DOMCTL_SCHEDOP_putinfo: vcpu_index=%d\n",
> +                            vcpu_index);
> +                }else{
> +                    printk("XEN_DOMCTL_SCHEDOP_putinfo: vcpu_index=%d, period=%"PRId64", budget=%"PRId64"\n",
> +                            vcpu_index, local_sched.vcpus[vcpu_index].period, 
> +                            local_sched.vcpus[vcpu_index].budget);
> +                }
> +
> +                svc->period = local_sched.vcpus[vcpu_index].period;
> +                svc->budget = local_sched.vcpus[vcpu_index].budget;
> +
> +                break;
> +            }
> +        }
> +        rc = 0;
> +        break;
> +    }
> +
> +    return rc;
> +}
> +
> +static struct rt_private _rt_priv;
> +
> +const struct scheduler sched_rt_def = {
> +    .name           = "SMP RT Scheduler",
> +    .opt_name       = "rt",
> +    .sched_id       = XEN_SCHEDULER_RT,
> +    .sched_data     = &_rt_priv,
> +
> +    .dump_cpu_state = rt_dump_pcpu,
> +    .dump_settings  = rt_dump,
> +    .init           = rt_init,
> +    .deinit         = rt_deinit,
> +    .alloc_pdata    = rt_alloc_pdata,
> +    .free_pdata     = rt_free_pdata,
> +    .alloc_domdata  = rt_alloc_domdata,
> +    .free_domdata   = rt_free_domdata,
> +    .init_domain    = rt_dom_init,
> +    .destroy_domain = rt_dom_destroy,
> +    .alloc_vdata    = rt_alloc_vdata,
> +    .free_vdata     = rt_free_vdata,
> +    .insert_vcpu    = rt_vcpu_insert,
> +    .remove_vcpu    = rt_vcpu_remove,
> +
> +    .adjust         = rt_dom_cntl,
> +
> +    .pick_cpu       = rt_cpu_pick,
> +    .do_schedule    = rt_schedule,
> +    .sleep          = rt_vcpu_sleep,
> +    .wake           = rt_vcpu_wake,
> +    .context_saved  = rt_context_saved,
> +
> +    .yield          = NULL,
> +    .migrate        = NULL,
>
You can skip the hunks that you want to be NULL. If you don't, because
you think it's worthwhile to give a more visual and explicit info about
a particular hunk being not implemented, put down a comment on why that
is the case.

It think, in this case, it's good to show the reviewers and future
readers/hackers that we don't have yield and migrate here, so ack on the
"=NULL", but do add the comment on what that is.

> --- a/xen/include/xen/sched-if.h
> +++ b/xen/include/xen/sched-if.h
> @@ -169,7 +169,7 @@ extern const struct scheduler sched_sedf_def;
>  extern const struct scheduler sched_credit_def;
>  extern const struct scheduler sched_credit2_def;
>  extern const struct scheduler sched_arinc653_def;
> -
> +extern const struct scheduler sched_rt_def;
>  
Don't kill blank lines, just add your stuff.

Phew... that's it for this patch... let's move to (hopefully) simpler
ones! :-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
  2014-07-11  4:49 ` [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor Meng Xu
  2014-07-11 14:27   ` Dario Faggioli
@ 2014-07-11 14:37   ` Andrew Cooper
  2014-07-11 15:21     ` Dario Faggioli
  1 sibling, 1 reply; 50+ messages in thread
From: Andrew Cooper @ 2014-07-11 14:37 UTC (permalink / raw)
  To: Meng Xu, xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	dario.faggioli, ian.jackson, xumengpanda, lichong659, dgolomb

On 11/07/14 05:49, Meng Xu wrote:
> This is the core rt scheduler patch. It adds the new real time scheduler to
> the hypervisor, as the non-default scheduler.
>
> This scheduler follows the pre-emptive Global EDF theory in real-time field.
> Each VCPU can have a dedicated period and budget.
> While scheduled, a VCPU burns its budget.
> A VCPU has its budget replenished at the beginning of each of its periods;
> The VCPU discards its unused budget at the end of each of its periods.
> If a VCPU runs out of budget in a period, it has to wait until next period.
> The mechanism of how to burn a VCPU's budget depends on the server mechanism
> implemented for each VCPU.
>
> Server mechanism: a VCPU is implemented as a deferable server.
> When a VCPU has a task running on it, its budget is continuously
> burned;
> When a VCPU has no task but with budget left, its budget is
> preserved.
>
> Priority scheme: Preemptive Global Earliest Deadline First (gEDF).
> At any scheduling point, the VCPU with earliest deadline has highest
> priority.
>
> Queue scheme: A global runqueue for each CPU pool.
> The runqueue holds all runnable VCPUs.
> VCPUs in the runqueue are divided into two parts: with and without budget.
> At each part, VCPUs are sorted based on EDF priority scheme.
>
> Scheduling quanta: 1 ms; but accounting the budget is in microsecond.
>
> Note: cpumask and cpupool is supported.
>
> This is still in the development phase.
>
> Signed-off-by: Sisu Xi <xisisu@gmail.com>
> Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>

Some notes from a hypervisor point of view.

> ---
>  xen/common/Makefile         |    1 +
>  xen/common/sched_rt.c       |  984 +++++++++++++++++++++++++++++++++++++++++++
>  xen/common/schedule.c       |    1 +
>  xen/include/public/domctl.h |   19 +
>  xen/include/xen/sched-if.h  |    2 +-
>  5 files changed, 1006 insertions(+), 1 deletion(-)
>  create mode 100644 xen/common/sched_rt.c
>
> diff --git a/xen/common/Makefile b/xen/common/Makefile
> index 3683ae3..5a23aa4 100644
> --- a/xen/common/Makefile
> +++ b/xen/common/Makefile
> @@ -26,6 +26,7 @@ obj-y += sched_credit.o
>  obj-y += sched_credit2.o
>  obj-y += sched_sedf.o
>  obj-y += sched_arinc653.o
> +obj-y += sched_rt.o
>  obj-y += schedule.o
>  obj-y += shutdown.o
>  obj-y += softirq.o
> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> new file mode 100644
> index 0000000..41543a2
> --- /dev/null
> +++ b/xen/common/sched_rt.c
> @@ -0,0 +1,984 @@
> +/******************************************************************************
> + * Preemptive Global Earliest Deadline First  (EDF) scheduler for Xen
> + * EDF scheduling is one of most popular real-time scheduling algorithm used in
> + * embedded field.
> + *
> + * by Sisu Xi, 2013, Washington University in Saint Louis
> + * and Meng Xu, 2014, University of Pennsylvania
> + *
> + * based on the code of credit Scheduler
> + */
> +
> +#include <xen/config.h>
> +#include <xen/init.h>
> +#include <xen/lib.h>
> +#include <xen/sched.h>
> +#include <xen/domain.h>
> +#include <xen/delay.h>
> +#include <xen/event.h>
> +#include <xen/time.h>
> +#include <xen/perfc.h>
> +#include <xen/sched-if.h>
> +#include <xen/softirq.h>
> +#include <asm/atomic.h>
> +#include <xen/errno.h>
> +#include <xen/trace.h>
> +#include <xen/cpu.h>
> +#include <xen/keyhandler.h>
> +#include <xen/trace.h>
> +#include <xen/guest_access.h>
> +
> +/*
> + * TODO:
> + *
> + * Migration compensation and resist like credit2 to better use cache;
> + * Lock Holder Problem, using yield?
> + * Self switch problem: VCPUs of the same domain may preempt each other;
> + */
> +
> +/*
> + * Design:
> + *
> + * This scheduler follows the Preemptive Global EDF theory in real-time field.
> + * Each VCPU can have a dedicated period and budget. 
> + * While scheduled, a VCPU burns its budget.
> + * A VCPU has its budget replenished at the beginning of each of its periods;
> + * The VCPU discards its unused budget at the end of each of its periods.
> + * If a VCPU runs out of budget in a period, it has to wait until next period.
> + * The mechanism of how to burn a VCPU's budget depends on the server mechanism
> + * implemented for each VCPU.
> + *
> + * Server mechanism: a VCPU is implemented as a deferable server.
> + * When a VCPU has a task running on it, its budget is continuously burned;
> + * When a VCPU has no task but with budget left, its budget is preserved.
> + *
> + * Priority scheme: Preemptive Global Earliest Deadline First (gEDF).
> + * At any scheduling point, the VCPU with earliest deadline has highest priority.
> + *
> + * Queue scheme: A global runqueue for each CPU pool. 
> + * The runqueue holds all runnable VCPUs. 
> + * VCPUs in the runqueue are divided into two parts: with and without remaining budget. 
> + * At each part, VCPUs are sorted based on EDF priority scheme.
> + *
> + * Scheduling quanta: 1 ms; but accounting the budget is in microsecond.
> + *
> + * Note: cpumask and cpupool is supported.
> + */
> +
> +/*
> + * Locking:
> + * Just like credit2, a global system lock is used to protect the RunQ.
> + * The global lock is referenced by schedule_data.schedule_lock from all physical cpus.
> + *
> + * The lock is already grabbed when calling wake/sleep/schedule/ functions in schedule.c
> + *
> + * The functions involes RunQ and needs to grab locks are:
> + *    dump, vcpu_insert, vcpu_remove, context_saved,
> + */
> +
> +
> +/*
> + * Default parameters in ms
> + */
> +#define RT_DEFAULT_PERIOD     10
> +#define RT_DEFAULT_BUDGET      4
> +
> +/*
> + * Useful macros
> + */
> +#define RT_PRIV(_ops)     ((struct rt_private *)((_ops)->sched_data))
> +#define RT_VCPU(_vcpu)    ((struct rt_vcpu *)(_vcpu)->sched_priv)
> +#define RT_DOM(_dom)      ((struct rt_dom *)(_dom)->sched_priv)
> +#define RUNQ(_ops)              (&RT_PRIV(_ops)->runq)
> +
> +/*
> + * Flags
> + */
> +#define __RT_scheduled            1
> +#define RT_scheduled (1<<__RT_scheduled)
> +#define __RT_delayed_runq_add     2
> +#define RT_delayed_runq_add (1<<__RT_delayed_runq_add)
> +
> +/*
> + * Used to printout debug information
> + */
> +#define printtime()     ( printk("%d : %3ld.%3ld : %-19s ", smp_processor_id(), NOW()/MILLISECS(1), NOW()%MILLISECS(1)/1000, __func__) )

NOW() is fairly expensive, and using it twice like this is going to lead
to inaccurate times in the message.  Anything like this should be
XENLOG_DEBUG, and smp_processor_id() returns an unsigned number.

I would suggest something more like this:

#define printtime() \
({ s_time_t now = NOW(); \
    printk(XENLOG_DEBUG "%u : %3ld.%eld : %-19s\n", smp_processor_id(),\
              now / MILLISECS(1), (now % MILLISECS(1)) / 1000, \
              __func__); })

> +
> +/*
> + * Systme-wide private data, include a global RunQueue
> + * The global lock is referenced by schedule_data.schedule_lock from all physical cpus.
> + * It can be grabbed via vcpu_schedule_lock_irq()
> + */
> +struct rt_private {
> +    spinlock_t lock;        /* The global coarse grand lock */
> +    struct list_head sdom;  /* list of availalbe domains, used for dump */
> +    struct list_head runq;  /* Ordered list of runnable VMs */
> +    cpumask_t cpus;         /* cpumask_t of available physical cpus */
> +    cpumask_t tickled;      /* another cpu in the queue already ticked this one */
> +};
> +
> +/*
> + * Virtual CPU
> + */
> +struct rt_vcpu {
> +    struct list_head runq_elem; /* On the runqueue list */
> +    struct list_head sdom_elem; /* On the domain VCPU list */
> +
> +    /* Up-pointers */
> +    struct rt_dom *sdom;
> +    struct vcpu *vcpu;
> +
> +    /* VCPU parameters, in milliseconds */
> +    s_time_t period;
> +    s_time_t budget;
> +
> +    /* VCPU current infomation */
> +    long cur_budget;             /* current budget in microseconds */

Can budget be negative ?

> +    s_time_t last_start;        /* last start time, used to calculate budget */
> +    s_time_t cur_deadline;      /* current deadline, used to do EDF */
> +    unsigned flags;             /* mark __RT_scheduled, etc.. */
> +};
> +
> +/*
> + * Domain
> + */
> +struct rt_dom {
> +    struct list_head vcpu;      /* link its VCPUs */
> +    struct list_head sdom_elem; /* link list on rt_priv */
> +    struct domain *dom;         /* pointer to upper domain */
> +};
> +
> +/*
> + * RunQueue helper functions
> + */
> +static int
> +__vcpu_on_runq(struct rt_vcpu *svc)
> +{
> +   return !list_empty(&svc->runq_elem);
> +}
> +
> +static struct rt_vcpu *
> +__runq_elem(struct list_head *elem)
> +{
> +    return list_entry(elem, struct rt_vcpu, runq_elem);
> +}
> +
> +static inline void
> +__runq_remove(struct rt_vcpu *svc)
> +{
> +    if ( __vcpu_on_runq(svc) )
> +        list_del_init(&svc->runq_elem);
> +}
> +
> +/*
> + * Insert a vcpu in the RunQ basing on vcpu's deadline: 
> + * vcpus with shorter deadline are inserted first.
> + * EDF schedule policy: vcpu with smaller deadline has higher priority;
> + * When vcpus have the same deadline, insert the current one at the head of these vcpus.
> + */
> +static void
> +__runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
> +{
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +   

The line above this has trailing whitespace, which you should clean up

> +    ASSERT( spin_is_locked(per_cpu(schedule_data, svc->vcpu->processor).schedule_lock) );

Beware assertions involving spin_is_locked().  It checks that someone
somewhere has the schedule lock held, not necessarily that this current
pcpu is holding the schedule lock.

Having the assertion is better than not having it, but do be aware that
it doesn't catch every case you might expect it to.

> +
> +    if ( __vcpu_on_runq(svc) )
> +        return;
> +
> +    list_for_each(iter, runq) {

Coding style - braces on next line.

> +        struct rt_vcpu * iter_svc = __runq_elem(iter);
> +
> +        if ( svc->cur_budget > 0 ) { /* svc still has budget */
> +            if ( iter_svc->cur_budget == 0 ||
> +                 svc->cur_deadline <= iter_svc->cur_deadline )
> +                    break;
> +        } else { /* svc has no budget */
> +            if ( iter_svc->cur_budget == 0 &&
> +                 svc->cur_deadline <= iter_svc->cur_deadline )
> +                    break;
> +        }
> +    }
> +
> +    list_add_tail(&svc->runq_elem, iter);
> +}
> +
> +
> +/*
> + * Debug related code, dump vcpu/cpu information
> + */
> +static void
> +rt_dump_vcpu(struct rt_vcpu *svc)
> +{
> +    if ( svc == NULL ) {
> +        printk("NULL!\n");
> +        return;
> +    }
> +#define cpustr keyhandler_scratch

char *cpustr = keyhandler_scratch;

Less macroism and more typesafety.

> +    cpumask_scnprintf(cpustr, sizeof(cpustr), svc->vcpu->cpu_hard_affinity);
> +    printk("[%5d.%-2d] cpu %d, (%"PRId64", %"PRId64"), cur_b=%"PRId64" cur_d=%"PRId64" last_start=%"PRId64" onR=%d runnable=%d cpu_hard_affinity=%s ",
> +            svc->vcpu->domain->domain_id,
> +            svc->vcpu->vcpu_id,
> +            svc->vcpu->processor,
> +            svc->period,
> +            svc->budget,
> +            svc->cur_budget,
> +            svc->cur_deadline,
> +            svc->last_start,
> +            __vcpu_on_runq(svc),
> +            vcpu_runnable(svc->vcpu),
> +            cpustr);
> +    memset(cpustr, 0, sizeof(char)*1024);
> +    cpumask_scnprintf(cpustr, sizeof(cpustr), cpupool_scheduler_cpumask(svc->vcpu->domain->cpupool));
> +    printk("cpupool=%s\n", cpustr);
> +#undef cpustr
> +}
> +
> +static void
> +rt_dump_pcpu(const struct scheduler *ops, int cpu)

cpu ids should be unsigned.

> +{
> +    struct rt_vcpu *svc = RT_VCPU(curr_on_cpu(cpu));
> +
> +    printtime();
> +    rt_dump_vcpu(svc);
> +}
> +
> +/*
> + * should not need lock here. only showing stuff 
> + */
> +static void
> +rt_dump(const struct scheduler *ops)
> +{
> +    struct list_head *iter_sdom, *iter_svc, *runq, *iter;
> +    struct rt_private *prv = RT_PRIV(ops);
> +    struct rt_vcpu *svc;
> +    int cpu = 0;
> +    int loop = 0;
> +
> +    printtime();
> +    printk("Priority Scheme: EDF\n");
> +
> +    printk("PCPU info: \n");
> +    for_each_cpu(cpu, &prv->cpus) {
> +        rt_dump_pcpu(ops, cpu);
> +    }
> +
> +    printk("Global RunQueue info: \n");
> +    loop = 0;
> +    runq = RUNQ(ops);
> +    list_for_each( iter, runq ) {
> +        svc = __runq_elem(iter);
> +        printk("\t%3d: ", ++loop);
> +        rt_dump_vcpu(svc);
> +    }
> +
> +    printk("Domain info: \n");
> +    loop = 0;
> +    list_for_each( iter_sdom, &prv->sdom ) {
> +        struct rt_dom *sdom;
> +        sdom = list_entry(iter_sdom, struct rt_dom, sdom_elem);
> +        printk("\tdomain: %d\n", sdom->dom->domain_id);
> +
> +        list_for_each( iter_svc, &sdom->vcpu ) {
> +            svc = list_entry(iter_svc, struct rt_vcpu, sdom_elem);
> +            printk("\t\t%3d: ", ++loop);
> +            rt_dump_vcpu(svc);
> +        }
> +    }
> +
> +    printk("\n");
> +}
> +
> +/*
> + * Init/Free related code
> + */
> +static int
> +rt_init(struct scheduler *ops)
> +{
> +    struct rt_private *prv;
> +
> +    prv = xzalloc(struct rt_private);

These 3 lines can be condensed into 1.

> +    if ( prv == NULL ) {
> +        printk("xzalloc failed at rt_private\n");

SCHED_OP(..., init) will bail on error.  No need to printk() as well.

> +        return -ENOMEM;
> +    }
> +
> +    ops->sched_data = prv;
> +    spin_lock_init(&prv->lock);
> +    INIT_LIST_HEAD(&prv->sdom);
> +    INIT_LIST_HEAD(&prv->runq);
> +    cpumask_clear(&prv->tickled);
> +    cpumask_clear(&prv->cpus);

Using the xzalloc(), you don't need to re-0 elements, so these
cpumask_clear()s can be dropped.

> +
> +    printtime();
> +    printk("\n");
> +
> +    return 0;
> +}
> +
> +static void
> +rt_deinit(const struct scheduler *ops)
> +{
> +    struct rt_private *prv;
> +
> +    printtime();
> +    printk("\n");
> +
> +    prv = RT_PRIV(ops);

this line can be joined with the declaration of *prv.

> +    if ( prv )
> +        xfree(prv);

xfree() works just like free().  Please drop the NULL check.

> +}
> +
> +/* 
> + * point per_cpu spinlock to the global system lock; all cpu have same global system lock 
> + */
> +static void *
> +rt_alloc_pdata(const struct scheduler *ops, int cpu)

Eww.  Not your fault, but struct scheduler needs its idea of cpus to
turn less signed.

> +{
> +    struct rt_private *prv = RT_PRIV(ops);
> +
> +    cpumask_set_cpu(cpu, &prv->cpus);
> +
> +    per_cpu(schedule_data, cpu).schedule_lock = &prv->lock;
> +
> +    printtime();
> +    printk("%s total cpus: %d", __FUNCTION__, cpumask_weight(&prv->cpus));
> +    return (void *)1;

This is going to end in disaster, as this_cpu(schedule_data).sched_priv
now contains a bogus pointer.

> +}
> +
> +static void
> +rt_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
> +{
> +    struct rt_private * prv = RT_PRIV(ops);
> +    cpumask_clear_cpu(cpu, &prv->cpus);
> +    printtime();
> +    printk("%s cpu=%d\n", __FUNCTION__, cpu);
> +}
> +
> +static void *
> +rt_alloc_domdata(const struct scheduler *ops, struct domain *dom)
> +{
> +    unsigned long flags;
> +    struct rt_dom *sdom;
> +    struct rt_private * prv = RT_PRIV(ops);
> +
> +    printtime();
> +    printk("dom=%d\n", dom->domain_id);
> +
> +    sdom = xzalloc(struct rt_dom);
> +    if ( sdom == NULL ) {
> +        printk("%s, xzalloc failed\n", __func__);
> +        return NULL;
> +    }
> +
> +    INIT_LIST_HEAD(&sdom->vcpu);
> +    INIT_LIST_HEAD(&sdom->sdom_elem);
> +    sdom->dom = dom;
> +
> +    /* spinlock here to insert the dom */
> +    spin_lock_irqsave(&prv->lock, flags);
> +    list_add_tail(&sdom->sdom_elem, &(prv->sdom));
> +    spin_unlock_irqrestore(&prv->lock, flags);
> +
> +    return (void *)sdom;
> +}
> +
> +static void
> +rt_free_domdata(const struct scheduler *ops, void *data)
> +{
> +    unsigned long flags;
> +    struct rt_dom *sdom = data;
> +    struct rt_private * prv = RT_PRIV(ops);

Please be consistent with *s in pointers declarations.  Prevailing style
is lacking a space.

> +
> +    printtime();
> +    printk("dom=%d\n", sdom->dom->domain_id);
> +
> +    spin_lock_irqsave(&prv->lock, flags);
> +    list_del_init(&sdom->sdom_elem);
> +    spin_unlock_irqrestore(&prv->lock, flags);

Why irqsave ?

> +    xfree(data);
> +}
> +
> +static int
> +rt_dom_init(const struct scheduler *ops, struct domain *dom)
> +{
> +    struct rt_dom *sdom;
> +
> +    printtime();
> +    printk("dom=%d\n", dom->domain_id);
> +
> +    /* IDLE Domain does not link on rt_private */
> +    if ( is_idle_domain(dom) ) { return 0; }

Coding style - extranious braces and statement on same line as if.

> +
> +    sdom = rt_alloc_domdata(ops, dom);
> +    if ( sdom == NULL ) {
> +        printk("%s, failed\n", __func__);
> +        return -ENOMEM;
> +    }
> +    dom->sched_priv = sdom;
> +
> +    return 0;
> +}
> +
> +static void
> +rt_dom_destroy(const struct scheduler *ops, struct domain *dom)
> +{
> +    printtime();
> +    printk("dom=%d\n", dom->domain_id);
> +
> +    rt_free_domdata(ops, RT_DOM(dom));
> +}
> +
> +static void *
> +rt_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
> +{
> +    struct rt_vcpu *svc;
> +    s_time_t now = NOW();
> +    long count;
> +
> +    /* Allocate per-VCPU info */
> +    svc = xzalloc(struct rt_vcpu);
> +    if ( svc == NULL ) {
> +        printk("%s, xzalloc failed\n", __func__);
> +        return NULL;
> +    }
> +
> +    INIT_LIST_HEAD(&svc->runq_elem);
> +    INIT_LIST_HEAD(&svc->sdom_elem);
> +    svc->flags = 0U;
> +    svc->sdom = dd;
> +    svc->vcpu = vc;
> +    svc->last_start = 0;            /* init last_start is 0 */
> +
> +    svc->period = RT_DEFAULT_PERIOD;
> +    if ( !is_idle_vcpu(vc) && vc->domain->domain_id != 0 ) {
> +        svc->budget = RT_DEFAULT_BUDGET;
> +    } else {
> +        svc->budget = RT_DEFAULT_PERIOD; /* give vcpus of dom0 100% utilization */
> +    }
> +
> +    count = (now/MILLISECS(svc->period)) + 1;

Coding style = spaces around binary operators.

> +    /* sync all VCPU's start time to 0 */
> +    svc->cur_deadline += count*MILLISECS(svc->period);
> +
> +    svc->cur_budget = svc->budget*1000; /* counting in microseconds level */
> +    /* Debug only: dump new vcpu's info */
> +    printtime();
> +    rt_dump_vcpu(svc);
> +
> +    return svc;
> +}
> +
> +static void
> +rt_free_vdata(const struct scheduler *ops, void *priv)
> +{
> +    struct rt_vcpu *svc = priv;
> +
> +    /* Debug only: dump freed vcpu's info */
> +    printtime();
> +    rt_dump_vcpu(svc);
> +    xfree(svc);
> +}
> +
> +static void
> +rt_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu *svc = RT_VCPU(vc);
> +
> +    /* Debug only: dump info of vcpu to insert */
> +    printtime();
> +    rt_dump_vcpu(svc);
> +
> +    /* IDLE VCPU not allowed on RunQ */
> +    if ( is_idle_vcpu(vc) )
> +        return;
> +
> +    list_add_tail(&svc->sdom_elem, &svc->sdom->vcpu);   /* add to dom vcpu list */
> +}
> +
> +static void
> +rt_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * const svc = RT_VCPU(vc);
> +    struct rt_dom * const sdom = svc->sdom;

Why are the pointers themselves const?

> +
> +    printtime();
> +    rt_dump_vcpu(svc);
> +
> +    BUG_ON( sdom == NULL );
> +    BUG_ON( __vcpu_on_runq(svc) );
> +
> +    if ( !is_idle_vcpu(vc) ) {
> +        list_del_init(&svc->sdom_elem);
> +    }
> +}
> +
> +/* 
> + * Pick a valid CPU for the vcpu vc
> + * Valid CPU of a vcpu is intesection of vcpu's affinity and available cpus
> + */
> +static int
> +rt_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    cpumask_t cpus;
> +    cpumask_t *online;
> +    int cpu;
> +    struct rt_private * prv = RT_PRIV(ops);
> +
> +    online = cpupool_scheduler_cpumask(vc->domain->cpupool);
> +    cpumask_and(&cpus, &prv->cpus, online);
> +    cpumask_and(&cpus, &cpus, vc->cpu_hard_affinity);
> +
> +    cpu = cpumask_test_cpu(vc->processor, &cpus)
> +            ? vc->processor 
> +            : cpumask_cycle(vc->processor, &cpus);
> +    ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
> +
> +    return cpu;
> +}
> +
> +/*
> + * Implemented as deferrable server. 
> + * Different server mechanism has different implementation.
> + * burn budget at microsecond level. 
> + */
> +static void
> +burn_budgets(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t now) {
> +    s_time_t delta;
> +    unsigned int consume;
> +    long count = 0;
> +
> +    /* first time called for this svc, update last_start */
> +    if ( svc->last_start == 0 ) {
> +        svc->last_start = now;
> +        return;
> +    }
> +
> +    /* don't burn budget for idle VCPU */
> +    if ( is_idle_vcpu(svc->vcpu) ) {
> +        return;
> +    }
> +
> +    /* don't burn budget for Domain-0, RT-Xen use only */
> +    if ( svc->sdom->dom->domain_id == 0 ) {
> +        return;
> +    }
> +
> +    /* update deadline info */
> +    delta = now - svc->cur_deadline;
> +    if ( delta >= 0 ) {
> +        count = ( delta/MILLISECS(svc->period) ) + 1;
> +        svc->cur_deadline += count * MILLISECS(svc->period);
> +        svc->cur_budget = svc->budget * 1000;
> +        return;
> +    }
> +
> +    delta = now - svc->last_start;
> +    if ( delta < 0 ) {
> +        printk("%s, delta = %ld for ", __func__, delta);
> +        rt_dump_vcpu(svc);
> +        svc->last_start = now;  /* update last_start */
> +        svc->cur_budget = 0;
> +        return;
> +    }
> +
> +    if ( svc->cur_budget == 0 ) return;
> +
> +    /* burn at microseconds level */
> +    consume = ( delta/MICROSECS(1) );
> +    if ( delta%MICROSECS(1) > MICROSECS(1)/2 ) consume++;
> +
> +    svc->cur_budget -= consume;
> +    if ( svc->cur_budget < 0 ) svc->cur_budget = 0;
> +}
> +
> +/* 
> + * RunQ is sorted. Pick first one within cpumask. If no one, return NULL
> + * lock is grabbed before calling this function 
> + */
> +static struct rt_vcpu *
> +__runq_pick(const struct scheduler *ops, cpumask_t mask)
> +{
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +    struct rt_vcpu *svc = NULL;
> +    struct rt_vcpu *iter_svc = NULL;
> +    cpumask_t cpu_common;
> +    cpumask_t *online;
> +    struct rt_private * prv = RT_PRIV(ops);
> +
> +    list_for_each(iter, runq) {
> +        iter_svc = __runq_elem(iter);
> +
> +        /* mask is intersection of cpu_hard_affinity and cpupool and priv->cpus */
> +        online = cpupool_scheduler_cpumask(iter_svc->vcpu->domain->cpupool);
> +        cpumask_and(&cpu_common, online, &prv->cpus);
> +        cpumask_and(&cpu_common, &cpu_common, iter_svc->vcpu->cpu_hard_affinity);
> +        cpumask_and(&cpu_common, &mask, &cpu_common);
> +        if ( cpumask_empty(&cpu_common) )
> +            continue;
> +
> +        if ( iter_svc->cur_budget <= 0 )
> +            continue;
> +
> +        svc = iter_svc;
> +        break;
> +    }
> +
> +    return svc;
> +}
> +
> +/*
> + * Update vcpu's budget and sort runq by insert the modifed vcpu back to runq
> + * lock is grabbed before calling this function 
> + */
> +static void
> +__repl_update(const struct scheduler *ops, s_time_t now)
> +{
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +    struct list_head *tmp;
> +    struct rt_vcpu *svc = NULL;
> +
> +    s_time_t diff;
> +    long count;
> +
> +    list_for_each_safe(iter, tmp, runq) {
> +        svc = __runq_elem(iter);
> +
> +        diff = now - svc->cur_deadline;
> +        if ( diff > 0 ) {
> +            count = (diff/MILLISECS(svc->period)) + 1;
> +            svc->cur_deadline += count * MILLISECS(svc->period);
> +            svc->cur_budget = svc->budget * 1000;
> +            __runq_remove(svc);
> +            __runq_insert(ops, svc);
> +        }
> +    }
> +}
> +
> +/* 
> + * schedule function for rt scheduler.
> + * The lock is already grabbed in schedule.c, no need to lock here 
> + */
> +static struct task_slice
> +rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_scheduled)
> +{
> +    const int cpu = smp_processor_id();
> +    struct rt_private * prv = RT_PRIV(ops);
> +    struct rt_vcpu * const scurr = RT_VCPU(current);
> +    struct rt_vcpu * snext = NULL;
> +    struct task_slice ret;
> +
> +    /* clear ticked bit now that we've been scheduled */
> +    if ( cpumask_test_cpu(cpu, &prv->tickled) )
> +        cpumask_clear_cpu(cpu, &prv->tickled);
> +
> +    /* burn_budget would return for IDLE VCPU */
> +    burn_budgets(ops, scurr, now);
> +
> +    __repl_update(ops, now);
> +
> +    if ( tasklet_work_scheduled ) {
> +        snext = RT_VCPU(idle_vcpu[cpu]);
> +    } else {
> +        cpumask_t cur_cpu;
> +        cpumask_clear(&cur_cpu);
> +        cpumask_set_cpu(cpu, &cur_cpu);
> +        snext = __runq_pick(ops, cur_cpu);
> +        if ( snext == NULL )
> +            snext = RT_VCPU(idle_vcpu[cpu]);
> +
> +        /* if scurr has higher priority and budget, still pick scurr */
> +        if ( !is_idle_vcpu(current) &&
> +             vcpu_runnable(current) &&
> +             scurr->cur_budget > 0 &&
> +             ( is_idle_vcpu(snext->vcpu) ||
> +               scurr->cur_deadline <= snext->cur_deadline ) ) {
> +            snext = scurr;
> +        }
> +    }
> +
> +    if ( snext != scurr &&
> +         !is_idle_vcpu(current) &&
> +         vcpu_runnable(current) ) {
> +        set_bit(__RT_delayed_runq_add, &scurr->flags);
> +    }
> +
> +    snext->last_start = now;
> +    ret.migrated = 0;
> +    if ( !is_idle_vcpu(snext->vcpu) ) {
> +        if ( snext != scurr ) {
> +            __runq_remove(snext);
> +            set_bit(__RT_scheduled, &snext->flags);
> +        }
> +        if ( snext->vcpu->processor != cpu ) {
> +            snext->vcpu->processor = cpu;
> +            ret.migrated = 1;
> +        }
> +    }
> +
> +    ret.time = MILLISECS(1);
> +    ret.task = snext->vcpu;
> +
> +    return ret;
> +}
> +
> +/*
> + * Remove VCPU from RunQ
> + * The lock is already grabbed in schedule.c, no need to lock here 
> + */
> +static void
> +rt_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * const svc = RT_VCPU(vc);
> +
> +    BUG_ON( is_idle_vcpu(vc) );
> +
> +    if ( curr_on_cpu(vc->processor) == vc ) {
> +        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
> +        return;
> +    }
> +
> +    if ( __vcpu_on_runq(svc) ) {
> +        __runq_remove(svc);
> +    }
> +
> +    clear_bit(__RT_delayed_runq_add, &svc->flags);
> +}
> +
> +/*
> + * Pick a vcpu on a cpu to kick out to place the running candidate
> + * Called by wake() and context_saved()
> + * We have a running candidate here, the kick logic is:
> + * Among all the cpus that are within the cpu affinity
> + * 1) if the new->cpu is idle, kick it. This could benefit cache hit
> + * 2) if there are any idle vcpu, kick it.
> + * 3) now all pcpus are busy, among all the running vcpus, pick lowest priority one
> + *    if snext has higher priority, kick it.
> + *
> + * TODO:
> + * 1) what if these two vcpus belongs to the same domain?
> + *    replace a vcpu belonging to the same domain introduces more overhead
> + *
> + * lock is grabbed before calling this function 
> + */
> +static void
> +runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
> +{
> +    struct rt_private * prv = RT_PRIV(ops);
> +    struct rt_vcpu * scheduled = NULL;    /* lowest priority scheduled */
> +    struct rt_vcpu * iter_svc;
> +    struct vcpu * iter_vc;
> +    int cpu = 0;
> +    cpumask_t not_tickled;                      /* not tickled cpus */
> +    cpumask_t *online;
> +
> +    if ( new == NULL || is_idle_vcpu(new->vcpu) ) return;
> +
> +    online = cpupool_scheduler_cpumask(new->vcpu->domain->cpupool);
> +    cpumask_and(&not_tickled, online, &prv->cpus);
> +    cpumask_and(&not_tickled, &not_tickled, new->vcpu->cpu_hard_affinity);
> +    cpumask_andnot(&not_tickled, &not_tickled, &prv->tickled);
> +
> +    /* 1) if new's previous cpu is idle, kick it for cache benefit */
> +    if ( is_idle_vcpu(curr_on_cpu(new->vcpu->processor)) ) {
> +        cpumask_set_cpu(new->vcpu->processor, &prv->tickled);
> +        cpu_raise_softirq(new->vcpu->processor, SCHEDULE_SOFTIRQ);
> +        return;
> +    }
> +
> +    /* 2) if there are any idle pcpu, kick it */
> +    /* The same loop also find the one with lowest priority */
> +    for_each_cpu(cpu, &not_tickled) {
> +        iter_vc = curr_on_cpu(cpu);
> +        if ( is_idle_vcpu(iter_vc) ) {
> +            cpumask_set_cpu(cpu, &prv->tickled);
> +            cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
> +            return;
> +        }
> +        iter_svc = RT_VCPU(iter_vc);
> +        if ( scheduled == NULL || 
> +             iter_svc->cur_deadline > scheduled->cur_deadline ) {
> +            scheduled = iter_svc;
> +        }
> +    }
> +
> +    /* 3) candicate has higher priority, kick out the lowest priority vcpu */
> +    if ( scheduled != NULL && new->cur_deadline < scheduled->cur_deadline ) {
> +        cpumask_set_cpu(scheduled->vcpu->processor, &prv->tickled);
> +        cpu_raise_softirq(scheduled->vcpu->processor, SCHEDULE_SOFTIRQ);
> +    }
> +    return;
> +}
> +
> +/* 
> + * Should always wake up runnable vcpu, put it back to RunQ. 
> + * Check priority to raise interrupt 
> + * The lock is already grabbed in schedule.c, no need to lock here 
> + * TODO: what if these two vcpus belongs to the same domain? 
> + */
> +static void
> +rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * const svc = RT_VCPU(vc);
> +    s_time_t diff;
> +    s_time_t now = NOW();
> +    long count = 0;
> +    struct rt_private * prv = RT_PRIV(ops);
> +    struct rt_vcpu * snext = NULL;        /* highest priority on RunQ */
> +
> +    BUG_ON( is_idle_vcpu(vc) );
> +
> +    if ( unlikely(curr_on_cpu(vc->processor) == vc) ) return;
> +
> +    /* on RunQ, just update info is ok */
> +    if ( unlikely(__vcpu_on_runq(svc)) ) return;
> +
> +    /* If context hasn't been saved yet, set flag so it will add later */
> +    if ( unlikely(test_bit(__RT_scheduled, &svc->flags)) ) {
> +        set_bit(__RT_delayed_runq_add, &svc->flags);
> +        return;
> +    }
> +
> +    /* update deadline info */
> +    diff = now - svc->cur_deadline;
> +    if ( diff >= 0 ) {
> +        count = ( diff/MILLISECS(svc->period) ) + 1;
> +        svc->cur_deadline += count * MILLISECS(svc->period);
> +        svc->cur_budget = svc->budget * 1000;
> +    }
> +
> +    __runq_insert(ops, svc);
> +    __repl_update(ops, now);
> +    snext = __runq_pick(ops, prv->cpus);    /* pick snext from ALL valid cpus */
> +    runq_tickle(ops, snext);
> +
> +    return;
> +}
> +
> +/* 
> + * scurr has finished context switch, insert it back to the RunQ,
> + * and then pick the highest priority vcpu from runq to run 
> + */
> +static void
> +rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * svc = RT_VCPU(vc);
> +    struct rt_vcpu * snext = NULL;
> +    struct rt_private * prv = RT_PRIV(ops);
> +    spinlock_t *lock;
> +
> +    clear_bit(__RT_scheduled, &svc->flags);
> +    if ( is_idle_vcpu(vc) ) return;
> +
> +    lock = vcpu_schedule_lock_irq(vc);
> +    if ( test_and_clear_bit(__RT_delayed_runq_add, &svc->flags) && 
> +         likely(vcpu_runnable(vc)) ) {
> +        __runq_insert(ops, svc);
> +        __repl_update(ops, NOW());
> +        snext = __runq_pick(ops, prv->cpus);    /* pick snext from ALL cpus */
> +        runq_tickle(ops, snext);
> +    }
> +    vcpu_schedule_unlock_irq(lock, vc);
> +}
> +
> +/*
> + * set/get each vcpu info of each domain
> + */
> +static int
> +rt_dom_cntl(const struct scheduler *ops, struct domain *d, struct xen_domctl_scheduler_op *op)
> +{
> +    xen_domctl_sched_rt_params_t local_sched;
> +    struct rt_dom * const sdom = RT_DOM(d);
> +    struct list_head *iter;
> +    int vcpu_index = 0;
> +    int rc = -EINVAL;
> +
> +    switch ( op->cmd )
> +    {
> +    case XEN_DOMCTL_SCHEDOP_getinfo:
> +        /* for debug use, whenever adjust Dom0 parameter, do global dump */
> +        if ( d->domain_id == 0 ) {
> +            rt_dump(ops);
> +        }
> +
> +        local_sched.num_vcpus = 0;
> +        list_for_each( iter, &sdom->vcpu ) {
> +            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu, sdom_elem);
> +
> +            ASSERT(vcpu_index < XEN_LEGACY_MAX_VCPUS);
> +            local_sched.vcpus[vcpu_index].budget = svc->budget;
> +            local_sched.vcpus[vcpu_index].period = svc->period;
> +
> +            vcpu_index++;
> +        }
> +        local_sched.num_vcpus = vcpu_index;
> +        copy_to_guest(op->u.rt.schedule, &local_sched, 1);
> +        rc = 0;
> +        break;
> +    case XEN_DOMCTL_SCHEDOP_putinfo:
> +        copy_from_guest(&local_sched, op->u.rt.schedule, 1);
> +        list_for_each( iter, &sdom->vcpu ) {
> +            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu, sdom_elem);
> +
> +            if ( local_sched.vcpu_index == svc->vcpu->vcpu_id ) { /* adjust per VCPU parameter */
> +                vcpu_index = local_sched.vcpu_index;
> +
> +                if ( vcpu_index < 0 || vcpu_index > XEN_LEGACY_MAX_VCPUS) {
> +                    printk("XEN_DOMCTL_SCHEDOP_putinfo: vcpu_index=%d\n",
> +                            vcpu_index);
> +                }else{
> +                    printk("XEN_DOMCTL_SCHEDOP_putinfo: vcpu_index=%d, period=%"PRId64", budget=%"PRId64"\n",
> +                            vcpu_index, local_sched.vcpus[vcpu_index].period, 
> +                            local_sched.vcpus[vcpu_index].budget);
> +                }
> +
> +                svc->period = local_sched.vcpus[vcpu_index].period;
> +                svc->budget = local_sched.vcpus[vcpu_index].budget;
> +
> +                break;
> +            }
> +        }
> +        rc = 0;
> +        break;
> +    }
> +
> +    return rc;
> +}
> +
> +static struct rt_private _rt_priv;
> +
> +const struct scheduler sched_rt_def = {
> +    .name           = "SMP RT Scheduler",
> +    .opt_name       = "rt",
> +    .sched_id       = XEN_SCHEDULER_RT,
> +    .sched_data     = &_rt_priv,
> +
> +    .dump_cpu_state = rt_dump_pcpu,
> +    .dump_settings  = rt_dump,
> +    .init           = rt_init,
> +    .deinit         = rt_deinit,
> +    .alloc_pdata    = rt_alloc_pdata,
> +    .free_pdata     = rt_free_pdata,
> +    .alloc_domdata  = rt_alloc_domdata,
> +    .free_domdata   = rt_free_domdata,
> +    .init_domain    = rt_dom_init,
> +    .destroy_domain = rt_dom_destroy,
> +    .alloc_vdata    = rt_alloc_vdata,
> +    .free_vdata     = rt_free_vdata,
> +    .insert_vcpu    = rt_vcpu_insert,
> +    .remove_vcpu    = rt_vcpu_remove,
> +
> +    .adjust         = rt_dom_cntl,
> +
> +    .pick_cpu       = rt_cpu_pick,
> +    .do_schedule    = rt_schedule,
> +    .sleep          = rt_vcpu_sleep,
> +    .wake           = rt_vcpu_wake,
> +    .context_saved  = rt_context_saved,
> +
> +    .yield          = NULL,
> +    .migrate        = NULL,
> +};
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index e9eb0bc..2d13966 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -68,6 +68,7 @@ static const struct scheduler *schedulers[] = {
>      &sched_sedf_def,
>      &sched_credit_def,
>      &sched_credit2_def,
> +    &sched_rt_def,
>      &sched_arinc653_def,
>  };
>  
> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
> index 5b11bbf..d1a8201 100644
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -339,6 +339,20 @@ struct xen_domctl_max_vcpus {
>  typedef struct xen_domctl_max_vcpus xen_domctl_max_vcpus_t;
>  DEFINE_XEN_GUEST_HANDLE(xen_domctl_max_vcpus_t);
>  
> +/*
> + * This structure is used to pass to rt scheduler from a 
> + * privileged domain to Xen
> + */
> +struct xen_domctl_sched_rt_params {
> +    struct {
> +        signed long period; /* s_time_t type */
> +        signed long budget; 

No implicitly-width'd types in the public headers.  int64_t's should do.

> +    } vcpus[XEN_LEGACY_MAX_VCPUS];

This is a limit which should be avoided in newly-added public API, as it
will force a change at some point in the future.

> +    uint16_t num_vcpus;
> +    uint16_t vcpu_index;

Can you explain what the purpose of these are?

As a gut feel, I think that these two ints should be part of the struct
xen_domctl_sched_rt below, with a guest handle pointing to an array of {
period, budget } structures, which avoids the LEGACY_MAX limit.

> +};
> +typedef struct xen_domctl_sched_rt_params xen_domctl_sched_rt_params_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_domctl_sched_rt_params_t);
>  
>  /* XEN_DOMCTL_scheduler_op */
>  /* Scheduler types. */
> @@ -346,6 +360,8 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_max_vcpus_t);
>  #define XEN_SCHEDULER_CREDIT   5
>  #define XEN_SCHEDULER_CREDIT2  6
>  #define XEN_SCHEDULER_ARINC653 7
> +#define XEN_SCHEDULER_RT       8
> +
>  /* Set or get info? */
>  #define XEN_DOMCTL_SCHEDOP_putinfo 0
>  #define XEN_DOMCTL_SCHEDOP_getinfo 1
> @@ -367,6 +383,9 @@ struct xen_domctl_scheduler_op {
>          struct xen_domctl_sched_credit2 {
>              uint16_t weight;
>          } credit2;
> +        struct xen_domctl_sched_rt{
> +            XEN_GUEST_HANDLE_64(xen_domctl_sched_rt_params_t) schedule;
> +        } rt;

Thinking more generally, is rt an appropriate name here?  It seems a
little generic, as there are several classes of realtime schedulers
employing different semantics and parameters.

>  nul
>      } u;
>  };
>  typedef struct xen_domctl_scheduler_op xen_domctl_scheduler_op_t;
> diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
> index 4164dff..e452d32 100644
> --- a/xen/include/xen/sched-if.h
> +++ b/xen/include/xen/sched-if.h
> @@ -169,7 +169,7 @@ extern const struct scheduler sched_sedf_def;
>  extern const struct scheduler sched_credit_def;
>  extern const struct scheduler sched_credit2_def;
>  extern const struct scheduler sched_arinc653_def;
> -
> +extern const struct scheduler sched_rt_def;

If you keep the double newline, this diff gets smaller.

~Andrew

>  
>  struct cpupool
>  {

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 4/4] libxc for rt scheduler
  2014-07-11  4:49 ` [PATCH RFC v1 4/4] libxc " Meng Xu
@ 2014-07-11 14:49   ` Dario Faggioli
  2014-07-11 16:23     ` Meng Xu
  2014-07-17 15:29     ` Ian Campbell
  0 siblings, 2 replies; 50+ messages in thread
From: Dario Faggioli @ 2014-07-11 14:49 UTC (permalink / raw)
  To: Meng Xu
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xen-devel, xumengpanda, lichong659, dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 3550 bytes --]

On ven, 2014-07-11 at 00:49 -0400, Meng Xu wrote:
> Add xc_sched_rt_* functions to interact with Xen to set/get domain's

> --- /dev/null
> +++ b/tools/libxc/xc_rt.c

> +#include "xc_private.h"
> +
> +int
> +xc_sched_rt_domain_set(
> +    xc_interface *xch,
> +    uint32_t domid,
> +    struct xen_domctl_sched_rt_params *sdom)
>
So, both here...

> +{
> +    int rc;
> +    DECLARE_DOMCTL;
> +    DECLARE_HYPERCALL_BOUNCE(sdom, 
> +        sizeof(*sdom), 
> +        XC_HYPERCALL_BUFFER_BOUNCE_IN);
> +
... and here... I know libxc is not terribly consistent when it comes to
coding style. Still, I like stuff to be aligned to the opening brace of
the function/macro.

Have a look at xc_domain.c, e.g., 

...
int xc_vcpu_setaffinity(xc_interface *xch,
                        uint32_t domid,
                        int vcpu,
                        xc_cpumap_t cpumap_hard_inout,
                        xc_cpumap_t cpumap_soft_inout,
                        uint32_t flags)
{
    DECLARE_DOMCTL;
    DECLARE_HYPERCALL_BOUNCE(cpumap_hard_inout, 0,
                             XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
    DECLARE_HYPERCALL_BOUNCE(cpumap_soft_inout, 0,
                             XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
...

> +    if ( xc_hypercall_bounce_pre(xch, sdom) )
> +        return -1;
> +
> +    domctl.cmd = XEN_DOMCTL_scheduler_op;
> +    domctl.domain = (domid_t) domid;
> +    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_RT;
> +    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_putinfo;
> +    set_xen_guest_handle(domctl.u.scheduler_op.u.rt.schedule, sdom);
> +
> +    rc = do_domctl(xch, &domctl);
> +
> +    xc_hypercall_bounce_post(xch, sdom);
> +
So, the bouncing logic seems fine. Looking at what other schedulers do,
there should be no particular need for bouncing anything.

xen/include/public/domctl.h:

struct xen_domctl_scheduler_op {
    uint32_t sched_id;  /* XEN_SCHEDULER_* */
    uint32_t cmd;       /* XEN_DOMCTL_SCHEDOP_* */
    union {
        struct xen_domctl_sched_sedf {
            uint64_aligned_t period;
            uint64_aligned_t slice;
            uint64_aligned_t latency;
            uint32_t extratime;
            uint32_t weight;
        } sedf;
        struct xen_domctl_sched_credit {
            uint16_t weight;
            uint16_t cap;
        } credit;
        struct xen_domctl_sched_credit2 {
            uint16_t weight;
        } credit2;
    } u;
};

I see you have a TODO item relate to this interface, so I guess I'm
leaving this alone.

My view here is, since per-VCPU scheduling parameters are important for
this scheduler, you either:
 - provide an interface for getting and setting the parameters of _one_
   _VCPU_, and the call them repeatedly, if for instance you need to act
   on a domain. In this case, no array and no bouncing should be
   necessary.
 - provide both the above, and another interface, that one with an array
   as big as the number of VCPUs of the domain, and use it to provide 
   the hypervisor with the parameters of all the VCPUs, and act 
   appropriately inside Xen.

Personally, I'd start with the former, and add the latter later, if we
deem it necessary.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 2/4] xl for rt scheduler
  2014-07-11 11:02   ` Wei Liu
@ 2014-07-11 14:59     ` Meng Xu
  2014-07-11 15:07       ` Dario Faggioli
  0 siblings, 1 reply; 50+ messages in thread
From: Meng Xu @ 2014-07-11 14:59 UTC (permalink / raw)
  To: Wei Liu
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, George Dunlap,
	Dario Faggioli, Ian Jackson, xen-devel, Meng Xu, Chong Li,
	Dagaen Golomb


[-- Attachment #1.1: Type: text/plain, Size: 3844 bytes --]

Hi Wei,

First, thank you very much for your comments!


> >  =head1 CPUPOOLS COMMANDS
> > diff --git a/tools/libxl/xl.h b/tools/libxl/xl.h
> > index 10a2e66..51b634a 100644
> > --- a/tools/libxl/xl.h
> > +++ b/tools/libxl/xl.h
> > @@ -67,6 +67,7 @@ int main_memset(int argc, char **argv);
> >  int main_sched_credit(int argc, char **argv);
> >  int main_sched_credit2(int argc, char **argv);
> >  int main_sched_sedf(int argc, char **argv);
> > +int main_sched_rt(int argc, char **argv);
> >  int main_domid(int argc, char **argv);
> >  int main_domname(int argc, char **argv);
> >  int main_rename(int argc, char **argv);
> > diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> > index 68df548..c043f88 100644
> > --- a/tools/libxl/xl_cmdimpl.c
> > +++ b/tools/libxl/xl_cmdimpl.c
> > @@ -4947,6 +4947,7 @@ int main_sharing(int argc, char **argv)
> >      return 0;
> >  }
> >
> > +
>
> Stray blank line.
>

​I will delete the blank line.
​

>
> >  static int sched_domain_get(libxl_scheduler sched, int domid,
> >                              libxl_domain_sched_params *scinfo)
> >  {
> > @@ -5098,6 +5099,52 @@ static int sched_sedf_domain_output(
> >      return 0;
> >  }
> >
> > +
> > +static int sched_rt_domain_output(
> > +    int domid)
> > +{
> > +    char *domname;
> > +    libxl_domain_sched_params scinfo;
> > +    int rc, i;
> > +
> > +    if (domid < 0) {
> > +        printf("%-33s %4s %4s %6s %6s\n", "Name", "ID", "VCPU",
> "Period", "Budget");
> > +        return 0;
> > +    }
> > +
>
> Just print the header and nothing more?
>

​We just print out the header here. This function will call
​sched_domain_get(LIBXL_SCHEDULER_RT, domid, &scinfo) and print out the
VCPUs parameters of each domain in the calling function. ​



>
> > +    libxl_domain_sched_params_init(&scinfo);
> > +    rc = sched_domain_get(LIBXL_SCHEDULER_RT, domid, &scinfo);
> > +    if (rc)
> > +        return rc;
> > +
>
> This is violating libxl type paradigm. See libxl.h.


>  263  * libxl types
>  264  *
>  265  * Most libxl types are defined by the libxl IDL (see
>  266  * libxl_types.idl). The library provides a common set of methods for
>  267  * initialising and freeing these types.
>  268  *
>  269  * IDL-generated libxl types should be used as follows: the user must
>  270  * always call the "init" function before using a type, even if the
>  271  * variable is simply being passed by reference as an out parameter
>  272  * to a libxl function.  The user must always calls "dispose" exactly
>  273  * once afterwards, to clean up, regardless of whether operations on
>  274  * this object succeeded or failed.  See the xl code for examples.
>
> And you should check other places that use libxl types as well.
>

​Thank you very much for pasting the rules here! I really appreciate it.
However, I didn't quite get why it violate the libxl type paradigm and how
I should correct it. (Sorry. :-()​

We actually followed the way credit scheduler does in main_sched_credit(int
argc, char **argv)

        } else { /* set credit scheduler paramaters */
            libxl_domain_sched_params scinfo;
            libxl_domain_sched_params_init(&scinfo);
            scinfo.sched = LIBXL_SCHEDULER_CREDIT;
            if (opt_w)
                scinfo.weight = weight;
            if (opt_c)
                scinfo.cap = cap;
            rc = sched_domain_set(domid, &scinfo);
            libxl_domain_sched_params_dispose(&scinfo);

Could you help let me know how I should modify it? I will check the rest to
modify when I know how to do it.

Thank you very much!

Best,

Meng




-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 6705 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 2/4] xl for rt scheduler
  2014-07-11 14:59     ` Meng Xu
@ 2014-07-11 15:07       ` Dario Faggioli
  2014-07-11 16:25         ` Meng Xu
  2014-07-13 12:58         ` Meng Xu
  0 siblings, 2 replies; 50+ messages in thread
From: Dario Faggioli @ 2014-07-11 15:07 UTC (permalink / raw)
  To: Meng Xu
  Cc: Wei Liu, Ian Campbell, Sisu Xi, Stefano Stabellini,
	George Dunlap, Ian Jackson, xen-devel, Meng Xu, Chong Li,
	Dagaen Golomb


[-- Attachment #1.1: Type: text/plain, Size: 1829 bytes --]

On ven, 2014-07-11 at 10:59 -0400, Meng Xu wrote:


>         > +    libxl_domain_sched_params_init(&scinfo);
>         > +    rc = sched_domain_get(LIBXL_SCHEDULER_RT, domid,
>         &scinfo);
>         > +    if (rc)
>         > +        return rc;
>         > +

> ​Thank you very much for pasting the rules here! I really appreciate
> it. However, I didn't quite get why it violate the libxl type paradigm
> and how I should correct it. (Sorry. :-()​
> 
<<The user must always calls "dispose" exactly once afterwards, to clean
up, regardless of whether operations on this object succeeded or
failed>>

While, above, you're exiting, if rc is true, without calling dispose.

It depens a lot on the function, but what you usually do, is grouping
the calls to the various dispose under a label (typically 'out:') and
goto there to exit.

Look around, both in xl and libxl, you'll find plenty of examples of
that.

> We actually followed the way credit scheduler does
> in main_sched_credit(int argc, char **argv)
> 
> 
>         } else { /* set credit scheduler paramaters */
>             libxl_domain_sched_params scinfo;
>             libxl_domain_sched_params_init(&scinfo);
>             scinfo.sched = LIBXL_SCHEDULER_CREDIT;
>             if (opt_w)
>                 scinfo.weight = weight;
>             if (opt_c)
>                 scinfo.cap = cap;
>             rc = sched_domain_set(domid, &scinfo);
>             libxl_domain_sched_params_dispose(&scinfo);
> 
And in fact, here's dispose! :-)

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 3/4] libxl for rt scheduler
  2014-07-11  4:49 ` [PATCH RFC v1 3/4] libxl " Meng Xu
  2014-07-11 11:05   ` Wei Liu
@ 2014-07-11 15:08   ` Dario Faggioli
  2014-07-12 18:16     ` Meng Xu
  2014-07-17 15:34     ` Ian Campbell
  2014-07-17 15:36   ` Ian Campbell
  2 siblings, 2 replies; 50+ messages in thread
From: Dario Faggioli @ 2014-07-11 15:08 UTC (permalink / raw)
  To: Meng Xu
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xen-devel, xumengpanda, lichong659, dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 2218 bytes --]

On ven, 2014-07-11 at 00:49 -0400, Meng Xu wrote:

> --- a/tools/libxl/libxl.c
> +++ b/tools/libxl/libxl.c

> +static int sched_rt_domain_get(libxl__gc *gc, uint32_t domid,
> +                               libxl_domain_sched_params *scinfo)
> +{
> +    struct xen_domctl_sched_rt_params sdom;
> +    int rc, i;
> +
> +    rc = xc_sched_rt_domain_get(CTX->xch, domid, &sdom);
> +    if (rc != 0) {
> +        LOGE(ERROR, "getting domain sched rt");
> +        return ERROR_FAIL;
> +    }
> +
> +    libxl_domain_sched_params_init(scinfo);
> +    scinfo->sched = LIBXL_SCHEDULER_RT;
> +    scinfo->rt.max_vcpus = LIBXL_XEN_LEGACY_MAX_VCPUS;
> +    /*TODO: alloc array with exact num of dom's vcpus; and use GCNEW_ARRAY()*/
>
Indeed. That, or just do an hypercall for each VCPU, at least as a first
step.

> +    scinfo->rt.vcpus = (libxl_vcpu *)
> +                    malloc( sizeof(libxl_vcpu) * scinfo->rt.max_vcpus );
>
Yes, you said it yourself, I see: GCNEW_ARRAY() and no malloc() in
libxl.

> +    if ( !scinfo->rt.vcpus ){
> +        LOGE(ERROR, "Allocate lib_vcpu array fails\n");
> +        return ERROR_INVAL;
> +    }
> +    scinfo->rt.num_vcpus = sdom.num_vcpus;
> +    for( i = 0; i < sdom.num_vcpus; ++i)
> +    {
> +        scinfo->rt.vcpus[i].period = sdom.vcpus[i].period;
> +        scinfo->rt.vcpus[i].budget = sdom.vcpus[i].budget;
> +    }
> +
> +    return 0;
> +}
> +
> +#define SCHED_RT_VCPU_PARAMS_MAX    4294967295 /* 2^32-1 */
> +
I don't like the name. I'd go for something like
SCHED_RT_VCPU_PERIOD_MAX and SCHED_RT_VCPU_BUDGET_MAX.

You well can define both to the same value, of course.

As per the value, yes, even if we turn the interface in usecs, 2^32
usecs is still ~4000 secs, which ought to be enough as a period or
budget for everyone! :-)

For the rest of this file, I think I'll wait for v2, as the interface is
subject to be changed.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
  2014-07-11 14:37   ` Andrew Cooper
@ 2014-07-11 15:21     ` Dario Faggioli
  2014-07-11 15:40       ` Andrew Cooper
  0 siblings, 1 reply; 50+ messages in thread
From: Dario Faggioli @ 2014-07-11 15:21 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xen-devel, xumengpanda, Meng Xu, lichong659,
	dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 3441 bytes --]

On ven, 2014-07-11 at 15:37 +0100, Andrew Cooper wrote:
> On 11/07/14 05:49, Meng Xu wrote:

> > +/*
> > + * Virtual CPU
> > + */
> > +struct rt_vcpu {
> > +    struct list_head runq_elem; /* On the runqueue list */
> > +    struct list_head sdom_elem; /* On the domain VCPU list */
> > +
> > +    /* Up-pointers */
> > +    struct rt_dom *sdom;
> > +    struct vcpu *vcpu;
> > +
> > +    /* VCPU parameters, in milliseconds */
> > +    s_time_t period;
> > +    s_time_t budget;
> > +
> > +    /* VCPU current infomation */
> > +    long cur_budget;             /* current budget in microseconds */
> 
> Can budget be negative ?
> 
It depends on how you implement the core of the algorithm. Current SEDF
uses a cputime counter, which always increase, and checks whether budget
is being overcome with something lien (cputime >= budget).

Meng's implementation here does the opposite, initializes cur_budget
with budget, and subtract actual execution time from there, until it
reaches zero or, in case of overrun, below zero.

So, yes, if we stick with this version of things (which is perfectly
fine), I think cur_budget should be allowed to go negative.

However, I'd use s_time_t for it it, rather than long.

> >  #define XEN_SCHEDULER_CREDIT   5
> >  #define XEN_SCHEDULER_CREDIT2  6
> >  #define XEN_SCHEDULER_ARINC653 7
> > +#define XEN_SCHEDULER_RT       8
> > +
> >  /* Set or get info? */
> >  #define XEN_DOMCTL_SCHEDOP_putinfo 0
> >  #define XEN_DOMCTL_SCHEDOP_getinfo 1
> > @@ -367,6 +383,9 @@ struct xen_domctl_scheduler_op {
> >          struct xen_domctl_sched_credit2 {
> >              uint16_t weight;
> >          } credit2;
> > +        struct xen_domctl_sched_rt{
> > +            XEN_GUEST_HANDLE_64(xen_domctl_sched_rt_params_t) schedule;
> > +        } rt;
> 
> Thinking more generally, is rt an appropriate name here?  It seems a
> little generic, as there are several classes of realtime schedulers
> employing different semantics and parameters.
> 
That's true. The challenge here is finding a name which is
representative enough of the specific characteristics of the scheduler,
without it being known only in real-time research rooms! :-O

I'd stick with sched_rt.c for the filename, as it's an implementation of
EDF, after all, on top of which (read: in the same file) you can
potentially implement a bunch of different scheduling algorithms and
policies.

This particular one is called "Deferrable Server" (DS). Since it's on
top of EDF, which is a dynamic priority algorithm, it is sometimes
called "Dynamic Deferrable Server" (DDS). It's also part of a class of
algorithms known as "Resource Reservation" algorithms, which is how
(well, one of the ways with which) the RT community refers to algos with
a budgetting scheme inside it.

So, ideas? RTDS (as per Real-Time Deferrable-Server)? RRESDS (Resource
Reservation Deferrable Server)? Just DS (Deferrable Server)?

I'm not sure I like much any of these...

What was it that you had in mind when you said "there are several
classes of realtime schedulers employing different semantics and
parameters" ?

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
  2014-07-11 15:21     ` Dario Faggioli
@ 2014-07-11 15:40       ` Andrew Cooper
  2014-07-11 15:48         ` Dario Faggioli
  0 siblings, 1 reply; 50+ messages in thread
From: Andrew Cooper @ 2014-07-11 15:40 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xen-devel, xumengpanda, Meng Xu, lichong659,
	dgolomb

On 11/07/14 16:21, Dario Faggioli wrote:

>>>  #define XEN_SCHEDULER_CREDIT   5
>>>  #define XEN_SCHEDULER_CREDIT2  6
>>>  #define XEN_SCHEDULER_ARINC653 7
>>> +#define XEN_SCHEDULER_RT       8
>>> +
>>>  /* Set or get info? */
>>>  #define XEN_DOMCTL_SCHEDOP_putinfo 0
>>>  #define XEN_DOMCTL_SCHEDOP_getinfo 1
>>> @@ -367,6 +383,9 @@ struct xen_domctl_scheduler_op {
>>>          struct xen_domctl_sched_credit2 {
>>>              uint16_t weight;
>>>          } credit2;
>>> +        struct xen_domctl_sched_rt{
>>> +            XEN_GUEST_HANDLE_64(xen_domctl_sched_rt_params_t) schedule;
>>> +        } rt;
>>
>> Thinking more generally, is rt an appropriate name here?  It seems a
>> little generic, as there are several classes of realtime schedulers
>> employing different semantics and parameters.
>>
> That's true. The challenge here is finding a name which is
> representative enough of the specific characteristics of the scheduler,
> without it being known only in real-time research rooms! :-O
>
> I'd stick with sched_rt.c for the filename, as it's an implementation of
> EDF, after all, on top of which (read: in the same file) you can
> potentially implement a bunch of different scheduling algorithms and
> policies.
>
> This particular one is called "Deferrable Server" (DS). Since it's on
> top of EDF, which is a dynamic priority algorithm, it is sometimes
> called "Dynamic Deferrable Server" (DDS). It's also part of a class of
> algorithms known as "Resource Reservation" algorithms, which is how
> (well, one of the ways with which) the RT community refers to algos with
> a budgetting scheme inside it.
>
> So, ideas? RTDS (as per Real-Time Deferrable-Server)? RRESDS (Resource
> Reservation Deferrable Server)? Just DS (Deferrable Server)?
>
> I'm not sure I like much any of these...
>
> What was it that you had in mind when you said "there are several
> classes of realtime schedulers employing different semantics and
> parameters" ?
>
> Dario
>

Hmm - naming things is difficult.  How about bob?

My concern is that naming it "rt" seems too generic, and will cause
confusion of a) what type of realtime scheduler it is, and b) further
problems if someone wants to come and implement a different realtime
scheduler in the future.

XEN_SCHEDULER_RT_DS doesn't sound too bad.

~Andrew

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
  2014-07-11 15:40       ` Andrew Cooper
@ 2014-07-11 15:48         ` Dario Faggioli
  2014-07-16 17:05           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 50+ messages in thread
From: Dario Faggioli @ 2014-07-11 15:48 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xen-devel, xumengpanda, Meng Xu, lichong659,
	dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 1279 bytes --]

On ven, 2014-07-11 at 16:40 +0100, Andrew Cooper wrote:
> On 11/07/14 16:21, Dario Faggioli wrote:

> > So, ideas? RTDS (as per Real-Time Deferrable-Server)? RRESDS (Resource
> > Reservation Deferrable Server)? Just DS (Deferrable Server)?
> >
> > I'm not sure I like much any of these...
> >
> > What was it that you had in mind when you said "there are several
> > classes of realtime schedulers employing different semantics and
> > parameters" ?

> Hmm - naming things is difficult.  How about bob?
> 
Well, if we go that route, I prefer Alice! :-D :-D

> My concern is that naming it "rt" seems too generic, and will cause
> confusion of a) what type of realtime scheduler it is, and b) further
> problems if someone wants to come and implement a different realtime
> scheduler in the future.
> 
Totally understood and agreed.

> XEN_SCHEDULER_RT_DS doesn't sound too bad.
> 
It does not, indeed. Let's hear Meng, Sisu, and the other xen-devel
folks...

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Introduce rt real-time scheduler for Xen
  2014-07-11 11:06   ` Dario Faggioli
@ 2014-07-11 16:14     ` Meng Xu
  0 siblings, 0 replies; 50+ messages in thread
From: Meng Xu @ 2014-07-11 16:14 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Wei Liu, Ian Campbell, Sisu Xi, Stefano Stabellini,
	George Dunlap, Ian Jackson, xen-devel, Meng Xu, Chong Li,
	Dagaen Golomb


[-- Attachment #1.1: Type: text/plain, Size: 1524 bytes --]

​Hi Wei and Dario,
​
2014-07-11 7:06 GMT-04:00 Dario Faggioli <dario.faggioli@citrix.com>:

> On ven, 2014-07-11 at 11:50 +0100, Wei Liu wrote:
> > On Fri, Jul 11, 2014 at 12:49:54AM -0400, Meng Xu wrote:
> > [...]
> > >
> > > [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
> > > [PATCH RFC v1 2/4] xl for rt scheduler
> > > [PATCH RFC v1 3/4] libxl for rt scheduler
> > > [PATCH RFC v1 4/4] libxc for rt scheduler
> > >
> >
> > I have some general comments on how you arrange these patches.
> >
> > At a glance of the title and code you should do them in the order of 1,
> > 4, 3 and 2. Apparently xl depends on libxl, libxl depends on libxc, and
> > libxc depends on hypervisor. You will break bisection with current
> > ordering.
> >
> Yep, I agree with Wei.
>
> > And we normally write titles like
> >   xen: add rt scheduler
> >   libxl: introduce rt scheduler
> >   xl: XXXX
> >   etc.
> > start with component name and separate with colon.
> >
> Indeed we do, and this helps quite a bit.
>
> > Last but not least, you need to CC relevant maintainers. You can find
> > out maintainers with scripts/get_maintainers.pl.
> >
> Yes, but this one, Meng almost got it right, I think.
>
> Basically, Meng, you're missing hypervisors maintainers (at least for
> patch 1). :-)
>
>
​Thank you very much for your advice!
I will modify them in the next version of the scheduler: 1) rearrange the
patch order, 2) change the commit log; 3) add all maintainers.

​Meng​

[-- Attachment #1.2: Type: text/html, Size: 2486 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: Introduce rt real-time scheduler for Xen
  2014-07-11  4:49 Introduce rt real-time scheduler for Xen Meng Xu
                   ` (4 preceding siblings ...)
  2014-07-11 10:50 ` Introduce rt real-time scheduler for Xen Wei Liu
@ 2014-07-11 16:19 ` Dario Faggioli
  5 siblings, 0 replies; 50+ messages in thread
From: Dario Faggioli @ 2014-07-11 16:19 UTC (permalink / raw)
  To: Meng Xu
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xen-devel, xumengpanda, lichong659, dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 6899 bytes --]

On ven, 2014-07-11 at 00:49 -0400, Meng Xu wrote:
> This serie of patches adds rt real-time scheduler to Xen.
> 
He Meng, Sisu!

Nice to see you here on xen-devel with this nice code-drop! :-P

> In summary, It supports:
> 1) Preemptive Global Earliest Deadline First scheduling policy by using a global RunQ for the scheduler;
> 2) Assign/display each VCPU's parameters of each domain;
> 3) Supports CPU Pool
> 
Great, thanks for doing the effort of extracting this from your code
base, and submit it here. :-)

Having look at the series carefully, I think it's a nice piece of work
already. There's quite a few modification and cleanups to do, and I
think there's room for quite a bit of improvement, but I really like the
fact that all the features are basically there already.

In particular, proper SMP support, per-VCPU scheduling parameters, and a
sane and theoretically sound budgetting scheme is what we're missing in
SEDF[*], and we need these things badly!

[*] Josh's RFC is improving this, but only wrt to the latter one (sane
scheduling algorithm).

> -----------------------------------------------------------------------------------------------------------------------------
> One scenario to show the functionality of this rt scheduler is as follows:
> //list each vcpu's parameters of each domain in cpu pools using rt scheduler
> #xl sched-rt
> Cpupool Pool-0: sched=EDF
> Name                                ID VCPU Period Budget
> Domain-0                             0    0     10     10
> Domain-0                             0    1     20     20
> Domain-0                             0    2     30     30
> Domain-0                             0    3     10     10
> litmus1                              1    0     10      4
> litmus1                              1    1     10      4
> 
> [...]
>
Thanks for showing this also.

> -----------------------------------------------------------------------------------------------------------------------------
> The differences between this new rt real-time scheduler and the sedf scheduler are as follows:
> 1) rt scheduler supports global EDF scheduling, while sedf only supports partitioned scheduling. With the support of vcpu mask, rt scheduler can also be used as partitioned scheduling by setting each VCPU’s cpumask to a specific cpu.
>
Which is be biggest and most important difference. In fact, although the
implementation of this scheduler can be improved (AFAICT) wrt this
aspect too, adding SMP support to SEDF would be much much harder...

> 2) rt scheduler supports setting and getting each VCPU’s parameters of a domain. A domain can have multiple vcpus with different parameters, rt scheduler can let user get/set the parameters of each VCPU of a specific domain; (sedf scheduler does not support it now)
> 3) rt scheduler supports cpupool.
>
Right. Well, to be fair, SEDF supports cpupools as well. :-)

> 4) rt scheduler uses deferrable server to burn/replenish budget of a VCPU, while sedf uses constrant bandwidth server to burn/replenish budget of a VCPU. This is just two options of implementing a global EDF real-time scheduler and both options’ real-time performance have already been proved in academic.
> 
So, can you put some links to some of your works on top of RT-Xen, which
is from which this scheduler comes from? Or, if that's not possible, at
least the titles?

I really don't expect people to jump on research papers, but the I've
seen a few, and the experimental sections were nice to read and quite
useful.

> -----------------------------------------------------------------------------------------------------------------------------
> TODO:
>
Allow me to add a few items here, in some sort of priority order (at
least mine one):

  *) Deal with budget overrun in the algorithm [medium]
  *) Split runnable and depleted (=no budget left) VCPU queues [easy]
> 1) Improve the code of getting/setting each VCPU’s parameters. [easy]
>     Right now, it create an array with LIBXL_XEN_LEGACY_MAX_VCPUS (i.e., 32) elements to bounce all VCPUs’ parameters of a domain between xen tool and xen to get all VCPUs’ parameters of a domain. It is unnecessary to have LIBXL_XEN_LEGACY_MAX_VCPUS elements for this array.
>     The current work is to first get the exact number of VCPUs of a domain and then create an array with that exact number of elements to bounce between xen tool and xen.
> 2) Provide microsecond time precision in xl interface instead of millisecond time precision. [easy]
>     Right now, rt scheduler let user to specify each VCPU’s parameters (period, budget) in millisecond (i.e., ms). In some real-time application, user may want to specify VCPUs’ parameters in  microsecond (i.e., us). The next work is to let user specify VCPUs’ parameters in microsecond and count the time in microsecond (or nanosecond) in xen rt scheduler as well.
>
  *) Subject Dom0 to the EDF+DS scheduling, as all other domains [easy]
      We can discuss what default Dom0 parameters should be, but we
      certainly want it to be scheduled as all other domains, and not
      getting too much of a special treatment.

> 3) Add Xen trace into the rt scheduler. [easy]
>     We will add a few xentrace tracepoints, like TRC_CSCHED2_RUNQ_POS in credit2 scheduler, in rt scheduler, to debug via tracing.
>
  *) Try using timers for replenishment, instead of scanning the full
     runqueue every now and then [medium]

> 4) Method of improving the performance of rt scheduler [future work]
>     VCPUs of the same domain may preempt each other based on the preemptive global EDF scheduling policy. This self-switch issue does not bring benefit to the domain but introduce more overhead. When this situation happens, we can simply promote the current running lower-priority VCPU’s priority and let it  borrow budget from higher priority VCPUs to avoid such self-swtich issue.
> 
> Timeline of implementing the TODOs:
> We plan to finish the TODO 1), 2) and 3) within 3-4 weeks (or earlier).
> Because TODO 4) will make the scheduling policy not pure GEDF, (people who wants the real GEDF may not be happy with this.) we look forward to hearing people’s opinions.
>
That one is definitely something we can concentrate on later.

> -----------------------------------------------------------------------------------------------------------------------------
> Special huge thanks to Dario Faggioli for his helpful and detailed comments on the preview version of this rt scheduler. :-)
> 
:-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 4/4] libxc for rt scheduler
  2014-07-11 14:49   ` Dario Faggioli
@ 2014-07-11 16:23     ` Meng Xu
  2014-07-11 16:35       ` Dario Faggioli
  2014-07-17 15:29     ` Ian Campbell
  1 sibling, 1 reply; 50+ messages in thread
From: Meng Xu @ 2014-07-11 16:23 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, George Dunlap,
	Ian Jackson, xen-devel, Meng Xu, Chong Li, Dagaen Golomb


[-- Attachment #1.1: Type: text/plain, Size: 4472 bytes --]

Hi Dario,


2014-07-11 10:49 GMT-04:00 Dario Faggioli <dario.faggioli@citrix.com>:

> On ven, 2014-07-11 at 00:49 -0400, Meng Xu wrote:
> > Add xc_sched_rt_* functions to interact with Xen to set/get domain's
>
> > --- /dev/null
> > +++ b/tools/libxc/xc_rt.c
>
> > +#include "xc_private.h"
> > +
> > +int
> > +xc_sched_rt_domain_set(
> > +    xc_interface *xch,
> > +    uint32_t domid,
> > +    struct xen_domctl_sched_rt_params *sdom)
> >
> So, both here...
>
> > +{
> > +    int rc;
> > +    DECLARE_DOMCTL;
> > +    DECLARE_HYPERCALL_BOUNCE(sdom,
> > +        sizeof(*sdom),
> > +        XC_HYPERCALL_BUFFER_BOUNCE_IN);
> > +
> ... and here... I know libxc is not terribly consistent when it comes to
> coding style. Still, I like stuff to be aligned to the opening brace of
> the function/macro.
>
> Have a look at xc_domain.c, e.g.,
>
> ...
> int xc_vcpu_setaffinity(xc_interface *xch,
>                         uint32_t domid,
>                         int vcpu,
>                         xc_cpumap_t cpumap_hard_inout,
>                         xc_cpumap_t cpumap_soft_inout,
>                         uint32_t flags)
> {
>     DECLARE_DOMCTL;
>     DECLARE_HYPERCALL_BOUNCE(cpumap_hard_inout, 0,
>                              XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
>     DECLARE_HYPERCALL_BOUNCE(cpumap_soft_inout, 0,
>                              XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
> ...
>

​It's so nice of you to giving me such an example! I will modify them in
the next version.

​

>
> > +    if ( xc_hypercall_bounce_pre(xch, sdom) )
> > +        return -1;
> > +
> > +    domctl.cmd = XEN_DOMCTL_scheduler_op;
> > +    domctl.domain = (domid_t) domid;
> > +    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_RT;
> > +    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_putinfo;
> > +    set_xen_guest_handle(domctl.u.scheduler_op.u.rt.schedule, sdom);
> > +
> > +    rc = do_domctl(xch, &domctl);
> > +
> > +    xc_hypercall_bounce_post(xch, sdom);
> > +
> So, the bouncing logic seems fine. Looking at what other schedulers do,
> there should be no particular need for bouncing anything.
>
> xen/include/public/domctl.h:
>
> struct xen_domctl_scheduler_op {
>     uint32_t sched_id;  /* XEN_SCHEDULER_* */
>     uint32_t cmd;       /* XEN_DOMCTL_SCHEDOP_* */
>     union {
>         struct xen_domctl_sched_sedf {
>             uint64_aligned_t period;
>             uint64_aligned_t slice;
>             uint64_aligned_t latency;
>             uint32_t extratime;
>             uint32_t weight;
>         } sedf;
>         struct xen_domctl_sched_credit {
>             uint16_t weight;
>             uint16_t cap;
>         } credit;
>         struct xen_domctl_sched_credit2 {
>             uint16_t weight;
>         } credit2;
>     } u;
> };
>
> I see you have a TODO item relate to this interface, so I guess I'm
> leaving this alone.
>
> My view here is, since per-VCPU scheduling parameters are important for
> this scheduler, you either:
>  - provide an interface for getting and setting the parameters of _one_
>    _VCPU_, and the call them repeatedly, if for instance you need to act
>    on a domain. In this case, no array and no bouncing should be
>    necessary.
>  - provide both the above, and another interface, that one with an array
>    as big as the number of VCPUs of the domain, and use it to provide
>    the hypervisor with the parameters of all the VCPUs, and act
>    appropriately inside Xen.
>
> Personally, I'd start with the former, and add the latter later, if we
> deem it necessary.
>

Actually, we did the former one in the very early version of the rt
scheduler. The only concern I have about the former version is that it has
to issue the hypercall for many times to get all VCPUs' information of a
domain, while the later one only issue the hypercall once with bouncing.

My concern is: Will the former one has worse performance than the later
one?  ​

Thanks,

Meng




>
> Regards,
> Dario
>
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>
>


-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 6605 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 2/4] xl for rt scheduler
  2014-07-11 15:07       ` Dario Faggioli
@ 2014-07-11 16:25         ` Meng Xu
  2014-07-13 12:58         ` Meng Xu
  1 sibling, 0 replies; 50+ messages in thread
From: Meng Xu @ 2014-07-11 16:25 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Wei Liu, Ian Campbell, Sisu Xi, Stefano Stabellini,
	George Dunlap, Ian Jackson, xen-devel, Meng Xu, Chong Li,
	Dagaen Golomb


[-- Attachment #1.1: Type: text/plain, Size: 1271 bytes --]

Hi Dario,

2014-07-11 11:07 GMT-04:00 Dario Faggioli <dario.faggioli@citrix.com>:

> On ven, 2014-07-11 at 10:59 -0400, Meng Xu wrote:
>
>
> >         > +    libxl_domain_sched_params_init(&scinfo);
> >         > +    rc = sched_domain_get(LIBXL_SCHEDULER_RT, domid,
> >         &scinfo);
> >         > +    if (rc)
> >         > +        return rc;
> >         > +
>
> > ​Thank you very much for pasting the rules here! I really appreciate
> > it. However, I didn't quite get why it violate the libxl type paradigm
> > and how I should correct it. (Sorry. :-()​
> >
> <<The user must always calls "dispose" exactly once afterwards, to clean
> up, regardless of whether operations on this object succeeded or
> failed>>
>
> While, above, you're exiting, if rc is true, without calling dispose.
>
> It depens a lot on the function, but what you usually do, is grouping
> the calls to the various dispose under a label (typically 'out:') and
> goto there to exit.
>
> Look around, both in xl and libxl, you'll find plenty of examples of
> that.
>

​Now I got it. Thank you very much! I will modify it. :-)

Best regards,

Meng


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 2175 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 4/4] libxc for rt scheduler
  2014-07-11 16:23     ` Meng Xu
@ 2014-07-11 16:35       ` Dario Faggioli
  2014-07-11 16:49         ` Andrew Cooper
  2014-07-12 19:46         ` Meng Xu
  0 siblings, 2 replies; 50+ messages in thread
From: Dario Faggioli @ 2014-07-11 16:35 UTC (permalink / raw)
  To: Meng Xu
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, George Dunlap,
	Ian Jackson, xen-devel, Meng Xu, Chong Li, Dagaen Golomb


[-- Attachment #1.1: Type: text/plain, Size: 3028 bytes --]

An unrelated thing, can you set your mail client to send text message,
instead of HTML ones?

On ven, 2014-07-11 at 12:23 -0400, Meng Xu wrote:


>         My view here is, since per-VCPU scheduling parameters are
>         important for
>         this scheduler, you either:
>          - provide an interface for getting and setting the parameters
>         of _one_
>            _VCPU_, and the call them repeatedly, if for instance you
>         need to act
>            on a domain. In this case, no array and no bouncing should
>         be
>            necessary.
>          - provide both the above, and another interface, that one
>         with an array
>            as big as the number of VCPUs of the domain, and use it to
>         provide
>            the hypervisor with the parameters of all the VCPUs, and
>         act
>            appropriately inside Xen.
>         
>         Personally, I'd start with the former, and add the latter
>         later, if we
>         deem it necessary.
> 
> 
> Actually, we did the former one in the very early version of the rt
> scheduler. The only concern I have about the former version is that it
> has to issue the hypercall for many times to get all VCPUs'
> information of a domain, while the later one only issue the hypercall
> once with bouncing. 
> 
True. However, if you only want to get or set one particular vcpu, it's
annoying to have to deal with the array, isn't it?

So, both solutions have up and down sides, depending on what you
actually need. :-)

This is why I was suggesting to implement both. Of course, it does not
matter much which one you implement first. Se, feel free to implement
the array variant properly first, and then we'll see if we also need the
single VCPU variant.

> My concern is: Will the former one has worse performance than the
> later one?  ​
> 
Well, Xen has its tricks, but yes, I think performing e.g., 8 or 16
hypercalls in a loop is worse than issueing one returning an array.

It also must be noted that, getting and setting scheduling parameters,
should not happen in very critical path, so it probably won't matter
much.

Anyway, as said above, go for the array first if you like it better.
When you do that, consider Andrew's comments to the bottom of patch 1:

"As a gut feel, I think that these two ints should be part of the struct
xen_domctl_sched_rt below, with a guest handle pointing to an array of {
period, budget } structures, which avoids the LEGACY_MAX limit."

I don't think you'll need vcpu_index any longer, and I'm not sure
whether you need max_vcpus to be part of the interface of this hcall, or
if you're better out fetching it from some other call, though.

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 4/4] libxc for rt scheduler
  2014-07-11 16:35       ` Dario Faggioli
@ 2014-07-11 16:49         ` Andrew Cooper
  2014-07-12 19:46         ` Meng Xu
  1 sibling, 0 replies; 50+ messages in thread
From: Andrew Cooper @ 2014-07-11 16:49 UTC (permalink / raw)
  To: Dario Faggioli, Meng Xu
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, George Dunlap,
	Ian Jackson, xen-devel, Meng Xu, Chong Li, Dagaen Golomb

On 11/07/14 17:35, Dario Faggioli wrote:
>
>> My concern is: Will the former one has worse performance than the
>> later one? ​
>>
> Well, Xen has its tricks, but yes, I think performing e.g., 8 or 16
> hypercalls in a loop is worse than issueing one returning an array.

As far as hypercalls go, the path to Xen is a long one. From userspace,
a hypercall involves a context switch into the kernel (which is a double
context switch through Xen) and a context switch from the kernel into
Xen, and back again. The return from guest kernel to guest userspace
requires a TLB flush.

So, 6 context switches and a TLB flushes per hypercall (assuming a 64bit
dom0).

Use batches wherever possible, even for 2 items. It will be about twice
as fast as not batching.

~Andrew


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 3/4] libxl for rt scheduler
  2014-07-11 15:08   ` Dario Faggioli
@ 2014-07-12 18:16     ` Meng Xu
  2014-07-14 10:38       ` Dario Faggioli
  2014-07-17 15:34     ` Ian Campbell
  1 sibling, 1 reply; 50+ messages in thread
From: Meng Xu @ 2014-07-12 18:16 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xen-devel, Meng Xu, lichong659, dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 1547 bytes --]

Hi Dario,


> > +    if ( !scinfo->rt.vcpus ){
> > +        LOGE(ERROR, "Allocate lib_vcpu array fails\n");
> > +        return ERROR_INVAL;
> > +    }
> > +    scinfo->rt.num_vcpus = sdom.num_vcpus;
> > +    for( i = 0; i < sdom.num_vcpus; ++i)
> > +    {
> > +        scinfo->rt.vcpus[i].period = sdom.vcpus[i].period;
> > +        scinfo->rt.vcpus[i].budget = sdom.vcpus[i].budget;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +#define SCHED_RT_VCPU_PARAMS_MAX    4294967295 /* 2^32-1 */
> > +
> I don't like the name. I'd go for something like
> SCHED_RT_VCPU_PERIOD_MAX and SCHED_RT_VCPU_BUDGET_MAX.
>
>
I will modify this.


> You well can define both to the same value, of course.
>
> As per the value, yes, even if we turn the interface in usecs, 2^32
> usecs is still ~4000 secs, which ought to be enough as a period or
> budget for everyone! :-)
>
>
Actually, I'm not very sure about the range (max value) for the xl sched-rt
interface now: I will change the type of period and budget in Xen to
s_time_t which is signed long (64bits). (I can also change the type of
period and budget in tool to be 64bits)

My question is:
Do we really need to do this range check and set the max value of the xl
sched-rt to (2^64 - 1)/1000 (us)? (64 bits can have 2^61-1 ns, so it's
(2^64-1)/1000 us)
If we look at the actually range value (2^64-1)/1000 ns, it's equal to
5.8*10^11 years, which is too large to be realistic.

Thanks,

Meng




-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 2238 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 4/4] libxc for rt scheduler
  2014-07-11 16:35       ` Dario Faggioli
  2014-07-11 16:49         ` Andrew Cooper
@ 2014-07-12 19:46         ` Meng Xu
  1 sibling, 0 replies; 50+ messages in thread
From: Meng Xu @ 2014-07-12 19:46 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, George Dunlap,
	Ian Jackson, xen-devel, Meng Xu, Chong Li, Dagaen Golomb


[-- Attachment #1.1: Type: text/plain, Size: 1631 bytes --]

Hi Dario,


>
> > My concern is: Will the former one has worse performance than the
> > later one?  ​
> >
> Well, Xen has its tricks, but yes, I think performing e.g., 8 or 16
> hypercalls in a loop is worse than issueing one returning an array.
>
> It also must be noted that, getting and setting scheduling parameters,
> should not happen in very critical path, so it probably won't matter
> much.
>
> Anyway, as said above, go for the array first if you like it better.
> When you do that, consider Andrew's comments to the bottom of patch 1:
>
> "As a gut feel, I think that these two ints should be part of the struct
> xen_domctl_sched_rt below, with a guest handle pointing to an array of {
> period, budget } structures, which avoids the LEGACY_MAX limit."
>
> I don't think you'll need vcpu_index any longer, and I'm not sure
> whether you need max_vcpus to be part of the interface of this hcall, or
> if you're better out fetching it from some other call, though.
>
> ​Yes, I don't need the max_vcpus any more after I improve this interface.
I think I will use the array for the get operation, since as Andrew
mentioned that batching is better than doing hypercall one by one when it
involves more than one hypercalls. As to the set operation, I think I can
avoid using array. Anyway, I will do this first and let people decide if we
need to make further change. :-)

(BTW, I will modify the code base on your other comments in the previous
email.)

Thanks,

Meng



-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 2533 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 2/4] xl for rt scheduler
  2014-07-11 15:07       ` Dario Faggioli
  2014-07-11 16:25         ` Meng Xu
@ 2014-07-13 12:58         ` Meng Xu
  2014-07-14  7:40           ` Dario Faggioli
                             ` (2 more replies)
  1 sibling, 3 replies; 50+ messages in thread
From: Meng Xu @ 2014-07-13 12:58 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Wei Liu, Ian Campbell, Sisu Xi, Stefano Stabellini,
	George Dunlap, Ian Jackson, xen-devel, Meng Xu, Chong Li,
	Dagaen Golomb


[-- Attachment #1.1: Type: text/plain, Size: 2552 bytes --]

Hi Dario,

>
>
> >         > +    libxl_domain_sched_params_init(&scinfo);
> >         > +    rc = sched_domain_get(LIBXL_SCHEDULER_RT, domid,
> >         &scinfo);
> >         > +    if (rc)
> >         > +        return rc;
> >         > +
>
> > ​Thank you very much for pasting the rules here! I really appreciate
> > it. However, I didn't quite get why it violate the libxl type paradigm
> > and how I should correct it. (Sorry. :-()​
> >
> <<The user must always calls "dispose" exactly once afterwards, to clean
> up, regardless of whether operations on this object succeeded or
> failed>>
>
> While, above, you're exiting, if rc is true, without calling dispose.
>
> It depens a lot on the function, but what you usually do, is grouping
> the calls to the various dispose under a label (typically 'out:') and
> goto there to exit.
>
> Look around, both in xl and libxl, you'll find plenty of examples of
> that.
>
> > We actually followed the way credit scheduler does
> > in main_sched_credit(int argc, char **argv)
> >
> >
> >         } else { /* set credit scheduler paramaters */
> >             libxl_domain_sched_params scinfo;
> >             libxl_domain_sched_params_init(&scinfo);
> >             scinfo.sched = LIBXL_SCHEDULER_CREDIT;
> >             if (opt_w)
> >                 scinfo.weight = weight;
> >             if (opt_c)
> >                 scinfo.cap = cap;
> >             rc = sched_domain_set(domid, &scinfo);
> >             libxl_domain_sched_params_dispose(&scinfo);
> >
> And in fact, here's dispose! :-)
>
>
I think you are right! But I think the implementation of this functionality
for credit scheduler also has the exactly same issue: scinfo will not be
disposed when hypercall returns false.

Here is the code for sched_credit_domain_output(int domid) in
tools/libxl/xl_cmdimpl.c.
    rc = sched_domain_get(LIBXL_SCHEDULER_CREDIT, domid, &scinfo);

    if (rc)

        return rc;

    domname = libxl_domid_to_name(ctx, domid);

    printf("%-33s %4d %6d %4d\n",

        domname,

        domid,

        scinfo.weight,

        scinfo.cap);

    free(domname);

    libxl_domain_sched_params_dispose(&scinfo);

(Note: sched_domain_get() init the scinfo, but didn't dispose it. )

 As you can see, it has the exact issue I had. :-) If it's the case, I can
submit a separate patch for this. :-)

Thanks,

Meng


-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 4247 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 2/4] xl for rt scheduler
  2014-07-13 12:58         ` Meng Xu
@ 2014-07-14  7:40           ` Dario Faggioli
  2014-07-14  9:31           ` Wei Liu
  2014-07-17 15:39           ` Ian Campbell
  2 siblings, 0 replies; 50+ messages in thread
From: Dario Faggioli @ 2014-07-14  7:40 UTC (permalink / raw)
  To: Meng Xu
  Cc: Wei Liu, Ian Campbell, Sisu Xi, Stefano Stabellini,
	George Dunlap, Ian Jackson, xen-devel, Meng Xu, Chong Li,
	Dagaen Golomb


[-- Attachment #1.1: Type: text/plain, Size: 900 bytes --]

On dom, 2014-07-13 at 08:58 -0400, Meng Xu wrote:

> I think you are right! But I think the implementation of this
> functionality for credit scheduler also has the exactly same issue:
> scinfo will not be disposed when hypercall returns false. 

> (Note: sched_domain_get() init the scinfo, but didn't dispose it. )
> 
ISWYM.

Yes, I think either in sched_domain_get(), or at the three (four
considering your code) callsites, a _dispose() would be good.

>  As you can see, it has the exact issue I had. :-) If it's the case, I
> can submit a separate patch for this. :-)
>
Go ahead. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 2/4] xl for rt scheduler
  2014-07-13 12:58         ` Meng Xu
  2014-07-14  7:40           ` Dario Faggioli
@ 2014-07-14  9:31           ` Wei Liu
  2014-07-17 15:39           ` Ian Campbell
  2 siblings, 0 replies; 50+ messages in thread
From: Wei Liu @ 2014-07-14  9:31 UTC (permalink / raw)
  To: Meng Xu
  Cc: Wei Liu, Ian Campbell, Sisu Xi, Stefano Stabellini,
	George Dunlap, Dario Faggioli, Ian Jackson, xen-devel, Meng Xu,
	Chong Li, Dagaen Golomb

On Sun, Jul 13, 2014 at 08:58:01AM -0400, Meng Xu wrote:
[...]
> >
> >
> I think you are right! But I think the implementation of this functionality
> for credit scheduler also has the exactly same issue: scinfo will not be
> disposed when hypercall returns false.
> 

The paradigm was documented not long ago, so I wouldn't be surprised if
there's existing code that violates that paradigm.

In any case we need to make sure new code follows our rule. :-)

> Here is the code for sched_credit_domain_output(int domid) in
> tools/libxl/xl_cmdimpl.c.
>     rc = sched_domain_get(LIBXL_SCHEDULER_CREDIT, domid, &scinfo);
> 
>     if (rc)
> 
>         return rc;
> 
>     domname = libxl_domid_to_name(ctx, domid);
> 
>     printf("%-33s %4d %6d %4d\n",
> 
>         domname,
> 
>         domid,
> 
>         scinfo.weight,
> 
>         scinfo.cap);
> 
>     free(domname);
> 
>     libxl_domain_sched_params_dispose(&scinfo);
> 
> (Note: sched_domain_get() init the scinfo, but didn't dispose it. )
> 
>  As you can see, it has the exact issue I had. :-) If it's the case, I can
> submit a separate patch for this. :-)
> 

That's a good idea.

Wei.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 3/4] libxl for rt scheduler
  2014-07-12 18:16     ` Meng Xu
@ 2014-07-14 10:38       ` Dario Faggioli
  0 siblings, 0 replies; 50+ messages in thread
From: Dario Faggioli @ 2014-07-14 10:38 UTC (permalink / raw)
  To: Meng Xu
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xen-devel, Meng Xu, lichong659, dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 2579 bytes --]

On sab, 2014-07-12 at 14:16 -0400, Meng Xu wrote:

> 
>         You well can define both to the same value, of course.
>         
>         As per the value, yes, even if we turn the interface in usecs,
>         2^32
>         usecs is still ~4000 secs, which ought to be enough as a
>         period or
>         budget for everyone! :-)
>         
> 
> 
> Actually, I'm not very sure about the range (max value) for the xl
> sched-rt interface now: I will change the type of period and budget in
> Xen to s_time_t which is signed long (64bits). (I can also change the
> type of period and budget in tool to be 64bits)
> 
In libxc, you probably should transition to uint64_t, yes.

In libxl, you probably will have to use 'integer', as sedf does:
libxl_domain_sched_params = Struct("domain_sched_params",[
    ("sched",        libxl_scheduler),
    ("weight",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_WEIGHT_DEFAULT'}),
    ("cap",          integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_CAP_DEFAULT'}),
    ("period",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_PERIOD_DEFAULT'}),
    ("slice",        integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_SLICE_DEFAULT'}),
    ("latency",      integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_LATENCY_DEFAULT'}),
    ("extratime",    integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_EXTRATIME_DEFAULT'}),
    ])

> My question is:
> Do we really need to do this range check and set the max value of the
> xl sched-rt to (2^64 - 1)/1000 (us)? (64 bits can have 2^61-1 ns, so
> it's (2^64-1)/1000 us)
> If we look at the actually range value (2^64-1)/1000 ns, it's equal to
> 5.8*10^11 years, which is too large to be realistic. 
> 
I concur. It's quite hard to decide what is a sane upper bound for
period and budget, as user may want to try doing all kind of crazy
stuff! :-O

For sure, if we want to put an upper bound, it has to come from some
reasoning about the scheduler's usage, rather than from limitations
coming from the actual type used (which makes sense, actually).

If, for instance, you keep 2^32 (no matter whether you use 32 or 64
bits), you end up with, if I've done the math correctly, a bit more than
1 hour period and budget, which I think is fine, do you?

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
  2014-07-11 15:48         ` Dario Faggioli
@ 2014-07-16 17:05           ` Konrad Rzeszutek Wilk
  2014-07-17 10:12             ` Meng Xu
  0 siblings, 1 reply; 50+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-07-16 17:05 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	Andrew Cooper, ian.jackson, xen-devel, xumengpanda, Meng Xu,
	lichong659, dgolomb

On Fri, Jul 11, 2014 at 05:48:17PM +0200, Dario Faggioli wrote:
> On ven, 2014-07-11 at 16:40 +0100, Andrew Cooper wrote:
> > On 11/07/14 16:21, Dario Faggioli wrote:
> 
> > > So, ideas? RTDS (as per Real-Time Deferrable-Server)? RRESDS (Resource
> > > Reservation Deferrable Server)? Just DS (Deferrable Server)?
> > >
> > > I'm not sure I like much any of these...
> > >
> > > What was it that you had in mind when you said "there are several
> > > classes of realtime schedulers employing different semantics and
> > > parameters" ?
> 
> > Hmm - naming things is difficult.  How about bob?
> > 
> Well, if we go that route, I prefer Alice! :-D :-D
> 
> > My concern is that naming it "rt" seems too generic, and will cause
> > confusion of a) what type of realtime scheduler it is, and b) further
> > problems if someone wants to come and implement a different realtime
> > scheduler in the future.
> > 
> Totally understood and agreed.
> 
> > XEN_SCHEDULER_RT_DS doesn't sound too bad.
> > 
> It does not, indeed. Let's hear Meng, Sisu, and the other xen-devel
> folks...

DS9 :-)

XEN_SCHEDULER_RT_PGEDF ?

And then we can also have
 XEN_SCHEDULER_RT_CBS ?

> 
> Thanks and Regards,
> Dario
> 
> -- 
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> 



> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
  2014-07-16 17:05           ` Konrad Rzeszutek Wilk
@ 2014-07-17 10:12             ` Meng Xu
  2014-07-17 15:12               ` Dario Faggioli
  2014-07-18 18:40               ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 50+ messages in thread
From: Meng Xu @ 2014-07-17 10:12 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	Andrew Cooper, Dario Faggioli, ian.jackson, xen-devel, Meng Xu,
	lichong659, dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 3545 bytes --]

Hi Dario and Konrad,

First, thank you very much for your detailed and very useful comments on
this patch! I'm modifying it based on your comments now. (I'm sorry I
didn't reply promptly because I was in travel these days and cannot always
get access to network. :-()

Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>于2014年7月16日星期三写道:

> On Fri, Jul 11, 2014 at 05:48:17PM +0200, Dario Faggioli wrote:
> > On ven, 2014-07-11 at 16:40 +0100, Andrew Cooper wrote:
> > > On 11/07/14 16:21, Dario Faggioli wrote:
> >
> > > > So, ideas? RTDS (as per Real-Time Deferrable-Server)? RRESDS
> (Resource
> > > > Reservation Deferrable Server)? Just DS (Deferrable Server)?
> > > >
> > > > I'm not sure I like much any of these...
> > > >
> > > > What was it that you had in mind when you said "there are several
> > > > classes of realtime schedulers employing different semantics and
> > > > parameters" ?
> >
> > > Hmm - naming things is difficult.  How about bob?
> > >
> > Well, if we go that route, I prefer Alice! :-D :-D
> >
> > > My concern is that naming it "rt" seems too generic, and will cause
> > > confusion of a) what type of realtime scheduler it is, and b) further
> > > problems if someone wants to come and implement a different realtime
> > > scheduler in the future.
> > >
> > Totally understood and agreed.
> >
> > > XEN_SCHEDULER_RT_DS doesn't sound too bad.
> > >
> > It does not, indeed. Let's hear Meng, Sisu, and the other xen-devel
> > folks...
>
> DS9 :-)
>
> XEN_SCHEDULER_RT_PGEDF ?
>
> And then we can also have
>  XEN_SCHEDULER_RT_CBS ?
>
>
I see your concerns about the name here.
The strength of the name you proposed, such as  XEN_SCHEDULER_RT_PGEDF or
XEN_SCHEDULER_RT_DS, is that it clearly states the characteristics of this
scheduler.

However, if people want to implement a new server mechanism for the EDF
scheduling policy, they actually don't have to implement a fresh new one
and add new sched_*.c files. They only need to modify the budget_update()
and burn_budget() functions based on the new server mechanism. (Actually,
we can make this scheduler more generic to make it easier to add a new
server mechanism just as the schedule.c does.)

In addition, if people want to implement a new real-time scheduling policy,
like Rate Monotonic scheduling, they only need to modify a few lines in
sched_rt.c. (We actually had the Rate Monotonic scheduling already, but
only extract the EDF one for Xen right now to avoid causing unnecessary
confusion.)

So adding new server mechanisms and adding new scheduling policy (i.e.,
priority schemes) only requires modify several functions in sched_rt.c,
instead of creating a fresh new file and scheduler. If we use the too
specific name, like XEN_SCHEDULER_RT_DS, we will have to modify the name in
the future when we extend the scheduler.  Of course, we could also
implement a fresh new scheduler, such as XEN_SCHEDULER_RT_CBS, or
XEN_SCHEDULER_RT_PS (periodic server), in a brand new file, but it's
actually unnecessary to introduce a new file.

Of course, I'm not against using the more specific name, such as
XEN_SCHEDULER_RT_DS. What I'm concerned is: when we implement a new server
mechanism or new scheduling policy inside the current sched_rt.c, the too
specific name will become impropriate and we will have to change name
again. :-)

Thanks,

Meng


-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 4233 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
  2014-07-17 10:12             ` Meng Xu
@ 2014-07-17 15:12               ` Dario Faggioli
  2014-07-18  5:46                 ` Meng Xu
  2014-07-18 18:40               ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 50+ messages in thread
From: Dario Faggioli @ 2014-07-17 15:12 UTC (permalink / raw)
  To: Meng Xu
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	Andrew Cooper, ian.jackson, xen-devel, Meng Xu, lichong659,
	dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 4585 bytes --]

On gio, 2014-07-17 at 06:12 -0400, Meng Xu wrote:

> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>于2014年7月16日星期三写
> 道:

>         DS9 :-)
>         
>         XEN_SCHEDULER_RT_PGEDF ?
>         
>         And then we can also have
>          XEN_SCHEDULER_RT_CBS ?
>         
> 
> 
> I see your concerns about the name here. 
> The strength of the name you proposed, such as  XEN_SCHEDULER_RT_PGEDF
> or XEN_SCHEDULER_RT_DS, is that it clearly states the characteristics
> of this scheduler. 
> 
Agreed, and I'd go fo XEN_SCHEDULER_RT_DS. It means people have to learn
at least the basics of what _D_eferrable _S_erver means (which in turns
mean we must provide decent documentation! :-P), but if we want to go
for being specific, let's go there at full blast :-)

(and being specific means mentioning the actual algorithm, while GEDF is
sort of a 'basis', that many algorithms can share, and I don't think I
got what the 'P' in PGEDF stands for... did you perhaps mean
'PG'='Partitioned & Global'?)

> However, if people want to implement a new server mechanism for the
> EDF scheduling policy, they actually don't have to implement a fresh
> new one and add new sched_*.c files. They only need to modify the
> budget_update() and burn_budget() functions based on the new server
> mechanism. (Actually, we can make this scheduler more generic to make
> it easier to add a new server mechanism just as the schedule.c does.) 
>
> In addition, if people want to implement a new real-time scheduling
> policy, like Rate Monotonic scheduling, they only need to modify a few
> lines in sched_rt.c. (We actually had the Rate Monotonic scheduling
> already, but only extract the EDF one for Xen right now to avoid
> causing unnecessary confusion.)  
>
> So adding new server mechanisms and adding new scheduling policy
> (i.e., priority schemes) only requires modify several functions in
> sched_rt.c, instead of creating a fresh new file and scheduler. If we
> use the too specific name, like XEN_SCHEDULER_RT_DS, we will have to
> modify the name in the future when we extend the scheduler.  Of
> course, we could also implement a fresh new scheduler, such as
> XEN_SCHEDULER_RT_CBS, or XEN_SCHEDULER_RT_PS (periodic server), in a
> brand new file, but it's actually unnecessary to introduce a new
> file. 
> 
This is all true. However, I think I agree with Andrew and Konrad about
the need of being a bit more specific with naming.

Probably, I'd distinguish the name of the source file and the name of
the scheduling policy(ies). In fact, as you say, sched_rt.cc is probably
fine, as one may just implement new/different things hacking its
content, without the need of adding a new one.

At that point, you can well introduce, say, XEN_SCHEDULER_RT_CBS, which,
as far as the implementation is concerned, only modifies a function or
two in sched_rt.c, but for the user perspective, it's a different
scheduling policy, and it makes sense for it to have a different name.

Both (or more, if we like) scheduling policies may well even coexist,
share most of the code in sched_rt.c, and yet being two different
scheduling policies...

> Of course, I'm not against using the more specific name, such as
> XEN_SCHEDULER_RT_DS. What I'm concerned is: when we implement a new
> server mechanism or new scheduling policy inside the current
> sched_rt.c, the too specific name will become impropriate and we will
> have to change name again. :-)
> 
As I said, it's always going to be an RT related scheduling policy so,
if the algorithm will be similar enough, it could be a first class
citizen inside sched_rt.c.

If we want to have more than a scheduling policy, we *need* names (of
the policies, not of the source file) to be specific.

The only downside of the specific name is if we at some point will want
to _replace_, rather than add, the DS algorithm with some other thing.
But even in that case, I don't think there will be much issues: at the
Xen level (not much compatibility requirements) we can just kill the old
and introduce the new. At the libxl level, we'll handle API stability in
the usual ways.

So, yes, I think I'm all for XEN_SCHEDULER_RT_DS implemented in
xen/common/sched_rt.c.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 4/4] libxc for rt scheduler
  2014-07-11 14:49   ` Dario Faggioli
  2014-07-11 16:23     ` Meng Xu
@ 2014-07-17 15:29     ` Ian Campbell
  2014-07-17 15:34       ` George Dunlap
  1 sibling, 1 reply; 50+ messages in thread
From: Ian Campbell @ 2014-07-17 15:29 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xisisu, stefano.stabellini, george.dunlap, ian.jackson,
	xen-devel, xumengpanda, Meng Xu, lichong659, dgolomb

On Fri, 2014-07-11 at 16:49 +0200, Dario Faggioli wrote:

> So, the bouncing logic seems fine. Looking at what other schedulers do,
> there should be no particular need for bouncing anything.

Seems like there is some confusing precedent around the use of sysctl vs
domctl for sched parameters here. 

Most schedulers use domctl but arinc uses sysctl.

All the existing domctl variants use the "inline parameters" scheme
which you quoted, where as arcinc+sysctl uses a pointer to a struct and
bouncing.

This new interface seems to combine the two and uses a domctl with a
pointer. I think it should follow either the existing sysctl or domctl,
not mix the two forms.

Also looking at the new interface added in the first patch I don't think
that array of XEN_LEGACY_MAX_VCPUS elements is going to be sufficient.

Ian.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 3/4] libxl for rt scheduler
  2014-07-11 15:08   ` Dario Faggioli
  2014-07-12 18:16     ` Meng Xu
@ 2014-07-17 15:34     ` Ian Campbell
  1 sibling, 0 replies; 50+ messages in thread
From: Ian Campbell @ 2014-07-17 15:34 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: xisisu, stefano.stabellini, george.dunlap, ian.jackson,
	xen-devel, xumengpanda, Meng Xu, lichong659, dgolomb

On Fri, 2014-07-11 at 17:08 +0200, Dario Faggioli wrote:

> > +    if ( !scinfo->rt.vcpus ){
> > +        LOGE(ERROR, "Allocate lib_vcpu array fails\n");
> > +        return ERROR_INVAL;
> > +    }
> > +    scinfo->rt.num_vcpus = sdom.num_vcpus;
> > +    for( i = 0; i < sdom.num_vcpus; ++i)
> > +    {
> > +        scinfo->rt.vcpus[i].period = sdom.vcpus[i].period;
> > +        scinfo->rt.vcpus[i].budget = sdom.vcpus[i].budget;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +#define SCHED_RT_VCPU_PARAMS_MAX    4294967295 /* 2^32-1 */
> > +
> I don't like the name. I'd go for something like
> SCHED_RT_VCPU_PERIOD_MAX and SCHED_RT_VCPU_BUDGET_MAX.
> 
> You well can define both to the same value, of course.

If you want the largest unsigned which can be represented in 32-bits
then please use UINT32_MAX from stdint.h

Ian.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 4/4] libxc for rt scheduler
  2014-07-17 15:29     ` Ian Campbell
@ 2014-07-17 15:34       ` George Dunlap
  2014-07-17 22:16         ` Meng Xu
  2014-07-18  9:47         ` Ian Campbell
  0 siblings, 2 replies; 50+ messages in thread
From: George Dunlap @ 2014-07-17 15:34 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Sisu Xi, Stefano Stabellini, Dario Faggioli, Ian Jackson,
	xen-devel, Meng Xu, Meng Xu, lichong659, dgolomb

On Thu, Jul 17, 2014 at 4:29 PM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Fri, 2014-07-11 at 16:49 +0200, Dario Faggioli wrote:
>
>> So, the bouncing logic seems fine. Looking at what other schedulers do,
>> there should be no particular need for bouncing anything.
>
> Seems like there is some confusing precedent around the use of sysctl vs
> domctl for sched parameters here.
>
> Most schedulers use domctl but arinc uses sysctl.

They're controlling different things.

domctl is controlling parameters related to *a particular domain* (for
instance, weight or cap); sysctl is relating to setting parameters
*for the scheduler as a whole* (for example, timeslice).

 -George

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 3/4] libxl for rt scheduler
  2014-07-11  4:49 ` [PATCH RFC v1 3/4] libxl " Meng Xu
  2014-07-11 11:05   ` Wei Liu
  2014-07-11 15:08   ` Dario Faggioli
@ 2014-07-17 15:36   ` Ian Campbell
  2014-07-18 11:05     ` Meng Xu
  2 siblings, 1 reply; 50+ messages in thread
From: Ian Campbell @ 2014-07-17 15:36 UTC (permalink / raw)
  To: Meng Xu
  Cc: xisisu, stefano.stabellini, george.dunlap, dario.faggioli,
	ian.jackson, xen-devel, xumengpanda, lichong659, dgolomb

On Fri, 2014-07-11 at 00:49 -0400, Meng Xu wrote:

> +/* Consistent with XEN_LEGACY_MAX_VCPUS xen/arch-x86/xen.h*/
> +#define LIBXL_XEN_LEGACY_MAX_VCPUS                  32 

As the name suggests this limit is legacy. Modern Xen and modern guests
can support considerably more VCPUs than this.

You need to base your interface on a dynamically sized/allocated array.
Perhaps this needs to be carried down all the way to the domctl
interface and into Xen, I haven't looked.

Ian.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 2/4] xl for rt scheduler
  2014-07-13 12:58         ` Meng Xu
  2014-07-14  7:40           ` Dario Faggioli
  2014-07-14  9:31           ` Wei Liu
@ 2014-07-17 15:39           ` Ian Campbell
  2 siblings, 0 replies; 50+ messages in thread
From: Ian Campbell @ 2014-07-17 15:39 UTC (permalink / raw)
  To: Meng Xu
  Cc: Wei Liu, Sisu Xi, Stefano Stabellini, George Dunlap,
	Dario Faggioli, Ian Jackson, xen-devel, Meng Xu, Chong Li,
	Dagaen Golomb

On Sun, 2014-07-13 at 08:58 -0400, Meng Xu wrote:

>         Look around, both in xl and libxl, you'll find plenty of
>         examples of
>         that.
>         
>         > We actually followed the way credit scheduler does
>         > in main_sched_credit(int argc, char **argv)
>         >
>         >
>         >         } else { /* set credit scheduler paramaters */
>         >             libxl_domain_sched_params scinfo;
>         >             libxl_domain_sched_params_init(&scinfo);
>         >             scinfo.sched = LIBXL_SCHEDULER_CREDIT;
>         >             if (opt_w)
>         >                 scinfo.weight = weight;
>         >             if (opt_c)
>         >                 scinfo.cap = cap;
>         >             rc = sched_domain_set(domid, &scinfo);
>         >             libxl_domain_sched_params_dispose(&scinfo);
>         >
>         And in fact, here's dispose! :-)
>         
> 
> 
> I think you are right! But I think the implementation of this
> functionality for credit scheduler also has the exactly same issue:
> scinfo will not be disposed when hypercall returns false. 

Yes, there's quite a lot of code in libxl and xl which is not correct in
this area but which happens to be "OK" because the struct in question
has no dynamically allocated members.

So this is a latent bug since if we were to add a dynamic member to the
struct all this code would become immediately buggy. We should try to
stamp these out as we notice them (as we have done here).

Ian.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH RFC v1 4/4] libxc for rt scheduler
  2014-07-17 15:34       ` George Dunlap
@ 2014-07-17 22:16         ` Meng Xu
  2014-07-18  9:49           ` Dario Faggioli
  2014-07-18  9:51           ` Ian Campbell
  2014-07-18  9:47         ` Ian Campbell
  1 sibling, 2 replies; 50+ messages in thread
From: Meng Xu @ 2014-07-17 22:16 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, Dario Faggioli,
	Ian Jackson, xen-devel, Meng Xu, lichong659, dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 1530 bytes --]

Hi Ian and George,

George Dunlap <George.Dunlap@eu.citrix.com
<javascript:_e(%7B%7D,'cvml','George.Dunlap@eu.citrix.com');>
>于2014年7月17日星期四写道:

> On Thu, Jul 17, 2014 at 4:29 PM, Ian Campbell <Ian.Campbell@citrix.com>
> wrote:
> > On Fri, 2014-07-11 at 16:49 +0200, Dario Faggioli wrote:
> >
> >> So, the bouncing logic seems fine. Looking at what other schedulers do,
> >> there should be no particular need for bouncing anything.
> >
> > Seems like there is some confusing precedent around the use of sysctl vs
> > domctl for sched parameters here.
> >
> > Most schedulers use domctl but arinc uses sysctl.
>
> They're controlling different things.
>
> domctl is controlling parameters related to *a particular domain* (for
> instance, weight or cap); sysctl is relating to setting parameters
> *for the scheduler as a whole* (for example, timeslice).
>
>
Do we have to use inline parameters in domctl?

Right now, we used the domctl to set/get the parameters of each vcpu of a
domain. So from what the functionality does, it should be in domctl.  If we
don't have to use inline parameters, I would prefer to use domctl. :-)

The reason why ARINC653 scheduler uses sysctl to set parameters is as
George said. It sets the scheduling sequence for the ARINC653 scheduler,
instead of setting the parameters of a domain.

Thank you very much!

Best,

Meng




-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 1986 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
  2014-07-17 15:12               ` Dario Faggioli
@ 2014-07-18  5:46                 ` Meng Xu
  0 siblings, 0 replies; 50+ messages in thread
From: Meng Xu @ 2014-07-18  5:46 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	Andrew Cooper, ian.jackson, xen-devel, Meng Xu, lichong659,
	dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 1317 bytes --]

Hi Dario,


> > Of course, I'm not against using the more specific name, such as
> > XEN_SCHEDULER_RT_DS. What I'm concerned is: when we implement a new
> > server mechanism or new scheduling policy inside the current
> > sched_rt.c, the too specific name will become impropriate and we will
> > have to change name again. :-)
> >
> As I said, it's always going to be an RT related scheduling policy so,
> if the algorithm will be similar enough, it could be a first class
> citizen inside sched_rt.c.
>
> If we want to have more than a scheduling policy, we *need* names (of
> the policies, not of the source file) to be specific.
>
> The only downside of the specific name is if we at some point will want
> to _replace_, rather than add, the DS algorithm with some other thing.
> But even in that case, I don't think there will be much issues: at the
> Xen level (not much compatibility requirements) we can just kill the old
> and introduce the new. At the libxl level, we'll handle API stability in
> the usual ways.
>
> So, yes, I think I'm all for XEN_SCHEDULER_RT_DS implemented in
> xen/common/sched_rt.c.
>
>
OK. I got it. I changed the name from XEN_SCHEDULER_RT to
XEN_SCHEDULER_RT_DS.

Thanks,

Meng



-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 2365 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 4/4] libxc for rt scheduler
  2014-07-17 15:34       ` George Dunlap
  2014-07-17 22:16         ` Meng Xu
@ 2014-07-18  9:47         ` Ian Campbell
  2014-07-18 10:00           ` Dario Faggioli
  1 sibling, 1 reply; 50+ messages in thread
From: Ian Campbell @ 2014-07-18  9:47 UTC (permalink / raw)
  To: George Dunlap
  Cc: Sisu Xi, Stefano Stabellini, Dario Faggioli, Ian Jackson,
	xen-devel, Meng Xu, Meng Xu, lichong659, dgolomb

On Thu, 2014-07-17 at 16:34 +0100, George Dunlap wrote:
> On Thu, Jul 17, 2014 at 4:29 PM, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Fri, 2014-07-11 at 16:49 +0200, Dario Faggioli wrote:
> >
> >> So, the bouncing logic seems fine. Looking at what other schedulers do,
> >> there should be no particular need for bouncing anything.
> >
> > Seems like there is some confusing precedent around the use of sysctl vs
> > domctl for sched parameters here.
> >
> > Most schedulers use domctl but arinc uses sysctl.
> 
> They're controlling different things.
> 
> domctl is controlling parameters related to *a particular domain* (for
> instance, weight or cap); sysctl is relating to setting parameters
> *for the scheduler as a whole* (for example, timeslice).

Ah, that makes sense.

It seems that arinc handles some degree of per-dom settings via the
sysctl interface though, since xen_sysctl_arinc653_schedule takes an
array of things which contain domid+vcpuid+params. Perhaps that's
peculiar to the way arinc works though. (Although it seems rather
incorrect to me for the array to be fixed size instead of a guest
handle...)

Lets not repeat that mistake with this new scheduler, if the domctl (or
a sysctl for that matter) needs per VCPU information then it should IMHO
take the parameters inline in the struct (like the others), anything
dynamic like the per-vcpu stuff should then be a GUEST_HANDLE pointer to
a dynamically sized array (accessed with copy_from_guest_offset et al).

Ian.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 4/4] libxc for rt scheduler
  2014-07-17 22:16         ` Meng Xu
@ 2014-07-18  9:49           ` Dario Faggioli
  2014-07-18  9:51           ` Ian Campbell
  1 sibling, 0 replies; 50+ messages in thread
From: Dario Faggioli @ 2014-07-18  9:49 UTC (permalink / raw)
  To: Meng Xu
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, George Dunlap,
	Ian Jackson, xen-devel, Meng Xu, lichong659, dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 2618 bytes --]

On gio, 2014-07-17 at 18:16 -0400, Meng Xu wrote:
> Hi Ian and George,
> 
> George Dunlap <George.Dunlap@eu.citrix.com>于2014年7月17日星期四写道:
>         On Thu, Jul 17, 2014 at 4:29 PM, Ian Campbell
>         <Ian.Campbell@citrix.com> wrote:
>         > On Fri, 2014-07-11 at 16:49 +0200, Dario Faggioli wrote:
>         >
>         >> So, the bouncing logic seems fine. Looking at what other
>         schedulers do,
>         >> there should be no particular need for bouncing anything.
>         >
>         > Seems like there is some confusing precedent around the use
>         of sysctl vs
>         > domctl for sched parameters here.
>         >
>         > Most schedulers use domctl but arinc uses sysctl.
>         
>         They're controlling different things.
>         
>         domctl is controlling parameters related to *a particular
>         domain* (for
>         instance, weight or cap); sysctl is relating to setting
>         parameters
>         *for the scheduler as a whole* (for example, timeslice).
>         
>  
> Do we have to use inline parameters in domctl?  
> 
I think you should use inline parameters as much as possible. Of course,
if you need to transfer the parameters for N vcpus (of the same domain)
all at once, you can use an array (i.e., an handle, of course), but only
for those parameters.

So, if you, for instance, you can have something like this:

{
    integer: num_vcpus
    array[]: { time_t: budget, time_t: period }
}

This, in case you want to batch the transfer of all the parameters for
all the vcpus in one hcall. BTW, I'm not sure you need num_vcpu to be
there, it depends if you have it available already from other hcall
performed before, etc.

For an hypercall that only retrieve the parameters for _one_ _vcpu_, it
should be something like this:

{
    time_t budget
    time_t period
}

(forgive the meta-language)

You don't have to implement the second hypercall if you think we don't
need it for now, it was just an example.

> Right now, we used the domctl to set/get the parameters of each vcpu
> of a domain. So from what the functionality does, it should be in
> domctl.  If we don't have to use inline parameters, I would prefer to
> use domctl. :-)
> 
You shall use domctl! :-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 4/4] libxc for rt scheduler
  2014-07-17 22:16         ` Meng Xu
  2014-07-18  9:49           ` Dario Faggioli
@ 2014-07-18  9:51           ` Ian Campbell
  2014-07-18 12:11             ` Meng Xu
  1 sibling, 1 reply; 50+ messages in thread
From: Ian Campbell @ 2014-07-18  9:51 UTC (permalink / raw)
  To: Meng Xu
  Cc: Sisu Xi, Stefano Stabellini, George Dunlap, Dario Faggioli,
	Ian Jackson, xen-devel, Meng Xu, lichong659, dgolomb

On Thu, 2014-07-17 at 18:16 -0400, Meng Xu wrote:
> Hi Ian and George,
> 
> George Dunlap <George.Dunlap@eu.citrix.com>于2014年7月17日星期四写道:
>         On Thu, Jul 17, 2014 at 4:29 PM, Ian Campbell
>         <Ian.Campbell@citrix.com> wrote:
>         > On Fri, 2014-07-11 at 16:49 +0200, Dario Faggioli wrote:
>         >
>         >> So, the bouncing logic seems fine. Looking at what other
>         schedulers do,
>         >> there should be no particular need for bouncing anything.
>         >
>         > Seems like there is some confusing precedent around the use
>         of sysctl vs
>         > domctl for sched parameters here.
>         >
>         > Most schedulers use domctl but arinc uses sysctl.
>         
>         They're controlling different things.
>         
>         domctl is controlling parameters related to *a particular
>         domain* (for
>         instance, weight or cap); sysctl is relating to setting
>         parameters
>         *for the scheduler as a whole* (for example, timeslice).
>         
>  
> Do we have to use inline parameters in domctl?  
> 
> 
> Right now, we used the domctl to set/get the parameters of each vcpu
> of a domain. So from what the functionality does, it should be in
> domctl.  If we don't have to use inline parameters, I would prefer to
> use domctl. :-)

The important thing is that the array of per-VCPU parameters needs to be
a GUEST_HANDLE and not a statically sized array with some arbitrary
size.

AFAICT that means that the xen_domctl_shceduler_op change should be:
+/*
+ * This structure is used to pass to rt scheduler from a 
+ * privileged domain to Xen
+ */
+struct xen_domctl_sched_rt_vcpus_params {
+    signed long period; /* s_time_t type */
+    signed long budget; 
+};
+typedef struct xen_domctl_sched_rt...
+DEFINE_XEN_GUE....

@@ -367,6 +383,9 @@ struct xen_domctl_scheduler_op {
         struct xen_domctl_sched_credit2 {
             uint16_t weight;
         } credit2;
+        struct xen_domctl_sched_rt {
+            type some_per_domain_parameter;
+            uint16_t nr_vcpus;
+            uint16_t vcpu_index;
+            XEN_GUEST_HANDLE_64(xen_domctl_sched_rt_vcpu_params_t) vcpu;
+        } rt;
     } u;

Currently you don't appear to have any per_domain_parameters, but I have
included one as an illustration.

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 4/4] libxc for rt scheduler
  2014-07-18  9:47         ` Ian Campbell
@ 2014-07-18 10:00           ` Dario Faggioli
  0 siblings, 0 replies; 50+ messages in thread
From: Dario Faggioli @ 2014-07-18 10:00 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Sisu Xi, Stefano Stabellini, George Dunlap, Ian Jackson,
	xen-devel, Meng Xu, Meng Xu, lichong659, dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 1263 bytes --]

On ven, 2014-07-18 at 10:47 +0100, Ian Campbell wrote:

> Lets not repeat that mistake with this new scheduler, if the domctl (or
> a sysctl for that matter) needs per VCPU information then it should IMHO
> take the parameters inline in the struct (like the others), anything
> dynamic like the per-vcpu stuff should then be a GUEST_HANDLE pointer to
> a dynamically sized array (accessed with copy_from_guest_offset et al).
> 
Indeed.

There is one other subtle difference, between the new scheduler Meng is
working on, and the existing ones (even leaving ARINC out).

In fact, SEDF, Credit and Credit2, only allows per domain parameters,
while Meng's scheduler supports per-vcpu parameters, and hence he needs
a guest handle to deal with the array of these per-vcpu parameters.

BTW, while writing this, I saw your
<1405677108.13883.9.camel@kazak.uk.xensource.com>, which goes straight
to the point I was trying to make, and with which I totally concur. :-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 3/4] libxl for rt scheduler
  2014-07-17 15:36   ` Ian Campbell
@ 2014-07-18 11:05     ` Meng Xu
  0 siblings, 0 replies; 50+ messages in thread
From: Meng Xu @ 2014-07-18 11:05 UTC (permalink / raw)
  To: Ian Campbell
  Cc: xisisu, stefano.stabellini, george.dunlap, dario.faggioli,
	ian.jackson, xen-devel, Meng Xu, lichong659, dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 865 bytes --]

Hi Ian,

Ian Campbell <Ian.Campbell@citrix.com>于2014年7月17日星期四写道:

> On Fri, 2014-07-11 at 00:49 -0400, Meng Xu wrote:
>
> > +/* Consistent with XEN_LEGACY_MAX_VCPUS xen/arch-x86/xen.h*/
> > +#define LIBXL_XEN_LEGACY_MAX_VCPUS                  32
>
> As the name suggests this limit is legacy. Modern Xen and modern guests
> can support considerably more VCPUs than this.
>
> You need to base your interface on a dynamically sized/allocated array.
> Perhaps this needs to be carried down all the way to the domctl
> interface and into Xen, I haven't looked.
>

Yes. You are right! I'm going to dynamically allocate the array, instead of
use the LIBXL_XEN_LEGACY_MAX_VCPUS, in the next version.

Thanks,

Meng



-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 1204 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 4/4] libxc for rt scheduler
  2014-07-18  9:51           ` Ian Campbell
@ 2014-07-18 12:11             ` Meng Xu
  0 siblings, 0 replies; 50+ messages in thread
From: Meng Xu @ 2014-07-18 12:11 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Sisu Xi, Stefano Stabellini, George Dunlap, Dario Faggioli,
	Ian Jackson, xen-devel, Meng Xu, lichong659, dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 2904 bytes --]

Hi Ian,

Ian Campbell <Ian.Campbell@citrix.com>于2014年7月18日星期五写道:

> On Thu, 2014-07-17 at 18:16 -0400, Meng Xu wrote:
> > Hi Ian and George,
> >
> > George Dunlap <George.Dunlap@eu.citrix.com <javascript:;>
> >于2014年7月17日星期四写道:
> >         On Thu, Jul 17, 2014 at 4:29 PM, Ian Campbell
> >         <Ian.Campbell@citrix.com <javascript:;>> wrote:
> >         > On Fri, 2014-07-11 at 16:49 +0200, Dario Faggioli wrote:
> >         >
> >         >> So, the bouncing logic seems fine. Looking at what other
> >         schedulers do,
> >         >> there should be no particular need for bouncing anything.
> >         >
> >         > Seems like there is some confusing precedent around the use
> >         of sysctl vs
> >         > domctl for sched parameters here.
> >         >
> >         > Most schedulers use domctl but arinc uses sysctl.
> >
> >         They're controlling different things.
> >
> >         domctl is controlling parameters related to *a particular
> >         domain* (for
> >         instance, weight or cap); sysctl is relating to setting
> >         parameters
> >         *for the scheduler as a whole* (for example, timeslice).
> >
> >
> > Do we have to use inline parameters in domctl?
> >
> >
> > Right now, we used the domctl to set/get the parameters of each vcpu
> > of a domain. So from what the functionality does, it should be in
> > domctl.  If we don't have to use inline parameters, I would prefer to
> > use domctl. :-)
>
> The important thing is that the array of per-VCPU parameters needs to be
> a GUEST_HANDLE and not a statically sized array with some arbitrary
> size.
>
> AFAICT that means that the xen_domctl_shceduler_op change should be:
> +/*
> + * This structure is used to pass to rt scheduler from a
> + * privileged domain to Xen
> + */
> +struct xen_domctl_sched_rt_vcpus_params {
> +    signed long period; /* s_time_t type */
> +    signed long budget;
> +};
> +typedef struct xen_domctl_sched_rt...
> +DEFINE_XEN_GUE....
>
> @@ -367,6 +383,9 @@ struct xen_domctl_scheduler_op {
>          struct xen_domctl_sched_credit2 {
>              uint16_t weight;
>          } credit2;
> +        struct xen_domctl_sched_rt {
> +            type some_per_domain_parameter;
> +            uint16_t nr_vcpus;
> +            uint16_t vcpu_index;
> +            XEN_GUEST_HANDLE_64(xen_domctl_sched_rt_vcpu_params_t) vcpu;
> +        } rt;
>      } u;
>
> Currently you don't appear to have any per_domain_parameters, but I have
> included one as an illustration.
>
>
Thank you so much for your detailed explanation! This is exactly what I'm
doing right now. :-) It will be in the next version of the patch. :-)

Best,

Meng




-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 3801 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
  2014-07-17 10:12             ` Meng Xu
  2014-07-17 15:12               ` Dario Faggioli
@ 2014-07-18 18:40               ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 50+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-07-18 18:40 UTC (permalink / raw)
  To: Meng Xu, josh.whitehead, robert.vanvossen, Paul.Skentzos,
	Steve.VanderLeest
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	Andrew Cooper, Dario Faggioli, ian.jackson, xen-devel, Meng Xu,
	lichong659, dgolomb

On Thu, Jul 17, 2014 at 06:12:22AM -0400, Meng Xu wrote:
> Hi Dario and Konrad,
> 
> First, thank you very much for your detailed and very useful comments on
> this patch! I'm modifying it based on your comments now. (I'm sorry I
> didn't reply promptly because I was in travel these days and cannot always
> get access to network. :-()

Hey,

CCing some of the DornerWorks folks.
> 
> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>于2014年7月16日星期三写道:
> 
> > On Fri, Jul 11, 2014 at 05:48:17PM +0200, Dario Faggioli wrote:
> > > On ven, 2014-07-11 at 16:40 +0100, Andrew Cooper wrote:
> > > > On 11/07/14 16:21, Dario Faggioli wrote:
> > >
> > > > > So, ideas? RTDS (as per Real-Time Deferrable-Server)? RRESDS
> > (Resource
> > > > > Reservation Deferrable Server)? Just DS (Deferrable Server)?
> > > > >
> > > > > I'm not sure I like much any of these...
> > > > >
> > > > > What was it that you had in mind when you said "there are several
> > > > > classes of realtime schedulers employing different semantics and
> > > > > parameters" ?
> > >
> > > > Hmm - naming things is difficult.  How about bob?
> > > >
> > > Well, if we go that route, I prefer Alice! :-D :-D
> > >
> > > > My concern is that naming it "rt" seems too generic, and will cause
> > > > confusion of a) what type of realtime scheduler it is, and b) further
> > > > problems if someone wants to come and implement a different realtime
> > > > scheduler in the future.
> > > >
> > > Totally understood and agreed.
> > >
> > > > XEN_SCHEDULER_RT_DS doesn't sound too bad.
> > > >
> > > It does not, indeed. Let's hear Meng, Sisu, and the other xen-devel
> > > folks...
> >
> > DS9 :-)
> >
> > XEN_SCHEDULER_RT_PGEDF ?
> >
> > And then we can also have
> >  XEN_SCHEDULER_RT_CBS ?
> >
> >
> I see your concerns about the name here.
> The strength of the name you proposed, such as  XEN_SCHEDULER_RT_PGEDF or
> XEN_SCHEDULER_RT_DS, is that it clearly states the characteristics of this
> scheduler.
> 
> However, if people want to implement a new server mechanism for the EDF
> scheduling policy, they actually don't have to implement a fresh new one
> and add new sched_*.c files. They only need to modify the budget_update()
> and burn_budget() functions based on the new server mechanism. (Actually,
> we can make this scheduler more generic to make it easier to add a new
> server mechanism just as the schedule.c does.)
> 
> In addition, if people want to implement a new real-time scheduling policy,
> like Rate Monotonic scheduling, they only need to modify a few lines in
> sched_rt.c. (We actually had the Rate Monotonic scheduling already, but
> only extract the EDF one for Xen right now to avoid causing unnecessary
> confusion.)
> 
> So adding new server mechanisms and adding new scheduling policy (i.e.,
> priority schemes) only requires modify several functions in sched_rt.c,
> instead of creating a fresh new file and scheduler. If we use the too
> specific name, like XEN_SCHEDULER_RT_DS, we will have to modify the name in
> the future when we extend the scheduler.  Of course, we could also
> implement a fresh new scheduler, such as XEN_SCHEDULER_RT_CBS, or
> XEN_SCHEDULER_RT_PS (periodic server), in a brand new file, but it's
> actually unnecessary to introduce a new file.

Joshua, Robert, et. al,

Would it be feasible to rebase the core changes on top of this
file instead of using the SEDF?
> 
> Of course, I'm not against using the more specific name, such as
> XEN_SCHEDULER_RT_DS. What I'm concerned is: when we implement a new server
> mechanism or new scheduling policy inside the current sched_rt.c, the too
> specific name will become impropriate and we will have to change name
> again. :-)

Right.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Introduce rt real-time scheduler for Xen
@ 2014-09-07 19:40 Meng Xu
  0 siblings, 0 replies; 50+ messages in thread
From: Meng Xu @ 2014-09-07 19:40 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap, lu,
	dario.faggioli, ian.jackson, ptxlinh, xumengpanda, JBeulich,
	chaowang, lichong659, dgolomb

This serie of patches adds rt real-time scheduler to Xen.

In summary, It supports:
1) Preemptive Global Earliest Deadline First scheduling policy by using a global RunQ for the scheduler;
2) Assign/display VCPUs' parameters of each domain (All VCPUs of each domain have the same period and budget);
3) Supports CPU Pool
Note: 
a) Although the toolstack only allows users to set the paramters of all VCPUs of the same domain to the same number, the scheduler supports to schedule VCPUs with different parameters of the same domain. In Xen 4.6, we plan to support assign/display each VCPU's parameters of each domain. 
b) Parameters of a domain at tool stack is in microsecond, instead of millisecond.

Compared with the PATCH v1, this set of patch has the following modifications:
    a) Toolstack only allows users to set the parameters of all VCPUs of the same domain to be the same value; Toostack only displays the VCPUs' period and budget of a domain. (In PATCH v1, toolstack can assign/display each VCPU's parameters of each domain, but because it is hard to reach an agreement with the libxl interface for this functionality, we decide to delay this functionality to Xen 4.6 after the scheduler is merged into Xen 4.5.)
    b) Miscellous modification of the scheduler in sched_rt.c based on Dario's detailed comments.
    c) Code style correction in libxl. 

-----------------------------------------------------------------------------------------------------------------------------
TODO after Xen 4.5:
    a) Burn budget in finer granularity instead of 1ms; [medium]
    b) Use separate timer per vcpu for each vcpu's budget replenishment, instead of scanning the full runqueue every now and then [medium]
    c) Handle time stolen from domU by hypervisor. When it runs on a machine with many sockets and lots of cores, the spin-lock for global RunQ used in rt scheduler could eat up time from domU, which could make domU have less budget than it requires. [not sure about difficulty right now] (Thank Konrad Rzeszutek to point this out in the XenSummit. :-)) 
    d) Toolstack supports assiging/display each VCPU's parameters of each domain.

-----------------------------------------------------------------------------------------------------------------------------
The design of this rt scheduler is as follows:
This scheduler follows the Preemptive Global Earliest Deadline First (EDF) theory in real-time field.
At any scheduling point, the VCPU with earlier deadline has higher priority. The scheduler always picks the highest priority VCPU to run on a
feasible PCPU.
A PCPU is feasible if the VCPU can run on this PCPU and (the PCPU is idle or has a lower-priority VCPU running on it.)

Each VCPU has a dedicated period and budget.
The deadline of a VCPU is at the end of each of its periods;
A VCPU has its budget replenished at the beginning of each of its periods;
While scheduled, a VCPU burns its budget.
The VCPU needs to finish its budget before its deadline in each period;
The VCPU discards its unused budget at the end of each of its periods.
If a VCPU runs out of budget in a period, it has to wait until next period.

Each VCPU is implemented as a deferable server.
When a VCPU has a task running on it, its budget is continuously burned;
When a VCPU has no task but with budget left, its budget is preserved.

Queue scheme: A global runqueue for each CPU pool.
The runqueue holds all runnable VCPUs.
VCPUs in the runqueue are divided into two parts: with and without budget.
At the first part, VCPUs are sorted based on EDF priority scheme.

Note: cpumask and cpupool is supported.

If you are intersted in the details of the design and evaluation of this rt scheduler, please refer to our paper "Real-Time Multi-Core Virtual Machine Scheduling in Xen" (http://www.cis.upenn.edu/~mengxu/emsoft14/emsoft14.pdf), which will be published in EMSOFT14. This paper has the following details:
    a) Desgin of this scheduler;
    b) Measurement of the implementation overhead, e.g., scheduler overhead, context switch overhead, etc.
    c) Comparison of this rt scheduler and credit scheduler in terms of the real-time performance.

If you are interested in other real-time schedulers in Xen, please refer to the RT-Xen project's website (https://sites.google.com/site/realtimexen/). It also supports Preemptive Global Rate Monotonic schedulers.
-----------------------------------------------------------------------------------------------------------------------------
One scenario to show the functionality of this rt scheduler is as follows:
//list the domains
#xl list
Name                                        ID   Mem VCPUs  State   Time(s)
Domain-0                                     0  3344     4     r-----     146.1
vm1                                          1   512     2     r-----     155.1

//list VCPUs' parameters of each domain in cpu pools using rt scheduler
#xl sched-rt
Cpupool Pool-0: sched=EDF
Name                                ID    Period    Budget
Domain-0                             0     10000      4000
vm1                                  1     10000      4000

//set VCPUs' parameters of each domain to new value
xl sched-rt -d Domain-0 -p 20000 -b 10000
//Now all vcpus of Domain-0 have period 20000us and budget 10000us.
#xl sched-rt
Cpupool Pool-0: sched=EDF
Name                                ID    Period    Budget
Domain-0                             0     20000     10000
vm1                                  1     10000      4000

// list cpupool information              
#xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0               4     rt_ds       y          2
#xl cpupool-list -c    
Name               CPU list    
Pool-0             0,1,2,3

//create a cpupool test
#xl cpupool-cpu-remove Pool-0 3
#xl cpupool-cpu-remove Pool-0 2
#xl cpupool-create name=\"test\" sched=\"rt_ds\"
#xl cpupool-cpu-add test 3 
#xl cpupool-cpu-add test 2
#xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0               2     rt_ds       y          2
test                 2     rt_ds       y          0

//migrate vm1 from cpupool Pool-0 to cpupool test.    
#xl cpupool-migrate vm1 test

//now vm1 is in cpupool test
# xl sched-rt
pupool Pool-0: sched=EDF
Name                                ID    Period    Budget
Domain-0                             0     20000     10000
Cpupool test: sched=EDF                 
Name                                ID    Period    Budget
vm1                                  1     10000      4000

-----------------------------------------------------------------------------------------------------------------------------
Any comment, question, and concerns are more than welcome! :-)

Thank you very much!

Best,

Meng

[PATCH v2 1/4] xen: add real time scheduler rt
[PATCH v2 2/4] libxc: add rt scheduler
[PATCH v2 3/4] libxl: add rt scheduler
[PATCH v2 4/4] xl: introduce rt scheduler

---
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania
http://www.cis.upenn.edu/~mengxu/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Introduce rt real-time scheduler for Xen
@ 2014-08-24 22:58 Meng Xu
  0 siblings, 0 replies; 50+ messages in thread
From: Meng Xu @ 2014-08-24 22:58 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	dario.faggioli, ian.jackson, xumengpanda, JBeulich, chaowang,
	lichong659, dgolomb

Hi all,

This serie of patches adds rt real-time scheduler to Xen.

In summary, It supports:
1) Preemptive Global Earliest Deadline First scheduling policy by using a global RunQ for the scheduler;
2) Assign/display each VCPU's parameters of each domain;
3) Supports CPU Pool

Compared with the set of patches in version RFC v2, this set of patch has the following improvement:
    a) added rt scheduler specific TRACE facility for rt scheduler
    b) more efficient RunQ implementation to avoid scanning the whole RunQ when insert a vcpu without budget.
    c) bug fix for cpupool support.

-----------------------------------------------------------------------------------------------------------------------------
TODO:
    a) Burn budget in finer granularity instead of 1ms; [medium]
    b) Use separate timer per vcpu for each vcpu's budget replenishment, instead of scanning the full runqueue every now and then [medium]
    c) Handle time stolen from domU by hypervisor. When it runs on a machine with many sockets and lots of cores, the spin-lock for global RunQ used in rt scheduler could eat up time from domU, which could make domU have less budget than it requires. [not sure about difficulty right now] (Thank Konrad Rzeszutek to point this out in the XenSummit. :-))

Plan:
    We will work on TODO a) and b) and try to finish these two items before September 10th. (We will also tackle the comments raised in the review of this set of patches.)

-----------------------------------------------------------------------------------------------------------------------------
The design of this rt scheduler is as follows:
This rt scheduler follows the Preemptive Global Earliest Deadline First (GEDF) theory in real-time field.
Each VCPU can have a dedicated period and budget. While scheduled, a VCPU burns its budget. Each VCPU has its budget replenished at the beginning of each of its periods; Each VCPU discards its unused budget at the end of each of its periods. If a VCPU runs out of budget in a period, it has to wait until next period.
The mechanism of how to burn a VCPU's budget depends on the server mechanism implemented for each VCPU.
The mechanism of deciding the priority of VCPUs at each scheduling point is based on the Preemptive Global Earliest Deadline First scheduling scheme.

Server mechanism: a VCPU is implemented as a deferrable server.
When a VCPU has a task running on it, its budget is continuously burned;
When a VCPU has no task but with budget left, its budget is preserved.

Priority scheme: Global Earliest Deadline First (EDF).
At any scheduling point, the VCPU with earliest deadline has highest priority.

Queue scheme: A global runqueue for each CPU pool.
The runqueue holds all runnable VCPUs.
VCPUs in the runqueue are divided into two parts: with and without remaining budget.
At each part, VCPUs are sorted based on GEDF priority scheme.

Scheduling quanta: 1 ms.

If you are intersted in the details of the design and evaluation of this rt scheduler, please refer to our paper "Real-Time Multi-Core Virtual Machine Scheduling in Xen" (http://www.cis.upenn.edu/~mengxu/emsoft14/emsoft14.pdf), which will be published in EMSOFT14. This paper has the following details:
    a) Desgin of this scheduler;
    b) Measurement of the implementation overhead, e.g., scheduler overhead, context switch overhead, etc.
    c) Comparison of this rt scheduler and credit scheduler in terms of the real-time performance.
-----------------------------------------------------------------------------------------------------------------------------
One scenario to show the functionality of this rt scheduler is as follows:
//list each vcpu's parameters of each domain in cpu pools using rt scheduler
#xl sched-rt
Cpupool Pool-0: sched=EDF
Name                                ID VCPU Period Budget
Domain-0                             0    0  10000  10000
Domain-0                             0    1  20000  20000
Domain-0                             0    2  30000  30000
Domain-0                             0    3  10000  10000
litmus1                              1    0  10000   4000
litmus1                              1    1  10000   4000

//set the parameters of the vcpu 1 of domain litmus1:
# xl sched-rt -d litmus1 -v 1 -p 20000 -b 10000

//domain litmus1's vcpu 1's parameters are changed, display each VCPU's parameters separately:
# xl sched-rt -d litmus1
Name                                ID VCPU Period Budget
litmus1                              1    0  10000   4000
litmus1                              1    1  20000  10000

// list cpupool information
xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0              12        rt       y          2

//create a cpupool test
#xl cpupool-cpu-remove Pool-0 11
#xl cpupool-cpu-remove Pool-0 10
#xl cpupool-create name=\"test\" sched=\"credit\"
#xl cpupool-cpu-add test 11
#xl cpupool-cpu-add test 10
#xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0              10        rt       y          2
test                 2    credit       y          0   

//migrate litmus1 from cpupool Pool-0 to cpupool test.
#xl cpupool-migrate litmus1 test

//now litmus1 is in cpupool test
# xl sched-credit 
Cpupool test: tslice=30ms ratelimit=1000us
Name                                ID Weight  Cap
litmus1                              1    256    0 

-----------------------------------------------------------------------------------------------------------------------------
This set of patches is tested by using the above scenario and running cpu intensive tasks inside each guest domain. We manually check if a domain can have its required resource without being interferenced by other domains; We also manually checked if the scheduling sequence of vcpus follows the Earliest Deadline First scheduling policy.

Any comment, question, and concerns are more than welcome! :-)

Thank you very much!

Best,

Meng

[PATCH v1 1/4] xen: add real time scheduler rt
[PATCH v1 2/4] libxc: add rt scheduler
[PATCH v1 3/4] libxl: add rt scheduler
[PATCH v1 4/4] xl: introduce rt scheduler

---
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Introduce rt real-time scheduler for Xen
@ 2014-07-29  1:52 Meng Xu
  0 siblings, 0 replies; 50+ messages in thread
From: Meng Xu @ 2014-07-29  1:52 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xumengpanda, JBeulich, lichong659, dgolomb

Hi all,

This serie of patches adds rt real-time scheduler to Xen.

In summary, It supports:
1) Preemptive Global Earliest Deadline First scheduling policy by using a global RunQ for the scheduler;
2) Assign/display each VCPU's parameters of each domain;
3) Supports CPU Pool

Based on the review comments/suggestion on version 1, the version 2 has the following improvements:
    a) Changed the interface of get/set vcpu's parameter from staticlly allocating a large array to dynamically allocating memory based on the number of vcpus of a domain. (This is a major change from v1 to v2 because many comments on v1 is about the code related to this functionality.)
    b) Changed the time unit at user interferce from 1ms to 1us; Changed the type of period and buget of a VCPU from uint16 to s_time_s.
    c) Changed the code style, rearrange the patch order, add comments to better explain the code.
    d) Domain 0 is no longer treated as a special domain. Domain 0's VCPUs are changed to be the same with domUs' VCPUs: domain 0's VCPUs have the same default value with domUs' VCPUs and are scheduled as domUs' VCPUs.
    e) Add more ASSERT(), e.g., in  __runq_insert() in sched_rt.c

-----------------------------------------------------------------------------------------------------------------------------
TODO:
    a) Add TRACE() in sched_rt.c functions.[easy]
       We will add a few xentrace tracepoints, like TRC_CSCHED2_RUNQ_POS in credit2 scheduler, in rt scheduler, to debug via tracing.
    b) Split runnable and depleted (=no budget left) VCPU queues.[easy]
    c) Deal with budget overrun in the algorithm [medium]
    d) Try using timers for replenishment, instead of scanning the full runqueue every now and then [medium]
    e) Reconsider the rt_vcpu_insert() and rt_vcpu_remove() for cpu pool support. 
    f) Method of improving the performance of rt scheduler [future work]
       VCPUs of the same domain may preempt each other based on the preemptive global EDF scheduling policy. This self-switch issue does not bring benefit to the domain but introduce more overhead. When this situation happens, we can simply promote the current running lower-priority VCPU’s priority and let it  borrow budget from higher priority VCPUs to avoid such self-swtich issue.

Plan: 
    TODO a) and b) are expected in RFC v3; (2 weeks)
    TODO c), d) and e) are expected in RFC v4, v5; (3-4 weeks)
    TODO f) will be delayed after this scheduler is upstreamed because the improvement will make the scheduler not a pure global EDF scheduler.

-----------------------------------------------------------------------------------------------------------------------------
The design of this rt scheduler is as follows:
This rt scheduler follows the Preemptive Global Earliest Deadline First (GEDF) theory in real-time field.
Each VCPU can have a dedicated period and budget. While scheduled, a VCPU burns its budget. Each VCPU has its budget replenished at the beginning of each of its periods; Each VCPU discards its unused budget at the end of each of its periods. If a VCPU runs out of budget in a period, it has to wait until next period.
The mechanism of how to burn a VCPU's budget depends on the server mechanism implemented for each VCPU.
The mechanism of deciding the priority of VCPUs at each scheduling point is based on the Preemptive Global Earliest Deadline First scheduling scheme.

Server mechanism: a VCPU is implemented as a deferrable server.
When a VCPU has a task running on it, its budget is continuously burned;
When a VCPU has no task but with budget left, its budget is preserved.

Priority scheme: Global Earliest Deadline First (EDF).
At any scheduling point, the VCPU with earliest deadline has highest priority.

Queue scheme: A global runqueue for each CPU pool.
The runqueue holds all runnable VCPUs.
VCPUs in the runqueue are divided into two parts: with and without remaining budget.
At each part, VCPUs are sorted based on GEDF priority scheme.

Scheduling quanta: 1 ms.

If you are intersted in the details of the design and evaluation of this rt scheduler, please refer to our paper "Real-Time Multi-Core Virtual Machine Scheduling in Xen" (http://www.cis.upenn.edu/~mengxu/emsoft14/emsoft14.pdf) in EMSOFT14. This paper has the following details:
    a) Desgin of this scheduler;
    b) Measurement of the implementation overhead, e.g., scheduler overhead, context switch overhead, etc. 
    c) Comparison of this rt scheduler and credit scheduler in terms of the real-time performance.
-----------------------------------------------------------------------------------------------------------------------------
One scenario to show the functionality of this rt scheduler is as follows:
//list each vcpu's parameters of each domain in cpu pools using rt scheduler
#xl sched-rt
Cpupool Pool-0: sched=EDF
Name                                ID VCPU Period Budget
Domain-0                             0    0  10000  10000
Domain-0                             0    1  20000  20000
Domain-0                             0    2  30000  30000
Domain-0                             0    3  10000  10000
litmus1                              1    0  10000   4000
litmus1                              1    1  10000   4000

//set the parameters of the vcpu 1 of domain litmus1:
# xl sched-rt -d litmus1 -v 1 -p 20000 -b 10000

//domain litmus1's vcpu 1's parameters are changed, display each VCPU's parameters separately:
# xl sched-rt -d litmus1
Name                                ID VCPU Period Budget
litmus1                              1    0  10000   4000
litmus1                              1    1  20000  10000

// list cpupool information
xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0              12        rt       y          2

//create a cpupool test
#xl cpupool-cpu-remove Pool-0 11
#xl cpupool-cpu-remove Pool-0 10
#xl cpupool-create name=\"test\" sched=\"credit\"
#xl cpupool-cpu-add test 11
#xl cpupool-cpu-add test 10
#xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0              10        rt       y          2
test                 2    credit       y          0

//migrate litmus1 from cpupool Pool-0 to cpupool test.
#xl cpupool-migrate litmus1 test

//now litmus1 is in cpupool test
# xl sched-credit
Cpupool test: tslice=30ms ratelimit=1000us
Name                                ID Weight  Cap
litmus1                              1    256    0

-----------------------------------------------------------------------------------------------------------------------------
[PATCH RFC v2 1/4] xen: add real time scheduler rt
[PATCH RFC v2 2/4] libxc: add rt scheduler
[PATCH RFC v2 3/4] libxl: add rt scheduler
[PATCH RFC v2 4/4] xl: introduce rt scheduler
-----------------------------------------------------------------------------------------------------------------------------
Thanks for Dario, Wei, Ian, Andrew, George, and Konrad for your valuable comments and suggestions!

Any comment, question, and concerns are more than welcome! :-)

Thank you very much!

Best,

Meng

---
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2014-09-07 19:40 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-11  4:49 Introduce rt real-time scheduler for Xen Meng Xu
2014-07-11  4:49 ` [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor Meng Xu
2014-07-11 14:27   ` Dario Faggioli
2014-07-11 14:37   ` Andrew Cooper
2014-07-11 15:21     ` Dario Faggioli
2014-07-11 15:40       ` Andrew Cooper
2014-07-11 15:48         ` Dario Faggioli
2014-07-16 17:05           ` Konrad Rzeszutek Wilk
2014-07-17 10:12             ` Meng Xu
2014-07-17 15:12               ` Dario Faggioli
2014-07-18  5:46                 ` Meng Xu
2014-07-18 18:40               ` Konrad Rzeszutek Wilk
2014-07-11  4:49 ` [PATCH RFC v1 2/4] xl for rt scheduler Meng Xu
2014-07-11 11:02   ` Wei Liu
2014-07-11 14:59     ` Meng Xu
2014-07-11 15:07       ` Dario Faggioli
2014-07-11 16:25         ` Meng Xu
2014-07-13 12:58         ` Meng Xu
2014-07-14  7:40           ` Dario Faggioli
2014-07-14  9:31           ` Wei Liu
2014-07-17 15:39           ` Ian Campbell
2014-07-11  4:49 ` [PATCH RFC v1 3/4] libxl " Meng Xu
2014-07-11 11:05   ` Wei Liu
2014-07-11 15:08   ` Dario Faggioli
2014-07-12 18:16     ` Meng Xu
2014-07-14 10:38       ` Dario Faggioli
2014-07-17 15:34     ` Ian Campbell
2014-07-17 15:36   ` Ian Campbell
2014-07-18 11:05     ` Meng Xu
2014-07-11  4:49 ` [PATCH RFC v1 4/4] libxc " Meng Xu
2014-07-11 14:49   ` Dario Faggioli
2014-07-11 16:23     ` Meng Xu
2014-07-11 16:35       ` Dario Faggioli
2014-07-11 16:49         ` Andrew Cooper
2014-07-12 19:46         ` Meng Xu
2014-07-17 15:29     ` Ian Campbell
2014-07-17 15:34       ` George Dunlap
2014-07-17 22:16         ` Meng Xu
2014-07-18  9:49           ` Dario Faggioli
2014-07-18  9:51           ` Ian Campbell
2014-07-18 12:11             ` Meng Xu
2014-07-18  9:47         ` Ian Campbell
2014-07-18 10:00           ` Dario Faggioli
2014-07-11 10:50 ` Introduce rt real-time scheduler for Xen Wei Liu
2014-07-11 11:06   ` Dario Faggioli
2014-07-11 16:14     ` Meng Xu
2014-07-11 16:19 ` Dario Faggioli
2014-07-29  1:52 Meng Xu
2014-08-24 22:58 Meng Xu
2014-09-07 19:40 Meng Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.