All of lore.kernel.org
 help / color / mirror / Atom feed
* Introduce rt real-time scheduler for Xen
@ 2014-09-07 19:40 Meng Xu
  2014-09-07 19:40 ` [PATCH v2 1/4] xen: add real time scheduler rt Meng Xu
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Meng Xu @ 2014-09-07 19:40 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap, lu,
	dario.faggioli, ian.jackson, ptxlinh, xumengpanda, JBeulich,
	chaowang, lichong659, dgolomb

This serie of patches adds rt real-time scheduler to Xen.

In summary, It supports:
1) Preemptive Global Earliest Deadline First scheduling policy by using a global RunQ for the scheduler;
2) Assign/display VCPUs' parameters of each domain (All VCPUs of each domain have the same period and budget);
3) Supports CPU Pool
Note: 
a) Although the toolstack only allows users to set the paramters of all VCPUs of the same domain to the same number, the scheduler supports to schedule VCPUs with different parameters of the same domain. In Xen 4.6, we plan to support assign/display each VCPU's parameters of each domain. 
b) Parameters of a domain at tool stack is in microsecond, instead of millisecond.

Compared with the PATCH v1, this set of patch has the following modifications:
    a) Toolstack only allows users to set the parameters of all VCPUs of the same domain to be the same value; Toostack only displays the VCPUs' period and budget of a domain. (In PATCH v1, toolstack can assign/display each VCPU's parameters of each domain, but because it is hard to reach an agreement with the libxl interface for this functionality, we decide to delay this functionality to Xen 4.6 after the scheduler is merged into Xen 4.5.)
    b) Miscellous modification of the scheduler in sched_rt.c based on Dario's detailed comments.
    c) Code style correction in libxl. 

-----------------------------------------------------------------------------------------------------------------------------
TODO after Xen 4.5:
    a) Burn budget in finer granularity instead of 1ms; [medium]
    b) Use separate timer per vcpu for each vcpu's budget replenishment, instead of scanning the full runqueue every now and then [medium]
    c) Handle time stolen from domU by hypervisor. When it runs on a machine with many sockets and lots of cores, the spin-lock for global RunQ used in rt scheduler could eat up time from domU, which could make domU have less budget than it requires. [not sure about difficulty right now] (Thank Konrad Rzeszutek to point this out in the XenSummit. :-)) 
    d) Toolstack supports assiging/display each VCPU's parameters of each domain.

-----------------------------------------------------------------------------------------------------------------------------
The design of this rt scheduler is as follows:
This scheduler follows the Preemptive Global Earliest Deadline First (EDF) theory in real-time field.
At any scheduling point, the VCPU with earlier deadline has higher priority. The scheduler always picks the highest priority VCPU to run on a
feasible PCPU.
A PCPU is feasible if the VCPU can run on this PCPU and (the PCPU is idle or has a lower-priority VCPU running on it.)

Each VCPU has a dedicated period and budget.
The deadline of a VCPU is at the end of each of its periods;
A VCPU has its budget replenished at the beginning of each of its periods;
While scheduled, a VCPU burns its budget.
The VCPU needs to finish its budget before its deadline in each period;
The VCPU discards its unused budget at the end of each of its periods.
If a VCPU runs out of budget in a period, it has to wait until next period.

Each VCPU is implemented as a deferable server.
When a VCPU has a task running on it, its budget is continuously burned;
When a VCPU has no task but with budget left, its budget is preserved.

Queue scheme: A global runqueue for each CPU pool.
The runqueue holds all runnable VCPUs.
VCPUs in the runqueue are divided into two parts: with and without budget.
At the first part, VCPUs are sorted based on EDF priority scheme.

Note: cpumask and cpupool is supported.

If you are intersted in the details of the design and evaluation of this rt scheduler, please refer to our paper "Real-Time Multi-Core Virtual Machine Scheduling in Xen" (http://www.cis.upenn.edu/~mengxu/emsoft14/emsoft14.pdf), which will be published in EMSOFT14. This paper has the following details:
    a) Desgin of this scheduler;
    b) Measurement of the implementation overhead, e.g., scheduler overhead, context switch overhead, etc.
    c) Comparison of this rt scheduler and credit scheduler in terms of the real-time performance.

If you are interested in other real-time schedulers in Xen, please refer to the RT-Xen project's website (https://sites.google.com/site/realtimexen/). It also supports Preemptive Global Rate Monotonic schedulers.
-----------------------------------------------------------------------------------------------------------------------------
One scenario to show the functionality of this rt scheduler is as follows:
//list the domains
#xl list
Name                                        ID   Mem VCPUs  State   Time(s)
Domain-0                                     0  3344     4     r-----     146.1
vm1                                          1   512     2     r-----     155.1

//list VCPUs' parameters of each domain in cpu pools using rt scheduler
#xl sched-rt
Cpupool Pool-0: sched=EDF
Name                                ID    Period    Budget
Domain-0                             0     10000      4000
vm1                                  1     10000      4000

//set VCPUs' parameters of each domain to new value
xl sched-rt -d Domain-0 -p 20000 -b 10000
//Now all vcpus of Domain-0 have period 20000us and budget 10000us.
#xl sched-rt
Cpupool Pool-0: sched=EDF
Name                                ID    Period    Budget
Domain-0                             0     20000     10000
vm1                                  1     10000      4000

// list cpupool information              
#xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0               4     rt_ds       y          2
#xl cpupool-list -c    
Name               CPU list    
Pool-0             0,1,2,3

//create a cpupool test
#xl cpupool-cpu-remove Pool-0 3
#xl cpupool-cpu-remove Pool-0 2
#xl cpupool-create name=\"test\" sched=\"rt_ds\"
#xl cpupool-cpu-add test 3 
#xl cpupool-cpu-add test 2
#xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0               2     rt_ds       y          2
test                 2     rt_ds       y          0

//migrate vm1 from cpupool Pool-0 to cpupool test.    
#xl cpupool-migrate vm1 test

//now vm1 is in cpupool test
# xl sched-rt
pupool Pool-0: sched=EDF
Name                                ID    Period    Budget
Domain-0                             0     20000     10000
Cpupool test: sched=EDF                 
Name                                ID    Period    Budget
vm1                                  1     10000      4000

-----------------------------------------------------------------------------------------------------------------------------
Any comment, question, and concerns are more than welcome! :-)

Thank you very much!

Best,

Meng

[PATCH v2 1/4] xen: add real time scheduler rt
[PATCH v2 2/4] libxc: add rt scheduler
[PATCH v2 3/4] libxl: add rt scheduler
[PATCH v2 4/4] xl: introduce rt scheduler

---
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania
http://www.cis.upenn.edu/~mengxu/

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 1/4] xen: add real time scheduler rt
  2014-09-07 19:40 Introduce rt real-time scheduler for Xen Meng Xu
@ 2014-09-07 19:40 ` Meng Xu
  2014-09-08 14:32   ` George Dunlap
                     ` (2 more replies)
  2014-09-07 19:40 ` [PATCH v2 2/4] libxc: add rt scheduler Meng Xu
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 31+ messages in thread
From: Meng Xu @ 2014-09-07 19:40 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap, lu,
	dario.faggioli, ian.jackson, ptxlinh, xumengpanda, Meng Xu,
	JBeulich, chaowang, lichong659, dgolomb

This scheduler follows the Preemptive Global Earliest Deadline First
(EDF) theory in real-time field.
At any scheduling point, the VCPU with earlier deadline has higher
priority. The scheduler always picks the highest priority VCPU to run on a
feasible PCPU.
A PCPU is feasible if the VCPU can run on this PCPU and (the PCPU is
idle or has a lower-priority VCPU running on it.)

Each VCPU has a dedicated period and budget.
The deadline of a VCPU is at the end of each of its periods;
A VCPU has its budget replenished at the beginning of each of its periods;
While scheduled, a VCPU burns its budget.
The VCPU needs to finish its budget before its deadline in each period;
The VCPU discards its unused budget at the end of each of its periods.
If a VCPU runs out of budget in a period, it has to wait until next period.

Each VCPU is implemented as a deferable server.
When a VCPU has a task running on it, its budget is continuously burned;
When a VCPU has no task but with budget left, its budget is preserved.

Queue scheme: A global runqueue for each CPU pool.
The runqueue holds all runnable VCPUs.
VCPUs in the runqueue are divided into two parts: with and without budget.
At the first part, VCPUs are sorted based on EDF priority scheme.

Note: cpumask and cpupool is supported.

This is an experimental scheduler.

Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
Signed-off-by: Sisu Xi <xisisu@gmail.com>
---
 xen/common/Makefile         |    1 +
 xen/common/sched_rt.c       | 1146 +++++++++++++++++++++++++++++++++++++++++++
 xen/common/schedule.c       |    1 +
 xen/include/public/domctl.h |    6 +
 xen/include/public/trace.h  |    1 +
 xen/include/xen/sched-if.h  |    1 +
 6 files changed, 1156 insertions(+)
 create mode 100644 xen/common/sched_rt.c

diff --git a/xen/common/Makefile b/xen/common/Makefile
index 3683ae3..5a23aa4 100644
--- a/xen/common/Makefile
+++ b/xen/common/Makefile
@@ -26,6 +26,7 @@ obj-y += sched_credit.o
 obj-y += sched_credit2.o
 obj-y += sched_sedf.o
 obj-y += sched_arinc653.o
+obj-y += sched_rt.o
 obj-y += schedule.o
 obj-y += shutdown.o
 obj-y += softirq.o
diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
new file mode 100644
index 0000000..412f8b1
--- /dev/null
+++ b/xen/common/sched_rt.c
@@ -0,0 +1,1146 @@
+/******************************************************************************
+ * Preemptive Global Earliest Deadline First  (EDF) scheduler for Xen
+ * EDF scheduling is a real-time scheduling algorithm used in embedded field.
+ *
+ * by Sisu Xi, 2013, Washington University in Saint Louis
+ * and Meng Xu, 2014, University of Pennsylvania
+ *
+ * based on the code of credit Scheduler
+ */
+
+#include <xen/config.h>
+#include <xen/init.h>
+#include <xen/lib.h>
+#include <xen/sched.h>
+#include <xen/domain.h>
+#include <xen/delay.h>
+#include <xen/event.h>
+#include <xen/time.h>
+#include <xen/perfc.h>
+#include <xen/sched-if.h>
+#include <xen/softirq.h>
+#include <asm/atomic.h>
+#include <xen/errno.h>
+#include <xen/trace.h>
+#include <xen/cpu.h>
+#include <xen/keyhandler.h>
+#include <xen/trace.h>
+#include <xen/guest_access.h>
+
+/*
+ * TODO:
+ *
+ * Migration compensation and resist like credit2 to better use cache;
+ * Lock Holder Problem, using yield?
+ * Self switch problem: VCPUs of the same domain may preempt each other;
+ */
+
+/*
+ * Design:
+ *
+ * This scheduler follows the Preemptive Global Earliest Deadline First (EDF)
+ * theory in real-time field.
+ * At any scheduling point, the VCPU with earlier deadline has higher priority.
+ * The scheduler always picks highest priority VCPU to run on a feasible PCPU.
+ * A PCPU is feasible if the VCPU can run on this PCPU and (the PCPU is idle or
+ * has a lower-priority VCPU running on it.)
+ * 
+ * Each VCPU has a dedicated period and budget.
+ * The deadline of a VCPU is at the end of each of its periods;
+ * A VCPU has its budget replenished at the beginning of each of its periods;
+ * While scheduled, a VCPU burns its budget.
+ * The VCPU needs to finish its budget before its deadline in each period;
+ * The VCPU discards its unused budget at the end of each of its periods.
+ * If a VCPU runs out of budget in a period, it has to wait until next period.
+ * 
+ * Each VCPU is implemented as a deferable server.
+ * When a VCPU has a task running on it, its budget is continuously burned;
+ * When a VCPU has no task but with budget left, its budget is preserved.
+ *
+ * Queue scheme: A global runqueue for each CPU pool. 
+ * The runqueue holds all runnable VCPUs. 
+ * VCPUs in the runqueue are divided into two parts: 
+ * with and without remaining budget. 
+ * At the first part, VCPUs are sorted based on EDF priority scheme.
+ *
+ * Note: cpumask and cpupool is supported.
+ */
+
+/*
+ * Locking:
+ * A global system lock is used to protect the RunQ.
+ * The global lock is referenced by schedule_data.schedule_lock 
+ * from all physical cpus.
+ *
+ * The lock is already grabbed when calling wake/sleep/schedule/ functions 
+ * in schedule.c
+ *
+ * The functions involes RunQ and needs to grab locks are:
+ *    vcpu_insert, vcpu_remove, context_saved, __runq_insert
+ */
+
+
+/*
+ * Default parameters: 
+ * Period and budget in default is 10 and 4 ms, respectively
+ */
+#define RT_DS_DEFAULT_PERIOD     (MICROSECS(10000))
+#define RT_DS_DEFAULT_BUDGET     (MICROSECS(4000))
+
+/*
+ * Flags
+ */ 
+/*
+ * RT_scheduled: Is this vcpu either running on, or context-switching off,
+ * a phyiscal cpu?
+ * + Accessed only with Runqueue lock held.
+ * + Set when chosen as next in rt_schedule().
+ * + Cleared after context switch has been saved in rt_context_saved()
+ * + Checked in vcpu_wake to see if we can add to the Runqueue, or if we should
+ *   set RT_delayed_runq_add
+ * + Checked to be false in runq_insert.
+ */
+#define __RT_scheduled            1
+#define RT_scheduled (1<<__RT_scheduled)
+/* 
+ * RT_delayed_runq_add: Do we need to add this to the Runqueueu once it'd done 
+ * being context switching out?
+ * + Set when scheduling out in rt_schedule() if prev is runable
+ * + Set in rt_vcpu_wake if it finds RT_scheduled set
+ * + Read in rt_context_saved(). If set, it adds prev to the Runqueue and
+ *   clears the bit.
+ */
+#define __RT_delayed_runq_add     2
+#define RT_delayed_runq_add (1<<__RT_delayed_runq_add)
+
+/*
+ * Debug only. Used to printout debug information
+ */
+#define printtime()\
+        ({s_time_t now = NOW(); \
+          printk("%u : %3ld.%3ldus : %-19s\n",smp_processor_id(),\
+          now/MICROSECS(1), now%MICROSECS(1)/1000, __func__);} )
+
+/*
+ * rt tracing events ("only" 512 available!). Check
+ * include/public/trace.h for more details.
+ */
+#define TRC_RT_TICKLE           TRC_SCHED_CLASS_EVT(RT, 1)
+#define TRC_RT_RUNQ_PICK        TRC_SCHED_CLASS_EVT(RT, 2)
+#define TRC_RT_BUDGET_BURN      TRC_SCHED_CLASS_EVT(RT, 3)
+#define TRC_RT_BUDGET_REPLENISH TRC_SCHED_CLASS_EVT(RT, 4)
+#define TRC_RT_SCHED_TASKLET    TRC_SCHED_CLASS_EVT(RT, 5)
+#define TRC_RT_VCPU_DUMP        TRC_SCHED_CLASS_EVT(RT, 6)
+
+/*
+ * Systme-wide private data, include a global RunQueue
+ * Global lock is referenced by schedule_data.schedule_lock from all 
+ * physical cpus. It can be grabbed via vcpu_schedule_lock_irq()
+ */
+struct rt_private {
+    spinlock_t lock;           /* The global coarse grand lock */
+    struct list_head sdom;     /* list of availalbe domains, used for dump */
+    struct list_head runq;     /* Ordered list of runnable VMs */
+    struct rt_vcpu *flag_vcpu; /* position of the first depleted vcpu */
+    cpumask_t cpus;            /* cpumask_t of available physical cpus */
+    cpumask_t tickled;         /* cpus been tickled */
+};
+
+/*
+ * Virtual CPU
+ */
+struct rt_vcpu {
+    struct list_head runq_elem; /* On the runqueue list */
+    struct list_head sdom_elem; /* On the domain VCPU list */
+
+    /* Up-pointers */
+    struct rt_dom *sdom;
+    struct vcpu *vcpu;
+
+    /* VCPU parameters, in nanoseconds */
+    s_time_t period;
+    s_time_t budget;
+
+    /* VCPU current infomation in nanosecond */
+    s_time_t cur_budget;        /* current budget */
+    s_time_t last_start;        /* last start time */
+    s_time_t cur_deadline;      /* current deadline for EDF */
+
+    unsigned flags;             /* mark __RT_scheduled, etc.. */
+};
+
+/*
+ * Domain
+ */
+struct rt_dom {
+    struct list_head vcpu;      /* link its VCPUs */
+    struct list_head sdom_elem; /* link list on rt_priv */
+    struct domain *dom;         /* pointer to upper domain */
+};
+
+/*
+ * Useful inline functions
+ */
+static inline struct rt_private *RT_PRIV(const struct scheduler *ops)
+{
+    return ops->sched_data;
+}
+
+static inline struct rt_vcpu *RT_VCPU(const struct vcpu *vcpu)
+{
+    return vcpu->sched_priv;
+}
+
+static inline struct rt_dom *RT_DOM(const struct domain *dom)
+{
+    return dom->sched_priv;
+}
+
+static inline struct list_head *RUNQ(const struct scheduler *ops)
+{
+    return &RT_PRIV(ops)->runq;
+}
+
+/*
+ * RunQueue helper functions
+ */
+static int
+__vcpu_on_runq(const struct rt_vcpu *svc)
+{
+   return !list_empty(&svc->runq_elem);
+}
+
+static struct rt_vcpu *
+__runq_elem(struct list_head *elem)
+{
+    return list_entry(elem, struct rt_vcpu, runq_elem);
+}
+
+/*
+ * Debug related code, dump vcpu/cpu information
+ */
+static void
+rt_dump_vcpu(const struct scheduler *ops, const struct rt_vcpu *svc)
+{
+    struct rt_private *prv = RT_PRIV(ops);
+    char cpustr[1024];
+    cpumask_t *cpupool_mask;
+
+    ASSERT(svc != NULL);
+    /* flag vcpu */
+    if( svc->sdom == NULL )
+        return;
+
+    cpumask_scnprintf(cpustr, sizeof(cpustr), svc->vcpu->cpu_hard_affinity);
+    printk("[%5d.%-2u] cpu %u, (%"PRI_stime", %"PRI_stime"),"
+           " cur_b=%"PRI_stime" cur_d=%"PRI_stime" last_start=%"PRI_stime
+           " onR=%d runnable=%d cpu_hard_affinity=%s ",
+            svc->vcpu->domain->domain_id,
+            svc->vcpu->vcpu_id,
+            svc->vcpu->processor,
+            svc->period,
+            svc->budget,
+            svc->cur_budget,
+            svc->cur_deadline,
+            svc->last_start,
+            __vcpu_on_runq(svc),
+            vcpu_runnable(svc->vcpu),
+            cpustr);
+    memset(cpustr, 0, sizeof(cpustr));
+    cpupool_mask = cpupool_scheduler_cpumask(svc->vcpu->domain->cpupool);
+    cpumask_scnprintf(cpustr, sizeof(cpustr), cpupool_mask);
+    printk("cpupool=%s ", cpustr);
+    memset(cpustr, 0, sizeof(cpustr));
+    cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->cpus);
+    printk("prv->cpus=%s\n", cpustr);
+    
+    /* TRACE */
+    {
+        struct {
+            unsigned dom:16,vcpu:16;
+            unsigned processor;
+            unsigned cur_budget_lo, cur_budget_hi;
+            unsigned cur_deadline_lo, cur_deadline_hi;
+            unsigned is_vcpu_on_runq:16,is_vcpu_runnable:16;
+        } d;
+        d.dom = svc->vcpu->domain->domain_id;
+        d.vcpu = svc->vcpu->vcpu_id;
+        d.processor = svc->vcpu->processor;
+        d.cur_budget_lo = (unsigned) svc->cur_budget;
+        d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
+        d.cur_deadline_lo = (unsigned) svc->cur_deadline;
+        d.cur_deadline_hi = (unsigned) (svc->cur_deadline >> 32);
+        d.is_vcpu_on_runq = __vcpu_on_runq(svc);
+        d.is_vcpu_runnable = vcpu_runnable(svc->vcpu);
+        trace_var(TRC_RT_VCPU_DUMP, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+}
+
+static void
+rt_dump_pcpu(const struct scheduler *ops, int cpu)
+{
+    struct rt_vcpu *svc = RT_VCPU(curr_on_cpu(cpu));
+
+    printtime();
+    rt_dump_vcpu(ops, svc);
+}
+
+/*
+ * should not need lock here. only showing stuff 
+ */
+static void
+rt_dump(const struct scheduler *ops)
+{
+    struct list_head *iter_sdom, *iter_svc, *runq, *iter;
+    struct rt_private *prv = RT_PRIV(ops);
+    struct rt_vcpu *svc;
+    unsigned int cpu = 0;
+
+    printtime();
+
+    printk("PCPU info:\n");
+    for_each_cpu(cpu, &prv->cpus) 
+        rt_dump_pcpu(ops, cpu);
+
+    printk("Global RunQueue info:\n");
+    runq = RUNQ(ops);
+    list_for_each( iter, runq ) 
+    {
+        svc = __runq_elem(iter);
+        rt_dump_vcpu(ops, svc);
+    }
+
+    printk("Domain info:\n");
+    list_for_each( iter_sdom, &prv->sdom ) 
+    {
+        struct rt_dom *sdom;
+        sdom = list_entry(iter_sdom, struct rt_dom, sdom_elem);
+        printk("\tdomain: %d\n", sdom->dom->domain_id);
+
+        list_for_each( iter_svc, &sdom->vcpu ) 
+        {
+            svc = list_entry(iter_svc, struct rt_vcpu, sdom_elem);
+            rt_dump_vcpu(ops, svc);
+        }
+    }
+
+    printk("\n");
+}
+
+/*
+ * update deadline and budget when deadline is in the past,
+ * it need to be updated to the deadline of the current period 
+ */
+static void
+rt_update_helper(s_time_t now, struct rt_vcpu *svc)
+{
+    s_time_t diff = now - svc->cur_deadline;
+
+    if ( diff >= 0 ) 
+    {
+        /* now can be later for several periods */
+        long count = ( diff/svc->period ) + 1;
+        svc->cur_deadline += count * svc->period;
+        svc->cur_budget = svc->budget;
+
+        /* TRACE */
+        {
+            struct {
+                unsigned dom:16,vcpu:16;
+                unsigned cur_budget_lo, cur_budget_hi;
+            } d;
+            d.dom = svc->vcpu->domain->domain_id;
+            d.vcpu = svc->vcpu->vcpu_id;
+            d.cur_budget_lo = (unsigned) svc->cur_budget;
+            d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
+            trace_var(TRC_RT_BUDGET_REPLENISH, 1,
+                      sizeof(d),
+                      (unsigned char *) &d);
+        }
+
+        return;
+    }
+}
+
+static inline void
+__runq_remove(struct rt_vcpu *svc)
+{
+    if ( __vcpu_on_runq(svc) )
+        list_del_init(&svc->runq_elem);
+}
+
+/*
+ * Insert svc in the RunQ according to EDF: vcpus with smaller deadlines
+ * goes first.
+ */
+static void
+__runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
+{
+    struct rt_private *prv = RT_PRIV(ops);
+    struct list_head *runq = RUNQ(ops);
+    struct list_head *iter;
+    spinlock_t *schedule_lock;
+    
+    schedule_lock = per_cpu(schedule_data, svc->vcpu->processor).schedule_lock;
+    ASSERT( spin_is_locked(schedule_lock) );
+    
+    ASSERT( !__vcpu_on_runq(svc) );
+
+    /* svc still has budget */
+    if ( svc->cur_budget > 0 ) 
+    {
+        list_for_each(iter, runq) 
+        {
+            struct rt_vcpu * iter_svc = __runq_elem(iter);
+            if ( iter_svc->cur_budget == 0 ||
+                 svc->cur_deadline <= iter_svc->cur_deadline )
+                    break;
+         }
+        list_add_tail(&svc->runq_elem, iter);
+     }
+    else 
+    {
+        list_add(&svc->runq_elem, &prv->flag_vcpu->runq_elem);
+    }
+}
+
+/*
+ * Init/Free related code
+ */
+static int
+rt_init(struct scheduler *ops)
+{
+    struct rt_private *prv = xzalloc(struct rt_private);
+
+    printk("Initializing RT scheduler\n" \
+           " WARNING: This is experimental software in development.\n" \
+           " Use at your own risk.\n");
+
+    if ( prv == NULL )
+        return -ENOMEM;
+
+    spin_lock_init(&prv->lock);
+    INIT_LIST_HEAD(&prv->sdom);
+    INIT_LIST_HEAD(&prv->runq);
+
+    prv->flag_vcpu = xzalloc(struct rt_vcpu);
+    prv->flag_vcpu->cur_budget = 0;
+    prv->flag_vcpu->sdom = NULL; /* distinguish this vcpu with others */
+    list_add(&prv->flag_vcpu->runq_elem, &prv->runq);
+
+    cpumask_clear(&prv->cpus);
+    cpumask_clear(&prv->tickled);
+
+    ops->sched_data = prv;
+
+    printtime();
+    printk("\n");
+
+    return 0;
+}
+
+static void
+rt_deinit(const struct scheduler *ops)
+{
+    struct rt_private *prv = RT_PRIV(ops);
+
+    printtime();
+    printk("\n");
+    xfree(prv->flag_vcpu);
+    xfree(prv);
+}
+
+/* 
+ * Point per_cpu spinlock to the global system lock;
+ * All cpu have same global system lock 
+ */
+static void *
+rt_alloc_pdata(const struct scheduler *ops, int cpu)
+{
+    struct rt_private *prv = RT_PRIV(ops);
+
+    cpumask_set_cpu(cpu, &prv->cpus);
+
+    per_cpu(schedule_data, cpu).schedule_lock = &prv->lock;
+
+    printtime();
+    printk("%s total cpus: %d", __func__, cpumask_weight(&prv->cpus));
+    /* 1 indicates alloc. succeed in schedule.c */
+    return (void *)1;
+}
+
+static void
+rt_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
+{
+    struct rt_private * prv = RT_PRIV(ops);
+    cpumask_clear_cpu(cpu, &prv->cpus);
+}
+
+static void *
+rt_alloc_domdata(const struct scheduler *ops, struct domain *dom)
+{
+    unsigned long flags;
+    struct rt_dom *sdom;
+    struct rt_private * prv = RT_PRIV(ops);
+
+    sdom = xzalloc(struct rt_dom);
+    if ( sdom == NULL ) 
+    {
+        printk("%s, xzalloc failed\n", __func__);
+        return NULL;
+    }
+
+    INIT_LIST_HEAD(&sdom->vcpu);
+    INIT_LIST_HEAD(&sdom->sdom_elem);
+    sdom->dom = dom;
+
+    /* spinlock here to insert the dom */
+    spin_lock_irqsave(&prv->lock, flags);
+    list_add_tail(&sdom->sdom_elem, &(prv->sdom));
+    spin_unlock_irqrestore(&prv->lock, flags);
+
+    return sdom;
+}
+
+static void
+rt_free_domdata(const struct scheduler *ops, void *data)
+{
+    unsigned long flags;
+    struct rt_dom *sdom = data;
+    struct rt_private *prv = RT_PRIV(ops);
+
+    spin_lock_irqsave(&prv->lock, flags);
+    list_del_init(&sdom->sdom_elem);
+    spin_unlock_irqrestore(&prv->lock, flags);
+    xfree(data);
+}
+
+static int
+rt_dom_init(const struct scheduler *ops, struct domain *dom)
+{
+    struct rt_dom *sdom;
+
+    /* IDLE Domain does not link on rt_private */
+    if ( is_idle_domain(dom) ) 
+        return 0;
+
+    sdom = rt_alloc_domdata(ops, dom);
+    if ( sdom == NULL ) 
+    {
+        printk("%s, failed\n", __func__);
+        return -ENOMEM;
+    }
+    dom->sched_priv = sdom;
+
+    return 0;
+}
+
+static void
+rt_dom_destroy(const struct scheduler *ops, struct domain *dom)
+{
+    rt_free_domdata(ops, RT_DOM(dom));
+}
+
+static void *
+rt_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
+{
+    struct rt_vcpu *svc;
+    s_time_t now = NOW();
+
+    /* Allocate per-VCPU info */
+    svc = xzalloc(struct rt_vcpu);
+    if ( svc == NULL ) 
+    {
+        printk("%s, xzalloc failed\n", __func__);
+        return NULL;
+    }
+
+    INIT_LIST_HEAD(&svc->runq_elem);
+    INIT_LIST_HEAD(&svc->sdom_elem);
+    svc->flags = 0U;
+    svc->sdom = dd;
+    svc->vcpu = vc;
+    svc->last_start = 0;
+
+    svc->period = RT_DS_DEFAULT_PERIOD;
+    if ( !is_idle_vcpu(vc) )
+        svc->budget = RT_DS_DEFAULT_BUDGET;
+
+    rt_update_helper(now, svc);
+
+    /* Debug only: dump new vcpu's info */
+    rt_dump_vcpu(ops, svc);
+
+    return svc;
+}
+
+static void
+rt_free_vdata(const struct scheduler *ops, void *priv)
+{
+    struct rt_vcpu *svc = priv;
+
+    /* Debug only: dump freed vcpu's info */
+    rt_dump_vcpu(ops, svc);
+    xfree(svc);
+}
+
+/*
+ * This function is called in sched_move_domain() in schedule.c
+ * When move a domain to a new cpupool.
+ * It inserts vcpus of moving domain to the scheduler's RunQ in
+ * dest. cpupool; and insert rt_vcpu svc to scheduler-specific
+ * vcpu list of the dom
+ */
+static void
+rt_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_vcpu *svc = RT_VCPU(vc);
+
+    /* Debug only: dump info of vcpu to insert */
+    rt_dump_vcpu(ops, svc);
+
+    /* not addlocate idle vcpu to dom vcpu list */
+    if ( is_idle_vcpu(vc) )
+        return;
+
+    if ( !__vcpu_on_runq(svc) && vcpu_runnable(vc) && !vc->is_running )
+        __runq_insert(ops, svc);
+
+    /* add rt_vcpu svc to scheduler-specific vcpu list of the dom */
+    list_add_tail(&svc->sdom_elem, &svc->sdom->vcpu);
+}
+
+/*
+ * Remove rt_vcpu svc from the old scheduler in source cpupool; and
+ * Remove rt_vcpu svc from scheduler-specific vcpu list of the dom
+ */
+static void
+rt_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_vcpu * const svc = RT_VCPU(vc);
+    struct rt_dom * const sdom = svc->sdom;
+
+    rt_dump_vcpu(ops, svc);
+
+    BUG_ON( sdom == NULL );
+    BUG_ON( __vcpu_on_runq(svc) );
+
+    if ( __vcpu_on_runq(svc) )
+        __runq_remove(svc);
+
+    if ( !is_idle_vcpu(vc) ) 
+        list_del_init(&svc->sdom_elem);
+}
+
+/* 
+ * Pick a valid CPU for the vcpu vc
+ * Valid CPU of a vcpu is intesection of vcpu's affinity 
+ * and available cpus
+ */
+static int
+rt_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
+{
+    cpumask_t cpus;
+    cpumask_t *online;
+    int cpu;
+    struct rt_private * prv = RT_PRIV(ops);
+
+    online = cpupool_scheduler_cpumask(vc->domain->cpupool);
+    cpumask_and(&cpus, &prv->cpus, online);
+    cpumask_and(&cpus, &cpus, vc->cpu_hard_affinity);
+
+    cpu = cpumask_test_cpu(vc->processor, &cpus)
+            ? vc->processor 
+            : cpumask_cycle(vc->processor, &cpus);
+    ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
+
+    return cpu;
+}
+
+/*
+ * Burn budget in nanosecond granularity
+ */
+static void
+burn_budgets(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t now) 
+{
+    s_time_t delta;
+
+    /* don't burn budget for idle VCPU */
+    if ( is_idle_vcpu(svc->vcpu) ) 
+        return;
+
+    rt_update_helper(now, svc);
+
+    /* not burn budget when vcpu miss deadline */
+    if ( now >= svc->cur_deadline )
+        return;
+
+    /* burn at nanoseconds level */
+    delta = now - svc->last_start;
+    /* 
+     * delta < 0 only happens in nested virtualization;
+     * TODO: how should we handle delta < 0 in a better way? 
+     */
+    if ( delta < 0 ) 
+    {
+        printk("%s, ATTENTION: now is behind last_start! delta = %ld",
+                __func__, delta);
+        rt_dump_vcpu(ops, svc);
+        svc->last_start = now;
+        svc->cur_budget = 0;
+        return;
+    }
+
+    if ( svc->cur_budget == 0 ) 
+        return;
+
+    svc->cur_budget -= delta;
+    if ( svc->cur_budget < 0 ) 
+        svc->cur_budget = 0;
+
+    /* TRACE */
+    {
+        struct {
+            unsigned dom:16, vcpu:16;
+            unsigned cur_budget_lo;
+            unsigned cur_budget_hi;
+            int delta;
+        } d;
+        d.dom = svc->vcpu->domain->domain_id;
+        d.vcpu = svc->vcpu->vcpu_id;
+        d.cur_budget_lo = (unsigned) svc->cur_budget;
+        d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
+        d.delta = delta;
+        trace_var(TRC_RT_BUDGET_BURN, 1,
+                  sizeof(d),
+                  (unsigned char *) &d);
+    }
+}
+
+/* 
+ * RunQ is sorted. Pick first one within cpumask. If no one, return NULL
+ * lock is grabbed before calling this function 
+ */
+static struct rt_vcpu *
+__runq_pick(const struct scheduler *ops, cpumask_t mask)
+{
+    struct list_head *runq = RUNQ(ops);
+    struct list_head *iter;
+    struct rt_vcpu *svc = NULL;
+    struct rt_vcpu *iter_svc = NULL;
+    cpumask_t cpu_common;
+    cpumask_t *online;
+    struct rt_private * prv = RT_PRIV(ops);
+
+    list_for_each(iter, runq) 
+    {
+        iter_svc = __runq_elem(iter);
+
+        /* flag vcpu */
+        if(iter_svc->sdom == NULL)
+            break;
+
+        /* mask cpu_hard_affinity & cpupool & priv->cpus */
+        online = cpupool_scheduler_cpumask(iter_svc->vcpu->domain->cpupool);
+        cpumask_and(&cpu_common, online, &prv->cpus);
+        cpumask_and(&cpu_common, &cpu_common, iter_svc->vcpu->cpu_hard_affinity);
+        cpumask_and(&cpu_common, &mask, &cpu_common);
+        if ( cpumask_empty(&cpu_common) )
+            continue;
+
+        ASSERT( iter_svc->cur_budget > 0 );
+
+        svc = iter_svc;
+        break;
+    }
+
+    /* TRACE */
+    {
+        if( svc != NULL )
+        {
+            struct {
+                unsigned dom:16, vcpu:16;
+                unsigned cur_deadline_lo, cur_deadline_hi;
+                unsigned cur_budget_lo, cur_budget_hi;
+            } d;
+            d.dom = svc->vcpu->domain->domain_id;
+            d.vcpu = svc->vcpu->vcpu_id;
+            d.cur_deadline_lo = (unsigned) svc->cur_deadline;
+            d.cur_deadline_hi = (unsigned) (svc->cur_deadline >> 32);
+            d.cur_budget_lo = (unsigned) svc->cur_budget;
+            d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
+            trace_var(TRC_RT_RUNQ_PICK, 1,
+                      sizeof(d),
+                      (unsigned char *) &d);
+        }
+        else
+            trace_var(TRC_RT_RUNQ_PICK, 1, 0, NULL);
+    }
+
+    return svc;
+}
+
+/*
+ * Update vcpu's budget and sort runq by insert the modifed vcpu back to runq
+ * lock is grabbed before calling this function 
+ */
+static void
+__repl_update(const struct scheduler *ops, s_time_t now)
+{
+    struct list_head *runq = RUNQ(ops);
+    struct list_head *iter;
+    struct list_head *tmp;
+    struct rt_vcpu *svc = NULL;
+
+    list_for_each_safe(iter, tmp, runq) 
+    {
+        svc = __runq_elem(iter);
+
+        /* not update flag_vcpu's budget */
+        if(svc->sdom == NULL)
+            continue;
+
+        rt_update_helper(now, svc);
+        /* reinsert the vcpu if its deadline is updated */
+        if ( now >= 0 )
+        {
+            __runq_remove(svc);
+            __runq_insert(ops, svc);
+        }
+    }
+}
+
+/* 
+ * schedule function for rt scheduler.
+ * The lock is already grabbed in schedule.c, no need to lock here 
+ */
+static struct task_slice
+rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_scheduled)
+{
+    const int cpu = smp_processor_id();
+    struct rt_private * prv = RT_PRIV(ops);
+    struct rt_vcpu * const scurr = RT_VCPU(current);
+    struct rt_vcpu * snext = NULL;
+    struct task_slice ret = { .migrated = 0 };
+
+    /* clear ticked bit now that we've been scheduled */
+    if ( cpumask_test_cpu(cpu, &prv->tickled) )
+        cpumask_clear_cpu(cpu, &prv->tickled);
+
+    /* burn_budget would return for IDLE VCPU */
+    burn_budgets(ops, scurr, now);
+
+    __repl_update(ops, now);
+
+    if ( tasklet_work_scheduled ) 
+    {
+        snext = RT_VCPU(idle_vcpu[cpu]);
+    } 
+    else 
+    {
+        cpumask_t cur_cpu;
+        cpumask_clear(&cur_cpu);
+        cpumask_set_cpu(cpu, &cur_cpu);
+        snext = __runq_pick(ops, cur_cpu);
+        if ( snext == NULL )
+            snext = RT_VCPU(idle_vcpu[cpu]);
+
+        /* if scurr has higher priority and budget, still pick scurr */
+        if ( !is_idle_vcpu(current) &&
+             vcpu_runnable(current) &&
+             scurr->cur_budget > 0 &&
+             ( is_idle_vcpu(snext->vcpu) ||
+               scurr->cur_deadline <= snext->cur_deadline ) ) 
+            snext = scurr;
+    }
+
+    if ( snext != scurr &&
+         !is_idle_vcpu(current) &&
+         vcpu_runnable(current) )
+        set_bit(__RT_delayed_runq_add, &scurr->flags);
+    
+
+    snext->last_start = now;
+    if ( !is_idle_vcpu(snext->vcpu) ) 
+    {
+        if ( snext != scurr ) 
+        {
+            __runq_remove(snext);
+            set_bit(__RT_scheduled, &snext->flags);
+        }
+        if ( snext->vcpu->processor != cpu ) 
+        {
+            snext->vcpu->processor = cpu;
+            ret.migrated = 1;
+        }
+    }
+
+    ret.time = MILLISECS(1); /* sched quantum */
+    ret.task = snext->vcpu;
+
+    /* TRACE */
+    {
+        struct {
+            unsigned dom:16,vcpu:16;
+            unsigned cur_deadline_lo, cur_deadline_hi;
+            unsigned cur_budget_lo, cur_budget_hi;
+        } d;
+        d.dom = snext->vcpu->domain->domain_id;
+        d.vcpu = snext->vcpu->vcpu_id;
+        d.cur_deadline_lo = (unsigned) snext->cur_deadline;
+        d.cur_deadline_hi = (unsigned) (snext->cur_deadline >> 32);
+        d.cur_budget_lo = (unsigned) snext->cur_budget;
+        d.cur_budget_hi = (unsigned) (snext->cur_budget >> 32);
+        trace_var(TRC_RT_SCHED_TASKLET, 1,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+
+    return ret;
+}
+
+/*
+ * Remove VCPU from RunQ
+ * The lock is already grabbed in schedule.c, no need to lock here 
+ */
+static void
+rt_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_vcpu * const svc = RT_VCPU(vc);
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    if ( curr_on_cpu(vc->processor) == vc ) 
+        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
+    else if ( __vcpu_on_runq(svc) ) 
+        __runq_remove(svc);
+    else if ( test_bit(__RT_delayed_runq_add, &svc->flags) )
+        clear_bit(__RT_delayed_runq_add, &svc->flags);
+}
+
+/*
+ * Pick a vcpu on a cpu to kick out to place the running candidate
+ * Called by wake() and context_saved()
+ * We have a running candidate here, the kick logic is:
+ * Among all the cpus that are within the cpu affinity
+ * 1) if the new->cpu is idle, kick it. This could benefit cache hit
+ * 2) if there are any idle vcpu, kick it.
+ * 3) now all pcpus are busy, among all the running vcpus, pick lowest priority one
+ *    if snext has higher priority, kick it.
+ *
+ * TODO:
+ * 1) what if these two vcpus belongs to the same domain?
+ *    replace a vcpu belonging to the same domain introduces more overhead
+ *
+ * lock is grabbed before calling this function 
+ */
+static void
+runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
+{
+    struct rt_private * prv = RT_PRIV(ops);
+    struct rt_vcpu * latest_deadline_vcpu = NULL;    /* lowest priority scheduled */
+    struct rt_vcpu * iter_svc;
+    struct vcpu * iter_vc;
+    int cpu = 0, cpu_to_tickle = 0;
+    cpumask_t not_tickled;
+    cpumask_t *online;
+
+    if ( new == NULL || is_idle_vcpu(new->vcpu) ) 
+        return;
+
+    online = cpupool_scheduler_cpumask(new->vcpu->domain->cpupool);
+    cpumask_and(&not_tickled, online, &prv->cpus);
+    cpumask_and(&not_tickled, &not_tickled, new->vcpu->cpu_hard_affinity);
+    cpumask_andnot(&not_tickled, &not_tickled, &prv->tickled);
+
+    /* 1) if new's previous cpu is idle, kick it for cache benefit */
+    if ( is_idle_vcpu(curr_on_cpu(new->vcpu->processor)) ) 
+    {
+        cpu_to_tickle = new->vcpu->processor;
+        goto out;
+    }
+
+    /* 2) if there are any idle pcpu, kick it */
+    /* The same loop also find the one with lowest priority */
+    for_each_cpu(cpu, &not_tickled) 
+    {
+        iter_vc = curr_on_cpu(cpu);
+        if ( is_idle_vcpu(iter_vc) ) 
+        {
+            cpu_to_tickle = cpu;
+            goto out;
+        }
+        iter_svc = RT_VCPU(iter_vc);
+        if ( latest_deadline_vcpu == NULL || 
+             iter_svc->cur_deadline > latest_deadline_vcpu->cur_deadline )
+            latest_deadline_vcpu = iter_svc;
+    }
+
+    /* 3) candicate has higher priority, kick out the lowest priority vcpu */
+    if ( latest_deadline_vcpu != NULL && new->cur_deadline < latest_deadline_vcpu->cur_deadline ) 
+    {
+        cpu_to_tickle = latest_deadline_vcpu->vcpu->processor;
+        goto out;
+    }
+
+out:
+    /* TRACE */ 
+    {
+        struct {
+            unsigned cpu:8, pad:24;
+        } d;
+        d.cpu = cpu_to_tickle;
+        d.pad = 0;
+        trace_var(TRC_RT_TICKLE, 0,
+                  sizeof(d),
+                  (unsigned char *)&d);
+    }
+
+    cpumask_set_cpu(cpu_to_tickle, &prv->tickled);
+    cpu_raise_softirq(cpu_to_tickle, SCHEDULE_SOFTIRQ);
+    return;    
+}
+
+/* 
+ * Should always wake up runnable vcpu, put it back to RunQ. 
+ * Check priority to raise interrupt 
+ * The lock is already grabbed in schedule.c, no need to lock here 
+ * TODO: what if these two vcpus belongs to the same domain? 
+ */
+static void
+rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_vcpu * const svc = RT_VCPU(vc);
+    s_time_t now = NOW();
+    struct rt_private * prv = RT_PRIV(ops);
+    struct rt_vcpu * snext = NULL;        /* highest priority on RunQ */
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    if ( unlikely(curr_on_cpu(vc->processor) == vc) ) 
+        return;
+
+    /* on RunQ, just update info is ok */
+    if ( unlikely(__vcpu_on_runq(svc)) ) 
+        return;
+
+    /* If context hasn't been saved for this vcpu yet, we can't put it on
+     * the Runqueue. Instead, we set a flag so that it will be put on the Runqueue
+     * After the context has been saved. 
+     */
+    if ( unlikely(test_bit(__RT_scheduled, &svc->flags)) ) 
+    {
+        set_bit(__RT_delayed_runq_add, &svc->flags);
+        return;
+    }
+
+    rt_update_helper(now, svc);
+
+    __runq_insert(ops, svc);
+    __repl_update(ops, now);
+    snext = __runq_pick(ops, prv->cpus);    /* pick snext from ALL valid cpus */
+    runq_tickle(ops, snext);
+
+    return;
+}
+
+/* 
+ * scurr has finished context switch, insert it back to the RunQ,
+ * and then pick the highest priority vcpu from runq to run 
+ */
+static void
+rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_vcpu * svc = RT_VCPU(vc);
+    struct rt_vcpu * snext = NULL;
+    struct rt_private * prv = RT_PRIV(ops);
+    spinlock_t *lock = vcpu_schedule_lock_irq(vc);
+
+    clear_bit(__RT_scheduled, &svc->flags);
+    /* not insert idle vcpu to runq */
+    if ( is_idle_vcpu(vc) ) 
+        goto out;
+
+    if ( test_and_clear_bit(__RT_delayed_runq_add, &svc->flags) && 
+         likely(vcpu_runnable(vc)) ) 
+    {
+        __runq_insert(ops, svc);
+        __repl_update(ops, NOW());
+        snext = __runq_pick(ops, prv->cpus);    /* pick snext from ALL cpus */
+        runq_tickle(ops, snext);
+    }
+out:
+    vcpu_schedule_unlock_irq(lock, vc);
+}
+
+/*
+ * set/get each vcpu info of each domain
+ */
+static int
+rt_dom_cntl(
+    const struct scheduler *ops, 
+    struct domain *d, 
+    struct xen_domctl_scheduler_op *op)
+{
+    struct rt_dom * const sdom = RT_DOM(d);
+    struct rt_vcpu * svc;
+    struct list_head *iter;
+    int rc = 0;
+
+    switch ( op->cmd )
+    {
+    case XEN_DOMCTL_SCHEDOP_getinfo:
+        /* for debug use, whenever adjust Dom0 parameter, do global dump */
+        if ( d->domain_id == 0 ) 
+            rt_dump(ops);
+
+        svc = list_entry(sdom->vcpu.next, struct rt_vcpu, sdom_elem);
+        op->u.rt.period = svc->period / MICROSECS(1); /* transfer to us */
+        op->u.rt.budget = svc->budget / MICROSECS(1);
+        break;
+    case XEN_DOMCTL_SCHEDOP_putinfo:
+        list_for_each( iter, &sdom->vcpu ) 
+        {
+            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu, sdom_elem);
+            svc->period = MICROSECS(op->u.rt.period); /* transfer to nanosec */
+            svc->budget = MICROSECS(op->u.rt.budget);
+        }
+        break;
+    }
+
+    return rc;
+}
+
+static struct rt_private _rt_priv;
+
+const struct scheduler sched_rt_ds_def = {
+    .name           = "SMP RT DS Scheduler",
+    .opt_name       = "rt_ds",
+    .sched_id       = XEN_SCHEDULER_RT_DS,
+    .sched_data     = &_rt_priv,
+
+    .dump_cpu_state = rt_dump_pcpu,
+    .dump_settings  = rt_dump,
+    .init           = rt_init,
+    .deinit         = rt_deinit,
+    .alloc_pdata    = rt_alloc_pdata,
+    .free_pdata     = rt_free_pdata,
+    .alloc_domdata  = rt_alloc_domdata,
+    .free_domdata   = rt_free_domdata,
+    .init_domain    = rt_dom_init,
+    .destroy_domain = rt_dom_destroy,
+    .alloc_vdata    = rt_alloc_vdata,
+    .free_vdata     = rt_free_vdata,
+    .insert_vcpu    = rt_vcpu_insert,
+    .remove_vcpu    = rt_vcpu_remove,
+
+    .adjust         = rt_dom_cntl,
+
+    .pick_cpu       = rt_cpu_pick,
+    .do_schedule    = rt_schedule,
+    .sleep          = rt_vcpu_sleep,
+    .wake           = rt_vcpu_wake,
+    .context_saved  = rt_context_saved,
+};
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 73cc2ea..dc4f749 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -69,6 +69,7 @@ static const struct scheduler *schedulers[] = {
     &sched_credit_def,
     &sched_credit2_def,
     &sched_arinc653_def,
+    &sched_rt_ds_def,
 };
 
 static struct scheduler __read_mostly ops;
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 69a8b44..11654d0 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -347,6 +347,8 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_max_vcpus_t);
 #define XEN_SCHEDULER_CREDIT   5
 #define XEN_SCHEDULER_CREDIT2  6
 #define XEN_SCHEDULER_ARINC653 7
+#define XEN_SCHEDULER_RT_DS    8
+
 /* Set or get info? */
 #define XEN_DOMCTL_SCHEDOP_putinfo 0
 #define XEN_DOMCTL_SCHEDOP_getinfo 1
@@ -368,6 +370,10 @@ struct xen_domctl_scheduler_op {
         struct xen_domctl_sched_credit2 {
             uint16_t weight;
         } credit2;
+        struct xen_domctl_sched_rt{
+            uint32_t period;
+            uint32_t budget;
+        } rt;
     } u;
 };
 typedef struct xen_domctl_scheduler_op xen_domctl_scheduler_op_t;
diff --git a/xen/include/public/trace.h b/xen/include/public/trace.h
index cfcf4aa..87340c4 100644
--- a/xen/include/public/trace.h
+++ b/xen/include/public/trace.h
@@ -77,6 +77,7 @@
 #define TRC_SCHED_CSCHED2  1
 #define TRC_SCHED_SEDF     2
 #define TRC_SCHED_ARINC653 3
+#define TRC_SCHED_RT       4
 
 /* Per-scheduler tracing */
 #define TRC_SCHED_CLASS_EVT(_c, _e) \
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 4164dff..04d81dc 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -169,6 +169,7 @@ extern const struct scheduler sched_sedf_def;
 extern const struct scheduler sched_credit_def;
 extern const struct scheduler sched_credit2_def;
 extern const struct scheduler sched_arinc653_def;
+extern const struct scheduler sched_rt_ds_def;
 
 
 struct cpupool
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 2/4] libxc: add rt scheduler
  2014-09-07 19:40 Introduce rt real-time scheduler for Xen Meng Xu
  2014-09-07 19:40 ` [PATCH v2 1/4] xen: add real time scheduler rt Meng Xu
@ 2014-09-07 19:40 ` Meng Xu
  2014-09-08 14:38   ` George Dunlap
                     ` (2 more replies)
  2014-09-07 19:41 ` [PATCH v2 3/4] libxl: " Meng Xu
  2014-09-07 19:41 ` [PATCH v2 4/4] xl: introduce " Meng Xu
  3 siblings, 3 replies; 31+ messages in thread
From: Meng Xu @ 2014-09-07 19:40 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap, lu,
	dario.faggioli, ian.jackson, ptxlinh, xumengpanda, Meng Xu,
	JBeulich, chaowang, lichong659, dgolomb

Add xc_sched_rt_* functions to interact with Xen to set/get domain's
parameters for rt scheduler.
Note: VCPU's information (period, budget) is in microsecond (us).

Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
Signed-off-by: Sisu Xi <xisisu@gmail.com>
---
 tools/libxc/Makefile  |    1 +
 tools/libxc/xc_rt.c   |   65 +++++++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xenctrl.h |    7 ++++++
 3 files changed, 73 insertions(+)
 create mode 100644 tools/libxc/xc_rt.c

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index 3b04027..8db0d97 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -20,6 +20,7 @@ CTRL_SRCS-y       += xc_sedf.c
 CTRL_SRCS-y       += xc_csched.c
 CTRL_SRCS-y       += xc_csched2.c
 CTRL_SRCS-y       += xc_arinc653.c
+CTRL_SRCS-y       += xc_rt.c
 CTRL_SRCS-y       += xc_tbuf.c
 CTRL_SRCS-y       += xc_pm.c
 CTRL_SRCS-y       += xc_cpu_hotplug.c
diff --git a/tools/libxc/xc_rt.c b/tools/libxc/xc_rt.c
new file mode 100644
index 0000000..e62f745
--- /dev/null
+++ b/tools/libxc/xc_rt.c
@@ -0,0 +1,65 @@
+/****************************************************************************
+ *
+ *        File: xc_rt.c
+ *      Author: Sisu Xi 
+ *              Meng Xu
+ *
+ * Description: XC Interface to the rt scheduler
+ * Note: VCPU's parameter (period, budget) is in microsecond (us).
+ *       All VCPUs of the same domain have same period and budget.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation;
+ * version 2.1 of the License.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "xc_private.h"
+
+int xc_sched_rt_domain_set(xc_interface *xch,
+                           uint32_t domid,
+                           struct xen_domctl_sched_rt *sdom)
+{
+    int rc;
+    DECLARE_DOMCTL;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_RT_DS;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_putinfo;
+    domctl.u.scheduler_op.u.rt.period = sdom->period;
+    domctl.u.scheduler_op.u.rt.budget = sdom->budget;
+
+    rc = do_domctl(xch, &domctl);
+
+    return rc;
+}
+
+int xc_sched_rt_domain_get(xc_interface *xch,
+                           uint32_t domid,
+                           struct xen_domctl_sched_rt *sdom)
+{
+    int rc;
+    DECLARE_DOMCTL;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_RT_DS;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_getinfo;
+
+    rc = do_domctl(xch, &domctl);
+
+    if ( rc == 0 )
+        *sdom = domctl.u.scheduler_op.u.rt;
+
+    return rc;
+}
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 1c8aa42..a61b2a7 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -875,6 +875,13 @@ int xc_sched_credit2_domain_get(xc_interface *xch,
                                uint32_t domid,
                                struct xen_domctl_sched_credit2 *sdom);
 
+int xc_sched_rt_domain_set(xc_interface *xch,
+                          uint32_t domid,
+                          struct xen_domctl_sched_rt *sdom);
+int xc_sched_rt_domain_get(xc_interface *xch,
+                          uint32_t domid,
+                          struct xen_domctl_sched_rt *sdom);
+
 int
 xc_sched_arinc653_schedule_set(
     xc_interface *xch,
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 3/4] libxl: add rt scheduler
  2014-09-07 19:40 Introduce rt real-time scheduler for Xen Meng Xu
  2014-09-07 19:40 ` [PATCH v2 1/4] xen: add real time scheduler rt Meng Xu
  2014-09-07 19:40 ` [PATCH v2 2/4] libxc: add rt scheduler Meng Xu
@ 2014-09-07 19:41 ` Meng Xu
  2014-09-08 15:19   ` George Dunlap
  2014-09-07 19:41 ` [PATCH v2 4/4] xl: introduce " Meng Xu
  3 siblings, 1 reply; 31+ messages in thread
From: Meng Xu @ 2014-09-07 19:41 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap, lu,
	dario.faggioli, ian.jackson, ptxlinh, xumengpanda, Meng Xu,
	JBeulich, chaowang, lichong659, dgolomb

Add libxl functions to set/get domain's parameters for rt scheduler
Note: VCPU's information (period, budget) is in microsecond (us).

Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
Signed-off-by: Sisu Xi <xisisu@gmail.com>
---
 tools/libxl/libxl.c         |   75 +++++++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl.h         |    1 +
 tools/libxl/libxl_types.idl |    2 ++
 3 files changed, 78 insertions(+)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 2ae5fca..6840c92 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -5155,6 +5155,75 @@ static int sched_sedf_domain_set(libxl__gc *gc, uint32_t domid,
     return 0;
 }
 
+static int sched_rt_domain_get(libxl__gc *gc, uint32_t domid,
+                               libxl_domain_sched_params *scinfo)
+{
+    struct xen_domctl_sched_rt sdom;
+    int rc;
+
+    rc = xc_sched_rt_domain_get(CTX->xch, domid, &sdom);
+    if (rc != 0) {
+        LOGE(ERROR, "getting domain sched rt");
+        return ERROR_FAIL;
+    }
+
+    libxl_domain_sched_params_init(scinfo);
+    
+    scinfo->sched = LIBXL_SCHEDULER_RT_DS;
+    scinfo->period = sdom.period;
+    scinfo->budget = sdom.budget;
+    
+    return 0;
+}
+
+#define SCHED_RT_DS_VCPU_PERIOD_UINT_MAX    4294967295U /* 2^32 - 1 us */
+#define SCHED_RT_DS_VCPU_BUDGET_UINT_MAX    SCHED_RT_DS_VCPU_PERIOD_UINT_MAX
+
+static int sched_rt_domain_set(libxl__gc *gc, uint32_t domid,
+                               const libxl_domain_sched_params *scinfo)
+{
+    struct xen_domctl_sched_rt sdom;
+    int rc;
+ 
+    rc = xc_sched_rt_domain_get(CTX->xch, domid, &sdom);
+
+    if (scinfo->period != LIBXL_DOMAIN_SCHED_PARAM_PERIOD_DEFAULT) {
+        if (scinfo->period < 1 ||
+            scinfo->period > SCHED_RT_DS_VCPU_PERIOD_UINT_MAX) {
+            LOG(ERROR, "VCPU period is not set or out of range, "
+                       "valid values are within range from 0 to %u",
+                       SCHED_RT_DS_VCPU_PERIOD_UINT_MAX);
+            return ERROR_INVAL;
+        }
+        sdom.period = scinfo->period;
+    }
+
+    if (scinfo->budget != LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT) {
+        if (scinfo->budget < 1 ||
+            scinfo->budget > SCHED_RT_DS_VCPU_BUDGET_UINT_MAX) {
+            LOG(ERROR, "VCPU budget is not set or out of range, "
+                       "valid values are within range from 0 to %u",
+                       SCHED_RT_DS_VCPU_BUDGET_UINT_MAX);
+            return ERROR_INVAL;
+        }
+        sdom.budget = scinfo->budget;
+    }
+
+    if (sdom.budget > sdom.period) {
+        LOG(ERROR, "VCPU budget is larger than VCPU period, "
+                   "VCPU budget should be no larger than VCPU period");
+        return ERROR_INVAL;
+    }
+
+    rc = xc_sched_rt_domain_set(CTX->xch, domid, &sdom);
+    if (rc < 0) {
+        LOGE(ERROR, "setting domain sched rt");
+        return ERROR_FAIL;
+    }
+
+    return 0;
+}
+
 int libxl_domain_sched_params_set(libxl_ctx *ctx, uint32_t domid,
                                   const libxl_domain_sched_params *scinfo)
 {
@@ -5178,6 +5247,9 @@ int libxl_domain_sched_params_set(libxl_ctx *ctx, uint32_t domid,
     case LIBXL_SCHEDULER_ARINC653:
         ret=sched_arinc653_domain_set(gc, domid, scinfo);
         break;
+    case LIBXL_SCHEDULER_RT_DS:
+        ret=sched_rt_domain_set(gc, domid, scinfo);
+        break;
     default:
         LOG(ERROR, "Unknown scheduler");
         ret=ERROR_INVAL;
@@ -5208,6 +5280,9 @@ int libxl_domain_sched_params_get(libxl_ctx *ctx, uint32_t domid,
     case LIBXL_SCHEDULER_CREDIT2:
         ret=sched_credit2_domain_get(gc, domid, scinfo);
         break;
+    case LIBXL_SCHEDULER_RT_DS:
+        ret=sched_rt_domain_get(gc, domid, scinfo);
+        break;
     default:
         LOG(ERROR, "Unknown scheduler");
         ret=ERROR_INVAL;
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 460207b..dbe736c 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -1280,6 +1280,7 @@ int libxl_sched_credit_params_set(libxl_ctx *ctx, uint32_t poolid,
 #define LIBXL_DOMAIN_SCHED_PARAM_SLICE_DEFAULT     -1
 #define LIBXL_DOMAIN_SCHED_PARAM_LATENCY_DEFAULT   -1
 #define LIBXL_DOMAIN_SCHED_PARAM_EXTRATIME_DEFAULT -1
+#define LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT     -1
 
 int libxl_domain_sched_params_get(libxl_ctx *ctx, uint32_t domid,
                                   libxl_domain_sched_params *params);
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 931c9e9..72f24fe 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -153,6 +153,7 @@ libxl_scheduler = Enumeration("scheduler", [
     (5, "credit"),
     (6, "credit2"),
     (7, "arinc653"),
+    (8, "rt_ds"),
     ])
 
 # Consistent with SHUTDOWN_* in sched.h (apart from UNKNOWN)
@@ -315,6 +316,7 @@ libxl_domain_sched_params = Struct("domain_sched_params",[
     ("slice",        integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_SLICE_DEFAULT'}),
     ("latency",      integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_LATENCY_DEFAULT'}),
     ("extratime",    integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_EXTRATIME_DEFAULT'}),
+    ("budget",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT'}),
     ])
 
 libxl_domain_build_info = Struct("domain_build_info",[
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 4/4] xl: introduce rt scheduler
  2014-09-07 19:40 Introduce rt real-time scheduler for Xen Meng Xu
                   ` (2 preceding siblings ...)
  2014-09-07 19:41 ` [PATCH v2 3/4] libxl: " Meng Xu
@ 2014-09-07 19:41 ` Meng Xu
  2014-09-08 16:06   ` George Dunlap
  3 siblings, 1 reply; 31+ messages in thread
From: Meng Xu @ 2014-09-07 19:41 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap, lu,
	dario.faggioli, ian.jackson, ptxlinh, xumengpanda, Meng Xu,
	JBeulich, chaowang, lichong659, dgolomb

Add xl command for rt scheduler
Note: VCPU's parameter (period, budget) is in microsecond (us).

Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
Signed-off-by: Sisu Xi <xisisu@gmail.com>
---
 docs/man/xl.pod.1         |   34 +++++++++++++
 tools/libxl/xl.h          |    1 +
 tools/libxl/xl_cmdimpl.c  |  119 +++++++++++++++++++++++++++++++++++++++++++++
 tools/libxl/xl_cmdtable.c |    8 +++
 4 files changed, 162 insertions(+)

diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
index 9d1c2a5..c2532cb 100644
--- a/docs/man/xl.pod.1
+++ b/docs/man/xl.pod.1
@@ -1035,6 +1035,40 @@ Restrict output to domains in the specified cpupool.
 
 =back
 
+=item B<sched-rt> [I<OPTIONS>]
+
+Set or get rt (Real Time) scheduler parameters. This rt scheduler applies 
+Preemptive Global Earliest Deadline First real-time scheduling algorithm to 
+schedule VCPUs in the system. Each VCPU has a dedicated period and budget.
+VCPUs in the same domain have the same period and budget (in Xen 4.5).
+While scheduled, a VCPU burns its budget. 
+A VCPU has its budget replenished at the beginning of each of its periods;
+The VCPU discards its unused budget at the end of its periods.
+
+B<OPTIONS>
+
+=over 4
+
+=item B<-d DOMAIN>, B<--domain=DOMAIN>
+
+Specify domain for which scheduler parameters are to be modified or retrieved.
+Mandatory for modifying scheduler parameters.
+
+=item B<-p PERIOD>, B<--period=PERIOD>
+
+A VCPU replenish its budget in every period. Time unit is millisecond.
+
+=item B<-b BUDGET>, B<--budget=BUDGET>
+
+A VCPU has BUDGET amount of time to run for each period. 
+Time unit is millisecond.
+
+=item B<-c CPUPOOL>, B<--cpupool=CPUPOOL>
+
+Restrict output to domains in the specified cpupool.
+
+=back
+
 =back
 
 =head1 CPUPOOLS COMMANDS
diff --git a/tools/libxl/xl.h b/tools/libxl/xl.h
index 10a2e66..51b634a 100644
--- a/tools/libxl/xl.h
+++ b/tools/libxl/xl.h
@@ -67,6 +67,7 @@ int main_memset(int argc, char **argv);
 int main_sched_credit(int argc, char **argv);
 int main_sched_credit2(int argc, char **argv);
 int main_sched_sedf(int argc, char **argv);
+int main_sched_rt(int argc, char **argv);
 int main_domid(int argc, char **argv);
 int main_domname(int argc, char **argv);
 int main_rename(int argc, char **argv);
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index e6b9615..92037b1 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -5212,6 +5212,47 @@ static int sched_sedf_domain_output(
     return 0;
 }
 
+static int sched_rt_domain_output(
+    int domid)
+{
+    char *domname;
+    libxl_domain_sched_params scinfo;
+    int rc = 0;
+
+    if (domid < 0) {
+        printf("%-33s %4s %9s %9s\n", "Name", "ID", "Period", "Budget");
+        return 0;
+    }
+
+    libxl_domain_sched_params_init(&scinfo);
+    rc = sched_domain_get(LIBXL_SCHEDULER_RT_DS, domid, &scinfo);
+    if (rc)
+        goto out;
+
+    domname = libxl_domid_to_name(ctx, domid);
+    printf("%-33s %4d %9d %9d\n",
+        domname,
+        domid,
+        scinfo.period,
+        scinfo.budget);
+    free(domname);
+
+out:
+    libxl_domain_sched_params_dispose(&scinfo);
+    return rc;
+}
+
+static int sched_rt_pool_output(uint32_t poolid)
+{
+    char *poolname;
+
+    poolname = libxl_cpupoolid_to_name(ctx, poolid);
+    printf("Cpupool %s: sched=EDF\n", poolname);
+
+    free(poolname);
+    return 0;
+}
+
 static int sched_default_pool_output(uint32_t poolid)
 {
     char *poolname;
@@ -5579,6 +5620,84 @@ int main_sched_sedf(int argc, char **argv)
     return 0;
 }
 
+/*
+ * <nothing>            : List all domain paramters and sched params
+ * -d [domid]           : List domain params for domain
+ * -d [domid] [params]  : Set domain params for domain 
+ */
+int main_sched_rt(int argc, char **argv)
+{
+    const char *dom = NULL;
+    const char *cpupool = NULL;
+    int period = 10, opt_p = 0; /* period is in microsecond */
+    int budget = 4, opt_b = 0; /* budget is in microsecond */
+    int opt, rc;
+    static struct option opts[] = {
+        {"domain", 1, 0, 'd'},
+        {"period", 1, 0, 'p'},
+        {"budget", 1, 0, 'b'},
+        {"cpupool", 1, 0, 'c'},
+        COMMON_LONG_OPTS,
+        {0, 0, 0, 0}
+    };
+
+    SWITCH_FOREACH_OPT(opt, "d:p:b:c:h", opts, "sched-rt", 0) {
+    case 'd':
+        dom = optarg;
+        break;
+    case 'p':
+        period = strtol(optarg, NULL, 10);
+        opt_p = 1;
+        break;
+    case 'b':
+        budget = strtol(optarg, NULL, 10);
+        opt_b = 1;
+        break;
+    case 'c':
+        cpupool = optarg;
+        break;
+    }
+
+    if (cpupool && (dom || opt_p || opt_b)) {
+        fprintf(stderr, "Specifying a cpupool is not allowed with other options.\n");
+        return 1;
+    }
+    if (!dom && (opt_p || opt_b)) {
+        fprintf(stderr, "Must specify a domain.\n");
+        return 1;
+    }
+    if ((opt_p || opt_b) && (opt_p + opt_b != 2)) {
+        fprintf(stderr, "Must specify period and budget\n");
+        return 1;
+    }
+    
+    if (!dom) { /* list all domain's rt scheduler info */
+        return -sched_domain_output(LIBXL_SCHEDULER_RT_DS,
+                                    sched_rt_domain_output,
+                                    sched_rt_pool_output,
+                                    cpupool);
+    } else {
+        uint32_t domid = find_domain(dom);
+        if (!opt_p && !opt_b) { /* output rt scheduler info */
+            sched_rt_domain_output(-1);
+            return -sched_rt_domain_output(domid);
+        } else { /* set rt scheduler paramaters */
+            libxl_domain_sched_params scinfo;
+            libxl_domain_sched_params_init(&scinfo);
+            scinfo.sched = LIBXL_SCHEDULER_RT_DS;
+            scinfo.period = period;
+            scinfo.budget = budget;
+
+            rc = sched_domain_set(domid, &scinfo);
+            libxl_domain_sched_params_dispose(&scinfo);
+            if (rc)
+                return -rc;
+        }
+    }
+
+    return 0;
+}
+
 int main_domid(int argc, char **argv)
 {
     uint32_t domid;
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
index 7b7fa92..0c0e06e 100644
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -277,6 +277,14 @@ struct cmd_spec cmd_table[] = {
       "                               --period/--slice)\n"
       "-c CPUPOOL, --cpupool=CPUPOOL  Restrict output to CPUPOOL"
     },
+    { "sched-rt",
+      &main_sched_rt, 0, 1,
+      "Get/set rt scheduler parameters",
+      "[-d <Domain> [-p[=PERIOD]] [-b[=BUDGET]]]",
+      "-d DOMAIN, --domain=DOMAIN     Domain to modify\n"
+      "-p PERIOD, --period=PERIOD     Period (us)\n"
+      "-b BUDGET, --budget=BUDGET     Budget (us)\n"
+    },
     { "domid",
       &main_domid, 0, 0,
       "Convert a domain name to domain id",
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/4] xen: add real time scheduler rt
  2014-09-07 19:40 ` [PATCH v2 1/4] xen: add real time scheduler rt Meng Xu
@ 2014-09-08 14:32   ` George Dunlap
  2014-09-08 18:44   ` George Dunlap
  2014-09-09 16:57   ` Dario Faggioli
  2 siblings, 0 replies; 31+ messages in thread
From: George Dunlap @ 2014-09-08 14:32 UTC (permalink / raw)
  To: Meng Xu, xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, lu, dario.faggioli,
	ian.jackson, ptxlinh, xumengpanda, JBeulich, chaowang,
	lichong659, dgolomb

Interface comments on the first pass; I'll dig into the algorithm more 
on the second pass.

On 09/07/2014 08:40 PM, Meng Xu wrote:
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index 73cc2ea..dc4f749 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -69,6 +69,7 @@ static const struct scheduler *schedulers[] = {
>       &sched_credit_def,
>       &sched_credit2_def,
>       &sched_arinc653_def,
> +    &sched_rt_ds_def,

I think it would be nicer to leave the _ out of the middle -- just call 
this the "rtds" server (here and elsewhere).

>   };
>   
>   static struct scheduler __read_mostly ops;
> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
> index 69a8b44..11654d0 100644
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -347,6 +347,8 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_max_vcpus_t);
>   #define XEN_SCHEDULER_CREDIT   5
>   #define XEN_SCHEDULER_CREDIT2  6
>   #define XEN_SCHEDULER_ARINC653 7
> +#define XEN_SCHEDULER_RT_DS    8
> +
>   /* Set or get info? */
>   #define XEN_DOMCTL_SCHEDOP_putinfo 0
>   #define XEN_DOMCTL_SCHEDOP_getinfo 1
> @@ -368,6 +370,10 @@ struct xen_domctl_scheduler_op {
>           struct xen_domctl_sched_credit2 {
>               uint16_t weight;
>           } credit2;
> +        struct xen_domctl_sched_rt{
> +            uint32_t period;
> +            uint32_t budget;
> +        } rt;

I'm not sure if you meant to leave this as "rt" instead of "rtds", but I 
don't think we can assume that every other server is going to expose 
"period" and "budget": the sEDF scheduler had "slice" instead, for 
instance.  I would prefer this to be "xen_domctl_sched_rtds".

>       } u;
>   };
>   typedef struct xen_domctl_scheduler_op xen_domctl_scheduler_op_t;
> diff --git a/xen/include/public/trace.h b/xen/include/public/trace.h
> index cfcf4aa..87340c4 100644
> --- a/xen/include/public/trace.h
> +++ b/xen/include/public/trace.h
> @@ -77,6 +77,7 @@
>   #define TRC_SCHED_CSCHED2  1
>   #define TRC_SCHED_SEDF     2
>   #define TRC_SCHED_ARINC653 3
> +#define TRC_SCHED_RT       4

TRC_SCHED_RTDS

  -George

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] libxc: add rt scheduler
  2014-09-07 19:40 ` [PATCH v2 2/4] libxc: add rt scheduler Meng Xu
@ 2014-09-08 14:38   ` George Dunlap
  2014-09-08 14:50   ` Ian Campbell
  2014-09-08 14:53   ` Dario Faggioli
  2 siblings, 0 replies; 31+ messages in thread
From: George Dunlap @ 2014-09-08 14:38 UTC (permalink / raw)
  To: Meng Xu, xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, lu, dario.faggioli,
	ian.jackson, ptxlinh, xumengpanda, JBeulich, chaowang,
	lichong659, dgolomb

On 09/07/2014 08:40 PM, Meng Xu wrote:
> Add xc_sched_rt_* functions to interact with Xen to set/get domain's
> parameters for rt scheduler.
> Note: VCPU's information (period, budget) is in microsecond (us).

s/rt/rtds/g; and it looks good to me.

  -George

>
> Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
> Signed-off-by: Sisu Xi <xisisu@gmail.com>
> ---
>   tools/libxc/Makefile  |    1 +
>   tools/libxc/xc_rt.c   |   65 +++++++++++++++++++++++++++++++++++++++++++++++++
>   tools/libxc/xenctrl.h |    7 ++++++
>   3 files changed, 73 insertions(+)
>   create mode 100644 tools/libxc/xc_rt.c
>
> diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
> index 3b04027..8db0d97 100644
> --- a/tools/libxc/Makefile
> +++ b/tools/libxc/Makefile
> @@ -20,6 +20,7 @@ CTRL_SRCS-y       += xc_sedf.c
>   CTRL_SRCS-y       += xc_csched.c
>   CTRL_SRCS-y       += xc_csched2.c
>   CTRL_SRCS-y       += xc_arinc653.c
> +CTRL_SRCS-y       += xc_rt.c
>   CTRL_SRCS-y       += xc_tbuf.c
>   CTRL_SRCS-y       += xc_pm.c
>   CTRL_SRCS-y       += xc_cpu_hotplug.c
> diff --git a/tools/libxc/xc_rt.c b/tools/libxc/xc_rt.c
> new file mode 100644
> index 0000000..e62f745
> --- /dev/null
> +++ b/tools/libxc/xc_rt.c
> @@ -0,0 +1,65 @@
> +/****************************************************************************
> + *
> + *        File: xc_rt.c
> + *      Author: Sisu Xi
> + *              Meng Xu
> + *
> + * Description: XC Interface to the rt scheduler
> + * Note: VCPU's parameter (period, budget) is in microsecond (us).
> + *       All VCPUs of the same domain have same period and budget.
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation;
> + * version 2.1 of the License.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
> + */
> +
> +#include "xc_private.h"
> +
> +int xc_sched_rt_domain_set(xc_interface *xch,
> +                           uint32_t domid,
> +                           struct xen_domctl_sched_rt *sdom)
> +{
> +    int rc;
> +    DECLARE_DOMCTL;
> +
> +    domctl.cmd = XEN_DOMCTL_scheduler_op;
> +    domctl.domain = (domid_t) domid;
> +    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_RT_DS;
> +    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_putinfo;
> +    domctl.u.scheduler_op.u.rt.period = sdom->period;
> +    domctl.u.scheduler_op.u.rt.budget = sdom->budget;
> +
> +    rc = do_domctl(xch, &domctl);
> +
> +    return rc;
> +}
> +
> +int xc_sched_rt_domain_get(xc_interface *xch,
> +                           uint32_t domid,
> +                           struct xen_domctl_sched_rt *sdom)
> +{
> +    int rc;
> +    DECLARE_DOMCTL;
> +
> +    domctl.cmd = XEN_DOMCTL_scheduler_op;
> +    domctl.domain = (domid_t) domid;
> +    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_RT_DS;
> +    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_getinfo;
> +
> +    rc = do_domctl(xch, &domctl);
> +
> +    if ( rc == 0 )
> +        *sdom = domctl.u.scheduler_op.u.rt;
> +
> +    return rc;
> +}
> diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> index 1c8aa42..a61b2a7 100644
> --- a/tools/libxc/xenctrl.h
> +++ b/tools/libxc/xenctrl.h
> @@ -875,6 +875,13 @@ int xc_sched_credit2_domain_get(xc_interface *xch,
>                                  uint32_t domid,
>                                  struct xen_domctl_sched_credit2 *sdom);
>   
> +int xc_sched_rt_domain_set(xc_interface *xch,
> +                          uint32_t domid,
> +                          struct xen_domctl_sched_rt *sdom);
> +int xc_sched_rt_domain_get(xc_interface *xch,
> +                          uint32_t domid,
> +                          struct xen_domctl_sched_rt *sdom);
> +
>   int
>   xc_sched_arinc653_schedule_set(
>       xc_interface *xch,

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] libxc: add rt scheduler
  2014-09-07 19:40 ` [PATCH v2 2/4] libxc: add rt scheduler Meng Xu
  2014-09-08 14:38   ` George Dunlap
@ 2014-09-08 14:50   ` Ian Campbell
  2014-09-08 14:53   ` Dario Faggioli
  2 siblings, 0 replies; 31+ messages in thread
From: Ian Campbell @ 2014-09-08 14:50 UTC (permalink / raw)
  To: Meng Xu
  Cc: xisisu, stefano.stabellini, george.dunlap, lu, dario.faggioli,
	ian.jackson, xen-devel, ptxlinh, xumengpanda, JBeulich, chaowang,
	lichong659, dgolomb

On Sun, 2014-09-07 at 15:40 -0400, Meng Xu wrote:
> Add xc_sched_rt_* functions to interact with Xen to set/get domain's
> parameters for rt scheduler.
> Note: VCPU's information (period, budget) is in microsecond (us).
> 
> Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
> Signed-off-by: Sisu Xi <xisisu@gmail.com>

These looks like correct bindings of the hypercall to me, so if the
h/visor side folks are happy with the interface:
        Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/4] libxc: add rt scheduler
  2014-09-07 19:40 ` [PATCH v2 2/4] libxc: add rt scheduler Meng Xu
  2014-09-08 14:38   ` George Dunlap
  2014-09-08 14:50   ` Ian Campbell
@ 2014-09-08 14:53   ` Dario Faggioli
  2 siblings, 0 replies; 31+ messages in thread
From: Dario Faggioli @ 2014-09-08 14:53 UTC (permalink / raw)
  To: Meng Xu
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap, lu,
	ian.jackson, xen-devel, ptxlinh, xumengpanda, JBeulich, chaowang,
	lichong659, dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 3701 bytes --]

On dom, 2014-09-07 at 15:40 -0400, Meng Xu wrote:
> Add xc_sched_rt_* functions to interact with Xen to set/get domain's
> parameters for rt scheduler.
> Note: VCPU's information (period, budget) is in microsecond (us).
> 
> Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
> Signed-off-by: Sisu Xi <xisisu@gmail.com>

This looks fine.

With the scheduler name properly updated (as George is saying, and as
pointed out below):

Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>

> diff --git a/tools/libxc/xc_rt.c b/tools/libxc/xc_rt.c
> new file mode 100644
> index 0000000..e62f745
> --- /dev/null
> +++ b/tools/libxc/xc_rt.c
> @@ -0,0 +1,65 @@
> +/****************************************************************************
> + *
> + *        File: xc_rt.c
> + *      Author: Sisu Xi 
> + *              Meng Xu
> + *
> + * Description: XC Interface to the rt scheduler
> + * Note: VCPU's parameter (period, budget) is in microsecond (us).
> + *       All VCPUs of the same domain have same period and budget.
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation;
> + * version 2.1 of the License.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
> + */
> +
> +#include "xc_private.h"
> +
> +int xc_sched_rt_domain_set(xc_interface *xch,
> +                           uint32_t domid,
> +                           struct xen_domctl_sched_rt *sdom)
                                     xen_domctl_sched_rtds

> +{
> +    int rc;
> +    DECLARE_DOMCTL;
> +
> +    domctl.cmd = XEN_DOMCTL_scheduler_op;
> +    domctl.domain = (domid_t) domid;
> +    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_RT_DS;
                                                      RTDS

> +    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_putinfo;
> +    domctl.u.scheduler_op.u.rt.period = sdom->period;
                               rtds
> +    domctl.u.scheduler_op.u.rt.budget = sdom->budget;
                               rtds

> +
> +    rc = do_domctl(xch, &domctl);
> +
> +    return rc;
> +}
> +
> +int xc_sched_rt_domain_get(xc_interface *xch,
> +                           uint32_t domid,
> +                           struct xen_domctl_sched_rt *sdom)
                                                      rtds

> +{
> +    int rc;
> +    DECLARE_DOMCTL;
> +
> +    domctl.cmd = XEN_DOMCTL_scheduler_op;
> +    domctl.domain = (domid_t) domid;
> +    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_RT_DS;
                                                      RTDS

> +    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_getinfo;
> +
> +    rc = do_domctl(xch, &domctl);
> +
> +    if ( rc == 0 )
> +        *sdom = domctl.u.scheduler_op.u.rt;
                                           rtds

> +
> +    return rc;
> +}
>
Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/4] libxl: add rt scheduler
  2014-09-07 19:41 ` [PATCH v2 3/4] libxl: " Meng Xu
@ 2014-09-08 15:19   ` George Dunlap
  2014-09-09 12:59     ` Meng Xu
  0 siblings, 1 reply; 31+ messages in thread
From: George Dunlap @ 2014-09-08 15:19 UTC (permalink / raw)
  To: Meng Xu, xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, lu, dario.faggioli,
	ian.jackson, ptxlinh, xumengpanda, JBeulich, chaowang,
	lichong659, dgolomb

On 09/07/2014 08:41 PM, Meng Xu wrote:
> Add libxl functions to set/get domain's parameters for rt scheduler
> Note: VCPU's information (period, budget) is in microsecond (us).
>
> Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
> Signed-off-by: Sisu Xi <xisisu@gmail.com>
> ---
>   tools/libxl/libxl.c         |   75 +++++++++++++++++++++++++++++++++++++++++++
>   tools/libxl/libxl.h         |    1 +
>   tools/libxl/libxl_types.idl |    2 ++
>   3 files changed, 78 insertions(+)
>
> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
> index 2ae5fca..6840c92 100644
> --- a/tools/libxl/libxl.c
> +++ b/tools/libxl/libxl.c
> @@ -5155,6 +5155,75 @@ static int sched_sedf_domain_set(libxl__gc *gc, uint32_t domid,
>       return 0;
>   }
>   
> +static int sched_rt_domain_get(libxl__gc *gc, uint32_t domid,
> +                               libxl_domain_sched_params *scinfo)
> +{
> +    struct xen_domctl_sched_rt sdom;
> +    int rc;
> +
> +    rc = xc_sched_rt_domain_get(CTX->xch, domid, &sdom);
> +    if (rc != 0) {
> +        LOGE(ERROR, "getting domain sched rt");
> +        return ERROR_FAIL;
> +    }
> +
> +    libxl_domain_sched_params_init(scinfo);
> +
> +    scinfo->sched = LIBXL_SCHEDULER_RT_DS;
> +    scinfo->period = sdom.period;
> +    scinfo->budget = sdom.budget;
> +
> +    return 0;
> +}
> +
> +#define SCHED_RT_DS_VCPU_PERIOD_UINT_MAX    4294967295U /* 2^32 - 1 us */
> +#define SCHED_RT_DS_VCPU_BUDGET_UINT_MAX    SCHED_RT_DS_VCPU_PERIOD_UINT_MAX

I think what Dario was looking for was this:

#define SCHED_RT_DS_VCPU_PERIOD_MAX UINT_MAX

I.e., use the already-defined #defines with meaningful names (line 
UINT_MAX), and avoid open-coding (i.e., typing out a "magic" number, 
like 429....U).

> +
> +static int sched_rt_domain_set(libxl__gc *gc, uint32_t domid,
> +                               const libxl_domain_sched_params *scinfo)
> +{
> +    struct xen_domctl_sched_rt sdom;
> +    int rc;
> +
> +    rc = xc_sched_rt_domain_get(CTX->xch, domid, &sdom);

You need to check the return value here and bail out on an error.

> +
> +    if (scinfo->period != LIBXL_DOMAIN_SCHED_PARAM_PERIOD_DEFAULT) {
> +        if (scinfo->period < 1 ||
> +            scinfo->period > SCHED_RT_DS_VCPU_PERIOD_UINT_MAX) {

...but this isn't right anyway, right?  scinfo->period is a signed 
integer.  You shouldn't be comparing it to an unsigned int; and this can 
never be false anyway, because even if it's automatically cast to be 
unsigned, the type isn't big enough to be bigger than UINT_MAX anyway.

If period is allowed to be anything up to INT_MAX, then there's no need 
to check the upper bound.  Checking to make sure it's >= 1 should be 
sufficient.  Then you can just get rid of the #defines above.

> +            LOG(ERROR, "VCPU period is not set or out of range, "
> +                       "valid values are within range from 0 to %u",
> +                       SCHED_RT_DS_VCPU_PERIOD_UINT_MAX);
> +            return ERROR_INVAL;
> +        }
> +        sdom.period = scinfo->period;
> +    }
> +
> +    if (scinfo->budget != LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT) {
> +        if (scinfo->budget < 1 ||
> +            scinfo->budget > SCHED_RT_DS_VCPU_BUDGET_UINT_MAX) {

Same here.

> +            LOG(ERROR, "VCPU budget is not set or out of range, "
> +                       "valid values are within range from 0 to %u",
> +                       SCHED_RT_DS_VCPU_BUDGET_UINT_MAX);
> +            return ERROR_INVAL;
> +        }
> +        sdom.budget = scinfo->budget;
> +    }
> +
> +    if (sdom.budget > sdom.period) {
> +        LOG(ERROR, "VCPU budget is larger than VCPU period, "
> +                   "VCPU budget should be no larger than VCPU period");
> +        return ERROR_INVAL;
> +    }
> +
> +    rc = xc_sched_rt_domain_set(CTX->xch, domid, &sdom);
> +    if (rc < 0) {
> +        LOGE(ERROR, "setting domain sched rt");
> +        return ERROR_FAIL;
> +    }
> +
> +    return 0;
> +}
> +
>   int libxl_domain_sched_params_set(libxl_ctx *ctx, uint32_t domid,
>                                     const libxl_domain_sched_params *scinfo)
>   {
> @@ -5178,6 +5247,9 @@ int libxl_domain_sched_params_set(libxl_ctx *ctx, uint32_t domid,
>       case LIBXL_SCHEDULER_ARINC653:
>           ret=sched_arinc653_domain_set(gc, domid, scinfo);
>           break;
> +    case LIBXL_SCHEDULER_RT_DS:
> +        ret=sched_rt_domain_set(gc, domid, scinfo);
> +        break;
>       default:
>           LOG(ERROR, "Unknown scheduler");
>           ret=ERROR_INVAL;
> @@ -5208,6 +5280,9 @@ int libxl_domain_sched_params_get(libxl_ctx *ctx, uint32_t domid,
>       case LIBXL_SCHEDULER_CREDIT2:
>           ret=sched_credit2_domain_get(gc, domid, scinfo);
>           break;
> +    case LIBXL_SCHEDULER_RT_DS:
> +        ret=sched_rt_domain_get(gc, domid, scinfo);
> +        break;
>       default:
>           LOG(ERROR, "Unknown scheduler");
>           ret=ERROR_INVAL;
> diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
> index 460207b..dbe736c 100644
> --- a/tools/libxl/libxl.h
> +++ b/tools/libxl/libxl.h
> @@ -1280,6 +1280,7 @@ int libxl_sched_credit_params_set(libxl_ctx *ctx, uint32_t poolid,
>   #define LIBXL_DOMAIN_SCHED_PARAM_SLICE_DEFAULT     -1
>   #define LIBXL_DOMAIN_SCHED_PARAM_LATENCY_DEFAULT   -1
>   #define LIBXL_DOMAIN_SCHED_PARAM_EXTRATIME_DEFAULT -1
> +#define LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT     -1
>   
>   int libxl_domain_sched_params_get(libxl_ctx *ctx, uint32_t domid,
>                                     libxl_domain_sched_params *params);
> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
> index 931c9e9..72f24fe 100644
> --- a/tools/libxl/libxl_types.idl
> +++ b/tools/libxl/libxl_types.idl
> @@ -153,6 +153,7 @@ libxl_scheduler = Enumeration("scheduler", [
>       (5, "credit"),
>       (6, "credit2"),
>       (7, "arinc653"),
> +    (8, "rt_ds"),

rtds

Other than that, looks good.

  -George

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/4] xl: introduce rt scheduler
  2014-09-07 19:41 ` [PATCH v2 4/4] xl: introduce " Meng Xu
@ 2014-09-08 16:06   ` George Dunlap
  2014-09-08 16:16     ` Dario Faggioli
  2014-09-09 13:14     ` Meng Xu
  0 siblings, 2 replies; 31+ messages in thread
From: George Dunlap @ 2014-09-08 16:06 UTC (permalink / raw)
  To: Meng Xu, xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, lu, dario.faggioli,
	ian.jackson, ptxlinh, xumengpanda, JBeulich, chaowang,
	lichong659, dgolomb

On 09/07/2014 08:41 PM, Meng Xu wrote:
> Add xl command for rt scheduler
> Note: VCPU's parameter (period, budget) is in microsecond (us).
>
> Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
> Signed-off-by: Sisu Xi <xisisu@gmail.com>
> ---
>   docs/man/xl.pod.1         |   34 +++++++++++++
>   tools/libxl/xl.h          |    1 +
>   tools/libxl/xl_cmdimpl.c  |  119 +++++++++++++++++++++++++++++++++++++++++++++
>   tools/libxl/xl_cmdtable.c |    8 +++
>   4 files changed, 162 insertions(+)
>
> diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
> index 9d1c2a5..c2532cb 100644
> --- a/docs/man/xl.pod.1
> +++ b/docs/man/xl.pod.1
> @@ -1035,6 +1035,40 @@ Restrict output to domains in the specified cpupool.
>   
>   =back
>   
> +=item B<sched-rt> [I<OPTIONS>]

sched-rtds, I think.

> +
> +Set or get rt (Real Time) scheduler parameters. This rt scheduler applies
> +Preemptive Global Earliest Deadline First real-time scheduling algorithm to
> +schedule VCPUs in the system. Each VCPU has a dedicated period and budget.
> +VCPUs in the same domain have the same period and budget (in Xen 4.5).
> +While scheduled, a VCPU burns its budget.
> +A VCPU has its budget replenished at the beginning of each of its periods;
> +The VCPU discards its unused budget at the end of its periods.

I think I would say, "A VCPU has its budget replenished at the beginning 
of each period; unused budget is discarded at the end of each period."

> +
> +B<OPTIONS>
> +
> +=over 4
> +
> +=item B<-d DOMAIN>, B<--domain=DOMAIN>
> +
> +Specify domain for which scheduler parameters are to be modified or retrieved.
> +Mandatory for modifying scheduler parameters.
> +
> +=item B<-p PERIOD>, B<--period=PERIOD>
> +
> +A VCPU replenish its budget in every period. Time unit is millisecond.

I think I'd say: "Period of time, in milliseconds, over which to 
replenish the budget."

> +
> +=item B<-b BUDGET>, B<--budget=BUDGET>
> +
> +A VCPU has BUDGET amount of time to run for each period.
> +Time unit is millisecond.

"Amount of time, in milliseconds, that the VCPU will be allowed to run 
every period."

> +
> +=item B<-c CPUPOOL>, B<--cpupool=CPUPOOL>
> +
> +Restrict output to domains in the specified cpupool.
> +
> +=back
> +
>   =back
>   
>   =head1 CPUPOOLS COMMANDS
> diff --git a/tools/libxl/xl.h b/tools/libxl/xl.h
> index 10a2e66..51b634a 100644
> --- a/tools/libxl/xl.h
> +++ b/tools/libxl/xl.h
> @@ -67,6 +67,7 @@ int main_memset(int argc, char **argv);
>   int main_sched_credit(int argc, char **argv);
>   int main_sched_credit2(int argc, char **argv);
>   int main_sched_sedf(int argc, char **argv);
> +int main_sched_rt(int argc, char **argv);

main_sched_rtds

>   int main_domid(int argc, char **argv);
>   int main_domname(int argc, char **argv);
>   int main_rename(int argc, char **argv);
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> index e6b9615..92037b1 100644
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -5212,6 +5212,47 @@ static int sched_sedf_domain_output(
>       return 0;
>   }
>   
> +static int sched_rt_domain_output(
> +    int domid)
> +{
> +    char *domname;
> +    libxl_domain_sched_params scinfo;
> +    int rc = 0;
> +
> +    if (domid < 0) {
> +        printf("%-33s %4s %9s %9s\n", "Name", "ID", "Period", "Budget");
> +        return 0;
> +    }
> +
> +    libxl_domain_sched_params_init(&scinfo);
> +    rc = sched_domain_get(LIBXL_SCHEDULER_RT_DS, domid, &scinfo);

Hmm, the other callers of sched_domain_get() don't call 
libxl_domain_sched_params_init(); but reading through libxl.h looks like 
that's actually a mistake:

  * ...the user must
  * always call the "init" function before using a type, even if the
  * variable is simply being passed by reference as an out parameter
  * to a libxl function.

Meng, would you be willing to put on your "to-do list" to send a 
follow-up patch to clean this up?

I think what should probably actually be done is that sched_domain_get() 
should call libxl_domain_sched_params_init() before calling 
libxl_domain_sched_params_get().  But I'm sure IanJ will have opinions 
on that.

> +    if (rc)
> +        goto out;
> +
> +    domname = libxl_domid_to_name(ctx, domid);
> +    printf("%-33s %4d %9d %9d\n",
> +        domname,
> +        domid,
> +        scinfo.period,
> +        scinfo.budget);
> +    free(domname);
> +
> +out:
> +    libxl_domain_sched_params_dispose(&scinfo);
> +    return rc;
> +}
> +
> +static int sched_rt_pool_output(uint32_t poolid)
> +{
> +    char *poolname;
> +
> +    poolname = libxl_cpupoolid_to_name(ctx, poolid);
> +    printf("Cpupool %s: sched=EDF\n", poolname);

Should we change this to "RTDS"?

> +
> +    free(poolname);
> +    return 0;
> +}
> +
>   static int sched_default_pool_output(uint32_t poolid)
>   {
>       char *poolname;
> @@ -5579,6 +5620,84 @@ int main_sched_sedf(int argc, char **argv)
>       return 0;
>   }
>   
> +/*
> + * <nothing>            : List all domain paramters and sched params
> + * -d [domid]           : List domain params for domain
> + * -d [domid] [params]  : Set domain params for domain
> + */
> +int main_sched_rt(int argc, char **argv)
> +{
> +    const char *dom = NULL;
> +    const char *cpupool = NULL;
> +    int period = 10, opt_p = 0; /* period is in microsecond */
> +    int budget = 4, opt_b = 0; /* budget is in microsecond */

We might as well make opt_p and opt_b  of type "bool".

Why are you setting the values for period and budget here?  It looks 
like they're either never used (if either one or both are not set on the 
command line), or they're clobbered (when both are set).

If gcc doesn't complain, just leave them uninitizlized.  If it does 
complian, then just initialize them to 0 -- that will make sure that it 
returns an error if there ever *is* a path which doesn't actually set 
the value.

> +    int opt, rc;
> +    static struct option opts[] = {
> +        {"domain", 1, 0, 'd'},
> +        {"period", 1, 0, 'p'},
> +        {"budget", 1, 0, 'b'},
> +        {"cpupool", 1, 0, 'c'},
> +        COMMON_LONG_OPTS,
> +        {0, 0, 0, 0}
> +    };
> +
> +    SWITCH_FOREACH_OPT(opt, "d:p:b:c:h", opts, "sched-rt", 0) {
> +    case 'd':
> +        dom = optarg;
> +        break;
> +    case 'p':
> +        period = strtol(optarg, NULL, 10);
> +        opt_p = 1;
> +        break;
> +    case 'b':
> +        budget = strtol(optarg, NULL, 10);
> +        opt_b = 1;
> +        break;
> +    case 'c':
> +        cpupool = optarg;
> +        break;
> +    }
> +
> +    if (cpupool && (dom || opt_p || opt_b)) {
> +        fprintf(stderr, "Specifying a cpupool is not allowed with other options.\n");
> +        return 1;
> +    }
> +    if (!dom && (opt_p || opt_b)) {
> +        fprintf(stderr, "Must specify a domain.\n");
> +        return 1;
> +    }
> +    if ((opt_p || opt_b) && (opt_p + opt_b != 2)) {

Maybe, "if (opt_p != opt_b)"?

> +        fprintf(stderr, "Must specify period and budget\n");
> +        return 1;
> +    }
> +
> +    if (!dom) { /* list all domain's rt scheduler info */
> +        return -sched_domain_output(LIBXL_SCHEDULER_RT_DS,
> +                                    sched_rt_domain_output,
> +                                    sched_rt_pool_output,
> +                                    cpupool);
> +    } else {
> +        uint32_t domid = find_domain(dom);
> +        if (!opt_p && !opt_b) { /* output rt scheduler info */
> +            sched_rt_domain_output(-1);
> +            return -sched_rt_domain_output(domid);
> +        } else { /* set rt scheduler paramaters */
> +            libxl_domain_sched_params scinfo;
> +            libxl_domain_sched_params_init(&scinfo);
> +            scinfo.sched = LIBXL_SCHEDULER_RT_DS;
> +            scinfo.period = period;
> +            scinfo.budget = budget;
> +
> +            rc = sched_domain_set(domid, &scinfo);
> +            libxl_domain_sched_params_dispose(&scinfo);
> +            if (rc)
> +                return -rc;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
>   int main_domid(int argc, char **argv)
>   {
>       uint32_t domid;
> diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
> index 7b7fa92..0c0e06e 100644
> --- a/tools/libxl/xl_cmdtable.c
> +++ b/tools/libxl/xl_cmdtable.c
> @@ -277,6 +277,14 @@ struct cmd_spec cmd_table[] = {
>         "                               --period/--slice)\n"
>         "-c CPUPOOL, --cpupool=CPUPOOL  Restrict output to CPUPOOL"
>       },
> +    { "sched-rt",

sched-rtds

Right, starting to get close. :-)

  -George

> +      &main_sched_rt, 0, 1,
> +      "Get/set rt scheduler parameters",
> +      "[-d <Domain> [-p[=PERIOD]] [-b[=BUDGET]]]",
> +      "-d DOMAIN, --domain=DOMAIN     Domain to modify\n"
> +      "-p PERIOD, --period=PERIOD     Period (us)\n"
> +      "-b BUDGET, --budget=BUDGET     Budget (us)\n"
> +    },
>       { "domid",
>         &main_domid, 0, 0,
>         "Convert a domain name to domain id",

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/4] xl: introduce rt scheduler
  2014-09-08 16:06   ` George Dunlap
@ 2014-09-08 16:16     ` Dario Faggioli
  2014-09-09 13:14     ` Meng Xu
  1 sibling, 0 replies; 31+ messages in thread
From: Dario Faggioli @ 2014-09-08 16:16 UTC (permalink / raw)
  To: George Dunlap
  Cc: ian.campbell, xisisu, stefano.stabellini, lu, ian.jackson,
	xen-devel, ptxlinh, xumengpanda, Meng Xu, JBeulich, chaowang,
	lichong659, dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 1150 bytes --]

On lun, 2014-09-08 at 17:06 +0100, George Dunlap wrote:
> On 09/07/2014 08:41 PM, Meng Xu wrote:

> > diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
> > index 7b7fa92..0c0e06e 100644
> > --- a/tools/libxl/xl_cmdtable.c
> > +++ b/tools/libxl/xl_cmdtable.c
> > @@ -277,6 +277,14 @@ struct cmd_spec cmd_table[] = {
> >         "                               --period/--slice)\n"
> >         "-c CPUPOOL, --cpupool=CPUPOOL  Restrict output to CPUPOOL"
> >       },
> > +    { "sched-rt",
> 
> sched-rtds
> 
> Right, starting to get close. :-)
> 
>   -George
> 
So, Meng, I've just skimmed George's comment to patches 3 and 4, and I
agree with all the points he makes.

I'll have a closer look and see if I've got any other comments on my
own, but not before I've looked carefully and commented at the new
version of patch 1.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/4] xen: add real time scheduler rt
  2014-09-07 19:40 ` [PATCH v2 1/4] xen: add real time scheduler rt Meng Xu
  2014-09-08 14:32   ` George Dunlap
@ 2014-09-08 18:44   ` George Dunlap
  2014-09-09  9:42     ` Dario Faggioli
  2014-09-09 12:46     ` Meng Xu
  2014-09-09 16:57   ` Dario Faggioli
  2 siblings, 2 replies; 31+ messages in thread
From: George Dunlap @ 2014-09-08 18:44 UTC (permalink / raw)
  To: Meng Xu, xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, lu, dario.faggioli,
	ian.jackson, ptxlinh, xumengpanda, JBeulich, chaowang,
	lichong659, dgolomb

On 09/07/2014 08:40 PM, Meng Xu wrote:
> This scheduler follows the Preemptive Global Earliest Deadline First
> (EDF) theory in real-time field.
> At any scheduling point, the VCPU with earlier deadline has higher
> priority. The scheduler always picks the highest priority VCPU to run on a
> feasible PCPU.
> A PCPU is feasible if the VCPU can run on this PCPU and (the PCPU is
> idle or has a lower-priority VCPU running on it.)
>
> Each VCPU has a dedicated period and budget.
> The deadline of a VCPU is at the end of each of its periods;
> A VCPU has its budget replenished at the beginning of each of its periods;
> While scheduled, a VCPU burns its budget.
> The VCPU needs to finish its budget before its deadline in each period;
> The VCPU discards its unused budget at the end of each of its periods.
> If a VCPU runs out of budget in a period, it has to wait until next period.
>
> Each VCPU is implemented as a deferable server.
> When a VCPU has a task running on it, its budget is continuously burned;
> When a VCPU has no task but with budget left, its budget is preserved.
>
> Queue scheme: A global runqueue for each CPU pool.
> The runqueue holds all runnable VCPUs.
> VCPUs in the runqueue are divided into two parts: with and without budget.
> At the first part, VCPUs are sorted based on EDF priority scheme.
>
> Note: cpumask and cpupool is supported.
>
> This is an experimental scheduler.
>
> Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
> Signed-off-by: Sisu Xi <xisisu@gmail.com>
> ---
>   xen/common/Makefile         |    1 +
>   xen/common/sched_rt.c       | 1146 +++++++++++++++++++++++++++++++++++++++++++
>   xen/common/schedule.c       |    1 +
>   xen/include/public/domctl.h |    6 +
>   xen/include/public/trace.h  |    1 +
>   xen/include/xen/sched-if.h  |    1 +
>   6 files changed, 1156 insertions(+)
>   create mode 100644 xen/common/sched_rt.c
>
> diff --git a/xen/common/Makefile b/xen/common/Makefile
> index 3683ae3..5a23aa4 100644
> --- a/xen/common/Makefile
> +++ b/xen/common/Makefile
> @@ -26,6 +26,7 @@ obj-y += sched_credit.o
>   obj-y += sched_credit2.o
>   obj-y += sched_sedf.o
>   obj-y += sched_arinc653.o
> +obj-y += sched_rt.o
>   obj-y += schedule.o
>   obj-y += shutdown.o
>   obj-y += softirq.o
> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> new file mode 100644
> index 0000000..412f8b1
> --- /dev/null
> +++ b/xen/common/sched_rt.c
> @@ -0,0 +1,1146 @@
> +/******************************************************************************
> + * Preemptive Global Earliest Deadline First  (EDF) scheduler for Xen
> + * EDF scheduling is a real-time scheduling algorithm used in embedded field.
> + *
> + * by Sisu Xi, 2013, Washington University in Saint Louis
> + * and Meng Xu, 2014, University of Pennsylvania
> + *
> + * based on the code of credit Scheduler
> + */
> +
> +#include <xen/config.h>
> +#include <xen/init.h>
> +#include <xen/lib.h>
> +#include <xen/sched.h>
> +#include <xen/domain.h>
> +#include <xen/delay.h>
> +#include <xen/event.h>
> +#include <xen/time.h>
> +#include <xen/perfc.h>
> +#include <xen/sched-if.h>
> +#include <xen/softirq.h>
> +#include <asm/atomic.h>
> +#include <xen/errno.h>
> +#include <xen/trace.h>
> +#include <xen/cpu.h>
> +#include <xen/keyhandler.h>
> +#include <xen/trace.h>
> +#include <xen/guest_access.h>
> +
> +/*
> + * TODO:
> + *
> + * Migration compensation and resist like credit2 to better use cache;
> + * Lock Holder Problem, using yield?
> + * Self switch problem: VCPUs of the same domain may preempt each other;
> + */
> +
> +/*
> + * Design:
> + *
> + * This scheduler follows the Preemptive Global Earliest Deadline First (EDF)
> + * theory in real-time field.
> + * At any scheduling point, the VCPU with earlier deadline has higher priority.
> + * The scheduler always picks highest priority VCPU to run on a feasible PCPU.
> + * A PCPU is feasible if the VCPU can run on this PCPU and (the PCPU is idle or
> + * has a lower-priority VCPU running on it.)
> + *
> + * Each VCPU has a dedicated period and budget.
> + * The deadline of a VCPU is at the end of each of its periods;
> + * A VCPU has its budget replenished at the beginning of each of its periods;
> + * While scheduled, a VCPU burns its budget.
> + * The VCPU needs to finish its budget before its deadline in each period;
> + * The VCPU discards its unused budget at the end of each of its periods.
> + * If a VCPU runs out of budget in a period, it has to wait until next period.
> + *
> + * Each VCPU is implemented as a deferable server.
> + * When a VCPU has a task running on it, its budget is continuously burned;
> + * When a VCPU has no task but with budget left, its budget is preserved.
> + *
> + * Queue scheme: A global runqueue for each CPU pool.
> + * The runqueue holds all runnable VCPUs.
> + * VCPUs in the runqueue are divided into two parts:
> + * with and without remaining budget.
> + * At the first part, VCPUs are sorted based on EDF priority scheme.
> + *
> + * Note: cpumask and cpupool is supported.
> + */
> +
> +/*
> + * Locking:
> + * A global system lock is used to protect the RunQ.
> + * The global lock is referenced by schedule_data.schedule_lock
> + * from all physical cpus.
> + *
> + * The lock is already grabbed when calling wake/sleep/schedule/ functions
> + * in schedule.c
> + *
> + * The functions involes RunQ and needs to grab locks are:
> + *    vcpu_insert, vcpu_remove, context_saved, __runq_insert
> + */
> +
> +
> +/*
> + * Default parameters:
> + * Period and budget in default is 10 and 4 ms, respectively
> + */
> +#define RT_DS_DEFAULT_PERIOD     (MICROSECS(10000))
> +#define RT_DS_DEFAULT_BUDGET     (MICROSECS(4000))
> +
> +/*
> + * Flags
> + */
> +/*
> + * RT_scheduled: Is this vcpu either running on, or context-switching off,
> + * a phyiscal cpu?
> + * + Accessed only with Runqueue lock held.
> + * + Set when chosen as next in rt_schedule().
> + * + Cleared after context switch has been saved in rt_context_saved()
> + * + Checked in vcpu_wake to see if we can add to the Runqueue, or if we should
> + *   set RT_delayed_runq_add
> + * + Checked to be false in runq_insert.
> + */
> +#define __RT_scheduled            1
> +#define RT_scheduled (1<<__RT_scheduled)
> +/*
> + * RT_delayed_runq_add: Do we need to add this to the Runqueueu once it'd done
> + * being context switching out?
> + * + Set when scheduling out in rt_schedule() if prev is runable
> + * + Set in rt_vcpu_wake if it finds RT_scheduled set
> + * + Read in rt_context_saved(). If set, it adds prev to the Runqueue and
> + *   clears the bit.
> + */
> +#define __RT_delayed_runq_add     2
> +#define RT_delayed_runq_add (1<<__RT_delayed_runq_add)
> +
> +/*
> + * Debug only. Used to printout debug information
> + */
> +#define printtime()\
> +        ({s_time_t now = NOW(); \
> +          printk("%u : %3ld.%3ldus : %-19s\n",smp_processor_id(),\
> +          now/MICROSECS(1), now%MICROSECS(1)/1000, __func__);} )
> +
> +/*
> + * rt tracing events ("only" 512 available!). Check
> + * include/public/trace.h for more details.
> + */
> +#define TRC_RT_TICKLE           TRC_SCHED_CLASS_EVT(RT, 1)
> +#define TRC_RT_RUNQ_PICK        TRC_SCHED_CLASS_EVT(RT, 2)
> +#define TRC_RT_BUDGET_BURN      TRC_SCHED_CLASS_EVT(RT, 3)
> +#define TRC_RT_BUDGET_REPLENISH TRC_SCHED_CLASS_EVT(RT, 4)
> +#define TRC_RT_SCHED_TASKLET    TRC_SCHED_CLASS_EVT(RT, 5)
> +#define TRC_RT_VCPU_DUMP        TRC_SCHED_CLASS_EVT(RT, 6)
> +
> +/*
> + * Systme-wide private data, include a global RunQueue
> + * Global lock is referenced by schedule_data.schedule_lock from all
> + * physical cpus. It can be grabbed via vcpu_schedule_lock_irq()
> + */
> +struct rt_private {
> +    spinlock_t lock;           /* The global coarse grand lock */
> +    struct list_head sdom;     /* list of availalbe domains, used for dump */
> +    struct list_head runq;     /* Ordered list of runnable VMs */
> +    struct rt_vcpu *flag_vcpu; /* position of the first depleted vcpu */
> +    cpumask_t cpus;            /* cpumask_t of available physical cpus */
> +    cpumask_t tickled;         /* cpus been tickled */
> +};
> +
> +/*
> + * Virtual CPU
> + */
> +struct rt_vcpu {
> +    struct list_head runq_elem; /* On the runqueue list */
> +    struct list_head sdom_elem; /* On the domain VCPU list */
> +
> +    /* Up-pointers */
> +    struct rt_dom *sdom;
> +    struct vcpu *vcpu;
> +
> +    /* VCPU parameters, in nanoseconds */
> +    s_time_t period;
> +    s_time_t budget;
> +
> +    /* VCPU current infomation in nanosecond */
> +    s_time_t cur_budget;        /* current budget */
> +    s_time_t last_start;        /* last start time */
> +    s_time_t cur_deadline;      /* current deadline for EDF */
> +
> +    unsigned flags;             /* mark __RT_scheduled, etc.. */
> +};
> +
> +/*
> + * Domain
> + */
> +struct rt_dom {
> +    struct list_head vcpu;      /* link its VCPUs */
> +    struct list_head sdom_elem; /* link list on rt_priv */
> +    struct domain *dom;         /* pointer to upper domain */
> +};
> +
> +/*
> + * Useful inline functions
> + */
> +static inline struct rt_private *RT_PRIV(const struct scheduler *ops)
> +{
> +    return ops->sched_data;
> +}
> +
> +static inline struct rt_vcpu *RT_VCPU(const struct vcpu *vcpu)
> +{
> +    return vcpu->sched_priv;
> +}
> +
> +static inline struct rt_dom *RT_DOM(const struct domain *dom)
> +{
> +    return dom->sched_priv;
> +}
> +
> +static inline struct list_head *RUNQ(const struct scheduler *ops)
> +{
> +    return &RT_PRIV(ops)->runq;
> +}
> +
> +/*
> + * RunQueue helper functions
> + */
> +static int
> +__vcpu_on_runq(const struct rt_vcpu *svc)
> +{
> +   return !list_empty(&svc->runq_elem);
> +}
> +
> +static struct rt_vcpu *
> +__runq_elem(struct list_head *elem)
> +{
> +    return list_entry(elem, struct rt_vcpu, runq_elem);
> +}
> +
> +/*
> + * Debug related code, dump vcpu/cpu information
> + */
> +static void
> +rt_dump_vcpu(const struct scheduler *ops, const struct rt_vcpu *svc)
> +{
> +    struct rt_private *prv = RT_PRIV(ops);
> +    char cpustr[1024];
> +    cpumask_t *cpupool_mask;
> +
> +    ASSERT(svc != NULL);
> +    /* flag vcpu */
> +    if( svc->sdom == NULL )
> +        return;
> +
> +    cpumask_scnprintf(cpustr, sizeof(cpustr), svc->vcpu->cpu_hard_affinity);
> +    printk("[%5d.%-2u] cpu %u, (%"PRI_stime", %"PRI_stime"),"
> +           " cur_b=%"PRI_stime" cur_d=%"PRI_stime" last_start=%"PRI_stime
> +           " onR=%d runnable=%d cpu_hard_affinity=%s ",
> +            svc->vcpu->domain->domain_id,
> +            svc->vcpu->vcpu_id,
> +            svc->vcpu->processor,
> +            svc->period,
> +            svc->budget,
> +            svc->cur_budget,
> +            svc->cur_deadline,
> +            svc->last_start,
> +            __vcpu_on_runq(svc),
> +            vcpu_runnable(svc->vcpu),
> +            cpustr);
> +    memset(cpustr, 0, sizeof(cpustr));
> +    cpupool_mask = cpupool_scheduler_cpumask(svc->vcpu->domain->cpupool);
> +    cpumask_scnprintf(cpustr, sizeof(cpustr), cpupool_mask);
> +    printk("cpupool=%s ", cpustr);
> +    memset(cpustr, 0, sizeof(cpustr));
> +    cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->cpus);
> +    printk("prv->cpus=%s\n", cpustr);
> +
> +    /* TRACE */
> +    {
> +        struct {
> +            unsigned dom:16,vcpu:16;
> +            unsigned processor;
> +            unsigned cur_budget_lo, cur_budget_hi;
> +            unsigned cur_deadline_lo, cur_deadline_hi;
> +            unsigned is_vcpu_on_runq:16,is_vcpu_runnable:16;
> +        } d;
> +        d.dom = svc->vcpu->domain->domain_id;
> +        d.vcpu = svc->vcpu->vcpu_id;
> +        d.processor = svc->vcpu->processor;
> +        d.cur_budget_lo = (unsigned) svc->cur_budget;
> +        d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
> +        d.cur_deadline_lo = (unsigned) svc->cur_deadline;
> +        d.cur_deadline_hi = (unsigned) (svc->cur_deadline >> 32);
> +        d.is_vcpu_on_runq = __vcpu_on_runq(svc);
> +        d.is_vcpu_runnable = vcpu_runnable(svc->vcpu);
> +        trace_var(TRC_RT_VCPU_DUMP, 1,
> +                  sizeof(d),
> +                  (unsigned char *)&d);
> +    }
> +}
> +
> +static void
> +rt_dump_pcpu(const struct scheduler *ops, int cpu)
> +{
> +    struct rt_vcpu *svc = RT_VCPU(curr_on_cpu(cpu));
> +
> +    printtime();
> +    rt_dump_vcpu(ops, svc);
> +}
> +
> +/*
> + * should not need lock here. only showing stuff
> + */

This isn't true -- you're walking both the runqueue and the lists of 
domains and vcpus, each of which may change under your feet.

> +static void
> +rt_dump(const struct scheduler *ops)
> +{
> +    struct list_head *iter_sdom, *iter_svc, *runq, *iter;
> +    struct rt_private *prv = RT_PRIV(ops);
> +    struct rt_vcpu *svc;
> +    unsigned int cpu = 0;
> +
> +    printtime();
> +
> +    printk("PCPU info:\n");
> +    for_each_cpu(cpu, &prv->cpus)
> +        rt_dump_pcpu(ops, cpu);
> +
> +    printk("Global RunQueue info:\n");
> +    runq = RUNQ(ops);
> +    list_for_each( iter, runq )
> +    {
> +        svc = __runq_elem(iter);
> +        rt_dump_vcpu(ops, svc);
> +    }
> +
> +    printk("Domain info:\n");
> +    list_for_each( iter_sdom, &prv->sdom )
> +    {
> +        struct rt_dom *sdom;
> +        sdom = list_entry(iter_sdom, struct rt_dom, sdom_elem);
> +        printk("\tdomain: %d\n", sdom->dom->domain_id);
> +
> +        list_for_each( iter_svc, &sdom->vcpu )
> +        {
> +            svc = list_entry(iter_svc, struct rt_vcpu, sdom_elem);
> +            rt_dump_vcpu(ops, svc);
> +        }
> +    }
> +
> +    printk("\n");
> +}
> +
> +/*
> + * update deadline and budget when deadline is in the past,
> + * it need to be updated to the deadline of the current period
> + */
> +static void
> +rt_update_helper(s_time_t now, struct rt_vcpu *svc)
> +{
> +    s_time_t diff = now - svc->cur_deadline;
> +
> +    if ( diff >= 0 )
> +    {
> +        /* now can be later for several periods */
> +        long count = ( diff/svc->period ) + 1;
> +        svc->cur_deadline += count * svc->period;
> +        svc->cur_budget = svc->budget;

In the common case, don't we expect diff/svc->period to be a small 
number, like 0 or 1?  If so, since division and multiplication are so 
expensive, it might make more sense to make this a while() loop:

  while (now - svc_cur_deadline > 0 )
  {
   svc->cur_deadline += svc->period;
   count++;
  }
  if ( count ) {
   svc->cur_budget = svc->budget;
   [tracing code]
  }

And similarly for the other 64-bit division Dario was asking about below?

I probably wouldn't  make this a precondition of going in.

> +
> +        /* TRACE */
> +        {
> +            struct {
> +                unsigned dom:16,vcpu:16;
> +                unsigned cur_budget_lo, cur_budget_hi;
> +            } d;
> +            d.dom = svc->vcpu->domain->domain_id;
> +            d.vcpu = svc->vcpu->vcpu_id;
> +            d.cur_budget_lo = (unsigned) svc->cur_budget;
> +            d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
> +            trace_var(TRC_RT_BUDGET_REPLENISH, 1,
> +                      sizeof(d),
> +                      (unsigned char *) &d);
> +        }
> +
> +        return;
> +    }
> +}
> +
> +static inline void
> +__runq_remove(struct rt_vcpu *svc)
> +{
> +    if ( __vcpu_on_runq(svc) )
> +        list_del_init(&svc->runq_elem);
> +}
> +
> +/*
> + * Insert svc in the RunQ according to EDF: vcpus with smaller deadlines
> + * goes first.
> + */
> +static void
> +__runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
> +{
> +    struct rt_private *prv = RT_PRIV(ops);
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +    spinlock_t *schedule_lock;
> +
> +    schedule_lock = per_cpu(schedule_data, svc->vcpu->processor).schedule_lock;
> +    ASSERT( spin_is_locked(schedule_lock) );
> +
> +    ASSERT( !__vcpu_on_runq(svc) );
> +
> +    /* svc still has budget */
> +    if ( svc->cur_budget > 0 )
> +    {
> +        list_for_each(iter, runq)
> +        {
> +            struct rt_vcpu * iter_svc = __runq_elem(iter);
> +            if ( iter_svc->cur_budget == 0 ||
> +                 svc->cur_deadline <= iter_svc->cur_deadline )
> +                    break;
> +         }
> +        list_add_tail(&svc->runq_elem, iter);
> +     }
> +    else
> +    {
> +        list_add(&svc->runq_elem, &prv->flag_vcpu->runq_elem);
> +    }

OK, this thing with the "flag vcpu" seems a bit weird.  Why not just 
have two queues -- a runq and a depletedq.  You don't need to have 
another function; you just add it to depleted_runq rather than runq in 
__runq_insert().  Then you don't have to have this "cur_budget==0" 
stuff.  The only extra code you'd have is (I think) in __repl_update().

> +}
> +
> +/*
> + * Init/Free related code
> + */
> +static int
> +rt_init(struct scheduler *ops)
> +{
> +    struct rt_private *prv = xzalloc(struct rt_private);
> +
> +    printk("Initializing RT scheduler\n" \
> +           " WARNING: This is experimental software in development.\n" \
> +           " Use at your own risk.\n");
> +
> +    if ( prv == NULL )
> +        return -ENOMEM;
> +
> +    spin_lock_init(&prv->lock);
> +    INIT_LIST_HEAD(&prv->sdom);
> +    INIT_LIST_HEAD(&prv->runq);
> +
> +    prv->flag_vcpu = xzalloc(struct rt_vcpu);
> +    prv->flag_vcpu->cur_budget = 0;
> +    prv->flag_vcpu->sdom = NULL; /* distinguish this vcpu with others */
> +    list_add(&prv->flag_vcpu->runq_elem, &prv->runq);
> +
> +    cpumask_clear(&prv->cpus);
> +    cpumask_clear(&prv->tickled);
> +
> +    ops->sched_data = prv;
> +
> +    printtime();
> +    printk("\n");
> +
> +    return 0;
> +}
> +
> +static void
> +rt_deinit(const struct scheduler *ops)
> +{
> +    struct rt_private *prv = RT_PRIV(ops);
> +
> +    printtime();
> +    printk("\n");
> +    xfree(prv->flag_vcpu);
> +    xfree(prv);
> +}
> +
> +/*
> + * Point per_cpu spinlock to the global system lock;
> + * All cpu have same global system lock
> + */
> +static void *
> +rt_alloc_pdata(const struct scheduler *ops, int cpu)
> +{
> +    struct rt_private *prv = RT_PRIV(ops);
> +
> +    cpumask_set_cpu(cpu, &prv->cpus);
> +
> +    per_cpu(schedule_data, cpu).schedule_lock = &prv->lock;
> +
> +    printtime();
> +    printk("%s total cpus: %d", __func__, cpumask_weight(&prv->cpus));
> +    /* 1 indicates alloc. succeed in schedule.c */
> +    return (void *)1;
> +}
> +
> +static void
> +rt_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
> +{
> +    struct rt_private * prv = RT_PRIV(ops);
> +    cpumask_clear_cpu(cpu, &prv->cpus);
> +}
> +
> +static void *
> +rt_alloc_domdata(const struct scheduler *ops, struct domain *dom)
> +{
> +    unsigned long flags;
> +    struct rt_dom *sdom;
> +    struct rt_private * prv = RT_PRIV(ops);
> +
> +    sdom = xzalloc(struct rt_dom);
> +    if ( sdom == NULL )
> +    {
> +        printk("%s, xzalloc failed\n", __func__);
> +        return NULL;
> +    }
> +
> +    INIT_LIST_HEAD(&sdom->vcpu);
> +    INIT_LIST_HEAD(&sdom->sdom_elem);
> +    sdom->dom = dom;
> +
> +    /* spinlock here to insert the dom */
> +    spin_lock_irqsave(&prv->lock, flags);
> +    list_add_tail(&sdom->sdom_elem, &(prv->sdom));
> +    spin_unlock_irqrestore(&prv->lock, flags);
> +
> +    return sdom;
> +}
> +
> +static void
> +rt_free_domdata(const struct scheduler *ops, void *data)
> +{
> +    unsigned long flags;
> +    struct rt_dom *sdom = data;
> +    struct rt_private *prv = RT_PRIV(ops);
> +
> +    spin_lock_irqsave(&prv->lock, flags);
> +    list_del_init(&sdom->sdom_elem);
> +    spin_unlock_irqrestore(&prv->lock, flags);
> +    xfree(data);
> +}
> +
> +static int
> +rt_dom_init(const struct scheduler *ops, struct domain *dom)
> +{
> +    struct rt_dom *sdom;
> +
> +    /* IDLE Domain does not link on rt_private */
> +    if ( is_idle_domain(dom) )
> +        return 0;
> +
> +    sdom = rt_alloc_domdata(ops, dom);
> +    if ( sdom == NULL )
> +    {
> +        printk("%s, failed\n", __func__);
> +        return -ENOMEM;
> +    }
> +    dom->sched_priv = sdom;
> +
> +    return 0;
> +}
> +
> +static void
> +rt_dom_destroy(const struct scheduler *ops, struct domain *dom)
> +{
> +    rt_free_domdata(ops, RT_DOM(dom));
> +}
> +
> +static void *
> +rt_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
> +{
> +    struct rt_vcpu *svc;
> +    s_time_t now = NOW();
> +
> +    /* Allocate per-VCPU info */
> +    svc = xzalloc(struct rt_vcpu);
> +    if ( svc == NULL )
> +    {
> +        printk("%s, xzalloc failed\n", __func__);
> +        return NULL;
> +    }
> +
> +    INIT_LIST_HEAD(&svc->runq_elem);
> +    INIT_LIST_HEAD(&svc->sdom_elem);
> +    svc->flags = 0U;
> +    svc->sdom = dd;
> +    svc->vcpu = vc;
> +    svc->last_start = 0;
> +
> +    svc->period = RT_DS_DEFAULT_PERIOD;
> +    if ( !is_idle_vcpu(vc) )
> +        svc->budget = RT_DS_DEFAULT_BUDGET;
> +
> +    rt_update_helper(now, svc);
> +
> +    /* Debug only: dump new vcpu's info */
> +    rt_dump_vcpu(ops, svc);

Having these rt_dump_vcpu() things all over the place is a non-starter.  
You're going to have to take all these out except the one in rt_dump().

> +
> +    return svc;
> +}
> +
> +static void
> +rt_free_vdata(const struct scheduler *ops, void *priv)
> +{
> +    struct rt_vcpu *svc = priv;
> +
> +    /* Debug only: dump freed vcpu's info */
> +    rt_dump_vcpu(ops, svc);
> +    xfree(svc);
> +}
> +
> +/*
> + * This function is called in sched_move_domain() in schedule.c
> + * When move a domain to a new cpupool.
> + * It inserts vcpus of moving domain to the scheduler's RunQ in
> + * dest. cpupool; and insert rt_vcpu svc to scheduler-specific
> + * vcpu list of the dom
> + */
> +static void
> +rt_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu *svc = RT_VCPU(vc);
> +
> +    /* Debug only: dump info of vcpu to insert */
> +    rt_dump_vcpu(ops, svc);
> +
> +    /* not addlocate idle vcpu to dom vcpu list */
> +    if ( is_idle_vcpu(vc) )
> +        return;
> +
> +    if ( !__vcpu_on_runq(svc) && vcpu_runnable(vc) && !vc->is_running )
> +        __runq_insert(ops, svc);
> +
> +    /* add rt_vcpu svc to scheduler-specific vcpu list of the dom */
> +    list_add_tail(&svc->sdom_elem, &svc->sdom->vcpu);
> +}
> +
> +/*
> + * Remove rt_vcpu svc from the old scheduler in source cpupool; and
> + * Remove rt_vcpu svc from scheduler-specific vcpu list of the dom
> + */
> +static void
> +rt_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * const svc = RT_VCPU(vc);
> +    struct rt_dom * const sdom = svc->sdom;
> +
> +    rt_dump_vcpu(ops, svc);
> +
> +    BUG_ON( sdom == NULL );
> +    BUG_ON( __vcpu_on_runq(svc) );
> +
> +    if ( __vcpu_on_runq(svc) )
> +        __runq_remove(svc);
> +
> +    if ( !is_idle_vcpu(vc) )
> +        list_del_init(&svc->sdom_elem);
> +}
> +
> +/*
> + * Pick a valid CPU for the vcpu vc
> + * Valid CPU of a vcpu is intesection of vcpu's affinity
> + * and available cpus
> + */
> +static int
> +rt_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    cpumask_t cpus;
> +    cpumask_t *online;
> +    int cpu;
> +    struct rt_private * prv = RT_PRIV(ops);
> +
> +    online = cpupool_scheduler_cpumask(vc->domain->cpupool);
> +    cpumask_and(&cpus, &prv->cpus, online);
> +    cpumask_and(&cpus, &cpus, vc->cpu_hard_affinity);
> +
> +    cpu = cpumask_test_cpu(vc->processor, &cpus)
> +            ? vc->processor
> +            : cpumask_cycle(vc->processor, &cpus);
> +    ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
> +
> +    return cpu;
> +}
> +
> +/*
> + * Burn budget in nanosecond granularity
> + */
> +static void
> +burn_budgets(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t now)
> +{
> +    s_time_t delta;
> +
> +    /* don't burn budget for idle VCPU */
> +    if ( is_idle_vcpu(svc->vcpu) )
> +        return;
> +
> +    rt_update_helper(now, svc);
> +
> +    /* not burn budget when vcpu miss deadline */
> +    if ( now >= svc->cur_deadline )
> +        return;
> +
> +    /* burn at nanoseconds level */
> +    delta = now - svc->last_start;
> +    /*
> +     * delta < 0 only happens in nested virtualization;
> +     * TODO: how should we handle delta < 0 in a better way?

I think what I did in credit2 was just

if(delta < 0) delta = 0;

What you're doing here basically takes away an entire budget when the 
time goes backwards for whatever reason.  Much better, it seems to me, 
to just give the vcpu some "free" time and deal with it. :-)

> +     */
> +    if ( delta < 0 )
> +    {
> +        printk("%s, ATTENTION: now is behind last_start! delta = %ld",
> +                __func__, delta);
> +        rt_dump_vcpu(ops, svc);
> +        svc->last_start = now;
> +        svc->cur_budget = 0;
> +        return;
> +    }
> +
> +    if ( svc->cur_budget == 0 )
> +        return;
> +
> +    svc->cur_budget -= delta;
> +    if ( svc->cur_budget < 0 )
> +        svc->cur_budget = 0;
> +
> +    /* TRACE */
> +    {
> +        struct {
> +            unsigned dom:16, vcpu:16;
> +            unsigned cur_budget_lo;
> +            unsigned cur_budget_hi;
> +            int delta;
> +        } d;
> +        d.dom = svc->vcpu->domain->domain_id;
> +        d.vcpu = svc->vcpu->vcpu_id;
> +        d.cur_budget_lo = (unsigned) svc->cur_budget;
> +        d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
> +        d.delta = delta;
> +        trace_var(TRC_RT_BUDGET_BURN, 1,
> +                  sizeof(d),
> +                  (unsigned char *) &d);
> +    }
> +}
> +
> +/*
> + * RunQ is sorted. Pick first one within cpumask. If no one, return NULL
> + * lock is grabbed before calling this function
> + */
> +static struct rt_vcpu *
> +__runq_pick(const struct scheduler *ops, cpumask_t mask)
> +{
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +    struct rt_vcpu *svc = NULL;
> +    struct rt_vcpu *iter_svc = NULL;
> +    cpumask_t cpu_common;
> +    cpumask_t *online;
> +    struct rt_private * prv = RT_PRIV(ops);
> +
> +    list_for_each(iter, runq)
> +    {
> +        iter_svc = __runq_elem(iter);
> +
> +        /* flag vcpu */
> +        if(iter_svc->sdom == NULL)
> +            break;
> +
> +        /* mask cpu_hard_affinity & cpupool & priv->cpus */
> +        online = cpupool_scheduler_cpumask(iter_svc->vcpu->domain->cpupool);
> +        cpumask_and(&cpu_common, online, &prv->cpus);
> +        cpumask_and(&cpu_common, &cpu_common, iter_svc->vcpu->cpu_hard_affinity);
> +        cpumask_and(&cpu_common, &mask, &cpu_common);
> +        if ( cpumask_empty(&cpu_common) )
> +            continue;
> +
> +        ASSERT( iter_svc->cur_budget > 0 );
> +
> +        svc = iter_svc;
> +        break;
> +    }
> +
> +    /* TRACE */
> +    {
> +        if( svc != NULL )
> +        {
> +            struct {
> +                unsigned dom:16, vcpu:16;
> +                unsigned cur_deadline_lo, cur_deadline_hi;
> +                unsigned cur_budget_lo, cur_budget_hi;
> +            } d;
> +            d.dom = svc->vcpu->domain->domain_id;
> +            d.vcpu = svc->vcpu->vcpu_id;
> +            d.cur_deadline_lo = (unsigned) svc->cur_deadline;
> +            d.cur_deadline_hi = (unsigned) (svc->cur_deadline >> 32);
> +            d.cur_budget_lo = (unsigned) svc->cur_budget;
> +            d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
> +            trace_var(TRC_RT_RUNQ_PICK, 1,
> +                      sizeof(d),
> +                      (unsigned char *) &d);
> +        }
> +        else
> +            trace_var(TRC_RT_RUNQ_PICK, 1, 0, NULL);
> +    }
> +
> +    return svc;
> +}
> +
> +/*
> + * Update vcpu's budget and sort runq by insert the modifed vcpu back to runq
> + * lock is grabbed before calling this function
> + */
> +static void
> +__repl_update(const struct scheduler *ops, s_time_t now)
> +{
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +    struct list_head *tmp;
> +    struct rt_vcpu *svc = NULL;
> +
> +    list_for_each_safe(iter, tmp, runq)
> +    {
> +        svc = __runq_elem(iter);
> +
> +        /* not update flag_vcpu's budget */
> +        if(svc->sdom == NULL)
> +            continue;
> +
> +        rt_update_helper(now, svc);
> +        /* reinsert the vcpu if its deadline is updated */
> +        if ( now >= 0 )

Uum, when is this ever not going to be >= 0?  The comment here seems 
completely inaccurate...

Also, it seems like you could make this a bit more efficient by pulling 
the check into this loop itself, rather than putting it in the helper 
function.  Since the queue is sorted by deadline, you could stop 
processing once you reach one for which now < cur_deadline, since you 
know all subsequent ones will even later than this one.

Of course, that wouldn't take care of the depleted ones, but if those 
were already on a separate queue, you'd be OK. :-)

Right, past time for me to go home... I've given a quick scan over the 
other things and nothing jumped out at me, but I'll come back to it 
again tomorrow and see how we fare.

Overall, the code was pretty clean, and easy for me to read -- very much 
like credit1 and credit2, so thanks. :-)

  -George

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/4] xen: add real time scheduler rt
  2014-09-08 18:44   ` George Dunlap
@ 2014-09-09  9:42     ` Dario Faggioli
  2014-09-09 11:31       ` George Dunlap
  2014-09-09 12:25       ` Meng Xu
  2014-09-09 12:46     ` Meng Xu
  1 sibling, 2 replies; 31+ messages in thread
From: Dario Faggioli @ 2014-09-09  9:42 UTC (permalink / raw)
  To: George Dunlap
  Cc: ian.campbell, xisisu, stefano.stabellini, lu, ian.jackson,
	xen-devel, ptxlinh, xumengpanda, Meng Xu, JBeulich, chaowang,
	lichong659, dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 4183 bytes --]

On Mon, 2014-09-08 at 19:44 +0100, George Dunlap wrote:
> On 09/07/2014 08:40 PM, Meng Xu wrote:

> > +/*
> > + * update deadline and budget when deadline is in the past,
> > + * it need to be updated to the deadline of the current period
> > + */
> > +static void
> > +rt_update_helper(s_time_t now, struct rt_vcpu *svc)
> > +{
> > +    s_time_t diff = now - svc->cur_deadline;
> > +
> > +    if ( diff >= 0 )
> > +    {
> > +        /* now can be later for several periods */
> > +        long count = ( diff/svc->period ) + 1;
> > +        svc->cur_deadline += count * svc->period;
> > +        svc->cur_budget = svc->budget;
> 
> In the common case, don't we expect diff/svc->period to be a small 
> number, like 0 or 1?  
>
In general, yes. The only exception is when cur_deadline is set for the
first time. In that case, now can be arbitrary large and cur_deadline
will always be 0, so quite a few iterations may be required, possibly
taking longer than the div and the mult.

That is not an hot path anyway, so either approach would be fine in that
case. For all the other occurrences, the while{} approach is an absolute
win-win, IMO.

> If so, since division and multiplication are so 
> expensive, it might make more sense to make this a while() loop:
> 
>   while (now - svc_cur_deadline > 0 )
>   {
>    svc->cur_deadline += svc->period;
>    count++;
>   }
>   if ( count ) {
>    svc->cur_budget = svc->budget;
>    [tracing code]
>   }
> 
> And similarly for the other 64-bit division Dario was asking about below?
> 
Hehe, this is, I think, the third or fourth time I say I'd like this to
be turned into a while! :-D

If it were me doing this, I'd go for something like this:

  static void
  rt_update_helper(s_time_t now, struct rt_vcpu *svc)
  {
      if ( svc->cur_deadline > now )
          return;

      do
          svc->cur_deadline += svc->period;
      while ( svc->cur_deadline <= now );
      svc->cur_budget = svc->budget;

      [tracing]
  }

Or even just the do {} while (and below), and have the check at the call
sites (as George is also saying below).

> I probably wouldn't  make this a precondition of going in.
> 
No, I'm not strict about this either, we can do it later. I don't think
it's a big or a too disruptive change, though. :-)

> > +
> > +        /* TRACE */
> > +        {
> > +            struct {
> > +                unsigned dom:16,vcpu:16;
> > +                unsigned cur_budget_lo, cur_budget_hi;
> > +            } d;
> > +            d.dom = svc->vcpu->domain->domain_id;
> > +            d.vcpu = svc->vcpu->vcpu_id;
> > +            d.cur_budget_lo = (unsigned) svc->cur_budget;
> > +            d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
> > +            trace_var(TRC_RT_BUDGET_REPLENISH, 1,
> > +                      sizeof(d),
> > +                      (unsigned char *) &d);
> > +        }
> > +
> > +        return;
> > +    }
> > +}
> > +
> > +static inline void
> > +__runq_remove(struct rt_vcpu *svc)
> > +{
> > +    if ( __vcpu_on_runq(svc) )
> > +        list_del_init(&svc->runq_elem);
> > +}
> > +
> > +/*
> > + * Insert svc in the RunQ according to EDF: vcpus with smaller deadlines
> > + * goes first.
> > + */
> > +static void
> > +__runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
> > +{
> > +    struct rt_private *prv = RT_PRIV(ops);
> > +    struct list_head *runq = RUNQ(ops);
>
Oh, BTW, George, what do you think about these? The case, I mean. Since
now they're  static inlines, I've been telling Meng to turn the function
names lower case.

This is, of course, a minor thing, but since we're saying the are not
major issues... :-)

> Overall, the code was pretty clean, and easy for me to read -- very much 
> like credit1 and credit2, so thanks. :-)
> 
Yep, indeed!

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/4] xen: add real time scheduler rt
  2014-09-09  9:42     ` Dario Faggioli
@ 2014-09-09 11:31       ` George Dunlap
  2014-09-09 12:52         ` Meng Xu
  2014-09-09 12:25       ` Meng Xu
  1 sibling, 1 reply; 31+ messages in thread
From: George Dunlap @ 2014-09-09 11:31 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, Chenyang Lu,
	Ian Jackson, xen-devel, Linh Thi Xuan Phan, Meng Xu, Meng Xu,
	Jan Beulich, Chao Wang, Chong Li, Dagaen Golomb

On Tue, Sep 9, 2014 at 10:42 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On Mon, 2014-09-08 at 19:44 +0100, George Dunlap wrote:
>> On 09/07/2014 08:40 PM, Meng Xu wrote:
>
>> > +/*
>> > + * update deadline and budget when deadline is in the past,
>> > + * it need to be updated to the deadline of the current period
>> > + */
>> > +static void
>> > +rt_update_helper(s_time_t now, struct rt_vcpu *svc)
>> > +{
>> > +    s_time_t diff = now - svc->cur_deadline;
>> > +
>> > +    if ( diff >= 0 )
>> > +    {
>> > +        /* now can be later for several periods */
>> > +        long count = ( diff/svc->period ) + 1;
>> > +        svc->cur_deadline += count * svc->period;
>> > +        svc->cur_budget = svc->budget;
>>
>> In the common case, don't we expect diff/svc->period to be a small
>> number, like 0 or 1?
>>
> In general, yes. The only exception is when cur_deadline is set for the
> first time. In that case, now can be arbitrary large and cur_deadline
> will always be 0, so quite a few iterations may be required, possibly
> taking longer than the div and the mult.

Right, well we should be able to special-case zero.  Is there any
reason, if cur_deadline == 0, not to just set cur_deadline=now +
svc->period?  I can see a reason why after skipping several periods
you'd want the future periods "lined up with" previous periods.  But
is there a need to have all the periods lined up from the beginning of
time?

>> And similarly for the other 64-bit division Dario was asking about below?
>>
> Hehe, this is, I think, the third or fourth time I say I'd like this to
> be turned into a while! :-D

Well, if you've asked for it several times, we should probably make it
a precondition of going in then.

> If it were me doing this, I'd go for something like this:
>
>   static void
>   rt_update_helper(s_time_t now, struct rt_vcpu *svc)
>   {
>       if ( svc->cur_deadline > now )
>           return;
>
>       do
>           svc->cur_deadline += svc->period;
>       while ( svc->cur_deadline <= now );
>       svc->cur_budget = svc->budget;
>
>       [tracing]
>   }

Yes, that looks even cleaner. :-)

>> > +{
>> > +    struct rt_private *prv = RT_PRIV(ops);
>> > +    struct list_head *runq = RUNQ(ops);
>>
> Oh, BTW, George, what do you think about these? The case, I mean. Since
> now they're  static inlines, I've been telling Meng to turn the function
> names lower case.
>
> This is, of course, a minor thing, but since we're saying the are not
> major issues... :-)

Yes, static inlines need to be lower case.

 -George

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/4] xen: add real time scheduler rt
  2014-09-09  9:42     ` Dario Faggioli
  2014-09-09 11:31       ` George Dunlap
@ 2014-09-09 12:25       ` Meng Xu
  1 sibling, 0 replies; 31+ messages in thread
From: Meng Xu @ 2014-09-09 12:25 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, George Dunlap,
	Chenyang Lu, Ian Jackson, xen-devel, Linh Thi Xuan Phan, Meng Xu,
	Jan Beulich, Chao Wang, Chong Li, Dagaen Golomb

Hi George and Dario,

2014-09-09 5:42 GMT-04:00 Dario Faggioli <dario.faggioli@citrix.com>:
> On Mon, 2014-09-08 at 19:44 +0100, George Dunlap wrote:
>> On 09/07/2014 08:40 PM, Meng Xu wrote:
>
>> > +/*
>> > + * update deadline and budget when deadline is in the past,
>> > + * it need to be updated to the deadline of the current period
>> > + */
>> > +static void
>> > +rt_update_helper(s_time_t now, struct rt_vcpu *svc)
>> > +{
>> > +    s_time_t diff = now - svc->cur_deadline;
>> > +
>> > +    if ( diff >= 0 )
>> > +    {
>> > +        /* now can be later for several periods */
>> > +        long count = ( diff/svc->period ) + 1;
>> > +        svc->cur_deadline += count * svc->period;
>> > +        svc->cur_budget = svc->budget;
>>
>> In the common case, don't we expect diff/svc->period to be a small
>> number, like 0 or 1?
>>
> In general, yes. The only exception is when cur_deadline is set for the
> first time. In that case, now can be arbitrary large and cur_deadline
> will always be 0, so quite a few iterations may be required, possibly
> taking longer than the div and the mult.
>
> That is not an hot path anyway, so either approach would be fine in that
> case. For all the other occurrences, the while{} approach is an absolute
> win-win, IMO.
>
>> If so, since division and multiplication are so
>> expensive, it might make more sense to make this a while() loop:
>>
>>   while (now - svc_cur_deadline > 0 )
>>   {
>>    svc->cur_deadline += svc->period;
>>    count++;
>>   }
>>   if ( count ) {
>>    svc->cur_budget = svc->budget;
>>    [tracing code]
>>   }
>>
>> And similarly for the other 64-bit division Dario was asking about below?
>>
> Hehe, this is, I think, the third or fourth time I say I'd like this to
> be turned into a while! :-D
>
> If it were me doing this, I'd go for something like this:
>
>   static void
>   rt_update_helper(s_time_t now, struct rt_vcpu *svc)
>   {
>       if ( svc->cur_deadline > now )
>           return;
>
>       do
>           svc->cur_deadline += svc->period;
>       while ( svc->cur_deadline <= now );
>       svc->cur_budget = svc->budget;
>
>       [tracing]
>   }
>
> Or even just the do {} while (and below), and have the check at the call
> sites (as George is also saying below).

I see the point and will change them for the next version. Thank you
very much! :-)

>
>> I probably wouldn't  make this a precondition of going in.
>>
> No, I'm not strict about this either, we can do it later. I don't think
> it's a big or a too disruptive change, though. :-)
>
>> > +
>> > +        /* TRACE */
>> > +        {
>> > +            struct {
>> > +                unsigned dom:16,vcpu:16;
>> > +                unsigned cur_budget_lo, cur_budget_hi;
>> > +            } d;
>> > +            d.dom = svc->vcpu->domain->domain_id;
>> > +            d.vcpu = svc->vcpu->vcpu_id;
>> > +            d.cur_budget_lo = (unsigned) svc->cur_budget;
>> > +            d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
>> > +            trace_var(TRC_RT_BUDGET_REPLENISH, 1,
>> > +                      sizeof(d),
>> > +                      (unsigned char *) &d);
>> > +        }
>> > +
>> > +        return;
>> > +    }
>> > +}
>> > +
>> > +static inline void
>> > +__runq_remove(struct rt_vcpu *svc)
>> > +{
>> > +    if ( __vcpu_on_runq(svc) )
>> > +        list_del_init(&svc->runq_elem);
>> > +}
>> > +
>> > +/*
>> > + * Insert svc in the RunQ according to EDF: vcpus with smaller deadlines
>> > + * goes first.
>> > + */
>> > +static void
>> > +__runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
>> > +{
>> > +    struct rt_private *prv = RT_PRIV(ops);
>> > +    struct list_head *runq = RUNQ(ops);
>>
> Oh, BTW, George, what do you think about these? The case, I mean. Since
> now they're  static inlines, I've been telling Meng to turn the function
> names lower case.
>
> This is, of course, a minor thing, but since we're saying the are not
> major issues... :-)
>
>> Overall, the code was pretty clean, and easy for me to read -- very much
>> like credit1 and credit2, so thanks. :-)
>>
> Yep, indeed!

Yes, it is. :-)


Thank you very much for your helpful comments and advice! I will
incorporate them in the next version.

Best,

Meng
-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/4] xen: add real time scheduler rt
  2014-09-08 18:44   ` George Dunlap
  2014-09-09  9:42     ` Dario Faggioli
@ 2014-09-09 12:46     ` Meng Xu
  1 sibling, 0 replies; 31+ messages in thread
From: Meng Xu @ 2014-09-09 12:46 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, Chenyang Lu,
	Dario Faggioli, Ian Jackson, xen-devel, Linh Thi Xuan Phan,
	Meng Xu, Jan Beulich, Chao Wang, Chong Li, Dagaen Golomb

2014-09-08 14:44 GMT-04:00 George Dunlap <george.dunlap@eu.citrix.com>:
> On 09/07/2014 08:40 PM, Meng Xu wrote:
>>
>> This scheduler follows the Preemptive Global Earliest Deadline First
>> (EDF) theory in real-time field.
>> At any scheduling point, the VCPU with earlier deadline has higher
>> priority. The scheduler always picks the highest priority VCPU to run on a
>> feasible PCPU.
>> A PCPU is feasible if the VCPU can run on this PCPU and (the PCPU is
>> idle or has a lower-priority VCPU running on it.)
>>
>> Each VCPU has a dedicated period and budget.
>> The deadline of a VCPU is at the end of each of its periods;
>> A VCPU has its budget replenished at the beginning of each of its periods;
>> While scheduled, a VCPU burns its budget.
>> The VCPU needs to finish its budget before its deadline in each period;
>> The VCPU discards its unused budget at the end of each of its periods.
>> If a VCPU runs out of budget in a period, it has to wait until next
>> period.
>>
>> Each VCPU is implemented as a deferable server.
>> When a VCPU has a task running on it, its budget is continuously burned;
>> When a VCPU has no task but with budget left, its budget is preserved.
>>
>> Queue scheme: A global runqueue for each CPU pool.
>> The runqueue holds all runnable VCPUs.
>> VCPUs in the runqueue are divided into two parts: with and without budget.
>> At the first part, VCPUs are sorted based on EDF priority scheme.
>>
>> Note: cpumask and cpupool is supported.
>>
>> This is an experimental scheduler.
>>
>> Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
>> Signed-off-by: Sisu Xi <xisisu@gmail.com>
>> ---
>>   xen/common/Makefile         |    1 +
>>   xen/common/sched_rt.c       | 1146
>> +++++++++++++++++++++++++++++++++++++++++++
>>   xen/common/schedule.c       |    1 +
>>   xen/include/public/domctl.h |    6 +
>>   xen/include/public/trace.h  |    1 +
>>   xen/include/xen/sched-if.h  |    1 +
>>   6 files changed, 1156 insertions(+)
>>   create mode 100644 xen/common/sched_rt.c
>>
>> diff --git a/xen/common/Makefile b/xen/common/Makefile
>> index 3683ae3..5a23aa4 100644
>> --- a/xen/common/Makefile
>> +++ b/xen/common/Makefile
>> @@ -26,6 +26,7 @@ obj-y += sched_credit.o
>>   obj-y += sched_credit2.o
>>   obj-y += sched_sedf.o
>>   obj-y += sched_arinc653.o
>> +obj-y += sched_rt.o
>>   obj-y += schedule.o
>>   obj-y += shutdown.o
>>   obj-y += softirq.o
>> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
>> new file mode 100644
>> index 0000000..412f8b1
>> --- /dev/null
>> +++ b/xen/common/sched_rt.c
>> @@ -0,0 +1,1146 @@
>>
>> +/******************************************************************************
>> + * Preemptive Global Earliest Deadline First  (EDF) scheduler for Xen
>> + * EDF scheduling is a real-time scheduling algorithm used in embedded
>> field.
>> + *
>> + * by Sisu Xi, 2013, Washington University in Saint Louis
>> + * and Meng Xu, 2014, University of Pennsylvania
>> + *
>> + * based on the code of credit Scheduler
>> + */
>> +
>> +#include <xen/config.h>
>> +#include <xen/init.h>
>> +#include <xen/lib.h>
>> +#include <xen/sched.h>
>> +#include <xen/domain.h>
>> +#include <xen/delay.h>
>> +#include <xen/event.h>
>> +#include <xen/time.h>
>> +#include <xen/perfc.h>
>> +#include <xen/sched-if.h>
>> +#include <xen/softirq.h>
>> +#include <asm/atomic.h>
>> +#include <xen/errno.h>
>> +#include <xen/trace.h>
>> +#include <xen/cpu.h>
>> +#include <xen/keyhandler.h>
>> +#include <xen/trace.h>
>> +#include <xen/guest_access.h>
>> +
>> +/*
>> + * TODO:
>> + *
>> + * Migration compensation and resist like credit2 to better use cache;
>> + * Lock Holder Problem, using yield?
>> + * Self switch problem: VCPUs of the same domain may preempt each other;
>> + */
>> +
>> +/*
>> + * Design:
>> + *
>> + * This scheduler follows the Preemptive Global Earliest Deadline First
>> (EDF)
>> + * theory in real-time field.
>> + * At any scheduling point, the VCPU with earlier deadline has higher
>> priority.
>> + * The scheduler always picks highest priority VCPU to run on a feasible
>> PCPU.
>> + * A PCPU is feasible if the VCPU can run on this PCPU and (the PCPU is
>> idle or
>> + * has a lower-priority VCPU running on it.)
>> + *
>> + * Each VCPU has a dedicated period and budget.
>> + * The deadline of a VCPU is at the end of each of its periods;
>> + * A VCPU has its budget replenished at the beginning of each of its
>> periods;
>> + * While scheduled, a VCPU burns its budget.
>> + * The VCPU needs to finish its budget before its deadline in each
>> period;
>> + * The VCPU discards its unused budget at the end of each of its periods.
>> + * If a VCPU runs out of budget in a period, it has to wait until next
>> period.
>> + *
>> + * Each VCPU is implemented as a deferable server.
>> + * When a VCPU has a task running on it, its budget is continuously
>> burned;
>> + * When a VCPU has no task but with budget left, its budget is preserved.
>> + *
>> + * Queue scheme: A global runqueue for each CPU pool.
>> + * The runqueue holds all runnable VCPUs.
>> + * VCPUs in the runqueue are divided into two parts:
>> + * with and without remaining budget.
>> + * At the first part, VCPUs are sorted based on EDF priority scheme.
>> + *
>> + * Note: cpumask and cpupool is supported.
>> + */
>> +
>> +/*
>> + * Locking:
>> + * A global system lock is used to protect the RunQ.
>> + * The global lock is referenced by schedule_data.schedule_lock
>> + * from all physical cpus.
>> + *
>> + * The lock is already grabbed when calling wake/sleep/schedule/
>> functions
>> + * in schedule.c
>> + *
>> + * The functions involes RunQ and needs to grab locks are:
>> + *    vcpu_insert, vcpu_remove, context_saved, __runq_insert
>> + */
>> +
>> +
>> +/*
>> + * Default parameters:
>> + * Period and budget in default is 10 and 4 ms, respectively
>> + */
>> +#define RT_DS_DEFAULT_PERIOD     (MICROSECS(10000))
>> +#define RT_DS_DEFAULT_BUDGET     (MICROSECS(4000))
>> +
>> +/*
>> + * Flags
>> + */
>> +/*
>> + * RT_scheduled: Is this vcpu either running on, or context-switching
>> off,
>> + * a phyiscal cpu?
>> + * + Accessed only with Runqueue lock held.
>> + * + Set when chosen as next in rt_schedule().
>> + * + Cleared after context switch has been saved in rt_context_saved()
>> + * + Checked in vcpu_wake to see if we can add to the Runqueue, or if we
>> should
>> + *   set RT_delayed_runq_add
>> + * + Checked to be false in runq_insert.
>> + */
>> +#define __RT_scheduled            1
>> +#define RT_scheduled (1<<__RT_scheduled)
>> +/*
>> + * RT_delayed_runq_add: Do we need to add this to the Runqueueu once it'd
>> done
>> + * being context switching out?
>> + * + Set when scheduling out in rt_schedule() if prev is runable
>> + * + Set in rt_vcpu_wake if it finds RT_scheduled set
>> + * + Read in rt_context_saved(). If set, it adds prev to the Runqueue and
>> + *   clears the bit.
>> + */
>> +#define __RT_delayed_runq_add     2
>> +#define RT_delayed_runq_add (1<<__RT_delayed_runq_add)
>> +
>> +/*
>> + * Debug only. Used to printout debug information
>> + */
>> +#define printtime()\
>> +        ({s_time_t now = NOW(); \
>> +          printk("%u : %3ld.%3ldus : %-19s\n",smp_processor_id(),\
>> +          now/MICROSECS(1), now%MICROSECS(1)/1000, __func__);} )
>> +
>> +/*
>> + * rt tracing events ("only" 512 available!). Check
>> + * include/public/trace.h for more details.
>> + */
>> +#define TRC_RT_TICKLE           TRC_SCHED_CLASS_EVT(RT, 1)
>> +#define TRC_RT_RUNQ_PICK        TRC_SCHED_CLASS_EVT(RT, 2)
>> +#define TRC_RT_BUDGET_BURN      TRC_SCHED_CLASS_EVT(RT, 3)
>> +#define TRC_RT_BUDGET_REPLENISH TRC_SCHED_CLASS_EVT(RT, 4)
>> +#define TRC_RT_SCHED_TASKLET    TRC_SCHED_CLASS_EVT(RT, 5)
>> +#define TRC_RT_VCPU_DUMP        TRC_SCHED_CLASS_EVT(RT, 6)
>> +
>> +/*
>> + * Systme-wide private data, include a global RunQueue
>> + * Global lock is referenced by schedule_data.schedule_lock from all
>> + * physical cpus. It can be grabbed via vcpu_schedule_lock_irq()
>> + */
>> +struct rt_private {
>> +    spinlock_t lock;           /* The global coarse grand lock */
>> +    struct list_head sdom;     /* list of availalbe domains, used for
>> dump */
>> +    struct list_head runq;     /* Ordered list of runnable VMs */
>> +    struct rt_vcpu *flag_vcpu; /* position of the first depleted vcpu */
>> +    cpumask_t cpus;            /* cpumask_t of available physical cpus */
>> +    cpumask_t tickled;         /* cpus been tickled */
>> +};
>> +
>> +/*
>> + * Virtual CPU
>> + */
>> +struct rt_vcpu {
>> +    struct list_head runq_elem; /* On the runqueue list */
>> +    struct list_head sdom_elem; /* On the domain VCPU list */
>> +
>> +    /* Up-pointers */
>> +    struct rt_dom *sdom;
>> +    struct vcpu *vcpu;
>> +
>> +    /* VCPU parameters, in nanoseconds */
>> +    s_time_t period;
>> +    s_time_t budget;
>> +
>> +    /* VCPU current infomation in nanosecond */
>> +    s_time_t cur_budget;        /* current budget */
>> +    s_time_t last_start;        /* last start time */
>> +    s_time_t cur_deadline;      /* current deadline for EDF */
>> +
>> +    unsigned flags;             /* mark __RT_scheduled, etc.. */
>> +};
>> +
>> +/*
>> + * Domain
>> + */
>> +struct rt_dom {
>> +    struct list_head vcpu;      /* link its VCPUs */
>> +    struct list_head sdom_elem; /* link list on rt_priv */
>> +    struct domain *dom;         /* pointer to upper domain */
>> +};
>> +
>> +/*
>> + * Useful inline functions
>> + */
>> +static inline struct rt_private *RT_PRIV(const struct scheduler *ops)
>> +{
>> +    return ops->sched_data;
>> +}
>> +
>> +static inline struct rt_vcpu *RT_VCPU(const struct vcpu *vcpu)
>> +{
>> +    return vcpu->sched_priv;
>> +}
>> +
>> +static inline struct rt_dom *RT_DOM(const struct domain *dom)
>> +{
>> +    return dom->sched_priv;
>> +}
>> +
>> +static inline struct list_head *RUNQ(const struct scheduler *ops)
>> +{
>> +    return &RT_PRIV(ops)->runq;
>> +}
>> +
>> +/*
>> + * RunQueue helper functions
>> + */
>> +static int
>> +__vcpu_on_runq(const struct rt_vcpu *svc)
>> +{
>> +   return !list_empty(&svc->runq_elem);
>> +}
>> +
>> +static struct rt_vcpu *
>> +__runq_elem(struct list_head *elem)
>> +{
>> +    return list_entry(elem, struct rt_vcpu, runq_elem);
>> +}
>> +
>> +/*
>> + * Debug related code, dump vcpu/cpu information
>> + */
>> +static void
>> +rt_dump_vcpu(const struct scheduler *ops, const struct rt_vcpu *svc)
>> +{
>> +    struct rt_private *prv = RT_PRIV(ops);
>> +    char cpustr[1024];
>> +    cpumask_t *cpupool_mask;
>> +
>> +    ASSERT(svc != NULL);
>> +    /* flag vcpu */
>> +    if( svc->sdom == NULL )
>> +        return;
>> +
>> +    cpumask_scnprintf(cpustr, sizeof(cpustr),
>> svc->vcpu->cpu_hard_affinity);
>> +    printk("[%5d.%-2u] cpu %u, (%"PRI_stime", %"PRI_stime"),"
>> +           " cur_b=%"PRI_stime" cur_d=%"PRI_stime" last_start=%"PRI_stime
>> +           " onR=%d runnable=%d cpu_hard_affinity=%s ",
>> +            svc->vcpu->domain->domain_id,
>> +            svc->vcpu->vcpu_id,
>> +            svc->vcpu->processor,
>> +            svc->period,
>> +            svc->budget,
>> +            svc->cur_budget,
>> +            svc->cur_deadline,
>> +            svc->last_start,
>> +            __vcpu_on_runq(svc),
>> +            vcpu_runnable(svc->vcpu),
>> +            cpustr);
>> +    memset(cpustr, 0, sizeof(cpustr));
>> +    cpupool_mask = cpupool_scheduler_cpumask(svc->vcpu->domain->cpupool);
>> +    cpumask_scnprintf(cpustr, sizeof(cpustr), cpupool_mask);
>> +    printk("cpupool=%s ", cpustr);
>> +    memset(cpustr, 0, sizeof(cpustr));
>> +    cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->cpus);
>> +    printk("prv->cpus=%s\n", cpustr);
>> +
>> +    /* TRACE */
>> +    {
>> +        struct {
>> +            unsigned dom:16,vcpu:16;
>> +            unsigned processor;
>> +            unsigned cur_budget_lo, cur_budget_hi;
>> +            unsigned cur_deadline_lo, cur_deadline_hi;
>> +            unsigned is_vcpu_on_runq:16,is_vcpu_runnable:16;
>> +        } d;
>> +        d.dom = svc->vcpu->domain->domain_id;
>> +        d.vcpu = svc->vcpu->vcpu_id;
>> +        d.processor = svc->vcpu->processor;
>> +        d.cur_budget_lo = (unsigned) svc->cur_budget;
>> +        d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
>> +        d.cur_deadline_lo = (unsigned) svc->cur_deadline;
>> +        d.cur_deadline_hi = (unsigned) (svc->cur_deadline >> 32);
>> +        d.is_vcpu_on_runq = __vcpu_on_runq(svc);
>> +        d.is_vcpu_runnable = vcpu_runnable(svc->vcpu);
>> +        trace_var(TRC_RT_VCPU_DUMP, 1,
>> +                  sizeof(d),
>> +                  (unsigned char *)&d);
>> +    }
>> +}
>> +
>> +static void
>> +rt_dump_pcpu(const struct scheduler *ops, int cpu)
>> +{
>> +    struct rt_vcpu *svc = RT_VCPU(curr_on_cpu(cpu));
>> +
>> +    printtime();
>> +    rt_dump_vcpu(ops, svc);
>> +}
>> +
>> +/*
>> + * should not need lock here. only showing stuff
>> + */
>
>
> This isn't true -- you're walking both the runqueue and the lists of domains
> and vcpus, each of which may change under your feet.

I see.  So even when I only read (and never write) the runqueue, I
still need to use the lock. I can add the lock in these dumps.

>
>> +
>> +        /* TRACE */
>> +        {
>> +            struct {
>> +                unsigned dom:16,vcpu:16;
>> +                unsigned cur_budget_lo, cur_budget_hi;
>> +            } d;
>> +            d.dom = svc->vcpu->domain->domain_id;
>> +            d.vcpu = svc->vcpu->vcpu_id;
>> +            d.cur_budget_lo = (unsigned) svc->cur_budget;
>> +            d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
>> +            trace_var(TRC_RT_BUDGET_REPLENISH, 1,
>> +                      sizeof(d),
>> +                      (unsigned char *) &d);
>> +        }
>> +
>> +        return;
>> +    }
>> +}
>> +
>> +static inline void
>> +__runq_remove(struct rt_vcpu *svc)
>> +{
>> +    if ( __vcpu_on_runq(svc) )
>> +        list_del_init(&svc->runq_elem);
>> +}
>> +
>> +/*
>> + * Insert svc in the RunQ according to EDF: vcpus with smaller deadlines
>> + * goes first.
>> + */
>> +static void
>> +__runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
>> +{
>> +    struct rt_private *prv = RT_PRIV(ops);
>> +    struct list_head *runq = RUNQ(ops);
>> +    struct list_head *iter;
>> +    spinlock_t *schedule_lock;
>> +
>> +    schedule_lock = per_cpu(schedule_data,
>> svc->vcpu->processor).schedule_lock;
>> +    ASSERT( spin_is_locked(schedule_lock) );
>> +
>> +    ASSERT( !__vcpu_on_runq(svc) );
>> +
>> +    /* svc still has budget */
>> +    if ( svc->cur_budget > 0 )
>> +    {
>> +        list_for_each(iter, runq)
>> +        {
>> +            struct rt_vcpu * iter_svc = __runq_elem(iter);
>> +            if ( iter_svc->cur_budget == 0 ||
>> +                 svc->cur_deadline <= iter_svc->cur_deadline )
>> +                    break;
>> +         }
>> +        list_add_tail(&svc->runq_elem, iter);
>> +     }
>> +    else
>> +    {
>> +        list_add(&svc->runq_elem, &prv->flag_vcpu->runq_elem);
>> +    }
>
>
> OK, this thing with the "flag vcpu" seems a bit weird.  Why not just have
> two queues -- a runq and a depletedq.  You don't need to have another
> function; you just add it to depleted_runq rather than runq in
> __runq_insert().  Then you don't have to have this "cur_budget==0" stuff.
> The only extra code you'd have is (I think) in __repl_update().

I may need to add some other code, like the static inline function
DEPLETEDQ() to get the depletedq from struct rt_private, and the
DepletedQ's helper functions, like __vcpu_on_depletedq, etc.  But
these codes are not big, so Yes, I will change it to two queues in the
next version.

>> +
>> +/*
>> + * Burn budget in nanosecond granularity
>> + */
>> +static void
>> +burn_budgets(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t
>> now)
>> +{
>> +    s_time_t delta;
>> +
>> +    /* don't burn budget for idle VCPU */
>> +    if ( is_idle_vcpu(svc->vcpu) )
>> +        return;
>> +
>> +    rt_update_helper(now, svc);
>> +
>> +    /* not burn budget when vcpu miss deadline */
>> +    if ( now >= svc->cur_deadline )
>> +        return;
>> +
>> +    /* burn at nanoseconds level */
>> +    delta = now - svc->last_start;
>> +    /*
>> +     * delta < 0 only happens in nested virtualization;
>> +     * TODO: how should we handle delta < 0 in a better way?
>
>
> I think what I did in credit2 was just
>
> if(delta < 0) delta = 0;
>
> What you're doing here basically takes away an entire budget when the time
> goes backwards for whatever reason.  Much better, it seems to me, to just
> give the vcpu some "free" time and deal with it. :-)

I can remove  svc->cur_budget = 0 to not set the vcpu's budget to 0.
If this vcpu has some budget left during this period and has higher
priority, it should be able to run.
So I will remove svc->cur_budget = 0.

>> +
>> +/*
>> + * Update vcpu's budget and sort runq by insert the modifed vcpu back to
>> runq
>> + * lock is grabbed before calling this function
>> + */
>> +static void
>> +__repl_update(const struct scheduler *ops, s_time_t now)
>> +{
>> +    struct list_head *runq = RUNQ(ops);
>> +    struct list_head *iter;
>> +    struct list_head *tmp;
>> +    struct rt_vcpu *svc = NULL;
>> +
>> +    list_for_each_safe(iter, tmp, runq)
>> +    {
>> +        svc = __runq_elem(iter);
>> +
>> +        /* not update flag_vcpu's budget */
>> +        if(svc->sdom == NULL)
>> +            continue;
>> +
>> +        rt_update_helper(now, svc);
>> +        /* reinsert the vcpu if its deadline is updated */
>> +        if ( now >= 0 )
>
>
> Uum, when is this ever not going to be >= 0?  The comment here seems
> completely inaccurate...

My bad. This is incorrect. :-( It should be diff (which is
now-svc->cur_deadline) >= 0. Sorry. Will change in the next patch.

>
> Also, it seems like you could make this a bit more efficient by pulling the
> check into this loop itself, rather than putting it in the helper function.
> Since the queue is sorted by deadline, you could stop processing once you
> reach one for which now < cur_deadline, since you know all subsequent ones
> will even later than this one.
>
> Of course, that wouldn't take care of the depleted ones, but if those were
> already on a separate queue, you'd be OK. :-)

Sure! Will do that.

>
> Right, past time for me to go home... I've given a quick scan over the other
> things and nothing jumped out at me, but I'll come back to it again tomorrow
> and see how we fare.

Thank you so much for your comments and time! I really appreciate it
and will tackle these comments in the next version asap.

Thanks!

Meng



-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/4] xen: add real time scheduler rt
  2014-09-09 11:31       ` George Dunlap
@ 2014-09-09 12:52         ` Meng Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Meng Xu @ 2014-09-09 12:52 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, Chenyang Lu,
	Dario Faggioli, Ian Jackson, xen-devel, Linh Thi Xuan Phan,
	Meng Xu, Jan Beulich, Chao Wang, Chong Li, Dagaen Golomb

2014-09-09 7:31 GMT-04:00 George Dunlap <George.Dunlap@eu.citrix.com>:
> On Tue, Sep 9, 2014 at 10:42 AM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:
>> On Mon, 2014-09-08 at 19:44 +0100, George Dunlap wrote:
>>> On 09/07/2014 08:40 PM, Meng Xu wrote:
>>
>>> > +/*
>>> > + * update deadline and budget when deadline is in the past,
>>> > + * it need to be updated to the deadline of the current period
>>> > + */
>>> > +static void
>>> > +rt_update_helper(s_time_t now, struct rt_vcpu *svc)
>>> > +{
>>> > +    s_time_t diff = now - svc->cur_deadline;
>>> > +
>>> > +    if ( diff >= 0 )
>>> > +    {
>>> > +        /* now can be later for several periods */
>>> > +        long count = ( diff/svc->period ) + 1;
>>> > +        svc->cur_deadline += count * svc->period;
>>> > +        svc->cur_budget = svc->budget;
>>>
>>> In the common case, don't we expect diff/svc->period to be a small
>>> number, like 0 or 1?
>>>
>> In general, yes. The only exception is when cur_deadline is set for the
>> first time. In that case, now can be arbitrary large and cur_deadline
>> will always be 0, so quite a few iterations may be required, possibly
>> taking longer than the div and the mult.
>
> Right, well we should be able to special-case zero.  Is there any
> reason, if cur_deadline == 0, not to just set cur_deadline=now +
> svc->period?  I can see a reason why after skipping several periods
> you'd want the future periods "lined up with" previous periods.  But
> is there a need to have all the periods lined up from the beginning of
> time?
>

Actually, no need to line up all vcpus from the beginning of time.
This just makes the scheduler more deterministic since everytime when
we boot the system, we know all vcpus are lined up to the beginning of
time. When a vcpu is created, its cur_deadline can be now +
svc->period.

I'm personally fine with either way. (very slightly prefer the line up
way because it is more deterministic, but not so prefer. :-))

>>> And similarly for the other 64-bit division Dario was asking about below?
>>>
>> Hehe, this is, I think, the third or fourth time I say I'd like this to
>> be turned into a while! :-D
>
> Well, if you've asked for it several times, we should probably make it
> a precondition of going in then.

I will modify this for sure in the next version. Didn't realize this
was stressed for many times. Sorry Dario for bothering you. :-(

>
>> If it were me doing this, I'd go for something like this:
>>
>>   static void
>>   rt_update_helper(s_time_t now, struct rt_vcpu *svc)
>>   {
>>       if ( svc->cur_deadline > now )
>>           return;
>>
>>       do
>>           svc->cur_deadline += svc->period;
>>       while ( svc->cur_deadline <= now );
>>       svc->cur_budget = svc->budget;
>>
>>       [tracing]
>>   }
>
> Yes, that looks even cleaner. :-)
>
>>> > +{
>>> > +    struct rt_private *prv = RT_PRIV(ops);
>>> > +    struct list_head *runq = RUNQ(ops);
>>>
>> Oh, BTW, George, what do you think about these? The case, I mean. Since
>> now they're  static inlines, I've been telling Meng to turn the function
>> names lower case.
>>
>> This is, of course, a minor thing, but since we're saying the are not
>> major issues... :-)
>
> Yes, static inlines need to be lower case.

Roger and will change to lower case then.

Thanks,

Meng


-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/4] libxl: add rt scheduler
  2014-09-08 15:19   ` George Dunlap
@ 2014-09-09 12:59     ` Meng Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Meng Xu @ 2014-09-09 12:59 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, Chenyang Lu,
	Dario Faggioli, Ian Jackson, xen-devel, Linh Thi Xuan Phan,
	Meng Xu, Jan Beulich, Chao Wang, Chong Li, Dagaen Golomb

2014-09-08 11:19 GMT-04:00 George Dunlap <george.dunlap@eu.citrix.com>:
> On 09/07/2014 08:41 PM, Meng Xu wrote:
>>
>> Add libxl functions to set/get domain's parameters for rt scheduler
>> Note: VCPU's information (period, budget) is in microsecond (us).
>>
>> Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
>> Signed-off-by: Sisu Xi <xisisu@gmail.com>
>> ---
>>   tools/libxl/libxl.c         |   75
>> +++++++++++++++++++++++++++++++++++++++++++
>>   tools/libxl/libxl.h         |    1 +
>>   tools/libxl/libxl_types.idl |    2 ++
>>   3 files changed, 78 insertions(+)
>>
>> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
>> index 2ae5fca..6840c92 100644
>> --- a/tools/libxl/libxl.c
>> +++ b/tools/libxl/libxl.c
>> @@ -5155,6 +5155,75 @@ static int sched_sedf_domain_set(libxl__gc *gc,
>> uint32_t domid,
>>       return 0;
>>   }
>>   +static int sched_rt_domain_get(libxl__gc *gc, uint32_t domid,
>> +                               libxl_domain_sched_params *scinfo)
>> +{
>> +    struct xen_domctl_sched_rt sdom;
>> +    int rc;
>> +
>> +    rc = xc_sched_rt_domain_get(CTX->xch, domid, &sdom);
>> +    if (rc != 0) {
>> +        LOGE(ERROR, "getting domain sched rt");
>> +        return ERROR_FAIL;
>> +    }
>> +
>> +    libxl_domain_sched_params_init(scinfo);
>> +
>> +    scinfo->sched = LIBXL_SCHEDULER_RT_DS;
>> +    scinfo->period = sdom.period;
>> +    scinfo->budget = sdom.budget;
>> +
>> +    return 0;
>> +}
>> +
>> +#define SCHED_RT_DS_VCPU_PERIOD_UINT_MAX    4294967295U /* 2^32 - 1 us */
>> +#define SCHED_RT_DS_VCPU_BUDGET_UINT_MAX
>> SCHED_RT_DS_VCPU_PERIOD_UINT_MAX
>
>
> I think what Dario was looking for was this:
>
> #define SCHED_RT_DS_VCPU_PERIOD_MAX UINT_MAX
>
> I.e., use the already-defined #defines with meaningful names (line
> UINT_MAX), and avoid open-coding (i.e., typing out a "magic" number, like
> 429....U).

Ah, I see.  I misunderstood. :-( Thank you very much, George, for
clarification!  :-)

>
>> +
>> +static int sched_rt_domain_set(libxl__gc *gc, uint32_t domid,
>> +                               const libxl_domain_sched_params *scinfo)
>> +{
>> +    struct xen_domctl_sched_rt sdom;
>> +    int rc;
>> +
>> +    rc = xc_sched_rt_domain_get(CTX->xch, domid, &sdom);
>
>
> You need to check the return value here and bail out on an error.

Right, will do.

>
>> +
>> +    if (scinfo->period != LIBXL_DOMAIN_SCHED_PARAM_PERIOD_DEFAULT) {
>> +        if (scinfo->period < 1 ||
>> +            scinfo->period > SCHED_RT_DS_VCPU_PERIOD_UINT_MAX) {
>
>
> ...but this isn't right anyway, right?  scinfo->period is a signed integer.
> You shouldn't be comparing it to an unsigned int; and this can never be
> false anyway, because even if it's automatically cast to be unsigned, the
> type isn't big enough to be bigger than UINT_MAX anyway.
>
> If period is allowed to be anything up to INT_MAX, then there's no need to
> check the upper bound.  Checking to make sure it's >= 1 should be
> sufficient.  Then you can just get rid of the #defines above.

I see and will change it as you said.

>
>> +            LOG(ERROR, "VCPU period is not set or out of range, "
>> +                       "valid values are within range from 0 to %u",
>> +                       SCHED_RT_DS_VCPU_PERIOD_UINT_MAX);
>> +            return ERROR_INVAL;
>> +        }
>> +        sdom.period = scinfo->period;
>> +    }
>> +
>> +    if (scinfo->budget != LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT) {
>> +        if (scinfo->budget < 1 ||
>> +            scinfo->budget > SCHED_RT_DS_VCPU_BUDGET_UINT_MAX) {
>
>
> Same here.

Will change, Thanks!

>
>
>> +            LOG(ERROR, "VCPU budget is not set or out of range, "
>> +                       "valid values are within range from 0 to %u",
>> +                       SCHED_RT_DS_VCPU_BUDGET_UINT_MAX);
>> +            return ERROR_INVAL;
>> +        }
>> +        sdom.budget = scinfo->budget;
>> +    }
>> +
>> +    if (sdom.budget > sdom.period) {
>> +        LOG(ERROR, "VCPU budget is larger than VCPU period, "
>> +                   "VCPU budget should be no larger than VCPU period");
>> +        return ERROR_INVAL;
>> +    }
>> +
>> +    rc = xc_sched_rt_domain_set(CTX->xch, domid, &sdom);
>> +    if (rc < 0) {
>> +        LOGE(ERROR, "setting domain sched rt");
>> +        return ERROR_FAIL;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>   int libxl_domain_sched_params_set(libxl_ctx *ctx, uint32_t domid,
>>                                     const libxl_domain_sched_params
>> *scinfo)
>>   {
>> @@ -5178,6 +5247,9 @@ int libxl_domain_sched_params_set(libxl_ctx *ctx,
>> uint32_t domid,
>>       case LIBXL_SCHEDULER_ARINC653:
>>           ret=sched_arinc653_domain_set(gc, domid, scinfo);
>>           break;
>> +    case LIBXL_SCHEDULER_RT_DS:
>> +        ret=sched_rt_domain_set(gc, domid, scinfo);
>> +        break;
>>       default:
>>           LOG(ERROR, "Unknown scheduler");
>>           ret=ERROR_INVAL;
>> @@ -5208,6 +5280,9 @@ int libxl_domain_sched_params_get(libxl_ctx *ctx,
>> uint32_t domid,
>>       case LIBXL_SCHEDULER_CREDIT2:
>>           ret=sched_credit2_domain_get(gc, domid, scinfo);
>>           break;
>> +    case LIBXL_SCHEDULER_RT_DS:
>> +        ret=sched_rt_domain_get(gc, domid, scinfo);
>> +        break;
>>       default:
>>           LOG(ERROR, "Unknown scheduler");
>>           ret=ERROR_INVAL;
>> diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
>> index 460207b..dbe736c 100644
>> --- a/tools/libxl/libxl.h
>> +++ b/tools/libxl/libxl.h
>> @@ -1280,6 +1280,7 @@ int libxl_sched_credit_params_set(libxl_ctx *ctx,
>> uint32_t poolid,
>>   #define LIBXL_DOMAIN_SCHED_PARAM_SLICE_DEFAULT     -1
>>   #define LIBXL_DOMAIN_SCHED_PARAM_LATENCY_DEFAULT   -1
>>   #define LIBXL_DOMAIN_SCHED_PARAM_EXTRATIME_DEFAULT -1
>> +#define LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT     -1
>>     int libxl_domain_sched_params_get(libxl_ctx *ctx, uint32_t domid,
>>                                     libxl_domain_sched_params *params);
>> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
>> index 931c9e9..72f24fe 100644
>> --- a/tools/libxl/libxl_types.idl
>> +++ b/tools/libxl/libxl_types.idl
>> @@ -153,6 +153,7 @@ libxl_scheduler = Enumeration("scheduler", [
>>       (5, "credit"),
>>       (6, "credit2"),
>>       (7, "arinc653"),
>> +    (8, "rt_ds"),
>
>
> rtds
>

Roger, will change very rt_ds to rtds then. :-P

Thanks,

Meng

-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 4/4] xl: introduce rt scheduler
  2014-09-08 16:06   ` George Dunlap
  2014-09-08 16:16     ` Dario Faggioli
@ 2014-09-09 13:14     ` Meng Xu
  1 sibling, 0 replies; 31+ messages in thread
From: Meng Xu @ 2014-09-09 13:14 UTC (permalink / raw)
  To: George Dunlap
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, Chenyang Lu,
	Dario Faggioli, Ian Jackson, xen-devel, Linh Thi Xuan Phan,
	Meng Xu, Jan Beulich, Chao Wang, Chong Li, Dagaen Golomb

2014-09-08 12:06 GMT-04:00 George Dunlap <george.dunlap@eu.citrix.com>:
> On 09/07/2014 08:41 PM, Meng Xu wrote:
>>
>> Add xl command for rt scheduler
>> Note: VCPU's parameter (period, budget) is in microsecond (us).
>>
>> Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
>> Signed-off-by: Sisu Xi <xisisu@gmail.com>
>> ---
>>   docs/man/xl.pod.1         |   34 +++++++++++++
>>   tools/libxl/xl.h          |    1 +
>>   tools/libxl/xl_cmdimpl.c  |  119
>> +++++++++++++++++++++++++++++++++++++++++++++
>>   tools/libxl/xl_cmdtable.c |    8 +++
>>   4 files changed, 162 insertions(+)
>>
>> diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
>> index 9d1c2a5..c2532cb 100644
>> --- a/docs/man/xl.pod.1
>> +++ b/docs/man/xl.pod.1
>> @@ -1035,6 +1035,40 @@ Restrict output to domains in the specified
>> cpupool.
>>     =back
>>   +=item B<sched-rt> [I<OPTIONS>]
>
>
> sched-rtds, I think.

OK. Then the command we provide will be "xl sched-rtds". I will modify them.

>
>>   int main_domid(int argc, char **argv);
>>   int main_domname(int argc, char **argv);
>>   int main_rename(int argc, char **argv);
>> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
>> index e6b9615..92037b1 100644
>> --- a/tools/libxl/xl_cmdimpl.c
>> +++ b/tools/libxl/xl_cmdimpl.c
>> @@ -5212,6 +5212,47 @@ static int sched_sedf_domain_output(
>>       return 0;
>>   }
>>   +static int sched_rt_domain_output(
>> +    int domid)
>> +{
>> +    char *domname;
>> +    libxl_domain_sched_params scinfo;
>> +    int rc = 0;
>> +
>> +    if (domid < 0) {
>> +        printf("%-33s %4s %9s %9s\n", "Name", "ID", "Period", "Budget");
>> +        return 0;
>> +    }
>> +
>> +    libxl_domain_sched_params_init(&scinfo);
>> +    rc = sched_domain_get(LIBXL_SCHEDULER_RT_DS, domid, &scinfo);
>
>
> Hmm, the other callers of sched_domain_get() don't call
> libxl_domain_sched_params_init(); but reading through libxl.h looks like
> that's actually a mistake:
>
>  * ...the user must
>  * always call the "init" function before using a type, even if the
>  * variable is simply being passed by reference as an out parameter
>  * to a libxl function.
>
> Meng, would you be willing to put on your "to-do list" to send a follow-up
> patch to clean this up?
>

Sure! I'm happy to do that! Noted and will do after finishing the next
version of the rt scheduler stuff. :-)

> I think what should probably actually be done is that sched_domain_get()
> should call libxl_domain_sched_params_init() before calling
> libxl_domain_sched_params_get().  But I'm sure IanJ will have opinions on
> that.
>
>> +    if (rc)
>> +        goto out;
>> +
>> +    domname = libxl_domid_to_name(ctx, domid);
>> +    printf("%-33s %4d %9d %9d\n",
>> +        domname,
>> +        domid,
>> +        scinfo.period,
>> +        scinfo.budget);
>> +    free(domname);
>> +
>> +out:
>> +    libxl_domain_sched_params_dispose(&scinfo);
>> +    return rc;
>> +}
>> +
>> +static int sched_rt_pool_output(uint32_t poolid)
>> +{
>> +    char *poolname;
>> +
>> +    poolname = libxl_cpupoolid_to_name(ctx, poolid);
>> +    printf("Cpupool %s: sched=EDF\n", poolname);
>
>
> Should we change this to "RTDS"?

Maybe yes, if we want to distinguish RTDS with other RT schedulers
with different server mechanisms. (I will change it to RTDS if no one
objects to it.)

>
>> +
>> +    free(poolname);
>> +    return 0;
>> +}
>> +
>>   static int sched_default_pool_output(uint32_t poolid)
>>   {
>>       char *poolname;
>> @@ -5579,6 +5620,84 @@ int main_sched_sedf(int argc, char **argv)
>>       return 0;
>>   }
>>   +/*
>> + * <nothing>            : List all domain paramters and sched params
>> + * -d [domid]           : List domain params for domain
>> + * -d [domid] [params]  : Set domain params for domain
>> + */
>> +int main_sched_rt(int argc, char **argv)
>> +{
>> +    const char *dom = NULL;
>> +    const char *cpupool = NULL;
>> +    int period = 10, opt_p = 0; /* period is in microsecond */
>> +    int budget = 4, opt_b = 0; /* budget is in microsecond */
>
>
> We might as well make opt_p and opt_b  of type "bool".
>
> Why are you setting the values for period and budget here?  It looks like
> they're either never used (if either one or both are not set on the command
> line), or they're clobbered (when both are set).
>
> If gcc doesn't complain, just leave them uninitizlized.  If it does
> complian, then just initialize them to 0 -- that will make sure that it
> returns an error if there ever *is* a path which doesn't actually set the
> value.

OK. Will leave them uninitialized.

>
>
>> +    int opt, rc;
>> +    static struct option opts[] = {
>> +        {"domain", 1, 0, 'd'},
>> +        {"period", 1, 0, 'p'},
>> +        {"budget", 1, 0, 'b'},
>> +        {"cpupool", 1, 0, 'c'},
>> +        COMMON_LONG_OPTS,
>> +        {0, 0, 0, 0}
>> +    };
>> +
>> +    SWITCH_FOREACH_OPT(opt, "d:p:b:c:h", opts, "sched-rt", 0) {
>> +    case 'd':
>> +        dom = optarg;
>> +        break;
>> +    case 'p':
>> +        period = strtol(optarg, NULL, 10);
>> +        opt_p = 1;
>> +        break;
>> +    case 'b':
>> +        budget = strtol(optarg, NULL, 10);
>> +        opt_b = 1;
>> +        break;
>> +    case 'c':
>> +        cpupool = optarg;
>> +        break;
>> +    }
>> +
>> +    if (cpupool && (dom || opt_p || opt_b)) {
>> +        fprintf(stderr, "Specifying a cpupool is not allowed with other
>> options.\n");
>> +        return 1;
>> +    }
>> +    if (!dom && (opt_p || opt_b)) {
>> +        fprintf(stderr, "Must specify a domain.\n");
>> +        return 1;
>> +    }
>> +    if ((opt_p || opt_b) && (opt_p + opt_b != 2)) {
>
>
> Maybe, "if (opt_p != opt_b)"?

This is better! :-)

>
>
>> +        fprintf(stderr, "Must specify period and budget\n");
>> +        return 1;
>> +    }
>> +
>> +    if (!dom) { /* list all domain's rt scheduler info */
>> +        return -sched_domain_output(LIBXL_SCHEDULER_RT_DS,
>> +                                    sched_rt_domain_output,
>> +                                    sched_rt_pool_output,
>> +                                    cpupool);
>> +    } else {
>> +        uint32_t domid = find_domain(dom);
>> +        if (!opt_p && !opt_b) { /* output rt scheduler info */
>> +            sched_rt_domain_output(-1);
>> +            return -sched_rt_domain_output(domid);
>> +        } else { /* set rt scheduler paramaters */
>> +            libxl_domain_sched_params scinfo;
>> +            libxl_domain_sched_params_init(&scinfo);
>> +            scinfo.sched = LIBXL_SCHEDULER_RT_DS;
>> +            scinfo.period = period;
>> +            scinfo.budget = budget;
>> +
>> +            rc = sched_domain_set(domid, &scinfo);
>> +            libxl_domain_sched_params_dispose(&scinfo);
>> +            if (rc)
>> +                return -rc;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>   int main_domid(int argc, char **argv)
>>   {
>>       uint32_t domid;
>> diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
>> index 7b7fa92..0c0e06e 100644
>> --- a/tools/libxl/xl_cmdtable.c
>> +++ b/tools/libxl/xl_cmdtable.c
>> @@ -277,6 +277,14 @@ struct cmd_spec cmd_table[] = {
>>         "                               --period/--slice)\n"
>>         "-c CPUPOOL, --cpupool=CPUPOOL  Restrict output to CPUPOOL"
>>       },
>> +    { "sched-rt",
>
>
> sched-rtds
>
> Right, starting to get close. :-)
>

Thank you so much for your helpful comments! :-)

Best,

Meng


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/4] xen: add real time scheduler rt
  2014-09-07 19:40 ` [PATCH v2 1/4] xen: add real time scheduler rt Meng Xu
  2014-09-08 14:32   ` George Dunlap
  2014-09-08 18:44   ` George Dunlap
@ 2014-09-09 16:57   ` Dario Faggioli
  2014-09-09 18:21     ` Meng Xu
  2 siblings, 1 reply; 31+ messages in thread
From: Dario Faggioli @ 2014-09-09 16:57 UTC (permalink / raw)
  To: Meng Xu
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap, lu,
	ian.jackson, xen-devel, ptxlinh, xumengpanda, JBeulich, chaowang,
	lichong659, dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 16599 bytes --]

On Sun, 2014-09-07 at 15:40 -0400, Meng Xu wrote:
> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> new file mode 100644
> index 0000000..412f8b1

> +/*
> + * Debug only. Used to printout debug information
> + */
> +#define printtime()\
> +        ({s_time_t now = NOW(); \
> +          printk("%u : %3ld.%3ldus : %-19s\n",smp_processor_id(),\
> +          now/MICROSECS(1), now%MICROSECS(1)/1000, __func__);} )
> +
You probably don't need this. As I said yesterday, you can keep it in an
out-of-series debug commit/patch.

> +/*
> + * rt tracing events ("only" 512 available!). Check
> + * include/public/trace.h for more details.
> + */
> +#define TRC_RT_TICKLE           TRC_SCHED_CLASS_EVT(RT, 1)
> +#define TRC_RT_RUNQ_PICK        TRC_SCHED_CLASS_EVT(RT, 2)
> +#define TRC_RT_BUDGET_BURN      TRC_SCHED_CLASS_EVT(RT, 3)
> +#define TRC_RT_BUDGET_REPLENISH TRC_SCHED_CLASS_EVT(RT, 4)
> +#define TRC_RT_SCHED_TASKLET    TRC_SCHED_CLASS_EVT(RT, 5)
> +#define TRC_RT_VCPU_DUMP        TRC_SCHED_CLASS_EVT(RT, 6)
>
Ditto about the uselessness of TRC_RT_VCPU_DUMP.

Also, as already said, RTDS here and everywhere else.

> +
> +/*
> + * Systme-wide private data, include a global RunQueue
> + * Global lock is referenced by schedule_data.schedule_lock from all 
> + * physical cpus. It can be grabbed via vcpu_schedule_lock_irq()
> + */
> +struct rt_private {
> +    spinlock_t lock;           /* The global coarse grand lock */
> +    struct list_head sdom;     /* list of availalbe domains, used for dump */
> +    struct list_head runq;     /* Ordered list of runnable VMs */
                                                     runnable vcpus ?

> +    struct rt_vcpu *flag_vcpu; /* position of the first depleted vcpu */
> +    cpumask_t cpus;            /* cpumask_t of available physical cpus */
> +    cpumask_t tickled;         /* cpus been tickled */
> +};

> +/*
> + * Debug related code, dump vcpu/cpu information
> + */
> +static void
> +rt_dump_vcpu(const struct scheduler *ops, const struct rt_vcpu *svc)
> +{
> +    struct rt_private *prv = RT_PRIV(ops);
> +    char cpustr[1024];
> +    cpumask_t *cpupool_mask;
> +
> +    ASSERT(svc != NULL);
> +    /* flag vcpu */
> +    if( svc->sdom == NULL )
> +        return;
> +
> +    cpumask_scnprintf(cpustr, sizeof(cpustr), svc->vcpu->cpu_hard_affinity);
> +    printk("[%5d.%-2u] cpu %u, (%"PRI_stime", %"PRI_stime"),"
> +           " cur_b=%"PRI_stime" cur_d=%"PRI_stime" last_start=%"PRI_stime
> +           " onR=%d runnable=%d cpu_hard_affinity=%s ",
>
How does this come up in the console? Should we break it with a '\n'
somewhere? It looks rather long...

> +            svc->vcpu->domain->domain_id,
> +            svc->vcpu->vcpu_id,
> +            svc->vcpu->processor,
> +            svc->period,
> +            svc->budget,
> +            svc->cur_budget,
> +            svc->cur_deadline,
> +            svc->last_start,
> +            __vcpu_on_runq(svc),
> +            vcpu_runnable(svc->vcpu),
> +            cpustr);
> +    memset(cpustr, 0, sizeof(cpustr));
> +    cpupool_mask = cpupool_scheduler_cpumask(svc->vcpu->domain->cpupool);
> +    cpumask_scnprintf(cpustr, sizeof(cpustr), cpupool_mask);
> +    printk("cpupool=%s ", cpustr);
> +    memset(cpustr, 0, sizeof(cpustr));
> +    cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->cpus);
> +    printk("prv->cpus=%s\n", cpustr);
> +    
> +    /* TRACE */
> +    {
> +        struct {
> +            unsigned dom:16,vcpu:16;
> +            unsigned processor;
> +            unsigned cur_budget_lo, cur_budget_hi;
> +            unsigned cur_deadline_lo, cur_deadline_hi;
> +            unsigned is_vcpu_on_runq:16,is_vcpu_runnable:16;
> +        } d;
> +        d.dom = svc->vcpu->domain->domain_id;
> +        d.vcpu = svc->vcpu->vcpu_id;
> +        d.processor = svc->vcpu->processor;
> +        d.cur_budget_lo = (unsigned) svc->cur_budget;
> +        d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
> +        d.cur_deadline_lo = (unsigned) svc->cur_deadline;
> +        d.cur_deadline_hi = (unsigned) (svc->cur_deadline >> 32);
> +        d.is_vcpu_on_runq = __vcpu_on_runq(svc);
> +        d.is_vcpu_runnable = vcpu_runnable(svc->vcpu);
> +        trace_var(TRC_RT_VCPU_DUMP, 1,
> +                  sizeof(d),
> +                  (unsigned char *)&d);
> +    }
> +}
> +

> +/*
> + * update deadline and budget when deadline is in the past,
> + * it need to be updated to the deadline of the current period 
> + */
> +static void
> +rt_update_helper(s_time_t now, struct rt_vcpu *svc)
> +{
>
While you're reworking this function, I'd also consider a different name
like 'rt_update_deadline', or 'rt_update_bandwidth', or something else
(it's the _helper part I don't like).

> +    s_time_t diff = now - svc->cur_deadline;
> +
> +    if ( diff >= 0 ) 
> +    {
> +        /* now can be later for several periods */
> +        long count = ( diff/svc->period ) + 1;
> +        svc->cur_deadline += count * svc->period;
> +        svc->cur_budget = svc->budget;
> +
> +        /* TRACE */
> +        {
> +            struct {
> +                unsigned dom:16,vcpu:16;
> +                unsigned cur_budget_lo, cur_budget_hi;
> +            } d;
> +            d.dom = svc->vcpu->domain->domain_id;
> +            d.vcpu = svc->vcpu->vcpu_id;
> +            d.cur_budget_lo = (unsigned) svc->cur_budget;
> +            d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
> +            trace_var(TRC_RT_BUDGET_REPLENISH, 1,
> +                      sizeof(d),
> +                      (unsigned char *) &d);
> +        }
> +
> +        return;
> +    }
> +}
> +
> +static inline void
> +__runq_remove(struct rt_vcpu *svc)
> +{
> +    if ( __vcpu_on_runq(svc) )
> +        list_del_init(&svc->runq_elem);
> +}
> +
> +/*
> + * Insert svc in the RunQ according to EDF: vcpus with smaller deadlines
> + * goes first.
      go

And if it was me that wrote 'goes', apologies for that. :-D

> + */
> +static void
> +__runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
> +{
> +    struct rt_private *prv = RT_PRIV(ops);
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +    spinlock_t *schedule_lock;
> +    
>This empty line above seems to be actually empty, but looking more
carefully, it does contain 4 spaces, doesn't it?

If that's the case, avoid doing this, i.e., make sure that empty lines
are actually empty. :-D

Looking at each patch with `git show' should highlight occurrences of
this phenomenon, as well  as of any trailing white space, by marking
them in red.

> +    schedule_lock = per_cpu(schedule_data, svc->vcpu->processor).schedule_lock;
> +    ASSERT( spin_is_locked(schedule_lock) );
> +    
As of now, the only lock around is prv->lock, isn't it? So this
per_cpu(xxx) is a complex way to get to prv->lock, or am I missing
something.

In credit, the pre-inited set of locks are actually used "as they are",
while in credit2, there is some remapping going on, but there is more
than one lock anyway. That's why you find things like the above in those
two schedulers. Here, you should not need anything like that, (as you do
everywhere else) just go ahead and use prv->lock.

Of course, that does not mean you don't need the lock remapping in
rt_alloc_pdata(). That code looks ok to me, just adapt this bit above,
as, like this, it makes things harder to understand.

Or am I overlooking something?

> +    ASSERT( !__vcpu_on_runq(svc) );
> +
> +    /* svc still has budget */
> +    if ( svc->cur_budget > 0 ) 
> +    {
> +        list_for_each(iter, runq) 
> +        {
> +            struct rt_vcpu * iter_svc = __runq_elem(iter);
> +            if ( iter_svc->cur_budget == 0 ||
> +                 svc->cur_deadline <= iter_svc->cur_deadline )
> +                    break;
> +         }
> +        list_add_tail(&svc->runq_elem, iter);
> +     }
> +    else 
> +    {
> +        list_add(&svc->runq_elem, &prv->flag_vcpu->runq_elem);
> +    }
> +}
> +
I agree with George about the queue splitting.

> +static void
> +rt_deinit(const struct scheduler *ops)
> +{
> +    struct rt_private *prv = RT_PRIV(ops);
> +
> +    printtime();
> +    printk("\n");
>
As said, when removing all the calls to rt_dump_vcpu, also remove both
the definition and all these calls to printtime(), they're of very few
value, IMO.

> +    xfree(prv->flag_vcpu);
> +    xfree(prv);
> +}

> +static void *
> +rt_alloc_domdata(const struct scheduler *ops, struct domain *dom)
> +{
> +    unsigned long flags;
> +    struct rt_dom *sdom;
> +    struct rt_private * prv = RT_PRIV(ops);
> +
> +    sdom = xzalloc(struct rt_dom);
> +    if ( sdom == NULL ) 
> +    {
> +        printk("%s, xzalloc failed\n", __func__);
> +        return NULL;
>
Just `return NULL', the printk() is pretty useless. Failures like this,
will be identified, without the need of it.

> +    }
> +
> +    INIT_LIST_HEAD(&sdom->vcpu);
> +    INIT_LIST_HEAD(&sdom->sdom_elem);
> +    sdom->dom = dom;
> +
> +    /* spinlock here to insert the dom */
> +    spin_lock_irqsave(&prv->lock, flags);
> +    list_add_tail(&sdom->sdom_elem, &(prv->sdom));
> +    spin_unlock_irqrestore(&prv->lock, flags);
> +
> +    return sdom;
> +}

> +static void *
> +rt_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
> +{
> +    struct rt_vcpu *svc;
> +    s_time_t now = NOW();
> +
> +    /* Allocate per-VCPU info */
> +    svc = xzalloc(struct rt_vcpu);
> +    if ( svc == NULL ) 
> +    {
> +        printk("%s, xzalloc failed\n", __func__);
> +        return NULL;
> +    }
> +
> +    INIT_LIST_HEAD(&svc->runq_elem);
> +    INIT_LIST_HEAD(&svc->sdom_elem);
> +    svc->flags = 0U;
> +    svc->sdom = dd;
> +    svc->vcpu = vc;
> +    svc->last_start = 0;
> +
> +    svc->period = RT_DS_DEFAULT_PERIOD;
> +    if ( !is_idle_vcpu(vc) )
> +        svc->budget = RT_DS_DEFAULT_BUDGET;
> +
> +    rt_update_helper(now, svc);
> +
And one more point in favour of pulling the check out of the helper. In
fact, in this case (independently whether you want to keep the division,
because it's the first time we set the deadline, or use the while loop),
you don't need to check if the deadline is in the past... You already
know it is!! :-D

That would mean you can just call rt_update_helper(), without further
checking, not here, neither inside the helper. Faster, but that does not
matter much in this case. Cleaner, and that _always_ does. :-)

> +    /* Debug only: dump new vcpu's info */
> +    rt_dump_vcpu(ops, svc);
> +
> +    return svc;
> +}

> +/*
> + * Burn budget in nanosecond granularity
> + */
> +static void
> +burn_budgets(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t now) 
> +{
>
burn_budget()? I mean, why the trailing 's'?

(yes, this is a very minor thing.)

> +    s_time_t delta;
> +
> +    /* don't burn budget for idle VCPU */
> +    if ( is_idle_vcpu(svc->vcpu) ) 
> +        return;
> +
> +    rt_update_helper(now, svc);
> +
> +    /* not burn budget when vcpu miss deadline */
> +    if ( now >= svc->cur_deadline )
> +        return;
> +
How this can be true?

Unless I'm missing something, in rt_update_helper(), if the deadline is
behind now, you move it ahead of it (and replenish the budget). Here you
check again whether the deadline is behind now, which should not be
possible, as you just took care of that... Isn't it so?

Considering both mine and George's suggestion, if you rework the helper
and move the check out of it, then this one is fine (and you just call
the helper if the condition is verified). If you don't want to do that,
then I guess you can have the helper returning 0|1 depending on whether
or not the update happened, and use such value here, for deciding
whether to bail or not.

I think I'd prefer the former (pulling the check out of the helper).

> +    /* burn at nanoseconds level */
> +    delta = now - svc->last_start;
> +    /* 
> +     * delta < 0 only happens in nested virtualization;
> +     * TODO: how should we handle delta < 0 in a better way? 
> +     */
> +    if ( delta < 0 ) 
> +    {
> +        printk("%s, ATTENTION: now is behind last_start! delta = %ld",
> +                __func__, delta);
> +        rt_dump_vcpu(ops, svc);
> +        svc->last_start = now;
> +        svc->cur_budget = 0;
> +        return;
> +    }
> +
> +    if ( svc->cur_budget == 0 ) 
> +        return;
> +
> +    svc->cur_budget -= delta;
> +    if ( svc->cur_budget < 0 ) 
> +        svc->cur_budget = 0;
> +
> +    /* TRACE */
> +    {
> +        struct {
> +            unsigned dom:16, vcpu:16;
> +            unsigned cur_budget_lo;
> +            unsigned cur_budget_hi;
> +            int delta;
> +        } d;
> +        d.dom = svc->vcpu->domain->domain_id;
> +        d.vcpu = svc->vcpu->vcpu_id;
> +        d.cur_budget_lo = (unsigned) svc->cur_budget;
> +        d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
> +        d.delta = delta;
> +        trace_var(TRC_RT_BUDGET_BURN, 1,
> +                  sizeof(d),
> +                  (unsigned char *) &d);
> +    }
> +}
> +
> +/* 
> + * RunQ is sorted. Pick first one within cpumask. If no one, return NULL
> + * lock is grabbed before calling this function 
> + */
> +static struct rt_vcpu *
> +__runq_pick(const struct scheduler *ops, cpumask_t mask)
                                            cpumask_t *mask

would be better, I think.

> +/*
> + * Update vcpu's budget and sort runq by insert the modifed vcpu back to runq
> + * lock is grabbed before calling this function 
> + */
> +static void
> +__repl_update(const struct scheduler *ops, s_time_t now)
> +{
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +    struct list_head *tmp;
> +    struct rt_vcpu *svc = NULL;
> +
> +    list_for_each_safe(iter, tmp, runq) 
> +    {
> +        svc = __runq_elem(iter);
> +
> +        /* not update flag_vcpu's budget */
> +        if(svc->sdom == NULL)
> +            continue;
> +
> +        rt_update_helper(now, svc);
> +        /* reinsert the vcpu if its deadline is updated */
> +        if ( now >= 0 )
> +        {
>
This is wrong, and I saw you noticed this already.

> +            __runq_remove(svc);
> +            __runq_insert(ops, svc);
> +        }
> +    }
> +}
> +
> +/* 
> + * schedule function for rt scheduler.
> + * The lock is already grabbed in schedule.c, no need to lock here 
> + */
> +static struct task_slice
> +rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_scheduled)
> +{
> +    const int cpu = smp_processor_id();
> +    struct rt_private * prv = RT_PRIV(ops);
> +    struct rt_vcpu * const scurr = RT_VCPU(current);
> +    struct rt_vcpu * snext = NULL;
> +    struct task_slice ret = { .migrated = 0 };
> +
> +    /* clear ticked bit now that we've been scheduled */
> +    if ( cpumask_test_cpu(cpu, &prv->tickled) )
> +        cpumask_clear_cpu(cpu, &prv->tickled);
> +
Is the test important? cpumask operations may be quite expensive, and I
think always clearing is better than always testing and sometimes
(rather often, I think) clear.

I'm open to other views on this, though. :-)

> +    /* burn_budget would return for IDLE VCPU */
> +    burn_budgets(ops, scurr, now);
> +
> +    __repl_update(ops, now);

> +/*
> + * Pick a vcpu on a cpu to kick out to place the running candidate
>
Rather than 'Pick a vcpu on a cpu to kick out...', I'd say 'Pick a cpu
where to run a vcpu, possibly kicking out the vcpu running there'.

Right. For this round, I tried, while looking at the patch, as hard as I
could to concentrate on the algorithm, and on how the Xen scheduling
framework is being used here. As a result, I confirm my previous
impression that this code is in a fair state and that, as an
experimental and in-development feature, it could well be checked in
soon (as far as comments are addressed, of course :-D ).

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/4] xen: add real time scheduler rt
  2014-09-09 16:57   ` Dario Faggioli
@ 2014-09-09 18:21     ` Meng Xu
  2014-09-11  8:44       ` Dario Faggioli
  0 siblings, 1 reply; 31+ messages in thread
From: Meng Xu @ 2014-09-09 18:21 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, George Dunlap,
	Chenyang Lu, Ian Jackson, xen-devel, Linh Thi Xuan Phan, Meng Xu,
	Jan Beulich, Chao Wang, Chong Li, Dagaen Golomb

Hi Dario,

Thank you very much for your comments! I will just comment on the
points that needs clarification and will solve all of these comments.

>> +/*
>> + * Debug related code, dump vcpu/cpu information
>> + */
>> +static void
>> +rt_dump_vcpu(const struct scheduler *ops, const struct rt_vcpu *svc)
>> +{
>> +    struct rt_private *prv = RT_PRIV(ops);
>> +    char cpustr[1024];
>> +    cpumask_t *cpupool_mask;
>> +
>> +    ASSERT(svc != NULL);
>> +    /* flag vcpu */
>> +    if( svc->sdom == NULL )
>> +        return;
>> +
>> +    cpumask_scnprintf(cpustr, sizeof(cpustr), svc->vcpu->cpu_hard_affinity);
>> +    printk("[%5d.%-2u] cpu %u, (%"PRI_stime", %"PRI_stime"),"
>> +           " cur_b=%"PRI_stime" cur_d=%"PRI_stime" last_start=%"PRI_stime
>> +           " onR=%d runnable=%d cpu_hard_affinity=%s ",
>>
> How does this come up in the console? Should we break it with a '\n'
> somewhere? It looks rather long...

Some information is not so useful here, such as the period and budget
of the vcpu, which can be displayed by using the tool stack. I can
remove some of them to make this line shorter. I will remove
svc->budget, svc->period and prv->cpus.

>> +            svc->vcpu->domain->domain_id,
>> +            svc->vcpu->vcpu_id,
>> +            svc->vcpu->processor,
>> +            svc->period,
>> +            svc->budget,
>> +            svc->cur_budget,
>> +            svc->cur_deadline,
>> +            svc->last_start,
>> +            __vcpu_on_runq(svc),
>> +            vcpu_runnable(svc->vcpu),
>> +            cpustr);
>> +    memset(cpustr, 0, sizeof(cpustr));
>> +    cpupool_mask = cpupool_scheduler_cpumask(svc->vcpu->domain->cpupool);
>> +    cpumask_scnprintf(cpustr, sizeof(cpustr), cpupool_mask);
>> +    printk("cpupool=%s ", cpustr);
>> +    memset(cpustr, 0, sizeof(cpustr));
>> +    cpumask_scnprintf(cpustr, sizeof(cpustr), &prv->cpus);
>> +    printk("prv->cpus=%s\n", cpustr);
>> +
>> +    /* TRACE */
>> +    {
>> +        struct {
>> +            unsigned dom:16,vcpu:16;
>> +            unsigned processor;
>> +            unsigned cur_budget_lo, cur_budget_hi;
>> +            unsigned cur_deadline_lo, cur_deadline_hi;
>> +            unsigned is_vcpu_on_runq:16,is_vcpu_runnable:16;
>> +        } d;
>> +        d.dom = svc->vcpu->domain->domain_id;
>> +        d.vcpu = svc->vcpu->vcpu_id;
>> +        d.processor = svc->vcpu->processor;
>> +        d.cur_budget_lo = (unsigned) svc->cur_budget;
>> +        d.cur_budget_hi = (unsigned) (svc->cur_budget >> 32);
>> +        d.cur_deadline_lo = (unsigned) svc->cur_deadline;
>> +        d.cur_deadline_hi = (unsigned) (svc->cur_deadline >> 32);
>> +        d.is_vcpu_on_runq = __vcpu_on_runq(svc);
>> +        d.is_vcpu_runnable = vcpu_runnable(svc->vcpu);
>> +        trace_var(TRC_RT_VCPU_DUMP, 1,
>> +                  sizeof(d),
>> +                  (unsigned char *)&d);
>> +    }
>> +}
>> +

>> + */
>> +static void
>> +__runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
>> +{
>> +    struct rt_private *prv = RT_PRIV(ops);
>> +    struct list_head *runq = RUNQ(ops);
>> +    struct list_head *iter;
>> +    spinlock_t *schedule_lock;
>> +
>>This empty line above seems to be actually empty, but looking more
> carefully, it does contain 4 spaces, doesn't it?
>
> If that's the case, avoid doing this, i.e., make sure that empty lines
> are actually empty. :-D
>
> Looking at each patch with `git show' should highlight occurrences of
> this phenomenon, as well  as of any trailing white space, by marking
> them in red.
>
>> +    schedule_lock = per_cpu(schedule_data, svc->vcpu->processor).schedule_lock;
>> +    ASSERT( spin_is_locked(schedule_lock) );
>> +
> As of now, the only lock around is prv->lock, isn't it? So this
> per_cpu(xxx) is a complex way to get to prv->lock, or am I missing
> something.

Yes. It's the only lock right now. When I split the RunQ to two
queues: RunQ, DepletedQ, I can still use one lock, (but probably two
locks are more efficient?)

>
> In credit, the pre-inited set of locks are actually used "as they are",
> while in credit2, there is some remapping going on, but there is more
> than one lock anyway. That's why you find things like the above in those
> two schedulers. Here, you should not need anything like that, (as you do
> everywhere else) just go ahead and use prv->lock.
>
> Of course, that does not mean you don't need the lock remapping in
> rt_alloc_pdata(). That code looks ok to me, just adapt this bit above,
> as, like this, it makes things harder to understand.
>
> Or am I overlooking something?

I think you didn't overlook anything. I will refer to credit2 to see
how it is using multiple locks, since it's likely we will have two
locks here.

>> +/*
>> + * Burn budget in nanosecond granularity
>> + */
>> +static void
>> +burn_budgets(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t now)
>> +{
>>
> burn_budget()? I mean, why the trailing 's'?
>
> (yes, this is a very minor thing.)
>
>> +    s_time_t delta;
>> +
>> +    /* don't burn budget for idle VCPU */
>> +    if ( is_idle_vcpu(svc->vcpu) )
>> +        return;
>> +
>> +    rt_update_helper(now, svc);
>> +
>> +    /* not burn budget when vcpu miss deadline */
>> +    if ( now >= svc->cur_deadline )
>> +        return;
>> +
> How this can be true?

You are right! After rt_update_helper(), this should be always true.
Will change as you said and preferred. :-)

>
> Unless I'm missing something, in rt_update_helper(), if the deadline is
> behind now, you move it ahead of it (and replenish the budget). Here you
> check again whether the deadline is behind now, which should not be
> possible, as you just took care of that... Isn't it so?
>
> Considering both mine and George's suggestion, if you rework the helper
> and move the check out of it, then this one is fine (and you just call
> the helper if the condition is verified). If you don't want to do that,
> then I guess you can have the helper returning 0|1 depending on whether
> or not the update happened, and use such value here, for deciding
> whether to bail or not.
>
> I think I'd prefer the former (pulling the check out of the helper).
>


>> +/*
>> + * schedule function for rt scheduler.
>> + * The lock is already grabbed in schedule.c, no need to lock here
>> + */
>> +static struct task_slice
>> +rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_scheduled)
>> +{
>> +    const int cpu = smp_processor_id();
>> +    struct rt_private * prv = RT_PRIV(ops);
>> +    struct rt_vcpu * const scurr = RT_VCPU(current);
>> +    struct rt_vcpu * snext = NULL;
>> +    struct task_slice ret = { .migrated = 0 };
>> +
>> +    /* clear ticked bit now that we've been scheduled */
>> +    if ( cpumask_test_cpu(cpu, &prv->tickled) )
>> +        cpumask_clear_cpu(cpu, &prv->tickled);
>> +
> Is the test important? cpumask operations may be quite expensive, and I
> think always clearing is better than always testing and sometimes
> (rather often, I think) clear.
>
> I'm open to other views on this, though. :-)

I think we can just clear it, unless clear is much more expensive than
test. If no objection, I will just clear it in the next version.

> Right. For this round, I tried, while looking at the patch, as hard as I
> could to concentrate on the algorithm, and on how the Xen scheduling
> framework is being used here. As a result, I confirm my previous
> impression that this code is in a fair state and that, as an
> experimental and in-development feature, it could well be checked in
> soon (as far as comments are addressed, of course :-D ).
>

I will solve all of these comments this week and try my best to
release the next version at the weekend. (Well, if not, it should be
early next week. Many simple things to change and modify. :-) )

Thank you very much!

Best,

Meng

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/4] xen: add real time scheduler rt
  2014-09-09 18:21     ` Meng Xu
@ 2014-09-11  8:44       ` Dario Faggioli
  2014-09-11 13:49         ` Meng Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Dario Faggioli @ 2014-09-11  8:44 UTC (permalink / raw)
  To: Meng Xu
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, George Dunlap,
	Chenyang Lu, Ian Jackson, xen-devel, Linh Thi Xuan Phan, Meng Xu,
	Jan Beulich, Chao Wang, Chong Li, Dagaen Golomb


[-- Attachment #1.1: Type: text/plain, Size: 3813 bytes --]

On Tue, 2014-09-09 at 14:21 -0400, Meng Xu wrote:
> >> +/*
> >> + * Debug related code, dump vcpu/cpu information
> >> + */
> >> +static void
> >> +rt_dump_vcpu(const struct scheduler *ops, const struct rt_vcpu *svc)
> >> +{
> >> +    struct rt_private *prv = RT_PRIV(ops);
> >> +    char cpustr[1024];
> >> +    cpumask_t *cpupool_mask;
> >> +
> >> +    ASSERT(svc != NULL);
> >> +    /* flag vcpu */
> >> +    if( svc->sdom == NULL )
> >> +        return;
> >> +
> >> +    cpumask_scnprintf(cpustr, sizeof(cpustr), svc->vcpu->cpu_hard_affinity);
> >> +    printk("[%5d.%-2u] cpu %u, (%"PRI_stime", %"PRI_stime"),"
> >> +           " cur_b=%"PRI_stime" cur_d=%"PRI_stime" last_start=%"PRI_stime
> >> +           " onR=%d runnable=%d cpu_hard_affinity=%s ",
> >>
> > How does this come up in the console? Should we break it with a '\n'
> > somewhere? It looks rather long...
> 
> Some information is not so useful here, such as the period and budget
> of the vcpu, which can be displayed by using the tool stack. I can
> remove some of them to make this line shorter. I will remove
> svc->budget, svc->period and prv->cpus.
> 
Well, as you wish... A '\n' (and perhaps some more formatting with
'\t'-s, etch) would be fine too, IMO.

> >> +    schedule_lock = per_cpu(schedule_data, svc->vcpu->processor).schedule_lock;
> >> +    ASSERT( spin_is_locked(schedule_lock) );
> >> +
> > As of now, the only lock around is prv->lock, isn't it? So this
> > per_cpu(xxx) is a complex way to get to prv->lock, or am I missing
> > something.
> 
> Yes. It's the only lock right now. When I split the RunQ to two
> queues: RunQ, DepletedQ, I can still use one lock, (but probably two
> locks are more efficient?)
> 
> >
> > In credit, the pre-inited set of locks are actually used "as they are",
> > while in credit2, there is some remapping going on, but there is more
> > than one lock anyway. That's why you find things like the above in those
> > two schedulers. Here, you should not need anything like that, (as you do
> > everywhere else) just go ahead and use prv->lock.
> >
> > Of course, that does not mean you don't need the lock remapping in
> > rt_alloc_pdata(). That code looks ok to me, just adapt this bit above,
> > as, like this, it makes things harder to understand.
> >
> > Or am I overlooking something?
> 
> I think you didn't overlook anything. I will refer to credit2 to see
> how it is using multiple locks, since it's likely we will have two
> locks here.
> 
I don't think you do. I mentioned credit2 only to make it clear why
notation like the one above is required there, and to highlight that it
is _not_ required in your case.

Even if you start using 2 queues, one for runnable and one for depleted
vcpus, access to both can well be serialized by the same lock. In fact,
in quite a few places, you'd need moving vcpus from one queue to the
other, i.e., you'd be forced to take both of the locks anyway.

I do think that using separate queues may improve scalability, and
adopting a different locking strategy could make that happen, but I just
won't do that right now, at this point of the release cycle. For now,
the two queue approach will "just" make the code easier to read,
understand and hack, which is already something really important,
especially for an experimental feature.

So, IMO, just replace the line above with a simple "&prv->lock" and get
done with it, without adding any more locks, or changing the locking
logic.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/4] xen: add real time scheduler rt
  2014-09-11  8:44       ` Dario Faggioli
@ 2014-09-11 13:49         ` Meng Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Meng Xu @ 2014-09-11 13:49 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, George Dunlap,
	Chenyang Lu, Ian Jackson, xen-devel, Linh Thi Xuan Phan, Meng Xu,
	Jan Beulich, Chao Wang, Chong Li, Dagaen Golomb

2014-09-11 4:44 GMT-04:00 Dario Faggioli <dario.faggioli@citrix.com>:
> On Tue, 2014-09-09 at 14:21 -0400, Meng Xu wrote:
>> >> +/*
>> >> + * Debug related code, dump vcpu/cpu information
>> >> + */
>> >> +static void
>> >> +rt_dump_vcpu(const struct scheduler *ops, const struct rt_vcpu *svc)
>> >> +{
>> >> +    struct rt_private *prv = RT_PRIV(ops);
>> >> +    char cpustr[1024];
>> >> +    cpumask_t *cpupool_mask;
>> >> +
>> >> +    ASSERT(svc != NULL);
>> >> +    /* flag vcpu */
>> >> +    if( svc->sdom == NULL )
>> >> +        return;
>> >> +
>> >> +    cpumask_scnprintf(cpustr, sizeof(cpustr), svc->vcpu->cpu_hard_affinity);
>> >> +    printk("[%5d.%-2u] cpu %u, (%"PRI_stime", %"PRI_stime"),"
>> >> +           " cur_b=%"PRI_stime" cur_d=%"PRI_stime" last_start=%"PRI_stime
>> >> +           " onR=%d runnable=%d cpu_hard_affinity=%s ",
>> >>
>> > How does this come up in the console? Should we break it with a '\n'
>> > somewhere? It looks rather long...
>>
>> Some information is not so useful here, such as the period and budget
>> of the vcpu, which can be displayed by using the tool stack. I can
>> remove some of them to make this line shorter. I will remove
>> svc->budget, svc->period and prv->cpus.
>>
> Well, as you wish... A '\n' (and perhaps some more formatting with
> '\t'-s, etch) would be fine too, IMO.

Got it! Thanks!

>
>> >> +    schedule_lock = per_cpu(schedule_data, svc->vcpu->processor).schedule_lock;
>> >> +    ASSERT( spin_is_locked(schedule_lock) );
>> >> +
>> > As of now, the only lock around is prv->lock, isn't it? So this
>> > per_cpu(xxx) is a complex way to get to prv->lock, or am I missing
>> > something.
>>
>> Yes. It's the only lock right now. When I split the RunQ to two
>> queues: RunQ, DepletedQ, I can still use one lock, (but probably two
>> locks are more efficient?)
>>
>> >
>> > In credit, the pre-inited set of locks are actually used "as they are",
>> > while in credit2, there is some remapping going on, but there is more
>> > than one lock anyway. That's why you find things like the above in those
>> > two schedulers. Here, you should not need anything like that, (as you do
>> > everywhere else) just go ahead and use prv->lock.
>> >
>> > Of course, that does not mean you don't need the lock remapping in
>> > rt_alloc_pdata(). That code looks ok to me, just adapt this bit above,
>> > as, like this, it makes things harder to understand.
>> >
>> > Or am I overlooking something?
>>
>> I think you didn't overlook anything. I will refer to credit2 to see
>> how it is using multiple locks, since it's likely we will have two
>> locks here.
>>
> I don't think you do. I mentioned credit2 only to make it clear why
> notation like the one above is required there, and to highlight that it
> is _not_ required in your case.
>
> Even if you start using 2 queues, one for runnable and one for depleted
> vcpus, access to both can well be serialized by the same lock. In fact,
> in quite a few places, you'd need moving vcpus from one queue to the
> other, i.e., you'd be forced to take both of the locks anyway.
>
> I do think that using separate queues may improve scalability, and
> adopting a different locking strategy could make that happen, but I just
> won't do that right now, at this point of the release cycle. For now,
> the two queue approach will "just" make the code easier to read,
> understand and hack, which is already something really important,
> especially for an experimental feature.
>
> So, IMO, just replace the line above with a simple "&prv->lock" and get
> done with it, without adding any more locks, or changing the locking
> logic.
>

I agree with you totally. Sure! I will use one lock then. :-)

Best,

Meng

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Introduce rt real-time scheduler for Xen
@ 2014-08-24 22:58 Meng Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Meng Xu @ 2014-08-24 22:58 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	dario.faggioli, ian.jackson, xumengpanda, JBeulich, chaowang,
	lichong659, dgolomb

Hi all,

This serie of patches adds rt real-time scheduler to Xen.

In summary, It supports:
1) Preemptive Global Earliest Deadline First scheduling policy by using a global RunQ for the scheduler;
2) Assign/display each VCPU's parameters of each domain;
3) Supports CPU Pool

Compared with the set of patches in version RFC v2, this set of patch has the following improvement:
    a) added rt scheduler specific TRACE facility for rt scheduler
    b) more efficient RunQ implementation to avoid scanning the whole RunQ when insert a vcpu without budget.
    c) bug fix for cpupool support.

-----------------------------------------------------------------------------------------------------------------------------
TODO:
    a) Burn budget in finer granularity instead of 1ms; [medium]
    b) Use separate timer per vcpu for each vcpu's budget replenishment, instead of scanning the full runqueue every now and then [medium]
    c) Handle time stolen from domU by hypervisor. When it runs on a machine with many sockets and lots of cores, the spin-lock for global RunQ used in rt scheduler could eat up time from domU, which could make domU have less budget than it requires. [not sure about difficulty right now] (Thank Konrad Rzeszutek to point this out in the XenSummit. :-))

Plan:
    We will work on TODO a) and b) and try to finish these two items before September 10th. (We will also tackle the comments raised in the review of this set of patches.)

-----------------------------------------------------------------------------------------------------------------------------
The design of this rt scheduler is as follows:
This rt scheduler follows the Preemptive Global Earliest Deadline First (GEDF) theory in real-time field.
Each VCPU can have a dedicated period and budget. While scheduled, a VCPU burns its budget. Each VCPU has its budget replenished at the beginning of each of its periods; Each VCPU discards its unused budget at the end of each of its periods. If a VCPU runs out of budget in a period, it has to wait until next period.
The mechanism of how to burn a VCPU's budget depends on the server mechanism implemented for each VCPU.
The mechanism of deciding the priority of VCPUs at each scheduling point is based on the Preemptive Global Earliest Deadline First scheduling scheme.

Server mechanism: a VCPU is implemented as a deferrable server.
When a VCPU has a task running on it, its budget is continuously burned;
When a VCPU has no task but with budget left, its budget is preserved.

Priority scheme: Global Earliest Deadline First (EDF).
At any scheduling point, the VCPU with earliest deadline has highest priority.

Queue scheme: A global runqueue for each CPU pool.
The runqueue holds all runnable VCPUs.
VCPUs in the runqueue are divided into two parts: with and without remaining budget.
At each part, VCPUs are sorted based on GEDF priority scheme.

Scheduling quanta: 1 ms.

If you are intersted in the details of the design and evaluation of this rt scheduler, please refer to our paper "Real-Time Multi-Core Virtual Machine Scheduling in Xen" (http://www.cis.upenn.edu/~mengxu/emsoft14/emsoft14.pdf), which will be published in EMSOFT14. This paper has the following details:
    a) Desgin of this scheduler;
    b) Measurement of the implementation overhead, e.g., scheduler overhead, context switch overhead, etc.
    c) Comparison of this rt scheduler and credit scheduler in terms of the real-time performance.
-----------------------------------------------------------------------------------------------------------------------------
One scenario to show the functionality of this rt scheduler is as follows:
//list each vcpu's parameters of each domain in cpu pools using rt scheduler
#xl sched-rt
Cpupool Pool-0: sched=EDF
Name                                ID VCPU Period Budget
Domain-0                             0    0  10000  10000
Domain-0                             0    1  20000  20000
Domain-0                             0    2  30000  30000
Domain-0                             0    3  10000  10000
litmus1                              1    0  10000   4000
litmus1                              1    1  10000   4000

//set the parameters of the vcpu 1 of domain litmus1:
# xl sched-rt -d litmus1 -v 1 -p 20000 -b 10000

//domain litmus1's vcpu 1's parameters are changed, display each VCPU's parameters separately:
# xl sched-rt -d litmus1
Name                                ID VCPU Period Budget
litmus1                              1    0  10000   4000
litmus1                              1    1  20000  10000

// list cpupool information
xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0              12        rt       y          2

//create a cpupool test
#xl cpupool-cpu-remove Pool-0 11
#xl cpupool-cpu-remove Pool-0 10
#xl cpupool-create name=\"test\" sched=\"credit\"
#xl cpupool-cpu-add test 11
#xl cpupool-cpu-add test 10
#xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0              10        rt       y          2
test                 2    credit       y          0   

//migrate litmus1 from cpupool Pool-0 to cpupool test.
#xl cpupool-migrate litmus1 test

//now litmus1 is in cpupool test
# xl sched-credit 
Cpupool test: tslice=30ms ratelimit=1000us
Name                                ID Weight  Cap
litmus1                              1    256    0 

-----------------------------------------------------------------------------------------------------------------------------
This set of patches is tested by using the above scenario and running cpu intensive tasks inside each guest domain. We manually check if a domain can have its required resource without being interferenced by other domains; We also manually checked if the scheduling sequence of vcpus follows the Earliest Deadline First scheduling policy.

Any comment, question, and concerns are more than welcome! :-)

Thank you very much!

Best,

Meng

[PATCH v1 1/4] xen: add real time scheduler rt
[PATCH v1 2/4] libxc: add rt scheduler
[PATCH v1 3/4] libxl: add rt scheduler
[PATCH v1 4/4] xl: introduce rt scheduler

---
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Introduce rt real-time scheduler for Xen
@ 2014-07-29  1:52 Meng Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Meng Xu @ 2014-07-29  1:52 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xumengpanda, JBeulich, lichong659, dgolomb

Hi all,

This serie of patches adds rt real-time scheduler to Xen.

In summary, It supports:
1) Preemptive Global Earliest Deadline First scheduling policy by using a global RunQ for the scheduler;
2) Assign/display each VCPU's parameters of each domain;
3) Supports CPU Pool

Based on the review comments/suggestion on version 1, the version 2 has the following improvements:
    a) Changed the interface of get/set vcpu's parameter from staticlly allocating a large array to dynamically allocating memory based on the number of vcpus of a domain. (This is a major change from v1 to v2 because many comments on v1 is about the code related to this functionality.)
    b) Changed the time unit at user interferce from 1ms to 1us; Changed the type of period and buget of a VCPU from uint16 to s_time_s.
    c) Changed the code style, rearrange the patch order, add comments to better explain the code.
    d) Domain 0 is no longer treated as a special domain. Domain 0's VCPUs are changed to be the same with domUs' VCPUs: domain 0's VCPUs have the same default value with domUs' VCPUs and are scheduled as domUs' VCPUs.
    e) Add more ASSERT(), e.g., in  __runq_insert() in sched_rt.c

-----------------------------------------------------------------------------------------------------------------------------
TODO:
    a) Add TRACE() in sched_rt.c functions.[easy]
       We will add a few xentrace tracepoints, like TRC_CSCHED2_RUNQ_POS in credit2 scheduler, in rt scheduler, to debug via tracing.
    b) Split runnable and depleted (=no budget left) VCPU queues.[easy]
    c) Deal with budget overrun in the algorithm [medium]
    d) Try using timers for replenishment, instead of scanning the full runqueue every now and then [medium]
    e) Reconsider the rt_vcpu_insert() and rt_vcpu_remove() for cpu pool support. 
    f) Method of improving the performance of rt scheduler [future work]
       VCPUs of the same domain may preempt each other based on the preemptive global EDF scheduling policy. This self-switch issue does not bring benefit to the domain but introduce more overhead. When this situation happens, we can simply promote the current running lower-priority VCPU’s priority and let it  borrow budget from higher priority VCPUs to avoid such self-swtich issue.

Plan: 
    TODO a) and b) are expected in RFC v3; (2 weeks)
    TODO c), d) and e) are expected in RFC v4, v5; (3-4 weeks)
    TODO f) will be delayed after this scheduler is upstreamed because the improvement will make the scheduler not a pure global EDF scheduler.

-----------------------------------------------------------------------------------------------------------------------------
The design of this rt scheduler is as follows:
This rt scheduler follows the Preemptive Global Earliest Deadline First (GEDF) theory in real-time field.
Each VCPU can have a dedicated period and budget. While scheduled, a VCPU burns its budget. Each VCPU has its budget replenished at the beginning of each of its periods; Each VCPU discards its unused budget at the end of each of its periods. If a VCPU runs out of budget in a period, it has to wait until next period.
The mechanism of how to burn a VCPU's budget depends on the server mechanism implemented for each VCPU.
The mechanism of deciding the priority of VCPUs at each scheduling point is based on the Preemptive Global Earliest Deadline First scheduling scheme.

Server mechanism: a VCPU is implemented as a deferrable server.
When a VCPU has a task running on it, its budget is continuously burned;
When a VCPU has no task but with budget left, its budget is preserved.

Priority scheme: Global Earliest Deadline First (EDF).
At any scheduling point, the VCPU with earliest deadline has highest priority.

Queue scheme: A global runqueue for each CPU pool.
The runqueue holds all runnable VCPUs.
VCPUs in the runqueue are divided into two parts: with and without remaining budget.
At each part, VCPUs are sorted based on GEDF priority scheme.

Scheduling quanta: 1 ms.

If you are intersted in the details of the design and evaluation of this rt scheduler, please refer to our paper "Real-Time Multi-Core Virtual Machine Scheduling in Xen" (http://www.cis.upenn.edu/~mengxu/emsoft14/emsoft14.pdf) in EMSOFT14. This paper has the following details:
    a) Desgin of this scheduler;
    b) Measurement of the implementation overhead, e.g., scheduler overhead, context switch overhead, etc. 
    c) Comparison of this rt scheduler and credit scheduler in terms of the real-time performance.
-----------------------------------------------------------------------------------------------------------------------------
One scenario to show the functionality of this rt scheduler is as follows:
//list each vcpu's parameters of each domain in cpu pools using rt scheduler
#xl sched-rt
Cpupool Pool-0: sched=EDF
Name                                ID VCPU Period Budget
Domain-0                             0    0  10000  10000
Domain-0                             0    1  20000  20000
Domain-0                             0    2  30000  30000
Domain-0                             0    3  10000  10000
litmus1                              1    0  10000   4000
litmus1                              1    1  10000   4000

//set the parameters of the vcpu 1 of domain litmus1:
# xl sched-rt -d litmus1 -v 1 -p 20000 -b 10000

//domain litmus1's vcpu 1's parameters are changed, display each VCPU's parameters separately:
# xl sched-rt -d litmus1
Name                                ID VCPU Period Budget
litmus1                              1    0  10000   4000
litmus1                              1    1  20000  10000

// list cpupool information
xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0              12        rt       y          2

//create a cpupool test
#xl cpupool-cpu-remove Pool-0 11
#xl cpupool-cpu-remove Pool-0 10
#xl cpupool-create name=\"test\" sched=\"credit\"
#xl cpupool-cpu-add test 11
#xl cpupool-cpu-add test 10
#xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0              10        rt       y          2
test                 2    credit       y          0

//migrate litmus1 from cpupool Pool-0 to cpupool test.
#xl cpupool-migrate litmus1 test

//now litmus1 is in cpupool test
# xl sched-credit
Cpupool test: tslice=30ms ratelimit=1000us
Name                                ID Weight  Cap
litmus1                              1    256    0

-----------------------------------------------------------------------------------------------------------------------------
[PATCH RFC v2 1/4] xen: add real time scheduler rt
[PATCH RFC v2 2/4] libxc: add rt scheduler
[PATCH RFC v2 3/4] libxl: add rt scheduler
[PATCH RFC v2 4/4] xl: introduce rt scheduler
-----------------------------------------------------------------------------------------------------------------------------
Thanks for Dario, Wei, Ian, Andrew, George, and Konrad for your valuable comments and suggestions!

Any comment, question, and concerns are more than welcome! :-)

Thank you very much!

Best,

Meng

---
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Introduce rt real-time scheduler for Xen
  2014-07-11  4:49 Meng Xu
  2014-07-11 10:50 ` Wei Liu
@ 2014-07-11 16:19 ` Dario Faggioli
  1 sibling, 0 replies; 31+ messages in thread
From: Dario Faggioli @ 2014-07-11 16:19 UTC (permalink / raw)
  To: Meng Xu
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xen-devel, xumengpanda, lichong659, dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 6899 bytes --]

On ven, 2014-07-11 at 00:49 -0400, Meng Xu wrote:
> This serie of patches adds rt real-time scheduler to Xen.
> 
He Meng, Sisu!

Nice to see you here on xen-devel with this nice code-drop! :-P

> In summary, It supports:
> 1) Preemptive Global Earliest Deadline First scheduling policy by using a global RunQ for the scheduler;
> 2) Assign/display each VCPU's parameters of each domain;
> 3) Supports CPU Pool
> 
Great, thanks for doing the effort of extracting this from your code
base, and submit it here. :-)

Having look at the series carefully, I think it's a nice piece of work
already. There's quite a few modification and cleanups to do, and I
think there's room for quite a bit of improvement, but I really like the
fact that all the features are basically there already.

In particular, proper SMP support, per-VCPU scheduling parameters, and a
sane and theoretically sound budgetting scheme is what we're missing in
SEDF[*], and we need these things badly!

[*] Josh's RFC is improving this, but only wrt to the latter one (sane
scheduling algorithm).

> -----------------------------------------------------------------------------------------------------------------------------
> One scenario to show the functionality of this rt scheduler is as follows:
> //list each vcpu's parameters of each domain in cpu pools using rt scheduler
> #xl sched-rt
> Cpupool Pool-0: sched=EDF
> Name                                ID VCPU Period Budget
> Domain-0                             0    0     10     10
> Domain-0                             0    1     20     20
> Domain-0                             0    2     30     30
> Domain-0                             0    3     10     10
> litmus1                              1    0     10      4
> litmus1                              1    1     10      4
> 
> [...]
>
Thanks for showing this also.

> -----------------------------------------------------------------------------------------------------------------------------
> The differences between this new rt real-time scheduler and the sedf scheduler are as follows:
> 1) rt scheduler supports global EDF scheduling, while sedf only supports partitioned scheduling. With the support of vcpu mask, rt scheduler can also be used as partitioned scheduling by setting each VCPU’s cpumask to a specific cpu.
>
Which is be biggest and most important difference. In fact, although the
implementation of this scheduler can be improved (AFAICT) wrt this
aspect too, adding SMP support to SEDF would be much much harder...

> 2) rt scheduler supports setting and getting each VCPU’s parameters of a domain. A domain can have multiple vcpus with different parameters, rt scheduler can let user get/set the parameters of each VCPU of a specific domain; (sedf scheduler does not support it now)
> 3) rt scheduler supports cpupool.
>
Right. Well, to be fair, SEDF supports cpupools as well. :-)

> 4) rt scheduler uses deferrable server to burn/replenish budget of a VCPU, while sedf uses constrant bandwidth server to burn/replenish budget of a VCPU. This is just two options of implementing a global EDF real-time scheduler and both options’ real-time performance have already been proved in academic.
> 
So, can you put some links to some of your works on top of RT-Xen, which
is from which this scheduler comes from? Or, if that's not possible, at
least the titles?

I really don't expect people to jump on research papers, but the I've
seen a few, and the experimental sections were nice to read and quite
useful.

> -----------------------------------------------------------------------------------------------------------------------------
> TODO:
>
Allow me to add a few items here, in some sort of priority order (at
least mine one):

  *) Deal with budget overrun in the algorithm [medium]
  *) Split runnable and depleted (=no budget left) VCPU queues [easy]
> 1) Improve the code of getting/setting each VCPU’s parameters. [easy]
>     Right now, it create an array with LIBXL_XEN_LEGACY_MAX_VCPUS (i.e., 32) elements to bounce all VCPUs’ parameters of a domain between xen tool and xen to get all VCPUs’ parameters of a domain. It is unnecessary to have LIBXL_XEN_LEGACY_MAX_VCPUS elements for this array.
>     The current work is to first get the exact number of VCPUs of a domain and then create an array with that exact number of elements to bounce between xen tool and xen.
> 2) Provide microsecond time precision in xl interface instead of millisecond time precision. [easy]
>     Right now, rt scheduler let user to specify each VCPU’s parameters (period, budget) in millisecond (i.e., ms). In some real-time application, user may want to specify VCPUs’ parameters in  microsecond (i.e., us). The next work is to let user specify VCPUs’ parameters in microsecond and count the time in microsecond (or nanosecond) in xen rt scheduler as well.
>
  *) Subject Dom0 to the EDF+DS scheduling, as all other domains [easy]
      We can discuss what default Dom0 parameters should be, but we
      certainly want it to be scheduled as all other domains, and not
      getting too much of a special treatment.

> 3) Add Xen trace into the rt scheduler. [easy]
>     We will add a few xentrace tracepoints, like TRC_CSCHED2_RUNQ_POS in credit2 scheduler, in rt scheduler, to debug via tracing.
>
  *) Try using timers for replenishment, instead of scanning the full
     runqueue every now and then [medium]

> 4) Method of improving the performance of rt scheduler [future work]
>     VCPUs of the same domain may preempt each other based on the preemptive global EDF scheduling policy. This self-switch issue does not bring benefit to the domain but introduce more overhead. When this situation happens, we can simply promote the current running lower-priority VCPU’s priority and let it  borrow budget from higher priority VCPUs to avoid such self-swtich issue.
> 
> Timeline of implementing the TODOs:
> We plan to finish the TODO 1), 2) and 3) within 3-4 weeks (or earlier).
> Because TODO 4) will make the scheduling policy not pure GEDF, (people who wants the real GEDF may not be happy with this.) we look forward to hearing people’s opinions.
>
That one is definitely something we can concentrate on later.

> -----------------------------------------------------------------------------------------------------------------------------
> Special huge thanks to Dario Faggioli for his helpful and detailed comments on the preview version of this rt scheduler. :-)
> 
:-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Introduce rt real-time scheduler for Xen
  2014-07-11 11:06   ` Dario Faggioli
@ 2014-07-11 16:14     ` Meng Xu
  0 siblings, 0 replies; 31+ messages in thread
From: Meng Xu @ 2014-07-11 16:14 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Wei Liu, Ian Campbell, Sisu Xi, Stefano Stabellini,
	George Dunlap, Ian Jackson, xen-devel, Meng Xu, Chong Li,
	Dagaen Golomb


[-- Attachment #1.1: Type: text/plain, Size: 1524 bytes --]

​Hi Wei and Dario,
​
2014-07-11 7:06 GMT-04:00 Dario Faggioli <dario.faggioli@citrix.com>:

> On ven, 2014-07-11 at 11:50 +0100, Wei Liu wrote:
> > On Fri, Jul 11, 2014 at 12:49:54AM -0400, Meng Xu wrote:
> > [...]
> > >
> > > [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
> > > [PATCH RFC v1 2/4] xl for rt scheduler
> > > [PATCH RFC v1 3/4] libxl for rt scheduler
> > > [PATCH RFC v1 4/4] libxc for rt scheduler
> > >
> >
> > I have some general comments on how you arrange these patches.
> >
> > At a glance of the title and code you should do them in the order of 1,
> > 4, 3 and 2. Apparently xl depends on libxl, libxl depends on libxc, and
> > libxc depends on hypervisor. You will break bisection with current
> > ordering.
> >
> Yep, I agree with Wei.
>
> > And we normally write titles like
> >   xen: add rt scheduler
> >   libxl: introduce rt scheduler
> >   xl: XXXX
> >   etc.
> > start with component name and separate with colon.
> >
> Indeed we do, and this helps quite a bit.
>
> > Last but not least, you need to CC relevant maintainers. You can find
> > out maintainers with scripts/get_maintainers.pl.
> >
> Yes, but this one, Meng almost got it right, I think.
>
> Basically, Meng, you're missing hypervisors maintainers (at least for
> patch 1). :-)
>
>
​Thank you very much for your advice!
I will modify them in the next version of the scheduler: 1) rearrange the
patch order, 2) change the commit log; 3) add all maintainers.

​Meng​

[-- Attachment #1.2: Type: text/html, Size: 2486 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Introduce rt real-time scheduler for Xen
  2014-07-11 10:50 ` Wei Liu
@ 2014-07-11 11:06   ` Dario Faggioli
  2014-07-11 16:14     ` Meng Xu
  0 siblings, 1 reply; 31+ messages in thread
From: Dario Faggioli @ 2014-07-11 11:06 UTC (permalink / raw)
  To: Wei Liu
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xen-devel, xumengpanda, Meng Xu, lichong659,
	dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 1447 bytes --]

On ven, 2014-07-11 at 11:50 +0100, Wei Liu wrote:
> On Fri, Jul 11, 2014 at 12:49:54AM -0400, Meng Xu wrote:
> [...]
> > 
> > [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
> > [PATCH RFC v1 2/4] xl for rt scheduler
> > [PATCH RFC v1 3/4] libxl for rt scheduler
> > [PATCH RFC v1 4/4] libxc for rt scheduler
> > 
> 
> I have some general comments on how you arrange these patches.
> 
> At a glance of the title and code you should do them in the order of 1,
> 4, 3 and 2. Apparently xl depends on libxl, libxl depends on libxc, and
> libxc depends on hypervisor. You will break bisection with current
> ordering.
> 
Yep, I agree with Wei.

> And we normally write titles like
>   xen: add rt scheduler
>   libxl: introduce rt scheduler
>   xl: XXXX
>   etc.
> start with component name and separate with colon.
> 
Indeed we do, and this helps quite a bit.

> Last but not least, you need to CC relevant maintainers. You can find
> out maintainers with scripts/get_maintainers.pl.
> 
Yes, but this one, Meng almost got it right, I think.

Basically, Meng, you're missing hypervisors maintainers (at least for
patch 1). :-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Introduce rt real-time scheduler for Xen
  2014-07-11  4:49 Meng Xu
@ 2014-07-11 10:50 ` Wei Liu
  2014-07-11 11:06   ` Dario Faggioli
  2014-07-11 16:19 ` Dario Faggioli
  1 sibling, 1 reply; 31+ messages in thread
From: Wei Liu @ 2014-07-11 10:50 UTC (permalink / raw)
  To: Meng Xu
  Cc: wei.liu2, ian.campbell, xisisu, stefano.stabellini,
	george.dunlap, dario.faggioli, ian.jackson, xen-devel,
	xumengpanda, lichong659, dgolomb

On Fri, Jul 11, 2014 at 12:49:54AM -0400, Meng Xu wrote:
[...]
> 
> [PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
> [PATCH RFC v1 2/4] xl for rt scheduler
> [PATCH RFC v1 3/4] libxl for rt scheduler
> [PATCH RFC v1 4/4] libxc for rt scheduler
> 

I have some general comments on how you arrange these patches.

At a glance of the title and code you should do them in the order of 1,
4, 3 and 2. Apparently xl depends on libxl, libxl depends on libxc, and
libxc depends on hypervisor. You will break bisection with current
ordering.

And we normally write titles like
  xen: add rt scheduler
  libxl: introduce rt scheduler
  xl: XXXX
  etc.
start with component name and separate with colon.

Last but not least, you need to CC relevant maintainers. You can find
out maintainers with scripts/get_maintainers.pl.

Wei.

> -----------
> Meng Xu
> PhD Student in Computer and Information Science
> University of Pennsylvania
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Introduce rt real-time scheduler for Xen
@ 2014-07-11  4:49 Meng Xu
  2014-07-11 10:50 ` Wei Liu
  2014-07-11 16:19 ` Dario Faggioli
  0 siblings, 2 replies; 31+ messages in thread
From: Meng Xu @ 2014-07-11  4:49 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	dario.faggioli, ian.jackson, xumengpanda, lichong659, dgolomb

This serie of patches adds rt real-time scheduler to Xen.

In summary, It supports:
1) Preemptive Global Earliest Deadline First scheduling policy by using a global RunQ for the scheduler;
2) Assign/display each VCPU's parameters of each domain;
3) Supports CPU Pool

The design of this rt scheduler is as follows:
This rt scheduler follows the Preemptive Global Earliest Deadline First (GEDF) theory in real-time field.
Each VCPU can have a dedicated period and budget. While scheduled, a VCPU burns its budget. Each VCPU has its budget replenished at the beginning of each of its periods; Each VCPU discards its unused budget at the end of each of its periods. If a VCPU runs out of budget in a period, it has to wait until next period.
The mechanism of how to burn a VCPU's budget depends on the server mechanism implemented for each VCPU.
The mechanism of deciding the priority of VCPUs at each scheduling point is based on the Preemptive Global Earliest Deadline First scheduling scheme.

Server mechanism: a VCPU is implemented as a deferrable server.
When a VCPU has a task running on it, its budget is continuously burned;
When a VCPU has no task but with budget left, its budget is preserved.

Priority scheme: Global Earliest Deadline First (EDF).
At any scheduling point, the VCPU with earliest deadline has highest priority.

Queue scheme: A global runqueue for each CPU pool.
The runqueue holds all runnable VCPUs.
VCPUs in the runqueue are divided into two parts: with and without remaining budget.
At each part, VCPUs are sorted based on GEDF priority scheme.

Scheduling quanta: 1 ms; but accounting the budget is in microsecond.

-----------------------------------------------------------------------------------------------------------------------------
One scenario to show the functionality of this rt scheduler is as follows:
//list each vcpu's parameters of each domain in cpu pools using rt scheduler
#xl sched-rt
Cpupool Pool-0: sched=EDF
Name                                ID VCPU Period Budget
Domain-0                             0    0     10     10
Domain-0                             0    1     20     20
Domain-0                             0    2     30     30
Domain-0                             0    3     10     10
litmus1                              1    0     10      4
litmus1                              1    1     10      4



//set the parameters of the vcpu 1 of domain litmus1:
# xl sched-rt -d litmus1 -v 1 -p 20 -b 10

//domain litmus1's vcpu 1's parameters are changed, display each VCPU's parameters separately:
# xl sched-rt -d litmus1
Name                                ID VCPU Period Budget
litmus1                              1    0     10      4
litmus1                              1    1     20     10

// list cpupool information
xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0              12        rt       y          2

//create a cpupool test
#xl cpupool-cpu-remove Pool-0 11
#xl cpupool-cpu-remove Pool-0 10
#xl cpupool-create name=\"test\" sched=\"credit\"
#xl cpupool-cpu-add test 11
#xl cpupool-cpu-add test 10
#xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0              10        rt       y          2
test                 2    credit       y          0

//migrate litmus1 from cpupool Pool-0 to cpupool test.
#xl cpupool-migrate litmus1 test

//now litmus1 is in cpupool test
# xl sched-credit
Cpupool test: tslice=30ms ratelimit=1000us
Name                                ID Weight  Cap
litmus1                              1    256    0

-----------------------------------------------------------------------------------------------------------------------------
The differences between this new rt real-time scheduler and the sedf scheduler are as follows:
1) rt scheduler supports global EDF scheduling, while sedf only supports partitioned scheduling. With the support of vcpu mask, rt scheduler can also be used as partitioned scheduling by setting each VCPU’s cpumask to a specific cpu.
2) rt scheduler supports setting and getting each VCPU’s parameters of a domain. A domain can have multiple vcpus with different parameters, rt scheduler can let user get/set the parameters of each VCPU of a specific domain; (sedf scheduler does not support it now)
3) rt scheduler supports cpupool.
4) rt scheduler uses deferrable server to burn/replenish budget of a VCPU, while sedf uses constrant bandwidth server to burn/replenish budget of a VCPU. This is just two options of implementing a global EDF real-time scheduler and both options’ real-time performance have already been proved in academic.

(Briefly speaking, the functionality that the *SEDF* scheduler plans to implement and improve in the future release has already been supported in this rt scheduler.)
(Although it’s unnecessary to implement two server mechanisms, we can simply modify the two functions of burning and replenishing vcpus’ budget to incorporate the CBS server mechanism or other server mechanisms into this rt scheduler.)

-----------------------------------------------------------------------------------------------------------------------------
TODO:
1) Improve the code of getting/setting each VCPU’s parameters. [easy]
    Right now, it create an array with LIBXL_XEN_LEGACY_MAX_VCPUS (i.e., 32) elements to bounce all VCPUs’ parameters of a domain between xen tool and xen to get all VCPUs’ parameters of a domain. It is unnecessary to have LIBXL_XEN_LEGACY_MAX_VCPUS elements for this array.
    The current work is to first get the exact number of VCPUs of a domain and then create an array with that exact number of elements to bounce between xen tool and xen.
2) Provide microsecond time precision in xl interface instead of millisecond time precision. [easy]
    Right now, rt scheduler let user to specify each VCPU’s parameters (period, budget) in millisecond (i.e., ms). In some real-time application, user may want to specify VCPUs’ parameters in  microsecond (i.e., us). The next work is to let user specify VCPUs’ parameters in microsecond and count the time in microsecond (or nanosecond) in xen rt scheduler as well.
3) Add Xen trace into the rt scheduler. [easy]
    We will add a few xentrace tracepoints, like TRC_CSCHED2_RUNQ_POS in credit2 scheduler, in rt scheduler, to debug via tracing.
4) Method of improving the performance of rt scheduler [future work]
    VCPUs of the same domain may preempt each other based on the preemptive global EDF scheduling policy. This self-switch issue does not bring benefit to the domain but introduce more overhead. When this situation happens, we can simply promote the current running lower-priority VCPU’s priority and let it  borrow budget from higher priority VCPUs to avoid such self-swtich issue.

Timeline of implementing the TODOs:
We plan to finish the TODO 1), 2) and 3) within 3-4 weeks (or earlier).
Because TODO 4) will make the scheduling policy not pure GEDF, (people who wants the real GEDF may not be happy with this.) we look forward to hearing people’s opinions.

-----------------------------------------------------------------------------------------------------------------------------
Special huge thanks to Dario Faggioli for his helpful and detailed comments on the preview version of this rt scheduler. :-)

Any comment, question, and concerns are more than welcome! :-)

Thank you very much!

Meng

[PATCH RFC v1 1/4] rt: Add rt scheduler to hypervisor
[PATCH RFC v1 2/4] xl for rt scheduler
[PATCH RFC v1 3/4] libxl for rt scheduler
[PATCH RFC v1 4/4] libxc for rt scheduler

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2014-09-11 13:49 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-07 19:40 Introduce rt real-time scheduler for Xen Meng Xu
2014-09-07 19:40 ` [PATCH v2 1/4] xen: add real time scheduler rt Meng Xu
2014-09-08 14:32   ` George Dunlap
2014-09-08 18:44   ` George Dunlap
2014-09-09  9:42     ` Dario Faggioli
2014-09-09 11:31       ` George Dunlap
2014-09-09 12:52         ` Meng Xu
2014-09-09 12:25       ` Meng Xu
2014-09-09 12:46     ` Meng Xu
2014-09-09 16:57   ` Dario Faggioli
2014-09-09 18:21     ` Meng Xu
2014-09-11  8:44       ` Dario Faggioli
2014-09-11 13:49         ` Meng Xu
2014-09-07 19:40 ` [PATCH v2 2/4] libxc: add rt scheduler Meng Xu
2014-09-08 14:38   ` George Dunlap
2014-09-08 14:50   ` Ian Campbell
2014-09-08 14:53   ` Dario Faggioli
2014-09-07 19:41 ` [PATCH v2 3/4] libxl: " Meng Xu
2014-09-08 15:19   ` George Dunlap
2014-09-09 12:59     ` Meng Xu
2014-09-07 19:41 ` [PATCH v2 4/4] xl: introduce " Meng Xu
2014-09-08 16:06   ` George Dunlap
2014-09-08 16:16     ` Dario Faggioli
2014-09-09 13:14     ` Meng Xu
  -- strict thread matches above, loose matches on Subject: below --
2014-08-24 22:58 Introduce rt real-time scheduler for Xen Meng Xu
2014-07-29  1:52 Meng Xu
2014-07-11  4:49 Meng Xu
2014-07-11 10:50 ` Wei Liu
2014-07-11 11:06   ` Dario Faggioli
2014-07-11 16:14     ` Meng Xu
2014-07-11 16:19 ` Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.