All of lore.kernel.org
 help / color / mirror / Atom feed
* Introduce rt real-time scheduler for Xen
@ 2014-07-29  1:52 Meng Xu
  2014-07-29  1:52 ` [PATCH RFC v2 1/4] xen: add real time scheduler rt Meng Xu
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Meng Xu @ 2014-07-29  1:52 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xumengpanda, JBeulich, lichong659, dgolomb

Hi all,

This serie of patches adds rt real-time scheduler to Xen.

In summary, It supports:
1) Preemptive Global Earliest Deadline First scheduling policy by using a global RunQ for the scheduler;
2) Assign/display each VCPU's parameters of each domain;
3) Supports CPU Pool

Based on the review comments/suggestion on version 1, the version 2 has the following improvements:
    a) Changed the interface of get/set vcpu's parameter from staticlly allocating a large array to dynamically allocating memory based on the number of vcpus of a domain. (This is a major change from v1 to v2 because many comments on v1 is about the code related to this functionality.)
    b) Changed the time unit at user interferce from 1ms to 1us; Changed the type of period and buget of a VCPU from uint16 to s_time_s.
    c) Changed the code style, rearrange the patch order, add comments to better explain the code.
    d) Domain 0 is no longer treated as a special domain. Domain 0's VCPUs are changed to be the same with domUs' VCPUs: domain 0's VCPUs have the same default value with domUs' VCPUs and are scheduled as domUs' VCPUs.
    e) Add more ASSERT(), e.g., in  __runq_insert() in sched_rt.c

-----------------------------------------------------------------------------------------------------------------------------
TODO:
    a) Add TRACE() in sched_rt.c functions.[easy]
       We will add a few xentrace tracepoints, like TRC_CSCHED2_RUNQ_POS in credit2 scheduler, in rt scheduler, to debug via tracing.
    b) Split runnable and depleted (=no budget left) VCPU queues.[easy]
    c) Deal with budget overrun in the algorithm [medium]
    d) Try using timers for replenishment, instead of scanning the full runqueue every now and then [medium]
    e) Reconsider the rt_vcpu_insert() and rt_vcpu_remove() for cpu pool support. 
    f) Method of improving the performance of rt scheduler [future work]
       VCPUs of the same domain may preempt each other based on the preemptive global EDF scheduling policy. This self-switch issue does not bring benefit to the domain but introduce more overhead. When this situation happens, we can simply promote the current running lower-priority VCPU’s priority and let it  borrow budget from higher priority VCPUs to avoid such self-swtich issue.

Plan: 
    TODO a) and b) are expected in RFC v3; (2 weeks)
    TODO c), d) and e) are expected in RFC v4, v5; (3-4 weeks)
    TODO f) will be delayed after this scheduler is upstreamed because the improvement will make the scheduler not a pure global EDF scheduler.

-----------------------------------------------------------------------------------------------------------------------------
The design of this rt scheduler is as follows:
This rt scheduler follows the Preemptive Global Earliest Deadline First (GEDF) theory in real-time field.
Each VCPU can have a dedicated period and budget. While scheduled, a VCPU burns its budget. Each VCPU has its budget replenished at the beginning of each of its periods; Each VCPU discards its unused budget at the end of each of its periods. If a VCPU runs out of budget in a period, it has to wait until next period.
The mechanism of how to burn a VCPU's budget depends on the server mechanism implemented for each VCPU.
The mechanism of deciding the priority of VCPUs at each scheduling point is based on the Preemptive Global Earliest Deadline First scheduling scheme.

Server mechanism: a VCPU is implemented as a deferrable server.
When a VCPU has a task running on it, its budget is continuously burned;
When a VCPU has no task but with budget left, its budget is preserved.

Priority scheme: Global Earliest Deadline First (EDF).
At any scheduling point, the VCPU with earliest deadline has highest priority.

Queue scheme: A global runqueue for each CPU pool.
The runqueue holds all runnable VCPUs.
VCPUs in the runqueue are divided into two parts: with and without remaining budget.
At each part, VCPUs are sorted based on GEDF priority scheme.

Scheduling quanta: 1 ms.

If you are intersted in the details of the design and evaluation of this rt scheduler, please refer to our paper "Real-Time Multi-Core Virtual Machine Scheduling in Xen" (http://www.cis.upenn.edu/~mengxu/emsoft14/emsoft14.pdf) in EMSOFT14. This paper has the following details:
    a) Desgin of this scheduler;
    b) Measurement of the implementation overhead, e.g., scheduler overhead, context switch overhead, etc. 
    c) Comparison of this rt scheduler and credit scheduler in terms of the real-time performance.
-----------------------------------------------------------------------------------------------------------------------------
One scenario to show the functionality of this rt scheduler is as follows:
//list each vcpu's parameters of each domain in cpu pools using rt scheduler
#xl sched-rt
Cpupool Pool-0: sched=EDF
Name                                ID VCPU Period Budget
Domain-0                             0    0  10000  10000
Domain-0                             0    1  20000  20000
Domain-0                             0    2  30000  30000
Domain-0                             0    3  10000  10000
litmus1                              1    0  10000   4000
litmus1                              1    1  10000   4000

//set the parameters of the vcpu 1 of domain litmus1:
# xl sched-rt -d litmus1 -v 1 -p 20000 -b 10000

//domain litmus1's vcpu 1's parameters are changed, display each VCPU's parameters separately:
# xl sched-rt -d litmus1
Name                                ID VCPU Period Budget
litmus1                              1    0  10000   4000
litmus1                              1    1  20000  10000

// list cpupool information
xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0              12        rt       y          2

//create a cpupool test
#xl cpupool-cpu-remove Pool-0 11
#xl cpupool-cpu-remove Pool-0 10
#xl cpupool-create name=\"test\" sched=\"credit\"
#xl cpupool-cpu-add test 11
#xl cpupool-cpu-add test 10
#xl cpupool-list
Name               CPUs   Sched     Active   Domain count
Pool-0              10        rt       y          2
test                 2    credit       y          0

//migrate litmus1 from cpupool Pool-0 to cpupool test.
#xl cpupool-migrate litmus1 test

//now litmus1 is in cpupool test
# xl sched-credit
Cpupool test: tslice=30ms ratelimit=1000us
Name                                ID Weight  Cap
litmus1                              1    256    0

-----------------------------------------------------------------------------------------------------------------------------
[PATCH RFC v2 1/4] xen: add real time scheduler rt
[PATCH RFC v2 2/4] libxc: add rt scheduler
[PATCH RFC v2 3/4] libxl: add rt scheduler
[PATCH RFC v2 4/4] xl: introduce rt scheduler
-----------------------------------------------------------------------------------------------------------------------------
Thanks for Dario, Wei, Ian, Andrew, George, and Konrad for your valuable comments and suggestions!

Any comment, question, and concerns are more than welcome! :-)

Thank you very much!

Best,

Meng

---
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH RFC v2 1/4] xen: add real time scheduler rt
  2014-07-29  1:52 Introduce rt real-time scheduler for Xen Meng Xu
@ 2014-07-29  1:52 ` Meng Xu
  2014-07-29  6:52   ` Jan Beulich
  2014-07-29 10:36   ` [PATCH RFC " Andrew Cooper
  2014-07-29  1:53 ` [PATCH RFC v2 2/4] libxc: add rt scheduler Meng Xu
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 13+ messages in thread
From: Meng Xu @ 2014-07-29  1:52 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xumengpanda, Meng Xu, JBeulich, lichong659, dgolomb

This scheduler follows the pre-emptive Global EDF theory in real-time field.
Each VCPU can have a dedicated period and budget.
While scheduled, a VCPU burns its budget.
A VCPU has its budget replenished at the beginning of each of its periods;
The VCPU discards its unused budget at the end of each of its periods.
If a VCPU runs out of budget in a period, it has to wait until next period.
The mechanism of how to burn a VCPU's budget depends on the server mechanism
implemented for each VCPU.

Server mechanism: a VCPU is implemented as a deferable server.
When a VCPU is scheduled to execute on a PCPU, its budget is continuously
burned.

Priority scheme: Preemptive Global Earliest Deadline First (gEDF).
At any scheduling point, the VCPU with earliest deadline has highest
priority.

Queue scheme: A global Runqueue for each CPU pool.
The Runqueue holds all runnable VCPUs.
VCPUs in the Runqueue are divided into two parts: with and without budget.
At each part, VCPUs are sorted based on gEDF priority scheme.

Scheduling quantum: 1 ms;

Note: cpumask and cpupool is supported.

This is still in the development phase.

Signed-off-by: Sisu Xi <xisisu@gmail.com>
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
---
 xen/common/Makefile         |    1 +
 xen/common/sched_rt.c       | 1058 +++++++++++++++++++++++++++++++++++++++++++
 xen/common/schedule.c       |    4 +-
 xen/include/public/domctl.h |   28 +-
 xen/include/xen/sched-if.h  |    1 +
 5 files changed, 1089 insertions(+), 3 deletions(-)
 create mode 100644 xen/common/sched_rt.c

diff --git a/xen/common/Makefile b/xen/common/Makefile
index 3683ae3..5a23aa4 100644
--- a/xen/common/Makefile
+++ b/xen/common/Makefile
@@ -26,6 +26,7 @@ obj-y += sched_credit.o
 obj-y += sched_credit2.o
 obj-y += sched_sedf.o
 obj-y += sched_arinc653.o
+obj-y += sched_rt.o
 obj-y += schedule.o
 obj-y += shutdown.o
 obj-y += softirq.o
diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
new file mode 100644
index 0000000..6cfbb8a
--- /dev/null
+++ b/xen/common/sched_rt.c
@@ -0,0 +1,1058 @@
+/******************************************************************************
+ * Preemptive Global Earliest Deadline First  (EDF) scheduler for Xen
+ * EDF scheduling is one of most popular real-time scheduling algorithm used in
+ * embedded field.
+ *
+ * by Sisu Xi, 2013, Washington University in Saint Louis
+ * and Meng Xu, 2014, University of Pennsylvania
+ *
+ * based on the code of credit Scheduler
+ */
+
+#include <xen/config.h>
+#include <xen/init.h>
+#include <xen/lib.h>
+#include <xen/sched.h>
+#include <xen/domain.h>
+#include <xen/delay.h>
+#include <xen/event.h>
+#include <xen/time.h>
+#include <xen/perfc.h>
+#include <xen/sched-if.h>
+#include <xen/softirq.h>
+#include <asm/atomic.h>
+#include <xen/errno.h>
+#include <xen/trace.h>
+#include <xen/cpu.h>
+#include <xen/keyhandler.h>
+#include <xen/trace.h>
+#include <xen/guest_access.h>
+
+/*
+ * TODO:
+ *
+ * Migration compensation and resist like credit2 to better use cache;
+ * Lock Holder Problem, using yield?
+ * Self switch problem: VCPUs of the same domain may preempt each other;
+ */
+
+/*
+ * Design:
+ *
+ * This scheduler follows the Preemptive Global EDF theory in real-time field.
+ * Each VCPU can have a dedicated period and budget. 
+ * While scheduled, a VCPU burns its budget.
+ * A VCPU has its budget replenished at the beginning of each of its periods;
+ * The VCPU discards its unused budget at the end of each of its periods.
+ * If a VCPU runs out of budget in a period, it has to wait until next period.
+ * The mechanism of how to burn a VCPU's budget depends on the server mechanism
+ * implemented for each VCPU.
+ *
+ * Server mechanism: a VCPU is implemented as a deferable server.
+ * When a VCPU has a task running on it, its budget is continuously burned;
+ * When a VCPU has no task but with budget left, its budget is preserved.
+ *
+ * Priority scheme: Preemptive Global Earliest Deadline First (gEDF).
+ * At any scheduling point, the VCPU with earliest deadline has highest priority.
+ *
+ * Queue scheme: A global runqueue for each CPU pool. 
+ * The runqueue holds all runnable VCPUs. 
+ * VCPUs in the runqueue are divided into two parts: with and without remaining budget. 
+ * At each part, VCPUs are sorted based on EDF priority scheme.
+ *
+ * Scheduling quanta: 1 ms; but accounting the budget is in microsecond.
+ *
+ * Note: cpumask and cpupool is supported.
+ */
+
+/*
+ * Locking:
+ * Just like credit2, a global system lock is used to protect the RunQ.
+ * The global lock is referenced by schedule_data.schedule_lock from all physical cpus.
+ *
+ * The lock is already grabbed when calling wake/sleep/schedule/ functions in schedule.c
+ *
+ * The functions involes RunQ and needs to grab locks are:
+ *    dump, vcpu_insert, vcpu_remove, context_saved,
+ */
+
+
+/*
+ * Default parameters: Period and budget in default is 10 and 4 ms, respectively
+ */
+#define RT_DEFAULT_PERIOD     (MICROSECS(10))
+#define RT_DEFAULT_BUDGET     (MICROSECS(4))
+
+/*
+ * Useful macros
+ */
+#define RT_PRIV(_ops)     ((struct rt_private *)((_ops)->sched_data))
+#define RT_VCPU(_vcpu)    ((struct rt_vcpu *)(_vcpu)->sched_priv)
+#define RT_DOM(_dom)      ((struct rt_dom *)(_dom)->sched_priv)
+#define RUNQ(_ops)              (&RT_PRIV(_ops)->runq)
+
+/*
+ * Flags
+ */
+/* RT_scheduled: Is this vcpu either running on, or context-switching off,
+ * a phyiscal cpu?
+ * + Accessed only with Runqueue lock held.
+ * + Set when chosen as next in rt_schedule().
+ * + Cleared after context switch has been saved in rt_context_saved()
+ * + Checked in vcpu_wake to see if we can add to the Runqueue, or if we should
+ *   set RT_delayed_runq_add
+ * + Checked to be false in runq_insert.
+ */
+#define __RT_scheduled            1
+#define RT_scheduled (1<<__RT_scheduled)
+/* RT_delayed_runq_add: Do we need to add this to the Runqueueu once it'd done 
+ * being context switching out?
+ * + Set when scheduling out in rt_schedule() if prev is runable
+ * + Set in rt_vcpu_wake if it finds RT_scheduled set
+ * + Read in rt_context_saved(). If set, it adds prev to the Runqueue and
+ *   clears the bit.
+ *
+ */
+#define __RT_delayed_runq_add     2
+#define RT_delayed_runq_add (1<<__RT_delayed_runq_add)
+
+/*
+ * Debug only. Used to printout debug information
+ */
+#define printtime()\
+        ({s_time_t now = NOW(); \
+          printk("%u : %3ld.%3ldus : %-19s ",\
+          smp_processor_id(), now/MICROSECS(1), now%MICROSECS(1)/1000, __func__);} )
+
+/*
+ * Systme-wide private data, include a global RunQueue
+ * The global lock is referenced by schedule_data.schedule_lock from all physical cpus.
+ * It can be grabbed via vcpu_schedule_lock_irq()
+ */
+struct rt_private {
+    spinlock_t lock;        /* The global coarse grand lock */
+    struct list_head sdom;  /* list of availalbe domains, used for dump */
+    struct list_head runq;  /* Ordered list of runnable VMs */
+    cpumask_t cpus;         /* cpumask_t of available physical cpus */
+    cpumask_t tickled;      /* another cpu in the queue already ticked this one */
+};
+
+/*
+ * Virtual CPU
+ */
+struct rt_vcpu {
+    struct list_head runq_elem; /* On the runqueue list */
+    struct list_head sdom_elem; /* On the domain VCPU list */
+
+    /* Up-pointers */
+    struct rt_dom *sdom;
+    struct vcpu *vcpu;
+
+    /* VCPU parameters, in milliseconds */
+    s_time_t period;
+    s_time_t budget;
+
+    /* VCPU current infomation in nanosecond */
+    long cur_budget;             /* current budget */
+    s_time_t last_start;        /* last start time */
+    s_time_t cur_deadline;      /* current deadline for EDF */
+
+    unsigned flags;             /* mark __RT_scheduled, etc.. */
+};
+
+/*
+ * Domain
+ */
+struct rt_dom {
+    struct list_head vcpu;      /* link its VCPUs */
+    struct list_head sdom_elem; /* link list on rt_priv */
+    struct domain *dom;         /* pointer to upper domain */
+};
+
+/*
+ * RunQueue helper functions
+ */
+static int
+__vcpu_on_runq(struct rt_vcpu *svc)
+{
+   return !list_empty(&svc->runq_elem);
+}
+
+static struct rt_vcpu *
+__runq_elem(struct list_head *elem)
+{
+    return list_entry(elem, struct rt_vcpu, runq_elem);
+}
+
+/*
+ * Debug related code, dump vcpu/cpu information
+ */
+static void
+rt_dump_vcpu(struct rt_vcpu *svc)
+{
+    char *cpustr = keyhandler_scratch;
+    if ( svc == NULL ) 
+    {
+        printk("NULL!\n");
+        return;
+    }
+    cpumask_scnprintf(cpustr, sizeof(cpustr), svc->vcpu->cpu_hard_affinity);
+    printk("[%5d.%-2d] cpu %d, (%"PRId64", %"PRId64"), cur_b=%"PRId64" cur_d=%"PRId64" last_start=%"PRId64" onR=%d runnable=%d cpu_hard_affinity=%s ",
+            svc->vcpu->domain->domain_id,
+            svc->vcpu->vcpu_id,
+            svc->vcpu->processor,
+            svc->period,
+            svc->budget,
+            svc->cur_budget,
+            svc->cur_deadline,
+            svc->last_start,
+            __vcpu_on_runq(svc),
+            vcpu_runnable(svc->vcpu),
+            cpustr);
+    memset(cpustr, 0, sizeof(char)*1024);
+    cpumask_scnprintf(cpustr, sizeof(cpustr), cpupool_scheduler_cpumask(svc->vcpu->domain->cpupool));
+    printk("cpupool=%s\n", cpustr);
+}
+
+static void
+rt_dump_pcpu(const struct scheduler *ops, int cpu)
+{
+    struct rt_vcpu *svc = RT_VCPU(curr_on_cpu(cpu));
+
+    printtime();
+    rt_dump_vcpu(svc);
+}
+
+/*
+ * should not need lock here. only showing stuff 
+ */
+static void
+rt_dump(const struct scheduler *ops)
+{
+    struct list_head *iter_sdom, *iter_svc, *runq, *iter;
+    struct rt_private *prv = RT_PRIV(ops);
+    struct rt_vcpu *svc;
+    int cpu = 0;
+    int loop = 0;
+
+    printtime();
+    printk("Priority Scheme: EDF\n");
+
+    printk("PCPU info: \n");
+    for_each_cpu(cpu, &prv->cpus) 
+        rt_dump_pcpu(ops, cpu);
+
+    printk("Global RunQueue info: \n");
+    loop = 0;
+    runq = RUNQ(ops);
+    list_for_each( iter, runq ) 
+    {
+        svc = __runq_elem(iter);
+        printk("\t%3d: ", ++loop);
+        rt_dump_vcpu(svc);
+    }
+
+    printk("Domain info: \n");
+    loop = 0;
+    list_for_each( iter_sdom, &prv->sdom ) 
+    {
+        struct rt_dom *sdom;
+        sdom = list_entry(iter_sdom, struct rt_dom, sdom_elem);
+        printk("\tdomain: %d\n", sdom->dom->domain_id);
+
+        list_for_each( iter_svc, &sdom->vcpu ) 
+        {
+            svc = list_entry(iter_svc, struct rt_vcpu, sdom_elem);
+            printk("\t\t%3d: ", ++loop);
+            rt_dump_vcpu(svc);
+        }
+    }
+
+    printk("\n");
+}
+
+static inline void
+__runq_remove(struct rt_vcpu *svc)
+{
+    if ( __vcpu_on_runq(svc) )
+        list_del_init(&svc->runq_elem);
+}
+
+/*
+ * Insert a vcpu in the RunQ based on vcpus' deadline: 
+ * EDF schedule policy: vcpu with smaller deadline has higher priority;
+ * The vcpu svc to be inserted will be inserted just before the very first 
+ * vcpu iter_svc in the Runqueue whose deadline is equal or larger than 
+ * svc's deadline.
+ */
+static void
+__runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
+{
+    struct list_head *runq = RUNQ(ops);
+    struct list_head *iter;
+    spinlock_t *schedule_lock;
+    
+    schedule_lock = per_cpu(schedule_data, svc->vcpu->processor).schedule_lock;
+    ASSERT( spin_is_locked(schedule_lock) );
+    
+    /* Debug only */
+    if ( __vcpu_on_runq(svc) )
+    {
+        rt_dump(ops);
+    }
+    ASSERT( !__vcpu_on_runq(svc) );
+
+    list_for_each(iter, runq) 
+    {
+        struct rt_vcpu * iter_svc = __runq_elem(iter);
+
+        /* svc still has budget */
+        if ( svc->cur_budget > 0 ) 
+        {
+            if ( iter_svc->cur_budget == 0 ||
+                 svc->cur_deadline <= iter_svc->cur_deadline )
+                    break;
+        } 
+        else 
+        { /* svc has no budget */
+            if ( iter_svc->cur_budget == 0 &&
+                 svc->cur_deadline <= iter_svc->cur_deadline )
+                    break;
+        }
+    }
+
+    list_add_tail(&svc->runq_elem, iter);
+}
+
+/*
+ * Init/Free related code
+ */
+static int
+rt_init(struct scheduler *ops)
+{
+    struct rt_private *prv = xzalloc(struct rt_private);
+
+    if ( prv == NULL )
+        return -ENOMEM;
+    ops->sched_data = prv;
+    spin_lock_init(&prv->lock);
+    INIT_LIST_HEAD(&prv->sdom);
+    INIT_LIST_HEAD(&prv->runq);
+
+    printtime();
+    printk("\n");
+
+    return 0;
+}
+
+static void
+rt_deinit(const struct scheduler *ops)
+{
+    struct rt_private *prv = RT_PRIV(ops);
+
+    printtime();
+    printk("\n");
+    xfree(prv);
+}
+
+/* 
+ * point per_cpu spinlock to the global system lock; all cpu have same global system lock 
+ */
+static void *
+rt_alloc_pdata(const struct scheduler *ops, int cpu)
+{
+    struct rt_private *prv = RT_PRIV(ops);
+
+    cpumask_set_cpu(cpu, &prv->cpus);
+
+    per_cpu(schedule_data, cpu).schedule_lock = &prv->lock;
+
+    printtime();
+    printk("%s total cpus: %d", __FUNCTION__, cpumask_weight(&prv->cpus));
+    /* same as credit2, not a bogus pointer */
+    return (void *)1;
+}
+
+static void
+rt_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
+{
+    struct rt_private * prv = RT_PRIV(ops);
+    cpumask_clear_cpu(cpu, &prv->cpus);
+    printtime();
+    printk("%s cpu=%d\n", __FUNCTION__, cpu);
+}
+
+static void *
+rt_alloc_domdata(const struct scheduler *ops, struct domain *dom)
+{
+    unsigned long flags;
+    struct rt_dom *sdom;
+    struct rt_private * prv = RT_PRIV(ops);
+
+    printtime();
+    printk("dom=%d\n", dom->domain_id);
+
+    sdom = xzalloc(struct rt_dom);
+    if ( sdom == NULL ) 
+    {
+        printk("%s, xzalloc failed\n", __func__);
+        return NULL;
+    }
+
+    INIT_LIST_HEAD(&sdom->vcpu);
+    INIT_LIST_HEAD(&sdom->sdom_elem);
+    sdom->dom = dom;
+
+    /* spinlock here to insert the dom */
+    spin_lock_irqsave(&prv->lock, flags);
+    list_add_tail(&sdom->sdom_elem, &(prv->sdom));
+    spin_unlock_irqrestore(&prv->lock, flags);
+
+    return (void *)sdom;
+}
+
+static void
+rt_free_domdata(const struct scheduler *ops, void *data)
+{
+    unsigned long flags;
+    struct rt_dom *sdom = data;
+    struct rt_private *prv = RT_PRIV(ops);
+
+    printtime();
+    printk("dom=%d\n", sdom->dom->domain_id);
+
+    spin_lock_irqsave(&prv->lock, flags);
+    list_del_init(&sdom->sdom_elem);
+    spin_unlock_irqrestore(&prv->lock, flags);
+    xfree(data);
+}
+
+static int
+rt_dom_init(const struct scheduler *ops, struct domain *dom)
+{
+    struct rt_dom *sdom;
+
+    printtime();
+    printk("dom=%d\n", dom->domain_id);
+
+    /* IDLE Domain does not link on rt_private */
+    if ( is_idle_domain(dom) ) 
+        return 0;
+
+    sdom = rt_alloc_domdata(ops, dom);
+    if ( sdom == NULL ) 
+    {
+        printk("%s, failed\n", __func__);
+        return -ENOMEM;
+    }
+    dom->sched_priv = sdom;
+
+    return 0;
+}
+
+static void
+rt_dom_destroy(const struct scheduler *ops, struct domain *dom)
+{
+    printtime();
+    printk("dom=%d\n", dom->domain_id);
+
+    rt_free_domdata(ops, RT_DOM(dom));
+}
+
+static void *
+rt_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
+{
+    struct rt_vcpu *svc;
+    s_time_t now = NOW();
+    long count;
+
+    /* Allocate per-VCPU info */
+    svc = xzalloc(struct rt_vcpu);
+    if ( svc == NULL ) 
+    {
+        printk("%s, xzalloc failed\n", __func__);
+        return NULL;
+    }
+
+    INIT_LIST_HEAD(&svc->runq_elem);
+    INIT_LIST_HEAD(&svc->sdom_elem);
+    svc->flags = 0U;
+    svc->sdom = dd;
+    svc->vcpu = vc;
+    svc->last_start = 0;            /* init last_start is 0 */
+
+    svc->period = RT_DEFAULT_PERIOD;
+    if ( !is_idle_vcpu(vc) )
+        svc->budget = RT_DEFAULT_BUDGET;
+
+    count = (now/MICROSECS(svc->period)) + 1;
+    /* sync all VCPU's start time to 0 */
+    svc->cur_deadline += count * MICROSECS(svc->period);
+
+    svc->cur_budget = svc->budget*1000; /* counting in microseconds level */
+    /* Debug only: dump new vcpu's info */
+    printtime();
+    rt_dump_vcpu(svc);
+
+    return svc;
+}
+
+static void
+rt_free_vdata(const struct scheduler *ops, void *priv)
+{
+    struct rt_vcpu *svc = priv;
+
+    /* Debug only: dump freed vcpu's info */
+    printtime();
+    rt_dump_vcpu(svc);
+    xfree(svc);
+}
+
+/*
+ * TODO: Do we need to add vc to the new Runqueue?
+ * This function is called in sched_move_domain() in schedule.c
+ * When move a domain to a new cpupool, 
+ * may have to add vc to the Runqueue of the new cpupool
+ */
+static void
+rt_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_vcpu *svc = RT_VCPU(vc);
+
+    /* Debug only: dump info of vcpu to insert */
+    printtime();
+    rt_dump_vcpu(svc);
+
+    /* not addlocate idle vcpu to dom vcpu list */
+    if ( is_idle_vcpu(vc) )
+        return;
+
+    list_add_tail(&svc->sdom_elem, &svc->sdom->vcpu);   /* add to dom vcpu list */
+}
+
+/*
+ * TODO: same as rt_vcpu_insert()
+ */
+static void
+rt_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_vcpu * const svc = RT_VCPU(vc);
+    struct rt_dom * const sdom = svc->sdom;
+
+    printtime();
+    rt_dump_vcpu(svc);
+
+    BUG_ON( sdom == NULL );
+    BUG_ON( __vcpu_on_runq(svc) );
+
+    if ( !is_idle_vcpu(vc) ) 
+        list_del_init(&svc->sdom_elem);
+}
+
+/* 
+ * Pick a valid CPU for the vcpu vc
+ * Valid CPU of a vcpu is intesection of vcpu's affinity and available cpus
+ */
+static int
+rt_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
+{
+    cpumask_t cpus;
+    cpumask_t *online;
+    int cpu;
+    struct rt_private * prv = RT_PRIV(ops);
+
+    online = cpupool_scheduler_cpumask(vc->domain->cpupool);
+    cpumask_and(&cpus, &prv->cpus, online);
+    cpumask_and(&cpus, &cpus, vc->cpu_hard_affinity);
+
+    cpu = cpumask_test_cpu(vc->processor, &cpus)
+            ? vc->processor 
+            : cpumask_cycle(vc->processor, &cpus);
+    ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
+
+    return cpu;
+}
+
+/*
+ * Burn budget at microsecond level. 
+ */
+static void
+burn_budgets(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t now) 
+{
+    s_time_t delta;
+    long count = 0;
+
+    /* don't burn budget for idle VCPU */
+    if ( is_idle_vcpu(svc->vcpu) ) 
+    {
+        return;
+    }
+
+    /* first time called for this svc, update last_start */
+    if ( svc->last_start == 0 ) 
+    {
+        svc->last_start = now;
+        return;
+    }
+
+    /*
+     * update deadline info: When deadline is in the past,
+     * it need to be updated to the deadline of the current period,
+     * and replenish the budget 
+     */
+    delta = now - svc->cur_deadline;
+    if ( delta >= 0 ) 
+    {
+        count = ( delta/MICROSECS(svc->period) ) + 1;
+        svc->cur_deadline += count * MICROSECS(svc->period);
+        svc->cur_budget = svc->budget * 1000;
+        return;
+    }
+
+    /* burn at nanoseconds level */
+    delta = now - svc->last_start;
+    /* 
+     * delta < 0 only happens in nested virtualization;
+     * TODO: how should we handle delta < 0 in a better way? */
+    if ( delta < 0 ) 
+    {
+        printk("%s, ATTENTION: now is behind last_start! delta = %ld for ",
+                __func__, delta);
+        rt_dump_vcpu(svc);
+        svc->last_start = now;  /* update last_start */
+        svc->cur_budget = 0;   /* FIXME: should we recover like this? */
+        return;
+    }
+
+    if ( svc->cur_budget == 0 ) 
+        return;
+
+    svc->cur_budget -= delta;
+    if ( svc->cur_budget < 0 ) 
+        svc->cur_budget = 0;
+}
+
+/* 
+ * RunQ is sorted. Pick first one within cpumask. If no one, return NULL
+ * lock is grabbed before calling this function 
+ */
+static struct rt_vcpu *
+__runq_pick(const struct scheduler *ops, cpumask_t mask)
+{
+    struct list_head *runq = RUNQ(ops);
+    struct list_head *iter;
+    struct rt_vcpu *svc = NULL;
+    struct rt_vcpu *iter_svc = NULL;
+    cpumask_t cpu_common;
+    cpumask_t *online;
+    struct rt_private * prv = RT_PRIV(ops);
+
+    list_for_each(iter, runq) 
+    {
+        iter_svc = __runq_elem(iter);
+
+        /* mask is intersection of cpu_hard_affinity and cpupool and priv->cpus */
+        online = cpupool_scheduler_cpumask(iter_svc->vcpu->domain->cpupool);
+        cpumask_and(&cpu_common, online, &prv->cpus);
+        cpumask_and(&cpu_common, &cpu_common, iter_svc->vcpu->cpu_hard_affinity);
+        cpumask_and(&cpu_common, &mask, &cpu_common);
+        if ( cpumask_empty(&cpu_common) )
+            continue;
+
+        if ( iter_svc->cur_budget <= 0 )
+            continue;
+
+        svc = iter_svc;
+        break;
+    }
+
+    return svc;
+}
+
+/*
+ * Update vcpu's budget and sort runq by insert the modifed vcpu back to runq
+ * lock is grabbed before calling this function 
+ */
+static void
+__repl_update(const struct scheduler *ops, s_time_t now)
+{
+    struct list_head *runq = RUNQ(ops);
+    struct list_head *iter;
+    struct list_head *tmp;
+    struct rt_vcpu *svc = NULL;
+
+    s_time_t diff;
+    long count;
+
+    list_for_each_safe(iter, tmp, runq) 
+    {
+        svc = __runq_elem(iter);
+
+        diff = now - svc->cur_deadline;
+        if ( diff > 0 ) 
+        {
+            count = (diff/MICROSECS(svc->period)) + 1;
+            svc->cur_deadline += count * MICROSECS(svc->period);
+            svc->cur_budget = svc->budget * 1000;
+            __runq_remove(svc);
+            __runq_insert(ops, svc);
+        }
+    }
+}
+
+/* 
+ * schedule function for rt scheduler.
+ * The lock is already grabbed in schedule.c, no need to lock here 
+ */
+static struct task_slice
+rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_scheduled)
+{
+    const int cpu = smp_processor_id();
+    struct rt_private * prv = RT_PRIV(ops);
+    struct rt_vcpu * const scurr = RT_VCPU(current);
+    struct rt_vcpu * snext = NULL;
+    struct task_slice ret = { .migrated = 0 };
+
+    /* clear ticked bit now that we've been scheduled */
+    if ( cpumask_test_cpu(cpu, &prv->tickled) )
+        cpumask_clear_cpu(cpu, &prv->tickled);
+
+    /* burn_budget would return for IDLE VCPU */
+    burn_budgets(ops, scurr, now);
+
+    __repl_update(ops, now);
+
+    if ( tasklet_work_scheduled ) 
+    {
+        snext = RT_VCPU(idle_vcpu[cpu]);
+    } 
+    else 
+    {
+        cpumask_t cur_cpu;
+        cpumask_clear(&cur_cpu);
+        cpumask_set_cpu(cpu, &cur_cpu);
+        snext = __runq_pick(ops, cur_cpu);
+        if ( snext == NULL )
+            snext = RT_VCPU(idle_vcpu[cpu]);
+
+        /* if scurr has higher priority and budget, still pick scurr */
+        if ( !is_idle_vcpu(current) &&
+             vcpu_runnable(current) &&
+             scurr->cur_budget > 0 &&
+             ( is_idle_vcpu(snext->vcpu) ||
+               scurr->cur_deadline <= snext->cur_deadline ) ) 
+            snext = scurr;
+    }
+
+    if ( snext != scurr &&
+         !is_idle_vcpu(current) &&
+         vcpu_runnable(current) )
+        set_bit(__RT_delayed_runq_add, &scurr->flags);
+    
+
+    snext->last_start = now;
+    if ( !is_idle_vcpu(snext->vcpu) ) 
+    {
+        if ( snext != scurr ) 
+        {
+            __runq_remove(snext);
+            set_bit(__RT_scheduled, &snext->flags);
+        }
+        if ( snext->vcpu->processor != cpu ) 
+        {
+            snext->vcpu->processor = cpu;
+            ret.migrated = 1;
+        }
+    }
+
+    ret.time = MILLISECS(1); /* sched quantum */
+    ret.task = snext->vcpu;
+
+    return ret;
+}
+
+/*
+ * Remove VCPU from RunQ
+ * The lock is already grabbed in schedule.c, no need to lock here 
+ */
+static void
+rt_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_vcpu * const svc = RT_VCPU(vc);
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    if ( curr_on_cpu(vc->processor) == vc ) 
+    {
+        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
+        return;
+    }
+
+    if ( __vcpu_on_runq(svc) ) 
+    {
+        __runq_remove(svc);
+        printk("%s: vcpu should not on runq in vcpu_sleep()\n", __FUNCTION__);
+        BUG();
+    }
+
+    clear_bit(__RT_delayed_runq_add, &svc->flags);
+}
+
+/*
+ * Pick a vcpu on a cpu to kick out to place the running candidate
+ * Called by wake() and context_saved()
+ * We have a running candidate here, the kick logic is:
+ * Among all the cpus that are within the cpu affinity
+ * 1) if the new->cpu is idle, kick it. This could benefit cache hit
+ * 2) if there are any idle vcpu, kick it.
+ * 3) now all pcpus are busy, among all the running vcpus, pick lowest priority one
+ *    if snext has higher priority, kick it.
+ *
+ * TODO:
+ * 1) what if these two vcpus belongs to the same domain?
+ *    replace a vcpu belonging to the same domain introduces more overhead
+ *
+ * lock is grabbed before calling this function 
+ */
+static void
+runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
+{
+    struct rt_private * prv = RT_PRIV(ops);
+    struct rt_vcpu * latest_deadline_vcpu = NULL;    /* lowest priority scheduled */
+    struct rt_vcpu * iter_svc;
+    struct vcpu * iter_vc;
+    int cpu = 0;
+    cpumask_t not_tickled;
+    cpumask_t *online;
+
+    if ( new == NULL || is_idle_vcpu(new->vcpu) ) 
+        return;
+
+    online = cpupool_scheduler_cpumask(new->vcpu->domain->cpupool);
+    cpumask_and(&not_tickled, online, &prv->cpus);
+    cpumask_and(&not_tickled, &not_tickled, new->vcpu->cpu_hard_affinity);
+    cpumask_andnot(&not_tickled, &not_tickled, &prv->tickled);
+
+    /* 1) if new's previous cpu is idle, kick it for cache benefit */
+    if ( is_idle_vcpu(curr_on_cpu(new->vcpu->processor)) ) 
+    {
+        cpumask_set_cpu(new->vcpu->processor, &prv->tickled);
+        cpu_raise_softirq(new->vcpu->processor, SCHEDULE_SOFTIRQ);
+        return;
+    }
+
+    /* 2) if there are any idle pcpu, kick it */
+    /* The same loop also find the one with lowest priority */
+    for_each_cpu(cpu, &not_tickled) 
+    {
+        iter_vc = curr_on_cpu(cpu);
+        if ( is_idle_vcpu(iter_vc) ) 
+        {
+            cpumask_set_cpu(cpu, &prv->tickled);
+            cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
+            return;
+        }
+        iter_svc = RT_VCPU(iter_vc);
+        if ( latest_deadline_vcpu == NULL || 
+             iter_svc->cur_deadline > latest_deadline_vcpu->cur_deadline )
+            latest_deadline_vcpu = iter_svc;
+    }
+
+    /* 3) candicate has higher priority, kick out the lowest priority vcpu */
+    if ( latest_deadline_vcpu != NULL && new->cur_deadline < latest_deadline_vcpu->cur_deadline ) 
+    {
+        cpumask_set_cpu(latest_deadline_vcpu->vcpu->processor, &prv->tickled);
+        cpu_raise_softirq(latest_deadline_vcpu->vcpu->processor, SCHEDULE_SOFTIRQ);
+    }
+    return;
+}
+
+/* 
+ * Should always wake up runnable vcpu, put it back to RunQ. 
+ * Check priority to raise interrupt 
+ * The lock is already grabbed in schedule.c, no need to lock here 
+ * TODO: what if these two vcpus belongs to the same domain? 
+ */
+static void
+rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_vcpu * const svc = RT_VCPU(vc);
+    s_time_t diff;
+    s_time_t now = NOW();
+    long count = 0;
+    struct rt_private * prv = RT_PRIV(ops);
+    struct rt_vcpu * snext = NULL;        /* highest priority on RunQ */
+
+    BUG_ON( is_idle_vcpu(vc) );
+
+    if ( unlikely(curr_on_cpu(vc->processor) == vc) ) 
+        return;
+
+    /* on RunQ, just update info is ok */
+    if ( unlikely(__vcpu_on_runq(svc)) ) 
+        return;
+
+    /* If context hasn't been saved for this vcpu yet, we can't put it on
+     * the Runqueue. Instead, we set a flag so that it will be put on the Runqueue
+     * After the context has been saved. */
+    if ( unlikely(test_bit(__RT_scheduled, &svc->flags)) ) 
+    {
+        set_bit(__RT_delayed_runq_add, &svc->flags);
+        return;
+    }
+
+    /* update deadline info */
+    diff = now - svc->cur_deadline;
+    if ( diff >= 0 ) 
+    {
+        count = ( diff/MICROSECS(svc->period) ) + 1;
+        svc->cur_deadline += count * MICROSECS(svc->period);
+        svc->cur_budget = svc->budget * 1000;
+    }
+
+    __runq_insert(ops, svc);
+    __repl_update(ops, now);
+    snext = __runq_pick(ops, prv->cpus);    /* pick snext from ALL valid cpus */
+    runq_tickle(ops, snext);
+
+    return;
+}
+
+/* 
+ * scurr has finished context switch, insert it back to the RunQ,
+ * and then pick the highest priority vcpu from runq to run 
+ */
+static void
+rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
+{
+    struct rt_vcpu * svc = RT_VCPU(vc);
+    struct rt_vcpu * snext = NULL;
+    struct rt_private * prv = RT_PRIV(ops);
+    spinlock_t *lock = vcpu_schedule_lock_irq(vc);
+
+    clear_bit(__RT_scheduled, &svc->flags);
+    /* not insert idle vcpu to runq */
+    if ( is_idle_vcpu(vc) ) 
+        goto out;
+
+    if ( test_and_clear_bit(__RT_delayed_runq_add, &svc->flags) && 
+         likely(vcpu_runnable(vc)) ) 
+    {
+        __runq_insert(ops, svc);
+        __repl_update(ops, NOW());
+        snext = __runq_pick(ops, prv->cpus);    /* pick snext from ALL cpus */
+        runq_tickle(ops, snext);
+    }
+out:
+    vcpu_schedule_unlock_irq(lock, vc);
+}
+
+/*
+ * set/get each vcpu info of each domain
+ */
+static int
+rt_dom_cntl(
+    const struct scheduler *ops, 
+    struct domain *d, 
+    struct xen_domctl_scheduler_op *op)
+{
+    xen_domctl_sched_rt_params_t *local_sched;
+    struct rt_dom * const sdom = RT_DOM(d);
+    struct list_head *iter;
+    int vcpu_index = 0;
+    int rc = -EINVAL;
+
+    switch ( op->cmd )
+    {
+    case XEN_DOMCTL_SCHEDOP_getnumvcpus:
+        op->u.rt.nr_vcpus = 0;
+        list_for_each( iter, &sdom->vcpu ) 
+            vcpu_index++;
+        op->u.rt.nr_vcpus = vcpu_index;
+        rc = 0;
+        break;
+    case XEN_DOMCTL_SCHEDOP_getinfo:
+        /* for debug use, whenever adjust Dom0 parameter, do global dump */
+        if ( d->domain_id == 0 ) 
+            rt_dump(ops);
+
+        op->u.rt.nr_vcpus = 0;
+        list_for_each( iter, &sdom->vcpu ) 
+            vcpu_index++;
+        op->u.rt.nr_vcpus = vcpu_index;
+        local_sched = xzalloc_array(xen_domctl_sched_rt_params_t, vcpu_index);
+        vcpu_index = 0;
+        list_for_each( iter, &sdom->vcpu ) 
+        {
+            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu, sdom_elem);
+
+            local_sched[vcpu_index].budget = svc->budget;
+            local_sched[vcpu_index].period = svc->period;
+            local_sched[vcpu_index].index = vcpu_index;
+            vcpu_index++;
+        }
+        copy_to_guest(op->u.rt.vcpu, local_sched, vcpu_index);
+        rc = 0;
+        break;
+    case XEN_DOMCTL_SCHEDOP_putinfo:
+        list_for_each( iter, &sdom->vcpu ) 
+        {
+            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu, sdom_elem);
+
+            /* adjust per VCPU parameter */
+            if ( op->u.rt.vcpu_index == svc->vcpu->vcpu_id ) 
+            { 
+                vcpu_index = op->u.rt.vcpu_index;
+
+                if ( vcpu_index < 0 ) 
+                    printk("XEN_DOMCTL_SCHEDOP_putinfo: vcpu_index=%d\n",
+                            vcpu_index);
+                else
+                    printk("XEN_DOMCTL_SCHEDOP_putinfo: "
+                            "vcpu_index=%d, period=%"PRId64", budget=%"PRId64"\n",
+                            vcpu_index, op->u.rt.period, op->u.rt.budget);
+
+                svc->period = op->u.rt.period;
+                svc->budget = op->u.rt.budget;
+
+                break;
+            }
+        }
+        rc = 0;
+        break;
+    }
+
+    return rc;
+}
+
+static struct rt_private _rt_priv;
+
+const struct scheduler sched_rt_def = {
+    .name           = "SMP RT Scheduler",
+    .opt_name       = "rt",
+    .sched_id       = XEN_SCHEDULER_RT_DS,
+    .sched_data     = &_rt_priv,
+
+    .dump_cpu_state = rt_dump_pcpu,
+    .dump_settings  = rt_dump,
+    .init           = rt_init,
+    .deinit         = rt_deinit,
+    .alloc_pdata    = rt_alloc_pdata,
+    .free_pdata     = rt_free_pdata,
+    .alloc_domdata  = rt_alloc_domdata,
+    .free_domdata   = rt_free_domdata,
+    .init_domain    = rt_dom_init,
+    .destroy_domain = rt_dom_destroy,
+    .alloc_vdata    = rt_alloc_vdata,
+    .free_vdata     = rt_free_vdata,
+    .insert_vcpu    = rt_vcpu_insert,
+    .remove_vcpu    = rt_vcpu_remove,
+
+    .adjust         = rt_dom_cntl,
+
+    .pick_cpu       = rt_cpu_pick,
+    .do_schedule    = rt_schedule,
+    .sleep          = rt_vcpu_sleep,
+    .wake           = rt_vcpu_wake,
+    .context_saved  = rt_context_saved,
+};
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index e9eb0bc..f2df400 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -68,6 +68,7 @@ static const struct scheduler *schedulers[] = {
     &sched_sedf_def,
     &sched_credit_def,
     &sched_credit2_def,
+    &sched_rt_def,
     &sched_arinc653_def,
 };
 
@@ -1092,7 +1093,8 @@ long sched_adjust(struct domain *d, struct xen_domctl_scheduler_op *op)
 
     if ( (op->sched_id != DOM2OP(d)->sched_id) ||
          ((op->cmd != XEN_DOMCTL_SCHEDOP_putinfo) &&
-          (op->cmd != XEN_DOMCTL_SCHEDOP_getinfo)) )
+          (op->cmd != XEN_DOMCTL_SCHEDOP_getinfo) &&
+          (op->cmd != XEN_DOMCTL_SCHEDOP_getnumvcpus)) )
         return -EINVAL;
 
     /* NB: the pluggable scheduler code needs to take care
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 5b11bbf..8d4b973 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -339,6 +339,18 @@ struct xen_domctl_max_vcpus {
 typedef struct xen_domctl_max_vcpus xen_domctl_max_vcpus_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_max_vcpus_t);
 
+/*
+ * This structure is used to pass to rt scheduler from a 
+ * privileged domain to Xen
+ */
+struct xen_domctl_sched_rt_params {
+    /* get vcpus' info */
+    int64_t period; /* s_time_t type */
+    int64_t budget;
+    int     index;
+};
+typedef struct xen_domctl_sched_rt_params xen_domctl_sched_rt_params_t;
+DEFINE_XEN_GUEST_HANDLE(xen_domctl_sched_rt_params_t);
 
 /* XEN_DOMCTL_scheduler_op */
 /* Scheduler types. */
@@ -346,9 +358,12 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_max_vcpus_t);
 #define XEN_SCHEDULER_CREDIT   5
 #define XEN_SCHEDULER_CREDIT2  6
 #define XEN_SCHEDULER_ARINC653 7
+#define XEN_SCHEDULER_RT_DS    8
+
 /* Set or get info? */
-#define XEN_DOMCTL_SCHEDOP_putinfo 0
-#define XEN_DOMCTL_SCHEDOP_getinfo 1
+#define XEN_DOMCTL_SCHEDOP_putinfo      0
+#define XEN_DOMCTL_SCHEDOP_getinfo      1
+#define XEN_DOMCTL_SCHEDOP_getnumvcpus   2
 struct xen_domctl_scheduler_op {
     uint32_t sched_id;  /* XEN_SCHEDULER_* */
     uint32_t cmd;       /* XEN_DOMCTL_SCHEDOP_* */
@@ -367,6 +382,15 @@ struct xen_domctl_scheduler_op {
         struct xen_domctl_sched_credit2 {
             uint16_t weight;
         } credit2;
+        struct xen_domctl_sched_rt{
+            /* get vcpus' params */
+            XEN_GUEST_HANDLE_64(xen_domctl_sched_rt_params_t) vcpu;
+            uint16_t nr_vcpus;
+            /* set one vcpu's params */
+            uint16_t vcpu_index;
+            int64_t period;
+            int64_t budget;
+        } rt;
     } u;
 };
 typedef struct xen_domctl_scheduler_op xen_domctl_scheduler_op_t;
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index 4164dff..bcbe234 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -169,6 +169,7 @@ extern const struct scheduler sched_sedf_def;
 extern const struct scheduler sched_credit_def;
 extern const struct scheduler sched_credit2_def;
 extern const struct scheduler sched_arinc653_def;
+extern const struct scheduler sched_rt_def;
 
 
 struct cpupool
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC v2 2/4] libxc: add rt scheduler
  2014-07-29  1:52 Introduce rt real-time scheduler for Xen Meng Xu
  2014-07-29  1:52 ` [PATCH RFC v2 1/4] xen: add real time scheduler rt Meng Xu
@ 2014-07-29  1:53 ` Meng Xu
  2014-07-29  1:53 ` [PATCH RFC v2 3/4] libxl: " Meng Xu
  2014-07-29  1:53 ` [PATCH RFC v2 4/4] xl: introduce " Meng Xu
  3 siblings, 0 replies; 13+ messages in thread
From: Meng Xu @ 2014-07-29  1:53 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xumengpanda, Meng Xu, JBeulich, lichong659, dgolomb

Add xc_sched_rt_* functions to interact with Xen to set/get domain's
parameters for rt scheduler.

Signed-off-by: Sisu Xi <xisisu@gmail.com>
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
---
 tools/libxc/Makefile  |    1 +
 tools/libxc/xc_rt.c   |   89 +++++++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xenctrl.h |   12 +++++++
 3 files changed, 102 insertions(+)
 create mode 100644 tools/libxc/xc_rt.c

diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile
index 22eef8e..c2b02a4 100644
--- a/tools/libxc/Makefile
+++ b/tools/libxc/Makefile
@@ -20,6 +20,7 @@ CTRL_SRCS-y       += xc_sedf.c
 CTRL_SRCS-y       += xc_csched.c
 CTRL_SRCS-y       += xc_csched2.c
 CTRL_SRCS-y       += xc_arinc653.c
+CTRL_SRCS-y       += xc_rt.c
 CTRL_SRCS-y       += xc_tbuf.c
 CTRL_SRCS-y       += xc_pm.c
 CTRL_SRCS-y       += xc_cpu_hotplug.c
diff --git a/tools/libxc/xc_rt.c b/tools/libxc/xc_rt.c
new file mode 100644
index 0000000..74a1c3e
--- /dev/null
+++ b/tools/libxc/xc_rt.c
@@ -0,0 +1,89 @@
+/****************************************************************************
+ *
+ *        File: xc_rt.c
+ *      Author: Sisu Xi 
+ *              Meng Xu
+ *
+ * Description: XC Interface to the rt scheduler
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation;
+ * version 2.1 of the License.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include "xc_private.h"
+
+int xc_sched_rt_domain_set(xc_interface *xch,
+                           uint32_t domid,
+                           struct xen_domctl_sched_rt_params *sdom)
+{
+    int rc;
+    DECLARE_DOMCTL;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_RT_DS;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_putinfo;
+    domctl.u.scheduler_op.u.rt.vcpu_index = sdom->index;
+    domctl.u.scheduler_op.u.rt.period = sdom->period;
+    domctl.u.scheduler_op.u.rt.budget = sdom->budget;
+
+    rc = do_domctl(xch, &domctl);
+
+    return rc;
+}
+
+int xc_sched_rt_domain_get(xc_interface *xch,
+                           uint32_t domid,
+                           struct xen_domctl_sched_rt_params *sdom,
+                           uint16_t num_vcpus)
+{
+    int rc;
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BOUNCE(sdom, 
+        sizeof(*sdom) * num_vcpus, 
+        XC_HYPERCALL_BUFFER_BOUNCE_OUT);
+
+    if ( xc_hypercall_bounce_pre(xch, sdom) )
+        return -1;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_RT_DS;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_getinfo;
+    set_xen_guest_handle(domctl.u.scheduler_op.u.rt.vcpu, sdom);
+
+    rc = do_domctl(xch, &domctl);
+
+    xc_hypercall_bounce_post(xch, sdom);
+
+    return rc;
+}
+
+int xc_sched_rt_domain_get_num_vcpus(xc_interface *xch,
+                                     uint32_t domid,
+                                     uint16_t *num_vcpus)
+{
+    int rc;
+    DECLARE_DOMCTL;
+
+    domctl.cmd = XEN_DOMCTL_scheduler_op;
+    domctl.domain = (domid_t) domid;
+    domctl.u.scheduler_op.sched_id = XEN_SCHEDULER_RT_DS;
+    domctl.u.scheduler_op.cmd = XEN_DOMCTL_SCHEDOP_getnumvcpus;
+
+    rc = do_domctl(xch, &domctl);
+
+    *num_vcpus = domctl.u.scheduler_op.u.rt.nr_vcpus;
+    return rc;
+}
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 5beb846..ca38a4f 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -875,6 +875,18 @@ int xc_sched_credit2_domain_get(xc_interface *xch,
                                uint32_t domid,
                                struct xen_domctl_sched_credit2 *sdom);
 
+int xc_sched_rt_domain_set(xc_interface *xch,
+                          uint32_t domid,
+                          struct xen_domctl_sched_rt_params *sdom);
+int xc_sched_rt_domain_get(xc_interface *xch,
+                          uint32_t domid,
+                          struct xen_domctl_sched_rt_params *sdom,
+                          uint16_t num_vcpus);
+
+int xc_sched_rt_domain_get_num_vcpus(xc_interface *xch,
+                                    uint32_t domid,
+                                    uint16_t *num_vcpus);
+
 int
 xc_sched_arinc653_schedule_set(
     xc_interface *xch,
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC v2 3/4] libxl: add rt scheduler
  2014-07-29  1:52 Introduce rt real-time scheduler for Xen Meng Xu
  2014-07-29  1:52 ` [PATCH RFC v2 1/4] xen: add real time scheduler rt Meng Xu
  2014-07-29  1:53 ` [PATCH RFC v2 2/4] libxc: add rt scheduler Meng Xu
@ 2014-07-29  1:53 ` Meng Xu
  2014-07-29  1:53 ` [PATCH RFC v2 4/4] xl: introduce " Meng Xu
  3 siblings, 0 replies; 13+ messages in thread
From: Meng Xu @ 2014-07-29  1:53 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xumengpanda, Meng Xu, JBeulich, lichong659, dgolomb

Add libxl functions to set/get domain's parameters for rt scheduler

Signed-off-by: Sisu Xi <xisisu@gmail.com>
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
---
 tools/libxl/libxl.c         |  139 +++++++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl.h         |    7 +++
 tools/libxl/libxl_types.idl |   15 +++++
 3 files changed, 161 insertions(+)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 3526539..440e8df31 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -5154,6 +5154,139 @@ static int sched_sedf_domain_set(libxl__gc *gc, uint32_t domid,
     return 0;
 }
 
+static int sched_rt_domain_get(libxl__gc *gc, uint32_t domid,
+                               libxl_domain_sched_params *scinfo)
+{
+    struct xen_domctl_sched_rt_params* sdom;
+    uint16_t num_vcpus;
+    int rc, i;
+
+    rc = xc_sched_rt_domain_get_num_vcpus(CTX->xch, domid, &num_vcpus);
+    if (rc != 0) {
+        LOGE(ERROR, "getting num_vcpus of domain sched rt");
+        return ERROR_FAIL;
+    }
+    
+    /* FIXME: can malloc be used in libxl? seems it was used in the file */
+    sdom = (struct xen_domctl_sched_rt_params *)
+            malloc( sizeof(struct xen_domctl_sched_rt_params) * num_vcpus );
+    if ( !sdom ){
+        LOGE(ERROR, "Allocate sdom array fails\n");
+        return ERROR_INVAL;
+    }
+
+    rc = xc_sched_rt_domain_get(CTX->xch, domid, sdom, num_vcpus);
+    if (rc != 0) {
+        LOGE(ERROR, "getting domain sched rt");
+        return ERROR_FAIL;
+    }
+
+    /* FIXME: how to guarantee libxl_*_dispose be called exactly once? */
+    libxl_domain_sched_params_init(scinfo);
+    
+    scinfo->rt.num_vcpus = num_vcpus;
+    scinfo->sched = LIBXL_SCHEDULER_RT;
+    /* FIXME: can malloc be used in libxl? seems it was used in the file */
+    scinfo->rt.vcpus = (libxl_vcpu *)
+                       malloc( sizeof(libxl_vcpu) * scinfo->rt.num_vcpus );
+    if ( !scinfo->rt.vcpus ){
+        LOGE(ERROR, "Allocate lib_vcpu array fails\n");
+        return ERROR_INVAL;
+    }
+    for( i = 0; i < num_vcpus; ++i)
+    {
+        scinfo->rt.vcpus[i].period = sdom[i].period;
+        scinfo->rt.vcpus[i].budget = sdom[i].budget;
+        scinfo->rt.vcpus[i].index = sdom[i].index;
+    }
+    
+    free(sdom);
+    return 0;
+}
+
+#define SCHED_RT_VCPU_PERIOD_MAX    31536000000000 /* one year in microsecond*/
+#define SCHED_RT_VCPU_BUDGET_MAX    SCHED_RT_VCPU_PERIOD_MAX
+
+/*
+ * Sanity check of the scinfo parameters
+ * return 0 if all values are valid
+ * return 1 if one param is default value
+ * return 2 if the target vcpu's index, period or budget is out of range
+ */
+static int sched_rt_domain_set_validate_params(libxl__gc *gc,
+                                               const libxl_domain_sched_params *scinfo,
+                                               const uint16_t num_vcpus)
+{
+    int vcpu_index = scinfo->rt.vcpu_index;
+
+    if (vcpu_index == LIBXL_DOMAIN_SCHED_PARAM_VCPU_DEFAULT ||
+        scinfo->rt.period == LIBXL_DOMAIN_SCHED_PARAM_PERIOD_DEFAULT ||
+        scinfo->rt.budget == LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT)
+    {
+        return 1;
+    }
+
+    if (vcpu_index < 0 || vcpu_index > num_vcpus)
+    {
+        LOG(ERROR, "VCPU index is not set or out of range, "
+                    "valid values are within range from 0 to %d", num_vcpus);
+        return 2;
+    }
+
+    if (scinfo->rt.period < 1 ||
+        scinfo->rt.period > SCHED_RT_VCPU_PERIOD_MAX)
+    {
+        LOG(ERROR, "VCPU period is not set or out of range, "
+                    "valid values are within range from 0 to %lu", SCHED_RT_VCPU_PERIOD_MAX);
+        return 2;
+    }
+
+    if (scinfo->rt.budget < 1 ||
+        scinfo->rt.budget > SCHED_RT_VCPU_BUDGET_MAX)
+    {
+        LOG(ERROR, "VCPU budget is not set or out of range, "
+                    "valid values are within range from 0 to %lu", SCHED_RT_VCPU_BUDGET_MAX);
+        return 2;
+    }
+
+    return 0;
+
+}
+
+static int sched_rt_domain_set(libxl__gc *gc, uint32_t domid,
+                               const libxl_domain_sched_params *scinfo)
+{
+    struct xen_domctl_sched_rt_params sdom;
+    uint16_t num_vcpus;
+    int rc;
+ 
+    rc = xc_sched_rt_domain_get_num_vcpus(CTX->xch, domid, &num_vcpus);
+    if (rc != 0) {
+        LOGE(ERROR, "getting domain sched rt");
+        return ERROR_FAIL;
+    }
+    
+    rc = sched_rt_domain_set_validate_params(gc, scinfo, num_vcpus);
+    if (rc == 2)
+        return ERROR_INVAL;
+    if (rc == 1)
+        return 0;
+    if (rc == 0)
+    {
+        sdom.index = scinfo->rt.vcpu_index;
+        sdom.period = scinfo->rt.period;
+        sdom.budget = scinfo->rt.budget;
+    }
+
+    rc = xc_sched_rt_domain_set(CTX->xch, domid, &sdom);
+    if ( rc < 0 ) {
+        LOGE(ERROR, "setting domain sched rt");
+        return ERROR_FAIL;
+    }
+
+    return 0;
+}
+
 int libxl_domain_sched_params_set(libxl_ctx *ctx, uint32_t domid,
                                   const libxl_domain_sched_params *scinfo)
 {
@@ -5177,6 +5310,9 @@ int libxl_domain_sched_params_set(libxl_ctx *ctx, uint32_t domid,
     case LIBXL_SCHEDULER_ARINC653:
         ret=sched_arinc653_domain_set(gc, domid, scinfo);
         break;
+    case LIBXL_SCHEDULER_RT:
+        ret=sched_rt_domain_set(gc, domid, scinfo);
+        break;
     default:
         LOG(ERROR, "Unknown scheduler");
         ret=ERROR_INVAL;
@@ -5207,6 +5343,9 @@ int libxl_domain_sched_params_get(libxl_ctx *ctx, uint32_t domid,
     case LIBXL_SCHEDULER_CREDIT2:
         ret=sched_credit2_domain_get(gc, domid, scinfo);
         break;
+    case LIBXL_SCHEDULER_RT:
+        ret=sched_rt_domain_get(gc, domid, scinfo);
+        break;
     default:
         LOG(ERROR, "Unknown scheduler");
         ret=ERROR_INVAL;
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 5ae6532..366eaf0 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -1241,6 +1241,13 @@ int libxl_sched_credit_params_set(libxl_ctx *ctx, uint32_t poolid,
 #define LIBXL_DOMAIN_SCHED_PARAM_LATENCY_DEFAULT   -1
 #define LIBXL_DOMAIN_SCHED_PARAM_EXTRATIME_DEFAULT -1
 
+#define LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT     -1
+#define LIBXL_DOMAIN_SCHED_PARAM_VCPU_DEFAULT       -1
+#define LIBXL_DOMAIN_SCHED_PARAM_NUM_VCPUS_DEFAULT  -1
+#define LIBXL_DOMAIN_SCHED_PARAM_VCPU_INDEX_DEFAULT -1
+/* Consistent with XEN_LEGACY_MAX_VCPUS xen/arch-x86/xen.h*/
+#define LIBXL_XEN_LEGACY_MAX_VCPUS                  32 
+
 int libxl_domain_sched_params_get(libxl_ctx *ctx, uint32_t domid,
                                   libxl_domain_sched_params *params);
 int libxl_domain_sched_params_set(libxl_ctx *ctx, uint32_t domid,
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index a412f9c..340eb4c 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -153,6 +153,7 @@ libxl_scheduler = Enumeration("scheduler", [
     (5, "credit"),
     (6, "credit2"),
     (7, "arinc653"),
+    (8, "rt"),
     ])
 
 # Consistent with SHUTDOWN_* in sched.h (apart from UNKNOWN)
@@ -303,6 +304,19 @@ libxl_domain_restore_params = Struct("domain_restore_params", [
     ("checkpointed_stream", integer),
     ])
 
+libxl_rt_vcpu = Struct("vcpu",[
+    ("period",       uint64, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_PERIOD_DEFAULT'}),
+    ("budget",       uint64, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT'}),
+    ("index",        integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_VCPU_INDEX_DEFAULT'}),
+    ])
+
+libxl_domain_sched_rt_params = Struct("domain_sched_rt_params",[
+    ("vcpus",        Array(libxl_rt_vcpu, "num_vcpus")),
+    ("period",       uint64, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_PERIOD_DEFAULT'}),
+    ("budget",       uint64, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT'}),
+    ("vcpu_index",   integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_VCPU_INDEX_DEFAULT'}),
+    ])
+
 libxl_domain_sched_params = Struct("domain_sched_params",[
     ("sched",        libxl_scheduler),
     ("weight",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_WEIGHT_DEFAULT'}),
@@ -311,6 +325,7 @@ libxl_domain_sched_params = Struct("domain_sched_params",[
     ("slice",        integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_SLICE_DEFAULT'}),
     ("latency",      integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_LATENCY_DEFAULT'}),
     ("extratime",    integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_EXTRATIME_DEFAULT'}),
+    ("rt",           libxl_domain_sched_rt_params),
     ])
 
 libxl_domain_build_info = Struct("domain_build_info",[
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC v2 4/4] xl: introduce rt scheduler
  2014-07-29  1:52 Introduce rt real-time scheduler for Xen Meng Xu
                   ` (2 preceding siblings ...)
  2014-07-29  1:53 ` [PATCH RFC v2 3/4] libxl: " Meng Xu
@ 2014-07-29  1:53 ` Meng Xu
  3 siblings, 0 replies; 13+ messages in thread
From: Meng Xu @ 2014-07-29  1:53 UTC (permalink / raw)
  To: xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xumengpanda, Meng Xu, JBeulich, lichong659, dgolomb

Add xl command for rt scheduler

Signed-off-by: Sisu Xi <xisisu@gmail.com>
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
---
 docs/man/xl.pod.1         |   40 ++++++++++++++
 tools/libxl/xl.h          |    1 +
 tools/libxl/xl_cmdimpl.c  |  131 +++++++++++++++++++++++++++++++++++++++++++++
 tools/libxl/xl_cmdtable.c |    9 ++++
 4 files changed, 181 insertions(+)

diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
index 30bd4bf..42aeedc 100644
--- a/docs/man/xl.pod.1
+++ b/docs/man/xl.pod.1
@@ -1019,6 +1019,46 @@ Restrict output to domains in the specified cpupool.
 
 =back
 
+=item B<sched-rt> [I<OPTIONS>]
+
+Set or get rt (Real Time) scheduler parameters. This rt scheduler applies 
+Preemptive Global Earliest Deadline First real-time scheduling algorithm to 
+schedule VCPUs in the system. Each VCPU has a dedicated period and budget.
+While scheduled, a VCPU burns its budget. 
+A VCPU has its budget replenished at the beginning of each of its periods;
+The VCPU discards its unused budget at the end of its periods.
+
+B<OPTIONS>
+
+=over 4
+
+=item B<-d DOMAIN>, B<--domain=DOMAIN>
+
+Specify domain for which scheduler parameters are to be modified or retrieved.
+Mandatory for modifying scheduler parameters.
+
+=item B<-v VCPU>, B<--vcpu=VCPU>
+
+Specify the index of VCPU whose parameters will be set. 
+A domain can have multiple VCPUs; Each VCPU has a unique index in this domain;
+When set domain's parameters, it needs to set each VCPU's parameters of this 
+domain.
+
+=item B<-p PERIOD>, B<--period=PERIOD>
+
+A VCPU replenish its budget in every period. Time unit is millisecond.
+
+=item B<-b BUDGET>, B<--budget=BUDGET>
+
+A VCPU has BUDGET amount of time to run for each period. 
+Time unit is millisecond.
+
+=item B<-c CPUPOOL>, B<--cpupool=CPUPOOL>
+
+Restrict output to domains in the specified cpupool.
+
+=back
+
 =back
 
 =head1 CPUPOOLS COMMANDS
diff --git a/tools/libxl/xl.h b/tools/libxl/xl.h
index 10a2e66..51b634a 100644
--- a/tools/libxl/xl.h
+++ b/tools/libxl/xl.h
@@ -67,6 +67,7 @@ int main_memset(int argc, char **argv);
 int main_sched_credit(int argc, char **argv);
 int main_sched_credit2(int argc, char **argv);
 int main_sched_sedf(int argc, char **argv);
+int main_sched_rt(int argc, char **argv);
 int main_domid(int argc, char **argv);
 int main_domname(int argc, char **argv);
 int main_rename(int argc, char **argv);
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 01bce2f..eb0039f 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -5100,6 +5100,52 @@ static int sched_sedf_domain_output(
     return 0;
 }
 
+
+static int sched_rt_domain_output(
+    int domid)
+{
+    char *domname;
+    libxl_domain_sched_params scinfo;
+    int rc = 0, i;
+
+    if (domid < 0) {
+        printf("%-33s %4s %4s %9s %9s\n", "Name", "ID", "VCPU", "Period", "Budget");
+        return 0;
+    }
+
+    libxl_domain_sched_params_init(&scinfo);
+    rc = sched_domain_get(LIBXL_SCHEDULER_RT, domid, &scinfo);
+    if (rc)
+        goto out;
+
+    domname = libxl_domid_to_name(ctx, domid);
+    for( i = 0; i < scinfo.rt.num_vcpus; i++ )
+    {
+        printf("%-33s %4d %4d %9"PRIu64" %9"PRIu64"\n",
+            domname,
+            domid,
+            scinfo.rt.vcpus[i].index,
+            scinfo.rt.vcpus[i].period,
+            scinfo.rt.vcpus[i].budget);
+    }
+    free(domname);
+
+out:
+    libxl_domain_sched_params_dispose(&scinfo);
+    return rc;
+}
+
+static int sched_rt_pool_output(uint32_t poolid)
+{
+    char *poolname;
+
+    poolname = libxl_cpupoolid_to_name(ctx, poolid);
+    printf("Cpupool %s: sched=EDF\n", poolname);
+
+    free(poolname);
+    return 0;
+}
+
 static int sched_default_pool_output(uint32_t poolid)
 {
     char *poolname;
@@ -5467,6 +5513,91 @@ int main_sched_sedf(int argc, char **argv)
     return 0;
 }
 
+/*
+ * <nothing>            : List all domain paramters and sched params
+ * -d [domid]           : List domain params for domain
+ * -d [domid] [params]  : Set domain params for domain 
+ */
+int main_sched_rt(int argc, char **argv)
+{
+    const char *dom = NULL;
+    const char *cpupool = NULL;
+    int period = 10, opt_p = 0;
+    int budget = 4, opt_b = 0;
+    int vcpu_index = 0, opt_v = 0;
+    int opt, rc;
+    static struct option opts[] = {
+        {"domain", 1, 0, 'd'},
+        {"period", 1, 0, 'p'},
+        {"budget", 1, 0, 'b'},
+        {"vcpu", 1, 0, 'v'},
+        {"cpupool", 1, 0, 'c'},
+        COMMON_LONG_OPTS,
+        {0, 0, 0, 0}
+    };
+
+    SWITCH_FOREACH_OPT(opt, "d:p:b:v:c:h", opts, "sched-rt", 0) {
+    case 'd':
+        dom = optarg;
+        break;
+    case 'p':
+        period = strtol(optarg, NULL, 10);
+        opt_p = 1;
+        break;
+    case 'b':
+        budget = strtol(optarg, NULL, 10);
+        opt_b = 1;
+        break;
+    case 'v':
+        vcpu_index = strtol(optarg, NULL, 10);
+        opt_v = 1;
+        break;
+    case 'c':
+        cpupool = optarg;
+        break;
+    }
+
+    if (cpupool && (dom || opt_p || opt_b || opt_v)) {
+        fprintf(stderr, "Specifying a cpupool is not allowed with other options.\n");
+        return 1;
+    }
+    if (!dom && (opt_p || opt_b || opt_v)) {
+        fprintf(stderr, "Must specify a domain.\n");
+        return 1;
+    }
+    if ( (opt_v || opt_p || opt_b) && (opt_p + opt_b + opt_v != 3) ) {
+        fprintf(stderr, "Must specify vcpu, period, budget\n");
+        return 1;
+    }
+    
+    if (!dom) { /* list all domain's rt scheduler info */
+        return -sched_domain_output(LIBXL_SCHEDULER_RT,
+                                    sched_rt_domain_output,
+                                    sched_rt_pool_output,
+                                    cpupool);
+    } else {
+        uint32_t domid = find_domain(dom);
+        if (!opt_p && !opt_b && !opt_v) { /* output rt scheduler info */
+            sched_rt_domain_output(-1);
+            return -sched_rt_domain_output(domid);
+        } else { /* set rt scheduler paramaters */
+            libxl_domain_sched_params scinfo;
+            libxl_domain_sched_params_init(&scinfo);
+            scinfo.sched = LIBXL_SCHEDULER_RT;
+            scinfo.rt.vcpu_index = vcpu_index;
+            scinfo.rt.period = period;
+            scinfo.rt.budget = budget;
+
+            rc = sched_domain_set(domid, &scinfo);
+            libxl_domain_sched_params_dispose(&scinfo);
+            if (rc)
+                return -rc;
+        }
+    }
+
+    return 0;
+}
+
 int main_domid(int argc, char **argv)
 {
     uint32_t domid;
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
index 4279b9f..00500cb 100644
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -277,6 +277,15 @@ struct cmd_spec cmd_table[] = {
       "                               --period/--slice)\n"
       "-c CPUPOOL, --cpupool=CPUPOOL  Restrict output to CPUPOOL"
     },
+    { "sched-rt",
+      &main_sched_rt, 0, 1,
+      "Get/set rt scheduler parameters",
+      "[-d <Domain> [-v[=VCPU]] [-p[=PERIOD]] [-b[=BUDGET]]]",
+      "-d DOMAIN, --domain=DOMAIN     Domain to modify\n"
+      "-v VCPU,   --vcpu=VCPU         VCPU\n"
+      "-p PERIOD, --period=PERIOD     Period (us)\n"
+      "-b BUDGET, --budget=BUDGET     Budget (us)\n"
+    },
     { "domid",
       &main_domid, 0, 0,
       "Convert a domain name to domain id",
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC v2 1/4] xen: add real time scheduler rt
  2014-07-29  1:52 ` [PATCH RFC v2 1/4] xen: add real time scheduler rt Meng Xu
@ 2014-07-29  6:52   ` Jan Beulich
  2014-07-29  9:42     ` Dario Faggioli
  2014-07-30 15:55     ` [RFC " Meng Xu
  2014-07-29 10:36   ` [PATCH RFC " Andrew Cooper
  1 sibling, 2 replies; 13+ messages in thread
From: Jan Beulich @ 2014-07-29  6:52 UTC (permalink / raw)
  To: Meng Xu
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xen-devel, xumengpanda, lichong659, dgolomb

>>> On 29.07.14 at 03:52, <mengxu@cis.upenn.edu> wrote:
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -68,6 +68,7 @@ static const struct scheduler *schedulers[] = {
>      &sched_sedf_def,
>      &sched_credit_def,
>      &sched_credit2_def,
> +    &sched_rt_def,
>      &sched_arinc653_def,
>  };

Is the insertion as other than last item (as one would expect for a
new addition) intentional? Not that I think this matters much, but
I'm still curious.

> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -339,6 +339,18 @@ struct xen_domctl_max_vcpus {
>  typedef struct xen_domctl_max_vcpus xen_domctl_max_vcpus_t;
>  DEFINE_XEN_GUEST_HANDLE(xen_domctl_max_vcpus_t);
>  
> +/*
> + * This structure is used to pass to rt scheduler from a 
> + * privileged domain to Xen
> + */
> +struct xen_domctl_sched_rt_params {
> +    /* get vcpus' info */
> +    int64_t period; /* s_time_t type */
> +    int64_t budget;
> +    int     index;

Are all these really meaningfully signed quantities?

Also, you need to pad the structure to a multiple of 8 bytes, or
its layout will differ between 32- and 64-bit (tool stack) callers.

> @@ -367,6 +382,15 @@ struct xen_domctl_scheduler_op {
>          struct xen_domctl_sched_credit2 {
>              uint16_t weight;
>          } credit2;
> +        struct xen_domctl_sched_rt{
> +            /* get vcpus' params */
> +            XEN_GUEST_HANDLE_64(xen_domctl_sched_rt_params_t) vcpu;
> +            uint16_t nr_vcpus;
> +            /* set one vcpu's params */
> +            uint16_t vcpu_index;
> +            int64_t period;
> +            int64_t budget;
> +        } rt;

Mostly the same comments here, just that the padding here needs to
go in the middle.

Jan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC v2 1/4] xen: add real time scheduler rt
  2014-07-29  6:52   ` Jan Beulich
@ 2014-07-29  9:42     ` Dario Faggioli
  2014-07-30 15:55     ` [RFC " Meng Xu
  1 sibling, 0 replies; 13+ messages in thread
From: Dario Faggioli @ 2014-07-29  9:42 UTC (permalink / raw)
  To: Jan Beulich
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xen-devel, xumengpanda, Meng Xu, lichong659,
	dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 1395 bytes --]

On mar, 2014-07-29 at 07:52 +0100, Jan Beulich wrote:
> >>> On 29.07.14 at 03:52, <mengxu@cis.upenn.edu> wrote:

> > --- a/xen/include/public/domctl.h
> > +++ b/xen/include/public/domctl.h
> > @@ -339,6 +339,18 @@ struct xen_domctl_max_vcpus {
> >  typedef struct xen_domctl_max_vcpus xen_domctl_max_vcpus_t;
> >  DEFINE_XEN_GUEST_HANDLE(xen_domctl_max_vcpus_t);
> >  
> > +/*
> > + * This structure is used to pass to rt scheduler from a 
> > + * privileged domain to Xen
> > + */
> > +struct xen_domctl_sched_rt_params {
> > +    /* get vcpus' info */
> > +    int64_t period; /* s_time_t type */
> > +    int64_t budget;
> > +    int     index;
> 
> Are all these really meaningfully signed quantities?
> 
The scheduler internal variable where the actual budget is kept updated
needs, IMO, to be signed (s_time_t is the best choice). However, as far
as the interface is concerned, I think Jan is right, being able to
specify a negative budget and period is not that useful! :-P

I think you can go for unsigned quantities here, while s_time_t is
certainly ok inside the scheduler.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC v2 1/4] xen: add real time scheduler rt
  2014-07-29  1:52 ` [PATCH RFC v2 1/4] xen: add real time scheduler rt Meng Xu
  2014-07-29  6:52   ` Jan Beulich
@ 2014-07-29 10:36   ` Andrew Cooper
  2014-08-02 15:31     ` Meng Xu
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Cooper @ 2014-07-29 10:36 UTC (permalink / raw)
  To: Meng Xu, xen-devel
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xumengpanda, JBeulich, lichong659, dgolomb

On 29/07/14 02:52, Meng Xu wrote:
> This scheduler follows the pre-emptive Global EDF theory in real-time field.
> Each VCPU can have a dedicated period and budget.
> While scheduled, a VCPU burns its budget.
> A VCPU has its budget replenished at the beginning of each of its periods;
> The VCPU discards its unused budget at the end of each of its periods.
> If a VCPU runs out of budget in a period, it has to wait until next period.
> The mechanism of how to burn a VCPU's budget depends on the server mechanism
> implemented for each VCPU.
>
> Server mechanism: a VCPU is implemented as a deferable server.
> When a VCPU is scheduled to execute on a PCPU, its budget is continuously
> burned.
>
> Priority scheme: Preemptive Global Earliest Deadline First (gEDF).
> At any scheduling point, the VCPU with earliest deadline has highest
> priority.
>
> Queue scheme: A global Runqueue for each CPU pool.
> The Runqueue holds all runnable VCPUs.
> VCPUs in the Runqueue are divided into two parts: with and without budget.
> At each part, VCPUs are sorted based on gEDF priority scheme.
>
> Scheduling quantum: 1 ms;
>
> Note: cpumask and cpupool is supported.
>
> This is still in the development phase.
>
> Signed-off-by: Sisu Xi <xisisu@gmail.com>
> Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
> ---
>  xen/common/Makefile         |    1 +
>  xen/common/sched_rt.c       | 1058 +++++++++++++++++++++++++++++++++++++++++++
>  xen/common/schedule.c       |    4 +-
>  xen/include/public/domctl.h |   28 +-
>  xen/include/xen/sched-if.h  |    1 +
>  5 files changed, 1089 insertions(+), 3 deletions(-)
>  create mode 100644 xen/common/sched_rt.c
>
> diff --git a/xen/common/Makefile b/xen/common/Makefile
> index 3683ae3..5a23aa4 100644
> --- a/xen/common/Makefile
> +++ b/xen/common/Makefile
> @@ -26,6 +26,7 @@ obj-y += sched_credit.o
>  obj-y += sched_credit2.o
>  obj-y += sched_sedf.o
>  obj-y += sched_arinc653.o
> +obj-y += sched_rt.o
>  obj-y += schedule.o
>  obj-y += shutdown.o
>  obj-y += softirq.o
> diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> new file mode 100644
> index 0000000..6cfbb8a
> --- /dev/null
> +++ b/xen/common/sched_rt.c
> @@ -0,0 +1,1058 @@
> +/******************************************************************************
> + * Preemptive Global Earliest Deadline First  (EDF) scheduler for Xen
> + * EDF scheduling is one of most popular real-time scheduling algorithm used in
> + * embedded field.
> + *
> + * by Sisu Xi, 2013, Washington University in Saint Louis
> + * and Meng Xu, 2014, University of Pennsylvania
> + *
> + * based on the code of credit Scheduler
> + */
> +
> +#include <xen/config.h>
> +#include <xen/init.h>
> +#include <xen/lib.h>
> +#include <xen/sched.h>
> +#include <xen/domain.h>
> +#include <xen/delay.h>
> +#include <xen/event.h>
> +#include <xen/time.h>
> +#include <xen/perfc.h>
> +#include <xen/sched-if.h>
> +#include <xen/softirq.h>
> +#include <asm/atomic.h>
> +#include <xen/errno.h>
> +#include <xen/trace.h>
> +#include <xen/cpu.h>
> +#include <xen/keyhandler.h>
> +#include <xen/trace.h>
> +#include <xen/guest_access.h>
> +
> +/*
> + * TODO:
> + *
> + * Migration compensation and resist like credit2 to better use cache;
> + * Lock Holder Problem, using yield?
> + * Self switch problem: VCPUs of the same domain may preempt each other;
> + */
> +
> +/*
> + * Design:
> + *
> + * This scheduler follows the Preemptive Global EDF theory in real-time field.
> + * Each VCPU can have a dedicated period and budget. 
> + * While scheduled, a VCPU burns its budget.
> + * A VCPU has its budget replenished at the beginning of each of its periods;
> + * The VCPU discards its unused budget at the end of each of its periods.
> + * If a VCPU runs out of budget in a period, it has to wait until next period.
> + * The mechanism of how to burn a VCPU's budget depends on the server mechanism
> + * implemented for each VCPU.
> + *
> + * Server mechanism: a VCPU is implemented as a deferable server.
> + * When a VCPU has a task running on it, its budget is continuously burned;
> + * When a VCPU has no task but with budget left, its budget is preserved.
> + *
> + * Priority scheme: Preemptive Global Earliest Deadline First (gEDF).
> + * At any scheduling point, the VCPU with earliest deadline has highest priority.
> + *
> + * Queue scheme: A global runqueue for each CPU pool. 
> + * The runqueue holds all runnable VCPUs. 
> + * VCPUs in the runqueue are divided into two parts: with and without remaining budget. 
> + * At each part, VCPUs are sorted based on EDF priority scheme.
> + *
> + * Scheduling quanta: 1 ms; but accounting the budget is in microsecond.
> + *
> + * Note: cpumask and cpupool is supported.
> + */
> +
> +/*
> + * Locking:
> + * Just like credit2, a global system lock is used to protect the RunQ.
> + * The global lock is referenced by schedule_data.schedule_lock from all physical cpus.
> + *
> + * The lock is already grabbed when calling wake/sleep/schedule/ functions in schedule.c
> + *
> + * The functions involes RunQ and needs to grab locks are:
> + *    dump, vcpu_insert, vcpu_remove, context_saved,
> + */
> +
> +
> +/*
> + * Default parameters: Period and budget in default is 10 and 4 ms, respectively
> + */
> +#define RT_DEFAULT_PERIOD     (MICROSECS(10))
> +#define RT_DEFAULT_BUDGET     (MICROSECS(4))
> +
> +/*
> + * Useful macros
> + */
> +#define RT_PRIV(_ops)     ((struct rt_private *)((_ops)->sched_data))
> +#define RT_VCPU(_vcpu)    ((struct rt_vcpu *)(_vcpu)->sched_priv)
> +#define RT_DOM(_dom)      ((struct rt_dom *)(_dom)->sched_priv)
> +#define RUNQ(_ops)              (&RT_PRIV(_ops)->runq)

I know you are copying the prevailing style, but these macros are nasty.

They would be perfectly fine as static inlines with some real types...

static inline struct rt_private *RT_PRIV(const scheduler *ops)
{
    return ops->sched_data;
}

... which allow for rather more useful compiler errors in the case that
they get accidentally misused.

> +
> +/*
> + * Flags
> + */
> +/* RT_scheduled: Is this vcpu either running on, or context-switching off,
> + * a phyiscal cpu?
> + * + Accessed only with Runqueue lock held.
> + * + Set when chosen as next in rt_schedule().
> + * + Cleared after context switch has been saved in rt_context_saved()
> + * + Checked in vcpu_wake to see if we can add to the Runqueue, or if we should
> + *   set RT_delayed_runq_add
> + * + Checked to be false in runq_insert.
> + */
> +#define __RT_scheduled            1
> +#define RT_scheduled (1<<__RT_scheduled)
> +/* RT_delayed_runq_add: Do we need to add this to the Runqueueu once it'd done 
> + * being context switching out?
> + * + Set when scheduling out in rt_schedule() if prev is runable
> + * + Set in rt_vcpu_wake if it finds RT_scheduled set
> + * + Read in rt_context_saved(). If set, it adds prev to the Runqueue and
> + *   clears the bit.
> + *
> + */
> +#define __RT_delayed_runq_add     2
> +#define RT_delayed_runq_add (1<<__RT_delayed_runq_add)
> +
> +/*
> + * Debug only. Used to printout debug information
> + */
> +#define printtime()\
> +        ({s_time_t now = NOW(); \
> +          printk("%u : %3ld.%3ldus : %-19s ",\
> +          smp_processor_id(), now/MICROSECS(1), now%MICROSECS(1)/1000, __func__);} )
> +
> +/*
> + * Systme-wide private data, include a global RunQueue
> + * The global lock is referenced by schedule_data.schedule_lock from all physical cpus.
> + * It can be grabbed via vcpu_schedule_lock_irq()
> + */
> +struct rt_private {
> +    spinlock_t lock;        /* The global coarse grand lock */
> +    struct list_head sdom;  /* list of availalbe domains, used for dump */
> +    struct list_head runq;  /* Ordered list of runnable VMs */
> +    cpumask_t cpus;         /* cpumask_t of available physical cpus */
> +    cpumask_t tickled;      /* another cpu in the queue already ticked this one */
> +};
> +
> +/*
> + * Virtual CPU
> + */
> +struct rt_vcpu {
> +    struct list_head runq_elem; /* On the runqueue list */
> +    struct list_head sdom_elem; /* On the domain VCPU list */
> +
> +    /* Up-pointers */
> +    struct rt_dom *sdom;
> +    struct vcpu *vcpu;
> +
> +    /* VCPU parameters, in milliseconds */
> +    s_time_t period;
> +    s_time_t budget;
> +
> +    /* VCPU current infomation in nanosecond */
> +    long cur_budget;             /* current budget */
> +    s_time_t last_start;        /* last start time */
> +    s_time_t cur_deadline;      /* current deadline for EDF */
> +
> +    unsigned flags;             /* mark __RT_scheduled, etc.. */
> +};
> +
> +/*
> + * Domain
> + */
> +struct rt_dom {
> +    struct list_head vcpu;      /* link its VCPUs */
> +    struct list_head sdom_elem; /* link list on rt_priv */
> +    struct domain *dom;         /* pointer to upper domain */
> +};
> +
> +/*
> + * RunQueue helper functions
> + */
> +static int
> +__vcpu_on_runq(struct rt_vcpu *svc)
> +{
> +   return !list_empty(&svc->runq_elem);
> +}
> +
> +static struct rt_vcpu *
> +__runq_elem(struct list_head *elem)
> +{
> +    return list_entry(elem, struct rt_vcpu, runq_elem);
> +}
> +
> +/*
> + * Debug related code, dump vcpu/cpu information
> + */
> +static void
> +rt_dump_vcpu(struct rt_vcpu *svc)

const struct rc_vcpu *svc, for added safety.  (In the past we have had a
dump function which accidentally clobbered some of the state it was
supposed to be reading)

> +{
> +    char *cpustr = keyhandler_scratch;

Xen style - newline between declarations and code.

The keyhandler_scratch is only safe to use in keyhandlers, yet your dump
functions are based on scheduling operations.  You risk concurrent
access with other dump functions and with keyhandlers.

> +    if ( svc == NULL ) 
> +    {
> +        printk("NULL!\n");
> +        return;
> +    }

This is quite a useless message on its own.  Is it reasonable for svc to
ever be NULL here?

> +    cpumask_scnprintf(cpustr, sizeof(cpustr), svc->vcpu->cpu_hard_affinity);

Buggy sizeof.  sizeof(cpustr) is 4, where I suspect you mean
sizeof(keyscratch_handler) which is 1024.

> +    printk("[%5d.%-2d] cpu %d, (%"PRId64", %"PRId64"), cur_b=%"PRId64" cur_d=%"PRId64" last_start=%"PRId64" onR=%d runnable=%d cpu_hard_affinity=%s ",
> +            svc->vcpu->domain->domain_id,
> +            svc->vcpu->vcpu_id,
> +            svc->vcpu->processor,

vcpu_id and processor are both unsigned quantities, so should be
formatted with %u rather thatn %d

> +            svc->period,
> +            svc->budget,
> +            svc->cur_budget,
> +            svc->cur_deadline,
> +            svc->last_start,
> +            __vcpu_on_runq(svc),
> +            vcpu_runnable(svc->vcpu),
> +            cpustr);
> +    memset(cpustr, 0, sizeof(char)*1024);

This memset is quite pointless, and buggy if keyscratch handler changes
length.  cpumask_scnprintf() properly NULL terminates its string.

> +    cpumask_scnprintf(cpustr, sizeof(cpustr), cpupool_scheduler_cpumask(svc->vcpu->domain->cpupool));

Same buggy sizeof.

> +    printk("cpupool=%s\n", cpustr);
> +}
> +
> +static void
> +rt_dump_pcpu(const struct scheduler *ops, int cpu)
> +{
> +    struct rt_vcpu *svc = RT_VCPU(curr_on_cpu(cpu));
> +
> +    printtime();
> +    rt_dump_vcpu(svc);
> +}
> +
> +/*
> + * should not need lock here. only showing stuff 
> + */
> +static void
> +rt_dump(const struct scheduler *ops)
> +{
> +    struct list_head *iter_sdom, *iter_svc, *runq, *iter;
> +    struct rt_private *prv = RT_PRIV(ops);
> +    struct rt_vcpu *svc;
> +    int cpu = 0;
> +    int loop = 0;

Both of these should be unsigned.

> +
> +    printtime();
> +    printk("Priority Scheme: EDF\n");
> +
> +    printk("PCPU info: \n");

Please remove these trailing spaces before newlines.  All they do is
waste space in the console ring.

> +    for_each_cpu(cpu, &prv->cpus) 
> +        rt_dump_pcpu(ops, cpu);
> +
> +    printk("Global RunQueue info: \n");
> +    loop = 0;
> +    runq = RUNQ(ops);
> +    list_for_each( iter, runq ) 
> +    {
> +        svc = __runq_elem(iter);
> +        printk("\t%3d: ", ++loop);
> +        rt_dump_vcpu(svc);
> +    }
> +
> +    printk("Domain info: \n");
> +    loop = 0;
> +    list_for_each( iter_sdom, &prv->sdom ) 
> +    {
> +        struct rt_dom *sdom;
> +        sdom = list_entry(iter_sdom, struct rt_dom, sdom_elem);
> +        printk("\tdomain: %d\n", sdom->dom->domain_id);
> +
> +        list_for_each( iter_svc, &sdom->vcpu ) 
> +        {
> +            svc = list_entry(iter_svc, struct rt_vcpu, sdom_elem);
> +            printk("\t\t%3d: ", ++loop);
> +            rt_dump_vcpu(svc);
> +        }
> +    }
> +
> +    printk("\n");
> +}
> +
> +static inline void
> +__runq_remove(struct rt_vcpu *svc)
> +{
> +    if ( __vcpu_on_runq(svc) )
> +        list_del_init(&svc->runq_elem);
> +}
> +
> +/*
> + * Insert a vcpu in the RunQ based on vcpus' deadline: 
> + * EDF schedule policy: vcpu with smaller deadline has higher priority;
> + * The vcpu svc to be inserted will be inserted just before the very first 
> + * vcpu iter_svc in the Runqueue whose deadline is equal or larger than 
> + * svc's deadline.
> + */
> +static void
> +__runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
> +{
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +    spinlock_t *schedule_lock;
> +    
> +    schedule_lock = per_cpu(schedule_data, svc->vcpu->processor).schedule_lock;
> +    ASSERT( spin_is_locked(schedule_lock) );
> +    
> +    /* Debug only */
> +    if ( __vcpu_on_runq(svc) )
> +    {
> +        rt_dump(ops);
> +    }
> +    ASSERT( !__vcpu_on_runq(svc) );
> +
> +    list_for_each(iter, runq) 
> +    {
> +        struct rt_vcpu * iter_svc = __runq_elem(iter);
> +
> +        /* svc still has budget */
> +        if ( svc->cur_budget > 0 ) 
> +        {
> +            if ( iter_svc->cur_budget == 0 ||
> +                 svc->cur_deadline <= iter_svc->cur_deadline )
> +                    break;
> +        } 
> +        else 
> +        { /* svc has no budget */
> +            if ( iter_svc->cur_budget == 0 &&
> +                 svc->cur_deadline <= iter_svc->cur_deadline )
> +                    break;
> +        }
> +    }
> +
> +    list_add_tail(&svc->runq_elem, iter);
> +}
> +
> +/*
> + * Init/Free related code
> + */
> +static int
> +rt_init(struct scheduler *ops)
> +{
> +    struct rt_private *prv = xzalloc(struct rt_private);
> +
> +    if ( prv == NULL )
> +        return -ENOMEM;

Newline in here.

> +    ops->sched_data = prv;

Is it safe to set ops->sched_data with a half constructed rt_private?  I
suspect this wants to be the very last (non-debug) action in this function.

> +    spin_lock_init(&prv->lock);
> +    INIT_LIST_HEAD(&prv->sdom);
> +    INIT_LIST_HEAD(&prv->runq);
> +
> +    printtime();
> +    printk("\n");
> +
> +    return 0;
> +}
> +
> +static void
> +rt_deinit(const struct scheduler *ops)
> +{
> +    struct rt_private *prv = RT_PRIV(ops);
> +
> +    printtime();
> +    printk("\n");
> +    xfree(prv);
> +}
> +
> +/* 
> + * point per_cpu spinlock to the global system lock; all cpu have same global system lock 
> + */
> +static void *
> +rt_alloc_pdata(const struct scheduler *ops, int cpu)
> +{
> +    struct rt_private *prv = RT_PRIV(ops);
> +
> +    cpumask_set_cpu(cpu, &prv->cpus);
> +
> +    per_cpu(schedule_data, cpu).schedule_lock = &prv->lock;
> +
> +    printtime();
> +    printk("%s total cpus: %d", __FUNCTION__, cpumask_weight(&prv->cpus));

__FUNCTION__ is a gccism.  __func__ is a standard way of doing the same.

> +    /* same as credit2, not a bogus pointer */
> +    return (void *)1;
> +}
> +
> +static void
> +rt_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
> +{
> +    struct rt_private * prv = RT_PRIV(ops);
> +    cpumask_clear_cpu(cpu, &prv->cpus);
> +    printtime();
> +    printk("%s cpu=%d\n", __FUNCTION__, cpu);
> +}
> +
> +static void *
> +rt_alloc_domdata(const struct scheduler *ops, struct domain *dom)
> +{
> +    unsigned long flags;
> +    struct rt_dom *sdom;
> +    struct rt_private * prv = RT_PRIV(ops);
> +
> +    printtime();
> +    printk("dom=%d\n", dom->domain_id);
> +
> +    sdom = xzalloc(struct rt_dom);
> +    if ( sdom == NULL ) 
> +    {
> +        printk("%s, xzalloc failed\n", __func__);
> +        return NULL;
> +    }
> +
> +    INIT_LIST_HEAD(&sdom->vcpu);
> +    INIT_LIST_HEAD(&sdom->sdom_elem);
> +    sdom->dom = dom;
> +
> +    /* spinlock here to insert the dom */
> +    spin_lock_irqsave(&prv->lock, flags);
> +    list_add_tail(&sdom->sdom_elem, &(prv->sdom));
> +    spin_unlock_irqrestore(&prv->lock, flags);
> +
> +    return (void *)sdom;

Bogus cast.

> +}
> +
> +static void
> +rt_free_domdata(const struct scheduler *ops, void *data)
> +{
> +    unsigned long flags;
> +    struct rt_dom *sdom = data;
> +    struct rt_private *prv = RT_PRIV(ops);
> +
> +    printtime();
> +    printk("dom=%d\n", sdom->dom->domain_id);
> +
> +    spin_lock_irqsave(&prv->lock, flags);
> +    list_del_init(&sdom->sdom_elem);
> +    spin_unlock_irqrestore(&prv->lock, flags);
> +    xfree(data);
> +}
> +
> +static int
> +rt_dom_init(const struct scheduler *ops, struct domain *dom)
> +{
> +    struct rt_dom *sdom;
> +
> +    printtime();
> +    printk("dom=%d\n", dom->domain_id);
> +
> +    /* IDLE Domain does not link on rt_private */
> +    if ( is_idle_domain(dom) ) 
> +        return 0;
> +
> +    sdom = rt_alloc_domdata(ops, dom);
> +    if ( sdom == NULL ) 
> +    {
> +        printk("%s, failed\n", __func__);
> +        return -ENOMEM;
> +    }
> +    dom->sched_priv = sdom;
> +
> +    return 0;
> +}
> +
> +static void
> +rt_dom_destroy(const struct scheduler *ops, struct domain *dom)
> +{
> +    printtime();
> +    printk("dom=%d\n", dom->domain_id);
> +
> +    rt_free_domdata(ops, RT_DOM(dom));
> +}
> +
> +static void *
> +rt_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
> +{
> +    struct rt_vcpu *svc;
> +    s_time_t now = NOW();
> +    long count;
> +
> +    /* Allocate per-VCPU info */
> +    svc = xzalloc(struct rt_vcpu);
> +    if ( svc == NULL ) 
> +    {
> +        printk("%s, xzalloc failed\n", __func__);
> +        return NULL;
> +    }
> +
> +    INIT_LIST_HEAD(&svc->runq_elem);
> +    INIT_LIST_HEAD(&svc->sdom_elem);
> +    svc->flags = 0U;
> +    svc->sdom = dd;
> +    svc->vcpu = vc;
> +    svc->last_start = 0;            /* init last_start is 0 */
> +
> +    svc->period = RT_DEFAULT_PERIOD;
> +    if ( !is_idle_vcpu(vc) )
> +        svc->budget = RT_DEFAULT_BUDGET;
> +
> +    count = (now/MICROSECS(svc->period)) + 1;
> +    /* sync all VCPU's start time to 0 */
> +    svc->cur_deadline += count * MICROSECS(svc->period);
> +
> +    svc->cur_budget = svc->budget*1000; /* counting in microseconds level */
> +    /* Debug only: dump new vcpu's info */
> +    printtime();
> +    rt_dump_vcpu(svc);
> +
> +    return svc;
> +}
> +
> +static void
> +rt_free_vdata(const struct scheduler *ops, void *priv)
> +{
> +    struct rt_vcpu *svc = priv;
> +
> +    /* Debug only: dump freed vcpu's info */
> +    printtime();
> +    rt_dump_vcpu(svc);
> +    xfree(svc);
> +}
> +
> +/*
> + * TODO: Do we need to add vc to the new Runqueue?
> + * This function is called in sched_move_domain() in schedule.c
> + * When move a domain to a new cpupool, 
> + * may have to add vc to the Runqueue of the new cpupool
> + */
> +static void
> +rt_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu *svc = RT_VCPU(vc);
> +
> +    /* Debug only: dump info of vcpu to insert */
> +    printtime();
> +    rt_dump_vcpu(svc);
> +
> +    /* not addlocate idle vcpu to dom vcpu list */
> +    if ( is_idle_vcpu(vc) )
> +        return;
> +
> +    list_add_tail(&svc->sdom_elem, &svc->sdom->vcpu);   /* add to dom vcpu list */
> +}
> +
> +/*
> + * TODO: same as rt_vcpu_insert()
> + */
> +static void
> +rt_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * const svc = RT_VCPU(vc);
> +    struct rt_dom * const sdom = svc->sdom;
> +
> +    printtime();
> +    rt_dump_vcpu(svc);
> +
> +    BUG_ON( sdom == NULL );
> +    BUG_ON( __vcpu_on_runq(svc) );
> +
> +    if ( !is_idle_vcpu(vc) ) 
> +        list_del_init(&svc->sdom_elem);
> +}
> +
> +/* 
> + * Pick a valid CPU for the vcpu vc
> + * Valid CPU of a vcpu is intesection of vcpu's affinity and available cpus
> + */
> +static int
> +rt_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    cpumask_t cpus;
> +    cpumask_t *online;
> +    int cpu;
> +    struct rt_private * prv = RT_PRIV(ops);
> +
> +    online = cpupool_scheduler_cpumask(vc->domain->cpupool);
> +    cpumask_and(&cpus, &prv->cpus, online);
> +    cpumask_and(&cpus, &cpus, vc->cpu_hard_affinity);
> +
> +    cpu = cpumask_test_cpu(vc->processor, &cpus)
> +            ? vc->processor 
> +            : cpumask_cycle(vc->processor, &cpus);
> +    ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
> +
> +    return cpu;
> +}
> +
> +/*
> + * Burn budget at microsecond level. 
> + */
> +static void
> +burn_budgets(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t now) 
> +{
> +    s_time_t delta;
> +    long count = 0;
> +
> +    /* don't burn budget for idle VCPU */
> +    if ( is_idle_vcpu(svc->vcpu) ) 
> +    {
> +        return;
> +    }
> +
> +    /* first time called for this svc, update last_start */
> +    if ( svc->last_start == 0 ) 
> +    {
> +        svc->last_start = now;
> +        return;
> +    }
> +
> +    /*
> +     * update deadline info: When deadline is in the past,
> +     * it need to be updated to the deadline of the current period,
> +     * and replenish the budget 
> +     */
> +    delta = now - svc->cur_deadline;
> +    if ( delta >= 0 ) 
> +    {
> +        count = ( delta/MICROSECS(svc->period) ) + 1;
> +        svc->cur_deadline += count * MICROSECS(svc->period);
> +        svc->cur_budget = svc->budget * 1000;
> +        return;
> +    }
> +
> +    /* burn at nanoseconds level */
> +    delta = now - svc->last_start;
> +    /* 
> +     * delta < 0 only happens in nested virtualization;
> +     * TODO: how should we handle delta < 0 in a better way? */
> +    if ( delta < 0 ) 
> +    {
> +        printk("%s, ATTENTION: now is behind last_start! delta = %ld for ",
> +                __func__, delta);
> +        rt_dump_vcpu(svc);
> +        svc->last_start = now;  /* update last_start */
> +        svc->cur_budget = 0;   /* FIXME: should we recover like this? */
> +        return;
> +    }
> +
> +    if ( svc->cur_budget == 0 ) 
> +        return;
> +
> +    svc->cur_budget -= delta;
> +    if ( svc->cur_budget < 0 ) 
> +        svc->cur_budget = 0;
> +}
> +
> +/* 
> + * RunQ is sorted. Pick first one within cpumask. If no one, return NULL
> + * lock is grabbed before calling this function 
> + */
> +static struct rt_vcpu *
> +__runq_pick(const struct scheduler *ops, cpumask_t mask)
> +{
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +    struct rt_vcpu *svc = NULL;
> +    struct rt_vcpu *iter_svc = NULL;
> +    cpumask_t cpu_common;
> +    cpumask_t *online;
> +    struct rt_private * prv = RT_PRIV(ops);
> +
> +    list_for_each(iter, runq) 
> +    {
> +        iter_svc = __runq_elem(iter);
> +
> +        /* mask is intersection of cpu_hard_affinity and cpupool and priv->cpus */
> +        online = cpupool_scheduler_cpumask(iter_svc->vcpu->domain->cpupool);
> +        cpumask_and(&cpu_common, online, &prv->cpus);
> +        cpumask_and(&cpu_common, &cpu_common, iter_svc->vcpu->cpu_hard_affinity);
> +        cpumask_and(&cpu_common, &mask, &cpu_common);
> +        if ( cpumask_empty(&cpu_common) )
> +            continue;
> +
> +        if ( iter_svc->cur_budget <= 0 )
> +            continue;
> +
> +        svc = iter_svc;
> +        break;
> +    }
> +
> +    return svc;
> +}
> +
> +/*
> + * Update vcpu's budget and sort runq by insert the modifed vcpu back to runq
> + * lock is grabbed before calling this function 
> + */
> +static void
> +__repl_update(const struct scheduler *ops, s_time_t now)
> +{
> +    struct list_head *runq = RUNQ(ops);
> +    struct list_head *iter;
> +    struct list_head *tmp;
> +    struct rt_vcpu *svc = NULL;
> +
> +    s_time_t diff;
> +    long count;
> +
> +    list_for_each_safe(iter, tmp, runq) 
> +    {
> +        svc = __runq_elem(iter);
> +
> +        diff = now - svc->cur_deadline;
> +        if ( diff > 0 ) 
> +        {
> +            count = (diff/MICROSECS(svc->period)) + 1;
> +            svc->cur_deadline += count * MICROSECS(svc->period);
> +            svc->cur_budget = svc->budget * 1000;
> +            __runq_remove(svc);
> +            __runq_insert(ops, svc);
> +        }
> +    }
> +}
> +
> +/* 
> + * schedule function for rt scheduler.
> + * The lock is already grabbed in schedule.c, no need to lock here 
> + */
> +static struct task_slice
> +rt_schedule(const struct scheduler *ops, s_time_t now, bool_t tasklet_work_scheduled)
> +{
> +    const int cpu = smp_processor_id();
> +    struct rt_private * prv = RT_PRIV(ops);
> +    struct rt_vcpu * const scurr = RT_VCPU(current);
> +    struct rt_vcpu * snext = NULL;
> +    struct task_slice ret = { .migrated = 0 };
> +
> +    /* clear ticked bit now that we've been scheduled */
> +    if ( cpumask_test_cpu(cpu, &prv->tickled) )
> +        cpumask_clear_cpu(cpu, &prv->tickled);
> +
> +    /* burn_budget would return for IDLE VCPU */
> +    burn_budgets(ops, scurr, now);
> +
> +    __repl_update(ops, now);
> +
> +    if ( tasklet_work_scheduled ) 
> +    {
> +        snext = RT_VCPU(idle_vcpu[cpu]);
> +    } 
> +    else 
> +    {
> +        cpumask_t cur_cpu;
> +        cpumask_clear(&cur_cpu);
> +        cpumask_set_cpu(cpu, &cur_cpu);
> +        snext = __runq_pick(ops, cur_cpu);
> +        if ( snext == NULL )
> +            snext = RT_VCPU(idle_vcpu[cpu]);
> +
> +        /* if scurr has higher priority and budget, still pick scurr */
> +        if ( !is_idle_vcpu(current) &&
> +             vcpu_runnable(current) &&
> +             scurr->cur_budget > 0 &&
> +             ( is_idle_vcpu(snext->vcpu) ||
> +               scurr->cur_deadline <= snext->cur_deadline ) ) 
> +            snext = scurr;
> +    }
> +
> +    if ( snext != scurr &&
> +         !is_idle_vcpu(current) &&
> +         vcpu_runnable(current) )
> +        set_bit(__RT_delayed_runq_add, &scurr->flags);
> +    
> +
> +    snext->last_start = now;
> +    if ( !is_idle_vcpu(snext->vcpu) ) 
> +    {
> +        if ( snext != scurr ) 
> +        {
> +            __runq_remove(snext);
> +            set_bit(__RT_scheduled, &snext->flags);
> +        }
> +        if ( snext->vcpu->processor != cpu ) 
> +        {
> +            snext->vcpu->processor = cpu;
> +            ret.migrated = 1;
> +        }
> +    }
> +
> +    ret.time = MILLISECS(1); /* sched quantum */
> +    ret.task = snext->vcpu;
> +
> +    return ret;
> +}
> +
> +/*
> + * Remove VCPU from RunQ
> + * The lock is already grabbed in schedule.c, no need to lock here 
> + */
> +static void
> +rt_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * const svc = RT_VCPU(vc);
> +
> +    BUG_ON( is_idle_vcpu(vc) );
> +
> +    if ( curr_on_cpu(vc->processor) == vc ) 
> +    {
> +        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
> +        return;
> +    }
> +
> +    if ( __vcpu_on_runq(svc) ) 
> +    {
> +        __runq_remove(svc);
> +        printk("%s: vcpu should not on runq in vcpu_sleep()\n", __FUNCTION__);
> +        BUG();
> +    }
> +
> +    clear_bit(__RT_delayed_runq_add, &svc->flags);
> +}
> +
> +/*
> + * Pick a vcpu on a cpu to kick out to place the running candidate
> + * Called by wake() and context_saved()
> + * We have a running candidate here, the kick logic is:
> + * Among all the cpus that are within the cpu affinity
> + * 1) if the new->cpu is idle, kick it. This could benefit cache hit
> + * 2) if there are any idle vcpu, kick it.
> + * 3) now all pcpus are busy, among all the running vcpus, pick lowest priority one
> + *    if snext has higher priority, kick it.
> + *
> + * TODO:
> + * 1) what if these two vcpus belongs to the same domain?
> + *    replace a vcpu belonging to the same domain introduces more overhead
> + *
> + * lock is grabbed before calling this function 
> + */
> +static void
> +runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
> +{
> +    struct rt_private * prv = RT_PRIV(ops);
> +    struct rt_vcpu * latest_deadline_vcpu = NULL;    /* lowest priority scheduled */
> +    struct rt_vcpu * iter_svc;
> +    struct vcpu * iter_vc;
> +    int cpu = 0;
> +    cpumask_t not_tickled;
> +    cpumask_t *online;
> +
> +    if ( new == NULL || is_idle_vcpu(new->vcpu) ) 
> +        return;
> +
> +    online = cpupool_scheduler_cpumask(new->vcpu->domain->cpupool);
> +    cpumask_and(&not_tickled, online, &prv->cpus);
> +    cpumask_and(&not_tickled, &not_tickled, new->vcpu->cpu_hard_affinity);
> +    cpumask_andnot(&not_tickled, &not_tickled, &prv->tickled);
> +
> +    /* 1) if new's previous cpu is idle, kick it for cache benefit */
> +    if ( is_idle_vcpu(curr_on_cpu(new->vcpu->processor)) ) 
> +    {
> +        cpumask_set_cpu(new->vcpu->processor, &prv->tickled);
> +        cpu_raise_softirq(new->vcpu->processor, SCHEDULE_SOFTIRQ);
> +        return;
> +    }
> +
> +    /* 2) if there are any idle pcpu, kick it */
> +    /* The same loop also find the one with lowest priority */
> +    for_each_cpu(cpu, &not_tickled) 
> +    {
> +        iter_vc = curr_on_cpu(cpu);
> +        if ( is_idle_vcpu(iter_vc) ) 
> +        {
> +            cpumask_set_cpu(cpu, &prv->tickled);
> +            cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
> +            return;
> +        }
> +        iter_svc = RT_VCPU(iter_vc);
> +        if ( latest_deadline_vcpu == NULL || 
> +             iter_svc->cur_deadline > latest_deadline_vcpu->cur_deadline )
> +            latest_deadline_vcpu = iter_svc;
> +    }
> +
> +    /* 3) candicate has higher priority, kick out the lowest priority vcpu */
> +    if ( latest_deadline_vcpu != NULL && new->cur_deadline < latest_deadline_vcpu->cur_deadline ) 
> +    {
> +        cpumask_set_cpu(latest_deadline_vcpu->vcpu->processor, &prv->tickled);
> +        cpu_raise_softirq(latest_deadline_vcpu->vcpu->processor, SCHEDULE_SOFTIRQ);
> +    }
> +    return;
> +}
> +
> +/* 
> + * Should always wake up runnable vcpu, put it back to RunQ. 
> + * Check priority to raise interrupt 
> + * The lock is already grabbed in schedule.c, no need to lock here 
> + * TODO: what if these two vcpus belongs to the same domain? 
> + */
> +static void
> +rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * const svc = RT_VCPU(vc);
> +    s_time_t diff;
> +    s_time_t now = NOW();
> +    long count = 0;
> +    struct rt_private * prv = RT_PRIV(ops);
> +    struct rt_vcpu * snext = NULL;        /* highest priority on RunQ */
> +
> +    BUG_ON( is_idle_vcpu(vc) );
> +
> +    if ( unlikely(curr_on_cpu(vc->processor) == vc) ) 
> +        return;
> +
> +    /* on RunQ, just update info is ok */
> +    if ( unlikely(__vcpu_on_runq(svc)) ) 
> +        return;
> +
> +    /* If context hasn't been saved for this vcpu yet, we can't put it on
> +     * the Runqueue. Instead, we set a flag so that it will be put on the Runqueue
> +     * After the context has been saved. */
> +    if ( unlikely(test_bit(__RT_scheduled, &svc->flags)) ) 
> +    {
> +        set_bit(__RT_delayed_runq_add, &svc->flags);
> +        return;
> +    }
> +
> +    /* update deadline info */
> +    diff = now - svc->cur_deadline;
> +    if ( diff >= 0 ) 
> +    {
> +        count = ( diff/MICROSECS(svc->period) ) + 1;
> +        svc->cur_deadline += count * MICROSECS(svc->period);
> +        svc->cur_budget = svc->budget * 1000;
> +    }
> +
> +    __runq_insert(ops, svc);
> +    __repl_update(ops, now);
> +    snext = __runq_pick(ops, prv->cpus);    /* pick snext from ALL valid cpus */
> +    runq_tickle(ops, snext);
> +
> +    return;
> +}
> +
> +/* 
> + * scurr has finished context switch, insert it back to the RunQ,
> + * and then pick the highest priority vcpu from runq to run 
> + */
> +static void
> +rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
> +{
> +    struct rt_vcpu * svc = RT_VCPU(vc);
> +    struct rt_vcpu * snext = NULL;
> +    struct rt_private * prv = RT_PRIV(ops);
> +    spinlock_t *lock = vcpu_schedule_lock_irq(vc);
> +
> +    clear_bit(__RT_scheduled, &svc->flags);
> +    /* not insert idle vcpu to runq */
> +    if ( is_idle_vcpu(vc) ) 
> +        goto out;
> +
> +    if ( test_and_clear_bit(__RT_delayed_runq_add, &svc->flags) && 
> +         likely(vcpu_runnable(vc)) ) 
> +    {
> +        __runq_insert(ops, svc);
> +        __repl_update(ops, NOW());
> +        snext = __runq_pick(ops, prv->cpus);    /* pick snext from ALL cpus */
> +        runq_tickle(ops, snext);
> +    }
> +out:
> +    vcpu_schedule_unlock_irq(lock, vc);
> +}
> +
> +/*
> + * set/get each vcpu info of each domain
> + */
> +static int
> +rt_dom_cntl(
> +    const struct scheduler *ops, 
> +    struct domain *d, 
> +    struct xen_domctl_scheduler_op *op)
> +{
> +    xen_domctl_sched_rt_params_t *local_sched;
> +    struct rt_dom * const sdom = RT_DOM(d);
> +    struct list_head *iter;
> +    int vcpu_index = 0;
> +    int rc = -EINVAL;
> +
> +    switch ( op->cmd )
> +    {
> +    case XEN_DOMCTL_SCHEDOP_getnumvcpus:
> +        op->u.rt.nr_vcpus = 0;
> +        list_for_each( iter, &sdom->vcpu ) 
> +            vcpu_index++;
> +        op->u.rt.nr_vcpus = vcpu_index;
> +        rc = 0;
> +        break;
> +    case XEN_DOMCTL_SCHEDOP_getinfo:
> +        /* for debug use, whenever adjust Dom0 parameter, do global dump */
> +        if ( d->domain_id == 0 ) 
> +            rt_dump(ops);
> +
> +        op->u.rt.nr_vcpus = 0;
> +        list_for_each( iter, &sdom->vcpu ) 
> +            vcpu_index++;
> +        op->u.rt.nr_vcpus = vcpu_index;
> +        local_sched = xzalloc_array(xen_domctl_sched_rt_params_t, vcpu_index);
> +        vcpu_index = 0;
> +        list_for_each( iter, &sdom->vcpu ) 
> +        {
> +            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu, sdom_elem);
> +
> +            local_sched[vcpu_index].budget = svc->budget;
> +            local_sched[vcpu_index].period = svc->period;
> +            local_sched[vcpu_index].index = vcpu_index;
> +            vcpu_index++;
> +        }
> +        copy_to_guest(op->u.rt.vcpu, local_sched, vcpu_index);

This will clobber guest heap if vcpu_index is greater than the allocated
space.  You also unconditionally leak local_sched, but there is no need
for an allocation in the first place.

> +        rc = 0;
> +        break;
> +    case XEN_DOMCTL_SCHEDOP_putinfo:
> +        list_for_each( iter, &sdom->vcpu ) 
> +        {
> +            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu, sdom_elem);
> +
> +            /* adjust per VCPU parameter */
> +            if ( op->u.rt.vcpu_index == svc->vcpu->vcpu_id ) 
> +            { 
> +                vcpu_index = op->u.rt.vcpu_index;
> +
> +                if ( vcpu_index < 0 ) 
> +                    printk("XEN_DOMCTL_SCHEDOP_putinfo: vcpu_index=%d\n",
> +                            vcpu_index);
> +                else
> +                    printk("XEN_DOMCTL_SCHEDOP_putinfo: "
> +                            "vcpu_index=%d, period=%"PRId64", budget=%"PRId64"\n",
> +                            vcpu_index, op->u.rt.period, op->u.rt.budget);
> +
> +                svc->period = op->u.rt.period;
> +                svc->budget = op->u.rt.budget;
> +
> +                break;
> +            }
> +        }
> +        rc = 0;
> +        break;
> +    }
> +
> +    return rc;
> +}
> +
> +static struct rt_private _rt_priv;
> +
> +const struct scheduler sched_rt_def = {
> +    .name           = "SMP RT Scheduler",
> +    .opt_name       = "rt",

Should these names reflect RT_DS as opposed to simply RT?

> +    .sched_id       = XEN_SCHEDULER_RT_DS,
> +    .sched_data     = &_rt_priv,
> +
> +    .dump_cpu_state = rt_dump_pcpu,
> +    .dump_settings  = rt_dump,
> +    .init           = rt_init,
> +    .deinit         = rt_deinit,
> +    .alloc_pdata    = rt_alloc_pdata,
> +    .free_pdata     = rt_free_pdata,
> +    .alloc_domdata  = rt_alloc_domdata,
> +    .free_domdata   = rt_free_domdata,
> +    .init_domain    = rt_dom_init,
> +    .destroy_domain = rt_dom_destroy,
> +    .alloc_vdata    = rt_alloc_vdata,
> +    .free_vdata     = rt_free_vdata,
> +    .insert_vcpu    = rt_vcpu_insert,
> +    .remove_vcpu    = rt_vcpu_remove,
> +
> +    .adjust         = rt_dom_cntl,
> +
> +    .pick_cpu       = rt_cpu_pick,
> +    .do_schedule    = rt_schedule,
> +    .sleep          = rt_vcpu_sleep,
> +    .wake           = rt_vcpu_wake,
> +    .context_saved  = rt_context_saved,
> +};
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> index e9eb0bc..f2df400 100644
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -68,6 +68,7 @@ static const struct scheduler *schedulers[] = {
>      &sched_sedf_def,
>      &sched_credit_def,
>      &sched_credit2_def,
> +    &sched_rt_def,
>      &sched_arinc653_def,
>  };
>  
> @@ -1092,7 +1093,8 @@ long sched_adjust(struct domain *d, struct xen_domctl_scheduler_op *op)
>  
>      if ( (op->sched_id != DOM2OP(d)->sched_id) ||
>           ((op->cmd != XEN_DOMCTL_SCHEDOP_putinfo) &&
> -          (op->cmd != XEN_DOMCTL_SCHEDOP_getinfo)) )
> +          (op->cmd != XEN_DOMCTL_SCHEDOP_getinfo) &&
> +          (op->cmd != XEN_DOMCTL_SCHEDOP_getnumvcpus)) )
>          return -EINVAL;
>  
>      /* NB: the pluggable scheduler code needs to take care
> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
> index 5b11bbf..8d4b973 100644
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -339,6 +339,18 @@ struct xen_domctl_max_vcpus {
>  typedef struct xen_domctl_max_vcpus xen_domctl_max_vcpus_t;
>  DEFINE_XEN_GUEST_HANDLE(xen_domctl_max_vcpus_t);
>  
> +/*
> + * This structure is used to pass to rt scheduler from a 
> + * privileged domain to Xen
> + */
> +struct xen_domctl_sched_rt_params {
> +    /* get vcpus' info */
> +    int64_t period; /* s_time_t type */
> +    int64_t budget;
> +    int     index;

Index is clearly an unsigned quantity.  For alignment and compatibility,
uint64_t would make the most sense.  Alternatively, uint32_t and an
explicit uint32_t pad field.

~Andrew

> +};
> +typedef struct xen_domctl_sched_rt_params xen_domctl_sched_rt_params_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_domctl_sched_rt_params_t);
>  
>  /* XEN_DOMCTL_scheduler_op */
>  /* Scheduler types. */
> @@ -346,9 +358,12 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_max_vcpus_t);
>  #define XEN_SCHEDULER_CREDIT   5
>  #define XEN_SCHEDULER_CREDIT2  6
>  #define XEN_SCHEDULER_ARINC653 7
> +#define XEN_SCHEDULER_RT_DS    8
> +
>  /* Set or get info? */
> -#define XEN_DOMCTL_SCHEDOP_putinfo 0
> -#define XEN_DOMCTL_SCHEDOP_getinfo 1
> +#define XEN_DOMCTL_SCHEDOP_putinfo      0
> +#define XEN_DOMCTL_SCHEDOP_getinfo      1
> +#define XEN_DOMCTL_SCHEDOP_getnumvcpus   2
>  struct xen_domctl_scheduler_op {
>      uint32_t sched_id;  /* XEN_SCHEDULER_* */
>      uint32_t cmd;       /* XEN_DOMCTL_SCHEDOP_* */
> @@ -367,6 +382,15 @@ struct xen_domctl_scheduler_op {
>          struct xen_domctl_sched_credit2 {
>              uint16_t weight;
>          } credit2;
> +        struct xen_domctl_sched_rt{
> +            /* get vcpus' params */
> +            XEN_GUEST_HANDLE_64(xen_domctl_sched_rt_params_t) vcpu;
> +            uint16_t nr_vcpus;
> +            /* set one vcpu's params */
> +            uint16_t vcpu_index;
> +            int64_t period;
> +            int64_t budget;
> +        } rt;
>      } u;
>  };
>  typedef struct xen_domctl_scheduler_op xen_domctl_scheduler_op_t;
> diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
> index 4164dff..bcbe234 100644
> --- a/xen/include/xen/sched-if.h
> +++ b/xen/include/xen/sched-if.h
> @@ -169,6 +169,7 @@ extern const struct scheduler sched_sedf_def;
>  extern const struct scheduler sched_credit_def;
>  extern const struct scheduler sched_credit2_def;
>  extern const struct scheduler sched_arinc653_def;
> +extern const struct scheduler sched_rt_def;
>  
>  
>  struct cpupool

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC v2 1/4] xen: add real time scheduler rt
  2014-07-29  6:52   ` Jan Beulich
  2014-07-29  9:42     ` Dario Faggioli
@ 2014-07-30 15:55     ` Meng Xu
  1 sibling, 0 replies; 13+ messages in thread
From: Meng Xu @ 2014-07-30 15:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: ian.campbell, xisisu, stefano.stabellini, george.dunlap,
	ian.jackson, xen-devel, Meng Xu, lichong659, dgolomb


[-- Attachment #1.1: Type: text/plain, Size: 2386 bytes --]

Hi Jan,

Jan Beulich <JBeulich@suse.com>于2014年7月29日星期二写道:

> >>> On 29.07.14 at 03:52, <mengxu@cis.upenn.edu <javascript:;>> wrote:
> > --- a/xen/common/schedule.c
> > +++ b/xen/common/schedule.c
> > @@ -68,6 +68,7 @@ static const struct scheduler *schedulers[] = {
> >      &sched_sedf_def,
> >      &sched_credit_def,
> >      &sched_credit2_def,
> > +    &sched_rt_def,
> >      &sched_arinc653_def,
> >  };
>
> Is the insertion as other than last item (as one would expect for a
> new addition) intentional? Not that I think this matters much, but
> I'm still curious.
>
>
It's not intentional. I can put it to the last item.


> > --- a/xen/include/public/domctl.h
> > +++ b/xen/include/public/domctl.h
> > @@ -339,6 +339,18 @@ struct xen_domctl_max_vcpus {
> >  typedef struct xen_domctl_max_vcpus xen_domctl_max_vcpus_t;
> >  DEFINE_XEN_GUEST_HANDLE(xen_domctl_max_vcpus_t);
> >
> > +/*
> > + * This structure is used to pass to rt scheduler from a
> > + * privileged domain to Xen
> > + */
> > +struct xen_domctl_sched_rt_params {
> > +    /* get vcpus' info */
> > +    int64_t period; /* s_time_t type */
> > +    int64_t budget;
> > +    int     index;
>
> Are all these really meaningfully signed quantities?
>
>
Dario explained this. I will change it to unsigned 64 bits. We need signed
value inside Xen, but not the tool stack.



> Also, you need to pad the structure to a multiple of 8 bytes, or
> its layout will differ between 32- and 64-bit (tool stack) callers.


> > @@ -367,6 +382,15 @@ struct xen_domctl_scheduler_op {
> >          struct xen_domctl_sched_credit2 {
> >              uint16_t weight;
> >          } credit2;
> > +        struct xen_domctl_sched_rt{
> > +            /* get vcpus' params */
> > +            XEN_GUEST_HANDLE_64(xen_domctl_sched_rt_params_t) vcpu;
> > +            uint16_t nr_vcpus;
> > +            /* set one vcpu's params */
> > +            uint16_t vcpu_index;
> > +            int64_t period;
> > +            int64_t budget;
> > +        } rt;
>
> Mostly the same comments here, just that the padding here needs to
> go in the middle.
>

Roger. Will pad it.  Thank you very much for your comments and suggestions!

Meng


-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 3542 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC v2 1/4] xen: add real time scheduler rt
  2014-07-29 10:36   ` [PATCH RFC " Andrew Cooper
@ 2014-08-02 15:31     ` Meng Xu
  2014-08-04 10:23       ` Andrew Cooper
  0 siblings, 1 reply; 13+ messages in thread
From: Meng Xu @ 2014-08-02 15:31 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, George Dunlap,
	Ian Jackson, xen-devel, Meng Xu, Jan Beulich, Chong Li,
	Dagaen Golomb


[-- Attachment #1.1: Type: text/plain, Size: 42753 bytes --]

Hi Andrew,

Thank you very much for your comments!

2014-07-29 18:36 GMT+08:00 Andrew Cooper <andrew.cooper3@citrix.com>:

> On 29/07/14 02:52, Meng Xu wrote:
> > This scheduler follows the pre-emptive Global EDF theory in real-time
> field.
> > Each VCPU can have a dedicated period and budget.
> > While scheduled, a VCPU burns its budget.
> > A VCPU has its budget replenished at the beginning of each of its
> periods;
> > The VCPU discards its unused budget at the end of each of its periods.
> > If a VCPU runs out of budget in a period, it has to wait until next
> period.
> > The mechanism of how to burn a VCPU's budget depends on the server
> mechanism
> > implemented for each VCPU.
> >
> > Server mechanism: a VCPU is implemented as a deferable server.
> > When a VCPU is scheduled to execute on a PCPU, its budget is continuously
> > burned.
> >
> > Priority scheme: Preemptive Global Earliest Deadline First (gEDF).
> > At any scheduling point, the VCPU with earliest deadline has highest
> > priority.
> >
> > Queue scheme: A global Runqueue for each CPU pool.
> > The Runqueue holds all runnable VCPUs.
> > VCPUs in the Runqueue are divided into two parts: with and without
> budget.
> > At each part, VCPUs are sorted based on gEDF priority scheme.
> >
> > Scheduling quantum: 1 ms;
> >
> > Note: cpumask and cpupool is supported.
> >
> > This is still in the development phase.
> >
> > Signed-off-by: Sisu Xi <xisisu@gmail.com>
> > Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
> > ---
> >  xen/common/Makefile         |    1 +
> >  xen/common/sched_rt.c       | 1058
> +++++++++++++++++++++++++++++++++++++++++++
> >  xen/common/schedule.c       |    4 +-
> >  xen/include/public/domctl.h |   28 +-
> >  xen/include/xen/sched-if.h  |    1 +
> >  5 files changed, 1089 insertions(+), 3 deletions(-)
> >  create mode 100644 xen/common/sched_rt.c
> >
> > diff --git a/xen/common/Makefile b/xen/common/Makefile
> > index 3683ae3..5a23aa4 100644
> > --- a/xen/common/Makefile
> > +++ b/xen/common/Makefile
> > @@ -26,6 +26,7 @@ obj-y += sched_credit.o
> >  obj-y += sched_credit2.o
> >  obj-y += sched_sedf.o
> >  obj-y += sched_arinc653.o
> > +obj-y += sched_rt.o
> >  obj-y += schedule.o
> >  obj-y += shutdown.o
> >  obj-y += softirq.o
> > diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> > new file mode 100644
> > index 0000000..6cfbb8a
> > --- /dev/null
> > +++ b/xen/common/sched_rt.c
> > @@ -0,0 +1,1058 @@
> >
> +/******************************************************************************
> > + * Preemptive Global Earliest Deadline First  (EDF) scheduler for Xen
> > + * EDF scheduling is one of most popular real-time scheduling algorithm
> used in
> > + * embedded field.
> > + *
> > + * by Sisu Xi, 2013, Washington University in Saint Louis
> > + * and Meng Xu, 2014, University of Pennsylvania
> > + *
> > + * based on the code of credit Scheduler
> > + */
> > +
> > +#include <xen/config.h>
> > +#include <xen/init.h>
> > +#include <xen/lib.h>
> > +#include <xen/sched.h>
> > +#include <xen/domain.h>
> > +#include <xen/delay.h>
> > +#include <xen/event.h>
> > +#include <xen/time.h>
> > +#include <xen/perfc.h>
> > +#include <xen/sched-if.h>
> > +#include <xen/softirq.h>
> > +#include <asm/atomic.h>
> > +#include <xen/errno.h>
> > +#include <xen/trace.h>
> > +#include <xen/cpu.h>
> > +#include <xen/keyhandler.h>
> > +#include <xen/trace.h>
> > +#include <xen/guest_access.h>
> > +
> > +/*
> > + * TODO:
> > + *
> > + * Migration compensation and resist like credit2 to better use cache;
> > + * Lock Holder Problem, using yield?
> > + * Self switch problem: VCPUs of the same domain may preempt each other;
> > + */
> > +
> > +/*
> > + * Design:
> > + *
> > + * This scheduler follows the Preemptive Global EDF theory in real-time
> field.
> > + * Each VCPU can have a dedicated period and budget.
> > + * While scheduled, a VCPU burns its budget.
> > + * A VCPU has its budget replenished at the beginning of each of its
> periods;
> > + * The VCPU discards its unused budget at the end of each of its
> periods.
> > + * If a VCPU runs out of budget in a period, it has to wait until next
> period.
> > + * The mechanism of how to burn a VCPU's budget depends on the server
> mechanism
> > + * implemented for each VCPU.
> > + *
> > + * Server mechanism: a VCPU is implemented as a deferable server.
> > + * When a VCPU has a task running on it, its budget is continuously
> burned;
> > + * When a VCPU has no task but with budget left, its budget is
> preserved.
> > + *
> > + * Priority scheme: Preemptive Global Earliest Deadline First (gEDF).
> > + * At any scheduling point, the VCPU with earliest deadline has highest
> priority.
> > + *
> > + * Queue scheme: A global runqueue for each CPU pool.
> > + * The runqueue holds all runnable VCPUs.
> > + * VCPUs in the runqueue are divided into two parts: with and without
> remaining budget.
> > + * At each part, VCPUs are sorted based on EDF priority scheme.
> > + *
> > + * Scheduling quanta: 1 ms; but accounting the budget is in microsecond.
> > + *
> > + * Note: cpumask and cpupool is supported.
> > + */
> > +
> > +/*
> > + * Locking:
> > + * Just like credit2, a global system lock is used to protect the RunQ.
> > + * The global lock is referenced by schedule_data.schedule_lock from
> all physical cpus.
> > + *
> > + * The lock is already grabbed when calling wake/sleep/schedule/
> functions in schedule.c
> > + *
> > + * The functions involes RunQ and needs to grab locks are:
> > + *    dump, vcpu_insert, vcpu_remove, context_saved,
> > + */
> > +
> > +
> > +/*
> > + * Default parameters: Period and budget in default is 10 and 4 ms,
> respectively
> > + */
> > +#define RT_DEFAULT_PERIOD     (MICROSECS(10))
> > +#define RT_DEFAULT_BUDGET     (MICROSECS(4))
> > +
> > +/*
> > + * Useful macros
> > + */
> > +#define RT_PRIV(_ops)     ((struct rt_private *)((_ops)->sched_data))
> > +#define RT_VCPU(_vcpu)    ((struct rt_vcpu *)(_vcpu)->sched_priv)
> > +#define RT_DOM(_dom)      ((struct rt_dom *)(_dom)->sched_priv)
> > +#define RUNQ(_ops)              (&RT_PRIV(_ops)->runq)
>
> I know you are copying the prevailing style, but these macros are nasty.
>
> They would be perfectly fine as static inlines with some real types...
>
> static inline struct rt_private *RT_PRIV(const scheduler *ops)
> {
>     return ops->sched_data;
> }
>
> ... which allow for rather more useful compiler errors in the case that
> they get accidentally misused.
>

​This is a good suggestion and I have modified it for the next version.
​

>
> > +
> > +/*
> > + * Flags
> > + */
> > +/* RT_scheduled: Is this vcpu either running on, or context-switching
> off,
> > + * a phyiscal cpu?
> > + * + Accessed only with Runqueue lock held.
> > + * + Set when chosen as next in rt_schedule().
> > + * + Cleared after context switch has been saved in rt_context_saved()
> > + * + Checked in vcpu_wake to see if we can add to the Runqueue, or if
> we should
> > + *   set RT_delayed_runq_add
> > + * + Checked to be false in runq_insert.
> > + */
> > +#define __RT_scheduled            1
> > +#define RT_scheduled (1<<__RT_scheduled)
> > +/* RT_delayed_runq_add: Do we need to add this to the Runqueueu once
> it'd done
> > + * being context switching out?
> > + * + Set when scheduling out in rt_schedule() if prev is runable
> > + * + Set in rt_vcpu_wake if it finds RT_scheduled set
> > + * + Read in rt_context_saved(). If set, it adds prev to the Runqueue
> and
> > + *   clears the bit.
> > + *
> > + */
> > +#define __RT_delayed_runq_add     2
> > +#define RT_delayed_runq_add (1<<__RT_delayed_runq_add)
> > +
> > +/*
> > + * Debug only. Used to printout debug information
> > + */
> > +#define printtime()\
> > +        ({s_time_t now = NOW(); \
> > +          printk("%u : %3ld.%3ldus : %-19s ",\
> > +          smp_processor_id(), now/MICROSECS(1), now%MICROSECS(1)/1000,
> __func__);} )
> > +
> > +/*
> > + * Systme-wide private data, include a global RunQueue
> > + * The global lock is referenced by schedule_data.schedule_lock from
> all physical cpus.
> > + * It can be grabbed via vcpu_schedule_lock_irq()
> > + */
> > +struct rt_private {
> > +    spinlock_t lock;        /* The global coarse grand lock */
> > +    struct list_head sdom;  /* list of availalbe domains, used for dump
> */
> > +    struct list_head runq;  /* Ordered list of runnable VMs */
> > +    cpumask_t cpus;         /* cpumask_t of available physical cpus */
> > +    cpumask_t tickled;      /* another cpu in the queue already ticked
> this one */
> > +};
> > +
> > +/*
> > + * Virtual CPU
> > + */
> > +struct rt_vcpu {
> > +    struct list_head runq_elem; /* On the runqueue list */
> > +    struct list_head sdom_elem; /* On the domain VCPU list */
> > +
> > +    /* Up-pointers */
> > +    struct rt_dom *sdom;
> > +    struct vcpu *vcpu;
> > +
> > +    /* VCPU parameters, in milliseconds */
> > +    s_time_t period;
> > +    s_time_t budget;
> > +
> > +    /* VCPU current infomation in nanosecond */
> > +    long cur_budget;             /* current budget */
> > +    s_time_t last_start;        /* last start time */
> > +    s_time_t cur_deadline;      /* current deadline for EDF */
> > +
> > +    unsigned flags;             /* mark __RT_scheduled, etc.. */
> > +};
> > +
> > +/*
> > + * Domain
> > + */
> > +struct rt_dom {
> > +    struct list_head vcpu;      /* link its VCPUs */
> > +    struct list_head sdom_elem; /* link list on rt_priv */
> > +    struct domain *dom;         /* pointer to upper domain */
> > +};
> > +
> > +/*
> > + * RunQueue helper functions
> > + */
> > +static int
> > +__vcpu_on_runq(struct rt_vcpu *svc)
> > +{
> > +   return !list_empty(&svc->runq_elem);
> > +}
> > +
> > +static struct rt_vcpu *
> > +__runq_elem(struct list_head *elem)
> > +{
> > +    return list_entry(elem, struct rt_vcpu, runq_elem);
> > +}
> > +
> > +/*
> > + * Debug related code, dump vcpu/cpu information
> > + */
> > +static void
> > +rt_dump_vcpu(struct rt_vcpu *svc)
>
> const struct rc_vcpu *svc, for added safety.  (In the past we have had a
> dump function which accidentally clobbered some of the state it was
> supposed to be reading)
>
> > +{
> > +    char *cpustr = keyhandler_scratch;
>
> Xen style - newline between declarations and code.
>
> The keyhandler_scratch is only safe to use in keyhandlers, yet your dump
> functions are based on scheduling operations.  You risk concurrent
> access with other dump functions and with keyhandlers.
>

​I see. I changed the cpustr to char[1024]. This should solve this issue.​



>
> > +    if ( svc == NULL )
> > +    {
> > +        printk("NULL!\n");
> > +        return;
> > +    }
>
> This is quite a useless message on its own.  Is it reasonable for svc to
> ever be NULL here?
>
>
​No, it should not be NULL. I changed it to ASSERT() for next version.​



> > +    cpumask_scnprintf(cpustr, sizeof(cpustr),
> svc->vcpu->cpu_hard_affinity);
>
> Buggy sizeof.  sizeof(cpustr) is 4, where I suspect you mean
> sizeof(keyscratch_handler) which is 1024.
>

​Sorry! :-( Corrected it. ​

>
>
> > +    for_each_cpu(cpu, &prv->cpus)
> > +        rt_dump_pcpu(ops, cpu);
> > +
> > +    printk("Global RunQueue info: \n");
> > +    loop = 0;
> > +    runq = RUNQ(ops);
> > +    list_for_each( iter, runq )
> > +    {
> > +        svc = __runq_elem(iter);
> > +        printk("\t%3d: ", ++loop);
> > +        rt_dump_vcpu(svc);
> > +    }
> > +
> > +    printk("Domain info: \n");
> > +    loop = 0;
> > +    list_for_each( iter_sdom, &prv->sdom )
> > +    {
> > +        struct rt_dom *sdom;
> > +        sdom = list_entry(iter_sdom, struct rt_dom, sdom_elem);
> > +        printk("\tdomain: %d\n", sdom->dom->domain_id);
> > +
> > +        list_for_each( iter_svc, &sdom->vcpu )
> > +        {
> > +            svc = list_entry(iter_svc, struct rt_vcpu, sdom_elem);
> > +            printk("\t\t%3d: ", ++loop);
> > +            rt_dump_vcpu(svc);
> > +        }
> > +    }
> > +
> > +    printk("\n");
> > +}
> > +
> > +static inline void
> > +__runq_remove(struct rt_vcpu *svc)
> > +{
> > +    if ( __vcpu_on_runq(svc) )
> > +        list_del_init(&svc->runq_elem);
> > +}
> > +
> > +/*
> > + * Insert a vcpu in the RunQ based on vcpus' deadline:
> > + * EDF schedule policy: vcpu with smaller deadline has higher priority;
> > + * The vcpu svc to be inserted will be inserted just before the very
> first
> > + * vcpu iter_svc in the Runqueue whose deadline is equal or larger than
> > + * svc's deadline.
> > + */
> > +static void
> > +__runq_insert(const struct scheduler *ops, struct rt_vcpu *svc)
> > +{
> > +    struct list_head *runq = RUNQ(ops);
> > +    struct list_head *iter;
> > +    spinlock_t *schedule_lock;
> > +
> > +    schedule_lock = per_cpu(schedule_data,
> svc->vcpu->processor).schedule_lock;
> > +    ASSERT( spin_is_locked(schedule_lock) );
> > +
> > +    /* Debug only */
> > +    if ( __vcpu_on_runq(svc) )
> > +    {
> > +        rt_dump(ops);
> > +    }
> > +    ASSERT( !__vcpu_on_runq(svc) );
> > +
> > +    list_for_each(iter, runq)
> > +    {
> > +        struct rt_vcpu * iter_svc = __runq_elem(iter);
> > +
> > +        /* svc still has budget */
> > +        if ( svc->cur_budget > 0 )
> > +        {
> > +            if ( iter_svc->cur_budget == 0 ||
> > +                 svc->cur_deadline <= iter_svc->cur_deadline )
> > +                    break;
> > +        }
> > +        else
> > +        { /* svc has no budget */
> > +            if ( iter_svc->cur_budget == 0 &&
> > +                 svc->cur_deadline <= iter_svc->cur_deadline )
> > +                    break;
> > +        }
> > +    }
> > +
> > +    list_add_tail(&svc->runq_elem, iter);
> > +}
> > +
> > +/*
> > + * Init/Free related code
> > + */
> > +static int
> > +rt_init(struct scheduler *ops)
> > +{
> > +    struct rt_private *prv = xzalloc(struct rt_private);
> > +
> > +    if ( prv == NULL )
> > +        return -ENOMEM;
>
> Newline in here.
>

​Changed, thanks!
​

>
> > +    ops->sched_data = prv;
>
> Is it safe to set ops->sched_data with a half constructed rt_private?  I
> suspect this wants to be the very last (non-debug) action in this function.
>

​I think it should be fine. I double checked the _init function in
sched_credit2.c. It has the similar operation: first assign prv to
ops->sched_data and then set the value of prv.
Of course, I can switch ​them. But I'm not sure if that really matter. :-)



>
> > +    spin_lock_init(&prv->lock);
> > +    INIT_LIST_HEAD(&prv->sdom);
> > +    INIT_LIST_HEAD(&prv->runq);
> > +
> > +    printtime();
> > +    printk("\n");
> > +
> > +    return 0;
> > +}
> > +
> > +static void
> > +rt_deinit(const struct scheduler *ops)
> > +{
> > +    struct rt_private *prv = RT_PRIV(ops);
> > +
> > +    printtime();
> > +    printk("\n");
> > +    xfree(prv);
> > +}
> > +
> > +/*
> > + * point per_cpu spinlock to the global system lock; all cpu have same
> global system lock
> > + */
> > +static void *
> > +rt_alloc_pdata(const struct scheduler *ops, int cpu)
> > +{
> > +    struct rt_private *prv = RT_PRIV(ops);
> > +
> > +    cpumask_set_cpu(cpu, &prv->cpus);
> > +
> > +    per_cpu(schedule_data, cpu).schedule_lock = &prv->lock;
> > +
> > +    printtime();
> > +    printk("%s total cpus: %d", __FUNCTION__,
> cpumask_weight(&prv->cpus));
>
> __FUNCTION__ is a gccism.  __func__ is a standard way of doing the same.
>

​Changed. Thanks!
​

>
> > +    /* same as credit2, not a bogus pointer */
> > +    return (void *)1;
> > +}
> > +
> > +static void
> > +rt_free_pdata(const struct scheduler *ops, void *pcpu, int cpu)
> > +{
> > +    struct rt_private * prv = RT_PRIV(ops);
> > +    cpumask_clear_cpu(cpu, &prv->cpus);
> > +    printtime();
> > +    printk("%s cpu=%d\n", __FUNCTION__, cpu);
> > +}
> > +
> > +static void *
> > +rt_alloc_domdata(const struct scheduler *ops, struct domain *dom)
> > +{
> > +    unsigned long flags;
> > +    struct rt_dom *sdom;
> > +    struct rt_private * prv = RT_PRIV(ops);
> > +
> > +    printtime();
> > +    printk("dom=%d\n", dom->domain_id);
> > +
> > +    sdom = xzalloc(struct rt_dom);
> > +    if ( sdom == NULL )
> > +    {
> > +        printk("%s, xzalloc failed\n", __func__);
> > +        return NULL;
> > +    }
> > +
> > +    INIT_LIST_HEAD(&sdom->vcpu);
> > +    INIT_LIST_HEAD(&sdom->sdom_elem);
> > +    sdom->dom = dom;
> > +
> > +    /* spinlock here to insert the dom */
> > +    spin_lock_irqsave(&prv->lock, flags);
> > +    list_add_tail(&sdom->sdom_elem, &(prv->sdom));
> > +    spin_unlock_irqrestore(&prv->lock, flags);
> > +
> > +    return (void *)sdom;
>
> Bogus cast.
>

​I think we have to cast it to void * ​because the definition of this
function asks the return type to be void *. In addition, the credit2
scheduler also did the same cast in  this _alloc_domdata function. So I
guess this should be fine?



>
> > +}
> > +
> > +static void
> > +rt_free_domdata(const struct scheduler *ops, void *data)
> > +{
> > +    unsigned long flags;
> > +    struct rt_dom *sdom = data;
> > +    struct rt_private *prv = RT_PRIV(ops);
> > +
> > +    printtime();
> > +    printk("dom=%d\n", sdom->dom->domain_id);
> > +
> > +    spin_lock_irqsave(&prv->lock, flags);
> > +    list_del_init(&sdom->sdom_elem);
> > +    spin_unlock_irqrestore(&prv->lock, flags);
> > +    xfree(data);
> > +}
> > +
> > +static int
> > +rt_dom_init(const struct scheduler *ops, struct domain *dom)
> > +{
> > +    struct rt_dom *sdom;
> > +
> > +    printtime();
> > +    printk("dom=%d\n", dom->domain_id);
> > +
> > +    /* IDLE Domain does not link on rt_private */
> > +    if ( is_idle_domain(dom) )
> > +        return 0;
> > +
> > +    sdom = rt_alloc_domdata(ops, dom);
> > +    if ( sdom == NULL )
> > +    {
> > +        printk("%s, failed\n", __func__);
> > +        return -ENOMEM;
> > +    }
> > +    dom->sched_priv = sdom;
> > +
> > +    return 0;
> > +}
> > +
> > +static void
> > +rt_dom_destroy(const struct scheduler *ops, struct domain *dom)
> > +{
> > +    printtime();
> > +    printk("dom=%d\n", dom->domain_id);
> > +
> > +    rt_free_domdata(ops, RT_DOM(dom));
> > +}
> > +
> > +static void *
> > +rt_alloc_vdata(const struct scheduler *ops, struct vcpu *vc, void *dd)
> > +{
> > +    struct rt_vcpu *svc;
> > +    s_time_t now = NOW();
> > +    long count;
> > +
> > +    /* Allocate per-VCPU info */
> > +    svc = xzalloc(struct rt_vcpu);
> > +    if ( svc == NULL )
> > +    {
> > +        printk("%s, xzalloc failed\n", __func__);
> > +        return NULL;
> > +    }
> > +
> > +    INIT_LIST_HEAD(&svc->runq_elem);
> > +    INIT_LIST_HEAD(&svc->sdom_elem);
> > +    svc->flags = 0U;
> > +    svc->sdom = dd;
> > +    svc->vcpu = vc;
> > +    svc->last_start = 0;            /* init last_start is 0 */
> > +
> > +    svc->period = RT_DEFAULT_PERIOD;
> > +    if ( !is_idle_vcpu(vc) )
> > +        svc->budget = RT_DEFAULT_BUDGET;
> > +
> > +    count = (now/MICROSECS(svc->period)) + 1;
> > +    /* sync all VCPU's start time to 0 */
> > +    svc->cur_deadline += count * MICROSECS(svc->period);
> > +
> > +    svc->cur_budget = svc->budget*1000; /* counting in microseconds
> level */
> > +    /* Debug only: dump new vcpu's info */
> > +    printtime();
> > +    rt_dump_vcpu(svc);
> > +
> > +    return svc;
> > +}
> > +
> > +static void
> > +rt_free_vdata(const struct scheduler *ops, void *priv)
> > +{
> > +    struct rt_vcpu *svc = priv;
> > +
> > +    /* Debug only: dump freed vcpu's info */
> > +    printtime();
> > +    rt_dump_vcpu(svc);
> > +    xfree(svc);
> > +}
> > +
> > +/*
> > + * TODO: Do we need to add vc to the new Runqueue?
> > + * This function is called in sched_move_domain() in schedule.c
> > + * When move a domain to a new cpupool,
> > + * may have to add vc to the Runqueue of the new cpupool
> > + */
> > +static void
> > +rt_vcpu_insert(const struct scheduler *ops, struct vcpu *vc)
> > +{
> > +    struct rt_vcpu *svc = RT_VCPU(vc);
> > +
> > +    /* Debug only: dump info of vcpu to insert */
> > +    printtime();
> > +    rt_dump_vcpu(svc);
> > +
> > +    /* not addlocate idle vcpu to dom vcpu list */
> > +    if ( is_idle_vcpu(vc) )
> > +        return;
> > +
> > +    list_add_tail(&svc->sdom_elem, &svc->sdom->vcpu);   /* add to dom
> vcpu list */
> > +}
> > +
> > +/*
> > + * TODO: same as rt_vcpu_insert()
> > + */
> > +static void
> > +rt_vcpu_remove(const struct scheduler *ops, struct vcpu *vc)
> > +{
> > +    struct rt_vcpu * const svc = RT_VCPU(vc);
> > +    struct rt_dom * const sdom = svc->sdom;
> > +
> > +    printtime();
> > +    rt_dump_vcpu(svc);
> > +
> > +    BUG_ON( sdom == NULL );
> > +    BUG_ON( __vcpu_on_runq(svc) );
> > +
> > +    if ( !is_idle_vcpu(vc) )
> > +        list_del_init(&svc->sdom_elem);
> > +}
> > +
> > +/*
> > + * Pick a valid CPU for the vcpu vc
> > + * Valid CPU of a vcpu is intesection of vcpu's affinity and available
> cpus
> > + */
> > +static int
> > +rt_cpu_pick(const struct scheduler *ops, struct vcpu *vc)
> > +{
> > +    cpumask_t cpus;
> > +    cpumask_t *online;
> > +    int cpu;
> > +    struct rt_private * prv = RT_PRIV(ops);
> > +
> > +    online = cpupool_scheduler_cpumask(vc->domain->cpupool);
> > +    cpumask_and(&cpus, &prv->cpus, online);
> > +    cpumask_and(&cpus, &cpus, vc->cpu_hard_affinity);
> > +
> > +    cpu = cpumask_test_cpu(vc->processor, &cpus)
> > +            ? vc->processor
> > +            : cpumask_cycle(vc->processor, &cpus);
> > +    ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
> > +
> > +    return cpu;
> > +}
> > +
> > +/*
> > + * Burn budget at microsecond level.
> > + */
> > +static void
> > +burn_budgets(const struct scheduler *ops, struct rt_vcpu *svc, s_time_t
> now)
> > +{
> > +    s_time_t delta;
> > +    long count = 0;
> > +
> > +    /* don't burn budget for idle VCPU */
> > +    if ( is_idle_vcpu(svc->vcpu) )
> > +    {
> > +        return;
> > +    }
> > +
> > +    /* first time called for this svc, update last_start */
> > +    if ( svc->last_start == 0 )
> > +    {
> > +        svc->last_start = now;
> > +        return;
> > +    }
> > +
> > +    /*
> > +     * update deadline info: When deadline is in the past,
> > +     * it need to be updated to the deadline of the current period,
> > +     * and replenish the budget
> > +     */
> > +    delta = now - svc->cur_deadline;
> > +    if ( delta >= 0 )
> > +    {
> > +        count = ( delta/MICROSECS(svc->period) ) + 1;
> > +        svc->cur_deadline += count * MICROSECS(svc->period);
> > +        svc->cur_budget = svc->budget * 1000;
> > +        return;
> > +    }
> > +
> > +    /* burn at nanoseconds level */
> > +    delta = now - svc->last_start;
> > +    /*
> > +     * delta < 0 only happens in nested virtualization;
> > +     * TODO: how should we handle delta < 0 in a better way? */
> > +    if ( delta < 0 )
> > +    {
> > +        printk("%s, ATTENTION: now is behind last_start! delta = %ld
> for ",
> > +                __func__, delta);
> > +        rt_dump_vcpu(svc);
> > +        svc->last_start = now;  /* update last_start */
> > +        svc->cur_budget = 0;   /* FIXME: should we recover like this? */
> > +        return;
> > +    }
> > +
> > +    if ( svc->cur_budget == 0 )
> > +        return;
> > +
> > +    svc->cur_budget -= delta;
> > +    if ( svc->cur_budget < 0 )
> > +        svc->cur_budget = 0;
> > +}
> > +
> > +/*
> > + * RunQ is sorted. Pick first one within cpumask. If no one, return NULL
> > + * lock is grabbed before calling this function
> > + */
> > +static struct rt_vcpu *
> > +__runq_pick(const struct scheduler *ops, cpumask_t mask)
> > +{
> > +    struct list_head *runq = RUNQ(ops);
> > +    struct list_head *iter;
> > +    struct rt_vcpu *svc = NULL;
> > +    struct rt_vcpu *iter_svc = NULL;
> > +    cpumask_t cpu_common;
> > +    cpumask_t *online;
> > +    struct rt_private * prv = RT_PRIV(ops);
> > +
> > +    list_for_each(iter, runq)
> > +    {
> > +        iter_svc = __runq_elem(iter);
> > +
> > +        /* mask is intersection of cpu_hard_affinity and cpupool and
> priv->cpus */
> > +        online =
> cpupool_scheduler_cpumask(iter_svc->vcpu->domain->cpupool);
> > +        cpumask_and(&cpu_common, online, &prv->cpus);
> > +        cpumask_and(&cpu_common, &cpu_common,
> iter_svc->vcpu->cpu_hard_affinity);
> > +        cpumask_and(&cpu_common, &mask, &cpu_common);
> > +        if ( cpumask_empty(&cpu_common) )
> > +            continue;
> > +
> > +        if ( iter_svc->cur_budget <= 0 )
> > +            continue;
> > +
> > +        svc = iter_svc;
> > +        break;
> > +    }
> > +
> > +    return svc;
> > +}
> > +
> > +/*
> > + * Update vcpu's budget and sort runq by insert the modifed vcpu back
> to runq
> > + * lock is grabbed before calling this function
> > + */
> > +static void
> > +__repl_update(const struct scheduler *ops, s_time_t now)
> > +{
> > +    struct list_head *runq = RUNQ(ops);
> > +    struct list_head *iter;
> > +    struct list_head *tmp;
> > +    struct rt_vcpu *svc = NULL;
> > +
> > +    s_time_t diff;
> > +    long count;
> > +
> > +    list_for_each_safe(iter, tmp, runq)
> > +    {
> > +        svc = __runq_elem(iter);
> > +
> > +        diff = now - svc->cur_deadline;
> > +        if ( diff > 0 )
> > +        {
> > +            count = (diff/MICROSECS(svc->period)) + 1;
> > +            svc->cur_deadline += count * MICROSECS(svc->period);
> > +            svc->cur_budget = svc->budget * 1000;
> > +            __runq_remove(svc);
> > +            __runq_insert(ops, svc);
> > +        }
> > +    }
> > +}
> > +
> > +/*
> > + * schedule function for rt scheduler.
> > + * The lock is already grabbed in schedule.c, no need to lock here
> > + */
> > +static struct task_slice
> > +rt_schedule(const struct scheduler *ops, s_time_t now, bool_t
> tasklet_work_scheduled)
> > +{
> > +    const int cpu = smp_processor_id();
> > +    struct rt_private * prv = RT_PRIV(ops);
> > +    struct rt_vcpu * const scurr = RT_VCPU(current);
> > +    struct rt_vcpu * snext = NULL;
> > +    struct task_slice ret = { .migrated = 0 };
> > +
> > +    /* clear ticked bit now that we've been scheduled */
> > +    if ( cpumask_test_cpu(cpu, &prv->tickled) )
> > +        cpumask_clear_cpu(cpu, &prv->tickled);
> > +
> > +    /* burn_budget would return for IDLE VCPU */
> > +    burn_budgets(ops, scurr, now);
> > +
> > +    __repl_update(ops, now);
> > +
> > +    if ( tasklet_work_scheduled )
> > +    {
> > +        snext = RT_VCPU(idle_vcpu[cpu]);
> > +    }
> > +    else
> > +    {
> > +        cpumask_t cur_cpu;
> > +        cpumask_clear(&cur_cpu);
> > +        cpumask_set_cpu(cpu, &cur_cpu);
> > +        snext = __runq_pick(ops, cur_cpu);
> > +        if ( snext == NULL )
> > +            snext = RT_VCPU(idle_vcpu[cpu]);
> > +
> > +        /* if scurr has higher priority and budget, still pick scurr */
> > +        if ( !is_idle_vcpu(current) &&
> > +             vcpu_runnable(current) &&
> > +             scurr->cur_budget > 0 &&
> > +             ( is_idle_vcpu(snext->vcpu) ||
> > +               scurr->cur_deadline <= snext->cur_deadline ) )
> > +            snext = scurr;
> > +    }
> > +
> > +    if ( snext != scurr &&
> > +         !is_idle_vcpu(current) &&
> > +         vcpu_runnable(current) )
> > +        set_bit(__RT_delayed_runq_add, &scurr->flags);
> > +
> > +
> > +    snext->last_start = now;
> > +    if ( !is_idle_vcpu(snext->vcpu) )
> > +    {
> > +        if ( snext != scurr )
> > +        {
> > +            __runq_remove(snext);
> > +            set_bit(__RT_scheduled, &snext->flags);
> > +        }
> > +        if ( snext->vcpu->processor != cpu )
> > +        {
> > +            snext->vcpu->processor = cpu;
> > +            ret.migrated = 1;
> > +        }
> > +    }
> > +
> > +    ret.time = MILLISECS(1); /* sched quantum */
> > +    ret.task = snext->vcpu;
> > +
> > +    return ret;
> > +}
> > +
> > +/*
> > + * Remove VCPU from RunQ
> > + * The lock is already grabbed in schedule.c, no need to lock here
> > + */
> > +static void
> > +rt_vcpu_sleep(const struct scheduler *ops, struct vcpu *vc)
> > +{
> > +    struct rt_vcpu * const svc = RT_VCPU(vc);
> > +
> > +    BUG_ON( is_idle_vcpu(vc) );
> > +
> > +    if ( curr_on_cpu(vc->processor) == vc )
> > +    {
> > +        cpu_raise_softirq(vc->processor, SCHEDULE_SOFTIRQ);
> > +        return;
> > +    }
> > +
> > +    if ( __vcpu_on_runq(svc) )
> > +    {
> > +        __runq_remove(svc);
> > +        printk("%s: vcpu should not on runq in vcpu_sleep()\n",
> __FUNCTION__);
> > +        BUG();
> > +    }
> > +
> > +    clear_bit(__RT_delayed_runq_add, &svc->flags);
> > +}
> > +
> > +/*
> > + * Pick a vcpu on a cpu to kick out to place the running candidate
> > + * Called by wake() and context_saved()
> > + * We have a running candidate here, the kick logic is:
> > + * Among all the cpus that are within the cpu affinity
> > + * 1) if the new->cpu is idle, kick it. This could benefit cache hit
> > + * 2) if there are any idle vcpu, kick it.
> > + * 3) now all pcpus are busy, among all the running vcpus, pick lowest
> priority one
> > + *    if snext has higher priority, kick it.
> > + *
> > + * TODO:
> > + * 1) what if these two vcpus belongs to the same domain?
> > + *    replace a vcpu belonging to the same domain introduces more
> overhead
> > + *
> > + * lock is grabbed before calling this function
> > + */
> > +static void
> > +runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
> > +{
> > +    struct rt_private * prv = RT_PRIV(ops);
> > +    struct rt_vcpu * latest_deadline_vcpu = NULL;    /* lowest priority
> scheduled */
> > +    struct rt_vcpu * iter_svc;
> > +    struct vcpu * iter_vc;
> > +    int cpu = 0;
> > +    cpumask_t not_tickled;
> > +    cpumask_t *online;
> > +
> > +    if ( new == NULL || is_idle_vcpu(new->vcpu) )
> > +        return;
> > +
> > +    online = cpupool_scheduler_cpumask(new->vcpu->domain->cpupool);
> > +    cpumask_and(&not_tickled, online, &prv->cpus);
> > +    cpumask_and(&not_tickled, &not_tickled,
> new->vcpu->cpu_hard_affinity);
> > +    cpumask_andnot(&not_tickled, &not_tickled, &prv->tickled);
> > +
> > +    /* 1) if new's previous cpu is idle, kick it for cache benefit */
> > +    if ( is_idle_vcpu(curr_on_cpu(new->vcpu->processor)) )
> > +    {
> > +        cpumask_set_cpu(new->vcpu->processor, &prv->tickled);
> > +        cpu_raise_softirq(new->vcpu->processor, SCHEDULE_SOFTIRQ);
> > +        return;
> > +    }
> > +
> > +    /* 2) if there are any idle pcpu, kick it */
> > +    /* The same loop also find the one with lowest priority */
> > +    for_each_cpu(cpu, &not_tickled)
> > +    {
> > +        iter_vc = curr_on_cpu(cpu);
> > +        if ( is_idle_vcpu(iter_vc) )
> > +        {
> > +            cpumask_set_cpu(cpu, &prv->tickled);
> > +            cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
> > +            return;
> > +        }
> > +        iter_svc = RT_VCPU(iter_vc);
> > +        if ( latest_deadline_vcpu == NULL ||
> > +             iter_svc->cur_deadline >
> latest_deadline_vcpu->cur_deadline )
> > +            latest_deadline_vcpu = iter_svc;
> > +    }
> > +
> > +    /* 3) candicate has higher priority, kick out the lowest priority
> vcpu */
> > +    if ( latest_deadline_vcpu != NULL && new->cur_deadline <
> latest_deadline_vcpu->cur_deadline )
> > +    {
> > +        cpumask_set_cpu(latest_deadline_vcpu->vcpu->processor,
> &prv->tickled);
> > +        cpu_raise_softirq(latest_deadline_vcpu->vcpu->processor,
> SCHEDULE_SOFTIRQ);
> > +    }
> > +    return;
> > +}
> > +
> > +/*
> > + * Should always wake up runnable vcpu, put it back to RunQ.
> > + * Check priority to raise interrupt
> > + * The lock is already grabbed in schedule.c, no need to lock here
> > + * TODO: what if these two vcpus belongs to the same domain?
> > + */
> > +static void
> > +rt_vcpu_wake(const struct scheduler *ops, struct vcpu *vc)
> > +{
> > +    struct rt_vcpu * const svc = RT_VCPU(vc);
> > +    s_time_t diff;
> > +    s_time_t now = NOW();
> > +    long count = 0;
> > +    struct rt_private * prv = RT_PRIV(ops);
> > +    struct rt_vcpu * snext = NULL;        /* highest priority on RunQ */
> > +
> > +    BUG_ON( is_idle_vcpu(vc) );
> > +
> > +    if ( unlikely(curr_on_cpu(vc->processor) == vc) )
> > +        return;
> > +
> > +    /* on RunQ, just update info is ok */
> > +    if ( unlikely(__vcpu_on_runq(svc)) )
> > +        return;
> > +
> > +    /* If context hasn't been saved for this vcpu yet, we can't put it
> on
> > +     * the Runqueue. Instead, we set a flag so that it will be put on
> the Runqueue
> > +     * After the context has been saved. */
> > +    if ( unlikely(test_bit(__RT_scheduled, &svc->flags)) )
> > +    {
> > +        set_bit(__RT_delayed_runq_add, &svc->flags);
> > +        return;
> > +    }
> > +
> > +    /* update deadline info */
> > +    diff = now - svc->cur_deadline;
> > +    if ( diff >= 0 )
> > +    {
> > +        count = ( diff/MICROSECS(svc->period) ) + 1;
> > +        svc->cur_deadline += count * MICROSECS(svc->period);
> > +        svc->cur_budget = svc->budget * 1000;
> > +    }
> > +
> > +    __runq_insert(ops, svc);
> > +    __repl_update(ops, now);
> > +    snext = __runq_pick(ops, prv->cpus);    /* pick snext from ALL
> valid cpus */
> > +    runq_tickle(ops, snext);
> > +
> > +    return;
> > +}
> > +
> > +/*
> > + * scurr has finished context switch, insert it back to the RunQ,
> > + * and then pick the highest priority vcpu from runq to run
> > + */
> > +static void
> > +rt_context_saved(const struct scheduler *ops, struct vcpu *vc)
> > +{
> > +    struct rt_vcpu * svc = RT_VCPU(vc);
> > +    struct rt_vcpu * snext = NULL;
> > +    struct rt_private * prv = RT_PRIV(ops);
> > +    spinlock_t *lock = vcpu_schedule_lock_irq(vc);
> > +
> > +    clear_bit(__RT_scheduled, &svc->flags);
> > +    /* not insert idle vcpu to runq */
> > +    if ( is_idle_vcpu(vc) )
> > +        goto out;
> > +
> > +    if ( test_and_clear_bit(__RT_delayed_runq_add, &svc->flags) &&
> > +         likely(vcpu_runnable(vc)) )
> > +    {
> > +        __runq_insert(ops, svc);
> > +        __repl_update(ops, NOW());
> > +        snext = __runq_pick(ops, prv->cpus);    /* pick snext from ALL
> cpus */
> > +        runq_tickle(ops, snext);
> > +    }
> > +out:
> > +    vcpu_schedule_unlock_irq(lock, vc);
> > +}
> > +
> > +/*
> > + * set/get each vcpu info of each domain
> > + */
> > +static int
> > +rt_dom_cntl(
> > +    const struct scheduler *ops,
> > +    struct domain *d,
> > +    struct xen_domctl_scheduler_op *op)
> > +{
> > +    xen_domctl_sched_rt_params_t *local_sched;
> > +    struct rt_dom * const sdom = RT_DOM(d);
> > +    struct list_head *iter;
> > +    int vcpu_index = 0;
> > +    int rc = -EINVAL;
> > +
> > +    switch ( op->cmd )
> > +    {
> > +    case XEN_DOMCTL_SCHEDOP_getnumvcpus:
> > +        op->u.rt.nr_vcpus = 0;
> > +        list_for_each( iter, &sdom->vcpu )
> > +            vcpu_index++;
> > +        op->u.rt.nr_vcpus = vcpu_index;
> > +        rc = 0;
> > +        break;
> > +    case XEN_DOMCTL_SCHEDOP_getinfo:
> > +        /* for debug use, whenever adjust Dom0 parameter, do global
> dump */
> > +        if ( d->domain_id == 0 )
> > +            rt_dump(ops);
> > +
> > +        op->u.rt.nr_vcpus = 0;
> > +        list_for_each( iter, &sdom->vcpu )
> > +            vcpu_index++;
> > +        op->u.rt.nr_vcpus = vcpu_index;
> > +        local_sched = xzalloc_array(xen_domctl_sched_rt_params_t,
> vcpu_index);
> > +        vcpu_index = 0;
> > +        list_for_each( iter, &sdom->vcpu )
> > +        {
> > +            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu,
> sdom_elem);
> > +
> > +            local_sched[vcpu_index].budget = svc->budget;
> > +            local_sched[vcpu_index].period = svc->period;
> > +            local_sched[vcpu_index].index = vcpu_index;
> > +            vcpu_index++;
> > +        }
> > +        copy_to_guest(op->u.rt.vcpu, local_sched, vcpu_index);
>
> This will clobber guest heap if vcpu_index is greater than the allocated
> space.


​This is a good point! I will pass the size of the array to the kernel and
check that the number of the array's elements is not smaller than the
number of vcpus.


>  You also unconditionally leak local_sched, but there is no need
> for an allocation in the first place.
>
> ​I will add the xfree() after copy_to_guest().
I have a question: how can I avoid allocating the local_sched?



> > +        rc = 0;
> > +        break;
> > +    case XEN_DOMCTL_SCHEDOP_putinfo:
> > +        list_for_each( iter, &sdom->vcpu )
> > +        {
> > +            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu,
> sdom_elem);
> > +
> > +            /* adjust per VCPU parameter */
> > +            if ( op->u.rt.vcpu_index == svc->vcpu->vcpu_id )
> > +            {
> > +                vcpu_index = op->u.rt.vcpu_index;
> > +
> > +                if ( vcpu_index < 0 )
> > +                    printk("XEN_DOMCTL_SCHEDOP_putinfo:
> vcpu_index=%d\n",
> > +                            vcpu_index);
> > +                else
> > +                    printk("XEN_DOMCTL_SCHEDOP_putinfo: "
> > +                            "vcpu_index=%d, period=%"PRId64",
> budget=%"PRId64"\n",
> > +                            vcpu_index, op->u.rt.period,
> op->u.rt.budget);
> > +
> > +                svc->period = op->u.rt.period;
> > +                svc->budget = op->u.rt.budget;
> > +
> > +                break;
> > +            }
> > +        }
> > +        rc = 0;
> > +        break;
> > +    }
> > +
> > +    return rc;
> > +}
> > +
> > +static struct rt_private _rt_priv;
> > +
> > +const struct scheduler sched_rt_def = {
> > +    .name           = "SMP RT Scheduler",
> > +    .opt_name       = "rt",
>
> Should these names reflect RT_DS as opposed to simply RT?
>

​DS (Deferrable Server) is just one kind of server mechanisms for global
Earliest Deadline First scheduling. ​We can add other server mechanisms in
the same file sched_rt.c to extend this Real Time scheduler. But we don't
want to change/affect user's interface when we add more server mechanisms.

The .opt_name will affect the user's interface when user choose the rt
scheduler, If we change it to rt_ds, we will have to change it to rt again
when we have more server mechanisms implemented. Then users will have to
change their configuration (i.e., the command line value sched=) to the new
name rt. Because this could potentially affect users' interface, I think
it's better to use rt here. What do you think?



>
> > +    .sched_id       = XEN_SCHEDULER_RT_DS,
> > +    .sched_data     = &_rt_priv,
> > +
> > +    .dump_cpu_state = rt_dump_pcpu,
> > +    .dump_settings  = rt_dump,
> > +    .init           = rt_init,
> > +    .deinit         = rt_deinit,
> > +    .alloc_pdata    = rt_alloc_pdata,
> > +    .free_pdata     = rt_free_pdata,
> > +    .alloc_domdata  = rt_alloc_domdata,
> > +    .free_domdata   = rt_free_domdata,
> > +    .init_domain    = rt_dom_init,
> > +    .destroy_domain = rt_dom_destroy,
> > +    .alloc_vdata    = rt_alloc_vdata,
> > +    .free_vdata     = rt_free_vdata,
> > +    .insert_vcpu    = rt_vcpu_insert,
> > +    .remove_vcpu    = rt_vcpu_remove,
> > +
> > +    .adjust         = rt_dom_cntl,
> > +
> > +    .pick_cpu       = rt_cpu_pick,
> > +    .do_schedule    = rt_schedule,
> > +    .sleep          = rt_vcpu_sleep,
> > +    .wake           = rt_vcpu_wake,
> > +    .context_saved  = rt_context_saved,
> > +};
> > diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> > index e9eb0bc..f2df400 100644
> > --- a/xen/common/schedule.c
> > +++ b/xen/common/schedule.c
> > @@ -68,6 +68,7 @@ static const struct scheduler *schedulers[] = {
> >      &sched_sedf_def,
> >      &sched_credit_def,
> >      &sched_credit2_def,
> > +    &sched_rt_def,
> >      &sched_arinc653_def,
> >  };
> >
> > @@ -1092,7 +1093,8 @@ long sched_adjust(struct domain *d, struct
> xen_domctl_scheduler_op *op)
> >
> >      if ( (op->sched_id != DOM2OP(d)->sched_id) ||
> >           ((op->cmd != XEN_DOMCTL_SCHEDOP_putinfo) &&
> > -          (op->cmd != XEN_DOMCTL_SCHEDOP_getinfo)) )
> > +          (op->cmd != XEN_DOMCTL_SCHEDOP_getinfo) &&
> > +          (op->cmd != XEN_DOMCTL_SCHEDOP_getnumvcpus)) )
> >          return -EINVAL;
> >
> >      /* NB: the pluggable scheduler code needs to take care
> > diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
> > index 5b11bbf..8d4b973 100644
> > --- a/xen/include/public/domctl.h
> > +++ b/xen/include/public/domctl.h
> > @@ -339,6 +339,18 @@ struct xen_domctl_max_vcpus {
> >  typedef struct xen_domctl_max_vcpus xen_domctl_max_vcpus_t;
> >  DEFINE_XEN_GUEST_HANDLE(xen_domctl_max_vcpus_t);
> >
> > +/*
> > + * This structure is used to pass to rt scheduler from a
> > + * privileged domain to Xen
> > + */
> > +struct xen_domctl_sched_rt_params {
> > +    /* get vcpus' info */
> > +    int64_t period; /* s_time_t type */
> > +    int64_t budget;
> > +    int     index;
>
> Index is clearly an unsigned quantity.  For alignment and compatibility,
> uint64_t would make the most sense.  Alternatively, uint32_t and an
> explicit uint32_t pad field.
>
> ​Agree. I have changed it to uint16_t because the vcpu_index is uint16_t
in the tool stack.

Thank you very much for your comments and suggestions! Looking forward to
your reply! ;-)

Best,

Meng​

-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 55743 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC v2 1/4] xen: add real time scheduler rt
  2014-08-02 15:31     ` Meng Xu
@ 2014-08-04 10:23       ` Andrew Cooper
  2014-08-07 11:26         ` Meng Xu
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Cooper @ 2014-08-04 10:23 UTC (permalink / raw)
  To: Meng Xu, George Dunlap
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, Ian Jackson,
	xen-devel, Meng Xu, Jan Beulich, Chong Li, Dagaen Golomb


[-- Attachment #1.1: Type: text/plain, Size: 11594 bytes --]

On 02/08/14 16:31, Meng Xu wrote:
>
>     > +/*
>     > + * Init/Free related code
>     > + */
>     > +static int
>     > +rt_init(struct scheduler *ops)
>     > +{
>     > +    struct rt_private *prv = xzalloc(struct rt_private);
>     > +
>     > +    if ( prv == NULL )
>     > +        return -ENOMEM;
>
>     Newline in here.
>
>
> ​Changed, thanks!
> ​
>
>
>     > +    ops->sched_data = prv;
>
>     Is it safe to set ops->sched_data with a half constructed
>     rt_private?  I
>     suspect this wants to be the very last (non-debug) action in this
>     function.
>
>
> ​I think it should be fine. I double checked the _init function in
> sched_credit2.c. It has the similar operation: first assign prv to
> ops->sched_data and then set the value of prv.
> Of course, I can switch ​them. But I'm not sure if that really matter. :-)

It means that anyone else who peaks at ops->sched_data will see a
partially constructed rt_private object.  This can be a source of subtle
bugs, so it is better to code appropriately, rather than relying on an
assumption that noone would go peaking before this function returns.

>
> > +static void *
> > +rt_alloc_domdata(const struct scheduler *ops, struct domain *dom)
> > +{
> > +    unsigned long flags;
> > +    struct rt_dom *sdom;
> > +    struct rt_private * prv = RT_PRIV(ops);
> > +
> > +    printtime();
> > +    printk("dom=%d\n", dom->domain_id);
> > +
> > +    sdom = xzalloc(struct rt_dom);
> > +    if ( sdom == NULL )
> > +    {
> > +        printk("%s, xzalloc failed\n", __func__);
> > +        return NULL;
> > +    }
> > +
> > +    INIT_LIST_HEAD(&sdom->vcpu);
> > +    INIT_LIST_HEAD(&sdom->sdom_elem);
> > +    sdom->dom = dom;
> > +
> > +    /* spinlock here to insert the dom */
> > +    spin_lock_irqsave(&prv->lock, flags);
> > +    list_add_tail(&sdom->sdom_elem, &(prv->sdom));
> > +    spin_unlock_irqrestore(&prv->lock, flags);
> > +
> > +    return (void *)sdom;
>
>     Bogus cast.
>
>
> ​I think we have to cast it to void * ​because the definition of this
> function asks the return type to be void *. In addition, the credit2
> scheduler also did the same cast in  this _alloc_domdata function. So
> I guess this should be fine?
>

In C, all pointers are implicitly castable to/from void*.  This is not
the case in C++.  (which suggests to me that George learnt C++ before C,
or is at least more familiar with C++?)

The act of using type punning means that you are trying to do something
which the C type system doesn't like.  This in turn requires more
thought from people reading the code to work out whether it is actually
safe or not.  As a result, we strive for as few type overrides as possible.

Please don't fall into the trap of assuming the credit2 code, or indeed
any of the other code in Xen, is perfect.  None of it is, although most
of it is good.

>
> > +/*
> > + * set/get each vcpu info of each domain
> > + */
> > +static int
> > +rt_dom_cntl(
> > +    const struct scheduler *ops,
> > +    struct domain *d,
> > +    struct xen_domctl_scheduler_op *op)
> > +{
> > +    xen_domctl_sched_rt_params_t *local_sched;
> > +    struct rt_dom * const sdom = RT_DOM(d);
> > +    struct list_head *iter;
> > +    int vcpu_index = 0;
> > +    int rc = -EINVAL;
> > +
> > +    switch ( op->cmd )
> > +    {
> > +    case XEN_DOMCTL_SCHEDOP_getnumvcpus:
> > +        op->u.rt.nr_vcpus = 0;
> > +        list_for_each( iter, &sdom->vcpu )
> > +            vcpu_index++;
> > +        op->u.rt.nr_vcpus = vcpu_index;
> > +        rc = 0;
> > +        break;
> > +    case XEN_DOMCTL_SCHEDOP_getinfo:
> > +        /* for debug use, whenever adjust Dom0 parameter, do global
> dump */
> > +        if ( d->domain_id == 0 )
> > +            rt_dump(ops);
> > +
> > +        op->u.rt.nr_vcpus = 0;
> > +        list_for_each( iter, &sdom->vcpu )
> > +            vcpu_index++;
> > +        op->u.rt.nr_vcpus = vcpu_index;
> > +        local_sched = xzalloc_array(xen_domctl_sched_rt_params_t,
> vcpu_index);
> > +        vcpu_index = 0;
> > +        list_for_each( iter, &sdom->vcpu )
> > +        {
> > +            struct rt_vcpu * svc = list_entry(iter, struct rt_vcpu,
> sdom_elem);
> > +
> > +            local_sched[vcpu_index].budget = svc->budget;
> > +            local_sched[vcpu_index].period = svc->period;
> > +            local_sched[vcpu_index].index = vcpu_index;
> > +            vcpu_index++;
> > +        }
> > +        copy_to_guest(op->u.rt.vcpu, local_sched, vcpu_index);
>
>     This will clobber guest heap if vcpu_index is greater than the
>     allocated
>     space.
>
>
> ​This is a good point! I will pass the size of the array to the kernel
> and check that the number of the array's elements is not smaller than
> the number of vcpus.
>
>
>      You also unconditionally leak local_sched, but there is no need
>     for an allocation in the first place.
>
> ​I will add the xfree() after copy_to_guest().
> I have a question: how can I avoid allocating the local_sched?

unsigned i = 0;

list_for_each( iter, &sdom->vcpu )
{
    xen_domctl_sched_rt_params_t local_sched;

    if ( i >= op->u.rt.nr_vcpus )
        break;

    /* Fill local_sched */

    if ( copy_to_guest_offset(op->u.rt.vcpu, i, &local_sched, 1) )
    {
        ret = -EFAULT;
        break;
    }
}


>
>
>
>     > +        rc = 0;
>     > +        break;
>     > +    case XEN_DOMCTL_SCHEDOP_putinfo:
>     > +        list_for_each( iter, &sdom->vcpu )
>     > +        {
>     > +            struct rt_vcpu * svc = list_entry(iter, struct
>     rt_vcpu, sdom_elem);
>     > +
>     > +            /* adjust per VCPU parameter */
>     > +            if ( op->u.rt.vcpu_index == svc->vcpu->vcpu_id )
>     > +            {
>     > +                vcpu_index = op->u.rt.vcpu_index;
>     > +
>     > +                if ( vcpu_index < 0 )
>     > +                    printk("XEN_DOMCTL_SCHEDOP_putinfo:
>     vcpu_index=%d\n",
>     > +                            vcpu_index);
>     > +                else
>     > +                    printk("XEN_DOMCTL_SCHEDOP_putinfo: "
>     > +                            "vcpu_index=%d, period=%"PRId64",
>     budget=%"PRId64"\n",
>     > +                            vcpu_index, op->u.rt.period,
>     op->u.rt.budget);
>     > +
>     > +                svc->period = op->u.rt.period;
>     > +                svc->budget = op->u.rt.budget;
>     > +
>     > +                break;
>     > +            }
>     > +        }
>     > +        rc = 0;
>     > +        break;
>     > +    }
>     > +
>     > +    return rc;
>     > +}
>     > +
>     > +static struct rt_private _rt_priv;
>     > +
>     > +const struct scheduler sched_rt_def = {
>     > +    .name           = "SMP RT Scheduler",
>     > +    .opt_name       = "rt",
>
>     Should these names reflect RT_DS as opposed to simply RT?
>
>
> ​DS (Deferrable Server) is just one kind of server mechanisms for
> global Earliest Deadline First scheduling. ​We can add other server
> mechanisms in the same file sched_rt.c to extend this Real Time
> scheduler. But we don't want to change/affect user's interface when we
> add more server mechanisms.
> The .opt_name will affect the user's interface when user choose the rt
> scheduler, If we change it to rt_ds, we will have to change it to rt
> again when we have more server mechanisms implemented. Then users will
> have to change their configuration (i.e., the command line value
> sched=) to the new name rt. Because this could potentially affect
> users' interface, I think it's better to use rt here. What do you think?
>

Reusing the rt.c infrastructure is fine, but something other than DS
will necessarily be a different scheduler as far as the tools are
concerned.  The user should expect to have to change their parameters if
they want to use a new scheduler.  The other point of view is that the
user should expect their scheduler not to change under their feet if
they continue using the same parameters as before.

>
>
>
>     > +    .sched_id       = XEN_SCHEDULER_RT_DS,
>     > +    .sched_data     = &_rt_priv,
>     > +
>     > +    .dump_cpu_state = rt_dump_pcpu,
>     > +    .dump_settings  = rt_dump,
>     > +    .init           = rt_init,
>     > +    .deinit         = rt_deinit,
>     > +    .alloc_pdata    = rt_alloc_pdata,
>     > +    .free_pdata     = rt_free_pdata,
>     > +    .alloc_domdata  = rt_alloc_domdata,
>     > +    .free_domdata   = rt_free_domdata,
>     > +    .init_domain    = rt_dom_init,
>     > +    .destroy_domain = rt_dom_destroy,
>     > +    .alloc_vdata    = rt_alloc_vdata,
>     > +    .free_vdata     = rt_free_vdata,
>     > +    .insert_vcpu    = rt_vcpu_insert,
>     > +    .remove_vcpu    = rt_vcpu_remove,
>     > +
>     > +    .adjust         = rt_dom_cntl,
>     > +
>     > +    .pick_cpu       = rt_cpu_pick,
>     > +    .do_schedule    = rt_schedule,
>     > +    .sleep          = rt_vcpu_sleep,
>     > +    .wake           = rt_vcpu_wake,
>     > +    .context_saved  = rt_context_saved,
>     > +};
>     > diff --git a/xen/common/schedule.c b/xen/common/schedule.c
>     > index e9eb0bc..f2df400 100644
>     > --- a/xen/common/schedule.c
>     > +++ b/xen/common/schedule.c
>     > @@ -68,6 +68,7 @@ static const struct scheduler *schedulers[] = {
>     >      &sched_sedf_def,
>     >      &sched_credit_def,
>     >      &sched_credit2_def,
>     > +    &sched_rt_def,
>     >      &sched_arinc653_def,
>     >  };
>     >
>     > @@ -1092,7 +1093,8 @@ long sched_adjust(struct domain *d, struct
>     xen_domctl_scheduler_op *op)
>     >
>     >      if ( (op->sched_id != DOM2OP(d)->sched_id) ||
>     >           ((op->cmd != XEN_DOMCTL_SCHEDOP_putinfo) &&
>     > -          (op->cmd != XEN_DOMCTL_SCHEDOP_getinfo)) )
>     > +          (op->cmd != XEN_DOMCTL_SCHEDOP_getinfo) &&
>     > +          (op->cmd != XEN_DOMCTL_SCHEDOP_getnumvcpus)) )
>     >          return -EINVAL;
>     >
>     >      /* NB: the pluggable scheduler code needs to take care
>     > diff --git a/xen/include/public/domctl.h
>     b/xen/include/public/domctl.h
>     > index 5b11bbf..8d4b973 100644
>     > --- a/xen/include/public/domctl.h
>     > +++ b/xen/include/public/domctl.h
>     > @@ -339,6 +339,18 @@ struct xen_domctl_max_vcpus {
>     >  typedef struct xen_domctl_max_vcpus xen_domctl_max_vcpus_t;
>     >  DEFINE_XEN_GUEST_HANDLE(xen_domctl_max_vcpus_t);
>     >
>     > +/*
>     > + * This structure is used to pass to rt scheduler from a
>     > + * privileged domain to Xen
>     > + */
>     > +struct xen_domctl_sched_rt_params {
>     > +    /* get vcpus' info */
>     > +    int64_t period; /* s_time_t type */
>     > +    int64_t budget;
>     > +    int     index;
>
>     Index is clearly an unsigned quantity.  For alignment and
>     compatibility,
>     uint64_t would make the most sense.  Alternatively, uint32_t and an
>     explicit uint32_t pad field.
>
> ​Agree. I have changed it to uint16_t because the vcpu_index is
> uint16_t in the tool stack.
>
> Thank you very much for your comments and suggestions! Looking forward
> to your reply! ;-)
>
> Best,
>
> Meng​

If you make its size a multiple of 8, then Xen doesn't need a compat
shim for converting hypercalls from 32bit guests.

~Andrew

[-- Attachment #1.2: Type: text/html, Size: 23191 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC v2 1/4] xen: add real time scheduler rt
  2014-08-04 10:23       ` Andrew Cooper
@ 2014-08-07 11:26         ` Meng Xu
  2014-08-07 11:48           ` Andrew Cooper
  0 siblings, 1 reply; 13+ messages in thread
From: Meng Xu @ 2014-08-07 11:26 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, George Dunlap,
	Ian Jackson, xen-devel, Meng Xu, Jan Beulich, Chong Li,
	Dagaen Golomb


[-- Attachment #1.1: Type: text/plain, Size: 2453 bytes --]

Hi Andrew,

​Thank you so much for your advice! They are very useful! :-)

I have one simple question about your comments:
​

>
> > +static void *
> > +rt_alloc_domdata(const struct scheduler *ops, struct domain *dom)
> > +{
> > +    unsigned long flags;
> > +    struct rt_dom *sdom;
>
> > +    struct rt_private * prv = RT_PRIV(ops);
> > +
> > +    printtime();
> > +    printk("dom=%d\n", dom->domain_id);
> > +
> > +    sdom = xzalloc(struct rt_dom);
> > +    if ( sdom == NULL )
> > +    {
> > +        printk("%s, xzalloc failed\n", __func__);
> > +        return NULL;
> > +    }
> > +
> > +    INIT_LIST_HEAD(&sdom->vcpu);
> > +    INIT_LIST_HEAD(&sdom->sdom_elem);
> > +    sdom->dom = dom;
> > +
> > +    /* spinlock here to insert the dom */
> > +    spin_lock_irqsave(&prv->lock, flags);
> > +    list_add_tail(&sdom->sdom_elem, &(prv->sdom));
> > +    spin_unlock_irqrestore(&prv->lock, flags);
> > +
> > +    return (void *)sdom;
>
>  Bogus cast.
>>
>
>  ​I think we have to cast it to void * ​because the definition of this
> function asks the return type to be void *. In addition, the credit2
> scheduler also did the same cast in  this _alloc_domdata function. So I
> guess this should be fine?
>
>
> In C, all pointers are implicitly castable to/from void*.  This is not the
> case in C++.  (which suggests to me that George learnt C++ before C, or is
> at least more familiar with C++?)
>
> The act of using type punning means that you are trying to do something
> which the C type system doesn't like.  This in turn requires more thought
> from people reading the code to work out whether it is actually safe or
> not.  As a result, we strive for as few type overrides as possible.
>
> Please don't fall into the trap of assuming the credit2 code, or indeed
> any of the other code in Xen, is perfect.  None of it is, although most of
> it is good.
>
>
​I can change the return (void *)sdom to return sdom.
​But I'm not sure what else should I modify?
In sched-if.h,  this function is ​"void *       (*alloc_domdata)  (const
struct scheduler *, struct domain *);" I think I should not change this
declaration, otherwise, the modification will affect the compilation of
other schedulers.
​​
​Thank you so much!

Best,

Meng​



-- 


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

[-- Attachment #1.2: Type: text/html, Size: 5107 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC v2 1/4] xen: add real time scheduler rt
  2014-08-07 11:26         ` Meng Xu
@ 2014-08-07 11:48           ` Andrew Cooper
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Cooper @ 2014-08-07 11:48 UTC (permalink / raw)
  To: Meng Xu
  Cc: Ian Campbell, Sisu Xi, Stefano Stabellini, George Dunlap,
	Ian Jackson, xen-devel, Meng Xu, Jan Beulich, Chong Li,
	Dagaen Golomb


[-- Attachment #1.1: Type: text/plain, Size: 2783 bytes --]

On 07/08/14 12:26, Meng Xu wrote:
> Hi Andrew,
>
> ​Thank you so much for your advice! They are very useful! :-)
>
> I have one simple question about your comments:
> ​
>
>>
>>     > +static void *
>>     > +rt_alloc_domdata(const struct scheduler *ops, struct domain *dom)
>>     > +{
>>     > +    unsigned long flags;
>>     > +    struct rt_dom *sdom;
>>
>>     > +    struct rt_private * prv = RT_PRIV(ops);
>>     > +
>>     > +    printtime();
>>     > +    printk("dom=%d\n", dom->domain_id);
>>     > +
>>     > +    sdom = xzalloc(struct rt_dom);
>>     > +    if ( sdom == NULL )
>>     > +    {
>>     > +        printk("%s, xzalloc failed\n", __func__);
>>     > +        return NULL;
>>     > +    }
>>     > +
>>     > +    INIT_LIST_HEAD(&sdom->vcpu);
>>     > +    INIT_LIST_HEAD(&sdom->sdom_elem);
>>     > +    sdom->dom = dom;
>>     > +
>>     > +    /* spinlock here to insert the dom */
>>     > +    spin_lock_irqsave(&prv->lock, flags);
>>     > +    list_add_tail(&sdom->sdom_elem, &(prv->sdom));
>>     > +    spin_unlock_irqrestore(&prv->lock, flags);
>>     > +
>>     > +    return (void *)sdom;
>>
>>         Bogus cast.
>>
>>
>>     ​I think we have to cast it to void * ​because the definition of
>>     this function asks the return type to be void *. In addition, the
>>     credit2 scheduler also did the same cast in  this _alloc_domdata
>>     function. So I guess this should be fine?
>>
>
>     In C, all pointers are implicitly castable to/from void*.  This is
>     not the case in C++.  (which suggests to me that George learnt C++
>     before C, or is at least more familiar with C++?)
>
>     The act of using type punning means that you are trying to do
>     something which the C type system doesn't like.  This in turn
>     requires more thought from people reading the code to work out
>     whether it is actually safe or not.  As a result, we strive for as
>     few type overrides as possible.
>
>     Please don't fall into the trap of assuming the credit2 code, or
>     indeed any of the other code in Xen, is perfect.  None of it is,
>     although most of it is good.
>
>
> ​I can change the return (void *)sdom to return sdom.
> ​But I'm not sure what else should I modify?
> In sched-if.h,  this function is ​"void *       (*alloc_domdata)
>  (const struct scheduler *, struct domain *);" I think I should not
> change this declaration, otherwise, the modification will affect the
> compilation of other schedulers.
> ​​
> ​Thank you so much!
>
> Best,
>
> Meng​

return sdom; is perfectly fine with the existing function prototype.  It
is an implicit cast of a pointer to void*, which is perfectly legal in C.

~Andrew

[-- Attachment #1.2: Type: text/html, Size: 7654 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-08-07 11:48 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-29  1:52 Introduce rt real-time scheduler for Xen Meng Xu
2014-07-29  1:52 ` [PATCH RFC v2 1/4] xen: add real time scheduler rt Meng Xu
2014-07-29  6:52   ` Jan Beulich
2014-07-29  9:42     ` Dario Faggioli
2014-07-30 15:55     ` [RFC " Meng Xu
2014-07-29 10:36   ` [PATCH RFC " Andrew Cooper
2014-08-02 15:31     ` Meng Xu
2014-08-04 10:23       ` Andrew Cooper
2014-08-07 11:26         ` Meng Xu
2014-08-07 11:48           ` Andrew Cooper
2014-07-29  1:53 ` [PATCH RFC v2 2/4] libxc: add rt scheduler Meng Xu
2014-07-29  1:53 ` [PATCH RFC v2 3/4] libxl: " Meng Xu
2014-07-29  1:53 ` [PATCH RFC v2 4/4] xl: introduce " Meng Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.