Xen on ARM IRQ latency and scheduler overhead

From: Stefano Stabellini <sstabellini@kernel.org>
To: xen-devel@lists.xen.org
Cc: george.dunlap@eu.citrix.com, edgar.iglesias@xilinx.com,
	dario.faggioli@citrix.com, sstabellini@kernel.org,
	julien.grall@arm.com
Subject: Xen on ARM IRQ latency and scheduler overhead
Date: Thu, 9 Feb 2017 16:54:37 -0800 (PST)	[thread overview]
Message-ID: <alpine.DEB.2.10.1702091603240.20549@sstabellini-ThinkPad-X260> (raw)

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3018 bytes --]

Hi all,

I have run some IRQ latency measurements on Xen on ARM on a Xilinx
ZynqMP board (four Cortex A53 cores, GICv2).

Dom0 has 1 vcpu pinned to cpu0, DomU has 1 vcpu pinned to cpu2.
Dom0 is Ubuntu. DomU is an ad-hoc baremetal app to measure interrupt
latency: https://github.com/edgarigl/tbm

I modified the app to use the phys_timer instead of the virt_timer.  You
can build it with:

make CFG=configs/xen-guest-irq-latency.cfg 

I modified Xen to export the phys_timer to guests, see the very hacky
patch attached. This way, the phys_timer interrupt should behave like
any conventional device interrupts assigned to a guest.

These are the results, in nanosec:

                        AVG     MIN     MAX     WARM MAX

NODEBUG no WFI          1890    1800    3170    2070
NODEBUG WFI             4850    4810    7030    4980
NODEBUG no WFI credit2  2217    2090    3420    2650
NODEBUG WFI credit2     8080    7890    10320   8300

DEBUG no WFI            2252    2080    3320    2650
DEBUG WFI               6500    6140    8520    8130
DEBUG WFI, credit2      8050    7870    10680   8450

DEBUG means Xen DEBUG build.
WARM MAX is the maximum latency, taking out the first few interrupts to
warm the caches.
WFI is the ARM and ARM64 sleeping instruction, trapped and emulated by
Xen by calling vcpu_block.

As you can see, depending on whether the guest issues a WFI or not while
waiting for interrupts, the results change significantly. Interestingly,
credit2 does worse than credit1 in this area.

Trying to figure out where those 3000-4000ns of difference between the
WFI and non-WFI cases come from, I wrote a patch to zero the latency
introduced by xen/arch/arm/domain.c:schedule_tail. That saves about
1000ns. There are no other arch specific context switch functions worth
optimizing.

We are down to 2000-3000ns. Then, I started investigating the scheduler.
I measured how long it takes to run "vcpu_unblock": 1050ns, which is
significant. I don't know what is causing the remaining 1000-2000ns, but
I bet on another scheduler function. Do you have any suggestions on
which one?


Assuming that the problem is indeed the scheduler, one workaround that
we could introduce today would be to avoid calling vcpu_unblock on guest
WFI and call vcpu_yield instead. This change makes things significantly
better:

                                     AVG     MIN     MAX     WARM MAX
DEBUG WFI (yield, no block)          2900    2190    5130    5130
DEBUG WFI (yield, no block) credit2  3514    2280    6180    5430

Is that a reasonable change to make? Would it cause significantly more
power consumption in Xen (because xen/arch/arm/domain.c:idle_loop might
not be called anymore)?


If we wanted to zero the difference between the WFI and non-WFI cases,
would we need a new scheduler? A simple "noop scheduler" that statically
assigns vcpus to pcpus, one by one, until they run out, then return
error? Or do we need more extensive modifications to
xen/common/schedule.c? Any other ideas?


Thanks,

Stefano

[-- Attachment #2: Type: TEXT/PLAIN, Size: 3239 bytes --]

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index 7e43691..f5ff69b 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -663,6 +663,7 @@ void arch_domain_destroy(struct domain *d)
     /* IOMMU page table is shared with P2M, always call
      * iommu_domain_destroy() before p2m_teardown().
      */
+    WRITE_SYSREG32(0, CNTP_CTL_EL0);
     iommu_domain_destroy(d);
     p2m_teardown(d);
     domain_vgic_free(d);
diff --git a/xen/arch/arm/gic.c b/xen/arch/arm/gic.c
index a5348f2..5c8b621 100644
--- a/xen/arch/arm/gic.c
+++ b/xen/arch/arm/gic.c
@@ -47,7 +47,7 @@ static DEFINE_PER_CPU(uint64_t, lr_mask);
 
 static void gic_update_one_lr(struct vcpu *v, int i);
 
-static const struct gic_hw_operations *gic_hw_ops;
+const struct gic_hw_operations *gic_hw_ops;
 
 void register_gic_ops(const struct gic_hw_operations *ops)
 {
diff --git a/xen/arch/arm/irq.c b/xen/arch/arm/irq.c
index dd62ba6..9a4e50d 100644
--- a/xen/arch/arm/irq.c
+++ b/xen/arch/arm/irq.c
@@ -184,6 +184,7 @@ int request_irq(unsigned int irq, unsigned int irqflags,
 }
 
 /* Dispatch an interrupt */
+extern const struct gic_hw_operations *gic_hw_ops;
 void do_IRQ(struct cpu_user_regs *regs, unsigned int irq, int is_fiq)
 {
     struct irq_desc *desc = irq_to_desc(irq);
@@ -202,6 +203,12 @@ void do_IRQ(struct cpu_user_regs *regs, unsigned int irq, int is_fiq)
     irq_enter();
 
     spin_lock(&desc->lock);
+
+    if (irq == 30) {
+        set_bit(_IRQ_GUEST, &desc->status);
+        desc->handler = gic_hw_ops->gic_guest_irq_type;
+    }
+
     desc->handler->ack(desc);
 
     if ( !desc->action )
@@ -224,7 +231,23 @@ void do_IRQ(struct cpu_user_regs *regs, unsigned int irq, int is_fiq)
          * The irq cannot be a PPI, we only support delivery of SPIs to
          * guests.
 	 */
-        vgic_vcpu_inject_spi(info->d, info->virq);
+        if (irq != 30)
+            vgic_vcpu_inject_spi(info->d, info->virq);
+        else {
+            struct domain *d;
+            
+            for_each_domain ( d )
+            {
+                struct pending_irq *p;
+                
+                if (d->domain_id == 0 || is_idle_domain(d))
+                    continue;
+                p = irq_to_pending(d->vcpu[0], 30);
+                p->desc = desc;
+                vgic_vcpu_inject_irq(d->vcpu[0], 30);
+                break;
+            }
+        }
         goto out_no_end;
     }
 
diff --git a/xen/arch/arm/time.c b/xen/arch/arm/time.c
index 7dae28b..0249631 100644
--- a/xen/arch/arm/time.c
+++ b/xen/arch/arm/time.c
@@ -297,9 +300,9 @@ void init_timer_interrupt(void)
     /* Sensible defaults */
     WRITE_SYSREG64(0, CNTVOFF_EL2);     /* No VM-specific offset */
     /* Do not let the VMs program the physical timer, only read the physical counter */
-    WRITE_SYSREG32(CNTHCTL_EL2_EL1PCTEN, CNTHCTL_EL2);
     WRITE_SYSREG32(0, CNTP_CTL_EL0);    /* Physical timer disabled */
     WRITE_SYSREG32(0, CNTHP_CTL_EL2);   /* Hypervisor's timer disabled */
+    WRITE_SYSREG32(CNTHCTL_EL2_EL1PCTEN|CNTHCTL_EL2_EL1PCEN, CNTHCTL_EL2);
     isb();
 
     request_irq(timer_irq[TIMER_HYP_PPI], 0, timer_interrupt,

[-- Attachment #3: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel