All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements
@ 2010-05-24 20:13 Jan Kiszka
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 01/15] hpet: Catch out-of-bounds timer access Jan Kiszka
                   ` (15 more replies)
  0 siblings, 16 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-24 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Juan Quintela

Not yet for merge (unless I happened to forgot adding bugs), just a
Request For Testing (and for review, of course). This series grew beyond
my initial plans and my current testing capabilities, Linux and Win7 are
apparently still fine, but that's all I can say so far.

To summarize contributions to the HPET model:
 - fixed host memory corruptions by guest (patch 1, likely stable stuff)
 - coding style cleanup
 - qdev conversion
 - detangling of RTC from HPET dependencies, specifically via providing
   "feedback" IRQ handlers to easy IRQ coalescing workaround
   implementations
 - support for level-triggered HPET IRQs (untested - does anyone know of
   a user?)
 - up to 32 comparators (configurable via qdev prop)
 - MSI support (configurable via qdev prop)
 - dropped obsolete "info hpet" and "query-hpet"

Yet missing:
 - IRQ coalescing workaround
 - maybe some refactoring to easy compile-time disabling (Juan?)
 - build once (I leave this to Blue Swirl :) )
 - multiple HPET blocks (no urgent need yet)

Please give this hell.

Jan Kiszka (15):
  hpet: Catch out-of-bounds timer access
  hpet: Coding style cleanups and some refactorings
  hpet: Silence warning on write to running main counter
  hpet: Move static timer field initialization
  hpet: Convert to qdev
  hpet: Start/stop timer when HPET_TN_ENABLE is modified
  qemu_irq: Add IRQ handlers with delivery feedback
  x86: Refactor RTC IRQ coalescing workaround
  hpet/rtc: Rework RTC IRQ replacement by HPET
  hpet: Drop static state
  hpet: Add support for level-triggered interrupts
  vmstate: Add VMSTATE_STRUCT_VARRAY_UINT8
  hpet: Make number of timers configurable
  hpet: Add MSI support
  monitor/QMP: Drop info hpet / query-hpet

 QMP/vm-info      |    2 +-
 hw/apic.c        |   63 +++---
 hw/apic.h        |   11 +-
 hw/hpet.c        |  582 +++++++++++++++++++++++++++++++++--------------------
 hw/hpet_emul.h   |   46 +----
 hw/hw.h          |   10 +
 hw/i8259.c       |   20 ++-
 hw/ioapic.c      |   34 ++--
 hw/irq.c         |   38 +++-
 hw/irq.h         |   22 ++-
 hw/mc146818rtc.c |   60 ++----
 hw/mc146818rtc.h |    4 +-
 hw/mips_jazz.c   |    2 +-
 hw/mips_malta.c  |    2 +-
 hw/mips_r4k.c    |    2 +-
 hw/pc.c          |   33 +++-
 hw/pc.h          |    2 +-
 hw/pc_piix.c     |    2 +-
 hw/ppc_prep.c    |    2 +-
 monitor.c        |   33 ---
 20 files changed, 552 insertions(+), 418 deletions(-)

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [Qemu-devel] [RFT][PATCH 01/15] hpet: Catch out-of-bounds timer access
  2010-05-24 20:13 [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Jan Kiszka
@ 2010-05-24 20:13 ` Jan Kiszka
  2010-05-24 20:34   ` [Qemu-devel] " Juan Quintela
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 02/15] hpet: Coding style cleanups and some refactorings Jan Kiszka
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-05-24 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Jan Kiszka, Juan Quintela

From: Jan Kiszka <jan.kiszka@siemens.com>

Also prevent out-of-bounds write access to the timers but don't spam the
host console if it triggers.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 hw/hpet.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 8729fb2..1980906 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -294,7 +294,7 @@ static uint32_t hpet_ram_readl(void *opaque, target_phys_addr_t addr)
     if (index >= 0x100 && index <= 0x3ff) {
         uint8_t timer_id = (addr - 0x100) / 0x20;
         if (timer_id > HPET_NUM_TIMERS - 1) {
-            printf("qemu: timer id out of range\n");
+            DPRINTF("qemu: timer id out of range\n");
             return 0;
         }
         HPETTimer *timer = &s->timer[timer_id];
@@ -383,6 +383,10 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
         DPRINTF("qemu: hpet_ram_writel timer_id = %#x \n", timer_id);
         HPETTimer *timer = &s->timer[timer_id];
 
+        if (timer_id > HPET_NUM_TIMERS - 1) {
+            DPRINTF("qemu: timer id out of range\n");
+            return;
+        }
         switch ((addr - 0x100) % 0x20) {
             case HPET_TN_CFG:
                 DPRINTF("qemu: hpet_ram_writel HPET_TN_CFG\n");
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [Qemu-devel] [RFT][PATCH 02/15] hpet: Coding style cleanups and some refactorings
  2010-05-24 20:13 [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Jan Kiszka
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 01/15] hpet: Catch out-of-bounds timer access Jan Kiszka
@ 2010-05-24 20:13 ` Jan Kiszka
  2010-05-24 20:37   ` [Qemu-devel] " Juan Quintela
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 03/15] hpet: Silence warning on write to running main counter Jan Kiszka
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-05-24 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Jan Kiszka, Juan Quintela

From: Jan Kiszka <jan.kiszka@siemens.com>

This moves the private HPET structures into the C module, simplifies
some helper functions and fixes most coding style issues (biggest chunk
was improper switch-case indention). No functional changes.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 hw/hpet.c      |  413 ++++++++++++++++++++++++++++++-------------------------
 hw/hpet_emul.h |   31 +----
 2 files changed, 226 insertions(+), 218 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 1980906..2836fb0 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -37,21 +37,47 @@
 #define DPRINTF(...)
 #endif
 
+struct HPETState;
+typedef struct HPETTimer {  /* timers */
+    uint8_t tn;             /*timer number*/
+    QEMUTimer *qemu_timer;
+    struct HPETState *state;
+    /* Memory-mapped, software visible timer registers */
+    uint64_t config;        /* configuration/cap */
+    uint64_t cmp;           /* comparator */
+    uint64_t fsb;           /* FSB route, not supported now */
+    /* Hidden register state */
+    uint64_t period;        /* Last value written to comparator */
+    uint8_t wrap_flag;      /* timer pop will indicate wrap for one-shot 32-bit
+                             * mode. Next pop will be actual timer expiration.
+                             */
+} HPETTimer;
+
+typedef struct HPETState {
+    uint64_t hpet_offset;
+    qemu_irq *irqs;
+    HPETTimer timer[HPET_NUM_TIMERS];
+
+    /* Memory-mapped, software visible registers */
+    uint64_t capability;        /* capabilities */
+    uint64_t config;            /* configuration */
+    uint64_t isr;               /* interrupt status reg */
+    uint64_t hpet_counter;      /* main counter */
+} HPETState;
+
 static HPETState *hpet_statep;
 
 uint32_t hpet_in_legacy_mode(void)
 {
-    if (hpet_statep)
-        return hpet_statep->config & HPET_CFG_LEGACY;
-    else
+    if (!hpet_statep) {
         return 0;
+    }
+    return hpet_statep->config & HPET_CFG_LEGACY;
 }
 
 static uint32_t timer_int_route(struct HPETTimer *timer)
 {
-    uint32_t route;
-    route = (timer->config & HPET_TN_INT_ROUTE_MASK) >> HPET_TN_INT_ROUTE_SHIFT;
-    return route;
+    return (timer->config & HPET_TN_INT_ROUTE_MASK) >> HPET_TN_INT_ROUTE_SHIFT;
 }
 
 static uint32_t hpet_enabled(void)
@@ -108,9 +134,7 @@ static int deactivating_bit(uint64_t old, uint64_t new, uint64_t mask)
 
 static uint64_t hpet_get_ticks(void)
 {
-    uint64_t ticks;
-    ticks = ns_to_ticks(qemu_get_clock(vm_clock) + hpet_statep->hpet_offset);
-    return ticks;
+    return ns_to_ticks(qemu_get_clock(vm_clock) + hpet_statep->hpet_offset);
 }
 
 /*
@@ -121,12 +145,14 @@ static inline uint64_t hpet_calculate_diff(HPETTimer *t, uint64_t current)
 
     if (t->config & HPET_TN_32BIT) {
         uint32_t diff, cmp;
+
         cmp = (uint32_t)t->cmp;
         diff = cmp - (uint32_t)current;
         diff = (int32_t)diff > 0 ? diff : (uint32_t)0;
         return (uint64_t)diff;
     } else {
         uint64_t diff, cmp;
+
         cmp = t->cmp;
         diff = cmp - current;
         diff = (int64_t)diff > 0 ? diff : (uint64_t)0;
@@ -136,7 +162,6 @@ static inline uint64_t hpet_calculate_diff(HPETTimer *t, uint64_t current)
 
 static void update_irq(struct HPETTimer *timer)
 {
-    qemu_irq irq;
     int route;
 
     if (timer->tn <= 1 && hpet_in_legacy_mode()) {
@@ -144,22 +169,20 @@ static void update_irq(struct HPETTimer *timer)
          * timer0 be routed to IRQ0 in NON-APIC or IRQ2 in the I/O APIC,
          * timer1 be routed to IRQ8 in NON-APIC or IRQ8 in the I/O APIC.
          */
-        if (timer->tn == 0) {
-            irq=timer->state->irqs[0];
-        } else
-            irq=timer->state->irqs[8];
+        route = (timer->tn == 0) ? 0 : 8;
     } else {
-        route=timer_int_route(timer);
-        irq=timer->state->irqs[route];
+        route = timer_int_route(timer);
     }
-    if (timer_enabled(timer) && hpet_enabled()) {
-        qemu_irq_pulse(irq);
+    if (!timer_enabled(timer) || !hpet_enabled()) {
+        return;
     }
+    qemu_irq_pulse(timer->state->irqs[route]);
 }
 
 static void hpet_pre_save(void *opaque)
 {
     HPETState *s = opaque;
+
     /* save current counter value */
     s->hpet_counter = hpet_get_ticks();
 }
@@ -212,7 +235,7 @@ static const VMStateDescription vmstate_hpet = {
  */
 static void hpet_timer(void *opaque)
 {
-    HPETTimer *t = (HPETTimer*)opaque;
+    HPETTimer *t = opaque;
     uint64_t diff;
 
     uint64_t period = t->period;
@@ -220,20 +243,22 @@ static void hpet_timer(void *opaque)
 
     if (timer_is_periodic(t) && period != 0) {
         if (t->config & HPET_TN_32BIT) {
-            while (hpet_time_after(cur_tick, t->cmp))
+            while (hpet_time_after(cur_tick, t->cmp)) {
                 t->cmp = (uint32_t)(t->cmp + t->period);
-        } else
-            while (hpet_time_after64(cur_tick, t->cmp))
+            }
+        } else {
+            while (hpet_time_after64(cur_tick, t->cmp)) {
                 t->cmp += period;
-
+            }
+        }
         diff = hpet_calculate_diff(t, cur_tick);
-        qemu_mod_timer(t->qemu_timer, qemu_get_clock(vm_clock)
-                       + (int64_t)ticks_to_ns(diff));
+        qemu_mod_timer(t->qemu_timer,
+                       qemu_get_clock(vm_clock) + (int64_t)ticks_to_ns(diff));
     } else if (t->config & HPET_TN_32BIT && !timer_is_periodic(t)) {
         if (t->wrap_flag) {
             diff = hpet_calculate_diff(t, cur_tick);
-            qemu_mod_timer(t->qemu_timer, qemu_get_clock(vm_clock)
-                           + (int64_t)ticks_to_ns(diff));
+            qemu_mod_timer(t->qemu_timer, qemu_get_clock(vm_clock) +
+                           (int64_t)ticks_to_ns(diff));
             t->wrap_flag = 0;
         }
     }
@@ -260,8 +285,8 @@ static void hpet_set_timer(HPETTimer *t)
             t->wrap_flag = 1;
         }
     }
-    qemu_mod_timer(t->qemu_timer, qemu_get_clock(vm_clock)
-                   + (int64_t)ticks_to_ns(diff));
+    qemu_mod_timer(t->qemu_timer,
+                   qemu_get_clock(vm_clock) + (int64_t)ticks_to_ns(diff));
 }
 
 static void hpet_del_timer(HPETTimer *t)
@@ -285,7 +310,7 @@ static uint32_t hpet_ram_readw(void *opaque, target_phys_addr_t addr)
 
 static uint32_t hpet_ram_readl(void *opaque, target_phys_addr_t addr)
 {
-    HPETState *s = (HPETState *)opaque;
+    HPETState *s = opaque;
     uint64_t cur_tick, index;
 
     DPRINTF("qemu: Enter hpet_ram_readl at %" PRIx64 "\n", addr);
@@ -293,57 +318,60 @@ static uint32_t hpet_ram_readl(void *opaque, target_phys_addr_t addr)
     /*address range of all TN regs*/
     if (index >= 0x100 && index <= 0x3ff) {
         uint8_t timer_id = (addr - 0x100) / 0x20;
+        HPETTimer *timer = &s->timer[timer_id];
+
         if (timer_id > HPET_NUM_TIMERS - 1) {
             DPRINTF("qemu: timer id out of range\n");
             return 0;
         }
-        HPETTimer *timer = &s->timer[timer_id];
 
         switch ((addr - 0x100) % 0x20) {
-            case HPET_TN_CFG:
-                return timer->config;
-            case HPET_TN_CFG + 4: // Interrupt capabilities
-                return timer->config >> 32;
-            case HPET_TN_CMP: // comparator register
-                return timer->cmp;
-            case HPET_TN_CMP + 4:
-                return timer->cmp >> 32;
-            case HPET_TN_ROUTE:
-                return timer->fsb >> 32;
-            default:
-                DPRINTF("qemu: invalid hpet_ram_readl\n");
-                break;
+        case HPET_TN_CFG:
+            return timer->config;
+        case HPET_TN_CFG + 4: // Interrupt capabilities
+            return timer->config >> 32;
+        case HPET_TN_CMP: // comparator register
+            return timer->cmp;
+        case HPET_TN_CMP + 4:
+            return timer->cmp >> 32;
+        case HPET_TN_ROUTE:
+            return timer->fsb >> 32;
+        default:
+            DPRINTF("qemu: invalid hpet_ram_readl\n");
+            break;
         }
     } else {
         switch (index) {
-            case HPET_ID:
-                return s->capability;
-            case HPET_PERIOD:
-                return s->capability >> 32;
-            case HPET_CFG:
-                return s->config;
-            case HPET_CFG + 4:
-                DPRINTF("qemu: invalid HPET_CFG + 4 hpet_ram_readl \n");
-                return 0;
-            case HPET_COUNTER:
-                if (hpet_enabled())
-                    cur_tick = hpet_get_ticks();
-                else
-                    cur_tick = s->hpet_counter;
-                DPRINTF("qemu: reading counter  = %" PRIx64 "\n", cur_tick);
-                return cur_tick;
-            case HPET_COUNTER + 4:
-                if (hpet_enabled())
-                    cur_tick = hpet_get_ticks();
-                else
-                    cur_tick = s->hpet_counter;
-                DPRINTF("qemu: reading counter + 4  = %" PRIx64 "\n", cur_tick);
-                return cur_tick >> 32;
-            case HPET_STATUS:
-                return s->isr;
-            default:
-                DPRINTF("qemu: invalid hpet_ram_readl\n");
-                break;
+        case HPET_ID:
+            return s->capability;
+        case HPET_PERIOD:
+            return s->capability >> 32;
+        case HPET_CFG:
+            return s->config;
+        case HPET_CFG + 4:
+            DPRINTF("qemu: invalid HPET_CFG + 4 hpet_ram_readl \n");
+            return 0;
+        case HPET_COUNTER:
+            if (hpet_enabled()) {
+                cur_tick = hpet_get_ticks();
+            } else {
+                cur_tick = s->hpet_counter;
+            }
+            DPRINTF("qemu: reading counter  = %" PRIx64 "\n", cur_tick);
+            return cur_tick;
+        case HPET_COUNTER + 4:
+            if (hpet_enabled()) {
+                cur_tick = hpet_get_ticks();
+            } else {
+                cur_tick = s->hpet_counter;
+            }
+            DPRINTF("qemu: reading counter + 4  = %" PRIx64 "\n", cur_tick);
+            return cur_tick >> 32;
+        case HPET_STATUS:
+            return s->isr;
+        default:
+            DPRINTF("qemu: invalid hpet_ram_readl\n");
+            break;
         }
     }
     return 0;
@@ -369,7 +397,7 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
                             uint32_t value)
 {
     int i;
-    HPETState *s = (HPETState *)opaque;
+    HPETState *s = opaque;
     uint64_t old_val, new_val, val, index;
 
     DPRINTF("qemu: Enter hpet_ram_writel at %" PRIx64 " = %#x\n", addr, value);
@@ -380,133 +408,137 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
     /*address range of all TN regs*/
     if (index >= 0x100 && index <= 0x3ff) {
         uint8_t timer_id = (addr - 0x100) / 0x20;
-        DPRINTF("qemu: hpet_ram_writel timer_id = %#x \n", timer_id);
         HPETTimer *timer = &s->timer[timer_id];
 
+        DPRINTF("qemu: hpet_ram_writel timer_id = %#x \n", timer_id);
         if (timer_id > HPET_NUM_TIMERS - 1) {
             DPRINTF("qemu: timer id out of range\n");
             return;
         }
         switch ((addr - 0x100) % 0x20) {
-            case HPET_TN_CFG:
-                DPRINTF("qemu: hpet_ram_writel HPET_TN_CFG\n");
-                val = hpet_fixup_reg(new_val, old_val, HPET_TN_CFG_WRITE_MASK);
-                timer->config = (timer->config & 0xffffffff00000000ULL) | val;
-                if (new_val & HPET_TN_32BIT) {
-                    timer->cmp = (uint32_t)timer->cmp;
-                    timer->period = (uint32_t)timer->period;
-                }
-                if (new_val & HPET_TIMER_TYPE_LEVEL) {
-                    printf("qemu: level-triggered hpet not supported\n");
-                    exit (-1);
-                }
-
-                break;
-            case HPET_TN_CFG + 4: // Interrupt capabilities
-                DPRINTF("qemu: invalid HPET_TN_CFG+4 write\n");
-                break;
-            case HPET_TN_CMP: // comparator register
-                DPRINTF("qemu: hpet_ram_writel HPET_TN_CMP \n");
-                if (timer->config & HPET_TN_32BIT)
-                    new_val = (uint32_t)new_val;
-                if (!timer_is_periodic(timer) ||
-                           (timer->config & HPET_TN_SETVAL))
-                    timer->cmp = (timer->cmp & 0xffffffff00000000ULL)
-                                  | new_val;
-                if (timer_is_periodic(timer)) {
-                    /*
-                     * FIXME: Clamp period to reasonable min value?
-                     * Clamp period to reasonable max value
-                     */
-                    new_val &= (timer->config & HPET_TN_32BIT ? ~0u : ~0ull) >> 1;
-                    timer->period = (timer->period & 0xffffffff00000000ULL)
-                                     | new_val;
+        case HPET_TN_CFG:
+            DPRINTF("qemu: hpet_ram_writel HPET_TN_CFG\n");
+            val = hpet_fixup_reg(new_val, old_val, HPET_TN_CFG_WRITE_MASK);
+            timer->config = (timer->config & 0xffffffff00000000ULL) | val;
+            if (new_val & HPET_TN_32BIT) {
+                timer->cmp = (uint32_t)timer->cmp;
+                timer->period = (uint32_t)timer->period;
+            }
+            if (new_val & HPET_TN_TYPE_LEVEL) {
+                printf("qemu: level-triggered hpet not supported\n");
+                exit (-1);
+            }
+            break;
+        case HPET_TN_CFG + 4: // Interrupt capabilities
+            DPRINTF("qemu: invalid HPET_TN_CFG+4 write\n");
+            break;
+        case HPET_TN_CMP: // comparator register
+            DPRINTF("qemu: hpet_ram_writel HPET_TN_CMP \n");
+            if (timer->config & HPET_TN_32BIT) {
+                new_val = (uint32_t)new_val;
+            }
+            if (!timer_is_periodic(timer)
+                || (timer->config & HPET_TN_SETVAL)) {
+                timer->cmp = (timer->cmp & 0xffffffff00000000ULL) | new_val;
+            }
+            if (timer_is_periodic(timer)) {
+                /*
+                 * FIXME: Clamp period to reasonable min value?
+                 * Clamp period to reasonable max value
+                 */
+                new_val &= (timer->config & HPET_TN_32BIT ? ~0u : ~0ull) >> 1;
+                timer->period =
+                    (timer->period & 0xffffffff00000000ULL) | new_val;
+            }
+            timer->config &= ~HPET_TN_SETVAL;
+            if (hpet_enabled()) {
+                hpet_set_timer(timer);
+            }
+            break;
+        case HPET_TN_CMP + 4: // comparator register high order
+            DPRINTF("qemu: hpet_ram_writel HPET_TN_CMP + 4\n");
+            if (!timer_is_periodic(timer)
+                || (timer->config & HPET_TN_SETVAL)) {
+                timer->cmp = (timer->cmp & 0xffffffffULL) | new_val << 32;
+            } else {
+                /*
+                 * FIXME: Clamp period to reasonable min value?
+                 * Clamp period to reasonable max value
+                 */
+                new_val &= (timer->config & HPET_TN_32BIT ? ~0u : ~0ull) >> 1;
+                timer->period =
+                    (timer->period & 0xffffffffULL) | new_val << 32;
                 }
                 timer->config &= ~HPET_TN_SETVAL;
-                if (hpet_enabled())
+                if (hpet_enabled()) {
                     hpet_set_timer(timer);
-                break;
-            case HPET_TN_CMP + 4: // comparator register high order
-                DPRINTF("qemu: hpet_ram_writel HPET_TN_CMP + 4\n");
-                if (!timer_is_periodic(timer) ||
-                           (timer->config & HPET_TN_SETVAL))
-                    timer->cmp = (timer->cmp & 0xffffffffULL)
-                                  | new_val << 32;
-                else {
-                    /*
-                     * FIXME: Clamp period to reasonable min value?
-                     * Clamp period to reasonable max value
-                     */
-                    new_val &= (timer->config
-                                & HPET_TN_32BIT ? ~0u : ~0ull) >> 1;
-                    timer->period = (timer->period & 0xffffffffULL)
-                                     | new_val << 32;
                 }
-                timer->config &= ~HPET_TN_SETVAL;
-                if (hpet_enabled())
-                    hpet_set_timer(timer);
-                break;
-            case HPET_TN_ROUTE + 4:
-                DPRINTF("qemu: hpet_ram_writel HPET_TN_ROUTE + 4\n");
-                break;
-            default:
-                DPRINTF("qemu: invalid hpet_ram_writel\n");
                 break;
+        case HPET_TN_ROUTE + 4:
+            DPRINTF("qemu: hpet_ram_writel HPET_TN_ROUTE + 4\n");
+            break;
+        default:
+            DPRINTF("qemu: invalid hpet_ram_writel\n");
+            break;
         }
         return;
     } else {
         switch (index) {
-            case HPET_ID:
-                return;
-            case HPET_CFG:
-                val = hpet_fixup_reg(new_val, old_val, HPET_CFG_WRITE_MASK);
-                s->config = (s->config & 0xffffffff00000000ULL) | val;
-                if (activating_bit(old_val, new_val, HPET_CFG_ENABLE)) {
-                    /* Enable main counter and interrupt generation. */
-                    s->hpet_offset = ticks_to_ns(s->hpet_counter)
-                                     - qemu_get_clock(vm_clock);
-                    for (i = 0; i < HPET_NUM_TIMERS; i++)
-                        if ((&s->timer[i])->cmp != ~0ULL)
-                            hpet_set_timer(&s->timer[i]);
-                }
-                else if (deactivating_bit(old_val, new_val, HPET_CFG_ENABLE)) {
-                    /* Halt main counter and disable interrupt generation. */
-                    s->hpet_counter = hpet_get_ticks();
-                    for (i = 0; i < HPET_NUM_TIMERS; i++)
-                        hpet_del_timer(&s->timer[i]);
+        case HPET_ID:
+            return;
+        case HPET_CFG:
+            val = hpet_fixup_reg(new_val, old_val, HPET_CFG_WRITE_MASK);
+            s->config = (s->config & 0xffffffff00000000ULL) | val;
+            if (activating_bit(old_val, new_val, HPET_CFG_ENABLE)) {
+                /* Enable main counter and interrupt generation. */
+                s->hpet_offset =
+                    ticks_to_ns(s->hpet_counter) - qemu_get_clock(vm_clock);
+                for (i = 0; i < HPET_NUM_TIMERS; i++) {
+                    if ((&s->timer[i])->cmp != ~0ULL) {
+                        hpet_set_timer(&s->timer[i]);
+                    }
                 }
-                /* i8254 and RTC are disabled when HPET is in legacy mode */
-                if (activating_bit(old_val, new_val, HPET_CFG_LEGACY)) {
-                    hpet_pit_disable();
-                } else if (deactivating_bit(old_val, new_val, HPET_CFG_LEGACY)) {
-                    hpet_pit_enable();
+            } else if (deactivating_bit(old_val, new_val, HPET_CFG_ENABLE)) {
+                /* Halt main counter and disable interrupt generation. */
+                s->hpet_counter = hpet_get_ticks();
+                for (i = 0; i < HPET_NUM_TIMERS; i++) {
+                    hpet_del_timer(&s->timer[i]);
                 }
-                break;
-            case HPET_CFG + 4:
-                DPRINTF("qemu: invalid HPET_CFG+4 write \n");
-                break;
-            case HPET_STATUS:
-                /* FIXME: need to handle level-triggered interrupts */
-                break;
-            case HPET_COUNTER:
-               if (hpet_enabled())
-                   printf("qemu: Writing counter while HPET enabled!\n");
-               s->hpet_counter = (s->hpet_counter & 0xffffffff00000000ULL)
-                                  | value;
-               DPRINTF("qemu: HPET counter written. ctr = %#x -> %" PRIx64 "\n",
-                        value, s->hpet_counter);
-               break;
-            case HPET_COUNTER + 4:
-               if (hpet_enabled())
-                   printf("qemu: Writing counter while HPET enabled!\n");
-               s->hpet_counter = (s->hpet_counter & 0xffffffffULL)
-                                  | (((uint64_t)value) << 32);
-               DPRINTF("qemu: HPET counter + 4 written. ctr = %#x -> %" PRIx64 "\n",
-                        value, s->hpet_counter);
-               break;
-            default:
-               DPRINTF("qemu: invalid hpet_ram_writel\n");
-               break;
+            }
+            /* i8254 and RTC are disabled when HPET is in legacy mode */
+            if (activating_bit(old_val, new_val, HPET_CFG_LEGACY)) {
+                hpet_pit_disable();
+            } else if (deactivating_bit(old_val, new_val, HPET_CFG_LEGACY)) {
+                hpet_pit_enable();
+            }
+            break;
+        case HPET_CFG + 4:
+            DPRINTF("qemu: invalid HPET_CFG+4 write \n");
+            break;
+        case HPET_STATUS:
+            /* FIXME: need to handle level-triggered interrupts */
+            break;
+        case HPET_COUNTER:
+            if (hpet_enabled()) {
+                printf("qemu: Writing counter while HPET enabled!\n");
+            }
+            s->hpet_counter =
+                (s->hpet_counter & 0xffffffff00000000ULL) | value;
+            DPRINTF("qemu: HPET counter written. ctr = %#x -> %" PRIx64 "\n",
+                    value, s->hpet_counter);
+            break;
+        case HPET_COUNTER + 4:
+            if (hpet_enabled()) {
+                printf("qemu: Writing counter while HPET enabled!\n");
+            }
+            s->hpet_counter =
+                (s->hpet_counter & 0xffffffffULL) | (((uint64_t)value) << 32);
+            DPRINTF("qemu: HPET counter + 4 written. ctr = %#x -> %" PRIx64 "\n",
+                    value, s->hpet_counter);
+            break;
+        default:
+            DPRINTF("qemu: invalid hpet_ram_writel\n");
+            break;
         }
     }
 }
@@ -533,13 +565,15 @@ static CPUWriteMemoryFunc * const hpet_ram_write[] = {
     hpet_ram_writel,
 };
 
-static void hpet_reset(void *opaque) {
+static void hpet_reset(void *opaque)
+{
     HPETState *s = opaque;
     int i;
     static int count = 0;
 
-    for (i=0; i<HPET_NUM_TIMERS; i++) {
+    for (i = 0; i < HPET_NUM_TIMERS; i++) {
         HPETTimer *timer = &s->timer[i];
+
         hpet_del_timer(timer);
         timer->tn = i;
         timer->cmp = ~0ULL;
@@ -557,19 +591,22 @@ static void hpet_reset(void *opaque) {
     s->capability = 0x8086a201ULL;
     s->capability |= ((HPET_CLK_PERIOD) << 32);
     s->config = 0ULL;
-    if (count > 0)
+    if (count > 0) {
         /* we don't enable pit when hpet_reset is first called (by hpet_init)
          * because hpet is taking over for pit here. On subsequent invocations,
          * hpet_reset is called due to system reset. At this point control must
          * be returned to pit until SW reenables hpet.
          */
         hpet_pit_enable();
+    }
     count = 1;
 }
 
 
-void hpet_init(qemu_irq *irq) {
+void hpet_init(qemu_irq *irq)
+{
     int i, iomemtype;
+    HPETTimer *timer;
     HPETState *s;
 
     DPRINTF ("hpet_init\n");
@@ -577,8 +614,8 @@ void hpet_init(qemu_irq *irq) {
     s = qemu_mallocz(sizeof(HPETState));
     hpet_statep = s;
     s->irqs = irq;
-    for (i=0; i<HPET_NUM_TIMERS; i++) {
-        HPETTimer *timer = &s->timer[i];
+    for (i = 0; i < HPET_NUM_TIMERS; i++) {
+        timer = &s->timer[i];
         timer->qemu_timer = qemu_new_timer(vm_clock, hpet_timer, timer);
     }
     vmstate_register(-1, &vmstate_hpet, s);
diff --git a/hw/hpet_emul.h b/hw/hpet_emul.h
index cfd95b4..2f5f8ba 100644
--- a/hw/hpet_emul.h
+++ b/hw/hpet_emul.h
@@ -18,7 +18,6 @@
 
 #define FS_PER_NS 1000000
 #define HPET_NUM_TIMERS 3
-#define HPET_TIMER_TYPE_LEVEL 0x002
 
 #define HPET_CFG_ENABLE 0x001
 #define HPET_CFG_LEGACY 0x002
@@ -33,7 +32,7 @@
 #define HPET_TN_ROUTE   0x010
 #define HPET_CFG_WRITE_MASK  0x3
 
-
+#define HPET_TN_TYPE_LEVEL       0x002
 #define HPET_TN_ENABLE           0x004
 #define HPET_TN_PERIODIC         0x008
 #define HPET_TN_PERIODIC_CAP     0x010
@@ -46,34 +45,6 @@
 #define HPET_TN_INT_ROUTE_CAP_SHIFT 32
 #define HPET_TN_CFG_BITS_READONLY_OR_RESERVED 0xffff80b1U
 
-struct HPETState;
-typedef struct HPETTimer {  /* timers */
-    uint8_t tn;             /*timer number*/
-    QEMUTimer *qemu_timer;
-    struct HPETState *state;
-    /* Memory-mapped, software visible timer registers */
-    uint64_t config;        /* configuration/cap */
-    uint64_t cmp;           /* comparator */
-    uint64_t fsb;           /* FSB route, not supported now */
-    /* Hidden register state */
-    uint64_t period;        /* Last value written to comparator */
-    uint8_t wrap_flag;      /* timer pop will indicate wrap for one-shot 32-bit
-                             * mode. Next pop will be actual timer expiration.
-                             */
-} HPETTimer;
-
-typedef struct HPETState {
-    uint64_t hpet_offset;
-    qemu_irq *irqs;
-    HPETTimer timer[HPET_NUM_TIMERS];
-
-    /* Memory-mapped, software visible registers */
-    uint64_t capability;        /* capabilities */
-    uint64_t config;            /* configuration */
-    uint64_t isr;               /* interrupt status reg */
-    uint64_t hpet_counter;      /* main counter */
-} HPETState;
-
 #if defined TARGET_I386
 extern uint32_t hpet_in_legacy_mode(void);
 extern void hpet_init(qemu_irq *irq);
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [Qemu-devel] [RFT][PATCH 03/15] hpet: Silence warning on write to running main counter
  2010-05-24 20:13 [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Jan Kiszka
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 01/15] hpet: Catch out-of-bounds timer access Jan Kiszka
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 02/15] hpet: Coding style cleanups and some refactorings Jan Kiszka
@ 2010-05-24 20:13 ` Jan Kiszka
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 04/15] hpet: Move static timer field initialization Jan Kiszka
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-24 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Jan Kiszka, Juan Quintela

From: Jan Kiszka <jan.kiszka@siemens.com>

Setting the main counter while the HPET is enabled may not be a good
idea of the guest, but it is supported and should, thus, not spam the
host console with warnings.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 hw/hpet.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 2836fb0..bcb160b 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -520,7 +520,7 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
             break;
         case HPET_COUNTER:
             if (hpet_enabled()) {
-                printf("qemu: Writing counter while HPET enabled!\n");
+                DPRINTF("qemu: Writing counter while HPET enabled!\n");
             }
             s->hpet_counter =
                 (s->hpet_counter & 0xffffffff00000000ULL) | value;
@@ -529,7 +529,7 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
             break;
         case HPET_COUNTER + 4:
             if (hpet_enabled()) {
-                printf("qemu: Writing counter while HPET enabled!\n");
+                DPRINTF("qemu: Writing counter while HPET enabled!\n");
             }
             s->hpet_counter =
                 (s->hpet_counter & 0xffffffffULL) | (((uint64_t)value) << 32);
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [Qemu-devel] [RFT][PATCH 04/15] hpet: Move static timer field initialization
  2010-05-24 20:13 [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Jan Kiszka
                   ` (2 preceding siblings ...)
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 03/15] hpet: Silence warning on write to running main counter Jan Kiszka
@ 2010-05-24 20:13 ` Jan Kiszka
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 05/15] hpet: Convert to qdev Jan Kiszka
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-24 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Jan Kiszka, Juan Quintela

From: Jan Kiszka <jan.kiszka@siemens.com>

Properly initialize HPETTimer::tn and HPETTimer::state once during
hpet_init instead of (re-)writing them on every reset.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 hw/hpet.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index bcb160b..fd7a1fd 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -575,12 +575,10 @@ static void hpet_reset(void *opaque)
         HPETTimer *timer = &s->timer[i];
 
         hpet_del_timer(timer);
-        timer->tn = i;
         timer->cmp = ~0ULL;
         timer->config =  HPET_TN_PERIODIC_CAP | HPET_TN_SIZE_CAP;
         /* advertise availability of ioapic inti2 */
         timer->config |=  0x00000004ULL << 32;
-        timer->state = s;
         timer->period = 0ULL;
         timer->wrap_flag = 0;
     }
@@ -617,6 +615,8 @@ void hpet_init(qemu_irq *irq)
     for (i = 0; i < HPET_NUM_TIMERS; i++) {
         timer = &s->timer[i];
         timer->qemu_timer = qemu_new_timer(vm_clock, hpet_timer, timer);
+        timer->tn = i;
+        timer->state = s;
     }
     vmstate_register(-1, &vmstate_hpet, s);
     qemu_register_reset(hpet_reset, s);
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [Qemu-devel] [RFT][PATCH 05/15] hpet: Convert to qdev
  2010-05-24 20:13 [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Jan Kiszka
                   ` (3 preceding siblings ...)
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 04/15] hpet: Move static timer field initialization Jan Kiszka
@ 2010-05-24 20:13 ` Jan Kiszka
  2010-05-25  9:37   ` Paul Brook
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 06/15] hpet: Start/stop timer when HPET_TN_ENABLE is modified Jan Kiszka
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-05-24 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Jan Kiszka, Juan Quintela

From: Jan Kiszka <jan.kiszka@siemens.com>

Register the HPET as a sysbus device and create it that way. As it can
route its IRQs to any ISA IRQ, we need to connect it to all 24 of them.
Once converted to qdev, we can move reset handler and vmstate
registration into its hands as well.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 hw/hpet.c      |   43 ++++++++++++++++++++++++++++++-------------
 hw/hpet_emul.h |    3 ++-
 hw/pc.c        |    7 ++++++-
 3 files changed, 38 insertions(+), 15 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index fd7a1fd..6974935 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -29,6 +29,7 @@
 #include "console.h"
 #include "qemu-timer.h"
 #include "hpet_emul.h"
+#include "sysbus.h"
 
 //#define HPET_DEBUG
 #ifdef HPET_DEBUG
@@ -54,8 +55,9 @@ typedef struct HPETTimer {  /* timers */
 } HPETTimer;
 
 typedef struct HPETState {
+    SysBusDevice busdev;
     uint64_t hpet_offset;
-    qemu_irq *irqs;
+    qemu_irq irqs[HPET_NUM_IRQ_ROUTES];
     HPETTimer timer[HPET_NUM_TIMERS];
 
     /* Memory-mapped, software visible registers */
@@ -565,9 +567,9 @@ static CPUWriteMemoryFunc * const hpet_ram_write[] = {
     hpet_ram_writel,
 };
 
-static void hpet_reset(void *opaque)
+static void hpet_reset(DeviceState *d)
 {
-    HPETState *s = opaque;
+    HPETState *s = FROM_SYSBUS(HPETState, sysbus_from_qdev(d));
     int i;
     static int count = 0;
 
@@ -600,28 +602,43 @@ static void hpet_reset(void *opaque)
     count = 1;
 }
 
-
-void hpet_init(qemu_irq *irq)
+static int hpet_init(SysBusDevice *dev)
 {
+    HPETState *s = FROM_SYSBUS(HPETState, dev);
     int i, iomemtype;
     HPETTimer *timer;
-    HPETState *s;
-
-    DPRINTF ("hpet_init\n");
 
-    s = qemu_mallocz(sizeof(HPETState));
+    assert(!hpet_statep);
     hpet_statep = s;
-    s->irqs = irq;
+    for (i = 0; i < HPET_NUM_IRQ_ROUTES; i++) {
+        sysbus_init_irq(dev, &s->irqs[i]);
+    }
     for (i = 0; i < HPET_NUM_TIMERS; i++) {
         timer = &s->timer[i];
         timer->qemu_timer = qemu_new_timer(vm_clock, hpet_timer, timer);
         timer->tn = i;
         timer->state = s;
     }
-    vmstate_register(-1, &vmstate_hpet, s);
-    qemu_register_reset(hpet_reset, s);
+
     /* HPET Area */
     iomemtype = cpu_register_io_memory(hpet_ram_read,
                                        hpet_ram_write, s);
-    cpu_register_physical_memory(HPET_BASE, 0x400, iomemtype);
+    sysbus_init_mmio(dev, 0x400, iomemtype);
+    return 0;
 }
+
+static SysBusDeviceInfo hpet_device_info = {
+    .qdev.name    = "hpet",
+    .qdev.size    = sizeof(HPETState),
+    .qdev.no_user = 1,
+    .qdev.vmsd    = &vmstate_hpet,
+    .qdev.reset   = hpet_reset,
+    .init         = hpet_init,
+};
+
+static void hpet_register_device(void)
+{
+    sysbus_register_withprop(&hpet_device_info);
+}
+
+device_init(hpet_register_device)
diff --git a/hw/hpet_emul.h b/hw/hpet_emul.h
index 2f5f8ba..785f850 100644
--- a/hw/hpet_emul.h
+++ b/hw/hpet_emul.h
@@ -19,6 +19,8 @@
 #define FS_PER_NS 1000000
 #define HPET_NUM_TIMERS 3
 
+#define HPET_NUM_IRQ_ROUTES     32
+
 #define HPET_CFG_ENABLE 0x001
 #define HPET_CFG_LEGACY 0x002
 
@@ -47,7 +49,6 @@
 
 #if defined TARGET_I386
 extern uint32_t hpet_in_legacy_mode(void);
-extern void hpet_init(qemu_irq *irq);
 #endif
 
 #endif
diff --git a/hw/pc.c b/hw/pc.c
index e7f31d3..631b0ae 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -35,6 +35,7 @@
 #include "elf.h"
 #include "multiboot.h"
 #include "mc146818rtc.h"
+#include "sysbus.h"
 
 /* output Bochs bios info messages */
 //#define DEBUG_BIOS
@@ -945,7 +946,11 @@ void pc_basic_device_init(qemu_irq *isa_irq,
     pit = pit_init(0x40, isa_reserve_irq(0));
     pcspk_init(pit);
     if (!no_hpet) {
-        hpet_init(isa_irq);
+        DeviceState *hpet = sysbus_create_simple("hpet", HPET_BASE, NULL);
+
+        for (i = 0; i < 24; i++) {
+            sysbus_connect_irq(sysbus_from_qdev(hpet), i, isa_irq[i]);
+        }
     }
 
     for(i = 0; i < MAX_SERIAL_PORTS; i++) {
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [Qemu-devel] [RFT][PATCH 06/15] hpet: Start/stop timer when HPET_TN_ENABLE is modified
  2010-05-24 20:13 [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Jan Kiszka
                   ` (4 preceding siblings ...)
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 05/15] hpet: Convert to qdev Jan Kiszka
@ 2010-05-24 20:13 ` Jan Kiszka
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback Jan Kiszka
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-24 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Jan Kiszka, Juan Quintela

From: Jan Kiszka <jan.kiszka@siemens.com>

We have to update the qemu timer when the per-timer enable bit is
toggled, just like for HPET_CFG_ENABLE changes.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 hw/hpet.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 6974935..041dd84 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -430,6 +430,11 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
                 printf("qemu: level-triggered hpet not supported\n");
                 exit (-1);
             }
+            if (activating_bit(old_val, new_val, HPET_TN_ENABLE)) {
+                hpet_set_timer(timer);
+            } else if (deactivating_bit(old_val, new_val, HPET_TN_ENABLE)) {
+                hpet_del_timer(timer);
+            }
             break;
         case HPET_TN_CFG + 4: // Interrupt capabilities
             DPRINTF("qemu: invalid HPET_TN_CFG+4 write\n");
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [Qemu-devel] [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-24 20:13 [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Jan Kiszka
                   ` (5 preceding siblings ...)
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 06/15] hpet: Start/stop timer when HPET_TN_ENABLE is modified Jan Kiszka
@ 2010-05-24 20:13 ` Jan Kiszka
  2010-05-25  6:07   ` Gleb Natapov
  2010-05-25 19:09   ` [Qemu-devel] " Blue Swirl
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 08/15] x86: Refactor RTC IRQ coalescing workaround Jan Kiszka
                   ` (8 subsequent siblings)
  15 siblings, 2 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-24 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Jan Kiszka, Juan Quintela

From: Jan Kiszka <jan.kiszka@siemens.com>

This allows to communicate potential IRQ coalescing during delivery from
the sink back to the source. Targets that support IRQ coalescing
workarounds need to register handlers that return the appropriate
QEMU_IRQ_* code, and they have to propergate the code across all IRQ
redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
apply its workaround. If multiple sinks exist, the source may only
consider an IRQ coalesced if all other sinks either report
QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 hw/irq.c |   38 +++++++++++++++++++++++++++++---------
 hw/irq.h |   22 +++++++++++++++-------
 2 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/hw/irq.c b/hw/irq.c
index 7703f62..db2cce6 100644
--- a/hw/irq.c
+++ b/hw/irq.c
@@ -26,19 +26,27 @@
 
 struct IRQState {
     qemu_irq_handler handler;
+    qemu_irq_fb_handler feedback_handler;
     void *opaque;
     int n;
 };
 
-void qemu_set_irq(qemu_irq irq, int level)
+int qemu_set_irq(qemu_irq irq, int level)
 {
-    if (!irq)
-        return;
-
-    irq->handler(irq->opaque, irq->n, level);
+    if (!irq) {
+        return 0;
+    }
+    if (irq->feedback_handler) {
+        return irq->feedback_handler(irq->opaque, irq->n, level);
+    } else {
+        irq->handler(irq->opaque, irq->n, level);
+        return QEMU_IRQ_DELIVERED;
+    }
 }
 
-qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
+static qemu_irq *allocate_irqs(qemu_irq_handler handler,
+                               qemu_irq_fb_handler feedback_handler,
+                               void *opaque, int n)
 {
     qemu_irq *s;
     struct IRQState *p;
@@ -48,6 +56,7 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
     p = (struct IRQState *)qemu_mallocz(sizeof(struct IRQState) * n);
     for (i = 0; i < n; i++) {
         p->handler = handler;
+        p->feedback_handler = feedback_handler;
         p->opaque = opaque;
         p->n = i;
         s[i] = p;
@@ -56,22 +65,33 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
     return s;
 }
 
+qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
+{
+    return allocate_irqs(handler, NULL, opaque, n);
+}
+
+qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
+                                      void *opaque, int n)
+{
+    return allocate_irqs(NULL, handler, opaque, n);
+}
+
 void qemu_free_irqs(qemu_irq *s)
 {
     qemu_free(s[0]);
     qemu_free(s);
 }
 
-static void qemu_notirq(void *opaque, int line, int level)
+static int qemu_notirq(void *opaque, int line, int level)
 {
     struct IRQState *irq = opaque;
 
-    irq->handler(irq->opaque, irq->n, !level);
+    return qemu_set_irq(irq, !level);
 }
 
 qemu_irq qemu_irq_invert(qemu_irq irq)
 {
     /* The default state for IRQs is low, so raise the output now.  */
     qemu_irq_raise(irq);
-    return qemu_allocate_irqs(qemu_notirq, irq, 1)[0];
+    return allocate_irqs(NULL, qemu_notirq, irq, 1)[0];
 }
diff --git a/hw/irq.h b/hw/irq.h
index 5daae44..eee03e6 100644
--- a/hw/irq.h
+++ b/hw/irq.h
@@ -3,15 +3,18 @@
 
 /* Generic IRQ/GPIO pin infrastructure.  */
 
-/* FIXME: Rmove one of these.  */
+#define QEMU_IRQ_DELIVERED      0
+#define QEMU_IRQ_COALESCED      (-1)
+#define QEMU_IRQ_MASKED         (-2)
+
 typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
-typedef void SetIRQFunc(void *opaque, int irq_num, int level);
+typedef int (*qemu_irq_fb_handler)(void *opaque, int n, int level);
 
-void qemu_set_irq(qemu_irq irq, int level);
+int qemu_set_irq(qemu_irq irq, int level);
 
-static inline void qemu_irq_raise(qemu_irq irq)
+static inline int qemu_irq_raise(qemu_irq irq)
 {
-    qemu_set_irq(irq, 1);
+    return qemu_set_irq(irq, 1);
 }
 
 static inline void qemu_irq_lower(qemu_irq irq)
@@ -19,14 +22,19 @@ static inline void qemu_irq_lower(qemu_irq irq)
     qemu_set_irq(irq, 0);
 }
 
-static inline void qemu_irq_pulse(qemu_irq irq)
+static inline int qemu_irq_pulse(qemu_irq irq)
 {
-    qemu_set_irq(irq, 1);
+    int ret;
+
+    ret = qemu_set_irq(irq, 1);
     qemu_set_irq(irq, 0);
+    return ret;
 }
 
 /* Returns an array of N IRQs.  */
 qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n);
+qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
+                                      void *opaque, int n);
 void qemu_free_irqs(qemu_irq *s);
 
 /* Returns a new IRQ with opposite polarity.  */
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [Qemu-devel] [RFT][PATCH 08/15] x86: Refactor RTC IRQ coalescing workaround
  2010-05-24 20:13 [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Jan Kiszka
                   ` (6 preceding siblings ...)
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback Jan Kiszka
@ 2010-05-24 20:13 ` Jan Kiszka
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 09/15] hpet/rtc: Rework RTC IRQ replacement by HPET Jan Kiszka
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-24 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Jan Kiszka, Juan Quintela

From: Jan Kiszka <jan.kiszka@siemens.com>

Make use of the new feedback IRQ handlers and propagate coalesced
deliveries via handler return code from the sink to the source. As a
by-product, this also adds coalescing support to the PIC.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 hw/apic.c        |   61 ++++++++++++++++++++++++++---------------------------
 hw/apic.h        |   10 ++------
 hw/i8259.c       |   20 +++++++++++++++--
 hw/ioapic.c      |   34 ++++++++++++++++++-----------
 hw/mc146818rtc.c |   22 ++++++++-----------
 hw/pc.c          |   17 ++++++++++----
 hw/pc.h          |    2 +-
 hw/pc_piix.c     |    2 +-
 8 files changed, 94 insertions(+), 74 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index 9029dad..641825c 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -108,10 +108,8 @@ typedef struct APICState {
 static int apic_io_memory;
 static APICState *local_apics[MAX_APICS + 1];
 static int last_apic_idx = 0;
-static int apic_irq_delivered;
 
-
-static void apic_set_irq(APICState *s, int vector_num, int trigger_mode);
+static int apic_set_irq(APICState *s, int vector_num, int trigger_mode);
 static void apic_update_irq(APICState *s);
 static void apic_get_delivery_bitmask(uint32_t *deliver_bitmask,
                                       uint8_t dest, uint8_t dest_mode);
@@ -222,12 +220,12 @@ void apic_deliver_pic_intr(CPUState *env, int level)
     }\
 }
 
-static void apic_bus_deliver(const uint32_t *deliver_bitmask,
-                             uint8_t delivery_mode,
-                             uint8_t vector_num, uint8_t polarity,
-                             uint8_t trigger_mode)
+static int apic_bus_deliver(const uint32_t *deliver_bitmask,
+                            uint8_t delivery_mode, uint8_t vector_num,
+                            uint8_t polarity, uint8_t trigger_mode)
 {
     APICState *apic_iter;
+    int ret;
 
     switch (delivery_mode) {
         case APIC_DM_LOWPRI:
@@ -244,11 +242,12 @@ static void apic_bus_deliver(const uint32_t *deliver_bitmask,
                 if (d >= 0) {
                     apic_iter = local_apics[d];
                     if (apic_iter) {
-                        apic_set_irq(apic_iter, vector_num, trigger_mode);
+                        return apic_set_irq(apic_iter, vector_num,
+                                            trigger_mode);
                     }
                 }
             }
-            return;
+            return QEMU_IRQ_MASKED;
 
         case APIC_DM_FIXED:
             break;
@@ -256,40 +255,48 @@ static void apic_bus_deliver(const uint32_t *deliver_bitmask,
         case APIC_DM_SMI:
             foreach_apic(apic_iter, deliver_bitmask,
                 cpu_interrupt(apic_iter->cpu_env, CPU_INTERRUPT_SMI) );
-            return;
+            return QEMU_IRQ_DELIVERED;
 
         case APIC_DM_NMI:
             foreach_apic(apic_iter, deliver_bitmask,
                 cpu_interrupt(apic_iter->cpu_env, CPU_INTERRUPT_NMI) );
-            return;
+            return QEMU_IRQ_DELIVERED;
 
         case APIC_DM_INIT:
             /* normal INIT IPI sent to processors */
             foreach_apic(apic_iter, deliver_bitmask,
                          cpu_interrupt(apic_iter->cpu_env, CPU_INTERRUPT_INIT) );
-            return;
+            return QEMU_IRQ_DELIVERED;
 
         case APIC_DM_EXTINT:
             /* handled in I/O APIC code */
             break;
 
         default:
-            return;
+            return QEMU_IRQ_MASKED;
     }
 
+    ret = QEMU_IRQ_MASKED;
     foreach_apic(apic_iter, deliver_bitmask,
-                 apic_set_irq(apic_iter, vector_num, trigger_mode) );
+        if (ret == QEMU_IRQ_MASKED)
+            ret = QEMU_IRQ_COALESCED;
+        if (apic_set_irq(apic_iter, vector_num,
+                         trigger_mode) == QEMU_IRQ_DELIVERED) {
+            ret = QEMU_IRQ_DELIVERED;
+        }
+    );
+    return ret;
 }
 
-void apic_deliver_irq(uint8_t dest, uint8_t dest_mode,
-                      uint8_t delivery_mode, uint8_t vector_num,
-                      uint8_t polarity, uint8_t trigger_mode)
+int apic_deliver_irq(uint8_t dest, uint8_t dest_mode,
+                     uint8_t delivery_mode, uint8_t vector_num,
+                     uint8_t polarity, uint8_t trigger_mode)
 {
     uint32_t deliver_bitmask[MAX_APIC_WORDS];
 
     apic_get_delivery_bitmask(deliver_bitmask, dest, dest_mode);
-    apic_bus_deliver(deliver_bitmask, delivery_mode, vector_num, polarity,
-                     trigger_mode);
+    return apic_bus_deliver(deliver_bitmask, delivery_mode, vector_num,
+                            polarity, trigger_mode);
 }
 
 void cpu_set_apic_base(CPUState *env, uint64_t val)
@@ -384,19 +391,10 @@ static void apic_update_irq(APICState *s)
     cpu_interrupt(s->cpu_env, CPU_INTERRUPT_HARD);
 }
 
-void apic_reset_irq_delivered(void)
-{
-    apic_irq_delivered = 0;
-}
-
-int apic_get_irq_delivered(void)
-{
-    return apic_irq_delivered;
-}
-
-static void apic_set_irq(APICState *s, int vector_num, int trigger_mode)
+static int apic_set_irq(APICState *s, int vector_num, int trigger_mode)
 {
-    apic_irq_delivered += !get_bit(s->irr, vector_num);
+    int ret = get_bit(s->irr, vector_num) ? QEMU_IRQ_COALESCED
+                                          : QEMU_IRQ_DELIVERED;
 
     set_bit(s->irr, vector_num);
     if (trigger_mode)
@@ -404,6 +402,7 @@ static void apic_set_irq(APICState *s, int vector_num, int trigger_mode)
     else
         reset_bit(s->tmr, vector_num);
     apic_update_irq(s);
+    return ret;
 }
 
 static void apic_eoi(APICState *s)
diff --git a/hw/apic.h b/hw/apic.h
index 132fcab..738d98a 100644
--- a/hw/apic.h
+++ b/hw/apic.h
@@ -2,18 +2,14 @@
 #define APIC_H
 
 typedef struct IOAPICState IOAPICState;
-void apic_deliver_irq(uint8_t dest, uint8_t dest_mode,
-                             uint8_t delivery_mode,
-                             uint8_t vector_num, uint8_t polarity,
-                             uint8_t trigger_mode);
+int apic_deliver_irq(uint8_t dest, uint8_t dest_mode,
+                     uint8_t delivery_mode, uint8_t vector_num,
+                     uint8_t polarity, uint8_t trigger_mode);
 int apic_init(CPUState *env);
 int apic_accept_pic_intr(CPUState *env);
 void apic_deliver_pic_intr(CPUState *env, int level);
 int apic_get_interrupt(CPUState *env);
 qemu_irq *ioapic_init(void);
-void ioapic_set_irq(void *opaque, int vector, int level);
-void apic_reset_irq_delivered(void);
-int apic_get_irq_delivered(void);
 
 int cpu_is_bsp(CPUState *env);
 
diff --git a/hw/i8259.c b/hw/i8259.c
index ea48e0e..c05adf2 100644
--- a/hw/i8259.c
+++ b/hw/i8259.c
@@ -179,9 +179,12 @@ void pic_update_irq(PicState2 *s)
 int64_t irq_time[16];
 #endif
 
-static void i8259_set_irq(void *opaque, int irq, int level)
+static int i8259_set_irq(void *opaque, int irq, int level)
 {
     PicState2 *s = opaque;
+    PicState *pic;
+    int ret = QEMU_IRQ_DELIVERED;
+    int mask;
 
 #if defined(DEBUG_PIC) || defined(DEBUG_IRQ_COUNT)
     if (level != irq_level[irq]) {
@@ -200,8 +203,19 @@ static void i8259_set_irq(void *opaque, int irq, int level)
         irq_time[irq] = qemu_get_clock(vm_clock);
     }
 #endif
-    pic_set_irq1(&s->pics[irq >> 3], irq & 7, level);
+    pic = &s->pics[irq >> 3];
+    irq &= 7;
+    mask = 1 << irq;
+    if (level) {
+        if (pic->imr & mask) {
+            ret = QEMU_IRQ_MASKED;
+        } else if (pic->irr & mask) {
+            ret = QEMU_IRQ_COALESCED;
+        }
+    }
+    pic_set_irq1(pic, irq, level);
     pic_update_irq(s);
+    return ret;
 }
 
 /* acknowledge interrupt 'irq' */
@@ -536,5 +550,5 @@ qemu_irq *i8259_init(qemu_irq parent_irq)
     s->pics[0].pics_state = s;
     s->pics[1].pics_state = s;
     isa_pic = s;
-    return qemu_allocate_irqs(i8259_set_irq, s, 16);
+    return qemu_allocate_feedback_irqs(i8259_set_irq, s, 16);
 }
diff --git a/hw/ioapic.c b/hw/ioapic.c
index 7ad8018..179fe49 100644
--- a/hw/ioapic.c
+++ b/hw/ioapic.c
@@ -51,7 +51,7 @@ struct IOAPICState {
     uint64_t ioredtbl[IOAPIC_NUM_PINS];
 };
 
-static void ioapic_service(IOAPICState *s)
+static int ioapic_service(IOAPICState *s)
 {
     uint8_t i;
     uint8_t trig_mode;
@@ -62,12 +62,16 @@ static void ioapic_service(IOAPICState *s)
     uint8_t dest;
     uint8_t dest_mode;
     uint8_t polarity;
+    int ret = QEMU_IRQ_MASKED;
 
     for (i = 0; i < IOAPIC_NUM_PINS; i++) {
         mask = 1 << i;
         if (s->irr & mask) {
             entry = s->ioredtbl[i];
             if (!(entry & IOAPIC_LVT_MASKED)) {
+                if (ret == QEMU_IRQ_MASKED) {
+                    ret = QEMU_IRQ_COALESCED;
+                }
                 trig_mode = ((entry >> 15) & 1);
                 dest = entry >> 56;
                 dest_mode = (entry >> 11) & 1;
@@ -80,33 +84,39 @@ static void ioapic_service(IOAPICState *s)
                 else
                     vector = entry & 0xff;
 
-                apic_deliver_irq(dest, dest_mode, delivery_mode,
-                                 vector, polarity, trig_mode);
+                if (apic_deliver_irq(dest, dest_mode,
+                                     delivery_mode, vector, polarity,
+                                     trig_mode) == QEMU_IRQ_DELIVERED) {
+                    ret = QEMU_IRQ_DELIVERED;
+                }
             }
         }
     }
+    return ret;
 }
 
-void ioapic_set_irq(void *opaque, int vector, int level)
+static int ioapic_set_irq(void *opaque, int vector, int level)
 {
+    int mapped_vector = vector;
     IOAPICState *s = opaque;
+    int ret = QEMU_IRQ_MASKED;
 
     /* ISA IRQs map to GSI 1-1 except for IRQ0 which maps
      * to GSI 2.  GSI maps to ioapic 1-1.  This is not
      * the cleanest way of doing it but it should work. */
 
     if (vector == 0)
-        vector = 2;
+        mapped_vector = 2;
 
     if (vector >= 0 && vector < IOAPIC_NUM_PINS) {
-        uint32_t mask = 1 << vector;
-        uint64_t entry = s->ioredtbl[vector];
+        uint32_t mask = 1 << mapped_vector;
+        uint64_t entry = s->ioredtbl[mapped_vector];
 
         if ((entry >> 15) & 1) {
             /* level triggered */
             if (level) {
                 s->irr |= mask;
-                ioapic_service(s);
+                ret = ioapic_service(s);
             } else {
                 s->irr &= ~mask;
             }
@@ -114,10 +124,11 @@ void ioapic_set_irq(void *opaque, int vector, int level)
             /* edge triggered */
             if (level) {
                 s->irr |= mask;
-                ioapic_service(s);
+                ret = ioapic_service(s);
             }
         }
     }
+    return ret;
 }
 
 static uint32_t ioapic_mem_readl(void *opaque, target_phys_addr_t addr)
@@ -230,7 +241,6 @@ static CPUWriteMemoryFunc * const ioapic_mem_write[3] = {
 qemu_irq *ioapic_init(void)
 {
     IOAPICState *s;
-    qemu_irq *irq;
     int io_memory;
 
     s = qemu_mallocz(sizeof(IOAPICState));
@@ -242,7 +252,5 @@ qemu_irq *ioapic_init(void)
 
     vmstate_register(0, &vmstate_ioapic, s);
     qemu_register_reset(ioapic_reset, s);
-    irq = qemu_allocate_irqs(ioapic_set_irq, s, IOAPIC_NUM_PINS);
-
-    return irq;
+    return qemu_allocate_feedback_irqs(ioapic_set_irq, s, IOAPIC_NUM_PINS);
 }
diff --git a/hw/mc146818rtc.c b/hw/mc146818rtc.c
index 571c593..697f723 100644
--- a/hw/mc146818rtc.c
+++ b/hw/mc146818rtc.c
@@ -25,7 +25,6 @@
 #include "qemu-timer.h"
 #include "sysemu.h"
 #include "pc.h"
-#include "apic.h"
 #include "isa.h"
 #include "hpet_emul.h"
 #include "mc146818rtc.h"
@@ -94,7 +93,7 @@ typedef struct RTCState {
     QEMUTimer *second_timer2;
 } RTCState;
 
-static void rtc_irq_raise(qemu_irq irq)
+static int rtc_irq_raise(qemu_irq irq)
 {
     /* When HPET is operating in legacy mode, RTC interrupts are disabled
      * We block qemu_irq_raise, but not qemu_irq_lower, in case legacy
@@ -102,9 +101,11 @@ static void rtc_irq_raise(qemu_irq irq)
      * be lowered in any case
      */
 #if defined TARGET_I386
-    if (!hpet_in_legacy_mode())
+    if (hpet_in_legacy_mode()) {
+        return QEMU_IRQ_MASKED;
+    }
 #endif
-        qemu_irq_raise(irq);
+    return qemu_irq_raise(irq);
 }
 
 static void rtc_set_time(RTCState *s);
@@ -129,10 +130,8 @@ static void rtc_coalesced_timer(void *opaque)
     RTCState *s = opaque;
 
     if (s->irq_coalesced != 0) {
-        apic_reset_irq_delivered();
         s->cmos_data[RTC_REG_C] |= 0xc0;
-        rtc_irq_raise(s->irq);
-        if (apic_get_irq_delivered()) {
+        if (rtc_irq_raise(s->irq) != QEMU_IRQ_COALESCED) {
             s->irq_coalesced--;
         }
     }
@@ -193,9 +192,7 @@ static void rtc_periodic_timer(void *opaque)
         if(rtc_td_hack) {
             if (s->irq_reinject_on_ack_count >= RTC_REINJECT_ON_ACK_COUNT)
                 s->irq_reinject_on_ack_count = 0;		
-            apic_reset_irq_delivered();
-            rtc_irq_raise(s->irq);
-            if (!apic_get_irq_delivered()) {
+            if (rtc_irq_raise(s->irq) == QEMU_IRQ_COALESCED) {
                 s->irq_coalesced++;
                 rtc_coalesced_timer_update(s);
             }
@@ -475,10 +472,9 @@ static uint32_t cmos_ioport_read(void *opaque, uint32_t addr)
             if(s->irq_coalesced &&
                     s->irq_reinject_on_ack_count < RTC_REINJECT_ON_ACK_COUNT) {
                 s->irq_reinject_on_ack_count++;
-                apic_reset_irq_delivered();
-                qemu_irq_raise(s->irq);
-                if (apic_get_irq_delivered())
+                if (qemu_irq_raise(s->irq) != QEMU_IRQ_COALESCED) {
                     s->irq_coalesced--;
+                }
                 break;
             }
 #endif
diff --git a/hw/pc.c b/hw/pc.c
index 631b0ae..ec6c32b 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -67,15 +67,22 @@ struct e820_table {
 
 static struct e820_table e820_table;
 
-void isa_irq_handler(void *opaque, int n, int level)
+int isa_irq_handler(void *opaque, int n, int level)
 {
-    IsaIrqState *isa = (IsaIrqState *)opaque;
+    IsaIrqState *isa = opaque;
+    int ret = QEMU_IRQ_MASKED;
+    int ioapic_ret;
 
     if (n < 16) {
-        qemu_set_irq(isa->i8259[n], level);
+        ret = qemu_set_irq(isa->i8259[n], level);
     }
-    if (isa->ioapic)
-        qemu_set_irq(isa->ioapic[n], level);
+    if (isa->ioapic) {
+        ioapic_ret = qemu_set_irq(isa->ioapic[n], level);
+        if (ioapic_ret == QEMU_IRQ_DELIVERED || ret == QEMU_IRQ_MASKED) {
+            ret = ioapic_ret;
+        }
+    }
+    return ret;
 };
 
 static void ioport80_write(void *opaque, uint32_t addr, uint32_t data)
diff --git a/hw/pc.h b/hw/pc.h
index 73cccef..015412f 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -44,7 +44,7 @@ typedef struct isa_irq_state {
     qemu_irq *ioapic;
 } IsaIrqState;
 
-void isa_irq_handler(void *opaque, int n, int level);
+int isa_irq_handler(void *opaque, int n, int level);
 
 /* i8254.c */
 
diff --git a/hw/pc_piix.c b/hw/pc_piix.c
index 70f563a..648b607 100644
--- a/hw/pc_piix.c
+++ b/hw/pc_piix.c
@@ -78,7 +78,7 @@ static void pc_init1(ram_addr_t ram_size,
     if (pci_enabled) {
         isa_irq_state->ioapic = ioapic_init();
     }
-    isa_irq = qemu_allocate_irqs(isa_irq_handler, isa_irq_state, 24);
+    isa_irq = qemu_allocate_feedback_irqs(isa_irq_handler, isa_irq_state, 24);
 
     if (pci_enabled) {
         pci_bus = i440fx_init(&i440fx_state, &piix3_devfn, isa_irq, ram_size);
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [Qemu-devel] [RFT][PATCH 09/15] hpet/rtc: Rework RTC IRQ replacement by HPET
  2010-05-24 20:13 [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Jan Kiszka
                   ` (7 preceding siblings ...)
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 08/15] x86: Refactor RTC IRQ coalescing workaround Jan Kiszka
@ 2010-05-24 20:13 ` Jan Kiszka
  2010-05-25  9:29   ` Paul Brook
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 10/15] hpet: Drop static state Jan Kiszka
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-05-24 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Jan Kiszka, Juan Quintela

From: Jan Kiszka <jan.kiszka@siemens.com>

Allow the intercept the RTC IRQ for the HPET legacy mode. Then push
routing to IRQ8 completely into the HPET. This allows to turn
hpet_in_legacy_mode() into a private function. Furthermore, this stops
the RTC from clearing IRQ8 even if the HPET is in control.

This patch comes with a side effect: The RTC timers will no longer be
stoppend when there is no IRQ consumer, possibly causing a minor
performance degration. But as the guest may want to redirect the RTC to
the SCI in that mode, it should normally disable unused IRQ source
anyway.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 hw/hpet.c        |   27 ++++++++++++++++++++-------
 hw/hpet_emul.h   |    4 +---
 hw/mc146818rtc.c |   52 ++++++++++++++++------------------------------------
 hw/mc146818rtc.h |    4 +++-
 hw/mips_jazz.c   |    2 +-
 hw/mips_malta.c  |    2 +-
 hw/mips_r4k.c    |    2 +-
 hw/pc.c          |   15 +++++++++------
 hw/ppc_prep.c    |    2 +-
 9 files changed, 53 insertions(+), 57 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 041dd84..12d91e7 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -30,6 +30,7 @@
 #include "qemu-timer.h"
 #include "hpet_emul.h"
 #include "sysbus.h"
+#include "mc146818rtc.h"
 
 //#define HPET_DEBUG
 #ifdef HPET_DEBUG
@@ -58,6 +59,7 @@ typedef struct HPETState {
     SysBusDevice busdev;
     uint64_t hpet_offset;
     qemu_irq irqs[HPET_NUM_IRQ_ROUTES];
+    uint8_t rtc_irq_level;
     HPETTimer timer[HPET_NUM_TIMERS];
 
     /* Memory-mapped, software visible registers */
@@ -69,12 +71,9 @@ typedef struct HPETState {
 
 static HPETState *hpet_statep;
 
-uint32_t hpet_in_legacy_mode(void)
+static uint32_t hpet_in_legacy_mode(HPETState *s)
 {
-    if (!hpet_statep) {
-        return 0;
-    }
-    return hpet_statep->config & HPET_CFG_LEGACY;
+    return s->config & HPET_CFG_LEGACY;
 }
 
 static uint32_t timer_int_route(struct HPETTimer *timer)
@@ -166,12 +165,12 @@ static void update_irq(struct HPETTimer *timer)
 {
     int route;
 
-    if (timer->tn <= 1 && hpet_in_legacy_mode()) {
+    if (timer->tn <= 1 && hpet_in_legacy_mode(timer->state)) {
         /* if LegacyReplacementRoute bit is set, HPET specification requires
          * timer0 be routed to IRQ0 in NON-APIC or IRQ2 in the I/O APIC,
          * timer1 be routed to IRQ8 in NON-APIC or IRQ8 in the I/O APIC.
          */
-        route = (timer->tn == 0) ? 0 : 8;
+        route = (timer->tn == 0) ? 0 : RTC_ISA_IRQ;
     } else {
         route = timer_int_route(timer);
     }
@@ -515,8 +514,10 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
             /* i8254 and RTC are disabled when HPET is in legacy mode */
             if (activating_bit(old_val, new_val, HPET_CFG_LEGACY)) {
                 hpet_pit_disable();
+                qemu_irq_lower(s->irqs[RTC_ISA_IRQ]);
             } else if (deactivating_bit(old_val, new_val, HPET_CFG_LEGACY)) {
                 hpet_pit_enable();
+                qemu_set_irq(s->irqs[RTC_ISA_IRQ], s->rtc_irq_level);
             }
             break;
         case HPET_CFG + 4:
@@ -632,6 +633,18 @@ static int hpet_init(SysBusDevice *dev)
     return 0;
 }
 
+int hpet_handle_rtc_irq(void *opaque, int n, int level)
+{
+    HPETState *s = FROM_SYSBUS(HPETState, opaque);
+
+    s->rtc_irq_level = level;
+    if (hpet_in_legacy_mode(s)) {
+        return QEMU_IRQ_MASKED;
+    } else {
+        return qemu_set_irq(s->irqs[RTC_ISA_IRQ], level);
+    }
+}
+
 static SysBusDeviceInfo hpet_device_info = {
     .qdev.name    = "hpet",
     .qdev.size    = sizeof(HPETState),
diff --git a/hw/hpet_emul.h b/hw/hpet_emul.h
index 785f850..0608e34 100644
--- a/hw/hpet_emul.h
+++ b/hw/hpet_emul.h
@@ -47,8 +47,6 @@
 #define HPET_TN_INT_ROUTE_CAP_SHIFT 32
 #define HPET_TN_CFG_BITS_READONLY_OR_RESERVED 0xffff80b1U
 
-#if defined TARGET_I386
-extern uint32_t hpet_in_legacy_mode(void);
-#endif
+int hpet_handle_rtc_irq(void *opaque, int n, int level);
 
 #endif
diff --git a/hw/mc146818rtc.c b/hw/mc146818rtc.c
index 697f723..a0f6951 100644
--- a/hw/mc146818rtc.c
+++ b/hw/mc146818rtc.c
@@ -26,7 +26,6 @@
 #include "sysemu.h"
 #include "pc.h"
 #include "isa.h"
-#include "hpet_emul.h"
 #include "mc146818rtc.h"
 
 //#define DEBUG_CMOS
@@ -93,21 +92,6 @@ typedef struct RTCState {
     QEMUTimer *second_timer2;
 } RTCState;
 
-static int rtc_irq_raise(qemu_irq irq)
-{
-    /* When HPET is operating in legacy mode, RTC interrupts are disabled
-     * We block qemu_irq_raise, but not qemu_irq_lower, in case legacy
-     * mode is established while interrupt is raised. We want it to
-     * be lowered in any case
-     */
-#if defined TARGET_I386
-    if (hpet_in_legacy_mode()) {
-        return QEMU_IRQ_MASKED;
-    }
-#endif
-    return qemu_irq_raise(irq);
-}
-
 static void rtc_set_time(RTCState *s);
 static void rtc_copy_date(RTCState *s);
 
@@ -131,7 +115,7 @@ static void rtc_coalesced_timer(void *opaque)
 
     if (s->irq_coalesced != 0) {
         s->cmos_data[RTC_REG_C] |= 0xc0;
-        if (rtc_irq_raise(s->irq) != QEMU_IRQ_COALESCED) {
+        if (qemu_irq_raise(s->irq) != QEMU_IRQ_COALESCED) {
             s->irq_coalesced--;
         }
     }
@@ -144,19 +128,10 @@ static void rtc_timer_update(RTCState *s, int64_t current_time)
 {
     int period_code, period;
     int64_t cur_clock, next_irq_clock;
-    int enable_pie;
 
     period_code = s->cmos_data[RTC_REG_A] & 0x0f;
-#if defined TARGET_I386
-    /* disable periodic timer if hpet is in legacy mode, since interrupts are
-     * disabled anyway.
-     */
-    enable_pie = !hpet_in_legacy_mode();
-#else
-    enable_pie = 1;
-#endif
     if (period_code != 0
-        && (((s->cmos_data[RTC_REG_B] & REG_B_PIE) && enable_pie)
+        && ((s->cmos_data[RTC_REG_B] & REG_B_PIE)
             || ((s->cmos_data[RTC_REG_B] & REG_B_SQWE) && s->sqw_irq))) {
         if (period_code <= 2)
             period_code += 7;
@@ -192,13 +167,13 @@ static void rtc_periodic_timer(void *opaque)
         if(rtc_td_hack) {
             if (s->irq_reinject_on_ack_count >= RTC_REINJECT_ON_ACK_COUNT)
                 s->irq_reinject_on_ack_count = 0;		
-            if (rtc_irq_raise(s->irq) == QEMU_IRQ_COALESCED) {
+            if (qemu_irq_raise(s->irq) == QEMU_IRQ_COALESCED) {
                 s->irq_coalesced++;
                 rtc_coalesced_timer_update(s);
             }
         } else
 #endif
-        rtc_irq_raise(s->irq);
+        qemu_irq_raise(s->irq);
     }
     if (s->cmos_data[RTC_REG_B] & REG_B_SQWE) {
         /* Not square wave at all but we don't want 2048Hz interrupts!
@@ -427,15 +402,15 @@ static void rtc_update_second2(void *opaque)
              s->cmos_data[RTC_HOURS_ALARM] == s->current_tm.tm_hour)) {
 
             s->cmos_data[RTC_REG_C] |= 0xa0;
-            rtc_irq_raise(s->irq);
+            qemu_irq_raise(s->irq);
         }
     }
 
     /* update ended interrupt */
     s->cmos_data[RTC_REG_C] |= REG_C_UF;
     if (s->cmos_data[RTC_REG_B] & REG_B_UIE) {
-      s->cmos_data[RTC_REG_C] |= REG_C_IRQF;
-      rtc_irq_raise(s->irq);
+        s->cmos_data[RTC_REG_C] |= REG_C_IRQF;
+        qemu_irq_raise(s->irq);
     }
 
     /* clear update in progress bit */
@@ -584,9 +559,6 @@ static int rtc_initfn(ISADevice *dev)
 {
     RTCState *s = DO_UPCAST(RTCState, dev, dev);
     int base = 0x70;
-    int isairq = 8;
-
-    isa_init_irq(dev, &s->irq, isairq);
 
     s->cmos_data[RTC_REG_A] = 0x26;
     s->cmos_data[RTC_REG_B] = 0x02;
@@ -616,13 +588,21 @@ static int rtc_initfn(ISADevice *dev)
     return 0;
 }
 
-ISADevice *rtc_init(int base_year)
+ISADevice *rtc_init(int base_year, qemu_irq *intercept_irq)
 {
     ISADevice *dev;
+    RTCState *s;
 
     dev = isa_create("mc146818rtc");
+    s = DO_UPCAST(RTCState, dev, dev);
     qdev_prop_set_int32(&dev->qdev, "base_year", base_year);
     qdev_init_nofail(&dev->qdev);
+    if (intercept_irq) {
+        s->irq = *intercept_irq;
+        qemu_free(intercept_irq);
+    } else {
+        isa_init_irq(dev, &s->irq, RTC_ISA_IRQ);
+    }
     return dev;
 }
 
diff --git a/hw/mc146818rtc.h b/hw/mc146818rtc.h
index 6f46a68..e656d1c 100644
--- a/hw/mc146818rtc.h
+++ b/hw/mc146818rtc.h
@@ -3,7 +3,9 @@
 
 #include "isa.h"
 
-ISADevice *rtc_init(int base_year);
+#define RTC_ISA_IRQ 8
+
+ISADevice *rtc_init(int base_year, qemu_irq *intercept_irq);
 void rtc_set_memory(ISADevice *dev, int addr, int val);
 void rtc_set_date(ISADevice *dev, const struct tm *tm);
 
diff --git a/hw/mips_jazz.c b/hw/mips_jazz.c
index ead3a00..22db7a2 100644
--- a/hw/mips_jazz.c
+++ b/hw/mips_jazz.c
@@ -259,7 +259,7 @@ void mips_jazz_init (ram_addr_t ram_size,
     fdctrl_init_sysbus(rc4030[1], 0, 0x80003000, fds);
 
     /* Real time clock */
-    rtc_init(1980);
+    rtc_init(1980, NULL);
     s_rtc = cpu_register_io_memory(rtc_read, rtc_write, NULL);
     cpu_register_physical_memory(0x80004000, 0x00001000, s_rtc);
 
diff --git a/hw/mips_malta.c b/hw/mips_malta.c
index a8f9d15..23de7f0 100644
--- a/hw/mips_malta.c
+++ b/hw/mips_malta.c
@@ -959,7 +959,7 @@ void mips_malta_init (ram_addr_t ram_size,
     /* Super I/O */
     isa_dev = isa_create_simple("i8042");
  
-    rtc_state = rtc_init(2000);
+    rtc_state = rtc_init(2000, NULL);
     serial_isa_init(0, serial_hds[0]);
     serial_isa_init(1, serial_hds[1]);
     if (parallel_hds[0])
diff --git a/hw/mips_r4k.c b/hw/mips_r4k.c
index f1fcfcd..5a96dea 100644
--- a/hw/mips_r4k.c
+++ b/hw/mips_r4k.c
@@ -267,7 +267,7 @@ void mips_r4k_init (ram_addr_t ram_size,
     isa_bus_new(NULL);
     isa_bus_irqs(i8259);
 
-    rtc_state = rtc_init(2000);
+    rtc_state = rtc_init(2000, NULL);
 
     /* Register 64 KB of ISA IO space at 0x14000000 */
 #ifdef TARGET_WORDS_BIGENDIAN
diff --git a/hw/pc.c b/hw/pc.c
index ec6c32b..13bcecb 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -938,6 +938,7 @@ void pc_basic_device_init(qemu_irq *isa_irq,
     int i;
     DriveInfo *fd[MAX_FD];
     PITState *pit;
+    qemu_irq *rtc_irq = NULL;
     qemu_irq *a20_line;
     ISADevice *i8042;
     qemu_irq *cpu_exit_irq;
@@ -946,19 +947,21 @@ void pc_basic_device_init(qemu_irq *isa_irq,
 
     register_ioport_write(0xf0, 1, 1, ioportF0_write, NULL);
 
-    *rtc_state = rtc_init(2000);
-
-    qemu_register_boot_set(pc_boot_set, *rtc_state);
-
-    pit = pit_init(0x40, isa_reserve_irq(0));
-    pcspk_init(pit);
     if (!no_hpet) {
         DeviceState *hpet = sysbus_create_simple("hpet", HPET_BASE, NULL);
 
         for (i = 0; i < 24; i++) {
             sysbus_connect_irq(sysbus_from_qdev(hpet), i, isa_irq[i]);
         }
+        rtc_irq = qemu_allocate_feedback_irqs(hpet_handle_rtc_irq,
+                                              sysbus_from_qdev(hpet), 1);
     }
+    *rtc_state = rtc_init(2000, rtc_irq);
+
+    qemu_register_boot_set(pc_boot_set, *rtc_state);
+
+    pit = pit_init(0x40, isa_reserve_irq(0));
+    pcspk_init(pit);
 
     for(i = 0; i < MAX_SERIAL_PORTS; i++) {
         if (serial_hds[i]) {
diff --git a/hw/ppc_prep.c b/hw/ppc_prep.c
index 16c9950..bb9e15f 100644
--- a/hw/ppc_prep.c
+++ b/hw/ppc_prep.c
@@ -696,7 +696,7 @@ static void ppc_prep_init (ram_addr_t ram_size,
     pci_vga_init(pci_bus, 0, 0);
     //    openpic = openpic_init(0x00000000, 0xF0000000, 1);
     //    pit = pit_init(0x40, i8259[0]);
-    rtc_init(2000);
+    rtc_init(2000, NULL);
 
     if (serial_hds[0])
         serial_isa_init(0, serial_hds[0]);
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [Qemu-devel] [RFT][PATCH 10/15] hpet: Drop static state
  2010-05-24 20:13 [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Jan Kiszka
                   ` (8 preceding siblings ...)
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 09/15] hpet/rtc: Rework RTC IRQ replacement by HPET Jan Kiszka
@ 2010-05-24 20:13 ` Jan Kiszka
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 11/15] hpet: Add support for level-triggered interrupts Jan Kiszka
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-24 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Jan Kiszka, Juan Quintela

From: Jan Kiszka <jan.kiszka@siemens.com>

Instead of keeping a static reference around, pass the state to
hpet_enabled and hpet_get_ticks. All callers now have it at hand. Will
once allow to instantiate the HPET more than a single time.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 hw/hpet.c |   38 +++++++++++++++++---------------------
 1 files changed, 17 insertions(+), 21 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index 12d91e7..b48f44b 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -69,8 +69,6 @@ typedef struct HPETState {
     uint64_t hpet_counter;      /* main counter */
 } HPETState;
 
-static HPETState *hpet_statep;
-
 static uint32_t hpet_in_legacy_mode(HPETState *s)
 {
     return s->config & HPET_CFG_LEGACY;
@@ -81,9 +79,9 @@ static uint32_t timer_int_route(struct HPETTimer *timer)
     return (timer->config & HPET_TN_INT_ROUTE_MASK) >> HPET_TN_INT_ROUTE_SHIFT;
 }
 
-static uint32_t hpet_enabled(void)
+static uint32_t hpet_enabled(HPETState *s)
 {
-    return hpet_statep->config & HPET_CFG_ENABLE;
+    return s->config & HPET_CFG_ENABLE;
 }
 
 static uint32_t timer_is_periodic(HPETTimer *t)
@@ -133,9 +131,9 @@ static int deactivating_bit(uint64_t old, uint64_t new, uint64_t mask)
     return ((old & mask) && !(new & mask));
 }
 
-static uint64_t hpet_get_ticks(void)
+static uint64_t hpet_get_ticks(HPETState *s)
 {
-    return ns_to_ticks(qemu_get_clock(vm_clock) + hpet_statep->hpet_offset);
+    return ns_to_ticks(qemu_get_clock(vm_clock) + s->hpet_offset);
 }
 
 /*
@@ -174,7 +172,7 @@ static void update_irq(struct HPETTimer *timer)
     } else {
         route = timer_int_route(timer);
     }
-    if (!timer_enabled(timer) || !hpet_enabled()) {
+    if (!timer_enabled(timer) || !hpet_enabled(timer->state)) {
         return;
     }
     qemu_irq_pulse(timer->state->irqs[route]);
@@ -185,7 +183,7 @@ static void hpet_pre_save(void *opaque)
     HPETState *s = opaque;
 
     /* save current counter value */
-    s->hpet_counter = hpet_get_ticks();
+    s->hpet_counter = hpet_get_ticks(s);
 }
 
 static int hpet_post_load(void *opaque, int version_id)
@@ -240,7 +238,7 @@ static void hpet_timer(void *opaque)
     uint64_t diff;
 
     uint64_t period = t->period;
-    uint64_t cur_tick = hpet_get_ticks();
+    uint64_t cur_tick = hpet_get_ticks(t->state);
 
     if (timer_is_periodic(t) && period != 0) {
         if (t->config & HPET_TN_32BIT) {
@@ -270,7 +268,7 @@ static void hpet_set_timer(HPETTimer *t)
 {
     uint64_t diff;
     uint32_t wrap_diff;  /* how many ticks until we wrap? */
-    uint64_t cur_tick = hpet_get_ticks();
+    uint64_t cur_tick = hpet_get_ticks(t->state);
 
     /* whenever new timer is being set up, make sure wrap_flag is 0 */
     t->wrap_flag = 0;
@@ -353,16 +351,16 @@ static uint32_t hpet_ram_readl(void *opaque, target_phys_addr_t addr)
             DPRINTF("qemu: invalid HPET_CFG + 4 hpet_ram_readl \n");
             return 0;
         case HPET_COUNTER:
-            if (hpet_enabled()) {
-                cur_tick = hpet_get_ticks();
+            if (hpet_enabled(s)) {
+                cur_tick = hpet_get_ticks(s);
             } else {
                 cur_tick = s->hpet_counter;
             }
             DPRINTF("qemu: reading counter  = %" PRIx64 "\n", cur_tick);
             return cur_tick;
         case HPET_COUNTER + 4:
-            if (hpet_enabled()) {
-                cur_tick = hpet_get_ticks();
+            if (hpet_enabled(s)) {
+                cur_tick = hpet_get_ticks(s);
             } else {
                 cur_tick = s->hpet_counter;
             }
@@ -457,7 +455,7 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
                     (timer->period & 0xffffffff00000000ULL) | new_val;
             }
             timer->config &= ~HPET_TN_SETVAL;
-            if (hpet_enabled()) {
+            if (hpet_enabled(s)) {
                 hpet_set_timer(timer);
             }
             break;
@@ -476,7 +474,7 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
                     (timer->period & 0xffffffffULL) | new_val << 32;
                 }
                 timer->config &= ~HPET_TN_SETVAL;
-                if (hpet_enabled()) {
+                if (hpet_enabled(s)) {
                     hpet_set_timer(timer);
                 }
                 break;
@@ -506,7 +504,7 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
                 }
             } else if (deactivating_bit(old_val, new_val, HPET_CFG_ENABLE)) {
                 /* Halt main counter and disable interrupt generation. */
-                s->hpet_counter = hpet_get_ticks();
+                s->hpet_counter = hpet_get_ticks(s);
                 for (i = 0; i < HPET_NUM_TIMERS; i++) {
                     hpet_del_timer(&s->timer[i]);
                 }
@@ -527,7 +525,7 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
             /* FIXME: need to handle level-triggered interrupts */
             break;
         case HPET_COUNTER:
-            if (hpet_enabled()) {
+            if (hpet_enabled(s)) {
                 DPRINTF("qemu: Writing counter while HPET enabled!\n");
             }
             s->hpet_counter =
@@ -536,7 +534,7 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
                     value, s->hpet_counter);
             break;
         case HPET_COUNTER + 4:
-            if (hpet_enabled()) {
+            if (hpet_enabled(s)) {
                 DPRINTF("qemu: Writing counter while HPET enabled!\n");
             }
             s->hpet_counter =
@@ -614,8 +612,6 @@ static int hpet_init(SysBusDevice *dev)
     int i, iomemtype;
     HPETTimer *timer;
 
-    assert(!hpet_statep);
-    hpet_statep = s;
     for (i = 0; i < HPET_NUM_IRQ_ROUTES; i++) {
         sysbus_init_irq(dev, &s->irqs[i]);
     }
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [Qemu-devel] [RFT][PATCH 11/15] hpet: Add support for level-triggered interrupts
  2010-05-24 20:13 [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Jan Kiszka
                   ` (9 preceding siblings ...)
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 10/15] hpet: Drop static state Jan Kiszka
@ 2010-05-24 20:13 ` Jan Kiszka
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 12/15] vmstate: Add VMSTATE_STRUCT_VARRAY_UINT8 Jan Kiszka
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-24 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Jan Kiszka, Juan Quintela

From: Jan Kiszka <jan.kiszka@siemens.com>

By implementing this feature we can also remove a nasty way to kill qemu
(by trying to enable level-triggered hpet interrupts).

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 hw/hpet.c |   32 ++++++++++++++++++++++----------
 1 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index b48f44b..b272181 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -159,8 +159,10 @@ static inline uint64_t hpet_calculate_diff(HPETTimer *t, uint64_t current)
     }
 }
 
-static void update_irq(struct HPETTimer *timer)
+static void update_irq(struct HPETTimer *timer, int set)
 {
+    uint64_t mask;
+    HPETState *s;
     int route;
 
     if (timer->tn <= 1 && hpet_in_legacy_mode(timer->state)) {
@@ -172,10 +174,18 @@ static void update_irq(struct HPETTimer *timer)
     } else {
         route = timer_int_route(timer);
     }
-    if (!timer_enabled(timer) || !hpet_enabled(timer->state)) {
-        return;
+    s = timer->state;
+    mask = 1 << timer->tn;
+    if (!set || !timer_enabled(timer) || !hpet_enabled(timer->state)) {
+        s->isr &= ~mask;
+        qemu_irq_lower(s->irqs[route]);
+    } else if (timer->config & HPET_TN_TYPE_LEVEL) {
+        s->isr |= mask;
+        qemu_irq_raise(s->irqs[route]);
+    } else {
+        s->isr &= ~mask;
+        qemu_irq_pulse(s->irqs[route]);
     }
-    qemu_irq_pulse(timer->state->irqs[route]);
 }
 
 static void hpet_pre_save(void *opaque)
@@ -261,7 +271,7 @@ static void hpet_timer(void *opaque)
             t->wrap_flag = 0;
         }
     }
-    update_irq(t);
+    update_irq(t, 1);
 }
 
 static void hpet_set_timer(HPETTimer *t)
@@ -291,6 +301,7 @@ static void hpet_set_timer(HPETTimer *t)
 static void hpet_del_timer(HPETTimer *t)
 {
     qemu_del_timer(t->qemu_timer);
+    update_irq(t, 0);
 }
 
 #ifdef HPET_DEBUG
@@ -423,10 +434,6 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
                 timer->cmp = (uint32_t)timer->cmp;
                 timer->period = (uint32_t)timer->period;
             }
-            if (new_val & HPET_TN_TYPE_LEVEL) {
-                printf("qemu: level-triggered hpet not supported\n");
-                exit (-1);
-            }
             if (activating_bit(old_val, new_val, HPET_TN_ENABLE)) {
                 hpet_set_timer(timer);
             } else if (deactivating_bit(old_val, new_val, HPET_TN_ENABLE)) {
@@ -522,7 +529,12 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
             DPRINTF("qemu: invalid HPET_CFG+4 write \n");
             break;
         case HPET_STATUS:
-            /* FIXME: need to handle level-triggered interrupts */
+            val = new_val & s->isr;
+            for (i = 0; i < HPET_NUM_TIMERS; i++) {
+                if (val & (1 << i)) {
+                    update_irq(&s->timer[i], 0);
+                }
+            }
             break;
         case HPET_COUNTER:
             if (hpet_enabled(s)) {
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [Qemu-devel] [RFT][PATCH 12/15] vmstate: Add VMSTATE_STRUCT_VARRAY_UINT8
  2010-05-24 20:13 [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Jan Kiszka
                   ` (10 preceding siblings ...)
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 11/15] hpet: Add support for level-triggered interrupts Jan Kiszka
@ 2010-05-24 20:13 ` Jan Kiszka
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 13/15] hpet: Make number of timers configurable Jan Kiszka
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-24 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Jan Kiszka, Juan Quintela

From: Jan Kiszka <jan.kiszka@siemens.com>

Required for hpet.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 hw/hw.h |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index fc2d184..36be0be 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -474,6 +474,16 @@ extern const VMStateInfo vmstate_info_unused_buffer;
     .offset     = vmstate_offset_array(_state, _field, _type, _num), \
 }
 
+#define VMSTATE_STRUCT_VARRAY_UINT8(_field, _state, _field_num, _version, _vmsd, _type) { \
+    .name       = (stringify(_field)),                               \
+    .num_offset = vmstate_offset_value(_state, _field_num, uint8_t),  \
+    .version_id = (_version),                                        \
+    .vmsd       = &(_vmsd),                                          \
+    .size       = sizeof(_type),                                     \
+    .flags      = VMS_STRUCT|VMS_VARRAY_INT32,                       \
+    .offset     = offsetof(_state, _field),                          \
+}
+
 #define VMSTATE_STATIC_BUFFER(_field, _state, _version, _test, _start, _size) { \
     .name         = (stringify(_field)),                             \
     .version_id   = (_version),                                      \
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [Qemu-devel] [RFT][PATCH 13/15] hpet: Make number of timers configurable
  2010-05-24 20:13 [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Jan Kiszka
                   ` (11 preceding siblings ...)
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 12/15] vmstate: Add VMSTATE_STRUCT_VARRAY_UINT8 Jan Kiszka
@ 2010-05-24 20:13 ` Jan Kiszka
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 14/15] hpet: Add MSI support Jan Kiszka
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-24 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Jan Kiszka, Juan Quintela

From: Jan Kiszka <jan.kiszka@siemens.com>

One HPET block supports up to 32 timers. Allow to instantiate more than
the recommended and implemented minimum of 3. The number is configured
via the qdev property "timers". It is also saved/restored so that it
need not match between migration peers.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 hw/hpet.c      |   53 ++++++++++++++++++++++++++++++++++++++++-------------
 hw/hpet_emul.h |    6 +++++-
 2 files changed, 45 insertions(+), 14 deletions(-)

diff --git a/hw/hpet.c b/hw/hpet.c
index b272181..b7d4771 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -60,7 +60,8 @@ typedef struct HPETState {
     uint64_t hpet_offset;
     qemu_irq irqs[HPET_NUM_IRQ_ROUTES];
     uint8_t rtc_irq_level;
-    HPETTimer timer[HPET_NUM_TIMERS];
+    uint8_t num_timers;
+    HPETTimer timer[HPET_MAX_TIMERS];
 
     /* Memory-mapped, software visible registers */
     uint64_t capability;        /* capabilities */
@@ -196,12 +197,25 @@ static void hpet_pre_save(void *opaque)
     s->hpet_counter = hpet_get_ticks(s);
 }
 
+static int hpet_pre_load(void *opaque)
+{
+    HPETState *s = opaque;
+
+    /* version 1 only supports 3, later versions will load the actual value */
+    s->num_timers = HPET_MIN_TIMERS;
+    return 0;
+}
+
 static int hpet_post_load(void *opaque, int version_id)
 {
     HPETState *s = opaque;
 
     /* Recalculate the offset between the main counter and guest time */
     s->hpet_offset = ticks_to_ns(s->hpet_counter) - qemu_get_clock(vm_clock);
+
+    /* Push number of timers into capability returned via HPET_ID */
+    s->capability &= ~HPET_ID_NUM_TIM_MASK;
+    s->capability |= (s->num_timers - 1) << HPET_ID_NUM_TIM_SHIFT;
     return 0;
 }
 
@@ -224,17 +238,19 @@ static const VMStateDescription vmstate_hpet_timer = {
 
 static const VMStateDescription vmstate_hpet = {
     .name = "hpet",
-    .version_id = 1,
+    .version_id = 2,
     .minimum_version_id = 1,
     .minimum_version_id_old = 1,
     .pre_save = hpet_pre_save,
+    .pre_load = hpet_pre_load,
     .post_load = hpet_post_load,
     .fields      = (VMStateField []) {
         VMSTATE_UINT64(config, HPETState),
         VMSTATE_UINT64(isr, HPETState),
         VMSTATE_UINT64(hpet_counter, HPETState),
-        VMSTATE_STRUCT_ARRAY(timer, HPETState, HPET_NUM_TIMERS, 0,
-                             vmstate_hpet_timer, HPETTimer),
+        VMSTATE_UINT8_V(num_timers, HPETState, 2),
+        VMSTATE_STRUCT_VARRAY_UINT8(timer, HPETState, num_timers, 0,
+                                    vmstate_hpet_timer, HPETTimer),
         VMSTATE_END_OF_LIST()
     }
 };
@@ -330,7 +346,7 @@ static uint32_t hpet_ram_readl(void *opaque, target_phys_addr_t addr)
         uint8_t timer_id = (addr - 0x100) / 0x20;
         HPETTimer *timer = &s->timer[timer_id];
 
-        if (timer_id > HPET_NUM_TIMERS - 1) {
+        if (timer_id > s->num_timers) {
             DPRINTF("qemu: timer id out of range\n");
             return 0;
         }
@@ -421,7 +437,7 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
         HPETTimer *timer = &s->timer[timer_id];
 
         DPRINTF("qemu: hpet_ram_writel timer_id = %#x \n", timer_id);
-        if (timer_id > HPET_NUM_TIMERS - 1) {
+        if (timer_id > s->num_timers) {
             DPRINTF("qemu: timer id out of range\n");
             return;
         }
@@ -504,7 +520,7 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
                 /* Enable main counter and interrupt generation. */
                 s->hpet_offset =
                     ticks_to_ns(s->hpet_counter) - qemu_get_clock(vm_clock);
-                for (i = 0; i < HPET_NUM_TIMERS; i++) {
+                for (i = 0; i < s->num_timers; i++) {
                     if ((&s->timer[i])->cmp != ~0ULL) {
                         hpet_set_timer(&s->timer[i]);
                     }
@@ -512,7 +528,7 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
             } else if (deactivating_bit(old_val, new_val, HPET_CFG_ENABLE)) {
                 /* Halt main counter and disable interrupt generation. */
                 s->hpet_counter = hpet_get_ticks(s);
-                for (i = 0; i < HPET_NUM_TIMERS; i++) {
+                for (i = 0; i < s->num_timers; i++) {
                     hpet_del_timer(&s->timer[i]);
                 }
             }
@@ -530,7 +546,7 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
             break;
         case HPET_STATUS:
             val = new_val & s->isr;
-            for (i = 0; i < HPET_NUM_TIMERS; i++) {
+            for (i = 0; i < s->num_timers; i++) {
                 if (val & (1 << i)) {
                     update_irq(&s->timer[i], 0);
                 }
@@ -589,7 +605,7 @@ static void hpet_reset(DeviceState *d)
     int i;
     static int count = 0;
 
-    for (i = 0; i < HPET_NUM_TIMERS; i++) {
+    for (i = 0; i < s->num_timers; i++) {
         HPETTimer *timer = &s->timer[i];
 
         hpet_del_timer(timer);
@@ -603,8 +619,9 @@ static void hpet_reset(DeviceState *d)
 
     s->hpet_counter = 0ULL;
     s->hpet_offset = 0ULL;
-    /* 64-bit main counter; 3 timers supported; LegacyReplacementRoute. */
-    s->capability = 0x8086a201ULL;
+    /* 64-bit main counter; LegacyReplacementRoute. */
+    s->capability = 0x8086a001ULL;
+    s->capability |= (s->num_timers - 1) << HPET_ID_NUM_TIM_SHIFT;
     s->capability |= ((HPET_CLK_PERIOD) << 32);
     s->config = 0ULL;
     if (count > 0) {
@@ -627,7 +644,13 @@ static int hpet_init(SysBusDevice *dev)
     for (i = 0; i < HPET_NUM_IRQ_ROUTES; i++) {
         sysbus_init_irq(dev, &s->irqs[i]);
     }
-    for (i = 0; i < HPET_NUM_TIMERS; i++) {
+
+    if (s->num_timers < HPET_MIN_TIMERS) {
+        s->num_timers = HPET_MIN_TIMERS;
+    } else if (s->num_timers > HPET_MAX_TIMERS) {
+        s->num_timers = HPET_MAX_TIMERS;
+    }
+    for (i = 0; i < HPET_MAX_TIMERS; i++) {
         timer = &s->timer[i];
         timer->qemu_timer = qemu_new_timer(vm_clock, hpet_timer, timer);
         timer->tn = i;
@@ -660,6 +683,10 @@ static SysBusDeviceInfo hpet_device_info = {
     .qdev.vmsd    = &vmstate_hpet,
     .qdev.reset   = hpet_reset,
     .init         = hpet_init,
+    .qdev.props = (Property[]) {
+        DEFINE_PROP_UINT8("timers", HPETState, num_timers, HPET_MIN_TIMERS),
+        DEFINE_PROP_END_OF_LIST(),
+    },
 };
 
 static void hpet_register_device(void)
diff --git a/hw/hpet_emul.h b/hw/hpet_emul.h
index 0608e34..4b537d2 100644
--- a/hw/hpet_emul.h
+++ b/hw/hpet_emul.h
@@ -17,7 +17,8 @@
 #define HPET_CLK_PERIOD         10000000ULL /* 10000000 femtoseconds == 10ns*/
 
 #define FS_PER_NS 1000000
-#define HPET_NUM_TIMERS 3
+#define HPET_MIN_TIMERS         3
+#define HPET_MAX_TIMERS         32
 
 #define HPET_NUM_IRQ_ROUTES     32
 
@@ -34,6 +35,9 @@
 #define HPET_TN_ROUTE   0x010
 #define HPET_CFG_WRITE_MASK  0x3
 
+#define HPET_ID_NUM_TIM_SHIFT   8
+#define HPET_ID_NUM_TIM_MASK    0x1f00
+
 #define HPET_TN_TYPE_LEVEL       0x002
 #define HPET_TN_ENABLE           0x004
 #define HPET_TN_PERIODIC         0x008
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [Qemu-devel] [RFT][PATCH 14/15] hpet: Add MSI support
  2010-05-24 20:13 [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Jan Kiszka
                   ` (12 preceding siblings ...)
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 13/15] hpet: Make number of timers configurable Jan Kiszka
@ 2010-05-24 20:13 ` Jan Kiszka
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 15/15] monitor/QMP: Drop info hpet / query-hpet Jan Kiszka
  2010-05-24 22:16 ` [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Anthony Liguori
  15 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-24 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Jan Kiszka, Juan Quintela

From: Jan Kiszka <jan.kiszka@siemens.com>

This implements the HPET capability of routing IRQs to the front-side
bus, aka MSI support. This feature can be enabled via the qdev property
"msi" and is off by default.

Note that switching it on can cause guests (at least Linux) to use the
HPET as timer instead of the LAPIC. KVM users should recall that only
the latter is currently available as fast in-kernel model.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 hw/apic.c      |    2 +-
 hw/apic.h      |    1 +
 hw/hpet.c      |   39 +++++++++++++++++++++++++++++++++++----
 hw/hpet_emul.h |    4 +++-
 4 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index 641825c..9d3f248 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -760,7 +760,7 @@ static uint32_t apic_mem_readl(void *opaque, target_phys_addr_t addr)
     return val;
 }
 
-static void apic_send_msi(target_phys_addr_t addr, uint32 data)
+void apic_send_msi(target_phys_addr_t addr, uint32 data)
 {
     uint8_t dest = (addr & MSI_ADDR_DEST_ID_MASK) >> MSI_ADDR_DEST_ID_SHIFT;
     uint8_t vector = (data & MSI_DATA_VECTOR_MASK) >> MSI_DATA_VECTOR_SHIFT;
diff --git a/hw/apic.h b/hw/apic.h
index 738d98a..9c646f0 100644
--- a/hw/apic.h
+++ b/hw/apic.h
@@ -5,6 +5,7 @@ typedef struct IOAPICState IOAPICState;
 int apic_deliver_irq(uint8_t dest, uint8_t dest_mode,
                      uint8_t delivery_mode, uint8_t vector_num,
                      uint8_t polarity, uint8_t trigger_mode);
+void apic_send_msi(target_phys_addr_t addr, uint32 data);
 int apic_init(CPUState *env);
 int apic_accept_pic_intr(CPUState *env);
 void apic_deliver_pic_intr(CPUState *env, int level);
diff --git a/hw/hpet.c b/hw/hpet.c
index b7d4771..bd669dc 100644
--- a/hw/hpet.c
+++ b/hw/hpet.c
@@ -31,6 +31,7 @@
 #include "hpet_emul.h"
 #include "sysbus.h"
 #include "mc146818rtc.h"
+#include "apic.h"
 
 //#define HPET_DEBUG
 #ifdef HPET_DEBUG
@@ -39,6 +40,8 @@
 #define DPRINTF(...)
 #endif
 
+#define HPET_MSI_SUPPORT        0
+
 struct HPETState;
 typedef struct HPETTimer {  /* timers */
     uint8_t tn;             /*timer number*/
@@ -47,7 +50,7 @@ typedef struct HPETTimer {  /* timers */
     /* Memory-mapped, software visible timer registers */
     uint64_t config;        /* configuration/cap */
     uint64_t cmp;           /* comparator */
-    uint64_t fsb;           /* FSB route, not supported now */
+    uint64_t fsb;           /* FSB route */
     /* Hidden register state */
     uint64_t period;        /* Last value written to comparator */
     uint8_t wrap_flag;      /* timer pop will indicate wrap for one-shot 32-bit
@@ -59,6 +62,7 @@ typedef struct HPETState {
     SysBusDevice busdev;
     uint64_t hpet_offset;
     qemu_irq irqs[HPET_NUM_IRQ_ROUTES];
+    uint32_t flags;
     uint8_t rtc_irq_level;
     uint8_t num_timers;
     HPETTimer timer[HPET_MAX_TIMERS];
@@ -80,6 +84,11 @@ static uint32_t timer_int_route(struct HPETTimer *timer)
     return (timer->config & HPET_TN_INT_ROUTE_MASK) >> HPET_TN_INT_ROUTE_SHIFT;
 }
 
+static uint32_t timer_fsb_route(HPETTimer *t)
+{
+    return t->config & HPET_TN_FSB_ENABLE;
+}
+
 static uint32_t hpet_enabled(HPETState *s)
 {
     return s->config & HPET_CFG_ENABLE;
@@ -179,7 +188,11 @@ static void update_irq(struct HPETTimer *timer, int set)
     mask = 1 << timer->tn;
     if (!set || !timer_enabled(timer) || !hpet_enabled(timer->state)) {
         s->isr &= ~mask;
-        qemu_irq_lower(s->irqs[route]);
+        if (!timer_fsb_route(timer)) {
+            qemu_irq_lower(s->irqs[route]);
+        }
+    } else if (timer_fsb_route(timer)) {
+        apic_send_msi(timer->fsb >> 32, timer->fsb & 0xffffffff);
     } else if (timer->config & HPET_TN_TYPE_LEVEL) {
         s->isr |= mask;
         qemu_irq_raise(s->irqs[route]);
@@ -216,6 +229,12 @@ static int hpet_post_load(void *opaque, int version_id)
     /* Push number of timers into capability returned via HPET_ID */
     s->capability &= ~HPET_ID_NUM_TIM_MASK;
     s->capability |= (s->num_timers - 1) << HPET_ID_NUM_TIM_SHIFT;
+
+    /* Derive HPET_MSI_SUPPORT from the capability of the first timer. */
+    s->flags &= ~(1 << HPET_MSI_SUPPORT);
+    if (s->timer[0].config & HPET_TN_FSB_CAP) {
+        s->flags |= 1 << HPET_MSI_SUPPORT;
+    }
     return 0;
 }
 
@@ -361,6 +380,8 @@ static uint32_t hpet_ram_readl(void *opaque, target_phys_addr_t addr)
         case HPET_TN_CMP + 4:
             return timer->cmp >> 32;
         case HPET_TN_ROUTE:
+            return timer->fsb;
+        case HPET_TN_ROUTE + 4:
             return timer->fsb >> 32;
         default:
             DPRINTF("qemu: invalid hpet_ram_readl\n");
@@ -444,6 +465,9 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
         switch ((addr - 0x100) % 0x20) {
         case HPET_TN_CFG:
             DPRINTF("qemu: hpet_ram_writel HPET_TN_CFG\n");
+            if (activating_bit(old_val, new_val, HPET_TN_FSB_ENABLE)) {
+                update_irq(timer, 0);
+            }
             val = hpet_fixup_reg(new_val, old_val, HPET_TN_CFG_WRITE_MASK);
             timer->config = (timer->config & 0xffffffff00000000ULL) | val;
             if (new_val & HPET_TN_32BIT) {
@@ -501,8 +525,11 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
                     hpet_set_timer(timer);
                 }
                 break;
+        case HPET_TN_ROUTE:
+            timer->fsb = (timer->fsb & 0xffffffff00000000ULL) | new_val;
+            break;
         case HPET_TN_ROUTE + 4:
-            DPRINTF("qemu: hpet_ram_writel HPET_TN_ROUTE + 4\n");
+            timer->fsb = (new_val << 32) | (timer->fsb & 0xffffffff);
             break;
         default:
             DPRINTF("qemu: invalid hpet_ram_writel\n");
@@ -610,7 +637,10 @@ static void hpet_reset(DeviceState *d)
 
         hpet_del_timer(timer);
         timer->cmp = ~0ULL;
-        timer->config =  HPET_TN_PERIODIC_CAP | HPET_TN_SIZE_CAP;
+        timer->config = HPET_TN_PERIODIC_CAP | HPET_TN_SIZE_CAP;
+        if (s->flags & (1 << HPET_MSI_SUPPORT)) {
+            timer->config |= HPET_TN_FSB_CAP;
+        }
         /* advertise availability of ioapic inti2 */
         timer->config |=  0x00000004ULL << 32;
         timer->period = 0ULL;
@@ -685,6 +715,7 @@ static SysBusDeviceInfo hpet_device_info = {
     .init         = hpet_init,
     .qdev.props = (Property[]) {
         DEFINE_PROP_UINT8("timers", HPETState, num_timers, HPET_MIN_TIMERS),
+        DEFINE_PROP_BIT("msi", HPETState, flags, HPET_MSI_SUPPORT, false),
         DEFINE_PROP_END_OF_LIST(),
     },
 };
diff --git a/hw/hpet_emul.h b/hw/hpet_emul.h
index 4b537d2..d94d84b 100644
--- a/hw/hpet_emul.h
+++ b/hw/hpet_emul.h
@@ -46,7 +46,9 @@
 #define HPET_TN_SETVAL           0x040
 #define HPET_TN_32BIT            0x100
 #define HPET_TN_INT_ROUTE_MASK  0x3e00
-#define HPET_TN_CFG_WRITE_MASK  0x3f4e
+#define HPET_TN_FSB_ENABLE      0x4000
+#define HPET_TN_FSB_CAP         0x8000
+#define HPET_TN_CFG_WRITE_MASK  0x7f4e
 #define HPET_TN_INT_ROUTE_SHIFT      9
 #define HPET_TN_INT_ROUTE_CAP_SHIFT 32
 #define HPET_TN_CFG_BITS_READONLY_OR_RESERVED 0xffff80b1U
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [Qemu-devel] [RFT][PATCH 15/15] monitor/QMP: Drop info hpet / query-hpet
  2010-05-24 20:13 [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Jan Kiszka
                   ` (13 preceding siblings ...)
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 14/15] hpet: Add MSI support Jan Kiszka
@ 2010-05-24 20:13 ` Jan Kiszka
  2010-05-24 22:16 ` [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Anthony Liguori
  15 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-24 20:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Jan Kiszka, Juan Quintela

From: Jan Kiszka <jan.kiszka@siemens.com>

This command was of minimal use before, now it is useless as the hpet
become a qdev device and is thus easily discoverable. We should
definitely not set query-hpet in QMP's stone, and there is also no good
reason to keep it for the interactive monitor.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 QMP/vm-info |    2 +-
 monitor.c   |   33 ---------------------------------
 2 files changed, 1 insertions(+), 34 deletions(-)

diff --git a/QMP/vm-info b/QMP/vm-info
index b150d82..8ebaeb3 100755
--- a/QMP/vm-info
+++ b/QMP/vm-info
@@ -25,7 +25,7 @@ def main():
     qemu = qmp.QEMUMonitorProtocol(argv[1])
     qemu.connect()
 
-    for cmd in [ 'version', 'hpet', 'kvm', 'status', 'uuid', 'balloon' ]:
+    for cmd in [ 'version', 'kvm', 'status', 'uuid', 'balloon' ]:
         print cmd + ': ' + str(qemu.send('query-' + cmd))
 
 if __name__ == '__main__':
diff --git a/monitor.c b/monitor.c
index ad50f12..d2abf07 100644
--- a/monitor.c
+++ b/monitor.c
@@ -777,31 +777,6 @@ static void do_info_commands(Monitor *mon, QObject **ret_data)
     *ret_data = QOBJECT(cmd_list);
 }
 
-#if defined(TARGET_I386)
-static void do_info_hpet_print(Monitor *mon, const QObject *data)
-{
-    monitor_printf(mon, "HPET is %s by QEMU\n",
-                   qdict_get_bool(qobject_to_qdict(data), "enabled") ?
-                   "enabled" : "disabled");
-}
-
-/**
- * do_info_hpet(): Show HPET state
- *
- * Return a QDict with the following information:
- *
- * - "enabled": true if hpet if enabled, false otherwise
- *
- * Example:
- *
- * { "enabled": true }
- */
-static void do_info_hpet(Monitor *mon, QObject **ret_data)
-{
-    *ret_data = qobject_from_jsonf("{ 'enabled': %i }", !no_hpet);
-}
-#endif
-
 static void do_info_uuid_print(Monitor *mon, const QObject *data)
 {
     monitor_printf(mon, "%s\n", qdict_get_str(qobject_to_qdict(data), "UUID"));
@@ -2611,14 +2586,6 @@ static const mon_cmd_t info_cmds[] = {
         .help       = "show the active virtual memory mappings",
         .mhandler.info = mem_info,
     },
-    {
-        .name       = "hpet",
-        .args_type  = "",
-        .params     = "",
-        .help       = "show state of HPET",
-        .user_print = do_info_hpet_print,
-        .mhandler.info_new = do_info_hpet,
-    },
 #endif
     {
         .name       = "jit",
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [Qemu-devel] Re: [RFT][PATCH 01/15] hpet: Catch out-of-bounds timer access
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 01/15] hpet: Catch out-of-bounds timer access Jan Kiszka
@ 2010-05-24 20:34   ` Juan Quintela
  2010-05-24 20:36     ` Jan Kiszka
  0 siblings, 1 reply; 122+ messages in thread
From: Juan Quintela @ 2010-05-24 20:34 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: blue Swirl, Jan Kiszka, qemu-devel

Jan Kiszka <jan.kiszka@web.de> wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
>
> Also prevent out-of-bounds write access to the timers but don't spam the
> host console if it triggers.
>
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> ---
>  hw/hpet.c |    6 +++++-
>  1 files changed, 5 insertions(+), 1 deletions(-)
>
> diff --git a/hw/hpet.c b/hw/hpet.c
> index 8729fb2..1980906 100644
> --- a/hw/hpet.c
> +++ b/hw/hpet.c
> @@ -294,7 +294,7 @@ static uint32_t hpet_ram_readl(void *opaque, target_phys_addr_t addr)
>      if (index >= 0x100 && index <= 0x3ff) {
>          uint8_t timer_id = (addr - 0x100) / 0x20;
>          if (timer_id > HPET_NUM_TIMERS - 1) {
> -            printf("qemu: timer id out of range\n");
> +            DPRINTF("qemu: timer id out of range\n");
>              return 0;
>          }
>          HPETTimer *timer = &s->timer[timer_id];
> @@ -383,6 +383,10 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
>          DPRINTF("qemu: hpet_ram_writel timer_id = %#x \n", timer_id);

if you are going to check timer_id, check it before accessing the array?

>          HPETTimer *timer = &s->timer[timer_id];
>  
> +        if (timer_id > HPET_NUM_TIMERS - 1) {
> +            DPRINTF("qemu: timer id out of range\n");
> +            return;
> +        }
>          switch ((addr - 0x100) % 0x20) {
>              case HPET_TN_CFG:
>                  DPRINTF("qemu: hpet_ram_writel HPET_TN_CFG\n");

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [Qemu-devel] Re: [RFT][PATCH 01/15] hpet: Catch out-of-bounds timer access
  2010-05-24 20:34   ` [Qemu-devel] " Juan Quintela
@ 2010-05-24 20:36     ` Jan Kiszka
  2010-05-24 20:50       ` Juan Quintela
  0 siblings, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-05-24 20:36 UTC (permalink / raw)
  To: Juan Quintela; +Cc: blue Swirl, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1281 bytes --]

Juan Quintela wrote:
> Jan Kiszka <jan.kiszka@web.de> wrote:
>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>
>> Also prevent out-of-bounds write access to the timers but don't spam the
>> host console if it triggers.
>>
>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>> ---
>>  hw/hpet.c |    6 +++++-
>>  1 files changed, 5 insertions(+), 1 deletions(-)
>>
>> diff --git a/hw/hpet.c b/hw/hpet.c
>> index 8729fb2..1980906 100644
>> --- a/hw/hpet.c
>> +++ b/hw/hpet.c
>> @@ -294,7 +294,7 @@ static uint32_t hpet_ram_readl(void *opaque, target_phys_addr_t addr)
>>      if (index >= 0x100 && index <= 0x3ff) {
>>          uint8_t timer_id = (addr - 0x100) / 0x20;
>>          if (timer_id > HPET_NUM_TIMERS - 1) {
>> -            printf("qemu: timer id out of range\n");
>> +            DPRINTF("qemu: timer id out of range\n");
>>              return 0;
>>          }
>>          HPETTimer *timer = &s->timer[timer_id];
>> @@ -383,6 +383,10 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
>>          DPRINTF("qemu: hpet_ram_writel timer_id = %#x \n", timer_id);
> 
> if you are going to check timer_id, check it before accessing the array?

That's just address arithmetic, nothing is dereferenced at this point.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [Qemu-devel] Re: [RFT][PATCH 02/15] hpet: Coding style cleanups and some refactorings
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 02/15] hpet: Coding style cleanups and some refactorings Jan Kiszka
@ 2010-05-24 20:37   ` Juan Quintela
  0 siblings, 0 replies; 122+ messages in thread
From: Juan Quintela @ 2010-05-24 20:37 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: blue Swirl, Jan Kiszka, qemu-devel

Jan Kiszka <jan.kiszka@web.de> wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
>
> This moves the private HPET structures into the C module,

I almost did this one on my previous series, thanks.

> simplifies
> some helper functions and fixes most coding style issues (biggest chunk
> was improper switch-case indention). No functional changes.
>
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [Qemu-devel] Re: [RFT][PATCH 01/15] hpet: Catch out-of-bounds timer access
  2010-05-24 20:36     ` Jan Kiszka
@ 2010-05-24 20:50       ` Juan Quintela
  0 siblings, 0 replies; 122+ messages in thread
From: Juan Quintela @ 2010-05-24 20:50 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: blue Swirl, qemu-devel

Jan Kiszka <jan.kiszka@web.de> wrote:
> Juan Quintela wrote:
>> Jan Kiszka <jan.kiszka@web.de> wrote:
>>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>>
>>> Also prevent out-of-bounds write access to the timers but don't spam the
>>> host console if it triggers.
>>>
>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>> ---
>>>  hw/hpet.c |    6 +++++-
>>>  1 files changed, 5 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/hw/hpet.c b/hw/hpet.c
>>> index 8729fb2..1980906 100644
>>> --- a/hw/hpet.c
>>> +++ b/hw/hpet.c
>>> @@ -294,7 +294,7 @@ static uint32_t hpet_ram_readl(void *opaque, target_phys_addr_t addr)
>>>      if (index >= 0x100 && index <= 0x3ff) {
>>>          uint8_t timer_id = (addr - 0x100) / 0x20;
>>>          if (timer_id > HPET_NUM_TIMERS - 1) {
>>> -            printf("qemu: timer id out of range\n");
>>> +            DPRINTF("qemu: timer id out of range\n");
>>>              return 0;
>>>          }
>>>          HPETTimer *timer = &s->timer[timer_id];
>>> @@ -383,6 +383,10 @@ static void hpet_ram_writel(void *opaque, target_phys_addr_t addr,
>>>          DPRINTF("qemu: hpet_ram_writel timer_id = %#x \n", timer_id);
>> 
>> if you are going to check timer_id, check it before accessing the array?
>
> That's just address arithmetic, nothing is dereferenced at this point.

hahahahahha

/me back to the pointer class.

Later, Juan.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements
  2010-05-24 20:13 [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Jan Kiszka
                   ` (14 preceding siblings ...)
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 15/15] monitor/QMP: Drop info hpet / query-hpet Jan Kiszka
@ 2010-05-24 22:16 ` Anthony Liguori
  15 siblings, 0 replies; 122+ messages in thread
From: Anthony Liguori @ 2010-05-24 22:16 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: blue Swirl, qemu-devel, Juan Quintela

On 05/24/2010 03:13 PM, Jan Kiszka wrote:
> Not yet for merge (unless I happened to forgot adding bugs), just a
> Request For Testing (and for review, of course). This series grew beyond
> my initial plans and my current testing capabilities, Linux and Win7 are
> apparently still fine, but that's all I can say so far.
>
> To summarize contributions to the HPET model:
>   - fixed host memory corruptions by guest (patch 1, likely stable stuff)
>   - coding style cleanup
>   - qdev conversion
>   - detangling of RTC from HPET dependencies, specifically via providing
>     "feedback" IRQ handlers to easy IRQ coalescing workaround
>     implementations
>   - support for level-triggered HPET IRQs (untested - does anyone know of
>     a user?)
>   - up to 32 comparators (configurable via qdev prop)
>   - MSI support (configurable via qdev prop)
>   - dropped obsolete "info hpet" and "query-hpet"
>    

Very nice series!

Regards,

Anthony Liguori

> Yet missing:
>   - IRQ coalescing workaround
>   - maybe some refactoring to easy compile-time disabling (Juan?)
>   - build once (I leave this to Blue Swirl :) )
>   - multiple HPET blocks (no urgent need yet)
>
> Please give this hell.
>
> Jan Kiszka (15):
>    hpet: Catch out-of-bounds timer access
>    hpet: Coding style cleanups and some refactorings
>    hpet: Silence warning on write to running main counter
>    hpet: Move static timer field initialization
>    hpet: Convert to qdev
>    hpet: Start/stop timer when HPET_TN_ENABLE is modified
>    qemu_irq: Add IRQ handlers with delivery feedback
>    x86: Refactor RTC IRQ coalescing workaround
>    hpet/rtc: Rework RTC IRQ replacement by HPET
>    hpet: Drop static state
>    hpet: Add support for level-triggered interrupts
>    vmstate: Add VMSTATE_STRUCT_VARRAY_UINT8
>    hpet: Make number of timers configurable
>    hpet: Add MSI support
>    monitor/QMP: Drop info hpet / query-hpet
>
>   QMP/vm-info      |    2 +-
>   hw/apic.c        |   63 +++---
>   hw/apic.h        |   11 +-
>   hw/hpet.c        |  582 +++++++++++++++++++++++++++++++++--------------------
>   hw/hpet_emul.h   |   46 +----
>   hw/hw.h          |   10 +
>   hw/i8259.c       |   20 ++-
>   hw/ioapic.c      |   34 ++--
>   hw/irq.c         |   38 +++-
>   hw/irq.h         |   22 ++-
>   hw/mc146818rtc.c |   60 ++----
>   hw/mc146818rtc.h |    4 +-
>   hw/mips_jazz.c   |    2 +-
>   hw/mips_malta.c  |    2 +-
>   hw/mips_r4k.c    |    2 +-
>   hw/pc.c          |   33 +++-
>   hw/pc.h          |    2 +-
>   hw/pc_piix.c     |    2 +-
>   hw/ppc_prep.c    |    2 +-
>   monitor.c        |   33 ---
>   20 files changed, 552 insertions(+), 418 deletions(-)
>
>
>
>    

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback Jan Kiszka
@ 2010-05-25  6:07   ` Gleb Natapov
  2010-05-25  6:31     ` Jan Kiszka
  2010-05-25 19:09   ` [Qemu-devel] " Blue Swirl
  1 sibling, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-05-25  6:07 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: blue Swirl, Jan Kiszka, qemu-devel, Juan Quintela

On Mon, May 24, 2010 at 10:13:40PM +0200, Jan Kiszka wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
> 
> This allows to communicate potential IRQ coalescing during delivery from
> the sink back to the source. Targets that support IRQ coalescing
> workarounds need to register handlers that return the appropriate
> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
> apply its workaround. If multiple sinks exist, the source may only
> consider an IRQ coalesced if all other sinks either report
> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
> 
Well, almost two years passed since this approach was proposed first
time[1] ;). Back then it generated bunch of nonsensical comments about
real hardware not working this way, so the hack that we have now was
introduce to overcome this resistance. I hope enough time passed for
people to gain some sense and the approach will be adopted this time.
Really this should have been done two year ago.

[1] http://lists.nongnu.org/archive/html/qemu-devel/2008-06/msg00757.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-25  6:07   ` Gleb Natapov
@ 2010-05-25  6:31     ` Jan Kiszka
  2010-05-25  6:40       ` Gleb Natapov
  0 siblings, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-05-25  6:31 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: blue Swirl, qemu-devel, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 1878 bytes --]

Gleb Natapov wrote:
> On Mon, May 24, 2010 at 10:13:40PM +0200, Jan Kiszka wrote:
>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>
>> This allows to communicate potential IRQ coalescing during delivery from
>> the sink back to the source. Targets that support IRQ coalescing
>> workarounds need to register handlers that return the appropriate
>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>> apply its workaround. If multiple sinks exist, the source may only
>> consider an IRQ coalesced if all other sinks either report
>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>
> Well, almost two years passed since this approach was proposed first
> time[1] ;). Back then it generated bunch of nonsensical comments about
> real hardware not working this way, so the hack that we have now was
> introduce to overcome this resistance. I hope enough time passed for
> people to gain some sense and the approach will be adopted this time.
> Really this should have been done two year ago.
> 
> [1] http://lists.nongnu.org/archive/html/qemu-devel/2008-06/msg00757.html

Yeah, I somehow had a vague feeling that there must have been an earlier
attempt. I think my approach could be a slightly easier to accept as it
does not require converting all platforms, but this can happen on demand
(or not at all). Moreover, I think that the third return state,
QEMU_IRQ_MASKED, is important for correct handling of multiple IRQ sinks.

However, as I would see it now, we just have two options long term:
 - drop IRQ coalescing workarounds
 - properly support them via qemu_irq

The current hack cannot stay. E.g., it does not scale because it depends
on a global variable of the APIC. So we would never able to protect the
APICs with individual locks.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-25  6:31     ` Jan Kiszka
@ 2010-05-25  6:40       ` Gleb Natapov
  2010-05-25  6:54         ` Jan Kiszka
  0 siblings, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-05-25  6:40 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: blue Swirl, qemu-devel, Juan Quintela

On Tue, May 25, 2010 at 08:31:06AM +0200, Jan Kiszka wrote:
> Gleb Natapov wrote:
> > On Mon, May 24, 2010 at 10:13:40PM +0200, Jan Kiszka wrote:
> >> From: Jan Kiszka <jan.kiszka@siemens.com>
> >>
> >> This allows to communicate potential IRQ coalescing during delivery from
> >> the sink back to the source. Targets that support IRQ coalescing
> >> workarounds need to register handlers that return the appropriate
> >> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
> >> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
> >> apply its workaround. If multiple sinks exist, the source may only
> >> consider an IRQ coalesced if all other sinks either report
> >> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
> >>
> > Well, almost two years passed since this approach was proposed first
> > time[1] ;). Back then it generated bunch of nonsensical comments about
> > real hardware not working this way, so the hack that we have now was
> > introduce to overcome this resistance. I hope enough time passed for
> > people to gain some sense and the approach will be adopted this time.
> > Really this should have been done two year ago.
> > 
> > [1] http://lists.nongnu.org/archive/html/qemu-devel/2008-06/msg00757.html
> 
> Yeah, I somehow had a vague feeling that there must have been an earlier
> attempt. I think my approach could be a slightly easier to accept as it
> does not require converting all platforms, but this can happen on demand
> (or not at all).
I proposed that too at some point (to lazy to look for it in archives
and it was in words not patch). But since resistance to that approach
from the beginning was baseless no sane arguments or compromises would
help at that point.

>                   Moreover, I think that the third return state,
> QEMU_IRQ_MASKED, is important for correct handling of multiple IRQ sinks.
My patch had that too: <0 = coalesced, 0 = masked, >0=delivered

> 
> However, as I would see it now, we just have two options long term:
>  - drop IRQ coalescing workarounds
>  - properly support them via qemu_irq
> 
> The current hack cannot stay. E.g., it does not scale because it depends
> on a global variable of the APIC. So we would never able to protect the
> APICs with individual locks.
> 
Agree, and that was well understood at the time the hack was introduced.
But we can't just drop RTC IRQ reinjectoin. It would be crippling of qemu.
So we have only _one_ option:
 - properly support them via qemu_irq

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-25  6:40       ` Gleb Natapov
@ 2010-05-25  6:54         ` Jan Kiszka
  0 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-25  6:54 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: blue Swirl, qemu-devel, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 2707 bytes --]

Gleb Natapov wrote:
> On Tue, May 25, 2010 at 08:31:06AM +0200, Jan Kiszka wrote:
>> Gleb Natapov wrote:
>>> On Mon, May 24, 2010 at 10:13:40PM +0200, Jan Kiszka wrote:
>>>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>>>
>>>> This allows to communicate potential IRQ coalescing during delivery from
>>>> the sink back to the source. Targets that support IRQ coalescing
>>>> workarounds need to register handlers that return the appropriate
>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>>> apply its workaround. If multiple sinks exist, the source may only
>>>> consider an IRQ coalesced if all other sinks either report
>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>>
>>> Well, almost two years passed since this approach was proposed first
>>> time[1] ;). Back then it generated bunch of nonsensical comments about
>>> real hardware not working this way, so the hack that we have now was
>>> introduce to overcome this resistance. I hope enough time passed for
>>> people to gain some sense and the approach will be adopted this time.
>>> Really this should have been done two year ago.
>>>
>>> [1] http://lists.nongnu.org/archive/html/qemu-devel/2008-06/msg00757.html
>> Yeah, I somehow had a vague feeling that there must have been an earlier
>> attempt. I think my approach could be a slightly easier to accept as it
>> does not require converting all platforms, but this can happen on demand
>> (or not at all).
> I proposed that too at some point (to lazy to look for it in archives
> and it was in words not patch). But since resistance to that approach
> from the beginning was baseless no sane arguments or compromises would
> help at that point.
> 
>>                   Moreover, I think that the third return state,
>> QEMU_IRQ_MASKED, is important for correct handling of multiple IRQ sinks.
> My patch had that too: <0 = coalesced, 0 = masked, >0=delivered

Oh, indeed! Good to see that we came up with the same logic.

> 
>> However, as I would see it now, we just have two options long term:
>>  - drop IRQ coalescing workarounds
>>  - properly support them via qemu_irq
>>
>> The current hack cannot stay. E.g., it does not scale because it depends
>> on a global variable of the APIC. So we would never able to protect the
>> APICs with individual locks.
>>
> Agree, and that was well understood at the time the hack was introduced.
> But we can't just drop RTC IRQ reinjectoin. It would be crippling of qemu.
> So we have only _one_ option:
>  - properly support them via qemu_irq

You won't hear me disagreeing.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] [RFT][PATCH 09/15] hpet/rtc: Rework RTC IRQ replacement by HPET
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 09/15] hpet/rtc: Rework RTC IRQ replacement by HPET Jan Kiszka
@ 2010-05-25  9:29   ` Paul Brook
  2010-05-25 10:23     ` Jan Kiszka
  0 siblings, 1 reply; 122+ messages in thread
From: Paul Brook @ 2010-05-25  9:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Jan Kiszka, Jan Kiszka, Juan Quintela

>          for (i = 0; i < 24; i++) {
>              sysbus_connect_irq(sysbus_from_qdev(hpet), i, isa_irq[i]);
>          }
> +        rtc_irq = qemu_allocate_feedback_irqs(hpet_handle_rtc_irq,
> +                                              sysbus_from_qdev(hpet), 1);
>      }

This is wrong. The hpet device should expose this as an IO pin.

Paul

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] [RFT][PATCH 05/15] hpet: Convert to qdev
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 05/15] hpet: Convert to qdev Jan Kiszka
@ 2010-05-25  9:37   ` Paul Brook
  2010-05-25 10:14     ` Jan Kiszka
  0 siblings, 1 reply; 122+ messages in thread
From: Paul Brook @ 2010-05-25  9:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: blue Swirl, Jan Kiszka, Jan Kiszka, Juan Quintela

> +static SysBusDeviceInfo hpet_device_info = {
> +    .qdev.name    = "hpet",
> +    .qdev.size    = sizeof(HPETState),
> +    .qdev.no_user = 1,

Why shouldn't the user create HPET devices? I thought you'd removed all the 
global state.

Paul

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] [RFT][PATCH 05/15] hpet: Convert to qdev
  2010-05-25  9:37   ` Paul Brook
@ 2010-05-25 10:14     ` Jan Kiszka
  0 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-25 10:14 UTC (permalink / raw)
  To: Paul Brook; +Cc: blue Swirl, Jan Kiszka, qemu-devel, Juan Quintela

Paul Brook wrote:
>> +static SysBusDeviceInfo hpet_device_info = {
>> +    .qdev.name    = "hpet",
>> +    .qdev.size    = sizeof(HPETState),
>> +    .qdev.no_user = 1,
> 
> Why shouldn't the user create HPET devices? I thought you'd removed all the 
> global state.

Long-term, there is no reason to deny this.

But the code is not yet ready for this: we statically instantiate it
during PC setup to establish the routings and respect -no-hpat. Also,
the BIOS isn't prepared for > 1 HPET.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] [RFT][PATCH 09/15] hpet/rtc: Rework RTC IRQ replacement by HPET
  2010-05-25  9:29   ` Paul Brook
@ 2010-05-25 10:23     ` Jan Kiszka
  2010-05-25 11:05       ` Paul Brook
  0 siblings, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-05-25 10:23 UTC (permalink / raw)
  To: Paul Brook; +Cc: blue Swirl, Jan Kiszka, qemu-devel, Juan Quintela

Paul Brook wrote:
>>          for (i = 0; i < 24; i++) {
>>              sysbus_connect_irq(sysbus_from_qdev(hpet), i, isa_irq[i]);
>>          }
>> +        rtc_irq = qemu_allocate_feedback_irqs(hpet_handle_rtc_irq,
>> +                                              sysbus_from_qdev(hpet), 1);
>>      }
> 
> This is wrong. The hpet device should expose this as an IO pin.

Will look into this.

BTW, I just realized that the GPIO handling is apparently lacking
support for attaching an output to multiple inputs. Or am I missing
something?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] [RFT][PATCH 09/15] hpet/rtc: Rework RTC IRQ replacement by HPET
  2010-05-25 10:23     ` Jan Kiszka
@ 2010-05-25 11:05       ` Paul Brook
  2010-05-25 11:19         ` Jan Kiszka
  0 siblings, 1 reply; 122+ messages in thread
From: Paul Brook @ 2010-05-25 11:05 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: blue Swirl, Jan Kiszka, qemu-devel, Juan Quintela

> > This is wrong. The hpet device should expose this as an IO pin.
> 
> Will look into this.
> 
> BTW, I just realized that the GPIO handling is apparently lacking
> support for attaching an output to multiple inputs. Or am I missing
> something?

Use an explicit mux.

Incidentally I suspect your handling of the ISA IRQs is broken. You may never 
have more than one source connected to a sink.  Shared IRQ lines must be done 
explicitly.

Paul

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] [RFT][PATCH 09/15] hpet/rtc: Rework RTC IRQ replacement by HPET
  2010-05-25 11:05       ` Paul Brook
@ 2010-05-25 11:19         ` Jan Kiszka
  2010-05-25 11:23           ` Paul Brook
  0 siblings, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-05-25 11:19 UTC (permalink / raw)
  To: Paul Brook; +Cc: blue Swirl, Jan Kiszka, qemu-devel, Juan Quintela

Paul Brook wrote:
>>> This is wrong. The hpet device should expose this as an IO pin.
>> Will look into this.
>>
>> BTW, I just realized that the GPIO handling is apparently lacking
>> support for attaching an output to multiple inputs. Or am I missing
>> something?
> 
> Use an explicit mux.
> 
> Incidentally I suspect your handling of the ISA IRQs is broken. You may never 
> have more than one source connected to a sink.  Shared IRQ lines must be done 
> explicitly.

No, the other way around: one source (RTC) multiple sinks (HPET, ACPI).
Will probably draft a generic irq/gpio mux.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] [RFT][PATCH 09/15] hpet/rtc: Rework RTC IRQ replacement by HPET
  2010-05-25 11:19         ` Jan Kiszka
@ 2010-05-25 11:23           ` Paul Brook
  2010-05-25 11:26             ` Jan Kiszka
  0 siblings, 1 reply; 122+ messages in thread
From: Paul Brook @ 2010-05-25 11:23 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: blue Swirl, Jan Kiszka, qemu-devel, Juan Quintela

> Paul Brook wrote:
> >>> This is wrong. The hpet device should expose this as an IO pin.
> >> 
> >> Will look into this.
> >> 
> >> BTW, I just realized that the GPIO handling is apparently lacking
> >> support for attaching an output to multiple inputs. Or am I missing
> >> something?
> > 
> > Use an explicit mux.
> > 
> > Incidentally I suspect your handling of the ISA IRQs is broken. You may
> > never have more than one source connected to a sink.  Shared IRQ lines
> > must be done explicitly.
> 
> No, the other way around: one source (RTC) multiple sinks (HPET, ACPI).
> Will probably draft a generic irq/gpio mux.

I realise that. However I'd expect things to break if the guest OS devices to 
share an IRQ line between the HPET and some other device.

Paul

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] [RFT][PATCH 09/15] hpet/rtc: Rework RTC IRQ replacement by HPET
  2010-05-25 11:23           ` Paul Brook
@ 2010-05-25 11:26             ` Jan Kiszka
  2010-05-25 12:03               ` Paul Brook
  0 siblings, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-05-25 11:26 UTC (permalink / raw)
  To: Paul Brook; +Cc: blue Swirl, Jan Kiszka, qemu-devel, Juan Quintela

Paul Brook wrote:
>> Paul Brook wrote:
>>>>> This is wrong. The hpet device should expose this as an IO pin.
>>>> Will look into this.
>>>>
>>>> BTW, I just realized that the GPIO handling is apparently lacking
>>>> support for attaching an output to multiple inputs. Or am I missing
>>>> something?
>>> Use an explicit mux.
>>>
>>> Incidentally I suspect your handling of the ISA IRQs is broken. You may
>>> never have more than one source connected to a sink.  Shared IRQ lines
>>> must be done explicitly.
>> No, the other way around: one source (RTC) multiple sinks (HPET, ACPI).
>> Will probably draft a generic irq/gpio mux.
> 
> I realise that. However I'd expect things to break if the guest OS devices to 
> share an IRQ line between the HPET and some other device.

The guest would share IRQ8, not the RTC output. So there would be no
difference to the current situation.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] [RFT][PATCH 09/15] hpet/rtc: Rework RTC IRQ replacement by HPET
  2010-05-25 11:26             ` Jan Kiszka
@ 2010-05-25 12:03               ` Paul Brook
  2010-05-25 12:39                 ` Jan Kiszka
  0 siblings, 1 reply; 122+ messages in thread
From: Paul Brook @ 2010-05-25 12:03 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: blue Swirl, Jan Kiszka, qemu-devel, Juan Quintela

> > I realise that. However I'd expect things to break if the guest OS
> > devices to share an IRQ line between the HPET and some other device.
> 
> The guest would share IRQ8, not the RTC output. So there would be no
> difference to the current situation.

The difference is that you've removed the check that prevented overlap between 
the PIC and annother device.  You should be using isa_reserve_irq/isa_init_irq 
before you use an ISA IRQ line.  Any uses of isa_bus_irqs (including teh 
existing HPET code) are probably broken.

Paul

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] [RFT][PATCH 09/15] hpet/rtc: Rework RTC IRQ replacement by HPET
  2010-05-25 12:03               ` Paul Brook
@ 2010-05-25 12:39                 ` Jan Kiszka
  0 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-25 12:39 UTC (permalink / raw)
  To: Paul Brook; +Cc: blue Swirl, Jan Kiszka, qemu-devel, Juan Quintela

Paul Brook wrote:
>>> I realise that. However I'd expect things to break if the guest OS
>>> devices to share an IRQ line between the HPET and some other device.
>> The guest would share IRQ8, not the RTC output. So there would be no
>> difference to the current situation.
> 
> The difference is that you've removed the check that prevented overlap between 
> the PIC and annother device.  You should be using isa_reserve_irq/isa_init_irq 
> before you use an ISA IRQ line.  Any uses of isa_bus_irqs (including teh 
> existing HPET code) are probably broken.

...at least fragile. OK, will address this as well.

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 122+ messages in thread

* [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback Jan Kiszka
  2010-05-25  6:07   ` Gleb Natapov
@ 2010-05-25 19:09   ` Blue Swirl
  2010-05-25 20:16     ` Anthony Liguori
  1 sibling, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-05-25 19:09 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
>
> This allows to communicate potential IRQ coalescing during delivery from
> the sink back to the source. Targets that support IRQ coalescing
> workarounds need to register handlers that return the appropriate
> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
> apply its workaround. If multiple sinks exist, the source may only
> consider an IRQ coalesced if all other sinks either report
> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.

No real devices are interested whether any of their output lines are
even connected. This would introduce a new signal type, bidirectional
multi-level, which is not correct.

I think the real solution to coalescing is put the logic inside one
device, in this case APIC because it has the information about irq
delivery. APIC could monitor incoming RTC irqs for frequency
information and whether they get delivered or not. If not, an internal
timer is installed which injects the lost irqs.

Of course, no real device could do such de-coalescing, but with this
approach, the voodoo is contained to insides of one device, APIC.

We should also take a step back to think what was the cause of lost
irqs, IIRC uneven execution rate in QEMU. Could this be fixed or taken
into account in timer handling? For example, CPU loop could analyze
the wall clock time between CPU exits and use that to offset the
timers. Thus the timer frequency (in wall clock time) could be made to
correspond a bit more to VCPU execution rate.

>
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> ---
>  hw/irq.c |   38 +++++++++++++++++++++++++++++---------
>  hw/irq.h |   22 +++++++++++++++-------
>  2 files changed, 44 insertions(+), 16 deletions(-)
>
> diff --git a/hw/irq.c b/hw/irq.c
> index 7703f62..db2cce6 100644
> --- a/hw/irq.c
> +++ b/hw/irq.c
> @@ -26,19 +26,27 @@
>
>  struct IRQState {
>     qemu_irq_handler handler;
> +    qemu_irq_fb_handler feedback_handler;
>     void *opaque;
>     int n;
>  };
>
> -void qemu_set_irq(qemu_irq irq, int level)
> +int qemu_set_irq(qemu_irq irq, int level)
>  {
> -    if (!irq)
> -        return;
> -
> -    irq->handler(irq->opaque, irq->n, level);
> +    if (!irq) {
> +        return 0;
> +    }
> +    if (irq->feedback_handler) {
> +        return irq->feedback_handler(irq->opaque, irq->n, level);
> +    } else {
> +        irq->handler(irq->opaque, irq->n, level);
> +        return QEMU_IRQ_DELIVERED;
> +    }
>  }
>
> -qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
> +static qemu_irq *allocate_irqs(qemu_irq_handler handler,
> +                               qemu_irq_fb_handler feedback_handler,
> +                               void *opaque, int n)
>  {
>     qemu_irq *s;
>     struct IRQState *p;
> @@ -48,6 +56,7 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
>     p = (struct IRQState *)qemu_mallocz(sizeof(struct IRQState) * n);
>     for (i = 0; i < n; i++) {
>         p->handler = handler;
> +        p->feedback_handler = feedback_handler;
>         p->opaque = opaque;
>         p->n = i;
>         s[i] = p;
> @@ -56,22 +65,33 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
>     return s;
>  }
>
> +qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
> +{
> +    return allocate_irqs(handler, NULL, opaque, n);
> +}
> +
> +qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
> +                                      void *opaque, int n)
> +{
> +    return allocate_irqs(NULL, handler, opaque, n);
> +}
> +
>  void qemu_free_irqs(qemu_irq *s)
>  {
>     qemu_free(s[0]);
>     qemu_free(s);
>  }
>
> -static void qemu_notirq(void *opaque, int line, int level)
> +static int qemu_notirq(void *opaque, int line, int level)
>  {
>     struct IRQState *irq = opaque;
>
> -    irq->handler(irq->opaque, irq->n, !level);
> +    return qemu_set_irq(irq, !level);
>  }
>
>  qemu_irq qemu_irq_invert(qemu_irq irq)
>  {
>     /* The default state for IRQs is low, so raise the output now.  */
>     qemu_irq_raise(irq);
> -    return qemu_allocate_irqs(qemu_notirq, irq, 1)[0];
> +    return allocate_irqs(NULL, qemu_notirq, irq, 1)[0];
>  }
> diff --git a/hw/irq.h b/hw/irq.h
> index 5daae44..eee03e6 100644
> --- a/hw/irq.h
> +++ b/hw/irq.h
> @@ -3,15 +3,18 @@
>
>  /* Generic IRQ/GPIO pin infrastructure.  */
>
> -/* FIXME: Rmove one of these.  */
> +#define QEMU_IRQ_DELIVERED      0
> +#define QEMU_IRQ_COALESCED      (-1)
> +#define QEMU_IRQ_MASKED         (-2)
> +
>  typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
> -typedef void SetIRQFunc(void *opaque, int irq_num, int level);
> +typedef int (*qemu_irq_fb_handler)(void *opaque, int n, int level);
>
> -void qemu_set_irq(qemu_irq irq, int level);
> +int qemu_set_irq(qemu_irq irq, int level);
>
> -static inline void qemu_irq_raise(qemu_irq irq)
> +static inline int qemu_irq_raise(qemu_irq irq)
>  {
> -    qemu_set_irq(irq, 1);
> +    return qemu_set_irq(irq, 1);
>  }
>
>  static inline void qemu_irq_lower(qemu_irq irq)
> @@ -19,14 +22,19 @@ static inline void qemu_irq_lower(qemu_irq irq)
>     qemu_set_irq(irq, 0);
>  }
>
> -static inline void qemu_irq_pulse(qemu_irq irq)
> +static inline int qemu_irq_pulse(qemu_irq irq)
>  {
> -    qemu_set_irq(irq, 1);
> +    int ret;
> +
> +    ret = qemu_set_irq(irq, 1);
>     qemu_set_irq(irq, 0);
> +    return ret;
>  }
>
>  /* Returns an array of N IRQs.  */
>  qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n);
> +qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
> +                                      void *opaque, int n);
>  void qemu_free_irqs(qemu_irq *s);
>
>  /* Returns a new IRQ with opposite polarity.  */
> --
> 1.6.0.2
>
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-25 19:09   ` [Qemu-devel] " Blue Swirl
@ 2010-05-25 20:16     ` Anthony Liguori
  2010-05-25 21:44       ` Jan Kiszka
  2010-05-26 19:49       ` Blue Swirl
  0 siblings, 2 replies; 122+ messages in thread
From: Anthony Liguori @ 2010-05-25 20:16 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, Jan Kiszka, qemu-devel, Juan Quintela

On 05/25/2010 02:09 PM, Blue Swirl wrote:
> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>    
>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>
>> This allows to communicate potential IRQ coalescing during delivery from
>> the sink back to the source. Targets that support IRQ coalescing
>> workarounds need to register handlers that return the appropriate
>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>> apply its workaround. If multiple sinks exist, the source may only
>> consider an IRQ coalesced if all other sinks either report
>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>      
> No real devices are interested whether any of their output lines are
> even connected. This would introduce a new signal type, bidirectional
> multi-level, which is not correct.
>    

I don't think it's really an issue of correct, but I wouldn't disagree 
to a suggestion that we ought to introduce a new signal type for this 
type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a 
similar interface as qemu_irq.

> I think the real solution to coalescing is put the logic inside one
> device, in this case APIC because it has the information about irq
> delivery. APIC could monitor incoming RTC irqs for frequency
> information and whether they get delivered or not. If not, an internal
> timer is installed which injects the lost irqs.
>
> Of course, no real device could do such de-coalescing, but with this
> approach, the voodoo is contained to insides of one device, APIC.
>
> We should also take a step back to think what was the cause of lost
> irqs, IIRC uneven execution rate in QEMU.

Not only that.  The pathological case is where a host is limited to a 
1khz timer frequency and the guest requests a 1khz timer frequency.  
Practically speaking, there is no way we'll ever be able to adjust 
timers to reinject lost interrupts because of the host timer limitation.

>   Could this be fixed or taken
> into account in timer handling? For example, CPU loop could analyze
> the wall clock time between CPU exits and use that to offset the
> timers. Thus the timer frequency (in wall clock time) could be made to
> correspond a bit more to VCPU execution rate.
>    

A lot of what motivates the timer reinjection work is very old linux 
kernels that had fixed userspace timer frequencies.  On newer host 
kernels, it's probably not nearly as important except when you get into 
pathological cases like exposing a high frequency HPET timer to the 
guest where you cannot keep up with the host.

Regards,

Anthony Liguori

>> Signed-off-by: Jan Kiszka<jan.kiszka@siemens.com>
>> ---
>>   hw/irq.c |   38 +++++++++++++++++++++++++++++---------
>>   hw/irq.h |   22 +++++++++++++++-------
>>   2 files changed, 44 insertions(+), 16 deletions(-)
>>
>> diff --git a/hw/irq.c b/hw/irq.c
>> index 7703f62..db2cce6 100644
>> --- a/hw/irq.c
>> +++ b/hw/irq.c
>> @@ -26,19 +26,27 @@
>>
>>   struct IRQState {
>>      qemu_irq_handler handler;
>> +    qemu_irq_fb_handler feedback_handler;
>>      void *opaque;
>>      int n;
>>   };
>>
>> -void qemu_set_irq(qemu_irq irq, int level)
>> +int qemu_set_irq(qemu_irq irq, int level)
>>   {
>> -    if (!irq)
>> -        return;
>> -
>> -    irq->handler(irq->opaque, irq->n, level);
>> +    if (!irq) {
>> +        return 0;
>> +    }
>> +    if (irq->feedback_handler) {
>> +        return irq->feedback_handler(irq->opaque, irq->n, level);
>> +    } else {
>> +        irq->handler(irq->opaque, irq->n, level);
>> +        return QEMU_IRQ_DELIVERED;
>> +    }
>>   }
>>
>> -qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
>> +static qemu_irq *allocate_irqs(qemu_irq_handler handler,
>> +                               qemu_irq_fb_handler feedback_handler,
>> +                               void *opaque, int n)
>>   {
>>      qemu_irq *s;
>>      struct IRQState *p;
>> @@ -48,6 +56,7 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
>>      p = (struct IRQState *)qemu_mallocz(sizeof(struct IRQState) * n);
>>      for (i = 0; i<  n; i++) {
>>          p->handler = handler;
>> +        p->feedback_handler = feedback_handler;
>>          p->opaque = opaque;
>>          p->n = i;
>>          s[i] = p;
>> @@ -56,22 +65,33 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
>>      return s;
>>   }
>>
>> +qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
>> +{
>> +    return allocate_irqs(handler, NULL, opaque, n);
>> +}
>> +
>> +qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
>> +                                      void *opaque, int n)
>> +{
>> +    return allocate_irqs(NULL, handler, opaque, n);
>> +}
>> +
>>   void qemu_free_irqs(qemu_irq *s)
>>   {
>>      qemu_free(s[0]);
>>      qemu_free(s);
>>   }
>>
>> -static void qemu_notirq(void *opaque, int line, int level)
>> +static int qemu_notirq(void *opaque, int line, int level)
>>   {
>>      struct IRQState *irq = opaque;
>>
>> -    irq->handler(irq->opaque, irq->n, !level);
>> +    return qemu_set_irq(irq, !level);
>>   }
>>
>>   qemu_irq qemu_irq_invert(qemu_irq irq)
>>   {
>>      /* The default state for IRQs is low, so raise the output now.  */
>>      qemu_irq_raise(irq);
>> -    return qemu_allocate_irqs(qemu_notirq, irq, 1)[0];
>> +    return allocate_irqs(NULL, qemu_notirq, irq, 1)[0];
>>   }
>> diff --git a/hw/irq.h b/hw/irq.h
>> index 5daae44..eee03e6 100644
>> --- a/hw/irq.h
>> +++ b/hw/irq.h
>> @@ -3,15 +3,18 @@
>>
>>   /* Generic IRQ/GPIO pin infrastructure.  */
>>
>> -/* FIXME: Rmove one of these.  */
>> +#define QEMU_IRQ_DELIVERED      0
>> +#define QEMU_IRQ_COALESCED      (-1)
>> +#define QEMU_IRQ_MASKED         (-2)
>> +
>>   typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
>> -typedef void SetIRQFunc(void *opaque, int irq_num, int level);
>> +typedef int (*qemu_irq_fb_handler)(void *opaque, int n, int level);
>>
>> -void qemu_set_irq(qemu_irq irq, int level);
>> +int qemu_set_irq(qemu_irq irq, int level);
>>
>> -static inline void qemu_irq_raise(qemu_irq irq)
>> +static inline int qemu_irq_raise(qemu_irq irq)
>>   {
>> -    qemu_set_irq(irq, 1);
>> +    return qemu_set_irq(irq, 1);
>>   }
>>
>>   static inline void qemu_irq_lower(qemu_irq irq)
>> @@ -19,14 +22,19 @@ static inline void qemu_irq_lower(qemu_irq irq)
>>      qemu_set_irq(irq, 0);
>>   }
>>
>> -static inline void qemu_irq_pulse(qemu_irq irq)
>> +static inline int qemu_irq_pulse(qemu_irq irq)
>>   {
>> -    qemu_set_irq(irq, 1);
>> +    int ret;
>> +
>> +    ret = qemu_set_irq(irq, 1);
>>      qemu_set_irq(irq, 0);
>> +    return ret;
>>   }
>>
>>   /* Returns an array of N IRQs.  */
>>   qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n);
>> +qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
>> +                                      void *opaque, int n);
>>   void qemu_free_irqs(qemu_irq *s);
>>
>>   /* Returns a new IRQ with opposite polarity.  */
>> --
>> 1.6.0.2
>>
>>
>>      
>    

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-25 20:16     ` Anthony Liguori
@ 2010-05-25 21:44       ` Jan Kiszka
  2010-05-26  8:08         ` Gleb Natapov
  2010-05-26 19:55         ` Blue Swirl
  2010-05-26 19:49       ` Blue Swirl
  1 sibling, 2 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-25 21:44 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Blue Swirl, qemu-devel, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 2025 bytes --]

Anthony Liguori wrote:
> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>   
>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>
>>> This allows to communicate potential IRQ coalescing during delivery from
>>> the sink back to the source. Targets that support IRQ coalescing
>>> workarounds need to register handlers that return the appropriate
>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>> apply its workaround. If multiple sinks exist, the source may only
>>> consider an IRQ coalesced if all other sinks either report
>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>      
>> No real devices are interested whether any of their output lines are
>> even connected. This would introduce a new signal type, bidirectional
>> multi-level, which is not correct.
>>    
> 
> I don't think it's really an issue of correct, but I wouldn't disagree
> to a suggestion that we ought to introduce a new signal type for this
> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
> similar interface as qemu_irq.

A separate type would complicate the delivery of the feedback value
across GPIO pins (as Paul requested for the RTC->HPET routing).

> 
>> I think the real solution to coalescing is put the logic inside one
>> device, in this case APIC because it has the information about irq
>> delivery. APIC could monitor incoming RTC irqs for frequency
>> information and whether they get delivered or not. If not, an internal
>> timer is installed which injects the lost irqs.

That won't fly as the IRQs will already arrive at the APIC with a
sufficiently high jitter. At the bare minimum, you need to tell the
interrupt controller about the fact that a particular IRQ should be
delivered at a specific regular rate. For this, you also need a generic
interface - nothing really "won".

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-25 21:44       ` Jan Kiszka
@ 2010-05-26  8:08         ` Gleb Natapov
  2010-05-26 20:14           ` Blue Swirl
  2010-05-26 19:55         ` Blue Swirl
  1 sibling, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-05-26  8:08 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Blue Swirl, qemu-devel, Juan Quintela

On Tue, May 25, 2010 at 11:44:50PM +0200, Jan Kiszka wrote:
> > 
> >> I think the real solution to coalescing is put the logic inside one
> >> device, in this case APIC because it has the information about irq
> >> delivery. APIC could monitor incoming RTC irqs for frequency
> >> information and whether they get delivered or not. If not, an internal
> >> timer is installed which injects the lost irqs.
> 
> That won't fly as the IRQs will already arrive at the APIC with a
> sufficiently high jitter. At the bare minimum, you need to tell the
> interrupt controller about the fact that a particular IRQ should be
> delivered at a specific regular rate. For this, you also need a generic
> interface - nothing really "won".
> 
Not only that. Suppose RTC runs with 64Hz frequency and N interrupts
were coalesced during some period. Now the guest reprograms RTC to
1024Hz. N should be scaled accordingly otherwise reinjection will not
fix the drift. Do you propose to put this logic into APIC to?

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-25 20:16     ` Anthony Liguori
  2010-05-25 21:44       ` Jan Kiszka
@ 2010-05-26 19:49       ` Blue Swirl
  1 sibling, 0 replies; 122+ messages in thread
From: Blue Swirl @ 2010-05-26 19:49 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Jan Kiszka, Jan Kiszka, qemu-devel, Juan Quintela

On Tue, May 25, 2010 at 8:16 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>>
>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>
>>>
>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>
>>> This allows to communicate potential IRQ coalescing during delivery from
>>> the sink back to the source. Targets that support IRQ coalescing
>>> workarounds need to register handlers that return the appropriate
>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>> apply its workaround. If multiple sinks exist, the source may only
>>> consider an IRQ coalesced if all other sinks either report
>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>
>>
>> No real devices are interested whether any of their output lines are
>> even connected. This would introduce a new signal type, bidirectional
>> multi-level, which is not correct.
>>
>
> I don't think it's really an issue of correct, but I wouldn't disagree to a
> suggestion that we ought to introduce a new signal type for this type of
> bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a similar
> interface as qemu_irq.

I'd rather try harder to find a solution for the problem without this.

>> I think the real solution to coalescing is put the logic inside one
>> device, in this case APIC because it has the information about irq
>> delivery. APIC could monitor incoming RTC irqs for frequency
>> information and whether they get delivered or not. If not, an internal
>> timer is installed which injects the lost irqs.
>>
>> Of course, no real device could do such de-coalescing, but with this
>> approach, the voodoo is contained to insides of one device, APIC.
>>
>> We should also take a step back to think what was the cause of lost
>> irqs, IIRC uneven execution rate in QEMU.
>
> Not only that.  The pathological case is where a host is limited to a 1khz
> timer frequency and the guest requests a 1khz timer frequency.  Practically
> speaking, there is no way we'll ever be able to adjust timers to reinject
> lost interrupts because of the host timer limitation.

But APIC knows when the guest has acked a timer irq and so can
immediately issue another one to cover a lost irq. APIC could also
arrange a CPU exit. The irqs would not arrive at even rate, but in
this pathological case, the host can't ever schedule any delays less
than 1ms.

>>  Could this be fixed or taken
>> into account in timer handling? For example, CPU loop could analyze
>> the wall clock time between CPU exits and use that to offset the
>> timers. Thus the timer frequency (in wall clock time) could be made to
>> correspond a bit more to VCPU execution rate.
>>
>
> A lot of what motivates the timer reinjection work is very old linux kernels
> that had fixed userspace timer frequencies.  On newer host kernels, it's
> probably not nearly as important except when you get into pathological cases
> like exposing a high frequency HPET timer to the guest where you cannot keep
> up with the host.

There will always be some delays and overhead because of the
emulation. We can either hide this completely from the guest (by
stretching the guest time) or we can try to keep the guest time base
in synch with host but then the guest can't expect native class
performance to be available between timer irqs.

On a native hardware, you could have a 1kHz timer and a kernel task
always running for max. 999us with interrupts disabled, then for at
least 1us the interrupts are enabled and the timer interrupt handler
gets a chance to run. On QEMU, such a small time window may be
impossible to emulate. No coalescing would help in this case. With a
cycle counting emulation it's possible, but the speed would probably
be slower than native. With the interrupt coalescing you may be able
to cover some easy cases.

>
> Regards,
>
> Anthony Liguori
>
>>> Signed-off-by: Jan Kiszka<jan.kiszka@siemens.com>
>>> ---
>>>  hw/irq.c |   38 +++++++++++++++++++++++++++++---------
>>>  hw/irq.h |   22 +++++++++++++++-------
>>>  2 files changed, 44 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/hw/irq.c b/hw/irq.c
>>> index 7703f62..db2cce6 100644
>>> --- a/hw/irq.c
>>> +++ b/hw/irq.c
>>> @@ -26,19 +26,27 @@
>>>
>>>  struct IRQState {
>>>     qemu_irq_handler handler;
>>> +    qemu_irq_fb_handler feedback_handler;
>>>     void *opaque;
>>>     int n;
>>>  };
>>>
>>> -void qemu_set_irq(qemu_irq irq, int level)
>>> +int qemu_set_irq(qemu_irq irq, int level)
>>>  {
>>> -    if (!irq)
>>> -        return;
>>> -
>>> -    irq->handler(irq->opaque, irq->n, level);
>>> +    if (!irq) {
>>> +        return 0;
>>> +    }
>>> +    if (irq->feedback_handler) {
>>> +        return irq->feedback_handler(irq->opaque, irq->n, level);
>>> +    } else {
>>> +        irq->handler(irq->opaque, irq->n, level);
>>> +        return QEMU_IRQ_DELIVERED;
>>> +    }
>>>  }
>>>
>>> -qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int
>>> n)
>>> +static qemu_irq *allocate_irqs(qemu_irq_handler handler,
>>> +                               qemu_irq_fb_handler feedback_handler,
>>> +                               void *opaque, int n)
>>>  {
>>>     qemu_irq *s;
>>>     struct IRQState *p;
>>> @@ -48,6 +56,7 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler,
>>> void *opaque, int n)
>>>     p = (struct IRQState *)qemu_mallocz(sizeof(struct IRQState) * n);
>>>     for (i = 0; i<  n; i++) {
>>>         p->handler = handler;
>>> +        p->feedback_handler = feedback_handler;
>>>         p->opaque = opaque;
>>>         p->n = i;
>>>         s[i] = p;
>>> @@ -56,22 +65,33 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler
>>> handler, void *opaque, int n)
>>>     return s;
>>>  }
>>>
>>> +qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int
>>> n)
>>> +{
>>> +    return allocate_irqs(handler, NULL, opaque, n);
>>> +}
>>> +
>>> +qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
>>> +                                      void *opaque, int n)
>>> +{
>>> +    return allocate_irqs(NULL, handler, opaque, n);
>>> +}
>>> +
>>>  void qemu_free_irqs(qemu_irq *s)
>>>  {
>>>     qemu_free(s[0]);
>>>     qemu_free(s);
>>>  }
>>>
>>> -static void qemu_notirq(void *opaque, int line, int level)
>>> +static int qemu_notirq(void *opaque, int line, int level)
>>>  {
>>>     struct IRQState *irq = opaque;
>>>
>>> -    irq->handler(irq->opaque, irq->n, !level);
>>> +    return qemu_set_irq(irq, !level);
>>>  }
>>>
>>>  qemu_irq qemu_irq_invert(qemu_irq irq)
>>>  {
>>>     /* The default state for IRQs is low, so raise the output now.  */
>>>     qemu_irq_raise(irq);
>>> -    return qemu_allocate_irqs(qemu_notirq, irq, 1)[0];
>>> +    return allocate_irqs(NULL, qemu_notirq, irq, 1)[0];
>>>  }
>>> diff --git a/hw/irq.h b/hw/irq.h
>>> index 5daae44..eee03e6 100644
>>> --- a/hw/irq.h
>>> +++ b/hw/irq.h
>>> @@ -3,15 +3,18 @@
>>>
>>>  /* Generic IRQ/GPIO pin infrastructure.  */
>>>
>>> -/* FIXME: Rmove one of these.  */
>>> +#define QEMU_IRQ_DELIVERED      0
>>> +#define QEMU_IRQ_COALESCED      (-1)
>>> +#define QEMU_IRQ_MASKED         (-2)
>>> +
>>>  typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
>>> -typedef void SetIRQFunc(void *opaque, int irq_num, int level);
>>> +typedef int (*qemu_irq_fb_handler)(void *opaque, int n, int level);
>>>
>>> -void qemu_set_irq(qemu_irq irq, int level);
>>> +int qemu_set_irq(qemu_irq irq, int level);
>>>
>>> -static inline void qemu_irq_raise(qemu_irq irq)
>>> +static inline int qemu_irq_raise(qemu_irq irq)
>>>  {
>>> -    qemu_set_irq(irq, 1);
>>> +    return qemu_set_irq(irq, 1);
>>>  }
>>>
>>>  static inline void qemu_irq_lower(qemu_irq irq)
>>> @@ -19,14 +22,19 @@ static inline void qemu_irq_lower(qemu_irq irq)
>>>     qemu_set_irq(irq, 0);
>>>  }
>>>
>>> -static inline void qemu_irq_pulse(qemu_irq irq)
>>> +static inline int qemu_irq_pulse(qemu_irq irq)
>>>  {
>>> -    qemu_set_irq(irq, 1);
>>> +    int ret;
>>> +
>>> +    ret = qemu_set_irq(irq, 1);
>>>     qemu_set_irq(irq, 0);
>>> +    return ret;
>>>  }
>>>
>>>  /* Returns an array of N IRQs.  */
>>>  qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int
>>> n);
>>> +qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
>>> +                                      void *opaque, int n);
>>>  void qemu_free_irqs(qemu_irq *s);
>>>
>>>  /* Returns a new IRQ with opposite polarity.  */
>>> --
>>> 1.6.0.2
>>>
>>>
>>>
>>
>>
>
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-25 21:44       ` Jan Kiszka
  2010-05-26  8:08         ` Gleb Natapov
@ 2010-05-26 19:55         ` Blue Swirl
  2010-05-26 20:09           ` Jan Kiszka
  1 sibling, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-05-26 19:55 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, Juan Quintela

On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Anthony Liguori wrote:
>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>>
>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>>
>>>> This allows to communicate potential IRQ coalescing during delivery from
>>>> the sink back to the source. Targets that support IRQ coalescing
>>>> workarounds need to register handlers that return the appropriate
>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>>> apply its workaround. If multiple sinks exist, the source may only
>>>> consider an IRQ coalesced if all other sinks either report
>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>>
>>> No real devices are interested whether any of their output lines are
>>> even connected. This would introduce a new signal type, bidirectional
>>> multi-level, which is not correct.
>>>
>>
>> I don't think it's really an issue of correct, but I wouldn't disagree
>> to a suggestion that we ought to introduce a new signal type for this
>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>> similar interface as qemu_irq.
>
> A separate type would complicate the delivery of the feedback value
> across GPIO pins (as Paul requested for the RTC->HPET routing).
>
>>
>>> I think the real solution to coalescing is put the logic inside one
>>> device, in this case APIC because it has the information about irq
>>> delivery. APIC could monitor incoming RTC irqs for frequency
>>> information and whether they get delivered or not. If not, an internal
>>> timer is installed which injects the lost irqs.
>
> That won't fly as the IRQs will already arrive at the APIC with a
> sufficiently high jitter. At the bare minimum, you need to tell the
> interrupt controller about the fact that a particular IRQ should be
> delivered at a specific regular rate. For this, you also need a generic
> interface - nothing really "won".

OK, let's simplify: just reinject at next possible chance. No need to
monitor or tell anything.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-26 19:55         ` Blue Swirl
@ 2010-05-26 20:09           ` Jan Kiszka
  2010-05-26 20:35             ` Blue Swirl
  2010-05-27  5:58             ` Gleb Natapov
  0 siblings, 2 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-26 20:09 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 2808 bytes --]

Blue Swirl wrote:
> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> Anthony Liguori wrote:
>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>>>
>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>>>
>>>>> This allows to communicate potential IRQ coalescing during delivery from
>>>>> the sink back to the source. Targets that support IRQ coalescing
>>>>> workarounds need to register handlers that return the appropriate
>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>>>> apply its workaround. If multiple sinks exist, the source may only
>>>>> consider an IRQ coalesced if all other sinks either report
>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>>>
>>>> No real devices are interested whether any of their output lines are
>>>> even connected. This would introduce a new signal type, bidirectional
>>>> multi-level, which is not correct.
>>>>
>>> I don't think it's really an issue of correct, but I wouldn't disagree
>>> to a suggestion that we ought to introduce a new signal type for this
>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>>> similar interface as qemu_irq.
>> A separate type would complicate the delivery of the feedback value
>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>>
>>>> I think the real solution to coalescing is put the logic inside one
>>>> device, in this case APIC because it has the information about irq
>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>>>> information and whether they get delivered or not. If not, an internal
>>>> timer is installed which injects the lost irqs.
>> That won't fly as the IRQs will already arrive at the APIC with a
>> sufficiently high jitter. At the bare minimum, you need to tell the
>> interrupt controller about the fact that a particular IRQ should be
>> delivered at a specific regular rate. For this, you also need a generic
>> interface - nothing really "won".
> 
> OK, let's simplify: just reinject at next possible chance. No need to
> monitor or tell anything.

There are guests that won't like this (I know of one in-house, but
others may even have more examples), specifically if you end up firing
multiple IRQs in a row due to a longer backlog. For that reason, the RTC
spreads the reinjection according to the current rate.

And even if the rate did not matter, the APIC woult still have to now
about the fact that an IRQ is really periodic and does not only appear
as such for a certain interval. This really does not sound like
simplifying things or even make them cleaner.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-26  8:08         ` Gleb Natapov
@ 2010-05-26 20:14           ` Blue Swirl
  2010-05-27  5:42             ` Gleb Natapov
  0 siblings, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-05-26 20:14 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Wed, May 26, 2010 at 8:08 AM, Gleb Natapov <gleb@redhat.com> wrote:
> On Tue, May 25, 2010 at 11:44:50PM +0200, Jan Kiszka wrote:
>> >
>> >> I think the real solution to coalescing is put the logic inside one
>> >> device, in this case APIC because it has the information about irq
>> >> delivery. APIC could monitor incoming RTC irqs for frequency
>> >> information and whether they get delivered or not. If not, an internal
>> >> timer is installed which injects the lost irqs.
>>
>> That won't fly as the IRQs will already arrive at the APIC with a
>> sufficiently high jitter. At the bare minimum, you need to tell the
>> interrupt controller about the fact that a particular IRQ should be
>> delivered at a specific regular rate. For this, you also need a generic
>> interface - nothing really "won".
>>
> Not only that. Suppose RTC runs with 64Hz frequency and N interrupts
> were coalesced during some period. Now the guest reprograms RTC to
> 1024Hz. N should be scaled accordingly otherwise reinjection will not
> fix the drift. Do you propose to put this logic into APIC to?

Interesting case, I don't think this is handled by the current code
either. Could this happen if the target machine does not have HPET and
the guest runs a tickless kernel?

I think the guest would be buggy to reprogram the timer without
checking that the interrupt from the previous timer frequency won't
interfere, for example by stopping the clock, or doing the
reprogramming only at timer interrupt handler. Otherwise the period
may be unpredictable for one period, which means that the time base
has been lost.

But let's consider how this could be handled with the current code (or
with the magical interrupts). When doing the scaling, the guest, while
reprogramming, is unaware of the old queued interrupts with the
previous rate. Why would scaling these be more correct? I'd think the
old ones should be just reinjected ASAP without any scaling.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-26 20:09           ` Jan Kiszka
@ 2010-05-26 20:35             ` Blue Swirl
  2010-05-26 22:35               ` Jan Kiszka
                                 ` (2 more replies)
  2010-05-27  5:58             ` Gleb Natapov
  1 sibling, 3 replies; 122+ messages in thread
From: Blue Swirl @ 2010-05-26 20:35 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, Juan Quintela

On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>> Anthony Liguori wrote:
>>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>>>>
>>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>>>>
>>>>>> This allows to communicate potential IRQ coalescing during delivery from
>>>>>> the sink back to the source. Targets that support IRQ coalescing
>>>>>> workarounds need to register handlers that return the appropriate
>>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>>>>> apply its workaround. If multiple sinks exist, the source may only
>>>>>> consider an IRQ coalesced if all other sinks either report
>>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>>>>
>>>>> No real devices are interested whether any of their output lines are
>>>>> even connected. This would introduce a new signal type, bidirectional
>>>>> multi-level, which is not correct.
>>>>>
>>>> I don't think it's really an issue of correct, but I wouldn't disagree
>>>> to a suggestion that we ought to introduce a new signal type for this
>>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>>>> similar interface as qemu_irq.
>>> A separate type would complicate the delivery of the feedback value
>>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>>>
>>>>> I think the real solution to coalescing is put the logic inside one
>>>>> device, in this case APIC because it has the information about irq
>>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>>>>> information and whether they get delivered or not. If not, an internal
>>>>> timer is installed which injects the lost irqs.
>>> That won't fly as the IRQs will already arrive at the APIC with a
>>> sufficiently high jitter. At the bare minimum, you need to tell the
>>> interrupt controller about the fact that a particular IRQ should be
>>> delivered at a specific regular rate. For this, you also need a generic
>>> interface - nothing really "won".
>>
>> OK, let's simplify: just reinject at next possible chance. No need to
>> monitor or tell anything.
>
> There are guests that won't like this (I know of one in-house, but
> others may even have more examples), specifically if you end up firing
> multiple IRQs in a row due to a longer backlog. For that reason, the RTC
> spreads the reinjection according to the current rate.

Then reinject with a constant delay, or next CPU exit. Such buggy
guests could also be assisted with special handling (like win2k
install hack), for example guest instructions could be counted
(approximately, for example using TB size or TSC) and only inject
after at least N instructions have passed.

> And even if the rate did not matter, the APIC woult still have to now
> about the fact that an IRQ is really periodic and does not only appear
> as such for a certain interval. This really does not sound like
> simplifying things or even make them cleaner.

It would, the voodoo would be contained only in APIC, RTC would be
just like any other device. With the bidirectional irqs, this voodoo
would probably eventually spread to many other devices. The logical
conclusion of that would be a system where all devices would be
careful not to disturb the guest at wrong moment because that would
trigger a bug.

At the other extreme, would it be possible to make the educated guests
aware of the virtualization also in clock aspect: virtio-clock?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-26 20:35             ` Blue Swirl
@ 2010-05-26 22:35               ` Jan Kiszka
  2010-05-26 23:26               ` Paul Brook
  2010-05-27  6:13               ` Gleb Natapov
  2 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-26 22:35 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 2907 bytes --]

Blue Swirl wrote:
>>>>>> I think the real solution to coalescing is put the logic inside one
>>>>>> device, in this case APIC because it has the information about irq
>>>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>>>>>> information and whether they get delivered or not. If not, an internal
>>>>>> timer is installed which injects the lost irqs.
>>>> That won't fly as the IRQs will already arrive at the APIC with a
>>>> sufficiently high jitter. At the bare minimum, you need to tell the
>>>> interrupt controller about the fact that a particular IRQ should be
>>>> delivered at a specific regular rate. For this, you also need a generic
>>>> interface - nothing really "won".
>>> OK, let's simplify: just reinject at next possible chance. No need to
>>> monitor or tell anything.
>> There are guests that won't like this (I know of one in-house, but
>> others may even have more examples), specifically if you end up firing
>> multiple IRQs in a row due to a longer backlog. For that reason, the RTC
>> spreads the reinjection according to the current rate.
> 
> Then reinject with a constant delay, or next CPU exit. Such buggy
> guests could also be assisted with special handling (like win2k
> install hack), for example guest instructions could be counted
> (approximately, for example using TB size or TSC) and only inject
> after at least N instructions have passed.

Let's assume that would be an alternative...

> 
>> And even if the rate did not matter, the APIC woult still have to now
>> about the fact that an IRQ is really periodic and does not only appear
>> as such for a certain interval. This really does not sound like
>> simplifying things or even make them cleaner.
> 
> It would, the voodoo would be contained only in APIC, RTC would be
> just like any other device.

...we would still need that communication channel from the RTC to the
APIC to tell the latter about entering/leaving periodic IRQ generation.
And as we don't want the RTC requiring any clue about who's finally
delivering the IRQ, we still need some generic interface.

This cannot be contained in the APIC - even leaving the PIC as another
IRQ delivery service aside for now, and maybe other virtualized
architectures.

> With the bidirectional irqs, this voodoo
> would probably eventually spread to many other devices.

Yes, the HPET was reported to need this feature as well.

> The logical
> conclusion of that would be a system where all devices would be
> careful not to disturb the guest at wrong moment because that would
> trigger a bug.
> 
> At the other extreme, would it be possible to make the educated guests
> aware of the virtualization also in clock aspect: virtio-clock?

Paravirtual clocks are around since ages (e.g. kvm-clock), just
unfortunately not long enough to find them supported by legacy guests.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-26 20:35             ` Blue Swirl
  2010-05-26 22:35               ` Jan Kiszka
@ 2010-05-26 23:26               ` Paul Brook
  2010-05-27 17:56                 ` Blue Swirl
  2010-05-27  6:13               ` Gleb Natapov
  2 siblings, 1 reply; 122+ messages in thread
From: Paul Brook @ 2010-05-26 23:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: Blue Swirl, Jan Kiszka, Juan Quintela

> At the other extreme, would it be possible to make the educated guests
> aware of the virtualization also in clock aspect: virtio-clock?

The guest doesn't even need to be aware of virtualization. It just needs to be 
able to accommodate the lack of guaranteed realtime behavior.

The fundamental problem here is that some guest operating systems assume that 
the hardware provides certain realtime guarantees with respect to execution of 
interrupt handlers.  In particular they assume that the CPU will always be 
able to complete execution of the timer IRQ handler before the periodic timer 
triggers again.  In most virtualized environments you have absolutely no 
guarantee of realtime response.

With Linux guests this was solved a long time ago by the introduction of 
tickless kernels.  These separate the timekeeping from wakeup events, so it 
doesn't matter if several wakeup triggers end up getting merged (either at the 
hardware level or via top/bottom half guest IRQ handlers).


It's worth mentioning that this problem also occurs on real hardware, 
typically due to lame hardware/drivers which end up masking interrupts or 
otherwise stall the CPU for for long periods of time.


The PIT hack attempts to workaround broken guests by adding artificial latency 
to the timer event, ensuring that the guest "sees" them all.  Unfortunately 
guests vary on when it is safe for them to see the next timer event, and 
trying to observe this behavior involves potentially harmful heuristics and 
collusion between unrelated devices (e.g. interrupt controller and timer).

In some cases we don't even do that, and just reschedule the event some 
arbitrarily small amount of time later. This assumes the guest to do useful 
work in that time. In a single threaded environment this is probably true - 
qemu got enough CPU to inject the first interrupt, so will probably manage to 
execute some guest code before the end of its timeslice. In an environment 
where interrupt processing/delivery and execution of the guest code happen in 
different threads this becomes increasingly likely to fail.

Paul

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-26 20:14           ` Blue Swirl
@ 2010-05-27  5:42             ` Gleb Natapov
  0 siblings, 0 replies; 122+ messages in thread
From: Gleb Natapov @ 2010-05-27  5:42 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Wed, May 26, 2010 at 08:14:38PM +0000, Blue Swirl wrote:
> On Wed, May 26, 2010 at 8:08 AM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Tue, May 25, 2010 at 11:44:50PM +0200, Jan Kiszka wrote:
> >> >
> >> >> I think the real solution to coalescing is put the logic inside one
> >> >> device, in this case APIC because it has the information about irq
> >> >> delivery. APIC could monitor incoming RTC irqs for frequency
> >> >> information and whether they get delivered or not. If not, an internal
> >> >> timer is installed which injects the lost irqs.
> >>
> >> That won't fly as the IRQs will already arrive at the APIC with a
> >> sufficiently high jitter. At the bare minimum, you need to tell the
> >> interrupt controller about the fact that a particular IRQ should be
> >> delivered at a specific regular rate. For this, you also need a generic
> >> interface - nothing really "won".
> >>
> > Not only that. Suppose RTC runs with 64Hz frequency and N interrupts
> > were coalesced during some period. Now the guest reprograms RTC to
> > 1024Hz. N should be scaled accordingly otherwise reinjection will not
> > fix the drift. Do you propose to put this logic into APIC to?
> 
> Interesting case, I don't think this is handled by the current code
> either.
Of course it is:
        if(period != s->period)
            s->irq_coalesced = (s->irq_coalesced * s->period) / period;

Qemu is not a toy anymore and dials with real loads in production now.

>          Could this happen if the target machine does not have HPET and
> the guest runs a tickless kernel?
> 
Windows doesn't have tickles kernel and Linux does not use RTC as time
source.

> I think the guest would be buggy to reprogram the timer without
> checking that the interrupt from the previous timer frequency won't
> interfere, for example by stopping the clock, or doing the
> reprogramming only at timer interrupt handler. Otherwise the period
> may be unpredictable for one period, which means that the time base
> has been lost.
Guest knows nothing about interrupt that was not yet delivered. And
Windows doesn't stop clock while changing frequency.

> 
> But let's consider how this could be handled with the current code (or
> with the magical interrupts). When doing the scaling, the guest, while
> reprogramming, is unaware of the old queued interrupts with the
> previous rate. Why would scaling these be more correct? I'd think the
> old ones should be just reinjected ASAP without any scaling.
If old once will be reinjected as is time calculation in the guest
will be incorrect. Windows is really stupid when it comes to time
keeping. It just count interrupts and multiplies them by clock frequency
to calculate the time. So if qemu accumulated 64 coalesced interrupts
while clock frequency was 64Hz it means that the guest 1 second behind
the time. If the guest change frequency to 1024hz reinfecting 64 interrupts
will not fix guest time, 1024 interrupts should be injected instead.

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-26 20:09           ` Jan Kiszka
  2010-05-26 20:35             ` Blue Swirl
@ 2010-05-27  5:58             ` Gleb Natapov
  1 sibling, 0 replies; 122+ messages in thread
From: Gleb Natapov @ 2010-05-27  5:58 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Blue Swirl, qemu-devel, Juan Quintela

On Wed, May 26, 2010 at 10:09:52PM +0200, Jan Kiszka wrote:
> Blue Swirl wrote:
> > On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> >> Anthony Liguori wrote:
> >>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
> >>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
> >>>>
> >>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
> >>>>>
> >>>>> This allows to communicate potential IRQ coalescing during delivery from
> >>>>> the sink back to the source. Targets that support IRQ coalescing
> >>>>> workarounds need to register handlers that return the appropriate
> >>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
> >>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
> >>>>> apply its workaround. If multiple sinks exist, the source may only
> >>>>> consider an IRQ coalesced if all other sinks either report
> >>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
> >>>>>
> >>>> No real devices are interested whether any of their output lines are
> >>>> even connected. This would introduce a new signal type, bidirectional
> >>>> multi-level, which is not correct.
> >>>>
> >>> I don't think it's really an issue of correct, but I wouldn't disagree
> >>> to a suggestion that we ought to introduce a new signal type for this
> >>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
> >>> similar interface as qemu_irq.
> >> A separate type would complicate the delivery of the feedback value
> >> across GPIO pins (as Paul requested for the RTC->HPET routing).
> >>
> >>>> I think the real solution to coalescing is put the logic inside one
> >>>> device, in this case APIC because it has the information about irq
> >>>> delivery. APIC could monitor incoming RTC irqs for frequency
> >>>> information and whether they get delivered or not. If not, an internal
> >>>> timer is installed which injects the lost irqs.
> >> That won't fly as the IRQs will already arrive at the APIC with a
> >> sufficiently high jitter. At the bare minimum, you need to tell the
> >> interrupt controller about the fact that a particular IRQ should be
> >> delivered at a specific regular rate. For this, you also need a generic
> >> interface - nothing really "won".
> > 
> > OK, let's simplify: just reinject at next possible chance. No need to
> > monitor or tell anything.
> 
> There are guests that won't like this (I know of one in-house, but
> others may even have more examples), specifically if you end up firing
> multiple IRQs in a row due to a longer backlog. For that reason, the RTC
> spreads the reinjection according to the current rate.
> 
No need to look far for such guests. See commit dd17765b5f77ca.

> And even if the rate did not matter, the APIC woult still have to now
> about the fact that an IRQ is really periodic and does not only appear
> as such for a certain interval. This really does not sound like
> simplifying things or even make them cleaner.
> 
> Jan
> 



--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-26 20:35             ` Blue Swirl
  2010-05-26 22:35               ` Jan Kiszka
  2010-05-26 23:26               ` Paul Brook
@ 2010-05-27  6:13               ` Gleb Natapov
  2010-05-27 18:37                 ` Blue Swirl
  2 siblings, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-05-27  6:13 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> > Blue Swirl wrote:
> >> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> >>> Anthony Liguori wrote:
> >>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
> >>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
> >>>>>
> >>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
> >>>>>>
> >>>>>> This allows to communicate potential IRQ coalescing during delivery from
> >>>>>> the sink back to the source. Targets that support IRQ coalescing
> >>>>>> workarounds need to register handlers that return the appropriate
> >>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
> >>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
> >>>>>> apply its workaround. If multiple sinks exist, the source may only
> >>>>>> consider an IRQ coalesced if all other sinks either report
> >>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
> >>>>>>
> >>>>> No real devices are interested whether any of their output lines are
> >>>>> even connected. This would introduce a new signal type, bidirectional
> >>>>> multi-level, which is not correct.
> >>>>>
> >>>> I don't think it's really an issue of correct, but I wouldn't disagree
> >>>> to a suggestion that we ought to introduce a new signal type for this
> >>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
> >>>> similar interface as qemu_irq.
> >>> A separate type would complicate the delivery of the feedback value
> >>> across GPIO pins (as Paul requested for the RTC->HPET routing).
> >>>
> >>>>> I think the real solution to coalescing is put the logic inside one
> >>>>> device, in this case APIC because it has the information about irq
> >>>>> delivery. APIC could monitor incoming RTC irqs for frequency
> >>>>> information and whether they get delivered or not. If not, an internal
> >>>>> timer is installed which injects the lost irqs.
> >>> That won't fly as the IRQs will already arrive at the APIC with a
> >>> sufficiently high jitter. At the bare minimum, you need to tell the
> >>> interrupt controller about the fact that a particular IRQ should be
> >>> delivered at a specific regular rate. For this, you also need a generic
> >>> interface - nothing really "won".
> >>
> >> OK, let's simplify: just reinject at next possible chance. No need to
> >> monitor or tell anything.
> >
> > There are guests that won't like this (I know of one in-house, but
> > others may even have more examples), specifically if you end up firing
> > multiple IRQs in a row due to a longer backlog. For that reason, the RTC
> > spreads the reinjection according to the current rate.
> 
> Then reinject with a constant delay, or next CPU exit. Such buggy
If guest's time frequency is the same as host time frequency you can't
reinject with constant delay. That is why current code mixes two
approaches: reinject M interrupts in a raw then delay.

> guests could also be assisted with special handling (like win2k
> install hack), for example guest instructions could be counted
> (approximately, for example using TB size or TSC) and only inject
> after at least N instructions have passed.
Guest instructions cannot be easily counted in KVM (it can be done more
or less reliably using perf counters, may be).

> 
> > And even if the rate did not matter, the APIC woult still have to now
> > about the fact that an IRQ is really periodic and does not only appear
> > as such for a certain interval. This really does not sound like
> > simplifying things or even make them cleaner.
> 
> It would, the voodoo would be contained only in APIC, RTC would be
> just like any other device. With the bidirectional irqs, this voodoo
> would probably eventually spread to many other devices. The logical
> conclusion of that would be a system where all devices would be
> careful not to disturb the guest at wrong moment because that would
> trigger a bug.
> 
This voodoo will be so complex and unreliable that it will make RTC hack
pale in comparison (and I still don't see how you are going to make it
actually work). The fact is that timer device is not "just like any
other device" in virtual world. Any other device is easy: you just
implement spec as close as possible and everything works. For time
source device this is not enough. You can implement RTC+HPET to the
letter and your guest will drift like crazy.

> At the other extreme, would it be possible to make the educated guests
> aware of the virtualization also in clock aspect: virtio-clock?
No.

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-26 23:26               ` Paul Brook
@ 2010-05-27 17:56                 ` Blue Swirl
  2010-05-27 18:31                   ` Jan Kiszka
  2010-05-27 22:21                   ` Paul Brook
  0 siblings, 2 replies; 122+ messages in thread
From: Blue Swirl @ 2010-05-27 17:56 UTC (permalink / raw)
  To: Paul Brook; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Wed, May 26, 2010 at 11:26 PM, Paul Brook <paul@codesourcery.com> wrote:
>> At the other extreme, would it be possible to make the educated guests
>> aware of the virtualization also in clock aspect: virtio-clock?
>
> The guest doesn't even need to be aware of virtualization. It just needs to be
> able to accommodate the lack of guaranteed realtime behavior.
>
> The fundamental problem here is that some guest operating systems assume that
> the hardware provides certain realtime guarantees with respect to execution of
> interrupt handlers.  In particular they assume that the CPU will always be
> able to complete execution of the timer IRQ handler before the periodic timer
> triggers again.  In most virtualized environments you have absolutely no
> guarantee of realtime response.
>
> With Linux guests this was solved a long time ago by the introduction of
> tickless kernels.  These separate the timekeeping from wakeup events, so it
> doesn't matter if several wakeup triggers end up getting merged (either at the
> hardware level or via top/bottom half guest IRQ handlers).
>
>
> It's worth mentioning that this problem also occurs on real hardware,
> typically due to lame hardware/drivers which end up masking interrupts or
> otherwise stall the CPU for for long periods of time.
>
>
> The PIT hack attempts to workaround broken guests by adding artificial latency
> to the timer event, ensuring that the guest "sees" them all.  Unfortunately
> guests vary on when it is safe for them to see the next timer event, and
> trying to observe this behavior involves potentially harmful heuristics and
> collusion between unrelated devices (e.g. interrupt controller and timer).
>
> In some cases we don't even do that, and just reschedule the event some
> arbitrarily small amount of time later. This assumes the guest to do useful
> work in that time. In a single threaded environment this is probably true -
> qemu got enough CPU to inject the first interrupt, so will probably manage to
> execute some guest code before the end of its timeslice. In an environment
> where interrupt processing/delivery and execution of the guest code happen in
> different threads this becomes increasingly likely to fail.

So any voodoo around timer events is doomed to fail in some cases.
What's the amount of hacks what we want then? Is there any generic
solution, like slowing down the guest system to the point where we can
guarantee the interrupt rate vs. CPU execution speed?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-27 17:56                 ` Blue Swirl
@ 2010-05-27 18:31                   ` Jan Kiszka
  2010-05-27 18:53                     ` Blue Swirl
  2010-05-27 22:21                   ` Paul Brook
  1 sibling, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-05-27 18:31 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Juan Quintela, Paul Brook, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3001 bytes --]

Blue Swirl wrote:
> On Wed, May 26, 2010 at 11:26 PM, Paul Brook <paul@codesourcery.com> wrote:
>>> At the other extreme, would it be possible to make the educated guests
>>> aware of the virtualization also in clock aspect: virtio-clock?
>> The guest doesn't even need to be aware of virtualization. It just needs to be
>> able to accommodate the lack of guaranteed realtime behavior.
>>
>> The fundamental problem here is that some guest operating systems assume that
>> the hardware provides certain realtime guarantees with respect to execution of
>> interrupt handlers.  In particular they assume that the CPU will always be
>> able to complete execution of the timer IRQ handler before the periodic timer
>> triggers again.  In most virtualized environments you have absolutely no
>> guarantee of realtime response.
>>
>> With Linux guests this was solved a long time ago by the introduction of
>> tickless kernels.  These separate the timekeeping from wakeup events, so it
>> doesn't matter if several wakeup triggers end up getting merged (either at the
>> hardware level or via top/bottom half guest IRQ handlers).
>>
>>
>> It's worth mentioning that this problem also occurs on real hardware,
>> typically due to lame hardware/drivers which end up masking interrupts or
>> otherwise stall the CPU for for long periods of time.
>>
>>
>> The PIT hack attempts to workaround broken guests by adding artificial latency
>> to the timer event, ensuring that the guest "sees" them all.  Unfortunately
>> guests vary on when it is safe for them to see the next timer event, and
>> trying to observe this behavior involves potentially harmful heuristics and
>> collusion between unrelated devices (e.g. interrupt controller and timer).
>>
>> In some cases we don't even do that, and just reschedule the event some
>> arbitrarily small amount of time later. This assumes the guest to do useful
>> work in that time. In a single threaded environment this is probably true -
>> qemu got enough CPU to inject the first interrupt, so will probably manage to
>> execute some guest code before the end of its timeslice. In an environment
>> where interrupt processing/delivery and execution of the guest code happen in
>> different threads this becomes increasingly likely to fail.
> 
> So any voodoo around timer events is doomed to fail in some cases.
> What's the amount of hacks what we want then? Is there any generic

The aim of this patch is to reduce the amount of existing and upcoming
hacks. It may still require some refinements, but I think we haven't
found any smarter approach yet that fits existing use cases.

> solution, like slowing down the guest system to the point where we can
> guarantee the interrupt rate vs. CPU execution speed?

That's generally a non-option in virtualized production environments.
Specifically if the guest system lost interrupts due to host
overcommitment, you do not want it slow down even further.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-27  6:13               ` Gleb Natapov
@ 2010-05-27 18:37                 ` Blue Swirl
  2010-05-28  7:31                   ` Gleb Natapov
  0 siblings, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-05-27 18:37 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

2010/5/27 Gleb Natapov <gleb@redhat.com>:
> On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
>> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> > Blue Swirl wrote:
>> >> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> >>> Anthony Liguori wrote:
>> >>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>> >>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>> >>>>>
>> >>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>> >>>>>>
>> >>>>>> This allows to communicate potential IRQ coalescing during delivery from
>> >>>>>> the sink back to the source. Targets that support IRQ coalescing
>> >>>>>> workarounds need to register handlers that return the appropriate
>> >>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>> >>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>> >>>>>> apply its workaround. If multiple sinks exist, the source may only
>> >>>>>> consider an IRQ coalesced if all other sinks either report
>> >>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>> >>>>>>
>> >>>>> No real devices are interested whether any of their output lines are
>> >>>>> even connected. This would introduce a new signal type, bidirectional
>> >>>>> multi-level, which is not correct.
>> >>>>>
>> >>>> I don't think it's really an issue of correct, but I wouldn't disagree
>> >>>> to a suggestion that we ought to introduce a new signal type for this
>> >>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>> >>>> similar interface as qemu_irq.
>> >>> A separate type would complicate the delivery of the feedback value
>> >>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>> >>>
>> >>>>> I think the real solution to coalescing is put the logic inside one
>> >>>>> device, in this case APIC because it has the information about irq
>> >>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>> >>>>> information and whether they get delivered or not. If not, an internal
>> >>>>> timer is installed which injects the lost irqs.
>> >>> That won't fly as the IRQs will already arrive at the APIC with a
>> >>> sufficiently high jitter. At the bare minimum, you need to tell the
>> >>> interrupt controller about the fact that a particular IRQ should be
>> >>> delivered at a specific regular rate. For this, you also need a generic
>> >>> interface - nothing really "won".
>> >>
>> >> OK, let's simplify: just reinject at next possible chance. No need to
>> >> monitor or tell anything.
>> >
>> > There are guests that won't like this (I know of one in-house, but
>> > others may even have more examples), specifically if you end up firing
>> > multiple IRQs in a row due to a longer backlog. For that reason, the RTC
>> > spreads the reinjection according to the current rate.
>>
>> Then reinject with a constant delay, or next CPU exit. Such buggy
> If guest's time frequency is the same as host time frequency you can't
> reinject with constant delay. That is why current code mixes two
> approaches: reinject M interrupts in a raw then delay.

This approach can be also used by APIC-only version.

>> guests could also be assisted with special handling (like win2k
>> install hack), for example guest instructions could be counted
>> (approximately, for example using TB size or TSC) and only inject
>> after at least N instructions have passed.
> Guest instructions cannot be easily counted in KVM (it can be done more
> or less reliably using perf counters, may be).

Aren't there any debug registers or perf counters, which can generate
an interrupt after some number of instructions have been executed?

>>
>> > And even if the rate did not matter, the APIC woult still have to now
>> > about the fact that an IRQ is really periodic and does not only appear
>> > as such for a certain interval. This really does not sound like
>> > simplifying things or even make them cleaner.
>>
>> It would, the voodoo would be contained only in APIC, RTC would be
>> just like any other device. With the bidirectional irqs, this voodoo
>> would probably eventually spread to many other devices. The logical
>> conclusion of that would be a system where all devices would be
>> careful not to disturb the guest at wrong moment because that would
>> trigger a bug.
>>
> This voodoo will be so complex and unreliable that it will make RTC hack
> pale in comparison (and I still don't see how you are going to make it
> actually work).

Implement everything inside APIC: only coalescing and reinjection.
Maybe that version would not bend backwards as much as the current to
cater for buggy hosts.

> The fact is that timer device is not "just like any
> other device" in virtual world. Any other device is easy: you just
> implement spec as close as possible and everything works. For time
> source device this is not enough. You can implement RTC+HPET to the
> letter and your guest will drift like crazy.

It's doable: a cycle accurate emulator will not cause any drift,
without any voodoo. The interrupts would come after executing the same
instruction as the real HW. For emulating any sufficiently buggy
guests in any sufficiently desperate low resource conditions, this may
be the only option that will always work.

>
>> At the other extreme, would it be possible to make the educated guests
>> aware of the virtualization also in clock aspect: virtio-clock?
> No.
>
> --
>                        Gleb.
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-27 18:31                   ` Jan Kiszka
@ 2010-05-27 18:53                     ` Blue Swirl
  2010-05-27 19:08                       ` Jan Kiszka
  0 siblings, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-05-27 18:53 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Juan Quintela, Paul Brook, qemu-devel

On Thu, May 27, 2010 at 6:31 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>> On Wed, May 26, 2010 at 11:26 PM, Paul Brook <paul@codesourcery.com> wrote:
>>>> At the other extreme, would it be possible to make the educated guests
>>>> aware of the virtualization also in clock aspect: virtio-clock?
>>> The guest doesn't even need to be aware of virtualization. It just needs to be
>>> able to accommodate the lack of guaranteed realtime behavior.
>>>
>>> The fundamental problem here is that some guest operating systems assume that
>>> the hardware provides certain realtime guarantees with respect to execution of
>>> interrupt handlers.  In particular they assume that the CPU will always be
>>> able to complete execution of the timer IRQ handler before the periodic timer
>>> triggers again.  In most virtualized environments you have absolutely no
>>> guarantee of realtime response.
>>>
>>> With Linux guests this was solved a long time ago by the introduction of
>>> tickless kernels.  These separate the timekeeping from wakeup events, so it
>>> doesn't matter if several wakeup triggers end up getting merged (either at the
>>> hardware level or via top/bottom half guest IRQ handlers).
>>>
>>>
>>> It's worth mentioning that this problem also occurs on real hardware,
>>> typically due to lame hardware/drivers which end up masking interrupts or
>>> otherwise stall the CPU for for long periods of time.
>>>
>>>
>>> The PIT hack attempts to workaround broken guests by adding artificial latency
>>> to the timer event, ensuring that the guest "sees" them all.  Unfortunately
>>> guests vary on when it is safe for them to see the next timer event, and
>>> trying to observe this behavior involves potentially harmful heuristics and
>>> collusion between unrelated devices (e.g. interrupt controller and timer).
>>>
>>> In some cases we don't even do that, and just reschedule the event some
>>> arbitrarily small amount of time later. This assumes the guest to do useful
>>> work in that time. In a single threaded environment this is probably true -
>>> qemu got enough CPU to inject the first interrupt, so will probably manage to
>>> execute some guest code before the end of its timeslice. In an environment
>>> where interrupt processing/delivery and execution of the guest code happen in
>>> different threads this becomes increasingly likely to fail.
>>
>> So any voodoo around timer events is doomed to fail in some cases.
>> What's the amount of hacks what we want then? Is there any generic
>
> The aim of this patch is to reduce the amount of existing and upcoming
> hacks. It may still require some refinements, but I think we haven't
> found any smarter approach yet that fits existing use cases.

I don't feel we have tried other possibilities hard enough.

>> solution, like slowing down the guest system to the point where we can
>> guarantee the interrupt rate vs. CPU execution speed?
>
> That's generally a non-option in virtualized production environments.
> Specifically if the guest system lost interrupts due to host
> overcommitment, you do not want it slow down even further.

I meant that the guest time could be scaled down, for example 2s in
wall clock time would be presented to the guest as 1s. Then the amount
of CPU cycles between timer interrupts would increase and hopefully
the guest can keep up. If the guest sleeps, time base could be
accelerated to catch up with wall clock and then set back to 1:1 rate.

Slowing down could be triggered by measuring the guest load (for
example, by checking for presence of halt instructions), if it's close
to 1, time would be slowed down. If the guest starts to issue halt
instructions because it's more idle, we can increase speed.

If this approach worked, even APIC could be made ignorant about
coalescing voodoo so it should be a major cleanup.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-27 18:53                     ` Blue Swirl
@ 2010-05-27 19:08                       ` Jan Kiszka
  2010-05-27 19:19                         ` Blue Swirl
  0 siblings, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-05-27 19:08 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Juan Quintela, Paul Brook, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4505 bytes --]

Blue Swirl wrote:
> On Thu, May 27, 2010 at 6:31 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> Blue Swirl wrote:
>>> On Wed, May 26, 2010 at 11:26 PM, Paul Brook <paul@codesourcery.com> wrote:
>>>>> At the other extreme, would it be possible to make the educated guests
>>>>> aware of the virtualization also in clock aspect: virtio-clock?
>>>> The guest doesn't even need to be aware of virtualization. It just needs to be
>>>> able to accommodate the lack of guaranteed realtime behavior.
>>>>
>>>> The fundamental problem here is that some guest operating systems assume that
>>>> the hardware provides certain realtime guarantees with respect to execution of
>>>> interrupt handlers.  In particular they assume that the CPU will always be
>>>> able to complete execution of the timer IRQ handler before the periodic timer
>>>> triggers again.  In most virtualized environments you have absolutely no
>>>> guarantee of realtime response.
>>>>
>>>> With Linux guests this was solved a long time ago by the introduction of
>>>> tickless kernels.  These separate the timekeeping from wakeup events, so it
>>>> doesn't matter if several wakeup triggers end up getting merged (either at the
>>>> hardware level or via top/bottom half guest IRQ handlers).
>>>>
>>>>
>>>> It's worth mentioning that this problem also occurs on real hardware,
>>>> typically due to lame hardware/drivers which end up masking interrupts or
>>>> otherwise stall the CPU for for long periods of time.
>>>>
>>>>
>>>> The PIT hack attempts to workaround broken guests by adding artificial latency
>>>> to the timer event, ensuring that the guest "sees" them all.  Unfortunately
>>>> guests vary on when it is safe for them to see the next timer event, and
>>>> trying to observe this behavior involves potentially harmful heuristics and
>>>> collusion between unrelated devices (e.g. interrupt controller and timer).
>>>>
>>>> In some cases we don't even do that, and just reschedule the event some
>>>> arbitrarily small amount of time later. This assumes the guest to do useful
>>>> work in that time. In a single threaded environment this is probably true -
>>>> qemu got enough CPU to inject the first interrupt, so will probably manage to
>>>> execute some guest code before the end of its timeslice. In an environment
>>>> where interrupt processing/delivery and execution of the guest code happen in
>>>> different threads this becomes increasingly likely to fail.
>>> So any voodoo around timer events is doomed to fail in some cases.
>>> What's the amount of hacks what we want then? Is there any generic
>> The aim of this patch is to reduce the amount of existing and upcoming
>> hacks. It may still require some refinements, but I think we haven't
>> found any smarter approach yet that fits existing use cases.
> 
> I don't feel we have tried other possibilities hard enough.

Well, seeing prototypes wouldn't be bad, also to run real load againt
them. But at least I'm currently clueless what to implement.

> 
>>> solution, like slowing down the guest system to the point where we can
>>> guarantee the interrupt rate vs. CPU execution speed?
>> That's generally a non-option in virtualized production environments.
>> Specifically if the guest system lost interrupts due to host
>> overcommitment, you do not want it slow down even further.
> 
> I meant that the guest time could be scaled down, for example 2s in
> wall clock time would be presented to the guest as 1s.

But that is precisely what already happens when the guest loses timer
interrupts. There is no other time source for this kind of guests -
often except for some external events generated by systems which you
don't want to fall behind arbitrarily.

> Then the amount
> of CPU cycles between timer interrupts would increase and hopefully
> the guest can keep up. If the guest sleeps, time base could be
> accelerated to catch up with wall clock and then set back to 1:1 rate.

Can't follow you ATM, sorry. What should be slowed down then? And how
precisely?

Jan

> 
> Slowing down could be triggered by measuring the guest load (for
> example, by checking for presence of halt instructions), if it's close
> to 1, time would be slowed down. If the guest starts to issue halt
> instructions because it's more idle, we can increase speed.
> 
> If this approach worked, even APIC could be made ignorant about
> coalescing voodoo so it should be a major cleanup.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-27 19:08                       ` Jan Kiszka
@ 2010-05-27 19:19                         ` Blue Swirl
  2010-05-27 22:19                           ` Jan Kiszka
  2010-05-27 22:21                           ` Paul Brook
  0 siblings, 2 replies; 122+ messages in thread
From: Blue Swirl @ 2010-05-27 19:19 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Juan Quintela, Paul Brook, qemu-devel

On Thu, May 27, 2010 at 7:08 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>> On Thu, May 27, 2010 at 6:31 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>> Blue Swirl wrote:
>>>> On Wed, May 26, 2010 at 11:26 PM, Paul Brook <paul@codesourcery.com> wrote:
>>>>>> At the other extreme, would it be possible to make the educated guests
>>>>>> aware of the virtualization also in clock aspect: virtio-clock?
>>>>> The guest doesn't even need to be aware of virtualization. It just needs to be
>>>>> able to accommodate the lack of guaranteed realtime behavior.
>>>>>
>>>>> The fundamental problem here is that some guest operating systems assume that
>>>>> the hardware provides certain realtime guarantees with respect to execution of
>>>>> interrupt handlers.  In particular they assume that the CPU will always be
>>>>> able to complete execution of the timer IRQ handler before the periodic timer
>>>>> triggers again.  In most virtualized environments you have absolutely no
>>>>> guarantee of realtime response.
>>>>>
>>>>> With Linux guests this was solved a long time ago by the introduction of
>>>>> tickless kernels.  These separate the timekeeping from wakeup events, so it
>>>>> doesn't matter if several wakeup triggers end up getting merged (either at the
>>>>> hardware level or via top/bottom half guest IRQ handlers).
>>>>>
>>>>>
>>>>> It's worth mentioning that this problem also occurs on real hardware,
>>>>> typically due to lame hardware/drivers which end up masking interrupts or
>>>>> otherwise stall the CPU for for long periods of time.
>>>>>
>>>>>
>>>>> The PIT hack attempts to workaround broken guests by adding artificial latency
>>>>> to the timer event, ensuring that the guest "sees" them all.  Unfortunately
>>>>> guests vary on when it is safe for them to see the next timer event, and
>>>>> trying to observe this behavior involves potentially harmful heuristics and
>>>>> collusion between unrelated devices (e.g. interrupt controller and timer).
>>>>>
>>>>> In some cases we don't even do that, and just reschedule the event some
>>>>> arbitrarily small amount of time later. This assumes the guest to do useful
>>>>> work in that time. In a single threaded environment this is probably true -
>>>>> qemu got enough CPU to inject the first interrupt, so will probably manage to
>>>>> execute some guest code before the end of its timeslice. In an environment
>>>>> where interrupt processing/delivery and execution of the guest code happen in
>>>>> different threads this becomes increasingly likely to fail.
>>>> So any voodoo around timer events is doomed to fail in some cases.
>>>> What's the amount of hacks what we want then? Is there any generic
>>> The aim of this patch is to reduce the amount of existing and upcoming
>>> hacks. It may still require some refinements, but I think we haven't
>>> found any smarter approach yet that fits existing use cases.
>>
>> I don't feel we have tried other possibilities hard enough.
>
> Well, seeing prototypes wouldn't be bad, also to run real load againt
> them. But at least I'm currently clueless what to implement.

Perhaps now is then not the time to rush to implement something, but
to brainstorm for a clean solution.

>>
>>>> solution, like slowing down the guest system to the point where we can
>>>> guarantee the interrupt rate vs. CPU execution speed?
>>> That's generally a non-option in virtualized production environments.
>>> Specifically if the guest system lost interrupts due to host
>>> overcommitment, you do not want it slow down even further.
>>
>> I meant that the guest time could be scaled down, for example 2s in
>> wall clock time would be presented to the guest as 1s.
>
> But that is precisely what already happens when the guest loses timer
> interrupts. There is no other time source for this kind of guests -
> often except for some external events generated by systems which you
> don't want to fall behind arbitrarily.
>
>> Then the amount
>> of CPU cycles between timer interrupts would increase and hopefully
>> the guest can keep up. If the guest sleeps, time base could be
>> accelerated to catch up with wall clock and then set back to 1:1 rate.
>
> Can't follow you ATM, sorry. What should be slowed down then? And how
> precisely?

I think vm_clock and everything that depends on vm_clock, also
rtc_clock should be tied to vm_clock in this mode, not host_clock.

>
> Jan
>
>>
>> Slowing down could be triggered by measuring the guest load (for
>> example, by checking for presence of halt instructions), if it's close
>> to 1, time would be slowed down. If the guest starts to issue halt
>> instructions because it's more idle, we can increase speed.
>>
>> If this approach worked, even APIC could be made ignorant about
>> coalescing voodoo so it should be a major cleanup.
>
>
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-27 19:19                         ` Blue Swirl
@ 2010-05-27 22:19                           ` Jan Kiszka
  2010-05-28 19:00                             ` Blue Swirl
  2010-05-30 12:00                             ` Avi Kivity
  2010-05-27 22:21                           ` Paul Brook
  1 sibling, 2 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-27 22:19 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Juan Quintela, Paul Brook, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5392 bytes --]

Blue Swirl wrote:
> On Thu, May 27, 2010 at 7:08 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> Blue Swirl wrote:
>>> On Thu, May 27, 2010 at 6:31 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>> Blue Swirl wrote:
>>>>> On Wed, May 26, 2010 at 11:26 PM, Paul Brook <paul@codesourcery.com> wrote:
>>>>>>> At the other extreme, would it be possible to make the educated guests
>>>>>>> aware of the virtualization also in clock aspect: virtio-clock?
>>>>>> The guest doesn't even need to be aware of virtualization. It just needs to be
>>>>>> able to accommodate the lack of guaranteed realtime behavior.
>>>>>>
>>>>>> The fundamental problem here is that some guest operating systems assume that
>>>>>> the hardware provides certain realtime guarantees with respect to execution of
>>>>>> interrupt handlers.  In particular they assume that the CPU will always be
>>>>>> able to complete execution of the timer IRQ handler before the periodic timer
>>>>>> triggers again.  In most virtualized environments you have absolutely no
>>>>>> guarantee of realtime response.
>>>>>>
>>>>>> With Linux guests this was solved a long time ago by the introduction of
>>>>>> tickless kernels.  These separate the timekeeping from wakeup events, so it
>>>>>> doesn't matter if several wakeup triggers end up getting merged (either at the
>>>>>> hardware level or via top/bottom half guest IRQ handlers).
>>>>>>
>>>>>>
>>>>>> It's worth mentioning that this problem also occurs on real hardware,
>>>>>> typically due to lame hardware/drivers which end up masking interrupts or
>>>>>> otherwise stall the CPU for for long periods of time.
>>>>>>
>>>>>>
>>>>>> The PIT hack attempts to workaround broken guests by adding artificial latency
>>>>>> to the timer event, ensuring that the guest "sees" them all.  Unfortunately
>>>>>> guests vary on when it is safe for them to see the next timer event, and
>>>>>> trying to observe this behavior involves potentially harmful heuristics and
>>>>>> collusion between unrelated devices (e.g. interrupt controller and timer).
>>>>>>
>>>>>> In some cases we don't even do that, and just reschedule the event some
>>>>>> arbitrarily small amount of time later. This assumes the guest to do useful
>>>>>> work in that time. In a single threaded environment this is probably true -
>>>>>> qemu got enough CPU to inject the first interrupt, so will probably manage to
>>>>>> execute some guest code before the end of its timeslice. In an environment
>>>>>> where interrupt processing/delivery and execution of the guest code happen in
>>>>>> different threads this becomes increasingly likely to fail.
>>>>> So any voodoo around timer events is doomed to fail in some cases.
>>>>> What's the amount of hacks what we want then? Is there any generic
>>>> The aim of this patch is to reduce the amount of existing and upcoming
>>>> hacks. It may still require some refinements, but I think we haven't
>>>> found any smarter approach yet that fits existing use cases.
>>> I don't feel we have tried other possibilities hard enough.
>> Well, seeing prototypes wouldn't be bad, also to run real load againt
>> them. But at least I'm currently clueless what to implement.
> 
> Perhaps now is then not the time to rush to implement something, but
> to brainstorm for a clean solution.

And sometimes it can help to understand how ideas could even be improved
or why others doesn't work at all.

> 
>>>>> solution, like slowing down the guest system to the point where we can
>>>>> guarantee the interrupt rate vs. CPU execution speed?
>>>> That's generally a non-option in virtualized production environments.
>>>> Specifically if the guest system lost interrupts due to host
>>>> overcommitment, you do not want it slow down even further.
>>> I meant that the guest time could be scaled down, for example 2s in
>>> wall clock time would be presented to the guest as 1s.
>> But that is precisely what already happens when the guest loses timer
>> interrupts. There is no other time source for this kind of guests -
>> often except for some external events generated by systems which you
>> don't want to fall behind arbitrarily.
>>
>>> Then the amount
>>> of CPU cycles between timer interrupts would increase and hopefully
>>> the guest can keep up. If the guest sleeps, time base could be
>>> accelerated to catch up with wall clock and then set back to 1:1 rate.
>> Can't follow you ATM, sorry. What should be slowed down then? And how
>> precisely?
> 
> I think vm_clock and everything that depends on vm_clock, also
> rtc_clock should be tied to vm_clock in this mode, not host_clock.

Let me check if I got this idea correctly: Instead of tuning just the
tick frequency of the affected timer device / sending its backlog in a
row, you rather want to tune the vm_clock correspondingly? Maybe a way
to abstract the required logic currently sitting only in the RTC for use
by other timer sources as well.

But just switching rtc_clock to vm_clock when the user wants host_clock
is obviously not an option. We would rather have to tune host_clock in
parallel.

Still, this does not answer:

- How do you want to detect lost timer ticks?

- What subsystem(s) keeps track of the backlog?

- And depending on the above: How to detect at all that a specific IRQ
  is a timer tick?

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-27 17:56                 ` Blue Swirl
  2010-05-27 18:31                   ` Jan Kiszka
@ 2010-05-27 22:21                   ` Paul Brook
  1 sibling, 0 replies; 122+ messages in thread
From: Paul Brook @ 2010-05-27 22:21 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

> > In some cases we don't even do that, and just reschedule the event some
> > arbitrarily small amount of time later. This assumes the guest to do
> > useful work in that time. In a single threaded environment this is
> > probably true - qemu got enough CPU to inject the first interrupt, so
> > will probably manage to execute some guest code before the end of its
> > timeslice. In an environment where interrupt processing/delivery and
> > execution of the guest code happen in different threads this becomes
> > increasingly likely to fail.
> 
> So any voodoo around timer events is doomed to fail in some cases.

Depends on the level of voodoo.
My guess is that common guest operating systems require hacks which result in 
demonstrably incorrect behavior.

> What's the amount of hacks what we want then? Is there any generic
> solution, like slowing down the guest system to the point where we can
> guarantee the interrupt rate vs. CPU execution speed?

The "-icount N" option gives deterministic virtual realtime behavior, however 
teh guuest if completely decoupled from real-world time.
The "-icount auto" option gives semi-deterministic behavior while maintaining 
overall consistency with the real world.  This may introduce some small-scale 
time jitter, but will still satisfy all but the most demanding hard-real-time 
assumptions.

Neither of these options work with KVM. It may be possible to implement 
something using using performance counters.  I don't know how much additional 
overhead this would involve.

Paul

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-27 19:19                         ` Blue Swirl
  2010-05-27 22:19                           ` Jan Kiszka
@ 2010-05-27 22:21                           ` Paul Brook
  2010-05-28 19:10                             ` Blue Swirl
  1 sibling, 1 reply; 122+ messages in thread
From: Paul Brook @ 2010-05-27 22:21 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela


> >> Then the amount
> >> of CPU cycles between timer interrupts would increase and hopefully
> >> the guest can keep up. If the guest sleeps, time base could be
> >> accelerated to catch up with wall clock and then set back to 1:1 rate.
> > 
> > Can't follow you ATM, sorry. What should be slowed down then? And how
> > precisely?
> 
> I think vm_clock and everything that depends on vm_clock, also
> rtc_clock should be tied to vm_clock in this mode, not host_clock.

The problem is more fundamental than that. There is no real correlation 
between vm_clock and the amount of code executed by the guest, especially not 
on timescales less than a second.

Paul

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-27 18:37                 ` Blue Swirl
@ 2010-05-28  7:31                   ` Gleb Natapov
  2010-05-28 20:06                     ` Blue Swirl
  0 siblings, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-05-28  7:31 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
> 2010/5/27 Gleb Natapov <gleb@redhat.com>:
> > On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
> >> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> >> > Blue Swirl wrote:
> >> >> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> >> >>> Anthony Liguori wrote:
> >> >>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
> >> >>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
> >> >>>>>
> >> >>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
> >> >>>>>>
> >> >>>>>> This allows to communicate potential IRQ coalescing during delivery from
> >> >>>>>> the sink back to the source. Targets that support IRQ coalescing
> >> >>>>>> workarounds need to register handlers that return the appropriate
> >> >>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
> >> >>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
> >> >>>>>> apply its workaround. If multiple sinks exist, the source may only
> >> >>>>>> consider an IRQ coalesced if all other sinks either report
> >> >>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
> >> >>>>>>
> >> >>>>> No real devices are interested whether any of their output lines are
> >> >>>>> even connected. This would introduce a new signal type, bidirectional
> >> >>>>> multi-level, which is not correct.
> >> >>>>>
> >> >>>> I don't think it's really an issue of correct, but I wouldn't disagree
> >> >>>> to a suggestion that we ought to introduce a new signal type for this
> >> >>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
> >> >>>> similar interface as qemu_irq.
> >> >>> A separate type would complicate the delivery of the feedback value
> >> >>> across GPIO pins (as Paul requested for the RTC->HPET routing).
> >> >>>
> >> >>>>> I think the real solution to coalescing is put the logic inside one
> >> >>>>> device, in this case APIC because it has the information about irq
> >> >>>>> delivery. APIC could monitor incoming RTC irqs for frequency
> >> >>>>> information and whether they get delivered or not. If not, an internal
> >> >>>>> timer is installed which injects the lost irqs.
> >> >>> That won't fly as the IRQs will already arrive at the APIC with a
> >> >>> sufficiently high jitter. At the bare minimum, you need to tell the
> >> >>> interrupt controller about the fact that a particular IRQ should be
> >> >>> delivered at a specific regular rate. For this, you also need a generic
> >> >>> interface - nothing really "won".
> >> >>
> >> >> OK, let's simplify: just reinject at next possible chance. No need to
> >> >> monitor or tell anything.
> >> >
> >> > There are guests that won't like this (I know of one in-house, but
> >> > others may even have more examples), specifically if you end up firing
> >> > multiple IRQs in a row due to a longer backlog. For that reason, the RTC
> >> > spreads the reinjection according to the current rate.
> >>
> >> Then reinject with a constant delay, or next CPU exit. Such buggy
> > If guest's time frequency is the same as host time frequency you can't
> > reinject with constant delay. That is why current code mixes two
> > approaches: reinject M interrupts in a raw then delay.
> 
> This approach can be also used by APIC-only version.
> 
I don't know what APIC-only version you are talking about. I haven't
seen the code and I don't understand hand waving, sorry.

> >> guests could also be assisted with special handling (like win2k
> >> install hack), for example guest instructions could be counted
> >> (approximately, for example using TB size or TSC) and only inject
> >> after at least N instructions have passed.
> > Guest instructions cannot be easily counted in KVM (it can be done more
> > or less reliably using perf counters, may be).
> 
> Aren't there any debug registers or perf counters, which can generate
> an interrupt after some number of instructions have been executed?
Don't think debug registers have something like that and they are
available for guest use anyway. Perf counters differs greatly from CPU
to CPU (even between two CPUs of the same manufacturer), and we want to
keep using them for profiling guests. And I don't see what problem it
will solve anyway that can be solved by simple delay between irq
reinjection.

> 
> >>
> >> > And even if the rate did not matter, the APIC woult still have to now
> >> > about the fact that an IRQ is really periodic and does not only appear
> >> > as such for a certain interval. This really does not sound like
> >> > simplifying things or even make them cleaner.
> >>
> >> It would, the voodoo would be contained only in APIC, RTC would be
> >> just like any other device. With the bidirectional irqs, this voodoo
> >> would probably eventually spread to many other devices. The logical
> >> conclusion of that would be a system where all devices would be
> >> careful not to disturb the guest at wrong moment because that would
> >> trigger a bug.
> >>
> > This voodoo will be so complex and unreliable that it will make RTC hack
> > pale in comparison (and I still don't see how you are going to make it
> > actually work).
> 
> Implement everything inside APIC: only coalescing and reinjection.
APIC has zero info needed to implement reinjection correctly as was
shown to you several time in this thread and you simply keep ignoring
it.

> Maybe that version would not bend backwards as much as the current to
> cater for buggy hosts.
> 
You mean "buggy guests"? What guests are not buggy in your opinion?
Linux tries hard to be smart and as a result the only way to have stable
clock with it is to go paravirt.

> > The fact is that timer device is not "just like any
> > other device" in virtual world. Any other device is easy: you just
> > implement spec as close as possible and everything works. For time
> > source device this is not enough. You can implement RTC+HPET to the
> > letter and your guest will drift like crazy.
> 
> It's doable: a cycle accurate emulator will not cause any drift,
> without any voodoo. The interrupts would come after executing the same
> instruction as the real HW. For emulating any sufficiently buggy
> guests in any sufficiently desperate low resource conditions, this may
> be the only option that will always work.
> 
Yes, but qemu and kvm are not cycle accurate emulators and don't strive
to be one. On the contrary KVM runs at native host CPU speed most of the
time, so any emulation done between two instruction is theoretically
noticeable for a guest. TSC is bypassed directly to a guest too, so
keeping all time source in perfect sync is also impossible.

> >
> >> At the other extreme, would it be possible to make the educated guests
> >> aware of the virtualization also in clock aspect: virtio-clock?
> > No.
> >
> > --
> >                        Gleb.
> >

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-27 22:19                           ` Jan Kiszka
@ 2010-05-28 19:00                             ` Blue Swirl
  2010-05-30 12:00                             ` Avi Kivity
  1 sibling, 0 replies; 122+ messages in thread
From: Blue Swirl @ 2010-05-28 19:00 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Juan Quintela, Paul Brook, qemu-devel

On Thu, May 27, 2010 at 10:19 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>> On Thu, May 27, 2010 at 7:08 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>> Blue Swirl wrote:
>>>> On Thu, May 27, 2010 at 6:31 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>> Blue Swirl wrote:
>>>>>> On Wed, May 26, 2010 at 11:26 PM, Paul Brook <paul@codesourcery.com> wrote:
>>>>>>>> At the other extreme, would it be possible to make the educated guests
>>>>>>>> aware of the virtualization also in clock aspect: virtio-clock?
>>>>>>> The guest doesn't even need to be aware of virtualization. It just needs to be
>>>>>>> able to accommodate the lack of guaranteed realtime behavior.
>>>>>>>
>>>>>>> The fundamental problem here is that some guest operating systems assume that
>>>>>>> the hardware provides certain realtime guarantees with respect to execution of
>>>>>>> interrupt handlers.  In particular they assume that the CPU will always be
>>>>>>> able to complete execution of the timer IRQ handler before the periodic timer
>>>>>>> triggers again.  In most virtualized environments you have absolutely no
>>>>>>> guarantee of realtime response.
>>>>>>>
>>>>>>> With Linux guests this was solved a long time ago by the introduction of
>>>>>>> tickless kernels.  These separate the timekeeping from wakeup events, so it
>>>>>>> doesn't matter if several wakeup triggers end up getting merged (either at the
>>>>>>> hardware level or via top/bottom half guest IRQ handlers).
>>>>>>>
>>>>>>>
>>>>>>> It's worth mentioning that this problem also occurs on real hardware,
>>>>>>> typically due to lame hardware/drivers which end up masking interrupts or
>>>>>>> otherwise stall the CPU for for long periods of time.
>>>>>>>
>>>>>>>
>>>>>>> The PIT hack attempts to workaround broken guests by adding artificial latency
>>>>>>> to the timer event, ensuring that the guest "sees" them all.  Unfortunately
>>>>>>> guests vary on when it is safe for them to see the next timer event, and
>>>>>>> trying to observe this behavior involves potentially harmful heuristics and
>>>>>>> collusion between unrelated devices (e.g. interrupt controller and timer).
>>>>>>>
>>>>>>> In some cases we don't even do that, and just reschedule the event some
>>>>>>> arbitrarily small amount of time later. This assumes the guest to do useful
>>>>>>> work in that time. In a single threaded environment this is probably true -
>>>>>>> qemu got enough CPU to inject the first interrupt, so will probably manage to
>>>>>>> execute some guest code before the end of its timeslice. In an environment
>>>>>>> where interrupt processing/delivery and execution of the guest code happen in
>>>>>>> different threads this becomes increasingly likely to fail.
>>>>>> So any voodoo around timer events is doomed to fail in some cases.
>>>>>> What's the amount of hacks what we want then? Is there any generic
>>>>> The aim of this patch is to reduce the amount of existing and upcoming
>>>>> hacks. It may still require some refinements, but I think we haven't
>>>>> found any smarter approach yet that fits existing use cases.
>>>> I don't feel we have tried other possibilities hard enough.
>>> Well, seeing prototypes wouldn't be bad, also to run real load againt
>>> them. But at least I'm currently clueless what to implement.
>>
>> Perhaps now is then not the time to rush to implement something, but
>> to brainstorm for a clean solution.
>
> And sometimes it can help to understand how ideas could even be improved
> or why others doesn't work at all.
>
>>
>>>>>> solution, like slowing down the guest system to the point where we can
>>>>>> guarantee the interrupt rate vs. CPU execution speed?
>>>>> That's generally a non-option in virtualized production environments.
>>>>> Specifically if the guest system lost interrupts due to host
>>>>> overcommitment, you do not want it slow down even further.
>>>> I meant that the guest time could be scaled down, for example 2s in
>>>> wall clock time would be presented to the guest as 1s.
>>> But that is precisely what already happens when the guest loses timer
>>> interrupts. There is no other time source for this kind of guests -
>>> often except for some external events generated by systems which you
>>> don't want to fall behind arbitrarily.
>>>
>>>> Then the amount
>>>> of CPU cycles between timer interrupts would increase and hopefully
>>>> the guest can keep up. If the guest sleeps, time base could be
>>>> accelerated to catch up with wall clock and then set back to 1:1 rate.
>>> Can't follow you ATM, sorry. What should be slowed down then? And how
>>> precisely?
>>
>> I think vm_clock and everything that depends on vm_clock, also
>> rtc_clock should be tied to vm_clock in this mode, not host_clock.
>
> Let me check if I got this idea correctly: Instead of tuning just the
> tick frequency of the affected timer device / sending its backlog in a
> row, you rather want to tune the vm_clock correspondingly? Maybe a way
> to abstract the required logic currently sitting only in the RTC for use
> by other timer sources as well.

Yes, that would be a good starting point.

> But just switching rtc_clock to vm_clock when the user wants host_clock
> is obviously not an option. We would rather have to tune host_clock in
> parallel.
>
> Still, this does not answer:
>
> - How do you want to detect lost timer ticks?

With APIC, just like like now. But I think detecting lost ticks
shouldn't be the only way. It's an indication that things have gone
wrong, in the past. It would be better to use also other measurements
to see that the guest is close to a stall _before_ it happens.

> - What subsystem(s) keeps track of the backlog?

Preferably high level (vl.c, exec.c, cpu-exec.c).

> - And depending on the above: How to detect at all that a specific IRQ
>  is a timer tick?

Actually any IRQs acks can be delayed, those could be taken as a sign
of guest overload.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-27 22:21                           ` Paul Brook
@ 2010-05-28 19:10                             ` Blue Swirl
  0 siblings, 0 replies; 122+ messages in thread
From: Blue Swirl @ 2010-05-28 19:10 UTC (permalink / raw)
  To: Paul Brook; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Thu, May 27, 2010 at 10:21 PM, Paul Brook <paul@codesourcery.com> wrote:
>
>> >> Then the amount
>> >> of CPU cycles between timer interrupts would increase and hopefully
>> >> the guest can keep up. If the guest sleeps, time base could be
>> >> accelerated to catch up with wall clock and then set back to 1:1 rate.
>> >
>> > Can't follow you ATM, sorry. What should be slowed down then? And how
>> > precisely?
>>
>> I think vm_clock and everything that depends on vm_clock, also
>> rtc_clock should be tied to vm_clock in this mode, not host_clock.
>
> The problem is more fundamental than that. There is no real correlation
> between vm_clock and the amount of code executed by the guest, especially not
> on timescales less than a second.

Can we measure (or at least estimate with higher accuracy than the
tick IRQ delivery jitter) the amount of code executed, somehow? For
example, add TSC sampling to all TB or KVM VCPU exit and load/store
paths?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-28  7:31                   ` Gleb Natapov
@ 2010-05-28 20:06                     ` Blue Swirl
  2010-05-28 20:47                       ` Gleb Natapov
  0 siblings, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-05-28 20:06 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

2010/5/28 Gleb Natapov <gleb@redhat.com>:
> On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
>> 2010/5/27 Gleb Natapov <gleb@redhat.com>:
>> > On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
>> >> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> >> > Blue Swirl wrote:
>> >> >> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> >> >>> Anthony Liguori wrote:
>> >> >>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>> >> >>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>> >> >>>>>
>> >> >>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>> >> >>>>>>
>> >> >>>>>> This allows to communicate potential IRQ coalescing during delivery from
>> >> >>>>>> the sink back to the source. Targets that support IRQ coalescing
>> >> >>>>>> workarounds need to register handlers that return the appropriate
>> >> >>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>> >> >>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>> >> >>>>>> apply its workaround. If multiple sinks exist, the source may only
>> >> >>>>>> consider an IRQ coalesced if all other sinks either report
>> >> >>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>> >> >>>>>>
>> >> >>>>> No real devices are interested whether any of their output lines are
>> >> >>>>> even connected. This would introduce a new signal type, bidirectional
>> >> >>>>> multi-level, which is not correct.
>> >> >>>>>
>> >> >>>> I don't think it's really an issue of correct, but I wouldn't disagree
>> >> >>>> to a suggestion that we ought to introduce a new signal type for this
>> >> >>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>> >> >>>> similar interface as qemu_irq.
>> >> >>> A separate type would complicate the delivery of the feedback value
>> >> >>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>> >> >>>
>> >> >>>>> I think the real solution to coalescing is put the logic inside one
>> >> >>>>> device, in this case APIC because it has the information about irq
>> >> >>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>> >> >>>>> information and whether they get delivered or not. If not, an internal
>> >> >>>>> timer is installed which injects the lost irqs.
>> >> >>> That won't fly as the IRQs will already arrive at the APIC with a
>> >> >>> sufficiently high jitter. At the bare minimum, you need to tell the
>> >> >>> interrupt controller about the fact that a particular IRQ should be
>> >> >>> delivered at a specific regular rate. For this, you also need a generic
>> >> >>> interface - nothing really "won".
>> >> >>
>> >> >> OK, let's simplify: just reinject at next possible chance. No need to
>> >> >> monitor or tell anything.
>> >> >
>> >> > There are guests that won't like this (I know of one in-house, but
>> >> > others may even have more examples), specifically if you end up firing
>> >> > multiple IRQs in a row due to a longer backlog. For that reason, the RTC
>> >> > spreads the reinjection according to the current rate.
>> >>
>> >> Then reinject with a constant delay, or next CPU exit. Such buggy
>> > If guest's time frequency is the same as host time frequency you can't
>> > reinject with constant delay. That is why current code mixes two
>> > approaches: reinject M interrupts in a raw then delay.
>>
>> This approach can be also used by APIC-only version.
>>
> I don't know what APIC-only version you are talking about. I haven't
> seen the code and I don't understand hand waving, sorry.

There is no code, because we're still at architecture design stage.

>> >> guests could also be assisted with special handling (like win2k
>> >> install hack), for example guest instructions could be counted
>> >> (approximately, for example using TB size or TSC) and only inject
>> >> after at least N instructions have passed.
>> > Guest instructions cannot be easily counted in KVM (it can be done more
>> > or less reliably using perf counters, may be).
>>
>> Aren't there any debug registers or perf counters, which can generate
>> an interrupt after some number of instructions have been executed?
> Don't think debug registers have something like that and they are
> available for guest use anyway. Perf counters differs greatly from CPU
> to CPU (even between two CPUs of the same manufacturer), and we want to
> keep using them for profiling guests. And I don't see what problem it
> will solve anyway that can be solved by simple delay between irq
> reinjection.

This would allow counting the executed instructions and limit it. Thus
we could emulate a 500MHz CPU on a 2GHz CPU more accurately.

>>
>> >>
>> >> > And even if the rate did not matter, the APIC woult still have to now
>> >> > about the fact that an IRQ is really periodic and does not only appear
>> >> > as such for a certain interval. This really does not sound like
>> >> > simplifying things or even make them cleaner.
>> >>
>> >> It would, the voodoo would be contained only in APIC, RTC would be
>> >> just like any other device. With the bidirectional irqs, this voodoo
>> >> would probably eventually spread to many other devices. The logical
>> >> conclusion of that would be a system where all devices would be
>> >> careful not to disturb the guest at wrong moment because that would
>> >> trigger a bug.
>> >>
>> > This voodoo will be so complex and unreliable that it will make RTC hack
>> > pale in comparison (and I still don't see how you are going to make it
>> > actually work).
>>
>> Implement everything inside APIC: only coalescing and reinjection.
> APIC has zero info needed to implement reinjection correctly as was
> shown to you several time in this thread and you simply keep ignoring
> it.

On the contrary, APIC is actually the only source of the IRQ ack
information. RTC hack would not work without APIC (or the
bidirectional IRQ) passing this info to RTC.

What APIC doesn't have now is the timer frequency or period info. This
is known by RTC and also higher levels managing the clocks.

I keep ignoring the idea that the current model, where both RTC and
APIC must somehow work together to make coalescing work, is the only
possible just because it is committed and it happens to work in some
cases. It would be much better to concentrate this to one place, APIC
or preferably higher level where it may benefit other timers too.
Provided of course that the other models can be made to work.

>> Maybe that version would not bend backwards as much as the current to
>> cater for buggy hosts.
>>
> You mean "buggy guests"?

Yes, sorry.

> What guests are not buggy in your opinion?
> Linux tries hard to be smart and as a result the only way to have stable
> clock with it is to go paravirt.

I'm not an OS designer, but I think an OS should never crash, even if
a burst of IRQs is received. Reprogramming the timer should consider
the pending IRQ situation (0 or 1 with real HW). Those bugs are one
cause of the problem.

>> > The fact is that timer device is not "just like any
>> > other device" in virtual world. Any other device is easy: you just
>> > implement spec as close as possible and everything works. For time
>> > source device this is not enough. You can implement RTC+HPET to the
>> > letter and your guest will drift like crazy.
>>
>> It's doable: a cycle accurate emulator will not cause any drift,
>> without any voodoo. The interrupts would come after executing the same
>> instruction as the real HW. For emulating any sufficiently buggy
>> guests in any sufficiently desperate low resource conditions, this may
>> be the only option that will always work.
>>
> Yes, but qemu and kvm are not cycle accurate emulators and don't strive
> to be one. On the contrary KVM runs at native host CPU speed most of the
> time, so any emulation done between two instruction is theoretically
> noticeable for a guest. TSC is bypassed directly to a guest too, so
> keeping all time source in perfect sync is also impossible.

That is actually another cause of the problem. KVM gives the guest an
illusion that the VCPU speed is equal to host speed. When they don't
match, especially in critical code, there can be problems. It would be
better to tell the guest a lower speed, which also can be guaranteed.

Maybe we should also offline the device emulation to another host CPU
with threading. A load from a device will always be much slower than
on real HW though.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-28 20:06                     ` Blue Swirl
@ 2010-05-28 20:47                       ` Gleb Natapov
  2010-05-29  7:58                         ` Jan Kiszka
  2010-05-29  9:15                         ` Blue Swirl
  0 siblings, 2 replies; 122+ messages in thread
From: Gleb Natapov @ 2010-05-28 20:47 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Fri, May 28, 2010 at 08:06:45PM +0000, Blue Swirl wrote:
> 2010/5/28 Gleb Natapov <gleb@redhat.com>:
> > On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
> >> 2010/5/27 Gleb Natapov <gleb@redhat.com>:
> >> > On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
> >> >> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> >> >> > Blue Swirl wrote:
> >> >> >> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> >> >> >>> Anthony Liguori wrote:
> >> >> >>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
> >> >> >>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
> >> >> >>>>>
> >> >> >>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
> >> >> >>>>>>
> >> >> >>>>>> This allows to communicate potential IRQ coalescing during delivery from
> >> >> >>>>>> the sink back to the source. Targets that support IRQ coalescing
> >> >> >>>>>> workarounds need to register handlers that return the appropriate
> >> >> >>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
> >> >> >>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
> >> >> >>>>>> apply its workaround. If multiple sinks exist, the source may only
> >> >> >>>>>> consider an IRQ coalesced if all other sinks either report
> >> >> >>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
> >> >> >>>>>>
> >> >> >>>>> No real devices are interested whether any of their output lines are
> >> >> >>>>> even connected. This would introduce a new signal type, bidirectional
> >> >> >>>>> multi-level, which is not correct.
> >> >> >>>>>
> >> >> >>>> I don't think it's really an issue of correct, but I wouldn't disagree
> >> >> >>>> to a suggestion that we ought to introduce a new signal type for this
> >> >> >>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
> >> >> >>>> similar interface as qemu_irq.
> >> >> >>> A separate type would complicate the delivery of the feedback value
> >> >> >>> across GPIO pins (as Paul requested for the RTC->HPET routing).
> >> >> >>>
> >> >> >>>>> I think the real solution to coalescing is put the logic inside one
> >> >> >>>>> device, in this case APIC because it has the information about irq
> >> >> >>>>> delivery. APIC could monitor incoming RTC irqs for frequency
> >> >> >>>>> information and whether they get delivered or not. If not, an internal
> >> >> >>>>> timer is installed which injects the lost irqs.
> >> >> >>> That won't fly as the IRQs will already arrive at the APIC with a
> >> >> >>> sufficiently high jitter. At the bare minimum, you need to tell the
> >> >> >>> interrupt controller about the fact that a particular IRQ should be
> >> >> >>> delivered at a specific regular rate. For this, you also need a generic
> >> >> >>> interface - nothing really "won".
> >> >> >>
> >> >> >> OK, let's simplify: just reinject at next possible chance. No need to
> >> >> >> monitor or tell anything.
> >> >> >
> >> >> > There are guests that won't like this (I know of one in-house, but
> >> >> > others may even have more examples), specifically if you end up firing
> >> >> > multiple IRQs in a row due to a longer backlog. For that reason, the RTC
> >> >> > spreads the reinjection according to the current rate.
> >> >>
> >> >> Then reinject with a constant delay, or next CPU exit. Such buggy
> >> > If guest's time frequency is the same as host time frequency you can't
> >> > reinject with constant delay. That is why current code mixes two
> >> > approaches: reinject M interrupts in a raw then delay.
> >>
> >> This approach can be also used by APIC-only version.
> >>
> > I don't know what APIC-only version you are talking about. I haven't
> > seen the code and I don't understand hand waving, sorry.
> 
> There is no code, because we're still at architecture design stage.
> 
Try to write test code to understand the problem better.

> >> >> guests could also be assisted with special handling (like win2k
> >> >> install hack), for example guest instructions could be counted
> >> >> (approximately, for example using TB size or TSC) and only inject
> >> >> after at least N instructions have passed.
> >> > Guest instructions cannot be easily counted in KVM (it can be done more
> >> > or less reliably using perf counters, may be).
> >>
> >> Aren't there any debug registers or perf counters, which can generate
> >> an interrupt after some number of instructions have been executed?
> > Don't think debug registers have something like that and they are
> > available for guest use anyway. Perf counters differs greatly from CPU
> > to CPU (even between two CPUs of the same manufacturer), and we want to
> > keep using them for profiling guests. And I don't see what problem it
> > will solve anyway that can be solved by simple delay between irq
> > reinjection.
> 
> This would allow counting the executed instructions and limit it. Thus
> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
> 
Why would you want to limit number of instruction executed by guest if
CPU has nothing else to do anyway? The problem occurs not when we have
spare cycles so give to a guest, but in opposite case.

> >>
> >> >>
> >> >> > And even if the rate did not matter, the APIC woult still have to now
> >> >> > about the fact that an IRQ is really periodic and does not only appear
> >> >> > as such for a certain interval. This really does not sound like
> >> >> > simplifying things or even make them cleaner.
> >> >>
> >> >> It would, the voodoo would be contained only in APIC, RTC would be
> >> >> just like any other device. With the bidirectional irqs, this voodoo
> >> >> would probably eventually spread to many other devices. The logical
> >> >> conclusion of that would be a system where all devices would be
> >> >> careful not to disturb the guest at wrong moment because that would
> >> >> trigger a bug.
> >> >>
> >> > This voodoo will be so complex and unreliable that it will make RTC hack
> >> > pale in comparison (and I still don't see how you are going to make it
> >> > actually work).
> >>
> >> Implement everything inside APIC: only coalescing and reinjection.
> > APIC has zero info needed to implement reinjection correctly as was
> > shown to you several time in this thread and you simply keep ignoring
> > it.
> 
> On the contrary, APIC is actually the only source of the IRQ ack
> information. RTC hack would not work without APIC (or the
> bidirectional IRQ) passing this info to RTC.
> 
> What APIC doesn't have now is the timer frequency or period info. This
> is known by RTC and also higher levels managing the clocks.
> 
So APIC has one bit of information and RTC everything else. The current
approach (and proposed patch) brings this one bit of information to RTC,
you are arguing that RTC should be able to communicate all its info to
APIC. Sorry I don't see that your way has any advantage. Just more
complex interface and it is much easier to get it wrong for other time
sources.

> I keep ignoring the idea that the current model, where both RTC and
> APIC must somehow work together to make coalescing work, is the only
> possible just because it is committed and it happens to work in some
> cases. It would be much better to concentrate this to one place, APIC
> or preferably higher level where it may benefit other timers too.
> Provided of course that the other models can be made to work.
> 
So write the code and show us. You haven't show any evidence that RTC is
the wrong place. RTC knows when interrupt was acknowledge to RTC, it
know when clock frequency changes, it know when device reset happened.
APIC knows only that interrupt was coalesced. It doesn't even know that
it may be masked by a guest in IOAPIC (interrupts delivered while they
are masked not considered coalesced). Time source knows only when
frequency changes and may be when device reset happens if timer is
stopped by device on reset. So RTC is actually a sweet spot if you want
to minimize amount of info you need to pass between various layers.

> >> Maybe that version would not bend backwards as much as the current to
> >> cater for buggy hosts.
> >>
> > You mean "buggy guests"?
> 
> Yes, sorry.
> 
> > What guests are not buggy in your opinion?
> > Linux tries hard to be smart and as a result the only way to have stable
> > clock with it is to go paravirt.
> 
> I'm not an OS designer, but I think an OS should never crash, even if
> a burst of IRQs is received. Reprogramming the timer should consider
> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
> cause of the problem.
OS should never crash in the absence of HW bugs? I doubt you can design
an OS that can run in a face of any HW failure. Anyway here we are
trying to solve guests time keeping problem not crashes. Do you think
you can design OS that can keep time accurately no matter how crazy all
HW clock behaves?

> 
> >> > The fact is that timer device is not "just like any
> >> > other device" in virtual world. Any other device is easy: you just
> >> > implement spec as close as possible and everything works. For time
> >> > source device this is not enough. You can implement RTC+HPET to the
> >> > letter and your guest will drift like crazy.
> >>
> >> It's doable: a cycle accurate emulator will not cause any drift,
> >> without any voodoo. The interrupts would come after executing the same
> >> instruction as the real HW. For emulating any sufficiently buggy
> >> guests in any sufficiently desperate low resource conditions, this may
> >> be the only option that will always work.
> >>
> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
> > to be one. On the contrary KVM runs at native host CPU speed most of the
> > time, so any emulation done between two instruction is theoretically
> > noticeable for a guest. TSC is bypassed directly to a guest too, so
> > keeping all time source in perfect sync is also impossible.
> 
> That is actually another cause of the problem. KVM gives the guest an
> illusion that the VCPU speed is equal to host speed. When they don't
> match, especially in critical code, there can be problems. It would be
> better to tell the guest a lower speed, which also can be guaranteed.
> 
Not possible. It's that simple. You should take it into account in your
architecture design stage. In case of KVM real physical CPU executes guest
instruction and it does this as fast as it can. The only way we can hide
that from a guest is by intercepting each access to TSC and at that
point we can use bochs instead.

> Maybe we should also offline the device emulation to another host CPU
> with threading. A load from a device will always be much slower than
> on real HW though.
Time drift problem start to happen on loaded servers, so you do not have
spare CPU to offload device emulation too.

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-28 20:47                       ` Gleb Natapov
@ 2010-05-29  7:58                         ` Jan Kiszka
  2010-05-29  9:35                           ` Blue Swirl
  2010-05-29  9:15                         ` Blue Swirl
  1 sibling, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-05-29  7:58 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Blue Swirl, qemu-devel, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 10190 bytes --]

Gleb Natapov wrote:
> On Fri, May 28, 2010 at 08:06:45PM +0000, Blue Swirl wrote:
>> 2010/5/28 Gleb Natapov <gleb@redhat.com>:
>>> On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
>>>> 2010/5/27 Gleb Natapov <gleb@redhat.com>:
>>>>> On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
>>>>>> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>>>> Blue Swirl wrote:
>>>>>>>> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>>>>>> Anthony Liguori wrote:
>>>>>>>>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>>>>>>>>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>>>>>>>>>>
>>>>>>>>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>>>>>>>>>>
>>>>>>>>>>>> This allows to communicate potential IRQ coalescing during delivery from
>>>>>>>>>>>> the sink back to the source. Targets that support IRQ coalescing
>>>>>>>>>>>> workarounds need to register handlers that return the appropriate
>>>>>>>>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>>>>>>>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>>>>>>>>>>> apply its workaround. If multiple sinks exist, the source may only
>>>>>>>>>>>> consider an IRQ coalesced if all other sinks either report
>>>>>>>>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>>>>>>>>>>
>>>>>>>>>>> No real devices are interested whether any of their output lines are
>>>>>>>>>>> even connected. This would introduce a new signal type, bidirectional
>>>>>>>>>>> multi-level, which is not correct.
>>>>>>>>>>>
>>>>>>>>>> I don't think it's really an issue of correct, but I wouldn't disagree
>>>>>>>>>> to a suggestion that we ought to introduce a new signal type for this
>>>>>>>>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>>>>>>>>>> similar interface as qemu_irq.
>>>>>>>>> A separate type would complicate the delivery of the feedback value
>>>>>>>>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>>>>>>>>>
>>>>>>>>>>> I think the real solution to coalescing is put the logic inside one
>>>>>>>>>>> device, in this case APIC because it has the information about irq
>>>>>>>>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>>>>>>>>>>> information and whether they get delivered or not. If not, an internal
>>>>>>>>>>> timer is installed which injects the lost irqs.
>>>>>>>>> That won't fly as the IRQs will already arrive at the APIC with a
>>>>>>>>> sufficiently high jitter. At the bare minimum, you need to tell the
>>>>>>>>> interrupt controller about the fact that a particular IRQ should be
>>>>>>>>> delivered at a specific regular rate. For this, you also need a generic
>>>>>>>>> interface - nothing really "won".
>>>>>>>> OK, let's simplify: just reinject at next possible chance. No need to
>>>>>>>> monitor or tell anything.
>>>>>>> There are guests that won't like this (I know of one in-house, but
>>>>>>> others may even have more examples), specifically if you end up firing
>>>>>>> multiple IRQs in a row due to a longer backlog. For that reason, the RTC
>>>>>>> spreads the reinjection according to the current rate.
>>>>>> Then reinject with a constant delay, or next CPU exit. Such buggy
>>>>> If guest's time frequency is the same as host time frequency you can't
>>>>> reinject with constant delay. That is why current code mixes two
>>>>> approaches: reinject M interrupts in a raw then delay.
>>>> This approach can be also used by APIC-only version.
>>>>
>>> I don't know what APIC-only version you are talking about. I haven't
>>> seen the code and I don't understand hand waving, sorry.
>> There is no code, because we're still at architecture design stage.
>>
> Try to write test code to understand the problem better.
> 
>>>>>> guests could also be assisted with special handling (like win2k
>>>>>> install hack), for example guest instructions could be counted
>>>>>> (approximately, for example using TB size or TSC) and only inject
>>>>>> after at least N instructions have passed.
>>>>> Guest instructions cannot be easily counted in KVM (it can be done more
>>>>> or less reliably using perf counters, may be).
>>>> Aren't there any debug registers or perf counters, which can generate
>>>> an interrupt after some number of instructions have been executed?
>>> Don't think debug registers have something like that and they are
>>> available for guest use anyway. Perf counters differs greatly from CPU
>>> to CPU (even between two CPUs of the same manufacturer), and we want to
>>> keep using them for profiling guests. And I don't see what problem it
>>> will solve anyway that can be solved by simple delay between irq
>>> reinjection.
>> This would allow counting the executed instructions and limit it. Thus
>> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>>
> Why would you want to limit number of instruction executed by guest if
> CPU has nothing else to do anyway? The problem occurs not when we have
> spare cycles so give to a guest, but in opposite case.
> 
>>>>>>> And even if the rate did not matter, the APIC woult still have to now
>>>>>>> about the fact that an IRQ is really periodic and does not only appear
>>>>>>> as such for a certain interval. This really does not sound like
>>>>>>> simplifying things or even make them cleaner.
>>>>>> It would, the voodoo would be contained only in APIC, RTC would be
>>>>>> just like any other device. With the bidirectional irqs, this voodoo
>>>>>> would probably eventually spread to many other devices. The logical
>>>>>> conclusion of that would be a system where all devices would be
>>>>>> careful not to disturb the guest at wrong moment because that would
>>>>>> trigger a bug.
>>>>>>
>>>>> This voodoo will be so complex and unreliable that it will make RTC hack
>>>>> pale in comparison (and I still don't see how you are going to make it
>>>>> actually work).
>>>> Implement everything inside APIC: only coalescing and reinjection.
>>> APIC has zero info needed to implement reinjection correctly as was
>>> shown to you several time in this thread and you simply keep ignoring
>>> it.
>> On the contrary, APIC is actually the only source of the IRQ ack
>> information. RTC hack would not work without APIC (or the
>> bidirectional IRQ) passing this info to RTC.
>>
>> What APIC doesn't have now is the timer frequency or period info. This
>> is known by RTC and also higher levels managing the clocks.
>>
> So APIC has one bit of information and RTC everything else. The current
> approach (and proposed patch) brings this one bit of information to RTC,
> you are arguing that RTC should be able to communicate all its info to
> APIC. Sorry I don't see that your way has any advantage. Just more
> complex interface and it is much easier to get it wrong for other time
> sources.
> 
>> I keep ignoring the idea that the current model, where both RTC and
>> APIC must somehow work together to make coalescing work, is the only
>> possible just because it is committed and it happens to work in some
>> cases. It would be much better to concentrate this to one place, APIC
>> or preferably higher level where it may benefit other timers too.
>> Provided of course that the other models can be made to work.
>>
> So write the code and show us. You haven't show any evidence that RTC is
> the wrong place. RTC knows when interrupt was acknowledge to RTC, it
> know when clock frequency changes, it know when device reset happened.
> APIC knows only that interrupt was coalesced. It doesn't even know that
> it may be masked by a guest in IOAPIC (interrupts delivered while they
> are masked not considered coalesced). Time source knows only when
> frequency changes and may be when device reset happens if timer is
> stopped by device on reset. So RTC is actually a sweet spot if you want
> to minimize amount of info you need to pass between various layers.

This is - according to my current understanding - the proposed
alternative architecture:

                                          .---------------.
                                          | de-coalescing |
                                          |     logic     |
                                          '---------------'
                                                ^   |
                                         period,|   |IRQ
                                       coalesced|   |(or tuned VM clock)
                                        (yes/no)|   v
.-------.              .--------.             .-------.
|  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
'-------'              '--------'             '-------'
    |                    ^    |                   ^
    |                    |    |                   |
    '-------period-------'    '------period-------'


And here is what this patch implements (except for the not yet factored
out de-coalescing logic):

   .---------------.
   | de-coalescing |
   |     logic     |
   '---------------'
         ^   |
  period,|   |next IRQ date
coalesced|   |(or tuned VM clock)
 (yes/no)|   v
       .-------.              .--------.             .-------.
       |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
       '-------'              '--------'             '-------'
           ^                    |    ^                   |
           |                    |    |                   |
           '------coalesced-----'    '-----coalesced-----'


I still don't see how the alternative is supposed to simplify our life
or improve the efficiency of the de-coalescing workaround. It's rather
problematic like Gleb pointed out: The de-coalescing logic needs to be
informed about periodicity changes that can only be delivered along
IRQs. So what to do with the backlog when the timer is stopped?

Regarding an improved de-coalescing logic: As its final design is widely
independent of how we collect the information, it could perfectly be
done after we laid the elementary foundation.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-28 20:47                       ` Gleb Natapov
  2010-05-29  7:58                         ` Jan Kiszka
@ 2010-05-29  9:15                         ` Blue Swirl
  2010-05-29  9:36                           ` Jan Kiszka
                                             ` (2 more replies)
  1 sibling, 3 replies; 122+ messages in thread
From: Blue Swirl @ 2010-05-29  9:15 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

2010/5/28 Gleb Natapov <gleb@redhat.com>:
> On Fri, May 28, 2010 at 08:06:45PM +0000, Blue Swirl wrote:
>> 2010/5/28 Gleb Natapov <gleb@redhat.com>:
>> > On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
>> >> 2010/5/27 Gleb Natapov <gleb@redhat.com>:
>> >> > On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
>> >> >> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> >> >> > Blue Swirl wrote:
>> >> >> >> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> >> >> >>> Anthony Liguori wrote:
>> >> >> >>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>> >> >> >>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>> >> >> >>>>>
>> >> >> >>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>> >> >> >>>>>>
>> >> >> >>>>>> This allows to communicate potential IRQ coalescing during delivery from
>> >> >> >>>>>> the sink back to the source. Targets that support IRQ coalescing
>> >> >> >>>>>> workarounds need to register handlers that return the appropriate
>> >> >> >>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>> >> >> >>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>> >> >> >>>>>> apply its workaround. If multiple sinks exist, the source may only
>> >> >> >>>>>> consider an IRQ coalesced if all other sinks either report
>> >> >> >>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>> >> >> >>>>>>
>> >> >> >>>>> No real devices are interested whether any of their output lines are
>> >> >> >>>>> even connected. This would introduce a new signal type, bidirectional
>> >> >> >>>>> multi-level, which is not correct.
>> >> >> >>>>>
>> >> >> >>>> I don't think it's really an issue of correct, but I wouldn't disagree
>> >> >> >>>> to a suggestion that we ought to introduce a new signal type for this
>> >> >> >>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>> >> >> >>>> similar interface as qemu_irq.
>> >> >> >>> A separate type would complicate the delivery of the feedback value
>> >> >> >>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>> >> >> >>>
>> >> >> >>>>> I think the real solution to coalescing is put the logic inside one
>> >> >> >>>>> device, in this case APIC because it has the information about irq
>> >> >> >>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>> >> >> >>>>> information and whether they get delivered or not. If not, an internal
>> >> >> >>>>> timer is installed which injects the lost irqs.
>> >> >> >>> That won't fly as the IRQs will already arrive at the APIC with a
>> >> >> >>> sufficiently high jitter. At the bare minimum, you need to tell the
>> >> >> >>> interrupt controller about the fact that a particular IRQ should be
>> >> >> >>> delivered at a specific regular rate. For this, you also need a generic
>> >> >> >>> interface - nothing really "won".
>> >> >> >>
>> >> >> >> OK, let's simplify: just reinject at next possible chance. No need to
>> >> >> >> monitor or tell anything.
>> >> >> >
>> >> >> > There are guests that won't like this (I know of one in-house, but
>> >> >> > others may even have more examples), specifically if you end up firing
>> >> >> > multiple IRQs in a row due to a longer backlog. For that reason, the RTC
>> >> >> > spreads the reinjection according to the current rate.
>> >> >>
>> >> >> Then reinject with a constant delay, or next CPU exit. Such buggy
>> >> > If guest's time frequency is the same as host time frequency you can't
>> >> > reinject with constant delay. That is why current code mixes two
>> >> > approaches: reinject M interrupts in a raw then delay.
>> >>
>> >> This approach can be also used by APIC-only version.
>> >>
>> > I don't know what APIC-only version you are talking about. I haven't
>> > seen the code and I don't understand hand waving, sorry.
>>
>> There is no code, because we're still at architecture design stage.
>>
> Try to write test code to understand the problem better.

I will.

>> >> >> guests could also be assisted with special handling (like win2k
>> >> >> install hack), for example guest instructions could be counted
>> >> >> (approximately, for example using TB size or TSC) and only inject
>> >> >> after at least N instructions have passed.
>> >> > Guest instructions cannot be easily counted in KVM (it can be done more
>> >> > or less reliably using perf counters, may be).
>> >>
>> >> Aren't there any debug registers or perf counters, which can generate
>> >> an interrupt after some number of instructions have been executed?
>> > Don't think debug registers have something like that and they are
>> > available for guest use anyway. Perf counters differs greatly from CPU
>> > to CPU (even between two CPUs of the same manufacturer), and we want to
>> > keep using them for profiling guests. And I don't see what problem it
>> > will solve anyway that can be solved by simple delay between irq
>> > reinjection.
>>
>> This would allow counting the executed instructions and limit it. Thus
>> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>>
> Why would you want to limit number of instruction executed by guest if
> CPU has nothing else to do anyway? The problem occurs not when we have
> spare cycles so give to a guest, but in opposite case.

I think one problem is that the guest has executed too much compared
to what would happen with real HW with a lesser CPU. That explains the
RTC frequency reprogramming case.

>
>> >>
>> >> >>
>> >> >> > And even if the rate did not matter, the APIC woult still have to now
>> >> >> > about the fact that an IRQ is really periodic and does not only appear
>> >> >> > as such for a certain interval. This really does not sound like
>> >> >> > simplifying things or even make them cleaner.
>> >> >>
>> >> >> It would, the voodoo would be contained only in APIC, RTC would be
>> >> >> just like any other device. With the bidirectional irqs, this voodoo
>> >> >> would probably eventually spread to many other devices. The logical
>> >> >> conclusion of that would be a system where all devices would be
>> >> >> careful not to disturb the guest at wrong moment because that would
>> >> >> trigger a bug.
>> >> >>
>> >> > This voodoo will be so complex and unreliable that it will make RTC hack
>> >> > pale in comparison (and I still don't see how you are going to make it
>> >> > actually work).
>> >>
>> >> Implement everything inside APIC: only coalescing and reinjection.
>> > APIC has zero info needed to implement reinjection correctly as was
>> > shown to you several time in this thread and you simply keep ignoring
>> > it.
>>
>> On the contrary, APIC is actually the only source of the IRQ ack
>> information. RTC hack would not work without APIC (or the
>> bidirectional IRQ) passing this info to RTC.
>>
>> What APIC doesn't have now is the timer frequency or period info. This
>> is known by RTC and also higher levels managing the clocks.
>>
> So APIC has one bit of information and RTC everything else.

The information known by RTC (timer period) is also known by higher levels.

> The current
> approach (and proposed patch) brings this one bit of information to RTC,
> you are arguing that RTC should be able to communicate all its info to
> APIC. Sorry I don't see that your way has any advantage. Just more
> complex interface and it is much easier to get it wrong for other time
> sources.

I don't think anymore that APIC should be handling this but the
generic stuff, like vl.c or exec.c. Then there would be only
information passing from APIC to higher levels.

>> I keep ignoring the idea that the current model, where both RTC and
>> APIC must somehow work together to make coalescing work, is the only
>> possible just because it is committed and it happens to work in some
>> cases. It would be much better to concentrate this to one place, APIC
>> or preferably higher level where it may benefit other timers too.
>> Provided of course that the other models can be made to work.
>>
> So write the code and show us. You haven't show any evidence that RTC is
> the wrong place. RTC knows when interrupt was acknowledge to RTC, it
> know when clock frequency changes, it know when device reset happened.
> APIC knows only that interrupt was coalesced. It doesn't even know that
> it may be masked by a guest in IOAPIC (interrupts delivered while they
> are masked not considered coalesced).

Oh, I thought interrupt masking was the reason for coalescing! What
exactly is the reason then?

> Time source knows only when
> frequency changes and may be when device reset happens if timer is
> stopped by device on reset. So RTC is actually a sweet spot if you want
> to minimize amount of info you need to pass between various layers.
>
>> >> Maybe that version would not bend backwards as much as the current to
>> >> cater for buggy hosts.
>> >>
>> > You mean "buggy guests"?
>>
>> Yes, sorry.
>>
>> > What guests are not buggy in your opinion?
>> > Linux tries hard to be smart and as a result the only way to have stable
>> > clock with it is to go paravirt.
>>
>> I'm not an OS designer, but I think an OS should never crash, even if
>> a burst of IRQs is received. Reprogramming the timer should consider
>> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
>> cause of the problem.
> OS should never crash in the absence of HW bugs? I doubt you can design
> an OS that can run in a face of any HW failure. Anyway here we are
> trying to solve guests time keeping problem not crashes. Do you think
> you can design OS that can keep time accurately no matter how crazy all
> HW clock behaves?

I think my OS design skills are not relevant in this discussion, but
IIRC there are fault tolerant operating systems for extreme conditions
so it can be done.

>
>>
>> >> > The fact is that timer device is not "just like any
>> >> > other device" in virtual world. Any other device is easy: you just
>> >> > implement spec as close as possible and everything works. For time
>> >> > source device this is not enough. You can implement RTC+HPET to the
>> >> > letter and your guest will drift like crazy.
>> >>
>> >> It's doable: a cycle accurate emulator will not cause any drift,
>> >> without any voodoo. The interrupts would come after executing the same
>> >> instruction as the real HW. For emulating any sufficiently buggy
>> >> guests in any sufficiently desperate low resource conditions, this may
>> >> be the only option that will always work.
>> >>
>> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
>> > to be one. On the contrary KVM runs at native host CPU speed most of the
>> > time, so any emulation done between two instruction is theoretically
>> > noticeable for a guest. TSC is bypassed directly to a guest too, so
>> > keeping all time source in perfect sync is also impossible.
>>
>> That is actually another cause of the problem. KVM gives the guest an
>> illusion that the VCPU speed is equal to host speed. When they don't
>> match, especially in critical code, there can be problems. It would be
>> better to tell the guest a lower speed, which also can be guaranteed.
>>
> Not possible. It's that simple. You should take it into account in your
> architecture design stage. In case of KVM real physical CPU executes guest
> instruction and it does this as fast as it can. The only way we can hide
> that from a guest is by intercepting each access to TSC and at that
> point we can use bochs instead.

Well, as Paul pointed out, there's also icount option.

>> Maybe we should also offline the device emulation to another host CPU
>> with threading. A load from a device will always be much slower than
>> on real HW though.
> Time drift problem start to happen on loaded servers, so you do not have
> spare CPU to offload device emulation too.
>
> --
>                        Gleb.
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29  7:58                         ` Jan Kiszka
@ 2010-05-29  9:35                           ` Blue Swirl
  2010-05-29  9:45                             ` Jan Kiszka
  2010-05-29 14:46                             ` Gleb Natapov
  0 siblings, 2 replies; 122+ messages in thread
From: Blue Swirl @ 2010-05-29  9:35 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, Gleb Natapov, Juan Quintela

2010/5/29 Jan Kiszka <jan.kiszka@web.de>:
> Gleb Natapov wrote:
>> On Fri, May 28, 2010 at 08:06:45PM +0000, Blue Swirl wrote:
>>> 2010/5/28 Gleb Natapov <gleb@redhat.com>:
>>>> On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
>>>>> 2010/5/27 Gleb Natapov <gleb@redhat.com>:
>>>>>> On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
>>>>>>> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>>>>> Blue Swirl wrote:
>>>>>>>>> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>>>>>>> Anthony Liguori wrote:
>>>>>>>>>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>>>>>>>>>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This allows to communicate potential IRQ coalescing during delivery from
>>>>>>>>>>>>> the sink back to the source. Targets that support IRQ coalescing
>>>>>>>>>>>>> workarounds need to register handlers that return the appropriate
>>>>>>>>>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>>>>>>>>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>>>>>>>>>>>> apply its workaround. If multiple sinks exist, the source may only
>>>>>>>>>>>>> consider an IRQ coalesced if all other sinks either report
>>>>>>>>>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>>>>>>>>>>>
>>>>>>>>>>>> No real devices are interested whether any of their output lines are
>>>>>>>>>>>> even connected. This would introduce a new signal type, bidirectional
>>>>>>>>>>>> multi-level, which is not correct.
>>>>>>>>>>>>
>>>>>>>>>>> I don't think it's really an issue of correct, but I wouldn't disagree
>>>>>>>>>>> to a suggestion that we ought to introduce a new signal type for this
>>>>>>>>>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>>>>>>>>>>> similar interface as qemu_irq.
>>>>>>>>>> A separate type would complicate the delivery of the feedback value
>>>>>>>>>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>>>>>>>>>>
>>>>>>>>>>>> I think the real solution to coalescing is put the logic inside one
>>>>>>>>>>>> device, in this case APIC because it has the information about irq
>>>>>>>>>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>>>>>>>>>>>> information and whether they get delivered or not. If not, an internal
>>>>>>>>>>>> timer is installed which injects the lost irqs.
>>>>>>>>>> That won't fly as the IRQs will already arrive at the APIC with a
>>>>>>>>>> sufficiently high jitter. At the bare minimum, you need to tell the
>>>>>>>>>> interrupt controller about the fact that a particular IRQ should be
>>>>>>>>>> delivered at a specific regular rate. For this, you also need a generic
>>>>>>>>>> interface - nothing really "won".
>>>>>>>>> OK, let's simplify: just reinject at next possible chance. No need to
>>>>>>>>> monitor or tell anything.
>>>>>>>> There are guests that won't like this (I know of one in-house, but
>>>>>>>> others may even have more examples), specifically if you end up firing
>>>>>>>> multiple IRQs in a row due to a longer backlog. For that reason, the RTC
>>>>>>>> spreads the reinjection according to the current rate.
>>>>>>> Then reinject with a constant delay, or next CPU exit. Such buggy
>>>>>> If guest's time frequency is the same as host time frequency you can't
>>>>>> reinject with constant delay. That is why current code mixes two
>>>>>> approaches: reinject M interrupts in a raw then delay.
>>>>> This approach can be also used by APIC-only version.
>>>>>
>>>> I don't know what APIC-only version you are talking about. I haven't
>>>> seen the code and I don't understand hand waving, sorry.
>>> There is no code, because we're still at architecture design stage.
>>>
>> Try to write test code to understand the problem better.
>>
>>>>>>> guests could also be assisted with special handling (like win2k
>>>>>>> install hack), for example guest instructions could be counted
>>>>>>> (approximately, for example using TB size or TSC) and only inject
>>>>>>> after at least N instructions have passed.
>>>>>> Guest instructions cannot be easily counted in KVM (it can be done more
>>>>>> or less reliably using perf counters, may be).
>>>>> Aren't there any debug registers or perf counters, which can generate
>>>>> an interrupt after some number of instructions have been executed?
>>>> Don't think debug registers have something like that and they are
>>>> available for guest use anyway. Perf counters differs greatly from CPU
>>>> to CPU (even between two CPUs of the same manufacturer), and we want to
>>>> keep using them for profiling guests. And I don't see what problem it
>>>> will solve anyway that can be solved by simple delay between irq
>>>> reinjection.
>>> This would allow counting the executed instructions and limit it. Thus
>>> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>>>
>> Why would you want to limit number of instruction executed by guest if
>> CPU has nothing else to do anyway? The problem occurs not when we have
>> spare cycles so give to a guest, but in opposite case.
>>
>>>>>>>> And even if the rate did not matter, the APIC woult still have to now
>>>>>>>> about the fact that an IRQ is really periodic and does not only appear
>>>>>>>> as such for a certain interval. This really does not sound like
>>>>>>>> simplifying things or even make them cleaner.
>>>>>>> It would, the voodoo would be contained only in APIC, RTC would be
>>>>>>> just like any other device. With the bidirectional irqs, this voodoo
>>>>>>> would probably eventually spread to many other devices. The logical
>>>>>>> conclusion of that would be a system where all devices would be
>>>>>>> careful not to disturb the guest at wrong moment because that would
>>>>>>> trigger a bug.
>>>>>>>
>>>>>> This voodoo will be so complex and unreliable that it will make RTC hack
>>>>>> pale in comparison (and I still don't see how you are going to make it
>>>>>> actually work).
>>>>> Implement everything inside APIC: only coalescing and reinjection.
>>>> APIC has zero info needed to implement reinjection correctly as was
>>>> shown to you several time in this thread and you simply keep ignoring
>>>> it.
>>> On the contrary, APIC is actually the only source of the IRQ ack
>>> information. RTC hack would not work without APIC (or the
>>> bidirectional IRQ) passing this info to RTC.
>>>
>>> What APIC doesn't have now is the timer frequency or period info. This
>>> is known by RTC and also higher levels managing the clocks.
>>>
>> So APIC has one bit of information and RTC everything else. The current
>> approach (and proposed patch) brings this one bit of information to RTC,
>> you are arguing that RTC should be able to communicate all its info to
>> APIC. Sorry I don't see that your way has any advantage. Just more
>> complex interface and it is much easier to get it wrong for other time
>> sources.
>>
>>> I keep ignoring the idea that the current model, where both RTC and
>>> APIC must somehow work together to make coalescing work, is the only
>>> possible just because it is committed and it happens to work in some
>>> cases. It would be much better to concentrate this to one place, APIC
>>> or preferably higher level where it may benefit other timers too.
>>> Provided of course that the other models can be made to work.
>>>
>> So write the code and show us. You haven't show any evidence that RTC is
>> the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>> know when clock frequency changes, it know when device reset happened.
>> APIC knows only that interrupt was coalesced. It doesn't even know that
>> it may be masked by a guest in IOAPIC (interrupts delivered while they
>> are masked not considered coalesced). Time source knows only when
>> frequency changes and may be when device reset happens if timer is
>> stopped by device on reset. So RTC is actually a sweet spot if you want
>> to minimize amount of info you need to pass between various layers.
>
> This is - according to my current understanding - the proposed
> alternative architecture:
>
>                                          .---------------.
>                                          | de-coalescing |
>                                          |     logic     |
>                                          '---------------'
>                                                ^   |
>                                         period,|   |IRQ
>                                       coalesced|   |(or tuned VM clock)
>                                        (yes/no)|   v
> .-------.              .--------.             .-------.
> |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
> '-------'              '--------'             '-------'
>    |                    ^    |                   ^
>    |                    |    |                   |
>    '-------period-------'    '------period-------'

The period information is already known by the higher level clock
management, it should be available (or made available) to
de-coalescing logic (which should probably be located close to other
clock stuff) but otherwise there shouldn't be a need to pass it
around. The tuned VM clock of course only affects the RTC. Otherwise
correct.

> And here is what this patch implements (except for the not yet factored
> out de-coalescing logic):
>
>   .---------------.
>   | de-coalescing |
>   |     logic     |
>   '---------------'
>         ^   |
>  period,|   |next IRQ date
> coalesced|   |(or tuned VM clock)
>  (yes/no)|   v
>       .-------.              .--------.             .-------.
>       |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
>       '-------'              '--------'             '-------'
>           ^                    |    ^                   |
>           |                    |    |                   |
>           '------coalesced-----'    '-----coalesced-----'

Why not route the coalesced signal directly from APIC to de-coalescing
logic? Then our designs would match!

>
> I still don't see how the alternative is supposed to simplify our life
> or improve the efficiency of the de-coalescing workaround. It's rather
> problematic like Gleb pointed out: The de-coalescing logic needs to be
> informed about periodicity changes that can only be delivered along
> IRQs. So what to do with the backlog when the timer is stopped?

What happens with the current design? Gleb only mentioned the
frequency change, I thought that was not so big problem. But I don't
think this case should be allowed happen at all, it can't exist on
real HW.

> Regarding an improved de-coalescing logic: As its final design is widely
> independent of how we collect the information, it could perfectly be
> done after we laid the elementary foundation.

True. But what is the foundation, do we need bidirectional IRQs or
not? I still think current IRQs should be enough.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29  9:15                         ` Blue Swirl
@ 2010-05-29  9:36                           ` Jan Kiszka
  2010-05-29 14:38                           ` Gleb Natapov
  2010-05-30 12:05                           ` Avi Kivity
  2 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-29  9:36 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel, Gleb Natapov, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 5786 bytes --]

Blue Swirl wrote:
>>> On the contrary, APIC is actually the only source of the IRQ ack
>>> information. RTC hack would not work without APIC (or the
>>> bidirectional IRQ) passing this info to RTC.
>>>
>>> What APIC doesn't have now is the timer frequency or period info. This
>>> is known by RTC and also higher levels managing the clocks.
>>>
>> So APIC has one bit of information and RTC everything else.
> 
> The information known by RTC (timer period) is also known by higher levels.

Curious to see where you'll find this.

> 
>> The current
>> approach (and proposed patch) brings this one bit of information to RTC,
>> you are arguing that RTC should be able to communicate all its info to
>> APIC. Sorry I don't see that your way has any advantage. Just more
>> complex interface and it is much easier to get it wrong for other time
>> sources.
> 
> I don't think anymore that APIC should be handling this but the
> generic stuff, like vl.c or exec.c. Then there would be only
> information passing from APIC to higher levels.

You neglect the the information required to associate a periodic source
(e.g. RTC) with an IRQ sink (e.g. APIC). Without that, you will have a
hard time figuring out if a reported IRQ coalescing requires any
activities or should simply be welcomed (for I/O IRQs).

> 
>>> I keep ignoring the idea that the current model, where both RTC and
>>> APIC must somehow work together to make coalescing work, is the only
>>> possible just because it is committed and it happens to work in some
>>> cases. It would be much better to concentrate this to one place, APIC
>>> or preferably higher level where it may benefit other timers too.
>>> Provided of course that the other models can be made to work.
>>>
>> So write the code and show us. You haven't show any evidence that RTC is
>> the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>> know when clock frequency changes, it know when device reset happened.
>> APIC knows only that interrupt was coalesced. It doesn't even know that
>> it may be masked by a guest in IOAPIC (interrupts delivered while they
>> are masked not considered coalesced).
> 
> Oh, I thought interrupt masking was the reason for coalescing! What
> exactly is the reason then?

Missing acks, ie. the IRQ is still pending when the next one arrives.
You want to filter out masked/suppressed IRQs to avoid running the
de-coalescing logic on sources that are actually cut off (like the RTC
IRQ when the HPET took over).

> 
>> Time source knows only when
>> frequency changes and may be when device reset happens if timer is
>> stopped by device on reset. So RTC is actually a sweet spot if you want
>> to minimize amount of info you need to pass between various layers.
>>
>>>>> Maybe that version would not bend backwards as much as the current to
>>>>> cater for buggy hosts.
>>>>>
>>>> You mean "buggy guests"?
>>> Yes, sorry.
>>>
>>>> What guests are not buggy in your opinion?
>>>> Linux tries hard to be smart and as a result the only way to have stable
>>>> clock with it is to go paravirt.
>>> I'm not an OS designer, but I think an OS should never crash, even if
>>> a burst of IRQs is received. Reprogramming the timer should consider
>>> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
>>> cause of the problem.
>> OS should never crash in the absence of HW bugs? I doubt you can design
>> an OS that can run in a face of any HW failure. Anyway here we are
>> trying to solve guests time keeping problem not crashes. Do you think
>> you can design OS that can keep time accurately no matter how crazy all
>> HW clock behaves?
> 
> I think my OS design skills are not relevant in this discussion, but
> IIRC there are fault tolerant operating systems for extreme conditions
> so it can be done.

No one can influence the design of released OS versions anymore.

> 
>>>>>> The fact is that timer device is not "just like any
>>>>>> other device" in virtual world. Any other device is easy: you just
>>>>>> implement spec as close as possible and everything works. For time
>>>>>> source device this is not enough. You can implement RTC+HPET to the
>>>>>> letter and your guest will drift like crazy.
>>>>> It's doable: a cycle accurate emulator will not cause any drift,
>>>>> without any voodoo. The interrupts would come after executing the same
>>>>> instruction as the real HW. For emulating any sufficiently buggy
>>>>> guests in any sufficiently desperate low resource conditions, this may
>>>>> be the only option that will always work.
>>>>>
>>>> Yes, but qemu and kvm are not cycle accurate emulators and don't strive
>>>> to be one. On the contrary KVM runs at native host CPU speed most of the
>>>> time, so any emulation done between two instruction is theoretically
>>>> noticeable for a guest. TSC is bypassed directly to a guest too, so
>>>> keeping all time source in perfect sync is also impossible.
>>> That is actually another cause of the problem. KVM gives the guest an
>>> illusion that the VCPU speed is equal to host speed. When they don't
>>> match, especially in critical code, there can be problems. It would be
>>> better to tell the guest a lower speed, which also can be guaranteed.
>>>
>> Not possible. It's that simple. You should take it into account in your
>> architecture design stage. In case of KVM real physical CPU executes guest
>> instruction and it does this as fast as it can. The only way we can hide
>> that from a guest is by intercepting each access to TSC and at that
>> point we can use bochs instead.
> 
> Well, as Paul pointed out, there's also icount option.

Which is not available in virtualization mode.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29  9:35                           ` Blue Swirl
@ 2010-05-29  9:45                             ` Jan Kiszka
  2010-05-29 10:04                               ` Blue Swirl
  2010-05-29 14:46                             ` Gleb Natapov
  1 sibling, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-05-29  9:45 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel, Gleb Natapov, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 12135 bytes --]

Blue Swirl wrote:
> 2010/5/29 Jan Kiszka <jan.kiszka@web.de>:
>> Gleb Natapov wrote:
>>> On Fri, May 28, 2010 at 08:06:45PM +0000, Blue Swirl wrote:
>>>> 2010/5/28 Gleb Natapov <gleb@redhat.com>:
>>>>> On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
>>>>>> 2010/5/27 Gleb Natapov <gleb@redhat.com>:
>>>>>>> On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
>>>>>>>> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>>>>>> Blue Swirl wrote:
>>>>>>>>>> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>>>>>>>> Anthony Liguori wrote:
>>>>>>>>>>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>>>>>>>>>>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This allows to communicate potential IRQ coalescing during delivery from
>>>>>>>>>>>>>> the sink back to the source. Targets that support IRQ coalescing
>>>>>>>>>>>>>> workarounds need to register handlers that return the appropriate
>>>>>>>>>>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>>>>>>>>>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>>>>>>>>>>>>> apply its workaround. If multiple sinks exist, the source may only
>>>>>>>>>>>>>> consider an IRQ coalesced if all other sinks either report
>>>>>>>>>>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> No real devices are interested whether any of their output lines are
>>>>>>>>>>>>> even connected. This would introduce a new signal type, bidirectional
>>>>>>>>>>>>> multi-level, which is not correct.
>>>>>>>>>>>>>
>>>>>>>>>>>> I don't think it's really an issue of correct, but I wouldn't disagree
>>>>>>>>>>>> to a suggestion that we ought to introduce a new signal type for this
>>>>>>>>>>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>>>>>>>>>>>> similar interface as qemu_irq.
>>>>>>>>>>> A separate type would complicate the delivery of the feedback value
>>>>>>>>>>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>>>>>>>>>>>
>>>>>>>>>>>>> I think the real solution to coalescing is put the logic inside one
>>>>>>>>>>>>> device, in this case APIC because it has the information about irq
>>>>>>>>>>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>>>>>>>>>>>>> information and whether they get delivered or not. If not, an internal
>>>>>>>>>>>>> timer is installed which injects the lost irqs.
>>>>>>>>>>> That won't fly as the IRQs will already arrive at the APIC with a
>>>>>>>>>>> sufficiently high jitter. At the bare minimum, you need to tell the
>>>>>>>>>>> interrupt controller about the fact that a particular IRQ should be
>>>>>>>>>>> delivered at a specific regular rate. For this, you also need a generic
>>>>>>>>>>> interface - nothing really "won".
>>>>>>>>>> OK, let's simplify: just reinject at next possible chance. No need to
>>>>>>>>>> monitor or tell anything.
>>>>>>>>> There are guests that won't like this (I know of one in-house, but
>>>>>>>>> others may even have more examples), specifically if you end up firing
>>>>>>>>> multiple IRQs in a row due to a longer backlog. For that reason, the RTC
>>>>>>>>> spreads the reinjection according to the current rate.
>>>>>>>> Then reinject with a constant delay, or next CPU exit. Such buggy
>>>>>>> If guest's time frequency is the same as host time frequency you can't
>>>>>>> reinject with constant delay. That is why current code mixes two
>>>>>>> approaches: reinject M interrupts in a raw then delay.
>>>>>> This approach can be also used by APIC-only version.
>>>>>>
>>>>> I don't know what APIC-only version you are talking about. I haven't
>>>>> seen the code and I don't understand hand waving, sorry.
>>>> There is no code, because we're still at architecture design stage.
>>>>
>>> Try to write test code to understand the problem better.
>>>
>>>>>>>> guests could also be assisted with special handling (like win2k
>>>>>>>> install hack), for example guest instructions could be counted
>>>>>>>> (approximately, for example using TB size or TSC) and only inject
>>>>>>>> after at least N instructions have passed.
>>>>>>> Guest instructions cannot be easily counted in KVM (it can be done more
>>>>>>> or less reliably using perf counters, may be).
>>>>>> Aren't there any debug registers or perf counters, which can generate
>>>>>> an interrupt after some number of instructions have been executed?
>>>>> Don't think debug registers have something like that and they are
>>>>> available for guest use anyway. Perf counters differs greatly from CPU
>>>>> to CPU (even between two CPUs of the same manufacturer), and we want to
>>>>> keep using them for profiling guests. And I don't see what problem it
>>>>> will solve anyway that can be solved by simple delay between irq
>>>>> reinjection.
>>>> This would allow counting the executed instructions and limit it. Thus
>>>> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>>>>
>>> Why would you want to limit number of instruction executed by guest if
>>> CPU has nothing else to do anyway? The problem occurs not when we have
>>> spare cycles so give to a guest, but in opposite case.
>>>
>>>>>>>>> And even if the rate did not matter, the APIC woult still have to now
>>>>>>>>> about the fact that an IRQ is really periodic and does not only appear
>>>>>>>>> as such for a certain interval. This really does not sound like
>>>>>>>>> simplifying things or even make them cleaner.
>>>>>>>> It would, the voodoo would be contained only in APIC, RTC would be
>>>>>>>> just like any other device. With the bidirectional irqs, this voodoo
>>>>>>>> would probably eventually spread to many other devices. The logical
>>>>>>>> conclusion of that would be a system where all devices would be
>>>>>>>> careful not to disturb the guest at wrong moment because that would
>>>>>>>> trigger a bug.
>>>>>>>>
>>>>>>> This voodoo will be so complex and unreliable that it will make RTC hack
>>>>>>> pale in comparison (and I still don't see how you are going to make it
>>>>>>> actually work).
>>>>>> Implement everything inside APIC: only coalescing and reinjection.
>>>>> APIC has zero info needed to implement reinjection correctly as was
>>>>> shown to you several time in this thread and you simply keep ignoring
>>>>> it.
>>>> On the contrary, APIC is actually the only source of the IRQ ack
>>>> information. RTC hack would not work without APIC (or the
>>>> bidirectional IRQ) passing this info to RTC.
>>>>
>>>> What APIC doesn't have now is the timer frequency or period info. This
>>>> is known by RTC and also higher levels managing the clocks.
>>>>
>>> So APIC has one bit of information and RTC everything else. The current
>>> approach (and proposed patch) brings this one bit of information to RTC,
>>> you are arguing that RTC should be able to communicate all its info to
>>> APIC. Sorry I don't see that your way has any advantage. Just more
>>> complex interface and it is much easier to get it wrong for other time
>>> sources.
>>>
>>>> I keep ignoring the idea that the current model, where both RTC and
>>>> APIC must somehow work together to make coalescing work, is the only
>>>> possible just because it is committed and it happens to work in some
>>>> cases. It would be much better to concentrate this to one place, APIC
>>>> or preferably higher level where it may benefit other timers too.
>>>> Provided of course that the other models can be made to work.
>>>>
>>> So write the code and show us. You haven't show any evidence that RTC is
>>> the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>>> know when clock frequency changes, it know when device reset happened.
>>> APIC knows only that interrupt was coalesced. It doesn't even know that
>>> it may be masked by a guest in IOAPIC (interrupts delivered while they
>>> are masked not considered coalesced). Time source knows only when
>>> frequency changes and may be when device reset happens if timer is
>>> stopped by device on reset. So RTC is actually a sweet spot if you want
>>> to minimize amount of info you need to pass between various layers.
>> This is - according to my current understanding - the proposed
>> alternative architecture:
>>
>>                                          .---------------.
>>                                          | de-coalescing |
>>                                          |     logic     |
>>                                          '---------------'
>>                                                ^   |
>>                                         period,|   |IRQ
>>                                       coalesced|   |(or tuned VM clock)
>>                                        (yes/no)|   v
>> .-------.              .--------.             .-------.
>> |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
>> '-------'              '--------'             '-------'
>>    |                    ^    |                   ^
>>    |                    |    |                   |
>>    '-------period-------'    '------period-------'
> 
> The period information is already known by the higher level clock
> management,

The timer device models program the next event of some qemu-timer. There
is no tag attached explaining the timer subsystem or anyone else the
reason behind this programming.

> it should be available (or made available) to
> de-coalescing logic (which should probably be located close to other
> clock stuff) but otherwise there shouldn't be a need to pass it
> around. The tuned VM clock of course only affects the RTC. Otherwise
> correct.
> 
>> And here is what this patch implements (except for the not yet factored
>> out de-coalescing logic):
>>
>>   .---------------.
>>   | de-coalescing |
>>   |     logic     |
>>   '---------------'
>>         ^   |
>>  period,|   |next IRQ date
>> coalesced|   |(or tuned VM clock)
>>  (yes/no)|   v
>>       .-------.              .--------.             .-------.
>>       |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
>>       '-------'              '--------'             '-------'
>>           ^                    |    ^                   |
>>           |                    |    |                   |
>>           '------coalesced-----'    '-----coalesced-----'
> 
> Why not route the coalesced signal directly from APIC to de-coalescing
> logic? Then our designs would match!

Because the de-coalescing logic will not be able to associate it with
the period information delivered by the RTC (without apply ugly hacks).

> 
>> I still don't see how the alternative is supposed to simplify our life
>> or improve the efficiency of the de-coalescing workaround. It's rather
>> problematic like Gleb pointed out: The de-coalescing logic needs to be
>> informed about periodicity changes that can only be delivered along
>> IRQs. So what to do with the backlog when the timer is stopped?
> 
> What happens with the current design? Gleb only mentioned the
> frequency change, I thought that was not so big problem. But I don't
> think this case should be allowed happen at all, it can't exist on
> real HW.
> 
>> Regarding an improved de-coalescing logic: As its final design is widely
>> independent of how we collect the information, it could perfectly be
>> done after we laid the elementary foundation.
> 
> True. But what is the foundation, do we need bidirectional IRQs or
> not? I still think current IRQs should be enough.

Then explain how you want to do the association between IRQ injection
and IRQ period. Both of my diagrams require an additional channel
between source and sink, either forward (tag IRQ event with its period)
or backward (report injection result - aka feedback IRQs). We currently
don't have such channel but the APIC hack (apic_reset/get_irq_delivered).

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29  9:45                             ` Jan Kiszka
@ 2010-05-29 10:04                               ` Blue Swirl
  2010-05-29 10:16                                 ` Jan Kiszka
  0 siblings, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-05-29 10:04 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, Gleb Natapov, Juan Quintela

On Sat, May 29, 2010 at 9:45 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>> 2010/5/29 Jan Kiszka <jan.kiszka@web.de>:
>>> Gleb Natapov wrote:
>>>> On Fri, May 28, 2010 at 08:06:45PM +0000, Blue Swirl wrote:
>>>>> 2010/5/28 Gleb Natapov <gleb@redhat.com>:
>>>>>> On Thu, May 27, 2010 at 06:37:10PM +0000, Blue Swirl wrote:
>>>>>>> 2010/5/27 Gleb Natapov <gleb@redhat.com>:
>>>>>>>> On Wed, May 26, 2010 at 08:35:00PM +0000, Blue Swirl wrote:
>>>>>>>>> On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>>>>>>> Blue Swirl wrote:
>>>>>>>>>>> On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>>>>>>>>>> Anthony Liguori wrote:
>>>>>>>>>>>>> On 05/25/2010 02:09 PM, Blue Swirl wrote:
>>>>>>>>>>>>>> On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka<jan.kiszka@web.de>  wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From: Jan Kiszka<jan.kiszka@siemens.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This allows to communicate potential IRQ coalescing during delivery from
>>>>>>>>>>>>>>> the sink back to the source. Targets that support IRQ coalescing
>>>>>>>>>>>>>>> workarounds need to register handlers that return the appropriate
>>>>>>>>>>>>>>> QEMU_IRQ_* code, and they have to propergate the code across all IRQ
>>>>>>>>>>>>>>> redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
>>>>>>>>>>>>>>> apply its workaround. If multiple sinks exist, the source may only
>>>>>>>>>>>>>>> consider an IRQ coalesced if all other sinks either report
>>>>>>>>>>>>>>> QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> No real devices are interested whether any of their output lines are
>>>>>>>>>>>>>> even connected. This would introduce a new signal type, bidirectional
>>>>>>>>>>>>>> multi-level, which is not correct.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't think it's really an issue of correct, but I wouldn't disagree
>>>>>>>>>>>>> to a suggestion that we ought to introduce a new signal type for this
>>>>>>>>>>>>> type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
>>>>>>>>>>>>> similar interface as qemu_irq.
>>>>>>>>>>>> A separate type would complicate the delivery of the feedback value
>>>>>>>>>>>> across GPIO pins (as Paul requested for the RTC->HPET routing).
>>>>>>>>>>>>
>>>>>>>>>>>>>> I think the real solution to coalescing is put the logic inside one
>>>>>>>>>>>>>> device, in this case APIC because it has the information about irq
>>>>>>>>>>>>>> delivery. APIC could monitor incoming RTC irqs for frequency
>>>>>>>>>>>>>> information and whether they get delivered or not. If not, an internal
>>>>>>>>>>>>>> timer is installed which injects the lost irqs.
>>>>>>>>>>>> That won't fly as the IRQs will already arrive at the APIC with a
>>>>>>>>>>>> sufficiently high jitter. At the bare minimum, you need to tell the
>>>>>>>>>>>> interrupt controller about the fact that a particular IRQ should be
>>>>>>>>>>>> delivered at a specific regular rate. For this, you also need a generic
>>>>>>>>>>>> interface - nothing really "won".
>>>>>>>>>>> OK, let's simplify: just reinject at next possible chance. No need to
>>>>>>>>>>> monitor or tell anything.
>>>>>>>>>> There are guests that won't like this (I know of one in-house, but
>>>>>>>>>> others may even have more examples), specifically if you end up firing
>>>>>>>>>> multiple IRQs in a row due to a longer backlog. For that reason, the RTC
>>>>>>>>>> spreads the reinjection according to the current rate.
>>>>>>>>> Then reinject with a constant delay, or next CPU exit. Such buggy
>>>>>>>> If guest's time frequency is the same as host time frequency you can't
>>>>>>>> reinject with constant delay. That is why current code mixes two
>>>>>>>> approaches: reinject M interrupts in a raw then delay.
>>>>>>> This approach can be also used by APIC-only version.
>>>>>>>
>>>>>> I don't know what APIC-only version you are talking about. I haven't
>>>>>> seen the code and I don't understand hand waving, sorry.
>>>>> There is no code, because we're still at architecture design stage.
>>>>>
>>>> Try to write test code to understand the problem better.
>>>>
>>>>>>>>> guests could also be assisted with special handling (like win2k
>>>>>>>>> install hack), for example guest instructions could be counted
>>>>>>>>> (approximately, for example using TB size or TSC) and only inject
>>>>>>>>> after at least N instructions have passed.
>>>>>>>> Guest instructions cannot be easily counted in KVM (it can be done more
>>>>>>>> or less reliably using perf counters, may be).
>>>>>>> Aren't there any debug registers or perf counters, which can generate
>>>>>>> an interrupt after some number of instructions have been executed?
>>>>>> Don't think debug registers have something like that and they are
>>>>>> available for guest use anyway. Perf counters differs greatly from CPU
>>>>>> to CPU (even between two CPUs of the same manufacturer), and we want to
>>>>>> keep using them for profiling guests. And I don't see what problem it
>>>>>> will solve anyway that can be solved by simple delay between irq
>>>>>> reinjection.
>>>>> This would allow counting the executed instructions and limit it. Thus
>>>>> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>>>>>
>>>> Why would you want to limit number of instruction executed by guest if
>>>> CPU has nothing else to do anyway? The problem occurs not when we have
>>>> spare cycles so give to a guest, but in opposite case.
>>>>
>>>>>>>>>> And even if the rate did not matter, the APIC woult still have to now
>>>>>>>>>> about the fact that an IRQ is really periodic and does not only appear
>>>>>>>>>> as such for a certain interval. This really does not sound like
>>>>>>>>>> simplifying things or even make them cleaner.
>>>>>>>>> It would, the voodoo would be contained only in APIC, RTC would be
>>>>>>>>> just like any other device. With the bidirectional irqs, this voodoo
>>>>>>>>> would probably eventually spread to many other devices. The logical
>>>>>>>>> conclusion of that would be a system where all devices would be
>>>>>>>>> careful not to disturb the guest at wrong moment because that would
>>>>>>>>> trigger a bug.
>>>>>>>>>
>>>>>>>> This voodoo will be so complex and unreliable that it will make RTC hack
>>>>>>>> pale in comparison (and I still don't see how you are going to make it
>>>>>>>> actually work).
>>>>>>> Implement everything inside APIC: only coalescing and reinjection.
>>>>>> APIC has zero info needed to implement reinjection correctly as was
>>>>>> shown to you several time in this thread and you simply keep ignoring
>>>>>> it.
>>>>> On the contrary, APIC is actually the only source of the IRQ ack
>>>>> information. RTC hack would not work without APIC (or the
>>>>> bidirectional IRQ) passing this info to RTC.
>>>>>
>>>>> What APIC doesn't have now is the timer frequency or period info. This
>>>>> is known by RTC and also higher levels managing the clocks.
>>>>>
>>>> So APIC has one bit of information and RTC everything else. The current
>>>> approach (and proposed patch) brings this one bit of information to RTC,
>>>> you are arguing that RTC should be able to communicate all its info to
>>>> APIC. Sorry I don't see that your way has any advantage. Just more
>>>> complex interface and it is much easier to get it wrong for other time
>>>> sources.
>>>>
>>>>> I keep ignoring the idea that the current model, where both RTC and
>>>>> APIC must somehow work together to make coalescing work, is the only
>>>>> possible just because it is committed and it happens to work in some
>>>>> cases. It would be much better to concentrate this to one place, APIC
>>>>> or preferably higher level where it may benefit other timers too.
>>>>> Provided of course that the other models can be made to work.
>>>>>
>>>> So write the code and show us. You haven't show any evidence that RTC is
>>>> the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>>>> know when clock frequency changes, it know when device reset happened.
>>>> APIC knows only that interrupt was coalesced. It doesn't even know that
>>>> it may be masked by a guest in IOAPIC (interrupts delivered while they
>>>> are masked not considered coalesced). Time source knows only when
>>>> frequency changes and may be when device reset happens if timer is
>>>> stopped by device on reset. So RTC is actually a sweet spot if you want
>>>> to minimize amount of info you need to pass between various layers.
>>> This is - according to my current understanding - the proposed
>>> alternative architecture:
>>>
>>>                                          .---------------.
>>>                                          | de-coalescing |
>>>                                          |     logic     |
>>>                                          '---------------'
>>>                                                ^   |
>>>                                         period,|   |IRQ
>>>                                       coalesced|   |(or tuned VM clock)
>>>                                        (yes/no)|   v
>>> .-------.              .--------.             .-------.
>>> |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
>>> '-------'              '--------'             '-------'
>>>    |                    ^    |                   ^
>>>    |                    |    |                   |
>>>    '-------period-------'    '------period-------'
>>
>> The period information is already known by the higher level clock
>> management,
>
> The timer device models program the next event of some qemu-timer. There
> is no tag attached explaining the timer subsystem or anyone else the
> reason behind this programming.

Yes, but why would we care? All timers in the system besides RTC
should be affected likewise, this would in fact be the benefit from
handling the problem at higher level.

>> it should be available (or made available) to
>> de-coalescing logic (which should probably be located close to other
>> clock stuff) but otherwise there shouldn't be a need to pass it
>> around. The tuned VM clock of course only affects the RTC. Otherwise
>> correct.
>>
>>> And here is what this patch implements (except for the not yet factored
>>> out de-coalescing logic):
>>>
>>>   .---------------.
>>>   | de-coalescing |
>>>   |     logic     |
>>>   '---------------'
>>>         ^   |
>>>  period,|   |next IRQ date
>>> coalesced|   |(or tuned VM clock)
>>>  (yes/no)|   v
>>>       .-------.              .--------.             .-------.
>>>       |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
>>>       '-------'              '--------'             '-------'
>>>           ^                    |    ^                   |
>>>           |                    |    |                   |
>>>           '------coalesced-----'    '-----coalesced-----'
>>
>> Why not route the coalesced signal directly from APIC to de-coalescing
>> logic? Then our designs would match!
>
> Because the de-coalescing logic will not be able to associate it with
> the period information delivered by the RTC (without apply ugly hacks).
>
>>
>>> I still don't see how the alternative is supposed to simplify our life
>>> or improve the efficiency of the de-coalescing workaround. It's rather
>>> problematic like Gleb pointed out: The de-coalescing logic needs to be
>>> informed about periodicity changes that can only be delivered along
>>> IRQs. So what to do with the backlog when the timer is stopped?
>>
>> What happens with the current design? Gleb only mentioned the
>> frequency change, I thought that was not so big problem. But I don't
>> think this case should be allowed happen at all, it can't exist on
>> real HW.
>>
>>> Regarding an improved de-coalescing logic: As its final design is widely
>>> independent of how we collect the information, it could perfectly be
>>> done after we laid the elementary foundation.
>>
>> True. But what is the foundation, do we need bidirectional IRQs or
>> not? I still think current IRQs should be enough.
>
> Then explain how you want to do the association between IRQ injection
> and IRQ period. Both of my diagrams require an additional channel
> between source and sink, either forward (tag IRQ event with its period)
> or backward (report injection result - aka feedback IRQs). We currently
> don't have such channel but the APIC hack (apic_reset/get_irq_delivered).
>
> Jan
>
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29 10:04                               ` Blue Swirl
@ 2010-05-29 10:16                                 ` Jan Kiszka
  2010-05-29 10:26                                   ` Blue Swirl
  0 siblings, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-05-29 10:16 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel, Gleb Natapov, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 1545 bytes --]

Blue Swirl wrote:
>>>> This is - according to my current understanding - the proposed
>>>> alternative architecture:
>>>>
>>>>                                          .---------------.
>>>>                                          | de-coalescing |
>>>>                                          |     logic     |
>>>>                                          '---------------'
>>>>                                                ^   |
>>>>                                         period,|   |IRQ
>>>>                                       coalesced|   |(or tuned VM clock)
>>>>                                        (yes/no)|   v
>>>> .-------.              .--------.             .-------.
>>>> |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
>>>> '-------'              '--------'             '-------'
>>>>    |                    ^    |                   ^
>>>>    |                    |    |                   |
>>>>    '-------period-------'    '------period-------'
>>> The period information is already known by the higher level clock
>>> management,
>> The timer device models program the next event of some qemu-timer. There
>> is no tag attached explaining the timer subsystem or anyone else the
>> reason behind this programming.
> 
> Yes, but why would we care? All timers in the system besides RTC
> should be affected likewise, this would in fact be the benefit from
> handling the problem at higher level.

And how does your equation for calculating the clock slow-down look like?

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29 10:16                                 ` Jan Kiszka
@ 2010-05-29 10:26                                   ` Blue Swirl
  2010-05-29 10:38                                     ` Jan Kiszka
  0 siblings, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-05-29 10:26 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, Gleb Natapov, Juan Quintela

On Sat, May 29, 2010 at 10:16 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>>>>> This is - according to my current understanding - the proposed
>>>>> alternative architecture:
>>>>>
>>>>>                                          .---------------.
>>>>>                                          | de-coalescing |
>>>>>                                          |     logic     |
>>>>>                                          '---------------'
>>>>>                                                ^   |
>>>>>                                         period,|   |IRQ
>>>>>                                       coalesced|   |(or tuned VM clock)
>>>>>                                        (yes/no)|   v
>>>>> .-------.              .--------.             .-------.
>>>>> |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
>>>>> '-------'              '--------'             '-------'
>>>>>    |                    ^    |                   ^
>>>>>    |                    |    |                   |
>>>>>    '-------period-------'    '------period-------'
>>>> The period information is already known by the higher level clock
>>>> management,
>>> The timer device models program the next event of some qemu-timer. There
>>> is no tag attached explaining the timer subsystem or anyone else the
>>> reason behind this programming.
>>
>> Yes, but why would we care? All timers in the system besides RTC
>> should be affected likewise, this would in fact be the benefit from
>> handling the problem at higher level.
>
> And how does your equation for calculating the clock slow-down look like?

How about icount_adjust()?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29 10:26                                   ` Blue Swirl
@ 2010-05-29 10:38                                     ` Jan Kiszka
  0 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-29 10:38 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel, Gleb Natapov, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 1870 bytes --]

Blue Swirl wrote:
> On Sat, May 29, 2010 at 10:16 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> Blue Swirl wrote:
>>>>>> This is - according to my current understanding - the proposed
>>>>>> alternative architecture:
>>>>>>
>>>>>>                                          .---------------.
>>>>>>                                          | de-coalescing |
>>>>>>                                          |     logic     |
>>>>>>                                          '---------------'
>>>>>>                                                ^   |
>>>>>>                                         period,|   |IRQ
>>>>>>                                       coalesced|   |(or tuned VM clock)
>>>>>>                                        (yes/no)|   v
>>>>>> .-------.              .--------.             .-------.
>>>>>> |  RTC  |-----IRQ----->| router |-----IRQ---->| APIC  |
>>>>>> '-------'              '--------'             '-------'
>>>>>>    |                    ^    |                   ^
>>>>>>    |                    |    |                   |
>>>>>>    '-------period-------'    '------period-------'
>>>>> The period information is already known by the higher level clock
>>>>> management,
>>>> The timer device models program the next event of some qemu-timer. There
>>>> is no tag attached explaining the timer subsystem or anyone else the
>>>> reason behind this programming.
>>> Yes, but why would we care? All timers in the system besides RTC
>>> should be affected likewise, this would in fact be the benefit from
>>> handling the problem at higher level.
>> And how does your equation for calculating the clock slow-down look like?
> 
> How about icount_adjust()?

I would suggest that you implement your ideas now. Please keep us
informed about the progress as this series (and more) depends on a decision.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29  9:15                         ` Blue Swirl
  2010-05-29  9:36                           ` Jan Kiszka
@ 2010-05-29 14:38                           ` Gleb Natapov
  2010-05-29 16:03                             ` Blue Swirl
  2010-05-30 12:05                           ` Avi Kivity
  2 siblings, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-05-29 14:38 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sat, May 29, 2010 at 09:15:11AM +0000, Blue Swirl wrote:
> >> There is no code, because we're still at architecture design stage.
> >>
> > Try to write test code to understand the problem better.
> 
> I will.
> 
Please do ASAP. This discussion shows that you don't understand what is the
problem that we are dialing with.

> >> >> >> guests could also be assisted with special handling (like win2k
> >> >> >> install hack), for example guest instructions could be counted
> >> >> >> (approximately, for example using TB size or TSC) and only inject
> >> >> >> after at least N instructions have passed.
> >> >> > Guest instructions cannot be easily counted in KVM (it can be done more
> >> >> > or less reliably using perf counters, may be).
> >> >>
> >> >> Aren't there any debug registers or perf counters, which can generate
> >> >> an interrupt after some number of instructions have been executed?
> >> > Don't think debug registers have something like that and they are
> >> > available for guest use anyway. Perf counters differs greatly from CPU
> >> > to CPU (even between two CPUs of the same manufacturer), and we want to
> >> > keep using them for profiling guests. And I don't see what problem it
> >> > will solve anyway that can be solved by simple delay between irq
> >> > reinjection.
> >>
> >> This would allow counting the executed instructions and limit it. Thus
> >> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
> >>
> > Why would you want to limit number of instruction executed by guest if
> > CPU has nothing else to do anyway? The problem occurs not when we have
> > spare cycles so give to a guest, but in opposite case.
> 
> I think one problem is that the guest has executed too much compared
> to what would happen with real HW with a lesser CPU. That explains the
> RTC frequency reprogramming case.
You think wrong. The problem is exactly opposite: the guest haven't
had enough execution time between two time interrupts. I don't know what
RTC frequency reprogramming case you are talking about here.

> 
> >
> >> >>
> >> >> >>
> >> >> >> > And even if the rate did not matter, the APIC woult still have to now
> >> >> >> > about the fact that an IRQ is really periodic and does not only appear
> >> >> >> > as such for a certain interval. This really does not sound like
> >> >> >> > simplifying things or even make them cleaner.
> >> >> >>
> >> >> >> It would, the voodoo would be contained only in APIC, RTC would be
> >> >> >> just like any other device. With the bidirectional irqs, this voodoo
> >> >> >> would probably eventually spread to many other devices. The logical
> >> >> >> conclusion of that would be a system where all devices would be
> >> >> >> careful not to disturb the guest at wrong moment because that would
> >> >> >> trigger a bug.
> >> >> >>
> >> >> > This voodoo will be so complex and unreliable that it will make RTC hack
> >> >> > pale in comparison (and I still don't see how you are going to make it
> >> >> > actually work).
> >> >>
> >> >> Implement everything inside APIC: only coalescing and reinjection.
> >> > APIC has zero info needed to implement reinjection correctly as was
> >> > shown to you several time in this thread and you simply keep ignoring
> >> > it.
> >>
> >> On the contrary, APIC is actually the only source of the IRQ ack
> >> information. RTC hack would not work without APIC (or the
> >> bidirectional IRQ) passing this info to RTC.
> >>
> >> What APIC doesn't have now is the timer frequency or period info. This
> >> is known by RTC and also higher levels managing the clocks.
> >>
> > So APIC has one bit of information and RTC everything else.
> 
> The information known by RTC (timer period) is also known by higher levels.
> 
What do you mean by higher level here? vl.c or APIC.

> > The current
> > approach (and proposed patch) brings this one bit of information to RTC,
> > you are arguing that RTC should be able to communicate all its info to
> > APIC. Sorry I don't see that your way has any advantage. Just more
> > complex interface and it is much easier to get it wrong for other time
> > sources.
> 
> I don't think anymore that APIC should be handling this but the
> generic stuff, like vl.c or exec.c. Then there would be only
> information passing from APIC to higher levels.
Handling reinjection by general timer code makes infinitely more sense
then handling it in APIC. One thing (from the top of my head) that can't
be implemented at that level is injection of interrupt back to back (i.e
injecting next interrupt immediately after guest acknowledge previous
one to RTC).

> 
> >> I keep ignoring the idea that the current model, where both RTC and
> >> APIC must somehow work together to make coalescing work, is the only
> >> possible just because it is committed and it happens to work in some
> >> cases. It would be much better to concentrate this to one place, APIC
> >> or preferably higher level where it may benefit other timers too.
> >> Provided of course that the other models can be made to work.
> >>
> > So write the code and show us. You haven't show any evidence that RTC is
> > the wrong place. RTC knows when interrupt was acknowledge to RTC, it
> > know when clock frequency changes, it know when device reset happened.
> > APIC knows only that interrupt was coalesced. It doesn't even know that
> > it may be masked by a guest in IOAPIC (interrupts delivered while they
> > are masked not considered coalesced).
> 
> Oh, I thought interrupt masking was the reason for coalescing! What
> exactly is the reason then?
> 
The reason is that guest has no time to process previous interrupt
before it is time to inject next one.

> > Time source knows only when
> > frequency changes and may be when device reset happens if timer is
> > stopped by device on reset. So RTC is actually a sweet spot if you want
> > to minimize amount of info you need to pass between various layers.
> >
> >> >> Maybe that version would not bend backwards as much as the current to
> >> >> cater for buggy hosts.
> >> >>
> >> > You mean "buggy guests"?
> >>
> >> Yes, sorry.
> >>
> >> > What guests are not buggy in your opinion?
> >> > Linux tries hard to be smart and as a result the only way to have stable
> >> > clock with it is to go paravirt.
> >>
> >> I'm not an OS designer, but I think an OS should never crash, even if
> >> a burst of IRQs is received. Reprogramming the timer should consider
> >> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
> >> cause of the problem.
> > OS should never crash in the absence of HW bugs? I doubt you can design
> > an OS that can run in a face of any HW failure. Anyway here we are
> > trying to solve guests time keeping problem not crashes. Do you think
> > you can design OS that can keep time accurately no matter how crazy all
> > HW clock behaves?
> 
> I think my OS design skills are not relevant in this discussion, but
> IIRC there are fault tolerant operating systems for extreme conditions
> so it can be done.
> 
> >
> >>
> >> >> > The fact is that timer device is not "just like any
> >> >> > other device" in virtual world. Any other device is easy: you just
> >> >> > implement spec as close as possible and everything works. For time
> >> >> > source device this is not enough. You can implement RTC+HPET to the
> >> >> > letter and your guest will drift like crazy.
> >> >>
> >> >> It's doable: a cycle accurate emulator will not cause any drift,
> >> >> without any voodoo. The interrupts would come after executing the same
> >> >> instruction as the real HW. For emulating any sufficiently buggy
> >> >> guests in any sufficiently desperate low resource conditions, this may
> >> >> be the only option that will always work.
> >> >>
> >> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
> >> > to be one. On the contrary KVM runs at native host CPU speed most of the
> >> > time, so any emulation done between two instruction is theoretically
> >> > noticeable for a guest. TSC is bypassed directly to a guest too, so
> >> > keeping all time source in perfect sync is also impossible.
> >>
> >> That is actually another cause of the problem. KVM gives the guest an
> >> illusion that the VCPU speed is equal to host speed. When they don't
> >> match, especially in critical code, there can be problems. It would be
> >> better to tell the guest a lower speed, which also can be guaranteed.
> >>
> > Not possible. It's that simple. You should take it into account in your
> > architecture design stage. In case of KVM real physical CPU executes guest
> > instruction and it does this as fast as it can. The only way we can hide
> > that from a guest is by intercepting each access to TSC and at that
> > point we can use bochs instead.
> 
> Well, as Paul pointed out, there's also icount option.
> 
icount is not an option for KVM.

> >> Maybe we should also offline the device emulation to another host CPU
> >> with threading. A load from a device will always be much slower than
> >> on real HW though.
> > Time drift problem start to happen on loaded servers, so you do not have
> > spare CPU to offload device emulation too.
> >
> > --
> >                        Gleb.
> >

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29  9:35                           ` Blue Swirl
  2010-05-29  9:45                             ` Jan Kiszka
@ 2010-05-29 14:46                             ` Gleb Natapov
  2010-05-29 16:13                               ` Blue Swirl
  1 sibling, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-05-29 14:46 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sat, May 29, 2010 at 09:35:30AM +0000, Blue Swirl wrote:
> > I still don't see how the alternative is supposed to simplify our life
> > or improve the efficiency of the de-coalescing workaround. It's rather
> > problematic like Gleb pointed out: The de-coalescing logic needs to be
> > informed about periodicity changes that can only be delivered along
> > IRQs. So what to do with the backlog when the timer is stopped?
> 
> What happens with the current design? Gleb only mentioned the
> frequency change, I thought that was not so big problem. But I don't
> think this case should be allowed happen at all, it can't exist on
> real HW.
> 
Hm, why it can't exist on real HW? Do simple exercise. Run WindowsXP
inside QEMU, connect with gdb to QEMU process and check what frequency
RTC configured with (hint: it will be 64Hz), now run video inside the
guest and check frequency again (hint: it will be 1Khz).

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29 14:38                           ` Gleb Natapov
@ 2010-05-29 16:03                             ` Blue Swirl
  2010-05-29 16:32                               ` Gleb Natapov
  0 siblings, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-05-29 16:03 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

2010/5/29 Gleb Natapov <gleb@redhat.com>:
> On Sat, May 29, 2010 at 09:15:11AM +0000, Blue Swirl wrote:
>> >> There is no code, because we're still at architecture design stage.
>> >>
>> > Try to write test code to understand the problem better.
>>
>> I will.
>>
> Please do ASAP. This discussion shows that you don't understand what is the
> problem that we are dialing with.

Which part of the problem you think I don't understand?

>> >> >> >> guests could also be assisted with special handling (like win2k
>> >> >> >> install hack), for example guest instructions could be counted
>> >> >> >> (approximately, for example using TB size or TSC) and only inject
>> >> >> >> after at least N instructions have passed.
>> >> >> > Guest instructions cannot be easily counted in KVM (it can be done more
>> >> >> > or less reliably using perf counters, may be).
>> >> >>
>> >> >> Aren't there any debug registers or perf counters, which can generate
>> >> >> an interrupt after some number of instructions have been executed?
>> >> > Don't think debug registers have something like that and they are
>> >> > available for guest use anyway. Perf counters differs greatly from CPU
>> >> > to CPU (even between two CPUs of the same manufacturer), and we want to
>> >> > keep using them for profiling guests. And I don't see what problem it
>> >> > will solve anyway that can be solved by simple delay between irq
>> >> > reinjection.
>> >>
>> >> This would allow counting the executed instructions and limit it. Thus
>> >> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>> >>
>> > Why would you want to limit number of instruction executed by guest if
>> > CPU has nothing else to do anyway? The problem occurs not when we have
>> > spare cycles so give to a guest, but in opposite case.
>>
>> I think one problem is that the guest has executed too much compared
>> to what would happen with real HW with a lesser CPU. That explains the
>> RTC frequency reprogramming case.
> You think wrong. The problem is exactly opposite: the guest haven't
> had enough execution time between two time interrupts. I don't know what
> RTC frequency reprogramming case you are talking about here.

The case you told me where N pending tick IRQs exist but the guest
wants to change the RTC frequency from 64Hz to 1024Hz.

Let's make this more concrete. 1 GHz CPU, initially 100Hz RTC, so
10Mcycles/tick or 10ms/tick. At T = 30Mcycles, guest wants to change
the frequency to 1000Hz.

The problem for emulation is that for the same 3 ticks, there has been
so little execution power that the ticks have been coalesced. But
isn't the guest cycle count then much lower than 30Mcyc?

Isn't it so that the guest must be above 30Mcyc to be able to want the
change? But if we reach that point,  the problem must have not been
too little execution time, but too much.

>
>>
>> >
>> >> >>
>> >> >> >>
>> >> >> >> > And even if the rate did not matter, the APIC woult still have to now
>> >> >> >> > about the fact that an IRQ is really periodic and does not only appear
>> >> >> >> > as such for a certain interval. This really does not sound like
>> >> >> >> > simplifying things or even make them cleaner.
>> >> >> >>
>> >> >> >> It would, the voodoo would be contained only in APIC, RTC would be
>> >> >> >> just like any other device. With the bidirectional irqs, this voodoo
>> >> >> >> would probably eventually spread to many other devices. The logical
>> >> >> >> conclusion of that would be a system where all devices would be
>> >> >> >> careful not to disturb the guest at wrong moment because that would
>> >> >> >> trigger a bug.
>> >> >> >>
>> >> >> > This voodoo will be so complex and unreliable that it will make RTC hack
>> >> >> > pale in comparison (and I still don't see how you are going to make it
>> >> >> > actually work).
>> >> >>
>> >> >> Implement everything inside APIC: only coalescing and reinjection.
>> >> > APIC has zero info needed to implement reinjection correctly as was
>> >> > shown to you several time in this thread and you simply keep ignoring
>> >> > it.
>> >>
>> >> On the contrary, APIC is actually the only source of the IRQ ack
>> >> information. RTC hack would not work without APIC (or the
>> >> bidirectional IRQ) passing this info to RTC.
>> >>
>> >> What APIC doesn't have now is the timer frequency or period info. This
>> >> is known by RTC and also higher levels managing the clocks.
>> >>
>> > So APIC has one bit of information and RTC everything else.
>>
>> The information known by RTC (timer period) is also known by higher levels.
>>
> What do you mean by higher level here? vl.c or APIC.

vl.c, qemu-timer.c.

>> > The current
>> > approach (and proposed patch) brings this one bit of information to RTC,
>> > you are arguing that RTC should be able to communicate all its info to
>> > APIC. Sorry I don't see that your way has any advantage. Just more
>> > complex interface and it is much easier to get it wrong for other time
>> > sources.
>>
>> I don't think anymore that APIC should be handling this but the
>> generic stuff, like vl.c or exec.c. Then there would be only
>> information passing from APIC to higher levels.
> Handling reinjection by general timer code makes infinitely more sense
> then handling it in APIC.

I'm glad you agree, or did you mean 'less'?

> One thing (from the top of my head) that can't
> be implemented at that level is injection of interrupt back to back (i.e
> injecting next interrupt immediately after guest acknowledge previous
> one to RTC).

But Jan told this confuses some buggy OSes.

>
>>
>> >> I keep ignoring the idea that the current model, where both RTC and
>> >> APIC must somehow work together to make coalescing work, is the only
>> >> possible just because it is committed and it happens to work in some
>> >> cases. It would be much better to concentrate this to one place, APIC
>> >> or preferably higher level where it may benefit other timers too.
>> >> Provided of course that the other models can be made to work.
>> >>
>> > So write the code and show us. You haven't show any evidence that RTC is
>> > the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>> > know when clock frequency changes, it know when device reset happened.
>> > APIC knows only that interrupt was coalesced. It doesn't even know that
>> > it may be masked by a guest in IOAPIC (interrupts delivered while they
>> > are masked not considered coalesced).
>>
>> Oh, I thought interrupt masking was the reason for coalescing! What
>> exactly is the reason then?
>>
> The reason is that guest has no time to process previous interrupt
> before it is time to inject next one.

Because of other host load or other emulation done by the same QEMU
process, I suppose?

>> > Time source knows only when
>> > frequency changes and may be when device reset happens if timer is
>> > stopped by device on reset. So RTC is actually a sweet spot if you want
>> > to minimize amount of info you need to pass between various layers.
>> >
>> >> >> Maybe that version would not bend backwards as much as the current to
>> >> >> cater for buggy hosts.
>> >> >>
>> >> > You mean "buggy guests"?
>> >>
>> >> Yes, sorry.
>> >>
>> >> > What guests are not buggy in your opinion?
>> >> > Linux tries hard to be smart and as a result the only way to have stable
>> >> > clock with it is to go paravirt.
>> >>
>> >> I'm not an OS designer, but I think an OS should never crash, even if
>> >> a burst of IRQs is received. Reprogramming the timer should consider
>> >> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
>> >> cause of the problem.
>> > OS should never crash in the absence of HW bugs? I doubt you can design
>> > an OS that can run in a face of any HW failure. Anyway here we are
>> > trying to solve guests time keeping problem not crashes. Do you think
>> > you can design OS that can keep time accurately no matter how crazy all
>> > HW clock behaves?
>>
>> I think my OS design skills are not relevant in this discussion, but
>> IIRC there are fault tolerant operating systems for extreme conditions
>> so it can be done.
>>
>> >
>> >>
>> >> >> > The fact is that timer device is not "just like any
>> >> >> > other device" in virtual world. Any other device is easy: you just
>> >> >> > implement spec as close as possible and everything works. For time
>> >> >> > source device this is not enough. You can implement RTC+HPET to the
>> >> >> > letter and your guest will drift like crazy.
>> >> >>
>> >> >> It's doable: a cycle accurate emulator will not cause any drift,
>> >> >> without any voodoo. The interrupts would come after executing the same
>> >> >> instruction as the real HW. For emulating any sufficiently buggy
>> >> >> guests in any sufficiently desperate low resource conditions, this may
>> >> >> be the only option that will always work.
>> >> >>
>> >> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
>> >> > to be one. On the contrary KVM runs at native host CPU speed most of the
>> >> > time, so any emulation done between two instruction is theoretically
>> >> > noticeable for a guest. TSC is bypassed directly to a guest too, so
>> >> > keeping all time source in perfect sync is also impossible.
>> >>
>> >> That is actually another cause of the problem. KVM gives the guest an
>> >> illusion that the VCPU speed is equal to host speed. When they don't
>> >> match, especially in critical code, there can be problems. It would be
>> >> better to tell the guest a lower speed, which also can be guaranteed.
>> >>
>> > Not possible. It's that simple. You should take it into account in your
>> > architecture design stage. In case of KVM real physical CPU executes guest
>> > instruction and it does this as fast as it can. The only way we can hide
>> > that from a guest is by intercepting each access to TSC and at that
>> > point we can use bochs instead.
>>
>> Well, as Paul pointed out, there's also icount option.
>>
> icount is not an option for KVM.

I think icount timer adjustment model might make sense for this work
too. We'd then just need some figure of executed CPU instructions, TSC
cycles or even kernel scheduler time slice information (how much time
the process got).

>
>> >> Maybe we should also offline the device emulation to another host CPU
>> >> with threading. A load from a device will always be much slower than
>> >> on real HW though.
>> > Time drift problem start to happen on loaded servers, so you do not have
>> > spare CPU to offload device emulation too.
>> >
>> > --
>> >                        Gleb.
>> >
>
> --
>                        Gleb.
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29 14:46                             ` Gleb Natapov
@ 2010-05-29 16:13                               ` Blue Swirl
  2010-05-29 16:37                                 ` Gleb Natapov
  0 siblings, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-05-29 16:13 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sat, May 29, 2010 at 2:46 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sat, May 29, 2010 at 09:35:30AM +0000, Blue Swirl wrote:
>> > I still don't see how the alternative is supposed to simplify our life
>> > or improve the efficiency of the de-coalescing workaround. It's rather
>> > problematic like Gleb pointed out: The de-coalescing logic needs to be
>> > informed about periodicity changes that can only be delivered along
>> > IRQs. So what to do with the backlog when the timer is stopped?
>>
>> What happens with the current design? Gleb only mentioned the
>> frequency change, I thought that was not so big problem. But I don't
>> think this case should be allowed happen at all, it can't exist on
>> real HW.
>>
> Hm, why it can't exist on real HW? Do simple exercise. Run WindowsXP
> inside QEMU, connect with gdb to QEMU process and check what frequency
> RTC configured with (hint: it will be 64Hz), now run video inside the
> guest and check frequency again (hint: it will be 1Khz).

You missed the key word 'stopped'. If the timer is really stopped, no
IRQs should ever come out afterwards, just like on real HW. For the
emulation, this means loss of ticks which should have been delivered
before the change.

But what if the guest changed the frequency very often, and between
changes used zero value, like 64Hz -> 0Hz -> 128Hz -> 0Hz -> 64Hz?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29 16:03                             ` Blue Swirl
@ 2010-05-29 16:32                               ` Gleb Natapov
  2010-05-29 20:52                                 ` Blue Swirl
  0 siblings, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-05-29 16:32 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sat, May 29, 2010 at 04:03:22PM +0000, Blue Swirl wrote:
> 2010/5/29 Gleb Natapov <gleb@redhat.com>:
> > On Sat, May 29, 2010 at 09:15:11AM +0000, Blue Swirl wrote:
> >> >> There is no code, because we're still at architecture design stage.
> >> >>
> >> > Try to write test code to understand the problem better.
> >>
> >> I will.
> >>
> > Please do ASAP. This discussion shows that you don't understand what is the
> > problem that we are dialing with.
> 
> Which part of the problem you think I don't understand?
> 
It seams to me you don't understand how Windows uses RTC for time
keeping and how the QEMU solves the problem today.

> >> >> >> >> guests could also be assisted with special handling (like win2k
> >> >> >> >> install hack), for example guest instructions could be counted
> >> >> >> >> (approximately, for example using TB size or TSC) and only inject
> >> >> >> >> after at least N instructions have passed.
> >> >> >> > Guest instructions cannot be easily counted in KVM (it can be done more
> >> >> >> > or less reliably using perf counters, may be).
> >> >> >>
> >> >> >> Aren't there any debug registers or perf counters, which can generate
> >> >> >> an interrupt after some number of instructions have been executed?
> >> >> > Don't think debug registers have something like that and they are
> >> >> > available for guest use anyway. Perf counters differs greatly from CPU
> >> >> > to CPU (even between two CPUs of the same manufacturer), and we want to
> >> >> > keep using them for profiling guests. And I don't see what problem it
> >> >> > will solve anyway that can be solved by simple delay between irq
> >> >> > reinjection.
> >> >>
> >> >> This would allow counting the executed instructions and limit it. Thus
> >> >> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
> >> >>
> >> > Why would you want to limit number of instruction executed by guest if
> >> > CPU has nothing else to do anyway? The problem occurs not when we have
> >> > spare cycles so give to a guest, but in opposite case.
> >>
> >> I think one problem is that the guest has executed too much compared
> >> to what would happen with real HW with a lesser CPU. That explains the
> >> RTC frequency reprogramming case.
> > You think wrong. The problem is exactly opposite: the guest haven't
> > had enough execution time between two time interrupts. I don't know what
> > RTC frequency reprogramming case you are talking about here.
> 
> The case you told me where N pending tick IRQs exist but the guest
> wants to change the RTC frequency from 64Hz to 1024Hz.
> 
> Let's make this more concrete. 1 GHz CPU, initially 100Hz RTC, so
> 10Mcycles/tick or 10ms/tick. At T = 30Mcycles, guest wants to change
> the frequency to 1000Hz.
> 
> The problem for emulation is that for the same 3 ticks, there has been
> so little execution power that the ticks have been coalesced. But
> isn't the guest cycle count then much lower than 30Mcyc?
> 
> Isn't it so that the guest must be above 30Mcyc to be able to want the
> change? But if we reach that point,  the problem must have not been
> too little execution time, but too much.
> 
Sorry I tried hard to understand what have you said above but failed.
What do you mean "to be able to want the change"? Guest sometimes wants
to get 64 timer interrupts per second and sometimes it wants to get 1024
timer interrupt per second. It wants it not as a result of time drift or
anything. It's just how guest behaves. You seams to be to fixated on
guest frequency change. It's just something you have to take into
account when you reinject interrupts.


> >
> >>
> >> >
> >> >> >>
> >> >> >> >>
> >> >> >> >> > And even if the rate did not matter, the APIC woult still have to now
> >> >> >> >> > about the fact that an IRQ is really periodic and does not only appear
> >> >> >> >> > as such for a certain interval. This really does not sound like
> >> >> >> >> > simplifying things or even make them cleaner.
> >> >> >> >>
> >> >> >> >> It would, the voodoo would be contained only in APIC, RTC would be
> >> >> >> >> just like any other device. With the bidirectional irqs, this voodoo
> >> >> >> >> would probably eventually spread to many other devices. The logical
> >> >> >> >> conclusion of that would be a system where all devices would be
> >> >> >> >> careful not to disturb the guest at wrong moment because that would
> >> >> >> >> trigger a bug.
> >> >> >> >>
> >> >> >> > This voodoo will be so complex and unreliable that it will make RTC hack
> >> >> >> > pale in comparison (and I still don't see how you are going to make it
> >> >> >> > actually work).
> >> >> >>
> >> >> >> Implement everything inside APIC: only coalescing and reinjection.
> >> >> > APIC has zero info needed to implement reinjection correctly as was
> >> >> > shown to you several time in this thread and you simply keep ignoring
> >> >> > it.
> >> >>
> >> >> On the contrary, APIC is actually the only source of the IRQ ack
> >> >> information. RTC hack would not work without APIC (or the
> >> >> bidirectional IRQ) passing this info to RTC.
> >> >>
> >> >> What APIC doesn't have now is the timer frequency or period info. This
> >> >> is known by RTC and also higher levels managing the clocks.
> >> >>
> >> > So APIC has one bit of information and RTC everything else.
> >>
> >> The information known by RTC (timer period) is also known by higher levels.
> >>
> > What do you mean by higher level here? vl.c or APIC.
> 
> vl.c, qemu-timer.c.
> 
> >> > The current
> >> > approach (and proposed patch) brings this one bit of information to RTC,
> >> > you are arguing that RTC should be able to communicate all its info to
> >> > APIC. Sorry I don't see that your way has any advantage. Just more
> >> > complex interface and it is much easier to get it wrong for other time
> >> > sources.
> >>
> >> I don't think anymore that APIC should be handling this but the
> >> generic stuff, like vl.c or exec.c. Then there would be only
> >> information passing from APIC to higher levels.
> > Handling reinjection by general timer code makes infinitely more sense
> > then handling it in APIC.
> 
> I'm glad you agree, or did you mean 'less'?
> 
Compared to APIC I would agree that even putting it in IDE is better idea :)

> > One thing (from the top of my head) that can't
> > be implemented at that level is injection of interrupt back to back (i.e
> > injecting next interrupt immediately after guest acknowledge previous
> > one to RTC).
> 
> But Jan told this confuses some buggy OSes.
> 
You keep calling them buggy, but I don't agree. They are written with
certain assumption that are true on real HW, but hard to achieve on
virtual. Anyway we use this technique (back to back reinjection)
otherwise you can't solve drift problem if guest want to receive
timer interrupts with max frequency that host time source support.


> >
> >>
> >> >> I keep ignoring the idea that the current model, where both RTC and
> >> >> APIC must somehow work together to make coalescing work, is the only
> >> >> possible just because it is committed and it happens to work in some
> >> >> cases. It would be much better to concentrate this to one place, APIC
> >> >> or preferably higher level where it may benefit other timers too.
> >> >> Provided of course that the other models can be made to work.
> >> >>
> >> > So write the code and show us. You haven't show any evidence that RTC is
> >> > the wrong place. RTC knows when interrupt was acknowledge to RTC, it
> >> > know when clock frequency changes, it know when device reset happened.
> >> > APIC knows only that interrupt was coalesced. It doesn't even know that
> >> > it may be masked by a guest in IOAPIC (interrupts delivered while they
> >> > are masked not considered coalesced).
> >>
> >> Oh, I thought interrupt masking was the reason for coalescing! What
> >> exactly is the reason then?
> >>
> > The reason is that guest has no time to process previous interrupt
> > before it is time to inject next one.
> 
> Because of other host load or other emulation done by the same QEMU
> process, I suppose?
Yes, both.

> 
> >> > Time source knows only when
> >> > frequency changes and may be when device reset happens if timer is
> >> > stopped by device on reset. So RTC is actually a sweet spot if you want
> >> > to minimize amount of info you need to pass between various layers.
> >> >
> >> >> >> Maybe that version would not bend backwards as much as the current to
> >> >> >> cater for buggy hosts.
> >> >> >>
> >> >> > You mean "buggy guests"?
> >> >>
> >> >> Yes, sorry.
> >> >>
> >> >> > What guests are not buggy in your opinion?
> >> >> > Linux tries hard to be smart and as a result the only way to have stable
> >> >> > clock with it is to go paravirt.
> >> >>
> >> >> I'm not an OS designer, but I think an OS should never crash, even if
> >> >> a burst of IRQs is received. Reprogramming the timer should consider
> >> >> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
> >> >> cause of the problem.
> >> > OS should never crash in the absence of HW bugs? I doubt you can design
> >> > an OS that can run in a face of any HW failure. Anyway here we are
> >> > trying to solve guests time keeping problem not crashes. Do you think
> >> > you can design OS that can keep time accurately no matter how crazy all
> >> > HW clock behaves?
> >>
> >> I think my OS design skills are not relevant in this discussion, but
> >> IIRC there are fault tolerant operating systems for extreme conditions
> >> so it can be done.
> >>
> >> >
> >> >>
> >> >> >> > The fact is that timer device is not "just like any
> >> >> >> > other device" in virtual world. Any other device is easy: you just
> >> >> >> > implement spec as close as possible and everything works. For time
> >> >> >> > source device this is not enough. You can implement RTC+HPET to the
> >> >> >> > letter and your guest will drift like crazy.
> >> >> >>
> >> >> >> It's doable: a cycle accurate emulator will not cause any drift,
> >> >> >> without any voodoo. The interrupts would come after executing the same
> >> >> >> instruction as the real HW. For emulating any sufficiently buggy
> >> >> >> guests in any sufficiently desperate low resource conditions, this may
> >> >> >> be the only option that will always work.
> >> >> >>
> >> >> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
> >> >> > to be one. On the contrary KVM runs at native host CPU speed most of the
> >> >> > time, so any emulation done between two instruction is theoretically
> >> >> > noticeable for a guest. TSC is bypassed directly to a guest too, so
> >> >> > keeping all time source in perfect sync is also impossible.
> >> >>
> >> >> That is actually another cause of the problem. KVM gives the guest an
> >> >> illusion that the VCPU speed is equal to host speed. When they don't
> >> >> match, especially in critical code, there can be problems. It would be
> >> >> better to tell the guest a lower speed, which also can be guaranteed.
> >> >>
> >> > Not possible. It's that simple. You should take it into account in your
> >> > architecture design stage. In case of KVM real physical CPU executes guest
> >> > instruction and it does this as fast as it can. The only way we can hide
> >> > that from a guest is by intercepting each access to TSC and at that
> >> > point we can use bochs instead.
> >>
> >> Well, as Paul pointed out, there's also icount option.
> >>
> > icount is not an option for KVM.
> 
> I think icount timer adjustment model might make sense for this work
> too. We'd then just need some figure of executed CPU instructions, TSC
> cycles or even kernel scheduler time slice information (how much time
> the process got).
>
And then? icount makes guest time flow dependant on amount of emulated
instructions. It relies on the fact that all time sources are
synchronized for a guest during emulation (including TSC). This is not
true for virtualization.
 
--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29 16:13                               ` Blue Swirl
@ 2010-05-29 16:37                                 ` Gleb Natapov
  2010-05-29 21:21                                   ` Blue Swirl
  0 siblings, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-05-29 16:37 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sat, May 29, 2010 at 04:13:22PM +0000, Blue Swirl wrote:
> On Sat, May 29, 2010 at 2:46 PM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Sat, May 29, 2010 at 09:35:30AM +0000, Blue Swirl wrote:
> >> > I still don't see how the alternative is supposed to simplify our life
> >> > or improve the efficiency of the de-coalescing workaround. It's rather
> >> > problematic like Gleb pointed out: The de-coalescing logic needs to be
> >> > informed about periodicity changes that can only be delivered along
> >> > IRQs. So what to do with the backlog when the timer is stopped?
> >>
> >> What happens with the current design? Gleb only mentioned the
> >> frequency change, I thought that was not so big problem. But I don't
> >> think this case should be allowed happen at all, it can't exist on
> >> real HW.
> >>
> > Hm, why it can't exist on real HW? Do simple exercise. Run WindowsXP
> > inside QEMU, connect with gdb to QEMU process and check what frequency
> > RTC configured with (hint: it will be 64Hz), now run video inside the
> > guest and check frequency again (hint: it will be 1Khz).
> 
> You missed the key word 'stopped'. If the timer is really stopped, no
> IRQs should ever come out afterwards, just like on real HW. For the
> emulation, this means loss of ticks which should have been delivered
> before the change.
> 
I haven't missed it. I describe to you reality of the situation. You want
to change reality to be more close to what you want it to be by adding
words to my description. Please just go write code, experiment, debug
and _then_ come here with design.

> But what if the guest changed the frequency very often, and between
> changes used zero value, like 64Hz -> 0Hz -> 128Hz -> 0Hz -> 64Hz?
Too bad, the world is not perfect.

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29 16:32                               ` Gleb Natapov
@ 2010-05-29 20:52                                 ` Blue Swirl
  2010-05-30  5:41                                   ` Gleb Natapov
  0 siblings, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-05-29 20:52 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sat, May 29, 2010 at 4:32 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sat, May 29, 2010 at 04:03:22PM +0000, Blue Swirl wrote:
>> 2010/5/29 Gleb Natapov <gleb@redhat.com>:
>> > On Sat, May 29, 2010 at 09:15:11AM +0000, Blue Swirl wrote:
>> >> >> There is no code, because we're still at architecture design stage.
>> >> >>
>> >> > Try to write test code to understand the problem better.
>> >>
>> >> I will.
>> >>
>> > Please do ASAP. This discussion shows that you don't understand what is the
>> > problem that we are dialing with.
>>
>> Which part of the problem you think I don't understand?
>>
> It seams to me you don't understand how Windows uses RTC for time
> keeping and how the QEMU solves the problem today.

RTC causes periodic interrupts and Windows interrupt handler
increments jiffies, like Linux?

>> >> >> >> >> guests could also be assisted with special handling (like win2k
>> >> >> >> >> install hack), for example guest instructions could be counted
>> >> >> >> >> (approximately, for example using TB size or TSC) and only inject
>> >> >> >> >> after at least N instructions have passed.
>> >> >> >> > Guest instructions cannot be easily counted in KVM (it can be done more
>> >> >> >> > or less reliably using perf counters, may be).
>> >> >> >>
>> >> >> >> Aren't there any debug registers or perf counters, which can generate
>> >> >> >> an interrupt after some number of instructions have been executed?
>> >> >> > Don't think debug registers have something like that and they are
>> >> >> > available for guest use anyway. Perf counters differs greatly from CPU
>> >> >> > to CPU (even between two CPUs of the same manufacturer), and we want to
>> >> >> > keep using them for profiling guests. And I don't see what problem it
>> >> >> > will solve anyway that can be solved by simple delay between irq
>> >> >> > reinjection.
>> >> >>
>> >> >> This would allow counting the executed instructions and limit it. Thus
>> >> >> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>> >> >>
>> >> > Why would you want to limit number of instruction executed by guest if
>> >> > CPU has nothing else to do anyway? The problem occurs not when we have
>> >> > spare cycles so give to a guest, but in opposite case.
>> >>
>> >> I think one problem is that the guest has executed too much compared
>> >> to what would happen with real HW with a lesser CPU. That explains the
>> >> RTC frequency reprogramming case.
>> > You think wrong. The problem is exactly opposite: the guest haven't
>> > had enough execution time between two time interrupts. I don't know what
>> > RTC frequency reprogramming case you are talking about here.
>>
>> The case you told me where N pending tick IRQs exist but the guest
>> wants to change the RTC frequency from 64Hz to 1024Hz.
>>
>> Let's make this more concrete. 1 GHz CPU, initially 100Hz RTC, so
>> 10Mcycles/tick or 10ms/tick. At T = 30Mcycles, guest wants to change
>> the frequency to 1000Hz.
>>
>> The problem for emulation is that for the same 3 ticks, there has been
>> so little execution power that the ticks have been coalesced. But
>> isn't the guest cycle count then much lower than 30Mcyc?
>>
>> Isn't it so that the guest must be above 30Mcyc to be able to want the
>> change? But if we reach that point,  the problem must have not been
>> too little execution time, but too much.
>>
> Sorry I tried hard to understand what have you said above but failed.
> What do you mean "to be able to want the change"? Guest sometimes wants
> to get 64 timer interrupts per second and sometimes it wants to get 1024
> timer interrupt per second. It wants it not as a result of time drift or
> anything. It's just how guest behaves. You seams to be to fixated on
> guest frequency change. It's just something you have to take into
> account when you reinject interrupts.

I meant that in the scenario, the guest won't change the RTC before
30Mcyc because of some built in determinism in the guest. At that
point, because of some reason, the change would happen.

>
>> >
>> >>
>> >> >
>> >> >> >>
>> >> >> >> >>
>> >> >> >> >> > And even if the rate did not matter, the APIC woult still have to now
>> >> >> >> >> > about the fact that an IRQ is really periodic and does not only appear
>> >> >> >> >> > as such for a certain interval. This really does not sound like
>> >> >> >> >> > simplifying things or even make them cleaner.
>> >> >> >> >>
>> >> >> >> >> It would, the voodoo would be contained only in APIC, RTC would be
>> >> >> >> >> just like any other device. With the bidirectional irqs, this voodoo
>> >> >> >> >> would probably eventually spread to many other devices. The logical
>> >> >> >> >> conclusion of that would be a system where all devices would be
>> >> >> >> >> careful not to disturb the guest at wrong moment because that would
>> >> >> >> >> trigger a bug.
>> >> >> >> >>
>> >> >> >> > This voodoo will be so complex and unreliable that it will make RTC hack
>> >> >> >> > pale in comparison (and I still don't see how you are going to make it
>> >> >> >> > actually work).
>> >> >> >>
>> >> >> >> Implement everything inside APIC: only coalescing and reinjection.
>> >> >> > APIC has zero info needed to implement reinjection correctly as was
>> >> >> > shown to you several time in this thread and you simply keep ignoring
>> >> >> > it.
>> >> >>
>> >> >> On the contrary, APIC is actually the only source of the IRQ ack
>> >> >> information. RTC hack would not work without APIC (or the
>> >> >> bidirectional IRQ) passing this info to RTC.
>> >> >>
>> >> >> What APIC doesn't have now is the timer frequency or period info. This
>> >> >> is known by RTC and also higher levels managing the clocks.
>> >> >>
>> >> > So APIC has one bit of information and RTC everything else.
>> >>
>> >> The information known by RTC (timer period) is also known by higher levels.
>> >>
>> > What do you mean by higher level here? vl.c or APIC.
>>
>> vl.c, qemu-timer.c.
>>
>> >> > The current
>> >> > approach (and proposed patch) brings this one bit of information to RTC,
>> >> > you are arguing that RTC should be able to communicate all its info to
>> >> > APIC. Sorry I don't see that your way has any advantage. Just more
>> >> > complex interface and it is much easier to get it wrong for other time
>> >> > sources.
>> >>
>> >> I don't think anymore that APIC should be handling this but the
>> >> generic stuff, like vl.c or exec.c. Then there would be only
>> >> information passing from APIC to higher levels.
>> > Handling reinjection by general timer code makes infinitely more sense
>> > then handling it in APIC.
>>
>> I'm glad you agree, or did you mean 'less'?
>>
> Compared to APIC I would agree that even putting it in IDE is better idea :)
>
>> > One thing (from the top of my head) that can't
>> > be implemented at that level is injection of interrupt back to back (i.e
>> > injecting next interrupt immediately after guest acknowledge previous
>> > one to RTC).
>>
>> But Jan told this confuses some buggy OSes.
>>
> You keep calling them buggy, but I don't agree. They are written with
> certain assumption that are true on real HW, but hard to achieve on
> virtual. Anyway we use this technique (back to back reinjection)
> otherwise you can't solve drift problem if guest want to receive
> timer interrupts with max frequency that host time source support.

Even if this confuses some OSes?

>> >
>> >>
>> >> >> I keep ignoring the idea that the current model, where both RTC and
>> >> >> APIC must somehow work together to make coalescing work, is the only
>> >> >> possible just because it is committed and it happens to work in some
>> >> >> cases. It would be much better to concentrate this to one place, APIC
>> >> >> or preferably higher level where it may benefit other timers too.
>> >> >> Provided of course that the other models can be made to work.
>> >> >>
>> >> > So write the code and show us. You haven't show any evidence that RTC is
>> >> > the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>> >> > know when clock frequency changes, it know when device reset happened.
>> >> > APIC knows only that interrupt was coalesced. It doesn't even know that
>> >> > it may be masked by a guest in IOAPIC (interrupts delivered while they
>> >> > are masked not considered coalesced).
>> >>
>> >> Oh, I thought interrupt masking was the reason for coalescing! What
>> >> exactly is the reason then?
>> >>
>> > The reason is that guest has no time to process previous interrupt
>> > before it is time to inject next one.
>>
>> Because of other host load or other emulation done by the same QEMU
>> process, I suppose?
> Yes, both.
>
>>
>> >> > Time source knows only when
>> >> > frequency changes and may be when device reset happens if timer is
>> >> > stopped by device on reset. So RTC is actually a sweet spot if you want
>> >> > to minimize amount of info you need to pass between various layers.
>> >> >
>> >> >> >> Maybe that version would not bend backwards as much as the current to
>> >> >> >> cater for buggy hosts.
>> >> >> >>
>> >> >> > You mean "buggy guests"?
>> >> >>
>> >> >> Yes, sorry.
>> >> >>
>> >> >> > What guests are not buggy in your opinion?
>> >> >> > Linux tries hard to be smart and as a result the only way to have stable
>> >> >> > clock with it is to go paravirt.
>> >> >>
>> >> >> I'm not an OS designer, but I think an OS should never crash, even if
>> >> >> a burst of IRQs is received. Reprogramming the timer should consider
>> >> >> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
>> >> >> cause of the problem.
>> >> > OS should never crash in the absence of HW bugs? I doubt you can design
>> >> > an OS that can run in a face of any HW failure. Anyway here we are
>> >> > trying to solve guests time keeping problem not crashes. Do you think
>> >> > you can design OS that can keep time accurately no matter how crazy all
>> >> > HW clock behaves?
>> >>
>> >> I think my OS design skills are not relevant in this discussion, but
>> >> IIRC there are fault tolerant operating systems for extreme conditions
>> >> so it can be done.
>> >>
>> >> >
>> >> >>
>> >> >> >> > The fact is that timer device is not "just like any
>> >> >> >> > other device" in virtual world. Any other device is easy: you just
>> >> >> >> > implement spec as close as possible and everything works. For time
>> >> >> >> > source device this is not enough. You can implement RTC+HPET to the
>> >> >> >> > letter and your guest will drift like crazy.
>> >> >> >>
>> >> >> >> It's doable: a cycle accurate emulator will not cause any drift,
>> >> >> >> without any voodoo. The interrupts would come after executing the same
>> >> >> >> instruction as the real HW. For emulating any sufficiently buggy
>> >> >> >> guests in any sufficiently desperate low resource conditions, this may
>> >> >> >> be the only option that will always work.
>> >> >> >>
>> >> >> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
>> >> >> > to be one. On the contrary KVM runs at native host CPU speed most of the
>> >> >> > time, so any emulation done between two instruction is theoretically
>> >> >> > noticeable for a guest. TSC is bypassed directly to a guest too, so
>> >> >> > keeping all time source in perfect sync is also impossible.
>> >> >>
>> >> >> That is actually another cause of the problem. KVM gives the guest an
>> >> >> illusion that the VCPU speed is equal to host speed. When they don't
>> >> >> match, especially in critical code, there can be problems. It would be
>> >> >> better to tell the guest a lower speed, which also can be guaranteed.
>> >> >>
>> >> > Not possible. It's that simple. You should take it into account in your
>> >> > architecture design stage. In case of KVM real physical CPU executes guest
>> >> > instruction and it does this as fast as it can. The only way we can hide
>> >> > that from a guest is by intercepting each access to TSC and at that
>> >> > point we can use bochs instead.
>> >>
>> >> Well, as Paul pointed out, there's also icount option.
>> >>
>> > icount is not an option for KVM.
>>
>> I think icount timer adjustment model might make sense for this work
>> too. We'd then just need some figure of executed CPU instructions, TSC
>> cycles or even kernel scheduler time slice information (how much time
>> the process got).
>>
> And then? icount makes guest time flow dependant on amount of emulated
> instructions. It relies on the fact that all time sources are
> synchronized for a guest during emulation (including TSC). This is not
> true for virtualization.

So for virtualization, is it OK then to keep time sources unsynchronized?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29 16:37                                 ` Gleb Natapov
@ 2010-05-29 21:21                                   ` Blue Swirl
  2010-05-30  6:02                                     ` Gleb Natapov
  0 siblings, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-05-29 21:21 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sat, May 29, 2010 at 4:37 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sat, May 29, 2010 at 04:13:22PM +0000, Blue Swirl wrote:
>> On Sat, May 29, 2010 at 2:46 PM, Gleb Natapov <gleb@redhat.com> wrote:
>> > On Sat, May 29, 2010 at 09:35:30AM +0000, Blue Swirl wrote:
>> >> > I still don't see how the alternative is supposed to simplify our life
>> >> > or improve the efficiency of the de-coalescing workaround. It's rather
>> >> > problematic like Gleb pointed out: The de-coalescing logic needs to be
>> >> > informed about periodicity changes that can only be delivered along
>> >> > IRQs. So what to do with the backlog when the timer is stopped?
>> >>
>> >> What happens with the current design? Gleb only mentioned the
>> >> frequency change, I thought that was not so big problem. But I don't
>> >> think this case should be allowed happen at all, it can't exist on
>> >> real HW.
>> >>
>> > Hm, why it can't exist on real HW? Do simple exercise. Run WindowsXP
>> > inside QEMU, connect with gdb to QEMU process and check what frequency
>> > RTC configured with (hint: it will be 64Hz), now run video inside the
>> > guest and check frequency again (hint: it will be 1Khz).
>>
>> You missed the key word 'stopped'. If the timer is really stopped, no
>> IRQs should ever come out afterwards, just like on real HW. For the
>> emulation, this means loss of ticks which should have been delivered
>> before the change.
>>
> I haven't missed it. I describe to you reality of the situation. You want
> to change reality to be more close to what you want it to be by adding
> words to my description.

Quoting Jan: 'So what to do with the backlog when the timer is
stopped?' I didn't add any words to your description, please be more
careful with your attributions. Why do you think I want to change the
reality?

XP frequency change isn't the same case as timer being stopped.

> Please just go write code, experiment, debug
> and _then_ come here with design.

I added some debugging to RTC, PIC and APIC. I also built a small
guest in x86 assembly to test the coalescing. However, in the tests
with this guest and others I noticed that the coalescing only happens
in some obscure conditions.

By default the APIC's delivery method for IRQs is ExtInt and
coalescing counting happens only with Fixed. This means that the guest
needs to reprogram APIC. It also looks like RTC interrupts need to be
triggered. But I didn't see both of these to happen simultaneously in
my tests with Linux and Windows guests. Of course, -rtc-td-hack flag
must be used and I also disabled HPET to be sure that RTC would be
used.

With DEBUG_COALESCING enabled, I just get increasing numbers for
apic_irq_delivered:
apic: apic_set_irq: coalescing 67123
apic: apic_set_irq: coalescing 67124
apic: apic_set_irq: coalescing 67125

If the hack were active, the numbers would be close to zero (or at
least some point) because apic_reset_irq_delivered would be called,
but this does not happen. Could you specify a clear test case with
which the coalescing action could be tested? Linux or BSD based,
please.

>> But what if the guest changed the frequency very often, and between
>> changes used zero value, like 64Hz -> 0Hz -> 128Hz -> 0Hz -> 64Hz?
> Too bad, the world is not perfect.
>
> --
>                        Gleb.
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29 20:52                                 ` Blue Swirl
@ 2010-05-30  5:41                                   ` Gleb Natapov
  2010-05-30 11:41                                     ` Blue Swirl
  0 siblings, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-05-30  5:41 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sat, May 29, 2010 at 08:52:34PM +0000, Blue Swirl wrote:
> On Sat, May 29, 2010 at 4:32 PM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Sat, May 29, 2010 at 04:03:22PM +0000, Blue Swirl wrote:
> >> 2010/5/29 Gleb Natapov <gleb@redhat.com>:
> >> > On Sat, May 29, 2010 at 09:15:11AM +0000, Blue Swirl wrote:
> >> >> >> There is no code, because we're still at architecture design stage.
> >> >> >>
> >> >> > Try to write test code to understand the problem better.
> >> >>
> >> >> I will.
> >> >>
> >> > Please do ASAP. This discussion shows that you don't understand what is the
> >> > problem that we are dialing with.
> >>
> >> Which part of the problem you think I don't understand?
> >>
> > It seams to me you don't understand how Windows uses RTC for time
> > keeping and how the QEMU solves the problem today.
> 
> RTC causes periodic interrupts and Windows interrupt handler
> increments jiffies, like Linux?
> 
Linux does much more complicated things than that to keep time, so the
only way to fix time drift in Linux was to introduce pvclock. For Window
it is not so accurate too, since Windows can change clock frequency any
time it can't calculate time from jiffies, it needs to update clock at
each time tick.

> >> >> >> >> >> guests could also be assisted with special handling (like win2k
> >> >> >> >> >> install hack), for example guest instructions could be counted
> >> >> >> >> >> (approximately, for example using TB size or TSC) and only inject
> >> >> >> >> >> after at least N instructions have passed.
> >> >> >> >> > Guest instructions cannot be easily counted in KVM (it can be done more
> >> >> >> >> > or less reliably using perf counters, may be).
> >> >> >> >>
> >> >> >> >> Aren't there any debug registers or perf counters, which can generate
> >> >> >> >> an interrupt after some number of instructions have been executed?
> >> >> >> > Don't think debug registers have something like that and they are
> >> >> >> > available for guest use anyway. Perf counters differs greatly from CPU
> >> >> >> > to CPU (even between two CPUs of the same manufacturer), and we want to
> >> >> >> > keep using them for profiling guests. And I don't see what problem it
> >> >> >> > will solve anyway that can be solved by simple delay between irq
> >> >> >> > reinjection.
> >> >> >>
> >> >> >> This would allow counting the executed instructions and limit it. Thus
> >> >> >> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
> >> >> >>
> >> >> > Why would you want to limit number of instruction executed by guest if
> >> >> > CPU has nothing else to do anyway? The problem occurs not when we have
> >> >> > spare cycles so give to a guest, but in opposite case.
> >> >>
> >> >> I think one problem is that the guest has executed too much compared
> >> >> to what would happen with real HW with a lesser CPU. That explains the
> >> >> RTC frequency reprogramming case.
> >> > You think wrong. The problem is exactly opposite: the guest haven't
> >> > had enough execution time between two time interrupts. I don't know what
> >> > RTC frequency reprogramming case you are talking about here.
> >>
> >> The case you told me where N pending tick IRQs exist but the guest
> >> wants to change the RTC frequency from 64Hz to 1024Hz.
> >>
> >> Let's make this more concrete. 1 GHz CPU, initially 100Hz RTC, so
> >> 10Mcycles/tick or 10ms/tick. At T = 30Mcycles, guest wants to change
> >> the frequency to 1000Hz.
> >>
> >> The problem for emulation is that for the same 3 ticks, there has been
> >> so little execution power that the ticks have been coalesced. But
> >> isn't the guest cycle count then much lower than 30Mcyc?
> >>
> >> Isn't it so that the guest must be above 30Mcyc to be able to want the
> >> change? But if we reach that point,  the problem must have not been
> >> too little execution time, but too much.
> >>
> > Sorry I tried hard to understand what have you said above but failed.
> > What do you mean "to be able to want the change"? Guest sometimes wants
> > to get 64 timer interrupts per second and sometimes it wants to get 1024
> > timer interrupt per second. It wants it not as a result of time drift or
> > anything. It's just how guest behaves. You seams to be to fixated on
> > guest frequency change. It's just something you have to take into
> > account when you reinject interrupts.
> 
> I meant that in the scenario, the guest won't change the RTC before
> 30Mcyc because of some built in determinism in the guest. At that
> point, because of some reason, the change would happen.
> 
I still don't understand what are you trying to say here. Guest changes
frequency because of some even in the guest. It is totally independent
of what happens in QEMUs RTC emulation.

> >
> >> >
> >> >>
> >> >> >
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> > And even if the rate did not matter, the APIC woult still have to now
> >> >> >> >> >> > about the fact that an IRQ is really periodic and does not only appear
> >> >> >> >> >> > as such for a certain interval. This really does not sound like
> >> >> >> >> >> > simplifying things or even make them cleaner.
> >> >> >> >> >>
> >> >> >> >> >> It would, the voodoo would be contained only in APIC, RTC would be
> >> >> >> >> >> just like any other device. With the bidirectional irqs, this voodoo
> >> >> >> >> >> would probably eventually spread to many other devices. The logical
> >> >> >> >> >> conclusion of that would be a system where all devices would be
> >> >> >> >> >> careful not to disturb the guest at wrong moment because that would
> >> >> >> >> >> trigger a bug.
> >> >> >> >> >>
> >> >> >> >> > This voodoo will be so complex and unreliable that it will make RTC hack
> >> >> >> >> > pale in comparison (and I still don't see how you are going to make it
> >> >> >> >> > actually work).
> >> >> >> >>
> >> >> >> >> Implement everything inside APIC: only coalescing and reinjection.
> >> >> >> > APIC has zero info needed to implement reinjection correctly as was
> >> >> >> > shown to you several time in this thread and you simply keep ignoring
> >> >> >> > it.
> >> >> >>
> >> >> >> On the contrary, APIC is actually the only source of the IRQ ack
> >> >> >> information. RTC hack would not work without APIC (or the
> >> >> >> bidirectional IRQ) passing this info to RTC.
> >> >> >>
> >> >> >> What APIC doesn't have now is the timer frequency or period info. This
> >> >> >> is known by RTC and also higher levels managing the clocks.
> >> >> >>
> >> >> > So APIC has one bit of information and RTC everything else.
> >> >>
> >> >> The information known by RTC (timer period) is also known by higher levels.
> >> >>
> >> > What do you mean by higher level here? vl.c or APIC.
> >>
> >> vl.c, qemu-timer.c.
> >>
> >> >> > The current
> >> >> > approach (and proposed patch) brings this one bit of information to RTC,
> >> >> > you are arguing that RTC should be able to communicate all its info to
> >> >> > APIC. Sorry I don't see that your way has any advantage. Just more
> >> >> > complex interface and it is much easier to get it wrong for other time
> >> >> > sources.
> >> >>
> >> >> I don't think anymore that APIC should be handling this but the
> >> >> generic stuff, like vl.c or exec.c. Then there would be only
> >> >> information passing from APIC to higher levels.
> >> > Handling reinjection by general timer code makes infinitely more sense
> >> > then handling it in APIC.
> >>
> >> I'm glad you agree, or did you mean 'less'?
> >>
> > Compared to APIC I would agree that even putting it in IDE is better idea :)
> >
> >> > One thing (from the top of my head) that can't
> >> > be implemented at that level is injection of interrupt back to back (i.e
> >> > injecting next interrupt immediately after guest acknowledge previous
> >> > one to RTC).
> >>
> >> But Jan told this confuses some buggy OSes.
> >>
> > You keep calling them buggy, but I don't agree. They are written with
> > certain assumption that are true on real HW, but hard to achieve on
> > virtual. Anyway we use this technique (back to back reinjection)
> > otherwise you can't solve drift problem if guest want to receive
> > timer interrupts with max frequency that host time source support.
> 
> Even if this confuses some OSes?
> 
We make it so that it will not confuse relevant OSes. It is not line you
have better choice.

> >> >
> >> >>
> >> >> >> I keep ignoring the idea that the current model, where both RTC and
> >> >> >> APIC must somehow work together to make coalescing work, is the only
> >> >> >> possible just because it is committed and it happens to work in some
> >> >> >> cases. It would be much better to concentrate this to one place, APIC
> >> >> >> or preferably higher level where it may benefit other timers too.
> >> >> >> Provided of course that the other models can be made to work.
> >> >> >>
> >> >> > So write the code and show us. You haven't show any evidence that RTC is
> >> >> > the wrong place. RTC knows when interrupt was acknowledge to RTC, it
> >> >> > know when clock frequency changes, it know when device reset happened.
> >> >> > APIC knows only that interrupt was coalesced. It doesn't even know that
> >> >> > it may be masked by a guest in IOAPIC (interrupts delivered while they
> >> >> > are masked not considered coalesced).
> >> >>
> >> >> Oh, I thought interrupt masking was the reason for coalescing! What
> >> >> exactly is the reason then?
> >> >>
> >> > The reason is that guest has no time to process previous interrupt
> >> > before it is time to inject next one.
> >>
> >> Because of other host load or other emulation done by the same QEMU
> >> process, I suppose?
> > Yes, both.
> >
> >>
> >> >> > Time source knows only when
> >> >> > frequency changes and may be when device reset happens if timer is
> >> >> > stopped by device on reset. So RTC is actually a sweet spot if you want
> >> >> > to minimize amount of info you need to pass between various layers.
> >> >> >
> >> >> >> >> Maybe that version would not bend backwards as much as the current to
> >> >> >> >> cater for buggy hosts.
> >> >> >> >>
> >> >> >> > You mean "buggy guests"?
> >> >> >>
> >> >> >> Yes, sorry.
> >> >> >>
> >> >> >> > What guests are not buggy in your opinion?
> >> >> >> > Linux tries hard to be smart and as a result the only way to have stable
> >> >> >> > clock with it is to go paravirt.
> >> >> >>
> >> >> >> I'm not an OS designer, but I think an OS should never crash, even if
> >> >> >> a burst of IRQs is received. Reprogramming the timer should consider
> >> >> >> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
> >> >> >> cause of the problem.
> >> >> > OS should never crash in the absence of HW bugs? I doubt you can design
> >> >> > an OS that can run in a face of any HW failure. Anyway here we are
> >> >> > trying to solve guests time keeping problem not crashes. Do you think
> >> >> > you can design OS that can keep time accurately no matter how crazy all
> >> >> > HW clock behaves?
> >> >>
> >> >> I think my OS design skills are not relevant in this discussion, but
> >> >> IIRC there are fault tolerant operating systems for extreme conditions
> >> >> so it can be done.
> >> >>
> >> >> >
> >> >> >>
> >> >> >> >> > The fact is that timer device is not "just like any
> >> >> >> >> > other device" in virtual world. Any other device is easy: you just
> >> >> >> >> > implement spec as close as possible and everything works. For time
> >> >> >> >> > source device this is not enough. You can implement RTC+HPET to the
> >> >> >> >> > letter and your guest will drift like crazy.
> >> >> >> >>
> >> >> >> >> It's doable: a cycle accurate emulator will not cause any drift,
> >> >> >> >> without any voodoo. The interrupts would come after executing the same
> >> >> >> >> instruction as the real HW. For emulating any sufficiently buggy
> >> >> >> >> guests in any sufficiently desperate low resource conditions, this may
> >> >> >> >> be the only option that will always work.
> >> >> >> >>
> >> >> >> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
> >> >> >> > to be one. On the contrary KVM runs at native host CPU speed most of the
> >> >> >> > time, so any emulation done between two instruction is theoretically
> >> >> >> > noticeable for a guest. TSC is bypassed directly to a guest too, so
> >> >> >> > keeping all time source in perfect sync is also impossible.
> >> >> >>
> >> >> >> That is actually another cause of the problem. KVM gives the guest an
> >> >> >> illusion that the VCPU speed is equal to host speed. When they don't
> >> >> >> match, especially in critical code, there can be problems. It would be
> >> >> >> better to tell the guest a lower speed, which also can be guaranteed.
> >> >> >>
> >> >> > Not possible. It's that simple. You should take it into account in your
> >> >> > architecture design stage. In case of KVM real physical CPU executes guest
> >> >> > instruction and it does this as fast as it can. The only way we can hide
> >> >> > that from a guest is by intercepting each access to TSC and at that
> >> >> > point we can use bochs instead.
> >> >>
> >> >> Well, as Paul pointed out, there's also icount option.
> >> >>
> >> > icount is not an option for KVM.
> >>
> >> I think icount timer adjustment model might make sense for this work
> >> too. We'd then just need some figure of executed CPU instructions, TSC
> >> cycles or even kernel scheduler time slice information (how much time
> >> the process got).
> >>
> > And then? icount makes guest time flow dependant on amount of emulated
> > instructions. It relies on the fact that all time sources are
> > synchronized for a guest during emulation (including TSC). This is not
> > true for virtualization.
> 
> So for virtualization, is it OK then to keep time sources unsynchronized?
No, this is not. This is the reality that can't be changed for now. May
be when HW vitalization will introduce TSC scaling, but even then we
don't want to run 4 500Mhz guest on 2GHz host, we want to run 4 2GHz
guest on 2GHz host, so overcommit occurs and you can't guaranty that
no coalescing will happen. And I hope you are aware of the fact that
using icount introduce ~10% performance penalty on QEMU.

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29 21:21                                   ` Blue Swirl
@ 2010-05-30  6:02                                     ` Gleb Natapov
  2010-05-30 12:10                                       ` Blue Swirl
  0 siblings, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-05-30  6:02 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sat, May 29, 2010 at 09:21:14PM +0000, Blue Swirl wrote:
> On Sat, May 29, 2010 at 4:37 PM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Sat, May 29, 2010 at 04:13:22PM +0000, Blue Swirl wrote:
> >> On Sat, May 29, 2010 at 2:46 PM, Gleb Natapov <gleb@redhat.com> wrote:
> >> > On Sat, May 29, 2010 at 09:35:30AM +0000, Blue Swirl wrote:
> >> >> > I still don't see how the alternative is supposed to simplify our life
> >> >> > or improve the efficiency of the de-coalescing workaround. It's rather
> >> >> > problematic like Gleb pointed out: The de-coalescing logic needs to be
> >> >> > informed about periodicity changes that can only be delivered along
> >> >> > IRQs. So what to do with the backlog when the timer is stopped?
> >> >>
> >> >> What happens with the current design? Gleb only mentioned the
> >> >> frequency change, I thought that was not so big problem. But I don't
> >> >> think this case should be allowed happen at all, it can't exist on
> >> >> real HW.
> >> >>
> >> > Hm, why it can't exist on real HW? Do simple exercise. Run WindowsXP
> >> > inside QEMU, connect with gdb to QEMU process and check what frequency
> >> > RTC configured with (hint: it will be 64Hz), now run video inside the
> >> > guest and check frequency again (hint: it will be 1Khz).
> >>
> >> You missed the key word 'stopped'. If the timer is really stopped, no
> >> IRQs should ever come out afterwards, just like on real HW. For the
> >> emulation, this means loss of ticks which should have been delivered
> >> before the change.
> >>
> > I haven't missed it. I describe to you reality of the situation. You want
> > to change reality to be more close to what you want it to be by adding
> > words to my description.
> 
> Quoting Jan: 'So what to do with the backlog when the timer is
> stopped?' I didn't add any words to your description, please be more
> careful with your attributions. Why do you think I want to change the
> reality?
Please refer to my words when you answer to my quote. You quoted my
answer to you statement:
 Gleb only mentioned the frequency change, I thought that was not so big
 problem. But I don't think this case should be allowed happen at all,
 it can't exist on real HW.
No 'stopped' was under discussion nowhere. FWIW 'stopped' is just a case
of frequency change.

> 
> XP frequency change isn't the same case as timer being stopped.
> 
And what is the big difference exactly?

> > Please just go write code, experiment, debug
> > and _then_ come here with design.
> 
> I added some debugging to RTC, PIC and APIC. I also built a small
> guest in x86 assembly to test the coalescing. However, in the tests
> with this guest and others I noticed that the coalescing only happens
> in some obscure conditions.
So try with real guest and with real load.

> 
> By default the APIC's delivery method for IRQs is ExtInt and
> coalescing counting happens only with Fixed. This means that the guest
> needs to reprogram APIC. It also looks like RTC interrupts need to be
> triggered. But I didn't see both of these to happen simultaneously in
> my tests with Linux and Windows guests. Of course, -rtc-td-hack flag
> must be used and I also disabled HPET to be sure that RTC would be
> used.
> 
> With DEBUG_COALESCING enabled, I just get increasing numbers for
> apic_irq_delivered:
> apic: apic_set_irq: coalescing 67123
> apic: apic_set_irq: coalescing 67124
> apic: apic_set_irq: coalescing 67125
So have you actually used -rtc-td-hack option? I compiled head of
qemu.git with DEBUG_COALESCING and run WindowsXP guest with -rtc-td-hack
and I get:
apic: apic_reset_irq_delivered: old coalescing 3
apic: apic_set_irq: coalescing 1
apic: apic_get_irq_delivered: returning coalescing 1
apic: apic_set_irq: coalescing 2
apic: apic_set_irq: coalescing 3
apic: apic_set_irq: coalescing 4
apic: apic_set_irq: coalescing 5
apic: apic_set_irq: coalescing 6
apic: apic_reset_irq_delivered: old coalescing 6
apic: apic_set_irq: coalescing 1
apic: apic_get_irq_delivered: returning coalescing 1

> 
> If the hack were active, the numbers would be close to zero (or at
> least some point) because apic_reset_irq_delivered would be called,
> but this does not happen. Could you specify a clear test case with
> which the coalescing action could be tested? Linux or BSD based,
> please.
Linux don't use RTC as time source and I don't know about BSD, so no
Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
heavy load on the host. You can run video inside the gust to trigger
coalescing more easily.

> 
> >> But what if the guest changed the frequency very often, and between
> >> changes used zero value, like 64Hz -> 0Hz -> 128Hz -> 0Hz -> 64Hz?
> > Too bad, the world is not perfect.
> >
> > --
> >                        Gleb.
> >

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-30  5:41                                   ` Gleb Natapov
@ 2010-05-30 11:41                                     ` Blue Swirl
  2010-05-30 11:52                                       ` Gleb Natapov
  0 siblings, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-05-30 11:41 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

2010/5/30 Gleb Natapov <gleb@redhat.com>:
> On Sat, May 29, 2010 at 08:52:34PM +0000, Blue Swirl wrote:
>> On Sat, May 29, 2010 at 4:32 PM, Gleb Natapov <gleb@redhat.com> wrote:
>> > On Sat, May 29, 2010 at 04:03:22PM +0000, Blue Swirl wrote:
>> >> 2010/5/29 Gleb Natapov <gleb@redhat.com>:
>> >> > On Sat, May 29, 2010 at 09:15:11AM +0000, Blue Swirl wrote:
>> >> >> >> There is no code, because we're still at architecture design stage.
>> >> >> >>
>> >> >> > Try to write test code to understand the problem better.
>> >> >>
>> >> >> I will.
>> >> >>
>> >> > Please do ASAP. This discussion shows that you don't understand what is the
>> >> > problem that we are dialing with.
>> >>
>> >> Which part of the problem you think I don't understand?
>> >>
>> > It seams to me you don't understand how Windows uses RTC for time
>> > keeping and how the QEMU solves the problem today.
>>
>> RTC causes periodic interrupts and Windows interrupt handler
>> increments jiffies, like Linux?
>>
> Linux does much more complicated things than that to keep time, so the
> only way to fix time drift in Linux was to introduce pvclock. For Window
> it is not so accurate too, since Windows can change clock frequency any
> time it can't calculate time from jiffies, it needs to update clock at
> each time tick.
>
>> >> >> >> >> >> guests could also be assisted with special handling (like win2k
>> >> >> >> >> >> install hack), for example guest instructions could be counted
>> >> >> >> >> >> (approximately, for example using TB size or TSC) and only inject
>> >> >> >> >> >> after at least N instructions have passed.
>> >> >> >> >> > Guest instructions cannot be easily counted in KVM (it can be done more
>> >> >> >> >> > or less reliably using perf counters, may be).
>> >> >> >> >>
>> >> >> >> >> Aren't there any debug registers or perf counters, which can generate
>> >> >> >> >> an interrupt after some number of instructions have been executed?
>> >> >> >> > Don't think debug registers have something like that and they are
>> >> >> >> > available for guest use anyway. Perf counters differs greatly from CPU
>> >> >> >> > to CPU (even between two CPUs of the same manufacturer), and we want to
>> >> >> >> > keep using them for profiling guests. And I don't see what problem it
>> >> >> >> > will solve anyway that can be solved by simple delay between irq
>> >> >> >> > reinjection.
>> >> >> >>
>> >> >> >> This would allow counting the executed instructions and limit it. Thus
>> >> >> >> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>> >> >> >>
>> >> >> > Why would you want to limit number of instruction executed by guest if
>> >> >> > CPU has nothing else to do anyway? The problem occurs not when we have
>> >> >> > spare cycles so give to a guest, but in opposite case.
>> >> >>
>> >> >> I think one problem is that the guest has executed too much compared
>> >> >> to what would happen with real HW with a lesser CPU. That explains the
>> >> >> RTC frequency reprogramming case.
>> >> > You think wrong. The problem is exactly opposite: the guest haven't
>> >> > had enough execution time between two time interrupts. I don't know what
>> >> > RTC frequency reprogramming case you are talking about here.
>> >>
>> >> The case you told me where N pending tick IRQs exist but the guest
>> >> wants to change the RTC frequency from 64Hz to 1024Hz.
>> >>
>> >> Let's make this more concrete. 1 GHz CPU, initially 100Hz RTC, so
>> >> 10Mcycles/tick or 10ms/tick. At T = 30Mcycles, guest wants to change
>> >> the frequency to 1000Hz.
>> >>
>> >> The problem for emulation is that for the same 3 ticks, there has been
>> >> so little execution power that the ticks have been coalesced. But
>> >> isn't the guest cycle count then much lower than 30Mcyc?
>> >>
>> >> Isn't it so that the guest must be above 30Mcyc to be able to want the
>> >> change? But if we reach that point,  the problem must have not been
>> >> too little execution time, but too much.
>> >>
>> > Sorry I tried hard to understand what have you said above but failed.
>> > What do you mean "to be able to want the change"? Guest sometimes wants
>> > to get 64 timer interrupts per second and sometimes it wants to get 1024
>> > timer interrupt per second. It wants it not as a result of time drift or
>> > anything. It's just how guest behaves. You seams to be to fixated on
>> > guest frequency change. It's just something you have to take into
>> > account when you reinject interrupts.
>>
>> I meant that in the scenario, the guest won't change the RTC before
>> 30Mcyc because of some built in determinism in the guest. At that
>> point, because of some reason, the change would happen.
>>
> I still don't understand what are you trying to say here. Guest changes
> frequency because of some even in the guest. It is totally independent
> of what happens in QEMUs RTC emulation.

I'm trying to understand the order of events. In the scenario, the
order of events on real HW would be:
10Mcyc: tick IRQ 1
20Mcyc: tick IRQ 2
30Mcyc: tick IRQ 3
30Mcyc: reprogram timer
31Mcyc: tick IRQ 4
32Mcyc: tick IRQ 5
33Mcyc: tick IRQ 6
34Mcyc: tick IRQ 7

With QEMU, the order could become:
30Mcyc: reprogram timer
30.5Mcyc: tick IRQ 1
31Mcyc: tick IRQ 2
31.5Mcyc: tick IRQ 3
32Mcyc: tick IRQ 4
32.5Mcyc: tick IRQ 5
33Mcyc: tick IRQ 6
34Mcyc: tick IRQ 7

Correct?

>
>> >
>> >> >
>> >> >>
>> >> >> >
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> > And even if the rate did not matter, the APIC woult still have to now
>> >> >> >> >> >> > about the fact that an IRQ is really periodic and does not only appear
>> >> >> >> >> >> > as such for a certain interval. This really does not sound like
>> >> >> >> >> >> > simplifying things or even make them cleaner.
>> >> >> >> >> >>
>> >> >> >> >> >> It would, the voodoo would be contained only in APIC, RTC would be
>> >> >> >> >> >> just like any other device. With the bidirectional irqs, this voodoo
>> >> >> >> >> >> would probably eventually spread to many other devices. The logical
>> >> >> >> >> >> conclusion of that would be a system where all devices would be
>> >> >> >> >> >> careful not to disturb the guest at wrong moment because that would
>> >> >> >> >> >> trigger a bug.
>> >> >> >> >> >>
>> >> >> >> >> > This voodoo will be so complex and unreliable that it will make RTC hack
>> >> >> >> >> > pale in comparison (and I still don't see how you are going to make it
>> >> >> >> >> > actually work).
>> >> >> >> >>
>> >> >> >> >> Implement everything inside APIC: only coalescing and reinjection.
>> >> >> >> > APIC has zero info needed to implement reinjection correctly as was
>> >> >> >> > shown to you several time in this thread and you simply keep ignoring
>> >> >> >> > it.
>> >> >> >>
>> >> >> >> On the contrary, APIC is actually the only source of the IRQ ack
>> >> >> >> information. RTC hack would not work without APIC (or the
>> >> >> >> bidirectional IRQ) passing this info to RTC.
>> >> >> >>
>> >> >> >> What APIC doesn't have now is the timer frequency or period info. This
>> >> >> >> is known by RTC and also higher levels managing the clocks.
>> >> >> >>
>> >> >> > So APIC has one bit of information and RTC everything else.
>> >> >>
>> >> >> The information known by RTC (timer period) is also known by higher levels.
>> >> >>
>> >> > What do you mean by higher level here? vl.c or APIC.
>> >>
>> >> vl.c, qemu-timer.c.
>> >>
>> >> >> > The current
>> >> >> > approach (and proposed patch) brings this one bit of information to RTC,
>> >> >> > you are arguing that RTC should be able to communicate all its info to
>> >> >> > APIC. Sorry I don't see that your way has any advantage. Just more
>> >> >> > complex interface and it is much easier to get it wrong for other time
>> >> >> > sources.
>> >> >>
>> >> >> I don't think anymore that APIC should be handling this but the
>> >> >> generic stuff, like vl.c or exec.c. Then there would be only
>> >> >> information passing from APIC to higher levels.
>> >> > Handling reinjection by general timer code makes infinitely more sense
>> >> > then handling it in APIC.
>> >>
>> >> I'm glad you agree, or did you mean 'less'?
>> >>
>> > Compared to APIC I would agree that even putting it in IDE is better idea :)
>> >
>> >> > One thing (from the top of my head) that can't
>> >> > be implemented at that level is injection of interrupt back to back (i.e
>> >> > injecting next interrupt immediately after guest acknowledge previous
>> >> > one to RTC).
>> >>
>> >> But Jan told this confuses some buggy OSes.
>> >>
>> > You keep calling them buggy, but I don't agree. They are written with
>> > certain assumption that are true on real HW, but hard to achieve on
>> > virtual. Anyway we use this technique (back to back reinjection)
>> > otherwise you can't solve drift problem if guest want to receive
>> > timer interrupts with max frequency that host time source support.
>>
>> Even if this confuses some OSes?
>>
> We make it so that it will not confuse relevant OSes. It is not line you
> have better choice.
>
>> >> >
>> >> >>
>> >> >> >> I keep ignoring the idea that the current model, where both RTC and
>> >> >> >> APIC must somehow work together to make coalescing work, is the only
>> >> >> >> possible just because it is committed and it happens to work in some
>> >> >> >> cases. It would be much better to concentrate this to one place, APIC
>> >> >> >> or preferably higher level where it may benefit other timers too.
>> >> >> >> Provided of course that the other models can be made to work.
>> >> >> >>
>> >> >> > So write the code and show us. You haven't show any evidence that RTC is
>> >> >> > the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>> >> >> > know when clock frequency changes, it know when device reset happened.
>> >> >> > APIC knows only that interrupt was coalesced. It doesn't even know that
>> >> >> > it may be masked by a guest in IOAPIC (interrupts delivered while they
>> >> >> > are masked not considered coalesced).
>> >> >>
>> >> >> Oh, I thought interrupt masking was the reason for coalescing! What
>> >> >> exactly is the reason then?
>> >> >>
>> >> > The reason is that guest has no time to process previous interrupt
>> >> > before it is time to inject next one.
>> >>
>> >> Because of other host load or other emulation done by the same QEMU
>> >> process, I suppose?
>> > Yes, both.
>> >
>> >>
>> >> >> > Time source knows only when
>> >> >> > frequency changes and may be when device reset happens if timer is
>> >> >> > stopped by device on reset. So RTC is actually a sweet spot if you want
>> >> >> > to minimize amount of info you need to pass between various layers.
>> >> >> >
>> >> >> >> >> Maybe that version would not bend backwards as much as the current to
>> >> >> >> >> cater for buggy hosts.
>> >> >> >> >>
>> >> >> >> > You mean "buggy guests"?
>> >> >> >>
>> >> >> >> Yes, sorry.
>> >> >> >>
>> >> >> >> > What guests are not buggy in your opinion?
>> >> >> >> > Linux tries hard to be smart and as a result the only way to have stable
>> >> >> >> > clock with it is to go paravirt.
>> >> >> >>
>> >> >> >> I'm not an OS designer, but I think an OS should never crash, even if
>> >> >> >> a burst of IRQs is received. Reprogramming the timer should consider
>> >> >> >> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
>> >> >> >> cause of the problem.
>> >> >> > OS should never crash in the absence of HW bugs? I doubt you can design
>> >> >> > an OS that can run in a face of any HW failure. Anyway here we are
>> >> >> > trying to solve guests time keeping problem not crashes. Do you think
>> >> >> > you can design OS that can keep time accurately no matter how crazy all
>> >> >> > HW clock behaves?
>> >> >>
>> >> >> I think my OS design skills are not relevant in this discussion, but
>> >> >> IIRC there are fault tolerant operating systems for extreme conditions
>> >> >> so it can be done.
>> >> >>
>> >> >> >
>> >> >> >>
>> >> >> >> >> > The fact is that timer device is not "just like any
>> >> >> >> >> > other device" in virtual world. Any other device is easy: you just
>> >> >> >> >> > implement spec as close as possible and everything works. For time
>> >> >> >> >> > source device this is not enough. You can implement RTC+HPET to the
>> >> >> >> >> > letter and your guest will drift like crazy.
>> >> >> >> >>
>> >> >> >> >> It's doable: a cycle accurate emulator will not cause any drift,
>> >> >> >> >> without any voodoo. The interrupts would come after executing the same
>> >> >> >> >> instruction as the real HW. For emulating any sufficiently buggy
>> >> >> >> >> guests in any sufficiently desperate low resource conditions, this may
>> >> >> >> >> be the only option that will always work.
>> >> >> >> >>
>> >> >> >> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
>> >> >> >> > to be one. On the contrary KVM runs at native host CPU speed most of the
>> >> >> >> > time, so any emulation done between two instruction is theoretically
>> >> >> >> > noticeable for a guest. TSC is bypassed directly to a guest too, so
>> >> >> >> > keeping all time source in perfect sync is also impossible.
>> >> >> >>
>> >> >> >> That is actually another cause of the problem. KVM gives the guest an
>> >> >> >> illusion that the VCPU speed is equal to host speed. When they don't
>> >> >> >> match, especially in critical code, there can be problems. It would be
>> >> >> >> better to tell the guest a lower speed, which also can be guaranteed.
>> >> >> >>
>> >> >> > Not possible. It's that simple. You should take it into account in your
>> >> >> > architecture design stage. In case of KVM real physical CPU executes guest
>> >> >> > instruction and it does this as fast as it can. The only way we can hide
>> >> >> > that from a guest is by intercepting each access to TSC and at that
>> >> >> > point we can use bochs instead.
>> >> >>
>> >> >> Well, as Paul pointed out, there's also icount option.
>> >> >>
>> >> > icount is not an option for KVM.
>> >>
>> >> I think icount timer adjustment model might make sense for this work
>> >> too. We'd then just need some figure of executed CPU instructions, TSC
>> >> cycles or even kernel scheduler time slice information (how much time
>> >> the process got).
>> >>
>> > And then? icount makes guest time flow dependant on amount of emulated
>> > instructions. It relies on the fact that all time sources are
>> > synchronized for a guest during emulation (including TSC). This is not
>> > true for virtualization.
>>
>> So for virtualization, is it OK then to keep time sources unsynchronized?
> No, this is not. This is the reality that can't be changed for now. May
> be when HW vitalization will introduce TSC scaling, but even then we
> don't want to run 4 500Mhz guest on 2GHz host, we want to run 4 2GHz
> guest on 2GHz host, so overcommit occurs and you can't guaranty that
> no coalescing will happen. And I hope you are aware of the fact that
> using icount introduce ~10% performance penalty on QEMU.
>
> --
>                        Gleb.
>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-30 11:41                                     ` Blue Swirl
@ 2010-05-30 11:52                                       ` Gleb Natapov
  0 siblings, 0 replies; 122+ messages in thread
From: Gleb Natapov @ 2010-05-30 11:52 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sun, May 30, 2010 at 11:41:36AM +0000, Blue Swirl wrote:
> >> I meant that in the scenario, the guest won't change the RTC before
> >> 30Mcyc because of some built in determinism in the guest. At that
> >> point, because of some reason, the change would happen.
> >>
> > I still don't understand what are you trying to say here. Guest changes
> > frequency because of some even in the guest. It is totally independent
> > of what happens in QEMUs RTC emulation.
> 
> I'm trying to understand the order of events. In the scenario, the
> order of events on real HW would be:
> 10Mcyc: tick IRQ 1
> 20Mcyc: tick IRQ 2
> 30Mcyc: tick IRQ 3
> 30Mcyc: reprogram timer
> 31Mcyc: tick IRQ 4
> 32Mcyc: tick IRQ 5
> 33Mcyc: tick IRQ 6
> 34Mcyc: tick IRQ 7
> 
> With QEMU, the order could become:
> 30Mcyc: reprogram timer
> 30.5Mcyc: tick IRQ 1
> 31Mcyc: tick IRQ 2
> 31.5Mcyc: tick IRQ 3
> 32Mcyc: tick IRQ 4
> 32.5Mcyc: tick IRQ 5
> 33Mcyc: tick IRQ 6
> 34Mcyc: tick IRQ 7
> 
> Correct?
Not sure, your description is not complete. Let me try:
On real HW:
10Mcyc: tick IRQ 1 -> delivered to an OS
20Mcyc: tick IRQ 2 -> delivered to an OS
30Mcyc: tick IRQ 3 -> delivered to an OS
30Mcyc: reprogram timer
31Mcyc: tick IRQ 4 -> delivered to an OS
32Mcyc: tick IRQ 5 -> delivered to an OS
33Mcyc: tick IRQ 6 -> delivered to an OS
34Mcyc: tick IRQ 7 -> delivered to an OS

With QEMU:
10Mcyc: tick IRQ 1 -> coalesced
20Mcyc: tick IRQ 2 -> coalesced
30Mcyc: tick IRQ 3 -> coalesced
30Mcyc: reprogram timer
30.1Mcyc: tick IRQ 1 -> delivered to an OS
30.2Mcyc: tick IRQ 2 -> delivered to an OS
30.3Mcyc: tick IRQ 3 -> delivered to an OS
31Mcyc: tick IRQ 4 -> delivered to an OS
32Mcyc: tick IRQ 5 -> delivered to an OS
33Mcyc: tick IRQ 6 -> delivered to an OS
34Mcyc: tick IRQ 7 -> delivered to an OS



--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-27 22:19                           ` Jan Kiszka
  2010-05-28 19:00                             ` Blue Swirl
@ 2010-05-30 12:00                             ` Avi Kivity
  1 sibling, 0 replies; 122+ messages in thread
From: Avi Kivity @ 2010-05-30 12:00 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Blue Swirl, qemu-devel, Paul Brook, Juan Quintela

On 05/28/2010 01:19 AM, Jan Kiszka wrote:
>
> Still, this does not answer:
>
> - How do you want to detect lost timer ticks?
>
>    

Your (and Gleb's) approach: during injection
An alternative: during guest ack

The normal pattern is inj/ack/inj/ack; if we see inj/inj/inj we know the 
guest isn't keeping up.

> - What subsystem(s) keeps track of the backlog?
>    

Clearly, the clock chip (hpet/rtc/pit).

With the alternative approach, the clock emulation requests that acks be 
reported via a qemu_irq interface.

> - And depending on the above: How to detect at all that a specific IRQ
>    is a timer tick?
>    

Clearly, a blocker unless the clock is responsible for timekeeping.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-29  9:15                         ` Blue Swirl
  2010-05-29  9:36                           ` Jan Kiszka
  2010-05-29 14:38                           ` Gleb Natapov
@ 2010-05-30 12:05                           ` Avi Kivity
  2 siblings, 0 replies; 122+ messages in thread
From: Avi Kivity @ 2010-05-30 12:05 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Gleb Natapov, Juan Quintela

On 05/29/2010 12:15 PM, Blue Swirl wrote:
>
>>> This would allow counting the executed instructions and limit it. Thus
>>> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
>>>
>>>        
>> Why would you want to limit number of instruction executed by guest if
>> CPU has nothing else to do anyway? The problem occurs not when we have
>> spare cycles so give to a guest, but in opposite case.
>>      
> I think one problem is that the guest has executed too much compared
> to what would happen with real HW with a lesser CPU. That explains the
> RTC frequency reprogramming case.
>    

The root cause is that while qemu is scheduled out, the real time clock 
keeps ticking.  Since we can't stop real time, we must compensate in 
other ways.

>> So write the code and show us. You haven't show any evidence that RTC is
>> the wrong place. RTC knows when interrupt was acknowledge to RTC, it
>> know when clock frequency changes, it know when device reset happened.
>> APIC knows only that interrupt was coalesced. It doesn't even know that
>> it may be masked by a guest in IOAPIC (interrupts delivered while they
>> are masked not considered coalesced).
>>      
> Oh, I thought interrupt masking was the reason for coalescing! What
> exactly is the reason then?
>    

The above.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-30  6:02                                     ` Gleb Natapov
@ 2010-05-30 12:10                                       ` Blue Swirl
  2010-05-30 12:24                                         ` Jan Kiszka
  2010-05-30 12:33                                         ` Gleb Natapov
  0 siblings, 2 replies; 122+ messages in thread
From: Blue Swirl @ 2010-05-30 12:10 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 6788 bytes --]

2010/5/30 Gleb Natapov <gleb@redhat.com>:
> On Sat, May 29, 2010 at 09:21:14PM +0000, Blue Swirl wrote:
>> On Sat, May 29, 2010 at 4:37 PM, Gleb Natapov <gleb@redhat.com> wrote:
>> > On Sat, May 29, 2010 at 04:13:22PM +0000, Blue Swirl wrote:
>> >> On Sat, May 29, 2010 at 2:46 PM, Gleb Natapov <gleb@redhat.com> wrote:
>> >> > On Sat, May 29, 2010 at 09:35:30AM +0000, Blue Swirl wrote:
>> >> >> > I still don't see how the alternative is supposed to simplify our life
>> >> >> > or improve the efficiency of the de-coalescing workaround. It's rather
>> >> >> > problematic like Gleb pointed out: The de-coalescing logic needs to be
>> >> >> > informed about periodicity changes that can only be delivered along
>> >> >> > IRQs. So what to do with the backlog when the timer is stopped?
>> >> >>
>> >> >> What happens with the current design? Gleb only mentioned the
>> >> >> frequency change, I thought that was not so big problem. But I don't
>> >> >> think this case should be allowed happen at all, it can't exist on
>> >> >> real HW.
>> >> >>
>> >> > Hm, why it can't exist on real HW? Do simple exercise. Run WindowsXP
>> >> > inside QEMU, connect with gdb to QEMU process and check what frequency
>> >> > RTC configured with (hint: it will be 64Hz), now run video inside the
>> >> > guest and check frequency again (hint: it will be 1Khz).
>> >>
>> >> You missed the key word 'stopped'. If the timer is really stopped, no
>> >> IRQs should ever come out afterwards, just like on real HW. For the
>> >> emulation, this means loss of ticks which should have been delivered
>> >> before the change.
>> >>
>> > I haven't missed it. I describe to you reality of the situation. You want
>> > to change reality to be more close to what you want it to be by adding
>> > words to my description.
>>
>> Quoting Jan: 'So what to do with the backlog when the timer is
>> stopped?' I didn't add any words to your description, please be more
>> careful with your attributions. Why do you think I want to change the
>> reality?
> Please refer to my words when you answer to my quote. You quoted my
> answer to you statement:
>  Gleb only mentioned the frequency change, I thought that was not so big
>  problem. But I don't think this case should be allowed happen at all,
>  it can't exist on real HW.

With 'this case' I was referring to 'case with timer stopped', not
'case which Gleb mentioned'.

> No 'stopped' was under discussion nowhere.

It's clearly written there in the sentence Jan wrote.

> FWIW 'stopped' is just a case
> of frequency change.

True.

>
>>
>> XP frequency change isn't the same case as timer being stopped.
>>
> And what is the big difference exactly?

Because after the timer is stopped, its extremely unrealistic to send
any IRQs. Whereas if the frequency is changed to some other nonzero
value, we can cheat and inject some more queued IRQs.

Anyway, if this case is not interesting because it doesn't happen in
real life emulation scenarios, we can forget it no matter how buggy
the current QEMU implementation is.

>> > Please just go write code, experiment, debug
>> > and _then_ come here with design.
>>
>> I added some debugging to RTC, PIC and APIC. I also built a small
>> guest in x86 assembly to test the coalescing. However, in the tests
>> with this guest and others I noticed that the coalescing only happens
>> in some obscure conditions.
> So try with real guest and with real load.

Well, I'd like to get the test program also trigger it. Now I'm getting:
apic: write: 00000350 = 00000000
apic: apic_reset_irq_delivered: old coalescing 0
apic: apic_local_deliver: vector 3 delivery mode 0
apic: apic_set_irq: coalescing 1
apic: apic_get_irq_delivered: returning coalescing 1
apic: apic_reset_irq_delivered: old coalescing 1
apic: apic_local_deliver: vector 3 delivery mode 0
apic: apic_set_irq: coalescing 0
apic: apic_get_irq_delivered: returning coalescing 0
apic: apic_reset_irq_delivered: old coalescing 0
apic: apic_local_deliver: vector 3 delivery mode 0
apic: apic_set_irq: coalescing 0

It looks like some other IRQs cause the coalescing, because also
looking at RTC code, it seems it's not possible for RTC to raise the
IRQ (except update IRQ, alarm etc.) without calling
apic_reset_irq_delivered().

I've attached my test program. Compile:
gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
coalescing.ld -g && objcopy -Obinary coalescing coalescing.bin

Run:
qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack

>>
>> By default the APIC's delivery method for IRQs is ExtInt and
>> coalescing counting happens only with Fixed. This means that the guest
>> needs to reprogram APIC. It also looks like RTC interrupts need to be
>> triggered. But I didn't see both of these to happen simultaneously in
>> my tests with Linux and Windows guests. Of course, -rtc-td-hack flag
>> must be used and I also disabled HPET to be sure that RTC would be
>> used.
>>
>> With DEBUG_COALESCING enabled, I just get increasing numbers for
>> apic_irq_delivered:
>> apic: apic_set_irq: coalescing 67123
>> apic: apic_set_irq: coalescing 67124
>> apic: apic_set_irq: coalescing 67125
> So have you actually used -rtc-td-hack option? I compiled head of
> qemu.git with DEBUG_COALESCING and run WindowsXP guest with -rtc-td-hack
> and I get:
> apic: apic_reset_irq_delivered: old coalescing 3
> apic: apic_set_irq: coalescing 1
> apic: apic_get_irq_delivered: returning coalescing 1
> apic: apic_set_irq: coalescing 2
> apic: apic_set_irq: coalescing 3
> apic: apic_set_irq: coalescing 4
> apic: apic_set_irq: coalescing 5
> apic: apic_set_irq: coalescing 6
> apic: apic_reset_irq_delivered: old coalescing 6
> apic: apic_set_irq: coalescing 1
> apic: apic_get_irq_delivered: returning coalescing 1
>
>>
>> If the hack were active, the numbers would be close to zero (or at
>> least some point) because apic_reset_irq_delivered would be called,
>> but this does not happen. Could you specify a clear test case with
>> which the coalescing action could be tested? Linux or BSD based,
>> please.
> Linux don't use RTC as time source and I don't know about BSD, so no
> Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
> heavy load on the host. You can run video inside the gust to trigger
> coalescing more easily.

I don't have Windows XP, sorry.

>
>>
>> >> But what if the guest changed the frequency very often, and between
>> >> changes used zero value, like 64Hz -> 0Hz -> 128Hz -> 0Hz -> 64Hz?
>> > Too bad, the world is not perfect.
>> >
>> > --
>> >                        Gleb.
>> >
>
> --
>                        Gleb.
>

[-- Attachment #2: coalescing.S --]
[-- Type: application/octet-stream, Size: 2397 bytes --]


        .section .text, "ax"
        .code16gcc

#define O(port, val)            \
        movb    $val, %al;      \
        out     %al, $port
#define RTC_O(port, val)        \
        O(0x70, port);          \
        O(0x71, val)
#define RTC_I(port)             \
        O(0x70, port);          \
        in      $0x71, %al

        .org 0x10000
        .globl start
start:
        /* A20 */
        O(0x92, 0x01)

        /* IDT */
        lidtw   %cs:0x2000

        /* GDT */
        lgdtw   %cs:0x2010

        /* Switch to protected mode */
        movl    %cr0, %eax
        orl     $1, %eax
        movl    %eax, %cr0
        ljmpl   $8, $0xf4000
        .org 0x11000

        .code32

        /* INT 70 handler */
        RTC_I(0xc)
        O(0xa0, 0x20)
        O(0x20, 0x20)
        iret

        .org 0x12000

        /* IDTD */
        .short 0x0400
        .long 0x0f3000

        .org 0x12010

        /* GDTD */
        .short 0x0020
        .long 0xf3400

        .org 0x13380

        /* IDT entry for INT 70 */
        .long 0x00081000, 0x000f8e00

        .org 0x13408

        /* GDT entry for 1st descriptor, CS */
        .short 0xffff, 0x0000, 0x9b00, 0x00cf
        /* GDT entry for 2nd descriptor, DS etc. */
        .short 0xffff, 0x0000, 0x9300, 0x00cf

        .org 0x14000
        movl    $0x10, %eax
        movw    %ax, %ds
        movw    %ax, %ss
        movl    $0x1000, %esp

        /* Master PIC */
        /* ICW1 */
        O(0x20, 0x11)
        /* ICW2 */
        O(0x21, 0x08)
        /* ICW3 */
        O(0x21, 0x04)
        /* ICW4 */
        O(0x21, 0x01)
        /* OCW1: only slave IRQs */
        O(0x21, 0xfb)

        /* Slave PIC */
        /* ICW1 */
        O(0xa0, 0x11)
        /* ICW2 */
        O(0xa1, 0x70)
        /* ICW3 */
        O(0xa1, 0x02)
        /* ICW4 */
        O(0xa1, 0x01)
        /* OCW1: only RTC IRQ */
        O(0xa1, 0xfe)

        /* set up APIC LVT */
        xor     %eax, %eax
        mov     %eax, 0xfee00350

        /* RTC: frequency 1kHz */
        RTC_O(0xa, 0x26)
        /* Enable IRQ */
        RTC_O(0xb, 0x40)

        sti
1:      jmp     1b
        hlt

        .section .reset, "ax"
        .globl  entry
        .code16gcc
entry:
        cli
        ljmp    $0xf000, $0x0000
        nop
        nop
        nop
        nop
        nop
        nop
        nop
        nop
        nop
        nop


[-- Attachment #3: coalescing.ld --]
[-- Type: application/octet-stream, Size: 205 bytes --]

OUTPUT_FORMAT(elf32-i386)
OUTPUT_ARCH(i386)

ENTRY(entry)

SECTIONS
{
    . = 0xe0000;

    .text : { *(.text) }

    . = 0xf0000;

    .text : { *(.text) }

    . = 0xffff0;

    .reset : { *(.reset) }
}

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-30 12:10                                       ` Blue Swirl
@ 2010-05-30 12:24                                         ` Jan Kiszka
  2010-05-30 12:58                                           ` Blue Swirl
  2010-05-30 12:33                                         ` Gleb Natapov
  1 sibling, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-05-30 12:24 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel, Gleb Natapov, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 444 bytes --]

Blue Swirl wrote:
>> Linux don't use RTC as time source and I don't know about BSD, so no
>> Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
>> heavy load on the host. You can run video inside the gust to trigger
>> coalescing more easily.
> 
> I don't have Windows XP, sorry.

ReactOS [1], at least its 32-bit version, appears to use the RTC as well.

Jan

[1] http://www.reactos.org/de/download.html



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-30 12:10                                       ` Blue Swirl
  2010-05-30 12:24                                         ` Jan Kiszka
@ 2010-05-30 12:33                                         ` Gleb Natapov
  2010-05-30 12:56                                           ` Blue Swirl
  2010-05-30 13:22                                           ` Blue Swirl
  1 sibling, 2 replies; 122+ messages in thread
From: Gleb Natapov @ 2010-05-30 12:33 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sun, May 30, 2010 at 12:10:16PM +0000, Blue Swirl wrote:
> >> >> You missed the key word 'stopped'. If the timer is really stopped, no
> >> >> IRQs should ever come out afterwards, just like on real HW. For the
> >> >> emulation, this means loss of ticks which should have been delivered
> >> >> before the change.
> >> >>
> >> > I haven't missed it. I describe to you reality of the situation. You want
> >> > to change reality to be more close to what you want it to be by adding
> >> > words to my description.
> >>
> >> Quoting Jan: 'So what to do with the backlog when the timer is
> >> stopped?' I didn't add any words to your description, please be more
> >> careful with your attributions. Why do you think I want to change the
> >> reality?
> > Please refer to my words when you answer to my quote. You quoted my
> > answer to you statement:
> >  Gleb only mentioned the frequency change, I thought that was not so big
> >  problem. But I don't think this case should be allowed happen at all,
> >  it can't exist on real HW.
> 
> With 'this case' I was referring to 'case with timer stopped', not
> 'case which Gleb mentioned'.
> 
> > No 'stopped' was under discussion nowhere.
> 
> It's clearly written there in the sentence Jan wrote.
> 
Jan, not me, but lets leave this topic alone since you agree that
stopped is just a case of frequency change anyway.

> > FWIW 'stopped' is just a case
> > of frequency change.
> 
> True.
> 
> >
> >>
> >> XP frequency change isn't the same case as timer being stopped.
> >>
> > And what is the big difference exactly?
> 
> Because after the timer is stopped, its extremely unrealistic to send
> any IRQs. Whereas if the frequency is changed to some other nonzero
> value, we can cheat and inject some more queued IRQs.
> 
Correct, when gets disables clock source (by reset or any other means)
coalesced backlog should be forgotten.

> Anyway, if this case is not interesting because it doesn't happen in
> real life emulation scenarios, we can forget it no matter how buggy
> the current QEMU implementation is.
> 
> >> > Please just go write code, experiment, debug
> >> > and _then_ come here with design.
> >>
> >> I added some debugging to RTC, PIC and APIC. I also built a small
> >> guest in x86 assembly to test the coalescing. However, in the tests
> >> with this guest and others I noticed that the coalescing only happens
> >> in some obscure conditions.
> > So try with real guest and with real load.
> 
> Well, I'd like to get the test program also trigger it. Now I'm getting:
> apic: write: 00000350 = 00000000
> apic: apic_reset_irq_delivered: old coalescing 0
> apic: apic_local_deliver: vector 3 delivery mode 0
> apic: apic_set_irq: coalescing 1
> apic: apic_get_irq_delivered: returning coalescing 1
> apic: apic_reset_irq_delivered: old coalescing 1
> apic: apic_local_deliver: vector 3 delivery mode 0
> apic: apic_set_irq: coalescing 0
> apic: apic_get_irq_delivered: returning coalescing 0
> apic: apic_reset_irq_delivered: old coalescing 0
> apic: apic_local_deliver: vector 3 delivery mode 0
> apic: apic_set_irq: coalescing 0
> 
> It looks like some other IRQs cause the coalescing, because also
> looking at RTC code, it seems it's not possible for RTC to raise the
> IRQ (except update IRQ, alarm etc.) without calling
> apic_reset_irq_delivered().
> 
> I've attached my test program. Compile:
> gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
> coalescing.ld -g && objcopy -Obinary coalescing coalescing.bin
> 
> Run:
> qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack
> 
The application does not work for me. Looks like it fails to enter
protected mode. $pc jumps from 0x00000000fffffff0 to 0x00000000000f003e
and back.

> >>
> >> By default the APIC's delivery method for IRQs is ExtInt and
> >> coalescing counting happens only with Fixed. This means that the guest
> >> needs to reprogram APIC. It also looks like RTC interrupts need to be
> >> triggered. But I didn't see both of these to happen simultaneously in
> >> my tests with Linux and Windows guests. Of course, -rtc-td-hack flag
> >> must be used and I also disabled HPET to be sure that RTC would be
> >> used.
> >>
> >> With DEBUG_COALESCING enabled, I just get increasing numbers for
> >> apic_irq_delivered:
> >> apic: apic_set_irq: coalescing 67123
> >> apic: apic_set_irq: coalescing 67124
> >> apic: apic_set_irq: coalescing 67125
> > So have you actually used -rtc-td-hack option? I compiled head of
> > qemu.git with DEBUG_COALESCING and run WindowsXP guest with -rtc-td-hack
> > and I get:
> > apic: apic_reset_irq_delivered: old coalescing 3
> > apic: apic_set_irq: coalescing 1
> > apic: apic_get_irq_delivered: returning coalescing 1
> > apic: apic_set_irq: coalescing 2
> > apic: apic_set_irq: coalescing 3
> > apic: apic_set_irq: coalescing 4
> > apic: apic_set_irq: coalescing 5
> > apic: apic_set_irq: coalescing 6
> > apic: apic_reset_irq_delivered: old coalescing 6
> > apic: apic_set_irq: coalescing 1
> > apic: apic_get_irq_delivered: returning coalescing 1
> >
> >>
> >> If the hack were active, the numbers would be close to zero (or at
> >> least some point) because apic_reset_irq_delivered would be called,
> >> but this does not happen. Could you specify a clear test case with
> >> which the coalescing action could be tested? Linux or BSD based,
> >> please.
> > Linux don't use RTC as time source and I don't know about BSD, so no
> > Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
> > heavy load on the host. You can run video inside the gust to trigger
> > coalescing more easily.
> 
> I don't have Windows XP, sorry.
> 
Will be hard to debug Windows time drift without Windows ;) Do you know
what time source BSD uses?

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-30 12:33                                         ` Gleb Natapov
@ 2010-05-30 12:56                                           ` Blue Swirl
  2010-05-30 13:49                                             ` Gleb Natapov
  2010-05-30 13:22                                           ` Blue Swirl
  1 sibling, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-05-30 12:56 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 6150 bytes --]

2010/5/30 Gleb Natapov <gleb@redhat.com>:
> On Sun, May 30, 2010 at 12:10:16PM +0000, Blue Swirl wrote:
>> >> >> You missed the key word 'stopped'. If the timer is really stopped, no
>> >> >> IRQs should ever come out afterwards, just like on real HW. For the
>> >> >> emulation, this means loss of ticks which should have been delivered
>> >> >> before the change.
>> >> >>
>> >> > I haven't missed it. I describe to you reality of the situation. You want
>> >> > to change reality to be more close to what you want it to be by adding
>> >> > words to my description.
>> >>
>> >> Quoting Jan: 'So what to do with the backlog when the timer is
>> >> stopped?' I didn't add any words to your description, please be more
>> >> careful with your attributions. Why do you think I want to change the
>> >> reality?
>> > Please refer to my words when you answer to my quote. You quoted my
>> > answer to you statement:
>> >  Gleb only mentioned the frequency change, I thought that was not so big
>> >  problem. But I don't think this case should be allowed happen at all,
>> >  it can't exist on real HW.
>>
>> With 'this case' I was referring to 'case with timer stopped', not
>> 'case which Gleb mentioned'.
>>
>> > No 'stopped' was under discussion nowhere.
>>
>> It's clearly written there in the sentence Jan wrote.
>>
> Jan, not me, but lets leave this topic alone since you agree that
> stopped is just a case of frequency change anyway.
>
>> > FWIW 'stopped' is just a case
>> > of frequency change.
>>
>> True.
>>
>> >
>> >>
>> >> XP frequency change isn't the same case as timer being stopped.
>> >>
>> > And what is the big difference exactly?
>>
>> Because after the timer is stopped, its extremely unrealistic to send
>> any IRQs. Whereas if the frequency is changed to some other nonzero
>> value, we can cheat and inject some more queued IRQs.
>>
> Correct, when gets disables clock source (by reset or any other means)
> coalesced backlog should be forgotten.
>
>> Anyway, if this case is not interesting because it doesn't happen in
>> real life emulation scenarios, we can forget it no matter how buggy
>> the current QEMU implementation is.
>>
>> >> > Please just go write code, experiment, debug
>> >> > and _then_ come here with design.
>> >>
>> >> I added some debugging to RTC, PIC and APIC. I also built a small
>> >> guest in x86 assembly to test the coalescing. However, in the tests
>> >> with this guest and others I noticed that the coalescing only happens
>> >> in some obscure conditions.
>> > So try with real guest and with real load.
>>
>> Well, I'd like to get the test program also trigger it. Now I'm getting:
>> apic: write: 00000350 = 00000000
>> apic: apic_reset_irq_delivered: old coalescing 0
>> apic: apic_local_deliver: vector 3 delivery mode 0
>> apic: apic_set_irq: coalescing 1
>> apic: apic_get_irq_delivered: returning coalescing 1
>> apic: apic_reset_irq_delivered: old coalescing 1
>> apic: apic_local_deliver: vector 3 delivery mode 0
>> apic: apic_set_irq: coalescing 0
>> apic: apic_get_irq_delivered: returning coalescing 0
>> apic: apic_reset_irq_delivered: old coalescing 0
>> apic: apic_local_deliver: vector 3 delivery mode 0
>> apic: apic_set_irq: coalescing 0
>>
>> It looks like some other IRQs cause the coalescing, because also
>> looking at RTC code, it seems it's not possible for RTC to raise the
>> IRQ (except update IRQ, alarm etc.) without calling
>> apic_reset_irq_delivered().
>>
>> I've attached my test program. Compile:
>> gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
>> coalescing.ld -g && objcopy -Obinary coalescing coalescing.bin
>>
>> Run:
>> qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack
>>
> The application does not work for me. Looks like it fails to enter
> protected mode. $pc jumps from 0x00000000fffffff0 to 0x00000000000f003e
> and back.

Strange. Here's a working binary.

>
>> >>
>> >> By default the APIC's delivery method for IRQs is ExtInt and
>> >> coalescing counting happens only with Fixed. This means that the guest
>> >> needs to reprogram APIC. It also looks like RTC interrupts need to be
>> >> triggered. But I didn't see both of these to happen simultaneously in
>> >> my tests with Linux and Windows guests. Of course, -rtc-td-hack flag
>> >> must be used and I also disabled HPET to be sure that RTC would be
>> >> used.
>> >>
>> >> With DEBUG_COALESCING enabled, I just get increasing numbers for
>> >> apic_irq_delivered:
>> >> apic: apic_set_irq: coalescing 67123
>> >> apic: apic_set_irq: coalescing 67124
>> >> apic: apic_set_irq: coalescing 67125
>> > So have you actually used -rtc-td-hack option? I compiled head of
>> > qemu.git with DEBUG_COALESCING and run WindowsXP guest with -rtc-td-hack
>> > and I get:
>> > apic: apic_reset_irq_delivered: old coalescing 3
>> > apic: apic_set_irq: coalescing 1
>> > apic: apic_get_irq_delivered: returning coalescing 1
>> > apic: apic_set_irq: coalescing 2
>> > apic: apic_set_irq: coalescing 3
>> > apic: apic_set_irq: coalescing 4
>> > apic: apic_set_irq: coalescing 5
>> > apic: apic_set_irq: coalescing 6
>> > apic: apic_reset_irq_delivered: old coalescing 6
>> > apic: apic_set_irq: coalescing 1
>> > apic: apic_get_irq_delivered: returning coalescing 1
>> >
>> >>
>> >> If the hack were active, the numbers would be close to zero (or at
>> >> least some point) because apic_reset_irq_delivered would be called,
>> >> but this does not happen. Could you specify a clear test case with
>> >> which the coalescing action could be tested? Linux or BSD based,
>> >> please.
>> > Linux don't use RTC as time source and I don't know about BSD, so no
>> > Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
>> > heavy load on the host. You can run video inside the gust to trigger
>> > coalescing more easily.
>>
>> I don't have Windows XP, sorry.
>>
> Will be hard to debug Windows time drift without Windows ;) Do you know
> what time source BSD uses?

No, at least OpenBSD 4.4 doesn't seem to use RTC.

[-- Attachment #2: coalescing.bin.bz2 --]
[-- Type: application/x-bzip2, Size: 260 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-30 12:24                                         ` Jan Kiszka
@ 2010-05-30 12:58                                           ` Blue Swirl
  2010-05-31  7:46                                             ` Jan Kiszka
  0 siblings, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-05-30 12:58 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, Gleb Natapov, Juan Quintela

On Sun, May 30, 2010 at 12:24 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>>> Linux don't use RTC as time source and I don't know about BSD, so no
>>> Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
>>> heavy load on the host. You can run video inside the gust to trigger
>>> coalescing more easily.
>>
>> I don't have Windows XP, sorry.
>
> ReactOS [1], at least its 32-bit version, appears to use the RTC as well.

I tried LiveCD and QEMU versions, both seem to hang at boot. Is that expected?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-30 12:33                                         ` Gleb Natapov
  2010-05-30 12:56                                           ` Blue Swirl
@ 2010-05-30 13:22                                           ` Blue Swirl
  1 sibling, 0 replies; 122+ messages in thread
From: Blue Swirl @ 2010-05-30 13:22 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

2010/5/30 Gleb Natapov <gleb@redhat.com>:
> On Sun, May 30, 2010 at 12:10:16PM +0000, Blue Swirl wrote:
>> >> >> You missed the key word 'stopped'. If the timer is really stopped, no
>> >> >> IRQs should ever come out afterwards, just like on real HW. For the
>> >> >> emulation, this means loss of ticks which should have been delivered
>> >> >> before the change.
>> >> >>
>> >> > I haven't missed it. I describe to you reality of the situation. You want
>> >> > to change reality to be more close to what you want it to be by adding
>> >> > words to my description.
>> >>
>> >> Quoting Jan: 'So what to do with the backlog when the timer is
>> >> stopped?' I didn't add any words to your description, please be more
>> >> careful with your attributions. Why do you think I want to change the
>> >> reality?
>> > Please refer to my words when you answer to my quote. You quoted my
>> > answer to you statement:
>> >  Gleb only mentioned the frequency change, I thought that was not so big
>> >  problem. But I don't think this case should be allowed happen at all,
>> >  it can't exist on real HW.
>>
>> With 'this case' I was referring to 'case with timer stopped', not
>> 'case which Gleb mentioned'.
>>
>> > No 'stopped' was under discussion nowhere.
>>
>> It's clearly written there in the sentence Jan wrote.
>>
> Jan, not me, but lets leave this topic alone since you agree that
> stopped is just a case of frequency change anyway.
>
>> > FWIW 'stopped' is just a case
>> > of frequency change.
>>
>> True.
>>
>> >
>> >>
>> >> XP frequency change isn't the same case as timer being stopped.
>> >>
>> > And what is the big difference exactly?
>>
>> Because after the timer is stopped, its extremely unrealistic to send
>> any IRQs. Whereas if the frequency is changed to some other nonzero
>> value, we can cheat and inject some more queued IRQs.
>>
> Correct, when gets disables clock source (by reset or any other means)
> coalesced backlog should be forgotten.
>
>> Anyway, if this case is not interesting because it doesn't happen in
>> real life emulation scenarios, we can forget it no matter how buggy
>> the current QEMU implementation is.
>>
>> >> > Please just go write code, experiment, debug
>> >> > and _then_ come here with design.
>> >>
>> >> I added some debugging to RTC, PIC and APIC. I also built a small
>> >> guest in x86 assembly to test the coalescing. However, in the tests
>> >> with this guest and others I noticed that the coalescing only happens
>> >> in some obscure conditions.
>> > So try with real guest and with real load.
>>
>> Well, I'd like to get the test program also trigger it. Now I'm getting:
>> apic: write: 00000350 = 00000000
>> apic: apic_reset_irq_delivered: old coalescing 0
>> apic: apic_local_deliver: vector 3 delivery mode 0
>> apic: apic_set_irq: coalescing 1
>> apic: apic_get_irq_delivered: returning coalescing 1
>> apic: apic_reset_irq_delivered: old coalescing 1
>> apic: apic_local_deliver: vector 3 delivery mode 0
>> apic: apic_set_irq: coalescing 0
>> apic: apic_get_irq_delivered: returning coalescing 0
>> apic: apic_reset_irq_delivered: old coalescing 0
>> apic: apic_local_deliver: vector 3 delivery mode 0
>> apic: apic_set_irq: coalescing 0
>>
>> It looks like some other IRQs cause the coalescing, because also
>> looking at RTC code, it seems it's not possible for RTC to raise the
>> IRQ (except update IRQ, alarm etc.) without calling
>> apic_reset_irq_delivered().
>>
>> I've attached my test program. Compile:
>> gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
>> coalescing.ld -g && objcopy -Obinary coalescing coalescing.bin
>>
>> Run:
>> qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack
>>
> The application does not work for me. Looks like it fails to enter
> protected mode. $pc jumps from 0x00000000fffffff0 to 0x00000000000f003e
> and back.
>
>> >>
>> >> By default the APIC's delivery method for IRQs is ExtInt and
>> >> coalescing counting happens only with Fixed. This means that the guest
>> >> needs to reprogram APIC. It also looks like RTC interrupts need to be
>> >> triggered. But I didn't see both of these to happen simultaneously in
>> >> my tests with Linux and Windows guests. Of course, -rtc-td-hack flag
>> >> must be used and I also disabled HPET to be sure that RTC would be
>> >> used.
>> >>
>> >> With DEBUG_COALESCING enabled, I just get increasing numbers for
>> >> apic_irq_delivered:
>> >> apic: apic_set_irq: coalescing 67123
>> >> apic: apic_set_irq: coalescing 67124
>> >> apic: apic_set_irq: coalescing 67125
>> > So have you actually used -rtc-td-hack option? I compiled head of
>> > qemu.git with DEBUG_COALESCING and run WindowsXP guest with -rtc-td-hack
>> > and I get:
>> > apic: apic_reset_irq_delivered: old coalescing 3
>> > apic: apic_set_irq: coalescing 1
>> > apic: apic_get_irq_delivered: returning coalescing 1
>> > apic: apic_set_irq: coalescing 2
>> > apic: apic_set_irq: coalescing 3
>> > apic: apic_set_irq: coalescing 4
>> > apic: apic_set_irq: coalescing 5
>> > apic: apic_set_irq: coalescing 6
>> > apic: apic_reset_irq_delivered: old coalescing 6
>> > apic: apic_set_irq: coalescing 1
>> > apic: apic_get_irq_delivered: returning coalescing 1
>> >
>> >>
>> >> If the hack were active, the numbers would be close to zero (or at
>> >> least some point) because apic_reset_irq_delivered would be called,
>> >> but this does not happen. Could you specify a clear test case with
>> >> which the coalescing action could be tested? Linux or BSD based,
>> >> please.
>> > Linux don't use RTC as time source and I don't know about BSD, so no
>> > Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
>> > heavy load on the host. You can run video inside the gust to trigger
>> > coalescing more easily.
>>
>> I don't have Windows XP, sorry.
>>
> Will be hard to debug Windows time drift without Windows ;) Do you know
> what time source BSD uses?

Seems to be PIT. OpenBSD:
http://www.openbsd.org/cgi-bin/cvsweb/src/sys/arch/i386/isa/clock.c?rev=1.42

FreeBSD:
http://fxr.watson.org/fxr/source/x86/isa/clock.c?im=bigexcerpts

Didn't find NetBSD clock.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-30 12:56                                           ` Blue Swirl
@ 2010-05-30 13:49                                             ` Gleb Natapov
  2010-05-30 16:54                                               ` Blue Swirl
  2010-05-30 19:37                                               ` Blue Swirl
  0 siblings, 2 replies; 122+ messages in thread
From: Gleb Natapov @ 2010-05-30 13:49 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sun, May 30, 2010 at 12:56:26PM +0000, Blue Swirl wrote:
> >> Well, I'd like to get the test program also trigger it. Now I'm getting:
> >> apic: write: 00000350 = 00000000
> >> apic: apic_reset_irq_delivered: old coalescing 0
> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> apic: apic_set_irq: coalescing 1
> >> apic: apic_get_irq_delivered: returning coalescing 1
> >> apic: apic_reset_irq_delivered: old coalescing 1
> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> apic: apic_set_irq: coalescing 0
> >> apic: apic_get_irq_delivered: returning coalescing 0
> >> apic: apic_reset_irq_delivered: old coalescing 0
> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> apic: apic_set_irq: coalescing 0
> >>
So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
0 it means the interrupt was not delivered.

> >> It looks like some other IRQs cause the coalescing, because also
> >> looking at RTC code, it seems it's not possible for RTC to raise the
> >> IRQ (except update IRQ, alarm etc.) without calling
> >> apic_reset_irq_delivered().
> >>
> >> I've attached my test program. Compile:
> >> gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
> >> coalescing.ld -g && objcopy -Obinary coalescing coalescing.bin
> >>
> >> Run:
> >> qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack
> >>
> > The application does not work for me. Looks like it fails to enter
> > protected mode. $pc jumps from 0x00000000fffffff0 to 0x00000000000f003e
> > and back.
> 
> Strange. Here's a working binary.
> 
Your binary works here too. What compiler are you using?

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-30 13:49                                             ` Gleb Natapov
@ 2010-05-30 16:54                                               ` Blue Swirl
  2010-05-30 19:37                                               ` Blue Swirl
  1 sibling, 0 replies; 122+ messages in thread
From: Blue Swirl @ 2010-05-30 16:54 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sun, May 30, 2010 at 12:56:26PM +0000, Blue Swirl wrote:
>> >> Well, I'd like to get the test program also trigger it. Now I'm getting:
>> >> apic: write: 00000350 = 00000000
>> >> apic: apic_reset_irq_delivered: old coalescing 0
>> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> apic: apic_set_irq: coalescing 1
>> >> apic: apic_get_irq_delivered: returning coalescing 1
>> >> apic: apic_reset_irq_delivered: old coalescing 1
>> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> apic: apic_set_irq: coalescing 0
>> >> apic: apic_get_irq_delivered: returning coalescing 0
>> >> apic: apic_reset_irq_delivered: old coalescing 0
>> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> apic: apic_set_irq: coalescing 0
>> >>
> So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
> 0 it means the interrupt was not delivered.
>
>> >> It looks like some other IRQs cause the coalescing, because also
>> >> looking at RTC code, it seems it's not possible for RTC to raise the
>> >> IRQ (except update IRQ, alarm etc.) without calling
>> >> apic_reset_irq_delivered().
>> >>
>> >> I've attached my test program. Compile:
>> >> gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
>> >> coalescing.ld -g && objcopy -Obinary coalescing coalescing.bin
>> >>
>> >> Run:
>> >> qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack
>> >>
>> > The application does not work for me. Looks like it fails to enter
>> > protected mode. $pc jumps from 0x00000000fffffff0 to 0x00000000000f003e
>> > and back.
>>
>> Strange. Here's a working binary.
>>
> Your binary works here too. What compiler are you using?

Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian
4.3.2-1.1' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--enable-shared --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --enable-nls
--with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
--enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
--enable-mpfr --enable-cld --enable-checking=release
--build=x86_64-linux-gnu --host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 4.3.2 (Debian 4.3.2-1.1)

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-30 13:49                                             ` Gleb Natapov
  2010-05-30 16:54                                               ` Blue Swirl
@ 2010-05-30 19:37                                               ` Blue Swirl
  2010-05-30 20:07                                                 ` Gleb Natapov
  1 sibling, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-05-30 19:37 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 2037 bytes --]

On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sun, May 30, 2010 at 12:56:26PM +0000, Blue Swirl wrote:
>> >> Well, I'd like to get the test program also trigger it. Now I'm getting:
>> >> apic: write: 00000350 = 00000000
>> >> apic: apic_reset_irq_delivered: old coalescing 0
>> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> apic: apic_set_irq: coalescing 1
>> >> apic: apic_get_irq_delivered: returning coalescing 1
>> >> apic: apic_reset_irq_delivered: old coalescing 1
>> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> apic: apic_set_irq: coalescing 0
>> >> apic: apic_get_irq_delivered: returning coalescing 0
>> >> apic: apic_reset_irq_delivered: old coalescing 0
>> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> apic: apic_set_irq: coalescing 0
>> >>
> So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
> 0 it means the interrupt was not delivered.

That seems strange. I changed the program so that the handler gets
executed, also output a dot to serial from the handler. I changed the
frequency to 2Hz.

Now, if I leave out -rtc-td-hack, I get the dots at 2Hz as expected.
With -rtc-td-hack, the dots come out much faster. I added
DEBUG_COALESCING also to RTC, with that enabled I get:
qemu -L . -bios coalescing.bin -d in_asm,int -no-hpet -rtc-td-hack -serial stdio
cmos: coalesced irqs scaled to 0
cmos: coalesced irqs increased to 1
cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
..cmos: injecting from timer
.cmos: coalesced irqs increased to 2
cmos: injecting on ack

So, there are bogus injections.

[-- Attachment #2: coalescing.S --]
[-- Type: application/octet-stream, Size: 2837 bytes --]


        .section .text, "ax"
        .code16gcc

#define O(port, val)            \
        movb    $val, %al;      \
        out     %al, $port
#define O16P(port, val)         \
        movb    $val, %al;      \
        movw    $port, %dx;     \
        out     %al, (%dx)
#define RTC_O(port, val)        \
        O(0x70, port);          \
        O(0x71, val)
#define RTC_I(port)             \
        O(0x70, port);          \
        in      $0x71, %al

        .org 0x10000
        .globl start
start:
        /* A20 */
        O(0x92, 0x01)

        /* IDT */
        lidtw   %cs:0x2000

        /* GDT */
        lgdtw   %cs:0x2010

        /* Switch to protected mode */
        movl    %cr0, %eax
        orl     $1, %eax
        movl    %eax, %cr0
        ljmpl   $8, $0xf4000
        .org 0x11000

        .code32

        /* INT 70 handler */
        RTC_I(0xc)
        O(0xa0, 0x20)
        O(0x20, 0x20)
        /* Output a dot */
        O16P(0x3f8, '.')
        iret

        .org 0x12000

        /* IDTD */
        .short 0x0400
        .long 0x0f3000

        .org 0x12010

        /* GDTD */
        .short 0x0020
        .long 0xf3400

        .org 0x13380

        /* IDT entry for INT 70 */
        .long 0x00081000, 0x000f8e00

        .org 0x13408

        /* GDT entry for 1st descriptor, CS */
        .short 0xffff, 0x0000, 0x9b00, 0x00cf
        /* GDT entry for 2nd descriptor, DS etc. */
        .short 0xffff, 0x0000, 0x9300, 0x00cf

        .org 0x14000
        movl    $0x10, %eax
        movw    %ax, %ds
        movw    %ax, %ss
        movl    $0x1000, %esp

        /* Master PIC */
        /* ICW1 */
        O(0x20, 0x11)
        /* ICW2 */
        O(0x21, 0x08)
        /* ICW3 */
        O(0x21, 0x04)
        /* ICW4 */
        O(0x21, 0x01)
        /* OCW1: only slave IRQs */
        O(0x21, 0xfb)

        /* Slave PIC */
        /* ICW1 */
        O(0xa0, 0x11)
        /* ICW2 */
        O(0xa1, 0x70)
        /* ICW3 */
        O(0xa1, 0x02)
        /* ICW4 */
        O(0xa1, 0x01)
        /* OCW1: only RTC IRQ */
        O(0xa1, 0xfe)

        /* Serial */
        /* LCR */
        O16P(0x3fb, 0x87)
        /* DLM */
        O16P(0x3f9, 0x00)
        /* DLL */
        O16P(0x3f8, 0x0c)
        /* LCR */
        O16P(0x3fb, 0x07)
        /* MCR */
        O16P(0x3fc, 0x0f)

        /* set up APIC LVT */
        movl    $0x0700, %eax
        mov     %eax, 0xfee00350

        /* RTC: frequency 2Hz */
        RTC_O(0xa, 0x2f)
        /* Enable IRQ */
        RTC_O(0xb, 0x40)

        sti
        nop
1:
        hlt
        jmp     1b

        .section .reset, "ax"
        .globl  entry
        .code16gcc
entry:
        cli
        ljmp    $0xf000, $0x0000
        nop
        nop
        nop
        nop
        nop
        nop
        nop
        nop
        nop
        nop


[-- Attachment #3: coalescing.bin.bz2 --]
[-- Type: application/x-bzip2, Size: 285 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-30 19:37                                               ` Blue Swirl
@ 2010-05-30 20:07                                                 ` Gleb Natapov
  2010-05-30 20:21                                                   ` Blue Swirl
  0 siblings, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-05-30 20:07 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sun, May 30, 2010 at 07:37:59PM +0000, Blue Swirl wrote:
> On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Sun, May 30, 2010 at 12:56:26PM +0000, Blue Swirl wrote:
> >> >> Well, I'd like to get the test program also trigger it. Now I'm getting:
> >> >> apic: write: 00000350 = 00000000
> >> >> apic: apic_reset_irq_delivered: old coalescing 0
> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> >> apic: apic_set_irq: coalescing 1
> >> >> apic: apic_get_irq_delivered: returning coalescing 1
> >> >> apic: apic_reset_irq_delivered: old coalescing 1
> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> >> apic: apic_set_irq: coalescing 0
> >> >> apic: apic_get_irq_delivered: returning coalescing 0
> >> >> apic: apic_reset_irq_delivered: old coalescing 0
> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> >> apic: apic_set_irq: coalescing 0
> >> >>
> > So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
> > 0 it means the interrupt was not delivered.
> 
> That seems strange. I changed the program so that the handler gets
> executed, also output a dot to serial from the handler. I changed the
> frequency to 2Hz.
> 
> Now, if I leave out -rtc-td-hack, I get the dots at 2Hz as expected.
> With -rtc-td-hack, the dots come out much faster. I added
> DEBUG_COALESCING also to RTC, with that enabled I get:
> qemu -L . -bios coalescing.bin -d in_asm,int -no-hpet -rtc-td-hack -serial stdio
> cmos: coalesced irqs scaled to 0
> cmos: coalesced irqs increased to 1
> cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> .cmos: injecting on ack
> ..cmos: injecting from timer
> .cmos: coalesced irqs increased to 2
> cmos: injecting on ack
> 
> So, there are bogus injections.

Looks like irr in apic is never cleared. Probably bug in userspace apic
emulation. I'll look into it. Try to route interrupt via APIC (not ExtInt),
or use qemu-kvm with in kernel irq chip.

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-30 20:07                                                 ` Gleb Natapov
@ 2010-05-30 20:21                                                   ` Blue Swirl
  2010-05-31  5:19                                                     ` Gleb Natapov
  0 siblings, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-05-30 20:21 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sun, May 30, 2010 at 8:07 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sun, May 30, 2010 at 07:37:59PM +0000, Blue Swirl wrote:
>> On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov <gleb@redhat.com> wrote:
>> > On Sun, May 30, 2010 at 12:56:26PM +0000, Blue Swirl wrote:
>> >> >> Well, I'd like to get the test program also trigger it. Now I'm getting:
>> >> >> apic: write: 00000350 = 00000000
>> >> >> apic: apic_reset_irq_delivered: old coalescing 0
>> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> >> apic: apic_set_irq: coalescing 1
>> >> >> apic: apic_get_irq_delivered: returning coalescing 1
>> >> >> apic: apic_reset_irq_delivered: old coalescing 1
>> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> >> apic: apic_set_irq: coalescing 0
>> >> >> apic: apic_get_irq_delivered: returning coalescing 0
>> >> >> apic: apic_reset_irq_delivered: old coalescing 0
>> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> >> apic: apic_set_irq: coalescing 0
>> >> >>
>> > So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
>> > 0 it means the interrupt was not delivered.
>>
>> That seems strange. I changed the program so that the handler gets
>> executed, also output a dot to serial from the handler. I changed the
>> frequency to 2Hz.
>>
>> Now, if I leave out -rtc-td-hack, I get the dots at 2Hz as expected.
>> With -rtc-td-hack, the dots come out much faster. I added
>> DEBUG_COALESCING also to RTC, with that enabled I get:
>> qemu -L . -bios coalescing.bin -d in_asm,int -no-hpet -rtc-td-hack -serial stdio
>> cmos: coalesced irqs scaled to 0
>> cmos: coalesced irqs increased to 1
>> cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> .cmos: injecting on ack
>> ..cmos: injecting from timer
>> .cmos: coalesced irqs increased to 2
>> cmos: injecting on ack
>>
>> So, there are bogus injections.
>
> Looks like irr in apic is never cleared. Probably bug in userspace apic
> emulation. I'll look into it. Try to route interrupt via APIC (not ExtInt),
> or use qemu-kvm with in kernel irq chip.

With APIC you mean Fixed? Then the IRQ is not delivered at all.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-30 20:21                                                   ` Blue Swirl
@ 2010-05-31  5:19                                                     ` Gleb Natapov
  2010-06-01 18:00                                                       ` Blue Swirl
  0 siblings, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-05-31  5:19 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sun, May 30, 2010 at 08:21:30PM +0000, Blue Swirl wrote:
> On Sun, May 30, 2010 at 8:07 PM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Sun, May 30, 2010 at 07:37:59PM +0000, Blue Swirl wrote:
> >> On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov <gleb@redhat.com> wrote:
> >> > On Sun, May 30, 2010 at 12:56:26PM +0000, Blue Swirl wrote:
> >> >> >> Well, I'd like to get the test program also trigger it. Now I'm getting:
> >> >> >> apic: write: 00000350 = 00000000
> >> >> >> apic: apic_reset_irq_delivered: old coalescing 0
> >> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> >> >> apic: apic_set_irq: coalescing 1
> >> >> >> apic: apic_get_irq_delivered: returning coalescing 1
> >> >> >> apic: apic_reset_irq_delivered: old coalescing 1
> >> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> >> >> apic: apic_set_irq: coalescing 0
> >> >> >> apic: apic_get_irq_delivered: returning coalescing 0
> >> >> >> apic: apic_reset_irq_delivered: old coalescing 0
> >> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
> >> >> >> apic: apic_set_irq: coalescing 0
> >> >> >>
> >> > So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
> >> > 0 it means the interrupt was not delivered.
> >>
> >> That seems strange. I changed the program so that the handler gets
> >> executed, also output a dot to serial from the handler. I changed the
> >> frequency to 2Hz.
> >>
> >> Now, if I leave out -rtc-td-hack, I get the dots at 2Hz as expected.
> >> With -rtc-td-hack, the dots come out much faster. I added
> >> DEBUG_COALESCING also to RTC, with that enabled I get:
> >> qemu -L . -bios coalescing.bin -d in_asm,int -no-hpet -rtc-td-hack -serial stdio
> >> cmos: coalesced irqs scaled to 0
> >> cmos: coalesced irqs increased to 1
> >> cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> .cmos: injecting on ack
> >> ..cmos: injecting from timer
> >> .cmos: coalesced irqs increased to 2
> >> cmos: injecting on ack
> >>
> >> So, there are bogus injections.
> >
> > Looks like irr in apic is never cleared. Probably bug in userspace apic
> > emulation. I'll look into it. Try to route interrupt via APIC (not ExtInt),
> > or use qemu-kvm with in kernel irq chip.
> 
> With APIC you mean Fixed? Then the IRQ is not delivered at all.
You need to deliver it through IOAPIC.

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-30 12:58                                           ` Blue Swirl
@ 2010-05-31  7:46                                             ` Jan Kiszka
  0 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-05-31  7:46 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel, Gleb Natapov, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 707 bytes --]

Blue Swirl wrote:
> On Sun, May 30, 2010 at 12:24 PM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> Blue Swirl wrote:
>>>> Linux don't use RTC as time source and I don't know about BSD, so no
>>>> Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
>>>> heavy load on the host. You can run video inside the gust to trigger
>>>> coalescing more easily.
>>> I don't have Windows XP, sorry.
>> ReactOS [1], at least its 32-bit version, appears to use the RTC as well.
> 
> I tried LiveCD and QEMU versions, both seem to hang at boot. Is that expected?

Live-CD works for me (in contrast to the pre-installed image in KVM
mode), but it doesn't show the desired RTC usage.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-05-31  5:19                                                     ` Gleb Natapov
@ 2010-06-01 18:00                                                       ` Blue Swirl
  2010-06-01 18:30                                                         ` Gleb Natapov
  0 siblings, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-06-01 18:00 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 3245 bytes --]

On Mon, May 31, 2010 at 5:19 AM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sun, May 30, 2010 at 08:21:30PM +0000, Blue Swirl wrote:
>> On Sun, May 30, 2010 at 8:07 PM, Gleb Natapov <gleb@redhat.com> wrote:
>> > On Sun, May 30, 2010 at 07:37:59PM +0000, Blue Swirl wrote:
>> >> On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov <gleb@redhat.com> wrote:
>> >> > On Sun, May 30, 2010 at 12:56:26PM +0000, Blue Swirl wrote:
>> >> >> >> Well, I'd like to get the test program also trigger it. Now I'm getting:
>> >> >> >> apic: write: 00000350 = 00000000
>> >> >> >> apic: apic_reset_irq_delivered: old coalescing 0
>> >> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> >> >> apic: apic_set_irq: coalescing 1
>> >> >> >> apic: apic_get_irq_delivered: returning coalescing 1
>> >> >> >> apic: apic_reset_irq_delivered: old coalescing 1
>> >> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> >> >> apic: apic_set_irq: coalescing 0
>> >> >> >> apic: apic_get_irq_delivered: returning coalescing 0
>> >> >> >> apic: apic_reset_irq_delivered: old coalescing 0
>> >> >> >> apic: apic_local_deliver: vector 3 delivery mode 0
>> >> >> >> apic: apic_set_irq: coalescing 0
>> >> >> >>
>> >> > So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
>> >> > 0 it means the interrupt was not delivered.
>> >>
>> >> That seems strange. I changed the program so that the handler gets
>> >> executed, also output a dot to serial from the handler. I changed the
>> >> frequency to 2Hz.
>> >>
>> >> Now, if I leave out -rtc-td-hack, I get the dots at 2Hz as expected.
>> >> With -rtc-td-hack, the dots come out much faster. I added
>> >> DEBUG_COALESCING also to RTC, with that enabled I get:
>> >> qemu -L . -bios coalescing.bin -d in_asm,int -no-hpet -rtc-td-hack -serial stdio
>> >> cmos: coalesced irqs scaled to 0
>> >> cmos: coalesced irqs increased to 1
>> >> cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> .cmos: injecting on ack
>> >> ..cmos: injecting from timer
>> >> .cmos: coalesced irqs increased to 2
>> >> cmos: injecting on ack
>> >>
>> >> So, there are bogus injections.
>> >
>> > Looks like irr in apic is never cleared. Probably bug in userspace apic
>> > emulation. I'll look into it. Try to route interrupt via APIC (not ExtInt),
>> > or use qemu-kvm with in kernel irq chip.
>>
>> With APIC you mean Fixed? Then the IRQ is not delivered at all.
> You need to deliver it through IOAPIC.

In this version, when USE_APIC is defined, IRQs are routed via IOAPIC
and APIC, PIC interrupts are disabled. Otherwise, IOAPIC and APIC are
left to reset state and PIC is used.

With the PIC version, there are bogus injections. With IOAPIC and APIC
dots come at 2Hz. -enable-kvm has no effect.

[-- Attachment #2: coalescing.S --]
[-- Type: application/octet-stream, Size: 3433 bytes --]


#define USE_APIC

        .section .text, "ax"
        .code16gcc

#define O(port, val)            \
        movb    $val, %al;      \
        out     %al, $port
#define O16P(port, val)         \
        movb    $val, %al;      \
        movw    $port, %dx;     \
        out     %al, (%dx)
#define RTC_O(port, val)        \
        O(0x70, port);          \
        O(0x71, val)
#define RTC_I(port)             \
        O(0x70, port);          \
        in      $0x71, %al
#define IOAPIC_O(port, val)     \
        movb    $port, 0xfec00000; \
        movl    $val, 0xfec00010

        .org 0x10000
        .globl start
start:
        /* A20 */
        O(0x92, 0x01)

        /* IDT */
        lidtw   %cs:0x2000

        /* GDT */
        lgdtw   %cs:0x2010

        /* Switch to protected mode */
        movl    %cr0, %eax
        orl     $1, %eax
        movl    %eax, %cr0
        ljmpl   $8, $0xf4000
        .org 0x11000

        .code32

        /* INT 70 handler */
        /* Clear RTC IRQ */
        RTC_I(0xc)
#ifdef USE_APIC
        /* Clear APIC IRQ */
        movl $0x00, 0xfee000b0
#else
        /* Clear PIC IRQs */
        O(0xa0, 0x20)
        O(0x20, 0x20)
#endif
        /* Output a dot */
        O16P(0x3f8, '.')
        iret

        .org 0x12000

        /* IDTD */
        .short 0x0400
        .long 0x0f3000

        .org 0x12010

        /* GDTD */
        .short 0x0020
        .long 0xf3400

        .org 0x13380

        /* IDT entry for INT 70 */
        .long 0x00081000, 0x000f8e00

        .org 0x13408

        /* GDT entry for 1st descriptor, CS */
        .short 0xffff, 0x0000, 0x9b00, 0x00cf
        /* GDT entry for 2nd descriptor, DS etc. */
        .short 0xffff, 0x0000, 0x9300, 0x00cf

        .org 0x14000
        movl    $0x10, %eax
        movw    %ax, %ds
        movw    %ax, %ss
        movl    $0x1000, %esp

        /* Master PIC */
        /* ICW1 */
        O(0x20, 0x11)
        /* ICW2 */
        O(0x21, 0x08)
        /* ICW3 */
        O(0x21, 0x04)
        /* ICW4 */
        O(0x21, 0x01)
#ifdef USE_APIC
        /* OCW1: no IRQs */
        O(0x21, 0xff)
#else
        /* OCW1: only slave IRQs */
        O(0x21, 0xfb)
#endif
        /* Slave PIC */
        /* ICW1 */
        O(0xa0, 0x11)
        /* ICW2 */
        O(0xa1, 0x70)
        /* ICW3 */
        O(0xa1, 0x02)
        /* ICW4 */
        O(0xa1, 0x01)
#ifdef USE_APIC
        /* OCW1: no IRQs */
        O(0xa1, 0xff)
#else
        /* OCW1: only RTC IRQs */
        O(0xa1, 0xfe)
#endif

        /* Serial */
        /* LCR */
        O16P(0x3fb, 0x87)
        /* DLM */
        O16P(0x3f9, 0x00)
        /* DLL */
        O16P(0x3f8, 0x0c)
        /* LCR */
        O16P(0x3fb, 0x07)
        /* MCR */
        O16P(0x3fc, 0x0f)

#ifdef USE_APIC
        /* set up APIC LVT */
        movl    $0x5700, 0xfee00350

        /* enable APIC */
        movl    $0x100, 0xfee000f0

        /* set up IOAPIC LVT */
        IOAPIC_O(0x10 + 8 << 1, 0x0070)
        IOAPIC_O(0x10 + 8 << 1 + 1, 0)
#endif

        /* RTC: frequency 2Hz */
        RTC_O(0xa, 0x2f)
        /* Enable IRQ */
        RTC_O(0xb, 0x40)

        sti
        nop
1:
        hlt
        jmp     1b

        .section .reset, "ax"
        .globl  entry
        .code16gcc
entry:
        cli
        ljmp    $0xf000, $0x0000
        nop
        nop
        nop
        nop
        nop
        nop
        nop
        nop
        nop
        nop


[-- Attachment #3: coalescing_apic.bin.bz2 --]
[-- Type: application/x-bzip2, Size: 314 bytes --]

[-- Attachment #4: coalescing_pic.bin.bz2 --]
[-- Type: application/x-bzip2, Size: 273 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-01 18:00                                                       ` Blue Swirl
@ 2010-06-01 18:30                                                         ` Gleb Natapov
  2010-06-02 19:05                                                           ` Blue Swirl
  0 siblings, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-06-01 18:30 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Tue, Jun 01, 2010 at 06:00:20PM +0000, Blue Swirl wrote:
> >> > Looks like irr in apic is never cleared. Probably bug in userspace apic
> >> > emulation. I'll look into it. Try to route interrupt via APIC (not ExtInt),
> >> > or use qemu-kvm with in kernel irq chip.
> >>
> >> With APIC you mean Fixed? Then the IRQ is not delivered at all.
> > You need to deliver it through IOAPIC.
> 
> In this version, when USE_APIC is defined, IRQs are routed via IOAPIC
> and APIC, PIC interrupts are disabled. Otherwise, IOAPIC and APIC are
> left to reset state and PIC is used.
> 
> With the PIC version, there are bogus injections. With IOAPIC and APIC
> dots come at 2Hz. -enable-kvm has no effect.

I looked into this briefly. Virtual wire case is not handled correctly,
and pic lacks coalescing detection at all (the reason BTW because proper
solution was rejected, so only hack required by Windows was added).
Windows routes RTC interrupts through IOAPIC and this case works as you
can see. -enable-kvm on qemu should not make any difference since it
uses the same userspace pic/ioapic/apic code as TCG. It would be
interesting to try with qemu-kvm which uses in-kernel irq chips by
default.

BTW I managed to compile you test program with gcc 4.2. 4.4 and 4.3
produced non-working binaries for me.

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-01 18:30                                                         ` Gleb Natapov
@ 2010-06-02 19:05                                                           ` Blue Swirl
  2010-06-03  6:23                                                             ` Jan Kiszka
  0 siblings, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-06-02 19:05 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Tue, Jun 1, 2010 at 6:30 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Tue, Jun 01, 2010 at 06:00:20PM +0000, Blue Swirl wrote:
>> >> > Looks like irr in apic is never cleared. Probably bug in userspace apic
>> >> > emulation. I'll look into it. Try to route interrupt via APIC (not ExtInt),
>> >> > or use qemu-kvm with in kernel irq chip.
>> >>
>> >> With APIC you mean Fixed? Then the IRQ is not delivered at all.
>> > You need to deliver it through IOAPIC.
>>
>> In this version, when USE_APIC is defined, IRQs are routed via IOAPIC
>> and APIC, PIC interrupts are disabled. Otherwise, IOAPIC and APIC are
>> left to reset state and PIC is used.
>>
>> With the PIC version, there are bogus injections. With IOAPIC and APIC
>> dots come at 2Hz. -enable-kvm has no effect.
>
> I looked into this briefly. Virtual wire case is not handled correctly,
> and pic lacks coalescing detection at all (the reason BTW because proper
> solution was rejected, so only hack required by Windows was added).
> Windows routes RTC interrupts through IOAPIC and this case works as you
> can see. -enable-kvm on qemu should not make any difference since it
> uses the same userspace pic/ioapic/apic code as TCG. It would be
> interesting to try with qemu-kvm which uses in-kernel irq chips by
> default.
>
> BTW I managed to compile you test program with gcc 4.2. 4.4 and 4.3
> produced non-working binaries for me.

This exercise showed me that the IRQ connectivity between APIC and RTC
can vary a lot because of guest activities. So it may be impossible to
know at APIC if some IRQ came from RTC and so the coalescing must be
accounted at RTC (or close to RTC). This makes a generic solution less
useful, I think the net result would be to refactor
rtc_coalesced_timer*() to a new file. Also my generic approach of time
base adjustment would need to be steered by RTC because APIC can't. So
I think I'll have to abandon this idea for now. Sorry for the delay,
thanks for the interesting discussion though.

Coming back to $SUBJECT: I don't think the bidirectional IRQ is a good
solution either, but it may be slightly better than current hack (I
didn't realize the full ugliness of the hack before, namely that APIC
and RTC are really not connected at all). The same effect could be
achieved with two signals going to opposite directions, like
APICREQ/APICACK on 82093 IOAPIC but again, if any IRQs point to same
vector at IOAPIC, there is no way to get back to originator. With
bidirectional IRQs, there's a one-shot chance at the delivery time,
but not later at ACK/EOI time.

But how about if we introduced instead a message based IRQ? Then the
message could specify the originator device, maybe ACK/coalesce/NACK
callbacks and a bigger payload than just 1 bit of level. I think that
could achieve the same coalescing effect as what the bidirectional
IRQ. The payload could be useful for other purposes, for example
Sparc64 IRQ messages contain three 64 bit words.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-02 19:05                                                           ` Blue Swirl
@ 2010-06-03  6:23                                                             ` Jan Kiszka
  2010-06-03  6:34                                                               ` Gleb Natapov
  0 siblings, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-06-03  6:23 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel, Gleb Natapov, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 1023 bytes --]

Blue Swirl wrote:
> But how about if we introduced instead a message based IRQ? Then the
> message could specify the originator device, maybe ACK/coalesce/NACK
> callbacks and a bigger payload than just 1 bit of level. I think that
> could achieve the same coalescing effect as what the bidirectional
> IRQ. The payload could be useful for other purposes, for example
> Sparc64 IRQ messages contain three 64 bit words.

If there are more users than just IRQ de-coalescing, this indeed sounds
superior. We could pass objects like this one around:

struct qemu_irq_msg {
	void (*delivery_cb)(int result);
	void *payload;
};

They would be valid within the scope of the IRQ handlers. Whoever
terminates or actually delivers the IRQ would invoke the callback. And
platforms like sparc64 could evaluate the additional payload pointer in
their irqchips or wherever they need it. IRQ routers on platforms that
make use of these messages would have to replicate them when forwarding
an event.

OK?

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-03  6:23                                                             ` Jan Kiszka
@ 2010-06-03  6:34                                                               ` Gleb Natapov
  2010-06-03  6:59                                                                 ` Jan Kiszka
  0 siblings, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-06-03  6:34 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Blue Swirl, qemu-devel, Juan Quintela

On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
> Blue Swirl wrote:
> > But how about if we introduced instead a message based IRQ? Then the
> > message could specify the originator device, maybe ACK/coalesce/NACK
> > callbacks and a bigger payload than just 1 bit of level. I think that
> > could achieve the same coalescing effect as what the bidirectional
> > IRQ. The payload could be useful for other purposes, for example
> > Sparc64 IRQ messages contain three 64 bit words.
> 
> If there are more users than just IRQ de-coalescing, this indeed sounds
> superior. We could pass objects like this one around:
> 
> struct qemu_irq_msg {
> 	void (*delivery_cb)(int result);
> 	void *payload;
> };
> 
> They would be valid within the scope of the IRQ handlers. Whoever
> terminates or actually delivers the IRQ would invoke the callback. And
> platforms like sparc64 could evaluate the additional payload pointer in
> their irqchips or wherever they need it. IRQ routers on platforms that
> make use of these messages would have to replicate them when forwarding
> an event.
> 
> OK?
> 
Let me see if I understand you correctly. qemu_set_irq() will get
additional parameter qemu_irq_msg and if irq was not coalesced
delivery_cb is called, so there is a guaranty that if delivery_cb is
called it is done before qemu_set_irq() returns. Correct?

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-03  6:34                                                               ` Gleb Natapov
@ 2010-06-03  6:59                                                                 ` Jan Kiszka
  2010-06-03  7:03                                                                   ` Gleb Natapov
  0 siblings, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-06-03  6:59 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Blue Swirl, qemu-devel, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 1782 bytes --]

Gleb Natapov wrote:
> On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
>> Blue Swirl wrote:
>>> But how about if we introduced instead a message based IRQ? Then the
>>> message could specify the originator device, maybe ACK/coalesce/NACK
>>> callbacks and a bigger payload than just 1 bit of level. I think that
>>> could achieve the same coalescing effect as what the bidirectional
>>> IRQ. The payload could be useful for other purposes, for example
>>> Sparc64 IRQ messages contain three 64 bit words.
>> If there are more users than just IRQ de-coalescing, this indeed sounds
>> superior. We could pass objects like this one around:
>>
>> struct qemu_irq_msg {
>> 	void (*delivery_cb)(int result);
>> 	void *payload;
>> };
>>
>> They would be valid within the scope of the IRQ handlers. Whoever
>> terminates or actually delivers the IRQ would invoke the callback. And
>> platforms like sparc64 could evaluate the additional payload pointer in
>> their irqchips or wherever they need it. IRQ routers on platforms that
>> make use of these messages would have to replicate them when forwarding
>> an event.
>>
>> OK?
>>
> Let me see if I understand you correctly. qemu_set_irq() will get
> additional parameter qemu_irq_msg and if irq was not coalesced
> delivery_cb is called, so there is a guaranty that if delivery_cb is
> called it is done before qemu_set_irq() returns. Correct?

If the side that triggers an IRQ passes a message object with a non-NULL
callback, it is supposed to be called unconditionally, passing the
result of the delivery (delivered, masked, coalesced). And yes, the
callback will be invoked in the context of the irq handler, so before
qemu_set_irq (or rather some new qemu_set_irq_msg) returns.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-03  6:59                                                                 ` Jan Kiszka
@ 2010-06-03  7:03                                                                   ` Gleb Natapov
  2010-06-03  7:06                                                                     ` Gleb Natapov
  0 siblings, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-06-03  7:03 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Blue Swirl, qemu-devel, Juan Quintela

On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
> Gleb Natapov wrote:
> > On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
> >> Blue Swirl wrote:
> >>> But how about if we introduced instead a message based IRQ? Then the
> >>> message could specify the originator device, maybe ACK/coalesce/NACK
> >>> callbacks and a bigger payload than just 1 bit of level. I think that
> >>> could achieve the same coalescing effect as what the bidirectional
> >>> IRQ. The payload could be useful for other purposes, for example
> >>> Sparc64 IRQ messages contain three 64 bit words.
> >> If there are more users than just IRQ de-coalescing, this indeed sounds
> >> superior. We could pass objects like this one around:
> >>
> >> struct qemu_irq_msg {
> >> 	void (*delivery_cb)(int result);
> >> 	void *payload;
> >> };
> >>
> >> They would be valid within the scope of the IRQ handlers. Whoever
> >> terminates or actually delivers the IRQ would invoke the callback. And
> >> platforms like sparc64 could evaluate the additional payload pointer in
> >> their irqchips or wherever they need it. IRQ routers on platforms that
> >> make use of these messages would have to replicate them when forwarding
> >> an event.
> >>
> >> OK?
> >>
> > Let me see if I understand you correctly. qemu_set_irq() will get
> > additional parameter qemu_irq_msg and if irq was not coalesced
> > delivery_cb is called, so there is a guaranty that if delivery_cb is
> > called it is done before qemu_set_irq() returns. Correct?
> 
> If the side that triggers an IRQ passes a message object with a non-NULL
> callback, it is supposed to be called unconditionally, passing the
> result of the delivery (delivered, masked, coalesced). And yes, the
> callback will be invoked in the context of the irq handler, so before
> qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
> 
Looks fine to me.

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-03  7:03                                                                   ` Gleb Natapov
@ 2010-06-03  7:06                                                                     ` Gleb Natapov
  2010-06-04 19:05                                                                       ` Blue Swirl
  0 siblings, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-06-03  7:06 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Blue Swirl, qemu-devel, Juan Quintela

On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
> On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
> > Gleb Natapov wrote:
> > > On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
> > >> Blue Swirl wrote:
> > >>> But how about if we introduced instead a message based IRQ? Then the
> > >>> message could specify the originator device, maybe ACK/coalesce/NACK
> > >>> callbacks and a bigger payload than just 1 bit of level. I think that
> > >>> could achieve the same coalescing effect as what the bidirectional
> > >>> IRQ. The payload could be useful for other purposes, for example
> > >>> Sparc64 IRQ messages contain three 64 bit words.
> > >> If there are more users than just IRQ de-coalescing, this indeed sounds
> > >> superior. We could pass objects like this one around:
> > >>
> > >> struct qemu_irq_msg {
> > >> 	void (*delivery_cb)(int result);
> > >> 	void *payload;
> > >> };
> > >>
> > >> They would be valid within the scope of the IRQ handlers. Whoever
> > >> terminates or actually delivers the IRQ would invoke the callback. And
> > >> platforms like sparc64 could evaluate the additional payload pointer in
> > >> their irqchips or wherever they need it. IRQ routers on platforms that
> > >> make use of these messages would have to replicate them when forwarding
> > >> an event.
> > >>
> > >> OK?
> > >>
> > > Let me see if I understand you correctly. qemu_set_irq() will get
> > > additional parameter qemu_irq_msg and if irq was not coalesced
> > > delivery_cb is called, so there is a guaranty that if delivery_cb is
> > > called it is done before qemu_set_irq() returns. Correct?
> > 
> > If the side that triggers an IRQ passes a message object with a non-NULL
> > callback, it is supposed to be called unconditionally, passing the
> > result of the delivery (delivered, masked, coalesced). And yes, the
> > callback will be invoked in the context of the irq handler, so before
> > qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
> > 
> Looks fine to me.
> 
Except that delivery_cb should probably get pointer to qemu_irq_msg as a
parameter.

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-03  7:06                                                                     ` Gleb Natapov
@ 2010-06-04 19:05                                                                       ` Blue Swirl
  2010-06-05  0:04                                                                         ` Jan Kiszka
  0 siblings, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-06-04 19:05 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov <gleb@redhat.com> wrote:
> On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
>> On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
>> > Gleb Natapov wrote:
>> > > On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
>> > >> Blue Swirl wrote:
>> > >>> But how about if we introduced instead a message based IRQ? Then the
>> > >>> message could specify the originator device, maybe ACK/coalesce/NACK
>> > >>> callbacks and a bigger payload than just 1 bit of level. I think that
>> > >>> could achieve the same coalescing effect as what the bidirectional
>> > >>> IRQ. The payload could be useful for other purposes, for example
>> > >>> Sparc64 IRQ messages contain three 64 bit words.
>> > >> If there are more users than just IRQ de-coalescing, this indeed sounds
>> > >> superior. We could pass objects like this one around:
>> > >>
>> > >> struct qemu_irq_msg {
>> > >>  void (*delivery_cb)(int result);
>> > >>  void *payload;
>> > >> };
>> > >>
>> > >> They would be valid within the scope of the IRQ handlers. Whoever
>> > >> terminates or actually delivers the IRQ would invoke the callback. And
>> > >> platforms like sparc64 could evaluate the additional payload pointer in
>> > >> their irqchips or wherever they need it. IRQ routers on platforms that
>> > >> make use of these messages would have to replicate them when forwarding
>> > >> an event.
>> > >>
>> > >> OK?
>> > >>
>> > > Let me see if I understand you correctly. qemu_set_irq() will get
>> > > additional parameter qemu_irq_msg and if irq was not coalesced
>> > > delivery_cb is called, so there is a guaranty that if delivery_cb is
>> > > called it is done before qemu_set_irq() returns. Correct?
>> >
>> > If the side that triggers an IRQ passes a message object with a non-NULL
>> > callback, it is supposed to be called unconditionally, passing the
>> > result of the delivery (delivered, masked, coalesced). And yes, the
>> > callback will be invoked in the context of the irq handler, so before
>> > qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
>> >
>> Looks fine to me.
>>
> Except that delivery_cb should probably get pointer to qemu_irq_msg as a
> parameter.

I'd like to also support EOI handling. When the guest clears the
interrupt condtion, the EOI callback would be called. This could occur
much later than the IRQ delivery time. I'm not sure if we need the
result code in that case.

If any intermediate device (IOAPIC?) needs to be informed about either
delivery or EOI also, it could create a proxy message with its
callbacks in place. But we need then a separate opaque field (in
addition to payload) to store the original message.

struct IRQMsg {
 DeviceState *src;
 void (*delivery_cb)(IRQMsg *msg, int result);
 void (*eoi_cb)(IRQMsg *msg, int result);
 void *src_opaque;
 void *payload;
};

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-04 19:05                                                                       ` Blue Swirl
@ 2010-06-05  0:04                                                                         ` Jan Kiszka
  2010-06-05  7:20                                                                           ` Blue Swirl
  2010-06-06  7:15                                                                           ` Gleb Natapov
  0 siblings, 2 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-06-05  0:04 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel, Gleb Natapov, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 3211 bytes --]

Blue Swirl wrote:
> On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov <gleb@redhat.com> wrote:
>> On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
>>> On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
>>>> Gleb Natapov wrote:
>>>>> On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
>>>>>> Blue Swirl wrote:
>>>>>>> But how about if we introduced instead a message based IRQ? Then the
>>>>>>> message could specify the originator device, maybe ACK/coalesce/NACK
>>>>>>> callbacks and a bigger payload than just 1 bit of level. I think that
>>>>>>> could achieve the same coalescing effect as what the bidirectional
>>>>>>> IRQ. The payload could be useful for other purposes, for example
>>>>>>> Sparc64 IRQ messages contain three 64 bit words.
>>>>>> If there are more users than just IRQ de-coalescing, this indeed sounds
>>>>>> superior. We could pass objects like this one around:
>>>>>>
>>>>>> struct qemu_irq_msg {
>>>>>>  void (*delivery_cb)(int result);
>>>>>>  void *payload;
>>>>>> };
>>>>>>
>>>>>> They would be valid within the scope of the IRQ handlers. Whoever
>>>>>> terminates or actually delivers the IRQ would invoke the callback. And
>>>>>> platforms like sparc64 could evaluate the additional payload pointer in
>>>>>> their irqchips or wherever they need it. IRQ routers on platforms that
>>>>>> make use of these messages would have to replicate them when forwarding
>>>>>> an event.
>>>>>>
>>>>>> OK?
>>>>>>
>>>>> Let me see if I understand you correctly. qemu_set_irq() will get
>>>>> additional parameter qemu_irq_msg and if irq was not coalesced
>>>>> delivery_cb is called, so there is a guaranty that if delivery_cb is
>>>>> called it is done before qemu_set_irq() returns. Correct?
>>>> If the side that triggers an IRQ passes a message object with a non-NULL
>>>> callback, it is supposed to be called unconditionally, passing the
>>>> result of the delivery (delivered, masked, coalesced). And yes, the
>>>> callback will be invoked in the context of the irq handler, so before
>>>> qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
>>>>
>>> Looks fine to me.
>>>
>> Except that delivery_cb should probably get pointer to qemu_irq_msg as a
>> parameter.
> 
> I'd like to also support EOI handling. When the guest clears the
> interrupt condtion, the EOI callback would be called. This could occur
> much later than the IRQ delivery time. I'm not sure if we need the
> result code in that case.
> 
> If any intermediate device (IOAPIC?) needs to be informed about either
> delivery or EOI also, it could create a proxy message with its
> callbacks in place. But we need then a separate opaque field (in
> addition to payload) to store the original message.
> 
> struct IRQMsg {
>  DeviceState *src;
>  void (*delivery_cb)(IRQMsg *msg, int result);
>  void (*eoi_cb)(IRQMsg *msg, int result);
>  void *src_opaque;
>  void *payload;
> };

Extending the lifetime of IRQMsg objects beyond the delivery call stack
means qemu_malloc/free for every delivery. I think it takes a _very_
appealing reason to justify this. But so far I do not see any use case
for eio_cb at all.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-05  0:04                                                                         ` Jan Kiszka
@ 2010-06-05  7:20                                                                           ` Blue Swirl
  2010-06-05  8:27                                                                             ` Jan Kiszka
  2010-06-06  7:15                                                                           ` Gleb Natapov
  1 sibling, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-06-05  7:20 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, Gleb Natapov, Juan Quintela

On Sat, Jun 5, 2010 at 12:04 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>> On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov <gleb@redhat.com> wrote:
>>> On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
>>>> On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
>>>>> Gleb Natapov wrote:
>>>>>> On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
>>>>>>> Blue Swirl wrote:
>>>>>>>> But how about if we introduced instead a message based IRQ? Then the
>>>>>>>> message could specify the originator device, maybe ACK/coalesce/NACK
>>>>>>>> callbacks and a bigger payload than just 1 bit of level. I think that
>>>>>>>> could achieve the same coalescing effect as what the bidirectional
>>>>>>>> IRQ. The payload could be useful for other purposes, for example
>>>>>>>> Sparc64 IRQ messages contain three 64 bit words.
>>>>>>> If there are more users than just IRQ de-coalescing, this indeed sounds
>>>>>>> superior. We could pass objects like this one around:
>>>>>>>
>>>>>>> struct qemu_irq_msg {
>>>>>>>  void (*delivery_cb)(int result);
>>>>>>>  void *payload;
>>>>>>> };
>>>>>>>
>>>>>>> They would be valid within the scope of the IRQ handlers. Whoever
>>>>>>> terminates or actually delivers the IRQ would invoke the callback. And
>>>>>>> platforms like sparc64 could evaluate the additional payload pointer in
>>>>>>> their irqchips or wherever they need it. IRQ routers on platforms that
>>>>>>> make use of these messages would have to replicate them when forwarding
>>>>>>> an event.
>>>>>>>
>>>>>>> OK?
>>>>>>>
>>>>>> Let me see if I understand you correctly. qemu_set_irq() will get
>>>>>> additional parameter qemu_irq_msg and if irq was not coalesced
>>>>>> delivery_cb is called, so there is a guaranty that if delivery_cb is
>>>>>> called it is done before qemu_set_irq() returns. Correct?
>>>>> If the side that triggers an IRQ passes a message object with a non-NULL
>>>>> callback, it is supposed to be called unconditionally, passing the
>>>>> result of the delivery (delivered, masked, coalesced). And yes, the
>>>>> callback will be invoked in the context of the irq handler, so before
>>>>> qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
>>>>>
>>>> Looks fine to me.
>>>>
>>> Except that delivery_cb should probably get pointer to qemu_irq_msg as a
>>> parameter.
>>
>> I'd like to also support EOI handling. When the guest clears the
>> interrupt condtion, the EOI callback would be called. This could occur
>> much later than the IRQ delivery time. I'm not sure if we need the
>> result code in that case.
>>
>> If any intermediate device (IOAPIC?) needs to be informed about either
>> delivery or EOI also, it could create a proxy message with its
>> callbacks in place. But we need then a separate opaque field (in
>> addition to payload) to store the original message.
>>
>> struct IRQMsg {
>>  DeviceState *src;
>>  void (*delivery_cb)(IRQMsg *msg, int result);
>>  void (*eoi_cb)(IRQMsg *msg, int result);
>>  void *src_opaque;
>>  void *payload;
>> };
>
> Extending the lifetime of IRQMsg objects beyond the delivery call stack
> means qemu_malloc/free for every delivery. I think it takes a _very_
> appealing reason to justify this. But so far I do not see any use case
> for eio_cb at all.

I think it's safer to use allocation model anyway because this will be
generic code. For example, an intermediate device may want to queue
the IRQs. Alternatively, the callbacks could use DeviceState and some
opaque which can be used as the callback context:
  void (*delivery_cb)(DeviceState *src, void *src_opaque, int result);

EOI can be added later if needed, QEMU seems to work fine now without
it. But based on IOAPIC data sheet, I'd suppose it should be need to
pass EOI from LAPIC to IOAPIC. It could be used by coalescing as
another opportunity to inject IRQs though I guess the guest will ack
the IRQ at the same time for both RTC and APIC.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-05  7:20                                                                           ` Blue Swirl
@ 2010-06-05  8:27                                                                             ` Jan Kiszka
  2010-06-05  9:23                                                                               ` Blue Swirl
  0 siblings, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-06-05  8:27 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel, Gleb Natapov, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 4423 bytes --]

Blue Swirl wrote:
> On Sat, Jun 5, 2010 at 12:04 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> Blue Swirl wrote:
>>> On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov <gleb@redhat.com> wrote:
>>>> On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
>>>>> On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
>>>>>> Gleb Natapov wrote:
>>>>>>> On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
>>>>>>>> Blue Swirl wrote:
>>>>>>>>> But how about if we introduced instead a message based IRQ? Then the
>>>>>>>>> message could specify the originator device, maybe ACK/coalesce/NACK
>>>>>>>>> callbacks and a bigger payload than just 1 bit of level. I think that
>>>>>>>>> could achieve the same coalescing effect as what the bidirectional
>>>>>>>>> IRQ. The payload could be useful for other purposes, for example
>>>>>>>>> Sparc64 IRQ messages contain three 64 bit words.
>>>>>>>> If there are more users than just IRQ de-coalescing, this indeed sounds
>>>>>>>> superior. We could pass objects like this one around:
>>>>>>>>
>>>>>>>> struct qemu_irq_msg {
>>>>>>>>  void (*delivery_cb)(int result);
>>>>>>>>  void *payload;
>>>>>>>> };
>>>>>>>>
>>>>>>>> They would be valid within the scope of the IRQ handlers. Whoever
>>>>>>>> terminates or actually delivers the IRQ would invoke the callback. And
>>>>>>>> platforms like sparc64 could evaluate the additional payload pointer in
>>>>>>>> their irqchips or wherever they need it. IRQ routers on platforms that
>>>>>>>> make use of these messages would have to replicate them when forwarding
>>>>>>>> an event.
>>>>>>>>
>>>>>>>> OK?
>>>>>>>>
>>>>>>> Let me see if I understand you correctly. qemu_set_irq() will get
>>>>>>> additional parameter qemu_irq_msg and if irq was not coalesced
>>>>>>> delivery_cb is called, so there is a guaranty that if delivery_cb is
>>>>>>> called it is done before qemu_set_irq() returns. Correct?
>>>>>> If the side that triggers an IRQ passes a message object with a non-NULL
>>>>>> callback, it is supposed to be called unconditionally, passing the
>>>>>> result of the delivery (delivered, masked, coalesced). And yes, the
>>>>>> callback will be invoked in the context of the irq handler, so before
>>>>>> qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
>>>>>>
>>>>> Looks fine to me.
>>>>>
>>>> Except that delivery_cb should probably get pointer to qemu_irq_msg as a
>>>> parameter.
>>> I'd like to also support EOI handling. When the guest clears the
>>> interrupt condtion, the EOI callback would be called. This could occur
>>> much later than the IRQ delivery time. I'm not sure if we need the
>>> result code in that case.
>>>
>>> If any intermediate device (IOAPIC?) needs to be informed about either
>>> delivery or EOI also, it could create a proxy message with its
>>> callbacks in place. But we need then a separate opaque field (in
>>> addition to payload) to store the original message.
>>>
>>> struct IRQMsg {
>>>  DeviceState *src;
>>>  void (*delivery_cb)(IRQMsg *msg, int result);
>>>  void (*eoi_cb)(IRQMsg *msg, int result);
>>>  void *src_opaque;
>>>  void *payload;
>>> };
>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
>> means qemu_malloc/free for every delivery. I think it takes a _very_
>> appealing reason to justify this. But so far I do not see any use case
>> for eio_cb at all.
> 
> I think it's safer to use allocation model anyway because this will be
> generic code. For example, an intermediate device may want to queue
> the IRQs. Alternatively, the callbacks could use DeviceState and some
> opaque which can be used as the callback context:
>   void (*delivery_cb)(DeviceState *src, void *src_opaque, int result);
> 
> EOI can be added later if needed, QEMU seems to work fine now without
> it. But based on IOAPIC data sheet, I'd suppose it should be need to
> pass EOI from LAPIC to IOAPIC. It could be used by coalescing as
> another opportunity to inject IRQs though I guess the guest will ack
> the IRQ at the same time for both RTC and APIC.

Let's wait for a real use case for an extended IRQMsg lifetime. For now
we are fine with stack-allocated messages which are way simpler to
handle. I'm already drafting a first prototype based on this model.
Switching to dynamic allocation may still happen later on once the
urgent need shows up.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-05  8:27                                                                             ` Jan Kiszka
@ 2010-06-05  9:23                                                                               ` Blue Swirl
  2010-06-05 12:14                                                                                 ` Jan Kiszka
  0 siblings, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-06-05  9:23 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, Gleb Natapov, Juan Quintela

On Sat, Jun 5, 2010 at 8:27 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
> Blue Swirl wrote:
>> On Sat, Jun 5, 2010 at 12:04 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>> Blue Swirl wrote:
>>>> On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov <gleb@redhat.com> wrote:
>>>>> On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
>>>>>> On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
>>>>>>> Gleb Natapov wrote:
>>>>>>>> On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
>>>>>>>>> Blue Swirl wrote:
>>>>>>>>>> But how about if we introduced instead a message based IRQ? Then the
>>>>>>>>>> message could specify the originator device, maybe ACK/coalesce/NACK
>>>>>>>>>> callbacks and a bigger payload than just 1 bit of level. I think that
>>>>>>>>>> could achieve the same coalescing effect as what the bidirectional
>>>>>>>>>> IRQ. The payload could be useful for other purposes, for example
>>>>>>>>>> Sparc64 IRQ messages contain three 64 bit words.
>>>>>>>>> If there are more users than just IRQ de-coalescing, this indeed sounds
>>>>>>>>> superior. We could pass objects like this one around:
>>>>>>>>>
>>>>>>>>> struct qemu_irq_msg {
>>>>>>>>>  void (*delivery_cb)(int result);
>>>>>>>>>  void *payload;
>>>>>>>>> };
>>>>>>>>>
>>>>>>>>> They would be valid within the scope of the IRQ handlers. Whoever
>>>>>>>>> terminates or actually delivers the IRQ would invoke the callback. And
>>>>>>>>> platforms like sparc64 could evaluate the additional payload pointer in
>>>>>>>>> their irqchips or wherever they need it. IRQ routers on platforms that
>>>>>>>>> make use of these messages would have to replicate them when forwarding
>>>>>>>>> an event.
>>>>>>>>>
>>>>>>>>> OK?
>>>>>>>>>
>>>>>>>> Let me see if I understand you correctly. qemu_set_irq() will get
>>>>>>>> additional parameter qemu_irq_msg and if irq was not coalesced
>>>>>>>> delivery_cb is called, so there is a guaranty that if delivery_cb is
>>>>>>>> called it is done before qemu_set_irq() returns. Correct?
>>>>>>> If the side that triggers an IRQ passes a message object with a non-NULL
>>>>>>> callback, it is supposed to be called unconditionally, passing the
>>>>>>> result of the delivery (delivered, masked, coalesced). And yes, the
>>>>>>> callback will be invoked in the context of the irq handler, so before
>>>>>>> qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
>>>>>>>
>>>>>> Looks fine to me.
>>>>>>
>>>>> Except that delivery_cb should probably get pointer to qemu_irq_msg as a
>>>>> parameter.
>>>> I'd like to also support EOI handling. When the guest clears the
>>>> interrupt condtion, the EOI callback would be called. This could occur
>>>> much later than the IRQ delivery time. I'm not sure if we need the
>>>> result code in that case.
>>>>
>>>> If any intermediate device (IOAPIC?) needs to be informed about either
>>>> delivery or EOI also, it could create a proxy message with its
>>>> callbacks in place. But we need then a separate opaque field (in
>>>> addition to payload) to store the original message.
>>>>
>>>> struct IRQMsg {
>>>>  DeviceState *src;
>>>>  void (*delivery_cb)(IRQMsg *msg, int result);
>>>>  void (*eoi_cb)(IRQMsg *msg, int result);
>>>>  void *src_opaque;
>>>>  void *payload;
>>>> };
>>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
>>> means qemu_malloc/free for every delivery. I think it takes a _very_
>>> appealing reason to justify this. But so far I do not see any use case
>>> for eio_cb at all.
>>
>> I think it's safer to use allocation model anyway because this will be
>> generic code. For example, an intermediate device may want to queue
>> the IRQs. Alternatively, the callbacks could use DeviceState and some
>> opaque which can be used as the callback context:
>>   void (*delivery_cb)(DeviceState *src, void *src_opaque, int result);
>>
>> EOI can be added later if needed, QEMU seems to work fine now without
>> it. But based on IOAPIC data sheet, I'd suppose it should be need to
>> pass EOI from LAPIC to IOAPIC. It could be used by coalescing as
>> another opportunity to inject IRQs though I guess the guest will ack
>> the IRQ at the same time for both RTC and APIC.
>
> Let's wait for a real use case for an extended IRQMsg lifetime. For now
> we are fine with stack-allocated messages which are way simpler to
> handle. I'm already drafting a first prototype based on this model.
> Switching to dynamic allocation may still happen later on once the
> urgent need shows up.

Passing around stack allocated objects is asking for trouble. I'd much
rather use the DeviceState/opaque version then, so at least
destination should not need to use IRQMsg for anything.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-05  9:23                                                                               ` Blue Swirl
@ 2010-06-05 12:14                                                                                 ` Jan Kiszka
  0 siblings, 0 replies; 122+ messages in thread
From: Jan Kiszka @ 2010-06-05 12:14 UTC (permalink / raw)
  To: Blue Swirl; +Cc: qemu-devel, Gleb Natapov, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 5161 bytes --]

Blue Swirl wrote:
> On Sat, Jun 5, 2010 at 8:27 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> Blue Swirl wrote:
>>> On Sat, Jun 5, 2010 at 12:04 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
>>>> Blue Swirl wrote:
>>>>> On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov <gleb@redhat.com> wrote:
>>>>>> On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
>>>>>>> On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
>>>>>>>> Gleb Natapov wrote:
>>>>>>>>> On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
>>>>>>>>>> Blue Swirl wrote:
>>>>>>>>>>> But how about if we introduced instead a message based IRQ? Then the
>>>>>>>>>>> message could specify the originator device, maybe ACK/coalesce/NACK
>>>>>>>>>>> callbacks and a bigger payload than just 1 bit of level. I think that
>>>>>>>>>>> could achieve the same coalescing effect as what the bidirectional
>>>>>>>>>>> IRQ. The payload could be useful for other purposes, for example
>>>>>>>>>>> Sparc64 IRQ messages contain three 64 bit words.
>>>>>>>>>> If there are more users than just IRQ de-coalescing, this indeed sounds
>>>>>>>>>> superior. We could pass objects like this one around:
>>>>>>>>>>
>>>>>>>>>> struct qemu_irq_msg {
>>>>>>>>>>  void (*delivery_cb)(int result);
>>>>>>>>>>  void *payload;
>>>>>>>>>> };
>>>>>>>>>>
>>>>>>>>>> They would be valid within the scope of the IRQ handlers. Whoever
>>>>>>>>>> terminates or actually delivers the IRQ would invoke the callback. And
>>>>>>>>>> platforms like sparc64 could evaluate the additional payload pointer in
>>>>>>>>>> their irqchips or wherever they need it. IRQ routers on platforms that
>>>>>>>>>> make use of these messages would have to replicate them when forwarding
>>>>>>>>>> an event.
>>>>>>>>>>
>>>>>>>>>> OK?
>>>>>>>>>>
>>>>>>>>> Let me see if I understand you correctly. qemu_set_irq() will get
>>>>>>>>> additional parameter qemu_irq_msg and if irq was not coalesced
>>>>>>>>> delivery_cb is called, so there is a guaranty that if delivery_cb is
>>>>>>>>> called it is done before qemu_set_irq() returns. Correct?
>>>>>>>> If the side that triggers an IRQ passes a message object with a non-NULL
>>>>>>>> callback, it is supposed to be called unconditionally, passing the
>>>>>>>> result of the delivery (delivered, masked, coalesced). And yes, the
>>>>>>>> callback will be invoked in the context of the irq handler, so before
>>>>>>>> qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
>>>>>>>>
>>>>>>> Looks fine to me.
>>>>>>>
>>>>>> Except that delivery_cb should probably get pointer to qemu_irq_msg as a
>>>>>> parameter.
>>>>> I'd like to also support EOI handling. When the guest clears the
>>>>> interrupt condtion, the EOI callback would be called. This could occur
>>>>> much later than the IRQ delivery time. I'm not sure if we need the
>>>>> result code in that case.
>>>>>
>>>>> If any intermediate device (IOAPIC?) needs to be informed about either
>>>>> delivery or EOI also, it could create a proxy message with its
>>>>> callbacks in place. But we need then a separate opaque field (in
>>>>> addition to payload) to store the original message.
>>>>>
>>>>> struct IRQMsg {
>>>>>  DeviceState *src;
>>>>>  void (*delivery_cb)(IRQMsg *msg, int result);
>>>>>  void (*eoi_cb)(IRQMsg *msg, int result);
>>>>>  void *src_opaque;
>>>>>  void *payload;
>>>>> };
>>>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
>>>> means qemu_malloc/free for every delivery. I think it takes a _very_
>>>> appealing reason to justify this. But so far I do not see any use case
>>>> for eio_cb at all.
>>> I think it's safer to use allocation model anyway because this will be
>>> generic code. For example, an intermediate device may want to queue
>>> the IRQs. Alternatively, the callbacks could use DeviceState and some
>>> opaque which can be used as the callback context:
>>>   void (*delivery_cb)(DeviceState *src, void *src_opaque, int result);
>>>
>>> EOI can be added later if needed, QEMU seems to work fine now without
>>> it. But based on IOAPIC data sheet, I'd suppose it should be need to
>>> pass EOI from LAPIC to IOAPIC. It could be used by coalescing as
>>> another opportunity to inject IRQs though I guess the guest will ack
>>> the IRQ at the same time for both RTC and APIC.
>> Let's wait for a real use case for an extended IRQMsg lifetime. For now
>> we are fine with stack-allocated messages which are way simpler to
>> handle. I'm already drafting a first prototype based on this model.
>> Switching to dynamic allocation may still happen later on once the
>> urgent need shows up.
> 
> Passing around stack allocated objects is asking for trouble. I'd much
> rather use the DeviceState/opaque version then, so at least
> destination should not need to use IRQMsg for anything.

Right, I've hiden the IRQMsg object from the target handler, temporarily
storing it in qemu_irq instead. qemu_irq_handler had to be touched
anyway, so I'm now passing the IRQ object to it so that it can invoke
services to trigger the delivery callback or obtain the payload.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-05  0:04                                                                         ` Jan Kiszka
  2010-06-05  7:20                                                                           ` Blue Swirl
@ 2010-06-06  7:15                                                                           ` Gleb Natapov
  2010-06-06  7:39                                                                             ` Jan Kiszka
  2010-06-06  7:39                                                                             ` Blue Swirl
  1 sibling, 2 replies; 122+ messages in thread
From: Gleb Natapov @ 2010-06-06  7:15 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Blue Swirl, qemu-devel, Juan Quintela

On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
> > I'd like to also support EOI handling. When the guest clears the
> > interrupt condtion, the EOI callback would be called. This could occur
> > much later than the IRQ delivery time. I'm not sure if we need the
> > result code in that case.
> > 
> > If any intermediate device (IOAPIC?) needs to be informed about either
> > delivery or EOI also, it could create a proxy message with its
> > callbacks in place. But we need then a separate opaque field (in
> > addition to payload) to store the original message.
> > 
> > struct IRQMsg {
> >  DeviceState *src;
> >  void (*delivery_cb)(IRQMsg *msg, int result);
> >  void (*eoi_cb)(IRQMsg *msg, int result);
> >  void *src_opaque;
> >  void *payload;
> > };
> 
> Extending the lifetime of IRQMsg objects beyond the delivery call stack
> means qemu_malloc/free for every delivery. I think it takes a _very_
> appealing reason to justify this. But so far I do not see any use case
> for eio_cb at all.
> 
I dislike use of eoi for reinfecting missing interrupts since
it eliminates use of internal PIC/APIC queue of not yet delivered
interrupts. PIC and APIC has internal queue that can handle two elements:
one is delivered, but not yet acked interrupt in isr and another is
pending interrupt in irr. Using eoi callback (or ack notifier as it's
called inside kernel) interrupt will be considered coalesced even if irr
is cleared, but no ack was received for previously delivered interrupt.
But ack notifiers actually has another use: device assignment. There is
a plan to move device assignment from kernel to userspace and for that
ack notifiers will have to be extended to userspace too. If so we can
use them to do irq decoalescing as well. I doubt they should be part
of IRQMsg though. Why not do what kernel does: have globally registered
notifier based on irqchip/pin.

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-06  7:15                                                                           ` Gleb Natapov
@ 2010-06-06  7:39                                                                             ` Jan Kiszka
  2010-06-06  7:49                                                                               ` Gleb Natapov
  2010-06-06  7:39                                                                             ` Blue Swirl
  1 sibling, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-06-06  7:39 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Blue Swirl, qemu-devel, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 2124 bytes --]

Gleb Natapov wrote:
> On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
>>> I'd like to also support EOI handling. When the guest clears the
>>> interrupt condtion, the EOI callback would be called. This could occur
>>> much later than the IRQ delivery time. I'm not sure if we need the
>>> result code in that case.
>>>
>>> If any intermediate device (IOAPIC?) needs to be informed about either
>>> delivery or EOI also, it could create a proxy message with its
>>> callbacks in place. But we need then a separate opaque field (in
>>> addition to payload) to store the original message.
>>>
>>> struct IRQMsg {
>>>  DeviceState *src;
>>>  void (*delivery_cb)(IRQMsg *msg, int result);
>>>  void (*eoi_cb)(IRQMsg *msg, int result);
>>>  void *src_opaque;
>>>  void *payload;
>>> };
>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
>> means qemu_malloc/free for every delivery. I think it takes a _very_
>> appealing reason to justify this. But so far I do not see any use case
>> for eio_cb at all.
>>
> I dislike use of eoi for reinfecting missing interrupts since
> it eliminates use of internal PIC/APIC queue of not yet delivered
> interrupts. PIC and APIC has internal queue that can handle two elements:
> one is delivered, but not yet acked interrupt in isr and another is
> pending interrupt in irr. Using eoi callback (or ack notifier as it's
> called inside kernel) interrupt will be considered coalesced even if irr
> is cleared, but no ack was received for previously delivered interrupt.
> But ack notifiers actually has another use: device assignment. There is
> a plan to move device assignment from kernel to userspace and for that
> ack notifiers will have to be extended to userspace too. If so we can
> use them to do irq decoalescing as well. I doubt they should be part
> of IRQMsg though. Why not do what kernel does: have globally registered
> notifier based on irqchip/pin.

I read this twice but I still don't get your plan. Do you like or
dislike using EIO for de-coalescing? And how should these notifiers work?

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-06  7:15                                                                           ` Gleb Natapov
  2010-06-06  7:39                                                                             ` Jan Kiszka
@ 2010-06-06  7:39                                                                             ` Blue Swirl
  2010-06-06  8:07                                                                               ` Gleb Natapov
  1 sibling, 1 reply; 122+ messages in thread
From: Blue Swirl @ 2010-06-06  7:39 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sun, Jun 6, 2010 at 7:15 AM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
>> > I'd like to also support EOI handling. When the guest clears the
>> > interrupt condtion, the EOI callback would be called. This could occur
>> > much later than the IRQ delivery time. I'm not sure if we need the
>> > result code in that case.
>> >
>> > If any intermediate device (IOAPIC?) needs to be informed about either
>> > delivery or EOI also, it could create a proxy message with its
>> > callbacks in place. But we need then a separate opaque field (in
>> > addition to payload) to store the original message.
>> >
>> > struct IRQMsg {
>> >  DeviceState *src;
>> >  void (*delivery_cb)(IRQMsg *msg, int result);
>> >  void (*eoi_cb)(IRQMsg *msg, int result);
>> >  void *src_opaque;
>> >  void *payload;
>> > };
>>
>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
>> means qemu_malloc/free for every delivery. I think it takes a _very_
>> appealing reason to justify this. But so far I do not see any use case
>> for eio_cb at all.
>>
> I dislike use of eoi for reinfecting missing interrupts since
> it eliminates use of internal PIC/APIC queue of not yet delivered
> interrupts. PIC and APIC has internal queue that can handle two elements:
> one is delivered, but not yet acked interrupt in isr and another is
> pending interrupt in irr. Using eoi callback (or ack notifier as it's
> called inside kernel) interrupt will be considered coalesced even if irr
> is cleared, but no ack was received for previously delivered interrupt.
> But ack notifiers actually has another use: device assignment. There is
> a plan to move device assignment from kernel to userspace and for that
> ack notifiers will have to be extended to userspace too. If so we can
> use them to do irq decoalescing as well. I doubt they should be part
> of IRQMsg though. Why not do what kernel does: have globally registered
> notifier based on irqchip/pin.

Because translation at IOAPIC may be lossy, IRQs from many devices
pointing to the same vector? With IRQMsg you know where a specific
message came from. The situation is different inside the kernel: it
manages both translation and registration, whereas in QEMU we could
only control registration.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-06  7:39                                                                             ` Jan Kiszka
@ 2010-06-06  7:49                                                                               ` Gleb Natapov
  2010-06-06  8:07                                                                                 ` Jan Kiszka
  0 siblings, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-06-06  7:49 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Blue Swirl, qemu-devel, Juan Quintela

On Sun, Jun 06, 2010 at 09:39:04AM +0200, Jan Kiszka wrote:
> Gleb Natapov wrote:
> > On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
> >>> I'd like to also support EOI handling. When the guest clears the
> >>> interrupt condtion, the EOI callback would be called. This could occur
> >>> much later than the IRQ delivery time. I'm not sure if we need the
> >>> result code in that case.
> >>>
> >>> If any intermediate device (IOAPIC?) needs to be informed about either
> >>> delivery or EOI also, it could create a proxy message with its
> >>> callbacks in place. But we need then a separate opaque field (in
> >>> addition to payload) to store the original message.
> >>>
> >>> struct IRQMsg {
> >>>  DeviceState *src;
> >>>  void (*delivery_cb)(IRQMsg *msg, int result);
> >>>  void (*eoi_cb)(IRQMsg *msg, int result);
> >>>  void *src_opaque;
> >>>  void *payload;
> >>> };
> >> Extending the lifetime of IRQMsg objects beyond the delivery call stack
> >> means qemu_malloc/free for every delivery. I think it takes a _very_
> >> appealing reason to justify this. But so far I do not see any use case
> >> for eio_cb at all.
> >>
> > I dislike use of eoi for reinfecting missing interrupts since
> > it eliminates use of internal PIC/APIC queue of not yet delivered
> > interrupts. PIC and APIC has internal queue that can handle two elements:
> > one is delivered, but not yet acked interrupt in isr and another is
> > pending interrupt in irr. Using eoi callback (or ack notifier as it's
> > called inside kernel) interrupt will be considered coalesced even if irr
> > is cleared, but no ack was received for previously delivered interrupt.
> > But ack notifiers actually has another use: device assignment. There is
> > a plan to move device assignment from kernel to userspace and for that
> > ack notifiers will have to be extended to userspace too. If so we can
> > use them to do irq decoalescing as well. I doubt they should be part
> > of IRQMsg though. Why not do what kernel does: have globally registered
> > notifier based on irqchip/pin.
> 
> I read this twice but I still don't get your plan. Do you like or
> dislike using EIO for de-coalescing? And how should these notifiers work?
> 
That's because I confused myself :) I _dislike_ them to be used, but
since device assignment requires ack notifiers anyway may be it is better
to introduce one mechanism for device assignmen + de-coalescing instead
of introducing two different mechanism. Using ack notifiers should be
easy: RTC registers ack notifier and keep track of delivered interrupts.
If timer triggers after previews irq was set, but before it was acked
coalesced counter is incremented. In ack notifier callback coalesced
counter is checked and if it is not zero new irq is set.

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-06  7:39                                                                             ` Blue Swirl
@ 2010-06-06  8:07                                                                               ` Gleb Natapov
  0 siblings, 0 replies; 122+ messages in thread
From: Gleb Natapov @ 2010-06-06  8:07 UTC (permalink / raw)
  To: Blue Swirl; +Cc: Jan Kiszka, qemu-devel, Juan Quintela

On Sun, Jun 06, 2010 at 07:39:49AM +0000, Blue Swirl wrote:
> On Sun, Jun 6, 2010 at 7:15 AM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
> >> > I'd like to also support EOI handling. When the guest clears the
> >> > interrupt condtion, the EOI callback would be called. This could occur
> >> > much later than the IRQ delivery time. I'm not sure if we need the
> >> > result code in that case.
> >> >
> >> > If any intermediate device (IOAPIC?) needs to be informed about either
> >> > delivery or EOI also, it could create a proxy message with its
> >> > callbacks in place. But we need then a separate opaque field (in
> >> > addition to payload) to store the original message.
> >> >
> >> > struct IRQMsg {
> >> >  DeviceState *src;
> >> >  void (*delivery_cb)(IRQMsg *msg, int result);
> >> >  void (*eoi_cb)(IRQMsg *msg, int result);
> >> >  void *src_opaque;
> >> >  void *payload;
> >> > };
> >>
> >> Extending the lifetime of IRQMsg objects beyond the delivery call stack
> >> means qemu_malloc/free for every delivery. I think it takes a _very_
> >> appealing reason to justify this. But so far I do not see any use case
> >> for eio_cb at all.
> >>
> > I dislike use of eoi for reinfecting missing interrupts since
> > it eliminates use of internal PIC/APIC queue of not yet delivered
> > interrupts. PIC and APIC has internal queue that can handle two elements:
> > one is delivered, but not yet acked interrupt in isr and another is
> > pending interrupt in irr. Using eoi callback (or ack notifier as it's
> > called inside kernel) interrupt will be considered coalesced even if irr
> > is cleared, but no ack was received for previously delivered interrupt.
> > But ack notifiers actually has another use: device assignment. There is
> > a plan to move device assignment from kernel to userspace and for that
> > ack notifiers will have to be extended to userspace too. If so we can
> > use them to do irq decoalescing as well. I doubt they should be part
> > of IRQMsg though. Why not do what kernel does: have globally registered
> > notifier based on irqchip/pin.
> 
> Because translation at IOAPIC may be lossy, IRQs from many devices
> pointing to the same vector? With IRQMsg you know where a specific
> message came from. The situation is different inside the kernel: it
> manages both translation and registration, whereas in QEMU we could
> only control registration.
Configuring IOAPIC like that is against x86 architecture. OS will not be
able to map from interrupt vector back to device.

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-06  7:49                                                                               ` Gleb Natapov
@ 2010-06-06  8:07                                                                                 ` Jan Kiszka
  2010-06-06  9:23                                                                                   ` Gleb Natapov
  0 siblings, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-06-06  8:07 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Blue Swirl, qemu-devel, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 3158 bytes --]

Gleb Natapov wrote:
> On Sun, Jun 06, 2010 at 09:39:04AM +0200, Jan Kiszka wrote:
>> Gleb Natapov wrote:
>>> On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
>>>>> I'd like to also support EOI handling. When the guest clears the
>>>>> interrupt condtion, the EOI callback would be called. This could occur
>>>>> much later than the IRQ delivery time. I'm not sure if we need the
>>>>> result code in that case.
>>>>>
>>>>> If any intermediate device (IOAPIC?) needs to be informed about either
>>>>> delivery or EOI also, it could create a proxy message with its
>>>>> callbacks in place. But we need then a separate opaque field (in
>>>>> addition to payload) to store the original message.
>>>>>
>>>>> struct IRQMsg {
>>>>>  DeviceState *src;
>>>>>  void (*delivery_cb)(IRQMsg *msg, int result);
>>>>>  void (*eoi_cb)(IRQMsg *msg, int result);
>>>>>  void *src_opaque;
>>>>>  void *payload;
>>>>> };
>>>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
>>>> means qemu_malloc/free for every delivery. I think it takes a _very_
>>>> appealing reason to justify this. But so far I do not see any use case
>>>> for eio_cb at all.
>>>>
>>> I dislike use of eoi for reinfecting missing interrupts since
>>> it eliminates use of internal PIC/APIC queue of not yet delivered
>>> interrupts. PIC and APIC has internal queue that can handle two elements:
>>> one is delivered, but not yet acked interrupt in isr and another is
>>> pending interrupt in irr. Using eoi callback (or ack notifier as it's
>>> called inside kernel) interrupt will be considered coalesced even if irr
>>> is cleared, but no ack was received for previously delivered interrupt.
>>> But ack notifiers actually has another use: device assignment. There is
>>> a plan to move device assignment from kernel to userspace and for that
>>> ack notifiers will have to be extended to userspace too. If so we can
>>> use them to do irq decoalescing as well. I doubt they should be part
>>> of IRQMsg though. Why not do what kernel does: have globally registered
>>> notifier based on irqchip/pin.
>> I read this twice but I still don't get your plan. Do you like or
>> dislike using EIO for de-coalescing? And how should these notifiers work?
>>
> That's because I confused myself :) I _dislike_ them to be used, but
> since device assignment requires ack notifiers anyway may be it is better
> to introduce one mechanism for device assignmen + de-coalescing instead
> of introducing two different mechanism. Using ack notifiers should be
> easy: RTC registers ack notifier and keep track of delivered interrupts.
> If timer triggers after previews irq was set, but before it was acked
> coalesced counter is incremented. In ack notifier callback coalesced
> counter is checked and if it is not zero new irq is set.

Ack notifier registrations and event deliveries still need to be routed.
Piggy-backing this on IRQ messages may be unavoidable for that reason.

Anyway, I'm going to post my HPET updates with the infrastructure for
IRQMsg now. Maybe it's helpful to see the other option in reality.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-06  8:07                                                                                 ` Jan Kiszka
@ 2010-06-06  9:23                                                                                   ` Gleb Natapov
  2010-06-06 10:10                                                                                     ` Jan Kiszka
  0 siblings, 1 reply; 122+ messages in thread
From: Gleb Natapov @ 2010-06-06  9:23 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Blue Swirl, qemu-devel, Juan Quintela

On Sun, Jun 06, 2010 at 10:07:48AM +0200, Jan Kiszka wrote:
> Gleb Natapov wrote:
> > On Sun, Jun 06, 2010 at 09:39:04AM +0200, Jan Kiszka wrote:
> >> Gleb Natapov wrote:
> >>> On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
> >>>>> I'd like to also support EOI handling. When the guest clears the
> >>>>> interrupt condtion, the EOI callback would be called. This could occur
> >>>>> much later than the IRQ delivery time. I'm not sure if we need the
> >>>>> result code in that case.
> >>>>>
> >>>>> If any intermediate device (IOAPIC?) needs to be informed about either
> >>>>> delivery or EOI also, it could create a proxy message with its
> >>>>> callbacks in place. But we need then a separate opaque field (in
> >>>>> addition to payload) to store the original message.
> >>>>>
> >>>>> struct IRQMsg {
> >>>>>  DeviceState *src;
> >>>>>  void (*delivery_cb)(IRQMsg *msg, int result);
> >>>>>  void (*eoi_cb)(IRQMsg *msg, int result);
> >>>>>  void *src_opaque;
> >>>>>  void *payload;
> >>>>> };
> >>>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
> >>>> means qemu_malloc/free for every delivery. I think it takes a _very_
> >>>> appealing reason to justify this. But so far I do not see any use case
> >>>> for eio_cb at all.
> >>>>
> >>> I dislike use of eoi for reinfecting missing interrupts since
> >>> it eliminates use of internal PIC/APIC queue of not yet delivered
> >>> interrupts. PIC and APIC has internal queue that can handle two elements:
> >>> one is delivered, but not yet acked interrupt in isr and another is
> >>> pending interrupt in irr. Using eoi callback (or ack notifier as it's
> >>> called inside kernel) interrupt will be considered coalesced even if irr
> >>> is cleared, but no ack was received for previously delivered interrupt.
> >>> But ack notifiers actually has another use: device assignment. There is
> >>> a plan to move device assignment from kernel to userspace and for that
> >>> ack notifiers will have to be extended to userspace too. If so we can
> >>> use them to do irq decoalescing as well. I doubt they should be part
> >>> of IRQMsg though. Why not do what kernel does: have globally registered
> >>> notifier based on irqchip/pin.
> >> I read this twice but I still don't get your plan. Do you like or
> >> dislike using EIO for de-coalescing? And how should these notifiers work?
> >>
> > That's because I confused myself :) I _dislike_ them to be used, but
> > since device assignment requires ack notifiers anyway may be it is better
> > to introduce one mechanism for device assignmen + de-coalescing instead
> > of introducing two different mechanism. Using ack notifiers should be
> > easy: RTC registers ack notifier and keep track of delivered interrupts.
> > If timer triggers after previews irq was set, but before it was acked
> > coalesced counter is incremented. In ack notifier callback coalesced
> > counter is checked and if it is not zero new irq is set.
> 
> Ack notifier registrations and event deliveries still need to be routed.
> Piggy-backing this on IRQ messages may be unavoidable for that reason.
It is done in the kernel without piggy-backing.

> 
> Anyway, I'm going to post my HPET updates with the infrastructure for
> IRQMsg now. Maybe it's helpful to see the other option in reality.
> 
One other think to consider current approach does not always work.
Win2K3-64bit-smp and Win2k8-64bit-smp configure RTC interrupt to be
broadcasted to all cpus, but only boot cpu does time calculation. With
current approach if interrupt is delivered to at least one vcpu
it will not be considered coalesced, but if cpu it was delivered to is
not cpu that does time accounting then clock will drift.

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-06  9:23                                                                                   ` Gleb Natapov
@ 2010-06-06 10:10                                                                                     ` Jan Kiszka
  2010-06-06 10:27                                                                                       ` Gleb Natapov
  0 siblings, 1 reply; 122+ messages in thread
From: Jan Kiszka @ 2010-06-06 10:10 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Blue Swirl, qemu-devel, Juan Quintela

[-- Attachment #1: Type: text/plain, Size: 4376 bytes --]

Gleb Natapov wrote:
> On Sun, Jun 06, 2010 at 10:07:48AM +0200, Jan Kiszka wrote:
>> Gleb Natapov wrote:
>>> On Sun, Jun 06, 2010 at 09:39:04AM +0200, Jan Kiszka wrote:
>>>> Gleb Natapov wrote:
>>>>> On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
>>>>>>> I'd like to also support EOI handling. When the guest clears the
>>>>>>> interrupt condtion, the EOI callback would be called. This could occur
>>>>>>> much later than the IRQ delivery time. I'm not sure if we need the
>>>>>>> result code in that case.
>>>>>>>
>>>>>>> If any intermediate device (IOAPIC?) needs to be informed about either
>>>>>>> delivery or EOI also, it could create a proxy message with its
>>>>>>> callbacks in place. But we need then a separate opaque field (in
>>>>>>> addition to payload) to store the original message.
>>>>>>>
>>>>>>> struct IRQMsg {
>>>>>>>  DeviceState *src;
>>>>>>>  void (*delivery_cb)(IRQMsg *msg, int result);
>>>>>>>  void (*eoi_cb)(IRQMsg *msg, int result);
>>>>>>>  void *src_opaque;
>>>>>>>  void *payload;
>>>>>>> };
>>>>>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
>>>>>> means qemu_malloc/free for every delivery. I think it takes a _very_
>>>>>> appealing reason to justify this. But so far I do not see any use case
>>>>>> for eio_cb at all.
>>>>>>
>>>>> I dislike use of eoi for reinfecting missing interrupts since
>>>>> it eliminates use of internal PIC/APIC queue of not yet delivered
>>>>> interrupts. PIC and APIC has internal queue that can handle two elements:
>>>>> one is delivered, but not yet acked interrupt in isr and another is
>>>>> pending interrupt in irr. Using eoi callback (or ack notifier as it's
>>>>> called inside kernel) interrupt will be considered coalesced even if irr
>>>>> is cleared, but no ack was received for previously delivered interrupt.
>>>>> But ack notifiers actually has another use: device assignment. There is
>>>>> a plan to move device assignment from kernel to userspace and for that
>>>>> ack notifiers will have to be extended to userspace too. If so we can
>>>>> use them to do irq decoalescing as well. I doubt they should be part
>>>>> of IRQMsg though. Why not do what kernel does: have globally registered
>>>>> notifier based on irqchip/pin.
>>>> I read this twice but I still don't get your plan. Do you like or
>>>> dislike using EIO for de-coalescing? And how should these notifiers work?
>>>>
>>> That's because I confused myself :) I _dislike_ them to be used, but
>>> since device assignment requires ack notifiers anyway may be it is better
>>> to introduce one mechanism for device assignmen + de-coalescing instead
>>> of introducing two different mechanism. Using ack notifiers should be
>>> easy: RTC registers ack notifier and keep track of delivered interrupts.
>>> If timer triggers after previews irq was set, but before it was acked
>>> coalesced counter is incremented. In ack notifier callback coalesced
>>> counter is checked and if it is not zero new irq is set.
>> Ack notifier registrations and event deliveries still need to be routed.
>> Piggy-backing this on IRQ messages may be unavoidable for that reason.
> It is done in the kernel without piggy-backing.

As it does not include any IRQ routers in front of the interrupt
controller. Maybe it works for x86, but it is no generic solution.

Also, periodic timer sources get no information about the fact that
their interrupt is masked somewhere along the path to the VCPUs and will
possibly replay countless IRQs when the masking ends, no?

> 
>> Anyway, I'm going to post my HPET updates with the infrastructure for
>> IRQMsg now. Maybe it's helpful to see the other option in reality.
>>
> One other think to consider current approach does not always work.
> Win2K3-64bit-smp and Win2k8-64bit-smp configure RTC interrupt to be
> broadcasted to all cpus, but only boot cpu does time calculation. With
> current approach if interrupt is delivered to at least one vcpu
> it will not be considered coalesced, but if cpu it was delivered to is
> not cpu that does time accounting then clock will drift.

That means we would have to fire callbacks per receiving CPU and report
its number back. Is there a way to find out if we are running such a
guest without an '-enable-win2k[38]-64bit-smp-rtc-drift-fix'?

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
  2010-06-06 10:10                                                                                     ` Jan Kiszka
@ 2010-06-06 10:27                                                                                       ` Gleb Natapov
  0 siblings, 0 replies; 122+ messages in thread
From: Gleb Natapov @ 2010-06-06 10:27 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Blue Swirl, qemu-devel, Juan Quintela

On Sun, Jun 06, 2010 at 12:10:07PM +0200, Jan Kiszka wrote:
> Gleb Natapov wrote:
> > On Sun, Jun 06, 2010 at 10:07:48AM +0200, Jan Kiszka wrote:
> >> Gleb Natapov wrote:
> >>> On Sun, Jun 06, 2010 at 09:39:04AM +0200, Jan Kiszka wrote:
> >>>> Gleb Natapov wrote:
> >>>>> On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
> >>>>>>> I'd like to also support EOI handling. When the guest clears the
> >>>>>>> interrupt condtion, the EOI callback would be called. This could occur
> >>>>>>> much later than the IRQ delivery time. I'm not sure if we need the
> >>>>>>> result code in that case.
> >>>>>>>
> >>>>>>> If any intermediate device (IOAPIC?) needs to be informed about either
> >>>>>>> delivery or EOI also, it could create a proxy message with its
> >>>>>>> callbacks in place. But we need then a separate opaque field (in
> >>>>>>> addition to payload) to store the original message.
> >>>>>>>
> >>>>>>> struct IRQMsg {
> >>>>>>>  DeviceState *src;
> >>>>>>>  void (*delivery_cb)(IRQMsg *msg, int result);
> >>>>>>>  void (*eoi_cb)(IRQMsg *msg, int result);
> >>>>>>>  void *src_opaque;
> >>>>>>>  void *payload;
> >>>>>>> };
> >>>>>> Extending the lifetime of IRQMsg objects beyond the delivery call stack
> >>>>>> means qemu_malloc/free for every delivery. I think it takes a _very_
> >>>>>> appealing reason to justify this. But so far I do not see any use case
> >>>>>> for eio_cb at all.
> >>>>>>
> >>>>> I dislike use of eoi for reinfecting missing interrupts since
> >>>>> it eliminates use of internal PIC/APIC queue of not yet delivered
> >>>>> interrupts. PIC and APIC has internal queue that can handle two elements:
> >>>>> one is delivered, but not yet acked interrupt in isr and another is
> >>>>> pending interrupt in irr. Using eoi callback (or ack notifier as it's
> >>>>> called inside kernel) interrupt will be considered coalesced even if irr
> >>>>> is cleared, but no ack was received for previously delivered interrupt.
> >>>>> But ack notifiers actually has another use: device assignment. There is
> >>>>> a plan to move device assignment from kernel to userspace and for that
> >>>>> ack notifiers will have to be extended to userspace too. If so we can
> >>>>> use them to do irq decoalescing as well. I doubt they should be part
> >>>>> of IRQMsg though. Why not do what kernel does: have globally registered
> >>>>> notifier based on irqchip/pin.
> >>>> I read this twice but I still don't get your plan. Do you like or
> >>>> dislike using EIO for de-coalescing? And how should these notifiers work?
> >>>>
> >>> That's because I confused myself :) I _dislike_ them to be used, but
> >>> since device assignment requires ack notifiers anyway may be it is better
> >>> to introduce one mechanism for device assignmen + de-coalescing instead
> >>> of introducing two different mechanism. Using ack notifiers should be
> >>> easy: RTC registers ack notifier and keep track of delivered interrupts.
> >>> If timer triggers after previews irq was set, but before it was acked
> >>> coalesced counter is incremented. In ack notifier callback coalesced
> >>> counter is checked and if it is not zero new irq is set.
> >> Ack notifier registrations and event deliveries still need to be routed.
> >> Piggy-backing this on IRQ messages may be unavoidable for that reason.
> > It is done in the kernel without piggy-backing.
> 
> As it does not include any IRQ routers in front of the interrupt
> controller. Maybe it works for x86, but it is no generic solution.
> 
x86 has IRQ router in front of interrupt controller inside pci host
bridge.

> Also, periodic timer sources get no information about the fact that
> their interrupt is masked somewhere along the path to the VCPUs and will
> possibly replay countless IRQs when the masking ends, no?
> 
Correct, for that we have mask notifiers in the kernel. Gets ugly be the
minute.

> > 
> >> Anyway, I'm going to post my HPET updates with the infrastructure for
> >> IRQMsg now. Maybe it's helpful to see the other option in reality.
> >>
> > One other think to consider current approach does not always work.
> > Win2K3-64bit-smp and Win2k8-64bit-smp configure RTC interrupt to be
> > broadcasted to all cpus, but only boot cpu does time calculation. With
> > current approach if interrupt is delivered to at least one vcpu
> > it will not be considered coalesced, but if cpu it was delivered to is
> > not cpu that does time accounting then clock will drift.
> 
> That means we would have to fire callbacks per receiving CPU and report
> its number back. Is there a way to find out if we are running such a
> guest without an '-enable-win2k[38]-64bit-smp-rtc-drift-fix'?
> 
Not that I know of.

--
			Gleb.

^ permalink raw reply	[flat|nested] 122+ messages in thread

end of thread, other threads:[~2010-06-06 10:27 UTC | newest]

Thread overview: 122+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-24 20:13 [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Jan Kiszka
2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 01/15] hpet: Catch out-of-bounds timer access Jan Kiszka
2010-05-24 20:34   ` [Qemu-devel] " Juan Quintela
2010-05-24 20:36     ` Jan Kiszka
2010-05-24 20:50       ` Juan Quintela
2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 02/15] hpet: Coding style cleanups and some refactorings Jan Kiszka
2010-05-24 20:37   ` [Qemu-devel] " Juan Quintela
2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 03/15] hpet: Silence warning on write to running main counter Jan Kiszka
2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 04/15] hpet: Move static timer field initialization Jan Kiszka
2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 05/15] hpet: Convert to qdev Jan Kiszka
2010-05-25  9:37   ` Paul Brook
2010-05-25 10:14     ` Jan Kiszka
2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 06/15] hpet: Start/stop timer when HPET_TN_ENABLE is modified Jan Kiszka
2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback Jan Kiszka
2010-05-25  6:07   ` Gleb Natapov
2010-05-25  6:31     ` Jan Kiszka
2010-05-25  6:40       ` Gleb Natapov
2010-05-25  6:54         ` Jan Kiszka
2010-05-25 19:09   ` [Qemu-devel] " Blue Swirl
2010-05-25 20:16     ` Anthony Liguori
2010-05-25 21:44       ` Jan Kiszka
2010-05-26  8:08         ` Gleb Natapov
2010-05-26 20:14           ` Blue Swirl
2010-05-27  5:42             ` Gleb Natapov
2010-05-26 19:55         ` Blue Swirl
2010-05-26 20:09           ` Jan Kiszka
2010-05-26 20:35             ` Blue Swirl
2010-05-26 22:35               ` Jan Kiszka
2010-05-26 23:26               ` Paul Brook
2010-05-27 17:56                 ` Blue Swirl
2010-05-27 18:31                   ` Jan Kiszka
2010-05-27 18:53                     ` Blue Swirl
2010-05-27 19:08                       ` Jan Kiszka
2010-05-27 19:19                         ` Blue Swirl
2010-05-27 22:19                           ` Jan Kiszka
2010-05-28 19:00                             ` Blue Swirl
2010-05-30 12:00                             ` Avi Kivity
2010-05-27 22:21                           ` Paul Brook
2010-05-28 19:10                             ` Blue Swirl
2010-05-27 22:21                   ` Paul Brook
2010-05-27  6:13               ` Gleb Natapov
2010-05-27 18:37                 ` Blue Swirl
2010-05-28  7:31                   ` Gleb Natapov
2010-05-28 20:06                     ` Blue Swirl
2010-05-28 20:47                       ` Gleb Natapov
2010-05-29  7:58                         ` Jan Kiszka
2010-05-29  9:35                           ` Blue Swirl
2010-05-29  9:45                             ` Jan Kiszka
2010-05-29 10:04                               ` Blue Swirl
2010-05-29 10:16                                 ` Jan Kiszka
2010-05-29 10:26                                   ` Blue Swirl
2010-05-29 10:38                                     ` Jan Kiszka
2010-05-29 14:46                             ` Gleb Natapov
2010-05-29 16:13                               ` Blue Swirl
2010-05-29 16:37                                 ` Gleb Natapov
2010-05-29 21:21                                   ` Blue Swirl
2010-05-30  6:02                                     ` Gleb Natapov
2010-05-30 12:10                                       ` Blue Swirl
2010-05-30 12:24                                         ` Jan Kiszka
2010-05-30 12:58                                           ` Blue Swirl
2010-05-31  7:46                                             ` Jan Kiszka
2010-05-30 12:33                                         ` Gleb Natapov
2010-05-30 12:56                                           ` Blue Swirl
2010-05-30 13:49                                             ` Gleb Natapov
2010-05-30 16:54                                               ` Blue Swirl
2010-05-30 19:37                                               ` Blue Swirl
2010-05-30 20:07                                                 ` Gleb Natapov
2010-05-30 20:21                                                   ` Blue Swirl
2010-05-31  5:19                                                     ` Gleb Natapov
2010-06-01 18:00                                                       ` Blue Swirl
2010-06-01 18:30                                                         ` Gleb Natapov
2010-06-02 19:05                                                           ` Blue Swirl
2010-06-03  6:23                                                             ` Jan Kiszka
2010-06-03  6:34                                                               ` Gleb Natapov
2010-06-03  6:59                                                                 ` Jan Kiszka
2010-06-03  7:03                                                                   ` Gleb Natapov
2010-06-03  7:06                                                                     ` Gleb Natapov
2010-06-04 19:05                                                                       ` Blue Swirl
2010-06-05  0:04                                                                         ` Jan Kiszka
2010-06-05  7:20                                                                           ` Blue Swirl
2010-06-05  8:27                                                                             ` Jan Kiszka
2010-06-05  9:23                                                                               ` Blue Swirl
2010-06-05 12:14                                                                                 ` Jan Kiszka
2010-06-06  7:15                                                                           ` Gleb Natapov
2010-06-06  7:39                                                                             ` Jan Kiszka
2010-06-06  7:49                                                                               ` Gleb Natapov
2010-06-06  8:07                                                                                 ` Jan Kiszka
2010-06-06  9:23                                                                                   ` Gleb Natapov
2010-06-06 10:10                                                                                     ` Jan Kiszka
2010-06-06 10:27                                                                                       ` Gleb Natapov
2010-06-06  7:39                                                                             ` Blue Swirl
2010-06-06  8:07                                                                               ` Gleb Natapov
2010-05-30 13:22                                           ` Blue Swirl
2010-05-29  9:15                         ` Blue Swirl
2010-05-29  9:36                           ` Jan Kiszka
2010-05-29 14:38                           ` Gleb Natapov
2010-05-29 16:03                             ` Blue Swirl
2010-05-29 16:32                               ` Gleb Natapov
2010-05-29 20:52                                 ` Blue Swirl
2010-05-30  5:41                                   ` Gleb Natapov
2010-05-30 11:41                                     ` Blue Swirl
2010-05-30 11:52                                       ` Gleb Natapov
2010-05-30 12:05                           ` Avi Kivity
2010-05-27  5:58             ` Gleb Natapov
2010-05-26 19:49       ` Blue Swirl
2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 08/15] x86: Refactor RTC IRQ coalescing workaround Jan Kiszka
2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 09/15] hpet/rtc: Rework RTC IRQ replacement by HPET Jan Kiszka
2010-05-25  9:29   ` Paul Brook
2010-05-25 10:23     ` Jan Kiszka
2010-05-25 11:05       ` Paul Brook
2010-05-25 11:19         ` Jan Kiszka
2010-05-25 11:23           ` Paul Brook
2010-05-25 11:26             ` Jan Kiszka
2010-05-25 12:03               ` Paul Brook
2010-05-25 12:39                 ` Jan Kiszka
2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 10/15] hpet: Drop static state Jan Kiszka
2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 11/15] hpet: Add support for level-triggered interrupts Jan Kiszka
2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 12/15] vmstate: Add VMSTATE_STRUCT_VARRAY_UINT8 Jan Kiszka
2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 13/15] hpet: Make number of timers configurable Jan Kiszka
2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 14/15] hpet: Add MSI support Jan Kiszka
2010-05-24 20:13 ` [Qemu-devel] [RFT][PATCH 15/15] monitor/QMP: Drop info hpet / query-hpet Jan Kiszka
2010-05-24 22:16 ` [Qemu-devel] [RFT][PATCH 00/15] HPET cleanups, fixes, enhancements Anthony Liguori

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.