All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/5] evtchn: (not so) recent XSAs follow-on
@ 2020-11-23 13:26 Jan Beulich
  2020-11-23 13:28 ` [PATCH v3 1/5] evtchn: drop acquiring of per-channel lock from send_guest_{global,vcpu}_virq() Jan Beulich
                   ` (4 more replies)
  0 siblings, 5 replies; 48+ messages in thread
From: Jan Beulich @ 2020-11-23 13:26 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall, Wei Liu,
	Stefano Stabellini

These are grouped into a series largely because of their origin,
not so much because there are heavy dependencies among them.
Compared to v2, there's a new patch resulting from review feedback,
and the last patch should be fully usable now. See also the
individual patches.

1: drop acquiring of per-channel lock from send_guest_{global,vcpu}_virq()
2: avoid access tearing for ->virq_to_evtchn[] accesses
3: convert vIRQ lock to an r/w one
4: convert domain event lock to an r/w one
5: don't call Xen consumer callback with per-channel lock held

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 1/5] evtchn: drop acquiring of per-channel lock from send_guest_{global,vcpu}_virq()
  2020-11-23 13:26 [PATCH v3 0/5] evtchn: (not so) recent XSAs follow-on Jan Beulich
@ 2020-11-23 13:28 ` Jan Beulich
  2020-12-02 19:03   ` Julien Grall
  2020-11-23 13:28 ` [PATCH v3 2/5] evtchn: avoid access tearing for ->virq_to_evtchn[] accesses Jan Beulich
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2020-11-23 13:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall, Wei Liu,
	Stefano Stabellini

The per-vCPU virq_lock, which is being held anyway, together with there
not being any call to evtchn_port_set_pending() when v->virq_to_evtchn[]
is zero, provide sufficient guarantees. Undo the lock addition done for
XSA-343 (commit e045199c7c9c "evtchn: address races with
evtchn_reset()"). Update the description next to struct evtchn_port_ops
accordingly.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Re-base.
v2: New.

--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -809,7 +809,6 @@ void send_guest_vcpu_virq(struct vcpu *v
     unsigned long flags;
     int port;
     struct domain *d;
-    struct evtchn *chn;
 
     ASSERT(!virq_is_global(virq));
 
@@ -820,12 +819,7 @@ void send_guest_vcpu_virq(struct vcpu *v
         goto out;
 
     d = v->domain;
-    chn = evtchn_from_port(d, port);
-    if ( evtchn_read_trylock(chn) )
-    {
-        evtchn_port_set_pending(d, v->vcpu_id, chn);
-        evtchn_read_unlock(chn);
-    }
+    evtchn_port_set_pending(d, v->vcpu_id, evtchn_from_port(d, port));
 
  out:
     spin_unlock_irqrestore(&v->virq_lock, flags);
@@ -854,11 +848,7 @@ void send_guest_global_virq(struct domai
         goto out;
 
     chn = evtchn_from_port(d, port);
-    if ( evtchn_read_trylock(chn) )
-    {
-        evtchn_port_set_pending(d, chn->notify_vcpu_id, chn);
-        evtchn_read_unlock(chn);
-    }
+    evtchn_port_set_pending(d, chn->notify_vcpu_id, chn);
 
  out:
     spin_unlock_irqrestore(&v->virq_lock, flags);
--- a/xen/include/xen/event.h
+++ b/xen/include/xen/event.h
@@ -192,9 +192,16 @@ int evtchn_reset(struct domain *d, bool
  * Low-level event channel port ops.
  *
  * All hooks have to be called with a lock held which prevents the channel
- * from changing state. This may be the domain event lock, the per-channel
- * lock, or in the case of sending interdomain events also the other side's
- * per-channel lock. Exceptions apply in certain cases for the PV shim.
+ * from changing state. This may be
+ * - the domain event lock,
+ * - the per-channel lock,
+ * - in the case of sending interdomain events the other side's per-channel
+ *   lock,
+ * - in the case of sending non-global vIRQ-s the per-vCPU virq_lock (in
+ *   combination with the ordering enforced through how the vCPU's
+ *   virq_to_evtchn[] gets updated),
+ * - in the case of sending global vIRQ-s vCPU 0's virq_lock.
+ * Exceptions apply in certain cases for the PV shim.
  */
 struct evtchn_port_ops {
     void (*init)(struct domain *d, struct evtchn *evtchn);



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 2/5] evtchn: avoid access tearing for ->virq_to_evtchn[] accesses
  2020-11-23 13:26 [PATCH v3 0/5] evtchn: (not so) recent XSAs follow-on Jan Beulich
  2020-11-23 13:28 ` [PATCH v3 1/5] evtchn: drop acquiring of per-channel lock from send_guest_{global,vcpu}_virq() Jan Beulich
@ 2020-11-23 13:28 ` Jan Beulich
  2020-12-02 21:14   ` Julien Grall
  2020-11-23 13:28 ` [PATCH v3 3/5] evtchn: convert vIRQ lock to an r/w one Jan Beulich
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2020-11-23 13:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall, Wei Liu,
	Stefano Stabellini

Use {read,write}_atomic() to exclude any eventualities, in particular
observing that accesses aren't all happening under a consistent lock.

Requested-by: Julien Grall <julien@xen.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -446,7 +446,7 @@ int evtchn_bind_virq(evtchn_bind_virq_t
 
     spin_lock(&d->event_lock);
 
-    if ( v->virq_to_evtchn[virq] != 0 )
+    if ( read_atomic(&v->virq_to_evtchn[virq]) )
         ERROR_EXIT(-EEXIST);
 
     if ( port != 0 )
@@ -474,7 +474,8 @@ int evtchn_bind_virq(evtchn_bind_virq_t
 
     evtchn_write_unlock(chn);
 
-    v->virq_to_evtchn[virq] = bind->port = port;
+    bind->port = port;
+    write_atomic(&v->virq_to_evtchn[virq], port);
 
  out:
     spin_unlock(&d->event_lock);
@@ -660,9 +661,9 @@ int evtchn_close(struct domain *d1, int
     case ECS_VIRQ:
         for_each_vcpu ( d1, v )
         {
-            if ( v->virq_to_evtchn[chn1->u.virq] != port1 )
+            if ( read_atomic(&v->virq_to_evtchn[chn1->u.virq]) != port1 )
                 continue;
-            v->virq_to_evtchn[chn1->u.virq] = 0;
+            write_atomic(&v->virq_to_evtchn[chn1->u.virq], 0);
             spin_barrier(&v->virq_lock);
         }
         break;
@@ -801,7 +802,7 @@ bool evtchn_virq_enabled(const struct vc
     if ( virq_is_global(virq) && v->vcpu_id )
         v = domain_vcpu(v->domain, 0);
 
-    return v->virq_to_evtchn[virq];
+    return read_atomic(&v->virq_to_evtchn[virq]);
 }
 
 void send_guest_vcpu_virq(struct vcpu *v, uint32_t virq)
@@ -814,7 +815,7 @@ void send_guest_vcpu_virq(struct vcpu *v
 
     spin_lock_irqsave(&v->virq_lock, flags);
 
-    port = v->virq_to_evtchn[virq];
+    port = read_atomic(&v->virq_to_evtchn[virq]);
     if ( unlikely(port == 0) )
         goto out;
 
@@ -843,7 +844,7 @@ void send_guest_global_virq(struct domai
 
     spin_lock_irqsave(&v->virq_lock, flags);
 
-    port = v->virq_to_evtchn[virq];
+    port = read_atomic(&v->virq_to_evtchn[virq]);
     if ( unlikely(port == 0) )
         goto out;
 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 3/5] evtchn: convert vIRQ lock to an r/w one
  2020-11-23 13:26 [PATCH v3 0/5] evtchn: (not so) recent XSAs follow-on Jan Beulich
  2020-11-23 13:28 ` [PATCH v3 1/5] evtchn: drop acquiring of per-channel lock from send_guest_{global,vcpu}_virq() Jan Beulich
  2020-11-23 13:28 ` [PATCH v3 2/5] evtchn: avoid access tearing for ->virq_to_evtchn[] accesses Jan Beulich
@ 2020-11-23 13:28 ` Jan Beulich
  2020-12-09 11:16   ` Julien Grall
  2020-11-23 13:29 ` [PATCH v3 4/5] evtchn: convert domain event " Jan Beulich
  2020-11-23 13:30 ` [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held Jan Beulich
  4 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2020-11-23 13:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall, Wei Liu,
	Stefano Stabellini

There's no need to serialize all sending of vIRQ-s; all that's needed
is serialization against the closing of the respective event channels
(so far by means of a barrier). To facilitate the conversion, switch to
an ordinary write locked region in evtchn_close().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Re-base over added new earlier patch.
v2: Don't introduce/use rw_barrier() here. Add comment to
    evtchn_bind_virq(). Re-base.

--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -160,7 +160,7 @@ struct vcpu *vcpu_create(struct domain *
     v->vcpu_id = vcpu_id;
     v->dirty_cpu = VCPU_CPU_CLEAN;
 
-    spin_lock_init(&v->virq_lock);
+    rwlock_init(&v->virq_lock);
 
     tasklet_init(&v->continue_hypercall_tasklet, NULL, NULL);
 
--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -475,6 +475,13 @@ int evtchn_bind_virq(evtchn_bind_virq_t
     evtchn_write_unlock(chn);
 
     bind->port = port;
+    /*
+     * If by any, the update of virq_to_evtchn[] would need guarding by
+     * virq_lock, but since this is the last action here, there's no strict
+     * need to acquire the lock. Hence holding event_lock isn't helpful
+     * anymore at this point, but utilize that its unlocking acts as the
+     * otherwise necessary smp_wmb() here.
+     */
     write_atomic(&v->virq_to_evtchn[virq], port);
 
  out:
@@ -661,10 +668,12 @@ int evtchn_close(struct domain *d1, int
     case ECS_VIRQ:
         for_each_vcpu ( d1, v )
         {
-            if ( read_atomic(&v->virq_to_evtchn[chn1->u.virq]) != port1 )
-                continue;
-            write_atomic(&v->virq_to_evtchn[chn1->u.virq], 0);
-            spin_barrier(&v->virq_lock);
+            unsigned long flags;
+
+            write_lock_irqsave(&v->virq_lock, flags);
+            if ( read_atomic(&v->virq_to_evtchn[chn1->u.virq]) == port1 )
+                write_atomic(&v->virq_to_evtchn[chn1->u.virq], 0);
+            write_unlock_irqrestore(&v->virq_lock, flags);
         }
         break;
 
@@ -813,7 +822,7 @@ void send_guest_vcpu_virq(struct vcpu *v
 
     ASSERT(!virq_is_global(virq));
 
-    spin_lock_irqsave(&v->virq_lock, flags);
+    read_lock_irqsave(&v->virq_lock, flags);
 
     port = read_atomic(&v->virq_to_evtchn[virq]);
     if ( unlikely(port == 0) )
@@ -823,7 +832,7 @@ void send_guest_vcpu_virq(struct vcpu *v
     evtchn_port_set_pending(d, v->vcpu_id, evtchn_from_port(d, port));
 
  out:
-    spin_unlock_irqrestore(&v->virq_lock, flags);
+    read_unlock_irqrestore(&v->virq_lock, flags);
 }
 
 void send_guest_global_virq(struct domain *d, uint32_t virq)
@@ -842,7 +851,7 @@ void send_guest_global_virq(struct domai
     if ( unlikely(v == NULL) )
         return;
 
-    spin_lock_irqsave(&v->virq_lock, flags);
+    read_lock_irqsave(&v->virq_lock, flags);
 
     port = read_atomic(&v->virq_to_evtchn[virq]);
     if ( unlikely(port == 0) )
@@ -852,7 +861,7 @@ void send_guest_global_virq(struct domai
     evtchn_port_set_pending(d, chn->notify_vcpu_id, chn);
 
  out:
-    spin_unlock_irqrestore(&v->virq_lock, flags);
+    read_unlock_irqrestore(&v->virq_lock, flags);
 }
 
 void send_guest_pirq(struct domain *d, const struct pirq *pirq)
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -238,7 +238,7 @@ struct vcpu
 
     /* IRQ-safe virq_lock protects against delivering VIRQ to stale evtchn. */
     evtchn_port_t    virq_to_evtchn[NR_VIRQS];
-    spinlock_t       virq_lock;
+    rwlock_t         virq_lock;
 
     /* Tasklet for continue_hypercall_on_cpu(). */
     struct tasklet   continue_hypercall_tasklet;



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 4/5] evtchn: convert domain event lock to an r/w one
  2020-11-23 13:26 [PATCH v3 0/5] evtchn: (not so) recent XSAs follow-on Jan Beulich
                   ` (2 preceding siblings ...)
  2020-11-23 13:28 ` [PATCH v3 3/5] evtchn: convert vIRQ lock to an r/w one Jan Beulich
@ 2020-11-23 13:29 ` Jan Beulich
  2020-12-09 11:54   ` Julien Grall
  2020-11-23 13:30 ` [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held Jan Beulich
  4 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2020-11-23 13:29 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall, Wei Liu,
	Stefano Stabellini

Especially for the use in evtchn_move_pirqs() (called when moving a vCPU
across pCPU-s) and the ones in EOI handling in PCI pass-through code,
serializing perhaps an entire domain isn't helpful when no state (which
isn't e.g. further protected by the per-channel lock) changes.

Unfortunately this implies dropping of lock profiling for this lock,
until r/w locks may get enabled for such functionality.

While ->notify_vcpu_id is now meant to be consistently updated with the
per-channel lock held, an extension applies to ECS_PIRQ: The field is
also guaranteed to not change with the per-domain event lock held for
writing. Therefore the link_pirq_port() call from evtchn_bind_pirq()
could in principle be moved out of the per-channel locked regions, but
this further code churn didn't seem worth it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Re-base.
v2: Consistently lock for writing in evtchn_reset(). Fix error path in
    pci_clean_dpci_irqs(). Lock for writing in pt_irq_time_out(),
    hvm_dirq_assist(), hvm_dpci_eoi(), and hvm_dpci_isairq_eoi(). Move
    rw_barrier() introduction here. Re-base over changes earlier in the
    series.
---
RFC: Wouldn't flask_get_peer_sid() better use the per-channel lock?

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -917,7 +917,7 @@ int arch_domain_soft_reset(struct domain
     if ( !is_hvm_domain(d) )
         return -EINVAL;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
     for ( i = 0; i < d->nr_pirqs ; i++ )
     {
         if ( domain_pirq_to_emuirq(d, i) != IRQ_UNBOUND )
@@ -927,7 +927,7 @@ int arch_domain_soft_reset(struct domain
                 break;
         }
     }
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     if ( ret )
         return ret;
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -528,9 +528,9 @@ void hvm_migrate_pirqs(struct vcpu *v)
     if ( !is_iommu_enabled(d) || !hvm_domain_irq(d)->dpci )
        return;
 
-    spin_lock(&d->event_lock);
+    read_lock(&d->event_lock);
     pt_pirq_iterate(d, migrate_pirq, v);
-    spin_unlock(&d->event_lock);
+    read_unlock(&d->event_lock);
 }
 
 static bool hvm_get_pending_event(struct vcpu *v, struct x86_event *info)
--- a/xen/arch/x86/hvm/irq.c
+++ b/xen/arch/x86/hvm/irq.c
@@ -404,9 +404,9 @@ int hvm_inject_msi(struct domain *d, uin
             {
                 int rc;
 
-                spin_lock(&d->event_lock);
+                write_lock(&d->event_lock);
                 rc = map_domain_emuirq_pirq(d, pirq, IRQ_MSI_EMU);
-                spin_unlock(&d->event_lock);
+                write_unlock(&d->event_lock);
                 if ( rc )
                     return rc;
                 info = pirq_info(d, pirq);
--- a/xen/arch/x86/hvm/vioapic.c
+++ b/xen/arch/x86/hvm/vioapic.c
@@ -203,9 +203,9 @@ static int vioapic_hwdom_map_gsi(unsigne
     {
         gprintk(XENLOG_WARNING, "vioapic: error binding GSI %u: %d\n",
                 gsi, ret);
-        spin_lock(&currd->event_lock);
+        write_lock(&currd->event_lock);
         unmap_domain_pirq(currd, pirq);
-        spin_unlock(&currd->event_lock);
+        write_unlock(&currd->event_lock);
     }
     pcidevs_unlock();
 
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -465,7 +465,7 @@ int msixtbl_pt_register(struct domain *d
     int r = -EINVAL;
 
     ASSERT(pcidevs_locked());
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
         return -ENODEV;
@@ -535,7 +535,7 @@ void msixtbl_pt_unregister(struct domain
     struct msixtbl_entry *entry;
 
     ASSERT(pcidevs_locked());
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !msixtbl_initialised(d) )
         return;
@@ -589,13 +589,13 @@ void msixtbl_pt_cleanup(struct domain *d
     if ( !msixtbl_initialised(d) )
         return;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     list_for_each_entry_safe( entry, temp,
                               &d->arch.hvm.msixtbl_list, list )
         del_msixtbl_entry(entry);
 
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 }
 
 void msix_write_completion(struct vcpu *v)
@@ -719,9 +719,9 @@ int vpci_msi_arch_update(struct vpci_msi
                          msi->arch.pirq, msi->mask);
     if ( rc )
     {
-        spin_lock(&pdev->domain->event_lock);
+        write_lock(&pdev->domain->event_lock);
         unmap_domain_pirq(pdev->domain, msi->arch.pirq);
-        spin_unlock(&pdev->domain->event_lock);
+        write_unlock(&pdev->domain->event_lock);
         pcidevs_unlock();
         msi->arch.pirq = INVALID_PIRQ;
         return rc;
@@ -760,9 +760,9 @@ static int vpci_msi_enable(const struct
     rc = vpci_msi_update(pdev, data, address, vectors, pirq, mask);
     if ( rc )
     {
-        spin_lock(&pdev->domain->event_lock);
+        write_lock(&pdev->domain->event_lock);
         unmap_domain_pirq(pdev->domain, pirq);
-        spin_unlock(&pdev->domain->event_lock);
+        write_unlock(&pdev->domain->event_lock);
         pcidevs_unlock();
         return rc;
     }
@@ -807,9 +807,9 @@ static void vpci_msi_disable(const struc
         ASSERT(!rc);
     }
 
-    spin_lock(&pdev->domain->event_lock);
+    write_lock(&pdev->domain->event_lock);
     unmap_domain_pirq(pdev->domain, pirq);
-    spin_unlock(&pdev->domain->event_lock);
+    write_unlock(&pdev->domain->event_lock);
     pcidevs_unlock();
 }
 
--- a/xen/arch/x86/io_apic.c
+++ b/xen/arch/x86/io_apic.c
@@ -2413,10 +2413,10 @@ int ioapic_guest_write(unsigned long phy
     }
     if ( pirq >= 0 )
     {
-        spin_lock(&hardware_domain->event_lock);
+        write_lock(&hardware_domain->event_lock);
         ret = map_domain_pirq(hardware_domain, pirq, irq,
                               MAP_PIRQ_TYPE_GSI, NULL);
-        spin_unlock(&hardware_domain->event_lock);
+        write_unlock(&hardware_domain->event_lock);
         if ( ret < 0 )
             return ret;
     }
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -1536,7 +1536,7 @@ int pirq_guest_bind(struct vcpu *v, stru
     irq_guest_action_t *action, *newaction = NULL;
     int                 rc = 0;
 
-    WARN_ON(!spin_is_locked(&v->domain->event_lock));
+    WARN_ON(!rw_is_write_locked(&v->domain->event_lock));
     BUG_ON(!local_irq_is_enabled());
 
  retry:
@@ -1756,7 +1756,7 @@ void pirq_guest_unbind(struct domain *d,
     struct irq_desc *desc;
     int irq = 0;
 
-    WARN_ON(!spin_is_locked(&d->event_lock));
+    WARN_ON(!rw_is_write_locked(&d->event_lock));
 
     BUG_ON(!local_irq_is_enabled());
     desc = pirq_spin_lock_irq_desc(pirq, NULL);
@@ -1793,7 +1793,7 @@ static bool pirq_guest_force_unbind(stru
     unsigned int i;
     bool bound = false;
 
-    WARN_ON(!spin_is_locked(&d->event_lock));
+    WARN_ON(!rw_is_write_locked(&d->event_lock));
 
     BUG_ON(!local_irq_is_enabled());
     desc = pirq_spin_lock_irq_desc(pirq, NULL);
@@ -2037,7 +2037,7 @@ int get_free_pirq(struct domain *d, int
 {
     int i;
 
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( type == MAP_PIRQ_TYPE_GSI )
     {
@@ -2062,7 +2062,7 @@ int get_free_pirqs(struct domain *d, uns
 {
     unsigned int i, found = 0;
 
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     for ( i = d->nr_pirqs - 1; i >= nr_irqs_gsi; --i )
         if ( is_free_pirq(d, pirq_info(d, i)) )
@@ -2090,7 +2090,7 @@ int map_domain_pirq(
     DECLARE_BITMAP(prepared, MAX_MSI_IRQS) = {};
     DECLARE_BITMAP(granted, MAX_MSI_IRQS) = {};
 
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !irq_access_permitted(current->domain, irq))
         return -EPERM;
@@ -2309,7 +2309,7 @@ int unmap_domain_pirq(struct domain *d,
         return -EINVAL;
 
     ASSERT(pcidevs_locked());
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     info = pirq_info(d, pirq);
     if ( !info || (irq = info->arch.irq) <= 0 )
@@ -2436,13 +2436,13 @@ void free_domain_pirqs(struct domain *d)
     int i;
 
     pcidevs_lock();
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     for ( i = 0; i < d->nr_pirqs; i++ )
         if ( domain_pirq_to_irq(d, i) > 0 )
             unmap_domain_pirq(d, i);
 
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
     pcidevs_unlock();
 }
 
@@ -2683,7 +2683,7 @@ int map_domain_emuirq_pirq(struct domain
     int old_emuirq = IRQ_UNBOUND, old_pirq = IRQ_UNBOUND;
     struct pirq *info;
 
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     if ( !is_hvm_domain(d) )
         return -EINVAL;
@@ -2749,7 +2749,7 @@ int unmap_domain_pirq_emuirq(struct doma
     if ( (pirq < 0) || (pirq >= d->nr_pirqs) )
         return -EINVAL;
 
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     emuirq = domain_pirq_to_emuirq(d, pirq);
     if ( emuirq == IRQ_UNBOUND )
@@ -2797,7 +2797,7 @@ static int allocate_pirq(struct domain *
 {
     int current_pirq;
 
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
     current_pirq = domain_irq_to_pirq(d, irq);
     if ( pirq < 0 )
     {
@@ -2869,7 +2869,7 @@ int allocate_and_map_gsi_pirq(struct dom
     }
 
     /* Verify or get pirq. */
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
     pirq = allocate_pirq(d, index, *pirq_p, irq, MAP_PIRQ_TYPE_GSI, NULL);
     if ( pirq < 0 )
     {
@@ -2882,7 +2882,7 @@ int allocate_and_map_gsi_pirq(struct dom
         *pirq_p = pirq;
 
  done:
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     return ret;
 }
@@ -2923,7 +2923,7 @@ int allocate_and_map_msi_pirq(struct dom
 
     pcidevs_lock();
     /* Verify or get pirq. */
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
     pirq = allocate_pirq(d, index, *pirq_p, irq, type, &msi->entry_nr);
     if ( pirq < 0 )
     {
@@ -2936,7 +2936,7 @@ int allocate_and_map_msi_pirq(struct dom
         *pirq_p = pirq;
 
  done:
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
     pcidevs_unlock();
     if ( ret )
     {
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -34,7 +34,7 @@ static int physdev_hvm_map_pirq(
 
     ASSERT(!is_hardware_domain(d));
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
     switch ( type )
     {
     case MAP_PIRQ_TYPE_GSI: {
@@ -84,7 +84,7 @@ static int physdev_hvm_map_pirq(
         break;
     }
 
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
     return ret;
 }
 
@@ -154,18 +154,18 @@ int physdev_unmap_pirq(domid_t domid, in
 
     if ( is_hvm_domain(d) && has_pirq(d) )
     {
-        spin_lock(&d->event_lock);
+        write_lock(&d->event_lock);
         if ( domain_pirq_to_emuirq(d, pirq) != IRQ_UNBOUND )
             ret = unmap_domain_pirq_emuirq(d, pirq);
-        spin_unlock(&d->event_lock);
+        write_unlock(&d->event_lock);
         if ( domid == DOMID_SELF || ret )
             goto free_domain;
     }
 
     pcidevs_lock();
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
     ret = unmap_domain_pirq(d, pirq);
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
     pcidevs_unlock();
 
  free_domain:
@@ -192,10 +192,10 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
         ret = -EINVAL;
         if ( eoi.irq >= currd->nr_pirqs )
             break;
-        spin_lock(&currd->event_lock);
+        read_lock(&currd->event_lock);
         pirq = pirq_info(currd, eoi.irq);
         if ( !pirq ) {
-            spin_unlock(&currd->event_lock);
+            read_unlock(&currd->event_lock);
             break;
         }
         if ( currd->arch.auto_unmask )
@@ -214,7 +214,7 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
                     && hvm_irq->gsi_assert_count[gsi] )
                 send_guest_pirq(currd, pirq);
         }
-        spin_unlock(&currd->event_lock);
+        read_unlock(&currd->event_lock);
         ret = 0;
         break;
     }
@@ -626,7 +626,7 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
         if ( copy_from_guest(&out, arg, 1) != 0 )
             break;
 
-        spin_lock(&currd->event_lock);
+        write_lock(&currd->event_lock);
 
         ret = get_free_pirq(currd, out.type);
         if ( ret >= 0 )
@@ -639,7 +639,7 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
                 ret = -ENOMEM;
         }
 
-        spin_unlock(&currd->event_lock);
+        write_unlock(&currd->event_lock);
 
         if ( ret >= 0 )
         {
--- a/xen/arch/x86/pv/shim.c
+++ b/xen/arch/x86/pv/shim.c
@@ -448,7 +448,7 @@ static long pv_shim_event_channel_op(int
         if ( rc )                                                           \
             break;                                                          \
                                                                             \
-        spin_lock(&d->event_lock);                                          \
+        write_lock(&d->event_lock);                                         \
         rc = evtchn_allocate_port(d, op.port_field);                        \
         if ( rc )                                                           \
         {                                                                   \
@@ -457,7 +457,7 @@ static long pv_shim_event_channel_op(int
         }                                                                   \
         else                                                                \
             evtchn_reserve(d, op.port_field);                               \
-        spin_unlock(&d->event_lock);                                        \
+        write_unlock(&d->event_lock);                                       \
                                                                             \
         if ( !rc && __copy_to_guest(arg, &op, 1) )                          \
             rc = -EFAULT;                                                   \
@@ -585,11 +585,11 @@ static long pv_shim_event_channel_op(int
         if ( rc )
             break;
 
-        spin_lock(&d->event_lock);
+        write_lock(&d->event_lock);
         rc = evtchn_allocate_port(d, ipi.port);
         if ( rc )
         {
-            spin_unlock(&d->event_lock);
+            write_unlock(&d->event_lock);
 
             close.port = ipi.port;
             BUG_ON(xen_hypercall_event_channel_op(EVTCHNOP_close, &close));
@@ -598,7 +598,7 @@ static long pv_shim_event_channel_op(int
 
         evtchn_assign_vcpu(d, ipi.port, ipi.vcpu);
         evtchn_reserve(d, ipi.port);
-        spin_unlock(&d->event_lock);
+        write_unlock(&d->event_lock);
 
         if ( __copy_to_guest(arg, &ipi, 1) )
             rc = -EFAULT;
--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -294,7 +294,7 @@ static long evtchn_alloc_unbound(evtchn_
     if ( d == NULL )
         return -ESRCH;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     if ( (port = get_free_port(d)) < 0 )
         ERROR_EXIT_DOM(port, d);
@@ -317,7 +317,7 @@ static long evtchn_alloc_unbound(evtchn_
 
  out:
     check_free_port(d, port);
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
     rcu_unlock_domain(d);
 
     return rc;
@@ -363,14 +363,14 @@ static long evtchn_bind_interdomain(evtc
     /* Avoid deadlock by first acquiring lock of domain with smaller id. */
     if ( ld < rd )
     {
-        spin_lock(&ld->event_lock);
-        spin_lock(&rd->event_lock);
+        write_lock(&ld->event_lock);
+        read_lock(&rd->event_lock);
     }
     else
     {
         if ( ld != rd )
-            spin_lock(&rd->event_lock);
-        spin_lock(&ld->event_lock);
+            read_lock(&rd->event_lock);
+        write_lock(&ld->event_lock);
     }
 
     if ( (lport = get_free_port(ld)) < 0 )
@@ -411,9 +411,9 @@ static long evtchn_bind_interdomain(evtc
 
  out:
     check_free_port(ld, lport);
-    spin_unlock(&ld->event_lock);
+    write_unlock(&ld->event_lock);
     if ( ld != rd )
-        spin_unlock(&rd->event_lock);
+        read_unlock(&rd->event_lock);
     
     rcu_unlock_domain(rd);
 
@@ -444,7 +444,7 @@ int evtchn_bind_virq(evtchn_bind_virq_t
     if ( (v = domain_vcpu(d, vcpu)) == NULL )
         return -ENOENT;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     if ( read_atomic(&v->virq_to_evtchn[virq]) )
         ERROR_EXIT(-EEXIST);
@@ -485,7 +485,7 @@ int evtchn_bind_virq(evtchn_bind_virq_t
     write_atomic(&v->virq_to_evtchn[virq], port);
 
  out:
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     return rc;
 }
@@ -501,7 +501,7 @@ static long evtchn_bind_ipi(evtchn_bind_
     if ( domain_vcpu(d, vcpu) == NULL )
         return -ENOENT;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     if ( (port = get_free_port(d)) < 0 )
         ERROR_EXIT(port);
@@ -519,7 +519,7 @@ static long evtchn_bind_ipi(evtchn_bind_
     bind->port = port;
 
  out:
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     return rc;
 }
@@ -565,7 +565,7 @@ static long evtchn_bind_pirq(evtchn_bind
     if ( !is_hvm_domain(d) && !pirq_access_permitted(d, pirq) )
         return -EPERM;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     if ( pirq_to_evtchn(d, pirq) != 0 )
         ERROR_EXIT(-EEXIST);
@@ -605,7 +605,7 @@ static long evtchn_bind_pirq(evtchn_bind
 
  out:
     check_free_port(d, port);
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     return rc;
 }
@@ -620,7 +620,7 @@ int evtchn_close(struct domain *d1, int
     long           rc = 0;
 
  again:
-    spin_lock(&d1->event_lock);
+    write_lock(&d1->event_lock);
 
     if ( !port_is_valid(d1, port1) )
     {
@@ -690,13 +690,11 @@ int evtchn_close(struct domain *d1, int
                 BUG();
 
             if ( d1 < d2 )
-            {
-                spin_lock(&d2->event_lock);
-            }
+                read_lock(&d2->event_lock);
             else if ( d1 != d2 )
             {
-                spin_unlock(&d1->event_lock);
-                spin_lock(&d2->event_lock);
+                write_unlock(&d1->event_lock);
+                read_lock(&d2->event_lock);
                 goto again;
             }
         }
@@ -743,11 +741,11 @@ int evtchn_close(struct domain *d1, int
     if ( d2 != NULL )
     {
         if ( d1 != d2 )
-            spin_unlock(&d2->event_lock);
+            read_unlock(&d2->event_lock);
         put_domain(d2);
     }
 
-    spin_unlock(&d1->event_lock);
+    write_unlock(&d1->event_lock);
 
     return rc;
 }
@@ -963,7 +961,7 @@ int evtchn_status(evtchn_status_t *statu
     if ( d == NULL )
         return -ESRCH;
 
-    spin_lock(&d->event_lock);
+    read_lock(&d->event_lock);
 
     if ( !port_is_valid(d, port) )
     {
@@ -1016,7 +1014,7 @@ int evtchn_status(evtchn_status_t *statu
     status->vcpu = chn->notify_vcpu_id;
 
  out:
-    spin_unlock(&d->event_lock);
+    read_unlock(&d->event_lock);
     rcu_unlock_domain(d);
 
     return rc;
@@ -1034,7 +1032,7 @@ long evtchn_bind_vcpu(unsigned int port,
     if ( (v = domain_vcpu(d, vcpu_id)) == NULL )
         return -ENOENT;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     if ( !port_is_valid(d, port) )
     {
@@ -1078,7 +1076,7 @@ long evtchn_bind_vcpu(unsigned int port,
     }
 
  out:
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     return rc;
 }
@@ -1124,7 +1122,7 @@ int evtchn_reset(struct domain *d, bool
     if ( d != current->domain && !d->controller_pause_count )
         return -EINVAL;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     /*
      * If we are resuming, then start where we stopped. Otherwise, check
@@ -1135,7 +1133,7 @@ int evtchn_reset(struct domain *d, bool
     if ( i > d->next_evtchn )
         d->next_evtchn = i;
 
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     if ( !i )
         return -EBUSY;
@@ -1147,14 +1145,14 @@ int evtchn_reset(struct domain *d, bool
         /* NB: Choice of frequency is arbitrary. */
         if ( !(i & 0x3f) && hypercall_preempt_check() )
         {
-            spin_lock(&d->event_lock);
+            write_lock(&d->event_lock);
             d->next_evtchn = i;
-            spin_unlock(&d->event_lock);
+            write_unlock(&d->event_lock);
             return -ERESTART;
         }
     }
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     d->next_evtchn = 0;
 
@@ -1167,7 +1165,7 @@ int evtchn_reset(struct domain *d, bool
         evtchn_2l_init(d);
     }
 
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     return rc;
 }
@@ -1357,7 +1355,7 @@ int alloc_unbound_xen_event_channel(
     struct evtchn *chn;
     int            port, rc;
 
-    spin_lock(&ld->event_lock);
+    write_lock(&ld->event_lock);
 
     port = rc = get_free_port(ld);
     if ( rc < 0 )
@@ -1385,7 +1383,7 @@ int alloc_unbound_xen_event_channel(
 
  out:
     check_free_port(ld, port);
-    spin_unlock(&ld->event_lock);
+    write_unlock(&ld->event_lock);
 
     return rc < 0 ? rc : port;
 }
@@ -1473,7 +1471,8 @@ int evtchn_init(struct domain *d, unsign
         return -ENOMEM;
     d->valid_evtchns = EVTCHNS_PER_BUCKET;
 
-    spin_lock_init_prof(d, event_lock);
+    rwlock_init(&d->event_lock);
+
     if ( get_free_port(d) != 0 )
     {
         free_evtchn_bucket(d, d->evtchn);
@@ -1500,7 +1499,7 @@ int evtchn_destroy(struct domain *d)
 
     /* After this barrier no new event-channel allocations can occur. */
     BUG_ON(!d->is_dying);
-    spin_barrier(&d->event_lock);
+    rw_barrier(&d->event_lock);
 
     /* Close all existing event channels. */
     for ( i = d->valid_evtchns; --i; )
@@ -1558,13 +1557,13 @@ void evtchn_move_pirqs(struct vcpu *v)
     unsigned int port;
     struct evtchn *chn;
 
-    spin_lock(&d->event_lock);
+    read_lock(&d->event_lock);
     for ( port = v->pirq_evtchn_head; port; port = chn->u.pirq.next_port )
     {
         chn = evtchn_from_port(d, port);
         pirq_set_affinity(d, chn->u.pirq.irq, mask);
     }
-    spin_unlock(&d->event_lock);
+    read_unlock(&d->event_lock);
 }
 
 
@@ -1577,7 +1576,7 @@ static void domain_dump_evtchn_info(stru
            "Polling vCPUs: {%*pbl}\n"
            "    port [p/m/s]\n", d->domain_id, d->max_vcpus, d->poll_mask);
 
-    spin_lock(&d->event_lock);
+    read_lock(&d->event_lock);
 
     for ( port = 1; port_is_valid(d, port); ++port )
     {
@@ -1624,7 +1623,7 @@ static void domain_dump_evtchn_info(stru
         }
     }
 
-    spin_unlock(&d->event_lock);
+    read_unlock(&d->event_lock);
 }
 
 static void dump_evtchn_info(unsigned char key)
--- a/xen/common/event_fifo.c
+++ b/xen/common/event_fifo.c
@@ -561,7 +561,7 @@ int evtchn_fifo_init_control(struct evtc
     if ( offset & (8 - 1) )
         return -EINVAL;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     /*
      * If this is the first control block, setup an empty event array
@@ -593,13 +593,13 @@ int evtchn_fifo_init_control(struct evtc
     else
         rc = map_control_block(v, gfn, offset);
 
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     return rc;
 
  error:
     evtchn_fifo_destroy(d);
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
     return rc;
 }
 
@@ -652,9 +652,9 @@ int evtchn_fifo_expand_array(const struc
     if ( !d->evtchn_fifo )
         return -EOPNOTSUPP;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
     rc = add_page_to_event_array(d, expand_array->array_gfn);
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     return rc;
 }
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -105,7 +105,7 @@ static void pt_pirq_softirq_reset(struct
 {
     struct domain *d = pirq_dpci->dom;
 
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_write_locked(&d->event_lock));
 
     switch ( cmpxchg(&pirq_dpci->state, 1 << STATE_SCHED, 0) )
     {
@@ -162,7 +162,7 @@ static void pt_irq_time_out(void *data)
     const struct hvm_irq_dpci *dpci;
     const struct dev_intx_gsi_link *digl;
 
-    spin_lock(&irq_map->dom->event_lock);
+    write_lock(&irq_map->dom->event_lock);
 
     if ( irq_map->flags & HVM_IRQ_DPCI_IDENTITY_GSI )
     {
@@ -177,7 +177,7 @@ static void pt_irq_time_out(void *data)
         hvm_gsi_deassert(irq_map->dom, dpci_pirq(irq_map)->pirq);
         irq_map->flags |= HVM_IRQ_DPCI_EOI_LATCH;
         pt_irq_guest_eoi(irq_map->dom, irq_map, NULL);
-        spin_unlock(&irq_map->dom->event_lock);
+        write_unlock(&irq_map->dom->event_lock);
         return;
     }
 
@@ -185,7 +185,7 @@ static void pt_irq_time_out(void *data)
     if ( unlikely(!dpci) )
     {
         ASSERT_UNREACHABLE();
-        spin_unlock(&irq_map->dom->event_lock);
+        write_unlock(&irq_map->dom->event_lock);
         return;
     }
     list_for_each_entry ( digl, &irq_map->digl_list, list )
@@ -204,7 +204,7 @@ static void pt_irq_time_out(void *data)
 
     pt_pirq_iterate(irq_map->dom, pt_irq_guest_eoi, NULL);
 
-    spin_unlock(&irq_map->dom->event_lock);
+    write_unlock(&irq_map->dom->event_lock);
 }
 
 struct hvm_irq_dpci *domain_get_irq_dpci(const struct domain *d)
@@ -288,7 +288,7 @@ int pt_irq_create_bind(
         return -EINVAL;
 
  restart:
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( !hvm_irq_dpci && !is_hardware_domain(d) )
@@ -304,7 +304,7 @@ int pt_irq_create_bind(
         hvm_irq_dpci = xzalloc(struct hvm_irq_dpci);
         if ( hvm_irq_dpci == NULL )
         {
-            spin_unlock(&d->event_lock);
+            write_unlock(&d->event_lock);
             return -ENOMEM;
         }
         for ( i = 0; i < NR_HVM_DOMU_IRQS; i++ )
@@ -316,7 +316,7 @@ int pt_irq_create_bind(
     info = pirq_get_info(d, pirq);
     if ( !info )
     {
-        spin_unlock(&d->event_lock);
+        write_unlock(&d->event_lock);
         return -ENOMEM;
     }
     pirq_dpci = pirq_dpci(info);
@@ -331,7 +331,7 @@ int pt_irq_create_bind(
      */
     if ( pt_pirq_softirq_active(pirq_dpci) )
     {
-        spin_unlock(&d->event_lock);
+        write_unlock(&d->event_lock);
         cpu_relax();
         goto restart;
     }
@@ -389,7 +389,7 @@ int pt_irq_create_bind(
                 pirq_dpci->dom = NULL;
                 pirq_dpci->flags = 0;
                 pirq_cleanup_check(info, d);
-                spin_unlock(&d->event_lock);
+                write_unlock(&d->event_lock);
                 return rc;
             }
         }
@@ -399,7 +399,7 @@ int pt_irq_create_bind(
 
             if ( (pirq_dpci->flags & mask) != mask )
             {
-                spin_unlock(&d->event_lock);
+                write_unlock(&d->event_lock);
                 return -EBUSY;
             }
 
@@ -423,7 +423,7 @@ int pt_irq_create_bind(
 
         dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
         pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
-        spin_unlock(&d->event_lock);
+        write_unlock(&d->event_lock);
 
         pirq_dpci->gmsi.posted = false;
         vcpu = (dest_vcpu_id >= 0) ? d->vcpu[dest_vcpu_id] : NULL;
@@ -483,7 +483,7 @@ int pt_irq_create_bind(
 
             if ( !digl || !girq )
             {
-                spin_unlock(&d->event_lock);
+                write_unlock(&d->event_lock);
                 xfree(girq);
                 xfree(digl);
                 return -ENOMEM;
@@ -510,7 +510,7 @@ int pt_irq_create_bind(
             if ( pt_irq_bind->irq_type != PT_IRQ_TYPE_PCI ||
                  pirq >= hvm_domain_irq(d)->nr_gsis )
             {
-                spin_unlock(&d->event_lock);
+                write_unlock(&d->event_lock);
 
                 return -EINVAL;
             }
@@ -546,7 +546,7 @@ int pt_irq_create_bind(
 
                     if ( mask < 0 || trigger_mode < 0 )
                     {
-                        spin_unlock(&d->event_lock);
+                        write_unlock(&d->event_lock);
 
                         ASSERT_UNREACHABLE();
                         return -EINVAL;
@@ -594,14 +594,14 @@ int pt_irq_create_bind(
                 }
                 pirq_dpci->flags = 0;
                 pirq_cleanup_check(info, d);
-                spin_unlock(&d->event_lock);
+                write_unlock(&d->event_lock);
                 xfree(girq);
                 xfree(digl);
                 return rc;
             }
         }
 
-        spin_unlock(&d->event_lock);
+        write_unlock(&d->event_lock);
 
         if ( iommu_verbose )
         {
@@ -619,7 +619,7 @@ int pt_irq_create_bind(
     }
 
     default:
-        spin_unlock(&d->event_lock);
+        write_unlock(&d->event_lock);
         return -EOPNOTSUPP;
     }
 
@@ -672,13 +672,13 @@ int pt_irq_destroy_bind(
         return -EOPNOTSUPP;
     }
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     hvm_irq_dpci = domain_get_irq_dpci(d);
 
     if ( !hvm_irq_dpci && !is_hardware_domain(d) )
     {
-        spin_unlock(&d->event_lock);
+        write_unlock(&d->event_lock);
         return -EINVAL;
     }
 
@@ -711,7 +711,7 @@ int pt_irq_destroy_bind(
 
         if ( girq )
         {
-            spin_unlock(&d->event_lock);
+            write_unlock(&d->event_lock);
             return -EINVAL;
         }
 
@@ -755,7 +755,7 @@ int pt_irq_destroy_bind(
         pirq_cleanup_check(pirq, d);
     }
 
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 
     if ( what && iommu_verbose )
     {
@@ -799,7 +799,7 @@ int pt_pirq_iterate(struct domain *d,
     unsigned int pirq = 0, n, i;
     struct pirq *pirqs[8];
 
-    ASSERT(spin_is_locked(&d->event_lock));
+    ASSERT(rw_is_locked(&d->event_lock));
 
     do {
         n = radix_tree_gang_lookup(&d->pirq_tree, (void **)pirqs, pirq,
@@ -880,9 +880,9 @@ void hvm_dpci_msi_eoi(struct domain *d,
          (!hvm_domain_irq(d)->dpci && !is_hardware_domain(d)) )
        return;
 
-    spin_lock(&d->event_lock);
+    read_lock(&d->event_lock);
     pt_pirq_iterate(d, _hvm_dpci_msi_eoi, (void *)(long)vector);
-    spin_unlock(&d->event_lock);
+    read_unlock(&d->event_lock);
 }
 
 static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
@@ -893,7 +893,7 @@ static void hvm_dirq_assist(struct domai
         return;
     }
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
     if ( test_and_clear_bool(pirq_dpci->masked) )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
@@ -947,7 +947,7 @@ static void hvm_dirq_assist(struct domai
     }
 
  out:
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 }
 
 static void hvm_pirq_eoi(struct pirq *pirq,
@@ -1012,7 +1012,7 @@ void hvm_dpci_eoi(struct domain *d, unsi
 
     if ( is_hardware_domain(d) )
     {
-        spin_lock(&d->event_lock);
+        write_lock(&d->event_lock);
         hvm_gsi_eoi(d, guest_gsi, ent);
         goto unlock;
     }
@@ -1023,7 +1023,7 @@ void hvm_dpci_eoi(struct domain *d, unsi
         return;
     }
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
     hvm_irq_dpci = domain_get_irq_dpci(d);
 
     if ( !hvm_irq_dpci )
@@ -1033,7 +1033,7 @@ void hvm_dpci_eoi(struct domain *d, unsi
         __hvm_dpci_eoi(d, girq, ent);
 
 unlock:
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 }
 
 /*
--- a/xen/common/rwlock.c
+++ b/xen/common/rwlock.c
@@ -102,6 +102,14 @@ void queue_write_lock_slowpath(rwlock_t
     spin_unlock(&lock->lock);
 }
 
+void _rw_barrier(rwlock_t *lock)
+{
+    check_barrier(&lock->lock.debug);
+    smp_mb();
+    while ( _rw_is_locked(lock) )
+        arch_lock_relax();
+    smp_mb();
+}
 
 static DEFINE_PER_CPU(cpumask_t, percpu_rwlock_readers);
 
--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -65,7 +65,7 @@ void check_lock(union lock_debug *debug,
     }
 }
 
-static void check_barrier(union lock_debug *debug)
+void check_barrier(union lock_debug *debug)
 {
     if ( unlikely(atomic_read(&spin_debug) <= 0) )
         return;
@@ -108,7 +108,6 @@ void spin_debug_disable(void)
 
 #else /* CONFIG_DEBUG_LOCKS */
 
-#define check_barrier(l) ((void)0)
 #define got_lock(l) ((void)0)
 #define rel_lock(l) ((void)0)
 
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -878,7 +878,7 @@ static int pci_clean_dpci_irqs(struct do
     if ( !is_hvm_domain(d) )
         return 0;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
@@ -896,14 +896,14 @@ static int pci_clean_dpci_irqs(struct do
             ret = pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
         if ( ret )
         {
-            spin_unlock(&d->event_lock);
+            write_unlock(&d->event_lock);
             return ret;
         }
 
         hvm_domain_irq(d)->dpci = NULL;
         free_hvm_irq_dpci(hvm_irq_dpci);
     }
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
     return 0;
 }
 
--- a/xen/drivers/passthrough/vtd/x86/hvm.c
+++ b/xen/drivers/passthrough/vtd/x86/hvm.c
@@ -54,7 +54,7 @@ void hvm_dpci_isairq_eoi(struct domain *
     if ( !is_iommu_enabled(d) )
         return;
 
-    spin_lock(&d->event_lock);
+    write_lock(&d->event_lock);
 
     dpci = domain_get_irq_dpci(d);
 
@@ -63,5 +63,5 @@ void hvm_dpci_isairq_eoi(struct domain *
         /* Multiple mirq may be mapped to one isa irq */
         pt_pirq_iterate(d, _hvm_dpci_isairq_eoi, (void *)(long)isairq);
     }
-    spin_unlock(&d->event_lock);
+    write_unlock(&d->event_lock);
 }
--- a/xen/include/xen/rwlock.h
+++ b/xen/include/xen/rwlock.h
@@ -247,6 +247,8 @@ static inline int _rw_is_write_locked(rw
     return (atomic_read(&lock->cnts) & _QW_WMASK) == _QW_LOCKED;
 }
 
+void _rw_barrier(rwlock_t *lock);
+
 #define read_lock(l)                  _read_lock(l)
 #define read_lock_irq(l)              _read_lock_irq(l)
 #define read_lock_irqsave(l, f)                                 \
@@ -276,6 +278,7 @@ static inline int _rw_is_write_locked(rw
 #define rw_is_locked(l)               _rw_is_locked(l)
 #define rw_is_write_locked(l)         _rw_is_write_locked(l)
 
+#define rw_barrier(l)                 _rw_barrier(l)
 
 typedef struct percpu_rwlock percpu_rwlock_t;
 
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -376,7 +376,7 @@ struct domain
     unsigned int     xen_evtchns;
     /* Port to resume from in evtchn_reset(), when in a continuation. */
     unsigned int     next_evtchn;
-    spinlock_t       event_lock;
+    rwlock_t         event_lock;
     const struct evtchn_port_ops *evtchn_port_ops;
     struct evtchn_fifo_domain *evtchn_fifo;
 
--- a/xen/include/xen/spinlock.h
+++ b/xen/include/xen/spinlock.h
@@ -22,12 +22,14 @@ union lock_debug {
 };
 #define _LOCK_DEBUG { LOCK_DEBUG_INITVAL }
 void check_lock(union lock_debug *debug, bool try);
+void check_barrier(union lock_debug *debug);
 void spin_debug_enable(void);
 void spin_debug_disable(void);
 #else
 union lock_debug { };
 #define _LOCK_DEBUG { }
 #define check_lock(l, t) ((void)0)
+#define check_barrier(l) ((void)0)
 #define spin_debug_enable() ((void)0)
 #define spin_debug_disable() ((void)0)
 #endif
--- a/xen/xsm/flask/flask_op.c
+++ b/xen/xsm/flask/flask_op.c
@@ -555,7 +555,7 @@ static int flask_get_peer_sid(struct xen
     struct evtchn *chn;
     struct domain_security_struct *dsec;
 
-    spin_lock(&d->event_lock);
+    read_lock(&d->event_lock);
 
     if ( !port_is_valid(d, arg->evtchn) )
         goto out;
@@ -573,7 +573,7 @@ static int flask_get_peer_sid(struct xen
     rv = 0;
 
  out:
-    spin_unlock(&d->event_lock);
+    read_unlock(&d->event_lock);
     return rv;
 }
 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-11-23 13:26 [PATCH v3 0/5] evtchn: (not so) recent XSAs follow-on Jan Beulich
                   ` (3 preceding siblings ...)
  2020-11-23 13:29 ` [PATCH v3 4/5] evtchn: convert domain event " Jan Beulich
@ 2020-11-23 13:30 ` Jan Beulich
  2020-11-30 10:39   ` Isaila Alexandru
  2020-12-02 21:10   ` Julien Grall
  4 siblings, 2 replies; 48+ messages in thread
From: Jan Beulich @ 2020-11-23 13:30 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila

While there don't look to be any problems with this right now, the lock
order implications from holding the lock can be very difficult to follow
(and may be easy to violate unknowingly). The present callbacks don't
(and no such callback should) have any need for the lock to be held.

However, vm_event_disable() frees the structures used by respective
callbacks and isn't otherwise synchronized with invocations of these
callbacks, so maintain a count of in-progress calls, for evtchn_close()
to wait to drop to zero before freeing the port (and dropping the lock).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Should we make this accounting optional, to be requested through a new
parameter to alloc_unbound_xen_event_channel(), or derived from other
than the default callback being requested?
---
v3: Drain callbacks before proceeding with closing. Re-base.

--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -397,6 +397,7 @@ static long evtchn_bind_interdomain(evtc
     
     rchn->u.interdomain.remote_dom  = ld;
     rchn->u.interdomain.remote_port = lport;
+    atomic_set(&rchn->u.interdomain.active_calls, 0);
     rchn->state                     = ECS_INTERDOMAIN;
 
     /*
@@ -720,6 +721,10 @@ int evtchn_close(struct domain *d1, int
 
         double_evtchn_lock(chn1, chn2);
 
+        if ( consumer_is_xen(chn1) )
+            while ( atomic_read(&chn1->u.interdomain.active_calls) )
+                cpu_relax();
+
         evtchn_free(d1, chn1);
 
         chn2->state = ECS_UNBOUND;
@@ -781,9 +786,15 @@ int evtchn_send(struct domain *ld, unsig
         rport = lchn->u.interdomain.remote_port;
         rchn  = evtchn_from_port(rd, rport);
         if ( consumer_is_xen(rchn) )
+        {
+            /* Don't keep holding the lock for the call below. */
+            atomic_inc(&rchn->u.interdomain.active_calls);
+            evtchn_read_unlock(lchn);
             xen_notification_fn(rchn)(rd->vcpu[rchn->notify_vcpu_id], rport);
-        else
-            evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
+            atomic_dec(&rchn->u.interdomain.active_calls);
+            return 0;
+        }
+        evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
         break;
     case ECS_IPI:
         evtchn_port_set_pending(ld, lchn->notify_vcpu_id, lchn);
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -104,6 +104,7 @@ struct evtchn
         } unbound;     /* state == ECS_UNBOUND */
         struct {
             evtchn_port_t  remote_port;
+            atomic_t       active_calls;
             struct domain *remote_dom;
         } interdomain; /* state == ECS_INTERDOMAIN */
         struct {



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-11-23 13:30 ` [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held Jan Beulich
@ 2020-11-30 10:39   ` Isaila Alexandru
  2020-12-02 21:10   ` Julien Grall
  1 sibling, 0 replies; 48+ messages in thread
From: Isaila Alexandru @ 2020-11-30 10:39 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Julien Grall, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU

On 23.11.2020 15:30, Jan Beulich wrote:
> While there don't look to be any problems with this right now, the lock
> order implications from holding the lock can be very difficult to follow
> (and may be easy to violate unknowingly). The present callbacks don't
> (and no such callback should) have any need for the lock to be held.
> 
> However, vm_event_disable() frees the structures used by respective
> callbacks and isn't otherwise synchronized with invocations of these
> callbacks, so maintain a count of in-progress calls, for evtchn_close()
> to wait to drop to zero before freeing the port (and dropping the lock).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com>

> ---
> Should we make this accounting optional, to be requested through a new
> parameter to alloc_unbound_xen_event_channel(), or derived from other
> than the default callback being requested?
> ---
> v3: Drain callbacks before proceeding with closing. Re-base.
> 
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -397,6 +397,7 @@ static long evtchn_bind_interdomain(evtc
>       
>       rchn->u.interdomain.remote_dom  = ld;
>       rchn->u.interdomain.remote_port = lport;
> +    atomic_set(&rchn->u.interdomain.active_calls, 0);
>       rchn->state                     = ECS_INTERDOMAIN;
>   
>       /*
> @@ -720,6 +721,10 @@ int evtchn_close(struct domain *d1, int
>   
>           double_evtchn_lock(chn1, chn2);
>   
> +        if ( consumer_is_xen(chn1) )
> +            while ( atomic_read(&chn1->u.interdomain.active_calls) )
> +                cpu_relax();
> +
>           evtchn_free(d1, chn1);
>   
>           chn2->state = ECS_UNBOUND;
> @@ -781,9 +786,15 @@ int evtchn_send(struct domain *ld, unsig
>           rport = lchn->u.interdomain.remote_port;
>           rchn  = evtchn_from_port(rd, rport);
>           if ( consumer_is_xen(rchn) )
> +        {
> +            /* Don't keep holding the lock for the call below. */
> +            atomic_inc(&rchn->u.interdomain.active_calls);
> +            evtchn_read_unlock(lchn);
>               xen_notification_fn(rchn)(rd->vcpu[rchn->notify_vcpu_id], rport);
> -        else
> -            evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
> +            atomic_dec(&rchn->u.interdomain.active_calls);
> +            return 0;
> +        }
> +        evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
>           break;
>       case ECS_IPI:
>           evtchn_port_set_pending(ld, lchn->notify_vcpu_id, lchn);
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -104,6 +104,7 @@ struct evtchn
>           } unbound;     /* state == ECS_UNBOUND */
>           struct {
>               evtchn_port_t  remote_port;
> +            atomic_t       active_calls;
>               struct domain *remote_dom;
>           } interdomain; /* state == ECS_INTERDOMAIN */
>           struct {
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/5] evtchn: drop acquiring of per-channel lock from send_guest_{global,vcpu}_virq()
  2020-11-23 13:28 ` [PATCH v3 1/5] evtchn: drop acquiring of per-channel lock from send_guest_{global,vcpu}_virq() Jan Beulich
@ 2020-12-02 19:03   ` Julien Grall
  2020-12-03  9:46     ` Jan Beulich
  0 siblings, 1 reply; 48+ messages in thread
From: Julien Grall @ 2020-12-02 19:03 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu, Stefano Stabellini

Hi Jan,

On 23/11/2020 13:28, Jan Beulich wrote:
> The per-vCPU virq_lock, which is being held anyway, together with there
> not being any call to evtchn_port_set_pending() when v->virq_to_evtchn[]
> is zero, provide sufficient guarantees. 

I agree that the per-vCPU virq_lock is going to be sufficient, however 
dropping the lock also means the event channel locking is more complex 
to understand (the long comment that was added proves it).

In fact, the locking in the event channel code was already proven to be 
quite fragile, therefore I think this patch is not worth the risk.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-11-23 13:30 ` [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held Jan Beulich
  2020-11-30 10:39   ` Isaila Alexandru
@ 2020-12-02 21:10   ` Julien Grall
  2020-12-03 10:09     ` Jan Beulich
  1 sibling, 1 reply; 48+ messages in thread
From: Julien Grall @ 2020-12-02 21:10 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila

Hi Jan,

On 23/11/2020 13:30, Jan Beulich wrote:
> While there don't look to be any problems with this right now, the lock
> order implications from holding the lock can be very difficult to follow
> (and may be easy to violate unknowingly). The present callbacks don't
> (and no such callback should) have any need for the lock to be held.
> 
> However, vm_event_disable() frees the structures used by respective
> callbacks and isn't otherwise synchronized with invocations of these
> callbacks, so maintain a count of in-progress calls, for evtchn_close()
> to wait to drop to zero before freeing the port (and dropping the lock).

AFAICT, this callback is not the only place where the synchronization is 
missing in the VM event code.

For instance, vm_event_put_request() can also race against 
vm_event_disable().

So shouldn't we handle this issue properly in VM event?

> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Should we make this accounting optional, to be requested through a new
> parameter to alloc_unbound_xen_event_channel(), or derived from other
> than the default callback being requested?

Aside the VM event, do you see any value for the other caller?

> ---
> v3: Drain callbacks before proceeding with closing. Re-base.
> 
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -397,6 +397,7 @@ static long evtchn_bind_interdomain(evtc
>       
>       rchn->u.interdomain.remote_dom  = ld;
>       rchn->u.interdomain.remote_port = lport;
> +    atomic_set(&rchn->u.interdomain.active_calls, 0);
>       rchn->state                     = ECS_INTERDOMAIN;
>   
>       /*
> @@ -720,6 +721,10 @@ int evtchn_close(struct domain *d1, int
>   
>           double_evtchn_lock(chn1, chn2);
>   
> +        if ( consumer_is_xen(chn1) )
> +            while ( atomic_read(&chn1->u.interdomain.active_calls) )
> +                cpu_relax();
> +
>           evtchn_free(d1, chn1);
>   
>           chn2->state = ECS_UNBOUND;
> @@ -781,9 +786,15 @@ int evtchn_send(struct domain *ld, unsig
>           rport = lchn->u.interdomain.remote_port;
>           rchn  = evtchn_from_port(rd, rport);
>           if ( consumer_is_xen(rchn) )
> +        {
> +            /* Don't keep holding the lock for the call below. */
> +            atomic_inc(&rchn->u.interdomain.active_calls);
> +            evtchn_read_unlock(lchn);
>               xen_notification_fn(rchn)(rd->vcpu[rchn->notify_vcpu_id], rport);
> -        else
> -            evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);

atomic_dec() doesn't contain any memory barrier, so we will want one 
between xen_notification_fn() and atomic_dec() to avoid re-ordering.

> +            atomic_dec(&rchn->u.interdomain.active_calls);
> +            return 0;
> +        }
> +        evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
>           break;
>       case ECS_IPI:
>           evtchn_port_set_pending(ld, lchn->notify_vcpu_id, lchn);
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -104,6 +104,7 @@ struct evtchn
>           } unbound;     /* state == ECS_UNBOUND */
>           struct {
>               evtchn_port_t  remote_port;
> +            atomic_t       active_calls;
>               struct domain *remote_dom;
>           } interdomain; /* state == ECS_INTERDOMAIN */
>           struct {
> 

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/5] evtchn: avoid access tearing for ->virq_to_evtchn[] accesses
  2020-11-23 13:28 ` [PATCH v3 2/5] evtchn: avoid access tearing for ->virq_to_evtchn[] accesses Jan Beulich
@ 2020-12-02 21:14   ` Julien Grall
  0 siblings, 0 replies; 48+ messages in thread
From: Julien Grall @ 2020-12-02 21:14 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu, Stefano Stabellini

Hi Jan,

On 23/11/2020 13:28, Jan Beulich wrote:
> Use {read,write}_atomic() to exclude any eventualities, in particular
> observing that accesses aren't all happening under a consistent lock.
> 
> Requested-by: Julien Grall <julien@xen.org>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Julien Grall <jgrall@amazon.com>

Cheers,

> ---
> v3: New.
> 
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -446,7 +446,7 @@ int evtchn_bind_virq(evtchn_bind_virq_t
>   
>       spin_lock(&d->event_lock);
>   
> -    if ( v->virq_to_evtchn[virq] != 0 )
> +    if ( read_atomic(&v->virq_to_evtchn[virq]) )
>           ERROR_EXIT(-EEXIST);
>   
>       if ( port != 0 )
> @@ -474,7 +474,8 @@ int evtchn_bind_virq(evtchn_bind_virq_t
>   
>       evtchn_write_unlock(chn);
>   
> -    v->virq_to_evtchn[virq] = bind->port = port;
> +    bind->port = port;
> +    write_atomic(&v->virq_to_evtchn[virq], port);
>   
>    out:
>       spin_unlock(&d->event_lock);
> @@ -660,9 +661,9 @@ int evtchn_close(struct domain *d1, int
>       case ECS_VIRQ:
>           for_each_vcpu ( d1, v )
>           {
> -            if ( v->virq_to_evtchn[chn1->u.virq] != port1 )
> +            if ( read_atomic(&v->virq_to_evtchn[chn1->u.virq]) != port1 )
>                   continue;
> -            v->virq_to_evtchn[chn1->u.virq] = 0;
> +            write_atomic(&v->virq_to_evtchn[chn1->u.virq], 0);
>               spin_barrier(&v->virq_lock);
>           }
>           break;
> @@ -801,7 +802,7 @@ bool evtchn_virq_enabled(const struct vc
>       if ( virq_is_global(virq) && v->vcpu_id )
>           v = domain_vcpu(v->domain, 0);
>   
> -    return v->virq_to_evtchn[virq];
> +    return read_atomic(&v->virq_to_evtchn[virq]);
>   }
>   
>   void send_guest_vcpu_virq(struct vcpu *v, uint32_t virq)
> @@ -814,7 +815,7 @@ void send_guest_vcpu_virq(struct vcpu *v
>   
>       spin_lock_irqsave(&v->virq_lock, flags);
>   
> -    port = v->virq_to_evtchn[virq];
> +    port = read_atomic(&v->virq_to_evtchn[virq]);
>       if ( unlikely(port == 0) )
>           goto out;
>   
> @@ -843,7 +844,7 @@ void send_guest_global_virq(struct domai
>   
>       spin_lock_irqsave(&v->virq_lock, flags);
>   
> -    port = v->virq_to_evtchn[virq];
> +    port = read_atomic(&v->virq_to_evtchn[virq]);
>       if ( unlikely(port == 0) )
>           goto out;
>   
> 

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/5] evtchn: drop acquiring of per-channel lock from send_guest_{global,vcpu}_virq()
  2020-12-02 19:03   ` Julien Grall
@ 2020-12-03  9:46     ` Jan Beulich
  2020-12-09  9:53       ` Julien Grall
  0 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2020-12-03  9:46 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, xen-devel

On 02.12.2020 20:03, Julien Grall wrote:
> On 23/11/2020 13:28, Jan Beulich wrote:
>> The per-vCPU virq_lock, which is being held anyway, together with there
>> not being any call to evtchn_port_set_pending() when v->virq_to_evtchn[]
>> is zero, provide sufficient guarantees. 
> 
> I agree that the per-vCPU virq_lock is going to be sufficient, however 
> dropping the lock also means the event channel locking is more complex 
> to understand (the long comment that was added proves it).
> 
> In fact, the locking in the event channel code was already proven to be 
> quite fragile, therefore I think this patch is not worth the risk.

I agree this is a very reasonable position to take. I probably
would even have remained silent if in the meantime the
spin_lock()s there hadn't changed to read_trylock()s. I really
think we want to limit this unusual locking model to where we
strictly need it. And this change eliminates 50% of them, with
the added benefit of making the paths more lightweight.

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-02 21:10   ` Julien Grall
@ 2020-12-03 10:09     ` Jan Beulich
  2020-12-03 14:40       ` Tamas K Lengyel
  2020-12-04 11:28       ` Julien Grall
  0 siblings, 2 replies; 48+ messages in thread
From: Jan Beulich @ 2020-12-03 10:09 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel

On 02.12.2020 22:10, Julien Grall wrote:
> On 23/11/2020 13:30, Jan Beulich wrote:
>> While there don't look to be any problems with this right now, the lock
>> order implications from holding the lock can be very difficult to follow
>> (and may be easy to violate unknowingly). The present callbacks don't
>> (and no such callback should) have any need for the lock to be held.
>>
>> However, vm_event_disable() frees the structures used by respective
>> callbacks and isn't otherwise synchronized with invocations of these
>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>> to wait to drop to zero before freeing the port (and dropping the lock).
> 
> AFAICT, this callback is not the only place where the synchronization is 
> missing in the VM event code.
> 
> For instance, vm_event_put_request() can also race against 
> vm_event_disable().
> 
> So shouldn't we handle this issue properly in VM event?

I suppose that's a question to the VM event folks rather than me?

>> ---
>> Should we make this accounting optional, to be requested through a new
>> parameter to alloc_unbound_xen_event_channel(), or derived from other
>> than the default callback being requested?
> 
> Aside the VM event, do you see any value for the other caller?

No (albeit I'm not entirely certain about vpl011_notification()'s
needs), hence the consideration. It's unnecessary overhead in
those cases.

>> @@ -781,9 +786,15 @@ int evtchn_send(struct domain *ld, unsig
>>           rport = lchn->u.interdomain.remote_port;
>>           rchn  = evtchn_from_port(rd, rport);
>>           if ( consumer_is_xen(rchn) )
>> +        {
>> +            /* Don't keep holding the lock for the call below. */
>> +            atomic_inc(&rchn->u.interdomain.active_calls);
>> +            evtchn_read_unlock(lchn);
>>               xen_notification_fn(rchn)(rd->vcpu[rchn->notify_vcpu_id], rport);
>> -        else
>> -            evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
> 
> atomic_dec() doesn't contain any memory barrier, so we will want one 
> between xen_notification_fn() and atomic_dec() to avoid re-ordering.

Oh, indeed. But smp_mb() is too heavy handed here - x86 doesn't
really need any barrier, yet would gain a full MFENCE that way.
Actually - looks like I forgot we gained smp_mb__before_atomic()
a little over half a year ago.

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-03 10:09     ` Jan Beulich
@ 2020-12-03 14:40       ` Tamas K Lengyel
  2020-12-04 11:28       ` Julien Grall
  1 sibling, 0 replies; 48+ messages in thread
From: Tamas K Lengyel @ 2020-12-03 14:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Julien Grall, Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel

On Thu, Dec 3, 2020 at 5:09 AM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 02.12.2020 22:10, Julien Grall wrote:
> > On 23/11/2020 13:30, Jan Beulich wrote:
> >> While there don't look to be any problems with this right now, the lock
> >> order implications from holding the lock can be very difficult to follow
> >> (and may be easy to violate unknowingly). The present callbacks don't
> >> (and no such callback should) have any need for the lock to be held.
> >>
> >> However, vm_event_disable() frees the structures used by respective
> >> callbacks and isn't otherwise synchronized with invocations of these
> >> callbacks, so maintain a count of in-progress calls, for evtchn_close()
> >> to wait to drop to zero before freeing the port (and dropping the lock).
> >
> > AFAICT, this callback is not the only place where the synchronization is
> > missing in the VM event code.
> >
> > For instance, vm_event_put_request() can also race against
> > vm_event_disable().
> >
> > So shouldn't we handle this issue properly in VM event?
>
> I suppose that's a question to the VM event folks rather than me?

IMHO it would obviously be better if the Xen side could handle
situations like these gracefully. OTOH it is also reasonable to expect
the privileged toolstack to perform its own sanity checks before
disabling. Like right now for disabling vm_event we first pause the
VM, process all remaining events that were on the ring, and only then
disable the interface. This avoids the race condition mentioned above
(among other issues). It's not perfect - we ran into problems where
event replies don't have the desired effect because the VM is paused -
but for the most part doing this is sufficient. So I don't consider
this to be a priority at the moment. That said, if anyone is so
inclined to fix this up, I would be happy to review & ack.

Tamas


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-03 10:09     ` Jan Beulich
  2020-12-03 14:40       ` Tamas K Lengyel
@ 2020-12-04 11:28       ` Julien Grall
  2020-12-04 11:48         ` Jan Beulich
  2020-12-04 15:21         ` Tamas K Lengyel
  1 sibling, 2 replies; 48+ messages in thread
From: Julien Grall @ 2020-12-04 11:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel

Hi Jan,

On 03/12/2020 10:09, Jan Beulich wrote:
> On 02.12.2020 22:10, Julien Grall wrote:
>> On 23/11/2020 13:30, Jan Beulich wrote:
>>> While there don't look to be any problems with this right now, the lock
>>> order implications from holding the lock can be very difficult to follow
>>> (and may be easy to violate unknowingly). The present callbacks don't
>>> (and no such callback should) have any need for the lock to be held.
>>>
>>> However, vm_event_disable() frees the structures used by respective
>>> callbacks and isn't otherwise synchronized with invocations of these
>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>
>> AFAICT, this callback is not the only place where the synchronization is
>> missing in the VM event code.
>>
>> For instance, vm_event_put_request() can also race against
>> vm_event_disable().
>>
>> So shouldn't we handle this issue properly in VM event?
> 
> I suppose that's a question to the VM event folks rather than me?

Yes. From my understanding of Tamas's e-mail, they are relying on the 
monitoring software to do the right thing.

I will refrain to comment on this approach. However, given the race is 
much wider than the event channel, I would recommend to not add more 
code in the event channel to deal with such problem.

Instead, this should be fixed in the VM event code when someone has time 
to harden the subsystem.

> 
>>> ---
>>> Should we make this accounting optional, to be requested through a new
>>> parameter to alloc_unbound_xen_event_channel(), or derived from other
>>> than the default callback being requested?
>>
>> Aside the VM event, do you see any value for the other caller?
> 
> No (albeit I'm not entirely certain about vpl011_notification()'s
> needs), hence the consideration. It's unnecessary overhead in
> those cases.

I had another look and I think there is a small race in VPL011. It 
should be easy to fix (I will try to have a look later today).

> 
>>> @@ -781,9 +786,15 @@ int evtchn_send(struct domain *ld, unsig
>>>            rport = lchn->u.interdomain.remote_port;
>>>            rchn  = evtchn_from_port(rd, rport);
>>>            if ( consumer_is_xen(rchn) )
>>> +        {
>>> +            /* Don't keep holding the lock for the call below. */
>>> +            atomic_inc(&rchn->u.interdomain.active_calls);
>>> +            evtchn_read_unlock(lchn);
>>>                xen_notification_fn(rchn)(rd->vcpu[rchn->notify_vcpu_id], rport);
>>> -        else
>>> -            evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn);
>>
>> atomic_dec() doesn't contain any memory barrier, so we will want one
>> between xen_notification_fn() and atomic_dec() to avoid re-ordering.
> 
> Oh, indeed. But smp_mb() is too heavy handed here - x86 doesn't
> really need any barrier, yet would gain a full MFENCE that way.
> Actually - looks like I forgot we gained smp_mb__before_atomic()
> a little over half a year ago.

Ah yes, I forgot that atomics instruction are ordered on x86.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-04 11:28       ` Julien Grall
@ 2020-12-04 11:48         ` Jan Beulich
  2020-12-04 11:51           ` Julien Grall
  2020-12-04 15:21         ` Tamas K Lengyel
  1 sibling, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2020-12-04 11:48 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel

On 04.12.2020 12:28, Julien Grall wrote:
> Hi Jan,
> 
> On 03/12/2020 10:09, Jan Beulich wrote:
>> On 02.12.2020 22:10, Julien Grall wrote:
>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>> While there don't look to be any problems with this right now, the lock
>>>> order implications from holding the lock can be very difficult to follow
>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>> (and no such callback should) have any need for the lock to be held.
>>>>
>>>> However, vm_event_disable() frees the structures used by respective
>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>
>>> AFAICT, this callback is not the only place where the synchronization is
>>> missing in the VM event code.
>>>
>>> For instance, vm_event_put_request() can also race against
>>> vm_event_disable().
>>>
>>> So shouldn't we handle this issue properly in VM event?
>>
>> I suppose that's a question to the VM event folks rather than me?
> 
> Yes. From my understanding of Tamas's e-mail, they are relying on the 
> monitoring software to do the right thing.
> 
> I will refrain to comment on this approach. However, given the race is 
> much wider than the event channel, I would recommend to not add more 
> code in the event channel to deal with such problem.
> 
> Instead, this should be fixed in the VM event code when someone has time 
> to harden the subsystem.

Are effectively saying I should now undo the addition of the
refcounting, which was added in response to feedback from you?
Or else what exactly am I to take from your reply?

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-04 11:48         ` Jan Beulich
@ 2020-12-04 11:51           ` Julien Grall
  2020-12-04 12:01             ` Jan Beulich
  0 siblings, 1 reply; 48+ messages in thread
From: Julien Grall @ 2020-12-04 11:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel



On 04/12/2020 11:48, Jan Beulich wrote:
> On 04.12.2020 12:28, Julien Grall wrote:
>> Hi Jan,
>>
>> On 03/12/2020 10:09, Jan Beulich wrote:
>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>>> While there don't look to be any problems with this right now, the lock
>>>>> order implications from holding the lock can be very difficult to follow
>>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>>> (and no such callback should) have any need for the lock to be held.
>>>>>
>>>>> However, vm_event_disable() frees the structures used by respective
>>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>>
>>>> AFAICT, this callback is not the only place where the synchronization is
>>>> missing in the VM event code.
>>>>
>>>> For instance, vm_event_put_request() can also race against
>>>> vm_event_disable().
>>>>
>>>> So shouldn't we handle this issue properly in VM event?
>>>
>>> I suppose that's a question to the VM event folks rather than me?
>>
>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>> monitoring software to do the right thing.
>>
>> I will refrain to comment on this approach. However, given the race is
>> much wider than the event channel, I would recommend to not add more
>> code in the event channel to deal with such problem.
>>
>> Instead, this should be fixed in the VM event code when someone has time
>> to harden the subsystem.
> 
> Are effectively saying I should now undo the addition of the
> refcounting, which was added in response to feedback from you?

Please point out where I made the request to use the refcounting...

I pointed out there was an issue with the VM event code. This was latter 
analysed as a wider issue. The VM event folks doesn't seem to be very 
concerned on the race, so I don't see the reason to try to fix it in the 
event channel code.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-04 11:51           ` Julien Grall
@ 2020-12-04 12:01             ` Jan Beulich
  2020-12-04 15:09               ` Julien Grall
  0 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2020-12-04 12:01 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel

On 04.12.2020 12:51, Julien Grall wrote:
> 
> 
> On 04/12/2020 11:48, Jan Beulich wrote:
>> On 04.12.2020 12:28, Julien Grall wrote:
>>> Hi Jan,
>>>
>>> On 03/12/2020 10:09, Jan Beulich wrote:
>>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>>>> While there don't look to be any problems with this right now, the lock
>>>>>> order implications from holding the lock can be very difficult to follow
>>>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>>>> (and no such callback should) have any need for the lock to be held.
>>>>>>
>>>>>> However, vm_event_disable() frees the structures used by respective
>>>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>>>
>>>>> AFAICT, this callback is not the only place where the synchronization is
>>>>> missing in the VM event code.
>>>>>
>>>>> For instance, vm_event_put_request() can also race against
>>>>> vm_event_disable().
>>>>>
>>>>> So shouldn't we handle this issue properly in VM event?
>>>>
>>>> I suppose that's a question to the VM event folks rather than me?
>>>
>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>>> monitoring software to do the right thing.
>>>
>>> I will refrain to comment on this approach. However, given the race is
>>> much wider than the event channel, I would recommend to not add more
>>> code in the event channel to deal with such problem.
>>>
>>> Instead, this should be fixed in the VM event code when someone has time
>>> to harden the subsystem.
>>
>> Are effectively saying I should now undo the addition of the
>> refcounting, which was added in response to feedback from you?
> 
> Please point out where I made the request to use the refcounting...

You didn't ask for this directly, sure, but ...

> I pointed out there was an issue with the VM event code.

... this has ultimately led to the decision to use refcounting
(iirc there was one alternative that I had proposed, besides
the option of doing nothing).

> This was latter 
> analysed as a wider issue. The VM event folks doesn't seem to be very 
> concerned on the race, so I don't see the reason to try to fix it in the 
> event channel code.

And you won't need the refcount for vpl011 then? I can certainly
drop it again, but it feels odd to go back to an earlier version
under the circumstances ...

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-04 12:01             ` Jan Beulich
@ 2020-12-04 15:09               ` Julien Grall
  2020-12-07  8:02                 ` Jan Beulich
  0 siblings, 1 reply; 48+ messages in thread
From: Julien Grall @ 2020-12-04 15:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel

Hi,

On 04/12/2020 12:01, Jan Beulich wrote:
> On 04.12.2020 12:51, Julien Grall wrote:
>>
>>
>> On 04/12/2020 11:48, Jan Beulich wrote:
>>> On 04.12.2020 12:28, Julien Grall wrote:
>>>> Hi Jan,
>>>>
>>>> On 03/12/2020 10:09, Jan Beulich wrote:
>>>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>>>>> While there don't look to be any problems with this right now, the lock
>>>>>>> order implications from holding the lock can be very difficult to follow
>>>>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>>>>> (and no such callback should) have any need for the lock to be held.
>>>>>>>
>>>>>>> However, vm_event_disable() frees the structures used by respective
>>>>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>>>>
>>>>>> AFAICT, this callback is not the only place where the synchronization is
>>>>>> missing in the VM event code.
>>>>>>
>>>>>> For instance, vm_event_put_request() can also race against
>>>>>> vm_event_disable().
>>>>>>
>>>>>> So shouldn't we handle this issue properly in VM event?
>>>>>
>>>>> I suppose that's a question to the VM event folks rather than me?
>>>>
>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>>>> monitoring software to do the right thing.
>>>>
>>>> I will refrain to comment on this approach. However, given the race is
>>>> much wider than the event channel, I would recommend to not add more
>>>> code in the event channel to deal with such problem.
>>>>
>>>> Instead, this should be fixed in the VM event code when someone has time
>>>> to harden the subsystem.
>>>
>>> Are effectively saying I should now undo the addition of the
>>> refcounting, which was added in response to feedback from you?
>>
>> Please point out where I made the request to use the refcounting...
> 
> You didn't ask for this directly, sure, but ...
> 
>> I pointed out there was an issue with the VM event code.
> 
> ... this has ultimately led to the decision to use refcounting
> (iirc there was one alternative that I had proposed, besides
> the option of doing nothing).

One other option that was discussed (maybe only on security@xen.org) is 
to move the spinlock outside of the structure so it is always allocated.

> 
>> This was latter
>> analysed as a wider issue. The VM event folks doesn't seem to be very
>> concerned on the race, so I don't see the reason to try to fix it in the
>> event channel code.
> 
> And you won't need the refcount for vpl011 then?

I don't believe we need it for the vpl011 as the spin lock protecting 
the code should always be allocated. The problem today is the lock is 
initialized too late.

> I can certainly
> drop it again, but it feels odd to go back to an earlier version
> under the circumstances ...

The code introduced doesn't look necessary outside of the VM event code.
So I think it would be wrong to merge it if it is just papering over a 
bigger problem.

Cheers,



> 
> Jan
> 

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-04 11:28       ` Julien Grall
  2020-12-04 11:48         ` Jan Beulich
@ 2020-12-04 15:21         ` Tamas K Lengyel
  2020-12-04 15:29           ` Julien Grall
  1 sibling, 1 reply; 48+ messages in thread
From: Tamas K Lengyel @ 2020-12-04 15:21 UTC (permalink / raw)
  To: Julien Grall
  Cc: Jan Beulich, Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel

On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
>
> Hi Jan,
>
> On 03/12/2020 10:09, Jan Beulich wrote:
> > On 02.12.2020 22:10, Julien Grall wrote:
> >> On 23/11/2020 13:30, Jan Beulich wrote:
> >>> While there don't look to be any problems with this right now, the lock
> >>> order implications from holding the lock can be very difficult to follow
> >>> (and may be easy to violate unknowingly). The present callbacks don't
> >>> (and no such callback should) have any need for the lock to be held.
> >>>
> >>> However, vm_event_disable() frees the structures used by respective
> >>> callbacks and isn't otherwise synchronized with invocations of these
> >>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
> >>> to wait to drop to zero before freeing the port (and dropping the lock).
> >>
> >> AFAICT, this callback is not the only place where the synchronization is
> >> missing in the VM event code.
> >>
> >> For instance, vm_event_put_request() can also race against
> >> vm_event_disable().
> >>
> >> So shouldn't we handle this issue properly in VM event?
> >
> > I suppose that's a question to the VM event folks rather than me?
>
> Yes. From my understanding of Tamas's e-mail, they are relying on the
> monitoring software to do the right thing.
>
> I will refrain to comment on this approach. However, given the race is
> much wider than the event channel, I would recommend to not add more
> code in the event channel to deal with such problem.
>
> Instead, this should be fixed in the VM event code when someone has time
> to harden the subsystem.

I double-checked and the disable route is actually more robust, we
don't just rely on the toolstack doing the right thing. The domain
gets paused before any calls to vm_event_disable. So I don't think
there is really a race-condition here.

Tamas


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-04 15:21         ` Tamas K Lengyel
@ 2020-12-04 15:29           ` Julien Grall
  2020-12-04 19:15             ` Tamas K Lengyel
  0 siblings, 1 reply; 48+ messages in thread
From: Julien Grall @ 2020-12-04 15:29 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: Jan Beulich, Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel



On 04/12/2020 15:21, Tamas K Lengyel wrote:
> On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
>>
>> Hi Jan,
>>
>> On 03/12/2020 10:09, Jan Beulich wrote:
>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>>> While there don't look to be any problems with this right now, the lock
>>>>> order implications from holding the lock can be very difficult to follow
>>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>>> (and no such callback should) have any need for the lock to be held.
>>>>>
>>>>> However, vm_event_disable() frees the structures used by respective
>>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>>
>>>> AFAICT, this callback is not the only place where the synchronization is
>>>> missing in the VM event code.
>>>>
>>>> For instance, vm_event_put_request() can also race against
>>>> vm_event_disable().
>>>>
>>>> So shouldn't we handle this issue properly in VM event?
>>>
>>> I suppose that's a question to the VM event folks rather than me?
>>
>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>> monitoring software to do the right thing.
>>
>> I will refrain to comment on this approach. However, given the race is
>> much wider than the event channel, I would recommend to not add more
>> code in the event channel to deal with such problem.
>>
>> Instead, this should be fixed in the VM event code when someone has time
>> to harden the subsystem.
> 
> I double-checked and the disable route is actually more robust, we
> don't just rely on the toolstack doing the right thing. The domain
> gets paused before any calls to vm_event_disable. So I don't think
> there is really a race-condition here.

The code will *only* pause the monitored domain. I can see two issues:
    1) The toolstack is still sending event while destroy is happening. 
This is the race discussed here.
    2) The implement of vm_event_put_request() suggests that it can be 
called with not-current domain.

I don't see how just pausing the monitored domain is enough here.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-04 15:29           ` Julien Grall
@ 2020-12-04 19:15             ` Tamas K Lengyel
  2020-12-04 19:22               ` Julien Grall
  2020-12-07 15:28               ` Jan Beulich
  0 siblings, 2 replies; 48+ messages in thread
From: Tamas K Lengyel @ 2020-12-04 19:15 UTC (permalink / raw)
  To: Julien Grall
  Cc: Jan Beulich, Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel

On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote:
>
>
>
> On 04/12/2020 15:21, Tamas K Lengyel wrote:
> > On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
> >>
> >> Hi Jan,
> >>
> >> On 03/12/2020 10:09, Jan Beulich wrote:
> >>> On 02.12.2020 22:10, Julien Grall wrote:
> >>>> On 23/11/2020 13:30, Jan Beulich wrote:
> >>>>> While there don't look to be any problems with this right now, the lock
> >>>>> order implications from holding the lock can be very difficult to follow
> >>>>> (and may be easy to violate unknowingly). The present callbacks don't
> >>>>> (and no such callback should) have any need for the lock to be held.
> >>>>>
> >>>>> However, vm_event_disable() frees the structures used by respective
> >>>>> callbacks and isn't otherwise synchronized with invocations of these
> >>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
> >>>>> to wait to drop to zero before freeing the port (and dropping the lock).
> >>>>
> >>>> AFAICT, this callback is not the only place where the synchronization is
> >>>> missing in the VM event code.
> >>>>
> >>>> For instance, vm_event_put_request() can also race against
> >>>> vm_event_disable().
> >>>>
> >>>> So shouldn't we handle this issue properly in VM event?
> >>>
> >>> I suppose that's a question to the VM event folks rather than me?
> >>
> >> Yes. From my understanding of Tamas's e-mail, they are relying on the
> >> monitoring software to do the right thing.
> >>
> >> I will refrain to comment on this approach. However, given the race is
> >> much wider than the event channel, I would recommend to not add more
> >> code in the event channel to deal with such problem.
> >>
> >> Instead, this should be fixed in the VM event code when someone has time
> >> to harden the subsystem.
> >
> > I double-checked and the disable route is actually more robust, we
> > don't just rely on the toolstack doing the right thing. The domain
> > gets paused before any calls to vm_event_disable. So I don't think
> > there is really a race-condition here.
>
> The code will *only* pause the monitored domain. I can see two issues:
>     1) The toolstack is still sending event while destroy is happening.
> This is the race discussed here.
>     2) The implement of vm_event_put_request() suggests that it can be
> called with not-current domain.
>
> I don't see how just pausing the monitored domain is enough here.

Requests only get generated by the monitored domain. So if the domain
is not running you won't get more of them. The toolstack can only send
replies.

Tamas


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-04 19:15             ` Tamas K Lengyel
@ 2020-12-04 19:22               ` Julien Grall
  2020-12-04 21:23                 ` Tamas K Lengyel
  2020-12-07 15:28               ` Jan Beulich
  1 sibling, 1 reply; 48+ messages in thread
From: Julien Grall @ 2020-12-04 19:22 UTC (permalink / raw)
  To: Tamas K Lengyel
  Cc: Jan Beulich, Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel

On Fri, 4 Dec 2020 at 19:15, Tamas K Lengyel <tamas.k.lengyel@gmail.com> wrote:
>
> On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote:
> >
> >
> >
> > On 04/12/2020 15:21, Tamas K Lengyel wrote:
> > > On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
> > >>
> > >> Hi Jan,
> > >>
> > >> On 03/12/2020 10:09, Jan Beulich wrote:
> > >>> On 02.12.2020 22:10, Julien Grall wrote:
> > >>>> On 23/11/2020 13:30, Jan Beulich wrote:
> > >>>>> While there don't look to be any problems with this right now, the lock
> > >>>>> order implications from holding the lock can be very difficult to follow
> > >>>>> (and may be easy to violate unknowingly). The present callbacks don't
> > >>>>> (and no such callback should) have any need for the lock to be held.
> > >>>>>
> > >>>>> However, vm_event_disable() frees the structures used by respective
> > >>>>> callbacks and isn't otherwise synchronized with invocations of these
> > >>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
> > >>>>> to wait to drop to zero before freeing the port (and dropping the lock).
> > >>>>
> > >>>> AFAICT, this callback is not the only place where the synchronization is
> > >>>> missing in the VM event code.
> > >>>>
> > >>>> For instance, vm_event_put_request() can also race against
> > >>>> vm_event_disable().
> > >>>>
> > >>>> So shouldn't we handle this issue properly in VM event?
> > >>>
> > >>> I suppose that's a question to the VM event folks rather than me?
> > >>
> > >> Yes. From my understanding of Tamas's e-mail, they are relying on the
> > >> monitoring software to do the right thing.
> > >>
> > >> I will refrain to comment on this approach. However, given the race is
> > >> much wider than the event channel, I would recommend to not add more
> > >> code in the event channel to deal with such problem.
> > >>
> > >> Instead, this should be fixed in the VM event code when someone has time
> > >> to harden the subsystem.
> > >
> > > I double-checked and the disable route is actually more robust, we
> > > don't just rely on the toolstack doing the right thing. The domain
> > > gets paused before any calls to vm_event_disable. So I don't think
> > > there is really a race-condition here.
> >
> > The code will *only* pause the monitored domain. I can see two issues:
> >     1) The toolstack is still sending event while destroy is happening.
> > This is the race discussed here.
> >     2) The implement of vm_event_put_request() suggests that it can be
> > called with not-current domain.
> >
> > I don't see how just pausing the monitored domain is enough here.
>
> Requests only get generated by the monitored domain.

If that's the case, then why is vm_event_put_request() able to
deal with a non-current domain?

I understand that you are possibly trusting who may call it, but this
looks quite fragile.

Cheers,

---


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-04 19:22               ` Julien Grall
@ 2020-12-04 21:23                 ` Tamas K Lengyel
  0 siblings, 0 replies; 48+ messages in thread
From: Tamas K Lengyel @ 2020-12-04 21:23 UTC (permalink / raw)
  To: Julien Grall
  Cc: Jan Beulich, Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel

On Fri, Dec 4, 2020 at 2:22 PM Julien Grall <julien.grall.oss@gmail.com> wrote:
>
> On Fri, 4 Dec 2020 at 19:15, Tamas K Lengyel <tamas.k.lengyel@gmail.com> wrote:
> >
> > On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote:
> > >
> > >
> > >
> > > On 04/12/2020 15:21, Tamas K Lengyel wrote:
> > > > On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
> > > >>
> > > >> Hi Jan,
> > > >>
> > > >> On 03/12/2020 10:09, Jan Beulich wrote:
> > > >>> On 02.12.2020 22:10, Julien Grall wrote:
> > > >>>> On 23/11/2020 13:30, Jan Beulich wrote:
> > > >>>>> While there don't look to be any problems with this right now, the lock
> > > >>>>> order implications from holding the lock can be very difficult to follow
> > > >>>>> (and may be easy to violate unknowingly). The present callbacks don't
> > > >>>>> (and no such callback should) have any need for the lock to be held.
> > > >>>>>
> > > >>>>> However, vm_event_disable() frees the structures used by respective
> > > >>>>> callbacks and isn't otherwise synchronized with invocations of these
> > > >>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
> > > >>>>> to wait to drop to zero before freeing the port (and dropping the lock).
> > > >>>>
> > > >>>> AFAICT, this callback is not the only place where the synchronization is
> > > >>>> missing in the VM event code.
> > > >>>>
> > > >>>> For instance, vm_event_put_request() can also race against
> > > >>>> vm_event_disable().
> > > >>>>
> > > >>>> So shouldn't we handle this issue properly in VM event?
> > > >>>
> > > >>> I suppose that's a question to the VM event folks rather than me?
> > > >>
> > > >> Yes. From my understanding of Tamas's e-mail, they are relying on the
> > > >> monitoring software to do the right thing.
> > > >>
> > > >> I will refrain to comment on this approach. However, given the race is
> > > >> much wider than the event channel, I would recommend to not add more
> > > >> code in the event channel to deal with such problem.
> > > >>
> > > >> Instead, this should be fixed in the VM event code when someone has time
> > > >> to harden the subsystem.
> > > >
> > > > I double-checked and the disable route is actually more robust, we
> > > > don't just rely on the toolstack doing the right thing. The domain
> > > > gets paused before any calls to vm_event_disable. So I don't think
> > > > there is really a race-condition here.
> > >
> > > The code will *only* pause the monitored domain. I can see two issues:
> > >     1) The toolstack is still sending event while destroy is happening.
> > > This is the race discussed here.
> > >     2) The implement of vm_event_put_request() suggests that it can be
> > > called with not-current domain.
> > >
> > > I don't see how just pausing the monitored domain is enough here.
> >
> > Requests only get generated by the monitored domain.
>
> If that's the case, then why is vm_event_put_request() able to
> deal with a non-current domain?
>
> I understand that you are possibly trusting who may call it, but this
> looks quite fragile.

I didn't write the system. You probably want to ask that question from
the original author.

Tamas


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-04 15:09               ` Julien Grall
@ 2020-12-07  8:02                 ` Jan Beulich
  2020-12-07 17:22                   ` Julien Grall
  0 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2020-12-07  8:02 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel

On 04.12.2020 16:09, Julien Grall wrote:
> On 04/12/2020 12:01, Jan Beulich wrote:
>> On 04.12.2020 12:51, Julien Grall wrote:
>>> On 04/12/2020 11:48, Jan Beulich wrote:
>>>> On 04.12.2020 12:28, Julien Grall wrote:
>>>>> On 03/12/2020 10:09, Jan Beulich wrote:
>>>>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>>>>> So shouldn't we handle this issue properly in VM event?
>>>>>>
>>>>>> I suppose that's a question to the VM event folks rather than me?
>>>>>
>>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>>>>> monitoring software to do the right thing.
>>>>>
>>>>> I will refrain to comment on this approach. However, given the race is
>>>>> much wider than the event channel, I would recommend to not add more
>>>>> code in the event channel to deal with such problem.
>>>>>
>>>>> Instead, this should be fixed in the VM event code when someone has time
>>>>> to harden the subsystem.
>>>>
>>>> Are effectively saying I should now undo the addition of the
>>>> refcounting, which was added in response to feedback from you?
>>>
>>> Please point out where I made the request to use the refcounting...
>>
>> You didn't ask for this directly, sure, but ...
>>
>>> I pointed out there was an issue with the VM event code.
>>
>> ... this has ultimately led to the decision to use refcounting
>> (iirc there was one alternative that I had proposed, besides
>> the option of doing nothing).
> 
> One other option that was discussed (maybe only on security@xen.org) is 
> to move the spinlock outside of the structure so it is always allocated.

Oh, right - forgot about that one, because that's nothing I would
ever have taken on actually carrying out.

>>> This was latter
>>> analysed as a wider issue. The VM event folks doesn't seem to be very
>>> concerned on the race, so I don't see the reason to try to fix it in the
>>> event channel code.
>>
>> And you won't need the refcount for vpl011 then?
> 
> I don't believe we need it for the vpl011 as the spin lock protecting 
> the code should always be allocated. The problem today is the lock is 
> initialized too late.
> 
>> I can certainly
>> drop it again, but it feels odd to go back to an earlier version
>> under the circumstances ...
> 
> The code introduced doesn't look necessary outside of the VM event code.
> So I think it would be wrong to merge it if it is just papering over a 
> bigger problem.

So to translate this to a clear course of action: You want me to
go back to the earlier version by dropping the refcounting again?
(I don't view this as "papering over" btw, but a tiny step towards
a solution.)

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-04 19:15             ` Tamas K Lengyel
  2020-12-04 19:22               ` Julien Grall
@ 2020-12-07 15:28               ` Jan Beulich
  2020-12-07 17:30                 ` Julien Grall
  1 sibling, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2020-12-07 15:28 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel, Tamas K Lengyel

On 04.12.2020 20:15, Tamas K Lengyel wrote:
> On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote:
>> On 04/12/2020 15:21, Tamas K Lengyel wrote:
>>> On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
>>>> On 03/12/2020 10:09, Jan Beulich wrote:
>>>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>>>>> While there don't look to be any problems with this right now, the lock
>>>>>>> order implications from holding the lock can be very difficult to follow
>>>>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>>>>> (and no such callback should) have any need for the lock to be held.
>>>>>>>
>>>>>>> However, vm_event_disable() frees the structures used by respective
>>>>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>>>>
>>>>>> AFAICT, this callback is not the only place where the synchronization is
>>>>>> missing in the VM event code.
>>>>>>
>>>>>> For instance, vm_event_put_request() can also race against
>>>>>> vm_event_disable().
>>>>>>
>>>>>> So shouldn't we handle this issue properly in VM event?
>>>>>
>>>>> I suppose that's a question to the VM event folks rather than me?
>>>>
>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>>>> monitoring software to do the right thing.
>>>>
>>>> I will refrain to comment on this approach. However, given the race is
>>>> much wider than the event channel, I would recommend to not add more
>>>> code in the event channel to deal with such problem.
>>>>
>>>> Instead, this should be fixed in the VM event code when someone has time
>>>> to harden the subsystem.
>>>
>>> I double-checked and the disable route is actually more robust, we
>>> don't just rely on the toolstack doing the right thing. The domain
>>> gets paused before any calls to vm_event_disable. So I don't think
>>> there is really a race-condition here.
>>
>> The code will *only* pause the monitored domain. I can see two issues:
>>     1) The toolstack is still sending event while destroy is happening.
>> This is the race discussed here.
>>     2) The implement of vm_event_put_request() suggests that it can be
>> called with not-current domain.
>>
>> I don't see how just pausing the monitored domain is enough here.
> 
> Requests only get generated by the monitored domain. So if the domain
> is not running you won't get more of them. The toolstack can only send
> replies.

Julien,

does this change your view on the refcounting added by the patch
at the root of this sub-thread?

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-07  8:02                 ` Jan Beulich
@ 2020-12-07 17:22                   ` Julien Grall
  0 siblings, 0 replies; 48+ messages in thread
From: Julien Grall @ 2020-12-07 17:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel



On 07/12/2020 08:02, Jan Beulich wrote:
> On 04.12.2020 16:09, Julien Grall wrote:
>> On 04/12/2020 12:01, Jan Beulich wrote:
>>> On 04.12.2020 12:51, Julien Grall wrote:
>>>> On 04/12/2020 11:48, Jan Beulich wrote:
>>>>> On 04.12.2020 12:28, Julien Grall wrote:
>>>>>> On 03/12/2020 10:09, Jan Beulich wrote:
>>>>>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>>>>>> So shouldn't we handle this issue properly in VM event?
>>>>>>>
>>>>>>> I suppose that's a question to the VM event folks rather than me?
>>>>>>
>>>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>>>>>> monitoring software to do the right thing.
>>>>>>
>>>>>> I will refrain to comment on this approach. However, given the race is
>>>>>> much wider than the event channel, I would recommend to not add more
>>>>>> code in the event channel to deal with such problem.
>>>>>>
>>>>>> Instead, this should be fixed in the VM event code when someone has time
>>>>>> to harden the subsystem.
>>>>>
>>>>> Are effectively saying I should now undo the addition of the
>>>>> refcounting, which was added in response to feedback from you?
>>>>
>>>> Please point out where I made the request to use the refcounting...
>>>
>>> You didn't ask for this directly, sure, but ...
>>>
>>>> I pointed out there was an issue with the VM event code.
>>>
>>> ... this has ultimately led to the decision to use refcounting
>>> (iirc there was one alternative that I had proposed, besides
>>> the option of doing nothing).
>>
>> One other option that was discussed (maybe only on security@xen.org) is
>> to move the spinlock outside of the structure so it is always allocated.
> 
> Oh, right - forgot about that one, because that's nothing I would
> ever have taken on actually carrying out.
> 
>>>> This was latter
>>>> analysed as a wider issue. The VM event folks doesn't seem to be very
>>>> concerned on the race, so I don't see the reason to try to fix it in the
>>>> event channel code.
>>>
>>> And you won't need the refcount for vpl011 then?
>>
>> I don't believe we need it for the vpl011 as the spin lock protecting
>> the code should always be allocated. The problem today is the lock is
>> initialized too late.
>>
>>> I can certainly
>>> drop it again, but it feels odd to go back to an earlier version
>>> under the circumstances ...
>>
>> The code introduced doesn't look necessary outside of the VM event code.
>> So I think it would be wrong to merge it if it is just papering over a
>> bigger problem.
> 
> So to translate this to a clear course of action: You want me to
> go back to the earlier version by dropping the refcounting again?

Yes.

> (I don't view this as "papering over" btw, but a tiny step towards
> a solution.)

This is implying that the refcounting is part of the actual solution. I 
think you can solve it directly in the VM event code without touching 
the event channel code.

Furthermore, I see no point to add code in the common code if the 
maintainers of the affected subsystem think there code is safe (I don't 
believe it is).

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-07 15:28               ` Jan Beulich
@ 2020-12-07 17:30                 ` Julien Grall
  2020-12-07 17:35                   ` Tamas K Lengyel
  0 siblings, 1 reply; 48+ messages in thread
From: Julien Grall @ 2020-12-07 17:30 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel, Tamas K Lengyel

Hi Jan,

On 07/12/2020 15:28, Jan Beulich wrote:
> On 04.12.2020 20:15, Tamas K Lengyel wrote:
>> On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote:
>>> On 04/12/2020 15:21, Tamas K Lengyel wrote:
>>>> On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
>>>>> On 03/12/2020 10:09, Jan Beulich wrote:
>>>>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>>>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>>>>>> While there don't look to be any problems with this right now, the lock
>>>>>>>> order implications from holding the lock can be very difficult to follow
>>>>>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>>>>>> (and no such callback should) have any need for the lock to be held.
>>>>>>>>
>>>>>>>> However, vm_event_disable() frees the structures used by respective
>>>>>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>>>>>
>>>>>>> AFAICT, this callback is not the only place where the synchronization is
>>>>>>> missing in the VM event code.
>>>>>>>
>>>>>>> For instance, vm_event_put_request() can also race against
>>>>>>> vm_event_disable().
>>>>>>>
>>>>>>> So shouldn't we handle this issue properly in VM event?
>>>>>>
>>>>>> I suppose that's a question to the VM event folks rather than me?
>>>>>
>>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>>>>> monitoring software to do the right thing.
>>>>>
>>>>> I will refrain to comment on this approach. However, given the race is
>>>>> much wider than the event channel, I would recommend to not add more
>>>>> code in the event channel to deal with such problem.
>>>>>
>>>>> Instead, this should be fixed in the VM event code when someone has time
>>>>> to harden the subsystem.
>>>>
>>>> I double-checked and the disable route is actually more robust, we
>>>> don't just rely on the toolstack doing the right thing. The domain
>>>> gets paused before any calls to vm_event_disable. So I don't think
>>>> there is really a race-condition here.
>>>
>>> The code will *only* pause the monitored domain. I can see two issues:
>>>      1) The toolstack is still sending event while destroy is happening.
>>> This is the race discussed here.
>>>      2) The implement of vm_event_put_request() suggests that it can be
>>> called with not-current domain.
>>>
>>> I don't see how just pausing the monitored domain is enough here.
>>
>> Requests only get generated by the monitored domain. So if the domain
>> is not running you won't get more of them. The toolstack can only send
>> replies.
> 
> Julien,
> 
> does this change your view on the refcounting added by the patch
> at the root of this sub-thread?

I still think the code is at best fragile. One example I can find is:

   -> guest_remove_page()
     -> p2m_mem_paging_drop_page()
      -> vm_event_put_request()

guest_remove_page() is not always call on the current domain. So there 
are possibility for vm_event_put_request() to happen on a foreign domain 
and therefore wouldn't be protected by the current hypercall.

Anyway, I don't think the refcounting should be part of the event 
channel without any idea on how this would fit in fixing the VM event race.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-07 17:30                 ` Julien Grall
@ 2020-12-07 17:35                   ` Tamas K Lengyel
  2020-12-23 13:12                     ` Jan Beulich
  0 siblings, 1 reply; 48+ messages in thread
From: Tamas K Lengyel @ 2020-12-07 17:35 UTC (permalink / raw)
  To: Julien Grall
  Cc: Jan Beulich, Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel

On Mon, Dec 7, 2020 at 12:30 PM Julien Grall <julien@xen.org> wrote:
>
> Hi Jan,
>
> On 07/12/2020 15:28, Jan Beulich wrote:
> > On 04.12.2020 20:15, Tamas K Lengyel wrote:
> >> On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote:
> >>> On 04/12/2020 15:21, Tamas K Lengyel wrote:
> >>>> On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
> >>>>> On 03/12/2020 10:09, Jan Beulich wrote:
> >>>>>> On 02.12.2020 22:10, Julien Grall wrote:
> >>>>>>> On 23/11/2020 13:30, Jan Beulich wrote:
> >>>>>>>> While there don't look to be any problems with this right now, the lock
> >>>>>>>> order implications from holding the lock can be very difficult to follow
> >>>>>>>> (and may be easy to violate unknowingly). The present callbacks don't
> >>>>>>>> (and no such callback should) have any need for the lock to be held.
> >>>>>>>>
> >>>>>>>> However, vm_event_disable() frees the structures used by respective
> >>>>>>>> callbacks and isn't otherwise synchronized with invocations of these
> >>>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
> >>>>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
> >>>>>>>
> >>>>>>> AFAICT, this callback is not the only place where the synchronization is
> >>>>>>> missing in the VM event code.
> >>>>>>>
> >>>>>>> For instance, vm_event_put_request() can also race against
> >>>>>>> vm_event_disable().
> >>>>>>>
> >>>>>>> So shouldn't we handle this issue properly in VM event?
> >>>>>>
> >>>>>> I suppose that's a question to the VM event folks rather than me?
> >>>>>
> >>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
> >>>>> monitoring software to do the right thing.
> >>>>>
> >>>>> I will refrain to comment on this approach. However, given the race is
> >>>>> much wider than the event channel, I would recommend to not add more
> >>>>> code in the event channel to deal with such problem.
> >>>>>
> >>>>> Instead, this should be fixed in the VM event code when someone has time
> >>>>> to harden the subsystem.
> >>>>
> >>>> I double-checked and the disable route is actually more robust, we
> >>>> don't just rely on the toolstack doing the right thing. The domain
> >>>> gets paused before any calls to vm_event_disable. So I don't think
> >>>> there is really a race-condition here.
> >>>
> >>> The code will *only* pause the monitored domain. I can see two issues:
> >>>      1) The toolstack is still sending event while destroy is happening.
> >>> This is the race discussed here.
> >>>      2) The implement of vm_event_put_request() suggests that it can be
> >>> called with not-current domain.
> >>>
> >>> I don't see how just pausing the monitored domain is enough here.
> >>
> >> Requests only get generated by the monitored domain. So if the domain
> >> is not running you won't get more of them. The toolstack can only send
> >> replies.
> >
> > Julien,
> >
> > does this change your view on the refcounting added by the patch
> > at the root of this sub-thread?
>
> I still think the code is at best fragile. One example I can find is:
>
>    -> guest_remove_page()
>      -> p2m_mem_paging_drop_page()
>       -> vm_event_put_request()
>
> guest_remove_page() is not always call on the current domain. So there
> are possibility for vm_event_put_request() to happen on a foreign domain
> and therefore wouldn't be protected by the current hypercall.
>
> Anyway, I don't think the refcounting should be part of the event
> channel without any idea on how this would fit in fixing the VM event race.

If the problematic patterns only appear with mem_paging I would
suggest just removing the mem_paging code completely. It's been
abandoned for several years now.

Tamas


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/5] evtchn: drop acquiring of per-channel lock from send_guest_{global,vcpu}_virq()
  2020-12-03  9:46     ` Jan Beulich
@ 2020-12-09  9:53       ` Julien Grall
  2020-12-09 14:24         ` Jan Beulich
  0 siblings, 1 reply; 48+ messages in thread
From: Julien Grall @ 2020-12-09  9:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, xen-devel

Hi Jan,

On 03/12/2020 09:46, Jan Beulich wrote:
> On 02.12.2020 20:03, Julien Grall wrote:
>> On 23/11/2020 13:28, Jan Beulich wrote:
>>> The per-vCPU virq_lock, which is being held anyway, together with there
>>> not being any call to evtchn_port_set_pending() when v->virq_to_evtchn[]
>>> is zero, provide sufficient guarantees.
>>
>> I agree that the per-vCPU virq_lock is going to be sufficient, however
>> dropping the lock also means the event channel locking is more complex
>> to understand (the long comment that was added proves it).
>>
>> In fact, the locking in the event channel code was already proven to be
>> quite fragile, therefore I think this patch is not worth the risk.
> 
> I agree this is a very reasonable position to take. I probably
> would even have remained silent if in the meantime the
> spin_lock()s there hadn't changed to read_trylock()s. I really
> think we want to limit this unusual locking model to where we
> strictly need it.

While I appreciate that the current locking is unusual, we should follow 
the same model everywhere rather than having a dozen of way to lock the 
same structure.

The rationale is quite simple, if you have one way to lock a structure, 
then there are less chance to screw up.

The only reason I would be willing to diverge from statement is if the 
performance are significantly improved.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/5] evtchn: convert vIRQ lock to an r/w one
  2020-11-23 13:28 ` [PATCH v3 3/5] evtchn: convert vIRQ lock to an r/w one Jan Beulich
@ 2020-12-09 11:16   ` Julien Grall
  0 siblings, 0 replies; 48+ messages in thread
From: Julien Grall @ 2020-12-09 11:16 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu, Stefano Stabellini

Hi Jan,

On 23/11/2020 13:28, Jan Beulich wrote:
> There's no need to serialize all sending of vIRQ-s; all that's needed
> is serialization against the closing of the respective event channels
> (so far by means of a barrier). To facilitate the conversion, switch to
> an ordinary write locked region in evtchn_close().
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Julien Grall <jgrall@amazon.com>

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/5] evtchn: convert domain event lock to an r/w one
  2020-11-23 13:29 ` [PATCH v3 4/5] evtchn: convert domain event " Jan Beulich
@ 2020-12-09 11:54   ` Julien Grall
  2020-12-11 10:32     ` Jan Beulich
  0 siblings, 1 reply; 48+ messages in thread
From: Julien Grall @ 2020-12-09 11:54 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu, Stefano Stabellini

Hi Jan,

On 23/11/2020 13:29, Jan Beulich wrote:
> @@ -620,7 +620,7 @@ int evtchn_close(struct domain *d1, int
>       long           rc = 0;
>   
>    again:
> -    spin_lock(&d1->event_lock);
> +    write_lock(&d1->event_lock);
>   
>       if ( !port_is_valid(d1, port1) )
>       {
> @@ -690,13 +690,11 @@ int evtchn_close(struct domain *d1, int
>                   BUG();
>   
>               if ( d1 < d2 )
> -            {
> -                spin_lock(&d2->event_lock);
> -            }
> +                read_lock(&d2->event_lock);

This change made me realized that I don't quite understand how the 
rwlock is meant to work for event_lock. I was actually expecting this to 
be a write_lock() given there are state changed in the d2 events.

Could you outline how a developper can find out whether he/she should 
use read_lock or write_lock?

[...]

> --- a/xen/common/rwlock.c
> +++ b/xen/common/rwlock.c
> @@ -102,6 +102,14 @@ void queue_write_lock_slowpath(rwlock_t
>       spin_unlock(&lock->lock);
>   }
>   
> +void _rw_barrier(rwlock_t *lock)
> +{
> +    check_barrier(&lock->lock.debug);
> +    smp_mb();
> +    while ( _rw_is_locked(lock) )
> +        arch_lock_relax();
> +    smp_mb();
> +}

As I pointed out when this implementation was first proposed (see [1]), 
there is a risk that the loop will never exit.

I think the following implementation would be better (although it is ugly):

write_lock();
/* do nothing */
write_unlock();

This will act as a barrier between lock held before and after the call.

As an aside, I think the introduction of rw_barrier() deserve to be a in 
separate patch to help the review.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/5] evtchn: drop acquiring of per-channel lock from send_guest_{global,vcpu}_virq()
  2020-12-09  9:53       ` Julien Grall
@ 2020-12-09 14:24         ` Jan Beulich
  0 siblings, 0 replies; 48+ messages in thread
From: Jan Beulich @ 2020-12-09 14:24 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, xen-devel

On 09.12.2020 10:53, Julien Grall wrote:
> On 03/12/2020 09:46, Jan Beulich wrote:
>> On 02.12.2020 20:03, Julien Grall wrote:
>>> On 23/11/2020 13:28, Jan Beulich wrote:
>>>> The per-vCPU virq_lock, which is being held anyway, together with there
>>>> not being any call to evtchn_port_set_pending() when v->virq_to_evtchn[]
>>>> is zero, provide sufficient guarantees.
>>>
>>> I agree that the per-vCPU virq_lock is going to be sufficient, however
>>> dropping the lock also means the event channel locking is more complex
>>> to understand (the long comment that was added proves it).
>>>
>>> In fact, the locking in the event channel code was already proven to be
>>> quite fragile, therefore I think this patch is not worth the risk.
>>
>> I agree this is a very reasonable position to take. I probably
>> would even have remained silent if in the meantime the
>> spin_lock()s there hadn't changed to read_trylock()s. I really
>> think we want to limit this unusual locking model to where we
>> strictly need it.
> 
> While I appreciate that the current locking is unusual, we should follow 
> the same model everywhere rather than having a dozen of way to lock the 
> same structure.
> 
> The rationale is quite simple, if you have one way to lock a structure, 
> then there are less chance to screw up.

If only this all was consistent prior to this change. It's not, and
hence I don't see how things get so much more unusual than it was
before. In fact one "unusual" (the trylock) gets treated for another
one (the specific lock protecting the sending of VIRQ events).

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/5] evtchn: convert domain event lock to an r/w one
  2020-12-09 11:54   ` Julien Grall
@ 2020-12-11 10:32     ` Jan Beulich
  2020-12-11 10:57       ` Julien Grall
  0 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2020-12-11 10:32 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, xen-devel

On 09.12.2020 12:54, Julien Grall wrote:
> On 23/11/2020 13:29, Jan Beulich wrote:
>> @@ -620,7 +620,7 @@ int evtchn_close(struct domain *d1, int
>>       long           rc = 0;
>>   
>>    again:
>> -    spin_lock(&d1->event_lock);
>> +    write_lock(&d1->event_lock);
>>   
>>       if ( !port_is_valid(d1, port1) )
>>       {
>> @@ -690,13 +690,11 @@ int evtchn_close(struct domain *d1, int
>>                   BUG();
>>   
>>               if ( d1 < d2 )
>> -            {
>> -                spin_lock(&d2->event_lock);
>> -            }
>> +                read_lock(&d2->event_lock);
> 
> This change made me realized that I don't quite understand how the 
> rwlock is meant to work for event_lock. I was actually expecting this to 
> be a write_lock() given there are state changed in the d2 events.

Well, the protection needs to be against racing changes, i.e.
parallel invocations of this same function, or evtchn_close().
It is debatable whether evtchn_status() and
domain_dump_evtchn_info() would better also be locked out
(other read_lock() uses aren't applicable to interdomain
channels).

> Could you outline how a developper can find out whether he/she should 
> use read_lock or write_lock?

I could try to, but it would again be a port type dependent
model, just like for the per-channel locks. So I'd like it to
be clarified first whether you aren't instead indirectly
asking for these to become write_lock().

>> --- a/xen/common/rwlock.c
>> +++ b/xen/common/rwlock.c
>> @@ -102,6 +102,14 @@ void queue_write_lock_slowpath(rwlock_t
>>       spin_unlock(&lock->lock);
>>   }
>>   
>> +void _rw_barrier(rwlock_t *lock)
>> +{
>> +    check_barrier(&lock->lock.debug);
>> +    smp_mb();
>> +    while ( _rw_is_locked(lock) )
>> +        arch_lock_relax();
>> +    smp_mb();
>> +}
> 
> As I pointed out when this implementation was first proposed (see [1]), 
> there is a risk that the loop will never exit.

The [1] reference was missing, but I recall you saying so.

> I think the following implementation would be better (although it is ugly):
> 
> write_lock();
> /* do nothing */
> write_unlock();
> 
> This will act as a barrier between lock held before and after the call.

Right, and back then I indicated agreement. When getting to
actually carry out the change, I realized though that then the less
restrictive check_barrier() can't be used anymore (or to be precise,
it could be used, but the stronger check_lock() would subsequently
still come into play). This isn't a problem here, but would be for
any IRQ-safe r/w lock that the barrier may want to be used on down
the road.

Thinking about it, a read_lock() / read_unlock() pair would suffice
though. But this would then still have check_lock() involved.

Given all of this, maybe it's better not to introduce the function
at all and instead open-code the read_lock() / read_unlock() pair at
the use site.

> As an aside, I think the introduction of rw_barrier() deserve to be a in 
> separate patch to help the review.

I'm aware there are differing views on this - to me, putting this in
a separate patch would be introduction of dead code. But as per
above maybe the function now won't get introduced anymore anyway.

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/5] evtchn: convert domain event lock to an r/w one
  2020-12-11 10:32     ` Jan Beulich
@ 2020-12-11 10:57       ` Julien Grall
  2020-12-14  9:40         ` Jan Beulich
  0 siblings, 1 reply; 48+ messages in thread
From: Julien Grall @ 2020-12-11 10:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, xen-devel

Hi Jan,

On 11/12/2020 10:32, Jan Beulich wrote:
> On 09.12.2020 12:54, Julien Grall wrote:
>> On 23/11/2020 13:29, Jan Beulich wrote:
>>> @@ -620,7 +620,7 @@ int evtchn_close(struct domain *d1, int
>>>        long           rc = 0;
>>>    
>>>     again:
>>> -    spin_lock(&d1->event_lock);
>>> +    write_lock(&d1->event_lock);
>>>    
>>>        if ( !port_is_valid(d1, port1) )
>>>        {
>>> @@ -690,13 +690,11 @@ int evtchn_close(struct domain *d1, int
>>>                    BUG();
>>>    
>>>                if ( d1 < d2 )
>>> -            {
>>> -                spin_lock(&d2->event_lock);
>>> -            }
>>> +                read_lock(&d2->event_lock);
>>
>> This change made me realized that I don't quite understand how the
>> rwlock is meant to work for event_lock. I was actually expecting this to
>> be a write_lock() given there are state changed in the d2 events.
> 
> Well, the protection needs to be against racing changes, i.e.
> parallel invocations of this same function, or evtchn_close().
> It is debatable whether evtchn_status() and
> domain_dump_evtchn_info() would better also be locked out
> (other read_lock() uses aren't applicable to interdomain
> channels).
> 
>> Could you outline how a developper can find out whether he/she should
>> use read_lock or write_lock?
> 
> I could try to, but it would again be a port type dependent
> model, just like for the per-channel locks.

It is quite important to have clear locking strategy (in particular 
rwlock) so we can make correct decision when to use read_lock or write_lock.

> So I'd like it to
> be clarified first whether you aren't instead indirectly
> asking for these to become write_lock()

Well, I don't understand why this is a read_lock() (even with your 
previous explanation). I am not suggesting to switch to a write_lock(), 
but instead asking for the reasoning behind the decision.

>>> --- a/xen/common/rwlock.c
>>> +++ b/xen/common/rwlock.c
>>> @@ -102,6 +102,14 @@ void queue_write_lock_slowpath(rwlock_t
>>>        spin_unlock(&lock->lock);
>>>    }
>>>    
>>> +void _rw_barrier(rwlock_t *lock)
>>> +{
>>> +    check_barrier(&lock->lock.debug);
>>> +    smp_mb();
>>> +    while ( _rw_is_locked(lock) )
>>> +        arch_lock_relax();
>>> +    smp_mb();
>>> +}
>>
>> As I pointed out when this implementation was first proposed (see [1]),
>> there is a risk that the loop will never exit.
> 
> The [1] reference was missing, but I recall you saying so.
> 
>> I think the following implementation would be better (although it is ugly):
>>
>> write_lock();
>> /* do nothing */
>> write_unlock();
>>
>> This will act as a barrier between lock held before and after the call.
> 
> Right, and back then I indicated agreement. When getting to
> actually carry out the change, I realized though that then the less
> restrictive check_barrier() can't be used anymore (or to be precise,
> it could be used, but the stronger check_lock() would subsequently
> still come into play). This isn't a problem here, but would be for
> any IRQ-safe r/w lock that the barrier may want to be used on down
> the road.
> 
> Thinking about it, a read_lock() / read_unlock() pair would suffice
> though. But this would then still have check_lock() involved.
> 
> Given all of this, maybe it's better not to introduce the function
> at all and instead open-code the read_lock() / read_unlock() pair at
> the use site.

IIUC, the read_lock() would be sufficient because we only care about 
"write" side and not read. Is that correct?

> 
>> As an aside, I think the introduction of rw_barrier() deserve to be a in
>> separate patch to help the review.
> 
> I'm aware there are differing views on this - to me, putting this in
> a separate patch would be introduction of dead code. 

This is only dead code if we decide to not use rw_barrier() :).

The idea behind introducing rw_barrier() in its own patch is so you can 
explanation why it was implemented like that. Arguably, this explanation 
can be added in the same patch...

There are other added benefits such as making a hint to the reviewer 
that this part will require more careful review. I am sure one will say 
that reviewer should always be careful...

But, personally, my level of carefulness will depend on the author and 
the type of the patch.

Anyway, I am happy with the open-coded version with an explanation in 
the code/commit message.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/5] evtchn: convert domain event lock to an r/w one
  2020-12-11 10:57       ` Julien Grall
@ 2020-12-14  9:40         ` Jan Beulich
  2020-12-21 17:45           ` Julien Grall
  0 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2020-12-14  9:40 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, xen-devel

On 11.12.2020 11:57, Julien Grall wrote:
> On 11/12/2020 10:32, Jan Beulich wrote:
>> On 09.12.2020 12:54, Julien Grall wrote:
>>> On 23/11/2020 13:29, Jan Beulich wrote:
>>>> @@ -620,7 +620,7 @@ int evtchn_close(struct domain *d1, int
>>>>        long           rc = 0;
>>>>    
>>>>     again:
>>>> -    spin_lock(&d1->event_lock);
>>>> +    write_lock(&d1->event_lock);
>>>>    
>>>>        if ( !port_is_valid(d1, port1) )
>>>>        {
>>>> @@ -690,13 +690,11 @@ int evtchn_close(struct domain *d1, int
>>>>                    BUG();
>>>>    
>>>>                if ( d1 < d2 )
>>>> -            {
>>>> -                spin_lock(&d2->event_lock);
>>>> -            }
>>>> +                read_lock(&d2->event_lock);
>>>
>>> This change made me realized that I don't quite understand how the
>>> rwlock is meant to work for event_lock. I was actually expecting this to
>>> be a write_lock() given there are state changed in the d2 events.
>>
>> Well, the protection needs to be against racing changes, i.e.
>> parallel invocations of this same function, or evtchn_close().
>> It is debatable whether evtchn_status() and
>> domain_dump_evtchn_info() would better also be locked out
>> (other read_lock() uses aren't applicable to interdomain
>> channels).
>>
>>> Could you outline how a developper can find out whether he/she should
>>> use read_lock or write_lock?
>>
>> I could try to, but it would again be a port type dependent
>> model, just like for the per-channel locks.
> 
> It is quite important to have clear locking strategy (in particular 
> rwlock) so we can make correct decision when to use read_lock or write_lock.
> 
>> So I'd like it to
>> be clarified first whether you aren't instead indirectly
>> asking for these to become write_lock()
> 
> Well, I don't understand why this is a read_lock() (even with your 
> previous explanation). I am not suggesting to switch to a write_lock(), 
> but instead asking for the reasoning behind the decision.

So if what I've said in my previous reply isn't enough (including the
argument towards using two write_lock() here), I'm struggling to
figure what else to say. The primary goal is to exclude changes to
the same ports. For this it is sufficient to hold just one of the two
locks in writer mode, as the other (racing) one will acquire that
same lock for at least reading. The question whether both need to use
writer mode can only be decided when looking at the sites acquiring
just one of the locks in reader mode (hence the reference to
evtchn_status() and domain_dump_evtchn_info()) - if races with them
are deemed to be a problem, switching to both-writers will be needed.

>>>> --- a/xen/common/rwlock.c
>>>> +++ b/xen/common/rwlock.c
>>>> @@ -102,6 +102,14 @@ void queue_write_lock_slowpath(rwlock_t
>>>>        spin_unlock(&lock->lock);
>>>>    }
>>>>    
>>>> +void _rw_barrier(rwlock_t *lock)
>>>> +{
>>>> +    check_barrier(&lock->lock.debug);
>>>> +    smp_mb();
>>>> +    while ( _rw_is_locked(lock) )
>>>> +        arch_lock_relax();
>>>> +    smp_mb();
>>>> +}
>>>
>>> As I pointed out when this implementation was first proposed (see [1]),
>>> there is a risk that the loop will never exit.
>>
>> The [1] reference was missing, but I recall you saying so.
>>
>>> I think the following implementation would be better (although it is ugly):
>>>
>>> write_lock();
>>> /* do nothing */
>>> write_unlock();
>>>
>>> This will act as a barrier between lock held before and after the call.
>>
>> Right, and back then I indicated agreement. When getting to
>> actually carry out the change, I realized though that then the less
>> restrictive check_barrier() can't be used anymore (or to be precise,
>> it could be used, but the stronger check_lock() would subsequently
>> still come into play). This isn't a problem here, but would be for
>> any IRQ-safe r/w lock that the barrier may want to be used on down
>> the road.
>>
>> Thinking about it, a read_lock() / read_unlock() pair would suffice
>> though. But this would then still have check_lock() involved.
>>
>> Given all of this, maybe it's better not to introduce the function
>> at all and instead open-code the read_lock() / read_unlock() pair at
>> the use site.
> 
> IIUC, the read_lock() would be sufficient because we only care about 
> "write" side and not read. Is that correct?

Correct - as the comment says, what we need to guard against is only
the allocation of new ports (which isn't even all "write" sides, but
exactly one of them).

>>> As an aside, I think the introduction of rw_barrier() deserve to be a in
>>> separate patch to help the review.
>>
>> I'm aware there are differing views on this - to me, putting this in
>> a separate patch would be introduction of dead code. 
> 
> This is only dead code if we decide to not use rw_barrier() :).
> 
> The idea behind introducing rw_barrier() in its own patch is so you can 
> explanation why it was implemented like that. Arguably, this explanation 
> can be added in the same patch...
> 
> There are other added benefits such as making a hint to the reviewer 
> that this part will require more careful review. I am sure one will say 
> that reviewer should always be careful...
> 
> But, personally, my level of carefulness will depend on the author and 
> the type of the patch.
> 
> Anyway, I am happy with the open-coded version with an explanation in 
> the code/commit message.

Okay, will change to that then.

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/5] evtchn: convert domain event lock to an r/w one
  2020-12-14  9:40         ` Jan Beulich
@ 2020-12-21 17:45           ` Julien Grall
  2020-12-22  9:46             ` Jan Beulich
  0 siblings, 1 reply; 48+ messages in thread
From: Julien Grall @ 2020-12-21 17:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, xen-devel

Hi Jan,

On 14/12/2020 09:40, Jan Beulich wrote:
> On 11.12.2020 11:57, Julien Grall wrote:
>> On 11/12/2020 10:32, Jan Beulich wrote:
>>> On 09.12.2020 12:54, Julien Grall wrote:
>>>> On 23/11/2020 13:29, Jan Beulich wrote:
>>>>> @@ -620,7 +620,7 @@ int evtchn_close(struct domain *d1, int
>>>>>         long           rc = 0;
>>>>>     
>>>>>      again:
>>>>> -    spin_lock(&d1->event_lock);
>>>>> +    write_lock(&d1->event_lock);
>>>>>     
>>>>>         if ( !port_is_valid(d1, port1) )
>>>>>         {
>>>>> @@ -690,13 +690,11 @@ int evtchn_close(struct domain *d1, int
>>>>>                     BUG();
>>>>>     
>>>>>                 if ( d1 < d2 )
>>>>> -            {
>>>>> -                spin_lock(&d2->event_lock);
>>>>> -            }
>>>>> +                read_lock(&d2->event_lock);
>>>>
>>>> This change made me realized that I don't quite understand how the
>>>> rwlock is meant to work for event_lock. I was actually expecting this to
>>>> be a write_lock() given there are state changed in the d2 events.
>>>
>>> Well, the protection needs to be against racing changes, i.e.
>>> parallel invocations of this same function, or evtchn_close().
>>> It is debatable whether evtchn_status() and
>>> domain_dump_evtchn_info() would better also be locked out
>>> (other read_lock() uses aren't applicable to interdomain
>>> channels).
>>>
>>>> Could you outline how a developper can find out whether he/she should
>>>> use read_lock or write_lock?
>>>
>>> I could try to, but it would again be a port type dependent
>>> model, just like for the per-channel locks.
>>
>> It is quite important to have clear locking strategy (in particular
>> rwlock) so we can make correct decision when to use read_lock or write_lock.
>>
>>> So I'd like it to
>>> be clarified first whether you aren't instead indirectly
>>> asking for these to become write_lock()
>>
>> Well, I don't understand why this is a read_lock() (even with your
>> previous explanation). I am not suggesting to switch to a write_lock(),
>> but instead asking for the reasoning behind the decision.
> 
> So if what I've said in my previous reply isn't enough (including the
> argument towards using two write_lock() here), I'm struggling to
> figure what else to say. The primary goal is to exclude changes to
> the same ports. For this it is sufficient to hold just one of the two
> locks in writer mode, as the other (racing) one will acquire that
> same lock for at least reading. The question whether both need to use
> writer mode can only be decided when looking at the sites acquiring
> just one of the locks in reader mode (hence the reference to
> evtchn_status() and domain_dump_evtchn_info()) - if races with them
> are deemed to be a problem, switching to both-writers will be needed.

I had another look at the code based on your explanation. I don't think 
it is fine to allow evtchn_status() to be concurrently called with 
evtchn_close().

evtchn_close() contains the following code:

   chn2->state = ECS_UNBOUND;
   chn2->u.unbound.remote_domid = d1->domain_id;

Where chn2 is a event channel of the remote domain (d2). Your patch will 
only held the read lock for d2.

However evtchn_status() expects the event channel state to not change 
behind its back. This assumption doesn't hold for d2, and you could 
possibly end up to see the new value of chn2->state after the new 
chn2->u.unbound.remote_domid.

Thanksfully, it doesn't look like chn2->u.interdomain.remote_domain 
would be overwritten. Otherwise, this would be a straight dereference of 
an invalid pointer.

So I think, we need to held the write event lock for both domain.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/5] evtchn: convert domain event lock to an r/w one
  2020-12-21 17:45           ` Julien Grall
@ 2020-12-22  9:46             ` Jan Beulich
  2020-12-23 11:22               ` Julien Grall
  0 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2020-12-22  9:46 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, xen-devel

On 21.12.2020 18:45, Julien Grall wrote:
> Hi Jan,
> 
> On 14/12/2020 09:40, Jan Beulich wrote:
>> On 11.12.2020 11:57, Julien Grall wrote:
>>> On 11/12/2020 10:32, Jan Beulich wrote:
>>>> On 09.12.2020 12:54, Julien Grall wrote:
>>>>> On 23/11/2020 13:29, Jan Beulich wrote:
>>>>>> @@ -620,7 +620,7 @@ int evtchn_close(struct domain *d1, int
>>>>>>         long           rc = 0;
>>>>>>     
>>>>>>      again:
>>>>>> -    spin_lock(&d1->event_lock);
>>>>>> +    write_lock(&d1->event_lock);
>>>>>>     
>>>>>>         if ( !port_is_valid(d1, port1) )
>>>>>>         {
>>>>>> @@ -690,13 +690,11 @@ int evtchn_close(struct domain *d1, int
>>>>>>                     BUG();
>>>>>>     
>>>>>>                 if ( d1 < d2 )
>>>>>> -            {
>>>>>> -                spin_lock(&d2->event_lock);
>>>>>> -            }
>>>>>> +                read_lock(&d2->event_lock);
>>>>>
>>>>> This change made me realized that I don't quite understand how the
>>>>> rwlock is meant to work for event_lock. I was actually expecting this to
>>>>> be a write_lock() given there are state changed in the d2 events.
>>>>
>>>> Well, the protection needs to be against racing changes, i.e.
>>>> parallel invocations of this same function, or evtchn_close().
>>>> It is debatable whether evtchn_status() and
>>>> domain_dump_evtchn_info() would better also be locked out
>>>> (other read_lock() uses aren't applicable to interdomain
>>>> channels).
>>>>
>>>>> Could you outline how a developper can find out whether he/she should
>>>>> use read_lock or write_lock?
>>>>
>>>> I could try to, but it would again be a port type dependent
>>>> model, just like for the per-channel locks.
>>>
>>> It is quite important to have clear locking strategy (in particular
>>> rwlock) so we can make correct decision when to use read_lock or write_lock.
>>>
>>>> So I'd like it to
>>>> be clarified first whether you aren't instead indirectly
>>>> asking for these to become write_lock()
>>>
>>> Well, I don't understand why this is a read_lock() (even with your
>>> previous explanation). I am not suggesting to switch to a write_lock(),
>>> but instead asking for the reasoning behind the decision.
>>
>> So if what I've said in my previous reply isn't enough (including the
>> argument towards using two write_lock() here), I'm struggling to
>> figure what else to say. The primary goal is to exclude changes to
>> the same ports. For this it is sufficient to hold just one of the two
>> locks in writer mode, as the other (racing) one will acquire that
>> same lock for at least reading. The question whether both need to use
>> writer mode can only be decided when looking at the sites acquiring
>> just one of the locks in reader mode (hence the reference to
>> evtchn_status() and domain_dump_evtchn_info()) - if races with them
>> are deemed to be a problem, switching to both-writers will be needed.
> 
> I had another look at the code based on your explanation. I don't think 
> it is fine to allow evtchn_status() to be concurrently called with 
> evtchn_close().
> 
> evtchn_close() contains the following code:
> 
>    chn2->state = ECS_UNBOUND;
>    chn2->u.unbound.remote_domid = d1->domain_id;
> 
> Where chn2 is a event channel of the remote domain (d2). Your patch will 
> only held the read lock for d2.
> 
> However evtchn_status() expects the event channel state to not change 
> behind its back. This assumption doesn't hold for d2, and you could 
> possibly end up to see the new value of chn2->state after the new 
> chn2->u.unbound.remote_domid.
> 
> Thanksfully, it doesn't look like chn2->u.interdomain.remote_domain 
> would be overwritten. Otherwise, this would be a straight dereference of 
> an invalid pointer.
> 
> So I think, we need to held the write event lock for both domain.

Well, okay. Three considerations though:

1) Neither evtchn_status() nor domain_dump_evtchn_info() appear to
have a real need to acquire the per-domain lock. They could as well
acquire the per-channel ones. (In the latter case this will then
also allow inserting the so far missing process_pending_softirqs()
call; it shouldn't be made with a lock held.)

2) With the double-locking changed and with 1) addressed, there's
going to be almost no read_lock() left. hvm_migrate_pirqs() and
do_physdev_op()'s PHYSDEVOP_eoi handling, evtchn_move_pirqs(), and
hvm_dpci_msi_eoi(). While for these it may still be helpful to be
possible to run in parallel, I then nevertheless wonder whether the
change as a whole is still worthwhile.

3) With the per-channel double locking and with 1) addressed I
can't really see the need for the double per-domain locking in
evtchn_bind_interdomain() and evtchn_close(). The write lock is
needed for the domain allocating a new port or freeing one. But why
is there any need for holding the remote domain's lock, when its
side of the channel gets guarded by the per-channel lock anyway?
Granted the per-channel locks may then need acquiring a little
earlier, before checking the remote channel's state. But this
shouldn't be an issue.

I guess I'll make addressing 1) and 3) prereq patches to this one,
unless I learn of reasons why things need to remain the way they
are.

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/5] evtchn: convert domain event lock to an r/w one
  2020-12-22  9:46             ` Jan Beulich
@ 2020-12-23 11:22               ` Julien Grall
  2020-12-23 12:57                 ` Jan Beulich
  0 siblings, 1 reply; 48+ messages in thread
From: Julien Grall @ 2020-12-23 11:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, xen-devel

Hi Jan,

On 22/12/2020 09:46, Jan Beulich wrote:
> On 21.12.2020 18:45, Julien Grall wrote:
>> On 14/12/2020 09:40, Jan Beulich wrote:
>>> On 11.12.2020 11:57, Julien Grall wrote:
>>>> On 11/12/2020 10:32, Jan Beulich wrote:
>>>>> On 09.12.2020 12:54, Julien Grall wrote:
>>>>>> On 23/11/2020 13:29, Jan Beulich wrote:
>>>>>>> @@ -620,7 +620,7 @@ int evtchn_close(struct domain *d1, int
>>>>>>>          long           rc = 0;
>>>>>>>      
>>>>>>>       again:
>>>>>>> -    spin_lock(&d1->event_lock);
>>>>>>> +    write_lock(&d1->event_lock);
>>>>>>>      
>>>>>>>          if ( !port_is_valid(d1, port1) )
>>>>>>>          {
>>>>>>> @@ -690,13 +690,11 @@ int evtchn_close(struct domain *d1, int
>>>>>>>                      BUG();
>>>>>>>      
>>>>>>>                  if ( d1 < d2 )
>>>>>>> -            {
>>>>>>> -                spin_lock(&d2->event_lock);
>>>>>>> -            }
>>>>>>> +                read_lock(&d2->event_lock);
>>>>>>
>>>>>> This change made me realized that I don't quite understand how the
>>>>>> rwlock is meant to work for event_lock. I was actually expecting this to
>>>>>> be a write_lock() given there are state changed in the d2 events.
>>>>>
>>>>> Well, the protection needs to be against racing changes, i.e.
>>>>> parallel invocations of this same function, or evtchn_close().
>>>>> It is debatable whether evtchn_status() and
>>>>> domain_dump_evtchn_info() would better also be locked out
>>>>> (other read_lock() uses aren't applicable to interdomain
>>>>> channels).
>>>>>
>>>>>> Could you outline how a developper can find out whether he/she should
>>>>>> use read_lock or write_lock?
>>>>>
>>>>> I could try to, but it would again be a port type dependent
>>>>> model, just like for the per-channel locks.
>>>>
>>>> It is quite important to have clear locking strategy (in particular
>>>> rwlock) so we can make correct decision when to use read_lock or write_lock.
>>>>
>>>>> So I'd like it to
>>>>> be clarified first whether you aren't instead indirectly
>>>>> asking for these to become write_lock()
>>>>
>>>> Well, I don't understand why this is a read_lock() (even with your
>>>> previous explanation). I am not suggesting to switch to a write_lock(),
>>>> but instead asking for the reasoning behind the decision.
>>>
>>> So if what I've said in my previous reply isn't enough (including the
>>> argument towards using two write_lock() here), I'm struggling to
>>> figure what else to say. The primary goal is to exclude changes to
>>> the same ports. For this it is sufficient to hold just one of the two
>>> locks in writer mode, as the other (racing) one will acquire that
>>> same lock for at least reading. The question whether both need to use
>>> writer mode can only be decided when looking at the sites acquiring
>>> just one of the locks in reader mode (hence the reference to
>>> evtchn_status() and domain_dump_evtchn_info()) - if races with them
>>> are deemed to be a problem, switching to both-writers will be needed.
>>
>> I had another look at the code based on your explanation. I don't think
>> it is fine to allow evtchn_status() to be concurrently called with
>> evtchn_close().
>>
>> evtchn_close() contains the following code:
>>
>>     chn2->state = ECS_UNBOUND;
>>     chn2->u.unbound.remote_domid = d1->domain_id;
>>
>> Where chn2 is a event channel of the remote domain (d2). Your patch will
>> only held the read lock for d2.
>>
>> However evtchn_status() expects the event channel state to not change
>> behind its back. This assumption doesn't hold for d2, and you could
>> possibly end up to see the new value of chn2->state after the new
>> chn2->u.unbound.remote_domid.
>>
>> Thanksfully, it doesn't look like chn2->u.interdomain.remote_domain
>> would be overwritten. Otherwise, this would be a straight dereference of
>> an invalid pointer.
>>
>> So I think, we need to held the write event lock for both domain.
> 
> Well, okay. Three considerations though:
> 
> 1) Neither evtchn_status() nor domain_dump_evtchn_info() appear to
> have a real need to acquire the per-domain lock. They could as well
> acquire the per-channel ones. (In the latter case this will then
> also allow inserting the so far missing process_pending_softirqs()
> call; it shouldn't be made with a lock held.)
I agree that evtchn_status() doesn't need to acquire the per-domain 
lock. I am not entirely sure about domain_dump_evtchn_info() because 
AFAICT the PIRQ tree (used by domain_pirq_to_irq()) is protected with 
d->event_lock.

> 
> 2) With the double-locking changed and with 1) addressed, there's
> going to be almost no read_lock() left. hvm_migrate_pirqs() and
> do_physdev_op()'s PHYSDEVOP_eoi handling, evtchn_move_pirqs(), and
> hvm_dpci_msi_eoi(). While for these it may still be helpful to be
> possible to run in parallel, I then nevertheless wonder whether the
> change as a whole is still worthwhile.

I can see some value in one future case. As we are going to support 
non-cooperative migration of guest, we will need to restore event 
channels (including PIRQs) for the domain. From my understanding, when 
the vCPU is first scheduled we will end up to migrate all the interrupts 
as the vCPU may not be created on the targeted pCPU.

So allowing evtchn_mode_pirqs() and hvm_migrate_pirqs() to run in 
parallel would slighlty reduce the downtime. Although, I don't have any 
numbers backing this statement.

> 3) With the per-channel double locking and with 1) addressed I
> can't really see the need for the double per-domain locking in
> evtchn_bind_interdomain() and evtchn_close(). The write lock is
> needed for the domain allocating a new port or freeing one. But why
> is there any need for holding the remote domain's lock, when its
> side of the channel gets guarded by the per-channel lock anyway?

If 1) is addressed, then I think it should be fine to just acquire the 
read event lock of the remote domain.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/5] evtchn: convert domain event lock to an r/w one
  2020-12-23 11:22               ` Julien Grall
@ 2020-12-23 12:57                 ` Jan Beulich
  2020-12-23 13:19                   ` Julien Grall
  0 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2020-12-23 12:57 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, xen-devel

On 23.12.2020 12:22, Julien Grall wrote:
> Hi Jan,
> 
> On 22/12/2020 09:46, Jan Beulich wrote:
>> On 21.12.2020 18:45, Julien Grall wrote:
>>> On 14/12/2020 09:40, Jan Beulich wrote:
>>>> On 11.12.2020 11:57, Julien Grall wrote:
>>>>> On 11/12/2020 10:32, Jan Beulich wrote:
>>>>>> On 09.12.2020 12:54, Julien Grall wrote:
>>>>>>> On 23/11/2020 13:29, Jan Beulich wrote:
>>>>>>>> @@ -620,7 +620,7 @@ int evtchn_close(struct domain *d1, int
>>>>>>>>          long           rc = 0;
>>>>>>>>      
>>>>>>>>       again:
>>>>>>>> -    spin_lock(&d1->event_lock);
>>>>>>>> +    write_lock(&d1->event_lock);
>>>>>>>>      
>>>>>>>>          if ( !port_is_valid(d1, port1) )
>>>>>>>>          {
>>>>>>>> @@ -690,13 +690,11 @@ int evtchn_close(struct domain *d1, int
>>>>>>>>                      BUG();
>>>>>>>>      
>>>>>>>>                  if ( d1 < d2 )
>>>>>>>> -            {
>>>>>>>> -                spin_lock(&d2->event_lock);
>>>>>>>> -            }
>>>>>>>> +                read_lock(&d2->event_lock);
>>>>>>>
>>>>>>> This change made me realized that I don't quite understand how the
>>>>>>> rwlock is meant to work for event_lock. I was actually expecting this to
>>>>>>> be a write_lock() given there are state changed in the d2 events.
>>>>>>
>>>>>> Well, the protection needs to be against racing changes, i.e.
>>>>>> parallel invocations of this same function, or evtchn_close().
>>>>>> It is debatable whether evtchn_status() and
>>>>>> domain_dump_evtchn_info() would better also be locked out
>>>>>> (other read_lock() uses aren't applicable to interdomain
>>>>>> channels).
>>>>>>
>>>>>>> Could you outline how a developper can find out whether he/she should
>>>>>>> use read_lock or write_lock?
>>>>>>
>>>>>> I could try to, but it would again be a port type dependent
>>>>>> model, just like for the per-channel locks.
>>>>>
>>>>> It is quite important to have clear locking strategy (in particular
>>>>> rwlock) so we can make correct decision when to use read_lock or write_lock.
>>>>>
>>>>>> So I'd like it to
>>>>>> be clarified first whether you aren't instead indirectly
>>>>>> asking for these to become write_lock()
>>>>>
>>>>> Well, I don't understand why this is a read_lock() (even with your
>>>>> previous explanation). I am not suggesting to switch to a write_lock(),
>>>>> but instead asking for the reasoning behind the decision.
>>>>
>>>> So if what I've said in my previous reply isn't enough (including the
>>>> argument towards using two write_lock() here), I'm struggling to
>>>> figure what else to say. The primary goal is to exclude changes to
>>>> the same ports. For this it is sufficient to hold just one of the two
>>>> locks in writer mode, as the other (racing) one will acquire that
>>>> same lock for at least reading. The question whether both need to use
>>>> writer mode can only be decided when looking at the sites acquiring
>>>> just one of the locks in reader mode (hence the reference to
>>>> evtchn_status() and domain_dump_evtchn_info()) - if races with them
>>>> are deemed to be a problem, switching to both-writers will be needed.
>>>
>>> I had another look at the code based on your explanation. I don't think
>>> it is fine to allow evtchn_status() to be concurrently called with
>>> evtchn_close().
>>>
>>> evtchn_close() contains the following code:
>>>
>>>     chn2->state = ECS_UNBOUND;
>>>     chn2->u.unbound.remote_domid = d1->domain_id;
>>>
>>> Where chn2 is a event channel of the remote domain (d2). Your patch will
>>> only held the read lock for d2.
>>>
>>> However evtchn_status() expects the event channel state to not change
>>> behind its back. This assumption doesn't hold for d2, and you could
>>> possibly end up to see the new value of chn2->state after the new
>>> chn2->u.unbound.remote_domid.
>>>
>>> Thanksfully, it doesn't look like chn2->u.interdomain.remote_domain
>>> would be overwritten. Otherwise, this would be a straight dereference of
>>> an invalid pointer.
>>>
>>> So I think, we need to held the write event lock for both domain.
>>
>> Well, okay. Three considerations though:
>>
>> 1) Neither evtchn_status() nor domain_dump_evtchn_info() appear to
>> have a real need to acquire the per-domain lock. They could as well
>> acquire the per-channel ones. (In the latter case this will then
>> also allow inserting the so far missing process_pending_softirqs()
>> call; it shouldn't be made with a lock held.)
> I agree that evtchn_status() doesn't need to acquire the per-domain 
> lock. I am not entirely sure about domain_dump_evtchn_info() because 
> AFAICT the PIRQ tree (used by domain_pirq_to_irq()) is protected with 
> d->event_lock.

It is, but calling it without the lock just to display the IRQ
is not a problem afaict.

>> 3) With the per-channel double locking and with 1) addressed I
>> can't really see the need for the double per-domain locking in
>> evtchn_bind_interdomain() and evtchn_close(). The write lock is
>> needed for the domain allocating a new port or freeing one. But why
>> is there any need for holding the remote domain's lock, when its
>> side of the channel gets guarded by the per-channel lock anyway?
> 
> If 1) is addressed, then I think it should be fine to just acquire the 
> read event lock of the remote domain.

For bind-interdomain I've eliminated the double locking, so the
question goes away there altogether. While for close I thought
I had managed to eliminate it too, the change looks to be
causing a deadlock of some sort, which I'll have to figure out.
However, the change might be controversial anyway, because I
need to play games already prior to fixing that bug ...

All of this said - for the time being it'll be both write_lock()
in evtchn_close(), as I consider it risky to make the remote one
a read_lock() merely based on the observation that there is
currently (i.e. with 1) addressed) no conflict.

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-07 17:35                   ` Tamas K Lengyel
@ 2020-12-23 13:12                     ` Jan Beulich
  2020-12-23 13:33                       ` Julien Grall
  0 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2020-12-23 13:12 UTC (permalink / raw)
  To: Tamas K Lengyel, Julien Grall
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel

On 07.12.2020 18:35, Tamas K Lengyel wrote:
> On Mon, Dec 7, 2020 at 12:30 PM Julien Grall <julien@xen.org> wrote:
>>
>> Hi Jan,
>>
>> On 07/12/2020 15:28, Jan Beulich wrote:
>>> On 04.12.2020 20:15, Tamas K Lengyel wrote:
>>>> On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote:
>>>>> On 04/12/2020 15:21, Tamas K Lengyel wrote:
>>>>>> On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
>>>>>>> On 03/12/2020 10:09, Jan Beulich wrote:
>>>>>>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>>>>>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>>>>>>>> While there don't look to be any problems with this right now, the lock
>>>>>>>>>> order implications from holding the lock can be very difficult to follow
>>>>>>>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>>>>>>>> (and no such callback should) have any need for the lock to be held.
>>>>>>>>>>
>>>>>>>>>> However, vm_event_disable() frees the structures used by respective
>>>>>>>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>>>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>>>>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>>>>>>>
>>>>>>>>> AFAICT, this callback is not the only place where the synchronization is
>>>>>>>>> missing in the VM event code.
>>>>>>>>>
>>>>>>>>> For instance, vm_event_put_request() can also race against
>>>>>>>>> vm_event_disable().
>>>>>>>>>
>>>>>>>>> So shouldn't we handle this issue properly in VM event?
>>>>>>>>
>>>>>>>> I suppose that's a question to the VM event folks rather than me?
>>>>>>>
>>>>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>>>>>>> monitoring software to do the right thing.
>>>>>>>
>>>>>>> I will refrain to comment on this approach. However, given the race is
>>>>>>> much wider than the event channel, I would recommend to not add more
>>>>>>> code in the event channel to deal with such problem.
>>>>>>>
>>>>>>> Instead, this should be fixed in the VM event code when someone has time
>>>>>>> to harden the subsystem.
>>>>>>
>>>>>> I double-checked and the disable route is actually more robust, we
>>>>>> don't just rely on the toolstack doing the right thing. The domain
>>>>>> gets paused before any calls to vm_event_disable. So I don't think
>>>>>> there is really a race-condition here.
>>>>>
>>>>> The code will *only* pause the monitored domain. I can see two issues:
>>>>>      1) The toolstack is still sending event while destroy is happening.
>>>>> This is the race discussed here.
>>>>>      2) The implement of vm_event_put_request() suggests that it can be
>>>>> called with not-current domain.
>>>>>
>>>>> I don't see how just pausing the monitored domain is enough here.
>>>>
>>>> Requests only get generated by the monitored domain. So if the domain
>>>> is not running you won't get more of them. The toolstack can only send
>>>> replies.
>>>
>>> Julien,
>>>
>>> does this change your view on the refcounting added by the patch
>>> at the root of this sub-thread?
>>
>> I still think the code is at best fragile. One example I can find is:
>>
>>    -> guest_remove_page()
>>      -> p2m_mem_paging_drop_page()
>>       -> vm_event_put_request()
>>
>> guest_remove_page() is not always call on the current domain. So there
>> are possibility for vm_event_put_request() to happen on a foreign domain
>> and therefore wouldn't be protected by the current hypercall.
>>
>> Anyway, I don't think the refcounting should be part of the event
>> channel without any idea on how this would fit in fixing the VM event race.
> 
> If the problematic patterns only appear with mem_paging I would
> suggest just removing the mem_paging code completely. It's been
> abandoned for several years now.

Since this is nothing I'm fancying doing, the way forward here needs
to be a different one. From the input by both of you I still can't
conclude whether this patch should remain as is in v4, or revert
back to its v2 version. Please can we get this settled so I can get
v4 out?

Thanks, Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/5] evtchn: convert domain event lock to an r/w one
  2020-12-23 12:57                 ` Jan Beulich
@ 2020-12-23 13:19                   ` Julien Grall
  2020-12-23 13:36                     ` Jan Beulich
  0 siblings, 1 reply; 48+ messages in thread
From: Julien Grall @ 2020-12-23 13:19 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, xen-devel



On 23/12/2020 12:57, Jan Beulich wrote:
> On 23.12.2020 12:22, Julien Grall wrote:
>>> 1) Neither evtchn_status() nor domain_dump_evtchn_info() appear to
>>> have a real need to acquire the per-domain lock. They could as well
>>> acquire the per-channel ones. (In the latter case this will then
>>> also allow inserting the so far missing process_pending_softirqs()
>>> call; it shouldn't be made with a lock held.)
>> I agree that evtchn_status() doesn't need to acquire the per-domain
>> lock. I am not entirely sure about domain_dump_evtchn_info() because
>> AFAICT the PIRQ tree (used by domain_pirq_to_irq()) is protected with
>> d->event_lock.
> 
> It is, but calling it without the lock just to display the IRQ
> is not a problem afaict.

How so? Is the radix tree lookup safe against concurrent radix tree 
insertion/deletion?

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-23 13:12                     ` Jan Beulich
@ 2020-12-23 13:33                       ` Julien Grall
  2020-12-23 13:41                         ` Jan Beulich
  0 siblings, 1 reply; 48+ messages in thread
From: Julien Grall @ 2020-12-23 13:33 UTC (permalink / raw)
  To: Jan Beulich, Tamas K Lengyel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel



On 23/12/2020 13:12, Jan Beulich wrote:
> On 07.12.2020 18:35, Tamas K Lengyel wrote:
>> On Mon, Dec 7, 2020 at 12:30 PM Julien Grall <julien@xen.org> wrote:
>>>
>>> Hi Jan,
>>>
>>> On 07/12/2020 15:28, Jan Beulich wrote:
>>>> On 04.12.2020 20:15, Tamas K Lengyel wrote:
>>>>> On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xen.org> wrote:
>>>>>> On 04/12/2020 15:21, Tamas K Lengyel wrote:
>>>>>>> On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xen.org> wrote:
>>>>>>>> On 03/12/2020 10:09, Jan Beulich wrote:
>>>>>>>>> On 02.12.2020 22:10, Julien Grall wrote:
>>>>>>>>>> On 23/11/2020 13:30, Jan Beulich wrote:
>>>>>>>>>>> While there don't look to be any problems with this right now, the lock
>>>>>>>>>>> order implications from holding the lock can be very difficult to follow
>>>>>>>>>>> (and may be easy to violate unknowingly). The present callbacks don't
>>>>>>>>>>> (and no such callback should) have any need for the lock to be held.
>>>>>>>>>>>
>>>>>>>>>>> However, vm_event_disable() frees the structures used by respective
>>>>>>>>>>> callbacks and isn't otherwise synchronized with invocations of these
>>>>>>>>>>> callbacks, so maintain a count of in-progress calls, for evtchn_close()
>>>>>>>>>>> to wait to drop to zero before freeing the port (and dropping the lock).
>>>>>>>>>>
>>>>>>>>>> AFAICT, this callback is not the only place where the synchronization is
>>>>>>>>>> missing in the VM event code.
>>>>>>>>>>
>>>>>>>>>> For instance, vm_event_put_request() can also race against
>>>>>>>>>> vm_event_disable().
>>>>>>>>>>
>>>>>>>>>> So shouldn't we handle this issue properly in VM event?
>>>>>>>>>
>>>>>>>>> I suppose that's a question to the VM event folks rather than me?
>>>>>>>>
>>>>>>>> Yes. From my understanding of Tamas's e-mail, they are relying on the
>>>>>>>> monitoring software to do the right thing.
>>>>>>>>
>>>>>>>> I will refrain to comment on this approach. However, given the race is
>>>>>>>> much wider than the event channel, I would recommend to not add more
>>>>>>>> code in the event channel to deal with such problem.
>>>>>>>>
>>>>>>>> Instead, this should be fixed in the VM event code when someone has time
>>>>>>>> to harden the subsystem.
>>>>>>>
>>>>>>> I double-checked and the disable route is actually more robust, we
>>>>>>> don't just rely on the toolstack doing the right thing. The domain
>>>>>>> gets paused before any calls to vm_event_disable. So I don't think
>>>>>>> there is really a race-condition here.
>>>>>>
>>>>>> The code will *only* pause the monitored domain. I can see two issues:
>>>>>>       1) The toolstack is still sending event while destroy is happening.
>>>>>> This is the race discussed here.
>>>>>>       2) The implement of vm_event_put_request() suggests that it can be
>>>>>> called with not-current domain.
>>>>>>
>>>>>> I don't see how just pausing the monitored domain is enough here.
>>>>>
>>>>> Requests only get generated by the monitored domain. So if the domain
>>>>> is not running you won't get more of them. The toolstack can only send
>>>>> replies.
>>>>
>>>> Julien,
>>>>
>>>> does this change your view on the refcounting added by the patch
>>>> at the root of this sub-thread?
>>>
>>> I still think the code is at best fragile. One example I can find is:
>>>
>>>     -> guest_remove_page()
>>>       -> p2m_mem_paging_drop_page()
>>>        -> vm_event_put_request()
>>>
>>> guest_remove_page() is not always call on the current domain. So there
>>> are possibility for vm_event_put_request() to happen on a foreign domain
>>> and therefore wouldn't be protected by the current hypercall.
>>>
>>> Anyway, I don't think the refcounting should be part of the event
>>> channel without any idea on how this would fit in fixing the VM event race.
>>
>> If the problematic patterns only appear with mem_paging I would
>> suggest just removing the mem_paging code completely. It's been
>> abandoned for several years now.
> 
> Since this is nothing I'm fancying doing, the way forward here needs
> to be a different one. From the input by both of you I still can't
> conclude whether this patch should remain as is in v4, or revert
> back to its v2 version. Please can we get this settled so I can get
> v4 out?

I haven't had time to investigate the rest of the VM event code to find 
other cases where this may happen. I still think there is a bigger 
problem in the VM event code, but the maintainer disagrees here.

At which point, I see limited reason to try to paper over in the common 
code. So I would rather ack/merge v2 rather than v3.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/5] evtchn: convert domain event lock to an r/w one
  2020-12-23 13:19                   ` Julien Grall
@ 2020-12-23 13:36                     ` Jan Beulich
  0 siblings, 0 replies; 48+ messages in thread
From: Jan Beulich @ 2020-12-23 13:36 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, xen-devel

On 23.12.2020 14:19, Julien Grall wrote:
> On 23/12/2020 12:57, Jan Beulich wrote:
>> On 23.12.2020 12:22, Julien Grall wrote:
>>>> 1) Neither evtchn_status() nor domain_dump_evtchn_info() appear to
>>>> have a real need to acquire the per-domain lock. They could as well
>>>> acquire the per-channel ones. (In the latter case this will then
>>>> also allow inserting the so far missing process_pending_softirqs()
>>>> call; it shouldn't be made with a lock held.)
>>> I agree that evtchn_status() doesn't need to acquire the per-domain
>>> lock. I am not entirely sure about domain_dump_evtchn_info() because
>>> AFAICT the PIRQ tree (used by domain_pirq_to_irq()) is protected with
>>> d->event_lock.
>>
>> It is, but calling it without the lock just to display the IRQ
>> is not a problem afaict.
> 
> How so? Is the radix tree lookup safe against concurrent radix tree 
> insertion/deletion?

Hmm, looks like I was misguided by pirq_field() tolerating NULL
coming back from radix_tree_lookup(). Indeed, if the tree
structure changed, there would be a problem. But this can't be
solved by holding the lock across the entire loop - as said
earlier, the loop body needs to gain a
process_pending_softirqs() invocation, and that needs to happen
with no locks held. The only way I can see us achieving this is
to drop the per-channel lock prior to calling
domain_pirq_to_irq().

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-23 13:33                       ` Julien Grall
@ 2020-12-23 13:41                         ` Jan Beulich
  2020-12-23 14:44                           ` Julien Grall
  0 siblings, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2020-12-23 13:41 UTC (permalink / raw)
  To: Julien Grall, Tamas K Lengyel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel

On 23.12.2020 14:33, Julien Grall wrote:
> On 23/12/2020 13:12, Jan Beulich wrote:
>> From the input by both of you I still can't
>> conclude whether this patch should remain as is in v4, or revert
>> back to its v2 version. Please can we get this settled so I can get
>> v4 out?
> 
> I haven't had time to investigate the rest of the VM event code to find 
> other cases where this may happen. I still think there is a bigger 
> problem in the VM event code, but the maintainer disagrees here.
> 
> At which point, I see limited reason to try to paper over in the common 
> code. So I would rather ack/merge v2 rather than v3.

Since I expect Tamas and/or the Bitdefender folks to be of the
opposite opinion, there's still no way out, at least if "rather
ack" implies a nak for v3. Personally, if this expectation of
mine is correct, I'd prefer to keep the accounting but make it
optional (as suggested in a post-commit-message remark).
That'll eliminate the overhead you appear to be concerned of,
but of course it'll further complicate the logic (albeit just
slightly).

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-23 13:41                         ` Jan Beulich
@ 2020-12-23 14:44                           ` Julien Grall
  2020-12-23 14:56                             ` Jan Beulich
  2020-12-23 15:15                             ` Tamas K Lengyel
  0 siblings, 2 replies; 48+ messages in thread
From: Julien Grall @ 2020-12-23 14:44 UTC (permalink / raw)
  To: Jan Beulich, Tamas K Lengyel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel



On 23/12/2020 13:41, Jan Beulich wrote:
> On 23.12.2020 14:33, Julien Grall wrote:
>> On 23/12/2020 13:12, Jan Beulich wrote:
>>>  From the input by both of you I still can't
>>> conclude whether this patch should remain as is in v4, or revert
>>> back to its v2 version. Please can we get this settled so I can get
>>> v4 out?
>>
>> I haven't had time to investigate the rest of the VM event code to find
>> other cases where this may happen. I still think there is a bigger
>> problem in the VM event code, but the maintainer disagrees here.
>>
>> At which point, I see limited reason to try to paper over in the common
>> code. So I would rather ack/merge v2 rather than v3.
> 
> Since I expect Tamas and/or the Bitdefender folks to be of the
> opposite opinion, there's still no way out, at least if "rather
> ack" implies a nak for v3.

The only way out here is for someone to justify why this patch is 
sufficient for the VM event race. I am not convinced it is (see more below).

> Personally, if this expectation of
> mine is correct, I'd prefer to keep the accounting but make it
> optional (as suggested in a post-commit-message remark).
> That'll eliminate the overhead you appear to be concerned of,
> but of course it'll further complicate the logic (albeit just
> slightly).

I am more concerned about adding over complex code that would just 
papering over a bigger problem. I also can't see use of it outside of 
the VM event discussion.

I had another look at the code. As I mentioned in the past, 
vm_put_event_request() is able to deal with d != current->domain (it 
will set VM_EVENT_FLAG_FOREIGN). There are 4 callers for the function:
    1) p2m_mem_paging_drop_page()
    2) p2m_mem_paging_populate()
    3) mem_sharing_notify_enomem()
    4) monitor_traps()

1) and 2) belongs to the mem paging subsystem. Tamas suggested that it 
was abandoned.

4) can only be called with the current domain.

This leave us 3) in the mem sharing subsystem. As this is call the 
memory hypercalls, it looks possible to me that d != current->domain. 
The code also seems to be maintained (there were recent non-trivial 
changes).

Can one of the VM event developper come up with a justification why this 
patch enough to make the VM event subsystem safe?

FAOD, the justification should be solely based on the hypervisor code 
(IOW not external trusted software).

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-23 14:44                           ` Julien Grall
@ 2020-12-23 14:56                             ` Jan Beulich
  2020-12-23 15:08                               ` Julien Grall
  2020-12-23 15:15                             ` Tamas K Lengyel
  1 sibling, 1 reply; 48+ messages in thread
From: Jan Beulich @ 2020-12-23 14:56 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel, Tamas K Lengyel

On 23.12.2020 15:44, Julien Grall wrote:
> On 23/12/2020 13:41, Jan Beulich wrote:
>> On 23.12.2020 14:33, Julien Grall wrote:
>>> On 23/12/2020 13:12, Jan Beulich wrote:
>>>>  From the input by both of you I still can't
>>>> conclude whether this patch should remain as is in v4, or revert
>>>> back to its v2 version. Please can we get this settled so I can get
>>>> v4 out?
>>>
>>> I haven't had time to investigate the rest of the VM event code to find
>>> other cases where this may happen. I still think there is a bigger
>>> problem in the VM event code, but the maintainer disagrees here.
>>>
>>> At which point, I see limited reason to try to paper over in the common
>>> code. So I would rather ack/merge v2 rather than v3.
>>
>> Since I expect Tamas and/or the Bitdefender folks to be of the
>> opposite opinion, there's still no way out, at least if "rather
>> ack" implies a nak for v3.
> 
> The only way out here is for someone to justify why this patch is 
> sufficient for the VM event race.

I think this is too much you demand. Moving in the right direction
without arriving at the final destination is still a worthwhile
thing to do imo.

>> Personally, if this expectation of
>> mine is correct, I'd prefer to keep the accounting but make it
>> optional (as suggested in a post-commit-message remark).
>> That'll eliminate the overhead you appear to be concerned of,
>> but of course it'll further complicate the logic (albeit just
>> slightly).
> 
> I am more concerned about adding over complex code that would just 
> papering over a bigger problem. I also can't see use of it outside of 
> the VM event discussion.

I think it is a generally appropriate thing to do to wait for
callbacks to complete before tearing down their origin control
structure. There may be cases where code structure makes this
unnecessary, but I don't think this can be an expectation to
all the users of the functionality. Hence my suggestion to
possibly make this optional (driven directly or indirectly by
the user of the registration function).

Jan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-23 14:56                             ` Jan Beulich
@ 2020-12-23 15:08                               ` Julien Grall
  0 siblings, 0 replies; 48+ messages in thread
From: Julien Grall @ 2020-12-23 15:08 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel, Tamas K Lengyel

Hi Jan,

On 23/12/2020 14:56, Jan Beulich wrote:
> On 23.12.2020 15:44, Julien Grall wrote:
>> On 23/12/2020 13:41, Jan Beulich wrote:
>>> On 23.12.2020 14:33, Julien Grall wrote:
>>>> On 23/12/2020 13:12, Jan Beulich wrote:
>>>>>   From the input by both of you I still can't
>>>>> conclude whether this patch should remain as is in v4, or revert
>>>>> back to its v2 version. Please can we get this settled so I can get
>>>>> v4 out?
>>>>
>>>> I haven't had time to investigate the rest of the VM event code to find
>>>> other cases where this may happen. I still think there is a bigger
>>>> problem in the VM event code, but the maintainer disagrees here.
>>>>
>>>> At which point, I see limited reason to try to paper over in the common
>>>> code. So I would rather ack/merge v2 rather than v3.
>>>
>>> Since I expect Tamas and/or the Bitdefender folks to be of the
>>> opposite opinion, there's still no way out, at least if "rather
>>> ack" implies a nak for v3.
>>
>> The only way out here is for someone to justify why this patch is
>> sufficient for the VM event race.
> 
> I think this is too much you demand.

I guess you didn't notice that I did most of the job by providing an 
analysis in my e-mail... I don't think it is too much demanding to read 
the analysis and say whether I am correct or not.

Do you really prefer to add code would get rot because unused?

> Moving in the right direction
> without arriving at the final destination is still a worthwhile
> thing to do imo.
> 
>>> Personally, if this expectation of
>>> mine is correct, I'd prefer to keep the accounting but make it
>>> optional (as suggested in a post-commit-message remark).
>>> That'll eliminate the overhead you appear to be concerned of,
>>> but of course it'll further complicate the logic (albeit just
>>> slightly).
>>
>> I am more concerned about adding over complex code that would just
>> papering over a bigger problem. I also can't see use of it outside of
>> the VM event discussion.
> 
> I think it is a generally appropriate thing to do to wait for
> callbacks to complete before tearing down their origin control
> structure. There may be cases where code structure makes this
> unnecessary, but I don't think this can be an expectation to
> all the users of the functionality. Hence my suggestion to
> possibly make this optional (driven directly or indirectly by
> the user of the registration function).

As you tend to say, we should not add code unless there is a user. So 
far the only possible user is dubbious. If you find another user, then 
we can discuss whether this patch makes sense.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held
  2020-12-23 14:44                           ` Julien Grall
  2020-12-23 14:56                             ` Jan Beulich
@ 2020-12-23 15:15                             ` Tamas K Lengyel
  1 sibling, 0 replies; 48+ messages in thread
From: Tamas K Lengyel @ 2020-12-23 15:15 UTC (permalink / raw)
  To: Julien Grall
  Cc: Jan Beulich, Andrew Cooper, George Dunlap, Ian Jackson, Wei Liu,
	Stefano Stabellini, Tamas K Lengyel, Petre Ovidiu PIRCALABU,
	Alexandru Isaila, xen-devel

On Wed, Dec 23, 2020 at 9:44 AM Julien Grall <julien@xen.org> wrote:
>
>
>
> On 23/12/2020 13:41, Jan Beulich wrote:
> > On 23.12.2020 14:33, Julien Grall wrote:
> >> On 23/12/2020 13:12, Jan Beulich wrote:
> >>>  From the input by both of you I still can't
> >>> conclude whether this patch should remain as is in v4, or revert
> >>> back to its v2 version. Please can we get this settled so I can get
> >>> v4 out?
> >>
> >> I haven't had time to investigate the rest of the VM event code to find
> >> other cases where this may happen. I still think there is a bigger
> >> problem in the VM event code, but the maintainer disagrees here.
> >>
> >> At which point, I see limited reason to try to paper over in the common
> >> code. So I would rather ack/merge v2 rather than v3.
> >
> > Since I expect Tamas and/or the Bitdefender folks to be of the
> > opposite opinion, there's still no way out, at least if "rather
> > ack" implies a nak for v3.
>
> The only way out here is for someone to justify why this patch is
> sufficient for the VM event race. I am not convinced it is (see more below).
>
> > Personally, if this expectation of
> > mine is correct, I'd prefer to keep the accounting but make it
> > optional (as suggested in a post-commit-message remark).
> > That'll eliminate the overhead you appear to be concerned of,
> > but of course it'll further complicate the logic (albeit just
> > slightly).
>
> I am more concerned about adding over complex code that would just
> papering over a bigger problem. I also can't see use of it outside of
> the VM event discussion.
>
> I had another look at the code. As I mentioned in the past,
> vm_put_event_request() is able to deal with d != current->domain (it
> will set VM_EVENT_FLAG_FOREIGN). There are 4 callers for the function:
>     1) p2m_mem_paging_drop_page()
>     2) p2m_mem_paging_populate()
>     3) mem_sharing_notify_enomem()
>     4) monitor_traps()
>
> 1) and 2) belongs to the mem paging subsystem. Tamas suggested that it
> was abandoned.
>
> 4) can only be called with the current domain.
>
> This leave us 3) in the mem sharing subsystem. As this is call the
> memory hypercalls, it looks possible to me that d != current->domain.
> The code also seems to be maintained (there were recent non-trivial
> changes).
>
> Can one of the VM event developper come up with a justification why this
> patch enough to make the VM event subsystem safe?

3) is an unused feature as well that likely should be dropped at some
point. It can also only be called with current->domain, it effectively
just signals an out-of-memory error to a vm_event listener in dom0
that populating an entry for the VM that EPT faulted failed. I guess
the idea was that the dom0 agent would be able to make a decision on
how to proceed (ie which VM to kill to free up memory).


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2020-12-23 15:16 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-23 13:26 [PATCH v3 0/5] evtchn: (not so) recent XSAs follow-on Jan Beulich
2020-11-23 13:28 ` [PATCH v3 1/5] evtchn: drop acquiring of per-channel lock from send_guest_{global,vcpu}_virq() Jan Beulich
2020-12-02 19:03   ` Julien Grall
2020-12-03  9:46     ` Jan Beulich
2020-12-09  9:53       ` Julien Grall
2020-12-09 14:24         ` Jan Beulich
2020-11-23 13:28 ` [PATCH v3 2/5] evtchn: avoid access tearing for ->virq_to_evtchn[] accesses Jan Beulich
2020-12-02 21:14   ` Julien Grall
2020-11-23 13:28 ` [PATCH v3 3/5] evtchn: convert vIRQ lock to an r/w one Jan Beulich
2020-12-09 11:16   ` Julien Grall
2020-11-23 13:29 ` [PATCH v3 4/5] evtchn: convert domain event " Jan Beulich
2020-12-09 11:54   ` Julien Grall
2020-12-11 10:32     ` Jan Beulich
2020-12-11 10:57       ` Julien Grall
2020-12-14  9:40         ` Jan Beulich
2020-12-21 17:45           ` Julien Grall
2020-12-22  9:46             ` Jan Beulich
2020-12-23 11:22               ` Julien Grall
2020-12-23 12:57                 ` Jan Beulich
2020-12-23 13:19                   ` Julien Grall
2020-12-23 13:36                     ` Jan Beulich
2020-11-23 13:30 ` [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held Jan Beulich
2020-11-30 10:39   ` Isaila Alexandru
2020-12-02 21:10   ` Julien Grall
2020-12-03 10:09     ` Jan Beulich
2020-12-03 14:40       ` Tamas K Lengyel
2020-12-04 11:28       ` Julien Grall
2020-12-04 11:48         ` Jan Beulich
2020-12-04 11:51           ` Julien Grall
2020-12-04 12:01             ` Jan Beulich
2020-12-04 15:09               ` Julien Grall
2020-12-07  8:02                 ` Jan Beulich
2020-12-07 17:22                   ` Julien Grall
2020-12-04 15:21         ` Tamas K Lengyel
2020-12-04 15:29           ` Julien Grall
2020-12-04 19:15             ` Tamas K Lengyel
2020-12-04 19:22               ` Julien Grall
2020-12-04 21:23                 ` Tamas K Lengyel
2020-12-07 15:28               ` Jan Beulich
2020-12-07 17:30                 ` Julien Grall
2020-12-07 17:35                   ` Tamas K Lengyel
2020-12-23 13:12                     ` Jan Beulich
2020-12-23 13:33                       ` Julien Grall
2020-12-23 13:41                         ` Jan Beulich
2020-12-23 14:44                           ` Julien Grall
2020-12-23 14:56                             ` Jan Beulich
2020-12-23 15:08                               ` Julien Grall
2020-12-23 15:15                             ` Tamas K Lengyel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.