[PATCH (V9) 0/2] Add V4V to Xen

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH (V9) 0/2] Add V4V to Xen
@ 2013-05-28 19:43 Ross Philipson
  2013-05-28 19:43 ` [PATCH (V9) 1/2] xen: events, exposes evtchn_alloc_unbound_domain Ross Philipson
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Ross Philipson @ 2013-05-28 19:43 UTC (permalink / raw)
  To: xen-devel

[PATCH (V9) 1/2] xen: events, exposes evtchn_alloc_unbound_domain
[PATCH (V9) 2/2] xen: Add V4V implementation

v9 changes:
        - Use of XEN_GUEST_HANDLE_PARAM where appropriate
        - Add v4v hypercall structs to xlat.lst
        - Bypass XSM for internal event channel creation
        - Use is_hardware_domain instead of IS_PRIV
        - Modification for rebase on staging after 8 months
        - General code cleanup/format as requested

v8 changes:
        - Move v4v private structures to v4v.c
        - fix padding
        - Fix some coding style issues
        - Drop the VIRQ_XC_RESERVED patch

v7 changes:
        - Remove v4v_send from the API (v4v_sendv should be
          used instead).
        - Remove internal_v4v_iov
        - Remove useless padding
        - Switch back to use uint64_t for the pfn list
          instead of xen_pfn_t (to avoid some compat code)
        - Rename v4v_iptable to v4v_table.
        - Use __copy_to/from_guest when possible

v6 changes:
        - Check compiler before using [0] or [].

v5 changes:
        - Fix prototypes in v4v.h for do_v4v_op
        - Add padding to make sure all the structures
          are 64 bits aligned
        - Replace [0] with []

v4 changes:
        - Stop using ssize_t, use long instead
        - Return -MSGSIZE for message bigger than 2GB
        - Fixup size handling
        - Rename protocol with message_type to avoid the
          confusion with the IP protocol flag
        - Replaced V4V_DOMID_ANY value with DOMID_INVALID
        - Replaced v4v_pfn_t with xen_pfn_t
        - Add padding to struct v4v_info
        - Fixup hypercall documentation
        - Move V4V_ROUNDUP to v4v.c
        - Remove v4v_utils.h (we could potentially add it later).

v3 changes:
        - Switch to event channel
                - Allocated a unbound event channel
                  per domain.
                - Add a new v4v call to share the
                  event channel port.
        - Public headers with actual type definition
        - Align all the v4v type to 64 bits
        - Modify v4v MAGIC numbers because we won't
          but backward compatible anymore
        - Merge insert and insertv
        - Merge send and sendv
        - Turn all the lock prerequisite from comment
          to ASSERT()
        - Make use or write_atomic instead of volatile pointers
        - Merge v4v_memcpy_to_guest_ring and
          v4v_memcpy_to_guest_ring_from_guest
                - Introduce copy_from_guest_maybe that can take
                  a void * and a handle as src address.
        - Replace 6 arguments hypercalls with 5 arguments hypercalls.

v2 changes:
        - Cleanup plugin header
        - Include basic access control
        - Use guest_handle_for_field

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH (V9) 1/2] xen: events, exposes evtchn_alloc_unbound_domain
  2013-05-28 19:43 [PATCH (V9) 0/2] Add V4V to Xen Ross Philipson
@ 2013-05-28 19:43 ` Ross Philipson
  2013-05-28 19:43 ` [PATCH (V9) 2/2] xen: Add V4V implementation Ross Philipson
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Ross Philipson @ 2013-05-28 19:43 UTC (permalink / raw)
  To: xen-devel; +Cc: Ross Philipson

Exposes evtchn_alloc_unbound_domain to the rest of
Xen so we can create allocated unbound evtchn within Xen.

Signed-off-by: Ross Philipson <ross.philipson@citrix.com>
---
 xen/common/event_channel.c |   45 +++++++++++++++++++++++++++++++-------------
 xen/include/xen/event.h    |    3 +++
 2 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c
index 64c976b..d416f97 100644
--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -160,35 +160,54 @@ static int get_free_port(struct domain *d)
 
 static long evtchn_alloc_unbound(evtchn_alloc_unbound_t *alloc)
 {
-    struct evtchn *chn;
     struct domain *d;
-    int            port;
-    domid_t        dom = alloc->dom;
     long           rc;
 
-    d = rcu_lock_domain_by_any_id(dom);
+    d = rcu_lock_domain_by_any_id(alloc->dom);
     if ( d == NULL )
         return -ESRCH;
 
-    spin_lock(&d->event_lock);
+    rc = evtchn_alloc_unbound_domain(d, &alloc->port, alloc->remote_dom, 1);
+    if ( rc )
+        ERROR_EXIT_DOM((int)rc, d);
 
-    if ( (port = get_free_port(d)) < 0 )
-        ERROR_EXIT_DOM(port, d);
-    chn = evtchn_from_port(d, port);
+ out:
+    rcu_unlock_domain(d);
 
-    rc = xsm_evtchn_unbound(XSM_TARGET, d, chn, alloc->remote_dom);
-    if ( rc )
+    return rc;
+}
+
+int evtchn_alloc_unbound_domain(struct domain *d, evtchn_port_t *port,
+                                domid_t remote_domid, bool_t call_xsm)
+{
+    struct evtchn *chn;
+    int           rc;
+    int           free_port;
+
+    spin_lock(&d->event_lock);
+
+    rc = free_port = get_free_port(d);
+    if ( free_port < 0 )
         goto out;
 
+    chn = evtchn_from_port(d, free_port);
+    if ( call_xsm )
+    {
+        rc = xsm_evtchn_unbound(XSM_TARGET, d, chn, remote_domid);
+        if ( rc )
+            goto out;
+    }
+
     chn->state = ECS_UNBOUND;
-    if ( (chn->u.unbound.remote_domid = alloc->remote_dom) == DOMID_SELF )
+    if ( (chn->u.unbound.remote_domid = remote_domid) == DOMID_SELF )
         chn->u.unbound.remote_domid = current->domain->domain_id;
 
-    alloc->port = port;
+    *port = free_port;
+    /* Everything is fine, returns 0 */
+    rc = 0;
 
  out:
     spin_unlock(&d->event_lock);
-    rcu_unlock_domain(d);
 
     return rc;
 }
diff --git a/xen/include/xen/event.h b/xen/include/xen/event.h
index 4ac39ad..5edca53 100644
--- a/xen/include/xen/event.h
+++ b/xen/include/xen/event.h
@@ -69,6 +69,9 @@ int guest_enabled_event(struct vcpu *v, uint32_t virq);
 /* Notify remote end of a Xen-attached event channel.*/
 void notify_via_xen_event_channel(struct domain *ld, int lport);
 
+int evtchn_alloc_unbound_domain(struct domain *d, evtchn_port_t *port,
+                                domid_t remote_domid, bool_t call_xsm);
+
 /* Internal event channel object accessors */
 #define bucket_from_port(d,p) \
     ((d)->evtchn[(p)/EVTCHNS_PER_BUCKET])
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH (V9) 2/2] xen: Add V4V implementation
  2013-05-28 19:43 [PATCH (V9) 0/2] Add V4V to Xen Ross Philipson
  2013-05-28 19:43 ` [PATCH (V9) 1/2] xen: events, exposes evtchn_alloc_unbound_domain Ross Philipson
@ 2013-05-28 19:43 ` Ross Philipson
  2013-05-29  0:43   ` Matt Wilson
                     ` (4 more replies)
  2013-05-30 11:57 ` [PATCH (V9) 0/2] Add V4V to Xen Ian Campbell
  2013-05-30 12:07 ` Ian Campbell
  3 siblings, 5 replies; 24+ messages in thread
From: Ross Philipson @ 2013-05-28 19:43 UTC (permalink / raw)
  To: xen-devel; +Cc: Ross Philipson

Setup of v4v domains a domain gets created and cleanup
when a domain die. Wire up the v4v hypercall.

Include v4v internal and public headers.

Signed-off-by: Ross Philipson <ross.philipson@citrix.com>
---
 xen/arch/x86/hvm/hvm.c             |    6 +-
 xen/arch/x86/x86_64/compat/entry.S |    2 +
 xen/arch/x86/x86_64/entry.S        |    2 +
 xen/common/Makefile                |    1 +
 xen/common/domain.c                |   13 +-
 xen/common/v4v.c                   | 1923 ++++++++++++++++++++++++++++++++++++
 xen/include/Makefile               |    3 +-
 xen/include/public/v4v.h           |  304 ++++++
 xen/include/public/xen.h           |    2 +-
 xen/include/xen/sched.h            |    4 +
 xen/include/xen/v4v.h              |   49 +
 xen/include/xlat.lst               |   10 +
 12 files changed, 2314 insertions(+), 5 deletions(-)
 create mode 100644 xen/common/v4v.c
 create mode 100644 xen/include/public/v4v.h
 create mode 100644 xen/include/xen/v4v.h

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index a962ce2..f5ed5a5 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -3292,7 +3292,8 @@ static hvm_hypercall_t *const hvm_hypercall64_table[NR_hypercalls] = {
     HYPERCALL(hvm_op),
     HYPERCALL(sysctl),
     HYPERCALL(domctl),
-    HYPERCALL(tmem_op)
+    HYPERCALL(tmem_op),
+    HYPERCALL(v4v_op)
 };
 
 #define COMPAT_CALL(x)                                        \
@@ -3311,7 +3312,8 @@ static hvm_hypercall_t *const hvm_hypercall32_table[NR_hypercalls] = {
     HYPERCALL(hvm_op),
     HYPERCALL(sysctl),
     HYPERCALL(domctl),
-    HYPERCALL(tmem_op)
+    HYPERCALL(tmem_op),
+    HYPERCALL(v4v_op)
 };
 
 int hvm_do_hypercall(struct cpu_user_regs *regs)
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index c0afe2c..70e88d0 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -413,6 +413,7 @@ ENTRY(compat_hypercall_table)
         .quad do_domctl
         .quad compat_kexec_op
         .quad do_tmem_op
+        .quad do_v4v_op
         .rept __HYPERVISOR_arch_0-((.-compat_hypercall_table)/8)
         .quad compat_ni_hypercall
         .endr
@@ -461,6 +462,7 @@ ENTRY(compat_hypercall_args_table)
         .byte 1 /* do_domctl                */
         .byte 2 /* compat_kexec_op          */
         .byte 1 /* do_tmem_op               */
+        .byte 5 /* do_v4v_op		    */
         .rept __HYPERVISOR_arch_0-(.-compat_hypercall_args_table)
         .byte 0 /* compat_ni_hypercall      */
         .endr
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 5beeccb..f957c1c 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -762,6 +762,7 @@ ENTRY(hypercall_table)
         .quad do_domctl
         .quad do_kexec_op
         .quad do_tmem_op
+        .quad do_v4v_op
         .rept __HYPERVISOR_arch_0-((.-hypercall_table)/8)
         .quad do_ni_hypercall
         .endr
@@ -810,6 +811,7 @@ ENTRY(hypercall_args_table)
         .byte 1 /* do_domctl            */
         .byte 2 /* do_kexec             */
         .byte 1 /* do_tmem_op           */
+        .byte 5 /* do_v4v_op		*/
         .rept __HYPERVISOR_arch_0-(.-hypercall_args_table)
         .byte 0 /* do_ni_hypercall      */
         .endr
diff --git a/xen/common/Makefile b/xen/common/Makefile
index 0dc2050..4389d03 100644
--- a/xen/common/Makefile
+++ b/xen/common/Makefile
@@ -47,6 +47,7 @@ obj-y += tmem_xen.o
 obj-y += radix-tree.o
 obj-y += rbtree.o
 obj-y += lzo.o
+obj-y += v4v.o
 
 obj-bin-$(CONFIG_X86) += $(foreach n,decompress bunzip2 unxz unlzma unlzo,$(n).init.o)
 
diff --git a/xen/common/domain.c b/xen/common/domain.c
index fac3470..d32171a 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -199,7 +199,8 @@ struct domain *domain_create(
 {
     struct domain *d, **pd;
     enum { INIT_xsm = 1u<<0, INIT_watchdog = 1u<<1, INIT_rangeset = 1u<<2,
-           INIT_evtchn = 1u<<3, INIT_gnttab = 1u<<4, INIT_arch = 1u<<5 };
+           INIT_evtchn = 1u<<3, INIT_gnttab = 1u<<4, INIT_arch = 1u<<5,
+           INIT_v4v = 1u<<6 };
     int err, init_status = 0;
     int poolid = CPUPOOLID_NONE;
 
@@ -312,6 +313,13 @@ struct domain *domain_create(
         spin_unlock(&domlist_update_lock);
     }
 
+    if ( !is_idle_domain(d) )
+    {
+        if ( (err = v4v_init(d)) != 0 )
+            goto fail;
+        init_status |= INIT_v4v;
+    }
+
     return d;
 
  fail:
@@ -320,6 +328,8 @@ struct domain *domain_create(
     xfree(d->mem_event);
     if ( init_status & INIT_arch )
         arch_domain_destroy(d);
+    if ( init_status & INIT_v4v )
+	v4v_destroy(d);
     if ( init_status & INIT_gnttab )
         grant_table_destroy(d);
     if ( init_status & INIT_evtchn )
@@ -527,6 +537,7 @@ int domain_kill(struct domain *d)
         domain_pause(d);
         d->is_dying = DOMDYING_dying;
         spin_barrier(&d->domain_lock);
+        v4v_destroy(d);
         evtchn_destroy(d);
         gnttab_release_mappings(d);
         tmem_destroy(d->tmem);
diff --git a/xen/common/v4v.c b/xen/common/v4v.c
new file mode 100644
index 0000000..701877d
--- /dev/null
+++ b/xen/common/v4v.c
@@ -0,0 +1,1923 @@
+/******************************************************************************
+ * V4V
+ *
+ * Version 2 of v2v (Virtual-to-Virtual)
+ *
+ * Copyright (c) 2010, Citrix Systems
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <xen/config.h>
+#include <xen/mm.h>
+#include <xen/compat.h>
+#include <xen/init.h>
+#include <xen/lib.h>
+#include <xen/errno.h>
+#include <xen/sched.h>
+#include <xen/domain.h>
+#include <xen/v4v.h>
+#include <xen/event.h>
+#include <xen/guest_access.h>
+#include <asm/paging.h>
+#include <asm/p2m.h>
+#include <xen/keyhandler.h>
+#include <asm/types.h>
+
+DEFINE_XEN_GUEST_HANDLE(v4v_iov_t);
+DEFINE_XEN_GUEST_HANDLE(v4v_addr_t);
+DEFINE_XEN_GUEST_HANDLE(v4v_send_addr_t);
+DEFINE_XEN_GUEST_HANDLE(v4v_ring_t);
+DEFINE_XEN_GUEST_HANDLE(v4v_ring_data_ent_t);
+DEFINE_XEN_GUEST_HANDLE(v4v_ring_data_t);
+DEFINE_XEN_GUEST_HANDLE(v4v_info_t);
+DEFINE_XEN_GUEST_HANDLE(v4v_pfn_t);
+DEFINE_XEN_GUEST_HANDLE(v4vtables_rule_t);
+DEFINE_XEN_GUEST_HANDLE(v4vtables_list_t);
+DEFINE_XEN_GUEST_HANDLE(uint8_t);
+
+struct v4v_pending_ent
+{
+    struct hlist_node node;
+    domid_t id;
+    uint32_t len;
+};
+
+struct v4vtables_rule_node
+{
+    struct list_head list;
+    v4vtables_rule_t rule;
+};
+
+struct v4v_ring_info
+{
+    /* next node in the hash, protected by L2  */
+    struct hlist_node node;
+    /* this ring's id, protected by L2 */
+    v4v_ring_id_t id;
+    /* L3 */
+    spinlock_t lock;
+    /* cached length of the ring (from ring->len), protected by L3 */
+    uint32_t len;
+    uint32_t npage;
+    /* cached tx pointer location, protected by L3 */
+    uint32_t tx_ptr;
+    /* guest ring, protected by L3 */
+    XEN_GUEST_HANDLE(v4v_ring_t) ring;
+    /* mapped ring pages protected by L3*/
+    uint8_t **mfn_mapping;
+    /* list of mfns of guest ring */
+    mfn_t *mfns;
+    /* list of struct v4v_pending_ent for this ring, L3 */
+    struct hlist_head pending;
+};
+
+/*
+ * The value of the v4v element in a struct domain is
+ * protected by the global lock L1
+ */
+#define V4V_HTABLE_SIZE 32
+struct v4v_domain
+{
+    /* L2 */
+    rwlock_t lock;
+    /* event channel */
+    evtchn_port_t evtchn_port;
+    /* protected by L2 */
+    struct hlist_head ring_hash[V4V_HTABLE_SIZE];
+};
+
+/*
+ * Messages on the ring are padded to 128 bits
+ * Len here refers to the exact length of the data not including the
+ * 128 bit header. The message uses
+ * ((len +0xf) & ~0xf) + sizeof(v4v_ring_message_header) bytes
+ */
+#define V4V_ROUNDUP(a) (((a) +0xf ) & ~0xf)
+
+/*
+ * Helper functions
+ */
+
+static inline uint16_t
+v4v_hash_fn(v4v_ring_id_t *id)
+{
+    uint16_t ret;
+    ret = (uint16_t)(id->addr.port >> 16);
+    ret ^= (uint16_t)id->addr.port;
+    ret ^= id->addr.domain;
+    ret ^= id->partner;
+
+    ret &= (V4V_HTABLE_SIZE - 1);
+
+    return ret;
+}
+
+static struct v4v_ring_info *v4v_ring_find_info(struct domain *d,
+                                                v4v_ring_id_t *id);
+
+static struct v4v_ring_info *v4v_ring_find_info_by_addr(struct domain *d,
+                                                        struct v4v_addr *a,
+                                                        domid_t p);
+
+struct list_head v4vtables_rules = LIST_HEAD_INIT(v4vtables_rules);
+
+/*
+ * locks
+ */
+
+/*
+ * locking is organized as follows:
+ *
+ * the global lock v4v_lock: L1 protects the v4v elements
+ * of all struct domain *d in the system, it does not
+ * protect any of the elements of d->v4v, just their
+ * addresses. By extension since the destruction of
+ * a domain with a non-NULL d->v4v will need to free
+ * the d->v4v pointer, holding this lock gauruntees
+ * that no domains pointers in which v4v is interested
+ * become invalid whilst this lock is held.
+ */
+
+static DEFINE_RWLOCK(v4v_lock); /* L1 */
+
+/*
+ * the lock d->v4v->lock: L2:  Read on protects the hash table and
+ * the elements in the hash_table d->v4v->ring_hash, and
+ * the node and id fields in struct v4v_ring_info in the
+ * hash table. Write on L2 protects all of the elements of
+ * struct v4v_ring_info. To take L2 you must already have R(L1)
+ * W(L1) implies W(L2) and L3
+ *
+ * the lock v4v_ring_info *ringinfo; ringinfo->lock: L3:
+ * protects len,tx_ptr the guest ring, the
+ * guest ring_data and the pending list. To take L3 you must
+ * already have R(L2). W(L2) implies L3
+ */
+
+/*
+ * lock to protect the filtering rules list: v4vtable_rules
+ *
+ * The write lock is held for viptables_del and viptables_add
+ * The read lock is held for viptable_list
+ */
+static DEFINE_RWLOCK(v4vtables_rules_lock);
+
+/*
+ * Debugs
+ */
+
+#ifdef V4V_DEBUG
+#define v4v_dprintk(format, args...)            \
+    do {                                        \
+        printk("%s:%d " format,                 \
+               __FILE__, __LINE__, ## args );   \
+    } while ( 1 == 0 )
+#else
+#define v4v_dprintk(format, ... ) (void)0
+#endif
+
+#ifdef V4V_DEBUG
+static void __attribute__((unused))
+v4v_hexdump(void *_p, int len)
+{
+    uint8_t *buf = (uint8_t *)_p;
+    int i, j, k;
+
+    for ( i = 0; i < len; i += 16 )
+    {
+        printk(KERN_ERR "%p:", &buf[i]);
+        for ( j = 0; j < 16; ++j )
+        {
+            k = i + j;
+            if ( k < len )
+                printk(" %02x", buf[k]);
+            else
+                printk("   ");
+        }
+        printk(" ");
+
+        for ( j = 0; j < 16; ++j )
+        {
+            k = i + j;
+            if ( k < len )
+                printk("%c", ((buf[k] > 32) && (buf[k] < 127)) ? buf[k] : '.');
+            else
+                printk(" ");
+        }
+        printk("\n");
+    }
+}
+#endif
+
+
+/*
+ * Event channel
+ */
+
+static void
+v4v_signal_domain(struct domain *d)
+{
+    v4v_dprintk("send guest VIRQ_V4V domid:%d\n", d->domain_id);
+
+    evtchn_send(d, d->v4v->evtchn_port);
+}
+
+static void
+v4v_signal_domid(domid_t id)
+{
+    struct domain *d = get_domain_by_id(id);
+
+    if ( !d )
+        return;
+    v4v_signal_domain(d);
+    put_domain(d);
+}
+
+
+/*
+ * ring buffer
+ */
+
+static void
+v4v_ring_unmap(struct v4v_ring_info *ring_info)
+{
+    int i;
+
+    ASSERT(spin_is_locked(&ring_info->lock));
+
+    for ( i = 0; i < ring_info->npage; ++i )
+    {
+        if ( !ring_info->mfn_mapping[i] )
+            continue;
+        v4v_dprintk("unmapping page %p from %p\n",
+                    (void*)mfn_x(ring_info->mfns[i]),
+                    ring_info->mfn_mapping[i]);
+
+        unmap_domain_page(ring_info->mfn_mapping[i]);
+        ring_info->mfn_mapping[i] = NULL;
+    }
+}
+
+static uint8_t *
+v4v_ring_map_page(struct v4v_ring_info *ring_info, int i)
+{
+    ASSERT(spin_is_locked(&ring_info->lock));
+
+    if ( i >= ring_info->npage )
+        return NULL;
+    if ( ring_info->mfn_mapping[i] )
+        return ring_info->mfn_mapping[i];
+    ring_info->mfn_mapping[i] = map_domain_page(mfn_x(ring_info->mfns[i]));
+
+    v4v_dprintk("mapping page %p to %p\n",
+                (void *)mfn_x(ring_info->mfns[i]),
+                ring_info->mfn_mapping[i]);
+    return ring_info->mfn_mapping[i];
+}
+
+static int
+v4v_memcpy_from_guest_ring(void *_dst, struct v4v_ring_info *ring_info,
+                           uint32_t offset, uint32_t len)
+{
+    int page = offset >> PAGE_SHIFT;
+    uint8_t *src;
+    uint8_t *dst = _dst;
+
+    ASSERT(spin_is_locked(&ring_info->lock));
+
+    offset &= PAGE_SIZE - 1;
+
+    while ( (offset + len) > PAGE_SIZE )
+    {
+        src = v4v_ring_map_page(ring_info, page);
+
+        if ( !src )
+            return -EFAULT;
+
+        v4v_dprintk("memcpy(%p,%p+%d,%d)\n",
+                    dst, src, offset,
+                    (int)(PAGE_SIZE - offset));
+        memcpy(dst, src + offset, PAGE_SIZE - offset);
+
+        page++;
+        len -= PAGE_SIZE - offset;
+        dst += PAGE_SIZE - offset;
+        offset = 0;
+    }
+
+    src = v4v_ring_map_page(ring_info, page);
+    if ( !src )
+        return -EFAULT;
+
+    v4v_dprintk("memcpy(%p,%p+%d,%d)\n", dst, src, offset, len);
+    memcpy(dst, src + offset, len);
+
+    return 0;
+}
+
+static int
+v4v_update_tx_ptr(struct v4v_ring_info *ring_info, uint32_t tx_ptr)
+{
+    uint8_t *dst = v4v_ring_map_page(ring_info, 0);
+    uint32_t *p = (uint32_t *)(dst + offsetof(v4v_ring_t, tx_ptr));
+
+    ASSERT(spin_is_locked(&ring_info->lock));
+
+    if ( !dst )
+        return -EFAULT;
+    write_atomic(p, tx_ptr);
+    mb();
+    return 0;
+}
+
+static int
+v4v_copy_from_guest_maybe(void *dst, void *src,
+                          XEN_GUEST_HANDLE(uint8_t) src_hnd,
+                          uint32_t len)
+{
+    int rc = 0;
+
+    if ( src )
+        memcpy(dst, src, len);
+    else
+        rc = __copy_from_guest(dst, src_hnd, len);
+    return rc;
+}
+
+static int
+v4v_memcpy_to_guest_ring(struct v4v_ring_info *ring_info,
+                         uint32_t offset,
+                         void *src,
+                         XEN_GUEST_HANDLE(uint8_t) src_hnd,
+                         uint32_t len)
+{
+    int page = offset >> PAGE_SHIFT;
+    uint8_t *dst;
+
+    ASSERT(spin_is_locked(&ring_info->lock));
+
+    offset &= PAGE_SIZE - 1;
+
+    while ( (offset + len) > PAGE_SIZE )
+    {
+        dst = v4v_ring_map_page(ring_info, page);
+        if ( !dst )
+            return -EFAULT;
+
+        if ( v4v_copy_from_guest_maybe(dst + offset, src, src_hnd,
+                                       PAGE_SIZE - offset) )
+            return -EFAULT;
+
+        page++;
+        len -= PAGE_SIZE - offset;
+        if ( src )
+            src += (PAGE_SIZE - offset);
+        else
+            guest_handle_add_offset(src_hnd, PAGE_SIZE - offset);
+        offset = 0;
+    }
+
+    dst = v4v_ring_map_page(ring_info, page);
+    if ( !dst )
+        return -EFAULT;
+
+    if ( v4v_copy_from_guest_maybe(dst + offset, src, src_hnd, len) )
+        return -EFAULT;
+
+    return 0;
+}
+
+static int
+v4v_ringbuf_get_rx_ptr(struct domain *d, struct v4v_ring_info *ring_info,
+                        uint32_t * rx_ptr)
+{
+    v4v_ring_t *ringp;
+
+    if ( ring_info->npage == 0 )
+        return -1;
+
+    ringp = map_domain_page(mfn_x(ring_info->mfns[0]));
+
+    v4v_dprintk("v4v_ringbuf_payload_space: mapped %p to %p\n",
+                (void *)mfn_x(ring_info->mfns[0]), ringp);
+    if ( !ringp )
+        return -1;
+
+    write_atomic(rx_ptr, ringp->rx_ptr);
+    mb();
+
+    unmap_domain_page(ringp);
+    return 0;
+}
+
+uint32_t
+v4v_ringbuf_payload_space(struct domain * d, struct v4v_ring_info * ring_info)
+{
+    v4v_ring_t ring;
+    int32_t ret;
+
+    ring.tx_ptr = ring_info->tx_ptr;
+    ring.len = ring_info->len;
+
+    if ( v4v_ringbuf_get_rx_ptr(d, ring_info, &ring.rx_ptr) )
+        return 0;
+
+    v4v_dprintk("v4v_ringbuf_payload_space:tx_ptr=%d rx_ptr=%d\n",
+                (int)ring.tx_ptr, (int)ring.rx_ptr);
+    if ( ring.rx_ptr == ring.tx_ptr )
+        return ring.len - sizeof(struct v4v_ring_message_header);
+
+    ret = ring.rx_ptr - ring.tx_ptr;
+    if ( ret < 0 )
+        ret += ring.len;
+
+    ret -= sizeof(struct v4v_ring_message_header);
+    ret -= V4V_ROUNDUP(1);
+
+    return (ret < 0) ? 0 : ret;
+}
+
+static long
+v4v_iov_count(XEN_GUEST_HANDLE_PARAM(v4v_iov_t) iovs, int niov)
+{
+    v4v_iov_t iov;
+    size_t ret = 0;
+
+    while ( niov-- )
+    {
+        if ( copy_from_guest(&iov, iovs, 1) )
+            return -EFAULT;
+
+        ret += iov.iov_len;
+
+        /* message bigger than 2G can't be sent */
+        if (ret > 2L * 1024 * 1024 * 1024)
+            return -EMSGSIZE;
+
+        guest_handle_add_offset(iovs, 1);
+    }
+
+    return ret;
+}
+
+static long
+v4v_ringbuf_insertv(struct domain *d,
+                    struct v4v_ring_info *ring_info,
+                    v4v_ring_id_t *src_id, uint32_t message_type,
+                    XEN_GUEST_HANDLE_PARAM(v4v_iov_t) iovs,
+                    uint32_t niov, size_t len)
+{
+    v4v_ring_t ring;
+    struct v4v_ring_message_header mh = { 0 };
+    int32_t sp;
+    long happy_ret;
+    int32_t ret = 0;
+    XEN_GUEST_HANDLE(uint8_t) empty_hnd = { 0 };
+
+    ASSERT(spin_is_locked(&ring_info->lock));
+
+    happy_ret = len;
+
+    if ( (V4V_ROUNDUP(len) + sizeof(struct v4v_ring_message_header) ) >=
+            ring_info->len)
+        return -EMSGSIZE;
+
+    do {
+        if ( (ret = v4v_memcpy_from_guest_ring(&ring, ring_info, 0,
+                                               sizeof(ring))) )
+            break;
+
+        ring.tx_ptr = ring_info->tx_ptr;
+        ring.len = ring_info->len;
+
+        v4v_dprintk("ring.tx_ptr=%d ring.rx_ptr=%d ring.len=%d ring_info->tx_ptr=%d\n",
+                    ring.tx_ptr, ring.rx_ptr, ring.len, ring_info->tx_ptr);
+
+        if ( ring.rx_ptr == ring.tx_ptr )
+            sp = ring_info->len;
+        else
+        {
+            sp = ring.rx_ptr - ring.tx_ptr;
+            if ( sp < 0 )
+                sp += ring.len;
+        }
+
+        if ( (V4V_ROUNDUP(len) + sizeof(struct v4v_ring_message_header)) >= sp )
+        {
+            v4v_dprintk("EAGAIN\n");
+            ret = -EAGAIN;
+            break;
+        }
+
+        mh.len = len + sizeof (struct v4v_ring_message_header);
+        mh.source.port = src_id->addr.port;
+        mh.source.domain = src_id->addr.domain;
+        mh.message_type = message_type;
+
+        if ( (ret = v4v_memcpy_to_guest_ring(ring_info,
+                                             ring.tx_ptr + sizeof(v4v_ring_t),
+                                             &mh, empty_hnd,
+                                             sizeof(mh))) )
+            break;
+
+        ring.tx_ptr += sizeof(mh);
+        if ( ring.tx_ptr == ring_info->len )
+            ring.tx_ptr = 0;
+
+        while ( niov-- )
+        {
+            XEN_GUEST_HANDLE_PARAM(uint8_t) bufp_hnd;
+            XEN_GUEST_HANDLE(uint8_t) buf_hnd;
+            v4v_iov_t iov;
+
+            if ( copy_from_guest(&iov, iovs, 1) )
+            {
+                ret = -EFAULT;
+                break;
+            }
+
+            bufp_hnd = guest_handle_from_ptr(iov.iov_base, uint8_t);
+            buf_hnd = guest_handle_from_param(bufp_hnd, uint8_t);
+            len = iov.iov_len;
+
+            if ( unlikely(!guest_handle_okay(buf_hnd, len)) )
+            {
+                ret = -EFAULT;
+                break;
+            }
+
+            sp = ring.len - ring.tx_ptr;
+
+            if ( len > sp )
+            {
+                ret = v4v_memcpy_to_guest_ring(ring_info,
+                        ring.tx_ptr + sizeof(v4v_ring_t),
+                        NULL, buf_hnd, sp);
+                if ( ret )
+                    break;
+
+                ring.tx_ptr = 0;
+                len -= sp;
+                guest_handle_add_offset(buf_hnd, sp);
+            }
+
+            ret = v4v_memcpy_to_guest_ring(ring_info,
+                    ring.tx_ptr + sizeof(v4v_ring_t),
+                    NULL, buf_hnd, len);
+            if ( ret )
+                break;
+
+            ring.tx_ptr += len;
+
+            if ( ring.tx_ptr == ring_info->len )
+                ring.tx_ptr = 0;
+
+            guest_handle_add_offset(iovs, 1);
+        }
+        if ( ret )
+            break;
+
+        ring.tx_ptr = V4V_ROUNDUP(ring.tx_ptr);
+
+        if ( ring.tx_ptr >= ring_info->len )
+            ring.tx_ptr -= ring_info->len;
+
+        mb();
+        ring_info->tx_ptr = ring.tx_ptr;
+        if ( (ret = v4v_update_tx_ptr(ring_info, ring.tx_ptr)) )
+            break;
+    } while ( 0 );
+
+    v4v_ring_unmap(ring_info);
+
+    return ret ? ret : happy_ret;
+}
+
+/* pending */
+static void
+v4v_pending_remove_ent(struct v4v_pending_ent *ent)
+{
+    hlist_del(&ent->node);
+    xfree(ent);
+}
+
+static void
+v4v_pending_remove_all(struct v4v_ring_info *info)
+{
+    struct hlist_node *node, *next;
+    struct v4v_pending_ent *pending_ent;
+
+    ASSERT(spin_is_locked(&info->lock));
+    hlist_for_each_entry_safe(pending_ent, node, next, &info->pending,
+            node) v4v_pending_remove_ent(pending_ent);
+}
+
+static void
+v4v_pending_notify(struct domain *caller_d, struct hlist_head *to_notify)
+{
+    struct hlist_node *node, *next;
+    struct v4v_pending_ent *pending_ent;
+
+    ASSERT(rw_is_locked(&v4v_lock));
+
+    hlist_for_each_entry_safe(pending_ent, node, next, to_notify, node)
+    {
+        hlist_del(&pending_ent->node);
+        v4v_signal_domid(pending_ent->id);
+        xfree(pending_ent);
+    }
+
+}
+
+static void
+v4v_pending_find(struct domain *d, struct v4v_ring_info *ring_info,
+                 uint32_t payload_space, struct hlist_head *to_notify)
+{
+    struct hlist_node *node, *next;
+    struct v4v_pending_ent *ent;
+
+    ASSERT(rw_is_locked(&d->v4v->lock));
+
+    spin_lock(&ring_info->lock);
+    hlist_for_each_entry_safe(ent, node, next, &ring_info->pending, node)
+    {
+        if ( payload_space >= ent->len )
+        {
+            hlist_del(&ent->node);
+            hlist_add_head(&ent->node, to_notify);
+        }
+    }
+    spin_unlock(&ring_info->lock);
+}
+
+/*caller must have L3 */
+static int
+v4v_pending_queue(struct v4v_ring_info *ring_info, domid_t src_id, int len)
+{
+    struct v4v_pending_ent *ent = xmalloc(struct v4v_pending_ent);
+
+    if ( !ent )
+    {
+        v4v_dprintk("ENOMEM\n");
+        return -ENOMEM;
+    }
+
+    ent->len = len;
+    ent->id = src_id;
+
+    hlist_add_head(&ent->node, &ring_info->pending);
+
+    return 0;
+}
+
+/* L3 */
+static int
+v4v_pending_requeue(struct v4v_ring_info *ring_info, domid_t src_id, int len)
+{
+    struct hlist_node *node;
+    struct v4v_pending_ent *ent;
+
+    hlist_for_each_entry(ent, node, &ring_info->pending, node)
+    {
+        if ( ent->id == src_id )
+        {
+            if ( ent->len < len )
+                ent->len = len;
+            return 0;
+        }
+    }
+
+    return v4v_pending_queue(ring_info, src_id, len);
+}
+
+
+/* L3 */
+static void
+v4v_pending_cancel(struct v4v_ring_info *ring_info, domid_t src_id)
+{
+    struct hlist_node *node, *next;
+    struct v4v_pending_ent *ent;
+
+    hlist_for_each_entry_safe(ent, node, next, &ring_info->pending, node)
+    {
+        if ( ent->id == src_id)
+        {
+            hlist_del(&ent->node);
+            xfree(ent);
+        }
+    }
+}
+
+/*
+ * ring data
+ */
+
+/*Caller should hold R(L1)*/
+static int
+v4v_fill_ring_data(struct domain *src_d,
+                   XEN_GUEST_HANDLE(v4v_ring_data_ent_t) data_ent_hnd)
+{
+    v4v_ring_data_ent_t ent;
+    struct domain *dst_d;
+    struct v4v_ring_info *ring_info;
+
+    if ( copy_from_guest(&ent, data_ent_hnd, 1) )
+    {
+        v4v_dprintk("EFAULT\n");
+        return -EFAULT;
+    }
+
+    v4v_dprintk("v4v_fill_ring_data: ent.ring.domain=%d,ent.ring.port=%u\n",
+                (int)ent.ring.domain, (int)ent.ring.port);
+
+    ent.flags = 0;
+
+    dst_d = get_domain_by_id(ent.ring.domain);
+
+    if ( dst_d && dst_d->v4v )
+    {
+        read_lock(&dst_d->v4v->lock);
+        ring_info = v4v_ring_find_info_by_addr(dst_d, &ent.ring,
+                                               src_d->domain_id);
+
+        if ( ring_info )
+        {
+            uint32_t space_avail;
+
+            ent.flags |= V4V_RING_DATA_F_EXISTS;
+            ent.max_message_size =
+                ring_info->len - sizeof(struct v4v_ring_message_header) -
+                V4V_ROUNDUP(1);
+            spin_lock(&ring_info->lock);
+
+            space_avail = v4v_ringbuf_payload_space(dst_d, ring_info);
+
+            if ( space_avail >= ent.space_required )
+            {
+                v4v_pending_cancel(ring_info, src_d->domain_id);
+                ent.flags |= V4V_RING_DATA_F_SUFFICIENT;
+            }
+            else
+            {
+                v4v_pending_requeue(ring_info, src_d->domain_id,
+                        ent.space_required);
+                ent.flags |= V4V_RING_DATA_F_PENDING;
+            }
+
+            spin_unlock(&ring_info->lock);
+
+            if ( space_avail == ent.max_message_size )
+                ent.flags |= V4V_RING_DATA_F_EMPTY;
+
+        }
+        read_unlock(&dst_d->v4v->lock);
+    }
+
+    if ( dst_d )
+        put_domain(dst_d);
+
+    if ( copy_field_to_guest(data_ent_hnd, &ent, flags) )
+    {
+        v4v_dprintk("EFAULT\n");
+        return -EFAULT;
+    }
+    return 0;
+}
+
+/*Called should hold no more than R(L1) */
+static int
+v4v_fill_ring_datas(struct domain *d, int nent,
+                     XEN_GUEST_HANDLE(v4v_ring_data_ent_t) data_ent_hnd)
+{
+    int ret = 0;
+
+    read_lock(&v4v_lock);
+    while ( !ret && nent-- )
+    {
+        ret = v4v_fill_ring_data(d, data_ent_hnd);
+        guest_handle_add_offset(data_ent_hnd, 1);
+    }
+    read_unlock(&v4v_lock);
+    return ret;
+}
+
+/*
+ * ring
+ */
+static int
+v4v_find_ring_mfns(struct domain *d, struct v4v_ring_info *ring_info,
+                   uint32_t npage, XEN_GUEST_HANDLE_PARAM(v4v_pfn_t) pfn_hnd)
+{
+    int i,j;
+    mfn_t *mfns;
+    uint8_t **mfn_mapping;
+    unsigned long mfn;
+    p2m_type_t t;
+    struct page_info *page;
+    int ret = 0;
+
+    if ( (npage << PAGE_SHIFT) < ring_info->len )
+    {
+        v4v_dprintk("EINVAL\n");
+        return -EINVAL;
+    }
+
+    mfns = xmalloc_array(mfn_t, npage);
+    if ( !mfns )
+    {
+        v4v_dprintk("ENOMEM\n");
+        return -ENOMEM;
+    }
+
+    mfn_mapping = xmalloc_array(uint8_t *, npage);
+    if ( !mfn_mapping )
+    {
+        xfree(mfns);
+        return -ENOMEM;
+    }
+
+    for ( i = 0; i < npage; ++i )
+    {
+        unsigned long pfn;
+
+        if ( copy_from_guest_offset(&pfn, pfn_hnd, i, 1) )
+        {
+            ret = -EFAULT;
+            v4v_dprintk("EFAULT\n");
+            break;
+        }
+
+        mfn = mfn_x(get_gfn(d, pfn, &t));
+        if ( !mfn_valid(mfn) )
+        {
+            put_gfn(d, pfn);
+            printk(KERN_ERR "v4v domain %d passed invalid mfn %"PRI_mfn" ring %p seq %d\n",
+                    d->domain_id, mfn, ring_info, i);
+            ret = -EINVAL;
+            break;
+        }
+        page = mfn_to_page(mfn);
+        if ( !get_page_and_type(page, d, PGT_writable_page) )
+        {
+            put_gfn(d, pfn);
+            printk(KERN_ERR "v4v domain %d passed wrong type mfn %"PRI_mfn" ring %p seq %d\n",
+                    d->domain_id, mfn, ring_info, i);
+            ret = -EINVAL;
+            break;
+        }
+        mfns[i] = _mfn(mfn);
+        v4v_dprintk("v4v_find_ring_mfns: %d: %lx -> %lx\n",
+                    i, (unsigned long)pfn, (unsigned long)mfn_x(mfns[i]));
+        mfn_mapping[i] = NULL;
+        put_gfn(d, pfn);
+    }
+
+    if ( !ret )
+    {
+        ring_info->npage = npage;
+        ring_info->mfns = mfns;
+        ring_info->mfn_mapping = mfn_mapping;
+    }
+    else
+    {
+        j = i;
+        for ( i = 0; i < j; ++i )
+            if ( mfn_x(mfns[i]) != 0 )
+                put_page_and_type(mfn_to_page(mfn_x(mfns[i])));
+        xfree(mfn_mapping);
+        xfree(mfns);
+    }
+    return ret;
+}
+
+static struct v4v_ring_info *
+v4v_ring_find_info(struct domain *d, v4v_ring_id_t *id)
+{
+    uint16_t hash;
+    struct hlist_node *node;
+    struct v4v_ring_info *ring_info;
+
+    ASSERT(rw_is_locked(&d->v4v->lock));
+
+    hash = v4v_hash_fn(id);
+
+    v4v_dprintk("ring_find_info: d->v4v=%p, d->v4v->ring_hash[%d]=%p id=%p\n",
+                d->v4v, (int)hash, d->v4v->ring_hash[hash].first, id);
+    v4v_dprintk("ring_find_info: id.addr.port=%d id.addr.domain=%d id.addr.partner=%d\n",
+                id->addr.port, id->addr.domain, id->partner);
+
+    hlist_for_each_entry(ring_info, node, &d->v4v->ring_hash[hash], node)
+    {
+        v4v_ring_id_t *cmpid = &ring_info->id;
+
+        if ( cmpid->addr.port == id->addr.port &&
+             cmpid->addr.domain == id->addr.domain &&
+             cmpid->partner == id->partner )
+        {
+            v4v_dprintk("ring_find_info: ring_info=%p\n", ring_info);
+            return ring_info;
+        }
+    }
+    v4v_dprintk("ring_find_info: no ring_info found\n");
+    return NULL;
+}
+
+static struct v4v_ring_info *
+v4v_ring_find_info_by_addr(struct domain *d, struct v4v_addr *a, domid_t p)
+{
+    v4v_ring_id_t id;
+    struct v4v_ring_info *ret;
+
+    ASSERT(rw_is_locked(&d->v4v->lock));
+
+    if ( !a )
+        return NULL;
+
+    id.addr.port = a->port;
+    id.addr.domain = d->domain_id;
+    id.partner = p;
+
+    ret = v4v_ring_find_info(d, &id);
+    if ( ret )
+        return ret;
+
+    id.partner = V4V_DOMID_ANY;
+
+    return v4v_ring_find_info(d, &id);
+}
+
+static void v4v_ring_remove_mfns(struct domain *d, struct v4v_ring_info *ring_info)
+{
+    int i;
+
+    ASSERT(rw_is_write_locked(&d->v4v->lock));
+
+    if ( ring_info->mfns )
+    {
+        for ( i = 0; i < ring_info->npage; ++i )
+            if ( mfn_x(ring_info->mfns[i]) != 0 )
+                put_page_and_type(mfn_to_page(mfn_x(ring_info->mfns[i])));
+        xfree(ring_info->mfns);
+    }
+    if ( ring_info->mfn_mapping )
+        xfree(ring_info->mfn_mapping);
+    ring_info->mfns = NULL;
+
+}
+
+static void
+v4v_ring_remove_info(struct domain *d, struct v4v_ring_info *ring_info)
+{
+    ASSERT(rw_is_write_locked(&d->v4v->lock));
+
+    spin_lock(&ring_info->lock);
+
+    v4v_pending_remove_all(ring_info);
+    hlist_del(&ring_info->node);
+    v4v_ring_remove_mfns(d, ring_info);
+
+    spin_unlock(&ring_info->lock);
+
+    xfree(ring_info);
+}
+
+/* Call from guest to unpublish a ring */
+static long
+v4v_ring_remove(struct domain *d, XEN_GUEST_HANDLE_PARAM(v4v_ring_t) ring_hnd)
+{
+    struct v4v_ring ring;
+    struct v4v_ring_info *ring_info;
+    int ret = 0;
+
+    read_lock(&v4v_lock);
+
+    do {
+        if ( !d->v4v )
+        {
+            v4v_dprintk("EINVAL\n");
+            ret = -EINVAL;
+            break;
+        }
+
+        if ( copy_from_guest(&ring, ring_hnd, 1) )
+        {
+            v4v_dprintk("EFAULT\n");
+            ret = -EFAULT;
+            break;
+        }
+
+        if ( ring.magic != V4V_RING_MAGIC )
+        {
+            v4v_dprintk("ring.magic(%"PRIx64") != V4V_RING_MAGIC(%"PRIx64"), EINVAL\n",
+                    ring.magic, V4V_RING_MAGIC);
+            ret = -EINVAL;
+            break;
+        }
+
+        ring.id.addr.domain = d->domain_id;
+
+        write_lock(&d->v4v->lock);
+        ring_info = v4v_ring_find_info(d, &ring.id);
+
+        if ( ring_info )
+            v4v_ring_remove_info(d, ring_info);
+
+        write_unlock(&d->v4v->lock);
+
+        if ( !ring_info )
+        {
+            v4v_dprintk("ENOENT\n");
+            ret = -ENOENT;
+            break;
+        }
+
+    } while ( 0 );
+
+    read_unlock(&v4v_lock);
+    return ret;
+}
+
+/* call from guest to publish a ring */
+static long
+v4v_ring_add(struct domain *d, XEN_GUEST_HANDLE_PARAM(v4v_ring_t) ring_hnd,
+             uint32_t npage, XEN_GUEST_HANDLE_PARAM(v4v_pfn_t) pfn_hnd)
+{
+    struct v4v_ring ring;
+    struct v4v_ring_info *ring_info;
+    int need_to_insert = 0;
+    int ret = 0;
+
+    if ( (long)ring_hnd.p & (PAGE_SIZE - 1) )
+    {
+        v4v_dprintk("EINVAL\n");
+        return -EINVAL;
+    }
+
+    read_lock(&v4v_lock);
+    do {
+        if ( !d->v4v )
+        {
+            v4v_dprintk(" !d->v4v, EINVAL\n");
+            ret = -EINVAL;
+            break;
+        }
+
+        if ( copy_from_guest(&ring, ring_hnd, 1) )
+        {
+            v4v_dprintk(" copy_from_guest failed, EFAULT\n");
+            ret = -EFAULT;
+            break;
+        }
+
+        if ( ring.magic != V4V_RING_MAGIC )
+        {
+            v4v_dprintk("ring.magic(%lx) != V4V_RING_MAGIC(%lx), EINVAL\n",
+                        ring.magic, V4V_RING_MAGIC);
+            ret = -EINVAL;
+            break;
+        }
+
+        if ( (ring.len <
+                    (sizeof(struct v4v_ring_message_header) + V4V_ROUNDUP(1) +
+                     V4V_ROUNDUP(1))) || (V4V_ROUNDUP(ring.len) != ring.len) )
+        {
+            v4v_dprintk("EINVAL\n");
+            ret = -EINVAL;
+            break;
+        }
+
+        ring.id.addr.domain = d->domain_id;
+        if ( copy_field_to_guest(ring_hnd, &ring, id) )
+        {
+            v4v_dprintk("EFAULT\n");
+            ret = -EFAULT;
+            break;
+        }
+
+        /*
+         * no need for a lock yet, because only we know about this
+         * set the tx pointer if it looks bogus (we don't reset it
+         * because this might be a re-register after S4)
+         */
+        if ( (ring.tx_ptr >= ring.len)
+                || (V4V_ROUNDUP(ring.tx_ptr) != ring.tx_ptr) )
+        {
+            ring.tx_ptr = ring.rx_ptr;
+        }
+        copy_field_to_guest(ring_hnd, &ring, tx_ptr);
+
+        read_lock(&d->v4v->lock);
+        ring_info = v4v_ring_find_info(d, &ring.id);
+
+        if ( !ring_info )
+        {
+            read_unlock(&d->v4v->lock);
+            ring_info = xmalloc(struct v4v_ring_info);
+            if ( !ring_info )
+            {
+                v4v_dprintk("ENOMEM\n");
+                ret = -ENOMEM;
+                break;
+            }
+            need_to_insert++;
+            spin_lock_init(&ring_info->lock);
+            INIT_HLIST_HEAD(&ring_info->pending);
+            ring_info->mfns = NULL;
+        }
+        else
+        {
+            /*
+             * Ring info already existed.
+             */
+            printk(KERN_INFO "v4v: dom%d ring already registered\n",
+                    current->domain->domain_id);
+            ret = -EEXIST;
+            break;
+        }
+
+        spin_lock(&ring_info->lock);
+        ring_info->id = ring.id;
+        ring_info->len = ring.len;
+        ring_info->tx_ptr = ring.tx_ptr;
+        ring_info->ring = ring_hnd;
+        if ( ring_info->mfns )
+            xfree(ring_info->mfns);
+        ret = v4v_find_ring_mfns(d, ring_info, npage, pfn_hnd);
+        spin_unlock(&ring_info->lock);
+        if ( ret )
+            break;
+
+        if ( !need_to_insert )
+        {
+            read_unlock(&d->v4v->lock);
+        }
+        else
+        {
+            uint16_t hash = v4v_hash_fn(&ring.id);
+
+            write_lock(&d->v4v->lock);
+            hlist_add_head(&ring_info->node, &d->v4v->ring_hash[hash]);
+            write_unlock(&d->v4v->lock);
+        }
+    } while ( 0 );
+
+    read_unlock(&v4v_lock);
+    return ret;
+}
+
+
+/*
+ * io
+ */
+
+static void
+v4v_notify_ring(struct domain *d, struct v4v_ring_info *ring_info,
+                struct hlist_head *to_notify)
+{
+    uint32_t space;
+
+    ASSERT(rw_is_locked(&v4v_lock));
+    ASSERT(rw_is_locked(&d->v4v->lock));
+
+    spin_lock(&ring_info->lock);
+    space = v4v_ringbuf_payload_space(d, ring_info);
+    spin_unlock(&ring_info->lock);
+
+    v4v_pending_find(d, ring_info, space, to_notify);
+}
+
+/*notify hypercall*/
+static long
+v4v_notify(struct domain *d,
+           XEN_GUEST_HANDLE_PARAM(v4v_ring_data_t) ring_data_hnd)
+{
+    v4v_ring_data_t ring_data;
+    HLIST_HEAD(to_notify);
+    int i;
+    int ret = 0;
+
+    read_lock(&v4v_lock);
+
+    if ( !d->v4v )
+    {
+        read_unlock(&v4v_lock);
+        v4v_dprintk("!d->v4v, ENODEV\n");
+        return -ENODEV;
+    }
+
+    read_lock(&d->v4v->lock);
+    for ( i = 0; i < V4V_HTABLE_SIZE; ++i )
+    {
+        struct hlist_node *node, *next;
+        struct v4v_ring_info *ring_info;
+
+        hlist_for_each_entry_safe(ring_info, node,
+                next, &d->v4v->ring_hash[i],
+                node)
+        {
+            v4v_notify_ring(d, ring_info, &to_notify);
+        }
+    }
+    read_unlock(&d->v4v->lock);
+
+    if ( !hlist_empty(&to_notify) )
+        v4v_pending_notify(d, &to_notify);
+
+    do {
+        if ( !guest_handle_is_null(ring_data_hnd) )
+        {
+            /* Quick sanity check on ring_data_hnd */
+            if ( copy_field_from_guest(&ring_data, ring_data_hnd, magic) )
+            {
+                v4v_dprintk("copy_field_from_guest failed\n");
+                ret = -EFAULT;
+                break;
+            }
+
+            if ( ring_data.magic != V4V_RING_DATA_MAGIC )
+            {
+                v4v_dprintk("ring.magic(%lx) != V4V_RING_MAGIC(%lx), EINVAL\n",
+                        ring_data.magic, V4V_RING_MAGIC);
+                ret = -EINVAL;
+                break;
+            }
+
+            if ( copy_from_guest(&ring_data, ring_data_hnd, 1) )
+            {
+                v4v_dprintk("copy_from_guest failed\n");
+                ret = -EFAULT;
+                break;
+            }
+
+            {
+                /* This is a guest pointer passed as a field in a struct
+                 * so XEN_GUEST_HANDLE is used. */
+                XEN_GUEST_HANDLE(v4v_ring_data_ent_t) ring_data_ent_hnd;
+                ring_data_ent_hnd =
+                    guest_handle_for_field(ring_data_hnd, v4v_ring_data_ent_t, data[0]);
+                ret = v4v_fill_ring_datas(d, ring_data.nent, ring_data_ent_hnd);
+            }
+        }
+    } while ( 0 );
+
+    read_unlock(&v4v_lock);
+
+    return ret;
+}
+
+#ifdef V4V_DEBUG
+void
+v4vtables_print_rule(struct v4vtables_rule_node *node)
+{
+    v4vtables_rule_t *rule;
+
+    if ( node == NULL )
+    {
+        printk("(null)\n");
+        return;
+    }
+
+    rule = &node->rule;
+
+    if ( rule->accept == 1 )
+        printk("ACCEPT");
+    else
+        printk("REJECT");
+
+    printk(" ");
+
+    if ( rule->src.domain == V4V_DOMID_ANY )
+        printk("*");
+    else
+        printk("%i", rule->src.domain);
+
+    printk(":");
+
+    if ( rule->src.port == -1 )
+        printk("*");
+    else
+        printk("%i", rule->src.port);
+
+    printk(" -> ");
+
+    if ( rule->dst.domain == V4V_DOMID_ANY )
+        printk("*");
+    else
+        printk("%i", rule->dst.domain);
+
+    printk(":");
+
+    if ( rule->dst.port == -1 )
+        printk("*");
+    else
+        printk("%i", rule->dst.port);
+
+    printk("\n");
+}
+#endif /* V4V_DEBUG */
+
+int
+v4vtables_add(struct domain *src_d,
+              XEN_GUEST_HANDLE_PARAM(v4vtables_rule_t) rule,
+              int32_t position)
+{
+    struct v4vtables_rule_node* new = NULL;
+    struct list_head* tmp;
+
+    ASSERT(rw_is_write_locked(&v4vtables_rules_lock));
+
+    /* First rule is n.1 */
+    position--;
+
+    new = xmalloc(struct v4vtables_rule_node);
+    if ( new == NULL )
+        return -ENOMEM;
+
+    if ( copy_from_guest(&new->rule, rule, 1) )
+    {
+        xfree(new);
+        return -EFAULT;
+    }
+
+#ifdef V4V_DEBUG
+    printk(KERN_ERR "VIPTables: ");
+    v4vtables_print_rule(new);
+#endif /* V4V_DEBUG */
+
+    tmp = &v4vtables_rules;
+    while ( position != 0 && tmp->next != &v4vtables_rules)
+    {
+        tmp = tmp->next;
+        position--;
+    }
+    list_add(&new->list, tmp);
+
+    return 0;
+}
+
+int
+v4vtables_del(struct domain *src_d,
+              XEN_GUEST_HANDLE_PARAM(v4vtables_rule_t) rule_hnd,
+              int32_t position)
+{
+    struct list_head *tmp = NULL;
+    struct list_head *to_delete = NULL;
+    struct list_head *next = NULL;
+    struct v4vtables_rule_node *node;
+
+    ASSERT(rw_is_write_locked(&v4vtables_rules_lock));
+
+    v4v_dprintk("v4vtables_del position:%d\n", position);
+
+    if ( position != -1 )
+    {
+        /* We want to delete the rule number <position> */
+        list_for_each(tmp, &v4vtables_rules)
+        {
+            to_delete = tmp;
+            if (position == 0)
+                break;
+            position--;
+        }
+        /* Can't find the position */
+        if (position != 0)
+            to_delete = NULL;
+    }
+    else if ( !guest_handle_is_null(rule_hnd) )
+    {
+        struct v4vtables_rule r;
+
+        if ( copy_from_guest(&r, rule_hnd, 1) )
+            return -EFAULT;
+
+        list_for_each(tmp, &v4vtables_rules)
+        {
+            node = list_entry(tmp, struct v4vtables_rule_node, list);
+
+            if ( (node->rule.src.domain == r.src.domain) &&
+                 (node->rule.src.port   == r.src.port)   &&
+                 (node->rule.dst.domain == r.dst.domain) &&
+                 (node->rule.dst.port   == r.dst.port))
+            {
+                to_delete = tmp;
+                break;
+            }
+        }
+    }
+    else
+    {
+        /* We want to flush the rules! */
+        printk(KERN_ERR "VIPTables: flushing rules\n");
+        list_for_each_safe(tmp, next, &v4vtables_rules)
+        {
+            node = list_entry(tmp, struct v4vtables_rule_node, list);
+            list_del(tmp);
+            xfree(node);
+        }
+    }
+
+    if ( to_delete )
+    {
+        node = list_entry(to_delete, struct v4vtables_rule_node, list);
+#ifdef V4V_DEBUG
+        printk(KERN_ERR "VIPTables: deleting rule: ");
+        v4vtables_print_rule(node);
+#endif /* V4V_DEBUG */
+        list_del(to_delete);
+        xfree(node);
+    }
+
+    return 0;
+}
+
+static size_t
+v4vtables_list(struct domain *src_d,
+               XEN_GUEST_HANDLE_PARAM(v4vtables_list_t) list_hnd)
+{
+    struct list_head *ptr;
+    struct v4vtables_rule_node *node;
+    struct v4vtables_list rules_list;
+    uint32_t nbrules;
+    XEN_GUEST_HANDLE(v4vtables_rule_t) guest_rules;
+
+    ASSERT(rw_is_locked(&v4vtables_rules_lock));
+
+    memset(&rules_list, 0, sizeof (rules_list));
+    if ( copy_from_guest(&rules_list, list_hnd, 1) )
+        return -EFAULT;
+
+    ptr = v4vtables_rules.next;
+    while ( rules_list.start_rule != 0 && ptr->next != &v4vtables_rules )
+    {
+        ptr = ptr->next;
+        rules_list.start_rule--;
+    }
+
+    if ( rules_list.nb_rules == 0 )
+        return -EINVAL;
+
+    guest_rules = guest_handle_for_field(list_hnd, v4vtables_rule_t, rules[0]);
+
+    nbrules = 0;
+    while ( nbrules < rules_list.nb_rules && ptr != &v4vtables_rules )
+    {
+        node = list_entry(ptr, struct v4vtables_rule_node, list);
+
+        if ( copy_to_guest(guest_rules, &node->rule, 1) )
+            break;
+
+        guest_handle_add_offset(guest_rules, 1);
+
+        nbrules++;
+        ptr = ptr->next;
+    }
+
+    rules_list.nb_rules = nbrules;
+    if ( copy_field_to_guest(list_hnd, &rules_list, nb_rules) )
+        return -EFAULT;
+
+    return 0;
+}
+
+static size_t
+v4vtables_check(v4v_addr_t *src, v4v_addr_t *dst)
+{
+    struct list_head *ptr;
+    struct v4vtables_rule_node *node;
+    size_t ret = 0; /* Defaulting to ACCEPT */
+
+    read_lock(&v4vtables_rules_lock);
+
+    list_for_each(ptr, &v4vtables_rules)
+    {
+        node = list_entry(ptr, struct v4vtables_rule_node, list);
+
+        if ( (node->rule.src.domain == V4V_DOMID_ANY ||
+              node->rule.src.domain == src->domain) &&
+             (node->rule.src.port == V4V_PORT_ANY ||
+              node->rule.src.port == src->port) &&
+             (node->rule.dst.domain == V4V_DOMID_ANY ||
+              node->rule.dst.domain == dst->domain) &&
+             (node->rule.dst.port == V4V_PORT_ANY ||
+              node->rule.dst.port == dst->port) )
+        {
+            ret = !node->rule.accept;
+            break;
+        }
+    }
+
+    read_unlock(&v4vtables_rules_lock);
+    return ret;
+}
+
+/*
+ * Hypercall to do the send
+ */
+static long
+v4v_sendv(struct domain *src_d, v4v_addr_t *src_addr,
+          v4v_addr_t *dst_addr, uint32_t message_type,
+          XEN_GUEST_HANDLE_PARAM(v4v_iov_t) iovs, size_t niov)
+{
+    struct domain *dst_d;
+    v4v_ring_id_t src_id;
+    struct v4v_ring_info *ring_info;
+    long ret = 0;
+
+    if ( !dst_addr )
+    {
+        v4v_dprintk("!dst_addr, EINVAL\n");
+        return -EINVAL;
+    }
+
+    read_lock(&v4v_lock);
+    if ( !src_d->v4v )
+    {
+        read_unlock(&v4v_lock);
+        v4v_dprintk("!src_d->v4v, EINVAL\n");
+        return -EINVAL;
+    }
+
+    src_id.addr.pad = 0;
+    src_id.addr.port = src_addr->port;
+    src_id.addr.domain = src_d->domain_id;
+    src_id.partner = dst_addr->domain;
+
+    dst_d = get_domain_by_id(dst_addr->domain);
+    if ( !dst_d )
+    {
+        read_unlock(&v4v_lock);
+        v4v_dprintk("!dst_d, ECONNREFUSED\n");
+        return -ECONNREFUSED;
+    }
+
+    if ( v4vtables_check(src_addr, dst_addr) != 0 )
+    {
+        read_unlock(&v4v_lock);
+        gdprintk(XENLOG_WARNING,
+                 "V4V: VIPTables REJECTED %i:%i -> %i:%i\n",
+                 src_addr->domain, src_addr->port,
+                 dst_addr->domain, dst_addr->port);
+        return -ECONNREFUSED;
+    }
+
+    do {
+        if ( !dst_d->v4v )
+        {
+            v4v_dprintk("dst_d->v4v, ECONNREFUSED\n");
+            ret = -ECONNREFUSED;
+            break;
+        }
+
+        read_lock(&dst_d->v4v->lock);
+        ring_info =
+            v4v_ring_find_info_by_addr(dst_d, dst_addr, src_addr->domain);
+
+        if ( !ring_info )
+        {
+            ret = -ECONNREFUSED;
+            v4v_dprintk(" !ring_info, ECONNREFUSED\n");
+        }
+        else
+        {
+            long len = v4v_iov_count(iovs, niov);
+
+            if ( len < 0 )
+            {
+                ret = len;
+                break;
+            }
+
+            spin_lock(&ring_info->lock);
+            ret =
+                v4v_ringbuf_insertv(dst_d, ring_info, &src_id, message_type,
+                        iovs, niov, len);
+            if ( ret == -EAGAIN )
+            {
+                v4v_dprintk("v4v_ringbuf_insertv failed, EAGAIN\n");
+                /* Schedule a wake up on the event channel when space is there */
+                if ( v4v_pending_requeue(ring_info, src_d->domain_id, len) )
+                {
+                    v4v_dprintk("v4v_pending_requeue failed, ENOMEM\n");
+                    ret = -ENOMEM;
+                }
+            }
+            spin_unlock(&ring_info->lock);
+
+            if ( ret >= 0 )
+            {
+                v4v_signal_domain(dst_d);
+            }
+
+        }
+        read_unlock(&dst_d->v4v->lock);
+
+    } while ( 0 );
+
+    put_domain(dst_d);
+    read_unlock(&v4v_lock);
+    return ret;
+}
+
+static void
+v4v_info(struct domain *d, v4v_info_t *info)
+{
+    read_lock(&d->v4v->lock);
+    info->ring_magic = V4V_RING_MAGIC;
+    info->data_magic = V4V_RING_DATA_MAGIC;
+    info->evtchn = d->v4v->evtchn_port;
+    read_unlock(&d->v4v->lock);
+}
+
+/*
+ * hypercall glue
+ */
+long
+do_v4v_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg1,
+          XEN_GUEST_HANDLE_PARAM(void) arg2,
+          uint32_t arg3, uint32_t arg4)
+{
+    struct domain *d = current->domain;
+    long rc = -EFAULT;
+
+    v4v_dprintk("->do_v4v_op(%d,%p,%p,%d,%d)\n", cmd,
+                (void *)arg1.p, (void *)arg2.p, (int) arg3, (int) arg4);
+
+    domain_lock(d);
+    switch (cmd)
+    {
+    case V4VOP_register_ring:
+    {
+         XEN_GUEST_HANDLE_PARAM(v4v_ring_t) ring_hnd =
+                 guest_handle_cast(arg1, v4v_ring_t);
+         XEN_GUEST_HANDLE_PARAM(v4v_pfn_t) pfn_hnd =
+                 guest_handle_cast(arg2, v4v_pfn_t);
+         uint32_t npage = arg3;
+
+         if ( unlikely(!guest_handle_okay(pfn_hnd, npage)) )
+             goto out;
+         rc = v4v_ring_add(d, ring_hnd, npage, pfn_hnd);
+         break;
+    }
+    case V4VOP_unregister_ring:
+    {
+         XEN_GUEST_HANDLE_PARAM(v4v_ring_t) ring_hnd =
+                 guest_handle_cast(arg1, v4v_ring_t);
+         rc = v4v_ring_remove(d, ring_hnd);
+         break;
+    }
+    case V4VOP_sendv:
+    {
+        uint32_t niov = arg3;
+        uint32_t message_type = arg4;
+
+        XEN_GUEST_HANDLE_PARAM(v4v_send_addr_t) addr_hnd =
+                guest_handle_cast(arg1, v4v_send_addr_t);
+        v4v_send_addr_t addr;
+
+        if ( copy_from_guest(&addr, addr_hnd, 1) )
+            goto out;
+
+        rc = v4v_sendv(d, &addr.src, &addr.dst, message_type,
+                guest_handle_cast(arg2, v4v_iov_t), niov);
+        break;
+    }
+    case V4VOP_notify:
+    {
+        XEN_GUEST_HANDLE_PARAM(v4v_ring_data_t) ring_data_hnd =
+                   guest_handle_cast(arg1, v4v_ring_data_t);
+        rc = v4v_notify(d, ring_data_hnd);
+        break;
+    }
+    case V4VOP_info:
+    {
+        XEN_GUEST_HANDLE_PARAM(v4v_info_t) info_hnd =
+               guest_handle_cast(arg1, v4v_info_t);
+        v4v_info_t info;
+
+        if ( unlikely(!guest_handle_okay(info_hnd, 1)) )
+            goto out;
+        v4v_info(d, &info);
+        if ( __copy_to_guest(info_hnd, &info, 1) )
+            goto out;
+        rc = 0;
+        break;
+    }
+    case V4VOP_tables_add:
+    {
+        uint32_t position = arg3;
+
+        XEN_GUEST_HANDLE_PARAM(v4vtables_rule_t) rule_hnd =
+               guest_handle_cast(arg1, v4vtables_rule_t);
+        rc = -EPERM;
+        if ( !is_hardware_domain(d) )
+            goto out;
+
+        write_lock(&v4vtables_rules_lock);
+        rc = v4vtables_add(d, rule_hnd, position);
+        write_unlock(&v4vtables_rules_lock);
+        break;
+    }
+    case V4VOP_tables_del:
+    {
+        uint32_t position = arg3;
+
+        XEN_GUEST_HANDLE_PARAM(v4vtables_rule_t) rule_hnd =
+               guest_handle_cast(arg1, v4vtables_rule_t);
+        rc = -EPERM;
+        if ( !is_hardware_domain(d) )
+             goto out;
+
+        write_lock(&v4vtables_rules_lock);
+        rc = v4vtables_del(d, rule_hnd, position);
+        write_unlock(&v4vtables_rules_lock);
+        break;
+    }
+    case V4VOP_tables_list:
+    {
+        XEN_GUEST_HANDLE_PARAM(v4vtables_list_t) rules_list_hnd =
+               guest_handle_cast(arg1, v4vtables_list_t);
+        rc = -EPERM;
+        if ( !is_hardware_domain(d) )
+            goto out;
+
+        read_lock(&v4vtables_rules_lock);
+        rc = v4vtables_list(d, rules_list_hnd);
+        read_unlock(&v4vtables_rules_lock);
+        break;
+    }
+    default:
+        rc = -ENOSYS;
+        break;
+    }
+out:
+    domain_unlock(d);
+    v4v_dprintk("<-do_v4v_op()=%d\n", (int)rc);
+    return rc;
+}
+
+/*
+ * init
+ */
+
+void
+v4v_destroy(struct domain *d)
+{
+    int i;
+
+    BUG_ON(!d->is_dying);
+    write_lock(&v4v_lock);
+
+    v4v_dprintk("d->v=%p\n", d->v4v);
+
+    if ( d->v4v )
+    {
+        write_lock(&d->v4v->lock);
+        for ( i = 0; i < V4V_HTABLE_SIZE; ++i )
+        {
+            struct hlist_node *node, *next;
+            struct v4v_ring_info *ring_info;
+
+            hlist_for_each_entry_safe(ring_info, node,
+                    next, &d->v4v->ring_hash[i],
+                    node)
+            {
+                v4v_ring_remove_info(d, ring_info);
+            }
+        }
+        write_unlock(&d->v4v->lock);
+    }
+
+    d->v4v = NULL;
+    write_unlock(&v4v_lock);
+}
+
+int
+v4v_init(struct domain *d)
+{
+    struct v4v_domain *v4v;
+    evtchn_port_t port;
+    int i;
+    int rc;
+
+    v4v = xmalloc(struct v4v_domain);
+    if ( !v4v )
+        return -ENOMEM;
+
+    rc = evtchn_alloc_unbound_domain(d, &port, d->domain_id, 0);
+    if ( rc )
+        return rc;
+
+    rwlock_init(&v4v->lock);
+
+    v4v->evtchn_port = port;
+    for ( i = 0; i < V4V_HTABLE_SIZE; ++i )
+        INIT_HLIST_HEAD(&v4v->ring_hash[i]);
+
+    write_lock(&v4v_lock);
+    d->v4v = v4v;
+    write_unlock(&v4v_lock);
+
+    return 0;
+}
+
+
+/*
+ * debug
+ */
+
+static void
+dump_domain_ring(struct domain *d, struct v4v_ring_info *ring_info)
+{
+    uint32_t rx_ptr;
+
+    printk(KERN_ERR "  ring: domid=%d port=0x%08x partner=%d npage=%d\n",
+           (int)d->domain_id, (int)ring_info->id.addr.port,
+           (int)ring_info->id.partner, (int)ring_info->npage);
+
+    if ( v4v_ringbuf_get_rx_ptr(d, ring_info, &rx_ptr) )
+    {
+        printk(KERN_ERR "   Failed to read rx_ptr\n");
+        return;
+    }
+
+    printk(KERN_ERR "   tx_ptr=%d rx_ptr=%d len=%d\n",
+           (int)ring_info->tx_ptr, (int)rx_ptr, (int)ring_info->len);
+}
+
+static void
+dump_domain(struct domain *d)
+{
+    int i;
+
+    printk(KERN_ERR " domain %d:\n", (int)d->domain_id);
+
+    read_lock(&d->v4v->lock);
+
+    for ( i = 0; i < V4V_HTABLE_SIZE; ++i )
+    {
+        struct hlist_node *node;
+        struct v4v_ring_info *ring_info;
+
+        hlist_for_each_entry(ring_info, node, &d->v4v->ring_hash[i], node)
+            dump_domain_ring(d, ring_info);
+    }
+
+    printk(KERN_ERR "  event channel: %d\n",  d->v4v->evtchn_port);
+    read_unlock(&d->v4v->lock);
+
+    printk(KERN_ERR "\n");
+    v4v_signal_domain(d);
+}
+
+static void
+dump_state(unsigned char key)
+{
+    struct domain *d;
+
+    printk(KERN_ERR "\n\nV4V:\n");
+    read_lock(&v4v_lock);
+
+    rcu_read_lock(&domlist_read_lock);
+
+    for_each_domain(d)
+        dump_domain(d);
+
+    rcu_read_unlock(&domlist_read_lock);
+
+    read_unlock(&v4v_lock);
+}
+
+struct keyhandler v4v_info_keyhandler =
+{
+    .diagnostic = 1,
+    .u.fn = dump_state,
+    .desc = "dump v4v states and interupt"
+};
+
+static int __init
+setup_dump_rings(void)
+{
+    register_keyhandler('4', &v4v_info_keyhandler);
+    return 0;
+}
+
+__initcall(setup_dump_rings);
+
+
+/*
+ * Local variables:
+ * mode: C
+ * c-set-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/Makefile b/xen/include/Makefile
index 62846a1..dbcc44c 100644
--- a/xen/include/Makefile
+++ b/xen/include/Makefile
@@ -22,7 +22,8 @@ headers-y := \
     compat/version.h \
     compat/xen.h \
     compat/xencomm.h \
-    compat/xenoprof.h
+    compat/xenoprof.h \
+    compat/v4v.h
 headers-$(CONFIG_X86)     += compat/arch-x86/xen-mca.h
 headers-$(CONFIG_X86)     += compat/arch-x86/xen.h
 headers-$(CONFIG_X86)     += compat/arch-x86/xen-$(compat-arch-y).h
diff --git a/xen/include/public/v4v.h b/xen/include/public/v4v.h
new file mode 100644
index 0000000..e32e2b5
--- /dev/null
+++ b/xen/include/public/v4v.h
@@ -0,0 +1,304 @@
+/******************************************************************************
+ * V4V
+ *
+ * Version 2 of v2v (Virtual-to-Virtual)
+ *
+ * Copyright (c) 2010, Citrix Systems
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __XEN_PUBLIC_V4V_H__
+#define __XEN_PUBLIC_V4V_H__
+
+#include "xen.h"
+#include "event_channel.h"
+
+/*
+ * Structure definitions
+ */
+
+#define V4V_RING_MAGIC          0xa822f72bb0b9d8ccUL
+#define V4V_RING_DATA_MAGIC	0x45fe852220b801d4UL
+
+#define V4V_MESSAGE_DGRAM       0x3c2c1db8
+#define V4V_MESSAGE_STREAM 	0x70f6a8e5
+
+#define V4V_DOMID_ANY           DOMID_INVALID
+#define V4V_PORT_ANY            0
+
+typedef uint64_t v4v_pfn_t;
+
+typedef struct v4v_iov
+{
+    uint64_t iov_base;
+    uint32_t iov_len;
+    uint32_t pad;
+} v4v_iov_t;
+
+typedef struct v4v_addr
+{
+    uint32_t port;
+    domid_t domain;
+    uint16_t pad;
+} v4v_addr_t;
+
+typedef struct v4v_ring_id
+{
+    v4v_addr_t addr;
+    domid_t partner;
+    uint16_t pad;
+} v4v_ring_id_t;
+
+typedef struct v4v_send_addr
+{
+    v4v_addr_t src;
+    v4v_addr_t dst;
+} v4v_send_addr_t;
+
+/*
+ * v4v_ring
+ * id: xen only looks at this during register/unregister
+ *     and will fill in id.addr.domain
+ * rx_ptr: rx pointer, modified by domain
+ * tx_ptr: tx pointer, modified by xen
+ *
+ */
+typedef struct v4v_ring
+{
+    uint64_t magic;
+    v4v_ring_id_t id;
+    uint32_t len;
+    uint32_t rx_ptr;
+    uint32_t tx_ptr;
+    uint8_t reserved[32];
+#if defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L
+    uint8_t ring[];
+#elif defined(__GNUC__)
+    uint8_t ring[0];
+#endif
+} v4v_ring_t;
+
+#define V4V_RING_DATA_F_EMPTY       (1U << 0) /* Ring is empty */
+#define V4V_RING_DATA_F_EXISTS      (1U << 1) /* Ring exists */
+#define V4V_RING_DATA_F_PENDING     (1U << 2) /* Pending interrupt exists - do not
+                                               * rely on this field - for
+                                               * profiling only */
+#define V4V_RING_DATA_F_SUFFICIENT  (1U << 3) /* Sufficient space to queue
+                                               * space_required bytes exists */
+
+typedef struct v4v_ring_data_ent
+{
+    v4v_addr_t ring;
+    uint16_t flags;
+    uint16_t pad;
+    uint32_t space_required;
+    uint32_t max_message_size;
+} v4v_ring_data_ent_t;
+
+typedef struct v4v_ring_data
+{
+    uint64_t magic;
+    uint32_t nent;
+    uint32_t pad;
+    uint64_t reserved[4];
+#if defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L
+    v4v_ring_data_ent_t data[];
+#elif defined(__GNUC__)
+    v4v_ring_data_ent_t data[0];
+#endif
+} v4v_ring_data_t;
+
+typedef struct v4v_info
+{
+    uint64_t ring_magic;
+    uint64_t data_magic;
+    evtchn_port_t evtchn;
+    uint32_t pad;
+} v4v_info_t;
+
+#define V4V_SHF_SYN		(1 << 0)
+#define V4V_SHF_ACK		(1 << 1)
+#define V4V_SHF_RST		(1 << 2)
+
+#define V4V_SHF_PING		(1 << 8)
+#define V4V_SHF_PONG		(1 << 9)
+
+struct v4v_stream_header
+{
+    uint32_t flags;
+    uint32_t conid;
+};
+
+struct v4v_ring_message_header
+{
+    uint32_t len;
+    uint32_t message_type;
+    v4v_addr_t source;
+#if defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L
+    uint8_t data[];
+#elif defined(__GNUC__)
+    uint8_t data[0];
+#endif
+};
+
+typedef struct v4vtables_rule
+{
+    v4v_addr_t src;
+    v4v_addr_t dst;
+    uint32_t accept;
+} v4vtables_rule_t;
+
+typedef struct v4vtables_list
+{
+    uint32_t start_rule;
+    uint32_t nb_rules;
+#if defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L
+    struct v4vtables_rule rules[];
+#elif defined(__GNUC__)
+    struct v4vtables_rule rules[0];
+#endif
+} v4vtables_list_t;
+
+/*
+ * HYPERCALLS
+ */
+
+/*
+ * V4VOP_register_ring
+ *
+ * Registers a ring with Xen. If a ring with the same v4v_ring_id exists,
+ * the hypercall will return -EEXIST.
+ *
+ * do_v4v_op(V4VOP_register_ring,
+ *           XEN_GUEST_HANDLE(v4v_ring_t), XEN_GUEST_HANDLE(v4v_pfn_t),
+ *           npage, 0)
+ */
+#define V4VOP_register_ring 	1
+
+/*
+ * V4VOP_unregister_ring
+ *
+ * Unregister a ring.
+ *
+ * do_v4v_op(V4VOP_unregister_ring,
+ *           XEN_GUEST_HANDLE(v4v_ring_t),
+ *           NULL, 0, 0)
+ */
+#define V4VOP_unregister_ring 	2
+
+/*
+ * V4VOP_notify
+ *
+ * Asks xen for information about other rings in the system
+ *
+ * ent->ring is the v4v_addr_t of the ring you want information on
+ * the same matching rules are used as for V4VOP_send.
+ *
+ * ent->space_required  if this field is not null xen will check
+ * that there is space in the destination ring for this many bytes
+ * of payload. If there is it will set the V4V_RING_DATA_F_SUFFICIENT
+ * and CANCEL any pending interrupt for that ent->ring, if insufficient
+ * space is available it will schedule an interrupt and the flag will
+ * not be set.
+ *
+ * The flags are set by xen when notify replies
+ * V4V_RING_DATA_F_EMPTY	ring is empty
+ * V4V_RING_DATA_F_PENDING	interrupt is pending - don't rely on this
+ * V4V_RING_DATA_F_SUFFICIENT	sufficient space for space_required is there
+ * V4V_RING_DATA_F_EXISTS	ring exists
+ *
+ * do_v4v_op(V4VOP_notify,
+ *           XEN_GUEST_HANDLE(v4v_ring_data_ent_t) ent,
+ *           NULL, 0, 0)
+ */
+#define V4VOP_notify 		4
+
+/*
+ * V4VOP_sendv
+ *
+ * Sends of list of buffer contained in iov.
+ *
+ * For each iov entry send iov_len bytes of iov_base to addr.dst, giving
+ * src as the source address (xen will ignore src->domain and put your
+ * domain in the actually message), xen first looks for a ring with id.addr==dst
+ * and id.partner==sending_domain if that fails it looks for id.addr==dst and
+ * id.partner==DOMID_ANY.
+ * message_type is the 32 bit number used from the message
+ * most likely V4V_MESSAGE_DGRAM or V4V_MESSAGE_STREAM. If insufficient space exists
+ * it will return -EAGAIN and xen will twing the V4V_INTERRUPT when
+ * sufficient space becomes available
+ *
+ * do_v4v_op(V4VOP_sendv,
+ *           XEN_GUEST_HANDLE(v4v_send_addr_t) addr,
+ *           XEN_GUEST_HANDLE(v4v_iov_t) iov,
+ *           uint32_t niov,
+ *           uint32_t message_type)
+ */
+#define V4VOP_sendv		5
+
+/*
+ * V4VOP_info
+ *
+ * Returns v4v info for the current domain (domain that issued the hypercall).
+ *      - V4V magic number
+ *      - event channel port (for current domain)
+ *
+ * do_v4v_op(V4VOP_info,
+ *           XEN_GUEST_HANDLE(v4v_info_t) info,
+ *           NULL, 0, 0)
+ */
+#define V4VOP_info              6
+
+/*
+ * V4VOP_tables_add
+ *
+ * Insert a filtering rules after a given position.
+ *
+ * do_v4v_op(V4VOP_tables_add,
+ *           XEN_GUEST_HANDLE(v4vtables_rule_t) rule,
+ *           NULL,
+ *           uint32_t position, 0)
+ */
+#define V4VOP_tables_add        7
+
+/*
+ * V4VOP_tables_del
+ *
+ * Delete a filtering rules at a position or the rule
+ * that matches "rule".
+ *
+ * do_v4v_op(V4VOP_tables_del,
+ *           XEN_GUEST_HANDLE(v4vtables_rule_t) rule,
+ *           NULL,
+ *           uint32_t position, 0)
+ */
+#define V4VOP_tables_del       8
+
+/*
+ * V4VOP_tables_list
+ *
+ * do_v4v_op(V4VOP_tables_list,
+ *           XEN_GUEST_HANDLE(v4vtpables_list_t) list,
+ *           NULL, 0, 0)
+ */
+#define V4VOP_tables_list      9
+
+#endif /* __XEN_PUBLIC_V4V_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-set-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h
index 3cab74f..0750152 100644
--- a/xen/include/public/xen.h
+++ b/xen/include/public/xen.h
@@ -100,7 +100,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_ulong_t);
 #define __HYPERVISOR_domctl               36
 #define __HYPERVISOR_kexec_op             37
 #define __HYPERVISOR_tmem_op              38
-#define __HYPERVISOR_xc_reserved_op       39 /* reserved for XenClient */
+#define __HYPERVISOR_v4v_op               39
 
 /* Architecture-specific hypercall definitions. */
 #define __HYPERVISOR_arch_0               48
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index ae6a3b8..5eeeca9 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -24,6 +24,7 @@
 #include <public/sysctl.h>
 #include <public/vcpu.h>
 #include <public/mem_event.h>
+#include <xen/v4v.h>
 
 #ifdef CONFIG_COMPAT
 #include <compat/vcpu.h>
@@ -377,6 +378,9 @@ struct domain
     nodemask_t node_affinity;
     unsigned int last_alloc_node;
     spinlock_t node_affinity_lock;
+
+    /* v4v */
+    struct v4v_domain *v4v;
 };
 
 struct domain_setup_info
diff --git a/xen/include/xen/v4v.h b/xen/include/xen/v4v.h
new file mode 100644
index 0000000..a8a050b
--- /dev/null
+++ b/xen/include/xen/v4v.h
@@ -0,0 +1,49 @@
+/******************************************************************************
+ * V4V
+ *
+ * Version 2 of v2v (Virtual-to-Virtual)
+ *
+ * Copyright (c) 2010, Citrix Systems
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __V4V_PRIVATE_H__
+#define __V4V_PRIVATE_H__
+
+#include <xen/config.h>
+#include <xen/types.h>
+#include <xen/spinlock.h>
+#include <xen/smp.h>
+#include <xen/shared.h>
+#include <xen/list.h>
+#include <public/v4v.h>
+
+struct v4v_domain;
+
+void v4v_destroy(struct domain *d);
+int v4v_init(struct domain *d);
+long do_v4v_op(int cmd,
+               XEN_GUEST_HANDLE_PARAM(void) arg1,
+               XEN_GUEST_HANDLE_PARAM(void) arg2,
+               uint32_t arg3,
+               uint32_t arg4);
+
+#endif /* __V4V_PRIVATE_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-set-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/include/xlat.lst b/xen/include/xlat.lst
index d832110..cacc8d3 100644
--- a/xen/include/xlat.lst
+++ b/xen/include/xlat.lst
@@ -97,3 +97,13 @@
 !	vcpu_set_singleshot_timer	vcpu.h
 ?	xenoprof_init			xenoprof.h
 ?	xenoprof_passive		xenoprof.h
+?	v4v_iov				v4v.h
+?	v4v_addr			v4v.h
+?	v4v_ring_id			v4v.h
+?	v4v_send_addr			v4v.h
+?	v4v_ring			v4v.h
+?	v4v_ring_data_ent		v4v.h
+?	v4v_ring_data			v4v.h
+?	v4v_info			v4v.h
+?	v4vtables_rule			v4v.h
+?	v4vtables_list			v4v.h
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 2/2] xen: Add V4V implementation
  2013-05-28 19:43 ` [PATCH (V9) 2/2] xen: Add V4V implementation Ross Philipson
@ 2013-05-29  0:43   ` Matt Wilson
  2013-05-29 19:28     ` Ross Philipson
  2013-05-29  8:34   ` Jan Beulich
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Matt Wilson @ 2013-05-29  0:43 UTC (permalink / raw)
  To: Ross Philipson; +Cc: xen-devel

On Tue, May 28, 2013 at 03:43:31PM -0400, Ross Philipson wrote:
> Setup of v4v domains a domain gets created and cleanup
> when a domain die. Wire up the v4v hypercall.
> 
> Include v4v internal and public headers.
> 
> Signed-off-by: Ross Philipson <ross.philipson@citrix.com>

Would you consider making a boot-time option to disable v4v entirely?

> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -199,7 +199,8 @@ struct domain *domain_create(
>  {
>      struct domain *d, **pd;
>      enum { INIT_xsm = 1u<<0, INIT_watchdog = 1u<<1, INIT_rangeset = 1u<<2,
> -           INIT_evtchn = 1u<<3, INIT_gnttab = 1u<<4, INIT_arch = 1u<<5 };
> +           INIT_evtchn = 1u<<3, INIT_gnttab = 1u<<4, INIT_arch = 1u<<5,
> +           INIT_v4v = 1u<<6 };
>      int err, init_status = 0;
>      int poolid = CPUPOOLID_NONE;
>  
> @@ -312,6 +313,13 @@ struct domain *domain_create(
>          spin_unlock(&domlist_update_lock);
>      }

Perhaps an opt_v4v check here?
  
> +    if ( !is_idle_domain(d) )
> +    {
> +        if ( (err = v4v_init(d)) != 0 )
> +            goto fail;
> +        init_status |= INIT_v4v;
> +    }
> +
>      return d;
>  
>   fail:

[...]

> new file mode 100644
> index 0000000..701877d
> --- /dev/null
> +++ b/xen/common/v4v.c

[...]

> +/*
> + * hypercall glue
> + */
> +long
> +do_v4v_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg1,
> +          XEN_GUEST_HANDLE_PARAM(void) arg2,
> +          uint32_t arg3, uint32_t arg4)
> +{
> +    struct domain *d = current->domain;
> +    long rc = -EFAULT;
> +
> +    v4v_dprintk("->do_v4v_op(%d,%p,%p,%d,%d)\n", cmd,
> +                (void *)arg1.p, (void *)arg2.p, (int) arg3, (int) arg4);
> +

and check for d->v4v != NULL here, bailing with -ENOSYS?

> +    domain_lock(d);
> +    switch (cmd)
> +    {
> +    case V4VOP_register_ring:

[...]

--msw

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 2/2] xen: Add V4V implementation
  2013-05-28 19:43 ` [PATCH (V9) 2/2] xen: Add V4V implementation Ross Philipson
  2013-05-29  0:43   ` Matt Wilson
@ 2013-05-29  8:34   ` Jan Beulich
  2013-05-29 19:26     ` Ross Philipson
  2013-05-29  9:56   ` Vincent Hanquez
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Jan Beulich @ 2013-05-29  8:34 UTC (permalink / raw)
  To: Ross Philipson; +Cc: xen-devel

>>> On 28.05.13 at 21:43, Ross Philipson <ross.philipson@citrix.com> wrote:
> @@ -320,6 +328,8 @@ struct domain *domain_create(
>      xfree(d->mem_event);
>      if ( init_status & INIT_arch )
>          arch_domain_destroy(d);
> +    if ( init_status & INIT_v4v )
> +	v4v_destroy(d);

Hard tab.

> +static long
> +v4v_ringbuf_insertv(struct domain *d,
> +                    struct v4v_ring_info *ring_info,
> +                    v4v_ring_id_t *src_id, uint32_t message_type,
> +                    XEN_GUEST_HANDLE_PARAM(v4v_iov_t) iovs,
> +                    uint32_t niov, size_t len)
> +{
>...
> +    do {
>...
> +    } while ( 0 );
> +
> +    v4v_ring_unmap(ring_info);

With the unmapping of all pages getting deferred till here - is there
a sensible upper limit on the number of accumulated mappings? Or
are you blindly running the host out of mapping space?

> +static int
> +v4v_find_ring_mfns(struct domain *d, struct v4v_ring_info *ring_info,
> +                   uint32_t npage, XEN_GUEST_HANDLE_PARAM(v4v_pfn_t) pfn_hnd)
> +{
>...
> +    for ( i = 0; i < npage; ++i )
> +    {
> +        unsigned long pfn;

If you use scope restricted local variables (which I appreciate),
please do so consistently at least within functions. I'm particularly
thinking of "t" here...

> +
> +        if ( copy_from_guest_offset(&pfn, pfn_hnd, i, 1) )
> +        {
> +            ret = -EFAULT;
> +            v4v_dprintk("EFAULT\n");
> +            break;
> +        }
> +
> +        mfn = mfn_x(get_gfn(d, pfn, &t));
> +        if ( !mfn_valid(mfn) )
> +        {
> +            put_gfn(d, pfn);
> +            printk(KERN_ERR "v4v domain %d passed invalid mfn %"PRI_mfn" ring %p seq %d\n",
> +                    d->domain_id, mfn, ring_info, i);
> +            ret = -EINVAL;
> +            break;

No handling of paged out or shared pages here, and not even a
comment saying this needs further adjustment?

Also in code that isn't a Linux clone, please use XENLOG_ instead
of KERN_ for message levels. And you absolutely have to use
XENLOG_G_ for messages concerning guest activities.

And then you properly use PRI_mfn here, yet elsewhere I saw MFNs
getting printed using bogus (void *) casts. This also calls for being
done consistently.

Finally, also printing the PFN here might aid guest side debugging.

> +static void v4v_ring_remove_mfns(struct domain *d, struct v4v_ring_info *ring_info)
> +{
>...
> +    if ( ring_info->mfn_mapping )
> +        xfree(ring_info->mfn_mapping);

Pointless if().

> +#define V4V_RING_MAGIC          0xa822f72bb0b9d8ccUL
> +#define V4V_RING_DATA_MAGIC	0x45fe852220b801d4UL

Hard tab.

> +#define V4V_MESSAGE_DGRAM       0x3c2c1db8
> +#define V4V_MESSAGE_STREAM 	0x70f6a8e5

Again. Was this really cleaned up, as the overview patch said?

> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -24,6 +24,7 @@
>  #include <public/sysctl.h>
>  #include <public/vcpu.h>
>  #include <public/mem_event.h>
> +#include <xen/v4v.h>

Please don't, or if you absolutely have to, not after the public
headers, but along with the other xen/ ones.

> --- /dev/null
> +++ b/xen/include/xen/v4v.h
>...
> +struct v4v_domain;

What is this good for? There's no use in any of the following
function declarations.

Jan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 2/2] xen: Add V4V implementation
  2013-05-28 19:43 ` [PATCH (V9) 2/2] xen: Add V4V implementation Ross Philipson
  2013-05-29  0:43   ` Matt Wilson
  2013-05-29  8:34   ` Jan Beulich
@ 2013-05-29  9:56   ` Vincent Hanquez
  2013-05-30 16:20   ` Tim Deegan
  2013-06-10 15:06   ` David Vrabel
  4 siblings, 0 replies; 24+ messages in thread
From: Vincent Hanquez @ 2013-05-29  9:56 UTC (permalink / raw)
  To: Ross Philipson; +Cc: xen-devel

On 05/28/2013 08:43 PM, Ross Philipson wrote:
> +
> +typedef struct v4v_addr
> +{
> +    uint32_t port;
> +    domid_t domain;
> +    uint16_t pad;
> +} v4v_addr_t;

> +struct v4v_ring_message_header
> +{
> +    uint32_t len;
> +    uint32_t message_type;
> +    v4v_addr_t source;
> +#if defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L
> +    uint8_t data[];
> +#elif defined(__GNUC__)
> +    uint8_t data[0];
> +#endif
> +};
> +

> +/*
> + * V4VOP_sendv
> + *
> + * Sends of list of buffer contained in iov.
> + *
> + * For each iov entry send iov_len bytes of iov_base to addr.dst, giving
> + * src as the source address (xen will ignore src->domain and put your
> + * domain in the actually message), xen first looks for a ring with id.addr==dst
> + * and id.partner==sending_domain if that fails it looks for id.addr==dst and
> + * id.partner==DOMID_ANY.
> + * message_type is the 32 bit number used from the message
> + * most likely V4V_MESSAGE_DGRAM or V4V_MESSAGE_STREAM. If insufficient space exists
> + * it will return -EAGAIN and xen will twing the V4V_INTERRUPT when
> + * sufficient space becomes available
> + *
> + * do_v4v_op(V4VOP_sendv,
> + *           XEN_GUEST_HANDLE(v4v_send_addr_t) addr,
> + *           XEN_GUEST_HANDLE(v4v_iov_t) iov,
> + *           uint32_t niov,
> + *           uint32_t message_type)
> + */
> +#define V4VOP_sendv		5

Instead of having the message_type part of each ring_message_header and 
in the sendv hypercall,
I wonder if this would make more sense to register the ring with a 
message type (as a "tag"),
so that ring are addressed by a 3-tuple (domain,port,message_type) 
instead of a 2-tuple (domain,port),
which make a ring only able to receive strictly one message type.

It could have the following benefits:

* v4v_addr's .pad becomes the message type.
* the message_type become less important for the hypervisor, it's just 
the same as a port.
* the message_header is 4 byte shorter and the guest doesn't have to 
sort message type itself.
* it could be useful to handle priority on rings at guest level for 
specific message type.
* guests doesn't receive message type that it doesn't expect on a ring.
* the source automatically get a connection refused if a guest is not 
handling a specific message type.
* the firewall can filter on message type as it's part of the v4v_addr.

-- 
Vincent

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 2/2] xen: Add V4V implementation
  2013-05-29  8:34   ` Jan Beulich
@ 2013-05-29 19:26     ` Ross Philipson
  2013-05-30  5:16       ` Jan Beulich
  0 siblings, 1 reply; 24+ messages in thread
From: Ross Philipson @ 2013-05-29 19:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 05/29/2013 04:34 AM, Jan Beulich wrote:
>>>> On 28.05.13 at 21:43, Ross Philipson<ross.philipson@citrix.com>  wrote:
>> @@ -320,6 +328,8 @@ struct domain *domain_create(
>>       xfree(d->mem_event);
>>       if ( init_status&  INIT_arch )
>>           arch_domain_destroy(d);
>> +    if ( init_status&  INIT_v4v )
>> +	v4v_destroy(d);
>
> Hard tab.
>
>> +static long
>> +v4v_ringbuf_insertv(struct domain *d,
>> +                    struct v4v_ring_info *ring_info,
>> +                    v4v_ring_id_t *src_id, uint32_t message_type,
>> +                    XEN_GUEST_HANDLE_PARAM(v4v_iov_t) iovs,
>> +                    uint32_t niov, size_t len)
>> +{
>> ...
>> +    do {
>> ...
>> +    } while ( 0 );
>> +
>> +    v4v_ring_unmap(ring_info);
>
> With the unmapping of all pages getting deferred till here - is there
> a sensible upper limit on the number of accumulated mappings? Or
> are you blindly running the host out of mapping space?

It is limited by the maximum size of the ring that was created. In 
theory that could be large. I could look at freeing them in the loop as 
it progresses.

>
>> +static int
>> +v4v_find_ring_mfns(struct domain *d, struct v4v_ring_info *ring_info,
>> +                   uint32_t npage, XEN_GUEST_HANDLE_PARAM(v4v_pfn_t) pfn_hnd)
>> +{
>> ...
>> +    for ( i = 0; i<  npage; ++i )
>> +    {
>> +        unsigned long pfn;
>
> If you use scope restricted local variables (which I appreciate),
> please do so consistently at least within functions. I'm particularly
> thinking of "t" here...

I will go through and make this consistent.

>
>> +
>> +        if ( copy_from_guest_offset(&pfn, pfn_hnd, i, 1) )
>> +        {
>> +            ret = -EFAULT;
>> +            v4v_dprintk("EFAULT\n");
>> +            break;
>> +        }
>> +
>> +        mfn = mfn_x(get_gfn(d, pfn,&t));
>> +        if ( !mfn_valid(mfn) )
>> +        {
>> +            put_gfn(d, pfn);
>> +            printk(KERN_ERR "v4v domain %d passed invalid mfn %"PRI_mfn" ring %p seq %d\n",
>> +                    d->domain_id, mfn, ring_info, i);
>> +            ret = -EINVAL;
>> +            break;
>
> No handling of paged out or shared pages here, and not even a
> comment saying this needs further adjustment?

Right, need to adddress this. I guess we should be looking for 
p2m_is_shared/p2m_is_paging and avoid those types. There may be other 
types to avoid too like p2m_is_grant etc.

>
> Also in code that isn't a Linux clone, please use XENLOG_ instead
> of KERN_ for message levels. And you absolutely have to use
> XENLOG_G_ for messages concerning guest activities.
>
> And then you properly use PRI_mfn here, yet elsewhere I saw MFNs
> getting printed using bogus (void *) casts. This also calls for being
> done consistently.
>
> Finally, also printing the PFN here might aid guest side debugging.

Wilco

>
>> +static void v4v_ring_remove_mfns(struct domain *d, struct v4v_ring_info *ring_info)
>> +{
>> ...
>> +    if ( ring_info->mfn_mapping )
>> +        xfree(ring_info->mfn_mapping);
>
> Pointless if().
>
>> +#define V4V_RING_MAGIC          0xa822f72bb0b9d8ccUL
>> +#define V4V_RING_DATA_MAGIC	0x45fe852220b801d4UL
>
> Hard tab.
>
>> +#define V4V_MESSAGE_DGRAM       0x3c2c1db8
>> +#define V4V_MESSAGE_STREAM 	0x70f6a8e5
>
> Again. Was this really cleaned up, as the overview patch said?

It was cleaned up. I inherited the patch set and went through it. In the 
diff from version 8 and the inherited set I saw quite a bit of cleanup. 
But that is not an excuse - I will clean the rest up, sorry.

>
>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -24,6 +24,7 @@
>>   #include<public/sysctl.h>
>>   #include<public/vcpu.h>
>>   #include<public/mem_event.h>
>> +#include<xen/v4v.h>
>
> Please don't, or if you absolutely have to, not after the public
> headers, but along with the other xen/ ones.
>
>> --- /dev/null
>> +++ b/xen/include/xen/v4v.h
>> ...
>> +struct v4v_domain;
>
> What is this good for? There's no use in any of the following
> function declarations.

It is a forward declaration of the main internal v4v struct. It is used 
in sched.h in the domain struct - thus the #include you asked about 
earlier. I will of course move that include.

Thanks
Ross

>
> Jan
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 2/2] xen: Add V4V implementation
  2013-05-29  0:43   ` Matt Wilson
@ 2013-05-29 19:28     ` Ross Philipson
  0 siblings, 0 replies; 24+ messages in thread
From: Ross Philipson @ 2013-05-29 19:28 UTC (permalink / raw)
  To: Matt Wilson; +Cc: xen-devel

On 05/28/2013 08:43 PM, Matt Wilson wrote:
> On Tue, May 28, 2013 at 03:43:31PM -0400, Ross Philipson wrote:
>> Setup of v4v domains a domain gets created and cleanup
>> when a domain die. Wire up the v4v hypercall.
>>
>> Include v4v internal and public headers.
>>
>> Signed-off-by: Ross Philipson<ross.philipson@citrix.com>
>
> Would you consider making a boot-time option to disable v4v entirely?

I think that sounds reasonable since if you are not using it, it is 
consuming resources needlessly. I actually was thinking about this 
yesterday shortly after submitting the last patch set.

Thanks
Ross

>
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -199,7 +199,8 @@ struct domain *domain_create(
>>   {
>>       struct domain *d, **pd;
>>       enum { INIT_xsm = 1u<<0, INIT_watchdog = 1u<<1, INIT_rangeset = 1u<<2,
>> -           INIT_evtchn = 1u<<3, INIT_gnttab = 1u<<4, INIT_arch = 1u<<5 };
>> +           INIT_evtchn = 1u<<3, INIT_gnttab = 1u<<4, INIT_arch = 1u<<5,
>> +           INIT_v4v = 1u<<6 };
>>       int err, init_status = 0;
>>       int poolid = CPUPOOLID_NONE;
>>
>> @@ -312,6 +313,13 @@ struct domain *domain_create(
>>           spin_unlock(&domlist_update_lock);
>>       }
>
> Perhaps an opt_v4v check here?
>
>> +    if ( !is_idle_domain(d) )
>> +    {
>> +        if ( (err = v4v_init(d)) != 0 )
>> +            goto fail;
>> +        init_status |= INIT_v4v;
>> +    }
>> +
>>       return d;
>>
>>    fail:
>
> [...]
>
>> new file mode 100644
>> index 0000000..701877d
>> --- /dev/null
>> +++ b/xen/common/v4v.c
>
> [...]
>
>> +/*
>> + * hypercall glue
>> + */
>> +long
>> +do_v4v_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg1,
>> +          XEN_GUEST_HANDLE_PARAM(void) arg2,
>> +          uint32_t arg3, uint32_t arg4)
>> +{
>> +    struct domain *d = current->domain;
>> +    long rc = -EFAULT;
>> +
>> +    v4v_dprintk("->do_v4v_op(%d,%p,%p,%d,%d)\n", cmd,
>> +                (void *)arg1.p, (void *)arg2.p, (int) arg3, (int) arg4);
>> +
>
> and check for d->v4v != NULL here, bailing with -ENOSYS?
>
>> +    domain_lock(d);
>> +    switch (cmd)
>> +    {
>> +    case V4VOP_register_ring:
>
> [...]
>
> --msw

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 2/2] xen: Add V4V implementation
  2013-05-29 19:26     ` Ross Philipson
@ 2013-05-30  5:16       ` Jan Beulich
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Beulich @ 2013-05-30  5:16 UTC (permalink / raw)
  To: ross.philipson; +Cc: xen-devel

>>>> Ross Philipson <ross.philipson@citrix.com> 05/29/13 9:26 PM >>>
>On 05/29/2013 04:34 AM, Jan Beulich wrote:
>>>>> On 28.05.13 at 21:43, Ross Philipson<ross.philipson@citrix.com>  wrote:
>>> --- a/xen/include/xen/sched.h
>>> +++ b/xen/include/xen/sched.h
>>> @@ -24,6 +24,7 @@
>>>   #include<public/sysctl.h>
>>>   #include<public/vcpu.h>
>>>   #include<public/mem_event.h>
>>> +#include<xen/v4v.h>
>>
>> Please don't, or if you absolutely have to, not after the public
>> headers, but along with the other xen/ ones.
>>
>>> --- /dev/null
>>> +++ b/xen/include/xen/v4v.h
>>> ...
>>> +struct v4v_domain;
>>
>> What is this good for? There's no use in any of the following
>> function declarations.
>
>It is a forward declaration of the main internal v4v struct. It is used 
>in sched.h in the domain struct - thus the #include you asked about 
>earlier. I will of course move that include.

Delete, not move - note my mentioning of function declarations. An incomplete
structure being used in another structure or union is no problem without
forward declaration - it serves as a forward declaration itself. This would only
be a problem in C++, but this is a non-public header, so C++ is of no concern.

Jan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 0/2] Add V4V to Xen
  2013-05-28 19:43 [PATCH (V9) 0/2] Add V4V to Xen Ross Philipson
  2013-05-28 19:43 ` [PATCH (V9) 1/2] xen: events, exposes evtchn_alloc_unbound_domain Ross Philipson
  2013-05-28 19:43 ` [PATCH (V9) 2/2] xen: Add V4V implementation Ross Philipson
@ 2013-05-30 11:57 ` Ian Campbell
  2013-05-31  7:36   ` Vincent Hanquez
  2013-05-30 12:07 ` Ian Campbell
  3 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2013-05-30 11:57 UTC (permalink / raw)
  To: Ross Philipson; +Cc: xen-devel

On Tue, 2013-05-28 at 15:43 -0400, Ross Philipson wrote:
> [PATCH (V9) 1/2] xen: events, exposes evtchn_alloc_unbound_domain
> [PATCH (V9) 2/2] xen: Add V4V implementation

What is happening on the tools/client side?

There have been a few "here's a quick hack" and "dump of some internal
thing" type things posted but AFAICT no real effort to get anything into
a shape where this functionality is useful to users at large.

At the very last I think we need to see a libvchan patch which uses this
mechanism when it is available. libvchan is the interface which we have
standardised on for this sort of communication and I'm reluctant to see
it simply discarded just because something apparently shiny has come
along.

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 0/2] Add V4V to Xen
  2013-05-28 19:43 [PATCH (V9) 0/2] Add V4V to Xen Ross Philipson
                   ` (2 preceding siblings ...)
  2013-05-30 11:57 ` [PATCH (V9) 0/2] Add V4V to Xen Ian Campbell
@ 2013-05-30 12:07 ` Ian Campbell
  2013-05-30 16:08   ` David Vrabel
  3 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2013-05-30 12:07 UTC (permalink / raw)
  To: Ross Philipson; +Cc: xen-devel

On Tue, 2013-05-28 at 15:43 -0400, Ross Philipson wrote:
> [PATCH (V9) 1/2] xen: events, exposes evtchn_alloc_unbound_domain
> [PATCH (V9) 2/2] xen: Add V4V implementation

 xen/common/event_channel.c |   45 +++++++++++++++++++++++++++++++-------------
 xen/include/xen/event.h    |    3 +++
 2 files changed, 35 insertions(+), 13 deletions(-)

 xen/arch/x86/hvm/hvm.c             |    6 +-
 xen/arch/x86/x86_64/compat/entry.S |    2 +
 xen/arch/x86/x86_64/entry.S        |    2 +
 xen/common/Makefile                |    1 +
 xen/common/domain.c                |   13 +-
 xen/common/v4v.c                   | 1923 ++++++++++++++++++++++++++++++++++++
 xen/include/Makefile               |    3 +-
 xen/include/public/v4v.h           |  304 ++++++
 xen/include/public/xen.h           |    2 +-
 xen/include/xen/sched.h            |    4 +
 xen/include/xen/v4v.h              |   49 +
 xen/include/xlat.lst               |   10 +
 12 files changed, 2314 insertions(+), 5 deletions(-)

No patch to docs/... at all? The hypercall interface docs have improved
(although they still aren't great IMHO) but what's really needed is an
overview of the design and a "how do I actually use this" type thing.

If you aren't going to provide that then I don't think a sum total of 5
lines of commit message is sufficient.

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 0/2] Add V4V to Xen
  2013-05-30 12:07 ` Ian Campbell
@ 2013-05-30 16:08   ` David Vrabel
  2013-05-31  7:25     ` Vincent Hanquez
  0 siblings, 1 reply; 24+ messages in thread
From: David Vrabel @ 2013-05-30 16:08 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ross Philipson, xen-devel

On 30/05/13 13:07, Ian Campbell wrote:
> 
> No patch to docs/... at all? The hypercall interface docs have improved
> (although they still aren't great IMHO) but what's really needed is an
> overview of the design and a "how do I actually use this" type thing.

I agree.  I'm looking at inter-domain communication mechanisms for use
in XenServer and it's not obvious how to use v4v securely.

e.g., when a previously trusted domain (A) is compromised it may spam a
domain (B) with messages in a DoS attack.  The per source domain/port
receive rings help here as the domain A will not be able to block B from
receiving traffic from other domains.

But how are these per-connection rings created?  This seems to require
out-of-band signaling for connection setup.  I suppose this could be via
v4v and a connection manager service running in a known and trusted
domain. But how does a domain find the connection manager service and
how does it handle the connection management domain being restarted?

The other big question I have is why v4v?  v4v doesn't seem to offer any
advantages over using shared rings like libvchan.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 2/2] xen: Add V4V implementation
  2013-05-28 19:43 ` [PATCH (V9) 2/2] xen: Add V4V implementation Ross Philipson
                     ` (2 preceding siblings ...)
  2013-05-29  9:56   ` Vincent Hanquez
@ 2013-05-30 16:20   ` Tim Deegan
  2013-06-04 18:01     ` Ross Philipson
  2013-06-10 15:06   ` David Vrabel
  4 siblings, 1 reply; 24+ messages in thread
From: Tim Deegan @ 2013-05-30 16:20 UTC (permalink / raw)
  To: Ross Philipson; +Cc: xen-devel

Hi,

Nice to see this series again. :)  Thanks for your work cleaning it up.

At 15:43 -0400 on 28 May (1369755811), Ross Philipson wrote:
> Setup of v4v domains a domain gets created and cleanup
> when a domain die. Wire up the v4v hypercall.
> 
> Include v4v internal and public headers.
> 
> Signed-off-by: Ross Philipson <ross.philipson@citrix.com>

I'm a bit worried about resource exhaustion attacks in this interface.
In particular:
 - It looks like a malicious guest can register large numbers of rings 
   (like, 2^48 of them), each of which causes Xen to xmalloc bookkeeping
   space (up to 2^22 bytes per ring). 
 - A malicious guest that has enough memory can send a very large vector
   (2^31 1-byte entries), which would cause Xen to read and check 2^37
   bytes of control structures before even starting the copy of the data.
   At a rough guess that could be >10seconds without checking for preemption.
   Maybe reduce the limits a bit (say, 32MB per hypercall, may 100 elements
   per vector)?
 - Something similar for V4VOP_notify, I guess, though that could perhaps
   be handled by the usual pending-events check + hypercall continuations.

And now, some detailed review:

> +/*
> + * Messages on the ring are padded to 128 bits
> + * Len here refers to the exact length of the data not including the
> + * 128 bit header. The message uses
> + * ((len +0xf) & ~0xf) + sizeof(v4v_ring_message_header) bytes
> + */
> +#define V4V_ROUNDUP(a) (((a) +0xf ) & ~0xf)

Even though this is always used on values <2^32 and so is safe,
it would be nice to have it as (((a) + 0xf) & ~(typeof (a))0xf)
so it's clear that it doesn't truncate any high-order bits.

[...]
> +/*
> + * lock to protect the filtering rules list: v4vtable_rules
> + *
> + * The write lock is held for viptables_del and viptables_add
> + * The read lock is held for viptable_list

s/viptables/v4vtables/g/

[...]
> +static int
> +v4v_ringbuf_get_rx_ptr(struct domain *d, struct v4v_ring_info *ring_info,
> +                        uint32_t * rx_ptr)
> +{
> +    v4v_ring_t *ringp;
> +
> +    if ( ring_info->npage == 0 )
> +        return -1;
> +
> +    ringp = map_domain_page(mfn_x(ring_info->mfns[0]));
> +
> +    v4v_dprintk("v4v_ringbuf_payload_space: mapped %p to %p\n",
> +                (void *)mfn_x(ring_info->mfns[0]), ringp);

The function name is wrong there, presumably after this code was
refctored.  You might add %s/__func__ to the definition of v4v_dprintk()
instead?

> +    if ( !ringp )
> +        return -1;
> +
> +    write_atomic(rx_ptr, ringp->rx_ptr);

AIUI ringp->rx_ptr is the shared variable here, so this should be
  *rx_ptr = read_atomic(ringp->rx_ptr);

Also, maybe comment in the hypercall interface that the guest should
also use atomic operations when accessing rx_ptr and tx_ptr.  (Though I
should say the hypercall comments are already pretty good compared to
some things we already have!)

> +    mb();

I'm not 100% convinced this barrier is needed (since we're not going to
read the 'free' memory in the ring and I don't think that writes to it
can possibly be hoisted past this read since they're gated on the
result).  OTOH I'm happy to err on the side of caution. :)

> +    unmap_domain_page(ringp);
> +    return 0;
> +}
> +
> +uint32_t
> +v4v_ringbuf_payload_space(struct domain * d, struct v4v_ring_info * ring_info)
> +{
> +    v4v_ring_t ring;
> +    int32_t ret;
> +
> +    ring.tx_ptr = ring_info->tx_ptr;
> +    ring.len = ring_info->len;
> +
> +    if ( v4v_ringbuf_get_rx_ptr(d, ring_info, &ring.rx_ptr) )
> +        return 0;
> +
> +    v4v_dprintk("v4v_ringbuf_payload_space:tx_ptr=%d rx_ptr=%d\n",
> +                (int)ring.tx_ptr, (int)ring.rx_ptr);
> +    if ( ring.rx_ptr == ring.tx_ptr )
> +        return ring.len - sizeof(struct v4v_ring_message_header);
> +
> +    ret = ring.rx_ptr - ring.tx_ptr;
> +    if ( ret < 0 )
> +        ret += ring.len;
> +
> +    ret -= sizeof(struct v4v_ring_message_header);
> +    ret -= V4V_ROUNDUP(1);
> +
> +    return (ret < 0) ? 0 : ret;

Sanity check for ret > ring.len as well?  I guess it's not clear what
you'd do about that.

[...]
> +static long
> +v4v_ringbuf_insertv(struct domain *d,
> +                    struct v4v_ring_info *ring_info,
> +                    v4v_ring_id_t *src_id, uint32_t message_type,
> +                    XEN_GUEST_HANDLE_PARAM(v4v_iov_t) iovs,
> +                    uint32_t niov, size_t len)
> +{
> +    v4v_ring_t ring;
> +    struct v4v_ring_message_header mh = { 0 };
> +    int32_t sp;
> +    long happy_ret;
> +    int32_t ret = 0;
> +    XEN_GUEST_HANDLE(uint8_t) empty_hnd = { 0 };
> +
> +    ASSERT(spin_is_locked(&ring_info->lock));
> +
> +    happy_ret = len;
> +
> +    if ( (V4V_ROUNDUP(len) + sizeof(struct v4v_ring_message_header) ) >=
> +            ring_info->len)
> +        return -EMSGSIZE;
> +
> +    do {
> +        if ( (ret = v4v_memcpy_from_guest_ring(&ring, ring_info, 0,
> +                                               sizeof(ring))) )
> +            break;

Should this be using an atomic operation to read ring.rx_ptr instead?
That's the only field we actually use from the guest copy anyway.

> +
> +        ring.tx_ptr = ring_info->tx_ptr;
> +        ring.len = ring_info->len;
> +
> +        v4v_dprintk("ring.tx_ptr=%d ring.rx_ptr=%d ring.len=%d ring_info->tx_ptr=%d\n",
> +                    ring.tx_ptr, ring.rx_ptr, ring.len, ring_info->tx_ptr);
> +
> +        if ( ring.rx_ptr == ring.tx_ptr )
> +            sp = ring_info->len;
> +        else
> +        {
> +            sp = ring.rx_ptr - ring.tx_ptr;
> +            if ( sp < 0 )
> +                sp += ring.len;
> +        }
> +
> +        if ( (V4V_ROUNDUP(len) + sizeof(struct v4v_ring_message_header)) >= sp )
> +        {
> +            v4v_dprintk("EAGAIN\n");
> +            ret = -EAGAIN;
> +            break;
> +        }
> +
> +        mh.len = len + sizeof (struct v4v_ring_message_header);
> +        mh.source.port = src_id->addr.port;
> +        mh.source.domain = src_id->addr.domain;
> +        mh.message_type = message_type;
> +
> +        if ( (ret = v4v_memcpy_to_guest_ring(ring_info,
> +                                             ring.tx_ptr + sizeof(v4v_ring_t),
> +                                             &mh, empty_hnd,
> +                                             sizeof(mh))) )
> +            break;

OK, I think this is correct, because the tx_ptr is always 16-byte
aligned and the message header is 16 bytes long, but it's not
obvious. :) Maybe add a comment here, and a BUILD_BUG_ON() to assert
that sizeof (struct v4v_ring_message_header) <= V4V_ROUNDUP(1)?

> +        ring.tx_ptr += sizeof(mh);
> +        if ( ring.tx_ptr == ring_info->len )
> +            ring.tx_ptr = 0;
> +
> +        while ( niov-- )
> +        {
> +            XEN_GUEST_HANDLE_PARAM(uint8_t) bufp_hnd;
> +            XEN_GUEST_HANDLE(uint8_t) buf_hnd;
> +            v4v_iov_t iov;
> +
> +            if ( copy_from_guest(&iov, iovs, 1) )

Hrmn.  So here we re-copy the iov from the guest, but the guest might
have changed it since we looked at it last (when we calculated the total
length).  That means we need to keep a running count of how much we copy,
to be sure the guest doesn't swap short iovs for long ones and make
us overrun the ring's rx_ptr.

> +            {
> +                ret = -EFAULT;
> +                break;
> +            }
> +
> +            bufp_hnd = guest_handle_from_ptr(iov.iov_base, uint8_t);
> +            buf_hnd = guest_handle_from_param(bufp_hnd, uint8_t);
> +            len = iov.iov_len;
> +
> +            if ( unlikely(!guest_handle_okay(buf_hnd, len)) )
> +            {
> +                ret = -EFAULT;
> +                break;
> +            }
> +
> +            sp = ring.len - ring.tx_ptr;
> +
> +            if ( len > sp )
> +            {
> +                ret = v4v_memcpy_to_guest_ring(ring_info,
> +                        ring.tx_ptr + sizeof(v4v_ring_t),
> +                        NULL, buf_hnd, sp);
> +                if ( ret )
> +                    break;
> +
> +                ring.tx_ptr = 0;
> +                len -= sp;
> +                guest_handle_add_offset(buf_hnd, sp);
> +            }
> +
> +            ret = v4v_memcpy_to_guest_ring(ring_info,
> +                    ring.tx_ptr + sizeof(v4v_ring_t),
> +                    NULL, buf_hnd, len);
> +            if ( ret )
> +                break;
> +
> +            ring.tx_ptr += len;
> +
> +            if ( ring.tx_ptr == ring_info->len )
> +                ring.tx_ptr = 0;
> +
> +            guest_handle_add_offset(iovs, 1);
> +        }
> +        if ( ret )
> +            break;
> +
> +        ring.tx_ptr = V4V_ROUNDUP(ring.tx_ptr);
> +
> +        if ( ring.tx_ptr >= ring_info->len )
> +            ring.tx_ptr -= ring_info->len;
> +
> +        mb();
> +        ring_info->tx_ptr = ring.tx_ptr;
> +        if ( (ret = v4v_update_tx_ptr(ring_info, ring.tx_ptr)) )
> +            break;
> +    } while ( 0 );
> +
> +    v4v_ring_unmap(ring_info);
> +
> +    return ret ? ret : happy_ret;
> +}
[...]
> +static int
> +v4v_pending_requeue(struct v4v_ring_info *ring_info, domid_t src_id, int len)
> +{
> +    struct hlist_node *node;
> +    struct v4v_pending_ent *ent;
> +
> +    hlist_for_each_entry(ent, node, &ring_info->pending, node)
> +    {
> +        if ( ent->id == src_id )
> +        {
> +            if ( ent->len < len )
> +                ent->len = len;

Does this mean that if I call V4VOP_notify with some large space
requirement, and then again with a smaller one, I won't get a
notification until the original large amount is free?  Seems like this
should always overwrite the old length instead, and notify on the most
recently requested amount.

> +static int
> +v4v_fill_ring_data(struct domain *src_d,
> +                   XEN_GUEST_HANDLE(v4v_ring_data_ent_t) data_ent_hnd)
> +{
> +    v4v_ring_data_ent_t ent;
> +    struct domain *dst_d;
> +    struct v4v_ring_info *ring_info;
> +
> +    if ( copy_from_guest(&ent, data_ent_hnd, 1) )
> +    {
> +        v4v_dprintk("EFAULT\n");
> +        return -EFAULT;
> +    }
> +
> +    v4v_dprintk("v4v_fill_ring_data: ent.ring.domain=%d,ent.ring.port=%u\n",
> +                (int)ent.ring.domain, (int)ent.ring.port);
> +
> +    ent.flags = 0;
> +
> +    dst_d = get_domain_by_id(ent.ring.domain);
> +
> +    if ( dst_d && dst_d->v4v )
> +    {
> +        read_lock(&dst_d->v4v->lock);
> +        ring_info = v4v_ring_find_info_by_addr(dst_d, &ent.ring,
> +                                               src_d->domain_id);
> +

Should there be a call to v4vtables_check() here?  Otherwise we leak a
little bit of information to a guest who's not supposed to see the ring,
(and also it's a bit odd to say 'yes, you may send' and then bounce the
sendv call).

[...]
> +static void v4v_ring_remove_mfns(struct domain *d, struct v4v_ring_info *ring_info)
> +{
> +    int i;
> +
> +    ASSERT(rw_is_write_locked(&d->v4v->lock));
> +
> +    if ( ring_info->mfns )
> +    {
> +        for ( i = 0; i < ring_info->npage; ++i )
> +            if ( mfn_x(ring_info->mfns[i]) != 0 )

Here, and where the array is populated, it would be better to use
INVALID_MFN in any empty slots.  On x86, MFN 0 is never allocated to a
guest, but that may not always be true on all architectures. 

[...]
> +static long
> +v4v_ring_add(struct domain *d, XEN_GUEST_HANDLE_PARAM(v4v_ring_t) ring_hnd,
> +             uint32_t npage, XEN_GUEST_HANDLE_PARAM(v4v_pfn_t) pfn_hnd)
> +{
[...]
> +        /*
> +         * no need for a lock yet, because only we know about this
> +         * set the tx pointer if it looks bogus (we don't reset it
> +         * because this might be a re-register after S4)
> +         */
> +        if ( (ring.tx_ptr >= ring.len)
> +                || (V4V_ROUNDUP(ring.tx_ptr) != ring.tx_ptr) )
> +        {
> +            ring.tx_ptr = ring.rx_ptr;
> +        }

Maybe check here that the guest-supplied rx_ptr is correctly aligned?
I don't think an unaligned one would be a security risk, but it might
cause entertaining breakage on the first sendv().

> +        copy_field_to_guest(ring_hnd, &ring, tx_ptr);
> +
> +        read_lock(&d->v4v->lock);
> +        ring_info = v4v_ring_find_info(d, &ring.id);
> +
> +        if ( !ring_info )
> +        {
> +            read_unlock(&d->v4v->lock);
> +            ring_info = xmalloc(struct v4v_ring_info);
> +            if ( !ring_info )
> +            {
> +                v4v_dprintk("ENOMEM\n");
> +                ret = -ENOMEM;
> +                break;
> +            }
> +            need_to_insert++;
> +            spin_lock_init(&ring_info->lock);
> +            INIT_HLIST_HEAD(&ring_info->pending);
> +            ring_info->mfns = NULL;
> +        }
> +        else
> +        {
> +            /*
> +             * Ring info already existed.
> +             */
> +            printk(KERN_INFO "v4v: dom%d ring already registered\n",
> +                    current->domain->domain_id);
> +            ret = -EEXIST;
> +            break;
> +        }

If you reshuffle these so that the already-exists case is first, then
the not-already-exists case doesn't need to be indented, i.e. 

    if ( ring_info )
    { 
        printk();
        ret = -EEXIST;
        break;
    }
    /* Normal case goes here */

Then I think it becomes clear that you don't need a 'need_to_insert'
flag -- and that the read_unlock() below is dead code, so the error path
doesn't unlock at all!

> +
> +        spin_lock(&ring_info->lock);
> +        ring_info->id = ring.id;
> +        ring_info->len = ring.len;
> +        ring_info->tx_ptr = ring.tx_ptr;
> +        ring_info->ring = ring_hnd;

I can't see where this field is used, but AFAICT the ring_hnd isn't
guaranteed to be valid after this hypercall finishes (and certainly not
after, say, the next context switch), so storing it doesn't seem useful.

> +        if ( ring_info->mfns )
> +            xfree(ring_info->mfns);
> +        ret = v4v_find_ring_mfns(d, ring_info, npage, pfn_hnd);
> +        spin_unlock(&ring_info->lock);
> +        if ( ret )
> +            break;
> +
> +        if ( !need_to_insert )
> +        {
> +            read_unlock(&d->v4v->lock);
> +        }
> +        else
> +        {
> +            uint16_t hash = v4v_hash_fn(&ring.id);
> +
> +            write_lock(&d->v4v->lock);
> +            hlist_add_head(&ring_info->node, &d->v4v->ring_hash[hash]);
> +            write_unlock(&d->v4v->lock);
> +        }
> +    } while ( 0 );
> +
> +    read_unlock(&v4v_lock);
> +    return ret;
> +}
[...]
> +void
> +v4vtables_print_rule(struct v4vtables_rule_node *node)
> +{
[...]
> +    if ( rule->accept == 1 )
> +        printk("ACCEPT");

The logic in the actual ACL is (effectively) 'if ( rule->accept != 0 )';
so this probably ought to follow suit. :)

[...]
> +int
> +v4vtables_del(struct domain *src_d,
> +              XEN_GUEST_HANDLE_PARAM(v4vtables_rule_t) rule_hnd,
> +              int32_t position)
> +{
> +    struct list_head *tmp = NULL;
> +    struct list_head *to_delete = NULL;
> +    struct list_head *next = NULL;
> +    struct v4vtables_rule_node *node;
> +
> +    ASSERT(rw_is_write_locked(&v4vtables_rules_lock));
> +
> +    v4v_dprintk("v4vtables_del position:%d\n", position);
> +
> +    if ( position != -1 )
> +    {
> +        /* We want to delete the rule number <position> */
> +        list_for_each(tmp, &v4vtables_rules)
> +        {
> +            to_delete = tmp;
> +            if (position == 0)
> +                break;
> +            position--;
> +        }

Hmmm.  If this is called with position = 1 with a multi-element list, it
selects the second element (which conflicts with the semantics of the
add operation, where '1' is the first element).  And if it's called with
'1' on a single-element list, it selects the first (and only) element.

> +        /* Can't find the position */
> +        if (position != 0)
> +            to_delete = NULL;
> +    }
> +    else if ( !guest_handle_is_null(rule_hnd) )
> +    {
> +        struct v4vtables_rule r;
> +
> +        if ( copy_from_guest(&r, rule_hnd, 1) )
> +            return -EFAULT;
> +
> +        list_for_each(tmp, &v4vtables_rules)
> +        {
> +            node = list_entry(tmp, struct v4vtables_rule_node, list);
> +
> +            if ( (node->rule.src.domain == r.src.domain) &&
> +                 (node->rule.src.port   == r.src.port)   &&
> +                 (node->rule.dst.domain == r.dst.domain) &&
> +                 (node->rule.dst.port   == r.dst.port))

Not matching by 'accept' too?  That should be mentioned in the hypercall
interface, I think.

> +            {
> +                to_delete = tmp;
> +                break;
> +            }
> +        }
> +    }
> +    else
> +    {
> +        /* We want to flush the rules! */
> +        printk(KERN_ERR "VIPTables: flushing rules\n");
> +        list_for_each_safe(tmp, next, &v4vtables_rules)
> +        {
> +            node = list_entry(tmp, struct v4vtables_rule_node, list);
> +            list_del(tmp);
> +            xfree(node);
> +        }
> +    }
> +
> +    if ( to_delete )
> +    {
> +        node = list_entry(to_delete, struct v4vtables_rule_node, list);
> +#ifdef V4V_DEBUG
> +        printk(KERN_ERR "VIPTables: deleting rule: ");
> +        v4vtables_print_rule(node);
> +#endif /* V4V_DEBUG */
> +        list_del(to_delete);
> +        xfree(node);
> +    }
> +
> +    return 0;
> +}
> +
> +static size_t
> +v4vtables_list(struct domain *src_d,
> +               XEN_GUEST_HANDLE_PARAM(v4vtables_list_t) list_hnd)
> +{
> +    struct list_head *ptr;
> +    struct v4vtables_rule_node *node;
> +    struct v4vtables_list rules_list;
> +    uint32_t nbrules;
> +    XEN_GUEST_HANDLE(v4vtables_rule_t) guest_rules;
> +
> +    ASSERT(rw_is_locked(&v4vtables_rules_lock));
> +
> +    memset(&rules_list, 0, sizeof (rules_list));
> +    if ( copy_from_guest(&rules_list, list_hnd, 1) )
> +        return -EFAULT;
> +
> +    ptr = v4vtables_rules.next;
> +    while ( rules_list.start_rule != 0 && ptr->next != &v4vtables_rules )
> +    {
> +        ptr = ptr->next;
> +        rules_list.start_rule--;
> +    }

Again, that seems to conflict with the 'add' semantics that the first
rule is rule 1.

And what if the supplied start_rule was too large?  As it stands I think
this will return the last rule in the list rather than returning no
rules (or an error).

> +
> +    if ( rules_list.nb_rules == 0 )
> +        return -EINVAL;
> +
> +    guest_rules = guest_handle_for_field(list_hnd, v4vtables_rule_t, rules[0]);
> +
> +    nbrules = 0;
> +    while ( nbrules < rules_list.nb_rules && ptr != &v4vtables_rules )
> +    {
> +        node = list_entry(ptr, struct v4vtables_rule_node, list);
> +
> +        if ( copy_to_guest(guest_rules, &node->rule, 1) )
> +            break;

Maybe also arrange to return -EFAULT?

> +
> +        guest_handle_add_offset(guest_rules, 1);
> +
> +        nbrules++;
> +        ptr = ptr->next;
> +    }
> +
> +    rules_list.nb_rules = nbrules;
> +    if ( copy_field_to_guest(list_hnd, &rules_list, nb_rules) )
> +        return -EFAULT;
> +
> +    return 0;
> +}
> +
> +static size_t
> +v4vtables_check(v4v_addr_t *src, v4v_addr_t *dst)
> +{
> +    struct list_head *ptr;
> +    struct v4vtables_rule_node *node;
> +    size_t ret = 0; /* Defaulting to ACCEPT */

size_t?  bool_t would be better.  Also, having this function return the
inverse of the rule encoding is just perverse.

[...]
> +    case V4VOP_tables_add:
> +    {
> +        uint32_t position = arg3;
> +
> +        XEN_GUEST_HANDLE_PARAM(v4vtables_rule_t) rule_hnd =
> +               guest_handle_cast(arg1, v4vtables_rule_t);
> +        rc = -EPERM;
> +        if ( !is_hardware_domain(d) )
> +            goto out;

I think this ought to be an XSM check now, with XSM_PRIV to indicate
that the default policy is to check IS_PRIV().  (Likewise for the
other table ops.)

[...]
> +struct keyhandler v4v_info_keyhandler =
> +{
> +    .diagnostic = 1,
> +    .u.fn = dump_state,
> +    .desc = "dump v4v states and interupt"

s/interupt/interrupt/

[...]
> +/*
> + * V4VOP_register_ring
> + *
> + * Registers a ring with Xen. If a ring with the same v4v_ring_id exists,
> + * the hypercall will return -EEXIST.
> + *
> + * do_v4v_op(V4VOP_register_ring,
> + *           XEN_GUEST_HANDLE(v4v_ring_t), XEN_GUEST_HANDLE(v4v_pfn_t),
> + *           npage, 0)

Maybe add a 'uint32_t ' before npage, to match the same style below?

[...]
> +/*
> + * V4VOP_sendv
> + *
> + * Sends of list of buffer contained in iov.
> + *
> + * For each iov entry send iov_len bytes of iov_base to addr.dst, giving
> + * src as the source address (xen will ignore src->domain and put your
> + * domain in the actually message), xen first looks for a ring with id.addr==dst
> + * and id.partner==sending_domain if that fails it looks for id.addr==dst and

These lines are longer than 80 chars - can you wrap a little narrower please?
Also s/actually/actual/.

> + * id.partner==DOMID_ANY.
> + * message_type is the 32 bit number used from the message
> + * most likely V4V_MESSAGE_DGRAM or V4V_MESSAGE_STREAM. If insufficient space exists
> + * it will return -EAGAIN and xen will twing the V4V_INTERRUPT when

'twing'? :)

[...]
> +/*
> + * V4VOP_tables_add
> + *
> + * Insert a filtering rules after a given position.
> + *

Can you add a description of the semantics of the rules?
E.g. does the first match or the last match or the tightest match dominate?

> + * do_v4v_op(V4VOP_tables_add,
> + *           XEN_GUEST_HANDLE(v4vtables_rule_t) rule,
> + *           NULL,
> + *           uint32_t position, 0)
> + */
> +#define V4VOP_tables_add        7
> +
> +/*
> + * V4VOP_tables_del
> + *
> + * Delete a filtering rules at a position or the rule
> + * that matches "rule".
> + *
> + * do_v4v_op(V4VOP_tables_del,
> + *           XEN_GUEST_HANDLE(v4vtables_rule_t) rule,
> + *           NULL,
> + *           uint32_t position, 0)
> + */
> +#define V4VOP_tables_del       8

Can you describe the calling convention a bit more fully?  In particular
that position should == -1 if you want to match by rule (though I think
== 0 would make more sense), and that position == -1 && rule == NULL
means to flush everything.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 0/2] Add V4V to Xen
  2013-05-30 16:08   ` David Vrabel
@ 2013-05-31  7:25     ` Vincent Hanquez
  2013-05-31 10:21       ` David Vrabel
  0 siblings, 1 reply; 24+ messages in thread
From: Vincent Hanquez @ 2013-05-31  7:25 UTC (permalink / raw)
  To: David Vrabel; +Cc: Ian Campbell, Ross Philipson, xen-devel

On 05/30/2013 05:08 PM, David Vrabel wrote:
> On 30/05/13 13:07, Ian Campbell wrote:
>> No patch to docs/... at all? The hypercall interface docs have improved
>> (although they still aren't great IMHO) but what's really needed is an
>> overview of the design and a "how do I actually use this" type thing.
> I agree.  I'm looking at inter-domain communication mechanisms for use
> in XenServer and it's not obvious how to use v4v securely.
>
> e.g., when a previously trusted domain (A) is compromised it may spam a
> domain (B) with messages in a DoS attack.  The per source domain/port
> receive rings help here as the domain A will not be able to block B from
> receiving traffic from other domains.
It's really up to the guest to take active measure to prevent this to 
happens.
B have multiple ways to handle this scenario:

* unregister his ring: A can't communicate with B anymore
* throttle his ring processing: if B doens't process his ring, 
eventually the ring is full
and A can't send any more spam.
* use stream message type, which has the same semantic to tcp 
(LISTENING/CONNECTING/CONNECTED/..), where a stream need to be connected 
before data is processed.

There's also the v4v firewall where connection can be blocked.
I'm not sure at the moment that a guest can set anything in it itself, 
but if not
i think it would be a good idea for a guest to proactively set blocking 
rules for
ring it owns.

> But how are these per-connection rings created?  This seems to require
> out-of-band signaling for connection setup.  I suppose this could be via
> v4v and a connection manager service running in a known and trusted
> domain. But how does a domain find the connection manager service and
> how does it handle the connection management domain being restarted?
Rings are created by a guest listening to v4v.
This is similar to how the ip stack works, as in some service may be 
listening on
some port, and ip doesn't provide any mechanism to find those service 
either.

IIRC, we use xenstore to provide connection parameters for services.

-- 
Vincent

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 0/2] Add V4V to Xen
  2013-05-30 11:57 ` [PATCH (V9) 0/2] Add V4V to Xen Ian Campbell
@ 2013-05-31  7:36   ` Vincent Hanquez
  2013-05-31  7:50     ` Ian Campbell
  0 siblings, 1 reply; 24+ messages in thread
From: Vincent Hanquez @ 2013-05-31  7:36 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ross Philipson, xen-devel

On 05/30/2013 12:57 PM, Ian Campbell wrote:
> On Tue, 2013-05-28 at 15:43 -0400, Ross Philipson wrote:
>> [PATCH (V9) 1/2] xen: events, exposes evtchn_alloc_unbound_domain
>> [PATCH (V9) 2/2] xen: Add V4V implementation
> What is happening on the tools/client side?
I'm not sure what kind of tool you want to see here; the only thing i 
can think of is
a way to manipulate the v4v firewall in dom0.

 From a client side PoV, this is up to the guest kernel to expose the 
capability.

For example on linux, the idea is to exposes it is using a new socket 
family, so that
everything is handled by usual syscalls (socket(2), recvmsg(3), 
sendmsg(3), connect(2), listen(2), ioctl(2), ...), and the userspace 
doens't need any library,
and specially just a trivial change to already written code that use ip 
socket without changing any semantics.

-- 
Vincent

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 0/2] Add V4V to Xen
  2013-05-31  7:36   ` Vincent Hanquez
@ 2013-05-31  7:50     ` Ian Campbell
  2013-05-31  8:56       ` Vincent Hanquez
  0 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2013-05-31  7:50 UTC (permalink / raw)
  To: Vincent Hanquez; +Cc: Ross Philipson, xen-devel

On Fri, 2013-05-31 at 08:36 +0100, Vincent Hanquez wrote:
> On 05/30/2013 12:57 PM, Ian Campbell wrote:
> > On Tue, 2013-05-28 at 15:43 -0400, Ross Philipson wrote:
> >> [PATCH (V9) 1/2] xen: events, exposes evtchn_alloc_unbound_domain
> >> [PATCH (V9) 2/2] xen: Add V4V implementation
> > What is happening on the tools/client side?
> I'm not sure what kind of tool you want to see here; the only thing i 
> can think of is a way to manipulate the v4v firewall in dom0.
> 
>  From a client side PoV, this is up to the guest kernel to expose the 
> capability.

Or a userspace library, as I clearly said in my original mail. That
means libvchan.

libvchan is the current interface for guest to guest communication in
upstream Xen and I am not in favour of just throwing it away because
something new has come along. IOW I think v4v should be exposed as an
incremental improvement to the existing interfaces. (Whether it is also
exposed via other means such as kernel plugins is a different question).

> For example on linux, the idea is to exposes it is using a new socket 
> family, so that
> everything is handled by usual syscalls (socket(2), recvmsg(3), 
> sendmsg(3), connect(2), listen(2), ioctl(2), ...), and the userspace 
> doens't need any library,
> and specially just a trivial change to already written code that use ip 
> socket without changing any semantics.

So where is this code? Has it been posted upstream? Is anyone working on
getting it in a state such that it acceptable upstream? Would it even be
acceptable to upstream?

So far I haven't seen any realistic attempt at actually making this v4v
hypervisor patch useful to general members of the Xen community.

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 0/2] Add V4V to Xen
  2013-05-31  7:50     ` Ian Campbell
@ 2013-05-31  8:56       ` Vincent Hanquez
  2013-05-31  9:01         ` Ian Campbell
  0 siblings, 1 reply; 24+ messages in thread
From: Vincent Hanquez @ 2013-05-31  8:56 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ross Philipson, xen-devel

On 05/31/2013 08:50 AM, Ian Campbell wrote:
> So where is this code? Has it been posted upstream? Is anyone working on
> getting it in a state such that it acceptable upstream? Would it even be
> acceptable to upstream?
yes, we're in the process to do so.

we're currently re-writing our current driver to use vmware's vsocket, 
which is upstream and also has been split up into generic hypervisor 
agnostic part (net/vmw_vsock/af_vsock.c) for other hypervisor to use.

As such, it's quite likely it will be acceptable upstream, as we would 
fill the xen part for this new linux "standard". This of course would 
depends on the xen part being in place too.

-- 
Vincent

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 0/2] Add V4V to Xen
  2013-05-31  8:56       ` Vincent Hanquez
@ 2013-05-31  9:01         ` Ian Campbell
  2013-05-31  9:26           ` Vincent Hanquez
  0 siblings, 1 reply; 24+ messages in thread
From: Ian Campbell @ 2013-05-31  9:01 UTC (permalink / raw)
  To: Vincent Hanquez; +Cc: Ross Philipson, xen-devel

On Fri, 2013-05-31 at 09:56 +0100, Vincent Hanquez wrote:
> On 05/31/2013 08:50 AM, Ian Campbell wrote:
> > So where is this code? Has it been posted upstream? Is anyone working on
> > getting it in a state such that it acceptable upstream? Would it even be
> > acceptable to upstream?
> yes, we're in the process to do so.
> 
> we're currently re-writing our current driver to use vmware's vsocket, 
> which is upstream and also has been split up into generic hypervisor 
> agnostic part (net/vmw_vsock/af_vsock.c) for other hypervisor to use.
> 
> As such, it's quite likely it will be acceptable upstream, as we would 
> fill the xen part for this new linux "standard". This of course would 
> depends on the xen part being in place too.

I think we need to see and agree on both the client and hypervisor side
before either can be committed, since review on either could cause ABI
changes.

What is your upgrade plan for libvchan users? Is it to add vsock support
to vchan?

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 0/2] Add V4V to Xen
  2013-05-31  9:01         ` Ian Campbell
@ 2013-05-31  9:26           ` Vincent Hanquez
  2013-05-31 16:29             ` Ross Philipson
  0 siblings, 1 reply; 24+ messages in thread
From: Vincent Hanquez @ 2013-05-31  9:26 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ross Philipson, xen-devel

On 05/31/2013 10:01 AM, Ian Campbell wrote:
> I think we need to see and agree on both the client and hypervisor side
> before either can be committed, since review on either could cause ABI
> changes.
Yes, of course. Once some more code is ready, but as of now, we still 
need review and comments on
the hypervisor side to make sure the hypervisor part itself is ready 
(soundness,design,security,etc).

I want to note however that the v4v capability provided is modeled like 
a simplified "network" stack,
so that the design is suppose to stand by itself without any client code.

> What is your upgrade plan for libvchan users? Is it to add vsock support
> to vchan?
I'm not aware of such a plan, nor i'm aware of libvchan or what it 
does/doesn't provides in the first place.
If that make sense, then a plan could be devised.

-- 
Vincent

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 0/2] Add V4V to Xen
  2013-05-31  7:25     ` Vincent Hanquez
@ 2013-05-31 10:21       ` David Vrabel
  0 siblings, 0 replies; 24+ messages in thread
From: David Vrabel @ 2013-05-31 10:21 UTC (permalink / raw)
  To: Vincent Hanquez; +Cc: Ian Campbell, Ross Philipson, xen-devel

On 31/05/13 08:25, Vincent Hanquez wrote:
> On 05/30/2013 05:08 PM, David Vrabel wrote:
>> On 30/05/13 13:07, Ian Campbell wrote:
>>> No patch to docs/... at all? The hypercall interface docs have improved
>>> (although they still aren't great IMHO) but what's really needed is an
>>> overview of the design and a "how do I actually use this" type thing.
>> I agree.  I'm looking at inter-domain communication mechanisms for use
>> in XenServer and it's not obvious how to use v4v securely.
>>
>> e.g., when a previously trusted domain (A) is compromised it may spam a
>> domain (B) with messages in a DoS attack.  The per source domain/port
>> receive rings help here as the domain A will not be able to block B from
>> receiving traffic from other domains.
> It's really up to the guest to take active measure to prevent this to
> happens.
> B have multiple ways to handle this scenario:
> 
> * unregister his ring: A can't communicate with B anymore
> * throttle his ring processing: if B doens't process his ring,
> eventually the ring is full
> and A can't send any more spam.

These require the use of per-sender rings.

> * use stream message type, which has the same semantic to tcp
> (LISTENING/CONNECTING/CONNECTED/..), where a stream need to be connected
> before data is processed.

You would still need to handle connection request spam.

> There's also the v4v firewall where connection can be blocked.
> I'm not sure at the moment that a guest can set anything in it itself,
> but if not
> i think it would be a good idea for a guest to proactively set blocking
> rules for
> ring it owns.

At the moment it looks like only privileged guest can add/modify
v4vtable rules.

>> But how are these per-connection rings created?  This seems to require
>> out-of-band signaling for connection setup.  I suppose this could be via
>> v4v and a connection manager service running in a known and trusted
>> domain. But how does a domain find the connection manager service and
>> how does it handle the connection management domain being restarted?
> Rings are created by a guest listening to v4v.

A listener doesn't know in advance which domains might attempt to
connect so it must necessarily create a ring that any domain can put
messages on.

One solution would be to have the per-ring v4vtable rule chains that the
ring owner can modify. Or some mechanism by which a ring owner can pause
a sender and prevent it temporarily (or permenantly) from placing any
messages on the ring.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 0/2] Add V4V to Xen
  2013-05-31  9:26           ` Vincent Hanquez
@ 2013-05-31 16:29             ` Ross Philipson
  2013-05-31 16:38               ` Ian Campbell
  0 siblings, 1 reply; 24+ messages in thread
From: Ross Philipson @ 2013-05-31 16:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Vincent Hanquez, Ian Campbell

On 05/31/2013 05:26 AM, Vincent Hanquez wrote:
> On 05/31/2013 10:01 AM, Ian Campbell wrote:
>> I think we need to see and agree on both the client and hypervisor side
>> before either can be committed, since review on either could cause ABI
>> changes.
> Yes, of course. Once some more code is ready, but as of now, we still
> need review and comments on
> the hypervisor side to make sure the hypervisor part itself is ready
> (soundness,design,security,etc).
>
> I want to note however that the v4v capability provided is modeled like
> a simplified "network" stack,
> so that the design is suppose to stand by itself without any client code.
>
>> What is your upgrade plan for libvchan users? Is it to add vsock support
>> to vchan?
> I'm not aware of such a plan, nor i'm aware of libvchan or what it
> does/doesn't provides in the first place.
> If that make sense, then a plan could be devised.
>

I am not sure where this requirement to support the libvchan interface 
or library (?) came from. The v4v client code in both Linux and Windows 
avoids having libraries that you have to use. On the Linux side, as 
Vincent pointed out, we implement a socket family and you use it as you 
would any other socket. On the Windows side, aside from a small header 
file that helps you send the right IOCTLs to initialize things, you use 
Windows APIs like ReadFile/WriteFile to interact with v4v.

Of course I know this would all be more obvious if we had released the 
"client" or guest bits along with the xen part. We are trying to do that 
now. We have someone (Vincent) working on the driver that will be 
upstreamed to Linux. We can drop a pre-release of that when it is 
functional and ready for posting. That work is happening now.

There is a plan to open source all of our Citrix PV Windows drivers. I 
do not have good details on that plan or its schedule but the Windows 
v4v driver would be part of that overall package. Again, we can drop a 
pre-release version of the Windows v4v driver ahead of the other work.

I had intended to send an email like this but responses came in quite 
rapidly to the v4v posting.

Thanks
Ross

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 0/2] Add V4V to Xen
  2013-05-31 16:29             ` Ross Philipson
@ 2013-05-31 16:38               ` Ian Campbell
  0 siblings, 0 replies; 24+ messages in thread
From: Ian Campbell @ 2013-05-31 16:38 UTC (permalink / raw)
  To: Ross Philipson; +Cc: Vincent Hanquez, xen-devel

On Fri, 2013-05-31 at 12:29 -0400, Ross Philipson wrote:
> On 05/31/2013 05:26 AM, Vincent Hanquez wrote:
> > On 05/31/2013 10:01 AM, Ian Campbell wrote:
> >> I think we need to see and agree on both the client and hypervisor side
> >> before either can be committed, since review on either could cause ABI
> >> changes.
> > Yes, of course. Once some more code is ready, but as of now, we still
> > need review and comments on
> > the hypervisor side to make sure the hypervisor part itself is ready
> > (soundness,design,security,etc).
> >
> > I want to note however that the v4v capability provided is modeled like
> > a simplified "network" stack,
> > so that the design is suppose to stand by itself without any client code.
> >
> >> What is your upgrade plan for libvchan users? Is it to add vsock support
> >> to vchan?
> > I'm not aware of such a plan, nor i'm aware of libvchan or what it
> > does/doesn't provides in the first place.
> > If that make sense, then a plan could be devised.
> >
> 
> I am not sure where this requirement to support the libvchan interface 
> or library (?) came from.

libvchan is the existing upstream interface for guest to guest
communication. It has been part of upstream Xen for at least one if not
two releases and has existing users.

I don't really think we want to have two totally separate such
interfaces, and neither do I think it is right to simply throw away an
existing interface just because something new and supposedly more shiny
has come along.

IMHO libvchan needs to support v4v, either directly or via vsock and
fallback to its existing mechanisms if v4v isn't available.

I believe I've said this in response to various v4v postings in the
past.

Ian.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 2/2] xen: Add V4V implementation
  2013-05-30 16:20   ` Tim Deegan
@ 2013-06-04 18:01     ` Ross Philipson
  0 siblings, 0 replies; 24+ messages in thread
From: Ross Philipson @ 2013-06-04 18:01 UTC (permalink / raw)
  To: xen-devel; +Cc: Tim Deegan

On 05/30/2013 12:20 PM, Tim Deegan wrote:
> Hi,
>
> Nice to see this series again. :)  Thanks for your work cleaning it up.

Thanks, and thanks for your feedback.

>
> At 15:43 -0400 on 28 May (1369755811), Ross Philipson wrote:
>> Setup of v4v domains a domain gets created and cleanup
>> when a domain die. Wire up the v4v hypercall.
>>
>> Include v4v internal and public headers.
>>
>> Signed-off-by: Ross Philipson<ross.philipson@citrix.com>
>
> I'm a bit worried about resource exhaustion attacks in this interface.
> In particular:
>   - It looks like a malicious guest can register large numbers of rings
>     (like, 2^48 of them), each of which causes Xen to xmalloc bookkeeping
>     space (up to 2^22 bytes per ring).
>   - A malicious guest that has enough memory can send a very large vector
>     (2^31 1-byte entries), which would cause Xen to read and check 2^37
>     bytes of control structures before even starting the copy of the data.
>     At a rough guess that could be>10seconds without checking for preemption.
>     Maybe reduce the limits a bit (say, 32MB per hypercall, may 100 elements
>     per vector)?
>   - Something similar for V4VOP_notify, I guess, though that could perhaps
>     be handled by the usual pending-events check + hypercall continuations.

Yes we agree. I think the plan is to have some reasonable default limits 
and allow the values to be adjusted during domain creation if that is 
needed. This would be done through a new V4V op for the hypercall.

>
> And now, some detailed review:
>
>> +/*
>> + * Messages on the ring are padded to 128 bits
>> + * Len here refers to the exact length of the data not including the
>> + * 128 bit header. The message uses
>> + * ((len +0xf)&  ~0xf) + sizeof(v4v_ring_message_header) bytes
>> + */
>> +#define V4V_ROUNDUP(a) (((a) +0xf )&  ~0xf)
>
> Even though this is always used on values<2^32 and so is safe,
> it would be nice to have it as (((a) + 0xf)&  ~(typeof (a))0xf)
> so it's clear that it doesn't truncate any high-order bits.

Ack

>
> [...]
>> +/*
>> + * lock to protect the filtering rules list: v4vtable_rules
>> + *
>> + * The write lock is held for viptables_del and viptables_add
>> + * The read lock is held for viptable_list
>
> s/viptables/v4vtables/g/

For all the minor ones I don't address directly, just assume I will fix 
it per your suggestions.

>
> [...]
>> +static int
>> +v4v_ringbuf_get_rx_ptr(struct domain *d, struct v4v_ring_info *ring_info,
>> +                        uint32_t * rx_ptr)
>> +{
>> +    v4v_ring_t *ringp;
>> +
>> +    if ( ring_info->npage == 0 )
>> +        return -1;
>> +
>> +    ringp = map_domain_page(mfn_x(ring_info->mfns[0]));
>> +
>> +    v4v_dprintk("v4v_ringbuf_payload_space: mapped %p to %p\n",
>> +                (void *)mfn_x(ring_info->mfns[0]), ringp);
>
> The function name is wrong there, presumably after this code was
> refctored.  You might add %s/__func__ to the definition of v4v_dprintk()
> instead?
>
>> +    if ( !ringp )
>> +        return -1;
>> +
>> +    write_atomic(rx_ptr, ringp->rx_ptr);
>
> AIUI ringp->rx_ptr is the shared variable here, so this should be
>    *rx_ptr = read_atomic(ringp->rx_ptr);
>
> Also, maybe comment in the hypercall interface that the guest should
> also use atomic operations when accessing rx_ptr and tx_ptr.  (Though I
> should say the hypercall comments are already pretty good compared to
> some things we already have!)

What you say sounds right. I will address this.

>
>> +    mb();
>
> I'm not 100% convinced this barrier is needed (since we're not going to
> read the 'free' memory in the ring and I don't think that writes to it
> can possibly be hoisted past this read since they're gated on the
> result).  OTOH I'm happy to err on the side of caution. :)

This could be cargo. There have been quite a number of modifications to 
this code since its birth in our product 3+ years ago. But if I cannot 
be sure, I will probably leave it be.

>
>> +    unmap_domain_page(ringp);
>> +    return 0;
>> +}
>> +
>> +uint32_t
>> +v4v_ringbuf_payload_space(struct domain * d, struct v4v_ring_info * ring_info)
>> +{
>> +    v4v_ring_t ring;
>> +    int32_t ret;
>> +
>> +    ring.tx_ptr = ring_info->tx_ptr;
>> +    ring.len = ring_info->len;
>> +
>> +    if ( v4v_ringbuf_get_rx_ptr(d, ring_info,&ring.rx_ptr) )
>> +        return 0;
>> +
>> +    v4v_dprintk("v4v_ringbuf_payload_space:tx_ptr=%d rx_ptr=%d\n",
>> +                (int)ring.tx_ptr, (int)ring.rx_ptr);
>> +    if ( ring.rx_ptr == ring.tx_ptr )
>> +        return ring.len - sizeof(struct v4v_ring_message_header);
>> +
>> +    ret = ring.rx_ptr - ring.tx_ptr;
>> +    if ( ret<  0 )
>> +        ret += ring.len;
>> +
>> +    ret -= sizeof(struct v4v_ring_message_header);
>> +    ret -= V4V_ROUNDUP(1);
>> +
>> +    return (ret<  0) ? 0 : ret;
>
> Sanity check for ret>  ring.len as well?  I guess it's not clear what
> you'd do about that.
>
> [...]
>> +static long
>> +v4v_ringbuf_insertv(struct domain *d,
>> +                    struct v4v_ring_info *ring_info,
>> +                    v4v_ring_id_t *src_id, uint32_t message_type,
>> +                    XEN_GUEST_HANDLE_PARAM(v4v_iov_t) iovs,
>> +                    uint32_t niov, size_t len)
>> +{
>> +    v4v_ring_t ring;
>> +    struct v4v_ring_message_header mh = { 0 };
>> +    int32_t sp;
>> +    long happy_ret;
>> +    int32_t ret = 0;
>> +    XEN_GUEST_HANDLE(uint8_t) empty_hnd = { 0 };
>> +
>> +    ASSERT(spin_is_locked(&ring_info->lock));
>> +
>> +    happy_ret = len;
>> +
>> +    if ( (V4V_ROUNDUP(len) + sizeof(struct v4v_ring_message_header) )>=
>> +            ring_info->len)
>> +        return -EMSGSIZE;
>> +
>> +    do {
>> +        if ( (ret = v4v_memcpy_from_guest_ring(&ring, ring_info, 0,
>> +                                               sizeof(ring))) )
>> +            break;
>
> Should this be using an atomic operation to read ring.rx_ptr instead?
> That's the only field we actually use from the guest copy anyway.

Probably. I will look at this as well as other uses of the t/rx_ptr 
usage mentioned above.

>
>> +
>> +        ring.tx_ptr = ring_info->tx_ptr;
>> +        ring.len = ring_info->len;
>> +
>> +        v4v_dprintk("ring.tx_ptr=%d ring.rx_ptr=%d ring.len=%d ring_info->tx_ptr=%d\n",
>> +                    ring.tx_ptr, ring.rx_ptr, ring.len, ring_info->tx_ptr);
>> +
>> +        if ( ring.rx_ptr == ring.tx_ptr )
>> +            sp = ring_info->len;
>> +        else
>> +        {
>> +            sp = ring.rx_ptr - ring.tx_ptr;
>> +            if ( sp<  0 )
>> +                sp += ring.len;
>> +        }
>> +
>> +        if ( (V4V_ROUNDUP(len) + sizeof(struct v4v_ring_message_header))>= sp )
>> +        {
>> +            v4v_dprintk("EAGAIN\n");
>> +            ret = -EAGAIN;
>> +            break;
>> +        }
>> +
>> +        mh.len = len + sizeof (struct v4v_ring_message_header);
>> +        mh.source.port = src_id->addr.port;
>> +        mh.source.domain = src_id->addr.domain;
>> +        mh.message_type = message_type;
>> +
>> +        if ( (ret = v4v_memcpy_to_guest_ring(ring_info,
>> +                                             ring.tx_ptr + sizeof(v4v_ring_t),
>> +&mh, empty_hnd,
>> +                                             sizeof(mh))) )
>> +            break;
>
> OK, I think this is correct, because the tx_ptr is always 16-byte
> aligned and the message header is 16 bytes long, but it's not
> obvious. :) Maybe add a comment here, and a BUILD_BUG_ON() to assert
> that sizeof (struct v4v_ring_message_header)<= V4V_ROUNDUP(1)?

Ack

>
>> +        ring.tx_ptr += sizeof(mh);
>> +        if ( ring.tx_ptr == ring_info->len )
>> +            ring.tx_ptr = 0;
>> +
>> +        while ( niov-- )
>> +        {
>> +            XEN_GUEST_HANDLE_PARAM(uint8_t) bufp_hnd;
>> +            XEN_GUEST_HANDLE(uint8_t) buf_hnd;
>> +            v4v_iov_t iov;
>> +
>> +            if ( copy_from_guest(&iov, iovs, 1) )
>
> Hrmn.  So here we re-copy the iov from the guest, but the guest might
> have changed it since we looked at it last (when we calculated the total
> length).  That means we need to keep a running count of how much we copy,
> to be sure the guest doesn't swap short iovs for long ones and make
> us overrun the ring's rx_ptr.

Yea I was starting to have concerns about just this right after I posted 
patch set 9. I really don't like that we read it twice. Your suggestion 
sounds reasonable.

>
>> +            {
>> +                ret = -EFAULT;
>> +                break;
>> +            }
>> +
>> +            bufp_hnd = guest_handle_from_ptr(iov.iov_base, uint8_t);
>> +            buf_hnd = guest_handle_from_param(bufp_hnd, uint8_t);
>> +            len = iov.iov_len;
>> +
>> +            if ( unlikely(!guest_handle_okay(buf_hnd, len)) )
>> +            {
>> +                ret = -EFAULT;
>> +                break;
>> +            }
>> +
>> +            sp = ring.len - ring.tx_ptr;
>> +
>> +            if ( len>  sp )
>> +            {
>> +                ret = v4v_memcpy_to_guest_ring(ring_info,
>> +                        ring.tx_ptr + sizeof(v4v_ring_t),
>> +                        NULL, buf_hnd, sp);
>> +                if ( ret )
>> +                    break;
>> +
>> +                ring.tx_ptr = 0;
>> +                len -= sp;
>> +                guest_handle_add_offset(buf_hnd, sp);
>> +            }
>> +
>> +            ret = v4v_memcpy_to_guest_ring(ring_info,
>> +                    ring.tx_ptr + sizeof(v4v_ring_t),
>> +                    NULL, buf_hnd, len);
>> +            if ( ret )
>> +                break;
>> +
>> +            ring.tx_ptr += len;
>> +
>> +            if ( ring.tx_ptr == ring_info->len )
>> +                ring.tx_ptr = 0;
>> +
>> +            guest_handle_add_offset(iovs, 1);
>> +        }
>> +        if ( ret )
>> +            break;
>> +
>> +        ring.tx_ptr = V4V_ROUNDUP(ring.tx_ptr);
>> +
>> +        if ( ring.tx_ptr>= ring_info->len )
>> +            ring.tx_ptr -= ring_info->len;
>> +
>> +        mb();
>> +        ring_info->tx_ptr = ring.tx_ptr;
>> +        if ( (ret = v4v_update_tx_ptr(ring_info, ring.tx_ptr)) )
>> +            break;
>> +    } while ( 0 );
>> +
>> +    v4v_ring_unmap(ring_info);
>> +
>> +    return ret ? ret : happy_ret;
>> +}
> [...]
>> +static int
>> +v4v_pending_requeue(struct v4v_ring_info *ring_info, domid_t src_id, int len)
>> +{
>> +    struct hlist_node *node;
>> +    struct v4v_pending_ent *ent;
>> +
>> +    hlist_for_each_entry(ent, node,&ring_info->pending, node)
>> +    {
>> +        if ( ent->id == src_id )
>> +        {
>> +            if ( ent->len<  len )
>> +                ent->len = len;
>
> Does this mean that if I call V4VOP_notify with some large space
> requirement, and then again with a smaller one, I won't get a
> notification until the original large amount is free?  Seems like this
> should always overwrite the old length instead, and notify on the most
> recently requested amount.

It has been a while since I have looked closely at the queuing code so I 
will have to study this some. I believe it does the right thing but I 
will check.

>
>> +static int
>> +v4v_fill_ring_data(struct domain *src_d,
>> +                   XEN_GUEST_HANDLE(v4v_ring_data_ent_t) data_ent_hnd)
>> +{
>> +    v4v_ring_data_ent_t ent;
>> +    struct domain *dst_d;
>> +    struct v4v_ring_info *ring_info;
>> +
>> +    if ( copy_from_guest(&ent, data_ent_hnd, 1) )
>> +    {
>> +        v4v_dprintk("EFAULT\n");
>> +        return -EFAULT;
>> +    }
>> +
>> +    v4v_dprintk("v4v_fill_ring_data: ent.ring.domain=%d,ent.ring.port=%u\n",
>> +                (int)ent.ring.domain, (int)ent.ring.port);
>> +
>> +    ent.flags = 0;
>> +
>> +    dst_d = get_domain_by_id(ent.ring.domain);
>> +
>> +    if ( dst_d&&  dst_d->v4v )
>> +    {
>> +        read_lock(&dst_d->v4v->lock);
>> +        ring_info = v4v_ring_find_info_by_addr(dst_d,&ent.ring,
>> +                                               src_d->domain_id);
>> +
>
> Should there be a call to v4vtables_check() here?  Otherwise we leak a
> little bit of information to a guest who's not supposed to see the ring,
> (and also it's a bit odd to say 'yes, you may send' and then bounce the
> sendv call).

Yea I think so. Concerning all the v4vtables issues, we are now looking 
at reworking a lot of this. We just started discussing options so I 
don't have a lot to reply with yet.

>
> [...]
>> +static void v4v_ring_remove_mfns(struct domain *d, struct v4v_ring_info *ring_info)
>> +{
>> +    int i;
>> +
>> +    ASSERT(rw_is_write_locked(&d->v4v->lock));
>> +
>> +    if ( ring_info->mfns )
>> +    {
>> +        for ( i = 0; i<  ring_info->npage; ++i )
>> +            if ( mfn_x(ring_info->mfns[i]) != 0 )
>
> Here, and where the array is populated, it would be better to use
> INVALID_MFN in any empty slots.  On x86, MFN 0 is never allocated to a
> guest, but that may not always be true on all architectures.

Ack.

>
> [...]
>> +static long
>> +v4v_ring_add(struct domain *d, XEN_GUEST_HANDLE_PARAM(v4v_ring_t) ring_hnd,
>> +             uint32_t npage, XEN_GUEST_HANDLE_PARAM(v4v_pfn_t) pfn_hnd)
>> +{
> [...]
>> +        /*
>> +         * no need for a lock yet, because only we know about this
>> +         * set the tx pointer if it looks bogus (we don't reset it
>> +         * because this might be a re-register after S4)
>> +         */
>> +        if ( (ring.tx_ptr>= ring.len)
>> +                || (V4V_ROUNDUP(ring.tx_ptr) != ring.tx_ptr) )
>> +        {
>> +            ring.tx_ptr = ring.rx_ptr;
>> +        }
>
> Maybe check here that the guest-supplied rx_ptr is correctly aligned?
> I don't think an unaligned one would be a security risk, but it might
> cause entertaining breakage on the first sendv().

Ack.

>
>> +        copy_field_to_guest(ring_hnd,&ring, tx_ptr);
>> +
>> +        read_lock(&d->v4v->lock);
>> +        ring_info = v4v_ring_find_info(d,&ring.id);
>> +
>> +        if ( !ring_info )
>> +        {
>> +            read_unlock(&d->v4v->lock);
>> +            ring_info = xmalloc(struct v4v_ring_info);
>> +            if ( !ring_info )
>> +            {
>> +                v4v_dprintk("ENOMEM\n");
>> +                ret = -ENOMEM;
>> +                break;
>> +            }
>> +            need_to_insert++;
>> +            spin_lock_init(&ring_info->lock);
>> +            INIT_HLIST_HEAD(&ring_info->pending);
>> +            ring_info->mfns = NULL;
>> +        }
>> +        else
>> +        {
>> +            /*
>> +             * Ring info already existed.
>> +             */
>> +            printk(KERN_INFO "v4v: dom%d ring already registered\n",
>> +                    current->domain->domain_id);
>> +            ret = -EEXIST;
>> +            break;
>> +        }
>
> If you reshuffle these so that the already-exists case is first, then
> the not-already-exists case doesn't need to be indented, i.e.
>
>      if ( ring_info )
>      {
>          printk();
>          ret = -EEXIST;
>          break;
>      }
>      /* Normal case goes here */
>
> Then I think it becomes clear that you don't need a 'need_to_insert'
> flag -- and that the read_unlock() below is dead code, so the error path
> doesn't unlock at all!
>
>> +
>> +        spin_lock(&ring_info->lock);
>> +        ring_info->id = ring.id;
>> +        ring_info->len = ring.len;
>> +        ring_info->tx_ptr = ring.tx_ptr;
>> +        ring_info->ring = ring_hnd;
>
> I can't see where this field is used, but AFAICT the ring_hnd isn't
> guaranteed to be valid after this hypercall finishes (and certainly not
> after, say, the next context switch), so storing it doesn't seem useful.

Yes I don't understand that one either. It will go away.

>
>> +        if ( ring_info->mfns )
>> +            xfree(ring_info->mfns);
>> +        ret = v4v_find_ring_mfns(d, ring_info, npage, pfn_hnd);
>> +        spin_unlock(&ring_info->lock);
>> +        if ( ret )
>> +            break;
>> +
>> +        if ( !need_to_insert )
>> +        {
>> +            read_unlock(&d->v4v->lock);
>> +        }
>> +        else
>> +        {
>> +            uint16_t hash = v4v_hash_fn(&ring.id);
>> +
>> +            write_lock(&d->v4v->lock);
>> +            hlist_add_head(&ring_info->node,&d->v4v->ring_hash[hash]);
>> +            write_unlock(&d->v4v->lock);
>> +        }
>> +    } while ( 0 );
>> +
>> +    read_unlock(&v4v_lock);
>> +    return ret;
>> +}
> [...]
>> +void
>> +v4vtables_print_rule(struct v4vtables_rule_node *node)
>> +{
> [...]
>> +    if ( rule->accept == 1 )
>> +        printk("ACCEPT");
>
> The logic in the actual ACL is (effectively) 'if ( rule->accept != 0 )';
> so this probably ought to follow suit. :)

Ack

>
> [...]
>> +int
>> +v4vtables_del(struct domain *src_d,
>> +              XEN_GUEST_HANDLE_PARAM(v4vtables_rule_t) rule_hnd,
>> +              int32_t position)
>> +{
>> +    struct list_head *tmp = NULL;
>> +    struct list_head *to_delete = NULL;
>> +    struct list_head *next = NULL;
>> +    struct v4vtables_rule_node *node;
>> +
>> +    ASSERT(rw_is_write_locked(&v4vtables_rules_lock));
>> +
>> +    v4v_dprintk("v4vtables_del position:%d\n", position);
>> +
>> +    if ( position != -1 )
>> +    {
>> +        /* We want to delete the rule number<position>  */
>> +        list_for_each(tmp,&v4vtables_rules)
>> +        {
>> +            to_delete = tmp;
>> +            if (position == 0)
>> +                break;
>> +            position--;
>> +        }
>
> Hmmm.  If this is called with position = 1 with a multi-element list, it
> selects the second element (which conflicts with the semantics of the
> add operation, where '1' is the first element).  And if it's called with
> '1' on a single-element list, it selects the first (and only) element.

Needs fixing along with other problems in the v4vtables.

>
>> +        /* Can't find the position */
>> +        if (position != 0)
>> +            to_delete = NULL;
>> +    }
>> +    else if ( !guest_handle_is_null(rule_hnd) )
>> +    {
>> +        struct v4vtables_rule r;
>> +
>> +        if ( copy_from_guest(&r, rule_hnd, 1) )
>> +            return -EFAULT;
>> +
>> +        list_for_each(tmp,&v4vtables_rules)
>> +        {
>> +            node = list_entry(tmp, struct v4vtables_rule_node, list);
>> +
>> +            if ( (node->rule.src.domain == r.src.domain)&&
>> +                 (node->rule.src.port   == r.src.port)&&
>> +                 (node->rule.dst.domain == r.dst.domain)&&
>> +                 (node->rule.dst.port   == r.dst.port))
>
> Not matching by 'accept' too?  That should be mentioned in the hypercall
> interface, I think.
>
>> +            {
>> +                to_delete = tmp;
>> +                break;
>> +            }
>> +        }
>> +    }
>> +    else
>> +    {
>> +        /* We want to flush the rules! */
>> +        printk(KERN_ERR "VIPTables: flushing rules\n");
>> +        list_for_each_safe(tmp, next,&v4vtables_rules)
>> +        {
>> +            node = list_entry(tmp, struct v4vtables_rule_node, list);
>> +            list_del(tmp);
>> +            xfree(node);
>> +        }
>> +    }
>> +
>> +    if ( to_delete )
>> +    {
>> +        node = list_entry(to_delete, struct v4vtables_rule_node, list);
>> +#ifdef V4V_DEBUG
>> +        printk(KERN_ERR "VIPTables: deleting rule: ");
>> +        v4vtables_print_rule(node);
>> +#endif /* V4V_DEBUG */
>> +        list_del(to_delete);
>> +        xfree(node);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static size_t
>> +v4vtables_list(struct domain *src_d,
>> +               XEN_GUEST_HANDLE_PARAM(v4vtables_list_t) list_hnd)
>> +{
>> +    struct list_head *ptr;
>> +    struct v4vtables_rule_node *node;
>> +    struct v4vtables_list rules_list;
>> +    uint32_t nbrules;
>> +    XEN_GUEST_HANDLE(v4vtables_rule_t) guest_rules;
>> +
>> +    ASSERT(rw_is_locked(&v4vtables_rules_lock));
>> +
>> +    memset(&rules_list, 0, sizeof (rules_list));
>> +    if ( copy_from_guest(&rules_list, list_hnd, 1) )
>> +        return -EFAULT;
>> +
>> +    ptr = v4vtables_rules.next;
>> +    while ( rules_list.start_rule != 0&&  ptr->next !=&v4vtables_rules )
>> +    {
>> +        ptr = ptr->next;
>> +        rules_list.start_rule--;
>> +    }
>
> Again, that seems to conflict with the 'add' semantics that the first
> rule is rule 1.
>
> And what if the supplied start_rule was too large?  As it stands I think
> this will return the last rule in the list rather than returning no
> rules (or an error).
>
>> +
>> +    if ( rules_list.nb_rules == 0 )
>> +        return -EINVAL;
>> +
>> +    guest_rules = guest_handle_for_field(list_hnd, v4vtables_rule_t, rules[0]);
>> +
>> +    nbrules = 0;
>> +    while ( nbrules<  rules_list.nb_rules&&  ptr !=&v4vtables_rules )
>> +    {
>> +        node = list_entry(ptr, struct v4vtables_rule_node, list);
>> +
>> +        if ( copy_to_guest(guest_rules,&node->rule, 1) )
>> +            break;
>
> Maybe also arrange to return -EFAULT?
>
>> +
>> +        guest_handle_add_offset(guest_rules, 1);
>> +
>> +        nbrules++;
>> +        ptr = ptr->next;
>> +    }
>> +
>> +    rules_list.nb_rules = nbrules;
>> +    if ( copy_field_to_guest(list_hnd,&rules_list, nb_rules) )
>> +        return -EFAULT;
>> +
>> +    return 0;
>> +}
>> +
>> +static size_t
>> +v4vtables_check(v4v_addr_t *src, v4v_addr_t *dst)
>> +{
>> +    struct list_head *ptr;
>> +    struct v4vtables_rule_node *node;
>> +    size_t ret = 0; /* Defaulting to ACCEPT */
>
> size_t?  bool_t would be better.  Also, having this function return the
> inverse of the rule encoding is just perverse.
>
> [...]
>> +    case V4VOP_tables_add:
>> +    {
>> +        uint32_t position = arg3;
>> +
>> +        XEN_GUEST_HANDLE_PARAM(v4vtables_rule_t) rule_hnd =
>> +               guest_handle_cast(arg1, v4vtables_rule_t);
>> +        rc = -EPERM;
>> +        if ( !is_hardware_domain(d) )
>> +            goto out;
>
> I think this ought to be an XSM check now, with XSM_PRIV to indicate
> that the default policy is to check IS_PRIV().  (Likewise for the
> other table ops.)

Ok. Again, we have v4vtables high on our list of needs overhauling.

>
> [...]
>> +struct keyhandler v4v_info_keyhandler =
>> +{
>> +    .diagnostic = 1,
>> +    .u.fn = dump_state,
>> +    .desc = "dump v4v states and interupt"
>
> s/interupt/interrupt/
>
> [...]
>> +/*
>> + * V4VOP_register_ring
>> + *
>> + * Registers a ring with Xen. If a ring with the same v4v_ring_id exists,
>> + * the hypercall will return -EEXIST.
>> + *
>> + * do_v4v_op(V4VOP_register_ring,
>> + *           XEN_GUEST_HANDLE(v4v_ring_t), XEN_GUEST_HANDLE(v4v_pfn_t),
>> + *           npage, 0)
>
> Maybe add a 'uint32_t ' before npage, to match the same style below?
>
> [...]
>> +/*
>> + * V4VOP_sendv
>> + *
>> + * Sends of list of buffer contained in iov.
>> + *
>> + * For each iov entry send iov_len bytes of iov_base to addr.dst, giving
>> + * src as the source address (xen will ignore src->domain and put your
>> + * domain in the actually message), xen first looks for a ring with id.addr==dst
>> + * and id.partner==sending_domain if that fails it looks for id.addr==dst and
>
> These lines are longer than 80 chars - can you wrap a little narrower please?
> Also s/actually/actual/.
>
>> + * id.partner==DOMID_ANY.
>> + * message_type is the 32 bit number used from the message
>> + * most likely V4V_MESSAGE_DGRAM or V4V_MESSAGE_STREAM. If insufficient space exists
>> + * it will return -EAGAIN and xen will twing the V4V_INTERRUPT when
>
> 'twing'? :)

Not my "verb" :)

>
> [...]
>> +/*
>> + * V4VOP_tables_add
>> + *
>> + * Insert a filtering rules after a given position.
>> + *
>
> Can you add a description of the semantics of the rules?
> E.g. does the first match or the last match or the tightest match dominate?
>
>> + * do_v4v_op(V4VOP_tables_add,
>> + *           XEN_GUEST_HANDLE(v4vtables_rule_t) rule,
>> + *           NULL,
>> + *           uint32_t position, 0)
>> + */
>> +#define V4VOP_tables_add        7
>> +
>> +/*
>> + * V4VOP_tables_del
>> + *
>> + * Delete a filtering rules at a position or the rule
>> + * that matches "rule".
>> + *
>> + * do_v4v_op(V4VOP_tables_del,
>> + *           XEN_GUEST_HANDLE(v4vtables_rule_t) rule,
>> + *           NULL,
>> + *           uint32_t position, 0)
>> + */
>> +#define V4VOP_tables_del       8
>
> Can you describe the calling convention a bit more fully?  In particular
> that position should == -1 if you want to match by rule (though I think
> == 0 would make more sense), and that position == -1&&  rule == NULL
> means to flush everything.

Yes, I will take all your v4vtables concerns into account when we work 
on it.

Again, thanks a bunch.
Ross

>
> Cheers,
>
> Tim.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH (V9) 2/2] xen: Add V4V implementation
  2013-05-28 19:43 ` [PATCH (V9) 2/2] xen: Add V4V implementation Ross Philipson
                     ` (3 preceding siblings ...)
  2013-05-30 16:20   ` Tim Deegan
@ 2013-06-10 15:06   ` David Vrabel
  4 siblings, 0 replies; 24+ messages in thread
From: David Vrabel @ 2013-06-10 15:06 UTC (permalink / raw)
  To: Ross Philipson; +Cc: xen-devel

On 28/05/13 20:43, Ross Philipson wrote:
> Setup of v4v domains a domain gets created and cleanup
> when a domain die. Wire up the v4v hypercall.
[...]
> +static void
> +v4v_pending_notify(struct domain *caller_d, struct hlist_head *to_notify)
> +{
> +    struct hlist_node *node, *next;
> +    struct v4v_pending_ent *pending_ent;
> +
> +    ASSERT(rw_is_locked(&v4v_lock));
> +
> +    hlist_for_each_entry_safe(pending_ent, node, next, to_notify, node)
> +    {
> +        hlist_del(&pending_ent->node);
> +        v4v_signal_domid(pending_ent->id);
> +        xfree(pending_ent);
> +    }
> +
> +}
> +
> +static void
> +v4v_pending_find(struct domain *d, struct v4v_ring_info *ring_info,
> +                 uint32_t payload_space, struct hlist_head *to_notify)
> +{
> +    struct hlist_node *node, *next;
> +    struct v4v_pending_ent *ent;
> +
> +    ASSERT(rw_is_locked(&d->v4v->lock));
> +
> +    spin_lock(&ring_info->lock);
> +    hlist_for_each_entry_safe(ent, node, next, &ring_info->pending, node)
> +    {
> +        if ( payload_space >= ent->len )
> +        {
> +            hlist_del(&ent->node);
> +            hlist_add_head(&ent->node, to_notify);
> +        }
> +    }
> +    spin_unlock(&ring_info->lock);
> +}

This "space in destination ring" notification does not seem like it
would scale very well for: a) many senders into the same ring; or b) a
single sender to many (full) rings.

In (a), all domains are notified, even though only one will win the race
and put a message on the ring.  In (b) the guest must retry all pending
transmits as it doesn't know which destination ring now has space.

David

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2013-06-10 15:06 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-28 19:43 [PATCH (V9) 0/2] Add V4V to Xen Ross Philipson
2013-05-28 19:43 ` [PATCH (V9) 1/2] xen: events, exposes evtchn_alloc_unbound_domain Ross Philipson
2013-05-28 19:43 ` [PATCH (V9) 2/2] xen: Add V4V implementation Ross Philipson
2013-05-29  0:43   ` Matt Wilson
2013-05-29 19:28     ` Ross Philipson
2013-05-29  8:34   ` Jan Beulich
2013-05-29 19:26     ` Ross Philipson
2013-05-30  5:16       ` Jan Beulich
2013-05-29  9:56   ` Vincent Hanquez
2013-05-30 16:20   ` Tim Deegan
2013-06-04 18:01     ` Ross Philipson
2013-06-10 15:06   ` David Vrabel
2013-05-30 11:57 ` [PATCH (V9) 0/2] Add V4V to Xen Ian Campbell
2013-05-31  7:36   ` Vincent Hanquez
2013-05-31  7:50     ` Ian Campbell
2013-05-31  8:56       ` Vincent Hanquez
2013-05-31  9:01         ` Ian Campbell
2013-05-31  9:26           ` Vincent Hanquez
2013-05-31 16:29             ` Ross Philipson
2013-05-31 16:38               ` Ian Campbell
2013-05-30 12:07 ` Ian Campbell
2013-05-30 16:08   ` David Vrabel
2013-05-31  7:25     ` Vincent Hanquez
2013-05-31 10:21       ` David Vrabel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.