All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] AlacrityVM guest drivers
@ 2009-08-03 17:17 Gregory Haskins
  2009-08-03 17:17 ` [PATCH 1/7] shm-signal: shared-memory signals Gregory Haskins
                   ` (7 more replies)
  0 siblings, 8 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-03 17:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: alacrityvm-devel, netdev

(Applies to v2.6.31-rc5, proposed for linux-next after review is complete)

This series implements the guest-side drivers for accelerated IO
when running on top of the AlacrityVM hypervisor, the details of
which you can find here:

http://developer.novell.com/wiki/index.php/AlacrityVM

This series includes the basic plumbing, as well as the driver for
accelerated 802.x (ethernet) networking.

Regards,
-Greg

---

Gregory Haskins (7):
      venet: add scatter-gather/GSO support
      net: Add vbus_enet driver
      ioq: add driver-side vbus helpers
      vbus-proxy: add a pci-to-vbus bridge
      vbus: add a "vbus-proxy" bus model for vbus_driver objects
      ioq: Add basic definitions for a shared-memory, lockless queue
      shm-signal: shared-memory signals


 arch/x86/Kconfig            |    2 
 drivers/Makefile            |    1 
 drivers/net/Kconfig         |   14 +
 drivers/net/Makefile        |    1 
 drivers/net/vbus-enet.c     |  899 +++++++++++++++++++++++++++++++++++++++++++
 drivers/vbus/Kconfig        |   24 +
 drivers/vbus/Makefile       |    6 
 drivers/vbus/bus-proxy.c    |  216 ++++++++++
 drivers/vbus/pci-bridge.c   |  824 +++++++++++++++++++++++++++++++++++++++
 include/linux/Kbuild        |    4 
 include/linux/ioq.h         |  415 ++++++++++++++++++++
 include/linux/shm_signal.h  |  189 +++++++++
 include/linux/vbus_driver.h |   80 ++++
 include/linux/vbus_pci.h    |  127 ++++++
 include/linux/venet.h       |   84 ++++
 lib/Kconfig                 |   21 +
 lib/Makefile                |    2 
 lib/ioq.c                   |  294 ++++++++++++++
 lib/shm_signal.c            |  192 +++++++++
 19 files changed, 3395 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/vbus-enet.c
 create mode 100644 drivers/vbus/Kconfig
 create mode 100644 drivers/vbus/Makefile
 create mode 100644 drivers/vbus/bus-proxy.c
 create mode 100644 drivers/vbus/pci-bridge.c
 create mode 100644 include/linux/ioq.h
 create mode 100644 include/linux/shm_signal.h
 create mode 100644 include/linux/vbus_driver.h
 create mode 100644 include/linux/vbus_pci.h
 create mode 100644 include/linux/venet.h
 create mode 100644 lib/ioq.c
 create mode 100644 lib/shm_signal.c

-- 
Signature

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 1/7] shm-signal: shared-memory signals
  2009-08-03 17:17 [PATCH 0/7] AlacrityVM guest drivers Gregory Haskins
@ 2009-08-03 17:17 ` Gregory Haskins
  2009-08-06 13:56   ` Arnd Bergmann
  2009-08-03 17:17 ` [PATCH 2/7] ioq: Add basic definitions for a shared-memory, lockless queue Gregory Haskins
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 62+ messages in thread
From: Gregory Haskins @ 2009-08-03 17:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: alacrityvm-devel, netdev

shm-signal provides a generic shared-memory based bidirectional
signaling mechanism.  It is used in conjunction with an existing
signal transport (such as posix-signals, interrupts, pipes, etc) to
increase the efficiency of the transport since the state information
is directly accessible to both sides of the link.  The shared-memory
design provides very cheap access to features such as event-masking
and spurious delivery mititgation, and is useful implementing higher
level shared-memory constructs such as rings.

We will use this mechanism as the basis for a shared-memory interface
later in the series.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/Kbuild       |    1 
 include/linux/shm_signal.h |  189 +++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig                |    9 ++
 lib/Makefile               |    1 
 lib/shm_signal.c           |  192 ++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 392 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/shm_signal.h
 create mode 100644 lib/shm_signal.c

diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 334a359..01d67b6 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -331,6 +331,7 @@ unifdef-y += serial_core.h
 unifdef-y += serial.h
 unifdef-y += serio.h
 unifdef-y += shm.h
+unifdef-y += shm_signal.h
 unifdef-y += signal.h
 unifdef-y += smb_fs.h
 unifdef-y += smb.h
diff --git a/include/linux/shm_signal.h b/include/linux/shm_signal.h
new file mode 100644
index 0000000..21cf750
--- /dev/null
+++ b/include/linux/shm_signal.h
@@ -0,0 +1,189 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_SHM_SIGNAL_H
+#define _LINUX_SHM_SIGNAL_H
+
+#include <linux/types.h>
+
+/*
+ *---------
+ * The following structures represent data that is shared across boundaries
+ * which may be quite disparate from one another (e.g. Windows vs Linux,
+ * 32 vs 64 bit, etc).  Therefore, care has been taken to make sure they
+ * present data in a manner that is independent of the environment.
+ *-----------
+ */
+
+#define SHM_SIGNAL_MAGIC 0x58fa39df
+#define SHM_SIGNAL_VER   1
+
+struct shm_signal_irq {
+	__u8                  enabled;
+	__u8                  pending;
+	__u8                  dirty;
+};
+
+enum shm_signal_locality {
+	shm_locality_north,
+	shm_locality_south,
+};
+
+struct shm_signal_desc {
+	__u32                 magic;
+	__u32                 ver;
+	struct shm_signal_irq irq[2];
+};
+
+/* --- END SHARED STRUCTURES --- */
+
+#ifdef __KERNEL__
+
+#include <linux/kref.h>
+#include <linux/interrupt.h>
+
+struct shm_signal_notifier {
+	void (*signal)(struct shm_signal_notifier *);
+};
+
+struct shm_signal;
+
+struct shm_signal_ops {
+	int      (*inject)(struct shm_signal *s);
+	void     (*fault)(struct shm_signal *s, const char *fmt, ...);
+	void     (*release)(struct shm_signal *s);
+};
+
+enum {
+	shm_signal_in_wakeup,
+};
+
+struct shm_signal {
+	struct kref                 kref;
+	spinlock_t                  lock;
+	enum shm_signal_locality    locale;
+	unsigned long               flags;
+	struct shm_signal_ops      *ops;
+	struct shm_signal_desc     *desc;
+	struct shm_signal_notifier *notifier;
+	struct tasklet_struct       deferred_notify;
+};
+
+#define SHM_SIGNAL_FAULT(s, fmt, args...)  \
+  ((s)->ops->fault ? (s)->ops->fault((s), fmt, ## args) : panic(fmt, ## args))
+
+ /*
+  * These functions should only be used internally
+  */
+void _shm_signal_release(struct kref *kref);
+void _shm_signal_wakeup(struct shm_signal *s);
+
+/**
+ * shm_signal_init() - initialize an SHM_SIGNAL
+ * @s:        SHM_SIGNAL context
+ *
+ * Initializes SHM_SIGNAL context before first use
+ *
+ **/
+void shm_signal_init(struct shm_signal *s, enum shm_signal_locality locale,
+		     struct shm_signal_ops *ops, struct shm_signal_desc *desc);
+
+/**
+ * shm_signal_get() - acquire an SHM_SIGNAL context reference
+ * @s:        SHM_SIGNAL context
+ *
+ **/
+static inline struct shm_signal *shm_signal_get(struct shm_signal *s)
+{
+	kref_get(&s->kref);
+
+	return s;
+}
+
+/**
+ * shm_signal_put() - release an SHM_SIGNAL context reference
+ * @s:        SHM_SIGNAL context
+ *
+ **/
+static inline void shm_signal_put(struct shm_signal *s)
+{
+	kref_put(&s->kref, _shm_signal_release);
+}
+
+/**
+ * shm_signal_enable() - enables local notifications on an SHM_SIGNAL
+ * @s:        SHM_SIGNAL context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Enables/unmasks the registered notifier (if applicable) to receive wakeups
+ * whenever the remote side performs an shm_signal() operation. A notification
+ * will be dispatched immediately if any pending signals have already been
+ * issued prior to invoking this call.
+ *
+ * This is synonymous with unmasking an interrupt.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int shm_signal_enable(struct shm_signal *s, int flags);
+
+/**
+ * shm_signal_disable() - disable local notifications on an SHM_SIGNAL
+ * @s:        SHM_SIGNAL context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Disables/masks the registered shm_signal_notifier (if applicable) from
+ * receiving any further notifications.  Any subsequent calls to shm_signal()
+ * by the remote side will update the shm as dirty, but will not traverse the
+ * locale boundary and will not invoke the notifier callback.  Signals
+ * delivered while masked will be deferred until shm_signal_enable() is
+ * invoked.
+ *
+ * This is synonymous with masking an interrupt
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int shm_signal_disable(struct shm_signal *s, int flags);
+
+/**
+ * shm_signal_inject() - notify the remote side about shm changes
+ * @s:        SHM_SIGNAL context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Marks the shm state as "dirty" and, if enabled, will traverse
+ * a locale boundary to inject a remote notification.  The remote
+ * side controls whether the notification should be delivered via
+ * the shm_signal_enable/disable() interface.
+ *
+ * The specifics of how to traverse a locale boundary are abstracted
+ * by the shm_signal_ops->signal() interface and provided by a particular
+ * implementation.  However, typically going north to south would be
+ * something like a syscall/hypercall, and going south to north would be
+ * something like a posix-signal/guest-interrupt.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int shm_signal_inject(struct shm_signal *s, int flags);
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_SHM_SIGNAL_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index bb1326d..136da19 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -200,4 +200,13 @@ config NLATTR
 config GENERIC_ATOMIC64
        bool
 
+config SHM_SIGNAL
+	tristate "SHM Signal - Generic shared-memory signaling mechanism"
+	default n
+	help
+	 Provides a shared-memory based signaling mechansim to indicate
+         memory-dirty notifications between two end-points.
+
+	 If unsure, say N
+
 endmenu
diff --git a/lib/Makefile b/lib/Makefile
index 2e78277..503bf7b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -76,6 +76,7 @@ obj-$(CONFIG_TEXTSEARCH_BM) += ts_bm.o
 obj-$(CONFIG_TEXTSEARCH_FSM) += ts_fsm.o
 obj-$(CONFIG_SMP) += percpu_counter.o
 obj-$(CONFIG_AUDIT_GENERIC) += audit.o
+obj-$(CONFIG_SHM_SIGNAL) += shm_signal.o
 
 obj-$(CONFIG_SWIOTLB) += swiotlb.o
 obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
diff --git a/lib/shm_signal.c b/lib/shm_signal.c
new file mode 100644
index 0000000..fbba74f
--- /dev/null
+++ b/lib/shm_signal.c
@@ -0,0 +1,192 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * See include/linux/shm_signal.h for documentation
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/interrupt.h>
+#include <linux/shm_signal.h>
+
+int shm_signal_enable(struct shm_signal *s, int flags)
+{
+	struct shm_signal_irq *irq = &s->desc->irq[s->locale];
+	unsigned long iflags;
+
+	spin_lock_irqsave(&s->lock, iflags);
+
+	irq->enabled = 1;
+	wmb();
+
+	if ((irq->dirty || irq->pending)
+	    && !test_bit(shm_signal_in_wakeup, &s->flags)) {
+		rmb();
+		tasklet_schedule(&s->deferred_notify);
+	}
+
+	spin_unlock_irqrestore(&s->lock, iflags);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(shm_signal_enable);
+
+int shm_signal_disable(struct shm_signal *s, int flags)
+{
+	struct shm_signal_irq *irq = &s->desc->irq[s->locale];
+
+	irq->enabled = 0;
+	wmb();
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(shm_signal_disable);
+
+/*
+ * signaling protocol:
+ *
+ * each side of the shm_signal has an "irq" structure with the following
+ * fields:
+ *
+ *    - enabled: controlled by shm_signal_enable/disable() to mask/unmask
+ *               the notification locally
+ *    - dirty:   indicates if the shared-memory is dirty or clean.  This
+ *               is updated regardless of the enabled/pending state so that
+ *               the state is always accurately tracked.
+ *    - pending: indicates if a signal is pending to the remote locale.
+ *               This allows us to determine if a remote-notification is
+ *               already in flight to optimize spurious notifications away.
+ */
+int shm_signal_inject(struct shm_signal *s, int flags)
+{
+	/* Load the irq structure from the other locale */
+	struct shm_signal_irq *irq = &s->desc->irq[!s->locale];
+
+	/*
+	 * We always mark the remote side as dirty regardless of whether
+	 * they need to be notified.
+	 */
+	irq->dirty = 1;
+	wmb();   /* dirty must be visible before we test the pending state */
+
+	if (irq->enabled && !irq->pending) {
+		rmb();
+
+		/*
+		 * If the remote side has enabled notifications, and we do
+		 * not see a notification pending, we must inject a new one.
+		 */
+		irq->pending = 1;
+		wmb(); /* make it visible before we do the injection */
+
+		s->ops->inject(s);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(shm_signal_inject);
+
+void _shm_signal_wakeup(struct shm_signal *s)
+{
+	struct shm_signal_irq *irq = &s->desc->irq[s->locale];
+	int dirty;
+	unsigned long flags;
+
+	spin_lock_irqsave(&s->lock, flags);
+
+	__set_bit(shm_signal_in_wakeup, &s->flags);
+
+	/*
+	 * The outer loop protects against race conditions between
+	 * irq->dirty and irq->pending updates
+	 */
+	while (irq->enabled && (irq->dirty || irq->pending)) {
+
+		/*
+		 * Run until we completely exhaust irq->dirty (it may
+		 * be re-dirtied by the remote side while we are in the
+		 * callback).  We let "pending" remain untouched until we have
+		 * processed them all so that the remote side knows we do not
+		 * need a new notification (yet).
+		 */
+		do {
+			irq->dirty = 0;
+			/* the unlock is an implicit wmb() for dirty = 0 */
+			spin_unlock_irqrestore(&s->lock, flags);
+
+			if (s->notifier)
+				s->notifier->signal(s->notifier);
+
+			spin_lock_irqsave(&s->lock, flags);
+			dirty = irq->dirty;
+			rmb();
+
+		} while (irq->enabled && dirty);
+
+		barrier();
+
+		/*
+		 * We can finally acknowledge the notification by clearing
+		 * "pending" after all of the dirty memory has been processed
+		 * Races against this clearing are handled by the outer loop.
+		 * Subsequent iterations of this loop will execute with
+		 * pending=0 potentially leading to future spurious
+		 * notifications, but this is an acceptable tradeoff as this
+		 * will be rare and harmless.
+		 */
+		irq->pending = 0;
+		wmb();
+
+	}
+
+	__clear_bit(shm_signal_in_wakeup, &s->flags);
+	spin_unlock_irqrestore(&s->lock, flags);
+
+}
+EXPORT_SYMBOL_GPL(_shm_signal_wakeup);
+
+void _shm_signal_release(struct kref *kref)
+{
+	struct shm_signal *s = container_of(kref, struct shm_signal, kref);
+
+	s->ops->release(s);
+}
+EXPORT_SYMBOL_GPL(_shm_signal_release);
+
+static void
+deferred_notify(unsigned long data)
+{
+	struct shm_signal *s = (struct shm_signal *)data;
+
+	_shm_signal_wakeup(s);
+}
+
+void shm_signal_init(struct shm_signal *s, enum shm_signal_locality locale,
+		     struct shm_signal_ops *ops, struct shm_signal_desc *desc)
+{
+	memset(s, 0, sizeof(*s));
+	kref_init(&s->kref);
+	spin_lock_init(&s->lock);
+	tasklet_init(&s->deferred_notify,
+		     deferred_notify,
+		     (unsigned long)s);
+	s->locale   = locale;
+	s->ops      = ops;
+	s->desc     = desc;
+}
+EXPORT_SYMBOL_GPL(shm_signal_init);


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 2/7] ioq: Add basic definitions for a shared-memory, lockless queue
  2009-08-03 17:17 [PATCH 0/7] AlacrityVM guest drivers Gregory Haskins
  2009-08-03 17:17 ` [PATCH 1/7] shm-signal: shared-memory signals Gregory Haskins
@ 2009-08-03 17:17 ` Gregory Haskins
  2009-08-03 17:17 ` [PATCH 3/7] vbus: add a "vbus-proxy" bus model for vbus_driver objects Gregory Haskins
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-03 17:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: alacrityvm-devel, netdev

IOQ allows asynchronous communication between two end-points via a common
shared-memory region.  Memory is synchronized using pure barriers (i.e.
lockless), and updates are communicated via an embedded shm-signal.   The
design of the interface allows one code base to universally provide both
sides of a given channel.

We will use this mechanism later in the series to efficiently move data
in and out of a guest kernel from various sources, including both
infrastructure level and application level transports.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/Kbuild |    1 
 include/linux/ioq.h  |  415 ++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig          |   12 +
 lib/Makefile         |    1 
 lib/ioq.c            |  294 +++++++++++++++++++++++++++++++++++
 5 files changed, 723 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/ioq.h
 create mode 100644 lib/ioq.c

diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 01d67b6..32b3eb8 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -247,6 +247,7 @@ unifdef-y += in.h
 unifdef-y += in6.h
 unifdef-y += inotify.h
 unifdef-y += input.h
+unifdef-y += ioq.h
 unifdef-y += ip.h
 unifdef-y += ipc.h
 unifdef-y += ipmi.h
diff --git a/include/linux/ioq.h b/include/linux/ioq.h
new file mode 100644
index 0000000..f77e316
--- /dev/null
+++ b/include/linux/ioq.h
@@ -0,0 +1,415 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * IOQ is a generic shared-memory, lockless queue mechanism. It can be used
+ * in a variety of ways, though its intended purpose is to become the
+ * asynchronous communication path for virtual-bus drivers.
+ *
+ * The following are a list of key design points:
+ *
+ * #) All shared-memory is always allocated on explicitly one side of the
+ *    link.  This typically would be the guest side in a VM/VMM scenario.
+ * #) Each IOQ has the concept of "north" and "south" locales, where
+ *    north denotes the memory-owner side (e.g. guest).
+ * #) An IOQ is manipulated using an iterator idiom.
+ * #) Provides a bi-directional signaling/notification infrastructure on
+ *    a per-queue basis, which includes an event mitigation strategy
+ *    to reduce boundary switching.
+ * #) The signaling path is abstracted so that various technologies and
+ *    topologies can define their own specific implementation while sharing
+ *    the basic structures and code.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_IOQ_H
+#define _LINUX_IOQ_H
+
+#include <linux/types.h>
+#include <linux/shm_signal.h>
+
+/*
+ *---------
+ * The following structures represent data that is shared across boundaries
+ * which may be quite disparate from one another (e.g. Windows vs Linux,
+ * 32 vs 64 bit, etc).  Therefore, care has been taken to make sure they
+ * present data in a manner that is independent of the environment.
+ *-----------
+ */
+struct ioq_ring_desc {
+	__u64                 cookie; /* for arbitrary use by north-side */
+	__u64                 ptr;
+	__u64                 len;
+	__u8                  valid;
+	__u8                  sown; /* South owned = 1, North owned = 0 */
+};
+
+#define IOQ_RING_MAGIC 0x47fa2fe4
+#define IOQ_RING_VER   4
+
+struct ioq_ring_idx {
+	__u32                 head;    /* 0 based index to head of ptr array */
+	__u32                 tail;    /* 0 based index to tail of ptr array */
+	__u8                  full;
+};
+
+enum ioq_locality {
+	ioq_locality_north,
+	ioq_locality_south,
+};
+
+struct ioq_ring_head {
+	__u32                  magic;
+	__u32                  ver;
+	struct shm_signal_desc signal;
+	struct ioq_ring_idx    idx[2];
+	__u32                  count;
+	struct ioq_ring_desc   ring[1]; /* "count" elements will be allocated */
+};
+
+#define IOQ_HEAD_DESC_SIZE(count) \
+    (sizeof(struct ioq_ring_head) + sizeof(struct ioq_ring_desc) * (count - 1))
+
+/* --- END SHARED STRUCTURES --- */
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/interrupt.h>
+#include <linux/shm_signal.h>
+#include <linux/kref.h>
+
+enum ioq_idx_type {
+	ioq_idxtype_valid,
+	ioq_idxtype_inuse,
+	ioq_idxtype_both,
+	ioq_idxtype_invalid,
+};
+
+enum ioq_seek_type {
+	ioq_seek_tail,
+	ioq_seek_next,
+	ioq_seek_head,
+	ioq_seek_set
+};
+
+struct ioq_iterator {
+	struct ioq            *ioq;
+	struct ioq_ring_idx   *idx;
+	u32                    pos;
+	struct ioq_ring_desc  *desc;
+	int                    update:1;
+	int                    dualidx:1;
+	int                    flipowner:1;
+};
+
+struct ioq_notifier {
+	void (*signal)(struct ioq_notifier *);
+};
+
+struct ioq_ops {
+	void     (*release)(struct ioq *ioq);
+};
+
+struct ioq {
+	struct ioq_ops *ops;
+
+	struct kref            kref;
+	enum ioq_locality      locale;
+	struct ioq_ring_head  *head_desc;
+	struct ioq_ring_desc  *ring;
+	struct shm_signal     *signal;
+	wait_queue_head_t      wq;
+	struct ioq_notifier   *notifier;
+	size_t                 count;
+	struct shm_signal_notifier shm_notifier;
+};
+
+#define IOQ_ITER_AUTOUPDATE  (1 << 0)
+#define IOQ_ITER_NOFLIPOWNER (1 << 1)
+
+/**
+ * ioq_init() - initialize an IOQ
+ * @ioq:        IOQ context
+ *
+ * Initializes IOQ context before first use
+ *
+ **/
+void ioq_init(struct ioq *ioq,
+	      struct ioq_ops *ops,
+	      enum ioq_locality locale,
+	      struct ioq_ring_head *head,
+	      struct shm_signal *signal,
+	      size_t count);
+
+/**
+ * ioq_get() - acquire an IOQ context reference
+ * @ioq:        IOQ context
+ *
+ **/
+static inline struct ioq *ioq_get(struct ioq *ioq)
+{
+	kref_get(&ioq->kref);
+
+	return ioq;
+}
+
+static inline void _ioq_kref_release(struct kref *kref)
+{
+	struct ioq *ioq = container_of(kref, struct ioq, kref);
+
+	shm_signal_put(ioq->signal);
+	ioq->ops->release(ioq);
+}
+
+/**
+ * ioq_put() - release an IOQ context reference
+ * @ioq:        IOQ context
+ *
+ **/
+static inline void ioq_put(struct ioq *ioq)
+{
+	kref_put(&ioq->kref, _ioq_kref_release);
+}
+
+/**
+ * ioq_notify_enable() - enables local notifications on an IOQ
+ * @ioq:        IOQ context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Enables/unmasks the registered ioq_notifier (if applicable) and waitq to
+ * receive wakeups whenever the remote side performs an ioq_signal() operation.
+ * A notification will be dispatched immediately if any pending signals have
+ * already been issued prior to invoking this call.
+ *
+ * This is synonymous with unmasking an interrupt.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+static inline int ioq_notify_enable(struct ioq *ioq, int flags)
+{
+	return shm_signal_enable(ioq->signal, 0);
+}
+
+/**
+ * ioq_notify_disable() - disable local notifications on an IOQ
+ * @ioq:        IOQ context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Disables/masks the registered ioq_notifier (if applicable) and waitq
+ * from receiving any further notifications.  Any subsequent calls to
+ * ioq_signal() by the remote side will update the ring as dirty, but
+ * will not traverse the locale boundary and will not invoke the notifier
+ * callback or wakeup the waitq.  Signals delivered while masked will
+ * be deferred until ioq_notify_enable() is invoked
+ *
+ * This is synonymous with masking an interrupt
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+static inline int ioq_notify_disable(struct ioq *ioq, int flags)
+{
+	return shm_signal_disable(ioq->signal, 0);
+}
+
+/**
+ * ioq_signal() - notify the remote side about ring changes
+ * @ioq:        IOQ context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Marks the ring state as "dirty" and, if enabled, will traverse
+ * a locale boundary to invoke a remote notification.  The remote
+ * side controls whether the notification should be delivered via
+ * the ioq_notify_enable/disable() interface.
+ *
+ * The specifics of how to traverse a locale boundary are abstracted
+ * by the ioq_ops->signal() interface and provided by a particular
+ * implementation.  However, typically going north to south would be
+ * something like a syscall/hypercall, and going south to north would be
+ * something like a posix-signal/guest-interrupt.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+static inline int ioq_signal(struct ioq *ioq, int flags)
+{
+	return shm_signal_inject(ioq->signal, 0);
+}
+
+/**
+ * ioq_count() - counts the number of outstanding descriptors in an index
+ * @ioq:        IOQ context
+ * @type:	Specifies the index type
+ *                 (*) valid: the descriptor is valid.  This is usually
+ *                     used to keep track of descriptors that may not
+ *                     be carrying a useful payload, but still need to
+ *                     be tracked carefully.
+ *                 (*) inuse: Descriptors that carry useful payload
+ *
+ * Returns:
+ *  (*) >=0: # of descriptors outstanding in the index
+ *  (*) <0 = ERRNO
+ *
+ **/
+int ioq_count(struct ioq *ioq, enum ioq_idx_type type);
+
+/**
+ * ioq_remain() - counts the number of remaining descriptors in an index
+ * @ioq:        IOQ context
+ * @type:	Specifies the index type
+ *                 (*) valid: the descriptor is valid.  This is usually
+ *                     used to keep track of descriptors that may not
+ *                     be carrying a useful payload, but still need to
+ *                     be tracked carefully.
+ *                 (*) inuse: Descriptors that carry useful payload
+ *
+ * This is the converse of ioq_count().  This function returns the number
+ * of "free" descriptors left in a particular index
+ *
+ * Returns:
+ *  (*) >=0: # of descriptors remaining in the index
+ *  (*) <0 = ERRNO
+ *
+ **/
+int ioq_remain(struct ioq *ioq, enum ioq_idx_type type);
+
+/**
+ * ioq_size() - counts the maximum number of descriptors in an ring
+ * @ioq:        IOQ context
+ *
+ * This function returns the maximum number of descriptors supported in
+ * a ring, regardless of their current state (free or inuse).
+ *
+ * Returns:
+ *  (*) >=0: total # of descriptors in the ring
+ *  (*) <0 = ERRNO
+ *
+ **/
+int ioq_size(struct ioq *ioq);
+
+/**
+ * ioq_full() - determines if a specific index is "full"
+ * @ioq:        IOQ context
+ * @type:	Specifies the index type
+ *                 (*) valid: the descriptor is valid.  This is usually
+ *                     used to keep track of descriptors that may not
+ *                     be carrying a useful payload, but still need to
+ *                     be tracked carefully.
+ *                 (*) inuse: Descriptors that carry useful payload
+ *
+ * Returns:
+ *  (*) 0: index is not full
+ *  (*) 1: index is full
+ *  (*) <0 = ERRNO
+ *
+ **/
+int ioq_full(struct ioq *ioq, enum ioq_idx_type type);
+
+/**
+ * ioq_empty() - determines if a specific index is "empty"
+ * @ioq:        IOQ context
+ * @type:	Specifies the index type
+ *                 (*) valid: the descriptor is valid.  This is usually
+ *                     used to keep track of descriptors that may not
+ *                     be carrying a useful payload, but still need to
+ *                     be tracked carefully.
+ *                 (*) inuse: Descriptors that carry useful payload
+ *
+ * Returns:
+ *  (*) 0: index is not empty
+ *  (*) 1: index is empty
+ *  (*) <0 = ERRNO
+ *
+ **/
+static inline int ioq_empty(struct ioq *ioq, enum ioq_idx_type type)
+{
+    return !ioq_count(ioq, type);
+}
+
+/**
+ * ioq_iter_init() - initialize an iterator for IOQ descriptor traversal
+ * @ioq:        IOQ context to iterate on
+ * @iter:	Iterator context to init (usually from stack)
+ * @type:	Specifies the index type to iterate against
+ *                 (*) valid: iterate against the "valid" index
+ *                 (*) inuse: iterate against the "inuse" index
+ *                 (*) both: iterate against both indexes simultaneously
+ * @flags:      Bitfield with 0 or more bits set to alter behavior
+ *                 (*) autoupdate: automatically signal the remote side
+ *                     whenever the iterator pushes/pops to a new desc
+ *                 (*) noflipowner: do not flip the ownership bit during
+ *                     a push/pop operation
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int ioq_iter_init(struct ioq *ioq, struct ioq_iterator *iter,
+		  enum ioq_idx_type type, int flags);
+
+/**
+ * ioq_iter_seek() - seek to a specific location in the IOQ ring
+ * @iter:	Iterator context (must be initialized with ioq_iter_init)
+ * @type:	Specifies the type of seek operation
+ *                 (*) tail: seek to the absolute tail, offset is ignored
+ *                 (*) next: seek to the relative next, offset is ignored
+ *                 (*) head: seek to the absolute head, offset is ignored
+ *                 (*) set: seek to the absolute offset
+ * @offset:     Offset for ioq_seek_set operations
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int  ioq_iter_seek(struct ioq_iterator *iter, enum ioq_seek_type type,
+		   long offset, int flags);
+
+/**
+ * ioq_iter_push() - push the tail pointer forward
+ * @iter:	Iterator context (must be initialized with ioq_iter_init)
+ * @flags:      Reserved for future use, must be 0
+ *
+ * This function will simultaneously advance the tail ptr in the current
+ * index (valid/inuse, as specified in the ioq_iter_init) as well as
+ * perform a seek(next) operation.  This effectively "pushes" a new pointer
+ * onto the tail of the index.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int  ioq_iter_push(struct ioq_iterator *iter, int flags);
+
+/**
+ * ioq_iter_pop() - pop the head pointer from the ring
+ * @iter:	Iterator context (must be initialized with ioq_iter_init)
+ * @flags:      Reserved for future use, must be 0
+ *
+ * This function will simultaneously advance the head ptr in the current
+ * index (valid/inuse, as specified in the ioq_iter_init) as well as
+ * perform a seek(next) operation.  This effectively "pops" a pointer
+ * from the head of the index.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int  ioq_iter_pop(struct ioq_iterator *iter,  int flags);
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_IOQ_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 136da19..255778d 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -209,4 +209,16 @@ config SHM_SIGNAL
 
 	 If unsure, say N
 
+config IOQ
+	tristate "IO-Queue library - Generic shared-memory queue"
+	select SHM_SIGNAL
+	default n
+	help
+	 IOQ is a generic shared-memory-queue mechanism that happens to be
+	 friendly to virtualization boundaries. It can be used in a variety
+	 of ways, though its intended purpose is to become a low-level
+	 communication path for paravirtualized drivers.
+
+	 If unsure, say N
+
 endmenu
diff --git a/lib/Makefile b/lib/Makefile
index 503bf7b..215f0c9 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -77,6 +77,7 @@ obj-$(CONFIG_TEXTSEARCH_FSM) += ts_fsm.o
 obj-$(CONFIG_SMP) += percpu_counter.o
 obj-$(CONFIG_AUDIT_GENERIC) += audit.o
 obj-$(CONFIG_SHM_SIGNAL) += shm_signal.o
+obj-$(CONFIG_IOQ) += ioq.o
 
 obj-$(CONFIG_SWIOTLB) += swiotlb.o
 obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
diff --git a/lib/ioq.c b/lib/ioq.c
new file mode 100644
index 0000000..af3090f
--- /dev/null
+++ b/lib/ioq.c
@@ -0,0 +1,294 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * See include/linux/ioq.h for documentation
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/sched.h>
+#include <linux/ioq.h>
+#include <linux/bitops.h>
+#include <linux/module.h>
+
+#ifndef NULL
+#define NULL 0
+#endif
+
+static int ioq_iter_setpos(struct ioq_iterator *iter, u32 pos)
+{
+	struct ioq *ioq = iter->ioq;
+
+	BUG_ON(pos >= ioq->count);
+
+	iter->pos  = pos;
+	iter->desc = &ioq->ring[pos];
+
+	return 0;
+}
+
+static inline u32 modulo_inc(u32 val, u32 mod)
+{
+	BUG_ON(val >= mod);
+
+	if (val == (mod - 1))
+		return 0;
+
+	return val + 1;
+}
+
+static inline int idx_full(struct ioq_ring_idx *idx)
+{
+	return idx->full && (idx->head == idx->tail);
+}
+
+int ioq_iter_seek(struct ioq_iterator *iter, enum ioq_seek_type type,
+		  long offset, int flags)
+{
+	struct ioq_ring_idx *idx = iter->idx;
+	u32 pos;
+
+	switch (type) {
+	case ioq_seek_next:
+		pos = modulo_inc(iter->pos, iter->ioq->count);
+		break;
+	case ioq_seek_tail:
+		pos = idx->tail;
+		break;
+	case ioq_seek_head:
+		pos = idx->head;
+		break;
+	case ioq_seek_set:
+		if (offset >= iter->ioq->count)
+			return -1;
+		pos = offset;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return ioq_iter_setpos(iter, pos);
+}
+EXPORT_SYMBOL_GPL(ioq_iter_seek);
+
+static int ioq_ring_count(struct ioq_ring_idx *idx, int count)
+{
+	if (idx->full && (idx->head == idx->tail))
+		return count;
+	else if (idx->tail >= idx->head)
+		return idx->tail - idx->head;
+	else
+		return (idx->tail + count) - idx->head;
+}
+
+static void idx_tail_push(struct ioq_ring_idx *idx, int count)
+{
+	u32 tail = modulo_inc(idx->tail, count);
+
+	if (idx->head == tail) {
+		rmb();
+
+		/*
+		 * Setting full here may look racy, but note that we havent
+		 * flipped the owner bit yet.  So it is impossible for the
+		 * remote locale to move head in such a way that this operation
+		 * becomes invalid
+		 */
+		idx->full = 1;
+		wmb();
+	}
+
+	idx->tail = tail;
+}
+
+int ioq_iter_push(struct ioq_iterator *iter, int flags)
+{
+	struct ioq_ring_head *head_desc = iter->ioq->head_desc;
+	struct ioq_ring_idx  *idx  = iter->idx;
+	int ret;
+
+	/*
+	 * Its only valid to push if we are currently pointed at the tail
+	 */
+	if (iter->pos != idx->tail || iter->desc->sown != iter->ioq->locale)
+		return -EINVAL;
+
+	idx_tail_push(idx, iter->ioq->count);
+	if (iter->dualidx) {
+		idx_tail_push(&head_desc->idx[ioq_idxtype_inuse],
+			      iter->ioq->count);
+		if (head_desc->idx[ioq_idxtype_inuse].tail !=
+		    head_desc->idx[ioq_idxtype_valid].tail) {
+			SHM_SIGNAL_FAULT(iter->ioq->signal,
+					 "Tails not synchronized");
+			return -EINVAL;
+		}
+	}
+
+	wmb(); /* the index must be visible before the sown, or signal */
+
+	if (iter->flipowner) {
+		iter->desc->sown = !iter->ioq->locale;
+		wmb(); /* sown must be visible before we signal */
+	}
+
+	ret = ioq_iter_seek(iter, ioq_seek_next, 0, flags);
+
+	if (iter->update)
+		ioq_signal(iter->ioq, 0);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ioq_iter_push);
+
+int ioq_iter_pop(struct ioq_iterator *iter,  int flags)
+{
+	struct ioq_ring_idx *idx = iter->idx;
+	int ret;
+
+	/*
+	 * Its only valid to pop if we are currently pointed at the head
+	 */
+	if (iter->pos != idx->head || iter->desc->sown != iter->ioq->locale)
+		return -EINVAL;
+
+	idx->head = modulo_inc(idx->head, iter->ioq->count);
+	wmb(); /* head must be visible before full */
+
+	if (idx->full) {
+		idx->full = 0;
+		wmb(); /* full must be visible before sown */
+	}
+
+	if (iter->flipowner) {
+		iter->desc->sown = !iter->ioq->locale;
+		wmb(); /* sown must be visible before we signal */
+	}
+
+	ret = ioq_iter_seek(iter, ioq_seek_next, 0, flags);
+
+	if (iter->update)
+		ioq_signal(iter->ioq, 0);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ioq_iter_pop);
+
+static struct ioq_ring_idx *idxtype_to_idx(struct ioq *ioq,
+					   enum ioq_idx_type type)
+{
+	struct ioq_ring_idx *idx;
+
+	switch (type) {
+	case ioq_idxtype_valid:
+	case ioq_idxtype_inuse:
+		idx = &ioq->head_desc->idx[type];
+		break;
+	default:
+		panic("IOQ: illegal index type: %d", type);
+		break;
+	}
+
+	return idx;
+}
+
+int ioq_iter_init(struct ioq *ioq, struct ioq_iterator *iter,
+		  enum ioq_idx_type type, int flags)
+{
+	iter->ioq        = ioq;
+	iter->update     = (flags & IOQ_ITER_AUTOUPDATE);
+	iter->flipowner  = !(flags & IOQ_ITER_NOFLIPOWNER);
+	iter->pos        = -1;
+	iter->desc       = NULL;
+	iter->dualidx    = 0;
+
+	if (type == ioq_idxtype_both) {
+		/*
+		 * "both" is a special case, so we set the dualidx flag.
+		 *
+		 * However, we also just want to use the valid-index
+		 * for normal processing, so override that here
+		 */
+		type = ioq_idxtype_valid;
+		iter->dualidx = 1;
+	}
+
+	iter->idx = idxtype_to_idx(ioq, type);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ioq_iter_init);
+
+int ioq_count(struct ioq *ioq, enum ioq_idx_type type)
+{
+	return ioq_ring_count(idxtype_to_idx(ioq, type), ioq->count);
+}
+EXPORT_SYMBOL_GPL(ioq_count);
+
+int ioq_remain(struct ioq *ioq, enum ioq_idx_type type)
+{
+	int count = ioq_ring_count(idxtype_to_idx(ioq, type), ioq->count);
+
+	return ioq->count - count;
+}
+EXPORT_SYMBOL_GPL(ioq_remain);
+
+int ioq_size(struct ioq *ioq)
+{
+	return ioq->count;
+}
+EXPORT_SYMBOL_GPL(ioq_size);
+
+int ioq_full(struct ioq *ioq, enum ioq_idx_type type)
+{
+	struct ioq_ring_idx *idx = idxtype_to_idx(ioq, type);
+
+	return idx_full(idx);
+}
+EXPORT_SYMBOL_GPL(ioq_full);
+
+static void ioq_shm_signal(struct shm_signal_notifier *notifier)
+{
+	struct ioq *ioq = container_of(notifier, struct ioq, shm_notifier);
+
+	wake_up(&ioq->wq);
+	if (ioq->notifier)
+		ioq->notifier->signal(ioq->notifier);
+}
+
+void ioq_init(struct ioq *ioq,
+	      struct ioq_ops *ops,
+	      enum ioq_locality locale,
+	      struct ioq_ring_head *head,
+	      struct shm_signal *signal,
+	      size_t count)
+{
+	memset(ioq, 0, sizeof(*ioq));
+	kref_init(&ioq->kref);
+	init_waitqueue_head(&ioq->wq);
+
+	ioq->ops         = ops;
+	ioq->locale      = locale;
+	ioq->head_desc   = head;
+	ioq->ring        = &head->ring[0];
+	ioq->count       = count;
+	ioq->signal      = signal;
+
+	ioq->shm_notifier.signal = &ioq_shm_signal;
+	signal->notifier         = &ioq->shm_notifier;
+}
+EXPORT_SYMBOL_GPL(ioq_init);


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 3/7] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-03 17:17 [PATCH 0/7] AlacrityVM guest drivers Gregory Haskins
  2009-08-03 17:17 ` [PATCH 1/7] shm-signal: shared-memory signals Gregory Haskins
  2009-08-03 17:17 ` [PATCH 2/7] ioq: Add basic definitions for a shared-memory, lockless queue Gregory Haskins
@ 2009-08-03 17:17 ` Gregory Haskins
  2009-08-03 17:17 ` [PATCH 4/7] vbus-proxy: add a pci-to-vbus bridge Gregory Haskins
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-03 17:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: alacrityvm-devel, netdev

This will generally be used for hypervisors to publish any host-side
virtual devices up to a guest.  The guest will have the opportunity
to consume any devices present on the vbus-proxy as if they were
platform devices, similar to existing buses like PCI.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/x86/Kconfig            |    2 +
 drivers/Makefile            |    1 
 drivers/vbus/Kconfig        |   14 ++++
 drivers/vbus/Makefile       |    3 +
 drivers/vbus/bus-proxy.c    |  152 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/vbus_driver.h |   73 +++++++++++++++++++++
 6 files changed, 245 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vbus/Kconfig
 create mode 100644 drivers/vbus/Makefile
 create mode 100644 drivers/vbus/bus-proxy.c
 create mode 100644 include/linux/vbus_driver.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 738bdc6..72ff902 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2037,6 +2037,8 @@ source "drivers/pcmcia/Kconfig"
 
 source "drivers/pci/hotplug/Kconfig"
 
+source "drivers/vbus/Kconfig"
+
 endmenu
 
 
diff --git a/drivers/Makefile b/drivers/Makefile
index bc4205d..d5bedb1 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -110,3 +110,4 @@ obj-$(CONFIG_VLYNQ)		+= vlynq/
 obj-$(CONFIG_STAGING)		+= staging/
 obj-y				+= platform/
 obj-y				+= ieee802154/
+obj-y				+= vbus/
diff --git a/drivers/vbus/Kconfig b/drivers/vbus/Kconfig
new file mode 100644
index 0000000..e1939f5
--- /dev/null
+++ b/drivers/vbus/Kconfig
@@ -0,0 +1,14 @@
+#
+# Virtual-Bus (VBus) driver configuration
+#
+
+config VBUS_PROXY
+       tristate "Virtual-Bus support"
+       select SHM_SIGNAL
+       default n
+       help
+       Adds support for a virtual-bus model drivers in a guest to connect
+	to host side virtual-bus resources.  If you are using this kernel
+	in a virtualization solution which implements virtual-bus devices
+	on the backend, say Y.  If unsure, say N.
+
diff --git a/drivers/vbus/Makefile b/drivers/vbus/Makefile
new file mode 100644
index 0000000..a29a1e0
--- /dev/null
+++ b/drivers/vbus/Makefile
@@ -0,0 +1,3 @@
+
+vbus-proxy-objs += bus-proxy.o
+obj-$(CONFIG_VBUS_PROXY) += vbus-proxy.o
diff --git a/drivers/vbus/bus-proxy.c b/drivers/vbus/bus-proxy.c
new file mode 100644
index 0000000..3177f9f
--- /dev/null
+++ b/drivers/vbus/bus-proxy.c
@@ -0,0 +1,152 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/vbus_driver.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+
+#define VBUS_PROXY_NAME "vbus-proxy"
+
+static struct vbus_device_proxy *to_dev(struct device *_dev)
+{
+	return _dev ? container_of(_dev, struct vbus_device_proxy, dev) : NULL;
+}
+
+static struct vbus_driver *to_drv(struct device_driver *_drv)
+{
+	return container_of(_drv, struct vbus_driver, drv);
+}
+
+/*
+ * This function is invoked whenever a new driver and/or device is added
+ * to check if there is a match
+ */
+static int vbus_dev_proxy_match(struct device *_dev, struct device_driver *_drv)
+{
+	struct vbus_device_proxy *dev = to_dev(_dev);
+	struct vbus_driver *drv = to_drv(_drv);
+
+	return !strcmp(dev->type, drv->type);
+}
+
+/*
+ * This function is invoked after the bus infrastructure has already made a
+ * match.  The device will contain a reference to the paired driver which
+ * we will extract.
+ */
+static int vbus_dev_proxy_probe(struct device *_dev)
+{
+	int ret = 0;
+	struct vbus_device_proxy *dev = to_dev(_dev);
+	struct vbus_driver *drv = to_drv(_dev->driver);
+
+	if (drv->ops->probe)
+		ret = drv->ops->probe(dev);
+
+	return ret;
+}
+
+static struct bus_type vbus_proxy = {
+	.name   = VBUS_PROXY_NAME,
+	.match  = vbus_dev_proxy_match,
+};
+
+static struct device vbus_proxy_rootdev = {
+	.parent    = NULL,
+	.init_name = VBUS_PROXY_NAME,
+};
+
+static int __init vbus_init(void)
+{
+	int ret;
+
+	ret = bus_register(&vbus_proxy);
+	BUG_ON(ret < 0);
+
+	ret = device_register(&vbus_proxy_rootdev);
+	BUG_ON(ret < 0);
+
+	return 0;
+}
+
+postcore_initcall(vbus_init);
+
+static void device_release(struct device *dev)
+{
+	struct vbus_device_proxy *_dev;
+
+	_dev = container_of(dev, struct vbus_device_proxy, dev);
+
+	_dev->ops->release(_dev);
+}
+
+int vbus_device_proxy_register(struct vbus_device_proxy *new)
+{
+	new->dev.parent  = &vbus_proxy_rootdev;
+	new->dev.bus     = &vbus_proxy;
+	new->dev.release = &device_release;
+
+	return device_register(&new->dev);
+}
+EXPORT_SYMBOL_GPL(vbus_device_proxy_register);
+
+void vbus_device_proxy_unregister(struct vbus_device_proxy *dev)
+{
+	device_unregister(&dev->dev);
+}
+EXPORT_SYMBOL_GPL(vbus_device_proxy_unregister);
+
+static int match_device_id(struct device *_dev, void *data)
+{
+	struct vbus_device_proxy *dev = to_dev(_dev);
+	u64 id = *(u64 *)data;
+
+	return dev->id == id;
+}
+
+struct vbus_device_proxy *vbus_device_proxy_find(u64 id)
+{
+	struct device *dev;
+
+	dev = bus_find_device(&vbus_proxy, NULL, &id, &match_device_id);
+
+	return to_dev(dev);
+}
+EXPORT_SYMBOL_GPL(vbus_device_proxy_find);
+
+int vbus_driver_register(struct vbus_driver *new)
+{
+	new->drv.bus   = &vbus_proxy;
+	new->drv.name  = new->type;
+	new->drv.owner = new->owner;
+	new->drv.probe = vbus_dev_proxy_probe;
+
+	return driver_register(&new->drv);
+}
+EXPORT_SYMBOL_GPL(vbus_driver_register);
+
+void vbus_driver_unregister(struct vbus_driver *drv)
+{
+	driver_unregister(&drv->drv);
+}
+EXPORT_SYMBOL_GPL(vbus_driver_unregister);
+
diff --git a/include/linux/vbus_driver.h b/include/linux/vbus_driver.h
new file mode 100644
index 0000000..c53e13f
--- /dev/null
+++ b/include/linux/vbus_driver.h
@@ -0,0 +1,73 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Mediates access to a host VBUS from a guest kernel by providing a
+ * global view of all VBUS devices
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VBUS_DRIVER_H
+#define _LINUX_VBUS_DRIVER_H
+
+#include <linux/device.h>
+#include <linux/shm_signal.h>
+
+struct vbus_device_proxy;
+struct vbus_driver;
+
+struct vbus_device_proxy_ops {
+	int (*open)(struct vbus_device_proxy *dev, int version, int flags);
+	int (*close)(struct vbus_device_proxy *dev, int flags);
+	int (*shm)(struct vbus_device_proxy *dev, int id, int prio,
+		   void *ptr, size_t len,
+		   struct shm_signal_desc *sigdesc, struct shm_signal **signal,
+		   int flags);
+	int (*call)(struct vbus_device_proxy *dev, u32 func,
+		    void *data, size_t len, int flags);
+	void (*release)(struct vbus_device_proxy *dev);
+};
+
+struct vbus_device_proxy {
+	char                          *type;
+	u64                            id;
+	void                          *priv; /* Used by drivers */
+	struct vbus_device_proxy_ops  *ops;
+	struct device                  dev;
+};
+
+int vbus_device_proxy_register(struct vbus_device_proxy *dev);
+void vbus_device_proxy_unregister(struct vbus_device_proxy *dev);
+
+struct vbus_device_proxy *vbus_device_proxy_find(u64 id);
+
+struct vbus_driver_ops {
+	int (*probe)(struct vbus_device_proxy *dev);
+	int (*remove)(struct vbus_device_proxy *dev);
+};
+
+struct vbus_driver {
+	char                          *type;
+	struct module                 *owner;
+	struct vbus_driver_ops        *ops;
+	struct device_driver           drv;
+};
+
+int vbus_driver_register(struct vbus_driver *drv);
+void vbus_driver_unregister(struct vbus_driver *drv);
+
+#endif /* _LINUX_VBUS_DRIVER_H */


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 4/7] vbus-proxy: add a pci-to-vbus bridge
  2009-08-03 17:17 [PATCH 0/7] AlacrityVM guest drivers Gregory Haskins
                   ` (2 preceding siblings ...)
  2009-08-03 17:17 ` [PATCH 3/7] vbus: add a "vbus-proxy" bus model for vbus_driver objects Gregory Haskins
@ 2009-08-03 17:17 ` Gregory Haskins
  2009-08-06 14:42   ` Arnd Bergmann
  2009-08-03 17:17 ` [PATCH 5/7] ioq: add driver-side vbus helpers Gregory Haskins
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 62+ messages in thread
From: Gregory Haskins @ 2009-08-03 17:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: alacrityvm-devel, netdev

This patch adds a pci-based driver to interface between the a host VBUS
and the guest's vbus-proxy bus model.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 drivers/vbus/Kconfig      |   10 +
 drivers/vbus/Makefile     |    3 
 drivers/vbus/pci-bridge.c |  824 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/Kbuild      |    1 
 include/linux/vbus_pci.h  |  127 +++++++
 5 files changed, 965 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vbus/pci-bridge.c
 create mode 100644 include/linux/vbus_pci.h

diff --git a/drivers/vbus/Kconfig b/drivers/vbus/Kconfig
index e1939f5..87c545d 100644
--- a/drivers/vbus/Kconfig
+++ b/drivers/vbus/Kconfig
@@ -12,3 +12,13 @@ config VBUS_PROXY
 	in a virtualization solution which implements virtual-bus devices
 	on the backend, say Y.  If unsure, say N.
 
+config VBUS_PCIBRIDGE
+       tristate "PCI to Virtual-Bus bridge"
+       depends on PCI
+       depends on VBUS_PROXY
+       select IOQ
+       default n
+       help
+        Provides a way to bridge host side vbus devices via a PCI-BRIDGE
+        object.  If you are running virtualization with vbus devices on the
+	host, and the vbus is exposed via PCI, say Y.  Otherwise, say N.
diff --git a/drivers/vbus/Makefile b/drivers/vbus/Makefile
index a29a1e0..944b7f1 100644
--- a/drivers/vbus/Makefile
+++ b/drivers/vbus/Makefile
@@ -1,3 +1,6 @@
 
 vbus-proxy-objs += bus-proxy.o
 obj-$(CONFIG_VBUS_PROXY) += vbus-proxy.o
+
+vbus-pcibridge-objs += pci-bridge.o
+obj-$(CONFIG_VBUS_PCIBRIDGE) += vbus-pcibridge.o
diff --git a/drivers/vbus/pci-bridge.c b/drivers/vbus/pci-bridge.c
new file mode 100644
index 0000000..b21a9a3
--- /dev/null
+++ b/drivers/vbus/pci-bridge.c
@@ -0,0 +1,824 @@
+/*
+ * Copyright (C) 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *	Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/mm.h>
+#include <linux/workqueue.h>
+#include <linux/ioq.h>
+#include <linux/interrupt.h>
+#include <linux/vbus_driver.h>
+#include <linux/vbus_pci.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("1");
+
+#define VBUS_PCI_NAME "pci-to-vbus-bridge"
+
+struct vbus_pci {
+	spinlock_t                lock;
+	struct pci_dev           *dev;
+	struct ioq                eventq;
+	struct vbus_pci_event    *ring;
+	struct vbus_pci_regs     *regs;
+	void                     *piosignal;
+	int                       irq;
+	int                       enabled:1;
+};
+
+static struct vbus_pci vbus_pci;
+
+struct vbus_pci_device {
+	char                     type[VBUS_MAX_DEVTYPE_LEN];
+	u64                      handle;
+	struct list_head         shms;
+	struct vbus_device_proxy vdev;
+	struct work_struct       add;
+	struct work_struct       drop;
+};
+
+/*
+ * -------------------
+ * common routines
+ * -------------------
+ */
+
+static int
+vbus_pci_hypercall(unsigned long nr, void *data, unsigned long len)
+{
+	struct vbus_pci_hypercall params = {
+		.vector = nr,
+		.len    = len,
+		.datap  = __pa(data),
+	};
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&vbus_pci.lock, flags);
+
+	memcpy_toio(&vbus_pci.regs->hypercall.data, &params, sizeof(params));
+	ret = ioread32(&vbus_pci.regs->hypercall.result);
+
+	spin_unlock_irqrestore(&vbus_pci.lock, flags);
+
+	return ret;
+}
+
+struct vbus_pci_device *
+to_dev(struct vbus_device_proxy *vdev)
+{
+	return container_of(vdev, struct vbus_pci_device, vdev);
+}
+
+static void
+_signal_init(struct shm_signal *signal, struct shm_signal_desc *desc,
+	     struct shm_signal_ops *ops)
+{
+	desc->magic = SHM_SIGNAL_MAGIC;
+	desc->ver   = SHM_SIGNAL_VER;
+
+	shm_signal_init(signal, shm_locality_north, ops, desc);
+}
+
+/*
+ * -------------------
+ * _signal
+ * -------------------
+ */
+
+struct _signal {
+	struct vbus_pci   *pcivbus;
+	struct shm_signal  signal;
+	u32                handle;
+	struct rb_node     node;
+	struct list_head   list;
+};
+
+static struct _signal *
+to_signal(struct shm_signal *signal)
+{
+       return container_of(signal, struct _signal, signal);
+}
+
+static int
+_signal_inject(struct shm_signal *signal)
+{
+	struct _signal *_signal = to_signal(signal);
+
+	iowrite32(_signal->handle, vbus_pci.piosignal);
+
+	return 0;
+}
+
+static void
+_signal_release(struct shm_signal *signal)
+{
+	struct _signal *_signal = to_signal(signal);
+
+	kfree(_signal);
+}
+
+static struct shm_signal_ops _signal_ops = {
+	.inject  = _signal_inject,
+	.release = _signal_release,
+};
+
+/*
+ * -------------------
+ * vbus_device_proxy routines
+ * -------------------
+ */
+
+static int
+vbus_pci_device_open(struct vbus_device_proxy *vdev, int version, int flags)
+{
+	struct vbus_pci_device *dev = to_dev(vdev);
+	struct vbus_pci_deviceopen params;
+	int ret;
+
+	if (dev->handle)
+		return -EINVAL;
+
+	params.devid   = vdev->id;
+	params.version = version;
+
+	ret = vbus_pci_hypercall(VBUS_PCI_HC_DEVOPEN,
+				 &params, sizeof(params));
+	if (ret < 0)
+		return ret;
+
+	dev->handle = params.handle;
+
+	return 0;
+}
+
+static int
+vbus_pci_device_close(struct vbus_device_proxy *vdev, int flags)
+{
+	struct vbus_pci_device *dev = to_dev(vdev);
+	unsigned long iflags;
+	int ret;
+
+	if (!dev->handle)
+		return -EINVAL;
+
+	spin_lock_irqsave(&vbus_pci.lock, iflags);
+
+	while (!list_empty(&dev->shms)) {
+		struct _signal *_signal;
+
+		_signal = list_first_entry(&dev->shms, struct _signal, list);
+
+		list_del(&_signal->list);
+
+		spin_unlock_irqrestore(&vbus_pci.lock, iflags);
+		shm_signal_put(&_signal->signal);
+		spin_lock_irqsave(&vbus_pci.lock, iflags);
+	}
+
+	spin_unlock_irqrestore(&vbus_pci.lock, iflags);
+
+	/*
+	 * The DEVICECLOSE will implicitly close all of the shm on the
+	 * host-side, so there is no need to do an explicit per-shm
+	 * hypercall
+	 */
+	ret = vbus_pci_hypercall(VBUS_PCI_HC_DEVCLOSE,
+				 &dev->handle, sizeof(dev->handle));
+
+	if (ret < 0)
+		printk(KERN_ERR "VBUS-PCI: Error closing device %s/%lld: %d\n",
+		       vdev->type, vdev->id, ret);
+
+	dev->handle = 0;
+
+	return 0;
+}
+
+static int
+vbus_pci_device_shm(struct vbus_device_proxy *vdev, int id, int prio,
+		    void *ptr, size_t len,
+		    struct shm_signal_desc *sdesc, struct shm_signal **signal,
+		    int flags)
+{
+	struct vbus_pci_device *dev = to_dev(vdev);
+	struct _signal *_signal = NULL;
+	struct vbus_pci_deviceshm params;
+	unsigned long iflags;
+	int ret;
+
+	if (!dev->handle)
+		return -EINVAL;
+
+	params.devh   = dev->handle;
+	params.id     = id;
+	params.flags  = flags;
+	params.datap  = (u64)__pa(ptr);
+	params.len    = len;
+
+	if (signal) {
+		/*
+		 * The signal descriptor must be embedded within the
+		 * provided ptr
+		 */
+		if (!sdesc
+		    || (len < sizeof(*sdesc))
+		    || ((void *)sdesc < ptr)
+		    || ((void *)sdesc > (ptr + len - sizeof(*sdesc))))
+			return -EINVAL;
+
+		_signal = kzalloc(sizeof(*_signal), GFP_KERNEL);
+		if (!_signal)
+			return -ENOMEM;
+
+		_signal_init(&_signal->signal, sdesc, &_signal_ops);
+
+		/*
+		 * take another reference for the host.  This is dropped
+		 * by a SHMCLOSE event
+		 */
+		shm_signal_get(&_signal->signal);
+
+		params.signal.offset = (u64)sdesc - (u64)ptr;
+		params.signal.prio   = prio;
+		params.signal.cookie = (u64)_signal;
+
+	} else
+		params.signal.offset = -1; /* yes, this is a u32, but its ok */
+
+	ret = vbus_pci_hypercall(VBUS_PCI_HC_DEVSHM,
+				 &params, sizeof(params));
+	if (ret < 0) {
+		if (_signal) {
+			/*
+			 * We held two references above, so we need to drop
+			 * both of them
+			 */
+			shm_signal_put(&_signal->signal);
+			shm_signal_put(&_signal->signal);
+		}
+
+		return ret;
+	}
+
+	if (signal) {
+		_signal->handle = ret;
+
+		spin_lock_irqsave(&vbus_pci.lock, iflags);
+
+		list_add_tail(&_signal->list, &dev->shms);
+
+		spin_unlock_irqrestore(&vbus_pci.lock, iflags);
+
+		shm_signal_get(&_signal->signal);
+		*signal = &_signal->signal;
+	}
+
+	return 0;
+}
+
+static int
+vbus_pci_device_call(struct vbus_device_proxy *vdev, u32 func, void *data,
+		     size_t len, int flags)
+{
+	struct vbus_pci_device *dev = to_dev(vdev);
+	struct vbus_pci_devicecall params = {
+		.devh  = dev->handle,
+		.func  = func,
+		.datap = (u64)__pa(data),
+		.len   = len,
+		.flags = flags,
+	};
+
+	if (!dev->handle)
+		return -EINVAL;
+
+	return vbus_pci_hypercall(VBUS_PCI_HC_DEVCALL, &params, sizeof(params));
+}
+
+static void
+vbus_pci_device_release(struct vbus_device_proxy *vdev)
+{
+	struct vbus_pci_device *_dev = to_dev(vdev);
+
+	vbus_pci_device_close(vdev, 0);
+
+	kfree(_dev);
+}
+
+struct vbus_device_proxy_ops vbus_pci_device_ops = {
+	.open    = vbus_pci_device_open,
+	.close   = vbus_pci_device_close,
+	.shm     = vbus_pci_device_shm,
+	.call    = vbus_pci_device_call,
+	.release = vbus_pci_device_release,
+};
+
+/*
+ * -------------------
+ * vbus events
+ * -------------------
+ */
+
+static void
+deferred_devadd(struct work_struct *work)
+{
+	struct vbus_pci_device *new;
+	int ret;
+
+	new = container_of(work, struct vbus_pci_device, add);
+
+	ret = vbus_device_proxy_register(&new->vdev);
+	if (ret < 0)
+		panic("failed to register device %lld(%s): %d\n",
+		      new->vdev.id, new->type, ret);
+}
+
+static void
+deferred_devdrop(struct work_struct *work)
+{
+	struct vbus_pci_device *dev;
+
+	dev = container_of(work, struct vbus_pci_device, drop);
+	vbus_device_proxy_unregister(&dev->vdev);
+}
+
+static void
+event_devadd(struct vbus_pci_add_event *event)
+{
+	struct vbus_pci_device *new = kzalloc(sizeof(*new), GFP_KERNEL);
+	if (!new) {
+		printk(KERN_ERR "VBUS_PCI: Out of memory on add_event\n");
+		return;
+	}
+
+	INIT_LIST_HEAD(&new->shms);
+
+	memcpy(new->type, event->type, VBUS_MAX_DEVTYPE_LEN);
+	new->vdev.type        = new->type;
+	new->vdev.id          = event->id;
+	new->vdev.ops         = &vbus_pci_device_ops;
+
+	dev_set_name(&new->vdev.dev, "%lld", event->id);
+
+	INIT_WORK(&new->add, deferred_devadd);
+	INIT_WORK(&new->drop, deferred_devdrop);
+
+	schedule_work(&new->add);
+}
+
+static void
+event_devdrop(struct vbus_pci_handle_event *event)
+{
+	struct vbus_device_proxy *dev = vbus_device_proxy_find(event->handle);
+
+	if (!dev) {
+		printk(KERN_WARNING "VBUS-PCI: devdrop failed: %lld\n",
+		       event->handle);
+		return;
+	}
+
+	schedule_work(&to_dev(dev)->drop);
+}
+
+static void
+event_shmsignal(struct vbus_pci_handle_event *event)
+{
+	struct _signal *_signal = (struct _signal *)event->handle;
+
+	_shm_signal_wakeup(&_signal->signal);
+}
+
+static void
+event_shmclose(struct vbus_pci_handle_event *event)
+{
+	struct _signal *_signal = (struct _signal *)event->handle;
+
+	/*
+	 * This reference was taken during the DEVICESHM call
+	 */
+	shm_signal_put(&_signal->signal);
+}
+
+/*
+ * -------------------
+ * eventq routines
+ * -------------------
+ */
+
+static struct ioq_notifier eventq_notifier;
+
+static int __init
+eventq_init(int qlen)
+{
+	struct ioq_iterator iter;
+	int ret;
+	int i;
+
+	vbus_pci.ring = kzalloc(sizeof(struct vbus_pci_event) * qlen,
+				GFP_KERNEL);
+	if (!vbus_pci.ring)
+		return -ENOMEM;
+
+	/*
+	 * We want to iterate on the "valid" index.  By default the iterator
+	 * will not "autoupdate" which means it will not hypercall the host
+	 * with our changes.  This is good, because we are really just
+	 * initializing stuff here anyway.  Note that you can always manually
+	 * signal the host with ioq_signal() if the autoupdate feature is not
+	 * used.
+	 */
+	ret = ioq_iter_init(&vbus_pci.eventq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Seek to the tail of the valid index (which should be our first
+	 * item since the queue is brand-new)
+	 */
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Now populate each descriptor with an empty vbus_event and mark it
+	 * valid
+	 */
+	for (i = 0; i < qlen; i++) {
+		struct vbus_pci_event *event = &vbus_pci.ring[i];
+		size_t                 len   = sizeof(*event);
+		struct ioq_ring_desc  *desc  = iter.desc;
+
+		BUG_ON(iter.desc->valid);
+
+		desc->cookie = (u64)event;
+		desc->ptr    = (u64)__pa(event);
+		desc->len    = len; /* total length  */
+		desc->valid  = 1;
+
+		/*
+		 * This push operation will simultaneously advance the
+		 * valid-tail index and increment our position in the queue
+		 * by one.
+		 */
+		ret = ioq_iter_push(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	vbus_pci.eventq.notifier = &eventq_notifier;
+
+	/*
+	 * And finally, ensure that we can receive notification
+	 */
+	ioq_notify_enable(&vbus_pci.eventq, 0);
+
+	return 0;
+}
+
+/* Invoked whenever the hypervisor ioq_signal()s our eventq */
+static void
+eventq_wakeup(struct ioq_notifier *notifier)
+{
+	struct ioq_iterator iter;
+	int ret;
+
+	/* We want to iterate on the head of the in-use index */
+	ret = ioq_iter_init(&vbus_pci.eventq, &iter, ioq_idxtype_inuse, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * The EOM is indicated by finding a packet that is still owned by
+	 * the south side.
+	 *
+	 * FIXME: This in theory could run indefinitely if the host keeps
+	 * feeding us events since there is nothing like a NAPI budget.  We
+	 * might need to address that
+	 */
+	while (!iter.desc->sown) {
+		struct ioq_ring_desc *desc  = iter.desc;
+		struct vbus_pci_event *event;
+
+		event = (struct vbus_pci_event *)desc->cookie;
+
+		switch (event->eventid) {
+		case VBUS_PCI_EVENT_DEVADD:
+			event_devadd(&event->data.add);
+			break;
+		case VBUS_PCI_EVENT_DEVDROP:
+			event_devdrop(&event->data.handle);
+			break;
+		case VBUS_PCI_EVENT_SHMSIGNAL:
+			event_shmsignal(&event->data.handle);
+			break;
+		case VBUS_PCI_EVENT_SHMCLOSE:
+			event_shmclose(&event->data.handle);
+			break;
+		default:
+			printk(KERN_WARNING "VBUS_PCI: Unexpected event %d\n",
+			       event->eventid);
+			break;
+		};
+
+		memset(event, 0, sizeof(*event));
+
+		/* Advance the in-use head */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	/* And let the south side know that we changed the queue */
+	ioq_signal(&vbus_pci.eventq, 0);
+}
+
+static struct ioq_notifier eventq_notifier = {
+	.signal = &eventq_wakeup,
+};
+
+/* Injected whenever the host issues an ioq_signal() on the eventq */
+irqreturn_t
+eventq_intr(int irq, void *dev)
+{
+	_shm_signal_wakeup(vbus_pci.eventq.signal);
+
+	return IRQ_HANDLED;
+}
+
+/*
+ * -------------------
+ */
+
+static int
+eventq_signal_inject(struct shm_signal *signal)
+{
+	/* The eventq uses the special-case handle=0 */
+	iowrite32(0, vbus_pci.piosignal);
+
+	return 0;
+}
+
+static void
+eventq_signal_release(struct shm_signal *signal)
+{
+	kfree(signal);
+}
+
+static struct shm_signal_ops eventq_signal_ops = {
+	.inject  = eventq_signal_inject,
+	.release = eventq_signal_release,
+};
+
+/*
+ * -------------------
+ */
+
+static void
+eventq_ioq_release(struct ioq *ioq)
+{
+	/* released as part of the vbus_pci object */
+}
+
+static struct ioq_ops eventq_ioq_ops = {
+	.release = eventq_ioq_release,
+};
+
+/*
+ * -------------------
+ */
+
+static void
+vbus_pci_release(void)
+{
+	if (vbus_pci.irq > 0)
+		free_irq(vbus_pci.irq, NULL);
+
+	if (vbus_pci.piosignal)
+		pci_iounmap(vbus_pci.dev, (void *)vbus_pci.piosignal);
+
+	if (vbus_pci.regs)
+		pci_iounmap(vbus_pci.dev, (void *)vbus_pci.regs);
+
+	pci_release_regions(vbus_pci.dev);
+	pci_disable_device(vbus_pci.dev);
+
+	kfree(vbus_pci.eventq.head_desc);
+	kfree(vbus_pci.ring);
+
+	vbus_pci.enabled = false;
+}
+
+static int __init
+vbus_pci_open(void)
+{
+	struct vbus_pci_negotiate params = {
+		.magic        = VBUS_PCI_ABI_MAGIC,
+		.version      = VBUS_PCI_HC_VERSION,
+		.capabilities = 0,
+	};
+
+	return vbus_pci_hypercall(VBUS_PCI_HC_NEGOTIATE,
+				  &params, sizeof(params));
+}
+
+#define QLEN 1024
+
+static int __init
+vbus_pci_register(void)
+{
+	struct vbus_pci_busreg params = {
+		.count = 1,
+		.eventq = {
+			{
+				.count = QLEN,
+				.ring  = (u64)__pa(vbus_pci.eventq.head_desc),
+				.data  = (u64)__pa(vbus_pci.ring),
+			},
+		},
+	};
+
+	return vbus_pci_hypercall(VBUS_PCI_HC_BUSREG, &params, sizeof(params));
+}
+
+static int __init
+_ioq_init(size_t ringsize, struct ioq *ioq, struct ioq_ops *ops)
+{
+	struct shm_signal    *signal = NULL;
+	struct ioq_ring_head *head = NULL;
+	size_t                len  = IOQ_HEAD_DESC_SIZE(ringsize);
+
+	head = kzalloc(len, GFP_KERNEL | GFP_DMA);
+	if (!head)
+		return -ENOMEM;
+
+	signal = kzalloc(sizeof(*signal), GFP_KERNEL);
+	if (!signal) {
+		kfree(head);
+		return -ENOMEM;
+	}
+
+	head->magic     = IOQ_RING_MAGIC;
+	head->ver	= IOQ_RING_VER;
+	head->count     = ringsize;
+
+	_signal_init(signal, &head->signal, &eventq_signal_ops);
+
+	ioq_init(ioq, ops, ioq_locality_north, head, signal, ringsize);
+
+	return 0;
+}
+
+static int __devinit
+vbus_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
+{
+	int ret;
+
+	if (vbus_pci.enabled)
+		return -EEXIST; /* we only support one bridge per kernel */
+
+	if (pdev->revision != VBUS_PCI_ABI_VERSION) {
+		printk(KERN_DEBUG "VBUS_PCI: expected ABI version %d, got %d\n",
+		       VBUS_PCI_ABI_VERSION,
+		       pdev->revision);
+		return -ENODEV;
+	}
+
+	vbus_pci.dev = pdev;
+
+	ret = pci_enable_device(pdev);
+	if (ret < 0)
+		return ret;
+
+	ret = pci_request_regions(pdev, VBUS_PCI_NAME);
+	if (ret < 0) {
+		printk(KERN_ERR "VBUS_PCI: Could not init BARs: %d\n", ret);
+		goto out_fail;
+	}
+
+	vbus_pci.regs = pci_iomap(pdev, 0, sizeof(struct vbus_pci_regs));
+	if (!vbus_pci.regs) {
+		printk(KERN_ERR "VBUS_PCI: Could not map BARs\n");
+		goto out_fail;
+	}
+
+	vbus_pci.piosignal = pci_iomap(pdev, 1, sizeof(u32));
+	if (!vbus_pci.piosignal) {
+		printk(KERN_ERR "VBUS_PCI: Could not map BARs\n");
+		goto out_fail;
+	}
+
+	ret = vbus_pci_open();
+	if (ret < 0) {
+		printk(KERN_DEBUG "VBUS_PCI: Could not register with host: %d\n",
+		       ret);
+		goto out_fail;
+	}
+
+	/*
+	 * Allocate an IOQ to use for host-2-guest event notification
+	 */
+	ret = _ioq_init(QLEN, &vbus_pci.eventq, &eventq_ioq_ops);
+	if (ret < 0) {
+		printk(KERN_ERR "VBUS_PCI: Cound not init eventq: %d\n", ret);
+		goto out_fail;
+	}
+
+	ret = eventq_init(QLEN);
+	if (ret < 0) {
+		printk(KERN_ERR "VBUS_PCI: Cound not setup ring: %d\n", ret);
+		goto out_fail;
+	}
+
+	ret = pci_enable_msi(pdev);
+	if (ret < 0) {
+		printk(KERN_ERR "VBUS_PCI: Cound not enable MSI: %d\n", ret);
+		goto out_fail;
+	}
+
+	vbus_pci.irq = pdev->irq;
+
+	ret = request_irq(pdev->irq, eventq_intr, 0, "vbus", NULL);
+	if (ret < 0) {
+		printk(KERN_ERR "VBUS_PCI: Failed to register IRQ %d\n: %d",
+		       pdev->irq, ret);
+		goto out_fail;
+	}
+
+	/*
+	 * Finally register our queue on the host to start receiving events
+	 */
+	ret = vbus_pci_register();
+	if (ret < 0) {
+		printk(KERN_ERR "VBUS_PCI: Could not register with host: %d\n",
+		       ret);
+		goto out_fail;
+	}
+
+	vbus_pci.enabled = true;
+
+	printk(KERN_INFO "Virtual-Bus: Copyright (c) 2009, " \
+	       "Gregory Haskins <ghaskins@novell.com>\n");
+
+	return 0;
+
+ out_fail:
+	vbus_pci_release();
+
+	return ret;
+}
+
+static void __devexit
+vbus_pci_remove(struct pci_dev *pdev)
+{
+	vbus_pci_release();
+}
+
+static DEFINE_PCI_DEVICE_TABLE(vbus_pci_tbl) = {
+	{ PCI_DEVICE(0x11da, 0x2000) },
+	{ 0 },
+};
+
+MODULE_DEVICE_TABLE(pci, vbus_pci_tbl);
+
+static struct pci_driver vbus_pci_driver = {
+	.name     = VBUS_PCI_NAME,
+	.id_table = vbus_pci_tbl,
+	.probe    = vbus_pci_probe,
+	.remove   = vbus_pci_remove,
+};
+
+int __init
+vbus_pci_init(void)
+{
+	memset(&vbus_pci, 0, sizeof(vbus_pci));
+	spin_lock_init(&vbus_pci.lock);
+
+	return pci_register_driver(&vbus_pci_driver);
+}
+
+static void __exit
+vbus_pci_exit(void)
+{
+	pci_unregister_driver(&vbus_pci_driver);
+}
+
+module_init(vbus_pci_init);
+module_exit(vbus_pci_exit);
+
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 32b3eb8..fa15bbf 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -358,6 +358,7 @@ unifdef-y += uio.h
 unifdef-y += unistd.h
 unifdef-y += usbdevice_fs.h
 unifdef-y += utsname.h
+unifdef-y += vbus_pci.h
 unifdef-y += videodev2.h
 unifdef-y += videodev.h
 unifdef-y += virtio_config.h
diff --git a/include/linux/vbus_pci.h b/include/linux/vbus_pci.h
new file mode 100644
index 0000000..e18ff59
--- /dev/null
+++ b/include/linux/vbus_pci.h
@@ -0,0 +1,127 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * PCI to Virtual-Bus Bridge
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VBUS_PCI_H
+#define _LINUX_VBUS_PCI_H
+
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+#define VBUS_PCI_ABI_MAGIC 0xbf53eef5
+#define VBUS_PCI_ABI_VERSION 1
+#define VBUS_PCI_HC_VERSION 1
+
+enum {
+	VBUS_PCI_HC_NEGOTIATE,
+	VBUS_PCI_HC_BUSREG,
+	VBUS_PCI_HC_DEVOPEN,
+	VBUS_PCI_HC_DEVCLOSE,
+	VBUS_PCI_HC_DEVCALL,
+	VBUS_PCI_HC_DEVSHM,
+
+	VBUS_PCI_HC_MAX,      /* must be last */
+};
+
+struct vbus_pci_negotiate {
+	__u32 magic;
+	__u32 version;
+	__u64 capabilities;
+};
+
+struct vbus_pci_deviceopen {
+	__u32 devid;
+	__u32 version; /* device ABI version */
+	__u64 handle; /* return value for devh */
+};
+
+struct vbus_pci_devicecall {
+	__u64 devh;   /* device-handle (returned from DEVICEOPEN */
+	__u32 func;
+	__u32 len;
+	__u32 flags;
+	__u64 datap;
+};
+
+struct vbus_pci_deviceshm {
+	__u64 devh;   /* device-handle (returned from DEVICEOPEN */
+	__u32 id;
+	__u32 len;
+	__u32 flags;
+	struct {
+		__u32 offset;
+		__u32 prio;
+		__u64 cookie; /* token to pass back when signaling client */
+	} signal;
+	__u64 datap;
+};
+
+struct vbus_pci_hypercall {
+	__u32 vector;
+	__u32 len;
+	__u64 datap;
+};
+
+struct vbus_pci_regs {
+	struct {
+		struct vbus_pci_hypercall data;
+		__u32                     result;
+	} hypercall;
+};
+
+struct vbus_pci_eventqreg {
+	__u32 count;
+	__u64 ring;
+	__u64 data;
+};
+
+struct vbus_pci_busreg {
+	__u32 count;  /* supporting multiple queues allows for prio, etc */
+	struct vbus_pci_eventqreg eventq[1];
+};
+
+enum vbus_pci_eventid {
+	VBUS_PCI_EVENT_DEVADD,
+	VBUS_PCI_EVENT_DEVDROP,
+	VBUS_PCI_EVENT_SHMSIGNAL,
+	VBUS_PCI_EVENT_SHMCLOSE,
+};
+
+#define VBUS_MAX_DEVTYPE_LEN 128
+
+struct vbus_pci_add_event {
+	__u64 id;
+	char  type[VBUS_MAX_DEVTYPE_LEN];
+};
+
+struct vbus_pci_handle_event {
+	__u64 handle;
+};
+
+struct vbus_pci_event {
+	__u32 eventid;
+	union {
+		struct vbus_pci_add_event    add;
+		struct vbus_pci_handle_event handle;
+	} data;
+};
+
+#endif /* _LINUX_VBUS_PCI_H */


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 5/7] ioq: add driver-side vbus helpers
  2009-08-03 17:17 [PATCH 0/7] AlacrityVM guest drivers Gregory Haskins
                   ` (3 preceding siblings ...)
  2009-08-03 17:17 ` [PATCH 4/7] vbus-proxy: add a pci-to-vbus bridge Gregory Haskins
@ 2009-08-03 17:17 ` Gregory Haskins
  2009-08-03 17:18 ` [PATCH 6/7] net: Add vbus_enet driver Gregory Haskins
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-03 17:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: alacrityvm-devel, netdev

It will be a common pattern to map an IOQ over the VBUS shared-memory
interfaces.  Therefore, we provide a helper function to generalize
the allocation and registration of an IOQ to make this use case
simple and easy.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 drivers/vbus/bus-proxy.c    |   64 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/vbus_driver.h |    7 +++++
 2 files changed, 71 insertions(+), 0 deletions(-)

diff --git a/drivers/vbus/bus-proxy.c b/drivers/vbus/bus-proxy.c
index 3177f9f..88cd904 100644
--- a/drivers/vbus/bus-proxy.c
+++ b/drivers/vbus/bus-proxy.c
@@ -150,3 +150,67 @@ void vbus_driver_unregister(struct vbus_driver *drv)
 }
 EXPORT_SYMBOL_GPL(vbus_driver_unregister);
 
+/*
+ *---------------------------------
+ * driver-side IOQ helper
+ *---------------------------------
+ */
+static void
+vbus_driver_ioq_release(struct ioq *ioq)
+{
+	kfree(ioq->head_desc);
+	kfree(ioq);
+}
+
+static struct ioq_ops vbus_driver_ioq_ops = {
+	.release = vbus_driver_ioq_release,
+};
+
+
+int vbus_driver_ioq_alloc(struct vbus_device_proxy *dev, int id, int prio,
+			  size_t count, struct ioq **ioq)
+{
+	struct ioq           *_ioq;
+	struct ioq_ring_head *head = NULL;
+	struct shm_signal    *signal = NULL;
+	size_t                len = IOQ_HEAD_DESC_SIZE(count);
+	int                   ret = -ENOMEM;
+
+	_ioq = kzalloc(sizeof(*_ioq), GFP_KERNEL);
+	if (!_ioq)
+		goto error;
+
+	head = kzalloc(len, GFP_KERNEL | GFP_DMA);
+	if (!head)
+		goto error;
+
+	head->magic     = IOQ_RING_MAGIC;
+	head->ver	= IOQ_RING_VER;
+	head->count     = count;
+
+	ret = dev->ops->shm(dev, id, prio, head, len,
+			    &head->signal, &signal, 0);
+	if (ret < 0)
+		goto error;
+
+	ioq_init(_ioq,
+		 &vbus_driver_ioq_ops,
+		 ioq_locality_north,
+		 head,
+		 signal,
+		 count);
+
+	*ioq = _ioq;
+
+	return 0;
+
+ error:
+	kfree(_ioq);
+	kfree(head);
+
+	if (signal)
+		shm_signal_put(signal);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vbus_driver_ioq_alloc);
diff --git a/include/linux/vbus_driver.h b/include/linux/vbus_driver.h
index c53e13f..9cfbf60 100644
--- a/include/linux/vbus_driver.h
+++ b/include/linux/vbus_driver.h
@@ -26,6 +26,7 @@
 
 #include <linux/device.h>
 #include <linux/shm_signal.h>
+#include <linux/ioq.h>
 
 struct vbus_device_proxy;
 struct vbus_driver;
@@ -70,4 +71,10 @@ struct vbus_driver {
 int vbus_driver_register(struct vbus_driver *drv);
 void vbus_driver_unregister(struct vbus_driver *drv);
 
+/*
+ * driver-side IOQ helper - allocates device-shm and maps an IOQ on it
+ */
+int vbus_driver_ioq_alloc(struct vbus_device_proxy *dev, int id, int prio,
+			  size_t ringsize, struct ioq **ioq);
+
 #endif /* _LINUX_VBUS_DRIVER_H */


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 6/7] net: Add vbus_enet driver
  2009-08-03 17:17 [PATCH 0/7] AlacrityVM guest drivers Gregory Haskins
                   ` (4 preceding siblings ...)
  2009-08-03 17:17 ` [PATCH 5/7] ioq: add driver-side vbus helpers Gregory Haskins
@ 2009-08-03 17:18 ` Gregory Haskins
  2009-08-03 18:30   ` Stephen Hemminger
  2009-08-04  1:14   ` [PATCH v2] " Gregory Haskins
  2009-08-03 17:18 ` [PATCH 7/7] venet: add scatter-gather/GSO support Gregory Haskins
  2009-08-06  8:19 ` [PATCH 0/7] AlacrityVM guest drivers Reply-To: Michael S. Tsirkin
  7 siblings, 2 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-03 17:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: alacrityvm-devel, netdev

A virtualized 802.x network device based on the VBUS interface. It can be
used with any hypervisor/kernel that supports the virtual-ethernet/vbus
protocol.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 drivers/net/Kconfig     |   14 +
 drivers/net/Makefile    |    1 
 drivers/net/vbus-enet.c |  672 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/Kbuild    |    1 
 include/linux/venet.h   |   49 +++
 5 files changed, 737 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/vbus-enet.c
 create mode 100644 include/linux/venet.h

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 5f6509a..974213e 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -3209,4 +3209,18 @@ config VIRTIO_NET
 	  This is the virtual network driver for virtio.  It can be used with
           lguest or QEMU based VMMs (like KVM or Xen).  Say Y or M.
 
+config VBUS_ENET
+	tristate "VBUS Ethernet Driver"
+	default n
+	select VBUS_PROXY
+	help
+	   A virtualized 802.x network device based on the VBUS
+	   "virtual-ethernet" interface.  It can be used with any
+	   hypervisor/kernel that supports the vbus+venet protocol.
+
+config VBUS_ENET_DEBUG
+        bool "Enable Debugging"
+	depends on VBUS_ENET
+	default n
+
 endif # NETDEVICES
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index ead8cab..2a3c7a9 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -277,6 +277,7 @@ obj-$(CONFIG_FS_ENET) += fs_enet/
 obj-$(CONFIG_NETXEN_NIC) += netxen/
 obj-$(CONFIG_NIU) += niu.o
 obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+obj-$(CONFIG_VBUS_ENET) += vbus-enet.o
 obj-$(CONFIG_SFC) += sfc/
 
 obj-$(CONFIG_WIMAX) += wimax/
diff --git a/drivers/net/vbus-enet.c b/drivers/net/vbus-enet.c
new file mode 100644
index 0000000..8fcc2d6
--- /dev/null
+++ b/drivers/net/vbus-enet.c
@@ -0,0 +1,672 @@
+/*
+ * vbus_enet - A virtualized 802.x network device based on the VBUS interface
+ *
+ * Copyright (C) 2009 Novell, Gregory Haskins <ghaskins@novell.com>
+ *
+ * Derived from the SNULL example from the book "Linux Device Drivers" by
+ * Alessandro Rubini, Jonathan Corbet, and Greg Kroah-Hartman, published
+ * by O'Reilly & Associates.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/interrupt.h>
+
+#include <linux/in.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/skbuff.h>
+#include <linux/ioq.h>
+#include <linux/vbus_driver.h>
+
+#include <linux/in6.h>
+#include <asm/checksum.h>
+
+#include <linux/venet.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("virtual-ethernet");
+MODULE_VERSION("1");
+
+static int rx_ringlen = 256;
+module_param(rx_ringlen, int, 0444);
+static int tx_ringlen = 256;
+module_param(tx_ringlen, int, 0444);
+
+#define PDEBUG(_dev, fmt, args...) dev_dbg(&(_dev)->dev, fmt, ## args)
+
+struct vbus_enet_queue {
+	struct ioq              *queue;
+	struct ioq_notifier      notifier;
+};
+
+struct vbus_enet_priv {
+	spinlock_t                 lock;
+	struct net_device         *dev;
+	struct vbus_device_proxy  *vdev;
+	struct napi_struct         napi;
+	struct vbus_enet_queue     rxq;
+	struct vbus_enet_queue     txq;
+	struct tasklet_struct      txtask;
+};
+
+static struct vbus_enet_priv *
+napi_to_priv(struct napi_struct *napi)
+{
+	return container_of(napi, struct vbus_enet_priv, napi);
+}
+
+static int
+queue_init(struct vbus_enet_priv *priv,
+	   struct vbus_enet_queue *q,
+	   int qid,
+	   size_t ringsize,
+	   void (*func)(struct ioq_notifier *))
+{
+	struct vbus_device_proxy *dev = priv->vdev;
+	int ret;
+
+	ret = vbus_driver_ioq_alloc(dev, qid, 0, ringsize, &q->queue);
+	if (ret < 0)
+		panic("ioq_alloc failed: %d\n", ret);
+
+	if (func) {
+		q->notifier.signal = func;
+		q->queue->notifier = &q->notifier;
+	}
+
+	return 0;
+}
+
+static int
+devcall(struct vbus_enet_priv *priv, u32 func, void *data, size_t len)
+{
+	struct vbus_device_proxy *dev = priv->vdev;
+
+	return dev->ops->call(dev, func, data, len, 0);
+}
+
+/*
+ * ---------------
+ * rx descriptors
+ * ---------------
+ */
+
+static void
+rxdesc_alloc(struct net_device *dev, struct ioq_ring_desc *desc, size_t len)
+{
+	struct sk_buff *skb;
+
+	len += ETH_HLEN;
+
+	skb = netdev_alloc_skb(dev, len + 2);
+	BUG_ON(!skb);
+
+	skb_reserve(skb, NET_IP_ALIGN); /* align IP on 16B boundary */
+
+	desc->cookie = (u64)skb;
+	desc->ptr    = (u64)__pa(skb->data);
+	desc->len    = len; /* total length  */
+	desc->valid  = 1;
+}
+
+static void
+rx_setup(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->rxq.queue;
+	struct ioq_iterator iter;
+	int ret;
+
+	/*
+	 * We want to iterate on the "valid" index.  By default the iterator
+	 * will not "autoupdate" which means it will not hypercall the host
+	 * with our changes.  This is good, because we are really just
+	 * initializing stuff here anyway.  Note that you can always manually
+	 * signal the host with ioq_signal() if the autoupdate feature is not
+	 * used.
+	 */
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0); /* will never fail unless seriously broken */
+
+	/*
+	 * Seek to the tail of the valid index (which should be our first
+	 * item, since the queue is brand-new)
+	 */
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Now populate each descriptor with an empty SKB and mark it valid
+	 */
+	while (!iter.desc->valid) {
+		rxdesc_alloc(priv->dev, iter.desc, priv->dev->mtu);
+
+		/*
+		 * This push operation will simultaneously advance the
+		 * valid-head index and increment our position in the queue
+		 * by one.
+		 */
+		ret = ioq_iter_push(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+}
+
+static void
+rx_teardown(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->rxq.queue;
+	struct ioq_iterator iter;
+	int ret;
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * free each valid descriptor
+	 */
+	while (iter.desc->valid) {
+		struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+
+		iter.desc->valid = 0;
+		wmb();
+
+		iter.desc->ptr = 0;
+		iter.desc->cookie = 0;
+
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+
+		dev_kfree_skb(skb);
+	}
+}
+
+/*
+ * Open and close
+ */
+
+static int
+vbus_enet_open(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	ret = devcall(priv, VENET_FUNC_LINKUP, NULL, 0);
+	BUG_ON(ret < 0);
+
+	napi_enable(&priv->napi);
+
+	return 0;
+}
+
+static int
+vbus_enet_stop(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	napi_disable(&priv->napi);
+
+	ret = devcall(priv, VENET_FUNC_LINKDOWN, NULL, 0);
+	BUG_ON(ret < 0);
+
+	return 0;
+}
+
+/*
+ * Configuration changes (passed on by ifconfig)
+ */
+static int
+vbus_enet_config(struct net_device *dev, struct ifmap *map)
+{
+	if (dev->flags & IFF_UP) /* can't act on a running interface */
+		return -EBUSY;
+
+	/* Don't allow changing the I/O address */
+	if (map->base_addr != dev->base_addr) {
+		dev_warn(&dev->dev, "Can't change I/O address\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* ignore other fields */
+	return 0;
+}
+
+static void
+vbus_enet_schedule_rx(struct vbus_enet_priv *priv)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (napi_schedule_prep(&priv->napi)) {
+		/* Disable further interrupts */
+		ioq_notify_disable(priv->rxq.queue, 0);
+		__napi_schedule(&priv->napi);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static int
+vbus_enet_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	dev->mtu = new_mtu;
+
+	/*
+	 * FLUSHRX will cause the device to flush any outstanding
+	 * RX buffers.  They will appear to come in as 0 length
+	 * packets which we can simply discard and replace with new_mtu
+	 * buffers for the future.
+	 */
+	ret = devcall(priv, VENET_FUNC_FLUSHRX, NULL, 0);
+	BUG_ON(ret < 0);
+
+	vbus_enet_schedule_rx(priv);
+
+	return 0;
+}
+
+/*
+ * The poll implementation.
+ */
+static int
+vbus_enet_poll(struct napi_struct *napi, int budget)
+{
+	struct vbus_enet_priv *priv = napi_to_priv(napi);
+	int npackets = 0;
+	struct ioq_iterator iter;
+	int ret;
+
+	PDEBUG(priv->dev, "polling...\n");
+
+	/* We want to iterate on the head of the in-use index */
+	ret = ioq_iter_init(priv->rxq.queue, &iter, ioq_idxtype_inuse,
+			    IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * We stop if we have met the quota or there are no more packets.
+	 * The EOM is indicated by finding a packet that is still owned by
+	 * the south side
+	 */
+	while ((npackets < budget) && (!iter.desc->sown)) {
+		struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+
+		if (iter.desc->len) {
+			skb_put(skb, iter.desc->len);
+
+			/* Maintain stats */
+			npackets++;
+			priv->dev->stats.rx_packets++;
+			priv->dev->stats.rx_bytes += iter.desc->len;
+
+			/* Pass the buffer up to the stack */
+			skb->dev      = priv->dev;
+			skb->protocol = eth_type_trans(skb, priv->dev);
+			netif_receive_skb(skb);
+
+			mb();
+		} else
+			/*
+			 * the device may send a zero-length packet when its
+			 * flushing references on the ring.  We can just drop
+			 * these on the floor
+			 */
+			dev_kfree_skb(skb);
+
+		/* Grab a new buffer to put in the ring */
+		rxdesc_alloc(priv->dev, iter.desc, priv->dev->mtu);
+
+		/* Advance the in-use tail */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	PDEBUG(priv->dev, "%d packets received\n", npackets);
+
+	/*
+	 * If we processed all packets, we're done; tell the kernel and
+	 * reenable ints
+	 */
+	if (ioq_empty(priv->rxq.queue, ioq_idxtype_inuse)) {
+		napi_complete(napi);
+		ioq_notify_enable(priv->rxq.queue, 0);
+		ret = 0;
+	} else
+		/* We couldn't process everything. */
+		ret = 1;
+
+	return ret;
+}
+
+/*
+ * Transmit a packet (called by the kernel)
+ */
+static int
+vbus_enet_tx_start(struct sk_buff *skb, struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	struct ioq_iterator    iter;
+	int ret;
+	unsigned long flags;
+
+	PDEBUG(priv->dev, "sending %d bytes\n", skb->len);
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
+		/*
+		 * We must flow-control the kernel by disabling the
+		 * queue
+		 */
+		spin_unlock_irqrestore(&priv->lock, flags);
+		netif_stop_queue(dev);
+		dev_err(&priv->dev->dev, "tx on full queue bug\n");
+		return 1;
+	}
+
+	/*
+	 * We want to iterate on the tail of both the "inuse" and "valid" index
+	 * so we specify the "both" index
+	 */
+	ret = ioq_iter_init(priv->txq.queue, &iter, ioq_idxtype_both,
+			    IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+	BUG_ON(iter.desc->sown);
+
+	/*
+	 * We simply put the skb right onto the ring.  We will get an interrupt
+	 * later when the data has been consumed and we can reap the pointers
+	 * at that time
+	 */
+	iter.desc->cookie = (u64)skb;
+	iter.desc->len = (u64)skb->len;
+	iter.desc->ptr = (u64)__pa(skb->data);
+	iter.desc->valid  = 1;
+
+	priv->dev->stats.tx_packets++;
+	priv->dev->stats.tx_bytes += skb->len;
+
+	/*
+	 * This advances both indexes together implicitly, and then
+	 * signals the south side to consume the packet
+	 */
+	ret = ioq_iter_push(&iter, 0);
+	BUG_ON(ret < 0);
+
+	dev->trans_start = jiffies; /* save the timestamp */
+
+	if (ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
+		/*
+		 * If the queue is congested, we must flow-control the kernel
+		 */
+		PDEBUG(priv->dev, "backpressure tx queue\n");
+		netif_stop_queue(dev);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	return 0;
+}
+
+/*
+ * reclaim any outstanding completed tx packets
+ *
+ * assumes priv->lock held
+ */
+static void
+vbus_enet_tx_reap(struct vbus_enet_priv *priv, int force)
+{
+	struct ioq_iterator iter;
+	int ret;
+
+	/*
+	 * We want to iterate on the head of the valid index, but we
+	 * do not want the iter_pop (below) to flip the ownership, so
+	 * we set the NOFLIPOWNER option
+	 */
+	ret = ioq_iter_init(priv->txq.queue, &iter, ioq_idxtype_valid,
+			    IOQ_ITER_NOFLIPOWNER);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * We are done once we find the first packet either invalid or still
+	 * owned by the south-side
+	 */
+	while (iter.desc->valid && (!iter.desc->sown || force)) {
+		struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+
+		PDEBUG(priv->dev, "completed sending %d bytes\n", skb->len);
+
+		/* Reset the descriptor */
+		iter.desc->valid  = 0;
+
+		dev_kfree_skb(skb);
+
+		/* Advance the valid-index head */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	/*
+	 * If we were previously stopped due to flow control, restart the
+	 * processing
+	 */
+	if (netif_queue_stopped(priv->dev)
+	    && !ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
+		PDEBUG(priv->dev, "re-enabling tx queue\n");
+		netif_wake_queue(priv->dev);
+	}
+}
+
+static void
+vbus_enet_timeout(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	unsigned long flags;
+
+	dev_dbg(&dev->dev, "Transmit timeout\n");
+
+	spin_lock_irqsave(&priv->lock, flags);
+	vbus_enet_tx_reap(priv, 0);
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static void
+rx_isr(struct ioq_notifier *notifier)
+{
+	struct vbus_enet_priv *priv;
+	struct net_device  *dev;
+
+	priv = container_of(notifier, struct vbus_enet_priv, rxq.notifier);
+	dev = priv->dev;
+
+	if (!ioq_empty(priv->rxq.queue, ioq_idxtype_inuse))
+		vbus_enet_schedule_rx(priv);
+}
+
+static void
+deferred_tx_isr(unsigned long data)
+{
+	struct vbus_enet_priv *priv = (struct vbus_enet_priv *)data;
+	unsigned long flags;
+
+	PDEBUG(priv->dev, "deferred_tx_isr\n");
+
+	spin_lock_irqsave(&priv->lock, flags);
+	vbus_enet_tx_reap(priv, 0);
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	ioq_notify_enable(priv->txq.queue, 0);
+}
+
+static void
+tx_isr(struct ioq_notifier *notifier)
+{
+       struct vbus_enet_priv *priv;
+
+       priv = container_of(notifier, struct vbus_enet_priv, txq.notifier);
+
+       PDEBUG(priv->dev, "tx_isr\n");
+
+       ioq_notify_disable(priv->txq.queue, 0);
+       tasklet_schedule(&priv->txtask);
+}
+
+static const struct net_device_ops vbus_enet_netdev_ops = {
+	.ndo_open          = vbus_enet_open,
+	.ndo_stop          = vbus_enet_stop,
+	.ndo_set_config    = vbus_enet_config,
+	.ndo_start_xmit    = vbus_enet_tx_start,
+	.ndo_change_mtu	   = vbus_enet_change_mtu,
+	.ndo_tx_timeout    = vbus_enet_timeout,
+};
+
+/*
+ * This is called whenever a new vbus_device_proxy is added to the vbus
+ * with the matching VENET_ID
+ */
+static int
+vbus_enet_probe(struct vbus_device_proxy *vdev)
+{
+	struct net_device  *dev;
+	struct vbus_enet_priv *priv;
+	int ret;
+
+	printk(KERN_INFO "VENET: Found new device at %lld\n", vdev->id);
+
+	ret = vdev->ops->open(vdev, VENET_VERSION, 0);
+	if (ret < 0)
+		return ret;
+
+	dev = alloc_etherdev(sizeof(struct vbus_enet_priv));
+	if (!dev)
+		return -ENOMEM;
+
+	priv = netdev_priv(dev);
+
+	spin_lock_init(&priv->lock);
+	priv->dev  = dev;
+	priv->vdev = vdev;
+
+	tasklet_init(&priv->txtask, deferred_tx_isr, (unsigned long)priv);
+
+	queue_init(priv, &priv->rxq, VENET_QUEUE_RX, rx_ringlen, rx_isr);
+	queue_init(priv, &priv->txq, VENET_QUEUE_TX, tx_ringlen, tx_isr);
+
+	rx_setup(priv);
+
+	ioq_notify_enable(priv->rxq.queue, 0);  /* enable interrupts */
+	ioq_notify_enable(priv->txq.queue, 0);
+
+	dev->netdev_ops     = &vbus_enet_netdev_ops;
+	dev->watchdog_timeo = 5 * HZ;
+
+	netif_napi_add(dev, &priv->napi, vbus_enet_poll, 128);
+
+	ret = devcall(priv, VENET_FUNC_MACQUERY, priv->dev->dev_addr, ETH_ALEN);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: Error obtaining MAC address for " \
+		       "%lld\n",
+		       priv->vdev->id);
+		goto out_free;
+	}
+
+	dev->features |= NETIF_F_HIGHDMA;
+
+	ret = register_netdev(dev);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: error %i registering device \"%s\"\n",
+		       ret, dev->name);
+		goto out_free;
+	}
+
+	vdev->priv = priv;
+
+	return 0;
+
+ out_free:
+	free_netdev(dev);
+
+	return ret;
+}
+
+static int
+vbus_enet_remove(struct vbus_device_proxy *vdev)
+{
+	struct vbus_enet_priv *priv = (struct vbus_enet_priv *)vdev->priv;
+	struct vbus_device_proxy *dev = priv->vdev;
+
+	unregister_netdev(priv->dev);
+	napi_disable(&priv->napi);
+
+	rx_teardown(priv);
+	vbus_enet_tx_reap(priv, 1);
+
+	ioq_put(priv->rxq.queue);
+	ioq_put(priv->txq.queue);
+
+	dev->ops->close(dev, 0);
+
+	free_netdev(priv->dev);
+
+	return 0;
+}
+
+/*
+ * Finally, the module stuff
+ */
+
+static struct vbus_driver_ops vbus_enet_driver_ops = {
+	.probe  = vbus_enet_probe,
+	.remove = vbus_enet_remove,
+};
+
+static struct vbus_driver vbus_enet_driver = {
+	.type   = VENET_TYPE,
+	.owner  = THIS_MODULE,
+	.ops    = &vbus_enet_driver_ops,
+};
+
+static __init int
+vbus_enet_init_module(void)
+{
+	printk(KERN_INFO "Virtual Ethernet: Copyright (C) 2009 Novell, Gregory Haskins\n");
+	printk(KERN_DEBUG "VENET: Using %d/%d queue depth\n",
+	       rx_ringlen, tx_ringlen);
+	return vbus_driver_register(&vbus_enet_driver);
+}
+
+static __exit void
+vbus_enet_cleanup(void)
+{
+	vbus_driver_unregister(&vbus_enet_driver);
+}
+
+module_init(vbus_enet_init_module);
+module_exit(vbus_enet_cleanup);
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index fa15bbf..911f7ef 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -359,6 +359,7 @@ unifdef-y += unistd.h
 unifdef-y += usbdevice_fs.h
 unifdef-y += utsname.h
 unifdef-y += vbus_pci.h
+unifdef-y += venet.h
 unifdef-y += videodev2.h
 unifdef-y += videodev.h
 unifdef-y += virtio_config.h
diff --git a/include/linux/venet.h b/include/linux/venet.h
new file mode 100644
index 0000000..586be40
--- /dev/null
+++ b/include/linux/venet.h
@@ -0,0 +1,49 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Virtual-Ethernet adapter
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VENET_H
+#define _LINUX_VENET_H
+
+#include <linux/types.h>
+
+#define VENET_VERSION 1
+
+#define VENET_TYPE "virtual-ethernet"
+
+#define VENET_QUEUE_RX 0
+#define VENET_QUEUE_TX 1
+
+struct venet_capabilities {
+	__u32 gid;
+	__u32 bits;
+};
+
+/* CAPABILITIES-GROUP 0 */
+/* #define VENET_CAP_FOO    0   (No capabilities defined yet, for now) */
+
+#define VENET_FUNC_LINKUP   0
+#define VENET_FUNC_LINKDOWN 1
+#define VENET_FUNC_MACQUERY 2
+#define VENET_FUNC_NEGCAP   3 /* negotiate capabilities */
+#define VENET_FUNC_FLUSHRX  4
+
+#endif /* _LINUX_VENET_H */


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 7/7] venet: add scatter-gather/GSO support
  2009-08-03 17:17 [PATCH 0/7] AlacrityVM guest drivers Gregory Haskins
                   ` (5 preceding siblings ...)
  2009-08-03 17:18 ` [PATCH 6/7] net: Add vbus_enet driver Gregory Haskins
@ 2009-08-03 17:18 ` Gregory Haskins
  2009-08-03 18:32   ` Stephen Hemminger
  2009-08-03 18:33   ` Stephen Hemminger
  2009-08-06  8:19 ` [PATCH 0/7] AlacrityVM guest drivers Reply-To: Michael S. Tsirkin
  7 siblings, 2 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-03 17:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: alacrityvm-devel, netdev

SG/GSO significantly enhance the performance of network traffic under
certain circumstances.  We implement this feature as a separate patch
to avoid intially complicating the baseline venet driver.  This will
presumably make the review process slightly easier, since we can
focus on the basic interface first.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 drivers/net/vbus-enet.c |  249 +++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/venet.h   |   39 +++++++
 2 files changed, 275 insertions(+), 13 deletions(-)

diff --git a/drivers/net/vbus-enet.c b/drivers/net/vbus-enet.c
index 8fcc2d6..5aa56ff 100644
--- a/drivers/net/vbus-enet.c
+++ b/drivers/net/vbus-enet.c
@@ -42,6 +42,8 @@ static int rx_ringlen = 256;
 module_param(rx_ringlen, int, 0444);
 static int tx_ringlen = 256;
 module_param(tx_ringlen, int, 0444);
+static int sg_enabled = 1;
+module_param(sg_enabled, int, 0444);
 
 #define PDEBUG(_dev, fmt, args...) dev_dbg(&(_dev)->dev, fmt, ## args)
 
@@ -58,8 +60,17 @@ struct vbus_enet_priv {
 	struct vbus_enet_queue     rxq;
 	struct vbus_enet_queue     txq;
 	struct tasklet_struct      txtask;
+	struct {
+		int                sg:1;
+		int                tso:1;
+		int                ufo:1;
+		int                tso6:1;
+		int                ecn:1;
+	} flags;
 };
 
+static void vbus_enet_tx_reap(struct vbus_enet_priv *priv, int force);
+
 static struct vbus_enet_priv *
 napi_to_priv(struct napi_struct *napi)
 {
@@ -193,6 +204,93 @@ rx_teardown(struct vbus_enet_priv *priv)
 	}
 }
 
+static int
+tx_setup(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->txq.queue;
+	struct ioq_iterator iter;
+	int i;
+	int ret;
+
+	if (!priv->flags.sg)
+		/*
+		 * There is nothing to do for a ring that is not using
+		 * scatter-gather
+		 */
+		return 0;
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_set, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Now populate each descriptor with an empty SG descriptor
+	 */
+	for (i = 0; i < tx_ringlen; i++) {
+		struct venet_sg *vsg;
+		size_t iovlen = sizeof(struct venet_iov) * (MAX_SKB_FRAGS-1);
+		size_t len = sizeof(*vsg) + iovlen;
+
+		vsg = kzalloc(len, GFP_KERNEL);
+		if (!vsg)
+			return -ENOMEM;
+
+		iter.desc->cookie = (u64)vsg;
+		iter.desc->len    = len;
+		iter.desc->ptr    = (u64)__pa(vsg);
+
+		ret = ioq_iter_seek(&iter, ioq_seek_next, 0, 0);
+		BUG_ON(ret < 0);
+	}
+
+	return 0;
+}
+
+static void
+tx_teardown(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->txq.queue;
+	struct ioq_iterator iter;
+	int ret;
+
+	/* forcefully free all outstanding transmissions */
+	vbus_enet_tx_reap(priv, 1);
+
+	if (!priv->flags.sg)
+		/*
+		 * There is nothing else to do for a ring that is not using
+		 * scatter-gather
+		 */
+		return;
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	/* seek to position 0 */
+	ret = ioq_iter_seek(&iter, ioq_seek_set, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * free each valid descriptor
+	 */
+	while (iter.desc->cookie) {
+		struct venet_sg *vsg = (struct venet_sg *)iter.desc->cookie;
+
+		iter.desc->valid = 0;
+		wmb();
+
+		iter.desc->ptr = 0;
+		iter.desc->cookie = 0;
+
+		ret = ioq_iter_seek(&iter, ioq_seek_next, 0, 0);
+		BUG_ON(ret < 0);
+
+		kfree(vsg);
+	}
+}
+
 /*
  * Open and close
  */
@@ -396,14 +494,67 @@ vbus_enet_tx_start(struct sk_buff *skb, struct net_device *dev)
 	BUG_ON(ret < 0);
 	BUG_ON(iter.desc->sown);
 
-	/*
-	 * We simply put the skb right onto the ring.  We will get an interrupt
-	 * later when the data has been consumed and we can reap the pointers
-	 * at that time
-	 */
-	iter.desc->cookie = (u64)skb;
-	iter.desc->len = (u64)skb->len;
-	iter.desc->ptr = (u64)__pa(skb->data);
+	if (priv->flags.sg) {
+		struct venet_sg *vsg = (struct venet_sg *)iter.desc->cookie;
+		struct scatterlist sgl[MAX_SKB_FRAGS+1];
+		struct scatterlist *sg;
+		int count, maxcount = ARRAY_SIZE(sgl);
+
+		sg_init_table(sgl, maxcount);
+
+		memset(vsg, 0, sizeof(*vsg));
+
+		vsg->cookie = (u64)skb;
+		vsg->len    = skb->len;
+
+		if (skb->ip_summed == CHECKSUM_PARTIAL) {
+			vsg->flags      |= VENET_SG_FLAG_NEEDS_CSUM;
+			vsg->csum.start  = skb->csum_start - skb_headroom(skb);
+			vsg->csum.offset = skb->csum_offset;
+		}
+
+		if (skb_is_gso(skb)) {
+			struct skb_shared_info *sinfo = skb_shinfo(skb);
+
+			vsg->flags |= VENET_SG_FLAG_GSO;
+
+			vsg->gso.hdrlen = skb_transport_header(skb) - skb->data;
+			vsg->gso.size = sinfo->gso_size;
+			if (sinfo->gso_type & SKB_GSO_TCPV4)
+				vsg->gso.type = VENET_GSO_TYPE_TCPV4;
+			else if (sinfo->gso_type & SKB_GSO_TCPV6)
+				vsg->gso.type = VENET_GSO_TYPE_TCPV6;
+			else if (sinfo->gso_type & SKB_GSO_UDP)
+				vsg->gso.type = VENET_GSO_TYPE_UDP;
+			else
+				panic("Virtual-Ethernet: unknown GSO type " \
+				      "0x%x\n", sinfo->gso_type);
+
+			if (sinfo->gso_type & SKB_GSO_TCP_ECN)
+				vsg->flags |= VENET_SG_FLAG_ECN;
+		}
+
+		count = skb_to_sgvec(skb, sgl, 0, skb->len);
+
+		BUG_ON(count > maxcount);
+
+		for (sg = &sgl[0]; sg; sg = sg_next(sg)) {
+			struct venet_iov *iov = &vsg->iov[vsg->count++];
+
+			iov->len = sg->length;
+			iov->ptr = (u64)sg_phys(sg);
+		}
+
+	} else {
+		/*
+		 * non scatter-gather mode: simply put the skb right onto the
+		 * ring.
+		 */
+		iter.desc->cookie = (u64)skb;
+		iter.desc->len = (u64)skb->len;
+		iter.desc->ptr = (u64)__pa(skb->data);
+	}
+
 	iter.desc->valid  = 1;
 
 	priv->dev->stats.tx_packets++;
@@ -459,7 +610,17 @@ vbus_enet_tx_reap(struct vbus_enet_priv *priv, int force)
 	 * owned by the south-side
 	 */
 	while (iter.desc->valid && (!iter.desc->sown || force)) {
-		struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+		struct sk_buff *skb;
+
+		if (priv->flags.sg) {
+			struct venet_sg *vsg;
+
+			vsg = (struct venet_sg *)iter.desc->cookie;
+			skb = (struct sk_buff *)vsg->cookie;
+
+		} else {
+			skb = (struct sk_buff *)iter.desc->cookie;
+		}
 
 		PDEBUG(priv->dev, "completed sending %d bytes\n", skb->len);
 
@@ -538,6 +699,47 @@ tx_isr(struct ioq_notifier *notifier)
        tasklet_schedule(&priv->txtask);
 }
 
+static int
+vbus_enet_negcap(struct vbus_enet_priv *priv)
+{
+	int ret;
+	struct venet_capabilities caps;
+
+	memset(&caps, 0, sizeof(caps));
+
+	if (sg_enabled) {
+		caps.gid = VENET_CAP_GROUP_SG;
+		caps.bits |= (VENET_CAP_SG|VENET_CAP_TSO4|VENET_CAP_TSO6
+			      |VENET_CAP_ECN);
+		/* note: exclude UFO for now due to stack bug */
+	}
+
+	ret = devcall(priv, VENET_FUNC_NEGCAP, &caps, sizeof(caps));
+	if (ret < 0)
+		return ret;
+
+	if (caps.bits & VENET_CAP_SG) {
+		priv->flags.sg = true;
+
+		if (caps.bits & VENET_CAP_TSO4)
+			priv->flags.tso = true;
+		if (caps.bits & VENET_CAP_TSO6)
+			priv->flags.tso6 = true;
+		if (caps.bits & VENET_CAP_UFO)
+			priv->flags.ufo = true;
+		if (caps.bits & VENET_CAP_ECN)
+			priv->flags.ecn = true;
+
+		dev_info(&priv->dev->dev, "Detected GSO features %s%s%s%s\n",
+			 priv->flags.tso  ? "t" : "-",
+			 priv->flags.tso6 ? "T" : "-",
+			 priv->flags.ufo  ? "u" : "-",
+			 priv->flags.ecn  ? "e" : "-");
+	}
+
+	return 0;
+}
+
 static const struct net_device_ops vbus_enet_netdev_ops = {
 	.ndo_open          = vbus_enet_open,
 	.ndo_stop          = vbus_enet_stop,
@@ -574,12 +776,21 @@ vbus_enet_probe(struct vbus_device_proxy *vdev)
 	priv->dev  = dev;
 	priv->vdev = vdev;
 
+	ret = vbus_enet_negcap(priv);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: Error negotiating capabilities for " \
+		       "%lld\n",
+		       priv->vdev->id);
+		goto out_free;
+	}
+
 	tasklet_init(&priv->txtask, deferred_tx_isr, (unsigned long)priv);
 
 	queue_init(priv, &priv->rxq, VENET_QUEUE_RX, rx_ringlen, rx_isr);
 	queue_init(priv, &priv->txq, VENET_QUEUE_TX, tx_ringlen, tx_isr);
 
 	rx_setup(priv);
+	tx_setup(priv);
 
 	ioq_notify_enable(priv->rxq.queue, 0);  /* enable interrupts */
 	ioq_notify_enable(priv->txq.queue, 0);
@@ -599,6 +810,22 @@ vbus_enet_probe(struct vbus_device_proxy *vdev)
 
 	dev->features |= NETIF_F_HIGHDMA;
 
+	if (priv->flags.sg) {
+		dev->features |= NETIF_F_SG|NETIF_F_HW_CSUM|NETIF_F_FRAGLIST;
+
+		if (priv->flags.tso)
+			dev->features |= NETIF_F_TSO;
+
+		if (priv->flags.ufo)
+			dev->features |= NETIF_F_UFO;
+
+		if (priv->flags.tso6)
+			dev->features |= NETIF_F_TSO6;
+
+		if (priv->flags.ecn)
+			dev->features |= NETIF_F_TSO_ECN;
+	}
+
 	ret = register_netdev(dev);
 	if (ret < 0) {
 		printk(KERN_INFO "VENET: error %i registering device \"%s\"\n",
@@ -626,9 +853,9 @@ vbus_enet_remove(struct vbus_device_proxy *vdev)
 	napi_disable(&priv->napi);
 
 	rx_teardown(priv);
-	vbus_enet_tx_reap(priv, 1);
-
 	ioq_put(priv->rxq.queue);
+
+	tx_teardown(priv);
 	ioq_put(priv->txq.queue);
 
 	dev->ops->close(dev, 0);
diff --git a/include/linux/venet.h b/include/linux/venet.h
index 586be40..47ed37d 100644
--- a/include/linux/venet.h
+++ b/include/linux/venet.h
@@ -37,8 +37,43 @@ struct venet_capabilities {
 	__u32 bits;
 };
 
-/* CAPABILITIES-GROUP 0 */
-/* #define VENET_CAP_FOO    0   (No capabilities defined yet, for now) */
+#define VENET_CAP_GROUP_SG 0
+
+/* CAPABILITIES-GROUP SG */
+#define VENET_CAP_SG     (1 << 0)
+#define VENET_CAP_TSO4   (1 << 1)
+#define VENET_CAP_TSO6   (1 << 2)
+#define VENET_CAP_ECN    (1 << 3)
+#define VENET_CAP_UFO    (1 << 4)
+
+struct venet_iov {
+	__u32 len;
+	__u64 ptr;
+};
+
+#define VENET_SG_FLAG_NEEDS_CSUM (1 << 0)
+#define VENET_SG_FLAG_GSO        (1 << 1)
+#define VENET_SG_FLAG_ECN        (1 << 2)
+
+struct venet_sg {
+	__u64            cookie;
+	__u32            flags;
+	__u32            len;     /* total length of all iovs */
+	struct {
+		__u16    start;	  /* csum starting position */
+		__u16    offset;  /* offset to place csum */
+	} csum;
+	struct {
+#define VENET_GSO_TYPE_TCPV4	0	/* IPv4 TCP (TSO) */
+#define VENET_GSO_TYPE_UDP	1	/* IPv4 UDP (UFO) */
+#define VENET_GSO_TYPE_TCPV6	2	/* IPv6 TCP */
+		__u8     type;
+		__u16    hdrlen;
+		__u16    size;
+	} gso;
+	__u32            count;   /* nr of iovs */
+	struct venet_iov iov[1];
+};
 
 #define VENET_FUNC_LINKUP   0
 #define VENET_FUNC_LINKDOWN 1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 6/7] net: Add vbus_enet driver
  2009-08-03 17:18 ` [PATCH 6/7] net: Add vbus_enet driver Gregory Haskins
@ 2009-08-03 18:30   ` Stephen Hemminger
  2009-08-03 20:10     ` Gregory Haskins
  2009-08-04  1:14   ` [PATCH v2] " Gregory Haskins
  1 sibling, 1 reply; 62+ messages in thread
From: Stephen Hemminger @ 2009-08-03 18:30 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: linux-kernel, alacrityvm-devel, netdev

On Mon, 03 Aug 2009 13:18:02 -0400
Gregory Haskins <ghaskins@novell.com> wrote:

> +
> +static const struct net_device_ops vbus_enet_netdev_ops = {
> +	.ndo_open          = vbus_enet_open,
> +	.ndo_stop          = vbus_enet_stop,
> +	.ndo_set_config    = vbus_enet_config,
> +	.ndo_start_xmit    = vbus_enet_tx_start,
> +	.ndo_change_mtu	   = vbus_enet_change_mtu,
> +	.ndo_tx_timeout    = vbus_enet_timeout,
> +};


Missing 
	.ndo_set_mac_address = eth_mac_addr,
	.ndo_validate_addr   = eth_validate_addr,

Also, should have change_mtu.

Suggest adding ethtool to report link and settings.

For performance this device should do scatter/gather, tso, gso, etc.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 7/7] venet: add scatter-gather/GSO support
  2009-08-03 17:18 ` [PATCH 7/7] venet: add scatter-gather/GSO support Gregory Haskins
@ 2009-08-03 18:32   ` Stephen Hemminger
  2009-08-03 19:30     ` Gregory Haskins
  2009-08-03 18:33   ` Stephen Hemminger
  1 sibling, 1 reply; 62+ messages in thread
From: Stephen Hemminger @ 2009-08-03 18:32 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: linux-kernel, alacrityvm-devel, netdev

On Mon, 03 Aug 2009 13:18:07 -0400
Gregory Haskins <ghaskins@novell.com> wrote:

> SG/GSO significantly enhance the performance of network traffic under
> certain circumstances.  We implement this feature as a separate patch
> to avoid intially complicating the baseline venet driver.  This will
> presumably make the review process slightly easier, since we can
> focus on the basic interface first.
> 
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>

Never mind, previous comment about adding TSO.  But it still would
be good to have ethtool interface for this.

-- 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 7/7] venet: add scatter-gather/GSO support
  2009-08-03 17:18 ` [PATCH 7/7] venet: add scatter-gather/GSO support Gregory Haskins
  2009-08-03 18:32   ` Stephen Hemminger
@ 2009-08-03 18:33   ` Stephen Hemminger
  2009-08-03 19:57     ` Gregory Haskins
  1 sibling, 1 reply; 62+ messages in thread
From: Stephen Hemminger @ 2009-08-03 18:33 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: linux-kernel, alacrityvm-devel, netdev

On Mon, 03 Aug 2009 13:18:07 -0400
Gregory Haskins <ghaskins@novell.com> wrote:

> +	struct {
> +		int                sg:1;
> +		int                tso:1;
> +		int                ufo:1;
> +		int                tso6:1;
> +		int                ecn:1;
> +	} flags;

Why do you have to shadow flags that are already available in net_device?
It is bad design to replicate state in a device driver. The problem
with replicated state is that it has to be updated in both places.

-- 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 7/7] venet: add scatter-gather/GSO support
  2009-08-03 18:32   ` Stephen Hemminger
@ 2009-08-03 19:30     ` Gregory Haskins
  0 siblings, 0 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-03 19:30 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Gregory Haskins, linux-kernel, alacrityvm-devel, netdev

[-- Attachment #1: Type: text/plain, Size: 980 bytes --]

Stephen Hemminger wrote:
> On Mon, 03 Aug 2009 13:18:07 -0400
> Gregory Haskins <ghaskins@novell.com> wrote:
> 
>> SG/GSO significantly enhance the performance of network traffic under
>> certain circumstances.  We implement this feature as a separate patch
>> to avoid intially complicating the baseline venet driver.  This will
>> presumably make the review process slightly easier, since we can
>> focus on the basic interface first.
>>
>> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> 
> Never mind, previous comment about adding TSO.


I should probably fold 6+7 together anyway, so this was my fault.  I
think I was originally keeping them separate because the venet was
serving as the canonical example for the vbus interface.  But Ill just
come up with something simpler to demonstrate the interface and merge
these two patches.

> But it still would be good to have ethtool interface for this.

Ack.  Will do.

Thanks Stephen,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 7/7] venet: add scatter-gather/GSO support
  2009-08-03 18:33   ` Stephen Hemminger
@ 2009-08-03 19:57     ` Gregory Haskins
  0 siblings, 0 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-03 19:57 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: linux-kernel, alacrityvm-devel, netdev

[-- Attachment #1: Type: text/plain, Size: 857 bytes --]

Stephen Hemminger wrote:
> On Mon, 03 Aug 2009 13:18:07 -0400
> Gregory Haskins <ghaskins@novell.com> wrote:
> 
>> +	struct {
>> +		int                sg:1;
>> +		int                tso:1;
>> +		int                ufo:1;
>> +		int                tso6:1;
>> +		int                ecn:1;
>> +	} flags;
> 
> Why do you have to shadow flags that are already available in net_device?
> It is bad design to replicate state in a device driver. The problem
> with replicated state is that it has to be updated in both places.
> 

Ya, you are right.  I think the rationale was that the flags were "hw"
state, whereas dev->features was software state.  But thinking about it
after you comments, I don't think it makes much difference either way.

I will just have the negcap() function set the features directly in v2.

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 6/7] net: Add vbus_enet driver
  2009-08-03 18:30   ` Stephen Hemminger
@ 2009-08-03 20:10     ` Gregory Haskins
  2009-08-03 20:19       ` Stephen Hemminger
  0 siblings, 1 reply; 62+ messages in thread
From: Gregory Haskins @ 2009-08-03 20:10 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: linux-kernel, alacrityvm-devel, netdev

[-- Attachment #1: Type: text/plain, Size: 809 bytes --]

Stephen Hemminger wrote:
> On Mon, 03 Aug 2009 13:18:02 -0400
> Gregory Haskins <ghaskins@novell.com> wrote:
> 
>> +
>> +static const struct net_device_ops vbus_enet_netdev_ops = {
>> +	.ndo_open          = vbus_enet_open,
>> +	.ndo_stop          = vbus_enet_stop,
>> +	.ndo_set_config    = vbus_enet_config,
>> +	.ndo_start_xmit    = vbus_enet_tx_start,
>> +	.ndo_change_mtu	   = vbus_enet_change_mtu,
>> +	.ndo_tx_timeout    = vbus_enet_timeout,
>> +};
> 
> 
> Missing 
> 	.ndo_set_mac_address = eth_mac_addr,
> 	.ndo_validate_addr   = eth_validate_addr,
> 

Ack.

> Also, should have change_mtu.

note that I do have .ndo_change_mtu.  I assume this is what you are
referring to and just missed it.  If there is something else I need
there, let me know.

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 6/7] net: Add vbus_enet driver
  2009-08-03 20:10     ` Gregory Haskins
@ 2009-08-03 20:19       ` Stephen Hemminger
  2009-08-03 20:24         ` Gregory Haskins
  0 siblings, 1 reply; 62+ messages in thread
From: Stephen Hemminger @ 2009-08-03 20:19 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: linux-kernel, alacrityvm-devel, netdev

On Mon, 03 Aug 2009 16:10:37 -0400
Gregory Haskins <ghaskins@novell.com> wrote:

> Stephen Hemminger wrote:
> > On Mon, 03 Aug 2009 13:18:02 -0400
> > Gregory Haskins <ghaskins@novell.com> wrote:
> > 
> >> +
> >> +static const struct net_device_ops vbus_enet_netdev_ops = {
> >> +	.ndo_open          = vbus_enet_open,
> >> +	.ndo_stop          = vbus_enet_stop,
> >> +	.ndo_set_config    = vbus_enet_config,
> >> +	.ndo_start_xmit    = vbus_enet_tx_start,
> >> +	.ndo_change_mtu	   = vbus_enet_change_mtu,
> >> +	.ndo_tx_timeout    = vbus_enet_timeout,
> >> +};
> > 
> > 
> > Missing 
> > 	.ndo_set_mac_address = eth_mac_addr,
> > 	.ndo_validate_addr   = eth_validate_addr,
> > 
> 
> Ack.
> 
> > Also, should have change_mtu.
> 
> note that I do have .ndo_change_mtu.  I assume this is what you are
> referring to and just missed it.  If there is something else I need
> there, let me know.

If you don't have a change_mtu, then MTU is unlimited. Can the device
handle 64K or larger transfers?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 6/7] net: Add vbus_enet driver
  2009-08-03 20:19       ` Stephen Hemminger
@ 2009-08-03 20:24         ` Gregory Haskins
  2009-08-03 20:29           ` Stephen Hemminger
  0 siblings, 1 reply; 62+ messages in thread
From: Gregory Haskins @ 2009-08-03 20:24 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Gregory Haskins, linux-kernel, alacrityvm-devel, netdev

[-- Attachment #1: Type: text/plain, Size: 1500 bytes --]

Stephen Hemminger wrote:
> On Mon, 03 Aug 2009 16:10:37 -0400
> Gregory Haskins <ghaskins@novell.com> wrote:
> 
>> Stephen Hemminger wrote:
>>> On Mon, 03 Aug 2009 13:18:02 -0400
>>> Gregory Haskins <ghaskins@novell.com> wrote:
>>>
>>>> +
>>>> +static const struct net_device_ops vbus_enet_netdev_ops = {
>>>> +	.ndo_open          = vbus_enet_open,
>>>> +	.ndo_stop          = vbus_enet_stop,
>>>> +	.ndo_set_config    = vbus_enet_config,
>>>> +	.ndo_start_xmit    = vbus_enet_tx_start,
>>>> +	.ndo_change_mtu	   = vbus_enet_change_mtu,
>>>> +	.ndo_tx_timeout    = vbus_enet_timeout,
>>>> +};
>>>
>>> Missing 
>>> 	.ndo_set_mac_address = eth_mac_addr,
>>> 	.ndo_validate_addr   = eth_validate_addr,
>>>
>> Ack.
>>
>>> Also, should have change_mtu.
>> note that I do have .ndo_change_mtu.  I assume this is what you are
>> referring to and just missed it.  If there is something else I need
>> there, let me know.
> 
> If you don't have a change_mtu, then MTU is unlimited.

Is "change_mtu" different from .ndo_change_mtu" on the ndo struct?
That's whats confusing me, as I have the .ndo one already.  Is there
something else I need in addition, or should I be ok as is?

> Can the device handle 64K or larger transfers?

Well, its been tested with 64K GSO packets at least.  I think it could
handle arbitrarily large packets as long as they are paged, but I have
never tried this beyond 64k due to the simple L4 limitations of 64k.

Thanks Stephen,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 6/7] net: Add vbus_enet driver
  2009-08-03 20:24         ` Gregory Haskins
@ 2009-08-03 20:29           ` Stephen Hemminger
  0 siblings, 0 replies; 62+ messages in thread
From: Stephen Hemminger @ 2009-08-03 20:29 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: Gregory Haskins, linux-kernel, alacrityvm-devel, netdev

On Mon, 03 Aug 2009 16:24:10 -0400
Gregory Haskins <gregory.haskins@gmail.com> wrote:

> Stephen Hemminger wrote:
> > On Mon, 03 Aug 2009 16:10:37 -0400
> > Gregory Haskins <ghaskins@novell.com> wrote:
> > 
> >> Stephen Hemminger wrote:
> >>> On Mon, 03 Aug 2009 13:18:02 -0400
> >>> Gregory Haskins <ghaskins@novell.com> wrote:
> >>>
> >>>> +
> >>>> +static const struct net_device_ops vbus_enet_netdev_ops = {
> >>>> +	.ndo_open          = vbus_enet_open,
> >>>> +	.ndo_stop          = vbus_enet_stop,
> >>>> +	.ndo_set_config    = vbus_enet_config,
> >>>> +	.ndo_start_xmit    = vbus_enet_tx_start,
> >>>> +	.ndo_change_mtu	   = vbus_enet_change_mtu,
> >>>> +	.ndo_tx_timeout    = vbus_enet_timeout,
> >>>> +};
> >>>
> >>> Missing 
> >>> 	.ndo_set_mac_address = eth_mac_addr,
> >>> 	.ndo_validate_addr   = eth_validate_addr,
> >>>
> >> Ack.
> >>
> >>> Also, should have change_mtu.
> >> note that I do have .ndo_change_mtu.  I assume this is what you are
> >> referring to and just missed it.  If there is something else I need
> >> there, let me know.
> > 
> > If you don't have a change_mtu, then MTU is unlimited.
> 
> Is "change_mtu" different from .ndo_change_mtu" on the ndo struct?
> That's whats confusing me, as I have the .ndo one already.  Is there
> something else I need in addition, or should I be ok as is?

Never mind, it is same as .ndo_change_mtu

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v2] net: Add vbus_enet driver
  2009-08-03 17:18 ` [PATCH 6/7] net: Add vbus_enet driver Gregory Haskins
  2009-08-03 18:30   ` Stephen Hemminger
@ 2009-08-04  1:14   ` Gregory Haskins
  2009-08-04  2:38     ` David Miller
  2009-10-02 15:33     ` [PATCH v3] " Gregory Haskins
  1 sibling, 2 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-04  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: alacrityvm-devel, netdev, shemminger

This is version-2 of the vbus_enet driver, originally posted here:

http://lkml.org/lkml/2009/8/3/278

It address (I believe) all of Stephen Hemminger's feedback

[ Changelog:

	v2:
	  *) folded patches 6/7 and 7/7 together
	  *) get rid of shadow flags
	  *) add missing baseline .ndo callbacks
	  *) add support for ethtool
]

Regards,
-Greg

-----------------------------

net: Add vbus_enet driver

A virtualized 802.x network device based on the VBUS interface. It can be
used with any hypervisor/kernel that supports the virtual-ethernet/vbus
protocol.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 drivers/net/Kconfig     |   14 +
 drivers/net/Makefile    |    1 
 drivers/net/vbus-enet.c |  895 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/Kbuild    |    1 
 include/linux/venet.h   |   84 ++++
 5 files changed, 995 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/vbus-enet.c
 create mode 100644 include/linux/venet.h

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 5f6509a..974213e 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -3209,4 +3209,18 @@ config VIRTIO_NET
 	  This is the virtual network driver for virtio.  It can be used with
           lguest or QEMU based VMMs (like KVM or Xen).  Say Y or M.
 
+config VBUS_ENET
+	tristate "VBUS Ethernet Driver"
+	default n
+	select VBUS_PROXY
+	help
+	   A virtualized 802.x network device based on the VBUS
+	   "virtual-ethernet" interface.  It can be used with any
+	   hypervisor/kernel that supports the vbus+venet protocol.
+
+config VBUS_ENET_DEBUG
+        bool "Enable Debugging"
+	depends on VBUS_ENET
+	default n
+
 endif # NETDEVICES
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index ead8cab..2a3c7a9 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -277,6 +277,7 @@ obj-$(CONFIG_FS_ENET) += fs_enet/
 obj-$(CONFIG_NETXEN_NIC) += netxen/
 obj-$(CONFIG_NIU) += niu.o
 obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+obj-$(CONFIG_VBUS_ENET) += vbus-enet.o
 obj-$(CONFIG_SFC) += sfc/
 
 obj-$(CONFIG_WIMAX) += wimax/
diff --git a/drivers/net/vbus-enet.c b/drivers/net/vbus-enet.c
new file mode 100644
index 0000000..91c47a9
--- /dev/null
+++ b/drivers/net/vbus-enet.c
@@ -0,0 +1,895 @@
+/*
+ * vbus_enet - A virtualized 802.x network device based on the VBUS interface
+ *
+ * Copyright (C) 2009 Novell, Gregory Haskins <ghaskins@novell.com>
+ *
+ * Derived from the SNULL example from the book "Linux Device Drivers" by
+ * Alessandro Rubini, Jonathan Corbet, and Greg Kroah-Hartman, published
+ * by O'Reilly & Associates.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/interrupt.h>
+
+#include <linux/in.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/skbuff.h>
+#include <linux/ioq.h>
+#include <linux/vbus_driver.h>
+
+#include <linux/in6.h>
+#include <asm/checksum.h>
+
+#include <linux/venet.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("virtual-ethernet");
+MODULE_VERSION("1");
+
+static int rx_ringlen = 256;
+module_param(rx_ringlen, int, 0444);
+static int tx_ringlen = 256;
+module_param(tx_ringlen, int, 0444);
+static int sg_enabled = 1;
+module_param(sg_enabled, int, 0444);
+
+#define PDEBUG(_dev, fmt, args...) dev_dbg(&(_dev)->dev, fmt, ## args)
+
+struct vbus_enet_queue {
+	struct ioq              *queue;
+	struct ioq_notifier      notifier;
+};
+
+struct vbus_enet_priv {
+	spinlock_t                 lock;
+	struct net_device         *dev;
+	struct vbus_device_proxy  *vdev;
+	struct napi_struct         napi;
+	struct vbus_enet_queue     rxq;
+	struct vbus_enet_queue     txq;
+	struct tasklet_struct      txtask;
+	bool                       sg;
+};
+
+static void vbus_enet_tx_reap(struct vbus_enet_priv *priv, int force);
+
+static struct vbus_enet_priv *
+napi_to_priv(struct napi_struct *napi)
+{
+	return container_of(napi, struct vbus_enet_priv, napi);
+}
+
+static int
+queue_init(struct vbus_enet_priv *priv,
+	   struct vbus_enet_queue *q,
+	   int qid,
+	   size_t ringsize,
+	   void (*func)(struct ioq_notifier *))
+{
+	struct vbus_device_proxy *dev = priv->vdev;
+	int ret;
+
+	ret = vbus_driver_ioq_alloc(dev, qid, 0, ringsize, &q->queue);
+	if (ret < 0)
+		panic("ioq_alloc failed: %d\n", ret);
+
+	if (func) {
+		q->notifier.signal = func;
+		q->queue->notifier = &q->notifier;
+	}
+
+	return 0;
+}
+
+static int
+devcall(struct vbus_enet_priv *priv, u32 func, void *data, size_t len)
+{
+	struct vbus_device_proxy *dev = priv->vdev;
+
+	return dev->ops->call(dev, func, data, len, 0);
+}
+
+/*
+ * ---------------
+ * rx descriptors
+ * ---------------
+ */
+
+static void
+rxdesc_alloc(struct net_device *dev, struct ioq_ring_desc *desc, size_t len)
+{
+	struct sk_buff *skb;
+
+	len += ETH_HLEN;
+
+	skb = netdev_alloc_skb(dev, len + 2);
+	BUG_ON(!skb);
+
+	skb_reserve(skb, NET_IP_ALIGN); /* align IP on 16B boundary */
+
+	desc->cookie = (u64)skb;
+	desc->ptr    = (u64)__pa(skb->data);
+	desc->len    = len; /* total length  */
+	desc->valid  = 1;
+}
+
+static void
+rx_setup(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->rxq.queue;
+	struct ioq_iterator iter;
+	int ret;
+
+	/*
+	 * We want to iterate on the "valid" index.  By default the iterator
+	 * will not "autoupdate" which means it will not hypercall the host
+	 * with our changes.  This is good, because we are really just
+	 * initializing stuff here anyway.  Note that you can always manually
+	 * signal the host with ioq_signal() if the autoupdate feature is not
+	 * used.
+	 */
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0); /* will never fail unless seriously broken */
+
+	/*
+	 * Seek to the tail of the valid index (which should be our first
+	 * item, since the queue is brand-new)
+	 */
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Now populate each descriptor with an empty SKB and mark it valid
+	 */
+	while (!iter.desc->valid) {
+		rxdesc_alloc(priv->dev, iter.desc, priv->dev->mtu);
+
+		/*
+		 * This push operation will simultaneously advance the
+		 * valid-head index and increment our position in the queue
+		 * by one.
+		 */
+		ret = ioq_iter_push(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+}
+
+static void
+rx_teardown(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->rxq.queue;
+	struct ioq_iterator iter;
+	int ret;
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * free each valid descriptor
+	 */
+	while (iter.desc->valid) {
+		struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+
+		iter.desc->valid = 0;
+		wmb();
+
+		iter.desc->ptr = 0;
+		iter.desc->cookie = 0;
+
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+
+		dev_kfree_skb(skb);
+	}
+}
+
+static int
+tx_setup(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->txq.queue;
+	struct ioq_iterator iter;
+	int i;
+	int ret;
+
+	if (!priv->sg)
+		/*
+		 * There is nothing to do for a ring that is not using
+		 * scatter-gather
+		 */
+		return 0;
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_set, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Now populate each descriptor with an empty SG descriptor
+	 */
+	for (i = 0; i < tx_ringlen; i++) {
+		struct venet_sg *vsg;
+		size_t iovlen = sizeof(struct venet_iov) * (MAX_SKB_FRAGS-1);
+		size_t len = sizeof(*vsg) + iovlen;
+
+		vsg = kzalloc(len, GFP_KERNEL);
+		if (!vsg)
+			return -ENOMEM;
+
+		iter.desc->cookie = (u64)vsg;
+		iter.desc->len    = len;
+		iter.desc->ptr    = (u64)__pa(vsg);
+
+		ret = ioq_iter_seek(&iter, ioq_seek_next, 0, 0);
+		BUG_ON(ret < 0);
+	}
+
+	return 0;
+}
+
+static void
+tx_teardown(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->txq.queue;
+	struct ioq_iterator iter;
+	int ret;
+
+	/* forcefully free all outstanding transmissions */
+	vbus_enet_tx_reap(priv, 1);
+
+	if (!priv->sg)
+		/*
+		 * There is nothing else to do for a ring that is not using
+		 * scatter-gather
+		 */
+		return;
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	/* seek to position 0 */
+	ret = ioq_iter_seek(&iter, ioq_seek_set, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * free each valid descriptor
+	 */
+	while (iter.desc->cookie) {
+		struct venet_sg *vsg = (struct venet_sg *)iter.desc->cookie;
+
+		iter.desc->valid = 0;
+		wmb();
+
+		iter.desc->ptr = 0;
+		iter.desc->cookie = 0;
+
+		ret = ioq_iter_seek(&iter, ioq_seek_next, 0, 0);
+		BUG_ON(ret < 0);
+
+		kfree(vsg);
+	}
+}
+
+/*
+ * Open and close
+ */
+
+static int
+vbus_enet_open(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	ret = devcall(priv, VENET_FUNC_LINKUP, NULL, 0);
+	BUG_ON(ret < 0);
+
+	napi_enable(&priv->napi);
+
+	return 0;
+}
+
+static int
+vbus_enet_stop(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	napi_disable(&priv->napi);
+
+	ret = devcall(priv, VENET_FUNC_LINKDOWN, NULL, 0);
+	BUG_ON(ret < 0);
+
+	return 0;
+}
+
+/*
+ * Configuration changes (passed on by ifconfig)
+ */
+static int
+vbus_enet_config(struct net_device *dev, struct ifmap *map)
+{
+	if (dev->flags & IFF_UP) /* can't act on a running interface */
+		return -EBUSY;
+
+	/* Don't allow changing the I/O address */
+	if (map->base_addr != dev->base_addr) {
+		dev_warn(&dev->dev, "Can't change I/O address\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* ignore other fields */
+	return 0;
+}
+
+static void
+vbus_enet_schedule_rx(struct vbus_enet_priv *priv)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (napi_schedule_prep(&priv->napi)) {
+		/* Disable further interrupts */
+		ioq_notify_disable(priv->rxq.queue, 0);
+		__napi_schedule(&priv->napi);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static int
+vbus_enet_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	dev->mtu = new_mtu;
+
+	/*
+	 * FLUSHRX will cause the device to flush any outstanding
+	 * RX buffers.  They will appear to come in as 0 length
+	 * packets which we can simply discard and replace with new_mtu
+	 * buffers for the future.
+	 */
+	ret = devcall(priv, VENET_FUNC_FLUSHRX, NULL, 0);
+	BUG_ON(ret < 0);
+
+	vbus_enet_schedule_rx(priv);
+
+	return 0;
+}
+
+/*
+ * The poll implementation.
+ */
+static int
+vbus_enet_poll(struct napi_struct *napi, int budget)
+{
+	struct vbus_enet_priv *priv = napi_to_priv(napi);
+	int npackets = 0;
+	struct ioq_iterator iter;
+	int ret;
+
+	PDEBUG(priv->dev, "polling...\n");
+
+	/* We want to iterate on the head of the in-use index */
+	ret = ioq_iter_init(priv->rxq.queue, &iter, ioq_idxtype_inuse,
+			    IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * We stop if we have met the quota or there are no more packets.
+	 * The EOM is indicated by finding a packet that is still owned by
+	 * the south side
+	 */
+	while ((npackets < budget) && (!iter.desc->sown)) {
+		struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+
+		if (iter.desc->len) {
+			skb_put(skb, iter.desc->len);
+
+			/* Maintain stats */
+			npackets++;
+			priv->dev->stats.rx_packets++;
+			priv->dev->stats.rx_bytes += iter.desc->len;
+
+			/* Pass the buffer up to the stack */
+			skb->dev      = priv->dev;
+			skb->protocol = eth_type_trans(skb, priv->dev);
+			netif_receive_skb(skb);
+
+			mb();
+		} else
+			/*
+			 * the device may send a zero-length packet when its
+			 * flushing references on the ring.  We can just drop
+			 * these on the floor
+			 */
+			dev_kfree_skb(skb);
+
+		/* Grab a new buffer to put in the ring */
+		rxdesc_alloc(priv->dev, iter.desc, priv->dev->mtu);
+
+		/* Advance the in-use tail */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	PDEBUG(priv->dev, "%d packets received\n", npackets);
+
+	/*
+	 * If we processed all packets, we're done; tell the kernel and
+	 * reenable ints
+	 */
+	if (ioq_empty(priv->rxq.queue, ioq_idxtype_inuse)) {
+		napi_complete(napi);
+		ioq_notify_enable(priv->rxq.queue, 0);
+		ret = 0;
+	} else
+		/* We couldn't process everything. */
+		ret = 1;
+
+	return ret;
+}
+
+/*
+ * Transmit a packet (called by the kernel)
+ */
+static int
+vbus_enet_tx_start(struct sk_buff *skb, struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	struct ioq_iterator    iter;
+	int ret;
+	unsigned long flags;
+
+	PDEBUG(priv->dev, "sending %d bytes\n", skb->len);
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
+		/*
+		 * We must flow-control the kernel by disabling the
+		 * queue
+		 */
+		spin_unlock_irqrestore(&priv->lock, flags);
+		netif_stop_queue(dev);
+		dev_err(&priv->dev->dev, "tx on full queue bug\n");
+		return 1;
+	}
+
+	/*
+	 * We want to iterate on the tail of both the "inuse" and "valid" index
+	 * so we specify the "both" index
+	 */
+	ret = ioq_iter_init(priv->txq.queue, &iter, ioq_idxtype_both,
+			    IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+	BUG_ON(iter.desc->sown);
+
+	if (priv->sg) {
+		struct venet_sg *vsg = (struct venet_sg *)iter.desc->cookie;
+		struct scatterlist sgl[MAX_SKB_FRAGS+1];
+		struct scatterlist *sg;
+		int count, maxcount = ARRAY_SIZE(sgl);
+
+		sg_init_table(sgl, maxcount);
+
+		memset(vsg, 0, sizeof(*vsg));
+
+		vsg->cookie = (u64)skb;
+		vsg->len    = skb->len;
+
+		if (skb->ip_summed == CHECKSUM_PARTIAL) {
+			vsg->flags      |= VENET_SG_FLAG_NEEDS_CSUM;
+			vsg->csum.start  = skb->csum_start - skb_headroom(skb);
+			vsg->csum.offset = skb->csum_offset;
+		}
+
+		if (skb_is_gso(skb)) {
+			struct skb_shared_info *sinfo = skb_shinfo(skb);
+
+			vsg->flags |= VENET_SG_FLAG_GSO;
+
+			vsg->gso.hdrlen = skb_transport_header(skb) - skb->data;
+			vsg->gso.size = sinfo->gso_size;
+			if (sinfo->gso_type & SKB_GSO_TCPV4)
+				vsg->gso.type = VENET_GSO_TYPE_TCPV4;
+			else if (sinfo->gso_type & SKB_GSO_TCPV6)
+				vsg->gso.type = VENET_GSO_TYPE_TCPV6;
+			else if (sinfo->gso_type & SKB_GSO_UDP)
+				vsg->gso.type = VENET_GSO_TYPE_UDP;
+			else
+				panic("Virtual-Ethernet: unknown GSO type " \
+				      "0x%x\n", sinfo->gso_type);
+
+			if (sinfo->gso_type & SKB_GSO_TCP_ECN)
+				vsg->flags |= VENET_SG_FLAG_ECN;
+		}
+
+		count = skb_to_sgvec(skb, sgl, 0, skb->len);
+
+		BUG_ON(count > maxcount);
+
+		for (sg = &sgl[0]; sg; sg = sg_next(sg)) {
+			struct venet_iov *iov = &vsg->iov[vsg->count++];
+
+			iov->len = sg->length;
+			iov->ptr = (u64)sg_phys(sg);
+		}
+
+	} else {
+		/*
+		 * non scatter-gather mode: simply put the skb right onto the
+		 * ring.
+		 */
+		iter.desc->cookie = (u64)skb;
+		iter.desc->len = (u64)skb->len;
+		iter.desc->ptr = (u64)__pa(skb->data);
+	}
+
+	iter.desc->valid  = 1;
+
+	priv->dev->stats.tx_packets++;
+	priv->dev->stats.tx_bytes += skb->len;
+
+	/*
+	 * This advances both indexes together implicitly, and then
+	 * signals the south side to consume the packet
+	 */
+	ret = ioq_iter_push(&iter, 0);
+	BUG_ON(ret < 0);
+
+	dev->trans_start = jiffies; /* save the timestamp */
+
+	if (ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
+		/*
+		 * If the queue is congested, we must flow-control the kernel
+		 */
+		PDEBUG(priv->dev, "backpressure tx queue\n");
+		netif_stop_queue(dev);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	return 0;
+}
+
+/*
+ * reclaim any outstanding completed tx packets
+ *
+ * assumes priv->lock held
+ */
+static void
+vbus_enet_tx_reap(struct vbus_enet_priv *priv, int force)
+{
+	struct ioq_iterator iter;
+	int ret;
+
+	/*
+	 * We want to iterate on the head of the valid index, but we
+	 * do not want the iter_pop (below) to flip the ownership, so
+	 * we set the NOFLIPOWNER option
+	 */
+	ret = ioq_iter_init(priv->txq.queue, &iter, ioq_idxtype_valid,
+			    IOQ_ITER_NOFLIPOWNER);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * We are done once we find the first packet either invalid or still
+	 * owned by the south-side
+	 */
+	while (iter.desc->valid && (!iter.desc->sown || force)) {
+		struct sk_buff *skb;
+
+		if (priv->sg) {
+			struct venet_sg *vsg;
+
+			vsg = (struct venet_sg *)iter.desc->cookie;
+			skb = (struct sk_buff *)vsg->cookie;
+
+		} else {
+			skb = (struct sk_buff *)iter.desc->cookie;
+		}
+
+		PDEBUG(priv->dev, "completed sending %d bytes\n", skb->len);
+
+		/* Reset the descriptor */
+		iter.desc->valid  = 0;
+
+		dev_kfree_skb(skb);
+
+		/* Advance the valid-index head */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	/*
+	 * If we were previously stopped due to flow control, restart the
+	 * processing
+	 */
+	if (netif_queue_stopped(priv->dev)
+	    && !ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
+		PDEBUG(priv->dev, "re-enabling tx queue\n");
+		netif_wake_queue(priv->dev);
+	}
+}
+
+static void
+vbus_enet_timeout(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	unsigned long flags;
+
+	dev_dbg(&dev->dev, "Transmit timeout\n");
+
+	spin_lock_irqsave(&priv->lock, flags);
+	vbus_enet_tx_reap(priv, 0);
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static void
+rx_isr(struct ioq_notifier *notifier)
+{
+	struct vbus_enet_priv *priv;
+	struct net_device  *dev;
+
+	priv = container_of(notifier, struct vbus_enet_priv, rxq.notifier);
+	dev = priv->dev;
+
+	if (!ioq_empty(priv->rxq.queue, ioq_idxtype_inuse))
+		vbus_enet_schedule_rx(priv);
+}
+
+static void
+deferred_tx_isr(unsigned long data)
+{
+	struct vbus_enet_priv *priv = (struct vbus_enet_priv *)data;
+	unsigned long flags;
+
+	PDEBUG(priv->dev, "deferred_tx_isr\n");
+
+	spin_lock_irqsave(&priv->lock, flags);
+	vbus_enet_tx_reap(priv, 0);
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	ioq_notify_enable(priv->txq.queue, 0);
+}
+
+static void
+tx_isr(struct ioq_notifier *notifier)
+{
+       struct vbus_enet_priv *priv;
+
+       priv = container_of(notifier, struct vbus_enet_priv, txq.notifier);
+
+       PDEBUG(priv->dev, "tx_isr\n");
+
+       ioq_notify_disable(priv->txq.queue, 0);
+       tasklet_schedule(&priv->txtask);
+}
+
+static int
+vbus_enet_negcap(struct vbus_enet_priv *priv)
+{
+	struct net_device *dev = priv->dev;
+	struct venet_capabilities caps;
+	int ret;
+
+	memset(&caps, 0, sizeof(caps));
+
+	if (sg_enabled) {
+		caps.gid = VENET_CAP_GROUP_SG;
+		caps.bits |= (VENET_CAP_SG|VENET_CAP_TSO4|VENET_CAP_TSO6
+			      |VENET_CAP_ECN);
+		/* note: exclude UFO for now due to stack bug */
+	}
+
+	ret = devcall(priv, VENET_FUNC_NEGCAP, &caps, sizeof(caps));
+	if (ret < 0)
+		return ret;
+
+	if (caps.bits & VENET_CAP_SG) {
+		priv->sg = true;
+
+		dev->features |= NETIF_F_SG|NETIF_F_HW_CSUM|NETIF_F_FRAGLIST;
+
+		if (caps.bits & VENET_CAP_TSO4)
+			dev->features |= NETIF_F_TSO;
+		if (caps.bits & VENET_CAP_UFO)
+			dev->features |= NETIF_F_UFO;
+		if (caps.bits & VENET_CAP_TSO6)
+			dev->features |= NETIF_F_TSO6;
+		if (caps.bits & VENET_CAP_ECN)
+			dev->features |= NETIF_F_TSO_ECN;
+	}
+
+	return 0;
+}
+
+static int vbus_enet_set_tx_csum(struct net_device *dev, u32 data)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+
+	if (data && !priv->sg)
+		return -ENOSYS;
+
+	return ethtool_op_set_tx_hw_csum(dev, data);
+}
+
+static struct ethtool_ops vbus_enet_ethtool_ops = {
+	.set_tx_csum = vbus_enet_set_tx_csum,
+	.set_sg      = ethtool_op_set_sg,
+	.set_tso     = ethtool_op_set_tso,
+	.get_link    = ethtool_op_get_link,
+};
+
+static const struct net_device_ops vbus_enet_netdev_ops = {
+	.ndo_open            = vbus_enet_open,
+	.ndo_stop            = vbus_enet_stop,
+	.ndo_set_config      = vbus_enet_config,
+	.ndo_start_xmit      = vbus_enet_tx_start,
+	.ndo_change_mtu	     = vbus_enet_change_mtu,
+	.ndo_tx_timeout      = vbus_enet_timeout,
+	.ndo_set_mac_address = eth_mac_addr,
+	.ndo_validate_addr   = eth_validate_addr,
+};
+
+/*
+ * This is called whenever a new vbus_device_proxy is added to the vbus
+ * with the matching VENET_ID
+ */
+static int
+vbus_enet_probe(struct vbus_device_proxy *vdev)
+{
+	struct net_device  *dev;
+	struct vbus_enet_priv *priv;
+	int ret;
+
+	printk(KERN_INFO "VENET: Found new device at %lld\n", vdev->id);
+
+	ret = vdev->ops->open(vdev, VENET_VERSION, 0);
+	if (ret < 0)
+		return ret;
+
+	dev = alloc_etherdev(sizeof(struct vbus_enet_priv));
+	if (!dev)
+		return -ENOMEM;
+
+	priv = netdev_priv(dev);
+
+	spin_lock_init(&priv->lock);
+	priv->dev  = dev;
+	priv->vdev = vdev;
+
+	ret = vbus_enet_negcap(priv);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: Error negotiating capabilities for " \
+		       "%lld\n",
+		       priv->vdev->id);
+		goto out_free;
+	}
+
+	tasklet_init(&priv->txtask, deferred_tx_isr, (unsigned long)priv);
+
+	queue_init(priv, &priv->rxq, VENET_QUEUE_RX, rx_ringlen, rx_isr);
+	queue_init(priv, &priv->txq, VENET_QUEUE_TX, tx_ringlen, tx_isr);
+
+	rx_setup(priv);
+	tx_setup(priv);
+
+	ioq_notify_enable(priv->rxq.queue, 0);  /* enable interrupts */
+	ioq_notify_enable(priv->txq.queue, 0);
+
+	dev->netdev_ops     = &vbus_enet_netdev_ops;
+	dev->watchdog_timeo = 5 * HZ;
+	SET_ETHTOOL_OPS(dev, &vbus_enet_ethtool_ops);
+	SET_NETDEV_DEV(dev, &vdev->dev);
+
+	netif_napi_add(dev, &priv->napi, vbus_enet_poll, 128);
+
+	ret = devcall(priv, VENET_FUNC_MACQUERY, priv->dev->dev_addr, ETH_ALEN);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: Error obtaining MAC address for " \
+		       "%lld\n",
+		       priv->vdev->id);
+		goto out_free;
+	}
+
+	dev->features |= NETIF_F_HIGHDMA;
+
+	ret = register_netdev(dev);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: error %i registering device \"%s\"\n",
+		       ret, dev->name);
+		goto out_free;
+	}
+
+	vdev->priv = priv;
+
+	return 0;
+
+ out_free:
+	free_netdev(dev);
+
+	return ret;
+}
+
+static int
+vbus_enet_remove(struct vbus_device_proxy *vdev)
+{
+	struct vbus_enet_priv *priv = (struct vbus_enet_priv *)vdev->priv;
+	struct vbus_device_proxy *dev = priv->vdev;
+
+	unregister_netdev(priv->dev);
+	napi_disable(&priv->napi);
+
+	rx_teardown(priv);
+	ioq_put(priv->rxq.queue);
+
+	tx_teardown(priv);
+	ioq_put(priv->txq.queue);
+
+	dev->ops->close(dev, 0);
+
+	free_netdev(priv->dev);
+
+	return 0;
+}
+
+/*
+ * Finally, the module stuff
+ */
+
+static struct vbus_driver_ops vbus_enet_driver_ops = {
+	.probe  = vbus_enet_probe,
+	.remove = vbus_enet_remove,
+};
+
+static struct vbus_driver vbus_enet_driver = {
+	.type   = VENET_TYPE,
+	.owner  = THIS_MODULE,
+	.ops    = &vbus_enet_driver_ops,
+};
+
+static __init int
+vbus_enet_init_module(void)
+{
+	printk(KERN_INFO "Virtual Ethernet: Copyright (C) 2009 Novell, Gregory Haskins\n");
+	printk(KERN_DEBUG "VENET: Using %d/%d queue depth\n",
+	       rx_ringlen, tx_ringlen);
+	return vbus_driver_register(&vbus_enet_driver);
+}
+
+static __exit void
+vbus_enet_cleanup(void)
+{
+	vbus_driver_unregister(&vbus_enet_driver);
+}
+
+module_init(vbus_enet_init_module);
+module_exit(vbus_enet_cleanup);
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index fa15bbf..911f7ef 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -359,6 +359,7 @@ unifdef-y += unistd.h
 unifdef-y += usbdevice_fs.h
 unifdef-y += utsname.h
 unifdef-y += vbus_pci.h
+unifdef-y += venet.h
 unifdef-y += videodev2.h
 unifdef-y += videodev.h
 unifdef-y += virtio_config.h
diff --git a/include/linux/venet.h b/include/linux/venet.h
new file mode 100644
index 0000000..47ed37d
--- /dev/null
+++ b/include/linux/venet.h
@@ -0,0 +1,84 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Virtual-Ethernet adapter
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VENET_H
+#define _LINUX_VENET_H
+
+#include <linux/types.h>
+
+#define VENET_VERSION 1
+
+#define VENET_TYPE "virtual-ethernet"
+
+#define VENET_QUEUE_RX 0
+#define VENET_QUEUE_TX 1
+
+struct venet_capabilities {
+	__u32 gid;
+	__u32 bits;
+};
+
+#define VENET_CAP_GROUP_SG 0
+
+/* CAPABILITIES-GROUP SG */
+#define VENET_CAP_SG     (1 << 0)
+#define VENET_CAP_TSO4   (1 << 1)
+#define VENET_CAP_TSO6   (1 << 2)
+#define VENET_CAP_ECN    (1 << 3)
+#define VENET_CAP_UFO    (1 << 4)
+
+struct venet_iov {
+	__u32 len;
+	__u64 ptr;
+};
+
+#define VENET_SG_FLAG_NEEDS_CSUM (1 << 0)
+#define VENET_SG_FLAG_GSO        (1 << 1)
+#define VENET_SG_FLAG_ECN        (1 << 2)
+
+struct venet_sg {
+	__u64            cookie;
+	__u32            flags;
+	__u32            len;     /* total length of all iovs */
+	struct {
+		__u16    start;	  /* csum starting position */
+		__u16    offset;  /* offset to place csum */
+	} csum;
+	struct {
+#define VENET_GSO_TYPE_TCPV4	0	/* IPv4 TCP (TSO) */
+#define VENET_GSO_TYPE_UDP	1	/* IPv4 UDP (UFO) */
+#define VENET_GSO_TYPE_TCPV6	2	/* IPv6 TCP */
+		__u8     type;
+		__u16    hdrlen;
+		__u16    size;
+	} gso;
+	__u32            count;   /* nr of iovs */
+	struct venet_iov iov[1];
+};
+
+#define VENET_FUNC_LINKUP   0
+#define VENET_FUNC_LINKDOWN 1
+#define VENET_FUNC_MACQUERY 2
+#define VENET_FUNC_NEGCAP   3 /* negotiate capabilities */
+#define VENET_FUNC_FLUSHRX  4
+
+#endif /* _LINUX_VENET_H */


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v2] net: Add vbus_enet driver
  2009-08-04  1:14   ` [PATCH v2] " Gregory Haskins
@ 2009-08-04  2:38     ` David Miller
  2009-08-04 13:57       ` [Alacrityvm-devel] " Gregory Haskins
  2009-10-02 15:33     ` [PATCH v3] " Gregory Haskins
  1 sibling, 1 reply; 62+ messages in thread
From: David Miller @ 2009-08-04  2:38 UTC (permalink / raw)
  To: ghaskins; +Cc: linux-kernel, alacrityvm-devel, netdev, shemminger

From: Gregory Haskins <ghaskins@novell.com>
Date: Mon, 03 Aug 2009 21:14:04 -0400

> net: Add vbus_enet driver
> 
> A virtualized 802.x network device based on the VBUS interface. It can be
> used with any hypervisor/kernel that supports the virtual-ethernet/vbus
> protocol.
> 
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>

I'm fine with this, it depends upon the vbus infrastructure so
whoever pulls that in can pull this in as well:

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v2] net: Add vbus_enet driver
  2009-08-04  2:38     ` David Miller
@ 2009-08-04 13:57       ` Gregory Haskins
  0 siblings, 0 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-04 13:57 UTC (permalink / raw)
  To: David Miller
  Cc: ghaskins, netdev, shemminger, linux-kernel, alacrityvm-devel,
	Stephen Rothwell

[-- Attachment #1: Type: text/plain, Size: 2360 bytes --]

David Miller wrote:
> From: Gregory Haskins <ghaskins@novell.com>
> Date: Mon, 03 Aug 2009 21:14:04 -0400
> 
>> net: Add vbus_enet driver
>>
>> A virtualized 802.x network device based on the VBUS interface. It can be
>> used with any hypervisor/kernel that supports the virtual-ethernet/vbus
>> protocol.
>>
>> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> 
> I'm fine with this, it depends upon the vbus infrastructure so
> whoever pulls that in can pull this in as well:
> 
> Acked-by: David S. Miller <davem@davemloft.net>

Thanks, David.

If there are no other comments/objections, I would propose to pull this
into linux-next.  I have prepared a branch specifically for this
purpose, which can be found here:

git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git
linux-next

I have also folded the following update to MAINTAINERS into the series:

diff --git a/MAINTAINERS b/MAINTAINERS
index d6befb2..7cbcf5d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2707,6 +2707,12 @@ L:       linux-mips@linux-mips.org
 S:     Maintained
 F:     drivers/serial/ioc3_serial.c

+IOQ LIBRARY
+M:     Gregory Haskins <ghaskins@novell.com>
+S:     Maintained
+F:     include/linux/ioq.h
+F:     lib/ioq.c
+
 IP MASQUERADING
 M:     Juanjo Ciarlante <jjciarla@raiz.uncu.edu.ar>
 S:     Maintained
@@ -4555,6 +4561,12 @@ F:       drivers/serial/serial_lh7a40x.c
 F:     drivers/usb/gadget/lh7a40*
 F:     drivers/usb/host/ohci-lh7a40*

+SHM-SIGNAL LIBRARY
+M:     Gregory Haskins <ghaskins@novell.com>
+S:     Maintained
+F:     include/linux/shm_signal.h
+F:     lib/shm_signal.c
+
 SHPC HOTPLUG DRIVER
 M:     Kristen Carlson Accardi <kristen.c.accardi@intel.com>
 L:     linux-pci@vger.kernel.org
@@ -5423,6 +5435,19 @@ S:       Maintained
 F:     Documentation/fb/uvesafb.txt
 F:     drivers/video/uvesafb.*

+VBUS
+M:     Gregory Haskins <ghaskins@novell.com>
+S:     Maintained
+F:     include/linux/vbus*
+F:     drivers/vbus/*
+
+VBUS ETHERNET DRIVER
+M:     Gregory Haskins <ghaskins@novell.com>
+S:     Maintained
+W:     http://developer.novell.com/wiki/index.php/AlacrityVM
+F:     include/linux/venet.h
+F:     drivers/net/vbus-enet.c
+
 VFAT/FAT/MSDOS FILESYSTEM
 M:     OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
 S:     Maintained




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-03 17:17 [PATCH 0/7] AlacrityVM guest drivers Gregory Haskins
                   ` (6 preceding siblings ...)
  2009-08-03 17:18 ` [PATCH 7/7] venet: add scatter-gather/GSO support Gregory Haskins
@ 2009-08-06  8:19 ` Michael S. Tsirkin
  2009-08-06 10:17   ` Michael S. Tsirkin
                     ` (2 more replies)
  7 siblings, 3 replies; 62+ messages in thread
From: Michael S. Tsirkin @ 2009-08-06  8:19 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: linux-kernel, alacrityvm-devel, netdev, kvm

On Mon, Aug 03, 2009 at 01:17:30PM -0400, Gregory Haskins wrote:
> (Applies to v2.6.31-rc5, proposed for linux-next after review is complete)

These are guest drivers, right? Merging the guest first means relying on
kernel interface from an out of tree driver, which well might change
before it goes in.  Would it make more sense to start merging with the
host side of the project?

> This series implements the guest-side drivers for accelerated IO
> when running on top of the AlacrityVM hypervisor, the details of
> which you can find here:
> 
> http://developer.novell.com/wiki/index.php/AlacrityVM

Since AlacrityVM is kvm based, Cc kvm@vger.kernel.org.

> This series includes the basic plumbing, as well as the driver for
> accelerated 802.x (ethernet) networking.

The graphs comparing virtio with vbus look interesting.
However, they do not compare apples to apples, do they?
These compare userspace virtio with kernel vbus, where for
apples to apples comparison one would need to compare
kernel virtio with kernel vbus. Right?


> Regards,
> -Greg
> 
> ---
> 
> Gregory Haskins (7):
>       venet: add scatter-gather/GSO support
>       net: Add vbus_enet driver
>       ioq: add driver-side vbus helpers
>       vbus-proxy: add a pci-to-vbus bridge
>       vbus: add a "vbus-proxy" bus model for vbus_driver objects
>       ioq: Add basic definitions for a shared-memory, lockless queue
>       shm-signal: shared-memory signals
> 
> 
>  arch/x86/Kconfig            |    2 
>  drivers/Makefile            |    1 
>  drivers/net/Kconfig         |   14 +
>  drivers/net/Makefile        |    1 
>  drivers/net/vbus-enet.c     |  899 +++++++++++++++++++++++++++++++++++++++++++
>  drivers/vbus/Kconfig        |   24 +
>  drivers/vbus/Makefile       |    6 
>  drivers/vbus/bus-proxy.c    |  216 ++++++++++
>  drivers/vbus/pci-bridge.c   |  824 +++++++++++++++++++++++++++++++++++++++
>  include/linux/Kbuild        |    4 
>  include/linux/ioq.h         |  415 ++++++++++++++++++++
>  include/linux/shm_signal.h  |  189 +++++++++
>  include/linux/vbus_driver.h |   80 ++++
>  include/linux/vbus_pci.h    |  127 ++++++
>  include/linux/venet.h       |   84 ++++
>  lib/Kconfig                 |   21 +
>  lib/Makefile                |    2 
>  lib/ioq.c                   |  294 ++++++++++++++
>  lib/shm_signal.c            |  192 +++++++++
>  19 files changed, 3395 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/net/vbus-enet.c
>  create mode 100644 drivers/vbus/Kconfig
>  create mode 100644 drivers/vbus/Makefile
>  create mode 100644 drivers/vbus/bus-proxy.c
>  create mode 100644 drivers/vbus/pci-bridge.c
>  create mode 100644 include/linux/ioq.h
>  create mode 100644 include/linux/shm_signal.h
>  create mode 100644 include/linux/vbus_driver.h
>  create mode 100644 include/linux/vbus_pci.h
>  create mode 100644 include/linux/venet.h
>  create mode 100644 lib/ioq.c
>  create mode 100644 lib/shm_signal.c
> 
> -- 
> Signature
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06  8:19 ` [PATCH 0/7] AlacrityVM guest drivers Reply-To: Michael S. Tsirkin
@ 2009-08-06 10:17   ` Michael S. Tsirkin
  2009-08-06 12:09     ` Gregory Haskins
  2009-08-06 12:08   ` Gregory Haskins
  2009-08-07 14:19   ` Anthony Liguori
  2 siblings, 1 reply; 62+ messages in thread
From: Michael S. Tsirkin @ 2009-08-06 10:17 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: linux-kernel, alacrityvm-devel, netdev, kvm

On Thu, Aug 06, 2009 at 11:19:56AM +0300, Michael S. Tsirkin wrote:
> On Mon, Aug 03, 2009 at 01:17:30PM -0400, Gregory Haskins wrote:
> > (Applies to v2.6.31-rc5, proposed for linux-next after review is complete)
> 
> These are guest drivers, right? Merging the guest first means relying on
> kernel interface from an out of tree driver, which well might change
> before it goes in.  Would it make more sense to start merging with the
> host side of the project?
> 
> > This series implements the guest-side drivers for accelerated IO
> > when running on top of the AlacrityVM hypervisor, the details of
> > which you can find here:
> > 
> > http://developer.novell.com/wiki/index.php/AlacrityVM
> 
> Since AlacrityVM is kvm based, Cc kvm@vger.kernel.org.
> 
> > This series includes the basic plumbing, as well as the driver for
> > accelerated 802.x (ethernet) networking.
> 
> The graphs comparing virtio with vbus look interesting.
> However, they do not compare apples to apples, do they?
> These compare userspace virtio with kernel vbus, where for
> apples to apples comparison one would need to compare
> kernel virtio with kernel vbus. Right?

Or userspace virtio with userspace vbus.

> > Regards,
> > -Greg
> > 
> > ---
> > 
> > Gregory Haskins (7):
> >       venet: add scatter-gather/GSO support
> >       net: Add vbus_enet driver
> >       ioq: add driver-side vbus helpers
> >       vbus-proxy: add a pci-to-vbus bridge
> >       vbus: add a "vbus-proxy" bus model for vbus_driver objects
> >       ioq: Add basic definitions for a shared-memory, lockless queue
> >       shm-signal: shared-memory signals
> > 
> > 
> >  arch/x86/Kconfig            |    2 
> >  drivers/Makefile            |    1 
> >  drivers/net/Kconfig         |   14 +
> >  drivers/net/Makefile        |    1 
> >  drivers/net/vbus-enet.c     |  899 +++++++++++++++++++++++++++++++++++++++++++
> >  drivers/vbus/Kconfig        |   24 +
> >  drivers/vbus/Makefile       |    6 
> >  drivers/vbus/bus-proxy.c    |  216 ++++++++++
> >  drivers/vbus/pci-bridge.c   |  824 +++++++++++++++++++++++++++++++++++++++
> >  include/linux/Kbuild        |    4 
> >  include/linux/ioq.h         |  415 ++++++++++++++++++++
> >  include/linux/shm_signal.h  |  189 +++++++++
> >  include/linux/vbus_driver.h |   80 ++++
> >  include/linux/vbus_pci.h    |  127 ++++++
> >  include/linux/venet.h       |   84 ++++
> >  lib/Kconfig                 |   21 +
> >  lib/Makefile                |    2 
> >  lib/ioq.c                   |  294 ++++++++++++++
> >  lib/shm_signal.c            |  192 +++++++++
> >  19 files changed, 3395 insertions(+), 0 deletions(-)
> >  create mode 100644 drivers/net/vbus-enet.c
> >  create mode 100644 drivers/vbus/Kconfig
> >  create mode 100644 drivers/vbus/Makefile
> >  create mode 100644 drivers/vbus/bus-proxy.c
> >  create mode 100644 drivers/vbus/pci-bridge.c
> >  create mode 100644 include/linux/ioq.h
> >  create mode 100644 include/linux/shm_signal.h
> >  create mode 100644 include/linux/vbus_driver.h
> >  create mode 100644 include/linux/vbus_pci.h
> >  create mode 100644 include/linux/venet.h
> >  create mode 100644 lib/ioq.c
> >  create mode 100644 lib/shm_signal.c
> > 
> > -- 
> > Signature
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06  8:19 ` [PATCH 0/7] AlacrityVM guest drivers Reply-To: Michael S. Tsirkin
  2009-08-06 10:17   ` Michael S. Tsirkin
@ 2009-08-06 12:08   ` Gregory Haskins
  2009-08-06 12:24     ` Michael S. Tsirkin
  2009-08-06 12:54     ` Avi Kivity
  2009-08-07 14:19   ` Anthony Liguori
  2 siblings, 2 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-06 12:08 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: alacrityvm-devel, kvm, linux-kernel, netdev

Hi Michael,

>>> On 8/6/2009 at  4:19 AM, in message <20090806081955.GA9752@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com> wrote: 
> On Mon, Aug 03, 2009 at 01:17:30PM -0400, Gregory Haskins wrote:
>> (Applies to v2.6.31-rc5, proposed for linux-next after review is complete)
> 
> These are guest drivers, right?

Yep.

> Merging the guest first means relying on
> kernel interface from an out of tree driver, which well might change
> before it goes in.

ABI compatibility is already addressed/handled, so even if that is true its not a problem.

> Would it make more sense to start merging with the host side of the project?

Not necessarily, no.  These are drivers for a "device", so its no different than merging any other driver really.  This is especially true since the hypervisor is also already published and freely available today, so anyone can start using it.

> 
>> This series implements the guest-side drivers for accelerated IO
>> when running on top of the AlacrityVM hypervisor, the details of
>> which you can find here:
>> 
>> http://developer.novell.com/wiki/index.php/AlacrityVM
> 
> Since AlacrityVM is kvm based, Cc kvm@vger.kernel.org.

I *can* do that, but there is nothing in these drivers that is KVM specific (its all pure PCI and VBUS).  I've already made the general announcement about the project/ml cross posted to KVM for anyone that might be interested, but I figure I will spare the general KVM list the details unless something specifically pertains to, or affects, KVM.  For instance, when I get to pushing the hypervisor side, I still need to work on getting that 'xinterface' patch to you guys.  I would certainly be CC'ing kvm@vger when that happens since it modifies the KVM code.

So instead, I would just encourage anyone interested (such as yourself) to join the alacrity list so I don't bother the KVM community unless absolutely necessary.

> 
>> This series includes the basic plumbing, as well as the driver for
>> accelerated 802.x (ethernet) networking.
> 
> The graphs comparing virtio with vbus look interesting.
> However, they do not compare apples to apples, do they?

Yes, I believe they do.  They represent the best that KVM has to offer (to my knowledge) vs the best that alacrityvm has to offer.

> These compare userspace virtio with kernel vbus,

vbus is a device model (akin to QEMU's device model).  Technically, it was a comparison of userspace virtio-net (via QEMU), to kernel venet (via vbus),
which I again stress is the state of the art for both to my knowledge.

As I have explained before in earlier threads on kvm@vger, virtio is not mutually exclusive here.  You can run the virtio protocol over the vbus model if someone were so inclined.  In fact, I proposed this very idea to you a month or two ago but I believe you decided to go your own way and reinvent some other in-kernel model instead for your own reasons.

>where for apples to apples comparison one would need to compare
> kernel virtio with kernel vbus. Right?

Again, it already *is* apples to apples as far as I am concerned.  

At the time I ran those numbers, there was certainly no in-kernel virtio model to play with.  And to my knowledge, there isn't one now (I was never CC'd on the patches, and a cursory search of the KVM list isn't revealing one that was posted recently).

To reiterate: kernel virtio-net (using ??) to kernel venet (vbus based) to kernel virtio-net (vbus, but doesnt exist yet) would be a fun bakeoff.  If you have something for the kernel virtio-net, point me at it and I will try to include it in the comparison next time.

Kind Regards,
-Greg

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 10:17   ` Michael S. Tsirkin
@ 2009-08-06 12:09     ` Gregory Haskins
  0 siblings, 0 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-06 12:09 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: alacrityvm-devel, kvm, linux-kernel, netdev

>>> On 8/6/2009 at  6:17 AM, in message <20090806101702.GA10605@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com> wrote: 
> On Thu, Aug 06, 2009 at 11:19:56AM +0300, Michael S. Tsirkin wrote:
>> On Mon, Aug 03, 2009 at 01:17:30PM -0400, Gregory Haskins wrote:
>> > (Applies to v2.6.31-rc5, proposed for linux-next after review is complete)
>> 
>> These are guest drivers, right? Merging the guest first means relying on
>> kernel interface from an out of tree driver, which well might change
>> before it goes in.  Would it make more sense to start merging with the
>> host side of the project?
>> 
>> > This series implements the guest-side drivers for accelerated IO
>> > when running on top of the AlacrityVM hypervisor, the details of
>> > which you can find here:
>> > 
>> > http://developer.novell.com/wiki/index.php/AlacrityVM
>> 
>> Since AlacrityVM is kvm based, Cc kvm@vger.kernel.org.
>> 
>> > This series includes the basic plumbing, as well as the driver for
>> > accelerated 802.x (ethernet) networking.
>> 
>> The graphs comparing virtio with vbus look interesting.
>> However, they do not compare apples to apples, do they?
>> These compare userspace virtio with kernel vbus, where for
>> apples to apples comparison one would need to compare
>> kernel virtio with kernel vbus. Right?
> 
> Or userspace virtio with userspace vbus.

Note: That would be pointless.

-Greg



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 12:08   ` Gregory Haskins
@ 2009-08-06 12:24     ` Michael S. Tsirkin
  2009-08-06 13:00       ` Gregory Haskins
  2009-08-06 12:54     ` Avi Kivity
  1 sibling, 1 reply; 62+ messages in thread
From: Michael S. Tsirkin @ 2009-08-06 12:24 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: alacrityvm-devel, kvm, linux-kernel, netdev

On Thu, Aug 06, 2009 at 06:08:27AM -0600, Gregory Haskins wrote:
> Hi Michael,
> 
> >>> On 8/6/2009 at  4:19 AM, in message <20090806081955.GA9752@redhat.com>,
> "Michael S. Tsirkin" <mst@redhat.com> wrote: 
> > On Mon, Aug 03, 2009 at 01:17:30PM -0400, Gregory Haskins wrote:
> >> (Applies to v2.6.31-rc5, proposed for linux-next after review is complete)
> > 
> > These are guest drivers, right?
> 
> Yep.
> 
> > Merging the guest first means relying on
> > kernel interface from an out of tree driver, which well might change
> > before it goes in.
> 
> ABI compatibility is already addressed/handled, so even if that is true its not a problem.

It is? With versioning? Presumably this:

+       params.devid   = vdev->id;
+       params.version = version;
+
+       ret = vbus_pci_hypercall(VBUS_PCI_HC_DEVOPEN,
+                                &params, sizeof(params));
+       if (ret < 0)
+               return ret;

Even assuming host even knows how to decode this structure (e.g.  some
other host module doesn't use VBUS_PCI_HC_DEVOPEN), checks the version
and denies older guests, this might help guest not to crash, but guest
still won't work.

> > Would it make more sense to start merging with the host side of the project?
> 
> Not necessarily, no.  These are drivers for a "device", so its no
> different than merging any other driver really.  This is especially
> true since the hypervisor is also already published and freely
> available today, so anyone can start using it.

The difference is clear to me: devices do not get to set kernel/userspace
interfaces. This "device" depends on a specific interface between
kernel and (guest) userspace.

-- 
MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 12:08   ` Gregory Haskins
  2009-08-06 12:24     ` Michael S. Tsirkin
@ 2009-08-06 12:54     ` Avi Kivity
  2009-08-06 13:03       ` Gregory Haskins
  1 sibling, 1 reply; 62+ messages in thread
From: Avi Kivity @ 2009-08-06 12:54 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Michael S. Tsirkin, alacrityvm-devel, kvm, linux-kernel, netdev

On 08/06/2009 03:08 PM, Gregory Haskins wrote:
>> Merging the guest first means relying on
>> kernel interface from an out of tree driver, which well might change
>> before it goes in.
>>      
>
> ABI compatibility is already addressed/handled, so even if that is true its not a problem.
>
>    

Really the correct way to address the ABI is to publish a spec and write 
both host and guest drivers to that.  Unfortunately we didn't do this 
with virtio.

It becomes more important when you have multiple implementations (e.g. 
Windows drivers).

>>> This series implements the guest-side drivers for accelerated IO
>>> when running on top of the AlacrityVM hypervisor, the details of
>>> which you can find here:
>>>
>>> http://developer.novell.com/wiki/index.php/AlacrityVM
>>>        
>> Since AlacrityVM is kvm based, Cc kvm@vger.kernel.org.
>>      
>
> I *can* do that, but there is nothing in these drivers that is KVM specific (its all pure PCI and VBUS).  I've already made the general announcement about the project/ml cross posted to KVM for anyone that might be interested, but I figure I will spare the general KVM list the details unless something specifically pertains to, or affects, KVM.  For instance, when I get to pushing the hypervisor side, I still need to work on getting that 'xinterface' patch to you guys.  I would certainly be CC'ing kvm@vger when that happens since it modifies the KVM code.
>
> So instead, I would just encourage anyone interested (such as yourself) to join the alacrity list so I don't bother the KVM community unless absolutely necessary.
>    

It's true that vbus is a separate project (in fact even virtio is 
completely separate from kvm).  Still I think it would be of interest to 
many kvm@ readers.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 12:24     ` Michael S. Tsirkin
@ 2009-08-06 13:00       ` Gregory Haskins
  0 siblings, 0 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-06 13:00 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: alacrityvm-devel, kvm, linux-kernel, netdev

>>> On 8/6/2009 at  8:24 AM, in message <20090806122449.GC11038@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com> wrote: 
> On Thu, Aug 06, 2009 at 06:08:27AM -0600, Gregory Haskins wrote:
>> Hi Michael,
>> 
>> >>> On 8/6/2009 at  4:19 AM, in message <20090806081955.GA9752@redhat.com>,
>> "Michael S. Tsirkin" <mst@redhat.com> wrote: 
>> > On Mon, Aug 03, 2009 at 01:17:30PM -0400, Gregory Haskins wrote:
>> >> (Applies to v2.6.31-rc5, proposed for linux-next after review is complete)
>> > 
>> > These are guest drivers, right?
>> 
>> Yep.
>> 
>> > Merging the guest first means relying on
>> > kernel interface from an out of tree driver, which well might change
>> > before it goes in.
>> 
>> ABI compatibility is already addressed/handled, so even if that is true its 
> not a problem.
> 
> It is? With versioning? Presumably this:
> 
> +       params.devid   = vdev->id;
> +       params.version = version;
> +
> +       ret = vbus_pci_hypercall(VBUS_PCI_HC_DEVOPEN,
> +                                &params, sizeof(params));
> +       if (ret < 0)
> +               return ret;

This is part of it.  There are various ABI version components (which, by the way, are only expected to only allow change while the code is experimental/alpha).  The other component is capability functions (such as NEGCAP in the venet driver).

> 
> Even assuming host even knows how to decode this structure (e.g.  some
> other host module doesn't use VBUS_PCI_HC_DEVOPEN),

This argument demonstrates a fundamental lack of understanding on how AlacrityVM works.  Please study the code more closely and you will see that your concern is illogical.  If it's still not clear, let me know and I will walk it through for you.

> checks the version
> and denies older guests, this might help guest not to crash, but guest
> still won't work.

Thats ok.  As I said above, the version number is just there for gross ABI protection and generally will never be changed once a driver is "official" (if at all).  We use things like capability-bit negotiation to allow backwards compat.

For an example, see drivers/net/vbus-enet.c, line 703:

http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=drivers/net/vbus-enet.c;h=7220f43723adc5b0bece1bc37974fae1b034cd9e;hb=b3b2339efbd4e754b1c85f8bc8f85f21a1a1f509#l703

venet exposes a verb "NEGCAP" (negotiate capabilities), which is used to extend the ABI.  The version number you quote above (on the device open) is really just a check to make sure the NEGCAP ABI is compatible.  The rest of the abi is negotiated at runtime with capability feature bits.

FWIW; I decided to not built a per-device capability into the low-level vbus protocol (e.g. there is no VBUS_PCI_HC_NEGCAP) because I felt as though the individual devices could better express their own capability mechanism, rather than try to generalize it.  Therefore it is up to each device to define its own mechanism, presumably using a verb from its own private call() namespace (as venet has done).

> 
>> > Would it make more sense to start merging with the host side of the 
> project?
>> 
>> Not necessarily, no.  These are drivers for a "device", so its no
>> different than merging any other driver really.  This is especially
>> true since the hypervisor is also already published and freely
>> available today, so anyone can start using it.
> 
> The difference is clear to me: devices do not get to set kernel/userspace
> interfaces. This "device" depends on a specific interface between
> kernel and (guest) userspace.

This doesn't really parse for me, but I think the gist of it is based on an incorrect assumption.

Can you elaborate?

Kind Regards,
-Greg




^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 12:54     ` Avi Kivity
@ 2009-08-06 13:03       ` Gregory Haskins
  2009-08-06 13:44         ` Avi Kivity
  0 siblings, 1 reply; 62+ messages in thread
From: Gregory Haskins @ 2009-08-06 13:03 UTC (permalink / raw)
  To: Avi Kivity
  Cc: alacrityvm-devel, Michael S. Tsirkin, kvm, linux-kernel, netdev

>>> On 8/6/2009 at  8:54 AM, in message <4A7AD29E.50800@redhat.com>, Avi Kivity
<avi@redhat.com> wrote: 
> On 08/06/2009 03:08 PM, Gregory Haskins wrote:
>>> Merging the guest first means relying on
>>> kernel interface from an out of tree driver, which well might change
>>> before it goes in.
>>>      
>>
>> ABI compatibility is already addressed/handled, so even if that is true its 
> not a problem.
>>
>>    
> 
> Really the correct way to address the ABI is to publish a spec and write 
> both host and guest drivers to that.  Unfortunately we didn't do this 
> with virtio.
> 
> It becomes more important when you have multiple implementations (e.g. 
> Windows drivers).
> 
>>>> This series implements the guest-side drivers for accelerated IO
>>>> when running on top of the AlacrityVM hypervisor, the details of
>>>> which you can find here:
>>>>
>>>> http://developer.novell.com/wiki/index.php/AlacrityVM
>>>>        
>>> Since AlacrityVM is kvm based, Cc kvm@vger.kernel.org.
>>>      
>>
>> I *can* do that, but there is nothing in these drivers that is KVM specific 
> (its all pure PCI and VBUS).  I've already made the general announcement 
> about the project/ml cross posted to KVM for anyone that might be interested, 
> but I figure I will spare the general KVM list the details unless something 
> specifically pertains to, or affects, KVM.  For instance, when I get to 
> pushing the hypervisor side, I still need to work on getting that 
> 'xinterface' patch to you guys.  I would certainly be CC'ing kvm@vger when 
> that happens since it modifies the KVM code.
>>
>> So instead, I would just encourage anyone interested (such as yourself) to 
> join the alacrity list so I don't bother the KVM community unless absolutely 
> necessary.
>>    
> 
> It's true that vbus is a separate project (in fact even virtio is 
> completely separate from kvm).  Still I think it would be of interest to 
> many kvm@ readers.

Well, my goal was to not annoy KVM readers. ;)  So if you feel as though there is benefit to having all of KVM CC'd and I won't be annoying everyone, I see no problem in cross posting.

Would you like to see all conversations, or just ones related to code (and, of course, KVM relevant items)?

Regards,
-Greg




^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 13:03       ` Gregory Haskins
@ 2009-08-06 13:44         ` Avi Kivity
  2009-08-06 13:45           ` Gregory Haskins
  0 siblings, 1 reply; 62+ messages in thread
From: Avi Kivity @ 2009-08-06 13:44 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: alacrityvm-devel, Michael S. Tsirkin, kvm, linux-kernel, netdev

On 08/06/2009 04:03 PM, Gregory Haskins wrote:
>> It's true that vbus is a separate project (in fact even virtio is
>> completely separate from kvm).  Still I think it would be of interest to
>> many kvm@ readers.
>>      
>
> Well, my goal was to not annoy KVM readers. ;)  So if you feel as though there is benefit to having all of KVM CC'd and I won't be annoying everyone, I see no problem in cross posting.
>    

I can only speak for myself, I'm interested in this project (though 
still rooting for virtio).

> Would you like to see all conversations, or just ones related to code (and, of course, KVM relevant items)

I guess internal vbus changes won't be too interesting for most readers, 
but new releases, benchmarks, and kvm-related stuff will be welcome on 
the kvm list.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 13:44         ` Avi Kivity
@ 2009-08-06 13:45           ` Gregory Haskins
  2009-08-06 13:57             ` Avi Kivity
  2009-08-06 13:59             ` Michael S. Tsirkin
  0 siblings, 2 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-06 13:45 UTC (permalink / raw)
  To: Avi Kivity
  Cc: alacrityvm-devel, Michael S. Tsirkin, kvm, linux-kernel, netdev

>>> On 8/6/2009 at  9:44 AM, in message <4A7ADE23.5010208@redhat.com>, Avi Kivity
<avi@redhat.com> wrote: 
> On 08/06/2009 04:03 PM, Gregory Haskins wrote:
>>> It's true that vbus is a separate project (in fact even virtio is
>>> completely separate from kvm).  Still I think it would be of interest to
>>> many kvm@ readers.
>>>      
>>
>> Well, my goal was to not annoy KVM readers. ;)  So if you feel as though 
> there is benefit to having all of KVM CC'd and I won't be annoying everyone, 
> I see no problem in cross posting.
>>    
> 
> I can only speak for myself, I'm interested in this project

In that case, the best solution is probably to have you (and anyone else interested) to sign up, then:

https://lists.sourceforge.net/lists/listinfo/alacrityvm-devel
https://lists.sourceforge.net/lists/listinfo/alacrityvm-users


> (though still rooting for virtio).

Heh...not to belabor the point to death, but virtio is orthogonal (you keep forgetting that ;).

Its really the vbus device-model vs the qemu device-model (and possibly vs the "in-kernel pci emulation" model that I believe Michael is working on).

You can run virtio on any of those three.

> 
>> Would you like to see all conversations, or just ones related to code (and, 
> of course, KVM relevant items)
> 
> I guess internal vbus changes won't be too interesting for most readers, 
> but new releases, benchmarks, and kvm-related stuff will be welcome on 
> the kvm list.

Ok, I was planning on that anyway.

Regards,
-Greg





^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/7] shm-signal: shared-memory signals
  2009-08-03 17:17 ` [PATCH 1/7] shm-signal: shared-memory signals Gregory Haskins
@ 2009-08-06 13:56   ` Arnd Bergmann
  2009-08-06 15:11     ` Gregory Haskins
  0 siblings, 1 reply; 62+ messages in thread
From: Arnd Bergmann @ 2009-08-06 13:56 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: linux-kernel, alacrityvm-devel, netdev

On Monday 03 August 2009, Gregory Haskins wrote:
> shm-signal provides a generic shared-memory based bidirectional
> signaling mechanism.  It is used in conjunction with an existing
> signal transport (such as posix-signals, interrupts, pipes, etc) to
> increase the efficiency of the transport since the state information
> is directly accessible to both sides of the link.  The shared-memory
> design provides very cheap access to features such as event-masking
> and spurious delivery mititgation, and is useful implementing higher
> level shared-memory constructs such as rings.

Looks like a very useful feature in general.

> +struct shm_signal_irq {
> +       __u8                  enabled;
> +       __u8                  pending;
> +       __u8                  dirty;
> +};

Won't this layout cause cache line ping pong? Other schemes I have
seen try to separate the bits so that each cache line is written to
by only one side. This gets much more interesting if the two sides
are on remote ends of an I/O link, e.g. using a nontransparent
PCI bridge, where you only want to send stores over the wire, but
never fetches or even read-modify-write cycles.

Your code is probably optimal if you only communicate between host
and guest code on the same CPU, but not so good if it crosses NUMA
nodes or worse.

> +struct shm_signal_desc {
> +       __u32                 magic;
> +       __u32                 ver;
> +       struct shm_signal_irq irq[2];
> +};

This data structure has implicit padding of two bytes at the end.
How about adding another '__u16 reserved' to make it explicit?

> +	/*
> +	 * We always mark the remote side as dirty regardless of whether
> +	 * they need to be notified.
> +	 */
> +	irq->dirty = 1;
> +	wmb();   /* dirty must be visible before we test the pending state */
> +
> +	if (irq->enabled && !irq->pending) {
> +		rmb();
> +
> +		/*
> +		 * If the remote side has enabled notifications, and we do
> +		 * not see a notification pending, we must inject a new one.
> +		 */
> +		irq->pending = 1;
> +		wmb(); /* make it visible before we do the injection */
> +
> +		s->ops->inject(s);
> +	}

Barriers always confuse me, but the rmb() looks slightly wrong. AFAIU
it only prevents reads after the barrier from being done before the
barrier, but you don't do any reads after it.

The (irq->enabled && !irq->pending) check could be done before the
irq->dirty = 1 arrives at the bus, but that does not seem to hurt, it
would at most cause a duplicate ->inject().

Regarding the scope of the barrier, did you intentionally use the
global versions (rmb()/wmb()) and not the lighter single-system
(smp_rmb()/smp_wmb()) versions? Your version should cope with remote
links over PCI but looks otherwise optimized for local use, as I
wrote above.

	Arnd <><

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 13:45           ` Gregory Haskins
@ 2009-08-06 13:57             ` Avi Kivity
  2009-08-06 14:06               ` Gregory Haskins
  2009-08-06 13:59             ` Michael S. Tsirkin
  1 sibling, 1 reply; 62+ messages in thread
From: Avi Kivity @ 2009-08-06 13:57 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: alacrityvm-devel, Michael S. Tsirkin, kvm, linux-kernel, netdev

On 08/06/2009 04:45 PM, Gregory Haskins wrote:
>
>> (though still rooting for virtio).
>>      
>
> Heh...not to belabor the point to death, but virtio is orthogonal (you keep forgetting that ;).
>
> Its really the vbus device-model vs the qemu device-model (and possibly vs the "in-kernel pci emulation" model that I believe Michael is working on).
>
> You can run virtio on any of those three.
>    

It's not orthogonal.  virtio is one set of ABI+guest drivers+host 
support to get networking on kvm guests.  AlacrityVM's vbus-based 
drivers are another set of ABI+guest drivers+host support to get 
networking on kvm guests.  That makes them competitors (two different 
ways to do one thing), not orthogonal.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 13:45           ` Gregory Haskins
  2009-08-06 13:57             ` Avi Kivity
@ 2009-08-06 13:59             ` Michael S. Tsirkin
  2009-08-06 14:07               ` Gregory Haskins
  1 sibling, 1 reply; 62+ messages in thread
From: Michael S. Tsirkin @ 2009-08-06 13:59 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: Avi Kivity, alacrityvm-devel, kvm, linux-kernel, netdev

On Thu, Aug 06, 2009 at 07:45:30AM -0600, Gregory Haskins wrote:
> > (though still rooting for virtio).
> 
> Heh...not to belabor the point to death, but virtio is orthogonal (you keep forgetting that ;).

venet and virtio aren't orthogonal, are they?

-- 
MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 13:57             ` Avi Kivity
@ 2009-08-06 14:06               ` Gregory Haskins
  2009-08-06 15:40                 ` Arnd Bergmann
  0 siblings, 1 reply; 62+ messages in thread
From: Gregory Haskins @ 2009-08-06 14:06 UTC (permalink / raw)
  To: Avi Kivity
  Cc: alacrityvm-devel, Michael S. Tsirkin, kvm, linux-kernel, netdev

>>> On 8/6/2009 at  9:57 AM, in message <4A7AE150.7040009@redhat.com>, Avi Kivity
<avi@redhat.com> wrote: 
> On 08/06/2009 04:45 PM, Gregory Haskins wrote:
>>
>>> (though still rooting for virtio).
>>>      
>>
>> Heh...not to belabor the point to death, but virtio is orthogonal (you keep 
> forgetting that ;).
>>
>> Its really the vbus device-model vs the qemu device-model (and possibly vs the 
> "in-kernel pci emulation" model that I believe Michael is working on).
>>
>> You can run virtio on any of those three.
>>    
> 
> It's not orthogonal.  virtio is one set of ABI+guest drivers+host 
> support to get networking on kvm guests.  AlacrityVM's vbus-based 
> drivers are another set of ABI+guest drivers+host support to get 
> networking on kvm guests.  That makes them competitors (two different 
> ways to do one thing), not orthogonal.

Thats not accurate, though.

The virtio stack is modular.  For instance, with virtio-net, you have

  (guest-side)
|--------------------------
| virtio-net
|--------------------------
| virtio-ring
|--------------------------
| virtio-bus
|--------------------------
| virtio-pci
|--------------------------
                      |
                   (pci)
                      |
|--------------------------
| kvm.ko
|--------------------------
| qemu
|--------------------------
| tun-tap
|--------------------------
| netif
|--------------------------
     (host-side)

We can exchange out the "virtio-pci" module like this:

  (guest-side)
|--------------------------
| virtio-net
|--------------------------
| virtio-ring
|--------------------------
| virtio-bus
|--------------------------
| virtio-vbus
|--------------------------
| vbus-proxy
|--------------------------
| vbus-connector
|--------------------------
                      |
                   (vbus)
                      |
|--------------------------
| kvm.ko
|--------------------------
| vbus-connector
|--------------------------
| vbus
|--------------------------
| virtio-net-tap (vbus model)
|--------------------------
| netif
|--------------------------
     (host-side)


So virtio-net runs unmodified.  What is "competing" here is "virtio-pci" vs "virtio-vbus".  Also, venet vs virtio-net are technically competing.  But to say "virtio vs vbus" is inaccurate, IMO.

HTH
-Greg


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 13:59             ` Michael S. Tsirkin
@ 2009-08-06 14:07               ` Gregory Haskins
  0 siblings, 0 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-06 14:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: alacrityvm-devel, Avi Kivity, kvm, linux-kernel, netdev

>>> On 8/6/2009 at  9:59 AM, in message <20090806135903.GA11530@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com> wrote: 
> On Thu, Aug 06, 2009 at 07:45:30AM -0600, Gregory Haskins wrote:
>> > (though still rooting for virtio).
>> 
>> Heh...not to belabor the point to death, but virtio is orthogonal (you keep 
> forgetting that ;).
> 
> venet and virtio aren't orthogonal, are they?

See my last reply to Avi.

Regards,
-Greg



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/7] vbus-proxy: add a pci-to-vbus bridge
  2009-08-03 17:17 ` [PATCH 4/7] vbus-proxy: add a pci-to-vbus bridge Gregory Haskins
@ 2009-08-06 14:42   ` Arnd Bergmann
  2009-08-06 15:59     ` Gregory Haskins
  0 siblings, 1 reply; 62+ messages in thread
From: Arnd Bergmann @ 2009-08-06 14:42 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: linux-kernel, alacrityvm-devel, netdev

On Monday 03 August 2009, Gregory Haskins wrote:
> This patch adds a pci-based driver to interface between the a host VBUS
> and the guest's vbus-proxy bus model.
> 
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>

This seems to be duplicating parts of virtio-pci that could be kept
common by extending the virtio code. Layering on top of virtio
would also make it possible to use the same features you add
on top of other transports (e.g. the s390 virtio code) without
adding yet another backend for each of them.

> +static int
> +vbus_pci_hypercall(unsigned long nr, void *data, unsigned long len)
> +{
> +	struct vbus_pci_hypercall params = {
> +		.vector = nr,
> +		.len    = len,
> +		.datap  = __pa(data),
> +	};
> +	unsigned long flags;
> +	int ret;
> +
> +	spin_lock_irqsave(&vbus_pci.lock, flags);
> +
> +	memcpy_toio(&vbus_pci.regs->hypercall.data, &params, sizeof(params));
> +	ret = ioread32(&vbus_pci.regs->hypercall.result);
> +
> +	spin_unlock_irqrestore(&vbus_pci.lock, flags);
> +
> +	return ret;
> +}

The functionality looks reasonable but please don't call this a hypercall.
A hypercall would be hypervisor specific by definition while this one
is device specific if I understand it correctly. How about "command queue",
"mailbox", "message queue", "devcall" or something else that we have in
existing PCI devices?

> +
> +static int
> +vbus_pci_device_open(struct vbus_device_proxy *vdev, int version, int flags)
> +{
> +	struct vbus_pci_device *dev = to_dev(vdev);
> +	struct vbus_pci_deviceopen params;
> +	int ret;
> +
> +	if (dev->handle)
> +		return -EINVAL;
> +
> +	params.devid   = vdev->id;
> +	params.version = version;
> +
> +	ret = vbus_pci_hypercall(VBUS_PCI_HC_DEVOPEN,
> +				 &params, sizeof(params));
> +	if (ret < 0)
> +		return ret;
> +
> +	dev->handle = params.handle;
> +
> +	return 0;
> +}

This seems to add an artificial abstraction that does not make sense
if you stick to the PCI abstraction. The two sensible and common models
for virtual devices that I've seen are:

* The hypervisor knows what virtual resources exist and provides them
  to the guest. The guest owns them as soon as they show up in the
  bus (e.g. PCI) probe. The 'handle' is preexisting.

* The guest starts without any devices and asks for resources it wants
  to access. There is no probing of resources but the guest issues
  a hypercall to get a handle to a newly created virtual device
  (or -ENODEV).

What is your reasoning for requiring both a probe and an allocation?

> +static int
> +vbus_pci_device_shm(struct vbus_device_proxy *vdev, int id, int prio,
> +		    void *ptr, size_t len,
> +		    struct shm_signal_desc *sdesc, struct shm_signal **signal,
> +		    int flags)
> +{
> +	struct vbus_pci_device *dev = to_dev(vdev);
> +	struct _signal *_signal = NULL;
> +	struct vbus_pci_deviceshm params;
> +	unsigned long iflags;
> +	int ret;
> +
> +	if (!dev->handle)
> +		return -EINVAL;
> +
> +	params.devh   = dev->handle;
> +	params.id     = id;
> +	params.flags  = flags;
> +	params.datap  = (u64)__pa(ptr);
> +	params.len    = len;
> +
> +	if (signal) {
> +		/*
> +		 * The signal descriptor must be embedded within the
> +		 * provided ptr
> +		 */
> +		if (!sdesc
> +		    || (len < sizeof(*sdesc))
> +		    || ((void *)sdesc < ptr)
> +		    || ((void *)sdesc > (ptr + len - sizeof(*sdesc))))
> +			return -EINVAL;
> +
> +		_signal = kzalloc(sizeof(*_signal), GFP_KERNEL);
> +		if (!_signal)
> +			return -ENOMEM;
> +
> +		_signal_init(&_signal->signal, sdesc, &_signal_ops);
> +
> +		/*
> +		 * take another reference for the host.  This is dropped
> +		 * by a SHMCLOSE event
> +		 */
> +		shm_signal_get(&_signal->signal);
> +
> +		params.signal.offset = (u64)sdesc - (u64)ptr;
> +		params.signal.prio   = prio;
> +		params.signal.cookie = (u64)_signal;
> +
> +	} else
> +		params.signal.offset = -1; /* yes, this is a u32, but its ok */
> +
> +	ret = vbus_pci_hypercall(VBUS_PCI_HC_DEVSHM,
> +				 &params, sizeof(params));
> +	if (ret < 0) {
> +		if (_signal) {
> +			/*
> +			 * We held two references above, so we need to drop
> +			 * both of them
> +			 */
> +			shm_signal_put(&_signal->signal);
> +			shm_signal_put(&_signal->signal);
> +		}
> +
> +		return ret;
> +	}
> +
> +	if (signal) {
> +		_signal->handle = ret;
> +
> +		spin_lock_irqsave(&vbus_pci.lock, iflags);
> +
> +		list_add_tail(&_signal->list, &dev->shms);
> +
> +		spin_unlock_irqrestore(&vbus_pci.lock, iflags);
> +
> +		shm_signal_get(&_signal->signal);
> +		*signal = &_signal->signal;
> +	}
> +
> +	return 0;
> +}

This could be implemented by virtio devices as well, right?

> +static int
> +vbus_pci_device_call(struct vbus_device_proxy *vdev, u32 func, void *data,
> +		     size_t len, int flags)
> +{
> +	struct vbus_pci_device *dev = to_dev(vdev);
> +	struct vbus_pci_devicecall params = {
> +		.devh  = dev->handle,
> +		.func  = func,
> +		.datap = (u64)__pa(data),
> +		.len   = len,
> +		.flags = flags,
> +	};
> +
> +	if (!dev->handle)
> +		return -EINVAL;
> +
> +	return vbus_pci_hypercall(VBUS_PCI_HC_DEVCALL, &params, sizeof(params));
> +}

Why the indirection? It seems to me that you could do the simpler

static int
vbus_pci_device_call(struct vbus_device_proxy *vdev, u32 func, void *data,
		     size_t len, int flags)
{
	struct vbus_pci_device *dev = to_dev(vdev);
	struct vbus_pci_hypercall params = {
		.vector = func,
		.len    = len,
		.datap  = __pa(data),
	};
	spin_lock_irqsave(&dev.lock, flags);
	memcpy_toio(&dev.regs->hypercall.data, &params, sizeof(params));
	ret = ioread32(&dev.regs->hypercall.result);
	spin_unlock_irqrestore(&dev.lock, flags);

	return ret;
}

This gets rid of your 'handle' and the unwinding through an extra pointer
indirection. You just need to make sure that the device specific call numbers
don't conflict with any global ones.

> +
> +static struct ioq_notifier eventq_notifier;

> ...

> +/* Invoked whenever the hypervisor ioq_signal()s our eventq */
> +static void
> +eventq_wakeup(struct ioq_notifier *notifier)
> +{
> +	struct ioq_iterator iter;
> +	int ret;
> +
> +	/* We want to iterate on the head of the in-use index */
> +	ret = ioq_iter_init(&vbus_pci.eventq, &iter, ioq_idxtype_inuse, 0);
> +	BUG_ON(ret < 0);
> +
> +	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
> +	BUG_ON(ret < 0);
> +
> +	/*
> +	 * The EOM is indicated by finding a packet that is still owned by
> +	 * the south side.
> +	 *
> +	 * FIXME: This in theory could run indefinitely if the host keeps
> +	 * feeding us events since there is nothing like a NAPI budget.  We
> +	 * might need to address that
> +	 */
> +	while (!iter.desc->sown) {
> +		struct ioq_ring_desc *desc  = iter.desc;
> +		struct vbus_pci_event *event;
> +
> +		event = (struct vbus_pci_event *)desc->cookie;
> +
> +		switch (event->eventid) {
> +		case VBUS_PCI_EVENT_DEVADD:
> +			event_devadd(&event->data.add);
> +			break;
> +		case VBUS_PCI_EVENT_DEVDROP:
> +			event_devdrop(&event->data.handle);
> +			break;
> +		case VBUS_PCI_EVENT_SHMSIGNAL:
> +			event_shmsignal(&event->data.handle);
> +			break;
> +		case VBUS_PCI_EVENT_SHMCLOSE:
> +			event_shmclose(&event->data.handle);
> +			break;
> +		default:
> +			printk(KERN_WARNING "VBUS_PCI: Unexpected event %d\n",
> +			       event->eventid);
> +			break;
> +		};
> +
> +		memset(event, 0, sizeof(*event));
> +
> +		/* Advance the in-use head */
> +		ret = ioq_iter_pop(&iter, 0);
> +		BUG_ON(ret < 0);
> +	}
> +
> +	/* And let the south side know that we changed the queue */
> +	ioq_signal(&vbus_pci.eventq, 0);
> +}

Ah, so you have a global event queue and your own device hotplug mechanism.
But why would you then still use PCI to back it? We already have PCI hotplug
to add and remove devices and you have defined per device notifier queues 
that you can use for waking up the device, right?

	Arnd <><

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/7] shm-signal: shared-memory signals
  2009-08-06 13:56   ` Arnd Bergmann
@ 2009-08-06 15:11     ` Gregory Haskins
  2009-08-06 20:51       ` Ira W. Snyder
  0 siblings, 1 reply; 62+ messages in thread
From: Gregory Haskins @ 2009-08-06 15:11 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: paulmck, alacrityvm-devel, linux-kernel, netdev

Hi Arnd,

>>> On 8/6/2009 at  9:56 AM, in message <200908061556.55390.arnd@arndb.de>, Arnd
Bergmann <arnd@arndb.de> wrote: 
> On Monday 03 August 2009, Gregory Haskins wrote:
>> shm-signal provides a generic shared-memory based bidirectional
>> signaling mechanism.  It is used in conjunction with an existing
>> signal transport (such as posix-signals, interrupts, pipes, etc) to
>> increase the efficiency of the transport since the state information
>> is directly accessible to both sides of the link.  The shared-memory
>> design provides very cheap access to features such as event-masking
>> and spurious delivery mititgation, and is useful implementing higher
>> level shared-memory constructs such as rings.
> 
> Looks like a very useful feature in general.

Thanks, I was hoping that would be the case.

> 
>> +struct shm_signal_irq {
>> +       __u8                  enabled;
>> +       __u8                  pending;
>> +       __u8                  dirty;
>> +};
> 
> Won't this layout cause cache line ping pong? Other schemes I have
> seen try to separate the bits so that each cache line is written to
> by only one side.

It could possibly use some optimization in that regard.  I generally consider myself an expert at concurrent programming, but this lockless stuff is, um, hard ;)  I was going for correctness first.

Long story short, any suggestions on ways to split this up are welcome (particularly now, before the ABI is sealed ;)

> This gets much more interesting if the two sides
> are on remote ends of an I/O link, e.g. using a nontransparent
> PCI bridge, where you only want to send stores over the wire, but
> never fetches or even read-modify-write cycles.

/me head explodes ;)

> 
> Your code is probably optimal if you only communicate between host
> and guest code on the same CPU, but not so good if it crosses NUMA
> nodes or worse.

Yeah, I wont lie and say it wasn't designed primarily for the former case in mind (since it was my particular itch).  I would certainly appreciate any insight on ways to make it more generally applicable for things like the transparent bridge model, and/or NUMA, though.

> 
>> +struct shm_signal_desc {
>> +       __u32                 magic;
>> +       __u32                 ver;
>> +       struct shm_signal_irq irq[2];
>> +};
> 
> This data structure has implicit padding of two bytes at the end.
> How about adding another '__u16 reserved' to make it explicit?

Good idea.  Will fix.

> 
>> +	/*
>> +	 * We always mark the remote side as dirty regardless of whether
>> +	 * they need to be notified.
>> +	 */
>> +	irq->dirty = 1;
>> +	wmb();   /* dirty must be visible before we test the pending state */
>> +
>> +	if (irq->enabled && !irq->pending) {
>> +		rmb();
>> +
>> +		/*
>> +		 * If the remote side has enabled notifications, and we do
>> +		 * not see a notification pending, we must inject a new one.
>> +		 */
>> +		irq->pending = 1;
>> +		wmb(); /* make it visible before we do the injection */
>> +
>> +		s->ops->inject(s);
>> +	}
> 
> Barriers always confuse me, but the rmb() looks slightly wrong. AFAIU
> it only prevents reads after the barrier from being done before the
> barrier, but you don't do any reads after it.

Its probably overzealous barrier'ing on my part.  I had a conversation with Paul McKenney (CC'd) where I was wondering if a conditional was an implicit barrier.  His response was something to the effect of "on most arches, yes, but not all".  And we concluded that, to be conservative, there should be a rmb() after the if().

That said, tbh I am not sure if its actually needed.  Paul?

> 
> The (irq->enabled && !irq->pending) check could be done before the
> irq->dirty = 1 arrives at the bus, but that does not seem to hurt, it
> would at most cause a duplicate ->inject().
> 
> Regarding the scope of the barrier, did you intentionally use the
> global versions (rmb()/wmb()) and not the lighter single-system
> (smp_rmb()/smp_wmb()) versions? Your version should cope with remote
> links over PCI but looks otherwise optimized for local use, as I
> wrote above.

Yes, it was intentional.  Both for the remote case, as you point out.  Also for the case where local might be mismatched (for instance, a guest compiled as UP).

Thanks Arnd,
-Greg


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 14:06               ` Gregory Haskins
@ 2009-08-06 15:40                 ` Arnd Bergmann
  2009-08-06 15:45                   ` Michael S. Tsirkin
                                     ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Arnd Bergmann @ 2009-08-06 15:40 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, alacrityvm-devel, Michael S. Tsirkin, kvm,
	linux-kernel, netdev

On Thursday 06 August 2009, Gregory Haskins wrote:
> We can exchange out the "virtio-pci" module like this:
> 
>   (guest-side)
> |--------------------------
> | virtio-net
> |--------------------------
> | virtio-ring
> |--------------------------
> | virtio-bus
> |--------------------------
> | virtio-vbus
> |--------------------------
> | vbus-proxy
> |--------------------------
> | vbus-connector
> |--------------------------
>                       |
>                    (vbus)
>                       |
> |--------------------------
> | kvm.ko
> |--------------------------
> | vbus-connector
> |--------------------------
> | vbus
> |--------------------------
> | virtio-net-tap (vbus model)
> |--------------------------
> | netif
> |--------------------------
>      (host-side)
> 
> 
> So virtio-net runs unmodified.  What is "competing" here is "virtio-pci" vs "virtio-vbus".
> Also, venet vs virtio-net are technically competing.  But to say "virtio vs vbus" is inaccurate, IMO.


I think what's confusing everyone is that you are competing on multiple
issues:

1. Implementation of bus probing: both vbus and virtio are backed by
PCI devices and can be backed by something else (e.g. virtio by lguest
or even by vbus).

2. Exchange of metadata: virtio uses a config space, vbus uses devcall
to do the same.

3. User data transport: virtio has virtqueues, vbus has shm/ioq.

I think these three are the main differences, and the venet vs. virtio-net
question comes down to which interface the drivers use for each aspect. Do
you agree with this interpretation?

Now to draw conclusions from each of these is of course highly subjective,
but this is how I view it:

1. The bus probing is roughly equivalent, they both work and the
virtio method seems to need a little less code but that could be fixed
by slimming down the vbus code as I mentioned in my comments on the
pci-to-vbus bridge code. However, I would much prefer not to have both
of them, and virtio came first.

2. the two methods (devcall/config space) are more or less equivalent
and you should be able to implement each one through the other one. The
virtio design was driven by making it look similar to PCI, the vbus
design was driven by making it easy to implement in a host kernel. I
don't care too much about these, as they can probably coexist without
causing any trouble. For a (hypothetical) vbus-in-virtio device,
a devcall can be a config-set/config-get pair, for a virtio-in-vbus,
you can do a config-get and a config-set devcall and be happy. Each
could be done in a trivial helper library.

3. The ioq method seems to be the real core of your work that makes
venet perform better than virtio-net with its virtqueues. I don't see
any reason to doubt that your claim is correct. My conclusion from
this would be to add support for ioq to virtio devices, alongside
virtqueues, but to leave out the extra bus_type and probing method.

	Arnd <><

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 15:40                 ` Arnd Bergmann
@ 2009-08-06 15:45                   ` Michael S. Tsirkin
  2009-08-06 16:28                     ` Pantelis Koukousoulas
  2009-08-06 15:50                   ` Avi Kivity
  2009-08-06 16:29                   ` Gregory Haskins
  2 siblings, 1 reply; 62+ messages in thread
From: Michael S. Tsirkin @ 2009-08-06 15:45 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Gregory Haskins, Avi Kivity, alacrityvm-devel, kvm, linux-kernel, netdev

On Thu, Aug 06, 2009 at 05:40:04PM +0200, Arnd Bergmann wrote:
> 3. The ioq method seems to be the real core of your work that makes
> venet perform better than virtio-net with its virtqueues. I don't see
> any reason to doubt that your claim is correct. My conclusion from
> this would be to add support for ioq to virtio devices, alongside
> virtqueues, but to leave out the extra bus_type and probing method.
> 
> 	Arnd <><

The fact that it's in kernel also likely contributes.

-- 
MST

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 15:40                 ` Arnd Bergmann
  2009-08-06 15:45                   ` Michael S. Tsirkin
@ 2009-08-06 15:50                   ` Avi Kivity
  2009-08-06 16:55                     ` Gregory Haskins
  2009-08-06 16:29                   ` Gregory Haskins
  2 siblings, 1 reply; 62+ messages in thread
From: Avi Kivity @ 2009-08-06 15:50 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Gregory Haskins, alacrityvm-devel, Michael S. Tsirkin, kvm,
	linux-kernel, netdev

On 08/06/2009 06:40 PM, Arnd Bergmann wrote:
> 3. The ioq method seems to be the real core of your work that makes
> venet perform better than virtio-net with its virtqueues. I don't see
> any reason to doubt that your claim is correct. My conclusion from
> this would be to add support for ioq to virtio devices, alongside
> virtqueues, but to leave out the extra bus_type and probing method.
>    

The current conjecture is that ioq outperforms virtio because the host 
side of ioq is implemented in the host kernel, while the host side of 
virtio is implemented in userspace.  AFAIK, no one pointed out 
differences in the protocol which explain the differences in performance.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/7] vbus-proxy: add a pci-to-vbus bridge
  2009-08-06 14:42   ` Arnd Bergmann
@ 2009-08-06 15:59     ` Gregory Haskins
  2009-08-06 17:03       ` Arnd Bergmann
  0 siblings, 1 reply; 62+ messages in thread
From: Gregory Haskins @ 2009-08-06 15:59 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: alacrityvm-devel, linux-kernel, netdev

>>> On 8/6/2009 at 10:42 AM, in message <200908061642.40614.arnd@arndb.de>, Arnd
Bergmann <arnd@arndb.de> wrote: 
> On Monday 03 August 2009, Gregory Haskins wrote:
>> This patch adds a pci-based driver to interface between the a host VBUS
>> and the guest's vbus-proxy bus model.
>> 
>> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> 
> This seems to be duplicating parts of virtio-pci that could be kept
> common by extending the virtio code. Layering on top of virtio
> would also make it possible to use the same features you add
> on top of other transports (e.g. the s390 virtio code) without
> adding yet another backend for each of them.

This doesn't make sense to me, but I suspect we are both looking at what this code does differently.  I am under the impression that you may believe that there is one of these objects per vbus device.  Note that this is just a bridge to vbus, so there is only one of these per system with potentially many vbus devices behind it.

In essence, this driver's job is to populate the "vbus-proxy" LDM bus with objects that it finds across the PCI-OTHER bridge.  This would actually sit below the virtio components in the stack, so it doesnt make sense (to me) to turn around and build this on top of virtio.  But perhaps I am missing something you are seeing.

Can you elaborate?

> 
>> +static int
>> +vbus_pci_hypercall(unsigned long nr, void *data, unsigned long len)
>> +{
>> +	struct vbus_pci_hypercall params = {
>> +		.vector = nr,
>> +		.len    = len,
>> +		.datap  = __pa(data),
>> +	};
>> +	unsigned long flags;
>> +	int ret;
>> +
>> +	spin_lock_irqsave(&vbus_pci.lock, flags);
>> +
>> +	memcpy_toio(&vbus_pci.regs->hypercall.data, &params, sizeof(params));
>> +	ret = ioread32(&vbus_pci.regs->hypercall.result);
>> +
>> +	spin_unlock_irqrestore(&vbus_pci.lock, flags);
>> +
>> +	return ret;
>> +}
> 
> The functionality looks reasonable but please don't call this a hypercall.

Heh, I guess its just semantics.  The reason why its called hypercall is two fold:

1) In previous versions (vbus-v3 and earlier), it actually *was* literally a KVM-hypercall.
2) In this current version, it is purely a PCI device doing PIO, but it still acts exactly like a hypercall on the backend (via the ioeventfd mechanism in KVM).

That said, I am not married to the name, so I can come up with something more appropriate/descriptive.

> A hypercall would be hypervisor specific by definition while this one
> is device specific if I understand it correctly. How about "command queue",
> "mailbox", "message queue", "devcall" or something else that we have in
> existing PCI devices?
> 
>> +
>> +static int
>> +vbus_pci_device_open(struct vbus_device_proxy *vdev, int version, int 
> flags)
>> +{
>> +	struct vbus_pci_device *dev = to_dev(vdev);
>> +	struct vbus_pci_deviceopen params;
>> +	int ret;
>> +
>> +	if (dev->handle)
>> +		return -EINVAL;
>> +
>> +	params.devid   = vdev->id;
>> +	params.version = version;
>> +
>> +	ret = vbus_pci_hypercall(VBUS_PCI_HC_DEVOPEN,
>> +				 &params, sizeof(params));
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	dev->handle = params.handle;
>> +
>> +	return 0;
>> +}
> 
> This seems to add an artificial abstraction that does not make sense
> if you stick to the PCI abstraction.

I think there may be confusion about what is going on here.  The "device-open" pertains to a vbus device *beyond* the bridge, not the PCI device (the bridge) itself.  Nor is the vbus device a PCI device.

Whats happening here is somewhat analogous to a PCI config-cycle.  Its a way to open a channel to a device beyond the bridge in _response_ to a probe.

We have a way to enumerate devices present beyond the bridge (this yields a "device-id")  but in order to actually talk to the device, you must first call DEVOPEN(id).  When a device-id is enumerated, it generates a probe() event on vbus-proxy.  The responding driver in question would then turn around and issue the handle = dev->open(VERSION) to see if it is compatible with the device, and to establish a context for further communication.

The reason why DEVOPEN returns a unique handle is to help ensure that the driver has established proper context before allowing other calls.

> The two sensible and common models
> for virtual devices that I've seen are:
> 
> * The hypervisor knows what virtual resources exist and provides them
>   to the guest. The guest owns them as soon as they show up in the
>   bus (e.g. PCI) probe. The 'handle' is preexisting.
> 
> * The guest starts without any devices and asks for resources it wants
>   to access. There is no probing of resources but the guest issues
>   a hypercall to get a handle to a newly created virtual device
>   (or -ENODEV).
> 
> What is your reasoning for requiring both a probe and an allocation?

Answered above, I thnk

> 
>> +static int
>> +vbus_pci_device_shm(struct vbus_device_proxy *vdev, int id, int prio,
>> +		    void *ptr, size_t len,
>> +		    struct shm_signal_desc *sdesc, struct shm_signal **signal,
>> +		    int flags)
>> +{
>> +	struct vbus_pci_device *dev = to_dev(vdev);
>> +	struct _signal *_signal = NULL;
>> +	struct vbus_pci_deviceshm params;
>> +	unsigned long iflags;
>> +	int ret;
>> +
>> +	if (!dev->handle)
>> +		return -EINVAL;
>> +
>> +	params.devh   = dev->handle;
>> +	params.id     = id;
>> +	params.flags  = flags;
>> +	params.datap  = (u64)__pa(ptr);
>> +	params.len    = len;
>> +
>> +	if (signal) {
>> +		/*
>> +		 * The signal descriptor must be embedded within the
>> +		 * provided ptr
>> +		 */
>> +		if (!sdesc
>> +		    || (len < sizeof(*sdesc))
>> +		    || ((void *)sdesc < ptr)
>> +		    || ((void *)sdesc > (ptr + len - sizeof(*sdesc))))
>> +			return -EINVAL;
>> +
>> +		_signal = kzalloc(sizeof(*_signal), GFP_KERNEL);
>> +		if (!_signal)
>> +			return -ENOMEM;
>> +
>> +		_signal_init(&_signal->signal, sdesc, &_signal_ops);
>> +
>> +		/*
>> +		 * take another reference for the host.  This is dropped
>> +		 * by a SHMCLOSE event
>> +		 */
>> +		shm_signal_get(&_signal->signal);
>> +
>> +		params.signal.offset = (u64)sdesc - (u64)ptr;
>> +		params.signal.prio   = prio;
>> +		params.signal.cookie = (u64)_signal;
>> +
>> +	} else
>> +		params.signal.offset = -1; /* yes, this is a u32, but its ok */
>> +
>> +	ret = vbus_pci_hypercall(VBUS_PCI_HC_DEVSHM,
>> +				 &params, sizeof(params));
>> +	if (ret < 0) {
>> +		if (_signal) {
>> +			/*
>> +			 * We held two references above, so we need to drop
>> +			 * both of them
>> +			 */
>> +			shm_signal_put(&_signal->signal);
>> +			shm_signal_put(&_signal->signal);
>> +		}
>> +
>> +		return ret;
>> +	}
>> +
>> +	if (signal) {
>> +		_signal->handle = ret;
>> +
>> +		spin_lock_irqsave(&vbus_pci.lock, iflags);
>> +
>> +		list_add_tail(&_signal->list, &dev->shms);
>> +
>> +		spin_unlock_irqrestore(&vbus_pci.lock, iflags);
>> +
>> +		shm_signal_get(&_signal->signal);
>> +		*signal = &_signal->signal;
>> +	}
>> +
>> +	return 0;
>> +}
> 
> This could be implemented by virtio devices as well, right?

The big difference with dev->shm() is that it is not bound to a particular ABI within the shared-memory (as opposed to virtio, which assumes a virtio ABI).  This just creates an empty shared-memory region (with a bidirectional signaling path) which you can overlay a variety of structures (virtio included).  You can of course also use non-ring based structures, such as, say, an array of idempotent state.

The point is that, once this is done, you have a shared-memory region and a way (via the shm-signal) to bidirectionally signal changes to that memory region.  You can then build bigger things with it, like virtqueues.

> 
>> +static int
>> +vbus_pci_device_call(struct vbus_device_proxy *vdev, u32 func, void *data,
>> +		     size_t len, int flags)
>> +{
>> +	struct vbus_pci_device *dev = to_dev(vdev);
>> +	struct vbus_pci_devicecall params = {
>> +		.devh  = dev->handle,
>> +		.func  = func,
>> +		.datap = (u64)__pa(data),
>> +		.len   = len,
>> +		.flags = flags,
>> +	};
>> +
>> +	if (!dev->handle)
>> +		return -EINVAL;
>> +
>> +	return vbus_pci_hypercall(VBUS_PCI_HC_DEVCALL, &params, sizeof(params));
>> +}
> 
> Why the indirection? It seems to me that you could do the simpler

What indirection?

/me looks below and thinks he sees the confusion..

> 
> static int
> vbus_pci_device_call(struct vbus_device_proxy *vdev, u32 func, void *data,
> 		     size_t len, int flags)
> {
> 	struct vbus_pci_device *dev = to_dev(vdev);
> 	struct vbus_pci_hypercall params = {
> 		.vector = func,
> 		.len    = len,
> 		.datap  = __pa(data),
> 	};
> 	spin_lock_irqsave(&dev.lock, flags);
> 	memcpy_toio(&dev.regs->hypercall.data, &params, sizeof(params));
> 	ret = ioread32(&dev.regs->hypercall.result);
> 	spin_unlock_irqrestore(&dev.lock, flags);
> 
> 	return ret;
> }
> 
> This gets rid of your 'handle' and the unwinding through an extra pointer
> indirection. You just need to make sure that the device specific call 
> numbers
> don't conflict with any global ones.

Ah, now I see the confusion...

DEVCALL is sending a synchronous call to a specific device beyond the bridge.  The MMIO going on here against dev.regs->hypercall.data is sending a synchronous call to the bridge itself.  They are distinctly different ;)

> 
>> +
>> +static struct ioq_notifier eventq_notifier;
> 
>> ...
> 
>> +/* Invoked whenever the hypervisor ioq_signal()s our eventq */
>> +static void
>> +eventq_wakeup(struct ioq_notifier *notifier)
>> +{
>> +	struct ioq_iterator iter;
>> +	int ret;
>> +
>> +	/* We want to iterate on the head of the in-use index */
>> +	ret = ioq_iter_init(&vbus_pci.eventq, &iter, ioq_idxtype_inuse, 0);
>> +	BUG_ON(ret < 0);
>> +
>> +	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
>> +	BUG_ON(ret < 0);
>> +
>> +	/*
>> +	 * The EOM is indicated by finding a packet that is still owned by
>> +	 * the south side.
>> +	 *
>> +	 * FIXME: This in theory could run indefinitely if the host keeps
>> +	 * feeding us events since there is nothing like a NAPI budget.  We
>> +	 * might need to address that
>> +	 */
>> +	while (!iter.desc->sown) {
>> +		struct ioq_ring_desc *desc  = iter.desc;
>> +		struct vbus_pci_event *event;
>> +
>> +		event = (struct vbus_pci_event *)desc->cookie;
>> +
>> +		switch (event->eventid) {
>> +		case VBUS_PCI_EVENT_DEVADD:
>> +			event_devadd(&event->data.add);
>> +			break;
>> +		case VBUS_PCI_EVENT_DEVDROP:
>> +			event_devdrop(&event->data.handle);
>> +			break;
>> +		case VBUS_PCI_EVENT_SHMSIGNAL:
>> +			event_shmsignal(&event->data.handle);
>> +			break;
>> +		case VBUS_PCI_EVENT_SHMCLOSE:
>> +			event_shmclose(&event->data.handle);
>> +			break;
>> +		default:
>> +			printk(KERN_WARNING "VBUS_PCI: Unexpected event %d\n",
>> +			       event->eventid);
>> +			break;
>> +		};
>> +
>> +		memset(event, 0, sizeof(*event));
>> +
>> +		/* Advance the in-use head */
>> +		ret = ioq_iter_pop(&iter, 0);
>> +		BUG_ON(ret < 0);
>> +	}
>> +
>> +	/* And let the south side know that we changed the queue */
>> +	ioq_signal(&vbus_pci.eventq, 0);
>> +}
> 
> Ah, so you have a global event queue and your own device hotplug mechanism.
> But why would you then still use PCI to back it?

PCI is only used as a PCI-to-vbus bridge.  Beyond that, the bridge is populating the vbus devices it sees beyond the bridge into the vbus-proxy LDM bus.

That said, the event queue is sending me events such as "device added" and "shm-signal", which are then reflected up into the general model.

> We already have PCI hotplug to add and remove devices

Yep, and I do actually use that...to get a probe for the bridge itself ;)

>and you have defined per device notifier queues that you can use for waking up the device, right?

Only per bridge.  vbus drivers themselves allocate additional dynamic shm-signal channels that are tunneled through the bridge's eventq(s).

Kind Regards,
-Greg

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 15:45                   ` Michael S. Tsirkin
@ 2009-08-06 16:28                     ` Pantelis Koukousoulas
  2009-08-07 12:14                       ` Gregory Haskins
  0 siblings, 1 reply; 62+ messages in thread
From: Pantelis Koukousoulas @ 2009-08-06 16:28 UTC (permalink / raw)
  To: kvm

How hard would it be to implement virtio over vbus and perhaps the
virtio-net backend?

This would leave only one variable in the comparison, clear misconceptions and
make evaluation easier by judging each of vbus, venet etc separately on its own
merits.

The way things are now, it is unclear exactly where those performance
improvements are coming from (or how much each component contributes)
because there are too many variables.

Replacing virtio-net by venet would be a hard proposition if only because
virtio-net has (closed source) windows drivers available. There has to be
shown that venet by itself does something significantly better that
virtio-net can't be modified to do comparably well.

Having venet in addition to virtio-net is also difficult, given that having only
one set of paravirtual drivers in the kernel was the whole point behind virtio.

Just a user's 0.02,
Pantelis

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 15:40                 ` Arnd Bergmann
  2009-08-06 15:45                   ` Michael S. Tsirkin
  2009-08-06 15:50                   ` Avi Kivity
@ 2009-08-06 16:29                   ` Gregory Haskins
  2009-08-06 23:23                     ` Ira W. Snyder
  2 siblings, 1 reply; 62+ messages in thread
From: Gregory Haskins @ 2009-08-06 16:29 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: alacrityvm-devel, Avi Kivity, Michael S. Tsirkin, kvm,
	linux-kernel, netdev

>>> On 8/6/2009 at 11:40 AM, in message <200908061740.04276.arnd@arndb.de>, Arnd
Bergmann <arnd@arndb.de> wrote: 
> On Thursday 06 August 2009, Gregory Haskins wrote:
>> We can exchange out the "virtio-pci" module like this:
>> 
>>   (guest-side)
>> |--------------------------
>> | virtio-net
>> |--------------------------
>> | virtio-ring
>> |--------------------------
>> | virtio-bus
>> |--------------------------
>> | virtio-vbus
>> |--------------------------
>> | vbus-proxy
>> |--------------------------
>> | vbus-connector
>> |--------------------------
>>                       |
>>                    (vbus)
>>                       |
>> |--------------------------
>> | kvm.ko
>> |--------------------------
>> | vbus-connector
>> |--------------------------
>> | vbus
>> |--------------------------
>> | virtio-net-tap (vbus model)
>> |--------------------------
>> | netif
>> |--------------------------
>>      (host-side)
>> 
>> 
>> So virtio-net runs unmodified.  What is "competing" here is "virtio-pci" vs 
> "virtio-vbus".
>> Also, venet vs virtio-net are technically competing.  But to say "virtio vs 
> vbus" is inaccurate, IMO.
> 
> 
> I think what's confusing everyone is that you are competing on multiple
> issues:
> 
> 1. Implementation of bus probing: both vbus and virtio are backed by
> PCI devices and can be backed by something else (e.g. virtio by lguest
> or even by vbus).

More specifically, vbus-proxy and virtio-bus can be backed by modular adapters.

vbus-proxy can be backed by vbus-pcibridge (as it is in AlacrityVM).  It was backed by KVM-hypercalls in previous releases, but we have deprecated/dropped that connector.  Other types of connectors are possible...

virtio-bus can be backed by virtio-pci, virtio-lguest, virtio-s390, and virtio-vbus (which is backed by vbus-proxy, et. al.)

"vbus" itself is actually the host-side container technology which vbus-proxy connects to.  This is an important distinction.

> 
> 2. Exchange of metadata: virtio uses a config space, vbus uses devcall
> to do the same.

Sort of.  You can use devcall() to implement something like config-space (and in fact, we do use it like this for some operations).  But this can also be fast path (for when you need synchronous behavior).

This has various uses, such as when you need synchronous updates from non-preemptible guest code (cpupri, for instance, for -rt)

> 
> 3. User data transport: virtio has virtqueues, vbus has shm/ioq.

Not quite:  vbus has shm + shm-signal.  You can then overlay shared-memory protocols over that, such as virtqueues, ioq, or even non-ring constructs.

I also consider the synchronous call() method to be part of the transport (tho more for niche devices, like -rt)

> 
> I think these three are the main differences, and the venet vs. virtio-net
> question comes down to which interface the drivers use for each aspect. Do
> you agree with this interpretation?
> 
> Now to draw conclusions from each of these is of course highly subjective,
> but this is how I view it:
> 
> 1. The bus probing is roughly equivalent, they both work and the
> virtio method seems to need a little less code but that could be fixed
> by slimming down the vbus code as I mentioned in my comments on the
> pci-to-vbus bridge code. However, I would much prefer not to have both
> of them, and virtio came first.
> 
> 2. the two methods (devcall/config space) are more or less equivalent
> and you should be able to implement each one through the other one. The
> virtio design was driven by making it look similar to PCI, the vbus
> design was driven by making it easy to implement in a host kernel. I
> don't care too much about these, as they can probably coexist without
> causing any trouble. For a (hypothetical) vbus-in-virtio device,
> a devcall can be a config-set/config-get pair, for a virtio-in-vbus,
> you can do a config-get and a config-set devcall and be happy. Each
> could be done in a trivial helper library.

Yep, in fact I publish something close to what I think you are talking about back in April

http://lkml.org/lkml/2009/4/21/427

> 
> 3. The ioq method seems to be the real core of your work that makes
> venet perform better than virtio-net with its virtqueues. I don't see
> any reason to doubt that your claim is correct. My conclusion from
> this would be to add support for ioq to virtio devices, alongside
> virtqueues, but to leave out the extra bus_type and probing method.

While I appreciate the sentiment, I doubt that is actually whats helping here.

There are a variety of factors that I poured into venet/vbus that I think contribute to its superior performance.  However, the difference in the ring design I do not think is one if them.  In fact, in many ways I think Rusty's design might turn out to be faster if put side by side because he was much more careful with cacheline alignment than I was.  Also note that I was careful to not pick one ring vs the other ;)  They both should work.

IMO, we are only looking at the tip of the iceberg when looking at this purely as the difference between virtio-pci vs virtio-vbus, or venet vs virtio-net.

Really, the big thing I am working on here is the host side device-model.  The idea here was to design a bus model that was conducive to high performance, software to software IO that would work in a variety of environments (that may or may not have PCI).  KVM is one such environment, but I also have people looking at building other types of containers, and even physical systems (host+blade kind of setups).

The idea is that the "connector" is modular, and then something like virtio-net or venet "just work": in kvm, in the userspace container, on the blade system. 

It provides a management infrastructure that (hopefully) makes sense for these different types of containers, regardless of whether they have PCI, QEMU, etc (e.g. things that are inherent to KVM, but not others).

I hope this helps to clarify the project :)

Kind Regards,
-Greg

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 15:50                   ` Avi Kivity
@ 2009-08-06 16:55                     ` Gregory Haskins
  2009-08-09  7:48                       ` Avi Kivity
  0 siblings, 1 reply; 62+ messages in thread
From: Gregory Haskins @ 2009-08-06 16:55 UTC (permalink / raw)
  To: Arnd Bergmann, Avi Kivity
  Cc: alacrityvm-devel, Michael S. Tsirkin, kvm, linux-kernel, netdev

>>> On 8/6/2009 at 11:50 AM, in message <4A7AFBE3.5080200@redhat.com>, Avi Kivity
<avi@redhat.com> wrote: 
> On 08/06/2009 06:40 PM, Arnd Bergmann wrote:
>> 3. The ioq method seems to be the real core of your work that makes
>> venet perform better than virtio-net with its virtqueues. I don't see
>> any reason to doubt that your claim is correct. My conclusion from
>> this would be to add support for ioq to virtio devices, alongside
>> virtqueues, but to leave out the extra bus_type and probing method.
>>    
> 
> The current conjecture is that ioq outperforms virtio because the host 
> side of ioq is implemented in the host kernel, while the host side of 
> virtio is implemented in userspace.  AFAIK, no one pointed out 
> differences in the protocol which explain the differences in performance.

There *are* protocol difference that matter, though I think they are slowly being addressed.

For an example:  Earlier versions of virtio-pci had a single interrupt for all ring events, and you had to do an extra MMIO cycle to learn the proper context. That will hurt...a _lot_ especially for latency.  I think recent versions of KVM switched to MSI-X per queue which fixed this particular ugly.

However, generally I think Avi is right.  The main reason why it outperforms virtio-pci by such a large margin has more to do with all the various inefficiencies in the backend (such as requiring multiple hops U->K, K->U per packet), coarse locking, lack of parallel processing, etc.  I went through and streamlined all the bottlenecks (such as putting the code in the kernel, reducing locking/context switches, etc).

I have every reason to believe that someone will skills/time equal to myself could develop a virtio-based backend that does not use vbus and achieve similar numbers.  However, as stated in my last reply, I am interested in this backend supporting more than KVM, and I designed vbus to fill that role.  Therefore, it does not interest me to endeavor such an effort if it doesn't involve a backend that is independent of KVM.

Based on this, I will continue my efforts surrounding to use of vbus including its use to accelerate KVM for AlacrityVM.  If I can find a way to do this in such a way that KVM upstream finds acceptable, I would be very happy and will work towards whatever that compromise might be.   OTOH, if the KVM community is set against the concept of a generalized/shared backend, and thus wants to use some other approach that does not involve vbus, that is fine too.  Choice is one of the great assets of open source, eh?   :)

Kind Regards,
-Greg





^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/7] vbus-proxy: add a pci-to-vbus bridge
  2009-08-06 15:59     ` Gregory Haskins
@ 2009-08-06 17:03       ` Arnd Bergmann
  2009-08-06 21:04         ` Gregory Haskins
  0 siblings, 1 reply; 62+ messages in thread
From: Arnd Bergmann @ 2009-08-06 17:03 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: alacrityvm-devel, linux-kernel, netdev

On Thursday 06 August 2009, you wrote:
> >>> On 8/6/2009 at 10:42 AM, in message <200908061642.40614.arnd@arndb.de>, Arnd
> Bergmann <arnd@arndb.de> wrote: 
> > On Monday 03 August 2009, Gregory Haskins wrote:
> >> This patch adds a pci-based driver to interface between the a host VBUS
> >> and the guest's vbus-proxy bus model.
> >> 
> >> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> > 
> > This seems to be duplicating parts of virtio-pci that could be kept
> > common by extending the virtio code. Layering on top of virtio
> > would also make it possible to use the same features you add
> > on top of other transports (e.g. the s390 virtio code) without
> > adding yet another backend for each of them.
> 
> This doesn't make sense to me, but I suspect we are both looking at what this
> code does differently.  I am under the impression that you may believe that
> there is one of these objects per vbus device.  Note that this is just a bridge
> to vbus, so there is only one of these per system with potentially many vbus
> devices behind it.

Right, this did not become clear from the posting. For virtio, we discussed
a model like this in the beginning and then rejected it in favour of a
"one PCI device per virtio device" model, which I now think is a better
approach than your pci-to-vbus bridge.
 
> In essence, this driver's job is to populate the "vbus-proxy" LDM bus with
> objects that it finds across the PCI-OTHER bridge.  This would actually sit
> below the virtio components in the stack, so it doesnt make sense (to me) to
> turn around and build this on top of virtio.  But perhaps I am missing
> something you are seeing.
> 
> Can you elaborate?

Your PCI device does not serve any real purpose as far as I can tell, you
could just as well have a root device as a parent for all the vbus devices
if you do your device probing like this.

However, assuming that you do the IMHO right thing to do probing like
virtio with a PCI device for each slave, the code will be almost the same
as virtio-pci and the two can be the same.

> > This seems to add an artificial abstraction that does not make sense
> > if you stick to the PCI abstraction.
> 
> I think there may be confusion about what is going on here.  The "device-open"
> pertains to a vbus device *beyond* the bridge, not the PCI device (the bridge)
> itself.  Nor is the vbus device a PCI device.
> 
> Whats happening here is somewhat analogous to a PCI config-cycle.  Its
> a way to open a channel to a device beyond the bridge in _response_ to
> a probe.
> 
> We have a way to enumerate devices present beyond the bridge (this yields
> a "device-id")  but in order to actually talk to the device, you must first
> call DEVOPEN(id).  When a device-id is enumerated, it generates a probe()
> event on vbus-proxy.  The responding driver in question would then turn
> around and issue the handle = dev->open(VERSION) to see if it is compatible
> with the device, and to establish a context for further communication.
> 
> The reason why DEVOPEN returns a unique handle is to help ensure that the
> driver has established proper context before allowing other calls.

So assuming this kind of bus is the right idea (which I think it's not),
why can't the host assume they are open to start with and you go and
enumerate the devices on the bridge, creating a vbus_device for each
one as you go. Then you just need to match the vbus drivers with the
devices by some string or vendor/device ID tuple.

> > 
> > This could be implemented by virtio devices as well, right?
> 
> The big difference with dev->shm() is that it is not bound to
> a particular ABI within the shared-memory (as opposed to
> virtio, which assumes a virtio ABI).  This just creates a
> n empty shared-memory region (with a bidirectional signaling
> path) which you can overlay a variety of structures (virtio
> included).  You can of course also use non-ring based
> structures, such as, say, an array of idempotent state.
>
> The point is that, once this is done, you have a shared-memory
> region and a way (via the shm-signal) to bidirectionally signal
> changes to that memory region.  You can then build bigger
> things with it, like virtqueues.

Let me try to rephrase my point: I believe you can implement
the shm/ioq data transport on top of the virtio bus level, by
adding shm and signal functions to struct virtio_config_ops
alongside find_vqs() so that a virtio_device can have either
any combination of virtqueues, shm and ioq.

> > static int
> > vbus_pci_device_call(struct vbus_device_proxy *vdev, u32 func, void *data,
> > 		     size_t len, int flags)
> > {
> > 	struct vbus_pci_device *dev = to_dev(vdev);
> > 	struct vbus_pci_hypercall params = {
> > 		.vector = func,
> > 		.len    = len,
> > 		.datap  = __pa(data),
> > 	};
> > 	spin_lock_irqsave(&dev.lock, flags);
> > 	memcpy_toio(&dev.regs->hypercall.data, &params, sizeof(params));
> > 	ret = ioread32(&dev.regs->hypercall.result);
> > 	spin_unlock_irqrestore(&dev.lock, flags);
> > 
> > 	return ret;
> > }
> > 
> > This gets rid of your 'handle' and the unwinding through an extra pointer
> > indirection. You just need to make sure that the device specific call 
> > numbers
> > don't conflict with any global ones.
> 
> Ah, now I see the confusion...
> 
> DEVCALL is sending a synchronous call to a specific device beyond the bridge.  The MMIO going on here against dev.regs->hypercall.data is sending a synchronous call to the bridge itself.  They are distinctly different ;)

well, my point earlier was that they probably should not be different  ;-)

	Arnd <><

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/7] shm-signal: shared-memory signals
  2009-08-06 15:11     ` Gregory Haskins
@ 2009-08-06 20:51       ` Ira W. Snyder
  0 siblings, 0 replies; 62+ messages in thread
From: Ira W. Snyder @ 2009-08-06 20:51 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Arnd Bergmann, paulmck, alacrityvm-devel, linux-kernel, netdev

On Thu, Aug 06, 2009 at 09:11:15AM -0600, Gregory Haskins wrote:
> Hi Arnd,
> 
> >>> On 8/6/2009 at  9:56 AM, in message <200908061556.55390.arnd@arndb.de>, Arnd
> Bergmann <arnd@arndb.de> wrote: 
> > On Monday 03 August 2009, Gregory Haskins wrote:
> >> shm-signal provides a generic shared-memory based bidirectional
> >> signaling mechanism.  It is used in conjunction with an existing
> >> signal transport (such as posix-signals, interrupts, pipes, etc) to
> >> increase the efficiency of the transport since the state information
> >> is directly accessible to both sides of the link.  The shared-memory
> >> design provides very cheap access to features such as event-masking
> >> and spurious delivery mititgation, and is useful implementing higher
> >> level shared-memory constructs such as rings.
> > 
> > Looks like a very useful feature in general.
> 
> Thanks, I was hoping that would be the case.
> 
> > 
> >> +struct shm_signal_irq {
> >> +       __u8                  enabled;
> >> +       __u8                  pending;
> >> +       __u8                  dirty;
> >> +};
> > 
> > Won't this layout cause cache line ping pong? Other schemes I have
> > seen try to separate the bits so that each cache line is written to
> > by only one side.
> 
> It could possibly use some optimization in that regard.  I generally consider myself an expert at concurrent programming, but this lockless stuff is, um, hard ;)  I was going for correctness first.
> 
> Long story short, any suggestions on ways to split this up are welcome (particularly now, before the ABI is sealed ;)
> 
> > This gets much more interesting if the two sides
> > are on remote ends of an I/O link, e.g. using a nontransparent
> > PCI bridge, where you only want to send stores over the wire, but
> > never fetches or even read-modify-write cycles.
> 
> /me head explodes ;)
> 

I've actually implemented this idea for virtio. Read the virtio-over-PCI
patches I posted, and you'll see that the entire virtqueue
implementation NEVER uses reads across the PCI bus, only writes. The
slowpath configuration space uses reads, but the virtqueues themselves
are write-only.

Some trivial benchmarking against an earlier driver that did
writes+reads across the PCI bus showed that the write-only driver was
about 2x as fast. (Throughput increased from ~30MB/sec to ~65MB/sec).

I'm sure the write-only design was not the only change responsible for
the speedup, but it was definitely a contributing factor.

Ira

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/7] vbus-proxy: add a pci-to-vbus bridge
  2009-08-06 17:03       ` Arnd Bergmann
@ 2009-08-06 21:04         ` Gregory Haskins
  2009-08-06 22:57           ` Arnd Bergmann
  0 siblings, 1 reply; 62+ messages in thread
From: Gregory Haskins @ 2009-08-06 21:04 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: alacrityvm-devel, linux-kernel, netdev

>>> On 8/6/2009 at  1:03 PM, in message <200908061903.05083.arnd@arndb.de>, Arnd
Bergmann <arnd@arndb.de> wrote: 
> On Thursday 06 August 2009, you wrote:
>> >>> On 8/6/2009 at 10:42 AM, in message <200908061642.40614.arnd@arndb.de>, Arnd
>> Bergmann <arnd@arndb.de> wrote: 
>> > On Monday 03 August 2009, Gregory Haskins wrote:
>> >> This patch adds a pci-based driver to interface between the a host VBUS
>> >> and the guest's vbus-proxy bus model.
>> >> 
>> >> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>> > 
>> > This seems to be duplicating parts of virtio-pci that could be kept
>> > common by extending the virtio code. Layering on top of virtio
>> > would also make it possible to use the same features you add
>> > on top of other transports (e.g. the s390 virtio code) without
>> > adding yet another backend for each of them.
>> 
>> This doesn't make sense to me, but I suspect we are both looking at what 
> this
>> code does differently.  I am under the impression that you may believe that
>> there is one of these objects per vbus device.  Note that this is just a 
> bridge
>> to vbus, so there is only one of these per system with potentially many vbus
>> devices behind it.
> 
> Right, this did not become clear from the posting. For virtio, we discussed
> a model like this in the beginning and then rejected it in favour of a
> "one PCI device per virtio device" model, which I now think is a better
> approach than your pci-to-vbus bridge.

I agree that the 1:1 model may have made sense for QEMU based virtio.  I think you will find it has diminished value here, however.

Here are some of my arguments against it:

1) there is an ample PCI model that is easy to work with when you are in QEMU and using its device model (and you get it for free).  Its the path of least resistance.  For something in kernel, it is more awkward to try to coordinate the in-kernel state with the PCI state.  Afaict, you either need to have it live partially in both places, or you need some PCI emulation in the kernel.

2) The signal model for the 1:1 design is not very flexible IMO.
    2a) I want to be able to allocate dynamic signal paths, not pre-allocate msi-x vectors at dev-add.
    2b) I also want to collapse multiple interrupts together so as to minimize the context switch rate (inject + EIO overhead).  My design effectively has "NAPI" for interrupt handling.  This helps when the system needs it the most: heavy IO.

3) The 1:1 model is not buying us much in terms of hotplug.  We don't really "use" PCI very much even in virtio.  Its a thin-shim of uniform dev-ids to resurface to the virtio-bus as something else.  With LDM, hotplug is ridiculously easy anyway, so who cares.  I already need an event channel anyway for (2b) anyway, so the devadd/devdrop events are trivial to handle.

4) communicating with something efficiently in-kernel requires more finesse than basic PIO/MMIO.  There are tricks you can do to get around this, but with 1:1 you would have to do this trick repeatedly for each device.  Even with a library solution to help, you still have per-cpu .data overhead and cpu hotplug overhead to get maximum performance.  With my "bridge" model, I do it once, which I believe is ideal.

5) 1:1 is going to quickly populate the available MMIO/PIO and IDT slots for any kind of medium to large configuration.  The bridge model scales better in this regard.

So based on that, I think the bridge model works better for vbus.  Perhaps you can convince me otherwise ;)

>  
>> In essence, this driver's job is to populate the "vbus-proxy" LDM bus with
>> objects that it finds across the PCI-OTHER bridge.  This would actually sit
>> below the virtio components in the stack, so it doesnt make sense (to me) to
>> turn around and build this on top of virtio.  But perhaps I am missing
>> something you are seeing.
>> 
>> Can you elaborate?
> 
> Your PCI device does not serve any real purpose as far as I can tell

That is certainly debatable.  Its purpose is as follows:

1) Allows a guest to discover the vbus feature (fwiw: I used to do this with cpuid)
2) Allows the guest to establish proper context to communicate with the feature (mmio, pio, and msi) (fwiw: i used to use hypercalls)
3) Access the virtual-devices that have been configured for the feature

Correct me if I am wrong:  Isn't this more of less the exact intent of something like an LDM bus (vbus-proxy) and a PCI-BRIDGE?  Other than the possibility that there might be some mergable overlap (still debatable), I don't think its fair to say that this does not serve a purpose.

>, you could just as well have a root device as a parent for all the vbus devices
> if you do your device probing like this.

Yes, I suppose the "bridge" could have been advertised as a virtio-based root device.  In this way, the virtio probe() would replace my pci probe() for feature discovery, and a virtqueue could replace my msi+ioq for the eventq channel.

I see a few issues with that, however:

1) The virtqueue library, while a perfectly nice ring design at the metadata level, does not have an API that is friendly to kernel-to-kernel communication.  It was designed more for frontend use to some remote backend.  The IOQ library on the other hand, was specifically designed to support use as kernel-to-kernel (see north/south designations).  So this made life easier for me.  To do what you propose, the eventq channel would need to terminate in kernel, and I would thus be forced to deal that the potential API problems.

2) I would need to have Avi et. al. allocate a virtio vector to use from their namespace, which I am sure they wont be willing to do until they accept my design.  Today, I have a nice conflict free PCI ID to use as I see fit.

Im sure both of these hurdles are not insurmountable, but I am left scratching my head as to why its worth the effort.  It seems to me its a "six of one, half-dozen of the other" kind of scenario.  Either I write a qemu PCI device and pci-bridge driver, or I write a qemu virtio-devicve and virtio root driver.

In short: What does this buy us, or did you mean something else?  

> 
> However, assuming that you do the IMHO right thing to do probing like
> virtio with a PCI device for each slave, the code will be almost the same
> as virtio-pci and the two can be the same.

Can you elaborate?

> 
>> > This seems to add an artificial abstraction that does not make sense
>> > if you stick to the PCI abstraction.
>> 
>> I think there may be confusion about what is going on here.  The 
> "device-open"
>> pertains to a vbus device *beyond* the bridge, not the PCI device (the 
> bridge)
>> itself.  Nor is the vbus device a PCI device.
>> 
>> Whats happening here is somewhat analogous to a PCI config-cycle.  Its
>> a way to open a channel to a device beyond the bridge in _response_ to
>> a probe.
>> 
>> We have a way to enumerate devices present beyond the bridge (this yields
>> a "device-id")  but in order to actually talk to the device, you must first
>> call DEVOPEN(id).  When a device-id is enumerated, it generates a probe()
>> event on vbus-proxy.  The responding driver in question would then turn
>> around and issue the handle = dev->open(VERSION) to see if it is compatible
>> with the device, and to establish a context for further communication.
>> 
>> The reason why DEVOPEN returns a unique handle is to help ensure that the
>> driver has established proper context before allowing other calls.
> 
> So assuming this kind of bus is the right idea (which I think it's not),
> why can't the host assume they are open to start with

Read on..

>and you go and enumerate the devices on the bridge, creating a vbus_device for each
> one as you go.

Thats exactly what it does.

> Then you just need to match the vbus drivers with the
> devices by some string or vendor/device ID tuple.
> 

Yep, thats right too.  Then, when the driver gets a ->probe(), it does an dev->open() to check various state:

a) can the device be opened?  if it has an max-open policy (most will have a max-open = 1 policy) and something else already has the device open, it will fail (this will not be common).
b) is the driver ABI revision compatible with the device ABI revision?  This is like checking the pci config-space revision number.

For an example, see drivers/net/vbus-enet.c, line 764:

http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=drivers/net/vbus-enet.c;h=7220f43723adc5b0bece1bc37974fae1b034cd9e;hb=b3b2339efbd4e754b1c85f8bc8f85f21a1a1f509#l764

Its simple a check to see if the driver and device are compatible, and therefore the probe should succeed.  Nothing more.  I think what I have done is similar to how most buses (like PCI) work today (ala revision number checks with a config-cycle).

Regarding the id->handle indirection:

Internally, the DEVOPEN call translates an "id" to a "handle".  The handle is just a token to help ensure that the caller actually opened the device successfully.  Note that the "id" namespace is 0 based.  Therefore, something like an errant DEVCALL(0) would be indistinguishable from a legit request.  Using the handle abstraction gives me a slightly more robust mechanism to ensure the caller actually meant to call the host, and was in the proper context to do so.  For one thing, if the device had never been opened, this would have failed before it ever reached the model.  Its one more check I can do at the infrastructure level, and one less thing each model has to look out for.

Is the id->handle translation critical?  No, i'm sure we could live without it, but I also don't think it hurts anything.  It allows the overall code to be slightly more robust, and the individual model code to be slightly less complicated.  Therefore, I don't see a problem.

>> > 
>> > This could be implemented by virtio devices as well, right?
>> 
>> The big difference with dev->shm() is that it is not bound to
>> a particular ABI within the shared-memory (as opposed to
>> virtio, which assumes a virtio ABI).  This just creates a
>> n empty shared-memory region (with a bidirectional signaling
>> path) which you can overlay a variety of structures (virtio
>> included).  You can of course also use non-ring based
>> structures, such as, say, an array of idempotent state.
>>
>> The point is that, once this is done, you have a shared-memory
>> region and a way (via the shm-signal) to bidirectionally signal
>> changes to that memory region.  You can then build bigger
>> things with it, like virtqueues.
> 
> Let me try to rephrase my point: I believe you can implement
> the shm/ioq data transport on top of the virtio bus level, by
> adding shm and signal functions to struct virtio_config_ops
> alongside find_vqs() so that a virtio_device can have either
> any combination of virtqueues, shm and ioq.

Yes, I believe this might be doable, but I don't know virtio well enough to say for sure.

> 
>> > static int
>> > vbus_pci_device_call(struct vbus_device_proxy *vdev, u32 func, void *data,
>> > 		     size_t len, int flags)
>> > {
>> > 	struct vbus_pci_device *dev = to_dev(vdev);
>> > 	struct vbus_pci_hypercall params = {
>> > 		.vector = func,
>> > 		.len    = len,
>> > 		.datap  = __pa(data),
>> > 	};
>> > 	spin_lock_irqsave(&dev.lock, flags);
>> > 	memcpy_toio(&dev.regs->hypercall.data, &params, sizeof(params));
>> > 	ret = ioread32(&dev.regs->hypercall.result);
>> > 	spin_unlock_irqrestore(&dev.lock, flags);
>> > 
>> > 	return ret;
>> > }
>> > 
>> > This gets rid of your 'handle' and the unwinding through an extra pointer
>> > indirection. You just need to make sure that the device specific call 
>> > numbers
>> > don't conflict with any global ones.
>> 
>> Ah, now I see the confusion...
>> 
>> DEVCALL is sending a synchronous call to a specific device beyond the 
> bridge.  The MMIO going on here against dev.regs->hypercall.data is sending a 
> synchronous call to the bridge itself.  They are distinctly different ;)
> 
> well, my point earlier was that they probably should not be different  ;-)

Ok :)

I still do not see how they could be merged in a way that is both a) worth the effort, and b) doesn't compromise my design.  But I will keep an open mind if you want to continue the conversation.

Thanks for all the feedback.  I do appreciate it.

Kind Regards,
-Greg


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/7] vbus-proxy: add a pci-to-vbus bridge
  2009-08-06 21:04         ` Gregory Haskins
@ 2009-08-06 22:57           ` Arnd Bergmann
  2009-08-07  4:42             ` Gregory Haskins
  0 siblings, 1 reply; 62+ messages in thread
From: Arnd Bergmann @ 2009-08-06 22:57 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: alacrityvm-devel, linux-kernel, netdev, Ira W. Snyder

On Thursday 06 August 2009, Gregory Haskins wrote:
> >>> On 8/6/2009 at  1:03 PM, in message <200908061903.05083.arnd@arndb.de>, Arnd Bergmann <arnd@arndb.de> wrote: 
> Here are some of my arguments against it:
> 
> 1) there is an ample PCI model that is easy to work with when you are in QEMU and using its device model (and you get it for free).  Its the path of least resistance.  For something in kernel, it is more awkward to try to coordinate the in-kernel state with the PCI state.  Afaict, you either need to have it live partially in both places, or you need some PCI emulation in the kernel.

True, if the whole hypervisor is in the host kernel, then doing full PCI emulation would be
insane. I was assuming that all of the setup code still lived in host user space.
What is the reason why it cannot? Do you want to use something other than qemu,
do you think this will impact performance, or something else?

> 2) The signal model for the 1:1 design is not very flexible IMO.
>     2a) I want to be able to allocate dynamic signal paths, not pre-allocate msi-x vectors at dev-add.

I believe msi-x implies that the interrupt vectors get added by the device driver
at run time, unlike legacy interrupts or msi. It's been a while since I dealt with
that though.

>     2b) I also want to collapse multiple interrupts together so as to minimize the context switch rate (inject + EIO overhead).  My design effectively has "NAPI" for interrupt handling.  This helps when the system needs it the most: heavy IO.

That sounds like a very useful concept in general, but this seems to be a
detail of the interrupt controller implementation. If the IO-APIC cannot
do what you want here, maybe we just need a paravirtual IRQ controller
driver, like e.g. the PS3 has.

> 3) The 1:1 model is not buying us much in terms of hotplug.  We don't really "use" PCI very much even in virtio.  Its a thin-shim of uniform dev-ids to resurface to the virtio-bus as something else.  With LDM, hotplug is ridiculously easy anyway, so who cares.  I already need an event channel anyway for (2b) anyway, so the devadd/devdrop events are trivial to handle.

I agree for Linux guests, but when you want to run other guest operating systems,
PCI hotplug is probably the most common interface for this. AFAIK, the windows
virtio-net driver does not at all have a concept of a virtio layer but is simply
a network driver for a PCI card. The same could be applied any other device,
possibly with some library code doing all the queue handling in a common way.l

> 4) communicating with something efficiently in-kernel requires more finesse than basic PIO/MMIO.  There are tricks you can do to get around this, but with 1:1 you would have to do this trick repeatedly for each device.  Even with a library solution to help, you still have per-cpu .data overhead and cpu hotplug overhead to get maximum performance.  With my "bridge" model, I do it once, which I believe is ideal.
>
> 5) 1:1 is going to quickly populate the available MMIO/PIO and IDT slots for any kind of medium to large configuration.  The bridge model scales better in this regard.

We don't need to rely on PIO, it's just the common interface that all hypervisors
can easily support. We could have different underlying methods for the communication
if space or performance becomes a bottleneck because of this.

> So based on that, I think the bridge model works better for vbus.  Perhaps you can convince me otherwise ;)

Being able to define all of it in the host kernel seems to be the major
advantage of your approach, the other points you mentioned are less
important IMHO. The question is whether that is indeed a worthy goal,
or if it should just live in user space as with the qemu PCI code.

> >> In essence, this driver's job is to populate the "vbus-proxy" LDM bus with
> >> objects that it finds across the PCI-OTHER bridge.  This would actually sit
> >> below the virtio components in the stack, so it doesnt make sense (to me) to
> >> turn around and build this on top of virtio.  But perhaps I am missing
> >> something you are seeing.
> >> 
> >> Can you elaborate?
> > 
> > Your PCI device does not serve any real purpose as far as I can tell
> 
> That is certainly debatable.  Its purpose is as follows:
> 
> 1) Allows a guest to discover the vbus feature (fwiw: I used to do this with cpuid)

true, I missed that.

> 2) Allows the guest to establish proper context to communicate with the feature (mmio, pio, and msi) (fwiw: i used to use hypercalls)
> 3) Access the virtual-devices that have been configured for the feature
> 
> Correct me if I am wrong:  Isn't this more of less the exact intent of something like an LDM bus (vbus-proxy) and a PCI-BRIDGE?  Other than the possibility that there might be some mergable overlap (still debatable), I don't think its fair to say that this does not serve a purpose.

I guess you are right on that. An interesting variation of that would be make the
child devices of it virtio devices again though: Instead of the PCI emulation code
in the host kernel, you could define a simpler interface to the same effect. So the
root device would be a virtio-pci device, below which you can have virtio-virtio
devices.

> >, you could just as well have a root device as a parent for all the vbus devices
> > if you do your device probing like this.
> 
> Yes, I suppose the "bridge" could have been advertised as a virtio-based root device.  In this way, the virtio probe() would replace my pci probe() for feature discovery, and a virtqueue could replace my msi+ioq for the eventq channel.
>
> I see a few issues with that, however:
> 
> 1) The virtqueue library, while a perfectly nice ring design at the metadata level, does not have an API that is friendly to kernel-to-kernel communication.  It was designed more for frontend use to some remote backend.  The IOQ library on the other hand, was specifically designed to support use as kernel-to-kernel (see north/south designations).  So this made life easier for me.  To do what you propose, the eventq channel would need to terminate in kernel, and I would thus be forced to deal that the potential API problems.

Well, virtqueues are not that bad for kernel-to-kernel communication, as Ira mentioned
referring to his virtio-over-PCI driver. You can have virtqueues on both sides, having
the host kernel create a pair of virtqueues (one in user aka guest space, one in the host
kernel), with the host virtqueue_ops doing copy_{to,from}_user to move data between them.

If you have that, you can actually use the same virtio_net driver in both guest and
host kernel, just communicating over different virtio implementations. Interestingly,
that would mean that you no longer need a separation between guest and host device
drivers (vbus and vbus-proxy in your case) but could use the same device abstraction
with just different transports to back the shm-signal or virtqueue.
 
> 2) I would need to have Avi et. al. allocate a virtio vector to use from their namespace, which I am sure they wont be willing to do until they accept my design.  Today, I have a nice conflict free PCI ID to use as I see fit.

My impression is the opposite: as long as you try to reinvent everything at once,
you face opposition, but if you just improve parts of the existing design one
by one (like eventfd), I think you will find lots of support.

> Im sure both of these hurdles are not insurmountable, but I am left scratching my head as to why its worth the effort.  It seems to me its a "six of one, half-dozen of the other" kind of scenario.  Either I write a qemu PCI device and pci-bridge driver, or I write a qemu virtio-devicve and virtio root driver.
> 
> In short: What does this buy us, or did you mean something else?  

In my last reply, I was thinking of a root device that can not be probed like a PCI device.

> > However, assuming that you do the IMHO right thing to do probing like
> > virtio with a PCI device for each slave, the code will be almost the same
> > as virtio-pci and the two can be the same.
> 
> Can you elaborate?

Well, let me revise based on the discussion:

The main point that remains is that I think a vbus-proxy should be the same as a
virtio device. This could be done by having (as in my earlier mails) a PCI device
per vbus-proxy, with devcall implemented in PIO or config-space and additional
shm/shm-signal, or it could be a single virtio device from virtio-pci or one
of the other existing provides that connects you with a new virtio provider
sitting in the host kernel. This provider has child devices for any endpoint
(virtio-net, venet, ...) that is implemented in the host kernel.

> >and you go and enumerate the devices on the bridge, creating a vbus_device for each
> > one as you go.
> 
> Thats exactly what it does.
> 
> > Then you just need to match the vbus drivers with the
> > devices by some string or vendor/device ID tuple.
> > 
> 
> Yep, thats right too.  Then, when the driver gets a ->probe(), it does an dev->open() to check various state:
> 
> a) can the device be opened?  if it has an max-open policy (most will have a max-open = 1 policy) and something else already has the device open, it will fail (this will not be common).
> b) is the driver ABI revision compatible with the device ABI revision?  This is like checking the pci config-space revision number.
> 
> For an example, see drivers/net/vbus-enet.c, line 764:
> 
> http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=drivers/net/vbus-enet.c;h=7220f43723adc5b0bece1bc37974fae1b034cd9e;hb=b3b2339efbd4e754b1c85f8bc8f85f21a1a1f509#l764
> 
> Its simple a check to see if the driver and device are compatible, and therefore the probe should succeed.  Nothing more.  I think what I have done is similar to how most buses (like PCI) work today (ala revision number checks with a config-cycle).

ok. 
 
> Regarding the id->handle indirection:
> 
> Internally, the DEVOPEN call translates an "id" to a "handle".  The handle is just a token to help ensure that the caller actually opened the device successfully.  Note that the "id" namespace is 0 based.  Therefore, something like an errant DEVCALL(0) would be indistinguishable from a legit request.  Using the handle abstraction gives me a slightly more robust mechanism to ensure the caller actually meant to call the host, and was in the proper context to do so.  For one thing, if the device had never been opened, this would have failed before it ever reached the model.  Its one more check I can do at the infrastructure level, and one less thing each model has to look out for.
> 
> Is the id->handle translation critical?  No, i'm sure we could live without it, but I also don't think it hurts anything.  It allows the overall code to be slightly more robust, and the individual model code to be slightly less complicated.  Therefore, I don't see a problem.

Right, assuming your model with all vbus devices behind a single PCI device, your
handle does not hurt, it's the equivalent of a bus/dev/fn number or an MMIO address.

	Arnd <><

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 16:29                   ` Gregory Haskins
@ 2009-08-06 23:23                     ` Ira W. Snyder
  0 siblings, 0 replies; 62+ messages in thread
From: Ira W. Snyder @ 2009-08-06 23:23 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Arnd Bergmann, alacrityvm-devel, Avi Kivity, Michael S. Tsirkin,
	kvm, linux-kernel, netdev

On Thu, Aug 06, 2009 at 10:29:08AM -0600, Gregory Haskins wrote:
> >>> On 8/6/2009 at 11:40 AM, in message <200908061740.04276.arnd@arndb.de>, Arnd
> Bergmann <arnd@arndb.de> wrote: 
> > On Thursday 06 August 2009, Gregory Haskins wrote:

[ big snip ]

> > 
> > 3. The ioq method seems to be the real core of your work that makes
> > venet perform better than virtio-net with its virtqueues. I don't see
> > any reason to doubt that your claim is correct. My conclusion from
> > this would be to add support for ioq to virtio devices, alongside
> > virtqueues, but to leave out the extra bus_type and probing method.
> 
> While I appreciate the sentiment, I doubt that is actually whats helping here.
> 
> There are a variety of factors that I poured into venet/vbus that I think contribute to its superior performance.  However, the difference in the ring design I do not think is one if them.  In fact, in many ways I think Rusty's design might turn out to be faster if put side by side because he was much more careful with cacheline alignment than I was.  Also note that I was careful to not pick one ring vs the other ;)  They both should work.

IMO, the virtio vring design is very well thought out. I found it
relatively easy to port to a host+blade setup, and run virtio-net over a
physical PCI bus, connecting two physical CPUs.

> 
> IMO, we are only looking at the tip of the iceberg when looking at this purely as the difference between virtio-pci vs virtio-vbus, or venet vs virtio-net.
> 
> Really, the big thing I am working on here is the host side device-model.  The idea here was to design a bus model that was conducive to high performance, software to software IO that would work in a variety of environments (that may or may not have PCI).  KVM is one such environment, but I also have people looking at building other types of containers, and even physical systems (host+blade kind of setups).
> 
> The idea is that the "connector" is modular, and then something like virtio-net or venet "just work": in kvm, in the userspace container, on the blade system. 
> 
> It provides a management infrastructure that (hopefully) makes sense for these different types of containers, regardless of whether they have PCI, QEMU, etc (e.g. things that are inherent to KVM, but not others).
> 
> I hope this helps to clarify the project :)
> 

I think this is the major benefit of vbus. I've only started studying
the vbus code, so I don't have lots to say yet. The overview of the
management interface makes it look pretty good.

Getting two virtio-net drivers hooked together in my virtio-over-PCI
patches was nasty. If you read the thread that followed, you'll see
the lack of a management interface as a concern of mine. It was
basically decided that it could come "later". The configfs interface
vbus provides is pretty nice, IMO.

Just my two cents,
Ira

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/7] vbus-proxy: add a pci-to-vbus bridge
  2009-08-06 22:57           ` Arnd Bergmann
@ 2009-08-07  4:42             ` Gregory Haskins
  2009-08-07 14:57               ` Arnd Bergmann
  2009-08-07 15:55               ` Ira W. Snyder
  0 siblings, 2 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-07  4:42 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: alacrityvm-devel, Ira W. Snyder, linux-kernel, netdev

>>> On 8/6/2009 at  6:57 PM, in message <200908070057.54795.arnd@arndb.de>, Arnd
Bergmann <arnd@arndb.de> wrote: 
> On Thursday 06 August 2009, Gregory Haskins wrote:
>> >>> On 8/6/2009 at  1:03 PM, in message <200908061903.05083.arnd@arndb.de>, Arnd 
> Bergmann <arnd@arndb.de> wrote: 
>> Here are some of my arguments against it:
>> 
>> 1) there is an ample PCI model that is easy to work with when you are in 
> QEMU and using its device model (and you get it for free).  Its the path of 
> least resistance.  For something in kernel, it is more awkward to try to 
> coordinate the in-kernel state with the PCI state.  Afaict, you either need to 
> have it live partially in both places, or you need some PCI emulation in the 
> kernel.
> 
> True, if the whole hypervisor is in the host kernel, then doing full PCI 
> emulation would be
> insane.

In this case, the entire bus is more or less self contained and in-kernel.  We technically *do* still have qemu present, however, so you are right there.

> I was assuming that all of the setup code still lived in host user 
> space.

Very little.  Just enough to register the PCI device, handle the MMIO/PIO/MSI configuration, etc.  All the bus management uses the standard vbus management interface (configfs/sysfs)

> What is the reason why it cannot?

Its not that it "can't" per se..  Its just awkward to have it live in two places, and I would need to coordinate in-kernel changes to userspace, etc.  Today I do not need to do this:  i.e. the model in userspace is very simple.

> Do you want to use something other than qemu,

Well, only in the sense that vbus has its own management interface and bus model, and I want them to be used.

> do you think this will impact performance, or something else?

performance is not a concern for this aspect of operation.

> 
>> 2) The signal model for the 1:1 design is not very flexible IMO.
>>     2a) I want to be able to allocate dynamic signal paths, not pre-allocate 
> msi-x vectors at dev-add.
> 
> I believe msi-x implies that the interrupt vectors get added by the device 
> driver
> at run time, unlike legacy interrupts or msi. It's been a while since I 
> dealt with
> that though.

Yeah, its been a while for me too.  I would have to look at the spec again.

My understanding was that its just a slight variation of msi, with some of the constraints revised (no <= 32 vector limit, etc).  Perhaps it is fancier than that and 2a is unfounded.  TBD.

> 
>>     2b) I also want to collapse multiple interrupts together so as to 
> minimize the context switch rate (inject + EIO overhead).  My design 
> effectively has "NAPI" for interrupt handling.  This helps when the system 
> needs it the most: heavy IO.
> 
> That sounds like a very useful concept in general, but this seems to be a
> detail of the interrupt controller implementation. If the IO-APIC cannot
> do what you want here, maybe we just need a paravirtual IRQ controller
> driver, like e.g. the PS3 has.

Yeah, I agree this could be a function of the APIC code.  Do note that I mentioned this in passing to Avi a few months ago but FWIW he indicated at that time that he is not interested in making the APIC PV.

Also, I almost forgot an important one.  Add:

   2c) Interrupt prioritization.  I want to be able to assign priority to interrupts and handle them in priority order.

> 
>> 3) The 1:1 model is not buying us much in terms of hotplug.  We don't really 
> "use" PCI very much even in virtio.  Its a thin-shim of uniform dev-ids to 
> resurface to the virtio-bus as something else.  With LDM, hotplug is 
> ridiculously easy anyway, so who cares.  I already need an event channel 
> anyway for (2b) anyway, so the devadd/devdrop events are trivial to handle.
> 
> I agree for Linux guests, but when you want to run other guest operating 
> systems,
> PCI hotplug is probably the most common interface for this. AFAIK, the 
> windows
> virtio-net driver does not at all have a concept of a virtio layer but is 
> simply
> a network driver for a PCI card. The same could be applied any other device,
> possibly with some library code doing all the queue handling in a common 
> way.l

I was told it also has a layering like Linux, but I haven't actually seen the code myself, so I do not know if this is true.

> 
>> 4) communicating with something efficiently in-kernel requires more finesse 
> than basic PIO/MMIO.  There are tricks you can do to get around this, but 
> with 1:1 you would have to do this trick repeatedly for each device.  Even 
> with a library solution to help, you still have per-cpu .data overhead and cpu 
> hotplug overhead to get maximum performance.  With my "bridge" model, I do it 
> once, which I believe is ideal.
>>
>> 5) 1:1 is going to quickly populate the available MMIO/PIO and IDT slots for 
> any kind of medium to large configuration.  The bridge model scales better in 
> this regard.
> 
> We don't need to rely on PIO, it's just the common interface that all 
> hypervisors
> can easily support. We could have different underlying methods for the 
> communication
> if space or performance becomes a bottleneck because of this.

Heh...I already proposed an alternative, which incidentally was shot down:

http://lkml.org/lkml/2009/5/5/132

(in the end, I think we agreed that a technique of tunneling PIO/MMIO over hypercaills would be better than introducing a new namespace.  But we also decided that the difference between PIO and PIOoHC was too small to care, and we don't care about non-x86)

But in any case, my comment still stands: 1:1 puts load on the PIO emulation (even if you use PIOoHC).  I am not sure this can be easily worked around.

> 
>> So based on that, I think the bridge model works better for vbus.  Perhaps 
> you can convince me otherwise ;)
> 
> Being able to define all of it in the host kernel seems to be the major
> advantage of your approach, the other points you mentioned are less
> important IMHO. The question is whether that is indeed a worthy goal,
> or if it should just live in user space as with the qemu PCI code.

I don't think we can gloss over these so easily.  They are all important to me, particularly 2b and 2c.

> 
>> >> In essence, this driver's job is to populate the "vbus-proxy" LDM bus with
>> >> objects that it finds across the PCI-OTHER bridge.  This would actually sit
>> >> below the virtio components in the stack, so it doesnt make sense (to me) 
> to
>> >> turn around and build this on top of virtio.  But perhaps I am missing
>> >> something you are seeing.
>> >> 
>> >> Can you elaborate?
>> > 
>> > Your PCI device does not serve any real purpose as far as I can tell
>> 
>> That is certainly debatable.  Its purpose is as follows:
>> 
>> 1) Allows a guest to discover the vbus feature (fwiw: I used to do this with 
> cpuid)
> 
> true, I missed that.
> 
>> 2) Allows the guest to establish proper context to communicate with the 
> feature (mmio, pio, and msi) (fwiw: i used to use hypercalls)
>> 3) Access the virtual-devices that have been configured for the feature
>> 
>> Correct me if I am wrong:  Isn't this more of less the exact intent of 
> something like an LDM bus (vbus-proxy) and a PCI-BRIDGE?  Other than the 
> possibility that there might be some mergable overlap (still debatable), I 
> don't think its fair to say that this does not serve a purpose.
> 
> I guess you are right on that. An interesting variation of that would be 
> make the
> child devices of it virtio devices again though: Instead of the PCI 
> emulation code
> in the host kernel, you could define a simpler interface to the same effect. 
> So the
> root device would be a virtio-pci device, below which you can have 
> virtio-virtio
> devices.

Interesting....but note I think that is effectively what I do today (with virtio-vbus) except you wouldn't have the explicit vbus-proxy model underneath.  Also, if 1:1 via PCI is important for windows, that solution would have the same problem that the virtio-vbus model does.


> 
>> >, you could just as well have a root device as a parent for all the vbus 
> devices
>> > if you do your device probing like this.
>> 
>> Yes, I suppose the "bridge" could have been advertised as a virtio-based root 
> device.  In this way, the virtio probe() would replace my pci probe() for 
> feature discovery, and a virtqueue could replace my msi+ioq for the eventq 
> channel.
>>
>> I see a few issues with that, however:
>> 
>> 1) The virtqueue library, while a perfectly nice ring design at the metadata 
> level, does not have an API that is friendly to kernel-to-kernel communication. 
>  It was designed more for frontend use to some remote backend.  The IOQ 
> library on the other hand, was specifically designed to support use as 
> kernel-to-kernel (see north/south designations).  So this made life easier 
> for me.  To do what you propose, the eventq channel would need to terminate 
> in kernel, and I would thus be forced to deal that the potential API 
> problems.
> 
> Well, virtqueues are not that bad for kernel-to-kernel communication, as Ira 
> mentioned
> referring to his virtio-over-PCI driver. You can have virtqueues on both 
> sides, having
> the host kernel create a pair of virtqueues (one in user aka guest space, 
> one in the host
> kernel), with the host virtqueue_ops doing copy_{to,from}_user to move data 
> between them.

Its been a while since I looked, so perhaps I am wrong here.  I will look again.

> 
> If you have that, you can actually use the same virtio_net driver in both 
> guest and
> host kernel, just communicating over different virtio implementations. 
> Interestingly,
> that would mean that you no longer need a separation between guest and host 
> device
> drivers (vbus and vbus-proxy in your case) but could use the same device 
> abstraction
> with just different transports to back the shm-signal or virtqueue.

Actually, I think there are some problems with that model (such as management of the interface).  virtio-net really wants to connect to a virtio-net-backend (such as the one in qemu or vbus).  It wasn't designed to connect back to back like that.  I think you will quickly run into problems similar to what Ira faced with virtio-over-PCI with that model.

>  
>> 2) I would need to have Avi et. al. allocate a virtio vector to use from 
> their namespace, which I am sure they wont be willing to do until they accept 
> my design.  Today, I have a nice conflict free PCI ID to use as I see fit.
> 
> My impression is the opposite: as long as you try to reinvent everything at 
> once,
> you face opposition, but if you just improve parts of the existing design 
> one
> by one (like eventfd), I think you will find lots of support.
> 
>> Im sure both of these hurdles are not insurmountable, but I am left 
> scratching my head as to why its worth the effort.  It seems to me its a "six 
> of one, half-dozen of the other" kind of scenario.  Either I write a qemu PCI 
> device and pci-bridge driver, or I write a qemu virtio-devicve and virtio root 
> driver.
>> 
>> In short: What does this buy us, or did you mean something else?  
> 
> In my last reply, I was thinking of a root device that can not be probed 
> like a PCI device.

IIUC, because you missed the "feature discovery" function of the bridge, you thought this was possible but now see it is problematic?  Or are you saying that this concept is still valid and should be considered?  I think its the former, but wanted to be sure we were on the same page.

> 
>> > However, assuming that you do the IMHO right thing to do probing like
>> > virtio with a PCI device for each slave, the code will be almost the same
>> > as virtio-pci and the two can be the same.
>> 
>> Can you elaborate?
> 
> Well, let me revise based on the discussion:
> 
> The main point that remains is that I think a vbus-proxy should be the same 
> as a
> virtio device. This could be done by having (as in my earlier mails) a PCI 
> device
> per vbus-proxy, with devcall implemented in PIO or config-space and additional
> shm/shm-signal,

So the problem with this model is the points I made earlier (such as 2b, 2c).

I do agree with you that the *lack* of this model may be problematic for Windows, depending on the answer w.r.t. what the windows drivers look like.

> or it could be a single virtio device from virtio-pci or one
> of the other existing provides that connects you with a new virtio provider
> sitting in the host kernel. This provider has child devices for any endpoint
> (virtio-net, venet, ...) that is implemented in the host kernel.

This is an interesting idea, but I think it also has problems.

What we do get with having the explicit vbus-proxy exposed in the stack (aside from being able to support "vbus native" drivers, like venet) is a neat way to map vbus-isms into virtio-isms.  For instance, vbus uses a string-based device id, and virtio uses a PCI-ID.  Using this as an intermediate layer allows the "virtio" vbus-id to know that we issue dev->call(GETID) to obtain the PCI-ID value of this virtio-device, and we should publish this result to virtio-bus.

Without this intermediate layer, the vbus identity scheme would have to be compatible with virtio PCI-ID based scheme, and I think this is suboptimal for the overall design of vbus.

[snip]

>  
>> Regarding the id->handle indirection:
>> 
>> Internally, the DEVOPEN call translates an "id" to a "handle".  The handle 
> is just a token to help ensure that the caller actually opened the device 
> successfully.  Note that the "id" namespace is 0 based.  Therefore, something 
> like an errant DEVCALL(0) would be indistinguishable from a legit request.  
> Using the handle abstraction gives me a slightly more robust mechanism to 
> ensure the caller actually meant to call the host, and was in the proper 
> context to do so.  For one thing, if the device had never been opened, this 
> would have failed before it ever reached the model.  Its one more check I can 
> do at the infrastructure level, and one less thing each model has to look out 
> for.
>> 
>> Is the id->handle translation critical?  No, i'm sure we could live without 
> it, but I also don't think it hurts anything.  It allows the overall code to 
> be slightly more robust, and the individual model code to be slightly less 
> complicated.  Therefore, I don't see a problem.
> 
> Right, assuming your model with all vbus devices behind a single PCI device, 
> your
> handle does not hurt, it's the equivalent of a bus/dev/fn number or an MMIO 
> address.

Agreed

Thanks Arnd,
-Greg


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 16:28                     ` Pantelis Koukousoulas
@ 2009-08-07 12:14                       ` Gregory Haskins
  0 siblings, 0 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-07 12:14 UTC (permalink / raw)
  To: Pantelis Koukousoulas; +Cc: kvm, alacrityvm-devel

[-- Attachment #1: Type: text/plain, Size: 2502 bytes --]

[not sure if it was intentional, but you dropped the CC list.
Therefore, I didn't see this until I caught up on my kvm@vger reading]

Pantelis Koukousoulas wrote:
> How hard would it be to implement virtio over vbus and perhaps the
> virtio-net backend?

It should be relatively trivial.  I have already written the transport
(called virtio-vbus) that would allow the existing front-end
(virtio-net) to work without modification.

http://lkml.org/lkml/2009/4/21/427

All that is needed is to take venet-tap as an example and port it to
something virtio compatible (via that patch I posted) on the backend.  I
have proposed this as an alternative to venet, but so far I have not had
any takers to help with this effort.  Likewise, I am too busy with the
infrastructure to take this on myself.


> 
> This would leave only one variable in the comparison, clear misconceptions and
> make evaluation easier by judging each of vbus, venet etc separately on its own
> merits.
> 
> The way things are now, it is unclear exactly where those performance
> improvements are coming from (or how much each component contributes)
> because there are too many variables.
> 
> Replacing virtio-net by venet would be a hard proposition if only because
> virtio-net has (closed source) windows drivers available. There has to be
> shown that venet by itself does something significantly better that
> virtio-net can't be modified to do comparably well.

I am not proposing anyone replace virtio-net.  It will continue to work
fine despite the existence of an alternative, and KVM can continue to
standardize on it if that is what KVM wants to do.

> 
> Having venet in addition to virtio-net is also difficult, given that having only
> one set of paravirtual drivers in the kernel was the whole point behind virtio.

As it stands right now, virtio-net fails to meet my performance goals,
and venet meets them (or at least, gets much closer, but I will not
rest..).  So, at least for AlacrityVM, I will continue to use and
promote it when performance matters.  If at some time in the future I
can get virtio-net to work in my environment in a comparable and
satisfactory way, I will consider migrating to it and deprecating venet.

Until then, having two drivers is ok, and no-one has to use the one they
don't like.  I certainly do not think having more than one driver that
speaks 802.x ethernet in the kernel tree is without precedent. ;)

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06  8:19 ` [PATCH 0/7] AlacrityVM guest drivers Reply-To: Michael S. Tsirkin
  2009-08-06 10:17   ` Michael S. Tsirkin
  2009-08-06 12:08   ` Gregory Haskins
@ 2009-08-07 14:19   ` Anthony Liguori
  2009-08-07 15:05     ` [PATCH 0/7] AlacrityVM guest drivers Gregory Haskins
  2 siblings, 1 reply; 62+ messages in thread
From: Anthony Liguori @ 2009-08-07 14:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Haskins, linux-kernel, alacrityvm-devel, netdev, kvm

Michael S. Tsirkin wrote:
>
>> This series includes the basic plumbing, as well as the driver for
>> accelerated 802.x (ethernet) networking.
>>     
>
> The graphs comparing virtio with vbus look interesting.
>   

1gbit throughput on a 10gbit link?  I have a hard time believing that.

I've seen much higher myself.  Can you describe your test setup in more 
detail?

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/7] vbus-proxy: add a pci-to-vbus bridge
  2009-08-07  4:42             ` Gregory Haskins
@ 2009-08-07 14:57               ` Arnd Bergmann
  2009-08-07 15:44                   ` Gregory Haskins
  2009-08-07 15:55               ` Ira W. Snyder
  1 sibling, 1 reply; 62+ messages in thread
From: Arnd Bergmann @ 2009-08-07 14:57 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: alacrityvm-devel, Ira W. Snyder, linux-kernel, netdev

On Friday 07 August 2009, Gregory Haskins wrote:
> >>> Arnd Bergmann <arnd@arndb.de> wrote: 
> > On Thursday 06 August 2009, Gregory Haskins wrote:
> > 
> > >     2b) I also want to collapse multiple interrupts together so as to 
> > > minimize the context switch rate (inject + EIO overhead).  My design 
> > > effectively has "NAPI" for interrupt handling.  This helps when the system 
> > > needs it the most: heavy IO.
> > 
> > That sounds like a very useful concept in general, but this seems to be a
> > detail of the interrupt controller implementation. If the IO-APIC cannot
> > do what you want here, maybe we just need a paravirtual IRQ controller
> > driver, like e.g. the PS3 has.
> 
> Yeah, I agree this could be a function of the APIC code.  Do note that I
> mentioned this in passing to Avi a few months ago but FWIW he indicated
> at that time that he is not interested in making the APIC PV.
> 
> Also, I almost forgot an important one.  Add:
> 
>    2c) Interrupt prioritization.  I want to be able to assign priority
>    to interrupts and handle them in priority order.

I think this part of the interface has developed into the wrong direction
because you confused two questions:

1. should you build an advanced interrupt mechanism for virtual drivers?
2. how should you build an advanced interrupt mechanism for virtual drivers?

My guess is that when Avi said he did not want a paravirtual IO-APIC,
he implied that the existing one is good enough (maybe Avi can clarify that
point himself) answering question 1, while you took that as an indication
that the code should live elsewhere instead, answering question 2.

What you built with the shm-signal code is essentially a paravirtual nested
interrupt controller by another name, and deeply integrated into a new
bigger subsystem. I believe that this has significant disadvantages
over the approach of making it a standard interrupt controller driver:

* It completely avoids the infrastructure that we have built into Linux
  to deal with interrupts, e.g. /proc/interrupts statistics, IRQ
  balancing and CPU affinity.

* It makes it impossible to quantify the value of the feature to start with,
  which could be used to answer question 1 above.

* Less importantly, it does not work with any other drivers that might
  also benefit from a new interrupt controller -- if it is indeed better
  than the one we already have.

	Arnd <><

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers
  2009-08-07 14:19   ` Anthony Liguori
@ 2009-08-07 15:05     ` Gregory Haskins
  2009-08-07 15:46       ` Anthony Liguori
  0 siblings, 1 reply; 62+ messages in thread
From: Gregory Haskins @ 2009-08-07 15:05 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Michael S. Tsirkin, Gregory Haskins, linux-kernel,
	alacrityvm-devel, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3158 bytes --]

Anthony Liguori wrote:
> Michael S. Tsirkin wrote:
>>
>>> This series includes the basic plumbing, as well as the driver for
>>> accelerated 802.x (ethernet) networking.
>>>     
>>
>> The graphs comparing virtio with vbus look interesting.
>>   
> 
> 1gbit throughput on a 10gbit link?  I have a hard time believing that.
> 
> I've seen much higher myself.  Can you describe your test setup in more
> detail?

Sure,

For those graphs, two 8-core x86_64 boxes with Chelsio T3 10GE connected
back to back via cross over with 1500mtu.  The kernel version was as
posted.  The qemu version was generally something very close to
qemu-kvm.git HEAD at the time the data was gathered, but unfortunately I
didn't seem to log this info.

For KVM, we take one of those boxes and run a bridge+tap configuration
on top of that.  We always run the server on the bare-metal machine on
the remote side of the link regardless of whether we run the client in a
VM or baremetal.

For guests, virtio-net and venet connect to the same linux bridge
instance, I just "ifdown eth0 / ifup eth1" (or vice versa) and repeat
the same test.  I do this multiple times (usually about 10) and average
the result.  I use several different programs, such as netperf, rsync,
and ping to take measurements.

That said, note that the graphs were from earlier kernel runs (2.6.28,
29-rc8).  The most recent data I can find that I published is for
2.6.29, announced with the vbus-v3 release back in April:

http://lkml.org/lkml/2009/4/21/408

In it, the virtio-net throughput numbers are substantially higher and
possibly more in line with your expectations (4.5gb/s) (though notably
still lagging venet, which weighed in at 5.6gb/s).

Generally, I find that the virtio-net exhibits non-deterministic results
from release to release.  I suspect (as we have discussed) the
tx-mitigation scheme.  Some releases buffer the daylights out of the
stream, and virtio gets close(r) throughput (e.g. 4.5g vs 5.8g, but
absolutely terrible latency (4000us vs 65us).  Other releases it seems
to operate with more of a compromise (1.3gb/s vs 3.8gb/s, but 350us vs
85us).

I do not understand what causes the virtio performance fluctuation, as I
use the same kernel config across builds, and I do not typically change
the qemu userspace.  Note that some general fluctuation is evident
across the board just from kernel to kernel.  I am referring to more of
the disparity in throughput vs latency than the ultimate numbers, as all
targets seem to scale max throughput about the same per kernel.

That said, I know I need to redo the graphs against HEAD (31-rc5, and
perhaps 30, and kvm.git).  I've been heads down with the eventfd
interfaces since vbus-v3 so I havent been as active with generating the
results. I did confirm that vbus-v4 (alacrityvm-v0.1) still produces a
similar graph, but I didn't gather the data as scientifically as I would
feel comfortable publishing a graph for.  This is on the TODO list.

If there is another patch-series/tree I should be using for comparison,
please point me at it.

HTH

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/7] vbus-proxy: add a pci-to-vbus bridge
  2009-08-07 14:57               ` Arnd Bergmann
@ 2009-08-07 15:44                   ` Gregory Haskins
  0 siblings, 0 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-07 15:44 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: alacrityvm-devel, Ira W. Snyder, linux-kernel, netdev

>>> On 8/7/2009 at 10:57 AM, in message <200908071657.32858.arnd@arndb.de>, Arnd
Bergmann <arnd@arndb.de> wrote: 
> On Friday 07 August 2009, Gregory Haskins wrote:
>> >>> Arnd Bergmann <arnd@arndb.de> wrote: 
>> > On Thursday 06 August 2009, Gregory Haskins wrote:
>> > 
>> > >     2b) I also want to collapse multiple interrupts together so as to 
>> > > minimize the context switch rate (inject + EIO overhead).  My design 
>> > > effectively has "NAPI" for interrupt handling.  This helps when the system 
> 
>> > > needs it the most: heavy IO.
>> > 
>> > That sounds like a very useful concept in general, but this seems to be a
>> > detail of the interrupt controller implementation. If the IO-APIC cannot
>> > do what you want here, maybe we just need a paravirtual IRQ controller
>> > driver, like e.g. the PS3 has.
>> 
>> Yeah, I agree this could be a function of the APIC code.  Do note that I
>> mentioned this in passing to Avi a few months ago but FWIW he indicated
>> at that time that he is not interested in making the APIC PV.
>> 
>> Also, I almost forgot an important one.  Add:
>> 
>>    2c) Interrupt prioritization.  I want to be able to assign priority
>>    to interrupts and handle them in priority order.
> 
> I think this part of the interface has developed into the wrong direction
> because you confused two questions:
> 
> 1. should you build an advanced interrupt mechanism for virtual drivers?
> 2. how should you build an advanced interrupt mechanism for virtual drivers?
> 
> My guess is that when Avi said he did not want a paravirtual IO-APIC,
> he implied that the existing one is good enough (maybe Avi can clarify that
> point himself) answering question 1, while you took that as an indication
> that the code should live elsewhere instead, answering question 2.
> 
> What you built with the shm-signal code is essentially a paravirtual nested
> interrupt controller by another name, and deeply integrated into a new
> bigger subsystem. I believe that this has significant disadvantages
> over the approach of making it a standard interrupt controller driver:
> 
> * It completely avoids the infrastructure that we have built into Linux
>   to deal with interrupts, e.g. /proc/interrupts statistics, IRQ
>   balancing and CPU affinity.
> 
> * It makes it impossible to quantify the value of the feature to start with,
>   which could be used to answer question 1 above.
> 
> * Less importantly, it does not work with any other drivers that might
>   also benefit from a new interrupt controller -- if it is indeed better
>   than the one we already have.
> 
> 	Arnd <><

Hi Arnd,

I don't strongly disagree with anything you said (except for perhaps that I confused the question).  I agree that the PCI-bridge effectively implements something akin to an interrupt controller.  I agree that this interrupt controller, if indeed superior (I believe it is), can only benefit devices inherently behind the bridge instead of all of KVM (this in of itself doesnt bother me, as I plan on all my performance work to be based on that bus, but I digress.  Also note that this is not dissimilar to other bridge+bus (think usb, scsi) operation).  I agree that a potentially more ideal solution would be if we had a proper generic PV interrupt controller that exhibited similar traits as to what I describe (priority, inject+EIO overhead reduction, etc) so that all of KVM benefited.

The issue wasn't that I didn't know these things.  The issue is that I have no control over whether such an intrusive change to KVM (and the guest arch code) is accepted (and at least one relevant maintainer expressed dissatisfaction (*) at the idea when proposed) Conversely, I am the maintainer of AlacrityVM, so I do have control over the bridge design. ;)  Also note that this particular design decision is completely encapsulated within alacrityvm's components.  IOW, I am not foisting my ideas on the entire kernel tree:  If someone doesn't like what I have done, they can choose not to use alacrity and its like my ideas never existed.  The important thing with this distinction is I am not changing how core linux or core kvm works in the process, only the one little piece of the world that particularly interests me.

That said, if attitudes about some of these ideas have changed, I may be able to break that piece out and start submitting it to kvm@ as some kind of pv interrupt controller.  I would only be interested in doing so if Avi et. al. express an openness to the idea...I.e.. I don't want to waste my time any more than any one elses.

Kind Regards,
-Greg

(*) I think Avi said something to the effect of "you are falling into the 'lets PV the world' trap"


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/7] vbus-proxy: add a pci-to-vbus bridge
@ 2009-08-07 15:44                   ` Gregory Haskins
  0 siblings, 0 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-07 15:44 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: alacrityvm-devel, Ira W. Snyder, linux-kernel, netdev

>>> On 8/7/2009 at 10:57 AM, in message <200908071657.32858.arnd@arndb.de>, Arnd
Bergmann <arnd@arndb.de> wrote: 
> On Friday 07 August 2009, Gregory Haskins wrote:
>> >>> Arnd Bergmann <arnd@arndb.de> wrote: 
>> > On Thursday 06 August 2009, Gregory Haskins wrote:
>> > 
>> > >     2b) I also want to collapse multiple interrupts together so as to 
>> > > minimize the context switch rate (inject + EIO overhead).  My design 
>> > > effectively has "NAPI" for interrupt handling.  This helps when the system 
> 
>> > > needs it the most: heavy IO.
>> > 
>> > That sounds like a very useful concept in general, but this seems to be a
>> > detail of the interrupt controller implementation. If the IO-APIC cannot
>> > do what you want here, maybe we just need a paravirtual IRQ controller
>> > driver, like e.g. the PS3 has.
>> 
>> Yeah, I agree this could be a function of the APIC code.  Do note that I
>> mentioned this in passing to Avi a few months ago but FWIW he indicated
>> at that time that he is not interested in making the APIC PV.
>> 
>> Also, I almost forgot an important one.  Add:
>> 
>>    2c) Interrupt prioritization.  I want to be able to assign priority
>>    to interrupts and handle them in priority order.
> 
> I think this part of the interface has developed into the wrong direction
> because you confused two questions:
> 
> 1. should you build an advanced interrupt mechanism for virtual drivers?
> 2. how should you build an advanced interrupt mechanism for virtual drivers?
> 
> My guess is that when Avi said he did not want a paravirtual IO-APIC,
> he implied that the existing one is good enough (maybe Avi can clarify that
> point himself) answering question 1, while you took that as an indication
> that the code should live elsewhere instead, answering question 2.
> 
> What you built with the shm-signal code is essentially a paravirtual nested
> interrupt controller by another name, and deeply integrated into a new
> bigger subsystem. I believe that this has significant disadvantages
> over the approach of making it a standard interrupt controller driver:
> 
> * It completely avoids the infrastructure that we have built into Linux
>   to deal with interrupts, e.g. /proc/interrupts statistics, IRQ
>   balancing and CPU affinity.
> 
> * It makes it impossible to quantify the value of the feature to start with,
>   which could be used to answer question 1 above.
> 
> * Less importantly, it does not work with any other drivers that might
>   also benefit from a new interrupt controller -- if it is indeed better
>   than the one we already have.
> 
> 	Arnd <><

Hi Arnd,

I don't strongly disagree with anything you said (except for perhaps that I confused the question).  I agree that the PCI-bridge effectively implements something akin to an interrupt controller.  I agree that this interrupt controller, if indeed superior (I believe it is), can only benefit devices inherently behind the bridge instead of all of KVM (this in of itself doesnt bother me, as I plan on all my performance work to be based on that bus, but I digress.  Also note that this is not dissimilar to other bridge+bus (think usb, scsi) operation).  I agree that a potentially more ideal solution would be if we had a proper generic PV interrupt controller that exhibited similar traits as to what I describe (priority, inject+EIO overhead reduction, etc) so that all of KVM benefited.

The issue wasn't that I didn't know these things.  The issue is that I have no control over whether such an intrusive change to KVM (and the guest arch code) is accepted (and at least one relevant maintainer expressed dissatisfaction (*) at the idea when proposed) Conversely, I am the maintainer of AlacrityVM, so I do have control over the bridge design. ;)  Also note that this particular design decision is completely encapsulated within alacrityvm's components.  IOW, I am not foisting my ideas on the entire kernel tree:  If someone doesn't like what I have done, they can choose not to use alacrity and its like my ideas never existed.  The important thing with this distinction is I am not changing how core linux or core kvm works in the process, only the one little piece of the world that 
 particularly interests me.

That said, if attitudes about some of these ideas have changed, I may be able to break that piece out and start submitting it to kvm@ as some kind of pv interrupt controller.  I would only be interested in doing so if Avi et. al. express an openness to the idea...I.e.. I don't want to waste my time any more than any one elses.

Kind Regards,
-Greg

(*) I think Avi said something to the effect of "you are falling into the 'lets PV the world' trap"


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers
  2009-08-07 15:05     ` [PATCH 0/7] AlacrityVM guest drivers Gregory Haskins
@ 2009-08-07 15:46       ` Anthony Liguori
  2009-08-07 18:04         ` Gregory Haskins
  0 siblings, 1 reply; 62+ messages in thread
From: Anthony Liguori @ 2009-08-07 15:46 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Michael S. Tsirkin, Gregory Haskins, linux-kernel,
	alacrityvm-devel, netdev, kvm

Gregory Haskins wrote:
> That said, note that the graphs were from earlier kernel runs (2.6.28,
> 29-rc8).  The most recent data I can find that I published is for
> 2.6.29, announced with the vbus-v3 release back in April:
>
> http://lkml.org/lkml/2009/4/21/408
>
> In it, the virtio-net throughput numbers are substantially higher and
> possibly more in line with your expectations (4.5gb/s) (though notably
> still lagging venet, which weighed in at 5.6gb/s).
>   

Okay, that makes more sense.  Would be nice to update the graphs as they 
make virtio look really, really bad :-)

> Generally, I find that the virtio-net exhibits non-deterministic results
> from release to release.  I suspect (as we have discussed) the
> tx-mitigation scheme.  Some releases buffer the daylights out of the
> stream, and virtio gets close(r) throughput (e.g. 4.5g vs 5.8g, but
> absolutely terrible latency (4000us vs 65us).  Other releases it seems
> to operate with more of a compromise (1.3gb/s vs 3.8gb/s, but 350us vs
> 85us).
>   

Are you using kvm modules or a new kernel?  There was some timer 
infrastructure changes around 28/29 and it's possible that the system 
your on is now detecting an hpet which will result in a better time 
source.  That could have an affect on mitigation.

> If there is another patch-series/tree I should be using for comparison,
> please point me at it.
>   

No, I think it's fair to look at upstream Linux.  Looking at the latest 
bits would be nice though because there are some virtio friendly changes 
recently like MSI-x and GRO.

Since you're using the latest vbus bits, it makes sense to compare 
against the latest virtio bits.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/7] vbus-proxy: add a pci-to-vbus bridge
  2009-08-07  4:42             ` Gregory Haskins
  2009-08-07 14:57               ` Arnd Bergmann
@ 2009-08-07 15:55               ` Ira W. Snyder
  2009-08-07 18:25                 ` Gregory Haskins
  1 sibling, 1 reply; 62+ messages in thread
From: Ira W. Snyder @ 2009-08-07 15:55 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: Arnd Bergmann, alacrityvm-devel, linux-kernel, netdev

On Thu, Aug 06, 2009 at 10:42:00PM -0600, Gregory Haskins wrote:
> >>> On 8/6/2009 at  6:57 PM, in message <200908070057.54795.arnd@arndb.de>, Arnd
> Bergmann <arnd@arndb.de> wrote: 

[ big snip ]

> >> I see a few issues with that, however:
> >> 
> >> 1) The virtqueue library, while a perfectly nice ring design at the metadata 
> > level, does not have an API that is friendly to kernel-to-kernel communication. 
> >  It was designed more for frontend use to some remote backend.  The IOQ 
> > library on the other hand, was specifically designed to support use as 
> > kernel-to-kernel (see north/south designations).  So this made life easier 
> > for me.  To do what you propose, the eventq channel would need to terminate 
> > in kernel, and I would thus be forced to deal that the potential API 
> > problems.
> > 
> > Well, virtqueues are not that bad for kernel-to-kernel communication, as Ira 
> > mentioned
> > referring to his virtio-over-PCI driver. You can have virtqueues on both 
> > sides, having
> > the host kernel create a pair of virtqueues (one in user aka guest space, 
> > one in the host
> > kernel), with the host virtqueue_ops doing copy_{to,from}_user to move data 
> > between them.
> 
> Its been a while since I looked, so perhaps I am wrong here.  I will look again.
> 
> > 
> > If you have that, you can actually use the same virtio_net driver in both 
> > guest and
> > host kernel, just communicating over different virtio implementations. 
> > Interestingly,
> > that would mean that you no longer need a separation between guest and host 
> > device
> > drivers (vbus and vbus-proxy in your case) but could use the same device 
> > abstraction
> > with just different transports to back the shm-signal or virtqueue.
> 
> Actually, I think there are some problems with that model (such as management of the interface).  virtio-net really wants to connect to a virtio-net-backend (such as the one in qemu or vbus).  It wasn't designed to connect back to back like that.  I think you will quickly run into problems similar to what Ira faced with virtio-over-PCI with that model.
> 

Getting the virtio-net devices talking to each other over PCI was not
terribly difficult. However, the capabilities negotiation works in a
VERY awkward way. The capabilities negotiation was really designed with
a virtio-net-backend in mind. Unless I'm missing something obvious, it
is essentially broken for the case where two virtio-net's are talking to
each other.

For example, imagine the situation where you'd like the guest to get a
specific MAC address, but you do not care what MAC address the host
recieves.

Normally, you'd set struct virtio_net_config's mac[] field, and set the
VIRTIO_NET_F_MAC feature. However, when you have two virtio-net's
communicating directly, this doesn't work.

Let me explain with a quick diagram. The results described are the
values RETURNED from virtio_config_ops->get() and
virtio_config_ops->get_features() when called by the system in question.

Guest System
1) struct virtio_net_config->mac[]: 00:11:22:33:44:55
2) features: VIRTIO_NET_F_MAC

Host System
1) struct virtio_net_config->mac[]: unset
2) features: VIRTIO_NET_F_MAC unset

In this case, the feature negotiation code will not accept the
VIRTIO_NET_F_MAC feature, and both systems will generate random mac
addresses. Not the behavior we want at all.

I "fixed" the problem by ALWAYS setting a random MAC address, and ALWAYS
setting the VIRTIO_NET_F_MAC feature. By doing this, both sides always
negotiate the VIRTIO_NET_F_MAC feature.

In conclusion, the feature negotiation works fine for driver features,
such as VIRTIO_NET_MRG_RXBUF or VIRTIO_NET_F_GSO, but completely breaks
down for user-controlled features, like VIRTIO_NET_F_MAC.

I think the vbus configfs interface works great for this situation,
because it has an obvious and seperate backend. It is obvious where the
configuration information is coming from.

With my powerpc hardware, it should be easily possible to have at least
6 devices, each with two virtqueues, one for tx and one for rx. (The
limit is caused by the amount of distinct kick() events I can generate.)
This could allow many combinations of devices, such as:

* 1 virtio-net, 1 virtio-console
* 3 virtio-net, 2 virtio-console
* 6 virtio-net
* etc.

In all honesty, I don't really care so much about the management
interface for my purposes. A static configuration of devices works for
me. However, I doubt that would ever be accepted into the upstream
kernel, which is what I'm really concerned about. I hate seeing drivers
live out-of-tree.

Getting two virtio-net's talking to each other had one other major
problem: the fields in struct virtio_net_hdr are not defined with a
constant endianness. When connecting two virtio-net's running on
different machines, they may have different endianness, as in my case
between a big-endian powerpc guest and a little-endian x86 host. I'm not
confident that qemu-system-ppc, running on x86, using virtio for the
network interface, even works at all. (I have not tested it.)

Sorry that this got somewhat long winded and went a little off topic.
Ira

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers
  2009-08-07 15:46       ` Anthony Liguori
@ 2009-08-07 18:04         ` Gregory Haskins
  0 siblings, 0 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-07 18:04 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Michael S. Tsirkin, Gregory Haskins, linux-kernel,
	alacrityvm-devel, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 2218 bytes --]

Anthony Liguori wrote:
> Gregory Haskins wrote:
>> That said, note that the graphs were from earlier kernel runs (2.6.28,
>> 29-rc8).  The most recent data I can find that I published is for
>> 2.6.29, announced with the vbus-v3 release back in April:
>>
>> http://lkml.org/lkml/2009/4/21/408
>>
>> In it, the virtio-net throughput numbers are substantially higher and
>> possibly more in line with your expectations (4.5gb/s) (though notably
>> still lagging venet, which weighed in at 5.6gb/s).
>>   
> 
> Okay, that makes more sense.  Would be nice to update the graphs as they
> make virtio look really, really bad :-)

Yeah, they are certainly ripe for an update.  (Note that I was
unilaterally stale on venet numbers, too) ;)

> 
>> Generally, I find that the virtio-net exhibits non-deterministic results
>> from release to release.  I suspect (as we have discussed) the
>> tx-mitigation scheme.  Some releases buffer the daylights out of the
>> stream, and virtio gets close(r) throughput (e.g. 4.5g vs 5.8g, but
>> absolutely terrible latency (4000us vs 65us).  Other releases it seems
>> to operate with more of a compromise (1.3gb/s vs 3.8gb/s, but 350us vs
>> 85us).
>>   
> 
> Are you using kvm modules or a new kernel?

I just build the entire kernel from git.

> There was some timer
> infrastructure changes around 28/29 and it's possible that the system
> your on is now detecting an hpet which will result in a better time
> source.  That could have an affect on mitigation.

Yeah, I suspect you are right.  I always kept the .config and machine
constant, but I *do* bounce around kernel versions so perhaps I got
hosed by a make-oldconfig cycle somewhere along the way.

> 
>> If there is another patch-series/tree I should be using for comparison,
>> please point me at it.
>>   
> 
> No, I think it's fair to look at upstream Linux.  Looking at the latest
> bits would be nice though because there are some virtio friendly changes
> recently like MSI-x and GRO.

Yeah, I will definitely include kvm.git in addition to whatever is
current from Linus.  I already have adopted using the latest
qemu-kvm.git into my workflow.

Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/7] vbus-proxy: add a pci-to-vbus bridge
  2009-08-07 15:55               ` Ira W. Snyder
@ 2009-08-07 18:25                 ` Gregory Haskins
  0 siblings, 0 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-08-07 18:25 UTC (permalink / raw)
  To: Ira W. Snyder; +Cc: Arnd Bergmann, alacrityvm-devel, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1137 bytes --]

Ira W. Snyder wrote:

<big snip>

> With my powerpc hardware, it should be easily possible to have at least
> 6 devices, each with two virtqueues, one for tx and one for rx. (The
> limit is caused by the amount of distinct kick() events I can generate.)
> This could allow many combinations of devices, such as:
> 
> * 1 virtio-net, 1 virtio-console
> * 3 virtio-net, 2 virtio-console
> * 6 virtio-net
> * etc.

Note that the vbus "connector" design abstract details such as kick
events, so you are not limited by the number of physical interrupts of
your hardware, per se.

You can actually have an arbitrary width namespace for your virtqueues
running over a single interrupt, if you were so inclined.  (Note you can
also have a 1:1 with interrupts if you like, too)

If you look at the design of the vbus-pcibridge connector, it actually
aggregates the entire namespace into between 1 and 8 interrupts
(depending on availability, and separated by priority level).  If
multiple vectors are not available, the entire protocol can fallback to
run over a single vector when necessary.

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
  2009-08-06 16:55                     ` Gregory Haskins
@ 2009-08-09  7:48                       ` Avi Kivity
  0 siblings, 0 replies; 62+ messages in thread
From: Avi Kivity @ 2009-08-09  7:48 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Arnd Bergmann, alacrityvm-devel, Michael S. Tsirkin, kvm,
	linux-kernel, netdev

On 08/06/2009 07:55 PM, Gregory Haskins wrote:
> Based on this, I will continue my efforts surrounding to use of vbus including its use to accelerate KVM for AlacrityVM.  If I can find a way to do this in such a way that KVM upstream finds acceptable, I would be very happy and will work towards whatever that compromise might be.   OTOH, if the KVM community is set against the concept of a generalized/shared backend, and thus wants to use some other approach that does not involve vbus, that is fine too.  Choice is one of the great assets of open source, eh?   :)
>    

KVM upstream (me) doesn't have much say regarding vbus.  I am not a 
networking expert and I'm not the virtio or networking stack maintainer, 
so I'm not qualified to accept or reject the code.  What I am able to do 
is make sure that kvm can efficiently work with any driver/device stack; 
this is why ioeventfd/irqfd were merged.

I still think vbus is a duplication of effort; I understand vbus has 
larger scope than virtio, but I still think these problems could have 
been solved within the existing virtio stack.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v3] net: Add vbus_enet driver
  2009-08-04  1:14   ` [PATCH v2] " Gregory Haskins
  2009-08-04  2:38     ` David Miller
@ 2009-10-02 15:33     ` Gregory Haskins
  1 sibling, 0 replies; 62+ messages in thread
From: Gregory Haskins @ 2009-10-02 15:33 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, alacrityvm-devel

A virtualized 802.x network device based on the VBUS interface. It can be
used with any hypervisor/kernel that supports the virtual-ethernet/vbus
protocol.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: David S. Miller <davem@davemloft.net>

[ added several new features since last review:
        pre-mapped-transmit descriptors,
        event-queue,
	link-state event
	tx-complete event
]

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 MAINTAINERS             |    7 
 drivers/net/Kconfig     |   14 +
 drivers/net/Makefile    |    1 
 drivers/net/vbus-enet.c | 1203 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/Kbuild    |    1 
 include/linux/venet.h   |  123 +++++
 6 files changed, 1349 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/vbus-enet.c
 create mode 100644 include/linux/venet.h

diff --git a/MAINTAINERS b/MAINTAINERS
index b484756..ade37b5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5456,6 +5456,13 @@ S:	Maintained
 F:	include/linux/vbus*
 F:	drivers/vbus/*
 
+VBUS ETHERNET DRIVER
+M:	Gregory Haskins <ghaskins@novell.com>
+S:	Maintained
+W:	http://developer.novell.com/wiki/index.php/AlacrityVM
+F:	include/linux/venet.h
+F:	drivers/net/vbus-enet.c
+
 VFAT/FAT/MSDOS FILESYSTEM
 M:	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
 S:	Maintained
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 5ce7cba..722f892 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -3211,4 +3211,18 @@ config VIRTIO_NET
 	  This is the virtual network driver for virtio.  It can be used with
           lguest or QEMU based VMMs (like KVM or Xen).  Say Y or M.
 
+config VBUS_ENET
+	tristate "VBUS Ethernet Driver"
+	default n
+	select VBUS_PROXY
+	help
+	   A virtualized 802.x network device based on the VBUS
+	   "virtual-ethernet" interface.  It can be used with any
+	   hypervisor/kernel that supports the vbus+venet protocol.
+
+config VBUS_ENET_DEBUG
+        bool "Enable Debugging"
+	depends on VBUS_ENET
+	default n
+
 endif # NETDEVICES
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index ead8cab..2a3c7a9 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -277,6 +277,7 @@ obj-$(CONFIG_FS_ENET) += fs_enet/
 obj-$(CONFIG_NETXEN_NIC) += netxen/
 obj-$(CONFIG_NIU) += niu.o
 obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+obj-$(CONFIG_VBUS_ENET) += vbus-enet.o
 obj-$(CONFIG_SFC) += sfc/
 
 obj-$(CONFIG_WIMAX) += wimax/
diff --git a/drivers/net/vbus-enet.c b/drivers/net/vbus-enet.c
new file mode 100644
index 0000000..e8a0553
--- /dev/null
+++ b/drivers/net/vbus-enet.c
@@ -0,0 +1,1203 @@
+/*
+ * vbus_enet - A virtualized 802.x network device based on the VBUS interface
+ *
+ * Copyright (C) 2009 Novell, Gregory Haskins <ghaskins@novell.com>
+ *
+ * Derived from the SNULL example from the book "Linux Device Drivers" by
+ * Alessandro Rubini, Jonathan Corbet, and Greg Kroah-Hartman, published
+ * by O'Reilly & Associates.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/interrupt.h>
+
+#include <linux/in.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/skbuff.h>
+#include <linux/ioq.h>
+#include <linux/vbus_driver.h>
+
+#include <linux/in6.h>
+#include <asm/checksum.h>
+
+#include <linux/venet.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("virtual-ethernet");
+MODULE_VERSION("1");
+
+static int rx_ringlen = 256;
+module_param(rx_ringlen, int, 0444);
+static int tx_ringlen = 256;
+module_param(tx_ringlen, int, 0444);
+static int sg_enabled = 1;
+module_param(sg_enabled, int, 0444);
+
+#define PDEBUG(_dev, fmt, args...) dev_dbg(&(_dev)->dev, fmt, ## args)
+
+struct vbus_enet_queue {
+	struct ioq              *queue;
+	struct ioq_notifier      notifier;
+	unsigned long            count;
+};
+
+struct vbus_enet_priv {
+	spinlock_t                 lock;
+	struct net_device         *dev;
+	struct vbus_device_proxy  *vdev;
+	struct napi_struct         napi;
+	struct vbus_enet_queue     rxq;
+	struct {
+		struct vbus_enet_queue veq;
+		struct tasklet_struct  task;
+		struct sk_buff_head    outstanding;
+	} tx;
+	bool                       sg;
+	struct {
+		bool               enabled;
+		char              *pool;
+	} pmtd; /* pre-mapped transmit descriptors */
+	struct {
+		bool                   enabled;
+		bool                   linkstate;
+		bool                   txc;
+		unsigned long          evsize;
+		struct vbus_enet_queue veq;
+		struct tasklet_struct  task;
+		char                  *pool;
+	} evq;
+};
+
+static void vbus_enet_tx_reap(struct vbus_enet_priv *priv);
+
+static struct vbus_enet_priv *
+napi_to_priv(struct napi_struct *napi)
+{
+	return container_of(napi, struct vbus_enet_priv, napi);
+}
+
+static int
+queue_init(struct vbus_enet_priv *priv,
+	   struct vbus_enet_queue *q,
+	   int qid,
+	   size_t ringsize,
+	   void (*func)(struct ioq_notifier *))
+{
+	struct vbus_device_proxy *dev = priv->vdev;
+	int ret;
+
+	ret = vbus_driver_ioq_alloc(dev, qid, 0, ringsize, &q->queue);
+	if (ret < 0)
+		panic("ioq_alloc failed: %d\n", ret);
+
+	if (func) {
+		q->notifier.signal = func;
+		q->queue->notifier = &q->notifier;
+	}
+
+	q->count = ringsize;
+
+	return 0;
+}
+
+static int
+devcall(struct vbus_enet_priv *priv, u32 func, void *data, size_t len)
+{
+	struct vbus_device_proxy *dev = priv->vdev;
+
+	return dev->ops->call(dev, func, data, len, 0);
+}
+
+/*
+ * ---------------
+ * rx descriptors
+ * ---------------
+ */
+
+static void
+rxdesc_alloc(struct net_device *dev, struct ioq_ring_desc *desc, size_t len)
+{
+	struct sk_buff *skb;
+
+	len += ETH_HLEN;
+
+	skb = netdev_alloc_skb(dev, len + 2);
+	BUG_ON(!skb);
+
+	skb_reserve(skb, NET_IP_ALIGN); /* align IP on 16B boundary */
+
+	desc->cookie = (u64)skb;
+	desc->ptr    = (u64)__pa(skb->data);
+	desc->len    = len; /* total length  */
+	desc->valid  = 1;
+}
+
+static void
+rx_setup(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->rxq.queue;
+	struct ioq_iterator iter;
+	int ret;
+
+	/*
+	 * We want to iterate on the "valid" index.  By default the iterator
+	 * will not "autoupdate" which means it will not hypercall the host
+	 * with our changes.  This is good, because we are really just
+	 * initializing stuff here anyway.  Note that you can always manually
+	 * signal the host with ioq_signal() if the autoupdate feature is not
+	 * used.
+	 */
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0); /* will never fail unless seriously broken */
+
+	/*
+	 * Seek to the tail of the valid index (which should be our first
+	 * item, since the queue is brand-new)
+	 */
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Now populate each descriptor with an empty SKB and mark it valid
+	 */
+	while (!iter.desc->valid) {
+		rxdesc_alloc(priv->dev, iter.desc, priv->dev->mtu);
+
+		/*
+		 * This push operation will simultaneously advance the
+		 * valid-head index and increment our position in the queue
+		 * by one.
+		 */
+		ret = ioq_iter_push(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+}
+
+static void
+rx_teardown(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->rxq.queue;
+	struct ioq_iterator iter;
+	int ret;
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * free each valid descriptor
+	 */
+	while (iter.desc->valid) {
+		struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+
+		iter.desc->valid = 0;
+		wmb();
+
+		iter.desc->ptr = 0;
+		iter.desc->cookie = 0;
+
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+
+		dev_kfree_skb(skb);
+	}
+}
+
+static int
+tx_setup(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq    = priv->tx.veq.queue;
+	size_t      iovlen = sizeof(struct venet_iov) * (MAX_SKB_FRAGS-1);
+	size_t      len    = sizeof(struct venet_sg) + iovlen;
+	struct ioq_iterator iter;
+	int i;
+	int ret;
+
+	if (!priv->sg)
+		/*
+		 * There is nothing to do for a ring that is not using
+		 * scatter-gather
+		 */
+		return 0;
+
+	/* pre-allocate our descriptor pool if pmtd is enabled */
+	if (priv->pmtd.enabled) {
+		struct vbus_device_proxy *dev = priv->vdev;
+		size_t poollen = len * priv->tx.veq.count;
+		char *pool;
+		int shmid;
+
+		/* pmtdquery will return the shm-id to use for the pool */
+		ret = devcall(priv, VENET_FUNC_PMTDQUERY, NULL, 0);
+		BUG_ON(ret < 0);
+
+		shmid = ret;
+
+		pool = kzalloc(poollen, GFP_KERNEL | GFP_DMA);
+		if (!pool)
+			return -ENOMEM;
+
+		priv->pmtd.pool = pool;
+
+		ret = dev->ops->shm(dev, shmid, 0, pool, poollen, 0, NULL, 0);
+		BUG_ON(ret < 0);
+	}
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_set, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Now populate each descriptor with an empty SG descriptor
+	 */
+	for (i = 0; i < priv->tx.veq.count; i++) {
+		struct venet_sg *vsg;
+
+		if (priv->pmtd.enabled) {
+			size_t offset = (i * len);
+
+			vsg = (struct venet_sg *)&priv->pmtd.pool[offset];
+			iter.desc->ptr = (u64)offset;
+		} else {
+			vsg = kzalloc(len, GFP_KERNEL);
+			if (!vsg)
+				return -ENOMEM;
+
+			iter.desc->ptr = (u64)__pa(vsg);
+		}
+
+		iter.desc->cookie = (u64)vsg;
+		iter.desc->len    = len;
+
+		ret = ioq_iter_seek(&iter, ioq_seek_next, 0, 0);
+		BUG_ON(ret < 0);
+	}
+
+	return 0;
+}
+
+static void
+tx_teardown(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->tx.veq.queue;
+	struct ioq_iterator iter;
+	struct sk_buff *skb;
+	int ret;
+
+	/* forcefully free all outstanding transmissions */
+	while ((skb = __skb_dequeue(&priv->tx.outstanding)))
+		dev_kfree_skb(skb);
+
+	if (!priv->sg)
+		/*
+		 * There is nothing else to do for a ring that is not using
+		 * scatter-gather
+		 */
+		return;
+
+	if (priv->pmtd.enabled) {
+		/*
+		 * PMTD mode means we only need to free the pool
+		 */
+		kfree(priv->pmtd.pool);
+		return;
+	}
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	/* seek to position 0 */
+	ret = ioq_iter_seek(&iter, ioq_seek_set, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * free each valid descriptor
+	 */
+	while (iter.desc->cookie) {
+		struct venet_sg *vsg = (struct venet_sg *)iter.desc->cookie;
+
+		iter.desc->valid = 0;
+		wmb();
+
+		iter.desc->ptr = 0;
+		iter.desc->cookie = 0;
+
+		ret = ioq_iter_seek(&iter, ioq_seek_next, 0, 0);
+		BUG_ON(ret < 0);
+
+		kfree(vsg);
+	}
+}
+
+static void
+evq_teardown(struct vbus_enet_priv *priv)
+{
+	if (!priv->evq.enabled)
+		return;
+
+	ioq_put(priv->evq.veq.queue);
+	kfree(priv->evq.pool);
+}
+
+/*
+ * Open and close
+ */
+
+static int
+vbus_enet_open(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	ret = devcall(priv, VENET_FUNC_LINKUP, NULL, 0);
+	BUG_ON(ret < 0);
+
+	napi_enable(&priv->napi);
+
+	return 0;
+}
+
+static int
+vbus_enet_stop(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	napi_disable(&priv->napi);
+
+	ret = devcall(priv, VENET_FUNC_LINKDOWN, NULL, 0);
+	BUG_ON(ret < 0);
+
+	return 0;
+}
+
+/*
+ * Configuration changes (passed on by ifconfig)
+ */
+static int
+vbus_enet_config(struct net_device *dev, struct ifmap *map)
+{
+	if (dev->flags & IFF_UP) /* can't act on a running interface */
+		return -EBUSY;
+
+	/* Don't allow changing the I/O address */
+	if (map->base_addr != dev->base_addr) {
+		dev_warn(&dev->dev, "Can't change I/O address\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* ignore other fields */
+	return 0;
+}
+
+static void
+vbus_enet_schedule_rx(struct vbus_enet_priv *priv)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (napi_schedule_prep(&priv->napi)) {
+		/* Disable further interrupts */
+		ioq_notify_disable(priv->rxq.queue, 0);
+		__napi_schedule(&priv->napi);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static int
+vbus_enet_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	dev->mtu = new_mtu;
+
+	/*
+	 * FLUSHRX will cause the device to flush any outstanding
+	 * RX buffers.  They will appear to come in as 0 length
+	 * packets which we can simply discard and replace with new_mtu
+	 * buffers for the future.
+	 */
+	ret = devcall(priv, VENET_FUNC_FLUSHRX, NULL, 0);
+	BUG_ON(ret < 0);
+
+	vbus_enet_schedule_rx(priv);
+
+	return 0;
+}
+
+/*
+ * The poll implementation.
+ */
+static int
+vbus_enet_poll(struct napi_struct *napi, int budget)
+{
+	struct vbus_enet_priv *priv = napi_to_priv(napi);
+	int npackets = 0;
+	struct ioq_iterator iter;
+	int ret;
+
+	PDEBUG(priv->dev, "polling...\n");
+
+	/* We want to iterate on the head of the in-use index */
+	ret = ioq_iter_init(priv->rxq.queue, &iter, ioq_idxtype_inuse,
+			    IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * We stop if we have met the quota or there are no more packets.
+	 * The EOM is indicated by finding a packet that is still owned by
+	 * the south side
+	 */
+	while ((npackets < budget) && (!iter.desc->sown)) {
+		struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+
+		if (iter.desc->len) {
+			skb_put(skb, iter.desc->len);
+
+			/* Maintain stats */
+			npackets++;
+			priv->dev->stats.rx_packets++;
+			priv->dev->stats.rx_bytes += iter.desc->len;
+
+			/* Pass the buffer up to the stack */
+			skb->dev      = priv->dev;
+			skb->protocol = eth_type_trans(skb, priv->dev);
+			netif_receive_skb(skb);
+
+			mb();
+		} else
+			/*
+			 * the device may send a zero-length packet when its
+			 * flushing references on the ring.  We can just drop
+			 * these on the floor
+			 */
+			dev_kfree_skb(skb);
+
+		/* Grab a new buffer to put in the ring */
+		rxdesc_alloc(priv->dev, iter.desc, priv->dev->mtu);
+
+		/* Advance the in-use tail */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	PDEBUG(priv->dev, "%d packets received\n", npackets);
+
+	/*
+	 * If we processed all packets, we're done; tell the kernel and
+	 * reenable ints
+	 */
+	if (ioq_empty(priv->rxq.queue, ioq_idxtype_inuse)) {
+		napi_complete(napi);
+		ioq_notify_enable(priv->rxq.queue, 0);
+		ret = 0;
+	} else
+		/* We couldn't process everything. */
+		ret = 1;
+
+	return ret;
+}
+
+/*
+ * Transmit a packet (called by the kernel)
+ */
+static int
+vbus_enet_tx_start(struct sk_buff *skb, struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	struct ioq_iterator    iter;
+	int ret;
+	unsigned long flags;
+
+	PDEBUG(priv->dev, "sending %d bytes\n", skb->len);
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (ioq_full(priv->tx.veq.queue, ioq_idxtype_valid)) {
+		/*
+		 * We must flow-control the kernel by disabling the
+		 * queue
+		 */
+		spin_unlock_irqrestore(&priv->lock, flags);
+		netif_stop_queue(dev);
+		dev_err(&priv->dev->dev, "tx on full queue bug\n");
+		return 1;
+	}
+
+	/*
+	 * We want to iterate on the tail of both the "inuse" and "valid" index
+	 * so we specify the "both" index
+	 */
+	ret = ioq_iter_init(priv->tx.veq.queue, &iter, ioq_idxtype_both,
+			    IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+	BUG_ON(iter.desc->sown);
+
+	if (priv->sg) {
+		struct venet_sg *vsg = (struct venet_sg *)iter.desc->cookie;
+		struct scatterlist sgl[MAX_SKB_FRAGS+1];
+		struct scatterlist *sg;
+		int count, maxcount = ARRAY_SIZE(sgl);
+
+		sg_init_table(sgl, maxcount);
+
+		memset(vsg, 0, sizeof(*vsg));
+
+		vsg->cookie = (u64)skb;
+		vsg->len    = skb->len;
+
+		if (skb->ip_summed == CHECKSUM_PARTIAL) {
+			vsg->flags      |= VENET_SG_FLAG_NEEDS_CSUM;
+			vsg->csum.start  = skb->csum_start - skb_headroom(skb);
+			vsg->csum.offset = skb->csum_offset;
+		}
+
+		if (skb_is_gso(skb)) {
+			struct skb_shared_info *sinfo = skb_shinfo(skb);
+
+			vsg->flags |= VENET_SG_FLAG_GSO;
+
+			vsg->gso.hdrlen = skb_headlen(skb);
+			vsg->gso.size = sinfo->gso_size;
+			if (sinfo->gso_type & SKB_GSO_TCPV4)
+				vsg->gso.type = VENET_GSO_TYPE_TCPV4;
+			else if (sinfo->gso_type & SKB_GSO_TCPV6)
+				vsg->gso.type = VENET_GSO_TYPE_TCPV6;
+			else if (sinfo->gso_type & SKB_GSO_UDP)
+				vsg->gso.type = VENET_GSO_TYPE_UDP;
+			else
+				panic("Virtual-Ethernet: unknown GSO type " \
+				      "0x%x\n", sinfo->gso_type);
+
+			if (sinfo->gso_type & SKB_GSO_TCP_ECN)
+				vsg->flags |= VENET_SG_FLAG_ECN;
+		}
+
+		count = skb_to_sgvec(skb, sgl, 0, skb->len);
+
+		BUG_ON(count > maxcount);
+
+		for (sg = &sgl[0]; sg; sg = sg_next(sg)) {
+			struct venet_iov *iov = &vsg->iov[vsg->count++];
+
+			iov->len = sg->length;
+			iov->ptr = (u64)sg_phys(sg);
+		}
+
+		iter.desc->len = (u64)VSG_DESC_SIZE(vsg->count);
+
+	} else {
+		/*
+		 * non scatter-gather mode: simply put the skb right onto the
+		 * ring.
+		 */
+		iter.desc->cookie = (u64)skb;
+		iter.desc->len = (u64)skb->len;
+		iter.desc->ptr = (u64)__pa(skb->data);
+	}
+
+	iter.desc->valid  = 1;
+
+	priv->dev->stats.tx_packets++;
+	priv->dev->stats.tx_bytes += skb->len;
+
+	__skb_queue_tail(&priv->tx.outstanding, skb);
+
+	/*
+	 * This advances both indexes together implicitly, and then
+	 * signals the south side to consume the packet
+	 */
+	ret = ioq_iter_push(&iter, 0);
+	BUG_ON(ret < 0);
+
+	dev->trans_start = jiffies; /* save the timestamp */
+
+	if (ioq_full(priv->tx.veq.queue, ioq_idxtype_valid)) {
+		/*
+		 * If the queue is congested, we must flow-control the kernel
+		 */
+		PDEBUG(priv->dev, "backpressure tx queue\n");
+		netif_stop_queue(dev);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	return 0;
+}
+
+/* assumes priv->lock held */
+static void
+vbus_enet_skb_complete(struct vbus_enet_priv *priv, struct sk_buff *skb)
+{
+	PDEBUG(priv->dev, "completed sending %d bytes\n",
+	       skb->len);
+
+	__skb_unlink(skb, &priv->tx.outstanding);
+	dev_kfree_skb(skb);
+}
+
+/*
+ * reclaim any outstanding completed tx packets
+ *
+ * assumes priv->lock held
+ */
+static void
+vbus_enet_tx_reap(struct vbus_enet_priv *priv)
+{
+	struct ioq_iterator iter;
+	int ret;
+
+	/*
+	 * We want to iterate on the head of the valid index, but we
+	 * do not want the iter_pop (below) to flip the ownership, so
+	 * we set the NOFLIPOWNER option
+	 */
+	ret = ioq_iter_init(priv->tx.veq.queue, &iter, ioq_idxtype_valid,
+			    IOQ_ITER_NOFLIPOWNER);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * We are done once we find the first packet either invalid or still
+	 * owned by the south-side
+	 */
+	while (iter.desc->valid && !iter.desc->sown) {
+
+		if (!priv->evq.txc) {
+			struct sk_buff *skb;
+
+			if (priv->sg) {
+				struct venet_sg *vsg;
+
+				vsg = (struct venet_sg *)iter.desc->cookie;
+				skb = (struct sk_buff *)vsg->cookie;
+			} else
+				skb = (struct sk_buff *)iter.desc->cookie;
+
+			/*
+			 * If TXC is not enabled, we are required to free
+			 * the buffer resources now
+			 */
+			vbus_enet_skb_complete(priv, skb);
+		}
+
+		/* Reset the descriptor */
+		iter.desc->valid  = 0;
+
+		/* Advance the valid-index head */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	/*
+	 * If we were previously stopped due to flow control, restart the
+	 * processing
+	 */
+	if (netif_queue_stopped(priv->dev)
+	    && !ioq_full(priv->tx.veq.queue, ioq_idxtype_valid)) {
+		PDEBUG(priv->dev, "re-enabling tx queue\n");
+		netif_wake_queue(priv->dev);
+	}
+}
+
+static void
+vbus_enet_timeout(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	unsigned long flags;
+
+	dev_dbg(&dev->dev, "Transmit timeout\n");
+
+	spin_lock_irqsave(&priv->lock, flags);
+	vbus_enet_tx_reap(priv);
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static void
+rx_isr(struct ioq_notifier *notifier)
+{
+	struct vbus_enet_priv *priv;
+	struct net_device  *dev;
+
+	priv = container_of(notifier, struct vbus_enet_priv, rxq.notifier);
+	dev = priv->dev;
+
+	if (!ioq_empty(priv->rxq.queue, ioq_idxtype_inuse))
+		vbus_enet_schedule_rx(priv);
+}
+
+static void
+deferred_tx_isr(unsigned long data)
+{
+	struct vbus_enet_priv *priv = (struct vbus_enet_priv *)data;
+	unsigned long flags;
+
+	PDEBUG(priv->dev, "deferred_tx_isr\n");
+
+	spin_lock_irqsave(&priv->lock, flags);
+	vbus_enet_tx_reap(priv);
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	ioq_notify_enable(priv->tx.veq.queue, 0);
+}
+
+static void
+tx_isr(struct ioq_notifier *notifier)
+{
+       struct vbus_enet_priv *priv;
+
+       priv = container_of(notifier, struct vbus_enet_priv, tx.veq.notifier);
+
+       PDEBUG(priv->dev, "tx_isr\n");
+
+       ioq_notify_disable(priv->tx.veq.queue, 0);
+       tasklet_schedule(&priv->tx.task);
+}
+
+static void
+evq_linkstate_event(struct vbus_enet_priv *priv,
+		    struct venet_event_header *header)
+{
+	struct venet_event_linkstate *event =
+		(struct venet_event_linkstate *)header;
+
+	switch (event->state) {
+	case 0:
+		netif_carrier_off(priv->dev);
+		break;
+	case 1:
+		netif_carrier_on(priv->dev);
+		break;
+	default:
+		break;
+	}
+}
+
+static void
+evq_txc_event(struct vbus_enet_priv *priv,
+	      struct venet_event_header *header)
+{
+	struct venet_event_txc *event =
+		(struct venet_event_txc *)header;
+	unsigned long flags;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	vbus_enet_tx_reap(priv);
+	vbus_enet_skb_complete(priv, (struct sk_buff *)event->cookie);
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static void
+deferred_evq_isr(unsigned long data)
+{
+	struct vbus_enet_priv *priv = (struct vbus_enet_priv *)data;
+	int nevents = 0;
+	struct ioq_iterator iter;
+	int ret;
+
+	PDEBUG(priv->dev, "evq: polling...\n");
+
+	/* We want to iterate on the head of the in-use index */
+	ret = ioq_iter_init(priv->evq.veq.queue, &iter, ioq_idxtype_inuse,
+			    IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * The EOM is indicated by finding a packet that is still owned by
+	 * the south side
+	 */
+	while (!iter.desc->sown) {
+		struct venet_event_header *header;
+
+		header = (struct venet_event_header *)iter.desc->cookie;
+
+		switch (header->id) {
+		case VENET_EVENT_LINKSTATE:
+			evq_linkstate_event(priv, header);
+			break;
+		case VENET_EVENT_TXC:
+			evq_txc_event(priv, header);
+			break;
+		default:
+			panic("venet: unexpected event id:%d of size %d\n",
+			      header->id, header->size);
+			break;
+		}
+
+		memset((void *)iter.desc->cookie, 0, priv->evq.evsize);
+
+		/* Advance the in-use tail */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+
+		nevents++;
+	}
+
+	PDEBUG(priv->dev, "%d events received\n", nevents);
+
+	ioq_notify_enable(priv->evq.veq.queue, 0);
+}
+
+static void
+evq_isr(struct ioq_notifier *notifier)
+{
+       struct vbus_enet_priv *priv;
+
+       priv = container_of(notifier, struct vbus_enet_priv, evq.veq.notifier);
+
+       PDEBUG(priv->dev, "evq_isr\n");
+
+       ioq_notify_disable(priv->evq.veq.queue, 0);
+       tasklet_schedule(&priv->evq.task);
+}
+
+static int
+vbus_enet_sg_negcap(struct vbus_enet_priv *priv)
+{
+	struct net_device *dev = priv->dev;
+	struct venet_capabilities caps;
+	int ret;
+
+	memset(&caps, 0, sizeof(caps));
+
+	if (sg_enabled) {
+		caps.gid = VENET_CAP_GROUP_SG;
+		caps.bits |= (VENET_CAP_SG|VENET_CAP_TSO4|VENET_CAP_TSO6
+			      |VENET_CAP_ECN|VENET_CAP_PMTD);
+		/* note: exclude UFO for now due to stack bug */
+	}
+
+	ret = devcall(priv, VENET_FUNC_NEGCAP, &caps, sizeof(caps));
+	if (ret < 0)
+		return ret;
+
+	if (caps.bits & VENET_CAP_SG) {
+		priv->sg = true;
+
+		dev->features |= NETIF_F_SG|NETIF_F_HW_CSUM|NETIF_F_FRAGLIST;
+
+		if (caps.bits & VENET_CAP_TSO4)
+			dev->features |= NETIF_F_TSO;
+		if (caps.bits & VENET_CAP_UFO)
+			dev->features |= NETIF_F_UFO;
+		if (caps.bits & VENET_CAP_TSO6)
+			dev->features |= NETIF_F_TSO6;
+		if (caps.bits & VENET_CAP_ECN)
+			dev->features |= NETIF_F_TSO_ECN;
+
+		if (caps.bits & VENET_CAP_PMTD)
+			priv->pmtd.enabled = true;
+	}
+
+	return 0;
+}
+
+static int
+vbus_enet_evq_negcap(struct vbus_enet_priv *priv, unsigned long count)
+{
+	struct venet_capabilities caps;
+	int ret;
+
+	memset(&caps, 0, sizeof(caps));
+
+	caps.gid = VENET_CAP_GROUP_EVENTQ;
+	caps.bits |= VENET_CAP_EVQ_LINKSTATE;
+	caps.bits |= VENET_CAP_EVQ_TXC;
+
+	ret = devcall(priv, VENET_FUNC_NEGCAP, &caps, sizeof(caps));
+	if (ret < 0)
+		return ret;
+
+	if (caps.bits) {
+		struct vbus_device_proxy *dev = priv->vdev;
+		struct venet_eventq_query query;
+		size_t                    poollen;
+		struct ioq_iterator       iter;
+		char                     *pool;
+		int                       i;
+
+		priv->evq.enabled = true;
+
+		if (caps.bits & VENET_CAP_EVQ_LINKSTATE) {
+			/*
+			 * We will assume there is no carrier until we get
+			 * an event telling us otherwise
+			 */
+			netif_carrier_off(priv->dev);
+			priv->evq.linkstate = true;
+		}
+
+		if (caps.bits & VENET_CAP_EVQ_TXC)
+			priv->evq.txc = true;
+
+		memset(&query, 0, sizeof(query));
+
+		ret = devcall(priv, VENET_FUNC_EVQQUERY, &query, sizeof(query));
+		if (ret < 0)
+			return ret;
+
+		priv->evq.evsize = query.evsize;
+		poollen = query.evsize * count;
+
+		pool = kzalloc(poollen, GFP_KERNEL | GFP_DMA);
+		if (!pool)
+			return -ENOMEM;
+
+		priv->evq.pool = pool;
+
+		ret = dev->ops->shm(dev, query.dpid, 0,
+				    pool, poollen, 0, NULL, 0);
+		if (ret < 0)
+			return ret;
+
+		queue_init(priv, &priv->evq.veq, query.qid, count, evq_isr);
+
+		ret = ioq_iter_init(priv->evq.veq.queue,
+				    &iter, ioq_idxtype_valid, 0);
+		BUG_ON(ret < 0);
+
+		ret = ioq_iter_seek(&iter, ioq_seek_set, 0, 0);
+		BUG_ON(ret < 0);
+
+		/* Now populate each descriptor with an empty event */
+		for (i = 0; i < count; i++) {
+			size_t offset = (i * query.evsize);
+			void *addr = &priv->evq.pool[offset];
+
+			iter.desc->ptr    = (u64)offset;
+			iter.desc->cookie = (u64)addr;
+			iter.desc->len    = query.evsize;
+
+			ret = ioq_iter_push(&iter, 0);
+			BUG_ON(ret < 0);
+		}
+
+		/* Finally, enable interrupts */
+		tasklet_init(&priv->evq.task, deferred_evq_isr,
+			     (unsigned long)priv);
+		ioq_notify_enable(priv->evq.veq.queue, 0);
+	}
+
+	return 0;
+}
+
+static int
+vbus_enet_negcap(struct vbus_enet_priv *priv)
+{
+	int ret;
+
+	ret = vbus_enet_sg_negcap(priv);
+	if (ret < 0)
+		return ret;
+
+	return vbus_enet_evq_negcap(priv, tx_ringlen);
+}
+
+static int vbus_enet_set_tx_csum(struct net_device *dev, u32 data)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+
+	if (data && !priv->sg)
+		return -ENOSYS;
+
+	return ethtool_op_set_tx_hw_csum(dev, data);
+}
+
+static struct ethtool_ops vbus_enet_ethtool_ops = {
+	.set_tx_csum = vbus_enet_set_tx_csum,
+	.set_sg      = ethtool_op_set_sg,
+	.set_tso     = ethtool_op_set_tso,
+	.get_link    = ethtool_op_get_link,
+};
+
+static const struct net_device_ops vbus_enet_netdev_ops = {
+	.ndo_open            = vbus_enet_open,
+	.ndo_stop            = vbus_enet_stop,
+	.ndo_set_config      = vbus_enet_config,
+	.ndo_start_xmit      = vbus_enet_tx_start,
+	.ndo_change_mtu	     = vbus_enet_change_mtu,
+	.ndo_tx_timeout      = vbus_enet_timeout,
+	.ndo_set_mac_address = eth_mac_addr,
+	.ndo_validate_addr   = eth_validate_addr,
+};
+
+/*
+ * This is called whenever a new vbus_device_proxy is added to the vbus
+ * with the matching VENET_ID
+ */
+static int
+vbus_enet_probe(struct vbus_device_proxy *vdev)
+{
+	struct net_device  *dev;
+	struct vbus_enet_priv *priv;
+	int ret;
+
+	printk(KERN_INFO "VENET: Found new device at %lld\n", vdev->id);
+
+	ret = vdev->ops->open(vdev, VENET_VERSION, 0);
+	if (ret < 0)
+		return ret;
+
+	dev = alloc_etherdev(sizeof(struct vbus_enet_priv));
+	if (!dev)
+		return -ENOMEM;
+
+	priv = netdev_priv(dev);
+
+	spin_lock_init(&priv->lock);
+	priv->dev  = dev;
+	priv->vdev = vdev;
+
+	ret = vbus_enet_negcap(priv);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: Error negotiating capabilities for " \
+		       "%lld\n",
+		       priv->vdev->id);
+		goto out_free;
+	}
+
+	skb_queue_head_init(&priv->tx.outstanding);
+
+	queue_init(priv, &priv->rxq, VENET_QUEUE_RX, rx_ringlen, rx_isr);
+	queue_init(priv, &priv->tx.veq, VENET_QUEUE_TX, tx_ringlen, tx_isr);
+
+	rx_setup(priv);
+	tx_setup(priv);
+
+	ioq_notify_enable(priv->rxq.queue, 0);  /* enable rx interrupts */
+
+	if (!priv->evq.txc) {
+		/*
+		 * If the TXC feature is present, we will recieve our
+		 * tx-complete notification via the event-channel.  Therefore,
+		 * we only enable txq interrupts if the TXC feature is not
+		 * present.
+		 */
+		tasklet_init(&priv->tx.task, deferred_tx_isr,
+			     (unsigned long)priv);
+		ioq_notify_enable(priv->tx.veq.queue, 0);
+	}
+
+	dev->netdev_ops     = &vbus_enet_netdev_ops;
+	dev->watchdog_timeo = 5 * HZ;
+	SET_ETHTOOL_OPS(dev, &vbus_enet_ethtool_ops);
+	SET_NETDEV_DEV(dev, &vdev->dev);
+
+	netif_napi_add(dev, &priv->napi, vbus_enet_poll, 128);
+
+	ret = devcall(priv, VENET_FUNC_MACQUERY, priv->dev->dev_addr, ETH_ALEN);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: Error obtaining MAC address for " \
+		       "%lld\n",
+		       priv->vdev->id);
+		goto out_free;
+	}
+
+	dev->features |= NETIF_F_HIGHDMA;
+
+	ret = register_netdev(dev);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: error %i registering device \"%s\"\n",
+		       ret, dev->name);
+		goto out_free;
+	}
+
+	vdev->priv = priv;
+
+	return 0;
+
+ out_free:
+	free_netdev(dev);
+
+	return ret;
+}
+
+static int
+vbus_enet_remove(struct vbus_device_proxy *vdev)
+{
+	struct vbus_enet_priv *priv = (struct vbus_enet_priv *)vdev->priv;
+	struct vbus_device_proxy *dev = priv->vdev;
+
+	unregister_netdev(priv->dev);
+	napi_disable(&priv->napi);
+
+	rx_teardown(priv);
+	ioq_put(priv->rxq.queue);
+
+	tx_teardown(priv);
+	ioq_put(priv->tx.veq.queue);
+
+	if (priv->evq.enabled)
+		evq_teardown(priv);
+
+	dev->ops->close(dev, 0);
+
+	free_netdev(priv->dev);
+
+	return 0;
+}
+
+/*
+ * Finally, the module stuff
+ */
+
+static struct vbus_driver_ops vbus_enet_driver_ops = {
+	.probe  = vbus_enet_probe,
+	.remove = vbus_enet_remove,
+};
+
+static struct vbus_driver vbus_enet_driver = {
+	.type   = VENET_TYPE,
+	.owner  = THIS_MODULE,
+	.ops    = &vbus_enet_driver_ops,
+};
+
+static __init int
+vbus_enet_init_module(void)
+{
+	printk(KERN_INFO "Virtual Ethernet: Copyright (C) 2009 Novell, Gregory Haskins\n");
+	printk(KERN_DEBUG "VENET: Using %d/%d queue depth\n",
+	       rx_ringlen, tx_ringlen);
+	return vbus_driver_register(&vbus_enet_driver);
+}
+
+static __exit void
+vbus_enet_cleanup(void)
+{
+	vbus_driver_unregister(&vbus_enet_driver);
+}
+
+module_init(vbus_enet_init_module);
+module_exit(vbus_enet_cleanup);
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index fa15bbf..911f7ef 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -359,6 +359,7 @@ unifdef-y += unistd.h
 unifdef-y += usbdevice_fs.h
 unifdef-y += utsname.h
 unifdef-y += vbus_pci.h
+unifdef-y += venet.h
 unifdef-y += videodev2.h
 unifdef-y += videodev.h
 unifdef-y += virtio_config.h
diff --git a/include/linux/venet.h b/include/linux/venet.h
new file mode 100644
index 0000000..b6bfd91
--- /dev/null
+++ b/include/linux/venet.h
@@ -0,0 +1,123 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Virtual-Ethernet adapter
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VENET_H
+#define _LINUX_VENET_H
+
+#include <linux/types.h>
+
+#define VENET_VERSION 1
+
+#define VENET_TYPE "virtual-ethernet"
+
+#define VENET_QUEUE_RX 0
+#define VENET_QUEUE_TX 1
+
+struct venet_capabilities {
+	__u32 gid;
+	__u32 bits;
+};
+
+#define VENET_CAP_GROUP_SG     0
+#define VENET_CAP_GROUP_EVENTQ 1
+
+/* CAPABILITIES-GROUP SG */
+#define VENET_CAP_SG     (1 << 0)
+#define VENET_CAP_TSO4   (1 << 1)
+#define VENET_CAP_TSO6   (1 << 2)
+#define VENET_CAP_ECN    (1 << 3)
+#define VENET_CAP_UFO    (1 << 4)
+#define VENET_CAP_PMTD   (1 << 5) /* pre-mapped tx desc */
+
+/* CAPABILITIES-GROUP EVENTQ */
+#define VENET_CAP_EVQ_LINKSTATE  (1 << 0)
+#define VENET_CAP_EVQ_TXC        (1 << 1) /* tx-complete */
+
+struct venet_iov {
+	__u32 len;
+	__u64 ptr;
+};
+
+#define VENET_SG_FLAG_NEEDS_CSUM (1 << 0)
+#define VENET_SG_FLAG_GSO        (1 << 1)
+#define VENET_SG_FLAG_ECN        (1 << 2)
+
+struct venet_sg {
+	__u64            cookie;
+	__u32            flags;
+	__u32            len;     /* total length of all iovs */
+	struct {
+		__u16    start;	  /* csum starting position */
+		__u16    offset;  /* offset to place csum */
+	} csum;
+	struct {
+#define VENET_GSO_TYPE_TCPV4	0	/* IPv4 TCP (TSO) */
+#define VENET_GSO_TYPE_UDP	1	/* IPv4 UDP (UFO) */
+#define VENET_GSO_TYPE_TCPV6	2	/* IPv6 TCP */
+		__u8     type;
+		__u16    hdrlen;
+		__u16    size;
+	} gso;
+	__u32            count;   /* nr of iovs */
+	struct venet_iov iov[1];
+};
+
+struct venet_eventq_query {
+	__u32 flags;
+	__u32 evsize;  /* size of each event */
+	__u32 dpid;    /* descriptor pool-id */
+	__u32 qid;
+	__u8  pad[16];
+};
+
+#define VENET_EVENT_LINKSTATE 0
+#define VENET_EVENT_TXC       1
+
+struct venet_event_header {
+	__u32 flags;
+	__u32 size;
+	__u32 id;
+};
+
+struct venet_event_linkstate {
+	struct venet_event_header header;
+	__u8                      state; /* 0 = down, 1 = up */
+};
+
+struct venet_event_txc {
+	struct venet_event_header header;
+	__u32                     txqid;
+	__u64                     cookie;
+};
+
+#define VSG_DESC_SIZE(count) (sizeof(struct venet_sg) + \
+			      sizeof(struct venet_iov) * ((count) - 1))
+
+#define VENET_FUNC_LINKUP    0
+#define VENET_FUNC_LINKDOWN  1
+#define VENET_FUNC_MACQUERY  2
+#define VENET_FUNC_NEGCAP    3 /* negotiate capabilities */
+#define VENET_FUNC_FLUSHRX   4
+#define VENET_FUNC_PMTDQUERY 5
+#define VENET_FUNC_EVQQUERY  6
+
+#endif /* _LINUX_VENET_H */


^ permalink raw reply related	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2009-10-02 15:33 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-03 17:17 [PATCH 0/7] AlacrityVM guest drivers Gregory Haskins
2009-08-03 17:17 ` [PATCH 1/7] shm-signal: shared-memory signals Gregory Haskins
2009-08-06 13:56   ` Arnd Bergmann
2009-08-06 15:11     ` Gregory Haskins
2009-08-06 20:51       ` Ira W. Snyder
2009-08-03 17:17 ` [PATCH 2/7] ioq: Add basic definitions for a shared-memory, lockless queue Gregory Haskins
2009-08-03 17:17 ` [PATCH 3/7] vbus: add a "vbus-proxy" bus model for vbus_driver objects Gregory Haskins
2009-08-03 17:17 ` [PATCH 4/7] vbus-proxy: add a pci-to-vbus bridge Gregory Haskins
2009-08-06 14:42   ` Arnd Bergmann
2009-08-06 15:59     ` Gregory Haskins
2009-08-06 17:03       ` Arnd Bergmann
2009-08-06 21:04         ` Gregory Haskins
2009-08-06 22:57           ` Arnd Bergmann
2009-08-07  4:42             ` Gregory Haskins
2009-08-07 14:57               ` Arnd Bergmann
2009-08-07 15:44                 ` Gregory Haskins
2009-08-07 15:44                   ` Gregory Haskins
2009-08-07 15:55               ` Ira W. Snyder
2009-08-07 18:25                 ` Gregory Haskins
2009-08-03 17:17 ` [PATCH 5/7] ioq: add driver-side vbus helpers Gregory Haskins
2009-08-03 17:18 ` [PATCH 6/7] net: Add vbus_enet driver Gregory Haskins
2009-08-03 18:30   ` Stephen Hemminger
2009-08-03 20:10     ` Gregory Haskins
2009-08-03 20:19       ` Stephen Hemminger
2009-08-03 20:24         ` Gregory Haskins
2009-08-03 20:29           ` Stephen Hemminger
2009-08-04  1:14   ` [PATCH v2] " Gregory Haskins
2009-08-04  2:38     ` David Miller
2009-08-04 13:57       ` [Alacrityvm-devel] " Gregory Haskins
2009-10-02 15:33     ` [PATCH v3] " Gregory Haskins
2009-08-03 17:18 ` [PATCH 7/7] venet: add scatter-gather/GSO support Gregory Haskins
2009-08-03 18:32   ` Stephen Hemminger
2009-08-03 19:30     ` Gregory Haskins
2009-08-03 18:33   ` Stephen Hemminger
2009-08-03 19:57     ` Gregory Haskins
2009-08-06  8:19 ` [PATCH 0/7] AlacrityVM guest drivers Reply-To: Michael S. Tsirkin
2009-08-06 10:17   ` Michael S. Tsirkin
2009-08-06 12:09     ` Gregory Haskins
2009-08-06 12:08   ` Gregory Haskins
2009-08-06 12:24     ` Michael S. Tsirkin
2009-08-06 13:00       ` Gregory Haskins
2009-08-06 12:54     ` Avi Kivity
2009-08-06 13:03       ` Gregory Haskins
2009-08-06 13:44         ` Avi Kivity
2009-08-06 13:45           ` Gregory Haskins
2009-08-06 13:57             ` Avi Kivity
2009-08-06 14:06               ` Gregory Haskins
2009-08-06 15:40                 ` Arnd Bergmann
2009-08-06 15:45                   ` Michael S. Tsirkin
2009-08-06 16:28                     ` Pantelis Koukousoulas
2009-08-07 12:14                       ` Gregory Haskins
2009-08-06 15:50                   ` Avi Kivity
2009-08-06 16:55                     ` Gregory Haskins
2009-08-09  7:48                       ` Avi Kivity
2009-08-06 16:29                   ` Gregory Haskins
2009-08-06 23:23                     ` Ira W. Snyder
2009-08-06 13:59             ` Michael S. Tsirkin
2009-08-06 14:07               ` Gregory Haskins
2009-08-07 14:19   ` Anthony Liguori
2009-08-07 15:05     ` [PATCH 0/7] AlacrityVM guest drivers Gregory Haskins
2009-08-07 15:46       ` Anthony Liguori
2009-08-07 18:04         ` Gregory Haskins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.