All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/6] AlacrityVM guest drivers
@ 2009-08-14 15:42 Gregory Haskins
  2009-08-14 15:42 ` [PATCH v3 1/6] shm-signal: shared-memory signals Gregory Haskins
                   ` (5 more replies)
  0 siblings, 6 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-14 15:42 UTC (permalink / raw)
  To: alacrityvm-devel; +Cc: linux-kernel, netdev

(Applies to v2.6.31-rc6)

This series implements the guest-side drivers for accelerated IO
when running on top of the AlacrityVM hypervisor, the details of
which you can find here:

http://developer.novell.com/wiki/index.php/AlacrityVM

This series includes the basic plumbing, as well as the driver for
accelerated 802.x (ethernet) networking.

[ Changelog:

  v3:
     *) pci-bridge changes:
          *) updated ABI to support FASTCALL
          *) got rid of confusing "hypercall" nomenclature

  v2:
     *) venet changes: Updated venet driver based on Stephen Hemminger's
        feedback
	  *) folded patches 6/7 and 7/7 together
	  *) get rid of shadow flags
	  *) add missing baseline .ndo callbacks
	  *) add support for ethtool

  v1:
     *) initial release
}

Regards,
-Greg

---

Gregory Haskins (6):
      net: Add vbus_enet driver
      ioq: add driver-side vbus helpers
      vbus-proxy: add a pci-to-vbus bridge
      vbus: add a "vbus-proxy" bus model for vbus_driver objects
      ioq: Add basic definitions for a shared-memory, lockless queue
      shm-signal: shared-memory signals


 MAINTAINERS                 |   25 +
 arch/x86/Kconfig            |    2 
 drivers/Makefile            |    1 
 drivers/net/Kconfig         |   14 +
 drivers/net/Makefile        |    1 
 drivers/net/vbus-enet.c     |  895 +++++++++++++++++++++++++++++++++++++++++++
 drivers/vbus/Kconfig        |   24 +
 drivers/vbus/Makefile       |    6 
 drivers/vbus/bus-proxy.c    |  216 ++++++++++
 drivers/vbus/pci-bridge.c   |  877 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/Kbuild        |    4 
 include/linux/ioq.h         |  415 ++++++++++++++++++++
 include/linux/shm_signal.h  |  189 +++++++++
 include/linux/vbus_driver.h |   80 ++++
 include/linux/vbus_pci.h    |  145 +++++++
 include/linux/venet.h       |   84 ++++
 lib/Kconfig                 |   21 +
 lib/Makefile                |    2 
 lib/ioq.c                   |  294 ++++++++++++++
 lib/shm_signal.c            |  192 +++++++++
 20 files changed, 3487 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/vbus-enet.c
 create mode 100644 drivers/vbus/Kconfig
 create mode 100644 drivers/vbus/Makefile
 create mode 100644 drivers/vbus/bus-proxy.c
 create mode 100644 drivers/vbus/pci-bridge.c
 create mode 100644 include/linux/ioq.h
 create mode 100644 include/linux/shm_signal.h
 create mode 100644 include/linux/vbus_driver.h
 create mode 100644 include/linux/vbus_pci.h
 create mode 100644 include/linux/venet.h
 create mode 100644 lib/ioq.c
 create mode 100644 lib/shm_signal.c

-- 
Signature

^ permalink raw reply	[flat|nested] 132+ messages in thread

* [PATCH v3 1/6] shm-signal: shared-memory signals
  2009-08-14 15:42 [PATCH v3 0/6] AlacrityVM guest drivers Gregory Haskins
@ 2009-08-14 15:42 ` Gregory Haskins
  2009-08-14 15:43 ` [PATCH v3 2/6] ioq: Add basic definitions for a shared-memory, lockless queue Gregory Haskins
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-14 15:42 UTC (permalink / raw)
  To: alacrityvm-devel; +Cc: linux-kernel, netdev

shm-signal provides a generic shared-memory based bidirectional
signaling mechanism.  It is used in conjunction with an existing
signal transport (such as posix-signals, interrupts, pipes, etc) to
increase the efficiency of the transport since the state information
is directly accessible to both sides of the link.  The shared-memory
design provides very cheap access to features such as event-masking
and spurious delivery mititgation, and is useful implementing higher
level shared-memory constructs such as rings.

We will use this mechanism as the basis for a shared-memory interface
later in the series.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 MAINTAINERS                |    6 +
 include/linux/Kbuild       |    1 
 include/linux/shm_signal.h |  189 +++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig                |    9 ++
 lib/Makefile               |    1 
 lib/shm_signal.c           |  192 ++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 398 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/shm_signal.h
 create mode 100644 lib/shm_signal.c

diff --git a/MAINTAINERS b/MAINTAINERS
index b1114cf..3e736fe 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4555,6 +4555,12 @@ F:	drivers/serial/serial_lh7a40x.c
 F:	drivers/usb/gadget/lh7a40*
 F:	drivers/usb/host/ohci-lh7a40*
 
+SHM-SIGNAL LIBRARY
+M:	Gregory Haskins <ghaskins@novell.com>
+S:	Maintained
+F:	include/linux/shm_signal.h
+F:	lib/shm_signal.c
+
 SHPC HOTPLUG DRIVER
 M:	Kristen Carlson Accardi <kristen.c.accardi@intel.com>
 L:	linux-pci@vger.kernel.org
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 334a359..01d67b6 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -331,6 +331,7 @@ unifdef-y += serial_core.h
 unifdef-y += serial.h
 unifdef-y += serio.h
 unifdef-y += shm.h
+unifdef-y += shm_signal.h
 unifdef-y += signal.h
 unifdef-y += smb_fs.h
 unifdef-y += smb.h
diff --git a/include/linux/shm_signal.h b/include/linux/shm_signal.h
new file mode 100644
index 0000000..21cf750
--- /dev/null
+++ b/include/linux/shm_signal.h
@@ -0,0 +1,189 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_SHM_SIGNAL_H
+#define _LINUX_SHM_SIGNAL_H
+
+#include <linux/types.h>
+
+/*
+ *---------
+ * The following structures represent data that is shared across boundaries
+ * which may be quite disparate from one another (e.g. Windows vs Linux,
+ * 32 vs 64 bit, etc).  Therefore, care has been taken to make sure they
+ * present data in a manner that is independent of the environment.
+ *-----------
+ */
+
+#define SHM_SIGNAL_MAGIC 0x58fa39df
+#define SHM_SIGNAL_VER   1
+
+struct shm_signal_irq {
+	__u8                  enabled;
+	__u8                  pending;
+	__u8                  dirty;
+};
+
+enum shm_signal_locality {
+	shm_locality_north,
+	shm_locality_south,
+};
+
+struct shm_signal_desc {
+	__u32                 magic;
+	__u32                 ver;
+	struct shm_signal_irq irq[2];
+};
+
+/* --- END SHARED STRUCTURES --- */
+
+#ifdef __KERNEL__
+
+#include <linux/kref.h>
+#include <linux/interrupt.h>
+
+struct shm_signal_notifier {
+	void (*signal)(struct shm_signal_notifier *);
+};
+
+struct shm_signal;
+
+struct shm_signal_ops {
+	int      (*inject)(struct shm_signal *s);
+	void     (*fault)(struct shm_signal *s, const char *fmt, ...);
+	void     (*release)(struct shm_signal *s);
+};
+
+enum {
+	shm_signal_in_wakeup,
+};
+
+struct shm_signal {
+	struct kref                 kref;
+	spinlock_t                  lock;
+	enum shm_signal_locality    locale;
+	unsigned long               flags;
+	struct shm_signal_ops      *ops;
+	struct shm_signal_desc     *desc;
+	struct shm_signal_notifier *notifier;
+	struct tasklet_struct       deferred_notify;
+};
+
+#define SHM_SIGNAL_FAULT(s, fmt, args...)  \
+  ((s)->ops->fault ? (s)->ops->fault((s), fmt, ## args) : panic(fmt, ## args))
+
+ /*
+  * These functions should only be used internally
+  */
+void _shm_signal_release(struct kref *kref);
+void _shm_signal_wakeup(struct shm_signal *s);
+
+/**
+ * shm_signal_init() - initialize an SHM_SIGNAL
+ * @s:        SHM_SIGNAL context
+ *
+ * Initializes SHM_SIGNAL context before first use
+ *
+ **/
+void shm_signal_init(struct shm_signal *s, enum shm_signal_locality locale,
+		     struct shm_signal_ops *ops, struct shm_signal_desc *desc);
+
+/**
+ * shm_signal_get() - acquire an SHM_SIGNAL context reference
+ * @s:        SHM_SIGNAL context
+ *
+ **/
+static inline struct shm_signal *shm_signal_get(struct shm_signal *s)
+{
+	kref_get(&s->kref);
+
+	return s;
+}
+
+/**
+ * shm_signal_put() - release an SHM_SIGNAL context reference
+ * @s:        SHM_SIGNAL context
+ *
+ **/
+static inline void shm_signal_put(struct shm_signal *s)
+{
+	kref_put(&s->kref, _shm_signal_release);
+}
+
+/**
+ * shm_signal_enable() - enables local notifications on an SHM_SIGNAL
+ * @s:        SHM_SIGNAL context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Enables/unmasks the registered notifier (if applicable) to receive wakeups
+ * whenever the remote side performs an shm_signal() operation. A notification
+ * will be dispatched immediately if any pending signals have already been
+ * issued prior to invoking this call.
+ *
+ * This is synonymous with unmasking an interrupt.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int shm_signal_enable(struct shm_signal *s, int flags);
+
+/**
+ * shm_signal_disable() - disable local notifications on an SHM_SIGNAL
+ * @s:        SHM_SIGNAL context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Disables/masks the registered shm_signal_notifier (if applicable) from
+ * receiving any further notifications.  Any subsequent calls to shm_signal()
+ * by the remote side will update the shm as dirty, but will not traverse the
+ * locale boundary and will not invoke the notifier callback.  Signals
+ * delivered while masked will be deferred until shm_signal_enable() is
+ * invoked.
+ *
+ * This is synonymous with masking an interrupt
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int shm_signal_disable(struct shm_signal *s, int flags);
+
+/**
+ * shm_signal_inject() - notify the remote side about shm changes
+ * @s:        SHM_SIGNAL context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Marks the shm state as "dirty" and, if enabled, will traverse
+ * a locale boundary to inject a remote notification.  The remote
+ * side controls whether the notification should be delivered via
+ * the shm_signal_enable/disable() interface.
+ *
+ * The specifics of how to traverse a locale boundary are abstracted
+ * by the shm_signal_ops->signal() interface and provided by a particular
+ * implementation.  However, typically going north to south would be
+ * something like a syscall/hypercall, and going south to north would be
+ * something like a posix-signal/guest-interrupt.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int shm_signal_inject(struct shm_signal *s, int flags);
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_SHM_SIGNAL_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index bb1326d..136da19 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -200,4 +200,13 @@ config NLATTR
 config GENERIC_ATOMIC64
        bool
 
+config SHM_SIGNAL
+	tristate "SHM Signal - Generic shared-memory signaling mechanism"
+	default n
+	help
+	 Provides a shared-memory based signaling mechansim to indicate
+         memory-dirty notifications between two end-points.
+
+	 If unsure, say N
+
 endmenu
diff --git a/lib/Makefile b/lib/Makefile
index 2e78277..503bf7b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -76,6 +76,7 @@ obj-$(CONFIG_TEXTSEARCH_BM) += ts_bm.o
 obj-$(CONFIG_TEXTSEARCH_FSM) += ts_fsm.o
 obj-$(CONFIG_SMP) += percpu_counter.o
 obj-$(CONFIG_AUDIT_GENERIC) += audit.o
+obj-$(CONFIG_SHM_SIGNAL) += shm_signal.o
 
 obj-$(CONFIG_SWIOTLB) += swiotlb.o
 obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
diff --git a/lib/shm_signal.c b/lib/shm_signal.c
new file mode 100644
index 0000000..fbba74f
--- /dev/null
+++ b/lib/shm_signal.c
@@ -0,0 +1,192 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * See include/linux/shm_signal.h for documentation
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/interrupt.h>
+#include <linux/shm_signal.h>
+
+int shm_signal_enable(struct shm_signal *s, int flags)
+{
+	struct shm_signal_irq *irq = &s->desc->irq[s->locale];
+	unsigned long iflags;
+
+	spin_lock_irqsave(&s->lock, iflags);
+
+	irq->enabled = 1;
+	wmb();
+
+	if ((irq->dirty || irq->pending)
+	    && !test_bit(shm_signal_in_wakeup, &s->flags)) {
+		rmb();
+		tasklet_schedule(&s->deferred_notify);
+	}
+
+	spin_unlock_irqrestore(&s->lock, iflags);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(shm_signal_enable);
+
+int shm_signal_disable(struct shm_signal *s, int flags)
+{
+	struct shm_signal_irq *irq = &s->desc->irq[s->locale];
+
+	irq->enabled = 0;
+	wmb();
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(shm_signal_disable);
+
+/*
+ * signaling protocol:
+ *
+ * each side of the shm_signal has an "irq" structure with the following
+ * fields:
+ *
+ *    - enabled: controlled by shm_signal_enable/disable() to mask/unmask
+ *               the notification locally
+ *    - dirty:   indicates if the shared-memory is dirty or clean.  This
+ *               is updated regardless of the enabled/pending state so that
+ *               the state is always accurately tracked.
+ *    - pending: indicates if a signal is pending to the remote locale.
+ *               This allows us to determine if a remote-notification is
+ *               already in flight to optimize spurious notifications away.
+ */
+int shm_signal_inject(struct shm_signal *s, int flags)
+{
+	/* Load the irq structure from the other locale */
+	struct shm_signal_irq *irq = &s->desc->irq[!s->locale];
+
+	/*
+	 * We always mark the remote side as dirty regardless of whether
+	 * they need to be notified.
+	 */
+	irq->dirty = 1;
+	wmb();   /* dirty must be visible before we test the pending state */
+
+	if (irq->enabled && !irq->pending) {
+		rmb();
+
+		/*
+		 * If the remote side has enabled notifications, and we do
+		 * not see a notification pending, we must inject a new one.
+		 */
+		irq->pending = 1;
+		wmb(); /* make it visible before we do the injection */
+
+		s->ops->inject(s);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(shm_signal_inject);
+
+void _shm_signal_wakeup(struct shm_signal *s)
+{
+	struct shm_signal_irq *irq = &s->desc->irq[s->locale];
+	int dirty;
+	unsigned long flags;
+
+	spin_lock_irqsave(&s->lock, flags);
+
+	__set_bit(shm_signal_in_wakeup, &s->flags);
+
+	/*
+	 * The outer loop protects against race conditions between
+	 * irq->dirty and irq->pending updates
+	 */
+	while (irq->enabled && (irq->dirty || irq->pending)) {
+
+		/*
+		 * Run until we completely exhaust irq->dirty (it may
+		 * be re-dirtied by the remote side while we are in the
+		 * callback).  We let "pending" remain untouched until we have
+		 * processed them all so that the remote side knows we do not
+		 * need a new notification (yet).
+		 */
+		do {
+			irq->dirty = 0;
+			/* the unlock is an implicit wmb() for dirty = 0 */
+			spin_unlock_irqrestore(&s->lock, flags);
+
+			if (s->notifier)
+				s->notifier->signal(s->notifier);
+
+			spin_lock_irqsave(&s->lock, flags);
+			dirty = irq->dirty;
+			rmb();
+
+		} while (irq->enabled && dirty);
+
+		barrier();
+
+		/*
+		 * We can finally acknowledge the notification by clearing
+		 * "pending" after all of the dirty memory has been processed
+		 * Races against this clearing are handled by the outer loop.
+		 * Subsequent iterations of this loop will execute with
+		 * pending=0 potentially leading to future spurious
+		 * notifications, but this is an acceptable tradeoff as this
+		 * will be rare and harmless.
+		 */
+		irq->pending = 0;
+		wmb();
+
+	}
+
+	__clear_bit(shm_signal_in_wakeup, &s->flags);
+	spin_unlock_irqrestore(&s->lock, flags);
+
+}
+EXPORT_SYMBOL_GPL(_shm_signal_wakeup);
+
+void _shm_signal_release(struct kref *kref)
+{
+	struct shm_signal *s = container_of(kref, struct shm_signal, kref);
+
+	s->ops->release(s);
+}
+EXPORT_SYMBOL_GPL(_shm_signal_release);
+
+static void
+deferred_notify(unsigned long data)
+{
+	struct shm_signal *s = (struct shm_signal *)data;
+
+	_shm_signal_wakeup(s);
+}
+
+void shm_signal_init(struct shm_signal *s, enum shm_signal_locality locale,
+		     struct shm_signal_ops *ops, struct shm_signal_desc *desc)
+{
+	memset(s, 0, sizeof(*s));
+	kref_init(&s->kref);
+	spin_lock_init(&s->lock);
+	tasklet_init(&s->deferred_notify,
+		     deferred_notify,
+		     (unsigned long)s);
+	s->locale   = locale;
+	s->ops      = ops;
+	s->desc     = desc;
+}
+EXPORT_SYMBOL_GPL(shm_signal_init);


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v3 2/6] ioq: Add basic definitions for a shared-memory, lockless queue
  2009-08-14 15:42 [PATCH v3 0/6] AlacrityVM guest drivers Gregory Haskins
  2009-08-14 15:42 ` [PATCH v3 1/6] shm-signal: shared-memory signals Gregory Haskins
@ 2009-08-14 15:43 ` Gregory Haskins
  2009-08-14 15:43 ` [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects Gregory Haskins
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-14 15:43 UTC (permalink / raw)
  To: alacrityvm-devel; +Cc: linux-kernel, netdev

IOQ allows asynchronous communication between two end-points via a common
shared-memory region.  Memory is synchronized using pure barriers (i.e.
lockless), and updates are communicated via an embedded shm-signal.   The
design of the interface allows one code base to universally provide both
sides of a given channel.

We will use this mechanism later in the series to efficiently move data
in and out of a guest kernel from various sources, including both
infrastructure level and application level transports.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 MAINTAINERS          |    6 +
 include/linux/Kbuild |    1 
 include/linux/ioq.h  |  415 ++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig          |   12 +
 lib/Makefile         |    1 
 lib/ioq.c            |  294 +++++++++++++++++++++++++++++++++++
 6 files changed, 729 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/ioq.h
 create mode 100644 lib/ioq.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 3e736fe..d0ea25c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2707,6 +2707,12 @@ L:	linux-mips@linux-mips.org
 S:	Maintained
 F:	drivers/serial/ioc3_serial.c
 
+IOQ LIBRARY
+M:	Gregory Haskins <ghaskins@novell.com>
+S:	Maintained
+F:	include/linux/ioq.h
+F:	lib/ioq.c
+
 IP MASQUERADING
 M:	Juanjo Ciarlante <jjciarla@raiz.uncu.edu.ar>
 S:	Maintained
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 01d67b6..32b3eb8 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -247,6 +247,7 @@ unifdef-y += in.h
 unifdef-y += in6.h
 unifdef-y += inotify.h
 unifdef-y += input.h
+unifdef-y += ioq.h
 unifdef-y += ip.h
 unifdef-y += ipc.h
 unifdef-y += ipmi.h
diff --git a/include/linux/ioq.h b/include/linux/ioq.h
new file mode 100644
index 0000000..f77e316
--- /dev/null
+++ b/include/linux/ioq.h
@@ -0,0 +1,415 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * IOQ is a generic shared-memory, lockless queue mechanism. It can be used
+ * in a variety of ways, though its intended purpose is to become the
+ * asynchronous communication path for virtual-bus drivers.
+ *
+ * The following are a list of key design points:
+ *
+ * #) All shared-memory is always allocated on explicitly one side of the
+ *    link.  This typically would be the guest side in a VM/VMM scenario.
+ * #) Each IOQ has the concept of "north" and "south" locales, where
+ *    north denotes the memory-owner side (e.g. guest).
+ * #) An IOQ is manipulated using an iterator idiom.
+ * #) Provides a bi-directional signaling/notification infrastructure on
+ *    a per-queue basis, which includes an event mitigation strategy
+ *    to reduce boundary switching.
+ * #) The signaling path is abstracted so that various technologies and
+ *    topologies can define their own specific implementation while sharing
+ *    the basic structures and code.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_IOQ_H
+#define _LINUX_IOQ_H
+
+#include <linux/types.h>
+#include <linux/shm_signal.h>
+
+/*
+ *---------
+ * The following structures represent data that is shared across boundaries
+ * which may be quite disparate from one another (e.g. Windows vs Linux,
+ * 32 vs 64 bit, etc).  Therefore, care has been taken to make sure they
+ * present data in a manner that is independent of the environment.
+ *-----------
+ */
+struct ioq_ring_desc {
+	__u64                 cookie; /* for arbitrary use by north-side */
+	__u64                 ptr;
+	__u64                 len;
+	__u8                  valid;
+	__u8                  sown; /* South owned = 1, North owned = 0 */
+};
+
+#define IOQ_RING_MAGIC 0x47fa2fe4
+#define IOQ_RING_VER   4
+
+struct ioq_ring_idx {
+	__u32                 head;    /* 0 based index to head of ptr array */
+	__u32                 tail;    /* 0 based index to tail of ptr array */
+	__u8                  full;
+};
+
+enum ioq_locality {
+	ioq_locality_north,
+	ioq_locality_south,
+};
+
+struct ioq_ring_head {
+	__u32                  magic;
+	__u32                  ver;
+	struct shm_signal_desc signal;
+	struct ioq_ring_idx    idx[2];
+	__u32                  count;
+	struct ioq_ring_desc   ring[1]; /* "count" elements will be allocated */
+};
+
+#define IOQ_HEAD_DESC_SIZE(count) \
+    (sizeof(struct ioq_ring_head) + sizeof(struct ioq_ring_desc) * (count - 1))
+
+/* --- END SHARED STRUCTURES --- */
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/interrupt.h>
+#include <linux/shm_signal.h>
+#include <linux/kref.h>
+
+enum ioq_idx_type {
+	ioq_idxtype_valid,
+	ioq_idxtype_inuse,
+	ioq_idxtype_both,
+	ioq_idxtype_invalid,
+};
+
+enum ioq_seek_type {
+	ioq_seek_tail,
+	ioq_seek_next,
+	ioq_seek_head,
+	ioq_seek_set
+};
+
+struct ioq_iterator {
+	struct ioq            *ioq;
+	struct ioq_ring_idx   *idx;
+	u32                    pos;
+	struct ioq_ring_desc  *desc;
+	int                    update:1;
+	int                    dualidx:1;
+	int                    flipowner:1;
+};
+
+struct ioq_notifier {
+	void (*signal)(struct ioq_notifier *);
+};
+
+struct ioq_ops {
+	void     (*release)(struct ioq *ioq);
+};
+
+struct ioq {
+	struct ioq_ops *ops;
+
+	struct kref            kref;
+	enum ioq_locality      locale;
+	struct ioq_ring_head  *head_desc;
+	struct ioq_ring_desc  *ring;
+	struct shm_signal     *signal;
+	wait_queue_head_t      wq;
+	struct ioq_notifier   *notifier;
+	size_t                 count;
+	struct shm_signal_notifier shm_notifier;
+};
+
+#define IOQ_ITER_AUTOUPDATE  (1 << 0)
+#define IOQ_ITER_NOFLIPOWNER (1 << 1)
+
+/**
+ * ioq_init() - initialize an IOQ
+ * @ioq:        IOQ context
+ *
+ * Initializes IOQ context before first use
+ *
+ **/
+void ioq_init(struct ioq *ioq,
+	      struct ioq_ops *ops,
+	      enum ioq_locality locale,
+	      struct ioq_ring_head *head,
+	      struct shm_signal *signal,
+	      size_t count);
+
+/**
+ * ioq_get() - acquire an IOQ context reference
+ * @ioq:        IOQ context
+ *
+ **/
+static inline struct ioq *ioq_get(struct ioq *ioq)
+{
+	kref_get(&ioq->kref);
+
+	return ioq;
+}
+
+static inline void _ioq_kref_release(struct kref *kref)
+{
+	struct ioq *ioq = container_of(kref, struct ioq, kref);
+
+	shm_signal_put(ioq->signal);
+	ioq->ops->release(ioq);
+}
+
+/**
+ * ioq_put() - release an IOQ context reference
+ * @ioq:        IOQ context
+ *
+ **/
+static inline void ioq_put(struct ioq *ioq)
+{
+	kref_put(&ioq->kref, _ioq_kref_release);
+}
+
+/**
+ * ioq_notify_enable() - enables local notifications on an IOQ
+ * @ioq:        IOQ context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Enables/unmasks the registered ioq_notifier (if applicable) and waitq to
+ * receive wakeups whenever the remote side performs an ioq_signal() operation.
+ * A notification will be dispatched immediately if any pending signals have
+ * already been issued prior to invoking this call.
+ *
+ * This is synonymous with unmasking an interrupt.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+static inline int ioq_notify_enable(struct ioq *ioq, int flags)
+{
+	return shm_signal_enable(ioq->signal, 0);
+}
+
+/**
+ * ioq_notify_disable() - disable local notifications on an IOQ
+ * @ioq:        IOQ context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Disables/masks the registered ioq_notifier (if applicable) and waitq
+ * from receiving any further notifications.  Any subsequent calls to
+ * ioq_signal() by the remote side will update the ring as dirty, but
+ * will not traverse the locale boundary and will not invoke the notifier
+ * callback or wakeup the waitq.  Signals delivered while masked will
+ * be deferred until ioq_notify_enable() is invoked
+ *
+ * This is synonymous with masking an interrupt
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+static inline int ioq_notify_disable(struct ioq *ioq, int flags)
+{
+	return shm_signal_disable(ioq->signal, 0);
+}
+
+/**
+ * ioq_signal() - notify the remote side about ring changes
+ * @ioq:        IOQ context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Marks the ring state as "dirty" and, if enabled, will traverse
+ * a locale boundary to invoke a remote notification.  The remote
+ * side controls whether the notification should be delivered via
+ * the ioq_notify_enable/disable() interface.
+ *
+ * The specifics of how to traverse a locale boundary are abstracted
+ * by the ioq_ops->signal() interface and provided by a particular
+ * implementation.  However, typically going north to south would be
+ * something like a syscall/hypercall, and going south to north would be
+ * something like a posix-signal/guest-interrupt.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+static inline int ioq_signal(struct ioq *ioq, int flags)
+{
+	return shm_signal_inject(ioq->signal, 0);
+}
+
+/**
+ * ioq_count() - counts the number of outstanding descriptors in an index
+ * @ioq:        IOQ context
+ * @type:	Specifies the index type
+ *                 (*) valid: the descriptor is valid.  This is usually
+ *                     used to keep track of descriptors that may not
+ *                     be carrying a useful payload, but still need to
+ *                     be tracked carefully.
+ *                 (*) inuse: Descriptors that carry useful payload
+ *
+ * Returns:
+ *  (*) >=0: # of descriptors outstanding in the index
+ *  (*) <0 = ERRNO
+ *
+ **/
+int ioq_count(struct ioq *ioq, enum ioq_idx_type type);
+
+/**
+ * ioq_remain() - counts the number of remaining descriptors in an index
+ * @ioq:        IOQ context
+ * @type:	Specifies the index type
+ *                 (*) valid: the descriptor is valid.  This is usually
+ *                     used to keep track of descriptors that may not
+ *                     be carrying a useful payload, but still need to
+ *                     be tracked carefully.
+ *                 (*) inuse: Descriptors that carry useful payload
+ *
+ * This is the converse of ioq_count().  This function returns the number
+ * of "free" descriptors left in a particular index
+ *
+ * Returns:
+ *  (*) >=0: # of descriptors remaining in the index
+ *  (*) <0 = ERRNO
+ *
+ **/
+int ioq_remain(struct ioq *ioq, enum ioq_idx_type type);
+
+/**
+ * ioq_size() - counts the maximum number of descriptors in an ring
+ * @ioq:        IOQ context
+ *
+ * This function returns the maximum number of descriptors supported in
+ * a ring, regardless of their current state (free or inuse).
+ *
+ * Returns:
+ *  (*) >=0: total # of descriptors in the ring
+ *  (*) <0 = ERRNO
+ *
+ **/
+int ioq_size(struct ioq *ioq);
+
+/**
+ * ioq_full() - determines if a specific index is "full"
+ * @ioq:        IOQ context
+ * @type:	Specifies the index type
+ *                 (*) valid: the descriptor is valid.  This is usually
+ *                     used to keep track of descriptors that may not
+ *                     be carrying a useful payload, but still need to
+ *                     be tracked carefully.
+ *                 (*) inuse: Descriptors that carry useful payload
+ *
+ * Returns:
+ *  (*) 0: index is not full
+ *  (*) 1: index is full
+ *  (*) <0 = ERRNO
+ *
+ **/
+int ioq_full(struct ioq *ioq, enum ioq_idx_type type);
+
+/**
+ * ioq_empty() - determines if a specific index is "empty"
+ * @ioq:        IOQ context
+ * @type:	Specifies the index type
+ *                 (*) valid: the descriptor is valid.  This is usually
+ *                     used to keep track of descriptors that may not
+ *                     be carrying a useful payload, but still need to
+ *                     be tracked carefully.
+ *                 (*) inuse: Descriptors that carry useful payload
+ *
+ * Returns:
+ *  (*) 0: index is not empty
+ *  (*) 1: index is empty
+ *  (*) <0 = ERRNO
+ *
+ **/
+static inline int ioq_empty(struct ioq *ioq, enum ioq_idx_type type)
+{
+    return !ioq_count(ioq, type);
+}
+
+/**
+ * ioq_iter_init() - initialize an iterator for IOQ descriptor traversal
+ * @ioq:        IOQ context to iterate on
+ * @iter:	Iterator context to init (usually from stack)
+ * @type:	Specifies the index type to iterate against
+ *                 (*) valid: iterate against the "valid" index
+ *                 (*) inuse: iterate against the "inuse" index
+ *                 (*) both: iterate against both indexes simultaneously
+ * @flags:      Bitfield with 0 or more bits set to alter behavior
+ *                 (*) autoupdate: automatically signal the remote side
+ *                     whenever the iterator pushes/pops to a new desc
+ *                 (*) noflipowner: do not flip the ownership bit during
+ *                     a push/pop operation
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int ioq_iter_init(struct ioq *ioq, struct ioq_iterator *iter,
+		  enum ioq_idx_type type, int flags);
+
+/**
+ * ioq_iter_seek() - seek to a specific location in the IOQ ring
+ * @iter:	Iterator context (must be initialized with ioq_iter_init)
+ * @type:	Specifies the type of seek operation
+ *                 (*) tail: seek to the absolute tail, offset is ignored
+ *                 (*) next: seek to the relative next, offset is ignored
+ *                 (*) head: seek to the absolute head, offset is ignored
+ *                 (*) set: seek to the absolute offset
+ * @offset:     Offset for ioq_seek_set operations
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int  ioq_iter_seek(struct ioq_iterator *iter, enum ioq_seek_type type,
+		   long offset, int flags);
+
+/**
+ * ioq_iter_push() - push the tail pointer forward
+ * @iter:	Iterator context (must be initialized with ioq_iter_init)
+ * @flags:      Reserved for future use, must be 0
+ *
+ * This function will simultaneously advance the tail ptr in the current
+ * index (valid/inuse, as specified in the ioq_iter_init) as well as
+ * perform a seek(next) operation.  This effectively "pushes" a new pointer
+ * onto the tail of the index.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int  ioq_iter_push(struct ioq_iterator *iter, int flags);
+
+/**
+ * ioq_iter_pop() - pop the head pointer from the ring
+ * @iter:	Iterator context (must be initialized with ioq_iter_init)
+ * @flags:      Reserved for future use, must be 0
+ *
+ * This function will simultaneously advance the head ptr in the current
+ * index (valid/inuse, as specified in the ioq_iter_init) as well as
+ * perform a seek(next) operation.  This effectively "pops" a pointer
+ * from the head of the index.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int  ioq_iter_pop(struct ioq_iterator *iter,  int flags);
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_IOQ_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 136da19..255778d 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -209,4 +209,16 @@ config SHM_SIGNAL
 
 	 If unsure, say N
 
+config IOQ
+	tristate "IO-Queue library - Generic shared-memory queue"
+	select SHM_SIGNAL
+	default n
+	help
+	 IOQ is a generic shared-memory-queue mechanism that happens to be
+	 friendly to virtualization boundaries. It can be used in a variety
+	 of ways, though its intended purpose is to become a low-level
+	 communication path for paravirtualized drivers.
+
+	 If unsure, say N
+
 endmenu
diff --git a/lib/Makefile b/lib/Makefile
index 503bf7b..215f0c9 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -77,6 +77,7 @@ obj-$(CONFIG_TEXTSEARCH_FSM) += ts_fsm.o
 obj-$(CONFIG_SMP) += percpu_counter.o
 obj-$(CONFIG_AUDIT_GENERIC) += audit.o
 obj-$(CONFIG_SHM_SIGNAL) += shm_signal.o
+obj-$(CONFIG_IOQ) += ioq.o
 
 obj-$(CONFIG_SWIOTLB) += swiotlb.o
 obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
diff --git a/lib/ioq.c b/lib/ioq.c
new file mode 100644
index 0000000..af3090f
--- /dev/null
+++ b/lib/ioq.c
@@ -0,0 +1,294 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * See include/linux/ioq.h for documentation
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/sched.h>
+#include <linux/ioq.h>
+#include <linux/bitops.h>
+#include <linux/module.h>
+
+#ifndef NULL
+#define NULL 0
+#endif
+
+static int ioq_iter_setpos(struct ioq_iterator *iter, u32 pos)
+{
+	struct ioq *ioq = iter->ioq;
+
+	BUG_ON(pos >= ioq->count);
+
+	iter->pos  = pos;
+	iter->desc = &ioq->ring[pos];
+
+	return 0;
+}
+
+static inline u32 modulo_inc(u32 val, u32 mod)
+{
+	BUG_ON(val >= mod);
+
+	if (val == (mod - 1))
+		return 0;
+
+	return val + 1;
+}
+
+static inline int idx_full(struct ioq_ring_idx *idx)
+{
+	return idx->full && (idx->head == idx->tail);
+}
+
+int ioq_iter_seek(struct ioq_iterator *iter, enum ioq_seek_type type,
+		  long offset, int flags)
+{
+	struct ioq_ring_idx *idx = iter->idx;
+	u32 pos;
+
+	switch (type) {
+	case ioq_seek_next:
+		pos = modulo_inc(iter->pos, iter->ioq->count);
+		break;
+	case ioq_seek_tail:
+		pos = idx->tail;
+		break;
+	case ioq_seek_head:
+		pos = idx->head;
+		break;
+	case ioq_seek_set:
+		if (offset >= iter->ioq->count)
+			return -1;
+		pos = offset;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return ioq_iter_setpos(iter, pos);
+}
+EXPORT_SYMBOL_GPL(ioq_iter_seek);
+
+static int ioq_ring_count(struct ioq_ring_idx *idx, int count)
+{
+	if (idx->full && (idx->head == idx->tail))
+		return count;
+	else if (idx->tail >= idx->head)
+		return idx->tail - idx->head;
+	else
+		return (idx->tail + count) - idx->head;
+}
+
+static void idx_tail_push(struct ioq_ring_idx *idx, int count)
+{
+	u32 tail = modulo_inc(idx->tail, count);
+
+	if (idx->head == tail) {
+		rmb();
+
+		/*
+		 * Setting full here may look racy, but note that we havent
+		 * flipped the owner bit yet.  So it is impossible for the
+		 * remote locale to move head in such a way that this operation
+		 * becomes invalid
+		 */
+		idx->full = 1;
+		wmb();
+	}
+
+	idx->tail = tail;
+}
+
+int ioq_iter_push(struct ioq_iterator *iter, int flags)
+{
+	struct ioq_ring_head *head_desc = iter->ioq->head_desc;
+	struct ioq_ring_idx  *idx  = iter->idx;
+	int ret;
+
+	/*
+	 * Its only valid to push if we are currently pointed at the tail
+	 */
+	if (iter->pos != idx->tail || iter->desc->sown != iter->ioq->locale)
+		return -EINVAL;
+
+	idx_tail_push(idx, iter->ioq->count);
+	if (iter->dualidx) {
+		idx_tail_push(&head_desc->idx[ioq_idxtype_inuse],
+			      iter->ioq->count);
+		if (head_desc->idx[ioq_idxtype_inuse].tail !=
+		    head_desc->idx[ioq_idxtype_valid].tail) {
+			SHM_SIGNAL_FAULT(iter->ioq->signal,
+					 "Tails not synchronized");
+			return -EINVAL;
+		}
+	}
+
+	wmb(); /* the index must be visible before the sown, or signal */
+
+	if (iter->flipowner) {
+		iter->desc->sown = !iter->ioq->locale;
+		wmb(); /* sown must be visible before we signal */
+	}
+
+	ret = ioq_iter_seek(iter, ioq_seek_next, 0, flags);
+
+	if (iter->update)
+		ioq_signal(iter->ioq, 0);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ioq_iter_push);
+
+int ioq_iter_pop(struct ioq_iterator *iter,  int flags)
+{
+	struct ioq_ring_idx *idx = iter->idx;
+	int ret;
+
+	/*
+	 * Its only valid to pop if we are currently pointed at the head
+	 */
+	if (iter->pos != idx->head || iter->desc->sown != iter->ioq->locale)
+		return -EINVAL;
+
+	idx->head = modulo_inc(idx->head, iter->ioq->count);
+	wmb(); /* head must be visible before full */
+
+	if (idx->full) {
+		idx->full = 0;
+		wmb(); /* full must be visible before sown */
+	}
+
+	if (iter->flipowner) {
+		iter->desc->sown = !iter->ioq->locale;
+		wmb(); /* sown must be visible before we signal */
+	}
+
+	ret = ioq_iter_seek(iter, ioq_seek_next, 0, flags);
+
+	if (iter->update)
+		ioq_signal(iter->ioq, 0);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ioq_iter_pop);
+
+static struct ioq_ring_idx *idxtype_to_idx(struct ioq *ioq,
+					   enum ioq_idx_type type)
+{
+	struct ioq_ring_idx *idx;
+
+	switch (type) {
+	case ioq_idxtype_valid:
+	case ioq_idxtype_inuse:
+		idx = &ioq->head_desc->idx[type];
+		break;
+	default:
+		panic("IOQ: illegal index type: %d", type);
+		break;
+	}
+
+	return idx;
+}
+
+int ioq_iter_init(struct ioq *ioq, struct ioq_iterator *iter,
+		  enum ioq_idx_type type, int flags)
+{
+	iter->ioq        = ioq;
+	iter->update     = (flags & IOQ_ITER_AUTOUPDATE);
+	iter->flipowner  = !(flags & IOQ_ITER_NOFLIPOWNER);
+	iter->pos        = -1;
+	iter->desc       = NULL;
+	iter->dualidx    = 0;
+
+	if (type == ioq_idxtype_both) {
+		/*
+		 * "both" is a special case, so we set the dualidx flag.
+		 *
+		 * However, we also just want to use the valid-index
+		 * for normal processing, so override that here
+		 */
+		type = ioq_idxtype_valid;
+		iter->dualidx = 1;
+	}
+
+	iter->idx = idxtype_to_idx(ioq, type);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ioq_iter_init);
+
+int ioq_count(struct ioq *ioq, enum ioq_idx_type type)
+{
+	return ioq_ring_count(idxtype_to_idx(ioq, type), ioq->count);
+}
+EXPORT_SYMBOL_GPL(ioq_count);
+
+int ioq_remain(struct ioq *ioq, enum ioq_idx_type type)
+{
+	int count = ioq_ring_count(idxtype_to_idx(ioq, type), ioq->count);
+
+	return ioq->count - count;
+}
+EXPORT_SYMBOL_GPL(ioq_remain);
+
+int ioq_size(struct ioq *ioq)
+{
+	return ioq->count;
+}
+EXPORT_SYMBOL_GPL(ioq_size);
+
+int ioq_full(struct ioq *ioq, enum ioq_idx_type type)
+{
+	struct ioq_ring_idx *idx = idxtype_to_idx(ioq, type);
+
+	return idx_full(idx);
+}
+EXPORT_SYMBOL_GPL(ioq_full);
+
+static void ioq_shm_signal(struct shm_signal_notifier *notifier)
+{
+	struct ioq *ioq = container_of(notifier, struct ioq, shm_notifier);
+
+	wake_up(&ioq->wq);
+	if (ioq->notifier)
+		ioq->notifier->signal(ioq->notifier);
+}
+
+void ioq_init(struct ioq *ioq,
+	      struct ioq_ops *ops,
+	      enum ioq_locality locale,
+	      struct ioq_ring_head *head,
+	      struct shm_signal *signal,
+	      size_t count)
+{
+	memset(ioq, 0, sizeof(*ioq));
+	kref_init(&ioq->kref);
+	init_waitqueue_head(&ioq->wq);
+
+	ioq->ops         = ops;
+	ioq->locale      = locale;
+	ioq->head_desc   = head;
+	ioq->ring        = &head->ring[0];
+	ioq->count       = count;
+	ioq->signal      = signal;
+
+	ioq->shm_notifier.signal = &ioq_shm_signal;
+	signal->notifier         = &ioq->shm_notifier;
+}
+EXPORT_SYMBOL_GPL(ioq_init);


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-14 15:42 [PATCH v3 0/6] AlacrityVM guest drivers Gregory Haskins
  2009-08-14 15:42 ` [PATCH v3 1/6] shm-signal: shared-memory signals Gregory Haskins
  2009-08-14 15:43 ` [PATCH v3 2/6] ioq: Add basic definitions for a shared-memory, lockless queue Gregory Haskins
@ 2009-08-14 15:43 ` Gregory Haskins
  2009-08-15 10:32   ` Ingo Molnar
  2009-08-14 15:43 ` [PATCH v3 4/6] vbus-proxy: add a pci-to-vbus bridge Gregory Haskins
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 132+ messages in thread
From: Gregory Haskins @ 2009-08-14 15:43 UTC (permalink / raw)
  To: alacrityvm-devel; +Cc: linux-kernel, netdev

This will generally be used for hypervisors to publish any host-side
virtual devices up to a guest.  The guest will have the opportunity
to consume any devices present on the vbus-proxy as if they were
platform devices, similar to existing buses like PCI.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 MAINTAINERS                 |    6 ++
 arch/x86/Kconfig            |    2 +
 drivers/Makefile            |    1 
 drivers/vbus/Kconfig        |   14 ++++
 drivers/vbus/Makefile       |    3 +
 drivers/vbus/bus-proxy.c    |  152 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/vbus_driver.h |   73 +++++++++++++++++++++
 7 files changed, 251 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vbus/Kconfig
 create mode 100644 drivers/vbus/Makefile
 create mode 100644 drivers/vbus/bus-proxy.c
 create mode 100644 include/linux/vbus_driver.h

diff --git a/MAINTAINERS b/MAINTAINERS
index d0ea25c..83624e7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5437,6 +5437,12 @@ S:	Maintained
 F:	Documentation/fb/uvesafb.txt
 F:	drivers/video/uvesafb.*
 
+VBUS
+M:	Gregory Haskins <ghaskins@novell.com>
+S:	Maintained
+F:	include/linux/vbus*
+F:	drivers/vbus/*
+
 VFAT/FAT/MSDOS FILESYSTEM
 M:	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
 S:	Maintained
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 13ffa5d..12f8fb3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2037,6 +2037,8 @@ source "drivers/pcmcia/Kconfig"
 
 source "drivers/pci/hotplug/Kconfig"
 
+source "drivers/vbus/Kconfig"
+
 endmenu
 
 
diff --git a/drivers/Makefile b/drivers/Makefile
index bc4205d..d5bedb1 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -110,3 +110,4 @@ obj-$(CONFIG_VLYNQ)		+= vlynq/
 obj-$(CONFIG_STAGING)		+= staging/
 obj-y				+= platform/
 obj-y				+= ieee802154/
+obj-y				+= vbus/
diff --git a/drivers/vbus/Kconfig b/drivers/vbus/Kconfig
new file mode 100644
index 0000000..e1939f5
--- /dev/null
+++ b/drivers/vbus/Kconfig
@@ -0,0 +1,14 @@
+#
+# Virtual-Bus (VBus) driver configuration
+#
+
+config VBUS_PROXY
+       tristate "Virtual-Bus support"
+       select SHM_SIGNAL
+       default n
+       help
+       Adds support for a virtual-bus model drivers in a guest to connect
+	to host side virtual-bus resources.  If you are using this kernel
+	in a virtualization solution which implements virtual-bus devices
+	on the backend, say Y.  If unsure, say N.
+
diff --git a/drivers/vbus/Makefile b/drivers/vbus/Makefile
new file mode 100644
index 0000000..a29a1e0
--- /dev/null
+++ b/drivers/vbus/Makefile
@@ -0,0 +1,3 @@
+
+vbus-proxy-objs += bus-proxy.o
+obj-$(CONFIG_VBUS_PROXY) += vbus-proxy.o
diff --git a/drivers/vbus/bus-proxy.c b/drivers/vbus/bus-proxy.c
new file mode 100644
index 0000000..3177f9f
--- /dev/null
+++ b/drivers/vbus/bus-proxy.c
@@ -0,0 +1,152 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/vbus_driver.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+
+#define VBUS_PROXY_NAME "vbus-proxy"
+
+static struct vbus_device_proxy *to_dev(struct device *_dev)
+{
+	return _dev ? container_of(_dev, struct vbus_device_proxy, dev) : NULL;
+}
+
+static struct vbus_driver *to_drv(struct device_driver *_drv)
+{
+	return container_of(_drv, struct vbus_driver, drv);
+}
+
+/*
+ * This function is invoked whenever a new driver and/or device is added
+ * to check if there is a match
+ */
+static int vbus_dev_proxy_match(struct device *_dev, struct device_driver *_drv)
+{
+	struct vbus_device_proxy *dev = to_dev(_dev);
+	struct vbus_driver *drv = to_drv(_drv);
+
+	return !strcmp(dev->type, drv->type);
+}
+
+/*
+ * This function is invoked after the bus infrastructure has already made a
+ * match.  The device will contain a reference to the paired driver which
+ * we will extract.
+ */
+static int vbus_dev_proxy_probe(struct device *_dev)
+{
+	int ret = 0;
+	struct vbus_device_proxy *dev = to_dev(_dev);
+	struct vbus_driver *drv = to_drv(_dev->driver);
+
+	if (drv->ops->probe)
+		ret = drv->ops->probe(dev);
+
+	return ret;
+}
+
+static struct bus_type vbus_proxy = {
+	.name   = VBUS_PROXY_NAME,
+	.match  = vbus_dev_proxy_match,
+};
+
+static struct device vbus_proxy_rootdev = {
+	.parent    = NULL,
+	.init_name = VBUS_PROXY_NAME,
+};
+
+static int __init vbus_init(void)
+{
+	int ret;
+
+	ret = bus_register(&vbus_proxy);
+	BUG_ON(ret < 0);
+
+	ret = device_register(&vbus_proxy_rootdev);
+	BUG_ON(ret < 0);
+
+	return 0;
+}
+
+postcore_initcall(vbus_init);
+
+static void device_release(struct device *dev)
+{
+	struct vbus_device_proxy *_dev;
+
+	_dev = container_of(dev, struct vbus_device_proxy, dev);
+
+	_dev->ops->release(_dev);
+}
+
+int vbus_device_proxy_register(struct vbus_device_proxy *new)
+{
+	new->dev.parent  = &vbus_proxy_rootdev;
+	new->dev.bus     = &vbus_proxy;
+	new->dev.release = &device_release;
+
+	return device_register(&new->dev);
+}
+EXPORT_SYMBOL_GPL(vbus_device_proxy_register);
+
+void vbus_device_proxy_unregister(struct vbus_device_proxy *dev)
+{
+	device_unregister(&dev->dev);
+}
+EXPORT_SYMBOL_GPL(vbus_device_proxy_unregister);
+
+static int match_device_id(struct device *_dev, void *data)
+{
+	struct vbus_device_proxy *dev = to_dev(_dev);
+	u64 id = *(u64 *)data;
+
+	return dev->id == id;
+}
+
+struct vbus_device_proxy *vbus_device_proxy_find(u64 id)
+{
+	struct device *dev;
+
+	dev = bus_find_device(&vbus_proxy, NULL, &id, &match_device_id);
+
+	return to_dev(dev);
+}
+EXPORT_SYMBOL_GPL(vbus_device_proxy_find);
+
+int vbus_driver_register(struct vbus_driver *new)
+{
+	new->drv.bus   = &vbus_proxy;
+	new->drv.name  = new->type;
+	new->drv.owner = new->owner;
+	new->drv.probe = vbus_dev_proxy_probe;
+
+	return driver_register(&new->drv);
+}
+EXPORT_SYMBOL_GPL(vbus_driver_register);
+
+void vbus_driver_unregister(struct vbus_driver *drv)
+{
+	driver_unregister(&drv->drv);
+}
+EXPORT_SYMBOL_GPL(vbus_driver_unregister);
+
diff --git a/include/linux/vbus_driver.h b/include/linux/vbus_driver.h
new file mode 100644
index 0000000..c53e13f
--- /dev/null
+++ b/include/linux/vbus_driver.h
@@ -0,0 +1,73 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Mediates access to a host VBUS from a guest kernel by providing a
+ * global view of all VBUS devices
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VBUS_DRIVER_H
+#define _LINUX_VBUS_DRIVER_H
+
+#include <linux/device.h>
+#include <linux/shm_signal.h>
+
+struct vbus_device_proxy;
+struct vbus_driver;
+
+struct vbus_device_proxy_ops {
+	int (*open)(struct vbus_device_proxy *dev, int version, int flags);
+	int (*close)(struct vbus_device_proxy *dev, int flags);
+	int (*shm)(struct vbus_device_proxy *dev, int id, int prio,
+		   void *ptr, size_t len,
+		   struct shm_signal_desc *sigdesc, struct shm_signal **signal,
+		   int flags);
+	int (*call)(struct vbus_device_proxy *dev, u32 func,
+		    void *data, size_t len, int flags);
+	void (*release)(struct vbus_device_proxy *dev);
+};
+
+struct vbus_device_proxy {
+	char                          *type;
+	u64                            id;
+	void                          *priv; /* Used by drivers */
+	struct vbus_device_proxy_ops  *ops;
+	struct device                  dev;
+};
+
+int vbus_device_proxy_register(struct vbus_device_proxy *dev);
+void vbus_device_proxy_unregister(struct vbus_device_proxy *dev);
+
+struct vbus_device_proxy *vbus_device_proxy_find(u64 id);
+
+struct vbus_driver_ops {
+	int (*probe)(struct vbus_device_proxy *dev);
+	int (*remove)(struct vbus_device_proxy *dev);
+};
+
+struct vbus_driver {
+	char                          *type;
+	struct module                 *owner;
+	struct vbus_driver_ops        *ops;
+	struct device_driver           drv;
+};
+
+int vbus_driver_register(struct vbus_driver *drv);
+void vbus_driver_unregister(struct vbus_driver *drv);
+
+#endif /* _LINUX_VBUS_DRIVER_H */


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v3 4/6] vbus-proxy: add a pci-to-vbus bridge
  2009-08-14 15:42 [PATCH v3 0/6] AlacrityVM guest drivers Gregory Haskins
                   ` (2 preceding siblings ...)
  2009-08-14 15:43 ` [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects Gregory Haskins
@ 2009-08-14 15:43 ` Gregory Haskins
  2009-08-14 15:43 ` [PATCH v3 5/6] ioq: add driver-side vbus helpers Gregory Haskins
  2009-08-14 15:43 ` [PATCH v3 6/6] net: Add vbus_enet driver Gregory Haskins
  5 siblings, 0 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-14 15:43 UTC (permalink / raw)
  To: alacrityvm-devel; +Cc: linux-kernel, netdev

This patch adds a pci-based bridge driver to interface between the a
host VBUS and the guest's vbus-proxy bus model.  It completes the guest
side notion of a "vbus-connector", and requires a cooresponding
host-side connector (in this case, the pci-bridge model) to comlete
the connection.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 drivers/vbus/Kconfig      |   10 +
 drivers/vbus/Makefile     |    3 
 drivers/vbus/pci-bridge.c |  877 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/Kbuild      |    1 
 include/linux/vbus_pci.h  |  145 +++++++
 5 files changed, 1036 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vbus/pci-bridge.c
 create mode 100644 include/linux/vbus_pci.h

diff --git a/drivers/vbus/Kconfig b/drivers/vbus/Kconfig
index e1939f5..87c545d 100644
--- a/drivers/vbus/Kconfig
+++ b/drivers/vbus/Kconfig
@@ -12,3 +12,13 @@ config VBUS_PROXY
 	in a virtualization solution which implements virtual-bus devices
 	on the backend, say Y.  If unsure, say N.
 
+config VBUS_PCIBRIDGE
+       tristate "PCI to Virtual-Bus bridge"
+       depends on PCI
+       depends on VBUS_PROXY
+       select IOQ
+       default n
+       help
+        Provides a way to bridge host side vbus devices via a PCI-BRIDGE
+        object.  If you are running virtualization with vbus devices on the
+	host, and the vbus is exposed via PCI, say Y.  Otherwise, say N.
diff --git a/drivers/vbus/Makefile b/drivers/vbus/Makefile
index a29a1e0..944b7f1 100644
--- a/drivers/vbus/Makefile
+++ b/drivers/vbus/Makefile
@@ -1,3 +1,6 @@
 
 vbus-proxy-objs += bus-proxy.o
 obj-$(CONFIG_VBUS_PROXY) += vbus-proxy.o
+
+vbus-pcibridge-objs += pci-bridge.o
+obj-$(CONFIG_VBUS_PCIBRIDGE) += vbus-pcibridge.o
diff --git a/drivers/vbus/pci-bridge.c b/drivers/vbus/pci-bridge.c
new file mode 100644
index 0000000..f0ed51a
--- /dev/null
+++ b/drivers/vbus/pci-bridge.c
@@ -0,0 +1,877 @@
+/*
+ * Copyright (C) 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *	Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/mm.h>
+#include <linux/workqueue.h>
+#include <linux/ioq.h>
+#include <linux/interrupt.h>
+#include <linux/vbus_driver.h>
+#include <linux/vbus_pci.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("1");
+
+#define VBUS_PCI_NAME "pci-to-vbus-bridge"
+
+struct vbus_pci {
+	spinlock_t                lock;
+	struct pci_dev           *dev;
+	struct ioq                eventq;
+	struct vbus_pci_event    *ring;
+	struct vbus_pci_regs     *regs;
+	struct vbus_pci_signals  *signals;
+	int                       irq;
+	int                       enabled:1;
+};
+
+static struct vbus_pci vbus_pci;
+
+struct vbus_pci_device {
+	char                     type[VBUS_MAX_DEVTYPE_LEN];
+	u64                      handle;
+	struct list_head         shms;
+	struct vbus_device_proxy vdev;
+	struct work_struct       add;
+	struct work_struct       drop;
+};
+
+DEFINE_PER_CPU(struct vbus_pci_fastcall_desc, vbus_pci_percpu_fastcall)
+____cacheline_aligned;
+
+/*
+ * -------------------
+ * common routines
+ * -------------------
+ */
+
+static int
+vbus_pci_bridgecall(unsigned long nr, void *data, unsigned long len)
+{
+	struct vbus_pci_call_desc params = {
+		.vector = nr,
+		.len    = len,
+		.datap  = __pa(data),
+	};
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&vbus_pci.lock, flags);
+
+	memcpy_toio(&vbus_pci.regs->bridgecall, &params, sizeof(params));
+	ret = ioread32(&vbus_pci.regs->bridgecall);
+
+	spin_unlock_irqrestore(&vbus_pci.lock, flags);
+
+	return ret;
+}
+
+static int
+vbus_pci_buscall(unsigned long nr, void *data, unsigned long len)
+{
+	struct vbus_pci_fastcall_desc *params;
+	int ret;
+
+	preempt_disable();
+
+	params = &get_cpu_var(vbus_pci_percpu_fastcall);
+
+	params->call.vector = nr;
+	params->call.len    = len;
+	params->call.datap  = __pa(data);
+
+	iowrite32(smp_processor_id(), &vbus_pci.signals->fastcall);
+
+	ret = params->result;
+
+	preempt_enable();
+
+	return ret;
+}
+
+struct vbus_pci_device *
+to_dev(struct vbus_device_proxy *vdev)
+{
+	return container_of(vdev, struct vbus_pci_device, vdev);
+}
+
+static void
+_signal_init(struct shm_signal *signal, struct shm_signal_desc *desc,
+	     struct shm_signal_ops *ops)
+{
+	desc->magic = SHM_SIGNAL_MAGIC;
+	desc->ver   = SHM_SIGNAL_VER;
+
+	shm_signal_init(signal, shm_locality_north, ops, desc);
+}
+
+/*
+ * -------------------
+ * _signal
+ * -------------------
+ */
+
+struct _signal {
+	struct vbus_pci   *pcivbus;
+	struct shm_signal  signal;
+	u32                handle;
+	struct rb_node     node;
+	struct list_head   list;
+};
+
+static struct _signal *
+to_signal(struct shm_signal *signal)
+{
+       return container_of(signal, struct _signal, signal);
+}
+
+static int
+_signal_inject(struct shm_signal *signal)
+{
+	struct _signal *_signal = to_signal(signal);
+
+	iowrite32(_signal->handle, &vbus_pci.signals->shmsignal);
+
+	return 0;
+}
+
+static void
+_signal_release(struct shm_signal *signal)
+{
+	struct _signal *_signal = to_signal(signal);
+
+	kfree(_signal);
+}
+
+static struct shm_signal_ops _signal_ops = {
+	.inject  = _signal_inject,
+	.release = _signal_release,
+};
+
+/*
+ * -------------------
+ * vbus_device_proxy routines
+ * -------------------
+ */
+
+static int
+vbus_pci_device_open(struct vbus_device_proxy *vdev, int version, int flags)
+{
+	struct vbus_pci_device *dev = to_dev(vdev);
+	struct vbus_pci_deviceopen params;
+	int ret;
+
+	if (dev->handle)
+		return -EINVAL;
+
+	params.devid   = vdev->id;
+	params.version = version;
+
+	ret = vbus_pci_buscall(VBUS_PCI_HC_DEVOPEN,
+				 &params, sizeof(params));
+	if (ret < 0)
+		return ret;
+
+	dev->handle = params.handle;
+
+	return 0;
+}
+
+static int
+vbus_pci_device_close(struct vbus_device_proxy *vdev, int flags)
+{
+	struct vbus_pci_device *dev = to_dev(vdev);
+	unsigned long iflags;
+	int ret;
+
+	if (!dev->handle)
+		return -EINVAL;
+
+	spin_lock_irqsave(&vbus_pci.lock, iflags);
+
+	while (!list_empty(&dev->shms)) {
+		struct _signal *_signal;
+
+		_signal = list_first_entry(&dev->shms, struct _signal, list);
+
+		list_del(&_signal->list);
+
+		spin_unlock_irqrestore(&vbus_pci.lock, iflags);
+		shm_signal_put(&_signal->signal);
+		spin_lock_irqsave(&vbus_pci.lock, iflags);
+	}
+
+	spin_unlock_irqrestore(&vbus_pci.lock, iflags);
+
+	/*
+	 * The DEVICECLOSE will implicitly close all of the shm on the
+	 * host-side, so there is no need to do an explicit per-shm
+	 * hypercall
+	 */
+	ret = vbus_pci_buscall(VBUS_PCI_HC_DEVCLOSE,
+				 &dev->handle, sizeof(dev->handle));
+
+	if (ret < 0)
+		printk(KERN_ERR "VBUS-PCI: Error closing device %s/%lld: %d\n",
+		       vdev->type, vdev->id, ret);
+
+	dev->handle = 0;
+
+	return 0;
+}
+
+static int
+vbus_pci_device_shm(struct vbus_device_proxy *vdev, int id, int prio,
+		    void *ptr, size_t len,
+		    struct shm_signal_desc *sdesc, struct shm_signal **signal,
+		    int flags)
+{
+	struct vbus_pci_device *dev = to_dev(vdev);
+	struct _signal *_signal = NULL;
+	struct vbus_pci_deviceshm params;
+	unsigned long iflags;
+	int ret;
+
+	if (!dev->handle)
+		return -EINVAL;
+
+	params.devh   = dev->handle;
+	params.id     = id;
+	params.flags  = flags;
+	params.datap  = (u64)__pa(ptr);
+	params.len    = len;
+
+	if (signal) {
+		/*
+		 * The signal descriptor must be embedded within the
+		 * provided ptr
+		 */
+		if (!sdesc
+		    || (len < sizeof(*sdesc))
+		    || ((void *)sdesc < ptr)
+		    || ((void *)sdesc > (ptr + len - sizeof(*sdesc))))
+			return -EINVAL;
+
+		_signal = kzalloc(sizeof(*_signal), GFP_KERNEL);
+		if (!_signal)
+			return -ENOMEM;
+
+		_signal_init(&_signal->signal, sdesc, &_signal_ops);
+
+		/*
+		 * take another reference for the host.  This is dropped
+		 * by a SHMCLOSE event
+		 */
+		shm_signal_get(&_signal->signal);
+
+		params.signal.offset = (u64)sdesc - (u64)ptr;
+		params.signal.prio   = prio;
+		params.signal.cookie = (u64)_signal;
+
+	} else
+		params.signal.offset = -1; /* yes, this is a u32, but its ok */
+
+	ret = vbus_pci_buscall(VBUS_PCI_HC_DEVSHM,
+				 &params, sizeof(params));
+	if (ret < 0) {
+		if (_signal) {
+			/*
+			 * We held two references above, so we need to drop
+			 * both of them
+			 */
+			shm_signal_put(&_signal->signal);
+			shm_signal_put(&_signal->signal);
+		}
+
+		return ret;
+	}
+
+	if (signal) {
+		BUG_ON(ret < 0);
+
+		_signal->handle = ret;
+
+		spin_lock_irqsave(&vbus_pci.lock, iflags);
+
+		list_add_tail(&_signal->list, &dev->shms);
+
+		spin_unlock_irqrestore(&vbus_pci.lock, iflags);
+
+		shm_signal_get(&_signal->signal);
+		*signal = &_signal->signal;
+	}
+
+	return 0;
+}
+
+static int
+vbus_pci_device_call(struct vbus_device_proxy *vdev, u32 func, void *data,
+		     size_t len, int flags)
+{
+	struct vbus_pci_device *dev = to_dev(vdev);
+	struct vbus_pci_devicecall params = {
+		.devh  = dev->handle,
+		.func  = func,
+		.datap = (u64)__pa(data),
+		.len   = len,
+		.flags = flags,
+	};
+
+	if (!dev->handle)
+		return -EINVAL;
+
+	return vbus_pci_buscall(VBUS_PCI_HC_DEVCALL, &params, sizeof(params));
+}
+
+static void
+vbus_pci_device_release(struct vbus_device_proxy *vdev)
+{
+	struct vbus_pci_device *_dev = to_dev(vdev);
+
+	vbus_pci_device_close(vdev, 0);
+
+	kfree(_dev);
+}
+
+struct vbus_device_proxy_ops vbus_pci_device_ops = {
+	.open    = vbus_pci_device_open,
+	.close   = vbus_pci_device_close,
+	.shm     = vbus_pci_device_shm,
+	.call    = vbus_pci_device_call,
+	.release = vbus_pci_device_release,
+};
+
+/*
+ * -------------------
+ * vbus events
+ * -------------------
+ */
+
+static void
+deferred_devadd(struct work_struct *work)
+{
+	struct vbus_pci_device *new;
+	int ret;
+
+	new = container_of(work, struct vbus_pci_device, add);
+
+	ret = vbus_device_proxy_register(&new->vdev);
+	if (ret < 0)
+		panic("failed to register device %lld(%s): %d\n",
+		      new->vdev.id, new->type, ret);
+}
+
+static void
+deferred_devdrop(struct work_struct *work)
+{
+	struct vbus_pci_device *dev;
+
+	dev = container_of(work, struct vbus_pci_device, drop);
+	vbus_device_proxy_unregister(&dev->vdev);
+}
+
+static void
+event_devadd(struct vbus_pci_add_event *event)
+{
+	struct vbus_pci_device *new = kzalloc(sizeof(*new), GFP_KERNEL);
+	if (!new) {
+		printk(KERN_ERR "VBUS_PCI: Out of memory on add_event\n");
+		return;
+	}
+
+	INIT_LIST_HEAD(&new->shms);
+
+	memcpy(new->type, event->type, VBUS_MAX_DEVTYPE_LEN);
+	new->vdev.type        = new->type;
+	new->vdev.id          = event->id;
+	new->vdev.ops         = &vbus_pci_device_ops;
+
+	dev_set_name(&new->vdev.dev, "%lld", event->id);
+
+	INIT_WORK(&new->add, deferred_devadd);
+	INIT_WORK(&new->drop, deferred_devdrop);
+
+	schedule_work(&new->add);
+}
+
+static void
+event_devdrop(struct vbus_pci_handle_event *event)
+{
+	struct vbus_device_proxy *dev = vbus_device_proxy_find(event->handle);
+
+	if (!dev) {
+		printk(KERN_WARNING "VBUS-PCI: devdrop failed: %lld\n",
+		       event->handle);
+		return;
+	}
+
+	schedule_work(&to_dev(dev)->drop);
+}
+
+static void
+event_shmsignal(struct vbus_pci_handle_event *event)
+{
+	struct _signal *_signal = (struct _signal *)event->handle;
+
+	_shm_signal_wakeup(&_signal->signal);
+}
+
+static void
+event_shmclose(struct vbus_pci_handle_event *event)
+{
+	struct _signal *_signal = (struct _signal *)event->handle;
+
+	/*
+	 * This reference was taken during the DEVICESHM call
+	 */
+	shm_signal_put(&_signal->signal);
+}
+
+/*
+ * -------------------
+ * eventq routines
+ * -------------------
+ */
+
+static struct ioq_notifier eventq_notifier;
+
+static int __init
+eventq_init(int qlen)
+{
+	struct ioq_iterator iter;
+	int ret;
+	int i;
+
+	vbus_pci.ring = kzalloc(sizeof(struct vbus_pci_event) * qlen,
+				GFP_KERNEL);
+	if (!vbus_pci.ring)
+		return -ENOMEM;
+
+	/*
+	 * We want to iterate on the "valid" index.  By default the iterator
+	 * will not "autoupdate" which means it will not hypercall the host
+	 * with our changes.  This is good, because we are really just
+	 * initializing stuff here anyway.  Note that you can always manually
+	 * signal the host with ioq_signal() if the autoupdate feature is not
+	 * used.
+	 */
+	ret = ioq_iter_init(&vbus_pci.eventq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Seek to the tail of the valid index (which should be our first
+	 * item since the queue is brand-new)
+	 */
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Now populate each descriptor with an empty vbus_event and mark it
+	 * valid
+	 */
+	for (i = 0; i < qlen; i++) {
+		struct vbus_pci_event *event = &vbus_pci.ring[i];
+		size_t                 len   = sizeof(*event);
+		struct ioq_ring_desc  *desc  = iter.desc;
+
+		BUG_ON(iter.desc->valid);
+
+		desc->cookie = (u64)event;
+		desc->ptr    = (u64)__pa(event);
+		desc->len    = len; /* total length  */
+		desc->valid  = 1;
+
+		/*
+		 * This push operation will simultaneously advance the
+		 * valid-tail index and increment our position in the queue
+		 * by one.
+		 */
+		ret = ioq_iter_push(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	vbus_pci.eventq.notifier = &eventq_notifier;
+
+	/*
+	 * And finally, ensure that we can receive notification
+	 */
+	ioq_notify_enable(&vbus_pci.eventq, 0);
+
+	return 0;
+}
+
+/* Invoked whenever the hypervisor ioq_signal()s our eventq */
+static void
+eventq_wakeup(struct ioq_notifier *notifier)
+{
+	struct ioq_iterator iter;
+	int ret;
+
+	/* We want to iterate on the head of the in-use index */
+	ret = ioq_iter_init(&vbus_pci.eventq, &iter, ioq_idxtype_inuse, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * The EOM is indicated by finding a packet that is still owned by
+	 * the south side.
+	 *
+	 * FIXME: This in theory could run indefinitely if the host keeps
+	 * feeding us events since there is nothing like a NAPI budget.  We
+	 * might need to address that
+	 */
+	while (!iter.desc->sown) {
+		struct ioq_ring_desc *desc  = iter.desc;
+		struct vbus_pci_event *event;
+
+		event = (struct vbus_pci_event *)desc->cookie;
+
+		switch (event->eventid) {
+		case VBUS_PCI_EVENT_DEVADD:
+			event_devadd(&event->data.add);
+			break;
+		case VBUS_PCI_EVENT_DEVDROP:
+			event_devdrop(&event->data.handle);
+			break;
+		case VBUS_PCI_EVENT_SHMSIGNAL:
+			event_shmsignal(&event->data.handle);
+			break;
+		case VBUS_PCI_EVENT_SHMCLOSE:
+			event_shmclose(&event->data.handle);
+			break;
+		default:
+			printk(KERN_WARNING "VBUS_PCI: Unexpected event %d\n",
+			       event->eventid);
+			break;
+		};
+
+		memset(event, 0, sizeof(*event));
+
+		/* Advance the in-use head */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	/* And let the south side know that we changed the queue */
+	ioq_signal(&vbus_pci.eventq, 0);
+}
+
+static struct ioq_notifier eventq_notifier = {
+	.signal = &eventq_wakeup,
+};
+
+/* Injected whenever the host issues an ioq_signal() on the eventq */
+irqreturn_t
+eventq_intr(int irq, void *dev)
+{
+	_shm_signal_wakeup(vbus_pci.eventq.signal);
+
+	return IRQ_HANDLED;
+}
+
+/*
+ * -------------------
+ */
+
+static int
+eventq_signal_inject(struct shm_signal *signal)
+{
+	/* The eventq uses the special-case handle=0 */
+	iowrite32(0, &vbus_pci.signals->eventq);
+
+	return 0;
+}
+
+static void
+eventq_signal_release(struct shm_signal *signal)
+{
+	kfree(signal);
+}
+
+static struct shm_signal_ops eventq_signal_ops = {
+	.inject  = eventq_signal_inject,
+	.release = eventq_signal_release,
+};
+
+/*
+ * -------------------
+ */
+
+static void
+eventq_ioq_release(struct ioq *ioq)
+{
+	/* released as part of the vbus_pci object */
+}
+
+static struct ioq_ops eventq_ioq_ops = {
+	.release = eventq_ioq_release,
+};
+
+/*
+ * -------------------
+ */
+
+static void
+vbus_pci_release(void)
+{
+	if (vbus_pci.irq > 0)
+		free_irq(vbus_pci.irq, NULL);
+
+	if (vbus_pci.signals)
+		pci_iounmap(vbus_pci.dev, (void *)vbus_pci.signals);
+
+	if (vbus_pci.regs)
+		pci_iounmap(vbus_pci.dev, (void *)vbus_pci.regs);
+
+	pci_release_regions(vbus_pci.dev);
+	pci_disable_device(vbus_pci.dev);
+
+	kfree(vbus_pci.eventq.head_desc);
+	kfree(vbus_pci.ring);
+
+	vbus_pci.enabled = false;
+}
+
+static int __init
+vbus_pci_open(void)
+{
+	struct vbus_pci_bridge_negotiate params = {
+		.magic        = VBUS_PCI_ABI_MAGIC,
+		.version      = VBUS_PCI_HC_VERSION,
+		.capabilities = 0,
+	};
+
+	return vbus_pci_bridgecall(VBUS_PCI_BRIDGE_NEGOTIATE,
+				  &params, sizeof(params));
+}
+
+#define QLEN 1024
+
+static int __init
+vbus_pci_eventq_register(void)
+{
+	struct vbus_pci_busreg params = {
+		.count = 1,
+		.eventq = {
+			{
+				.count = QLEN,
+				.ring  = (u64)__pa(vbus_pci.eventq.head_desc),
+				.data  = (u64)__pa(vbus_pci.ring),
+			},
+		},
+	};
+
+	return vbus_pci_bridgecall(VBUS_PCI_BRIDGE_QREG,
+				   &params, sizeof(params));
+}
+
+static int __init
+_ioq_init(size_t ringsize, struct ioq *ioq, struct ioq_ops *ops)
+{
+	struct shm_signal    *signal = NULL;
+	struct ioq_ring_head *head = NULL;
+	size_t                len  = IOQ_HEAD_DESC_SIZE(ringsize);
+
+	head = kzalloc(len, GFP_KERNEL | GFP_DMA);
+	if (!head)
+		return -ENOMEM;
+
+	signal = kzalloc(sizeof(*signal), GFP_KERNEL);
+	if (!signal) {
+		kfree(head);
+		return -ENOMEM;
+	}
+
+	head->magic     = IOQ_RING_MAGIC;
+	head->ver	= IOQ_RING_VER;
+	head->count     = ringsize;
+
+	_signal_init(signal, &head->signal, &eventq_signal_ops);
+
+	ioq_init(ioq, ops, ioq_locality_north, head, signal, ringsize);
+
+	return 0;
+}
+
+static int __devinit
+vbus_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
+{
+	int ret;
+	int cpu;
+
+	if (vbus_pci.enabled)
+		return -EEXIST; /* we only support one bridge per kernel */
+
+	if (pdev->revision != VBUS_PCI_ABI_VERSION) {
+		printk(KERN_DEBUG "VBUS_PCI: expected ABI version %d, got %d\n",
+		       VBUS_PCI_ABI_VERSION,
+		       pdev->revision);
+		return -ENODEV;
+	}
+
+	vbus_pci.dev = pdev;
+
+	ret = pci_enable_device(pdev);
+	if (ret < 0)
+		return ret;
+
+	ret = pci_request_regions(pdev, VBUS_PCI_NAME);
+	if (ret < 0) {
+		printk(KERN_ERR "VBUS_PCI: Could not init BARs: %d\n", ret);
+		goto out_fail;
+	}
+
+	vbus_pci.regs = pci_iomap(pdev, 0, sizeof(struct vbus_pci_regs));
+	if (!vbus_pci.regs) {
+		printk(KERN_ERR "VBUS_PCI: Could not map BARs\n");
+		goto out_fail;
+	}
+
+	vbus_pci.signals = pci_iomap(pdev, 1, sizeof(struct vbus_pci_signals));
+	if (!vbus_pci.signals) {
+		printk(KERN_ERR "VBUS_PCI: Could not map BARs\n");
+		goto out_fail;
+	}
+
+	ret = vbus_pci_open();
+	if (ret < 0) {
+		printk(KERN_DEBUG "VBUS_PCI: Could not register with host: %d\n",
+		       ret);
+		goto out_fail;
+	}
+
+	/*
+	 * Allocate an IOQ to use for host-2-guest event notification
+	 */
+	ret = _ioq_init(QLEN, &vbus_pci.eventq, &eventq_ioq_ops);
+	if (ret < 0) {
+		printk(KERN_ERR "VBUS_PCI: Cound not init eventq: %d\n", ret);
+		goto out_fail;
+	}
+
+	ret = eventq_init(QLEN);
+	if (ret < 0) {
+		printk(KERN_ERR "VBUS_PCI: Cound not setup ring: %d\n", ret);
+		goto out_fail;
+	}
+
+	ret = pci_enable_msi(pdev);
+	if (ret < 0) {
+		printk(KERN_ERR "VBUS_PCI: Cound not enable MSI: %d\n", ret);
+		goto out_fail;
+	}
+
+	vbus_pci.irq = pdev->irq;
+
+	ret = request_irq(pdev->irq, eventq_intr, 0, "vbus", NULL);
+	if (ret < 0) {
+		printk(KERN_ERR "VBUS_PCI: Failed to register IRQ %d\n: %d",
+		       pdev->irq, ret);
+		goto out_fail;
+	}
+
+	/*
+	 * Add one fastcall vector per cpu so that we can do lockless
+	 * hypercalls
+	 */
+	for_each_possible_cpu(cpu) {
+		struct vbus_pci_fastcall_desc *desc =
+			&per_cpu(vbus_pci_percpu_fastcall, cpu);
+		struct vbus_pci_call_desc params = {
+			.vector = cpu,
+			.len    = sizeof(*desc),
+			.datap  = __pa(desc),
+		};
+
+		ret = vbus_pci_bridgecall(VBUS_PCI_BRIDGE_FASTCALL_ADD,
+					  &params, sizeof(params));
+		if (ret < 0) {
+			printk(KERN_ERR \
+			       "VBUS_PCI: Failed to register cpu:%d\n: %d",
+			       cpu, ret);
+			goto out_fail;
+		}
+	}
+
+	/*
+	 * Finally register our queue on the host to start receiving events
+	 */
+	ret = vbus_pci_eventq_register();
+	if (ret < 0) {
+		printk(KERN_ERR "VBUS_PCI: Could not register with host: %d\n",
+		       ret);
+		goto out_fail;
+	}
+
+	vbus_pci.enabled = true;
+
+	printk(KERN_INFO "Virtual-Bus: Copyright (c) 2009, " \
+	       "Gregory Haskins <ghaskins@novell.com>\n");
+
+	return 0;
+
+ out_fail:
+	vbus_pci_release();
+
+	return ret;
+}
+
+static void __devexit
+vbus_pci_remove(struct pci_dev *pdev)
+{
+	vbus_pci_release();
+}
+
+static DEFINE_PCI_DEVICE_TABLE(vbus_pci_tbl) = {
+	{ PCI_DEVICE(0x11da, 0x2000) },
+	{ 0 },
+};
+
+MODULE_DEVICE_TABLE(pci, vbus_pci_tbl);
+
+static struct pci_driver vbus_pci_driver = {
+	.name     = VBUS_PCI_NAME,
+	.id_table = vbus_pci_tbl,
+	.probe    = vbus_pci_probe,
+	.remove   = vbus_pci_remove,
+};
+
+int __init
+vbus_pci_init(void)
+{
+	memset(&vbus_pci, 0, sizeof(vbus_pci));
+	spin_lock_init(&vbus_pci.lock);
+
+	return pci_register_driver(&vbus_pci_driver);
+}
+
+static void __exit
+vbus_pci_exit(void)
+{
+	pci_unregister_driver(&vbus_pci_driver);
+}
+
+module_init(vbus_pci_init);
+module_exit(vbus_pci_exit);
+
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 32b3eb8..fa15bbf 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -358,6 +358,7 @@ unifdef-y += uio.h
 unifdef-y += unistd.h
 unifdef-y += usbdevice_fs.h
 unifdef-y += utsname.h
+unifdef-y += vbus_pci.h
 unifdef-y += videodev2.h
 unifdef-y += videodev.h
 unifdef-y += virtio_config.h
diff --git a/include/linux/vbus_pci.h b/include/linux/vbus_pci.h
new file mode 100644
index 0000000..fe33759
--- /dev/null
+++ b/include/linux/vbus_pci.h
@@ -0,0 +1,145 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * PCI to Virtual-Bus Bridge
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VBUS_PCI_H
+#define _LINUX_VBUS_PCI_H
+
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+#define VBUS_PCI_ABI_MAGIC 0xbf53eef5
+#define VBUS_PCI_ABI_VERSION 2
+#define VBUS_PCI_HC_VERSION 1
+
+enum {
+	VBUS_PCI_BRIDGE_NEGOTIATE,
+	VBUS_PCI_BRIDGE_QREG,
+	VBUS_PCI_BRIDGE_SLOWCALL,
+	VBUS_PCI_BRIDGE_FASTCALL_ADD,
+	VBUS_PCI_BRIDGE_FASTCALL_DROP,
+
+	VBUS_PCI_BRIDGE_MAX, /* must be last */
+};
+
+enum {
+	VBUS_PCI_HC_DEVOPEN,
+	VBUS_PCI_HC_DEVCLOSE,
+	VBUS_PCI_HC_DEVCALL,
+	VBUS_PCI_HC_DEVSHM,
+
+	VBUS_PCI_HC_MAX,      /* must be last */
+};
+
+struct vbus_pci_bridge_negotiate {
+	__u32 magic;
+	__u32 version;
+	__u64 capabilities;
+};
+
+struct vbus_pci_deviceopen {
+	__u32 devid;
+	__u32 version; /* device ABI version */
+	__u64 handle; /* return value for devh */
+};
+
+struct vbus_pci_devicecall {
+	__u64 devh;   /* device-handle (returned from DEVICEOPEN */
+	__u32 func;
+	__u32 len;
+	__u32 flags;
+	__u64 datap;
+};
+
+struct vbus_pci_deviceshm {
+	__u64 devh;   /* device-handle (returned from DEVICEOPEN */
+	__u32 id;
+	__u32 len;
+	__u32 flags;
+	struct {
+		__u32 offset;
+		__u32 prio;
+		__u64 cookie; /* token to pass back when signaling client */
+	} signal;
+	__u64 datap;
+};
+
+struct vbus_pci_call_desc {
+	__u32 vector;
+	__u32 len;
+	__u64 datap;
+};
+
+struct vbus_pci_fastcall_desc {
+	struct vbus_pci_call_desc call;
+	__u32                     result;
+};
+
+struct vbus_pci_regs {
+	struct vbus_pci_call_desc bridgecall;
+	__u8                      pad[48];
+};
+
+struct vbus_pci_signals {
+	__u32 eventq;
+	__u32 fastcall;
+	__u32 shmsignal;
+	__u8  pad[20];
+};
+
+struct vbus_pci_eventqreg {
+	__u32 count;
+	__u64 ring;
+	__u64 data;
+};
+
+struct vbus_pci_busreg {
+	__u32 count;  /* supporting multiple queues allows for prio, etc */
+	struct vbus_pci_eventqreg eventq[1];
+};
+
+enum vbus_pci_eventid {
+	VBUS_PCI_EVENT_DEVADD,
+	VBUS_PCI_EVENT_DEVDROP,
+	VBUS_PCI_EVENT_SHMSIGNAL,
+	VBUS_PCI_EVENT_SHMCLOSE,
+};
+
+#define VBUS_MAX_DEVTYPE_LEN 128
+
+struct vbus_pci_add_event {
+	__u64 id;
+	char  type[VBUS_MAX_DEVTYPE_LEN];
+};
+
+struct vbus_pci_handle_event {
+	__u64 handle;
+};
+
+struct vbus_pci_event {
+	__u32 eventid;
+	union {
+		struct vbus_pci_add_event    add;
+		struct vbus_pci_handle_event handle;
+	} data;
+};
+
+#endif /* _LINUX_VBUS_PCI_H */


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v3 5/6] ioq: add driver-side vbus helpers
  2009-08-14 15:42 [PATCH v3 0/6] AlacrityVM guest drivers Gregory Haskins
                   ` (3 preceding siblings ...)
  2009-08-14 15:43 ` [PATCH v3 4/6] vbus-proxy: add a pci-to-vbus bridge Gregory Haskins
@ 2009-08-14 15:43 ` Gregory Haskins
  2009-08-14 15:43 ` [PATCH v3 6/6] net: Add vbus_enet driver Gregory Haskins
  5 siblings, 0 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-14 15:43 UTC (permalink / raw)
  To: alacrityvm-devel; +Cc: linux-kernel, netdev

It will be a common pattern to map an IOQ over the VBUS shared-memory
interfaces.  Therefore, we provide a helper function to generalize
the allocation and registration of an IOQ to make this use case
simple and easy.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 drivers/vbus/bus-proxy.c    |   64 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/vbus_driver.h |    7 +++++
 2 files changed, 71 insertions(+), 0 deletions(-)

diff --git a/drivers/vbus/bus-proxy.c b/drivers/vbus/bus-proxy.c
index 3177f9f..88cd904 100644
--- a/drivers/vbus/bus-proxy.c
+++ b/drivers/vbus/bus-proxy.c
@@ -150,3 +150,67 @@ void vbus_driver_unregister(struct vbus_driver *drv)
 }
 EXPORT_SYMBOL_GPL(vbus_driver_unregister);
 
+/*
+ *---------------------------------
+ * driver-side IOQ helper
+ *---------------------------------
+ */
+static void
+vbus_driver_ioq_release(struct ioq *ioq)
+{
+	kfree(ioq->head_desc);
+	kfree(ioq);
+}
+
+static struct ioq_ops vbus_driver_ioq_ops = {
+	.release = vbus_driver_ioq_release,
+};
+
+
+int vbus_driver_ioq_alloc(struct vbus_device_proxy *dev, int id, int prio,
+			  size_t count, struct ioq **ioq)
+{
+	struct ioq           *_ioq;
+	struct ioq_ring_head *head = NULL;
+	struct shm_signal    *signal = NULL;
+	size_t                len = IOQ_HEAD_DESC_SIZE(count);
+	int                   ret = -ENOMEM;
+
+	_ioq = kzalloc(sizeof(*_ioq), GFP_KERNEL);
+	if (!_ioq)
+		goto error;
+
+	head = kzalloc(len, GFP_KERNEL | GFP_DMA);
+	if (!head)
+		goto error;
+
+	head->magic     = IOQ_RING_MAGIC;
+	head->ver	= IOQ_RING_VER;
+	head->count     = count;
+
+	ret = dev->ops->shm(dev, id, prio, head, len,
+			    &head->signal, &signal, 0);
+	if (ret < 0)
+		goto error;
+
+	ioq_init(_ioq,
+		 &vbus_driver_ioq_ops,
+		 ioq_locality_north,
+		 head,
+		 signal,
+		 count);
+
+	*ioq = _ioq;
+
+	return 0;
+
+ error:
+	kfree(_ioq);
+	kfree(head);
+
+	if (signal)
+		shm_signal_put(signal);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vbus_driver_ioq_alloc);
diff --git a/include/linux/vbus_driver.h b/include/linux/vbus_driver.h
index c53e13f..9cfbf60 100644
--- a/include/linux/vbus_driver.h
+++ b/include/linux/vbus_driver.h
@@ -26,6 +26,7 @@
 
 #include <linux/device.h>
 #include <linux/shm_signal.h>
+#include <linux/ioq.h>
 
 struct vbus_device_proxy;
 struct vbus_driver;
@@ -70,4 +71,10 @@ struct vbus_driver {
 int vbus_driver_register(struct vbus_driver *drv);
 void vbus_driver_unregister(struct vbus_driver *drv);
 
+/*
+ * driver-side IOQ helper - allocates device-shm and maps an IOQ on it
+ */
+int vbus_driver_ioq_alloc(struct vbus_device_proxy *dev, int id, int prio,
+			  size_t ringsize, struct ioq **ioq);
+
 #endif /* _LINUX_VBUS_DRIVER_H */


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v3 6/6] net: Add vbus_enet driver
  2009-08-14 15:42 [PATCH v3 0/6] AlacrityVM guest drivers Gregory Haskins
                   ` (4 preceding siblings ...)
  2009-08-14 15:43 ` [PATCH v3 5/6] ioq: add driver-side vbus helpers Gregory Haskins
@ 2009-08-14 15:43 ` Gregory Haskins
  5 siblings, 0 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-14 15:43 UTC (permalink / raw)
  To: alacrityvm-devel; +Cc: linux-kernel, netdev

A virtualized 802.x network device based on the VBUS interface. It can be
used with any hypervisor/kernel that supports the virtual-ethernet/vbus
protocol.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: David S. Miller <davem@davemloft.net>
---

 MAINTAINERS             |    7 
 drivers/net/Kconfig     |   14 +
 drivers/net/Makefile    |    1 
 drivers/net/vbus-enet.c |  895 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/Kbuild    |    1 
 include/linux/venet.h   |   84 ++++
 6 files changed, 1002 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/vbus-enet.c
 create mode 100644 include/linux/venet.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 83624e7..cf8ae03 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5443,6 +5443,13 @@ S:	Maintained
 F:	include/linux/vbus*
 F:	drivers/vbus/*
 
+VBUS ETHERNET DRIVER
+M:	Gregory Haskins <ghaskins@novell.com>
+S:	Maintained
+W:	http://developer.novell.com/wiki/index.php/AlacrityVM
+F:	include/linux/venet.h
+F:	drivers/net/vbus-enet.c
+
 VFAT/FAT/MSDOS FILESYSTEM
 M:	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
 S:	Maintained
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 5f6509a..974213e 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -3209,4 +3209,18 @@ config VIRTIO_NET
 	  This is the virtual network driver for virtio.  It can be used with
           lguest or QEMU based VMMs (like KVM or Xen).  Say Y or M.
 
+config VBUS_ENET
+	tristate "VBUS Ethernet Driver"
+	default n
+	select VBUS_PROXY
+	help
+	   A virtualized 802.x network device based on the VBUS
+	   "virtual-ethernet" interface.  It can be used with any
+	   hypervisor/kernel that supports the vbus+venet protocol.
+
+config VBUS_ENET_DEBUG
+        bool "Enable Debugging"
+	depends on VBUS_ENET
+	default n
+
 endif # NETDEVICES
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index ead8cab..2a3c7a9 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -277,6 +277,7 @@ obj-$(CONFIG_FS_ENET) += fs_enet/
 obj-$(CONFIG_NETXEN_NIC) += netxen/
 obj-$(CONFIG_NIU) += niu.o
 obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+obj-$(CONFIG_VBUS_ENET) += vbus-enet.o
 obj-$(CONFIG_SFC) += sfc/
 
 obj-$(CONFIG_WIMAX) += wimax/
diff --git a/drivers/net/vbus-enet.c b/drivers/net/vbus-enet.c
new file mode 100644
index 0000000..91c47a9
--- /dev/null
+++ b/drivers/net/vbus-enet.c
@@ -0,0 +1,895 @@
+/*
+ * vbus_enet - A virtualized 802.x network device based on the VBUS interface
+ *
+ * Copyright (C) 2009 Novell, Gregory Haskins <ghaskins@novell.com>
+ *
+ * Derived from the SNULL example from the book "Linux Device Drivers" by
+ * Alessandro Rubini, Jonathan Corbet, and Greg Kroah-Hartman, published
+ * by O'Reilly & Associates.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/interrupt.h>
+
+#include <linux/in.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/skbuff.h>
+#include <linux/ioq.h>
+#include <linux/vbus_driver.h>
+
+#include <linux/in6.h>
+#include <asm/checksum.h>
+
+#include <linux/venet.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("virtual-ethernet");
+MODULE_VERSION("1");
+
+static int rx_ringlen = 256;
+module_param(rx_ringlen, int, 0444);
+static int tx_ringlen = 256;
+module_param(tx_ringlen, int, 0444);
+static int sg_enabled = 1;
+module_param(sg_enabled, int, 0444);
+
+#define PDEBUG(_dev, fmt, args...) dev_dbg(&(_dev)->dev, fmt, ## args)
+
+struct vbus_enet_queue {
+	struct ioq              *queue;
+	struct ioq_notifier      notifier;
+};
+
+struct vbus_enet_priv {
+	spinlock_t                 lock;
+	struct net_device         *dev;
+	struct vbus_device_proxy  *vdev;
+	struct napi_struct         napi;
+	struct vbus_enet_queue     rxq;
+	struct vbus_enet_queue     txq;
+	struct tasklet_struct      txtask;
+	bool                       sg;
+};
+
+static void vbus_enet_tx_reap(struct vbus_enet_priv *priv, int force);
+
+static struct vbus_enet_priv *
+napi_to_priv(struct napi_struct *napi)
+{
+	return container_of(napi, struct vbus_enet_priv, napi);
+}
+
+static int
+queue_init(struct vbus_enet_priv *priv,
+	   struct vbus_enet_queue *q,
+	   int qid,
+	   size_t ringsize,
+	   void (*func)(struct ioq_notifier *))
+{
+	struct vbus_device_proxy *dev = priv->vdev;
+	int ret;
+
+	ret = vbus_driver_ioq_alloc(dev, qid, 0, ringsize, &q->queue);
+	if (ret < 0)
+		panic("ioq_alloc failed: %d\n", ret);
+
+	if (func) {
+		q->notifier.signal = func;
+		q->queue->notifier = &q->notifier;
+	}
+
+	return 0;
+}
+
+static int
+devcall(struct vbus_enet_priv *priv, u32 func, void *data, size_t len)
+{
+	struct vbus_device_proxy *dev = priv->vdev;
+
+	return dev->ops->call(dev, func, data, len, 0);
+}
+
+/*
+ * ---------------
+ * rx descriptors
+ * ---------------
+ */
+
+static void
+rxdesc_alloc(struct net_device *dev, struct ioq_ring_desc *desc, size_t len)
+{
+	struct sk_buff *skb;
+
+	len += ETH_HLEN;
+
+	skb = netdev_alloc_skb(dev, len + 2);
+	BUG_ON(!skb);
+
+	skb_reserve(skb, NET_IP_ALIGN); /* align IP on 16B boundary */
+
+	desc->cookie = (u64)skb;
+	desc->ptr    = (u64)__pa(skb->data);
+	desc->len    = len; /* total length  */
+	desc->valid  = 1;
+}
+
+static void
+rx_setup(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->rxq.queue;
+	struct ioq_iterator iter;
+	int ret;
+
+	/*
+	 * We want to iterate on the "valid" index.  By default the iterator
+	 * will not "autoupdate" which means it will not hypercall the host
+	 * with our changes.  This is good, because we are really just
+	 * initializing stuff here anyway.  Note that you can always manually
+	 * signal the host with ioq_signal() if the autoupdate feature is not
+	 * used.
+	 */
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0); /* will never fail unless seriously broken */
+
+	/*
+	 * Seek to the tail of the valid index (which should be our first
+	 * item, since the queue is brand-new)
+	 */
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Now populate each descriptor with an empty SKB and mark it valid
+	 */
+	while (!iter.desc->valid) {
+		rxdesc_alloc(priv->dev, iter.desc, priv->dev->mtu);
+
+		/*
+		 * This push operation will simultaneously advance the
+		 * valid-head index and increment our position in the queue
+		 * by one.
+		 */
+		ret = ioq_iter_push(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+}
+
+static void
+rx_teardown(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->rxq.queue;
+	struct ioq_iterator iter;
+	int ret;
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * free each valid descriptor
+	 */
+	while (iter.desc->valid) {
+		struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+
+		iter.desc->valid = 0;
+		wmb();
+
+		iter.desc->ptr = 0;
+		iter.desc->cookie = 0;
+
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+
+		dev_kfree_skb(skb);
+	}
+}
+
+static int
+tx_setup(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->txq.queue;
+	struct ioq_iterator iter;
+	int i;
+	int ret;
+
+	if (!priv->sg)
+		/*
+		 * There is nothing to do for a ring that is not using
+		 * scatter-gather
+		 */
+		return 0;
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_set, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Now populate each descriptor with an empty SG descriptor
+	 */
+	for (i = 0; i < tx_ringlen; i++) {
+		struct venet_sg *vsg;
+		size_t iovlen = sizeof(struct venet_iov) * (MAX_SKB_FRAGS-1);
+		size_t len = sizeof(*vsg) + iovlen;
+
+		vsg = kzalloc(len, GFP_KERNEL);
+		if (!vsg)
+			return -ENOMEM;
+
+		iter.desc->cookie = (u64)vsg;
+		iter.desc->len    = len;
+		iter.desc->ptr    = (u64)__pa(vsg);
+
+		ret = ioq_iter_seek(&iter, ioq_seek_next, 0, 0);
+		BUG_ON(ret < 0);
+	}
+
+	return 0;
+}
+
+static void
+tx_teardown(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->txq.queue;
+	struct ioq_iterator iter;
+	int ret;
+
+	/* forcefully free all outstanding transmissions */
+	vbus_enet_tx_reap(priv, 1);
+
+	if (!priv->sg)
+		/*
+		 * There is nothing else to do for a ring that is not using
+		 * scatter-gather
+		 */
+		return;
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	/* seek to position 0 */
+	ret = ioq_iter_seek(&iter, ioq_seek_set, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * free each valid descriptor
+	 */
+	while (iter.desc->cookie) {
+		struct venet_sg *vsg = (struct venet_sg *)iter.desc->cookie;
+
+		iter.desc->valid = 0;
+		wmb();
+
+		iter.desc->ptr = 0;
+		iter.desc->cookie = 0;
+
+		ret = ioq_iter_seek(&iter, ioq_seek_next, 0, 0);
+		BUG_ON(ret < 0);
+
+		kfree(vsg);
+	}
+}
+
+/*
+ * Open and close
+ */
+
+static int
+vbus_enet_open(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	ret = devcall(priv, VENET_FUNC_LINKUP, NULL, 0);
+	BUG_ON(ret < 0);
+
+	napi_enable(&priv->napi);
+
+	return 0;
+}
+
+static int
+vbus_enet_stop(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	napi_disable(&priv->napi);
+
+	ret = devcall(priv, VENET_FUNC_LINKDOWN, NULL, 0);
+	BUG_ON(ret < 0);
+
+	return 0;
+}
+
+/*
+ * Configuration changes (passed on by ifconfig)
+ */
+static int
+vbus_enet_config(struct net_device *dev, struct ifmap *map)
+{
+	if (dev->flags & IFF_UP) /* can't act on a running interface */
+		return -EBUSY;
+
+	/* Don't allow changing the I/O address */
+	if (map->base_addr != dev->base_addr) {
+		dev_warn(&dev->dev, "Can't change I/O address\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* ignore other fields */
+	return 0;
+}
+
+static void
+vbus_enet_schedule_rx(struct vbus_enet_priv *priv)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (napi_schedule_prep(&priv->napi)) {
+		/* Disable further interrupts */
+		ioq_notify_disable(priv->rxq.queue, 0);
+		__napi_schedule(&priv->napi);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static int
+vbus_enet_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	dev->mtu = new_mtu;
+
+	/*
+	 * FLUSHRX will cause the device to flush any outstanding
+	 * RX buffers.  They will appear to come in as 0 length
+	 * packets which we can simply discard and replace with new_mtu
+	 * buffers for the future.
+	 */
+	ret = devcall(priv, VENET_FUNC_FLUSHRX, NULL, 0);
+	BUG_ON(ret < 0);
+
+	vbus_enet_schedule_rx(priv);
+
+	return 0;
+}
+
+/*
+ * The poll implementation.
+ */
+static int
+vbus_enet_poll(struct napi_struct *napi, int budget)
+{
+	struct vbus_enet_priv *priv = napi_to_priv(napi);
+	int npackets = 0;
+	struct ioq_iterator iter;
+	int ret;
+
+	PDEBUG(priv->dev, "polling...\n");
+
+	/* We want to iterate on the head of the in-use index */
+	ret = ioq_iter_init(priv->rxq.queue, &iter, ioq_idxtype_inuse,
+			    IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * We stop if we have met the quota or there are no more packets.
+	 * The EOM is indicated by finding a packet that is still owned by
+	 * the south side
+	 */
+	while ((npackets < budget) && (!iter.desc->sown)) {
+		struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+
+		if (iter.desc->len) {
+			skb_put(skb, iter.desc->len);
+
+			/* Maintain stats */
+			npackets++;
+			priv->dev->stats.rx_packets++;
+			priv->dev->stats.rx_bytes += iter.desc->len;
+
+			/* Pass the buffer up to the stack */
+			skb->dev      = priv->dev;
+			skb->protocol = eth_type_trans(skb, priv->dev);
+			netif_receive_skb(skb);
+
+			mb();
+		} else
+			/*
+			 * the device may send a zero-length packet when its
+			 * flushing references on the ring.  We can just drop
+			 * these on the floor
+			 */
+			dev_kfree_skb(skb);
+
+		/* Grab a new buffer to put in the ring */
+		rxdesc_alloc(priv->dev, iter.desc, priv->dev->mtu);
+
+		/* Advance the in-use tail */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	PDEBUG(priv->dev, "%d packets received\n", npackets);
+
+	/*
+	 * If we processed all packets, we're done; tell the kernel and
+	 * reenable ints
+	 */
+	if (ioq_empty(priv->rxq.queue, ioq_idxtype_inuse)) {
+		napi_complete(napi);
+		ioq_notify_enable(priv->rxq.queue, 0);
+		ret = 0;
+	} else
+		/* We couldn't process everything. */
+		ret = 1;
+
+	return ret;
+}
+
+/*
+ * Transmit a packet (called by the kernel)
+ */
+static int
+vbus_enet_tx_start(struct sk_buff *skb, struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	struct ioq_iterator    iter;
+	int ret;
+	unsigned long flags;
+
+	PDEBUG(priv->dev, "sending %d bytes\n", skb->len);
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
+		/*
+		 * We must flow-control the kernel by disabling the
+		 * queue
+		 */
+		spin_unlock_irqrestore(&priv->lock, flags);
+		netif_stop_queue(dev);
+		dev_err(&priv->dev->dev, "tx on full queue bug\n");
+		return 1;
+	}
+
+	/*
+	 * We want to iterate on the tail of both the "inuse" and "valid" index
+	 * so we specify the "both" index
+	 */
+	ret = ioq_iter_init(priv->txq.queue, &iter, ioq_idxtype_both,
+			    IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+	BUG_ON(iter.desc->sown);
+
+	if (priv->sg) {
+		struct venet_sg *vsg = (struct venet_sg *)iter.desc->cookie;
+		struct scatterlist sgl[MAX_SKB_FRAGS+1];
+		struct scatterlist *sg;
+		int count, maxcount = ARRAY_SIZE(sgl);
+
+		sg_init_table(sgl, maxcount);
+
+		memset(vsg, 0, sizeof(*vsg));
+
+		vsg->cookie = (u64)skb;
+		vsg->len    = skb->len;
+
+		if (skb->ip_summed == CHECKSUM_PARTIAL) {
+			vsg->flags      |= VENET_SG_FLAG_NEEDS_CSUM;
+			vsg->csum.start  = skb->csum_start - skb_headroom(skb);
+			vsg->csum.offset = skb->csum_offset;
+		}
+
+		if (skb_is_gso(skb)) {
+			struct skb_shared_info *sinfo = skb_shinfo(skb);
+
+			vsg->flags |= VENET_SG_FLAG_GSO;
+
+			vsg->gso.hdrlen = skb_transport_header(skb) - skb->data;
+			vsg->gso.size = sinfo->gso_size;
+			if (sinfo->gso_type & SKB_GSO_TCPV4)
+				vsg->gso.type = VENET_GSO_TYPE_TCPV4;
+			else if (sinfo->gso_type & SKB_GSO_TCPV6)
+				vsg->gso.type = VENET_GSO_TYPE_TCPV6;
+			else if (sinfo->gso_type & SKB_GSO_UDP)
+				vsg->gso.type = VENET_GSO_TYPE_UDP;
+			else
+				panic("Virtual-Ethernet: unknown GSO type " \
+				      "0x%x\n", sinfo->gso_type);
+
+			if (sinfo->gso_type & SKB_GSO_TCP_ECN)
+				vsg->flags |= VENET_SG_FLAG_ECN;
+		}
+
+		count = skb_to_sgvec(skb, sgl, 0, skb->len);
+
+		BUG_ON(count > maxcount);
+
+		for (sg = &sgl[0]; sg; sg = sg_next(sg)) {
+			struct venet_iov *iov = &vsg->iov[vsg->count++];
+
+			iov->len = sg->length;
+			iov->ptr = (u64)sg_phys(sg);
+		}
+
+	} else {
+		/*
+		 * non scatter-gather mode: simply put the skb right onto the
+		 * ring.
+		 */
+		iter.desc->cookie = (u64)skb;
+		iter.desc->len = (u64)skb->len;
+		iter.desc->ptr = (u64)__pa(skb->data);
+	}
+
+	iter.desc->valid  = 1;
+
+	priv->dev->stats.tx_packets++;
+	priv->dev->stats.tx_bytes += skb->len;
+
+	/*
+	 * This advances both indexes together implicitly, and then
+	 * signals the south side to consume the packet
+	 */
+	ret = ioq_iter_push(&iter, 0);
+	BUG_ON(ret < 0);
+
+	dev->trans_start = jiffies; /* save the timestamp */
+
+	if (ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
+		/*
+		 * If the queue is congested, we must flow-control the kernel
+		 */
+		PDEBUG(priv->dev, "backpressure tx queue\n");
+		netif_stop_queue(dev);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	return 0;
+}
+
+/*
+ * reclaim any outstanding completed tx packets
+ *
+ * assumes priv->lock held
+ */
+static void
+vbus_enet_tx_reap(struct vbus_enet_priv *priv, int force)
+{
+	struct ioq_iterator iter;
+	int ret;
+
+	/*
+	 * We want to iterate on the head of the valid index, but we
+	 * do not want the iter_pop (below) to flip the ownership, so
+	 * we set the NOFLIPOWNER option
+	 */
+	ret = ioq_iter_init(priv->txq.queue, &iter, ioq_idxtype_valid,
+			    IOQ_ITER_NOFLIPOWNER);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * We are done once we find the first packet either invalid or still
+	 * owned by the south-side
+	 */
+	while (iter.desc->valid && (!iter.desc->sown || force)) {
+		struct sk_buff *skb;
+
+		if (priv->sg) {
+			struct venet_sg *vsg;
+
+			vsg = (struct venet_sg *)iter.desc->cookie;
+			skb = (struct sk_buff *)vsg->cookie;
+
+		} else {
+			skb = (struct sk_buff *)iter.desc->cookie;
+		}
+
+		PDEBUG(priv->dev, "completed sending %d bytes\n", skb->len);
+
+		/* Reset the descriptor */
+		iter.desc->valid  = 0;
+
+		dev_kfree_skb(skb);
+
+		/* Advance the valid-index head */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	/*
+	 * If we were previously stopped due to flow control, restart the
+	 * processing
+	 */
+	if (netif_queue_stopped(priv->dev)
+	    && !ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
+		PDEBUG(priv->dev, "re-enabling tx queue\n");
+		netif_wake_queue(priv->dev);
+	}
+}
+
+static void
+vbus_enet_timeout(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	unsigned long flags;
+
+	dev_dbg(&dev->dev, "Transmit timeout\n");
+
+	spin_lock_irqsave(&priv->lock, flags);
+	vbus_enet_tx_reap(priv, 0);
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static void
+rx_isr(struct ioq_notifier *notifier)
+{
+	struct vbus_enet_priv *priv;
+	struct net_device  *dev;
+
+	priv = container_of(notifier, struct vbus_enet_priv, rxq.notifier);
+	dev = priv->dev;
+
+	if (!ioq_empty(priv->rxq.queue, ioq_idxtype_inuse))
+		vbus_enet_schedule_rx(priv);
+}
+
+static void
+deferred_tx_isr(unsigned long data)
+{
+	struct vbus_enet_priv *priv = (struct vbus_enet_priv *)data;
+	unsigned long flags;
+
+	PDEBUG(priv->dev, "deferred_tx_isr\n");
+
+	spin_lock_irqsave(&priv->lock, flags);
+	vbus_enet_tx_reap(priv, 0);
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	ioq_notify_enable(priv->txq.queue, 0);
+}
+
+static void
+tx_isr(struct ioq_notifier *notifier)
+{
+       struct vbus_enet_priv *priv;
+
+       priv = container_of(notifier, struct vbus_enet_priv, txq.notifier);
+
+       PDEBUG(priv->dev, "tx_isr\n");
+
+       ioq_notify_disable(priv->txq.queue, 0);
+       tasklet_schedule(&priv->txtask);
+}
+
+static int
+vbus_enet_negcap(struct vbus_enet_priv *priv)
+{
+	struct net_device *dev = priv->dev;
+	struct venet_capabilities caps;
+	int ret;
+
+	memset(&caps, 0, sizeof(caps));
+
+	if (sg_enabled) {
+		caps.gid = VENET_CAP_GROUP_SG;
+		caps.bits |= (VENET_CAP_SG|VENET_CAP_TSO4|VENET_CAP_TSO6
+			      |VENET_CAP_ECN);
+		/* note: exclude UFO for now due to stack bug */
+	}
+
+	ret = devcall(priv, VENET_FUNC_NEGCAP, &caps, sizeof(caps));
+	if (ret < 0)
+		return ret;
+
+	if (caps.bits & VENET_CAP_SG) {
+		priv->sg = true;
+
+		dev->features |= NETIF_F_SG|NETIF_F_HW_CSUM|NETIF_F_FRAGLIST;
+
+		if (caps.bits & VENET_CAP_TSO4)
+			dev->features |= NETIF_F_TSO;
+		if (caps.bits & VENET_CAP_UFO)
+			dev->features |= NETIF_F_UFO;
+		if (caps.bits & VENET_CAP_TSO6)
+			dev->features |= NETIF_F_TSO6;
+		if (caps.bits & VENET_CAP_ECN)
+			dev->features |= NETIF_F_TSO_ECN;
+	}
+
+	return 0;
+}
+
+static int vbus_enet_set_tx_csum(struct net_device *dev, u32 data)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+
+	if (data && !priv->sg)
+		return -ENOSYS;
+
+	return ethtool_op_set_tx_hw_csum(dev, data);
+}
+
+static struct ethtool_ops vbus_enet_ethtool_ops = {
+	.set_tx_csum = vbus_enet_set_tx_csum,
+	.set_sg      = ethtool_op_set_sg,
+	.set_tso     = ethtool_op_set_tso,
+	.get_link    = ethtool_op_get_link,
+};
+
+static const struct net_device_ops vbus_enet_netdev_ops = {
+	.ndo_open            = vbus_enet_open,
+	.ndo_stop            = vbus_enet_stop,
+	.ndo_set_config      = vbus_enet_config,
+	.ndo_start_xmit      = vbus_enet_tx_start,
+	.ndo_change_mtu	     = vbus_enet_change_mtu,
+	.ndo_tx_timeout      = vbus_enet_timeout,
+	.ndo_set_mac_address = eth_mac_addr,
+	.ndo_validate_addr   = eth_validate_addr,
+};
+
+/*
+ * This is called whenever a new vbus_device_proxy is added to the vbus
+ * with the matching VENET_ID
+ */
+static int
+vbus_enet_probe(struct vbus_device_proxy *vdev)
+{
+	struct net_device  *dev;
+	struct vbus_enet_priv *priv;
+	int ret;
+
+	printk(KERN_INFO "VENET: Found new device at %lld\n", vdev->id);
+
+	ret = vdev->ops->open(vdev, VENET_VERSION, 0);
+	if (ret < 0)
+		return ret;
+
+	dev = alloc_etherdev(sizeof(struct vbus_enet_priv));
+	if (!dev)
+		return -ENOMEM;
+
+	priv = netdev_priv(dev);
+
+	spin_lock_init(&priv->lock);
+	priv->dev  = dev;
+	priv->vdev = vdev;
+
+	ret = vbus_enet_negcap(priv);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: Error negotiating capabilities for " \
+		       "%lld\n",
+		       priv->vdev->id);
+		goto out_free;
+	}
+
+	tasklet_init(&priv->txtask, deferred_tx_isr, (unsigned long)priv);
+
+	queue_init(priv, &priv->rxq, VENET_QUEUE_RX, rx_ringlen, rx_isr);
+	queue_init(priv, &priv->txq, VENET_QUEUE_TX, tx_ringlen, tx_isr);
+
+	rx_setup(priv);
+	tx_setup(priv);
+
+	ioq_notify_enable(priv->rxq.queue, 0);  /* enable interrupts */
+	ioq_notify_enable(priv->txq.queue, 0);
+
+	dev->netdev_ops     = &vbus_enet_netdev_ops;
+	dev->watchdog_timeo = 5 * HZ;
+	SET_ETHTOOL_OPS(dev, &vbus_enet_ethtool_ops);
+	SET_NETDEV_DEV(dev, &vdev->dev);
+
+	netif_napi_add(dev, &priv->napi, vbus_enet_poll, 128);
+
+	ret = devcall(priv, VENET_FUNC_MACQUERY, priv->dev->dev_addr, ETH_ALEN);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: Error obtaining MAC address for " \
+		       "%lld\n",
+		       priv->vdev->id);
+		goto out_free;
+	}
+
+	dev->features |= NETIF_F_HIGHDMA;
+
+	ret = register_netdev(dev);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: error %i registering device \"%s\"\n",
+		       ret, dev->name);
+		goto out_free;
+	}
+
+	vdev->priv = priv;
+
+	return 0;
+
+ out_free:
+	free_netdev(dev);
+
+	return ret;
+}
+
+static int
+vbus_enet_remove(struct vbus_device_proxy *vdev)
+{
+	struct vbus_enet_priv *priv = (struct vbus_enet_priv *)vdev->priv;
+	struct vbus_device_proxy *dev = priv->vdev;
+
+	unregister_netdev(priv->dev);
+	napi_disable(&priv->napi);
+
+	rx_teardown(priv);
+	ioq_put(priv->rxq.queue);
+
+	tx_teardown(priv);
+	ioq_put(priv->txq.queue);
+
+	dev->ops->close(dev, 0);
+
+	free_netdev(priv->dev);
+
+	return 0;
+}
+
+/*
+ * Finally, the module stuff
+ */
+
+static struct vbus_driver_ops vbus_enet_driver_ops = {
+	.probe  = vbus_enet_probe,
+	.remove = vbus_enet_remove,
+};
+
+static struct vbus_driver vbus_enet_driver = {
+	.type   = VENET_TYPE,
+	.owner  = THIS_MODULE,
+	.ops    = &vbus_enet_driver_ops,
+};
+
+static __init int
+vbus_enet_init_module(void)
+{
+	printk(KERN_INFO "Virtual Ethernet: Copyright (C) 2009 Novell, Gregory Haskins\n");
+	printk(KERN_DEBUG "VENET: Using %d/%d queue depth\n",
+	       rx_ringlen, tx_ringlen);
+	return vbus_driver_register(&vbus_enet_driver);
+}
+
+static __exit void
+vbus_enet_cleanup(void)
+{
+	vbus_driver_unregister(&vbus_enet_driver);
+}
+
+module_init(vbus_enet_init_module);
+module_exit(vbus_enet_cleanup);
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index fa15bbf..911f7ef 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -359,6 +359,7 @@ unifdef-y += unistd.h
 unifdef-y += usbdevice_fs.h
 unifdef-y += utsname.h
 unifdef-y += vbus_pci.h
+unifdef-y += venet.h
 unifdef-y += videodev2.h
 unifdef-y += videodev.h
 unifdef-y += virtio_config.h
diff --git a/include/linux/venet.h b/include/linux/venet.h
new file mode 100644
index 0000000..47ed37d
--- /dev/null
+++ b/include/linux/venet.h
@@ -0,0 +1,84 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Virtual-Ethernet adapter
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VENET_H
+#define _LINUX_VENET_H
+
+#include <linux/types.h>
+
+#define VENET_VERSION 1
+
+#define VENET_TYPE "virtual-ethernet"
+
+#define VENET_QUEUE_RX 0
+#define VENET_QUEUE_TX 1
+
+struct venet_capabilities {
+	__u32 gid;
+	__u32 bits;
+};
+
+#define VENET_CAP_GROUP_SG 0
+
+/* CAPABILITIES-GROUP SG */
+#define VENET_CAP_SG     (1 << 0)
+#define VENET_CAP_TSO4   (1 << 1)
+#define VENET_CAP_TSO6   (1 << 2)
+#define VENET_CAP_ECN    (1 << 3)
+#define VENET_CAP_UFO    (1 << 4)
+
+struct venet_iov {
+	__u32 len;
+	__u64 ptr;
+};
+
+#define VENET_SG_FLAG_NEEDS_CSUM (1 << 0)
+#define VENET_SG_FLAG_GSO        (1 << 1)
+#define VENET_SG_FLAG_ECN        (1 << 2)
+
+struct venet_sg {
+	__u64            cookie;
+	__u32            flags;
+	__u32            len;     /* total length of all iovs */
+	struct {
+		__u16    start;	  /* csum starting position */
+		__u16    offset;  /* offset to place csum */
+	} csum;
+	struct {
+#define VENET_GSO_TYPE_TCPV4	0	/* IPv4 TCP (TSO) */
+#define VENET_GSO_TYPE_UDP	1	/* IPv4 UDP (UFO) */
+#define VENET_GSO_TYPE_TCPV6	2	/* IPv6 TCP */
+		__u8     type;
+		__u16    hdrlen;
+		__u16    size;
+	} gso;
+	__u32            count;   /* nr of iovs */
+	struct venet_iov iov[1];
+};
+
+#define VENET_FUNC_LINKUP   0
+#define VENET_FUNC_LINKDOWN 1
+#define VENET_FUNC_MACQUERY 2
+#define VENET_FUNC_NEGCAP   3 /* negotiate capabilities */
+#define VENET_FUNC_FLUSHRX  4
+
+#endif /* _LINUX_VENET_H */


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-14 15:43 ` [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects Gregory Haskins
@ 2009-08-15 10:32   ` Ingo Molnar
  2009-08-15 19:15     ` Anthony Liguori
                       ` (2 more replies)
  0 siblings, 3 replies; 132+ messages in thread
From: Ingo Molnar @ 2009-08-15 10:32 UTC (permalink / raw)
  To: Gregory Haskins, kvm, Avi Kivity; +Cc: alacrityvm-devel, linux-kernel, netdev


* Gregory Haskins <ghaskins@novell.com> wrote:

> This will generally be used for hypervisors to publish any host-side
> virtual devices up to a guest.  The guest will have the opportunity
> to consume any devices present on the vbus-proxy as if they were
> platform devices, similar to existing buses like PCI.
> 
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> ---
> 
>  MAINTAINERS                 |    6 ++
>  arch/x86/Kconfig            |    2 +
>  drivers/Makefile            |    1 
>  drivers/vbus/Kconfig        |   14 ++++
>  drivers/vbus/Makefile       |    3 +
>  drivers/vbus/bus-proxy.c    |  152 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/vbus_driver.h |   73 +++++++++++++++++++++
>  7 files changed, 251 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/vbus/Kconfig
>  create mode 100644 drivers/vbus/Makefile
>  create mode 100644 drivers/vbus/bus-proxy.c
>  create mode 100644 include/linux/vbus_driver.h

Is there a consensus on this with the KVM folks? (i've added the KVM 
list to the Cc:)

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-15 10:32   ` Ingo Molnar
@ 2009-08-15 19:15     ` Anthony Liguori
  2009-08-16  7:16       ` Ingo Molnar
  2009-08-17 14:14       ` Gregory Haskins
  2009-08-16  8:30     ` Avi Kivity
  2009-08-17 13:02     ` Gregory Haskins
  2 siblings, 2 replies; 132+ messages in thread
From: Anthony Liguori @ 2009-08-15 19:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gregory Haskins, kvm, Avi Kivity, alacrityvm-devel, linux-kernel,
	netdev, Michael S. Tsirkin

Ingo Molnar wrote:
> * Gregory Haskins <ghaskins@novell.com> wrote:
>
>   
>> This will generally be used for hypervisors to publish any host-side
>> virtual devices up to a guest.  The guest will have the opportunity
>> to consume any devices present on the vbus-proxy as if they were
>> platform devices, similar to existing buses like PCI.
>>
>> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>> ---
>>
>>  MAINTAINERS                 |    6 ++
>>  arch/x86/Kconfig            |    2 +
>>  drivers/Makefile            |    1 
>>  drivers/vbus/Kconfig        |   14 ++++
>>  drivers/vbus/Makefile       |    3 +
>>  drivers/vbus/bus-proxy.c    |  152 +++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/vbus_driver.h |   73 +++++++++++++++++++++
>>  7 files changed, 251 insertions(+), 0 deletions(-)
>>  create mode 100644 drivers/vbus/Kconfig
>>  create mode 100644 drivers/vbus/Makefile
>>  create mode 100644 drivers/vbus/bus-proxy.c
>>  create mode 100644 include/linux/vbus_driver.h
>>     
>
> Is there a consensus on this with the KVM folks? (i've added the KVM 
> list to the Cc:)
>   

I'll let Avi comment about it from a KVM perspective but from a QEMU 
perspective, I don't think we want to support two paravirtual IO 
frameworks.  I'd like to see them converge.  Since there's an install 
base of guests today with virtio drivers, there really ought to be a 
compelling reason to change the virtio ABI in a non-backwards compatible 
way.  This means convergence really ought to be adding features to virtio.

On paper, I don't think vbus really has any features over virtio.  vbus 
does things in different ways (paravirtual bus vs. pci for discovery) 
but I think we're happy with how virtio does things today.

I think the reason vbus gets better performance for networking today is 
that vbus' backends are in the kernel while virtio's backends are 
currently in userspace.  Since Michael has a functioning in-kernel 
backend for virtio-net now, I suspect we're weeks (maybe days) away from 
performance results.  My expectation is that vhost + virtio-net will be 
as good as venet + vbus.  If that's the case, then I don't see any 
reason to adopt vbus unless Greg things there are other compelling 
features over virtio.

Regards,

Anthony Liguori

> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-15 19:15     ` Anthony Liguori
@ 2009-08-16  7:16       ` Ingo Molnar
  2009-08-17 13:54         ` Anthony Liguori
  2009-08-17 14:14       ` Gregory Haskins
  1 sibling, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2009-08-16  7:16 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Gregory Haskins, kvm, Avi Kivity, alacrityvm-devel, linux-kernel,
	netdev, Michael S. Tsirkin


* Anthony Liguori <anthony@codemonkey.ws> wrote:

> Ingo Molnar wrote:
>> * Gregory Haskins <ghaskins@novell.com> wrote:
>>
>>   
>>> This will generally be used for hypervisors to publish any host-side
>>> virtual devices up to a guest.  The guest will have the opportunity
>>> to consume any devices present on the vbus-proxy as if they were
>>> platform devices, similar to existing buses like PCI.
>>>
>>> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>>> ---
>>>
>>>  MAINTAINERS                 |    6 ++
>>>  arch/x86/Kconfig            |    2 +
>>>  drivers/Makefile            |    1  drivers/vbus/Kconfig        |   
>>> 14 ++++
>>>  drivers/vbus/Makefile       |    3 +
>>>  drivers/vbus/bus-proxy.c    |  152 +++++++++++++++++++++++++++++++++++++++++++
>>>  include/linux/vbus_driver.h |   73 +++++++++++++++++++++
>>>  7 files changed, 251 insertions(+), 0 deletions(-)
>>>  create mode 100644 drivers/vbus/Kconfig
>>>  create mode 100644 drivers/vbus/Makefile
>>>  create mode 100644 drivers/vbus/bus-proxy.c
>>>  create mode 100644 include/linux/vbus_driver.h
>>>     
>>
>> Is there a consensus on this with the KVM folks? (i've added the KVM  
>> list to the Cc:)
>   
> I'll let Avi comment about it from a KVM perspective but from a 
> QEMU perspective, I don't think we want to support two paravirtual 
> IO frameworks.  I'd like to see them converge.  Since there's an 
> install base of guests today with virtio drivers, there really 
> ought to be a compelling reason to change the virtio ABI in a 
> non-backwards compatible way.  This means convergence really ought 
> to be adding features to virtio.

I agree.

While different paravirt drivers are inevitable for things that are 
externally constrained (say support different hypervisors), doing 
different _Linux internal_ paravirt drivers looks plain stupid and 
counter-productive. It splits testing and development.

So either the vbus code replaces virtio (for technical merits such 
as performance and other details), or virtio is enhanced with the 
vbus performance enhancements.

> On paper, I don't think vbus really has any features over virtio.  
> vbus does things in different ways (paravirtual bus vs. pci for 
> discovery) but I think we're happy with how virtio does things 
> today.
>
> I think the reason vbus gets better performance for networking 
> today is that vbus' backends are in the kernel while virtio's 
> backends are currently in userspace.  Since Michael has a 
> functioning in-kernel backend for virtio-net now, I suspect we're 
> weeks (maybe days) away from performance results.  My expectation 
> is that vhost + virtio-net will be as good as venet + vbus.  If 
> that's the case, then I don't see any reason to adopt vbus unless 
> Greg things there are other compelling features over virtio.

Keeping virtio's backend in user-space was rather stupid IMHO. 

Having the _option_ to piggyback to user-space (for flexibility, 
extensibility, etc.) is OK, but not having kernel acceleration is 
bad.

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-15 10:32   ` Ingo Molnar
  2009-08-15 19:15     ` Anthony Liguori
@ 2009-08-16  8:30     ` Avi Kivity
  2009-08-17 14:16       ` Gregory Haskins
  2009-08-17 13:02     ` Gregory Haskins
  2 siblings, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-16  8:30 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Gregory Haskins, kvm, alacrityvm-devel, linux-kernel, netdev

On 08/15/2009 01:32 PM, Ingo Molnar wrote:
>> This will generally be used for hypervisors to publish any host-side
>> virtual devices up to a guest.  The guest will have the opportunity
>> to consume any devices present on the vbus-proxy as if they were
>> platform devices, similar to existing buses like PCI.
>>
>>      
> Is there a consensus on this with the KVM folks? (i've added the KVM
> list to the Cc:)
>    

My opinion is that this is a duplication of effort and we'd be better 
off if everyone contributed to enhancing virtio, which already has 
widely deployed guest drivers and non-Linux guest support.

It may have merit if it is proven that it is technically superior to 
virtio (and I don't mean some benchmark in some point in time; I mean 
design wise).  So far I haven't seen any indications that it is.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-15 10:32   ` Ingo Molnar
  2009-08-15 19:15     ` Anthony Liguori
  2009-08-16  8:30     ` Avi Kivity
@ 2009-08-17 13:02     ` Gregory Haskins
  2009-08-17 14:25       ` Ingo Molnar
  2 siblings, 1 reply; 132+ messages in thread
From: Gregory Haskins @ 2009-08-17 13:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gregory Haskins, kvm, Avi Kivity, alacrityvm-devel, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1880 bytes --]

Ingo Molnar wrote:
> * Gregory Haskins <ghaskins@novell.com> wrote:
> 
>> This will generally be used for hypervisors to publish any host-side
>> virtual devices up to a guest.  The guest will have the opportunity
>> to consume any devices present on the vbus-proxy as if they were
>> platform devices, similar to existing buses like PCI.
>>
>> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>> ---
>>
>>  MAINTAINERS                 |    6 ++
>>  arch/x86/Kconfig            |    2 +
>>  drivers/Makefile            |    1 
>>  drivers/vbus/Kconfig        |   14 ++++
>>  drivers/vbus/Makefile       |    3 +
>>  drivers/vbus/bus-proxy.c    |  152 +++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/vbus_driver.h |   73 +++++++++++++++++++++
>>  7 files changed, 251 insertions(+), 0 deletions(-)
>>  create mode 100644 drivers/vbus/Kconfig
>>  create mode 100644 drivers/vbus/Makefile
>>  create mode 100644 drivers/vbus/bus-proxy.c
>>  create mode 100644 include/linux/vbus_driver.h
> 
> Is there a consensus on this with the KVM folks? (i've added the KVM 
> list to the Cc:)
> 
> 

Hi Ingo,

Avi can correct me if I am wrong, but the agreement that he and I came
to a few months ago was something to the effect of:

kvm will be neutral towards various external IO subsystems, and instead
provide various hooks (see irqfd, ioeventfd) to permit these IO
subsystems to interface with kvm.

AlacrityVM is one of the first projects to take advantage of that
interface.  AlacrityVM is kvm-core + vbus-core + vbus-kvm-connector +
vbus-enhanced qemu + guest drivers.  This thread is part of the
guest-drivers portion.  Note that it is specific to alacrityvm, not kvm,
which is why the kvm list was not included in the conversation (also an
agreement with Avi: http://lkml.org/lkml/2009/8/6/231).

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-16  7:16       ` Ingo Molnar
@ 2009-08-17 13:54         ` Anthony Liguori
  2009-08-17 14:23           ` Ingo Molnar
  0 siblings, 1 reply; 132+ messages in thread
From: Anthony Liguori @ 2009-08-17 13:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gregory Haskins, kvm, Avi Kivity, alacrityvm-devel, linux-kernel,
	netdev, Michael S. Tsirkin

Ingo Molnar wrote:
>> I think the reason vbus gets better performance for networking 
>> today is that vbus' backends are in the kernel while virtio's 
>> backends are currently in userspace.  Since Michael has a 
>> functioning in-kernel backend for virtio-net now, I suspect we're 
>> weeks (maybe days) away from performance results.  My expectation 
>> is that vhost + virtio-net will be as good as venet + vbus.  If 
>> that's the case, then I don't see any reason to adopt vbus unless 
>> Greg things there are other compelling features over virtio.
>>     
>
> Keeping virtio's backend in user-space was rather stupid IMHO. 
>   

I don't think it's quite so clear.

There's nothing about vhost_net that would prevent a userspace 
application from using it as a higher performance replacement for tun/tap.

The fact that we can avoid userspace for most of the fast paths is nice 
but that's really an issue of vhost_net vs. tun/tap.

 From the kernel's perspective, a KVM guest is just a userspace 
process.  Having new userspace interfaces that are only useful to KVM 
guests would be a bad thing.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-15 19:15     ` Anthony Liguori
  2009-08-16  7:16       ` Ingo Molnar
@ 2009-08-17 14:14       ` Gregory Haskins
  2009-08-17 14:58         ` Avi Kivity
                           ` (2 more replies)
  1 sibling, 3 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-17 14:14 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Ingo Molnar, Gregory Haskins, kvm, Avi Kivity, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 6816 bytes --]

Anthony Liguori wrote:
> Ingo Molnar wrote:
>> * Gregory Haskins <ghaskins@novell.com> wrote:
>>
>>  
>>> This will generally be used for hypervisors to publish any host-side
>>> virtual devices up to a guest.  The guest will have the opportunity
>>> to consume any devices present on the vbus-proxy as if they were
>>> platform devices, similar to existing buses like PCI.
>>>
>>> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>>> ---
>>>
>>>  MAINTAINERS                 |    6 ++
>>>  arch/x86/Kconfig            |    2 +
>>>  drivers/Makefile            |    1  drivers/vbus/Kconfig        |  
>>> 14 ++++
>>>  drivers/vbus/Makefile       |    3 +
>>>  drivers/vbus/bus-proxy.c    |  152
>>> +++++++++++++++++++++++++++++++++++++++++++
>>>  include/linux/vbus_driver.h |   73 +++++++++++++++++++++
>>>  7 files changed, 251 insertions(+), 0 deletions(-)
>>>  create mode 100644 drivers/vbus/Kconfig
>>>  create mode 100644 drivers/vbus/Makefile
>>>  create mode 100644 drivers/vbus/bus-proxy.c
>>>  create mode 100644 include/linux/vbus_driver.h
>>>     
>>
>> Is there a consensus on this with the KVM folks? (i've added the KVM
>> list to the Cc:)
>>   
> 
> I'll let Avi comment about it from a KVM perspective but from a QEMU
> perspective, I don't think we want to support two paravirtual IO
> frameworks.  I'd like to see them converge.  Since there's an install
> base of guests today with virtio drivers, there really ought to be a
> compelling reason to change the virtio ABI in a non-backwards compatible
> way.


Note: No one has ever proposed to change the virtio-ABI.  In fact, this
thread in question doesn't even touch virtio, and even the patches that
I have previously posted to add virtio-capability do it in a backwards
compatible way

Case in point: Take an upstream kernel and you can modprobe the
vbus-pcibridge in and virtio devices will work over that transport
unmodified.

See http://lkml.org/lkml/2009/8/6/244 for details.

Note that I have tentatively dropped the virtio-vbus patch from the
queue due to lack of interest, but I can resurrect it if need be.

>  This means convergence really ought to be adding features to virtio.

virtio is a device model. vbus is a bus model and a host backend
facility.  Adding features to virtio would be orthogonal to some kind of
convergence goal.  virtio can run unmodified or add new features within
its own namespace independent of vbus, as it pleases.  vbus will simply
transport those changes.

> 
> On paper, I don't think vbus really has any features over virtio.

Again, do not confuse vbus with virtio.  They are different layers of
the stack.

>  vbus
> does things in different ways (paravirtual bus vs. pci for discovery)
> but I think we're happy with how virtio does things today.
> 

Thats fine.  KVM can stick with virtio-pci if it wants.  AlacrityVM will
support virtio-pci and vbus (with possible convergence with
virtio-vbus).  If at some point KVM thinks vbus is interesting, I will
gladly work with getting it integrated into upstream KVM as well.  Until
then, they can happily coexist without issue between the two projects.


> I think the reason vbus gets better performance for networking today is
> that vbus' backends are in the kernel while virtio's backends are
> currently in userspace.

Well, with all due respect, you also said initially when I announced
vbus that in-kernel doesn't matter, and tried to make virtio-net run as
fast as venet from userspace ;)  Given that we never saw those userspace
patches from you that in fact equaled my performance, I assume you were
wrong about that statement.  Perhaps you were wrong about other things too?


> Since Michael has a functioning in-kernel
> backend for virtio-net now, I suspect we're weeks (maybe days) away from
> performance results.  My expectation is that vhost + virtio-net will be
> as good as venet + vbus.

This is not entirely impossible, at least for certain simple benchmarks
like singleton throughput and latency.  But if you think that this
somehow invalidates vbus as a concept, you have missed the point entirely.

vbus is about creating a flexible (e.g. cross hypervisor, and even
physical system or userspace application) in-kernel IO containers with
linux.  The "guest" interface represents what I believe to be the ideal
interface for ease of use, yet maximum performance for
software-to-software interaction.  This means very low latency and
high-throughput for both synchronous and asynchronous IO, minimizing
enters/exits, reducing enter/exit cost, prioritization, parallel
computation, etc.  The things that we (the alacrityvm community) have
coming down the pipeline for high-performance virtualization require
that these issues be addressed.

venet was originally crafted just to validate the approach and test the
vbus interface.  It ended up being so much faster that virtio-net, that
people in the vbus community started coding against its ABI.  Therefore,
I decided to support it formally and indefinately.  If I can get
consensus on virtio-vbus going forward, it will probably be the last
vbus-specific driver for which there is overlap with virtio (e.g.
virtio-block, virtio-console, etc).  Instead, you will only see native
vbus devices for non-native virtio type things, like real-time and
advanced fabric support.

OTOH, Michael's patch is purely targeted at improving virtio-net on kvm,
and its likewise constrained by various limitations of that decision
(such as its reliance of the PCI model, and the kvm memory scheme).  The
tradeoff is that his approach will work in all existing virtio-net kvm
guests, and is probably significantly less code since he can re-use the
qemu PCI bus model.

Conversely, I am not afraid of requiring a new driver to optimize the
general PV interface.  In the long term, this will reduce the amount of
reimplementing the same code over and over, reduce system overhead, and
it adds new features not previously available (for instance, coalescing
and prioritizing interrupts).


> If that's the case, then I don't see any
> reason to adopt vbus unless Greg things there are other compelling
> features over virtio.

Aside from the fact that this is another confusion of the vbus/virtio
relationship...yes, of course there are compelling features (IMHO) or I
wouldn't be expending effort ;)  They are at least compelling enough to
put in AlacrityVM.  If upstream KVM doesn't want them, that's KVMs
decision and I am fine with that.  Simply never apply my qemu patches to
qemu-kvm.git, and KVM will be blissfully unaware if vbus is present.  I
do hope that I can convince the KVM community otherwise, however. :)

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-16  8:30     ` Avi Kivity
@ 2009-08-17 14:16       ` Gregory Haskins
  2009-08-17 14:59         ` Avi Kivity
  0 siblings, 1 reply; 132+ messages in thread
From: Gregory Haskins @ 2009-08-17 14:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Gregory Haskins, kvm, alacrityvm-devel,
	linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1021 bytes --]

Avi Kivity wrote:
> On 08/15/2009 01:32 PM, Ingo Molnar wrote:
>>> This will generally be used for hypervisors to publish any host-side
>>> virtual devices up to a guest.  The guest will have the opportunity
>>> to consume any devices present on the vbus-proxy as if they were
>>> platform devices, similar to existing buses like PCI.
>>>
>>>      
>> Is there a consensus on this with the KVM folks? (i've added the KVM
>> list to the Cc:)
>>    
> 
> My opinion is that this is a duplication of effort and we'd be better
> off if everyone contributed to enhancing virtio, which already has
> widely deployed guest drivers and non-Linux guest support.
> 
> It may have merit if it is proven that it is technically superior to
> virtio (and I don't mean some benchmark in some point in time; I mean
> design wise).  So far I haven't seen any indications that it is.
> 


The design is very different, so hopefully I can start to convince you
why it might be interesting.

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 13:54         ` Anthony Liguori
@ 2009-08-17 14:23           ` Ingo Molnar
  0 siblings, 0 replies; 132+ messages in thread
From: Ingo Molnar @ 2009-08-17 14:23 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Gregory Haskins, kvm, Avi Kivity, alacrityvm-devel, linux-kernel,
	netdev, Michael S. Tsirkin


* Anthony Liguori <anthony@codemonkey.ws> wrote:

> Ingo Molnar wrote:
>>> I think the reason vbus gets better performance for networking today 
>>> is that vbus' backends are in the kernel while virtio's backends are 
>>> currently in userspace.  Since Michael has a functioning in-kernel 
>>> backend for virtio-net now, I suspect we're weeks (maybe days) away 
>>> from performance results.  My expectation is that vhost + virtio-net 
>>> will be as good as venet + vbus.  If that's the case, then I don't 
>>> see any reason to adopt vbus unless Greg things there are other 
>>> compelling features over virtio.
>>>     
>>
>> Keeping virtio's backend in user-space was rather stupid IMHO.
>
> I don't think it's quite so clear.

in such a narrow quote it's not so clear indeed - that's why i 
qualified it with:

>> Having the _option_ to piggyback to user-space (for flexibility, 
>> extensibility, etc.) is OK, but not having kernel acceleration is 
>> bad.

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 13:02     ` Gregory Haskins
@ 2009-08-17 14:25       ` Ingo Molnar
  2009-08-17 15:05         ` Gregory Haskins
  0 siblings, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2009-08-17 14:25 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Gregory Haskins, kvm, Avi Kivity, alacrityvm-devel, linux-kernel, netdev


* Gregory Haskins <gregory.haskins@gmail.com> wrote:

> Ingo Molnar wrote:
> > * Gregory Haskins <ghaskins@novell.com> wrote:
> > 
> >> This will generally be used for hypervisors to publish any host-side
> >> virtual devices up to a guest.  The guest will have the opportunity
> >> to consume any devices present on the vbus-proxy as if they were
> >> platform devices, similar to existing buses like PCI.
> >>
> >> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> >> ---
> >>
> >>  MAINTAINERS                 |    6 ++
> >>  arch/x86/Kconfig            |    2 +
> >>  drivers/Makefile            |    1 
> >>  drivers/vbus/Kconfig        |   14 ++++
> >>  drivers/vbus/Makefile       |    3 +
> >>  drivers/vbus/bus-proxy.c    |  152 +++++++++++++++++++++++++++++++++++++++++++
> >>  include/linux/vbus_driver.h |   73 +++++++++++++++++++++
> >>  7 files changed, 251 insertions(+), 0 deletions(-)
> >>  create mode 100644 drivers/vbus/Kconfig
> >>  create mode 100644 drivers/vbus/Makefile
> >>  create mode 100644 drivers/vbus/bus-proxy.c
> >>  create mode 100644 include/linux/vbus_driver.h
> > 
> > Is there a consensus on this with the KVM folks? (i've added the KVM 
> > list to the Cc:)
> > 
> > 
> 
> Hi Ingo,
> 
> Avi can correct me if I am wrong, but the agreement that he and I 
> came to a few months ago was something to the effect of:
> 
> kvm will be neutral towards various external IO subsystems, and 
> instead provide various hooks (see irqfd, ioeventfd) to permit 
> these IO subsystems to interface with kvm.
> 
> AlacrityVM is one of the first projects to take advantage of that 
> interface.  AlacrityVM is kvm-core + vbus-core + 
> vbus-kvm-connector + vbus-enhanced qemu + guest drivers.  This 
> thread is part of the guest-drivers portion.  Note that it is 
> specific to alacrityvm, not kvm, which is why the kvm list was not 
> included in the conversation (also an agreement with Avi: 
> http://lkml.org/lkml/2009/8/6/231).

Well my own opinion is that the fracturing of the Linux internal 
driver space into diverging pieces of duplicate functionality 
(absent compelling technical reasons) is harmful.

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 14:14       ` Gregory Haskins
@ 2009-08-17 14:58         ` Avi Kivity
  2009-08-17 15:05           ` Ingo Molnar
  2009-08-17 17:41         ` Michael S. Tsirkin
  2009-08-18  1:08         ` Anthony Liguori
  2 siblings, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-17 14:58 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev, Michael S. Tsirkin

On 08/17/2009 05:14 PM, Gregory Haskins wrote:
> Note: No one has ever proposed to change the virtio-ABI.  In fact, this
> thread in question doesn't even touch virtio, and even the patches that
> I have previously posted to add virtio-capability do it in a backwards
> compatible way
>    

Your patches include venet, which is a direct competitor to virtio-net, 
so it splits the development effort.

> Case in point: Take an upstream kernel and you can modprobe the
> vbus-pcibridge in and virtio devices will work over that transport
> unmodified.
>    

Older kernels don't support it, and Windows doesn't support it.

>>   vbus
>> does things in different ways (paravirtual bus vs. pci for discovery)
>> but I think we're happy with how virtio does things today.
>>
>>      
> Thats fine.  KVM can stick with virtio-pci if it wants.  AlacrityVM will
> support virtio-pci and vbus (with possible convergence with
> virtio-vbus).  If at some point KVM thinks vbus is interesting, I will
> gladly work with getting it integrated into upstream KVM as well.  Until
> then, they can happily coexist without issue between the two projects.
>    

If vbus is to go upstream, it must go via the same path other drivers 
go.  Virtio wasn't merged via the kvm tree and virtio-host won't be either.

I don't have any technical objections to vbus/venet (I had in the past 
re interrupts but I believe you've addressed them), and it appears to 
perform very well.  However I still think we should address virtio's 
shortcomings (as Michael is doing) rather than create a competitor.  We 
have enough external competition, we don't need in-tree competitors.

>> I think the reason vbus gets better performance for networking today is
>> that vbus' backends are in the kernel while virtio's backends are
>> currently in userspace.
>>      
> Well, with all due respect, you also said initially when I announced
> vbus that in-kernel doesn't matter, and tried to make virtio-net run as
> fast as venet from userspace ;)  Given that we never saw those userspace
> patches from you that in fact equaled my performance, I assume you were
> wrong about that statement.

I too thought that if we'd improved the userspace interfaces we'd get 
fast networking without pushing virtio details into the kernels, 
benefiting not just kvm but the Linux community at large.  This might 
still be correct but in fact no one turned up with the patches.  Maybe 
they're impossible to write, hard to write, or uninteresting to write 
for those who are capable of writing them.  As it is, we've given up and 
Michael wrote vhost.

> Perhaps you were wrong about other things too?
>    

I'm pretty sure Anthony doesn't posses a Diploma of Perpetual Omniscience.

>> Since Michael has a functioning in-kernel
>> backend for virtio-net now, I suspect we're weeks (maybe days) away from
>> performance results.  My expectation is that vhost + virtio-net will be
>> as good as venet + vbus.
>>      
> This is not entirely impossible, at least for certain simple benchmarks
> like singleton throughput and latency.

What about more complex benchmarks?  Do you thing vbus+venet has an 
advantage there?

> But if you think that this
> somehow invalidates vbus as a concept, you have missed the point entirely.
>
> vbus is about creating a flexible (e.g. cross hypervisor, and even
> physical system or userspace application) in-kernel IO containers with
> linux.  The "guest" interface represents what I believe to be the ideal
> interface for ease of use, yet maximum performance for
> software-to-software interaction.

Maybe.  But layering venet or vblock on top of it makes it specific to 
hypervisors.  The venet/vblock ABIs are not very interesting for 
user-to-user (and anyway, they could use virtio just as well).

> venet was originally crafted just to validate the approach and test the
> vbus interface.  It ended up being so much faster that virtio-net, that
> people in the vbus community started coding against its ABI.

It ended up being much faster than qemu's host implementation, not the 
virtio ABI.  When asked you've indicated that you don't see any 
deficiencies in the virtio protocol.

> OTOH, Michael's patch is purely targeted at improving virtio-net on kvm,
> and its likewise constrained by various limitations of that decision
> (such as its reliance of the PCI model, and the kvm memory scheme).  The
> tradeoff is that his approach will work in all existing virtio-net kvm
> guests, and is probably significantly less code since he can re-use the
> qemu PCI bus model.
>    

virtio does not depend on PCI and virtio-host does not either.

> Conversely, I am not afraid of requiring a new driver to optimize the
> general PV interface.  In the long term, this will reduce the amount of
> reimplementing the same code over and over, reduce system overhead, and
> it adds new features not previously available (for instance, coalescing
> and prioritizing interrupts).
>    

If it were proven to me a new driver is needed I'd switch too.  So far 
no proof has materialized.

>> If that's the case, then I don't see any
>> reason to adopt vbus unless Greg things there are other compelling
>> features over virtio.
>>      
> Aside from the fact that this is another confusion of the vbus/virtio
> relationship...yes, of course there are compelling features (IMHO) or I
> wouldn't be expending effort ;)  They are at least compelling enough to
> put in AlacrityVM.  If upstream KVM doesn't want them, that's KVMs
> decision and I am fine with that.  Simply never apply my qemu patches to
> qemu-kvm.git, and KVM will be blissfully unaware if vbus is present.  I
> do hope that I can convince the KVM community otherwise, however. :)
>    

If the vbus patches make it into the kernel I see no reason not to 
support them in qemu.  qemu supports dozens if not hundreds of devices, 
one more wouldn't matter.

But there's a lot of work before that can happen; for example you must 
support save/restore/migrate for vbus to be mergable.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 14:16       ` Gregory Haskins
@ 2009-08-17 14:59         ` Avi Kivity
  2009-08-17 15:09           ` Gregory Haskins
  0 siblings, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-17 14:59 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ingo Molnar, Gregory Haskins, kvm, alacrityvm-devel,
	linux-kernel, netdev

On 08/17/2009 05:16 PM, Gregory Haskins wrote:
>> My opinion is that this is a duplication of effort and we'd be better
>> off if everyone contributed to enhancing virtio, which already has
>> widely deployed guest drivers and non-Linux guest support.
>>
>> It may have merit if it is proven that it is technically superior to
>> virtio (and I don't mean some benchmark in some point in time; I mean
>> design wise).  So far I haven't seen any indications that it is.
>>
>>      
>
> The design is very different, so hopefully I can start to convince you
> why it might be interesting.
>    

We've been through this before I believe.  If you can point out specific 
differences that make venet outperform virtio-net I'll be glad to hear 
(and steal) them though.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 14:25       ` Ingo Molnar
@ 2009-08-17 15:05         ` Gregory Haskins
  2009-08-17 15:08           ` Ingo Molnar
  2009-08-17 15:13           ` Avi Kivity
  0 siblings, 2 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-17 15:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gregory Haskins, kvm, Avi Kivity, alacrityvm-devel, linux-kernel,
	netdev, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 5022 bytes --]

Ingo Molnar wrote:
> * Gregory Haskins <gregory.haskins@gmail.com> wrote:
> 
>> Ingo Molnar wrote:
>>> * Gregory Haskins <ghaskins@novell.com> wrote:
>>>
>>>> This will generally be used for hypervisors to publish any host-side
>>>> virtual devices up to a guest.  The guest will have the opportunity
>>>> to consume any devices present on the vbus-proxy as if they were
>>>> platform devices, similar to existing buses like PCI.
>>>>
>>>> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>>>> ---
>>>>
>>>>  MAINTAINERS                 |    6 ++
>>>>  arch/x86/Kconfig            |    2 +
>>>>  drivers/Makefile            |    1 
>>>>  drivers/vbus/Kconfig        |   14 ++++
>>>>  drivers/vbus/Makefile       |    3 +
>>>>  drivers/vbus/bus-proxy.c    |  152 +++++++++++++++++++++++++++++++++++++++++++
>>>>  include/linux/vbus_driver.h |   73 +++++++++++++++++++++
>>>>  7 files changed, 251 insertions(+), 0 deletions(-)
>>>>  create mode 100644 drivers/vbus/Kconfig
>>>>  create mode 100644 drivers/vbus/Makefile
>>>>  create mode 100644 drivers/vbus/bus-proxy.c
>>>>  create mode 100644 include/linux/vbus_driver.h
>>> Is there a consensus on this with the KVM folks? (i've added the KVM 
>>> list to the Cc:)
>>>
>>>
>> Hi Ingo,
>>
>> Avi can correct me if I am wrong, but the agreement that he and I 
>> came to a few months ago was something to the effect of:
>>
>> kvm will be neutral towards various external IO subsystems, and 
>> instead provide various hooks (see irqfd, ioeventfd) to permit 
>> these IO subsystems to interface with kvm.
>>
>> AlacrityVM is one of the first projects to take advantage of that 
>> interface.  AlacrityVM is kvm-core + vbus-core + 
>> vbus-kvm-connector + vbus-enhanced qemu + guest drivers.  This 
>> thread is part of the guest-drivers portion.  Note that it is 
>> specific to alacrityvm, not kvm, which is why the kvm list was not 
>> included in the conversation (also an agreement with Avi: 
>> http://lkml.org/lkml/2009/8/6/231).
> 
> Well my own opinion is that the fracturing of the Linux internal 
> driver space into diverging pieces of duplicate functionality 
> (absent compelling technical reasons) is harmful.

[Adding Michael Tsirkin]

Hi Ingo,

1) First off, let me state that I have made every effort to propose this
as a solution to integrate with KVM, the most recent of which is April:

http://lkml.org/lkml/2009/4/21/408

If you read through the various vbus related threads on LKML/KVM posted
this year, I think you will see that I made numerous polite offerings to
work with people on finding a common solution here, including Michael.

In the end, Michael decided that go a different route using some of the
ideas proposed in vbus + venet-tap to create vhost-net.  This is fine,
and I respect his decision.  But do not try to pin "fracturing" on me,
because I tried everything to avoid it. :)

Since I still disagree with the fundamental approach of how KVM IO
works, I am continuing my effort in the downstream project "AlacrityVM"
which will hopefully serve to build a better understanding of what it is
I am doing with the vbus technology, and a point to maintain the subsystem.

2) There *are* technical reasons for this change (and IMHO, they are
compelling), many of which have already been previously discussed
(including my last reply to Anthony) so I wont rehash them here.

3) Even if there really is some duplication here, I disagree with you
that it is somehow harmful to the Linux community per se.  Case in
point, look at the graphs posted on the AlacrityVM wiki:

http://developer.novell.com/wiki/index.php/AlacrityVM

Prior to my effort, KVM was humming along at the status quo and I came
along with a closer eye and almost doubled the throughput and cut
latency by 78%.  Given an apparent disagreement with aspects of my
approach, Michael went off and created a counter example that was
motivated by my performance findings.

Therefore, even if Avi ultimately accepts Michaels vhost approach
instead of mine, Linux as a hypervisor platform has been significantly
_improved_ by a little friendly competition, not somehow damaged by it.

4) Lastly, these patches are almost entirely just stand alone Linux
drivers that do not affect KVM if KVM doesn't wish to acknowledge them.
 Its just like any of the other numerous drivers that are accepted
upstream into Linux every day.  The only maintained subsystem that is
technically touched by this series is netdev, and David Miller already
approved of the relevant patch's inclusion:

http://lkml.org/lkml/2009/8/3/505

So with all due respect, where is the problem?  The patches are all
professionally developed according to the Linux coding standards, pass
checkpatch, are GPL'ed, and work with a freely available platform which
you can download today
(http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=summary)


Kind Regards,
-Greg





[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 14:58         ` Avi Kivity
@ 2009-08-17 15:05           ` Ingo Molnar
  0 siblings, 0 replies; 132+ messages in thread
From: Ingo Molnar @ 2009-08-17 15:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Anthony Liguori, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev, Michael S. Tsirkin


* Avi Kivity <avi@redhat.com> wrote:

> I don't have any technical objections to vbus/venet (I had in the 
> past re interrupts but I believe you've addressed them), and it 
> appears to perform very well.  However I still think we should 
> address virtio's shortcomings (as Michael is doing) rather than 
> create a competitor.  We have enough external competition, we 
> don't need in-tree competitors.

I do have strong technical objections: distributions really want to 
standardize on as few Linux internal virtualization APIs as 
possible, so splintering it just because /bin/cp is easy to do is 
bad.

If virtio pulls even with vbus's performance and vbus has no 
advantages over virtio i do NAK vbus on that basis. Lets stop the 
sillyness before it starts hurting users. Coming up with something 
better is good, but doing an incompatible, duplicative framework 
just for NIH reasons is stupid and should be resisted.

People dont get to add a new sys_read_v2() without strong technical 
arguments either - the same holds for our Linux internal driver 
abstractions, APIs and ABIs.

	ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 15:05         ` Gregory Haskins
@ 2009-08-17 15:08           ` Ingo Molnar
  2009-08-17 19:33             ` Gregory Haskins
  2009-08-17 15:13           ` Avi Kivity
  1 sibling, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2009-08-17 15:08 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Gregory Haskins, kvm, Avi Kivity, alacrityvm-devel, linux-kernel,
	netdev, Michael S. Tsirkin


* Gregory Haskins <gregory.haskins@gmail.com> wrote:

> Hi Ingo,
> 
> 1) First off, let me state that I have made every effort to 
> propose this as a solution to integrate with KVM, the most recent 
> of which is April:
> 
>    http://lkml.org/lkml/2009/4/21/408
> 
> If you read through the various vbus related threads on LKML/KVM 
> posted this year, I think you will see that I made numerous polite 
> offerings to work with people on finding a common solution here, 
> including Michael.
> 
> In the end, Michael decided that go a different route using some 
> of the ideas proposed in vbus + venet-tap to create vhost-net.  
> This is fine, and I respect his decision.  But do not try to pin 
> "fracturing" on me, because I tried everything to avoid it. :)

That's good.

So if virtio is fixed to be as fast as vbus, and if there's no other 
techical advantages of vbus over virtio you'll be glad to drop vbus 
and stand behind virtio?

Also, are you willing to help virtio to become faster? Or do you 
have arguments why that is impossible to do so and why the only 
possible solution is vbus? Avi says no such arguments were offered 
so far.

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 14:59         ` Avi Kivity
@ 2009-08-17 15:09           ` Gregory Haskins
  2009-08-17 15:14             ` Ingo Molnar
  2009-08-17 15:18             ` Avi Kivity
  0 siblings, 2 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-17 15:09 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Gregory Haskins, kvm, alacrityvm-devel,
	linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1007 bytes --]

Avi Kivity wrote:
> On 08/17/2009 05:16 PM, Gregory Haskins wrote:
>>> My opinion is that this is a duplication of effort and we'd be better
>>> off if everyone contributed to enhancing virtio, which already has
>>> widely deployed guest drivers and non-Linux guest support.
>>>
>>> It may have merit if it is proven that it is technically superior to
>>> virtio (and I don't mean some benchmark in some point in time; I mean
>>> design wise).  So far I haven't seen any indications that it is.
>>>
>>>      
>>
>> The design is very different, so hopefully I can start to convince you
>> why it might be interesting.
>>    
> 
> We've been through this before I believe.  If you can point out specific
> differences that make venet outperform virtio-net I'll be glad to hear
> (and steal) them though.
> 

You sure know how to convince someone to collaborate with you, eh?

Unforunately, i've answered that question numerous times, but it
apparently falls on deaf ears.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 15:05         ` Gregory Haskins
  2009-08-17 15:08           ` Ingo Molnar
@ 2009-08-17 15:13           ` Avi Kivity
  1 sibling, 0 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-17 15:13 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ingo Molnar, Gregory Haskins, kvm, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin

On 08/17/2009 06:05 PM, Gregory Haskins wrote:
> Hi Ingo,
>
> 1) First off, let me state that I have made every effort to propose this
> as a solution to integrate with KVM, the most recent of which is April:
>
> http://lkml.org/lkml/2009/4/21/408
>
> If you read through the various vbus related threads on LKML/KVM posted
> this year, I think you will see that I made numerous polite offerings to
> work with people on finding a common solution here, including Michael.
>
> In the end, Michael decided that go a different route using some of the
> ideas proposed in vbus + venet-tap to create vhost-net.  This is fine,
> and I respect his decision.  But do not try to pin "fracturing" on me,
> because I tried everything to avoid it. :)
>    

Given your post, there are only three possible ways to continue kvm 
guest driver development:

- develop virtio/vhost, drop vbus/venet
- develop vbus/venet, drop virtio
- develop both

Developing both fractures the community.   Dropping virtio invalidates 
the installed base and Windows effort.  There were no strong technical 
reasons shown in favor of the remaining option.


> Since I still disagree with the fundamental approach of how KVM IO
> works,

What's that?

> Prior to my effort, KVM was humming along at the status quo and I came
> along with a closer eye and almost doubled the throughput and cut
> latency by 78%.  Given an apparent disagreement with aspects of my
> approach, Michael went off and created a counter example that was
> motivated by my performance findings.
>    

Oh, virtio-net performance was a thorn in our side for a long time.  I 
agree that venet was an additional spur.

> Therefore, even if Avi ultimately accepts Michaels vhost approach
> instead of mine, Linux as a hypervisor platform has been significantly
> _improved_ by a little friendly competition, not somehow damaged by it.
>    

Certainly, and irqfd/ioeventfd are a net win in any case.

> 4) Lastly, these patches are almost entirely just stand alone Linux
> drivers that do not affect KVM if KVM doesn't wish to acknowledge them.
>   Its just like any of the other numerous drivers that are accepted
> upstream into Linux every day.  The only maintained subsystem that is
> technically touched by this series is netdev, and David Miller already
> approved of the relevant patch's inclusion:
>
> http://lkml.org/lkml/2009/8/3/505
>
> So with all due respect, where is the problem?  The patches are all
> professionally developed according to the Linux coding standards, pass
> checkpatch, are GPL'ed, and work with a freely available platform which
> you can download today
> (http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=summary)
>    

As I mentioned before, I have no technical objections to the patches, I 
just wish the effort could be concentrated in one direction.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 15:09           ` Gregory Haskins
@ 2009-08-17 15:14             ` Ingo Molnar
  2009-08-17 19:35               ` Gregory Haskins
  2009-08-17 15:18             ` Avi Kivity
  1 sibling, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2009-08-17 15:14 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Gregory Haskins, kvm, alacrityvm-devel, linux-kernel, netdev


* Gregory Haskins <gregory.haskins@gmail.com> wrote:

> Avi Kivity wrote:
> > On 08/17/2009 05:16 PM, Gregory Haskins wrote:
> >>> My opinion is that this is a duplication of effort and we'd be better
> >>> off if everyone contributed to enhancing virtio, which already has
> >>> widely deployed guest drivers and non-Linux guest support.
> >>>
> >>> It may have merit if it is proven that it is technically superior to
> >>> virtio (and I don't mean some benchmark in some point in time; I mean
> >>> design wise).  So far I haven't seen any indications that it is.
> >>>
> >>>      
> >>
> >> The design is very different, so hopefully I can start to convince you
> >> why it might be interesting.
> >>    
> > 
> > We've been through this before I believe.  If you can point out 
> > specific differences that make venet outperform virtio-net I'll 
> > be glad to hear (and steal) them though.
> 
> You sure know how to convince someone to collaborate with you, eh?
> 
> Unforunately, i've answered that question numerous times, but it 
> apparently falls on deaf ears.

I'm trying to find the relevant discussion. The link you gave in the 
previous mail:

  http://lkml.org/lkml/2009/4/21/408

does not offer any design analysis of vbus versus virtio, and why 
the only fix to virtio is vbus. It offers a comparison and a blanket 
statement that vbus is superior but no arguments.

(If you've already explained in a past thread then please give me an 
URL to that reply if possible, or forward me that prior reply. 
Thanks!)

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 15:09           ` Gregory Haskins
  2009-08-17 15:14             ` Ingo Molnar
@ 2009-08-17 15:18             ` Avi Kivity
  1 sibling, 0 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-17 15:18 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ingo Molnar, Gregory Haskins, kvm, alacrityvm-devel,
	linux-kernel, netdev

On 08/17/2009 06:09 PM, Gregory Haskins wrote:
>
>> We've been through this before I believe.  If you can point out specific
>> differences that make venet outperform virtio-net I'll be glad to hear
>> (and steal) them though.
>>
>>      
> You sure know how to convince someone to collaborate with you, eh?
>
>    

If I've offended you, I apologize.

> Unforunately, i've answered that question numerous times, but it
> apparently falls on deaf ears.
>    

Well, I'm sorry, I truly don't think I've had that question answered 
with specificity.  I'm really interested in it (out of a selfish desire 
to improve virtio), but the only comment I recall from you was to the 
effect that the virtio rings were better than ioq in terms of cache 
placement.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 14:14       ` Gregory Haskins
  2009-08-17 14:58         ` Avi Kivity
@ 2009-08-17 17:41         ` Michael S. Tsirkin
  2009-08-17 20:17           ` Gregory Haskins
  2009-08-18  1:08         ` Anthony Liguori
  2 siblings, 1 reply; 132+ messages in thread
From: Michael S. Tsirkin @ 2009-08-17 17:41 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Ingo Molnar, Gregory Haskins, kvm, Avi Kivity,
	alacrityvm-devel, linux-kernel, netdev

On Mon, Aug 17, 2009 at 10:14:56AM -0400, Gregory Haskins wrote:
> Case in point: Take an upstream kernel and you can modprobe the
> vbus-pcibridge in and virtio devices will work over that transport
> unmodified.
> 
> See http://lkml.org/lkml/2009/8/6/244 for details.

The modprobe you are talking about would need
to be done in guest kernel, correct?

> OTOH, Michael's patch is purely targeted at improving virtio-net on kvm,
> and its likewise constrained by various limitations of that decision
> (such as its reliance of the PCI model, and the kvm memory scheme).

vhost is actually not related to PCI in any way. It simply leaves all
setup for userspace to do.  And the memory scheme was intentionally
separated from kvm so that it can easily support e.g. lguest.

-- 
MST

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 15:08           ` Ingo Molnar
@ 2009-08-17 19:33             ` Gregory Haskins
  2009-08-18  8:33               ` Avi Kivity
  2009-08-18  9:53               ` Michael S. Tsirkin
  0 siblings, 2 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-17 19:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gregory Haskins, kvm, Avi Kivity, alacrityvm-devel, linux-kernel,
	netdev, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 11547 bytes --]

Ingo Molnar wrote:
> * Gregory Haskins <gregory.haskins@gmail.com> wrote:
> 
>> Hi Ingo,
>>
>> 1) First off, let me state that I have made every effort to 
>> propose this as a solution to integrate with KVM, the most recent 
>> of which is April:
>>
>>    http://lkml.org/lkml/2009/4/21/408
>>
>> If you read through the various vbus related threads on LKML/KVM 
>> posted this year, I think you will see that I made numerous polite 
>> offerings to work with people on finding a common solution here, 
>> including Michael.
>>
>> In the end, Michael decided that go a different route using some 
>> of the ideas proposed in vbus + venet-tap to create vhost-net.  
>> This is fine, and I respect his decision.  But do not try to pin 
>> "fracturing" on me, because I tried everything to avoid it. :)
> 
> That's good.
> 
> So if virtio is fixed to be as fast as vbus, and if there's no other 
> techical advantages of vbus over virtio you'll be glad to drop vbus 
> and stand behind virtio?

To reiterate: vbus and virtio are not mutually exclusive.  The virtio
device model rides happily on top of the vbus bus model.

This is primarily a question of the virtio-pci adapter, vs virtio-vbus.

For more details, see this post: http://lkml.org/lkml/2009/8/6/244

There is a secondary question of venet (a vbus native device) verses
virtio-net (a virtio native device that works with PCI or VBUS).  If
this contention is really around venet vs virtio-net, I may possibly
conceed and retract its submission to mainline.  I've been pushing it to
date because people are using it and I don't see any reason that the
driver couldn't be upstream.

> 
> Also, are you willing to help virtio to become faster?

Yes, that is not a problem.  Note that virtio in general, and
virtio-net/venet in particular are not the primary goal here, however.
Improved 802.x and block IO are just positive side-effects of the
effort.  I started with 802.x networking just to demonstrate the IO
layer capabilities, and to test it.  It ended up being so good on
contrast to existing facilities, that developers in the vbus community
started using it for production development.

Ultimately, I created vbus to address areas of performance that have not
yet been addressed in things like KVM.  Areas such as real-time guests,
or RDMA (host bypass) interfaces.  I also designed it in such a way that
we could, in theory, write one set of (linux-based) backends, and have
them work across a variety of environments (such as containers/VMs like
KVM, lguest, openvz, but also physical systems like blade enclosures and
clusters, or even applications running on the host).

> Or do you 
> have arguments why that is impossible to do so and why the only 
> possible solution is vbus? Avi says no such arguments were offered 
> so far.

Not for lack of trying.  I think my points have just been missed
everytime I try to describe them. ;)  Basically I write a message very
similar to this one, and the next conversation starts back from square
one.  But I digress, let me try again..

Noting that this discussion is really about the layer *below* virtio,
not virtio itself (e.g. PCI vs vbus).  Lets start with a little background:

-- Background --

So on one level, we have the resource-container technology called
"vbus".  It lets you create a container on the host, fill it with
virtual devices, and assign that container to some context (such as a
KVM guest).  These "devices" are LKMs, and each device has a very simple
verb namespace consisting of a synchronous "call()" method, and a
"shm()" method for establishing async channels.

The async channels are just shared-memory with a signal path (e.g.
interrupts and hypercalls), which the device+driver can use to overlay
things like rings (virtqueues, IOQs), or other shared-memory based
constructs of their choosing (such as a shared table).  The signal path
is designed to minimize enter/exits and reduce spurious signals in a
unified way (see shm-signal patch).

call() can be used both for config-space like details, as well as
fast-path messaging that require synchronous behavior (such as guest
scheduler updates).

All of this is managed via sysfs/configfs.

On the guest, we have a "vbus-proxy" which is how the guest gets access
to devices assigned to its container.  (as an aside, "virtio" devices
can be populated in the container, and then surfaced up to the
virtio-bus via that virtio-vbus patch I mentioned).

There is a thing called a "vbus-connector" which is the guest specific
part.  Its job is to connect the vbus-proxy in the guest, to the vbus
container on the host.  How it does its job is specific to the connector
implementation, but its role is to transport messages between the guest
and the host (such as for call() and shm() invocations) and to handle
things like discovery and hotswap.

-- Issues --

Out of all this, I think the biggest contention point is the design of
the vbus-connector that I use in AlacrityVM (Avi, correct me if I am
wrong and you object to other aspects as well).  I suspect that if I had
designed the vbus-connector to surface vbus devices as PCI devices via
QEMU, the patches would potentially have been pulled in a while ago.

There are, of course, reasons why vbus does *not* render as PCI, so this
is the meat of of your question, I believe.

At a high level, PCI was designed for software-to-hardware interaction,
so it makes assumptions about that relationship that do not necessarily
apply to virtualization.

For instance:

A) hardware can only generate byte/word sized requests at a time because
that is all the pcb-etch and silicon support. So hardware is usually
expressed in terms of some number of "registers".

B) each access to one of these registers is relatively cheap

C) the target end-point has no visibility into the CPU machine state
other than the parameters passed in the bus-cycle (usually an address
and data tuple).

D) device-ids are in a fixed width register and centrally assigned from
an authority (e.g. PCI-SIG).

E) Interrupt/MSI routing is per-device oriented

F) Interrupts/MSI are assumed cheap to inject

G) Interrupts/MSI are non-priortizable.

H) Interrupts/MSI are statically established

These assumptions and constraints may be completely different or simply
invalid in a virtualized guest. For instance, the hypervisor is just
software, and therefore it's not restricted to "etch" constraints. IO
requests can be arbitrarily large, just as if you are invoking a library
function-call or OS system-call. Likewise, each one of those requests is
a branch and a context switch, so it has often has greater performance
implications than a simple register bus-cycle in hardware.  If you use
an MMIO variant, it has to run through the page-fault code to be decoded.

The result is typically decreased performance if you try to do the same
thing real hardware does. This is why you usually see hypervisor
specific drivers (e.g. virtio-net, vmnet, etc) a common feature.

_Some_ performance oriented items can technically be accomplished in
PCI, albeit in a much more awkward way.  For instance, you can set up a
really fast, low-latency "call()" mechanism using a PIO port on a
PCI-model and ioeventfd.  As a matter of fact, this is exactly what the
vbus pci-bridge does:

http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=drivers/vbus/pci-bridge.c;h=f0ed51af55b5737b3ae4239ed2adfe12c7859941;hb=ee557a5976921650b792b19e6a93cd03fcad304a#l102

(Also note that the enabling technology, ioeventfd, is something that
came out of my efforts on vbus).

The problem here is that this is incredibly awkward to setup.  You have
all that per-cpu goo and the registration of the memory on the guest.
And on the host side, you have all the vmapping of the registered
memory, and the file-descriptor to manage.  In short, its really painful.

I would much prefer to do this *once*, and then let all my devices
simple re-use that infrastructure.  This is, in fact, what I do.  Here
is the device model that a guest sees:

struct vbus_device_proxy_ops {
           int (*open)(struct vbus_device_proxy *dev, int version, int
flags);
           int (*close)(struct vbus_device_proxy *dev, int flags);
           int (*shm)(struct vbus_device_proxy *dev, int id, int prio,
                      void *ptr, size_t len,
                      struct shm_signal_desc *sigdesc, struct shm_signal
**signal,
                      int flags);
           int (*call)(struct vbus_device_proxy *dev, u32 func,
                       void *data, size_t len, int flags);
           void (*release)(struct vbus_device_proxy *dev);
   };

Now the client just calls dev->call() and its lighting quick, and they
don't have to worry about all the details of making it quick, nor expend
addition per-cpu heap and address space to get it.

Moving on: _Other_ items cannot be replicated (at least, not without
hacking it into something that is no longer PCI.

Things like the pci-id namespace are just silly for software.  I would
rather have a namespace that does not require central management so
people are free to create vbus-backends at will.  This is akin to
registering a device MAJOR/MINOR, verses using the various dynamic
assignment mechanisms.  vbus uses a string identifier in place of a
pci-id.  This is superior IMHO, and not compatible with PCI.

As another example, the connector design coalesces *all* shm-signals
into a single interrupt (by prio) that uses the same context-switch
mitigation techniques that help boost things like networking.  This
effectively means we can detect and optimize out ack/eoi cycles from the
APIC as the IO load increases (which is when you need it most).  PCI has
no such concept.

In addition, the signals and interrupts are priority aware, which is
useful for things like 802.1p networking where you may establish 8-tx
and 8-rx queues for your virtio-net device.  x86 APIC really has no
usable equivalent, so PCI is stuck here.

Also, the signals can be allocated on-demand for implementing things
like IPC channels in response to guest requests since there is no
assumption about device-to-interrupt mappings.  This is more flexible.

And through all of this, this design would work in any guest even if it
doesn't have PCI (e.g. lguest, UML, physical systems, etc).

-- Bottom Line --

The idea here is to generalize all the interesting parts that are common
(fast sync+async io, context-switch mitigation, back-end models, memory
abstractions, signal-path routing, etc) that a variety of linux based
technologies can use (kvm, lguest, openvz, uml, physical systems) and
only require the thin "connector" code to port the system around.  The
idea is to try to get this aspect of PV right once, and at some point in
the future, perhaps vbus will be as ubiquitous as PCI.  Well, perhaps
not *that* ubiquitous, but you get the idea ;)

Then device models like virtio can ride happily on top and we end up
with a really robust and high-performance Linux-based stack.  I don't
buy the argument that we already have PCI so lets use it.  I don't think
its the best design and I am not afraid to make an investment in a
change here because I think it will pay off in the long run.

I hope this helps to clarify my motivation.

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 15:14             ` Ingo Molnar
@ 2009-08-17 19:35               ` Gregory Haskins
  0 siblings, 0 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-17 19:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Gregory Haskins, kvm, alacrityvm-devel, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1818 bytes --]

Ingo Molnar wrote:
> * Gregory Haskins <gregory.haskins@gmail.com> wrote:
> 
>> Avi Kivity wrote:
>>> On 08/17/2009 05:16 PM, Gregory Haskins wrote:
>>>>> My opinion is that this is a duplication of effort and we'd be better
>>>>> off if everyone contributed to enhancing virtio, which already has
>>>>> widely deployed guest drivers and non-Linux guest support.
>>>>>
>>>>> It may have merit if it is proven that it is technically superior to
>>>>> virtio (and I don't mean some benchmark in some point in time; I mean
>>>>> design wise).  So far I haven't seen any indications that it is.
>>>>>
>>>>>      
>>>> The design is very different, so hopefully I can start to convince you
>>>> why it might be interesting.
>>>>    
>>> We've been through this before I believe.  If you can point out 
>>> specific differences that make venet outperform virtio-net I'll 
>>> be glad to hear (and steal) them though.
>> You sure know how to convince someone to collaborate with you, eh?
>>
>> Unforunately, i've answered that question numerous times, but it 
>> apparently falls on deaf ears.
> 
> I'm trying to find the relevant discussion. The link you gave in the 
> previous mail:
> 
>   http://lkml.org/lkml/2009/4/21/408
> 
> does not offer any design analysis of vbus versus virtio, and why 
> the only fix to virtio is vbus. It offers a comparison and a blanket 
> statement that vbus is superior but no arguments.
> 
> (If you've already explained in a past thread then please give me an 
> URL to that reply if possible, or forward me that prior reply. 
> Thanks!)


Sorry, it was a series of long threads from quite a while back.  I will
see if I can find some references, but it might be easier to just start
fresh (see the last reply I sent).

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 17:41         ` Michael S. Tsirkin
@ 2009-08-17 20:17           ` Gregory Haskins
  2009-08-18  8:46             ` Michael S. Tsirkin
  0 siblings, 1 reply; 132+ messages in thread
From: Gregory Haskins @ 2009-08-17 20:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Anthony Liguori, Ingo Molnar, Gregory Haskins, kvm, Avi Kivity,
	alacrityvm-devel, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 3195 bytes --]

Michael S. Tsirkin wrote:
> On Mon, Aug 17, 2009 at 10:14:56AM -0400, Gregory Haskins wrote:
>> Case in point: Take an upstream kernel and you can modprobe the
>> vbus-pcibridge in and virtio devices will work over that transport
>> unmodified.
>>
>> See http://lkml.org/lkml/2009/8/6/244 for details.
> 
> The modprobe you are talking about would need
> to be done in guest kernel, correct?

Yes, and your point is?  "unmodified" (pardon the psuedo pun) modifies
"virtio", not "guest".  It means you can take an off-the-shelf kernel
with off-the-shelf virtio (ala distro-kernel) and modprobe
vbus-pcibridge and get alacrityvm acceleration.  It is not a design goal
of mine to forbid the loading of a new driver, so I am ok with that
requirement.

> 
>> OTOH, Michael's patch is purely targeted at improving virtio-net on kvm,
>> and its likewise constrained by various limitations of that decision
>> (such as its reliance of the PCI model, and the kvm memory scheme).
> 
> vhost is actually not related to PCI in any way. It simply leaves all
> setup for userspace to do.  And the memory scheme was intentionally
> separated from kvm so that it can easily support e.g. lguest.
> 

I think you have missed my point. I mean that vhost requires a separate
bus-model (ala qemu-pci).  And no, your memory scheme is not separated,
at least, not very well.  It still assumes memory-regions and
copy_to_user(), which is very kvm-esque.  Vbus has people using things
like userspace containers (no regions), and physical hardware (dma
controllers, so no regions or copy_to_user) so your scheme quickly falls
apart once you get away from KVM.

Don't get me wrong:  That design may have its place.  Perhaps you only
care about fixing KVM, which is a perfectly acceptable strategy.  Its
just not a strategy that I think is the best approach.  Essentially you
are promoting the proliferation of competing backends, and I am trying
to unify them (which is ironic that this thread started with concerns I
was fragmenting things ;).

The bottom line is, you have a simpler solution that is more finely
targeted at KVM and virtio-networking.  It fixes probably a lot of
problems with the existing implementation, but it still has limitations.

OTOH, what I am promoting is more complex, but more flexible.  That is
the tradeoff.  You can't have both ;) So do not for one second think
that what you implemented is equivalent, because they are not.

In fact, I believe I warned you about this potential problem when you
decided to implement your own version.  I think I said something to the
effect of "you will either have a subset of functionality, or you will
ultimately reinvent what I did".  Right now you are in the subset phase.
 Perhaps someday you will be in the complete-reinvent phase.  Why you
wanted to go that route when I had already worked though the issues is
something perhaps only you will ever know, but I'm sure you had your
reasons.  But do note you could have saved yourself grief by reusing my
already implemented and tested variant, as I politely offered to work
with you on making it meet your needs.

Kind Regards
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 14:14       ` Gregory Haskins
  2009-08-17 14:58         ` Avi Kivity
  2009-08-17 17:41         ` Michael S. Tsirkin
@ 2009-08-18  1:08         ` Anthony Liguori
  2009-08-18  7:38           ` Avi Kivity
                             ` (2 more replies)
  2 siblings, 3 replies; 132+ messages in thread
From: Anthony Liguori @ 2009-08-18  1:08 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ingo Molnar, Gregory Haskins, kvm, Avi Kivity, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin

Gregory Haskins wrote:
> Note: No one has ever proposed to change the virtio-ABI.

virtio-pci is part of the virtio ABI.  You are proposing changing that.

You cannot add new kernel modules to guests and expect them to remain 
supported.  So there is value in reusing existing ABIs

>> I think the reason vbus gets better performance for networking today is
>> that vbus' backends are in the kernel while virtio's backends are
>> currently in userspace.
>>     
>
> Well, with all due respect, you also said initially when I announced
> vbus that in-kernel doesn't matter, and tried to make virtio-net run as
> fast as venet from userspace ;)  Given that we never saw those userspace
> patches from you that in fact equaled my performance, I assume you were
> wrong about that statement.  Perhaps you were wrong about other things too?
>   

I'm wrong about a lot of things :-)  I haven't yet been convinced that 
I'm wrong here though.

One of the gray areas here is what constitutes an in-kernel backend.  
tun/tap is a sort of an in-kernel backend.  Userspace is still involved 
in all of the paths.  vhost seems to be an intermediate step between 
tun/tap and vbus.  The fast paths avoid userspace completely.  Many of 
the slow paths involve userspace still (like migration apparently).  
With vbus, userspace is avoided entirely.  In some ways, you could argue 
that slirp and vbus are opposite ends of the virtual I/O spectrum.

I believe strongly that we should avoid putting things in the kernel 
unless they absolutely have to be.  I'm definitely interested in playing 
with vhost to see if there are ways to put even less in the kernel.  In 
particular, I think it would be a big win to avoid knowledge of slots in 
the kernel by doing ring translation in userspace.  This implies a 
userspace transition in the fast path.  This may or may not be 
acceptable.  I think this is going to be a very interesting experiment 
and will ultimately determine whether my intuition about the cost of 
dropping to userspace is right or wrong.


> Conversely, I am not afraid of requiring a new driver to optimize the
> general PV interface.  In the long term, this will reduce the amount of
> reimplementing the same code over and over, reduce system overhead, and
> it adds new features not previously available (for instance, coalescing
> and prioritizing interrupts).
>   

I think you have a lot of ideas and I don't know that we've been able to 
really understand your vision.  Do you have any plans on writing a paper 
about vbus that goes into some of your thoughts in detail?

>> If that's the case, then I don't see any
>> reason to adopt vbus unless Greg things there are other compelling
>> features over virtio.
>>     
>
> Aside from the fact that this is another confusion of the vbus/virtio
> relationship...yes, of course there are compelling features (IMHO) or I
> wouldn't be expending effort ;)  They are at least compelling enough to
> put in AlacrityVM.

This whole AlactricyVM thing is really hitting this nail with a 
sledgehammer.  While the kernel needs to be very careful about what it 
pulls in, as long as you're willing to commit to ABI compatibility, we 
can pull code into QEMU to support vbus.  Then you can just offer vbus 
host and guest drivers instead of forking the kernel.

>   If upstream KVM doesn't want them, that's KVMs
> decision and I am fine with that.  Simply never apply my qemu patches to
> qemu-kvm.git, and KVM will be blissfully unaware if vbus is present.

As I mentioned before, if you submit patches to upstream QEMU, we'll 
apply them (after appropriate review).  As I said previously, we want to 
avoid user confusion as much as possible.  Maybe this means limiting it 
to -device or a separate machine type.  I'm not sure, but that's 
something we can discussion on qemu-devel.

>   I
> do hope that I can convince the KVM community otherwise, however. :)
>   

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18  1:08         ` Anthony Liguori
@ 2009-08-18  7:38           ` Avi Kivity
  2009-08-18  8:54           ` Michael S. Tsirkin
  2009-08-18 13:16           ` Gregory Haskins
  2 siblings, 0 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-18  7:38 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Gregory Haskins, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev, Michael S. Tsirkin

On 08/18/2009 04:08 AM, Anthony Liguori wrote:
> I believe strongly that we should avoid putting things in the kernel 
> unless they absolutely have to be.  I'm definitely interested in 
> playing with vhost to see if there are ways to put even less in the 
> kernel.  In particular, I think it would be a big win to avoid 
> knowledge of slots in the kernel by doing ring translation in 
> userspace.  This implies a userspace transition in the fast path.  
> This may or may not be acceptable.  I think this is going to be a very 
> interesting experiment and will ultimately determine whether my 
> intuition about the cost of dropping to userspace is right or wrong.

I believe with a perfectly scaling qemu this should be feasible.  
Currently qemu is far from scaling perfectly, but inefficient userspace 
is not a reason to put things into the kernel.

Having a translated ring is also a nice solution for migration - 
userspace can mark the pages dirty while translating the receive ring.

Still, in-kernel translation is simple enough that I think we should 
keep it.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 19:33             ` Gregory Haskins
@ 2009-08-18  8:33               ` Avi Kivity
  2009-08-18 14:46                 ` Gregory Haskins
  2009-08-18  9:53               ` Michael S. Tsirkin
  1 sibling, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-18  8:33 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ingo Molnar, Gregory Haskins, kvm, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin

On 08/17/2009 10:33 PM, Gregory Haskins wrote:
>
> There is a secondary question of venet (a vbus native device) verses
> virtio-net (a virtio native device that works with PCI or VBUS).  If
> this contention is really around venet vs virtio-net, I may possibly
> conceed and retract its submission to mainline.  I've been pushing it to
> date because people are using it and I don't see any reason that the
> driver couldn't be upstream.
>    

That's probably the cause of much confusion.  The primary kvm pain point 
is now networking, so in any vbus discussion we're concentrating on that 
aspect.

>> Also, are you willing to help virtio to become faster?
>>      
> Yes, that is not a problem.  Note that virtio in general, and
> virtio-net/venet in particular are not the primary goal here, however.
> Improved 802.x and block IO are just positive side-effects of the
> effort.  I started with 802.x networking just to demonstrate the IO
> layer capabilities, and to test it.  It ended up being so good on
> contrast to existing facilities, that developers in the vbus community
> started using it for production development.
>
> Ultimately, I created vbus to address areas of performance that have not
> yet been addressed in things like KVM.  Areas such as real-time guests,
> or RDMA (host bypass) interfaces.

Can you explain how vbus achieves RDMA?

I also don't see the connection to real time guests.

> I also designed it in such a way that
> we could, in theory, write one set of (linux-based) backends, and have
> them work across a variety of environments (such as containers/VMs like
> KVM, lguest, openvz, but also physical systems like blade enclosures and
> clusters, or even applications running on the host).
>    

Sorry, I'm still confused.  Why would openvz need vbus?  It already has 
zero-copy networking since it's a shared kernel.  Shared memory should 
also work seamlessly, you just need to expose the shared memory object 
on a shared part of the namespace.  And of course, anything in the 
kernel is already shared.

>> Or do you
>> have arguments why that is impossible to do so and why the only
>> possible solution is vbus? Avi says no such arguments were offered
>> so far.
>>      
> Not for lack of trying.  I think my points have just been missed
> everytime I try to describe them. ;)  Basically I write a message very
> similar to this one, and the next conversation starts back from square
> one.  But I digress, let me try again..
>
> Noting that this discussion is really about the layer *below* virtio,
> not virtio itself (e.g. PCI vs vbus).  Lets start with a little background:
>
> -- Background --
>
> So on one level, we have the resource-container technology called
> "vbus".  It lets you create a container on the host, fill it with
> virtual devices, and assign that container to some context (such as a
> KVM guest).  These "devices" are LKMs, and each device has a very simple
> verb namespace consisting of a synchronous "call()" method, and a
> "shm()" method for establishing async channels.
>
> The async channels are just shared-memory with a signal path (e.g.
> interrupts and hypercalls), which the device+driver can use to overlay
> things like rings (virtqueues, IOQs), or other shared-memory based
> constructs of their choosing (such as a shared table).  The signal path
> is designed to minimize enter/exits and reduce spurious signals in a
> unified way (see shm-signal patch).
>
> call() can be used both for config-space like details, as well as
> fast-path messaging that require synchronous behavior (such as guest
> scheduler updates).
>
> All of this is managed via sysfs/configfs.
>    

One point of contention is that this is all managementy stuff and should 
be kept out of the host kernel.  Exposing shared memory, interrupts, and 
guest hypercalls can all be easily done from userspace (as virtio 
demonstrates).  True, some devices need kernel acceleration, but that's 
no reason to put everything into the host kernel.

> On the guest, we have a "vbus-proxy" which is how the guest gets access
> to devices assigned to its container.  (as an aside, "virtio" devices
> can be populated in the container, and then surfaced up to the
> virtio-bus via that virtio-vbus patch I mentioned).
>
> There is a thing called a "vbus-connector" which is the guest specific
> part.  Its job is to connect the vbus-proxy in the guest, to the vbus
> container on the host.  How it does its job is specific to the connector
> implementation, but its role is to transport messages between the guest
> and the host (such as for call() and shm() invocations) and to handle
> things like discovery and hotswap.
>    

virtio has an exact parallel here (virtio-pci and friends).

> Out of all this, I think the biggest contention point is the design of
> the vbus-connector that I use in AlacrityVM (Avi, correct me if I am
> wrong and you object to other aspects as well).  I suspect that if I had
> designed the vbus-connector to surface vbus devices as PCI devices via
> QEMU, the patches would potentially have been pulled in a while ago.
>    

Exposing devices as PCI is an important issue for me, as I have to 
consider non-Linux guests.
Another issue is the host kernel management code which I believe is 
superfluous.

But the biggest issue is compatibility.  virtio exists and has Windows 
and Linux drivers.  Without a fatal flaw in virtio we'll continue to 
support it.  Given that, why spread to a new model?

Of course, I understand you're interested in non-ethernet, non-block 
devices.  I can't comment on these until I see them.  Maybe they can fit 
the virtio model, and maybe they can't.

> There are, of course, reasons why vbus does *not* render as PCI, so this
> is the meat of of your question, I believe.
>
> At a high level, PCI was designed for software-to-hardware interaction,
> so it makes assumptions about that relationship that do not necessarily
> apply to virtualization.
>
> For instance:
>
> A) hardware can only generate byte/word sized requests at a time because
> that is all the pcb-etch and silicon support. So hardware is usually
> expressed in terms of some number of "registers".
>    

No, hardware happily DMAs to and fro main memory.  Some hardware of 
course uses mmio registers extensively, but not virtio hardware.  With 
the recent MSI support no registers are touched in the fast path.

> C) the target end-point has no visibility into the CPU machine state
> other than the parameters passed in the bus-cycle (usually an address
> and data tuple).
>    

That's not an issue.  Accessing memory is cheap.

> D) device-ids are in a fixed width register and centrally assigned from
> an authority (e.g. PCI-SIG).
>    

That's not an issue either.  Qumranet/Red Hat has donated a range of 
device IDs for use in virtio.  Device IDs are how devices are associated 
with drivers, so you'll need something similar for vbus.

> E) Interrupt/MSI routing is per-device oriented
>    

Please elaborate.  What is the issue?  How does vbus solve it?

> F) Interrupts/MSI are assumed cheap to inject
>    

Interrupts are not assumed cheap; that's why interrupt mitigation is 
used (on real and virtual hardware).

> G) Interrupts/MSI are non-priortizable.
>    

They are prioritizable; Linux ignores this though (Windows doesn't).  
Please elaborate on what the problem is and how vbus solves it.

> H) Interrupts/MSI are statically established
>    

Can you give an example of why this is a problem?

> These assumptions and constraints may be completely different or simply
> invalid in a virtualized guest. For instance, the hypervisor is just
> software, and therefore it's not restricted to "etch" constraints. IO
> requests can be arbitrarily large, just as if you are invoking a library
> function-call or OS system-call. Likewise, each one of those requests is
> a branch and a context switch, so it has often has greater performance
> implications than a simple register bus-cycle in hardware.  If you use
> an MMIO variant, it has to run through the page-fault code to be decoded.
>
> The result is typically decreased performance if you try to do the same
> thing real hardware does. This is why you usually see hypervisor
> specific drivers (e.g. virtio-net, vmnet, etc) a common feature.
>
> _Some_ performance oriented items can technically be accomplished in
> PCI, albeit in a much more awkward way.  For instance, you can set up a
> really fast, low-latency "call()" mechanism using a PIO port on a
> PCI-model and ioeventfd.  As a matter of fact, this is exactly what the
> vbus pci-bridge does:
>    

What performance oriented items have been left unaddressed?

virtio and vbus use three communications channels:  call from guest to 
host (implemented as pio and reasonably fast), call from host to guest 
(implemented as msi and reasonably fast) and shared memory (as fast as 
it can be).  Where does PCI limit you in any way?

> The problem here is that this is incredibly awkward to setup.  You have
> all that per-cpu goo and the registration of the memory on the guest.
> And on the host side, you have all the vmapping of the registered
> memory, and the file-descriptor to manage.  In short, its really painful.
>
> I would much prefer to do this *once*, and then let all my devices
> simple re-use that infrastructure.  This is, in fact, what I do.  Here
> is the device model that a guest sees:
>    

virtio also reuses the pci code, on both guest and host.

> Moving on: _Other_ items cannot be replicated (at least, not without
> hacking it into something that is no longer PCI.
>
> Things like the pci-id namespace are just silly for software.  I would
> rather have a namespace that does not require central management so
> people are free to create vbus-backends at will.  This is akin to
> registering a device MAJOR/MINOR, verses using the various dynamic
> assignment mechanisms.  vbus uses a string identifier in place of a
> pci-id.  This is superior IMHO, and not compatible with PCI.
>    

How do you handle conflicts?  Again you need a central authority to hand 
out names or prefixes.

> As another example, the connector design coalesces *all* shm-signals
> into a single interrupt (by prio) that uses the same context-switch
> mitigation techniques that help boost things like networking.  This
> effectively means we can detect and optimize out ack/eoi cycles from the
> APIC as the IO load increases (which is when you need it most).  PCI has
> no such concept.
>    

That's a bug, not a feature.  It means poor scaling as the number of 
vcpus increases and as the number of devices increases.

Note nothing prevents steering multiple MSIs into a single vector.  It's 
a bad idea though.

> In addition, the signals and interrupts are priority aware, which is
> useful for things like 802.1p networking where you may establish 8-tx
> and 8-rx queues for your virtio-net device.  x86 APIC really has no
> usable equivalent, so PCI is stuck here.
>    

x86 APIC is priority aware.

> Also, the signals can be allocated on-demand for implementing things
> like IPC channels in response to guest requests since there is no
> assumption about device-to-interrupt mappings.  This is more flexible.
>    

Yes.  However given that vectors are a scarce resource you're severely 
limited in that.  And if you're multiplexing everything on one vector, 
then you can just as well demultiplex your channels in the virtio driver 
code.

> And through all of this, this design would work in any guest even if it
> doesn't have PCI (e.g. lguest, UML, physical systems, etc).
>    

That is true for virtio which works on pci-less lguest and s390.

> -- Bottom Line --
>
> The idea here is to generalize all the interesting parts that are common
> (fast sync+async io, context-switch mitigation, back-end models, memory
> abstractions, signal-path routing, etc) that a variety of linux based
> technologies can use (kvm, lguest, openvz, uml, physical systems) and
> only require the thin "connector" code to port the system around.  The
> idea is to try to get this aspect of PV right once, and at some point in
> the future, perhaps vbus will be as ubiquitous as PCI.  Well, perhaps
> not *that* ubiquitous, but you get the idea ;)
>    

That is exactly the design goal of virtio (except it limits itself to 
virtualization).

> Then device models like virtio can ride happily on top and we end up
> with a really robust and high-performance Linux-based stack.  I don't
> buy the argument that we already have PCI so lets use it.  I don't think
> its the best design and I am not afraid to make an investment in a
> change here because I think it will pay off in the long run.
>    

Sorry, I don't think you've shown any quantifiable advantages.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 20:17           ` Gregory Haskins
@ 2009-08-18  8:46             ` Michael S. Tsirkin
  2009-08-18 15:19               ` Gregory Haskins
  2009-08-18 15:53               ` [Alacrityvm-devel] " Ira W. Snyder
  0 siblings, 2 replies; 132+ messages in thread
From: Michael S. Tsirkin @ 2009-08-18  8:46 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Ingo Molnar, Gregory Haskins, kvm, Avi Kivity,
	alacrityvm-devel, linux-kernel, netdev

On Mon, Aug 17, 2009 at 04:17:09PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Mon, Aug 17, 2009 at 10:14:56AM -0400, Gregory Haskins wrote:
> >> Case in point: Take an upstream kernel and you can modprobe the
> >> vbus-pcibridge in and virtio devices will work over that transport
> >> unmodified.
> >>
> >> See http://lkml.org/lkml/2009/8/6/244 for details.
> > 
> > The modprobe you are talking about would need
> > to be done in guest kernel, correct?
> 
> Yes, and your point is? "unmodified" (pardon the psuedo pun) modifies
> "virtio", not "guest".
>  It means you can take an off-the-shelf kernel
> with off-the-shelf virtio (ala distro-kernel) and modprobe
> vbus-pcibridge and get alacrityvm acceleration.

Heh, by that logic ksplice does not modify running kernel either :)

> It is not a design goal of mine to forbid the loading of a new driver,
> so I am ok with that requirement.
> 
> >> OTOH, Michael's patch is purely targeted at improving virtio-net on kvm,
> >> and its likewise constrained by various limitations of that decision
> >> (such as its reliance of the PCI model, and the kvm memory scheme).
> > 
> > vhost is actually not related to PCI in any way. It simply leaves all
> > setup for userspace to do.  And the memory scheme was intentionally
> > separated from kvm so that it can easily support e.g. lguest.
> > 
> 
> I think you have missed my point. I mean that vhost requires a separate
> bus-model (ala qemu-pci).

So? That can be in userspace, and can be anything including vbus.

> And no, your memory scheme is not separated,
> at least, not very well.  It still assumes memory-regions and
> copy_to_user(), which is very kvm-esque.

I don't think so: works for lguest, kvm, UML and containers

> Vbus has people using things
> like userspace containers (no regions),

vhost by default works without regions

> and physical hardware (dma
> controllers, so no regions or copy_to_user) so your scheme quickly falls
> apart once you get away from KVM.

Someone took a driver and is building hardware for it ... so what?

> Don't get me wrong:  That design may have its place.  Perhaps you only
> care about fixing KVM, which is a perfectly acceptable strategy.
> Its just not a strategy that I think is the best approach.  Essentially you
> are promoting the proliferation of competing backends, and I am trying
> to unify them (which is ironic that this thread started with concerns I
> was fragmenting things ;).

So, you don't see how venet fragments things? It's pretty obvious ...

> The bottom line is, you have a simpler solution that is more finely
> targeted at KVM and virtio-networking.  It fixes probably a lot of
> problems with the existing implementation, but it still has limitations.
> 
> OTOH, what I am promoting is more complex, but more flexible.  That is
> the tradeoff.  You can't have both ;)

We can. connect eventfds to hypercalls, and vhost will work with vbus.

> So do not for one second think
> that what you implemented is equivalent, because they are not.
> 
> In fact, I believe I warned you about this potential problem when you
> decided to implement your own version.  I think I said something to the
> effect of "you will either have a subset of functionality, or you will
> ultimately reinvent what I did".  Right now you are in the subset phase.

No. Unlike vbus, vhost supports unmodified guests and live migration.

> Perhaps someday you will be in the complete-reinvent phase.  Why you
> wanted to go that route when I had already worked though the issues is
> something perhaps only you will ever know, but I'm sure you had your
> reasons. But do note you could have saved yourself grief by reusing my
> already implemented and tested variant, as I politely offered to work
> with you on making it meet your needs.
> Kind Regards
> -Greg
> 

you have a midlayer.  I could not use it without pulling in all of it.

-- 
MST

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18  1:08         ` Anthony Liguori
  2009-08-18  7:38           ` Avi Kivity
@ 2009-08-18  8:54           ` Michael S. Tsirkin
  2009-08-18 13:16           ` Gregory Haskins
  2 siblings, 0 replies; 132+ messages in thread
From: Michael S. Tsirkin @ 2009-08-18  8:54 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Gregory Haskins, Ingo Molnar, Gregory Haskins, kvm, Avi Kivity,
	alacrityvm-devel, linux-kernel, netdev

On Mon, Aug 17, 2009 at 08:08:24PM -0500, Anthony Liguori wrote:
> In particular, I think it would be a big win to avoid knowledge of slots in  
> the kernel by doing ring translation in userspace.

vhost supports this BTW: just don't call the memory table ioctl.
In-kernel translation is simple, as well, though.

-- 
MST

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-17 19:33             ` Gregory Haskins
  2009-08-18  8:33               ` Avi Kivity
@ 2009-08-18  9:53               ` Michael S. Tsirkin
  2009-08-18 10:00                 ` Avi Kivity
  2009-08-18 15:39                 ` Gregory Haskins
  1 sibling, 2 replies; 132+ messages in thread
From: Michael S. Tsirkin @ 2009-08-18  9:53 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ingo Molnar, Gregory Haskins, kvm, Avi Kivity, alacrityvm-devel,
	linux-kernel, netdev

On Mon, Aug 17, 2009 at 03:33:30PM -0400, Gregory Haskins wrote:
> There is a secondary question of venet (a vbus native device) verses
> virtio-net (a virtio native device that works with PCI or VBUS).  If
> this contention is really around venet vs virtio-net, I may possibly
> conceed and retract its submission to mainline.

For me yes, venet+ioq competing with virtio+virtqueue.

> I've been pushing it to date because people are using it and I don't
> see any reason that the driver couldn't be upstream.

If virtio is just as fast, they can just use it without knowing it.
Clearly, that's better since we support virtio anyway ...

> -- Issues --
> 
> Out of all this, I think the biggest contention point is the design of
> the vbus-connector that I use in AlacrityVM (Avi, correct me if I am
> wrong and you object to other aspects as well).  I suspect that if I had
> designed the vbus-connector to surface vbus devices as PCI devices via
> QEMU, the patches would potentially have been pulled in a while ago.
> 
> There are, of course, reasons why vbus does *not* render as PCI, so this
> is the meat of of your question, I believe.
> 
> At a high level, PCI was designed for software-to-hardware interaction,
> so it makes assumptions about that relationship that do not necessarily
> apply to virtualization.

I'm not hung up on PCI, myself.  An idea that might help you get Avi
on-board: do setup in userspace, over PCI.  Negotiate hypercall support
(e.g.  with a PCI capability) and then switch to that for fastpath. Hmm?

> As another example, the connector design coalesces *all* shm-signals
> into a single interrupt (by prio) that uses the same context-switch
> mitigation techniques that help boost things like networking.  This
> effectively means we can detect and optimize out ack/eoi cycles from the
> APIC as the IO load increases (which is when you need it most).  PCI has
> no such concept.

Could you elaborate on this one for me? How does context-switch
mitigation work?

> In addition, the signals and interrupts are priority aware, which is
> useful for things like 802.1p networking where you may establish 8-tx
> and 8-rx queues for your virtio-net device.  x86 APIC really has no
> usable equivalent, so PCI is stuck here.

By the way, multiqueue support in virtio would be very nice to have,
and seems mostly unrelated to vbus.


-- 
MST

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18  9:53               ` Michael S. Tsirkin
@ 2009-08-18 10:00                 ` Avi Kivity
  2009-08-18 10:09                   ` Michael S. Tsirkin
  2009-08-18 15:39                 ` Gregory Haskins
  1 sibling, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-18 10:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Haskins, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev

On 08/18/2009 12:53 PM, Michael S. Tsirkin wrote:
> I'm not hung up on PCI, myself.  An idea that might help you get Avi
> on-board: do setup in userspace, over PCI.  Negotiate hypercall support
> (e.g.  with a PCI capability) and then switch to that for fastpath. Hmm?
>    

Hypercalls don't nest well.  When a nested guest issues a hypercall, you 
have to assume it is destined to the enclosing guest, so you can't 
assign a hypercall-capable device to a nested guest.

mmio and pio don't have this problem since the host can use the address 
to locate the destination.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 10:00                 ` Avi Kivity
@ 2009-08-18 10:09                   ` Michael S. Tsirkin
  2009-08-18 10:13                     ` Avi Kivity
  0 siblings, 1 reply; 132+ messages in thread
From: Michael S. Tsirkin @ 2009-08-18 10:09 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev

On Tue, Aug 18, 2009 at 01:00:25PM +0300, Avi Kivity wrote:
> On 08/18/2009 12:53 PM, Michael S. Tsirkin wrote:
>> I'm not hung up on PCI, myself.  An idea that might help you get Avi
>> on-board: do setup in userspace, over PCI.  Negotiate hypercall support
>> (e.g.  with a PCI capability) and then switch to that for fastpath. Hmm?
>>    
>
> Hypercalls don't nest well.  When a nested guest issues a hypercall, you  
> have to assume it is destined to the enclosing guest, so you can't  
> assign a hypercall-capable device to a nested guest.
> 
> mmio and pio don't have this problem since the host can use the address  
> to locate the destination.

So userspace could map hypercall to address during setup and tell the
host kernel?

> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 10:09                   ` Michael S. Tsirkin
@ 2009-08-18 10:13                     ` Avi Kivity
  2009-08-18 10:28                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-18 10:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Haskins, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev

On 08/18/2009 01:09 PM, Michael S. Tsirkin wrote:
>
>> mmio and pio don't have this problem since the host can use the address
>> to locate the destination.
>>      
> So userspace could map hypercall to address during setup and tell the
> host kernel?
>    

Suppose a nested guest has two devices.  One a virtual device backed by 
its host (our guest), and one a virtual device backed by us (the real 
host), and assigned by the guest to the nested guest.  If both devices 
use hypercalls, there is no way to distinguish between them.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 10:13                     ` Avi Kivity
@ 2009-08-18 10:28                       ` Michael S. Tsirkin
  2009-08-18 10:45                         ` Avi Kivity
  0 siblings, 1 reply; 132+ messages in thread
From: Michael S. Tsirkin @ 2009-08-18 10:28 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev

On Tue, Aug 18, 2009 at 01:13:57PM +0300, Avi Kivity wrote:
> On 08/18/2009 01:09 PM, Michael S. Tsirkin wrote:
>>
>>> mmio and pio don't have this problem since the host can use the address
>>> to locate the destination.
>>>      
>> So userspace could map hypercall to address during setup and tell the
>> host kernel?
>>    
>
> Suppose a nested guest has two devices.  One a virtual device backed by  
> its host (our guest), and one a virtual device backed by us (the real  
> host), and assigned by the guest to the nested guest.  If both devices  
> use hypercalls, there is no way to distinguish between them.

Not sure I understand. What I had in mind is that devices would have to
either use different hypercalls and map hypercall to address during
setup, or pass address with each hypercall.  We get the hypercall,
translate the address as if it was pio access, and know the destination?

> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 10:28                       ` Michael S. Tsirkin
@ 2009-08-18 10:45                         ` Avi Kivity
  2009-08-18 11:07                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-18 10:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Haskins, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev

On 08/18/2009 01:28 PM, Michael S. Tsirkin wrote:
>
>> Suppose a nested guest has two devices.  One a virtual device backed by
>> its host (our guest), and one a virtual device backed by us (the real
>> host), and assigned by the guest to the nested guest.  If both devices
>> use hypercalls, there is no way to distinguish between them.
>>      
> Not sure I understand. What I had in mind is that devices would have to
> either use different hypercalls and map hypercall to address during
> setup, or pass address with each hypercall.  We get the hypercall,
> translate the address as if it was pio access, and know the destination?
>    

There are no different hypercalls.  There's just one hypercall 
instruction, and there's no standard on how it's used.  If a nested call 
issues a hypercall instruction, you have no idea if it's calling a 
Hyper-V hypercall or a vbus/virtio kick.

You could have a protocol where you register the hypercall instruction's 
address with its recipient, but it quickly becomes a tangled mess.  And 
for what?  pio and hypercalls have the same performance characteristics.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 10:45                         ` Avi Kivity
@ 2009-08-18 11:07                           ` Michael S. Tsirkin
  2009-08-18 11:15                             ` Avi Kivity
  0 siblings, 1 reply; 132+ messages in thread
From: Michael S. Tsirkin @ 2009-08-18 11:07 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev

On Tue, Aug 18, 2009 at 01:45:05PM +0300, Avi Kivity wrote:
> On 08/18/2009 01:28 PM, Michael S. Tsirkin wrote:
>>
>>> Suppose a nested guest has two devices.  One a virtual device backed by
>>> its host (our guest), and one a virtual device backed by us (the real
>>> host), and assigned by the guest to the nested guest.  If both devices
>>> use hypercalls, there is no way to distinguish between them.
>>>      
>> Not sure I understand. What I had in mind is that devices would have to
>> either use different hypercalls and map hypercall to address during
>> setup, or pass address with each hypercall.  We get the hypercall,
>> translate the address as if it was pio access, and know the destination?
>>    
>
> There are no different hypercalls.  There's just one hypercall  
> instruction, and there's no standard on how it's used.  If a nested call  
> issues a hypercall instruction, you have no idea if it's calling a  
> Hyper-V hypercall or a vbus/virtio kick.

userspace will know which it is, because hypercall capability
in the device has been activated, and can tell kernel, using
something similar to iosignalfd. No?

> You could have a protocol where you register the hypercall instruction's  
> address with its recipient, but it quickly becomes a tangled mess.

I really thought we could pass the io address in register as an input
parameter. Is there a way to do this in a secure manner?

Hmm. Doesn't kvm use hypercalls now? How does this work with nesting?
For example, in this code in arch/x86/kvm/x86.c:

        switch (nr) {
        case KVM_HC_VAPIC_POLL_IRQ:
                ret = 0;
                break;
        case KVM_HC_MMU_OP:
                r = kvm_pv_mmu_op(vcpu, a0, hc_gpa(vcpu, a1, a2), &ret);
                break;
        default:
                ret = -KVM_ENOSYS;
                break;
        }

how do we know that it's the guest and not the nested guest performing
the hypercall?

> And  for what?  pio and hypercalls have the same performance characteristics.

No idea about that. I'm assuming Gregory knows why he wants to use hypercalls,
I was just trying to help find a way that is also palatable, and flexible.

> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 11:07                           ` Michael S. Tsirkin
@ 2009-08-18 11:15                             ` Avi Kivity
  2009-08-18 11:49                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-18 11:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Haskins, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev

On 08/18/2009 02:07 PM, Michael S. Tsirkin wrote:
> On Tue, Aug 18, 2009 at 01:45:05PM +0300, Avi Kivity wrote:
>    
>> On 08/18/2009 01:28 PM, Michael S. Tsirkin wrote:
>>      
>>>        
>>>> Suppose a nested guest has two devices.  One a virtual device backed by
>>>> its host (our guest), and one a virtual device backed by us (the real
>>>> host), and assigned by the guest to the nested guest.  If both devices
>>>> use hypercalls, there is no way to distinguish between them.
>>>>
>>>>          
>>> Not sure I understand. What I had in mind is that devices would have to
>>> either use different hypercalls and map hypercall to address during
>>> setup, or pass address with each hypercall.  We get the hypercall,
>>> translate the address as if it was pio access, and know the destination?
>>>
>>>        
>> There are no different hypercalls.  There's just one hypercall
>> instruction, and there's no standard on how it's used.  If a nested call
>> issues a hypercall instruction, you have no idea if it's calling a
>> Hyper-V hypercall or a vbus/virtio kick.
>>      
> userspace will know which it is, because hypercall capability
> in the device has been activated, and can tell kernel, using
> something similar to iosignalfd. No?
>    

The host kernel sees a hypercall vmexit.  How does it know if it's a 
nested-guest-to-guest hypercall or a nested-guest-to-host hypercall?  
The two are equally valid at the same time.


>> You could have a protocol where you register the hypercall instruction's
>> address with its recipient, but it quickly becomes a tangled mess.
>>      
> I really thought we could pass the io address in register as an input
> parameter. Is there a way to do this in a secure manner?
>
> Hmm. Doesn't kvm use hypercalls now? How does this work with nesting?
> For example, in this code in arch/x86/kvm/x86.c:
>
>          switch (nr) {
>          case KVM_HC_VAPIC_POLL_IRQ:
>                  ret = 0;
>                  break;
>          case KVM_HC_MMU_OP:
>                  r = kvm_pv_mmu_op(vcpu, a0, hc_gpa(vcpu, a1, a2),&ret);
>                  break;
>          default:
>                  ret = -KVM_ENOSYS;
>                  break;
>          }
>
> how do we know that it's the guest and not the nested guest performing
> the hypercall?
>    

The host knows whether the guest or nested guest are running.  If the 
guest is running, it's a guest-to-host hypercall.  If the nested guest 
is running, it's a nested-guest-to-guest hypercall.  We don't have 
nested-guest-to-host hypercalls (and couldn't unless we get agreement on 
a protocol from all hypervisor vendors).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 11:15                             ` Avi Kivity
@ 2009-08-18 11:49                               ` Michael S. Tsirkin
  2009-08-18 11:54                                 ` Avi Kivity
  0 siblings, 1 reply; 132+ messages in thread
From: Michael S. Tsirkin @ 2009-08-18 11:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev

On Tue, Aug 18, 2009 at 02:15:57PM +0300, Avi Kivity wrote:
> On 08/18/2009 02:07 PM, Michael S. Tsirkin wrote:
>> On Tue, Aug 18, 2009 at 01:45:05PM +0300, Avi Kivity wrote:
>>    
>>> On 08/18/2009 01:28 PM, Michael S. Tsirkin wrote:
>>>      
>>>>        
>>>>> Suppose a nested guest has two devices.  One a virtual device backed by
>>>>> its host (our guest), and one a virtual device backed by us (the real
>>>>> host), and assigned by the guest to the nested guest.  If both devices
>>>>> use hypercalls, there is no way to distinguish between them.
>>>>>
>>>>>          
>>>> Not sure I understand. What I had in mind is that devices would have to
>>>> either use different hypercalls and map hypercall to address during
>>>> setup, or pass address with each hypercall.  We get the hypercall,
>>>> translate the address as if it was pio access, and know the destination?
>>>>
>>>>        
>>> There are no different hypercalls.  There's just one hypercall
>>> instruction, and there's no standard on how it's used.  If a nested call
>>> issues a hypercall instruction, you have no idea if it's calling a
>>> Hyper-V hypercall or a vbus/virtio kick.
>>>      
>> userspace will know which it is, because hypercall capability
>> in the device has been activated, and can tell kernel, using
>> something similar to iosignalfd. No?
>>    
>
> The host kernel sees a hypercall vmexit.  How does it know if it's a  
> nested-guest-to-guest hypercall or a nested-guest-to-host hypercall?   
> The two are equally valid at the same time.

Here is how this can work - it is similar to MSI if you like:
- by default, the device uses pio kicks
- nested guest driver can enable hypercall capability in the device,
  probably with pci config cycle
- guest userspace (hypervisor running in guest) will see this request
  and perform pci config cycle on the "real" device, telling it to which
  nested guest this device is assigned
- host userspace (hypervisor running in host) will see this.
  it now knows both which guest the hypercalls will be for,
  and that the device in question is an emulated one,
  and can set up kvm appropriately


>>> You could have a protocol where you register the hypercall instruction's
>>> address with its recipient, but it quickly becomes a tangled mess.
>>>      
>> I really thought we could pass the io address in register as an input
>> parameter. Is there a way to do this in a secure manner?
>>
>> Hmm. Doesn't kvm use hypercalls now? How does this work with nesting?
>> For example, in this code in arch/x86/kvm/x86.c:
>>
>>          switch (nr) {
>>          case KVM_HC_VAPIC_POLL_IRQ:
>>                  ret = 0;
>>                  break;
>>          case KVM_HC_MMU_OP:
>>                  r = kvm_pv_mmu_op(vcpu, a0, hc_gpa(vcpu, a1, a2),&ret);
>>                  break;
>>          default:
>>                  ret = -KVM_ENOSYS;
>>                  break;
>>          }
>>
>> how do we know that it's the guest and not the nested guest performing
>> the hypercall?
>>    
>
> The host knows whether the guest or nested guest are running.  If the  
> guest is running, it's a guest-to-host hypercall.  If the nested guest  
> is running, it's a nested-guest-to-guest hypercall.  We don't have  
> nested-guest-to-host hypercalls (and couldn't unless we get agreement on  
> a protocol from all hypervisor vendors).

Not necessarily. What I am saying is we could make this protocol part of
guest paravirt driver. the guest that loads the driver and enables the
capability, has to agree to the protocol. If it doesn't want to, it does
not have to use that driver.

> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 11:49                               ` Michael S. Tsirkin
@ 2009-08-18 11:54                                 ` Avi Kivity
  0 siblings, 0 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-18 11:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Haskins, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev

On 08/18/2009 02:49 PM, Michael S. Tsirkin wrote:
>
>> The host kernel sees a hypercall vmexit.  How does it know if it's a
>> nested-guest-to-guest hypercall or a nested-guest-to-host hypercall?
>> The two are equally valid at the same time.
>>      
> Here is how this can work - it is similar to MSI if you like:
> - by default, the device uses pio kicks
> - nested guest driver can enable hypercall capability in the device,
>    probably with pci config cycle
> - guest userspace (hypervisor running in guest) will see this request
>    and perform pci config cycle on the "real" device, telling it to which
>    nested guest this device is assigned
>    

So far so good.

> - host userspace (hypervisor running in host) will see this.
>    it now knows both which guest the hypercalls will be for,
>    and that the device in question is an emulated one,
>    and can set up kvm appropriately
>    

No it doesn't.  The fact that one device uses hypercalls doesn't mean 
all hypercalls are for that device.  Hypercalls are a shared resource, 
and there's no way to tell for a given hypercall what device it is 
associated with (if any).

>> The host knows whether the guest or nested guest are running.  If the
>> guest is running, it's a guest-to-host hypercall.  If the nested guest
>> is running, it's a nested-guest-to-guest hypercall.  We don't have
>> nested-guest-to-host hypercalls (and couldn't unless we get agreement on
>> a protocol from all hypervisor vendors).
>>      
> Not necessarily. What I am saying is we could make this protocol part of
> guest paravirt driver. the guest that loads the driver and enables the
> capability, has to agree to the protocol. If it doesn't want to, it does
> not have to use that driver.
>    

It would only work for kvm-on-kvm.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18  1:08         ` Anthony Liguori
  2009-08-18  7:38           ` Avi Kivity
  2009-08-18  8:54           ` Michael S. Tsirkin
@ 2009-08-18 13:16           ` Gregory Haskins
  2009-08-18 13:45             ` Avi Kivity
  2 siblings, 1 reply; 132+ messages in thread
From: Gregory Haskins @ 2009-08-18 13:16 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Ingo Molnar, Gregory Haskins, kvm, Avi Kivity, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 8228 bytes --]

Anthony Liguori wrote:
> Gregory Haskins wrote:
>> Note: No one has ever proposed to change the virtio-ABI.
> 
> virtio-pci is part of the virtio ABI.  You are proposing changing that.

I'm sorry, but I respectfully disagree with you here.

virtio has an ABI...I am not modifying that.

virtio-pci has an ABI...I am not modifying that either.

The subsystem in question is virtio-vbus, and is a completely standalone
addition to the virtio ecosystem.  By your argument, virtio amd
virtio-pci should fuse together, and virtio-lguest and virtio-s390
should go away because they diverge from the virtio-pci ABI, right?

I seriously doubt you would agree with that statement.  The fact is, the
design of virtio not only permits modular replacement of its transport
ABI, it encourages it.

So how is virtio-vbus any different from the other three?  I understand
that it means you need to load a new driver in the guest, and I am ok
with that.  virtio-pci was once a non-upstream driver too and required
someone to explicitly load it, wasn't it?  You gotta crawl before you
can walk...

> 
> You cannot add new kernel modules to guests and expect them to remain
> supported.

??? Of course you can.  How is this different from any other driver?


>  So there is value in reusing existing ABIs


Well, I wont argue with you on that one.  There is certainly value there.

My contention is that sometimes the liability of that ABI is greater
than its value, and thats when its time to evaluate the design decisions
that lead to re-use vs re-design.


> 
>>> I think the reason vbus gets better performance for networking today is
>>> that vbus' backends are in the kernel while virtio's backends are
>>> currently in userspace.
>>>     
>>
>> Well, with all due respect, you also said initially when I announced
>> vbus that in-kernel doesn't matter, and tried to make virtio-net run as
>> fast as venet from userspace ;)  Given that we never saw those userspace
>> patches from you that in fact equaled my performance, I assume you were
>> wrong about that statement.  Perhaps you were wrong about other things
>> too?
>>   
> 
> I'm wrong about a lot of things :-)  I haven't yet been convinced that
> I'm wrong here though.
> 
> One of the gray areas here is what constitutes an in-kernel backend. 
> tun/tap is a sort of an in-kernel backend.  Userspace is still involved
> in all of the paths.  vhost seems to be an intermediate step between
> tun/tap and vbus.  The fast paths avoid userspace completely.  Many of
> the slow paths involve userspace still (like migration apparently). 
> With vbus, userspace is avoided entirely.  In some ways, you could argue
> that slirp and vbus are opposite ends of the virtual I/O spectrum.
> 
> I believe strongly that we should avoid putting things in the kernel
> unless they absolutely have to be.


I would generally agree with you on that.  Particularly in the case of
kvm, having slow-path bus-management code in-kernel is not strictly
necessary because KVM has qemu in userspace.

The issue here is that vbus is designed to be a generic solution to
in-kernel virtual-IO.  It will support (via abstraction of key
subsystems) a variety of environments that may or may not be similar in
facilities to KVM, and therefore it represents the
least-common-denominator as far as what external dependencies it requires.

The bottom line is this: despite the tendency for people to jump at
"don't put much in the kernel!", the fact is that a "bus" designed for
software to software (such as vbus) is almost laughably trivial.  Its
essentially a list of objects that have an int (dev-id) and char*
(dev-type) attribute.  All the extra goo that you see me setting up in
something like the kvm-connector needs to be done for fast-path
_anyway_, so transporting the verbs to query this simple list is not
really a big deal.

If we were talking about full ICH emulation for a PCI bus, I would agree
with you.  In the case of vbus, I think its overstated.


>  I'm definitely interested in playing
> with vhost to see if there are ways to put even less in the kernel.  In
> particular, I think it would be a big win to avoid knowledge of slots in
> the kernel by doing ring translation in userspace.

Ultimately I think that would not be a very good proposition.  Ring
translation is actually not that hard, and that would definitely be a
measurable latency source to try and do as you propose.  But, I will not
discourage you from trying if that is what you want to do.

>  This implies a
> userspace transition in the fast path.  This may or may not be
> acceptable.  I think this is going to be a very interesting experiment
> and will ultimately determine whether my intuition about the cost of
> dropping to userspace is right or wrong.

I can already tell you its wrong, just based on the fact that even extra
kthread switches can hurt from my own experience playing in this area...

> 
> 
>> Conversely, I am not afraid of requiring a new driver to optimize the
>> general PV interface.  In the long term, this will reduce the amount of
>> reimplementing the same code over and over, reduce system overhead, and
>> it adds new features not previously available (for instance, coalescing
>> and prioritizing interrupts).
>>   
> 
> I think you have a lot of ideas and I don't know that we've been able to
> really understand your vision.  Do you have any plans on writing a paper
> about vbus that goes into some of your thoughts in detail?

I really need to, I know...


> 
>>> If that's the case, then I don't see any
>>> reason to adopt vbus unless Greg things there are other compelling
>>> features over virtio.
>>>     
>>
>> Aside from the fact that this is another confusion of the vbus/virtio
>> relationship...yes, of course there are compelling features (IMHO) or I
>> wouldn't be expending effort ;)  They are at least compelling enough to
>> put in AlacrityVM.
> 
> This whole AlactricyVM thing is really hitting this nail with a
> sledgehammer.

Note that I didn't really want to go that route.  As you know, I tried
pushing this straight through kvm first since earlier this year, but I
was met with reluctance to even bother truly understanding what I was
proposing, comments like "tell me your ideas so I can steal them", and
"sorry, we are going to reinvent our own instead".  This isn't exactly
going to motivate someone to continue pushing these ideas within that
community.  I was made to feel (purposely?) unwelcome at times.  So I
can either roll over and die, or start my own project.

In addition, almost all of vbus is completely independent of kvm anyway
(I think there are only 3 patches that actually touch KVM, and they are
relatively minor).  And vbus doesn't really fit into any other category
of maintained subsystem either.  So it really calls for a new branch of
maintainership, of which I currently sit.  AlacrityVM will serve as the
collaboration point of that maintainership.

The bottom line is, there are people out there who are interested in
what we are doing (and that number grows everyday).  Starting a new
project wasn't what I wanted per se, but I don't think there was much
choice.


  While the kernel needs to be very careful about what it
> pulls in, as long as you're willing to commit to ABI compatibility, we
> can pull code into QEMU to support vbus.  Then you can just offer vbus
> host and guest drivers instead of forking the kernel.

Ok, I will work on pushing those patches next.

> 
>>   If upstream KVM doesn't want them, that's KVMs
>> decision and I am fine with that.  Simply never apply my qemu patches to
>> qemu-kvm.git, and KVM will be blissfully unaware if vbus is present.
> 
> As I mentioned before, if you submit patches to upstream QEMU, we'll
> apply them (after appropriate review).  As I said previously, we want to
> avoid user confusion as much as possible.  Maybe this means limiting it
> to -device or a separate machine type.  I'm not sure, but that's
> something we can discussion on qemu-devel.

Ok.

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 13:16           ` Gregory Haskins
@ 2009-08-18 13:45             ` Avi Kivity
  2009-08-18 15:51               ` Gregory Haskins
  0 siblings, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-18 13:45 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev, Michael S. Tsirkin

On 08/18/2009 04:16 PM, Gregory Haskins wrote:
> The issue here is that vbus is designed to be a generic solution to
> in-kernel virtual-IO.  It will support (via abstraction of key
> subsystems) a variety of environments that may or may not be similar in
> facilities to KVM, and therefore it represents the
> least-common-denominator as far as what external dependencies it requires.
>    

Maybe it will be easier to evaluate it in the context of these other 
environments.  It's difficult to assess this without an example.

> The bottom line is this: despite the tendency for people to jump at
> "don't put much in the kernel!", the fact is that a "bus" designed for
> software to software (such as vbus) is almost laughably trivial.  Its
> essentially a list of objects that have an int (dev-id) and char*
> (dev-type) attribute.  All the extra goo that you see me setting up in
> something like the kvm-connector needs to be done for fast-path
> _anyway_, so transporting the verbs to query this simple list is not
> really a big deal.
>    

It's not laughably trivial when you try to support the full feature set 
of kvm (for example, live migration will require dirty memory tracking, 
and exporting all state stored in the kernel to userspace).

> Note that I didn't really want to go that route.  As you know, I tried
> pushing this straight through kvm first since earlier this year, but I
> was met with reluctance to even bother truly understanding what I was
> proposing, comments like "tell me your ideas so I can steal them", and
>    

Oh come on, I wrote "steal" as a convenient shorthand for 
"cross-pollinate your ideas into our code according to the letter and 
spirit of the GNU General Public License".  Since we're all trying to 
improve Linux we may as well cooperate.

> "sorry, we are going to reinvent our own instead".

No.  Adopting venet/vbus would mean reinventing something that already 
existed.  Continuing to support virtio/pci is not reinventing anything.

> This isn't exactly
> going to motivate someone to continue pushing these ideas within that
> community.  I was made to feel (purposely?) unwelcome at times.  So I
> can either roll over and die, or start my own project.
>    

You haven't convinced me that your ideas are worth the effort of 
abandoning virtio/pci or maintaining both venet/vbus and virtio/pci.  
I'm sorry if that made you feel unwelcome.  There's no reason to 
interpret disagreement as malice though.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18  8:33               ` Avi Kivity
@ 2009-08-18 14:46                 ` Gregory Haskins
  2009-08-18 16:27                   ` Avi Kivity
  2009-08-18 18:20                   ` Arnd Bergmann
  0 siblings, 2 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-18 14:46 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, kvm, alacrityvm-devel, linux-kernel, netdev,
	Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 20159 bytes --]

Avi Kivity wrote:
> On 08/17/2009 10:33 PM, Gregory Haskins wrote:
>>
>> There is a secondary question of venet (a vbus native device) verses
>> virtio-net (a virtio native device that works with PCI or VBUS).  If
>> this contention is really around venet vs virtio-net, I may possibly
>> conceed and retract its submission to mainline.  I've been pushing it to
>> date because people are using it and I don't see any reason that the
>> driver couldn't be upstream.
>>    
> 
> That's probably the cause of much confusion.  The primary kvm pain point
> is now networking, so in any vbus discussion we're concentrating on that
> aspect.
> 
>>> Also, are you willing to help virtio to become faster?
>>>      
>> Yes, that is not a problem.  Note that virtio in general, and
>> virtio-net/venet in particular are not the primary goal here, however.
>> Improved 802.x and block IO are just positive side-effects of the
>> effort.  I started with 802.x networking just to demonstrate the IO
>> layer capabilities, and to test it.  It ended up being so good on
>> contrast to existing facilities, that developers in the vbus community
>> started using it for production development.
>>
>> Ultimately, I created vbus to address areas of performance that have not
>> yet been addressed in things like KVM.  Areas such as real-time guests,
>> or RDMA (host bypass) interfaces.
> 
> Can you explain how vbus achieves RDMA?
> 
> I also don't see the connection to real time guests.

Both of these are still in development.  Trying to stay true to the
"release early and often" mantra, the core vbus technology is being
pushed now so it can be reviewed.  Stay tuned for these other developments.

> 
>> I also designed it in such a way that
>> we could, in theory, write one set of (linux-based) backends, and have
>> them work across a variety of environments (such as containers/VMs like
>> KVM, lguest, openvz, but also physical systems like blade enclosures and
>> clusters, or even applications running on the host).
>>    
> 
> Sorry, I'm still confused.  Why would openvz need vbus?

Its just an example.  The point is that I abstracted what I think are
the key points of fast-io, memory routing, signal routing, etc, so that
it will work in a variety of (ideally, _any_) environments.

There may not be _performance_ motivations for certain classes of VMs
because they already have decent support, but they may want a connector
anyway to gain some of the new features available in vbus.

And looking forward, the idea is that we have commoditized the backend
so we don't need to redo this each time a new container comes along.


>  It already has
> zero-copy networking since it's a shared kernel.  Shared memory should
> also work seamlessly, you just need to expose the shared memory object
> on a shared part of the namespace.  And of course, anything in the
> kernel is already shared.
> 
>>> Or do you
>>> have arguments why that is impossible to do so and why the only
>>> possible solution is vbus? Avi says no such arguments were offered
>>> so far.
>>>      
>> Not for lack of trying.  I think my points have just been missed
>> everytime I try to describe them. ;)  Basically I write a message very
>> similar to this one, and the next conversation starts back from square
>> one.  But I digress, let me try again..
>>
>> Noting that this discussion is really about the layer *below* virtio,
>> not virtio itself (e.g. PCI vs vbus).  Lets start with a little
>> background:
>>
>> -- Background --
>>
>> So on one level, we have the resource-container technology called
>> "vbus".  It lets you create a container on the host, fill it with
>> virtual devices, and assign that container to some context (such as a
>> KVM guest).  These "devices" are LKMs, and each device has a very simple
>> verb namespace consisting of a synchronous "call()" method, and a
>> "shm()" method for establishing async channels.
>>
>> The async channels are just shared-memory with a signal path (e.g.
>> interrupts and hypercalls), which the device+driver can use to overlay
>> things like rings (virtqueues, IOQs), or other shared-memory based
>> constructs of their choosing (such as a shared table).  The signal path
>> is designed to minimize enter/exits and reduce spurious signals in a
>> unified way (see shm-signal patch).
>>
>> call() can be used both for config-space like details, as well as
>> fast-path messaging that require synchronous behavior (such as guest
>> scheduler updates).
>>
>> All of this is managed via sysfs/configfs.
>>    
> 
> One point of contention is that this is all managementy stuff and should
> be kept out of the host kernel.  Exposing shared memory, interrupts, and
> guest hypercalls can all be easily done from userspace (as virtio
> demonstrates).  True, some devices need kernel acceleration, but that's
> no reason to put everything into the host kernel.

See my last reply to Anthony.  My two points here are that:

a) having it in-kernel makes it a complete subsystem, which perhaps has
diminished value in kvm, but adds value in most other places that we are
looking to use vbus.

b) the in-kernel code is being overstated as "complex".  We are not
talking about your typical virt thing, like an emulated ICH/PCI chipset.
 Its really a simple list of devices with a handful of attributes.  They
are managed using established linux interfaces, like sysfs/configfs.


> 
>> On the guest, we have a "vbus-proxy" which is how the guest gets access
>> to devices assigned to its container.  (as an aside, "virtio" devices
>> can be populated in the container, and then surfaced up to the
>> virtio-bus via that virtio-vbus patch I mentioned).
>>
>> There is a thing called a "vbus-connector" which is the guest specific
>> part.  Its job is to connect the vbus-proxy in the guest, to the vbus
>> container on the host.  How it does its job is specific to the connector
>> implementation, but its role is to transport messages between the guest
>> and the host (such as for call() and shm() invocations) and to handle
>> things like discovery and hotswap.
>>    
> 
> virtio has an exact parallel here (virtio-pci and friends).
> 
>> Out of all this, I think the biggest contention point is the design of
>> the vbus-connector that I use in AlacrityVM (Avi, correct me if I am
>> wrong and you object to other aspects as well).  I suspect that if I had
>> designed the vbus-connector to surface vbus devices as PCI devices via
>> QEMU, the patches would potentially have been pulled in a while ago.
>>    
> 
> Exposing devices as PCI is an important issue for me, as I have to
> consider non-Linux guests.

Thats your prerogative, but obviously not everyone agrees with you.
Getting non-Linux guests to work is my problem if you chose to not be
part of the vbus community.

> Another issue is the host kernel management code which I believe is
> superfluous.

In your opinion, right?

> 
> But the biggest issue is compatibility.  virtio exists and has Windows
> and Linux drivers.  Without a fatal flaw in virtio we'll continue to
> support it.

So go ahead.

> Given that, why spread to a new model?

Note: I haven't asked you to (at least, not since April with the vbus-v3
release).  Spreading to a new model is currently the role of the
AlacrityVM project, since we disagree on the utility of a new model.

> 
> Of course, I understand you're interested in non-ethernet, non-block
> devices.  I can't comment on these until I see them.  Maybe they can fit
> the virtio model, and maybe they can't.

Yes, that I am not sure.  They may.  I will certainly explore that angle
at some point.

> 
>> There are, of course, reasons why vbus does *not* render as PCI, so this
>> is the meat of of your question, I believe.
>>
>> At a high level, PCI was designed for software-to-hardware interaction,
>> so it makes assumptions about that relationship that do not necessarily
>> apply to virtualization.
>>
>> For instance:
>>
>> A) hardware can only generate byte/word sized requests at a time because
>> that is all the pcb-etch and silicon support. So hardware is usually
>> expressed in terms of some number of "registers".
>>    
> 
> No, hardware happily DMAs to and fro main memory.

Yes, now walk me through how you set up DMA to do something like a call
when you do not know addresses apriori.  Hint: count the number of
MMIO/PIOs you need.  If the number is > 1, you've lost.


>  Some hardware of
> course uses mmio registers extensively, but not virtio hardware.  With
> the recent MSI support no registers are touched in the fast path.

Note we are not talking about virtio here.  Just raw PCI and why I
advocate vbus over it.


> 
>> C) the target end-point has no visibility into the CPU machine state
>> other than the parameters passed in the bus-cycle (usually an address
>> and data tuple).
>>    
> 
> That's not an issue.  Accessing memory is cheap.
> 
>> D) device-ids are in a fixed width register and centrally assigned from
>> an authority (e.g. PCI-SIG).
>>    
> 
> That's not an issue either.  Qumranet/Red Hat has donated a range of
> device IDs for use in virtio.

Yes, and to get one you have to do what?  Register it with kvm.git,
right?  Kind of like registering a MAJOR/MINOR, would you agree?  Maybe
you do not mind (especially given your relationship to kvm.git), but
there are disadvantages to that model for most of the rest of us.


>  Device IDs are how devices are associated
> with drivers, so you'll need something similar for vbus.

Nope, just like you don't need to do anything ahead of time for using a
dynamic misc-device name.  You just have both the driver and device know
what they are looking for (its part of the ABI).

> 
>> E) Interrupt/MSI routing is per-device oriented
>>    
> 
> Please elaborate.  What is the issue?  How does vbus solve it?

There are no "interrupts" in vbus..only shm-signals.  You can establish
an arbitrary amount of shm regions, each with an optional shm-signal
associated with it.  To do this, the driver calls dev->shm(), and you
get back a shm_signal object.

Underneath the hood, the vbus-connector (e.g. vbus-pcibridge) decides
how it maps real interrupts to shm-signals (on a system level, not per
device).  This can be 1:1, or any other scheme.  vbus-pcibridge uses one
system-wide interrupt per priority level (today this is 8 levels), each
with an IOQ based event channel.  "signals" come as an event on that
channel.

So the "issue" is that you have no real choice with PCI.  You just get
device oriented interrupts.  With vbus, its abstracted.  So you can
still get per-device standard MSI, or you can do fancier things like do
coalescing and prioritization.

> 
>> F) Interrupts/MSI are assumed cheap to inject
>>    
> 
> Interrupts are not assumed cheap; that's why interrupt mitigation is
> used (on real and virtual hardware).

Its all relative.  IDT dispatch and EOI overhead are "baseline" on real
hardware, whereas they are significantly more expensive to do the
vmenters and vmexits on virt (and you have new exit causes, like
irq-windows, etc, that do not exist in real HW).


> 
>> G) Interrupts/MSI are non-priortizable.
>>    
> 
> They are prioritizable; Linux ignores this though (Windows doesn't). 
> Please elaborate on what the problem is and how vbus solves it.

It doesn't work right.  The x86 sense of interrupt priority is, sorry to
say it, half-assed at best.  I've worked with embedded systems that have
real interrupt priority support in the hardware, end to end, including
the PIC.  The LAPIC on the other hand is really weak in this dept, and
as you said, Linux doesn't even attempt to use whats there.


> 
>> H) Interrupts/MSI are statically established
>>    
> 
> Can you give an example of why this is a problem?

Some of the things we are building use the model of having a device that
hands out shm-signal in response to guest events (say, the creation of
an IPC channel).  This would generally be handled by a specific device
model instance, and it would need to do this without pre-declaring the
MSI vectors (to use PCI as an example).


> 
>> These assumptions and constraints may be completely different or simply
>> invalid in a virtualized guest. For instance, the hypervisor is just
>> software, and therefore it's not restricted to "etch" constraints. IO
>> requests can be arbitrarily large, just as if you are invoking a library
>> function-call or OS system-call. Likewise, each one of those requests is
>> a branch and a context switch, so it has often has greater performance
>> implications than a simple register bus-cycle in hardware.  If you use
>> an MMIO variant, it has to run through the page-fault code to be decoded.
>>
>> The result is typically decreased performance if you try to do the same
>> thing real hardware does. This is why you usually see hypervisor
>> specific drivers (e.g. virtio-net, vmnet, etc) a common feature.
>>
>> _Some_ performance oriented items can technically be accomplished in
>> PCI, albeit in a much more awkward way.  For instance, you can set up a
>> really fast, low-latency "call()" mechanism using a PIO port on a
>> PCI-model and ioeventfd.  As a matter of fact, this is exactly what the
>> vbus pci-bridge does:
>>    
> 
> What performance oriented items have been left unaddressed?

Well, the interrupt model to name one.

> 
> virtio and vbus use three communications channels:  call from guest to
> host (implemented as pio and reasonably fast), call from host to guest
> (implemented as msi and reasonably fast) and shared memory (as fast as
> it can be).  Where does PCI limit you in any way?
> 
>> The problem here is that this is incredibly awkward to setup.  You have
>> all that per-cpu goo and the registration of the memory on the guest.
>> And on the host side, you have all the vmapping of the registered
>> memory, and the file-descriptor to manage.  In short, its really painful.
>>
>> I would much prefer to do this *once*, and then let all my devices
>> simple re-use that infrastructure.  This is, in fact, what I do.  Here
>> is the device model that a guest sees:
>>    
> 
> virtio also reuses the pci code, on both guest and host.
> 
>> Moving on: _Other_ items cannot be replicated (at least, not without
>> hacking it into something that is no longer PCI.
>>
>> Things like the pci-id namespace are just silly for software.  I would
>> rather have a namespace that does not require central management so
>> people are free to create vbus-backends at will.  This is akin to
>> registering a device MAJOR/MINOR, verses using the various dynamic
>> assignment mechanisms.  vbus uses a string identifier in place of a
>> pci-id.  This is superior IMHO, and not compatible with PCI.
>>    
> 
> How do you handle conflicts?  Again you need a central authority to hand
> out names or prefixes.

Not really, no.  If you really wanted to be formal about it, you could
adopt any series of UUID schemes.  For instance, perhaps venet should be
"com.novell::virtual-ethernet".  Heck, I could use uuidgen.

> 
>> As another example, the connector design coalesces *all* shm-signals
>> into a single interrupt (by prio) that uses the same context-switch
>> mitigation techniques that help boost things like networking.  This
>> effectively means we can detect and optimize out ack/eoi cycles from the
>> APIC as the IO load increases (which is when you need it most).  PCI has
>> no such concept.
>>    
> 
> That's a bug, not a feature.  It means poor scaling as the number of
> vcpus increases and as the number of devices increases.

So the "avi-vbus-connector" can use 1:1, if you prefer.  Large vcpu
counts (which are not typical) and irq-affinity is not a target
application for my design, so I prefer the coalescing model in the
vbus-pcibridge included in this series. YMMV

Note: If you really wanted to, you could have priority queues per-cpu,
and get the best of both worlds (irq routing and coalescing/priority).


> 
> Note nothing prevents steering multiple MSIs into a single vector.  It's
> a bad idea though.

Yes, it is a bad idea...and not the same thing either.  This would
effectively create a shared-line scenario in the irq code, which is not
what happens in vbus.

> 
>> In addition, the signals and interrupts are priority aware, which is
>> useful for things like 802.1p networking where you may establish 8-tx
>> and 8-rx queues for your virtio-net device.  x86 APIC really has no
>> usable equivalent, so PCI is stuck here.
>>    
> 
> x86 APIC is priority aware.

Have you ever tried to use it?

> 
>> Also, the signals can be allocated on-demand for implementing things
>> like IPC channels in response to guest requests since there is no
>> assumption about device-to-interrupt mappings.  This is more flexible.
>>    
> 
> Yes.  However given that vectors are a scarce resource you're severely
> limited in that.

The connector I am pushing out does not have this limitation.

>  And if you're multiplexing everything on one vector,
> then you can just as well demultiplex your channels in the virtio driver
> code.

Only per-device, not system wide.

> 
>> And through all of this, this design would work in any guest even if it
>> doesn't have PCI (e.g. lguest, UML, physical systems, etc).
>>    
> 
> That is true for virtio which works on pci-less lguest and s390.

Yes, and lguest and s390 had to build their own bus-model to do it, right?

Thank you for bringing this up, because it is one of the main points
here.  What I am trying to do is generalize the bus to prevent the
proliferation of more of these isolated models in the future.  Build
one, fast, in-kernel model so that we wouldn't need virtio-X, and
virtio-Y in the future.  They can just reuse the (performance optimized)
bus and models, and only need to build the connector to bridge them.


> 
>> -- Bottom Line --
>>
>> The idea here is to generalize all the interesting parts that are common
>> (fast sync+async io, context-switch mitigation, back-end models, memory
>> abstractions, signal-path routing, etc) that a variety of linux based
>> technologies can use (kvm, lguest, openvz, uml, physical systems) and
>> only require the thin "connector" code to port the system around.  The
>> idea is to try to get this aspect of PV right once, and at some point in
>> the future, perhaps vbus will be as ubiquitous as PCI.  Well, perhaps
>> not *that* ubiquitous, but you get the idea ;)
>>    
> 
> That is exactly the design goal of virtio (except it limits itself to
> virtualization).

No, virtio is only part of the picture.  It not including the backend
models, or how to do memory/signal-path abstraction for in-kernel, for
instance.  But otherwise, virtio as a device model is compatible with
vbus as a bus model.  They compliment one another.



> 
>> Then device models like virtio can ride happily on top and we end up
>> with a really robust and high-performance Linux-based stack.  I don't
>> buy the argument that we already have PCI so lets use it.  I don't think
>> its the best design and I am not afraid to make an investment in a
>> change here because I think it will pay off in the long run.
>>    
> 
> Sorry, I don't think you've shown any quantifiable advantages.

We can agree to disagree then, eh?  There are certainly quantifiable
differences.  Waving your hand at the differences to say they are not
advantages is merely an opinion, one that is not shared universally.

The bottom line is all of these design distinctions are encapsulated
within the vbus subsystem and do not affect the kvm code-base.  So
agreement with kvm upstream is not a requirement, but would be
advantageous for collaboration.

Kind Regards,
-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18  8:46             ` Michael S. Tsirkin
@ 2009-08-18 15:19               ` Gregory Haskins
  2009-08-18 16:25                 ` Michael S. Tsirkin
  2009-08-18 15:53               ` [Alacrityvm-devel] " Ira W. Snyder
  1 sibling, 1 reply; 132+ messages in thread
From: Gregory Haskins @ 2009-08-18 15:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Haskins, Anthony Liguori, Ingo Molnar, kvm, Avi Kivity,
	alacrityvm-devel, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 6115 bytes --]

Michael S. Tsirkin wrote:
> On Mon, Aug 17, 2009 at 04:17:09PM -0400, Gregory Haskins wrote:
>> Michael S. Tsirkin wrote:
>>> On Mon, Aug 17, 2009 at 10:14:56AM -0400, Gregory Haskins wrote:
>>>> Case in point: Take an upstream kernel and you can modprobe the
>>>> vbus-pcibridge in and virtio devices will work over that transport
>>>> unmodified.
>>>>
>>>> See http://lkml.org/lkml/2009/8/6/244 for details.
>>> The modprobe you are talking about would need
>>> to be done in guest kernel, correct?
>> Yes, and your point is? "unmodified" (pardon the psuedo pun) modifies
>> "virtio", not "guest".
>>  It means you can take an off-the-shelf kernel
>> with off-the-shelf virtio (ala distro-kernel) and modprobe
>> vbus-pcibridge and get alacrityvm acceleration.
> 
> Heh, by that logic ksplice does not modify running kernel either :)

Sigh...this is just fud.

Again, I never said I do not modify the guest.  I only said that virtio
is unmodified and all the existing devices can work unmodified.

I hardly think its fair to compare something like loading a pci-bridge
driver into a running kernel is the same as patching the kernel.  You
just load a driver to get access to your IO resources...standard stuff
really.

> 
>> It is not a design goal of mine to forbid the loading of a new driver,
>> so I am ok with that requirement.
>>
>>>> OTOH, Michael's patch is purely targeted at improving virtio-net on kvm,
>>>> and its likewise constrained by various limitations of that decision
>>>> (such as its reliance of the PCI model, and the kvm memory scheme).
>>> vhost is actually not related to PCI in any way. It simply leaves all
>>> setup for userspace to do.  And the memory scheme was intentionally
>>> separated from kvm so that it can easily support e.g. lguest.
>>>
>> I think you have missed my point. I mean that vhost requires a separate
>> bus-model (ala qemu-pci).
> 
> So? That can be in userspace, and can be anything including vbus.

-ENOPARSE

Can you elaborate?

> 
>> And no, your memory scheme is not separated,
>> at least, not very well.  It still assumes memory-regions and
>> copy_to_user(), which is very kvm-esque.
> 
> I don't think so: works for lguest, kvm, UML and containers

kvm _esque_ , meaning anything that follows the region+copy_to_user
model.  Not all things do.

> 
>> Vbus has people using things
>> like userspace containers (no regions),
> 
> vhost by default works without regions

Thats a start, but not good enough if you were trying to achieve the
same thing as vbus.  As I said before, I've never said you had to
achieve the same thing, but do note they are distinctly different with
different goals.  You are solving a directed problem.  I am solving a
general problem, and trying to solve it once.

> 
>> and physical hardware (dma
>> controllers, so no regions or copy_to_user) so your scheme quickly falls
>> apart once you get away from KVM.
> 
> Someone took a driver and is building hardware for it ... so what?

What is your point?

> 
>> Don't get me wrong:  That design may have its place.  Perhaps you only
>> care about fixing KVM, which is a perfectly acceptable strategy.
>> Its just not a strategy that I think is the best approach.  Essentially you
>> are promoting the proliferation of competing backends, and I am trying
>> to unify them (which is ironic that this thread started with concerns I
>> was fragmenting things ;).
> 
> So, you don't see how venet fragments things? It's pretty obvious ...

I never said it doesn't.  venet started as a test harness, but now it is
inadvertently fragmenting the virtio-net effort.  I admit it.  It wasn't
intentional, but just worked out that way.  Until your vhost idea is
vetted and benchmarked, its not even in the running.  Venet is currently
the highest performing 802.x acceleration for KVM that I am aware of, so
it will continue to garner interest from users concerned with performance.

But likewise, vhost has the potential to fragment the back-end model.
That was my point.

> 
>> The bottom line is, you have a simpler solution that is more finely
>> targeted at KVM and virtio-networking.  It fixes probably a lot of
>> problems with the existing implementation, but it still has limitations.
>>
>> OTOH, what I am promoting is more complex, but more flexible.  That is
>> the tradeoff.  You can't have both ;)
> 
> We can. connect eventfds to hypercalls, and vhost will work with vbus.

-ENOPARSE

vbus doesnt use hypercalls, and I do not see why or how you would
connect two backend models together like this.  Can you elaborate.

> 
>> So do not for one second think
>> that what you implemented is equivalent, because they are not.
>>
>> In fact, I believe I warned you about this potential problem when you
>> decided to implement your own version.  I think I said something to the
>> effect of "you will either have a subset of functionality, or you will
>> ultimately reinvent what I did".  Right now you are in the subset phase.
> 
> No. Unlike vbus, vhost supports unmodified guests and live migration.

By "subset", I am referring to your interfaces and the scope of its
applicability.  The things you need to do to make vhost work and a vbus
device work from a memory and signaling abstration POV are going to be
extremely similar.

The difference in how the guest sees them these backends is all
contained in the vbus-connector.  Therefore, what you *could* have done
is simply written a connector that  does something like only support
"virtio" backends, and surfaced them as regular PCI devices to the
guest.  Then you could have reused all the abstraction features in vbus,
instead of reinventing them (case in point, your region+copy_to_user
code).  And likewise, anyone using vbus could use your virtio-net backend.

Instead, I am still left with no virtio-net backend implemented, and you
were left designing, writing, and testing facilities that I've already
completed.  So it was duplicative effort.

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18  9:53               ` Michael S. Tsirkin
  2009-08-18 10:00                 ` Avi Kivity
@ 2009-08-18 15:39                 ` Gregory Haskins
  2009-08-18 16:39                   ` Michael S. Tsirkin
  1 sibling, 1 reply; 132+ messages in thread
From: Gregory Haskins @ 2009-08-18 15:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Ingo Molnar, kvm, Avi Kivity, alacrityvm-devel, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 3914 bytes --]

Michael S. Tsirkin wrote:
> On Mon, Aug 17, 2009 at 03:33:30PM -0400, Gregory Haskins wrote:
>> There is a secondary question of venet (a vbus native device) verses
>> virtio-net (a virtio native device that works with PCI or VBUS).  If
>> this contention is really around venet vs virtio-net, I may possibly
>> conceed and retract its submission to mainline.
> 
> For me yes, venet+ioq competing with virtio+virtqueue.
> 
>> I've been pushing it to date because people are using it and I don't
>> see any reason that the driver couldn't be upstream.
> 
> If virtio is just as fast, they can just use it without knowing it.
> Clearly, that's better since we support virtio anyway ...

More specifically: kvm can support whatever it wants.  I am not asking
kvm to support venet.

If we (the alacrityvm community) decide to keep maintaining venet, _we_
will support it, and I have no problem with that.

As of right now, we are doing some interesting things with it in the lab
and its certainly more flexible for us as a platform since we maintain
the ABI and feature set.  So for now, I do not think its a big deal if
they both co-exist, and it has no bearing on KVM upstream.

> 
>> -- Issues --
>>
>> Out of all this, I think the biggest contention point is the design of
>> the vbus-connector that I use in AlacrityVM (Avi, correct me if I am
>> wrong and you object to other aspects as well).  I suspect that if I had
>> designed the vbus-connector to surface vbus devices as PCI devices via
>> QEMU, the patches would potentially have been pulled in a while ago.
>>
>> There are, of course, reasons why vbus does *not* render as PCI, so this
>> is the meat of of your question, I believe.
>>
>> At a high level, PCI was designed for software-to-hardware interaction,
>> so it makes assumptions about that relationship that do not necessarily
>> apply to virtualization.
> 
> I'm not hung up on PCI, myself.  An idea that might help you get Avi
> on-board: do setup in userspace, over PCI.

Note that this is exactly what I do.

In AlacrityVM, the guest learns of the available acceleration by the
presence of the PCI-BRIDGE.  It then uses that bridge, using standard
PCI mechanisms, to set everything up in the slow-path.


>  Negotiate hypercall support
> (e.g.  with a PCI capability) and then switch to that for fastpath. Hmm?
> 
>> As another example, the connector design coalesces *all* shm-signals
>> into a single interrupt (by prio) that uses the same context-switch
>> mitigation techniques that help boost things like networking.  This
>> effectively means we can detect and optimize out ack/eoi cycles from the
>> APIC as the IO load increases (which is when you need it most).  PCI has
>> no such concept.
> 
> Could you elaborate on this one for me? How does context-switch
> mitigation work?

What I did was I commoditized the concept of signal-mitigation.  I then
reuse that concept all over the place to do "NAPI" like mitigation of
the signal path for everthing: for individual interrupts, of course, but
also for things like hypercalls, kthread wakeups, and the interrupt
controller too.


> 
>> In addition, the signals and interrupts are priority aware, which is
>> useful for things like 802.1p networking where you may establish 8-tx
>> and 8-rx queues for your virtio-net device.  x86 APIC really has no
>> usable equivalent, so PCI is stuck here.
> 
> By the way, multiqueue support in virtio would be very nice to have,

Actually what I am talking about is a little different than MQ, but I
agree that both priority-based and concurrency-based MQ would require
similar facilities.

> and seems mostly unrelated to vbus.

Mostly, but not totally.  The priority stuff wouldn't work quite right
without similar provisions to the entire signal path, like vbus does.

Kind Regards,
-Greg






[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 13:45             ` Avi Kivity
@ 2009-08-18 15:51               ` Gregory Haskins
  2009-08-18 16:14                 ` Ingo Molnar
                                   ` (2 more replies)
  0 siblings, 3 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-18 15:51 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 3807 bytes --]

Avi Kivity wrote:
> On 08/18/2009 04:16 PM, Gregory Haskins wrote:
>> The issue here is that vbus is designed to be a generic solution to
>> in-kernel virtual-IO.  It will support (via abstraction of key
>> subsystems) a variety of environments that may or may not be similar in
>> facilities to KVM, and therefore it represents the
>> least-common-denominator as far as what external dependencies it
>> requires.
>>    
> 
> Maybe it will be easier to evaluate it in the context of these other
> environments.  It's difficult to assess this without an example.

When they are ready, I will cross post the announcement to KVM.

> 
>> The bottom line is this: despite the tendency for people to jump at
>> "don't put much in the kernel!", the fact is that a "bus" designed for
>> software to software (such as vbus) is almost laughably trivial.  Its
>> essentially a list of objects that have an int (dev-id) and char*
>> (dev-type) attribute.  All the extra goo that you see me setting up in
>> something like the kvm-connector needs to be done for fast-path
>> _anyway_, so transporting the verbs to query this simple list is not
>> really a big deal.
>>    
> 
> It's not laughably trivial when you try to support the full feature set
> of kvm (for example, live migration will require dirty memory tracking,
> and exporting all state stored in the kernel to userspace).

Doesn't vhost suffer from the same issue?  If not, could I also apply
the same technique to support live-migration in vbus?

> 
>> Note that I didn't really want to go that route.  As you know, I tried
>> pushing this straight through kvm first since earlier this year, but I
>> was met with reluctance to even bother truly understanding what I was
>> proposing, comments like "tell me your ideas so I can steal them", and
>>    
> 
> Oh come on, I wrote "steal" as a convenient shorthand for
> "cross-pollinate your ideas into our code according to the letter and
> spirit of the GNU General Public License".

Is that supposed to make me feel better about working with you?  I mean,
writing, testing, polishing patches for LKML-type submission is time
consuming.  If all you are going to do is take those ideas and rewrite
it yourself, why should I go through that effort?

And its not like that was the first time you have said that to me.

> Since we're all trying to improve Linux we may as well cooperate.

Well, I don't think anyone can say that I haven't been trying.

> 
>> "sorry, we are going to reinvent our own instead".
> 
> No.  Adopting venet/vbus would mean reinventing something that already
> existed.

But yet, it doesn't.


>  Continuing to support virtio/pci is not reinventing anything.

No one asked you to do otherwise.

> 
>> This isn't exactly
>> going to motivate someone to continue pushing these ideas within that
>> community.  I was made to feel (purposely?) unwelcome at times.  So I
>> can either roll over and die, or start my own project.
>>    
> 
> You haven't convinced me that your ideas are worth the effort of
> abandoning virtio/pci or maintaining both venet/vbus and virtio/pci.

With all due respect, I didnt ask you do to anything, especially not
abandon something you are happy with.

All I did was push guest drivers to LKML.  The code in question is
independent of KVM, and its proven to improve the experience of using
Linux as a platform.  There are people interested in using them (by
virtue of the number of people that have signed up for the AlacrityVM
list, and have mailed me privately about this work).

So where is the problem here?


> I'm sorry if that made you feel unwelcome.  There's no reason to
> interpret disagreement as malice though.
> 

Ok.

Kind Regards,
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18  8:46             ` Michael S. Tsirkin
  2009-08-18 15:19               ` Gregory Haskins
@ 2009-08-18 15:53               ` Ira W. Snyder
  2009-08-18 16:51                 ` Avi Kivity
  2009-08-18 20:57                 ` Michael S. Tsirkin
  1 sibling, 2 replies; 132+ messages in thread
From: Ira W. Snyder @ 2009-08-18 15:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Haskins, kvm, netdev, linux-kernel, alacrityvm-devel,
	Avi Kivity, Anthony Liguori, Ingo Molnar, Gregory Haskins

On Tue, Aug 18, 2009 at 11:46:06AM +0300, Michael S. Tsirkin wrote:
> On Mon, Aug 17, 2009 at 04:17:09PM -0400, Gregory Haskins wrote:
> > Michael S. Tsirkin wrote:
> > > On Mon, Aug 17, 2009 at 10:14:56AM -0400, Gregory Haskins wrote:
> > >> Case in point: Take an upstream kernel and you can modprobe the
> > >> vbus-pcibridge in and virtio devices will work over that transport
> > >> unmodified.
> > >>
> > >> See http://lkml.org/lkml/2009/8/6/244 for details.
> > > 
> > > The modprobe you are talking about would need
> > > to be done in guest kernel, correct?
> > 
> > Yes, and your point is? "unmodified" (pardon the psuedo pun) modifies
> > "virtio", not "guest".
> >  It means you can take an off-the-shelf kernel
> > with off-the-shelf virtio (ala distro-kernel) and modprobe
> > vbus-pcibridge and get alacrityvm acceleration.
> 
> Heh, by that logic ksplice does not modify running kernel either :)
> 
> > It is not a design goal of mine to forbid the loading of a new driver,
> > so I am ok with that requirement.
> > 
> > >> OTOH, Michael's patch is purely targeted at improving virtio-net on kvm,
> > >> and its likewise constrained by various limitations of that decision
> > >> (such as its reliance of the PCI model, and the kvm memory scheme).
> > > 
> > > vhost is actually not related to PCI in any way. It simply leaves all
> > > setup for userspace to do.  And the memory scheme was intentionally
> > > separated from kvm so that it can easily support e.g. lguest.
> > > 
> > 
> > I think you have missed my point. I mean that vhost requires a separate
> > bus-model (ala qemu-pci).
> 
> So? That can be in userspace, and can be anything including vbus.
> 
> > And no, your memory scheme is not separated,
> > at least, not very well.  It still assumes memory-regions and
> > copy_to_user(), which is very kvm-esque.
> 
> I don't think so: works for lguest, kvm, UML and containers
> 
> > Vbus has people using things
> > like userspace containers (no regions),
> 
> vhost by default works without regions
> 
> > and physical hardware (dma
> > controllers, so no regions or copy_to_user) so your scheme quickly falls
> > apart once you get away from KVM.
> 
> Someone took a driver and is building hardware for it ... so what?
> 

I think Greg is referring to something like my virtio-over-PCI patch.
I'm pretty sure that vhost is completely useless for my situation. I'd
like to see vhost work for my use, so I'll try to explain what I'm
doing.

I've got a system where I have about 20 computers connected via PCI. The
PCI master is a normal x86 system, and the PCI agents are PowerPC
systems. The PCI agents act just like any other PCI card, except they
are running Linux, and have their own RAM and peripherals.

I wrote a custom driver which imitated a network interface and a serial
port. I tried to push it towards mainline, and DavidM rejected it, with
the argument, "use virtio, don't add another virtualization layer to the
kernel." I think he has a decent argument, so I wrote virtio-over-PCI.

Now, there are some things about virtio that don't work over PCI.
Mainly, memory is not truly shared. It is extremely slow to access
memory that is "far away", meaning "across the PCI bus." This can be
worked around by using a DMA controller to transfer all data, along with
an intelligent scheme to perform only writes across the bus. If you're
careful, reads are never needed.

So, in my system, copy_(to|from)_user() is completely wrong. There is no
userspace, only a physical system. In fact, because normal x86 computers
do not have DMA controllers, the host system doesn't actually handle any
data transfer!

I used virtio-net in both the guest and host systems in my example
virtio-over-PCI patch, and succeeded in getting them to communicate.
However, the lack of any setup interface means that the devices must be
hardcoded into both drivers, when the decision could be up to userspace.
I think this is a problem that vbus could solve.

For my own selfish reasons (I don't want to maintain an out-of-tree
driver) I'd like to see *something* useful in mainline Linux. I'm happy
to answer questions about my setup, just ask.

Ira

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 15:51               ` Gregory Haskins
@ 2009-08-18 16:14                 ` Ingo Molnar
  2009-08-19  4:27                   ` Gregory Haskins
  2009-08-18 16:47                 ` [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects Avi Kivity
  2009-08-18 16:51                 ` Michael S. Tsirkin
  2 siblings, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2009-08-18 16:14 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Anthony Liguori, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev, Michael S. Tsirkin


* Gregory Haskins <gregory.haskins@gmail.com> wrote:

> > You haven't convinced me that your ideas are worth the effort 
> > of abandoning virtio/pci or maintaining both venet/vbus and 
> > virtio/pci.
> 
> With all due respect, I didnt ask you do to anything, especially 
> not abandon something you are happy with.
> 
> All I did was push guest drivers to LKML.  The code in question 
> is independent of KVM, and its proven to improve the experience 
> of using Linux as a platform.  There are people interested in 
> using them (by virtue of the number of people that have signed up 
> for the AlacrityVM list, and have mailed me privately about this 
> work).

This thread started because i asked you about your technical 
arguments why we'd want vbus instead of virtio. Your answer above 
now basically boils down to: "because I want it so, why dont you 
leave me alone".

What you are doing here is to in essence to fork KVM, regardless of 
the technical counter arguments given against such a fork and 
regardless of the ample opportunity given to you to demostrate the 
technical advantages of your code. (in which case KVM would happily 
migrate to your code)

We all love faster code and better management interfaces and tons 
of your prior patches got accepted by Avi. This time you didnt even 
_try_ to improve virtio. It's not like you posted a lot of virtio 
patches which were not applied. You didnt even try and you need to 
try _much_ harder than that before forking a project.

And fragmentation matters quite a bit. To Linux users, developers, 
administrators, packagers it's a big deal whether two overlapping 
pieces of functionality for the same thing exist within the same 
kernel. The kernel is not an anarchy where everyone can have their 
own sys_fork() version or their own sys_write() version. Would you 
want to have two dozen read() variants, sys_read_oracle() and a 
sys_read_db2()?

I certainly dont want that. Instead we (at great expense and work) 
try to reach the best technical solution. That means we throw away 
inferior code and adopt the better one. (with a reasonable 
migration period)

You are ignoring that principle with hand-waving about 'the 
community wants this'. I can assure you, users _DONT WANT_ split 
interfaces and incompatible drivers for the same thing. They want 
stuff that works well.

If the community wants this then why cannot you convince one of the 
most prominent representatives of that community, the KVM 
developers?

Furthermore, 99% of your work is KVM, why dont you respect that 
work by not forking it? Why dont you respect the KVM community and 
Linux in general by improving existing pieces of infrastructure 
instead of forcefully forking it?

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 15:19               ` Gregory Haskins
@ 2009-08-18 16:25                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 132+ messages in thread
From: Michael S. Tsirkin @ 2009-08-18 16:25 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Ingo Molnar, kvm, Avi Kivity, alacrityvm-devel,
	linux-kernel, netdev

On Tue, Aug 18, 2009 at 11:19:40AM -0400, Gregory Haskins wrote:
> >>>> OTOH, Michael's patch is purely targeted at improving virtio-net on kvm,
> >>>> and its likewise constrained by various limitations of that decision
> >>>> (such as its reliance of the PCI model, and the kvm memory scheme).
> >>> vhost is actually not related to PCI in any way. It simply leaves all
> >>> setup for userspace to do.  And the memory scheme was intentionally
> >>> separated from kvm so that it can easily support e.g. lguest.
> >>>
> >> I think you have missed my point. I mean that vhost requires a separate
> >> bus-model (ala qemu-pci).
> > 
> > So? That can be in userspace, and can be anything including vbus.
> 
> -ENOPARSE
> 
> Can you elaborate?

Write a device that signals an eventfd on virtio kick, and poll eventfd
for notifications, and you can use vhost-net.  vbus, surely, can do
this?

> > 
> >> And no, your memory scheme is not separated,
> >> at least, not very well.  It still assumes memory-regions and
> >> copy_to_user(), which is very kvm-esque.
> > 
> > I don't think so: works for lguest, kvm, UML and containers
> 
> kvm _esque_ , meaning anything that follows the region+copy_to_user
> model.  Not all things do.

Pretty much all things where it makes sense to share code with
vhost-net.  If there's hardware that wants direct access to descriptor
rings, it just needs a driver.

> >> Vbus has people using things
> >> like userspace containers (no regions),
> > 
> > vhost by default works without regions
> 
> Thats a start, but not good enough if you were trying to achieve the
> same thing as vbus.  As I said before, I've never said you had to
> achieve the same thing, but do note they are distinctly different with
> different goals.  You are solving a directed problem.  I am solving a
> general problem, and trying to solve it once.

Heh. A good demonstration of vbus generality would be a solution that
speeds up virtio in guests.  What venet seems to illustrate instead is
that one has to rework all of host, guest and hypervisor to use vbus.
Maybe it does not need to be that way - it just seems so.

> >> and physical hardware (dma
> >> controllers, so no regions or copy_to_user) so your scheme quickly falls
> >> apart once you get away from KVM.
> > 
> > Someone took a driver and is building hardware for it ... so what?
> 
> What is your point?

OK, can we forget about that physical hardware then?

> >> Don't get me wrong:  That design may have its place.  Perhaps you only
> >> care about fixing KVM, which is a perfectly acceptable strategy.
> >> Its just not a strategy that I think is the best approach.  Essentially you
> >> are promoting the proliferation of competing backends, and I am trying
> >> to unify them (which is ironic that this thread started with concerns I
> >> was fragmenting things ;).
> > 
> > So, you don't see how venet fragments things? It's pretty obvious ...
> 
> I never said it doesn't.  venet started as a test harness, but now it is
> inadvertently fragmenting the virtio-net effort.  I admit it.  It wasn't
> intentional, but just worked out that way.  Until your vhost idea is
> vetted and benchmarked, its not even in the running.
>
> Venet is currently
> the highest performing 802.x acceleration for KVM that I am aware of, so
> it will continue to garner interest from users concerned with performance.
> 
> But likewise, vhost has the potential to fragment the back-end model.
> That was my point.

You don't see the difference? Long term vhost-net can just be enabled by
default whenever it is present, and there is a single guest driver to
support. OTOH, venet means that we have to support 2 guest drivers:
virtio and venet, for a long time.

> > 
> >> The bottom line is, you have a simpler solution that is more finely
> >> targeted at KVM and virtio-networking.  It fixes probably a lot of
> >> problems with the existing implementation, but it still has limitations.
> >>
> >> OTOH, what I am promoting is more complex, but more flexible.  That is
> >> the tradeoff.  You can't have both ;)
> > 
> > We can. connect eventfds to hypercalls, and vhost will work with vbus.
> 
> -ENOPARSE
> 
> vbus doesnt use hypercalls, and I do not see why or how you would
> connect two backend models together like this.  Can you elaborate.

I think some older version did. But whatever. signal eventfd on guest
kick, poll eventfd to notify guest, and you can use vhost-net with vbus.

> > 
> >> So do not for one second think
> >> that what you implemented is equivalent, because they are not.
> >>
> >> In fact, I believe I warned you about this potential problem when you
> >> decided to implement your own version.  I think I said something to the
> >> effect of "you will either have a subset of functionality, or you will
> >> ultimately reinvent what I did".  Right now you are in the subset phase.
> > 
> > No. Unlike vbus, vhost supports unmodified guests and live migration.
> 
> By "subset", I am referring to your interfaces and the scope of its
> applicability.  The things you need to do to make vhost work and a vbus
> device work from a memory and signaling abstration POV are going to be
> extremely similar.
> 
> The difference in how the guest sees them these backends is all
> contained in the vbus-connector.  Therefore, what you *could* have done
> is simply written a connector that  does something like only support
> "virtio" backends, and surfaced them as regular PCI devices to the
> guest.  Then you could have reused all the abstraction features in vbus,
> instead of reinventing them (case in point, your region+copy_to_user
> code).  And likewise, anyone using vbus could use your virtio-net backend.
> 
> Instead, I am still left with no virtio-net backend implemented, and you
> were left designing, writing, and testing facilities that I've already
> completed.  So it was duplicative effort.
> 
> Kind Regards,
> -Greg
> 

As I said, I couldn't reuse your code the way it's written.  But happily
you can reuse vhost - it's just a library, link with it - or even vhost
net as I explained above.

-- 
MST

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 14:46                 ` Gregory Haskins
@ 2009-08-18 16:27                   ` Avi Kivity
  2009-08-19  6:28                     ` Gregory Haskins
  2009-08-18 18:20                   ` Arnd Bergmann
  1 sibling, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-18 16:27 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ingo Molnar, kvm, alacrityvm-devel, linux-kernel, netdev,
	Michael S. Tsirkin

On 08/18/2009 05:46 PM, Gregory Haskins wrote:
>
>> Can you explain how vbus achieves RDMA?
>>
>> I also don't see the connection to real time guests.
>>      
> Both of these are still in development.  Trying to stay true to the
> "release early and often" mantra, the core vbus technology is being
> pushed now so it can be reviewed.  Stay tuned for these other developments.
>    

Hopefully you can outline how it works.  AFAICT, RDMA and kernel bypass 
will need device assignment.  If you're bypassing the call into the host 
kernel, it doesn't really matter how that call is made, does it?

>>> I also designed it in such a way that
>>> we could, in theory, write one set of (linux-based) backends, and have
>>> them work across a variety of environments (such as containers/VMs like
>>> KVM, lguest, openvz, but also physical systems like blade enclosures and
>>> clusters, or even applications running on the host).
>>>
>>>        
>> Sorry, I'm still confused.  Why would openvz need vbus?
>>      
> Its just an example.  The point is that I abstracted what I think are
> the key points of fast-io, memory routing, signal routing, etc, so that
> it will work in a variety of (ideally, _any_) environments.
>
> There may not be _performance_ motivations for certain classes of VMs
> because they already have decent support, but they may want a connector
> anyway to gain some of the new features available in vbus.
>
> And looking forward, the idea is that we have commoditized the backend
> so we don't need to redo this each time a new container comes along.
>    

I'll wait until a concrete example shows up as I still don't understand.

>> One point of contention is that this is all managementy stuff and should
>> be kept out of the host kernel.  Exposing shared memory, interrupts, and
>> guest hypercalls can all be easily done from userspace (as virtio
>> demonstrates).  True, some devices need kernel acceleration, but that's
>> no reason to put everything into the host kernel.
>>      
> See my last reply to Anthony.  My two points here are that:
>
> a) having it in-kernel makes it a complete subsystem, which perhaps has
> diminished value in kvm, but adds value in most other places that we are
> looking to use vbus.
>    

It's not a complete system unless you want users to administer VMs using 
echo and cat and configfs.  Some userspace support will always be necessary.

> b) the in-kernel code is being overstated as "complex".  We are not
> talking about your typical virt thing, like an emulated ICH/PCI chipset.
>   Its really a simple list of devices with a handful of attributes.  They
> are managed using established linux interfaces, like sysfs/configfs.
>    

They need to be connected to the real world somehow.  What about 
security?  can any user create a container and devices and link them to 
real interfaces?  If not, do you need to run the VM as root?

virtio and vhost-net solve these issues.  Does vbus?

The code may be simple to you.  But the question is whether it's 
necessary, not whether it's simple or complex.

>> Exposing devices as PCI is an important issue for me, as I have to
>> consider non-Linux guests.
>>      
> Thats your prerogative, but obviously not everyone agrees with you.
>    

I hope everyone agrees that it's an important issue for me and that I 
have to consider non-Linux guests.  I also hope that you're considering 
non-Linux guests since they have considerable market share.

> Getting non-Linux guests to work is my problem if you chose to not be
> part of the vbus community.
>    

I won't be writing those drivers in any case.

>> Another issue is the host kernel management code which I believe is
>> superfluous.
>>      
> In your opinion, right?
>    

Yes, this is why I wrote "I believe".


>> Given that, why spread to a new model?
>>      
> Note: I haven't asked you to (at least, not since April with the vbus-v3
> release).  Spreading to a new model is currently the role of the
> AlacrityVM project, since we disagree on the utility of a new model.
>    

Given I'm not the gateway to inclusion of vbus/venet, you don't need to 
ask me anything.  I'm still free to give my opinion.

>>> A) hardware can only generate byte/word sized requests at a time because
>>> that is all the pcb-etch and silicon support. So hardware is usually
>>> expressed in terms of some number of "registers".
>>>
>>>        
>> No, hardware happily DMAs to and fro main memory.
>>      
> Yes, now walk me through how you set up DMA to do something like a call
> when you do not know addresses apriori.  Hint: count the number of
> MMIO/PIOs you need.  If the number is>  1, you've lost.
>    

With virtio, the number is 1 (or less if you amortize).  Set up the ring 
entries and kick.

>>   Some hardware of
>> course uses mmio registers extensively, but not virtio hardware.  With
>> the recent MSI support no registers are touched in the fast path.
>>      
> Note we are not talking about virtio here.  Just raw PCI and why I
> advocate vbus over it.
>    

There's no such thing as raw PCI.  Every PCI device has a protocol.  The 
protocol virtio chose is optimized for virtualization.


>>> D) device-ids are in a fixed width register and centrally assigned from
>>> an authority (e.g. PCI-SIG).
>>>
>>>        
>> That's not an issue either.  Qumranet/Red Hat has donated a range of
>> device IDs for use in virtio.
>>      
> Yes, and to get one you have to do what?  Register it with kvm.git,
> right?  Kind of like registering a MAJOR/MINOR, would you agree?  Maybe
> you do not mind (especially given your relationship to kvm.git), but
> there are disadvantages to that model for most of the rest of us.
>    

Send an email, it's not that difficult.  There's also an experimental range.

>>   Device IDs are how devices are associated
>> with drivers, so you'll need something similar for vbus.
>>      
> Nope, just like you don't need to do anything ahead of time for using a
> dynamic misc-device name.  You just have both the driver and device know
> what they are looking for (its part of the ABI).
>    

If you get a device ID clash, you fail.  If you get a device name clash, 
you fail in the same way.

>>> E) Interrupt/MSI routing is per-device oriented
>>>
>>>        
>> Please elaborate.  What is the issue?  How does vbus solve it?
>>      
> There are no "interrupts" in vbus..only shm-signals.  You can establish
> an arbitrary amount of shm regions, each with an optional shm-signal
> associated with it.  To do this, the driver calls dev->shm(), and you
> get back a shm_signal object.
>
> Underneath the hood, the vbus-connector (e.g. vbus-pcibridge) decides
> how it maps real interrupts to shm-signals (on a system level, not per
> device).  This can be 1:1, or any other scheme.  vbus-pcibridge uses one
> system-wide interrupt per priority level (today this is 8 levels), each
> with an IOQ based event channel.  "signals" come as an event on that
> channel.
>
> So the "issue" is that you have no real choice with PCI.  You just get
> device oriented interrupts.  With vbus, its abstracted.  So you can
> still get per-device standard MSI, or you can do fancier things like do
> coalescing and prioritization.
>    

As I've mentioned before, prioritization is available on x86, and 
coalescing scales badly.

>>> F) Interrupts/MSI are assumed cheap to inject
>>>
>>>        
>> Interrupts are not assumed cheap; that's why interrupt mitigation is
>> used (on real and virtual hardware).
>>      
> Its all relative.  IDT dispatch and EOI overhead are "baseline" on real
> hardware, whereas they are significantly more expensive to do the
> vmenters and vmexits on virt (and you have new exit causes, like
> irq-windows, etc, that do not exist in real HW).
>    

irq window exits ought to be pretty rare, so we're only left with 
injection vmexits.  At around 1us/vmexit, even 100,000 interrupts/vcpu 
(which is excessive) will only cost you 10% cpu time.

>>> G) Interrupts/MSI are non-priortizable.
>>>
>>>        
>> They are prioritizable; Linux ignores this though (Windows doesn't).
>> Please elaborate on what the problem is and how vbus solves it.
>>      
> It doesn't work right.  The x86 sense of interrupt priority is, sorry to
> say it, half-assed at best.  I've worked with embedded systems that have
> real interrupt priority support in the hardware, end to end, including
> the PIC.  The LAPIC on the other hand is really weak in this dept, and
> as you said, Linux doesn't even attempt to use whats there.
>    

Maybe prioritization is not that important then.  If it is, it needs to 
be fixed at the lapic level, otherwise you have no real prioritization 
wrt non-vbus interrupts.

>>> H) Interrupts/MSI are statically established
>>>
>>>        
>> Can you give an example of why this is a problem?
>>      
> Some of the things we are building use the model of having a device that
> hands out shm-signal in response to guest events (say, the creation of
> an IPC channel).  This would generally be handled by a specific device
> model instance, and it would need to do this without pre-declaring the
> MSI vectors (to use PCI as an example).
>    

You're free to demultiplex an MSI to however many consumers you want, 
there's no need for a new bus for that.

>> What performance oriented items have been left unaddressed?
>>      
> Well, the interrupt model to name one.
>    

Like I mentioned, you can merge MSI interrupts, but that's not 
necessarily a good idea.

>> How do you handle conflicts?  Again you need a central authority to hand
>> out names or prefixes.
>>      
> Not really, no.  If you really wanted to be formal about it, you could
> adopt any series of UUID schemes.  For instance, perhaps venet should be
> "com.novell::virtual-ethernet".  Heck, I could use uuidgen.
>    

Do you use DNS.  We use PCI-SIG.  If Novell is a PCI-SIG member you can 
get a vendor ID and control your own virtio space.

>>> As another example, the connector design coalesces *all* shm-signals
>>> into a single interrupt (by prio) that uses the same context-switch
>>> mitigation techniques that help boost things like networking.  This
>>> effectively means we can detect and optimize out ack/eoi cycles from the
>>> APIC as the IO load increases (which is when you need it most).  PCI has
>>> no such concept.
>>>
>>>        
>> That's a bug, not a feature.  It means poor scaling as the number of
>> vcpus increases and as the number of devices increases.
>>      
> So the "avi-vbus-connector" can use 1:1, if you prefer.  Large vcpu
> counts (which are not typical) and irq-affinity is not a target
> application for my design, so I prefer the coalescing model in the
> vbus-pcibridge included in this series. YMMV
>    

So far you've left out live migration, Windows, large guests, and 
multiqueue out of your design.  If you wish to position vbus/venet for 
large scale use you'll need to address all of them.

>> Note nothing prevents steering multiple MSIs into a single vector.  It's
>> a bad idea though.
>>      
> Yes, it is a bad idea...and not the same thing either.  This would
> effectively create a shared-line scenario in the irq code, which is not
> what happens in vbus.
>    

Ok.

>>> In addition, the signals and interrupts are priority aware, which is
>>> useful for things like 802.1p networking where you may establish 8-tx
>>> and 8-rx queues for your virtio-net device.  x86 APIC really has no
>>> usable equivalent, so PCI is stuck here.
>>>
>>>        
>> x86 APIC is priority aware.
>>      
> Have you ever tried to use it?
>    

I haven't, but Windows does.

>>> Also, the signals can be allocated on-demand for implementing things
>>> like IPC channels in response to guest requests since there is no
>>> assumption about device-to-interrupt mappings.  This is more flexible.
>>>
>>>        
>> Yes.  However given that vectors are a scarce resource you're severely
>> limited in that.
>>      
> The connector I am pushing out does not have this limitation.
>    

Okay.

>    
>>   And if you're multiplexing everything on one vector,
>> then you can just as well demultiplex your channels in the virtio driver
>> code.
>>      
> Only per-device, not system wide.
>    

Right.  I still think multiplexing interrupts is a bad idea in a large 
system.  In a small system... why would you do it at all?

>>> And through all of this, this design would work in any guest even if it
>>> doesn't have PCI (e.g. lguest, UML, physical systems, etc).
>>>
>>>        
>> That is true for virtio which works on pci-less lguest and s390.
>>      
> Yes, and lguest and s390 had to build their own bus-model to do it, right?
>    

They had to build connectors just like you propose to do.

> Thank you for bringing this up, because it is one of the main points
> here.  What I am trying to do is generalize the bus to prevent the
> proliferation of more of these isolated models in the future.  Build
> one, fast, in-kernel model so that we wouldn't need virtio-X, and
> virtio-Y in the future.  They can just reuse the (performance optimized)
> bus and models, and only need to build the connector to bridge them.
>    

But you still need vbus-connector-lguest and vbus-connector-s390 because 
they all talk to the host differently.  So what's changed?  the names?

>> That is exactly the design goal of virtio (except it limits itself to
>> virtualization).
>>      
> No, virtio is only part of the picture.  It not including the backend
> models, or how to do memory/signal-path abstraction for in-kernel, for
> instance.  But otherwise, virtio as a device model is compatible with
> vbus as a bus model.  They compliment one another.
>    

Well, venet doesn't complement virtio-net, and virtio-pci doesn't 
complement vbus-connector.

>>> Then device models like virtio can ride happily on top and we end up
>>> with a really robust and high-performance Linux-based stack.  I don't
>>> buy the argument that we already have PCI so lets use it.  I don't think
>>> its the best design and I am not afraid to make an investment in a
>>> change here because I think it will pay off in the long run.
>>>
>>>        
>> Sorry, I don't think you've shown any quantifiable advantages.
>>      
> We can agree to disagree then, eh?  There are certainly quantifiable
> differences.  Waving your hand at the differences to say they are not
> advantages is merely an opinion, one that is not shared universally.
>    

I've addressed them one by one.  We can agree to disagree on interrupt 
multiplexing, and the importance of compatibility, Windows, large 
guests, multiqueue, and DNS vs. PCI-SIG.

> The bottom line is all of these design distinctions are encapsulated
> within the vbus subsystem and do not affect the kvm code-base.  So
> agreement with kvm upstream is not a requirement, but would be
> advantageous for collaboration.
>    

Certainly.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 15:39                 ` Gregory Haskins
@ 2009-08-18 16:39                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 132+ messages in thread
From: Michael S. Tsirkin @ 2009-08-18 16:39 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ingo Molnar, kvm, Avi Kivity, alacrityvm-devel, linux-kernel, netdev

On Tue, Aug 18, 2009 at 11:39:25AM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Mon, Aug 17, 2009 at 03:33:30PM -0400, Gregory Haskins wrote:
> >> There is a secondary question of venet (a vbus native device) verses
> >> virtio-net (a virtio native device that works with PCI or VBUS).  If
> >> this contention is really around venet vs virtio-net, I may possibly
> >> conceed and retract its submission to mainline.
> > 
> > For me yes, venet+ioq competing with virtio+virtqueue.
> > 
> >> I've been pushing it to date because people are using it and I don't
> >> see any reason that the driver couldn't be upstream.
> > 
> > If virtio is just as fast, they can just use it without knowing it.
> > Clearly, that's better since we support virtio anyway ...
> 
> More specifically: kvm can support whatever it wants.  I am not asking
> kvm to support venet.
> 
> If we (the alacrityvm community) decide to keep maintaining venet, _we_
> will support it, and I have no problem with that.
> 
> As of right now, we are doing some interesting things with it in the lab
> and its certainly more flexible for us as a platform since we maintain
> the ABI and feature set. So for now, I do not think its a big deal if
> they both co-exist, and it has no bearing on KVM upstream.

As someone who extended them recently, both ABI and feature set with
virtio are pretty flexible.  What's the problem?  Will every single
contributor now push a driver with an incompatible ABI upstream because
this way he maintains both ABI and feature set? Oh well ...


-- 
MST

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 15:51               ` Gregory Haskins
  2009-08-18 16:14                 ` Ingo Molnar
@ 2009-08-18 16:47                 ` Avi Kivity
  2009-08-18 16:51                 ` Michael S. Tsirkin
  2 siblings, 0 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-18 16:47 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev, Michael S. Tsirkin

On 08/18/2009 06:51 PM, Gregory Haskins wrote:
>
>> It's not laughably trivial when you try to support the full feature set
>> of kvm (for example, live migration will require dirty memory tracking,
>> and exporting all state stored in the kernel to userspace).
>>      
> Doesn't vhost suffer from the same issue?  If not, could I also apply
> the same technique to support live-migration in vbus?
>    

It does.  There are two possible solutions to that: dropping the entire 
protocol to userspace, or the one I prefer, proxying the ring and 
eventfds in userspace but otherwise letting vhost-net run normally.  
This way userspace gets to see descriptors and mark the pages as dirty.  
Both these approaches rely on vhost-net being an accelerator to a 
userspace based component, but maybe you can adapt venet to use 
something similar.

>> Oh come on, I wrote "steal" as a convenient shorthand for
>> "cross-pollinate your ideas into our code according to the letter and
>> spirit of the GNU General Public License".
>>      
> Is that supposed to make me feel better about working with you?  I mean,
> writing, testing, polishing patches for LKML-type submission is time
> consuming.  If all you are going to do is take those ideas and rewrite
> it yourself, why should I go through that effort?
>    

If you're posting your ideas for everyone to read in the form of code, 
why not post them in the form of design ideas as well?  In any case 
you've given up any secrets.  In the worst case you've lost nothing, in 
the best case you may get some hopefully constructive criticism and 
maybe improvements.

I'm perfectly happy picking up ideas from competing projects (and I 
have) and seeing my ideas picked up in competing projects (which I also 
have).

Really, isn't that the point of open source?  Share code, but also share 
ideas?

> And its not like that was the first time you have said that to me.
>    

And I meant it every time.

Haven't you just asked how vhost-net plans to do live migration?

>> Since we're all trying to improve Linux we may as well cooperate.
>>      
> Well, I don't think anyone can say that I haven't been trying.
>    

I'd be obliged if you reveal some of your secret sauce then (only the 
parts you plan to GPL anyway of course).

>>> "sorry, we are going to reinvent our own instead".
>>>        
>> No.  Adopting venet/vbus would mean reinventing something that already
>> existed.
>>      
> But yet, it doesn't.
>    

We'll need to do the agree to disagree thing again here.

>>   Continuing to support virtio/pci is not reinventing anything.
>>      
> No one asked you to do otherwise.
>    

Right, and I'm not keen on supporting both.  See why I want to stick to 
virtio/pci as long as I possibly can?

>> You haven't convinced me that your ideas are worth the effort of
>> abandoning virtio/pci or maintaining both venet/vbus and virtio/pci.
>>      
> With all due respect, I didnt ask you do to anything, especially not
> abandon something you are happy with.
>
> All I did was push guest drivers to LKML.  The code in question is
> independent of KVM, and its proven to improve the experience of using
> Linux as a platform.  There are people interested in using them (by
> virtue of the number of people that have signed up for the AlacrityVM
> list, and have mailed me privately about this work).
>
> So where is the problem here?
>    

I'm unhappy with the duplication of effort and potential fragmentation 
of the developer and user communities, that's all.  I'd rather see the 
work going into vbus/venet going into virtio.  I think it's a legitimate 
concern.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 15:53               ` [Alacrityvm-devel] " Ira W. Snyder
@ 2009-08-18 16:51                 ` Avi Kivity
  2009-08-18 17:27                   ` Ira W. Snyder
  2009-08-18 20:57                 ` Michael S. Tsirkin
  1 sibling, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-18 16:51 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: Michael S. Tsirkin, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins

On 08/18/2009 06:53 PM, Ira W. Snyder wrote:
> So, in my system, copy_(to|from)_user() is completely wrong. There is no
> userspace, only a physical system. In fact, because normal x86 computers
> do not have DMA controllers, the host system doesn't actually handle any
> data transfer!
>    

In fact, modern x86s do have dma engines these days (google for Intel 
I/OAT), and one of our plans for vhost-net is to allow their use for 
packets above a certain size.  So a patch allowing vhost-net to 
optionally use a dma engine is a good thing.

> I used virtio-net in both the guest and host systems in my example
> virtio-over-PCI patch, and succeeded in getting them to communicate.
> However, the lack of any setup interface means that the devices must be
> hardcoded into both drivers, when the decision could be up to userspace.
> I think this is a problem that vbus could solve.
>    

Exposing a knob to userspace is not an insurmountable problem; vhost-net 
already allows changing the memory layout, for example.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 15:51               ` Gregory Haskins
  2009-08-18 16:14                 ` Ingo Molnar
  2009-08-18 16:47                 ` [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects Avi Kivity
@ 2009-08-18 16:51                 ` Michael S. Tsirkin
  2009-08-19  5:36                   ` Gregory Haskins
  2 siblings, 1 reply; 132+ messages in thread
From: Michael S. Tsirkin @ 2009-08-18 16:51 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Anthony Liguori, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev

On Tue, Aug 18, 2009 at 11:51:59AM -0400, Gregory Haskins wrote:
> > It's not laughably trivial when you try to support the full feature set
> > of kvm (for example, live migration will require dirty memory tracking,
> > and exporting all state stored in the kernel to userspace).
> 
> Doesn't vhost suffer from the same issue?  If not, could I also apply
> the same technique to support live-migration in vbus?

vhost does this by switching to userspace for the duration of live
migration. venet could do this I guess, but you'd need to write a
userspace implementation. vhost just reuses existing userspace virtio.

> With all due respect, I didnt ask you do to anything, especially not
> abandon something you are happy with.
> 
> All I did was push guest drivers to LKML.  The code in question is
> independent of KVM, and its proven to improve the experience of using
> Linux as a platform.  There are people interested in using them (by
> virtue of the number of people that have signed up for the AlacrityVM
> list, and have mailed me privately about this work).
> 
> So where is the problem here?

If virtio net in guest could be improved instead, everyone would
benefit. I am doing this, and I wish more people would join.  Instead,
you change ABI in a incompatible way. So now, there's no single place to
work on kvm networking performance. Now, it would all be understandable
if the reason was e.g. better performance. But you say yourself it
isn't. See the problem?

-- 
MST

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 16:51                 ` Avi Kivity
@ 2009-08-18 17:27                   ` Ira W. Snyder
  2009-08-18 17:47                     ` Avi Kivity
  2009-08-18 20:39                     ` Michael S. Tsirkin
  0 siblings, 2 replies; 132+ messages in thread
From: Ira W. Snyder @ 2009-08-18 17:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins

On Tue, Aug 18, 2009 at 07:51:21PM +0300, Avi Kivity wrote:
> On 08/18/2009 06:53 PM, Ira W. Snyder wrote:
>> So, in my system, copy_(to|from)_user() is completely wrong. There is no
>> userspace, only a physical system. In fact, because normal x86 computers
>> do not have DMA controllers, the host system doesn't actually handle any
>> data transfer!
>>    
>
> In fact, modern x86s do have dma engines these days (google for Intel  
> I/OAT), and one of our plans for vhost-net is to allow their use for  
> packets above a certain size.  So a patch allowing vhost-net to  
> optionally use a dma engine is a good thing.
>

Yes, I'm aware that very modern x86 PCs have general purpose DMA
engines, even though I don't have any capable hardware. However, I think
it is better to support using any PC (with or without DMA engine, any
architecture) as the PCI master, and just handle the DMA all from the
PCI agent, which is known to have DMA?

>> I used virtio-net in both the guest and host systems in my example
>> virtio-over-PCI patch, and succeeded in getting them to communicate.
>> However, the lack of any setup interface means that the devices must be
>> hardcoded into both drivers, when the decision could be up to userspace.
>> I think this is a problem that vbus could solve.
>>    
>
> Exposing a knob to userspace is not an insurmountable problem; vhost-net  
> already allows changing the memory layout, for example.
>

Let me explain the most obvious problem I ran into: setting the MAC
addresses used in virtio.

On the host (PCI master), I want eth0 (virtio-net) to get a random MAC
address.

On the guest (PCI agent), I want eth0 (virtio-net) to get a specific MAC
address, aa:bb:cc:dd:ee:ff.

The virtio feature negotiation code handles this, by seeing the
VIRTIO_NET_F_MAC feature in it's configuration space. If BOTH drivers do
not have VIRTIO_NET_F_MAC set, then NEITHER will use the specified MAC
address. This is because the feature negotiation code only accepts a
feature if it is offered by both sides of the connection.

In this case, I must have the guest generate a random MAC address and
have the host put aa:bb:cc:dd:ee:ff into the guest's configuration
space. This basically means hardcoding the MAC addresses in the Linux
drivers, which is a big no-no.

What would I expose to userspace to make this situation manageable?

Thanks for the response,
Ira

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 17:27                   ` Ira W. Snyder
@ 2009-08-18 17:47                     ` Avi Kivity
  2009-08-18 18:27                       ` Ira W. Snyder
  2009-08-18 20:39                     ` Michael S. Tsirkin
  1 sibling, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-18 17:47 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: Michael S. Tsirkin, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins

On 08/18/2009 08:27 PM, Ira W. Snyder wrote:
>> In fact, modern x86s do have dma engines these days (google for Intel
>> I/OAT), and one of our plans for vhost-net is to allow their use for
>> packets above a certain size.  So a patch allowing vhost-net to
>> optionally use a dma engine is a good thing.
>>      
> Yes, I'm aware that very modern x86 PCs have general purpose DMA
> engines, even though I don't have any capable hardware. However, I think
> it is better to support using any PC (with or without DMA engine, any
> architecture) as the PCI master, and just handle the DMA all from the
> PCI agent, which is known to have DMA?
>    

Certainly; but if your PCI agent will support the DMA API, then the same 
vhost code will work with both I/OAT and your specialized hardware.

>> Exposing a knob to userspace is not an insurmountable problem; vhost-net
>> already allows changing the memory layout, for example.
>>
>>      
> Let me explain the most obvious problem I ran into: setting the MAC
> addresses used in virtio.
>
> On the host (PCI master), I want eth0 (virtio-net) to get a random MAC
> address.
>
> On the guest (PCI agent), I want eth0 (virtio-net) to get a specific MAC
> address, aa:bb:cc:dd:ee:ff.
>
> The virtio feature negotiation code handles this, by seeing the
> VIRTIO_NET_F_MAC feature in it's configuration space. If BOTH drivers do
> not have VIRTIO_NET_F_MAC set, then NEITHER will use the specified MAC
> address. This is because the feature negotiation code only accepts a
> feature if it is offered by both sides of the connection.
>
> In this case, I must have the guest generate a random MAC address and
> have the host put aa:bb:cc:dd:ee:ff into the guest's configuration
> space. This basically means hardcoding the MAC addresses in the Linux
> drivers, which is a big no-no.
>
> What would I expose to userspace to make this situation manageable?
>
>    

I think in this case you want one side to be virtio-net (I'm guessing 
the x86) and the other side vhost-net (the ppc boards with the dma 
engine).  virtio-net on x86 would communicate with userspace on the ppc 
board to negotiate features and get a mac address, the fast path would 
be between virtio-net and vhost-net (which would use the dma engine to 
push and pull data).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 14:46                 ` Gregory Haskins
  2009-08-18 16:27                   ` Avi Kivity
@ 2009-08-18 18:20                   ` Arnd Bergmann
  2009-08-18 19:08                     ` Avi Kivity
  2009-08-19  5:36                     ` Gregory Haskins
  1 sibling, 2 replies; 132+ messages in thread
From: Arnd Bergmann @ 2009-08-18 18:20 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Ingo Molnar, kvm, alacrityvm-devel, linux-kernel,
	netdev, Michael S. Tsirkin

On Tuesday 18 August 2009, Gregory Haskins wrote:
> Avi Kivity wrote:
> > On 08/17/2009 10:33 PM, Gregory Haskins wrote:
> > 
> > One point of contention is that this is all managementy stuff and should
> > be kept out of the host kernel.  Exposing shared memory, interrupts, and
> > guest hypercalls can all be easily done from userspace (as virtio
> > demonstrates).  True, some devices need kernel acceleration, but that's
> > no reason to put everything into the host kernel.
> 
> See my last reply to Anthony.  My two points here are that:
> 
> a) having it in-kernel makes it a complete subsystem, which perhaps has
> diminished value in kvm, but adds value in most other places that we are
> looking to use vbus.
> 
> b) the in-kernel code is being overstated as "complex".  We are not
> talking about your typical virt thing, like an emulated ICH/PCI chipset.
>  Its really a simple list of devices with a handful of attributes.  They
> are managed using established linux interfaces, like sysfs/configfs.

IMHO the complexity of the code is not so much of a problem. What I
see as a problem is the complexity a kernel/user space interface that
manages a the devices with global state.

One of the greatest features of Michaels vhost driver is that all
the state is associated with open file descriptors that either exist
already or belong to the vhost_net misc device. When a process dies,
all the file descriptors get closed and the whole state is cleaned
up implicitly.

AFAICT, you can't do that with the vbus host model.

> > What performance oriented items have been left unaddressed?
> 
> Well, the interrupt model to name one.

The performance aspects of your interrupt model are independent
of the vbus proxy, or at least they should be. Let's assume for
now that your event notification mechanism gives significant
performance improvements (which we can't measure independently
right now). I don't see a reason why we could not get the
same performance out of a paravirtual interrupt controller
that uses the same method, and it would be straightforward
to implement one and use that together with all the existing
emulated PCI devices and virtio devices including vhost_net.

	Arnd <><

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 17:47                     ` Avi Kivity
@ 2009-08-18 18:27                       ` Ira W. Snyder
  2009-08-18 18:52                         ` Avi Kivity
  2009-08-18 20:35                         ` Michael S. Tsirkin
  0 siblings, 2 replies; 132+ messages in thread
From: Ira W. Snyder @ 2009-08-18 18:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins

On Tue, Aug 18, 2009 at 08:47:04PM +0300, Avi Kivity wrote:
> On 08/18/2009 08:27 PM, Ira W. Snyder wrote:
>>> In fact, modern x86s do have dma engines these days (google for Intel
>>> I/OAT), and one of our plans for vhost-net is to allow their use for
>>> packets above a certain size.  So a patch allowing vhost-net to
>>> optionally use a dma engine is a good thing.
>>>      
>> Yes, I'm aware that very modern x86 PCs have general purpose DMA
>> engines, even though I don't have any capable hardware. However, I think
>> it is better to support using any PC (with or without DMA engine, any
>> architecture) as the PCI master, and just handle the DMA all from the
>> PCI agent, which is known to have DMA?
>>    
>
> Certainly; but if your PCI agent will support the DMA API, then the same  
> vhost code will work with both I/OAT and your specialized hardware.
>

Yes, that's true. My ppc is a Freescale MPC8349EMDS. It has a Linux
DMAEngine driver in mainline, which I've used. That's excellent.

>>> Exposing a knob to userspace is not an insurmountable problem; vhost-net
>>> already allows changing the memory layout, for example.
>>>
>>>      
>> Let me explain the most obvious problem I ran into: setting the MAC
>> addresses used in virtio.
>>
>> On the host (PCI master), I want eth0 (virtio-net) to get a random MAC
>> address.
>>
>> On the guest (PCI agent), I want eth0 (virtio-net) to get a specific MAC
>> address, aa:bb:cc:dd:ee:ff.
>>
>> The virtio feature negotiation code handles this, by seeing the
>> VIRTIO_NET_F_MAC feature in it's configuration space. If BOTH drivers do
>> not have VIRTIO_NET_F_MAC set, then NEITHER will use the specified MAC
>> address. This is because the feature negotiation code only accepts a
>> feature if it is offered by both sides of the connection.
>>
>> In this case, I must have the guest generate a random MAC address and
>> have the host put aa:bb:cc:dd:ee:ff into the guest's configuration
>> space. This basically means hardcoding the MAC addresses in the Linux
>> drivers, which is a big no-no.
>>
>> What would I expose to userspace to make this situation manageable?
>>
>>    
>
> I think in this case you want one side to be virtio-net (I'm guessing  
> the x86) and the other side vhost-net (the ppc boards with the dma  
> engine).  virtio-net on x86 would communicate with userspace on the ppc  
> board to negotiate features and get a mac address, the fast path would  
> be between virtio-net and vhost-net (which would use the dma engine to  
> push and pull data).
>

Ah, that seems backwards, but it should work after vhost-net learns how
to use the DMAEngine API.

I haven't studied vhost-net very carefully yet. As soon as I saw the
copy_(to|from)_user() I stopped reading, because it seemed useless for
my case. I'll look again and try to find where vhost-net supports
setting MAC addresses and other features.

Also, in my case I'd like to boot Linux with my rootfs over NFS. Is
vhost-net capable of this?

I've had Arnd, BenH, and Grant Likely (and others, privately) contact me
about devices they are working with that would benefit from something
like virtio-over-PCI. I'd like to see vhost-net be merged with the
capability to support my use case. There are plenty of others that would
benefit, not just myself.

I'm not sure vhost-net is being written with this kind of future use in
mind. I'd hate to see it get merged, and then have to change the ABI to
support physical-device-to-device usage. It would be better to keep
future use in mind now, rather than try and hack it in later.

Thanks for the comments.
Ira

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 18:27                       ` Ira W. Snyder
@ 2009-08-18 18:52                         ` Avi Kivity
  2009-08-18 20:59                           ` Ira W. Snyder
  2009-08-18 20:35                         ` Michael S. Tsirkin
  1 sibling, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-18 18:52 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: Michael S. Tsirkin, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins

On 08/18/2009 09:27 PM, Ira W. Snyder wrote:
>> I think in this case you want one side to be virtio-net (I'm guessing
>> the x86) and the other side vhost-net (the ppc boards with the dma
>> engine).  virtio-net on x86 would communicate with userspace on the ppc
>> board to negotiate features and get a mac address, the fast path would
>> be between virtio-net and vhost-net (which would use the dma engine to
>> push and pull data).
>>
>>      
>
> Ah, that seems backwards, but it should work after vhost-net learns how
> to use the DMAEngine API.
>
> I haven't studied vhost-net very carefully yet. As soon as I saw the
> copy_(to|from)_user() I stopped reading, because it seemed useless for
> my case. I'll look again and try to find where vhost-net supports
> setting MAC addresses and other features.
>    

It doesn't; all it does is pump the rings, leaving everything else to 
userspace.

> Also, in my case I'd like to boot Linux with my rootfs over NFS. Is
> vhost-net capable of this?
>    

It's just another network interface.  You'd need an initramfs though to 
contain the needed userspace.

> I've had Arnd, BenH, and Grant Likely (and others, privately) contact me
> about devices they are working with that would benefit from something
> like virtio-over-PCI. I'd like to see vhost-net be merged with the
> capability to support my use case. There are plenty of others that would
> benefit, not just myself.
>
> I'm not sure vhost-net is being written with this kind of future use in
> mind. I'd hate to see it get merged, and then have to change the ABI to
> support physical-device-to-device usage. It would be better to keep
> future use in mind now, rather than try and hack it in later.
>    

Please review and comment then.  I'm fairly confident there won't be any 
ABI issues since vhost-net does so little outside pumping the rings.

Note the signalling paths go through eventfd: when vhost-net wants the 
other side to look at its ring, it tickles an eventfd which is supposed 
to trigger an interrupt on the other side.  Conversely, when another 
eventfd is signalled, vhost-net will look at the ring and process any 
data there.  You'll need to wire your signalling to those eventfds, 
either in userspace or in the kernel.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 18:20                   ` Arnd Bergmann
@ 2009-08-18 19:08                     ` Avi Kivity
  2009-08-19  5:36                     ` Gregory Haskins
  1 sibling, 0 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-18 19:08 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Gregory Haskins, Ingo Molnar, kvm, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin

On 08/18/2009 09:20 PM, Arnd Bergmann wrote:
>> Well, the interrupt model to name one.
>>      
> The performance aspects of your interrupt model are independent
> of the vbus proxy, or at least they should be. Let's assume for
> now that your event notification mechanism gives significant
> performance improvements (which we can't measure independently
> right now). I don't see a reason why we could not get the
> same performance out of a paravirtual interrupt controller
> that uses the same method, and it would be straightforward
> to implement one and use that together with all the existing
> emulated PCI devices and virtio devices including vhost_net.
>    

Interesting.  You could even configure those vectors using the standard 
MSI configuration mechanism; simply replace the address/data pair with 
something meaningful to the paravirt interrupt controller.

I'd have to see really hard numbers to be tempted to merge something 
like this though.  We've merged paravirt mmu, for example, and now it 
underperforms both hardware two-level paging and software shadow paging.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 18:27                       ` Ira W. Snyder
  2009-08-18 18:52                         ` Avi Kivity
@ 2009-08-18 20:35                         ` Michael S. Tsirkin
  2009-08-18 21:04                           ` Arnd Bergmann
  1 sibling, 1 reply; 132+ messages in thread
From: Michael S. Tsirkin @ 2009-08-18 20:35 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: Avi Kivity, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins

On Tue, Aug 18, 2009 at 11:27:35AM -0700, Ira W. Snyder wrote:
> I haven't studied vhost-net very carefully yet. As soon as I saw the
> copy_(to|from)_user() I stopped reading, because it seemed useless for
> my case. I'll look again and try to find where vhost-net supports
> setting MAC addresses and other features.

vhost net doesn't do this at all. You bind raw socket to a network
device, and program that with usual userspace interfaces.

> Also, in my case I'd like to boot Linux with my rootfs over NFS. Is
> vhost-net capable of this?
> 
> I've had Arnd, BenH, and Grant Likely (and others, privately) contact me
> about devices they are working with that would benefit from something
> like virtio-over-PCI. I'd like to see vhost-net be merged with the
> capability to support my use case. There are plenty of others that would
> benefit, not just myself.
> 
> I'm not sure vhost-net is being written with this kind of future use in
> mind. I'd hate to see it get merged, and then have to change the ABI to
> support physical-device-to-device usage. It would be better to keep
> future use in mind now, rather than try and hack it in later.

I still need to think your usage over. I am not so sure this fits what
vhost is trying to do. If not, possibly it's better to just have a
separate driver for your device.


> Thanks for the comments.
> Ira

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 17:27                   ` Ira W. Snyder
  2009-08-18 17:47                     ` Avi Kivity
@ 2009-08-18 20:39                     ` Michael S. Tsirkin
  1 sibling, 0 replies; 132+ messages in thread
From: Michael S. Tsirkin @ 2009-08-18 20:39 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: Avi Kivity, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins

On Tue, Aug 18, 2009 at 10:27:52AM -0700, Ira W. Snyder wrote:
> On Tue, Aug 18, 2009 at 07:51:21PM +0300, Avi Kivity wrote:
> > On 08/18/2009 06:53 PM, Ira W. Snyder wrote:
> >> So, in my system, copy_(to|from)_user() is completely wrong. There is no
> >> userspace, only a physical system. In fact, because normal x86 computers
> >> do not have DMA controllers, the host system doesn't actually handle any
> >> data transfer!
> >>    
> >
> > In fact, modern x86s do have dma engines these days (google for Intel  
> > I/OAT), and one of our plans for vhost-net is to allow their use for  
> > packets above a certain size.  So a patch allowing vhost-net to  
> > optionally use a dma engine is a good thing.
> >
> 
> Yes, I'm aware that very modern x86 PCs have general purpose DMA
> engines, even though I don't have any capable hardware. However, I think
> it is better to support using any PC (with or without DMA engine, any
> architecture) as the PCI master, and just handle the DMA all from the
> PCI agent, which is known to have DMA?
> 
> >> I used virtio-net in both the guest and host systems in my example
> >> virtio-over-PCI patch, and succeeded in getting them to communicate.
> >> However, the lack of any setup interface means that the devices must be
> >> hardcoded into both drivers, when the decision could be up to userspace.
> >> I think this is a problem that vbus could solve.
> >>    
> >
> > Exposing a knob to userspace is not an insurmountable problem; vhost-net  
> > already allows changing the memory layout, for example.
> >
> 
> Let me explain the most obvious problem I ran into: setting the MAC
> addresses used in virtio.
> 
> On the host (PCI master), I want eth0 (virtio-net) to get a random MAC
> address.
> 
> On the guest (PCI agent), I want eth0 (virtio-net) to get a specific MAC
> address, aa:bb:cc:dd:ee:ff.
> 
> The virtio feature negotiation code handles this, by seeing the
> VIRTIO_NET_F_MAC feature in it's configuration space. If BOTH drivers do
> not have VIRTIO_NET_F_MAC set, then NEITHER will use the specified MAC
> address. This is because the feature negotiation code only accepts a
> feature if it is offered by both sides of the connection.
> 
> In this case, I must have the guest generate a random MAC address and
> have the host put aa:bb:cc:dd:ee:ff into the guest's configuration
> space. This basically means hardcoding the MAC addresses in the Linux
> drivers, which is a big no-no.
> 
> What would I expose to userspace to make this situation manageable?
> 
> Thanks for the response,
> Ira

This calls for some kind of change in guest virtio.  vhost being a host
kernel only feature, does not deal with this problem.  But assuming
virtio in guest supports this somehow, vhost will not interfere: you do
the setup in qemu userspace anyway, vhost will happily use a network
device however you chose to set it up.

-- 
MST

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 15:53               ` [Alacrityvm-devel] " Ira W. Snyder
  2009-08-18 16:51                 ` Avi Kivity
@ 2009-08-18 20:57                 ` Michael S. Tsirkin
  2009-08-18 23:24                   ` Ira W. Snyder
  1 sibling, 1 reply; 132+ messages in thread
From: Michael S. Tsirkin @ 2009-08-18 20:57 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: Gregory Haskins, kvm, netdev, linux-kernel, alacrityvm-devel,
	Avi Kivity, Anthony Liguori, Ingo Molnar, Gregory Haskins

On Tue, Aug 18, 2009 at 08:53:29AM -0700, Ira W. Snyder wrote:
> I think Greg is referring to something like my virtio-over-PCI patch.
> I'm pretty sure that vhost is completely useless for my situation. I'd
> like to see vhost work for my use, so I'll try to explain what I'm
> doing.
> 
> I've got a system where I have about 20 computers connected via PCI. The
> PCI master is a normal x86 system, and the PCI agents are PowerPC
> systems. The PCI agents act just like any other PCI card, except they
> are running Linux, and have their own RAM and peripherals.
> 
> I wrote a custom driver which imitated a network interface and a serial
> port. I tried to push it towards mainline, and DavidM rejected it, with
> the argument, "use virtio, don't add another virtualization layer to the
> kernel." I think he has a decent argument, so I wrote virtio-over-PCI.
> 
> Now, there are some things about virtio that don't work over PCI.
> Mainly, memory is not truly shared. It is extremely slow to access
> memory that is "far away", meaning "across the PCI bus." This can be
> worked around by using a DMA controller to transfer all data, along with
> an intelligent scheme to perform only writes across the bus. If you're
> careful, reads are never needed.
> 
> So, in my system, copy_(to|from)_user() is completely wrong.
> There is no userspace, only a physical system.

Can guests do DMA to random host memory? Or is there some kind of IOMMU
and DMA API involved? If the later, then note that you'll still need
some kind of driver for your device. The question we need to ask
ourselves then is whether this driver can reuse bits from vhost.

> In fact, because normal x86 computers
> do not have DMA controllers, the host system doesn't actually handle any
> data transfer!

Is it true that PPC has to initiate all DMA then? How do you
manage not to do DMA reads then?

> I used virtio-net in both the guest and host systems in my example
> virtio-over-PCI patch, and succeeded in getting them to communicate.
> However, the lack of any setup interface means that the devices must be
> hardcoded into both drivers, when the decision could be up to userspace.
> I think this is a problem that vbus could solve.

What you describe (passing setup from host to guest) seems like
a feature that guest devices need to support. It seems unlikely that
vbus, being a transport layer, can address this.

> 
> For my own selfish reasons (I don't want to maintain an out-of-tree
> driver) I'd like to see *something* useful in mainline Linux. I'm happy
> to answer questions about my setup, just ask.
> 
> Ira

Thanks Ira, I'll think about it.
A couple of questions:
- Could you please describe what kind of communication needs to happen?
- I'm not familiar with DMA engine in question. I'm guessing it's the
  usual thing: in/out buffers need to be kernel memory, interface is
  asynchronous, small limited number of outstanding requests? Is there a
  userspace interface for it and if yes how does it work?



-- 
MST


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 18:52                         ` Avi Kivity
@ 2009-08-18 20:59                           ` Ira W. Snyder
  2009-08-18 21:26                             ` Avi Kivity
  0 siblings, 1 reply; 132+ messages in thread
From: Ira W. Snyder @ 2009-08-18 20:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins

On Tue, Aug 18, 2009 at 09:52:48PM +0300, Avi Kivity wrote:
> On 08/18/2009 09:27 PM, Ira W. Snyder wrote:
>>> I think in this case you want one side to be virtio-net (I'm guessing
>>> the x86) and the other side vhost-net (the ppc boards with the dma
>>> engine).  virtio-net on x86 would communicate with userspace on the ppc
>>> board to negotiate features and get a mac address, the fast path would
>>> be between virtio-net and vhost-net (which would use the dma engine to
>>> push and pull data).
>>>
>>>      
>>
>> Ah, that seems backwards, but it should work after vhost-net learns how
>> to use the DMAEngine API.
>>
>> I haven't studied vhost-net very carefully yet. As soon as I saw the
>> copy_(to|from)_user() I stopped reading, because it seemed useless for
>> my case. I'll look again and try to find where vhost-net supports
>> setting MAC addresses and other features.
>>    
>
> It doesn't; all it does is pump the rings, leaving everything else to  
> userspace.
>

Ok.

On a non shared-memory system (where the guest's RAM is not just a chunk
of userspace RAM in the host system), virtio's management model seems to
fall apart. Feature negotiation doesn't work as one would expect.

This does appear to be solved by vbus, though I haven't written a
vbus-over-PCI implementation, so I cannot be completely sure.

I'm not at all clear on how to get feature negotiation to work on a
system like mine. From my study of lguest and kvm (see below) it looks
like userspace will need to be involved, via a miscdevice.

>> Also, in my case I'd like to boot Linux with my rootfs over NFS. Is
>> vhost-net capable of this?
>>    
>
> It's just another network interface.  You'd need an initramfs though to  
> contain the needed userspace.
>

Ok. I'm using an initramfs already, so adding some more userspace to it
isn't a problem.

>> I've had Arnd, BenH, and Grant Likely (and others, privately) contact me
>> about devices they are working with that would benefit from something
>> like virtio-over-PCI. I'd like to see vhost-net be merged with the
>> capability to support my use case. There are plenty of others that would
>> benefit, not just myself.
>>
>> I'm not sure vhost-net is being written with this kind of future use in
>> mind. I'd hate to see it get merged, and then have to change the ABI to
>> support physical-device-to-device usage. It would be better to keep
>> future use in mind now, rather than try and hack it in later.
>>    
>
> Please review and comment then.  I'm fairly confident there won't be any  
> ABI issues since vhost-net does so little outside pumping the rings.
>

Ok. I thought I should at least express my concerns while we're
discussing this, rather than being too late after finding the time to
study the driver.

Off the top of my head, I would think that transporting userspace
addresses in the ring (for copy_(to|from)_user()) vs. physical addresses
(for DMAEngine) might be a problem. Pinning userspace pages into memory
for DMA is a bit of a pain, though it is possible.

There is also the problem of different endianness between host and guest
in virtio-net. The struct virtio_net_hdr (include/linux/virtio_net.h)
defines fields in host byte order. Which totally breaks if the guest has
a different endianness. This is a virtio-net problem though, and is not
transport specific.

> Note the signalling paths go through eventfd: when vhost-net wants the  
> other side to look at its ring, it tickles an eventfd which is supposed  
> to trigger an interrupt on the other side.  Conversely, when another  
> eventfd is signalled, vhost-net will look at the ring and process any  
> data there.  You'll need to wire your signalling to those eventfds,  
> either in userspace or in the kernel.
>

Ok. I've never used eventfd before, so that'll take yet more studying.

I've browsed over both the kvm and lguest code, and it looks like they
each re-invent a mechanism for transporting interrupts between the host
and guest, using eventfd. They both do this by implementing a
miscdevice, which is basically their management interface.

See drivers/lguest/lguest_user.c (see write() and LHREQ_EVENTFD) and
kvm-kmod-devel-88/x86/kvm_main.c (see kvm_vm_ioctl(), called via
kvm_dev_ioctl()) for how they hook up eventfd's.

I can now imagine how two userspace programs (host and guest) could work
together to implement a management interface, including hotplug of
devices, etc. Of course, this would basically reinvent the vbus
management interface into a specific driver.

I think this is partly what Greg is trying to abstract out into generic
code. I haven't studied the actual data transport mechanisms in vbus,
though I have studied virtio's transport mechanism. I think a generic
management interface for virtio might be a good thing to consider,
because it seems there are at least two implementations already: kvm and
lguest.

Thanks for answering my questions. It helps to talk with someone more
familiar with the issues than I am.

Ira

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 20:35                         ` Michael S. Tsirkin
@ 2009-08-18 21:04                           ` Arnd Bergmann
  0 siblings, 0 replies; 132+ messages in thread
From: Arnd Bergmann @ 2009-08-18 21:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Ira W. Snyder, Avi Kivity, Gregory Haskins, kvm, netdev,
	linux-kernel, alacrityvm-devel, Anthony Liguori, Ingo Molnar,
	Gregory Haskins

On Tuesday 18 August 2009 20:35:22 Michael S. Tsirkin wrote:
> On Tue, Aug 18, 2009 at 10:27:52AM -0700, Ira W. Snyder wrote:
> > Also, in my case I'd like to boot Linux with my rootfs over NFS. Is
> > vhost-net capable of this?
> > 
> > I've had Arnd, BenH, and Grant Likely (and others, privately) contact me
> > about devices they are working with that would benefit from something
> > like virtio-over-PCI. I'd like to see vhost-net be merged with the
> > capability to support my use case. There are plenty of others that would
> > benefit, not just myself.

yes.

> > I'm not sure vhost-net is being written with this kind of future use in
> > mind. I'd hate to see it get merged, and then have to change the ABI to
> > support physical-device-to-device usage. It would be better to keep
> > future use in mind now, rather than try and hack it in later.
> 
> I still need to think your usage over. I am not so sure this fits what
> vhost is trying to do. If not, possibly it's better to just have a
> separate driver for your device.

I now think we need both. virtio-over-PCI does it the right way for its
purpose and can be rather generic. It could certainly be extended to
support virtio-net on both sides (host and guest) of KVM, but I think
it better fits the use where a kernel wants to communicate with some
other machine where you normally wouldn't think of using qemu.

Vhost-net OTOH is great in the way that it serves as an easy way to
move the virtio-net code from qemu into the kernel, without changing
its behaviour. It should even straightforward to do live-migration between
hosts with and without it, something that would be much harder with
the virtio-over-PCI logic. Also, its internal state is local to the
process owning its file descriptor, which makes it much easier to
manage permissions and cleanup of its resources.

	Arnd <><

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 20:59                           ` Ira W. Snyder
@ 2009-08-18 21:26                             ` Avi Kivity
  2009-08-18 22:06                               ` Avi Kivity
  2009-08-19  0:38                               ` Ira W. Snyder
  0 siblings, 2 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-18 21:26 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: Michael S. Tsirkin, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins

On 08/18/2009 11:59 PM, Ira W. Snyder wrote:
> On a non shared-memory system (where the guest's RAM is not just a chunk
> of userspace RAM in the host system), virtio's management model seems to
> fall apart. Feature negotiation doesn't work as one would expect.
>    

In your case, virtio-net on the main board accesses PCI config space 
registers to perform the feature negotiation; software on your PCI cards 
needs to trap these config space accesses and respond to them according 
to virtio ABI.

(There's no real guest on your setup, right?  just a kernel running on 
and x86 system and other kernels running on the PCI cards?)

> This does appear to be solved by vbus, though I haven't written a
> vbus-over-PCI implementation, so I cannot be completely sure.
>    

Even if virtio-pci doesn't work out for some reason (though it should), 
you can write your own virtio transport and implement its config space 
however you like.

> I'm not at all clear on how to get feature negotiation to work on a
> system like mine. From my study of lguest and kvm (see below) it looks
> like userspace will need to be involved, via a miscdevice.
>    

I don't see why.  Is the kernel on the PCI cards in full control of all 
accesses?

> Ok. I thought I should at least express my concerns while we're
> discussing this, rather than being too late after finding the time to
> study the driver.
>
> Off the top of my head, I would think that transporting userspace
> addresses in the ring (for copy_(to|from)_user()) vs. physical addresses
> (for DMAEngine) might be a problem. Pinning userspace pages into memory
> for DMA is a bit of a pain, though it is possible.
>    

Oh, the ring doesn't transport userspace addresses.  It transports guest 
addresses, and it's up to vhost to do something with them.

Currently vhost supports two translation modes:

1. virtio address == host virtual address (using copy_to_user)
2. virtio address == offsetted host virtual address (using copy_to_user)

The latter mode is used for kvm guests (with multiple offsets, skipping 
some details).

I think you need to add a third mode, virtio address == host physical 
address (using dma engine).  Once you do that, and wire up the 
signalling, things should work.

> There is also the problem of different endianness between host and guest
> in virtio-net. The struct virtio_net_hdr (include/linux/virtio_net.h)
> defines fields in host byte order. Which totally breaks if the guest has
> a different endianness. This is a virtio-net problem though, and is not
> transport specific.
>    

Yeah.  You'll need to add byteswaps.

> I've browsed over both the kvm and lguest code, and it looks like they
> each re-invent a mechanism for transporting interrupts between the host
> and guest, using eventfd. They both do this by implementing a
> miscdevice, which is basically their management interface.
>
> See drivers/lguest/lguest_user.c (see write() and LHREQ_EVENTFD) and
> kvm-kmod-devel-88/x86/kvm_main.c (see kvm_vm_ioctl(), called via
> kvm_dev_ioctl()) for how they hook up eventfd's.
>
> I can now imagine how two userspace programs (host and guest) could work
> together to implement a management interface, including hotplug of
> devices, etc. Of course, this would basically reinvent the vbus
> management interface into a specific driver.
>    

You don't need anything in the guest userspace (virtio-net) side.

> I think this is partly what Greg is trying to abstract out into generic
> code. I haven't studied the actual data transport mechanisms in vbus,
> though I have studied virtio's transport mechanism. I think a generic
> management interface for virtio might be a good thing to consider,
> because it seems there are at least two implementations already: kvm and
> lguest.
>    

Management code in the kernel doesn't really help unless you plan to 
manage things with echo and cat.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 21:26                             ` Avi Kivity
@ 2009-08-18 22:06                               ` Avi Kivity
  2009-08-19  0:44                                 ` Ira W. Snyder
  2009-08-19  0:38                               ` Ira W. Snyder
  1 sibling, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-18 22:06 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: Michael S. Tsirkin, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins

On 08/19/2009 12:26 AM, Avi Kivity wrote:
>>
>> Off the top of my head, I would think that transporting userspace
>> addresses in the ring (for copy_(to|from)_user()) vs. physical addresses
>> (for DMAEngine) might be a problem. Pinning userspace pages into memory
>> for DMA is a bit of a pain, though it is possible.
>
>
> Oh, the ring doesn't transport userspace addresses.  It transports 
> guest addresses, and it's up to vhost to do something with them.
>
> Currently vhost supports two translation modes:
>
> 1. virtio address == host virtual address (using copy_to_user)
> 2. virtio address == offsetted host virtual address (using copy_to_user)
>
> The latter mode is used for kvm guests (with multiple offsets, 
> skipping some details).
>
> I think you need to add a third mode, virtio address == host physical 
> address (using dma engine).  Once you do that, and wire up the 
> signalling, things should work.


You don't need in fact a third mode.  You can mmap the x86 address space 
into your ppc userspace and use the second mode.  All you need then is 
the dma engine glue and byte swapping.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 20:57                 ` Michael S. Tsirkin
@ 2009-08-18 23:24                   ` Ira W. Snyder
  0 siblings, 0 replies; 132+ messages in thread
From: Ira W. Snyder @ 2009-08-18 23:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Haskins, kvm, netdev, linux-kernel, alacrityvm-devel,
	Avi Kivity, Anthony Liguori, Ingo Molnar, Gregory Haskins

On Tue, Aug 18, 2009 at 11:57:48PM +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 18, 2009 at 08:53:29AM -0700, Ira W. Snyder wrote:
> > I think Greg is referring to something like my virtio-over-PCI patch.
> > I'm pretty sure that vhost is completely useless for my situation. I'd
> > like to see vhost work for my use, so I'll try to explain what I'm
> > doing.
> > 
> > I've got a system where I have about 20 computers connected via PCI. The
> > PCI master is a normal x86 system, and the PCI agents are PowerPC
> > systems. The PCI agents act just like any other PCI card, except they
> > are running Linux, and have their own RAM and peripherals.
> > 
> > I wrote a custom driver which imitated a network interface and a serial
> > port. I tried to push it towards mainline, and DavidM rejected it, with
> > the argument, "use virtio, don't add another virtualization layer to the
> > kernel." I think he has a decent argument, so I wrote virtio-over-PCI.
> > 
> > Now, there are some things about virtio that don't work over PCI.
> > Mainly, memory is not truly shared. It is extremely slow to access
> > memory that is "far away", meaning "across the PCI bus." This can be
> > worked around by using a DMA controller to transfer all data, along with
> > an intelligent scheme to perform only writes across the bus. If you're
> > careful, reads are never needed.
> > 
> > So, in my system, copy_(to|from)_user() is completely wrong.
> > There is no userspace, only a physical system.
> 
> Can guests do DMA to random host memory? Or is there some kind of IOMMU
> and DMA API involved? If the later, then note that you'll still need
> some kind of driver for your device. The question we need to ask
> ourselves then is whether this driver can reuse bits from vhost.
> 

Mostly. All of my systems are 32 bit (both x86 and ppc). From the view
of the ppc (and DMAEngine), I can view the first 1GB of host memory.

This limited view is due to address space limitations on the ppc. The
view of PCI memory must live somewhere in the ppc address space, along
with the ppc's SDRAM, flash, and other peripherals. Since this is a
32bit processor, I only have 4GB of address space to work with.

The PCI address space could be up to 4GB in size. If I tried to allow
the ppc boards to view all 4GB of PCI address space, then they would
have no address space left for their onboard SDRAM, etc.

Hopefully that makes sense.

I use dma_set_mask(dev, DMA_BIT_MASK(30) on the host system to ensure
that when dma_map_sg() is called, it returns addresses that can be
accessed directly by the device.

The DMAEngine can access any local (ppc) memory without any restriction.

I have used the Linux DMAEngine API (include/linux/dmaengine.h) to
handle all data transfer across the PCI bus. The Intel I/OAT (and many
others) use the same API.

> > In fact, because normal x86 computers
> > do not have DMA controllers, the host system doesn't actually handle any
> > data transfer!
> 
> Is it true that PPC has to initiate all DMA then? How do you
> manage not to do DMA reads then?
> 

Yes, the ppc initiates all DMA. It handles all data transfer (both reads
and writes) across the PCI bus, for speed reasons. A CPU cannot create
burst transactions on the PCI bus. This is the reason that most (all?)
network cards (as a familiar example) use DMA to transfer packet
contents into RAM.

Sorry if I made a confusing statement ("no reads are necessary")
earlier. What I meant to say was: If you are very careful, it is not
necessary for the CPU to do any reads over the PCI bus to maintain
state. Writes are the only necessary CPU-initiated transaction.

I implemented this in my virtio-over-PCI patch, copying as much as
possible from the virtio vring structure. The descriptors in the rings
are only changed by one "side" of the connection, therefore they can be
cached as they are written (via the CPU) across the PCI bus, with the
knowledge that both sides will have a consistent view.

I'm sorry, this is hard to explain via email. It is much easier in a
room with a whiteboard. :)

> > I used virtio-net in both the guest and host systems in my example
> > virtio-over-PCI patch, and succeeded in getting them to communicate.
> > However, the lack of any setup interface means that the devices must be
> > hardcoded into both drivers, when the decision could be up to userspace.
> > I think this is a problem that vbus could solve.
> 
> What you describe (passing setup from host to guest) seems like
> a feature that guest devices need to support. It seems unlikely that
> vbus, being a transport layer, can address this.
> 

I think I explained this poorly as well.

Virtio needs two things to function:
1) a set of descriptor rings (1 or more)
2) a way to kick each ring.

With the amount of space available in the ppc's PCI BAR's (which point
at a small chunk of SDRAM), I could potentially make ~6 virtqueues + 6
kick interrupts available.

Right now, my virtio-over-PCI driver hardcoded the first and second
virtqueues to be for virtio-net only, and nothing else.

What if the user wanted 2 virtio-console and 2 virtio-net? They'd have
to change the driver, because virtio doesn't have much of a management
interface. Vbus does have a management interface: you create devices via
configfs. The vbus-connector on the guest notices new devices, and
triggers hotplug events on the guest.

As far as I understand it, vbus is a bus model, not just a transport
layer.

> > 
> > For my own selfish reasons (I don't want to maintain an out-of-tree
> > driver) I'd like to see *something* useful in mainline Linux. I'm happy
> > to answer questions about my setup, just ask.
> > 
> > Ira
> 
> Thanks Ira, I'll think about it.
> A couple of questions:
> - Could you please describe what kind of communication needs to happen?
> - I'm not familiar with DMA engine in question. I'm guessing it's the
>   usual thing: in/out buffers need to be kernel memory, interface is
>   asynchronous, small limited number of outstanding requests? Is there a
>   userspace interface for it and if yes how does it work?
> 

The DMA engine can handle transferring from any two physical addresses,
as seen from the ppc address map. The things of interest are:
1) ppc sdram
2) host sdram (first 1GB only, explained above)

The Linux DMAEngine API allows you to do sync or async requests with
callbacks, and an unlimited number of outstanding requests (until you
exhaust memory).

The interface is in-kernel only. See include/linux/dmaengine.h for the
details, but the most important part is dma_async_memcpy_buf_to_buf(),
which will copy between two kernel virtual addresses.

It is trivial to code up an implementation which will transfer between
physical addresses instead, which I found much more convenient in my
code. I'm happy to provide the function if/when needed.

Ira

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 21:26                             ` Avi Kivity
  2009-08-18 22:06                               ` Avi Kivity
@ 2009-08-19  0:38                               ` Ira W. Snyder
  2009-08-19  5:40                                 ` Avi Kivity
  1 sibling, 1 reply; 132+ messages in thread
From: Ira W. Snyder @ 2009-08-19  0:38 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins

On Wed, Aug 19, 2009 at 12:26:23AM +0300, Avi Kivity wrote:
> On 08/18/2009 11:59 PM, Ira W. Snyder wrote:
>> On a non shared-memory system (where the guest's RAM is not just a chunk
>> of userspace RAM in the host system), virtio's management model seems to
>> fall apart. Feature negotiation doesn't work as one would expect.
>>    
>
> In your case, virtio-net on the main board accesses PCI config space  
> registers to perform the feature negotiation; software on your PCI cards  
> needs to trap these config space accesses and respond to them according  
> to virtio ABI.
>

Is this "real PCI" (physical hardware) or "fake PCI" (software PCI
emulation) that you are describing?

The host (x86, PCI master) must use "real PCI" to actually configure the
boards, enable bus mastering, etc. Just like any other PCI device, such
as a network card.

On the guests (ppc, PCI agents) I cannot add/change PCI functions (the
last .[0-9] in the PCI address) nor can I change PCI BAR's once the
board has started. I'm pretty sure that would violate the PCI spec,
since the PCI master would need to re-scan the bus, and re-assign
addresses, which is a task for the BIOS.

> (There's no real guest on your setup, right?  just a kernel running on  
> and x86 system and other kernels running on the PCI cards?)
>

Yes, the x86 (PCI master) runs Linux (booted via PXELinux). The ppc's
(PCI agents) also run Linux (booted via U-Boot). They are independent
Linux systems, with a physical PCI interconnect.

The x86 has CONFIG_PCI=y, however the ppc's have CONFIG_PCI=n. Linux's
PCI stack does bad things as a PCI agent. It always assumes it is a PCI
master.

It is possible for me to enable CONFIG_PCI=y on the ppc's by removing
the PCI bus from their list of devices provided by OpenFirmware. They
can not access PCI via normal methods. PCI drivers cannot work on the
ppc's, because Linux assumes it is a PCI master.

To the best of my knowledge, I cannot trap configuration space accesses
on the PCI agents. I haven't needed that for anything I've done thus
far.

>> This does appear to be solved by vbus, though I haven't written a
>> vbus-over-PCI implementation, so I cannot be completely sure.
>>    
>
> Even if virtio-pci doesn't work out for some reason (though it should),  
> you can write your own virtio transport and implement its config space  
> however you like.
>

This is what I did with virtio-over-PCI. The way virtio-net negotiates
features makes this work non-intuitively.

>> I'm not at all clear on how to get feature negotiation to work on a
>> system like mine. From my study of lguest and kvm (see below) it looks
>> like userspace will need to be involved, via a miscdevice.
>>    
>
> I don't see why.  Is the kernel on the PCI cards in full control of all  
> accesses?
>

I'm not sure what you mean by this. Could you be more specific? This is
a normal, unmodified vanilla Linux kernel running on the PCI agents.

>> Ok. I thought I should at least express my concerns while we're
>> discussing this, rather than being too late after finding the time to
>> study the driver.
>>
>> Off the top of my head, I would think that transporting userspace
>> addresses in the ring (for copy_(to|from)_user()) vs. physical addresses
>> (for DMAEngine) might be a problem. Pinning userspace pages into memory
>> for DMA is a bit of a pain, though it is possible.
>>    
>
> Oh, the ring doesn't transport userspace addresses.  It transports guest  
> addresses, and it's up to vhost to do something with them.
>
> Currently vhost supports two translation modes:
>
> 1. virtio address == host virtual address (using copy_to_user)
> 2. virtio address == offsetted host virtual address (using copy_to_user)
>
> The latter mode is used for kvm guests (with multiple offsets, skipping  
> some details).
>
> I think you need to add a third mode, virtio address == host physical  
> address (using dma engine).  Once you do that, and wire up the  
> signalling, things should work.
>

Ok.

In my virtio-over-PCI patch, I hooked two virtio-net's together. I wrote
an algorithm to pair the tx/rx queues together. Since virtio-net
pre-fills its rx queues with buffers, I was able to use the DMA engine
to copy from the tx queue into the pre-allocated memory in the rx queue.

I have an intuitive idea about how I think vhost-net works in this case.

>> There is also the problem of different endianness between host and guest
>> in virtio-net. The struct virtio_net_hdr (include/linux/virtio_net.h)
>> defines fields in host byte order. Which totally breaks if the guest has
>> a different endianness. This is a virtio-net problem though, and is not
>> transport specific.
>>    
>
> Yeah.  You'll need to add byteswaps.
>

I wonder if Rusty would accept a new feature:
VIRTIO_F_NET_LITTLE_ENDIAN, which would allow the virtio-net driver to
use LE for all of it's multi-byte fields.

I don't think the transport should have to care about the endianness.

>> I've browsed over both the kvm and lguest code, and it looks like they
>> each re-invent a mechanism for transporting interrupts between the host
>> and guest, using eventfd. They both do this by implementing a
>> miscdevice, which is basically their management interface.
>>
>> See drivers/lguest/lguest_user.c (see write() and LHREQ_EVENTFD) and
>> kvm-kmod-devel-88/x86/kvm_main.c (see kvm_vm_ioctl(), called via
>> kvm_dev_ioctl()) for how they hook up eventfd's.
>>
>> I can now imagine how two userspace programs (host and guest) could work
>> together to implement a management interface, including hotplug of
>> devices, etc. Of course, this would basically reinvent the vbus
>> management interface into a specific driver.
>>    
>
> You don't need anything in the guest userspace (virtio-net) side.
>
>> I think this is partly what Greg is trying to abstract out into generic
>> code. I haven't studied the actual data transport mechanisms in vbus,
>> though I have studied virtio's transport mechanism. I think a generic
>> management interface for virtio might be a good thing to consider,
>> because it seems there are at least two implementations already: kvm and
>> lguest.
>>    
>
> Management code in the kernel doesn't really help unless you plan to  
> manage things with echo and cat.
>

True. It's slowpath setup, so I don't care how fast it is. For reasons
outside my control, the x86 (PCI master) is running a RHEL5 system. This
means glibc-2.5, which doesn't have eventfd support, AFAIK. I could try
and push for an upgrade. This obviously makes cat/echo really nice, it
doesn't depend on glibc, only the kernel version.

I don't give much weight to the above, because I can use the eventfd
syscalls directly, without glibc support. It is just more painful.

Ira

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 22:06                               ` Avi Kivity
@ 2009-08-19  0:44                                 ` Ira W. Snyder
  2009-08-19  5:26                                   ` Avi Kivity
  0 siblings, 1 reply; 132+ messages in thread
From: Ira W. Snyder @ 2009-08-19  0:44 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins

On Wed, Aug 19, 2009 at 01:06:45AM +0300, Avi Kivity wrote:
> On 08/19/2009 12:26 AM, Avi Kivity wrote:
>>>
>>> Off the top of my head, I would think that transporting userspace
>>> addresses in the ring (for copy_(to|from)_user()) vs. physical addresses
>>> (for DMAEngine) might be a problem. Pinning userspace pages into memory
>>> for DMA is a bit of a pain, though it is possible.
>>
>>
>> Oh, the ring doesn't transport userspace addresses.  It transports  
>> guest addresses, and it's up to vhost to do something with them.
>>
>> Currently vhost supports two translation modes:
>>
>> 1. virtio address == host virtual address (using copy_to_user)
>> 2. virtio address == offsetted host virtual address (using copy_to_user)
>>
>> The latter mode is used for kvm guests (with multiple offsets,  
>> skipping some details).
>>
>> I think you need to add a third mode, virtio address == host physical  
>> address (using dma engine).  Once you do that, and wire up the  
>> signalling, things should work.
>
>
> You don't need in fact a third mode.  You can mmap the x86 address space  
> into your ppc userspace and use the second mode.  All you need then is  
> the dma engine glue and byte swapping.
>

Hmm, I'll have to think about that.

The ppc is a 32-bit processor, so it has 4GB of address space for
everything, including PCI, SDRAM, flash memory, and all other
peripherals.

This is exactly like 32bit x86, where you cannot have a PCI card that
exposes a 4GB PCI BAR. The system would have no address space left for
its own SDRAM.

On my x86 computers, I only have 1GB of physical RAM, and so the ppc's
have plenty of room in their address spaces to map the entire x86 RAM
into their own address space. That is exactly what I do now. Accesses to
ppc physical address 0x80000000 "magically" hit x86 physical address
0x0.

Ira

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 16:14                 ` Ingo Molnar
@ 2009-08-19  4:27                   ` Gregory Haskins
  2009-08-19  5:22                     ` Avi Kivity
  2009-08-21 10:55                     ` vbus design points: shm and shm-signals Gregory Haskins
  0 siblings, 2 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-19  4:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Anthony Liguori, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 9045 bytes --]

Ingo Molnar wrote:
> * Gregory Haskins <gregory.haskins@gmail.com> wrote:
> 
>>> You haven't convinced me that your ideas are worth the effort 
>>> of abandoning virtio/pci or maintaining both venet/vbus and 
>>> virtio/pci.
>> With all due respect, I didnt ask you do to anything, especially 
>> not abandon something you are happy with.
>>
>> All I did was push guest drivers to LKML.  The code in question 
>> is independent of KVM, and its proven to improve the experience 
>> of using Linux as a platform.  There are people interested in 
>> using them (by virtue of the number of people that have signed up 
>> for the AlacrityVM list, and have mailed me privately about this 
>> work).
> 
> This thread started because i asked you about your technical 
> arguments why we'd want vbus instead of virtio.

(You mean vbus vs pci, right?  virtio works fine, is untouched, and is
out-of-scope here)

Right, and I do believe I answered your questions.  Do you feel as
though this was not a satisfactory response?


> Your answer above 
> now basically boils down to: "because I want it so, why dont you 
> leave me alone".

Well, with all due respect, please do not put words in my mouth.  This
is not what I am saying at all.

What I *am* saying is:

fact: this thread is about linux guest drivers to support vbus

fact: these drivers do not touch kvm code.

fact: these drivers to not force kvm to alter its operation in any way.

fact: these drivers do not alter ABIs that KVM currently supports.

Therefore, all this talk about "abandoning", "supporting", and
"changing" things in KVM is, premature, irrelevant, and/or, FUD.  No one
proposed such changes, so I am highlighting this fact to bring the
thread back on topic.  That KVM talk is merely a distraction at this
point in time.

> 
> What you are doing here is to in essence to fork KVM,

No, that is incorrect.  What I am doing here is a downstream development
point for the integration of KVM and vbus.  Its akin to kvm.git or
tip.git to develop a subsystem intended for eventual inclusion upstream.
 If and when the code goes upstream in a manner acceptable to all
parties involved, and AlacrityVM exceeds its utility as a separate
project, I will _gladly_ dissolve it and migrate to use upstream KVM
instead.

As stated on the project wiki: "It is a goal of AlacrityVM to work
towards upstream acceptance of the project on a timeline that suits the
community. In the meantime, this wiki will serve as the central
coordination point for development and discussion of the technology"

(citation: http://developer.novell.com/wiki/index.php/AlacrityVM)

And I meant it when I said it.

Until then, the project is a much more efficient way for us (the vbus
developers) to work together than pointing people at my patch series
posted to kvm@vger.  I tried that way first.  It sucked, and didn't
work.  Users were having trouble patching the various pieces, building,
etc.  Now I can offer a complete solution from a central point, with all
the proper pieces in place to play around with it.

Ultimately, it is up to upstream to decide if this is to become merged
or remain out of tree forever as a "fork".  Not me.  I will continue to
make every effort to find common ground with my goals coincident with
the blessing of upstream, as I have been from the beginning.  Now I have
a more official forum to do it in.

> regardless of 
> the technical counter arguments given against such a fork and 
> regardless of the ample opportunity given to you to demostrate the 
> technical advantages of your code. (in which case KVM would happily 
> migrate to your code)

In an ideal world, perhaps.  Avi and I currently have a fundamental
disagreement about the best way to do PV.  He sees the world through PCI
glasses, and I don't.  Despite attempts on both sides to rectify this
disagreement, we currently do not see eye to eye on every front.

This doesn't mean he is right, and I am wrong per se.  It just means we
disagree.  Period.  Avi is a sharp guy, and I respect him.  But upstream
KVM doesn't have a corner on "correct" ;)  The community as a whole will
ultimately decide if my ideas live or die, wouldn't you agree?

Avi can correct me if I am wrong, but what we _do_ agree on is that core
KVM doesn't need to be directly involved in this vbus (or vhost)
discussion, per se.  It just wants to have the hooks to support various
PV solutions (such as irqfd/ioeventfd), and vbus is one such solution.

> 
> We all love faster code and better management interfaces and tons 
> of your prior patches got accepted by Avi. This time you didnt even 
> _try_ to improve virtio.

Im sorry, but you are mistaken:

http://lkml.indiana.edu/hypermail/linux/kernel/0904.2/02443.html

> It's not like you posted a lot of virtio 
> patches which were not applied. You didnt even try and you need to 
> try _much_ harder than that before forking a project.

I really do not think you are in a position to say when someone can or
cannot fork a project, so please do not try to lecture on that.  Perhaps
you could offer advice on when someone, in your opinion, *should* or
*should not* fork, because that would be interesting to hear.

You are also wrong to say that I didn't try to avoid creating a
downstream effort first.   I believe the public record of the mailing
lists will back me up that I tried politely pushing this directly though
kvm first.  It was only after Avi recently informed me that they would
be building their own version of an in-kernel backend in lieu of working
with me to adapt vbus to their needs that I decided to put my own
project together.

What should I have done otherwise, in your opinion?

> 
> And fragmentation matters quite a bit. To Linux users, developers, 
> administrators, packagers it's a big deal whether two overlapping 
> pieces of functionality for the same thing exist within the same 
> kernel.

So the only thing that could be construed as overlapping here is venet
vs virtio-net. If I dropped the contentious venet and focused on making
a virtio-net backend that we can all re-use, do you see that as a path
of compromise here?

> The kernel is not an anarchy where everyone can have their 
> own sys_fork() version or their own sys_write() version. Would you 
> want to have two dozen read() variants, sys_read_oracle() and a 
> sys_read_db2()?

No, and I am not advocating that either.

> 
> I certainly dont want that. Instead we (at great expense and work) 
> try to reach the best technical solution.

This is all I want, as well.

> That means we throw away 
> inferior code and adopt the better one. (with a reasonable 
> migration period)
> 
> You are ignoring that principle with hand-waving about 'the 
> community wants this'.

I call it like I see it.  I get private emails all the time encouraging
my efforts and asking about the project.  I'm sorry if you see this as
hand-waving.  Perhaps the people involved will become more vocal in the
community to back me up, perhaps not.  Time will tell.


> I can assure you, users _DONT WANT_ split 
> interfaces and incompatible drivers for the same thing. They want 
> stuff that works well.

And I can respect that, and am trying to provide that.

> 
> If the community wants this then why cannot you convince one of the 
> most prominent representatives of that community, the KVM 
> developers?

Its a chicken and egg at times.  Perhaps the KVM developers do not have
the motivation or time to properly consider such a proposal _until_ the
community presents its demand.  And sometimes you cannot build demand
unless you have an easy way to use the idea, such as a project to back
it.  Since vbus+kvm has many moving parts (guest side, host-side,
userspace-side, etc), its difficult to use as a patch series pulled in
from a mailing list.

This is the role of the AlacrityVM project.  Make it easy to use and
develop.  If it draws a community, perhaps KVM will reconsider its
stance.  If it does not draw a community, it will naturally die.  End of
story.

But please do not confuse one particular groups opinion as the sole
validation of an idea, no matter how "prominent".  There are numerous
reasons why one group may hold an opinion that have nothing to do with
the actual technical merits of the idea, or the community demand for it.

> 
> Furthermore, 99% of your work is KVM

Actually, no.  Almost none of it is.  I think there are about 2-3
patches in the series that touch KVM, the rest are all original (and
primarily stand-alone code).  AlacrityVM is the application of kvm and
vbus (and, of course, Linux) together as a complete unit, but I do not
try to hide this relationship.

By your argument, KVM is 99% QEMU+Linux. ;)

> why dont you respect that work by not forking it?

Lighten up on the fork FUD, please.  It's counter productive.

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19  4:27                   ` Gregory Haskins
@ 2009-08-19  5:22                     ` Avi Kivity
  2009-08-19 13:27                       ` Gregory Haskins
  2009-08-21 10:55                     ` vbus design points: shm and shm-signals Gregory Haskins
  1 sibling, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-19  5:22 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ingo Molnar, Anthony Liguori, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev, Michael S. Tsirkin

On 08/19/2009 07:27 AM, Gregory Haskins wrote:
>
>> This thread started because i asked you about your technical
>> arguments why we'd want vbus instead of virtio.
>>      
> (You mean vbus vs pci, right?  virtio works fine, is untouched, and is
> out-of-scope here)
>    

I guess he meant venet vs virtio-net.  Without venet vbus is currently 
userless.

> Right, and I do believe I answered your questions.  Do you feel as
> though this was not a satisfactory response?
>    

Others and I have shown you its wrong.  There's no inherent performance 
problem in pci.  The vbus approach has inherent problems (the biggest of 
which is compatibility, the second managability).

>> Your answer above
>> now basically boils down to: "because I want it so, why dont you
>> leave me alone".
>>      
> Well, with all due respect, please do not put words in my mouth.  This
> is not what I am saying at all.
>
> What I *am* saying is:
>
> fact: this thread is about linux guest drivers to support vbus
>
> fact: these drivers do not touch kvm code.
>
> fact: these drivers to not force kvm to alter its operation in any way.
>
> fact: these drivers do not alter ABIs that KVM currently supports.
>
> Therefore, all this talk about "abandoning", "supporting", and
> "changing" things in KVM is, premature, irrelevant, and/or, FUD.  No one
> proposed such changes, so I am highlighting this fact to bring the
> thread back on topic.  That KVM talk is merely a distraction at this
> point in time.
>    

s/kvm/kvm stack/.  virtio/pci is part of the kvm stack, even if it is 
not part of kvm itself.  If vbus/venet were to be merged, users and 
developers would have to choose one or the other.  That's the 
fragmentation I'm worried about.  And you can prefix that with "fact:" 
as well.

>> We all love faster code and better management interfaces and tons
>> of your prior patches got accepted by Avi. This time you didnt even
>> _try_ to improve virtio.
>>      
> Im sorry, but you are mistaken:
>
> http://lkml.indiana.edu/hypermail/linux/kernel/0904.2/02443.html
>    

That does nothing to improve virtio.  Existing guests (Linux and 
Windows) which support virtio will cease to work if the host moves to 
vbus-virtio.  Existing hosts (running virtio-pci) won't be able to talk 
to newer guests running virtio-vbus.  The patch doesn't improve 
performance without the entire vbus stack in the host kernel and a 
vbus-virtio-net-host host kernel driver.

Perhaps if you posted everything needed to make vbus-virtio work and 
perform we could compare that to vhost-net and you'll see another reason 
why vhost-net is the better approach.

> You are also wrong to say that I didn't try to avoid creating a
> downstream effort first.   I believe the public record of the mailing
> lists will back me up that I tried politely pushing this directly though
> kvm first.  It was only after Avi recently informed me that they would
> be building their own version of an in-kernel backend in lieu of working
> with me to adapt vbus to their needs that I decided to put my own
> project together.
>    

There's no way we can adapt vbus to our needs.  Don't you think we'd 
preferred it rather than writing our own?  the current virtio-net issues 
are hurting us.

Our needs are compatibility, performance, and managability.  vbus fails 
all three, your impressive venet numbers notwithstanding.

> What should I have done otherwise, in your opinion?
>    

You could come up with uses where vbus truly is superior to 
virtio/pci/whatever (not words about etch constraints).  Showing some of 
those non-virt uses, for example.  The fact that your only user 
duplicates existing functionality doesn't help.


>> And fragmentation matters quite a bit. To Linux users, developers,
>> administrators, packagers it's a big deal whether two overlapping
>> pieces of functionality for the same thing exist within the same
>> kernel.
>>      
> So the only thing that could be construed as overlapping here is venet
> vs virtio-net. If I dropped the contentious venet and focused on making
> a virtio-net backend that we can all re-use, do you see that as a path
> of compromise here?
>    

That's a step in the right direction.

>> I certainly dont want that. Instead we (at great expense and work)
>> try to reach the best technical solution.
>>      
> This is all I want, as well.
>    

Note whenever I mention migration, large guests, or Windows you say 
these are not your design requirements.  The best technical solution 
will have to consider those.

>> If the community wants this then why cannot you convince one of the
>> most prominent representatives of that community, the KVM
>> developers?
>>      
> Its a chicken and egg at times.  Perhaps the KVM developers do not have
> the motivation or time to properly consider such a proposal _until_ the
> community presents its demand.

I've spent quite a lot of time arguing with you, no doubt influenced by 
the fact that you can write a lot faster than I can read.

>> Furthermore, 99% of your work is KVM
>>      
> Actually, no.  Almost none of it is.  I think there are about 2-3
> patches in the series that touch KVM, the rest are all original (and
> primarily stand-alone code).  AlacrityVM is the application of kvm and
> vbus (and, of course, Linux) together as a complete unit, but I do not
> try to hide this relationship.
>
> By your argument, KVM is 99% QEMU+Linux. ;)
>    

That's one of the kvm strong points...

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19  0:44                                 ` Ira W. Snyder
@ 2009-08-19  5:26                                   ` Avi Kivity
  0 siblings, 0 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-19  5:26 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: Michael S. Tsirkin, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins

On 08/19/2009 03:44 AM, Ira W. Snyder wrote:
>> You don't need in fact a third mode.  You can mmap the x86 address space
>> into your ppc userspace and use the second mode.  All you need then is
>> the dma engine glue and byte swapping.
>>
>>      
> Hmm, I'll have to think about that.
>
> The ppc is a 32-bit processor, so it has 4GB of address space for
> everything, including PCI, SDRAM, flash memory, and all other
> peripherals.
>
> This is exactly like 32bit x86, where you cannot have a PCI card that
> exposes a 4GB PCI BAR. The system would have no address space left for
> its own SDRAM.
>    

(you actually can, since x86 has a 36-40 bit physical address space even 
with a 32-bit virtual address space, but that doesn't help you).

> On my x86 computers, I only have 1GB of physical RAM, and so the ppc's
> have plenty of room in their address spaces to map the entire x86 RAM
> into their own address space. That is exactly what I do now. Accesses to
> ppc physical address 0x80000000 "magically" hit x86 physical address
> 0x0.
>    

So if you mmap() that, you could work with virtual addresses.  It may be 
more efficient to work with physical addresses directly though.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 16:51                 ` Michael S. Tsirkin
@ 2009-08-19  5:36                   ` Gregory Haskins
  2009-08-19  5:48                     ` Avi Kivity
                                       ` (2 more replies)
  0 siblings, 3 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-19  5:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, Anthony Liguori, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 3234 bytes --]

Michael S. Tsirkin wrote:
> On Tue, Aug 18, 2009 at 11:51:59AM -0400, Gregory Haskins wrote:
>>> It's not laughably trivial when you try to support the full feature set
>>> of kvm (for example, live migration will require dirty memory tracking,
>>> and exporting all state stored in the kernel to userspace).
>> Doesn't vhost suffer from the same issue?  If not, could I also apply
>> the same technique to support live-migration in vbus?
> 
> vhost does this by switching to userspace for the duration of live
> migration. venet could do this I guess, but you'd need to write a
> userspace implementation. vhost just reuses existing userspace virtio.
> 
>> With all due respect, I didnt ask you do to anything, especially not
>> abandon something you are happy with.
>>
>> All I did was push guest drivers to LKML.  The code in question is
>> independent of KVM, and its proven to improve the experience of using
>> Linux as a platform.  There are people interested in using them (by
>> virtue of the number of people that have signed up for the AlacrityVM
>> list, and have mailed me privately about this work).
>>
>> So where is the problem here?
> 
> If virtio net in guest could be improved instead, everyone would
> benefit.

So if I whip up a virtio-net backend for vbus with a PCI compliant
connector, you are happy?


> I am doing this, and I wish more people would join.  Instead,
> you change ABI in a incompatible way.

Only by choice of my particular connector.  The ABI is a function of the
connector design.  So one such model is to terminate the connector in
qemu, and surface the resulting objects as PCI devices.  I choose not to
use this particular design for my connector that I am pushing upstream
because I am of the opinion that I can do better by terminating it in
the guest directly as a PV optimized bus.  However, both connectors can
theoretically coexist peacefully.

The advantage that this would give us is that one in-kernel virtio-net
model could be surfaced to all vbus users (pci, or otherwise), which
will hopefully be growing over time.  This would have gained vbus a
virtio-net backend, and it would have saved you from re-inventing the
various abstractions and management interfaces that vbus has in place.


> So now, there's no single place to
> work on kvm networking performance. Now, it would all be understandable
> if the reason was e.g. better performance. But you say yourself it
> isn't.

Actually, I really didn't say that.  As far as I know, your patch hasnt
been performance proven to my knowledge, but I just gave you the benefit
of the doubt.  What I said was that for a limited type of benchmark, it
*may* get similar numbers if you implemented vhost optimally.  For
others (for instance, when we can start to take advantage of priority,
or scaling the number of interfaces) it may not since my proposed
connector was designed to optimize this over raw PCI facilities.

But I digress.  Please post results when you have numbers, as I had to
give up my 10GE rig in the lab.  I suspect you will have performance
issues until you at least address GSO, but you may already be there by now.

Kind Regards,
-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 18:20                   ` Arnd Bergmann
  2009-08-18 19:08                     ` Avi Kivity
@ 2009-08-19  5:36                     ` Gregory Haskins
  1 sibling, 0 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-19  5:36 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Avi Kivity, Ingo Molnar, kvm, alacrityvm-devel, linux-kernel,
	netdev, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 2718 bytes --]

Arnd Bergmann wrote:
> On Tuesday 18 August 2009, Gregory Haskins wrote:
>> Avi Kivity wrote:
>>> On 08/17/2009 10:33 PM, Gregory Haskins wrote:
>>>
>>> One point of contention is that this is all managementy stuff and should
>>> be kept out of the host kernel.  Exposing shared memory, interrupts, and
>>> guest hypercalls can all be easily done from userspace (as virtio
>>> demonstrates).  True, some devices need kernel acceleration, but that's
>>> no reason to put everything into the host kernel.
>> See my last reply to Anthony.  My two points here are that:
>>
>> a) having it in-kernel makes it a complete subsystem, which perhaps has
>> diminished value in kvm, but adds value in most other places that we are
>> looking to use vbus.
>>
>> b) the in-kernel code is being overstated as "complex".  We are not
>> talking about your typical virt thing, like an emulated ICH/PCI chipset.
>>  Its really a simple list of devices with a handful of attributes.  They
>> are managed using established linux interfaces, like sysfs/configfs.
> 
> IMHO the complexity of the code is not so much of a problem. What I
> see as a problem is the complexity a kernel/user space interface that
> manages a the devices with global state.
> 
> One of the greatest features of Michaels vhost driver is that all
> the state is associated with open file descriptors that either exist
> already or belong to the vhost_net misc device. When a process dies,
> all the file descriptors get closed and the whole state is cleaned
> up implicitly.
> 
> AFAICT, you can't do that with the vbus host model.

It should work the same.  When a driver opens a vbus device, it calls
"interface->connect()" and gets back a "connection" object.  The
connection->release() method is invoked when the driver "goes away",
which would include the scenario you present.  This gives the
device-model the opportunity to cleanup in the same way.


> 
>>> What performance oriented items have been left unaddressed?
>> Well, the interrupt model to name one.
> 
> The performance aspects of your interrupt model are independent
> of the vbus proxy, or at least they should be. Let's assume for
> now that your event notification mechanism gives significant
> performance improvements (which we can't measure independently
> right now). I don't see a reason why we could not get the
> same performance out of a paravirtual interrupt controller
> that uses the same method, and it would be straightforward
> to implement one and use that together with all the existing
> emulated PCI devices and virtio devices including vhost_net.

Agreed.  I proposed this before and Avi rejected the idea.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19  0:38                               ` Ira W. Snyder
@ 2009-08-19  5:40                                 ` Avi Kivity
  2009-08-19 15:28                                   ` Ira W. Snyder
  0 siblings, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-19  5:40 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: Michael S. Tsirkin, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins

On 08/19/2009 03:38 AM, Ira W. Snyder wrote:
> On Wed, Aug 19, 2009 at 12:26:23AM +0300, Avi Kivity wrote:
>    
>> On 08/18/2009 11:59 PM, Ira W. Snyder wrote:
>>      
>>> On a non shared-memory system (where the guest's RAM is not just a chunk
>>> of userspace RAM in the host system), virtio's management model seems to
>>> fall apart. Feature negotiation doesn't work as one would expect.
>>>
>>>        
>> In your case, virtio-net on the main board accesses PCI config space
>> registers to perform the feature negotiation; software on your PCI cards
>> needs to trap these config space accesses and respond to them according
>> to virtio ABI.
>>
>>      
> Is this "real PCI" (physical hardware) or "fake PCI" (software PCI
> emulation) that you are describing?
>
>    

Real PCI.

> The host (x86, PCI master) must use "real PCI" to actually configure the
> boards, enable bus mastering, etc. Just like any other PCI device, such
> as a network card.
>
> On the guests (ppc, PCI agents) I cannot add/change PCI functions (the
> last .[0-9] in the PCI address) nor can I change PCI BAR's once the
> board has started. I'm pretty sure that would violate the PCI spec,
> since the PCI master would need to re-scan the bus, and re-assign
> addresses, which is a task for the BIOS.
>    

Yes.  Can the boards respond to PCI config space cycles coming from the 
host, or is the config space implemented in silicon and immutable?  
(reading on, I see the answer is no).  virtio-pci uses the PCI config 
space to configure the hardware.

>> (There's no real guest on your setup, right?  just a kernel running on
>> and x86 system and other kernels running on the PCI cards?)
>>
>>      
> Yes, the x86 (PCI master) runs Linux (booted via PXELinux). The ppc's
> (PCI agents) also run Linux (booted via U-Boot). They are independent
> Linux systems, with a physical PCI interconnect.
>
> The x86 has CONFIG_PCI=y, however the ppc's have CONFIG_PCI=n. Linux's
> PCI stack does bad things as a PCI agent. It always assumes it is a PCI
> master.
>
> It is possible for me to enable CONFIG_PCI=y on the ppc's by removing
> the PCI bus from their list of devices provided by OpenFirmware. They
> can not access PCI via normal methods. PCI drivers cannot work on the
> ppc's, because Linux assumes it is a PCI master.
>
> To the best of my knowledge, I cannot trap configuration space accesses
> on the PCI agents. I haven't needed that for anything I've done thus
> far.
>
>    

Well, if you can't do that, you can't use virtio-pci on the host.  
You'll need another virtio transport (equivalent to "fake pci" you 
mentioned above).

>>> This does appear to be solved by vbus, though I haven't written a
>>> vbus-over-PCI implementation, so I cannot be completely sure.
>>>
>>>        
>> Even if virtio-pci doesn't work out for some reason (though it should),
>> you can write your own virtio transport and implement its config space
>> however you like.
>>
>>      
> This is what I did with virtio-over-PCI. The way virtio-net negotiates
> features makes this work non-intuitively.
>    

I think you tried to take two virtio-nets and make them talk together?  
That won't work.  You need the code from qemu to talk to virtio-net 
config space, and vhost-net to pump the rings.

>>> I'm not at all clear on how to get feature negotiation to work on a
>>> system like mine. From my study of lguest and kvm (see below) it looks
>>> like userspace will need to be involved, via a miscdevice.
>>>
>>>        
>> I don't see why.  Is the kernel on the PCI cards in full control of all
>> accesses?
>>
>>      
> I'm not sure what you mean by this. Could you be more specific? This is
> a normal, unmodified vanilla Linux kernel running on the PCI agents.
>    

I meant, does board software implement the config space accesses issued 
from the host, and it seems the answer is no.


> In my virtio-over-PCI patch, I hooked two virtio-net's together. I wrote
> an algorithm to pair the tx/rx queues together. Since virtio-net
> pre-fills its rx queues with buffers, I was able to use the DMA engine
> to copy from the tx queue into the pre-allocated memory in the rx queue.
>
>    

Please find a name other than virtio-over-PCI since it conflicts with 
virtio-pci.  You're tunnelling virtio config cycles (which are usually 
done on pci config cycles) on a new protocol which is itself tunnelled 
over PCI shared memory.

>>>
>>>        
>> Yeah.  You'll need to add byteswaps.
>>
>>      
> I wonder if Rusty would accept a new feature:
> VIRTIO_F_NET_LITTLE_ENDIAN, which would allow the virtio-net driver to
> use LE for all of it's multi-byte fields.
>
> I don't think the transport should have to care about the endianness.
>    

Given this is not mainstream use, it would have to have zero impact when 
configured out.

> True. It's slowpath setup, so I don't care how fast it is. For reasons
> outside my control, the x86 (PCI master) is running a RHEL5 system. This
> means glibc-2.5, which doesn't have eventfd support, AFAIK. I could try
> and push for an upgrade. This obviously makes cat/echo really nice, it
> doesn't depend on glibc, only the kernel version.
>
> I don't give much weight to the above, because I can use the eventfd
> syscalls directly, without glibc support. It is just more painful.
>    

The x86 side only needs to run virtio-net, which is present in RHEL 
5.3.  You'd only need to run virtio-tunnel or however it's called.  All 
the eventfd magic takes place on the PCI agents.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19  5:36                   ` Gregory Haskins
@ 2009-08-19  5:48                     ` Avi Kivity
  2009-08-19  6:40                       ` Gregory Haskins
  2009-08-19 14:33                     ` Michael S. Tsirkin
  2009-08-20 12:12                     ` Michael S. Tsirkin
  2 siblings, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-19  5:48 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Michael S. Tsirkin, Anthony Liguori, Ingo Molnar,
	Gregory Haskins, kvm, alacrityvm-devel, linux-kernel, netdev

On 08/19/2009 08:36 AM, Gregory Haskins wrote:
>> If virtio net in guest could be improved instead, everyone would
>> benefit.
>>      
> So if I whip up a virtio-net backend for vbus with a PCI compliant
> connector, you are happy?
>    

This doesn't improve virtio-net in any way.

>> I am doing this, and I wish more people would join.  Instead,
>> you change ABI in a incompatible way.
>>      
> Only by choice of my particular connector.  The ABI is a function of the
> connector design.  So one such model is to terminate the connector in
> qemu, and surface the resulting objects as PCI devices.  I choose not to
> use this particular design for my connector that I am pushing upstream
> because I am of the opinion that I can do better by terminating it in
> the guest directly as a PV optimized bus.  However, both connectors can
> theoretically coexist peacefully.
>    

virtio already supports this model; see lguest and s390.  Transporting 
virtio over vbus and vbus over something else doesn't gain anything over 
directly transporting virtio over that something else.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-18 16:27                   ` Avi Kivity
@ 2009-08-19  6:28                     ` Gregory Haskins
  2009-08-19  7:11                       ` Avi Kivity
  0 siblings, 1 reply; 132+ messages in thread
From: Gregory Haskins @ 2009-08-19  6:28 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, kvm, alacrityvm-devel, linux-kernel, netdev,
	Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 20993 bytes --]

Avi Kivity wrote:
> On 08/18/2009 05:46 PM, Gregory Haskins wrote:
>>
>>> Can you explain how vbus achieves RDMA?
>>>
>>> I also don't see the connection to real time guests.
>>>      
>> Both of these are still in development.  Trying to stay true to the
>> "release early and often" mantra, the core vbus technology is being
>> pushed now so it can be reviewed.  Stay tuned for these other
>> developments.
>>    
> 
> Hopefully you can outline how it works.  AFAICT, RDMA and kernel bypass
> will need device assignment.  If you're bypassing the call into the host
> kernel, it doesn't really matter how that call is made, does it?

This is for things like the setup of queue-pairs, and the transport of
door-bells, and ib-verbs.  I am not on the team doing that work, so I am
not an expert in this area.  What I do know is having a flexible and
low-latency signal-path was deemed a key requirement.

For real-time, a big part of it is relaying the guest scheduler state to
the host, but in a smart way.  For instance, the cpu priority for each
vcpu is in a shared-table.  When the priority is raised, we can simply
update the table without taking a VMEXIT.  When it is lowered, we need
to inform the host of the change in case the underlying task needs to
reschedule.

This is where the really fast call() type mechanism is important.

Its also about having the priority flow-end to end, and having the vcpu
interrupt state affect the task-priority, etc (e.g. pending interrupts
affect the vcpu task prio).

etc, etc.

I can go on and on (as you know ;), but will wait till this work is more
concrete and proven.

> 
>>>> I also designed it in such a way that
>>>> we could, in theory, write one set of (linux-based) backends, and have
>>>> them work across a variety of environments (such as containers/VMs like
>>>> KVM, lguest, openvz, but also physical systems like blade enclosures
>>>> and
>>>> clusters, or even applications running on the host).
>>>>
>>>>        
>>> Sorry, I'm still confused.  Why would openvz need vbus?
>>>      
>> Its just an example.  The point is that I abstracted what I think are
>> the key points of fast-io, memory routing, signal routing, etc, so that
>> it will work in a variety of (ideally, _any_) environments.
>>
>> There may not be _performance_ motivations for certain classes of VMs
>> because they already have decent support, but they may want a connector
>> anyway to gain some of the new features available in vbus.
>>
>> And looking forward, the idea is that we have commoditized the backend
>> so we don't need to redo this each time a new container comes along.
>>    
> 
> I'll wait until a concrete example shows up as I still don't understand.

Ok.

> 
>>> One point of contention is that this is all managementy stuff and should
>>> be kept out of the host kernel.  Exposing shared memory, interrupts, and
>>> guest hypercalls can all be easily done from userspace (as virtio
>>> demonstrates).  True, some devices need kernel acceleration, but that's
>>> no reason to put everything into the host kernel.
>>>      
>> See my last reply to Anthony.  My two points here are that:
>>
>> a) having it in-kernel makes it a complete subsystem, which perhaps has
>> diminished value in kvm, but adds value in most other places that we are
>> looking to use vbus.
>>    
> 
> It's not a complete system unless you want users to administer VMs using
> echo and cat and configfs.  Some userspace support will always be
> necessary.

Well, more specifically, it doesn't require a userspace app to hang
around.  For instance, you can set up your devices with udev scripts, or
whatever.

But that is kind of a silly argument, since the kernel always needs
userspace around to give it something interesting, right? ;)

Basically, what it comes down to is both vbus and vhost need
configuration/management.  Vbus does it with sysfs/configfs, and vhost
does it with ioctls.  I ultimately decided to go with sysfs/configfs
because, at least that the time I looked, it seemed like the "blessed"
way to do user->kernel interfaces.

> 
>> b) the in-kernel code is being overstated as "complex".  We are not
>> talking about your typical virt thing, like an emulated ICH/PCI chipset.
>>   Its really a simple list of devices with a handful of attributes.  They
>> are managed using established linux interfaces, like sysfs/configfs.
>>    
> 
> They need to be connected to the real world somehow.  What about
> security?  can any user create a container and devices and link them to
> real interfaces?  If not, do you need to run the VM as root?

Today it has to be root as a result of weak mode support in configfs, so
you have me there.  I am looking for help patching this limitation, though.

Also, venet-tap uses a bridge, which of course is not as slick as a
raw-socket w.r.t. perms.


> 
> virtio and vhost-net solve these issues.  Does vbus?
> 
> The code may be simple to you.  But the question is whether it's
> necessary, not whether it's simple or complex.
> 
>>> Exposing devices as PCI is an important issue for me, as I have to
>>> consider non-Linux guests.
>>>      
>> Thats your prerogative, but obviously not everyone agrees with you.
>>    
> 
> I hope everyone agrees that it's an important issue for me and that I
> have to consider non-Linux guests.  I also hope that you're considering
> non-Linux guests since they have considerable market share.

I didn't mean non-Linux guests are not important.  I was disagreeing
with your assertion that it only works if its PCI.  There are numerous
examples of IHV/ISV "bridge" implementations deployed in Windows, no?
If vbus is exposed as a PCI-BRIDGE, how is this different?

> 
>> Getting non-Linux guests to work is my problem if you chose to not be
>> part of the vbus community.
>>    
> 
> I won't be writing those drivers in any case.

Ok.

> 
>>> Another issue is the host kernel management code which I believe is
>>> superfluous.
>>>      
>> In your opinion, right?
>>    
> 
> Yes, this is why I wrote "I believe".

Fair enough.

> 
> 
>>> Given that, why spread to a new model?
>>>      
>> Note: I haven't asked you to (at least, not since April with the vbus-v3
>> release).  Spreading to a new model is currently the role of the
>> AlacrityVM project, since we disagree on the utility of a new model.
>>    
> 
> Given I'm not the gateway to inclusion of vbus/venet, you don't need to
> ask me anything.  I'm still free to give my opinion.

Agreed, and I didn't mean to suggest otherwise.  It not clear if you are
wearing the "kvm maintainer" hat, or the "lkml community member" hat at
times, so its important to make that distinction.  Otherwise, its not
clear if this is edict as my superior, or input as my peer. ;)

> 
>>>> A) hardware can only generate byte/word sized requests at a time
>>>> because
>>>> that is all the pcb-etch and silicon support. So hardware is usually
>>>> expressed in terms of some number of "registers".
>>>>
>>>>        
>>> No, hardware happily DMAs to and fro main memory.
>>>      
>> Yes, now walk me through how you set up DMA to do something like a call
>> when you do not know addresses apriori.  Hint: count the number of
>> MMIO/PIOs you need.  If the number is>  1, you've lost.
>>    
> 
> With virtio, the number is 1 (or less if you amortize).  Set up the ring
> entries and kick.

Again, I am just talking about basic PCI here, not the things we build
on top.

The point is: the things we build on top have costs associated with
them, and I aim to minimize it.  For instance, to do a "call()" kind of
interface, you generally need to pre-setup some per-cpu mappings so that
you can just do a single iowrite32() to kick the call off.  Those
per-cpu mappings have a cost if you want them to be high-performance, so
my argument is that you ideally want to limit the number of times you
have to do this.  My current design reduces this to "once".


> 
>>>   Some hardware of
>>> course uses mmio registers extensively, but not virtio hardware.  With
>>> the recent MSI support no registers are touched in the fast path.
>>>      
>> Note we are not talking about virtio here.  Just raw PCI and why I
>> advocate vbus over it.
>>    
> 
> There's no such thing as raw PCI.  Every PCI device has a protocol.  The
> protocol virtio chose is optimized for virtualization.

And its a question of how that protocol scales, more than how the
protocol works.

Obviously the general idea of the protocol works, as vbus itself is
implemented as a PCI-BRIDGE and is therefore limited to the underlying
characteristics that I can get out of PCI (like PIO latency).

> 
> 
>>>> D) device-ids are in a fixed width register and centrally assigned from
>>>> an authority (e.g. PCI-SIG).
>>>>
>>>>        
>>> That's not an issue either.  Qumranet/Red Hat has donated a range of
>>> device IDs for use in virtio.
>>>      
>> Yes, and to get one you have to do what?  Register it with kvm.git,
>> right?  Kind of like registering a MAJOR/MINOR, would you agree?  Maybe
>> you do not mind (especially given your relationship to kvm.git), but
>> there are disadvantages to that model for most of the rest of us.
>>    
> 
> Send an email, it's not that difficult.  There's also an experimental
> range.

Ugly....


> 
>>>   Device IDs are how devices are associated
>>> with drivers, so you'll need something similar for vbus.
>>>      
>> Nope, just like you don't need to do anything ahead of time for using a
>> dynamic misc-device name.  You just have both the driver and device know
>> what they are looking for (its part of the ABI).
>>    
> 
> If you get a device ID clash, you fail.  If you get a device name clash,
> you fail in the same way.

No argument here.


> 
>>>> E) Interrupt/MSI routing is per-device oriented
>>>>
>>>>        
>>> Please elaborate.  What is the issue?  How does vbus solve it?
>>>      
>> There are no "interrupts" in vbus..only shm-signals.  You can establish
>> an arbitrary amount of shm regions, each with an optional shm-signal
>> associated with it.  To do this, the driver calls dev->shm(), and you
>> get back a shm_signal object.
>>
>> Underneath the hood, the vbus-connector (e.g. vbus-pcibridge) decides
>> how it maps real interrupts to shm-signals (on a system level, not per
>> device).  This can be 1:1, or any other scheme.  vbus-pcibridge uses one
>> system-wide interrupt per priority level (today this is 8 levels), each
>> with an IOQ based event channel.  "signals" come as an event on that
>> channel.
>>
>> So the "issue" is that you have no real choice with PCI.  You just get
>> device oriented interrupts.  With vbus, its abstracted.  So you can
>> still get per-device standard MSI, or you can do fancier things like do
>> coalescing and prioritization.
>>    
> 
> As I've mentioned before, prioritization is available on x86

But as Ive mentioned, it doesn't work very well.


>, and coalescing scales badly.

Depends on what is scaling.  Scaling vcpus?  Yes, you are right.
Scaling the number of devices?  No, this is where it improves.

> 
>>>> F) Interrupts/MSI are assumed cheap to inject
>>>>
>>>>        
>>> Interrupts are not assumed cheap; that's why interrupt mitigation is
>>> used (on real and virtual hardware).
>>>      
>> Its all relative.  IDT dispatch and EOI overhead are "baseline" on real
>> hardware, whereas they are significantly more expensive to do the
>> vmenters and vmexits on virt (and you have new exit causes, like
>> irq-windows, etc, that do not exist in real HW).
>>    
> 
> irq window exits ought to be pretty rare, so we're only left with
> injection vmexits.  At around 1us/vmexit, even 100,000 interrupts/vcpu
> (which is excessive) will only cost you 10% cpu time.

1us is too much for what I am building, IMHO.  Besides, there are a slew
of older machines (like Woodcrests) that are more like 2+us per exit, so
1us is a best-case scenario.

> 
>>>> G) Interrupts/MSI are non-priortizable.
>>>>
>>>>        
>>> They are prioritizable; Linux ignores this though (Windows doesn't).
>>> Please elaborate on what the problem is and how vbus solves it.
>>>      
>> It doesn't work right.  The x86 sense of interrupt priority is, sorry to
>> say it, half-assed at best.  I've worked with embedded systems that have
>> real interrupt priority support in the hardware, end to end, including
>> the PIC.  The LAPIC on the other hand is really weak in this dept, and
>> as you said, Linux doesn't even attempt to use whats there.
>>    
> 
> Maybe prioritization is not that important then.  If it is, it needs to
> be fixed at the lapic level, otherwise you have no real prioritization
> wrt non-vbus interrupts.

While this is true, I am generally not worried about it.  For the
environments that care, I plan on having it be predominantly vbus
devices and using an -rt kernel (with irq-threads).

> 
>>>> H) Interrupts/MSI are statically established
>>>>
>>>>        
>>> Can you give an example of why this is a problem?
>>>      
>> Some of the things we are building use the model of having a device that
>> hands out shm-signal in response to guest events (say, the creation of
>> an IPC channel).  This would generally be handled by a specific device
>> model instance, and it would need to do this without pre-declaring the
>> MSI vectors (to use PCI as an example).
>>    
> 
> You're free to demultiplex an MSI to however many consumers you want,
> there's no need for a new bus for that.

Hmmm...can you elaborate?


> 
>>> What performance oriented items have been left unaddressed?
>>>      
>> Well, the interrupt model to name one.
>>    
> 
> Like I mentioned, you can merge MSI interrupts, but that's not
> necessarily a good idea.
> 
>>> How do you handle conflicts?  Again you need a central authority to hand
>>> out names or prefixes.
>>>      
>> Not really, no.  If you really wanted to be formal about it, you could
>> adopt any series of UUID schemes.  For instance, perhaps venet should be
>> "com.novell::virtual-ethernet".  Heck, I could use uuidgen.
>>    
> 
> Do you use DNS.  We use PCI-SIG.  If Novell is a PCI-SIG member you can
> get a vendor ID and control your own virtio space.

Yeah, we have our own id.  I am more concerned about making this design
make sense outside of PCI oriented environments.

> 
>>>> As another example, the connector design coalesces *all* shm-signals
>>>> into a single interrupt (by prio) that uses the same context-switch
>>>> mitigation techniques that help boost things like networking.  This
>>>> effectively means we can detect and optimize out ack/eoi cycles from
>>>> the
>>>> APIC as the IO load increases (which is when you need it most).  PCI
>>>> has
>>>> no such concept.
>>>>
>>>>        
>>> That's a bug, not a feature.  It means poor scaling as the number of
>>> vcpus increases and as the number of devices increases.

vcpu increases, I agree (and am ok with, as I expect low vcpu count
machines to be typical).  nr of devices, I disagree.  can you elaborate?

>>>      
>> So the "avi-vbus-connector" can use 1:1, if you prefer.  Large vcpu
>> counts (which are not typical) and irq-affinity is not a target
>> application for my design, so I prefer the coalescing model in the
>> vbus-pcibridge included in this series. YMMV
>>    
> 
> So far you've left out live migration

guilty as charged.

> Windows,

Work in progress.

> large guests

Can you elaborate?  I am not familiar with the term.

> and multiqueue out of your design.

AFAICT, multiqueue should work quite nicely with vbus.  Can you
elaborate on where you see the problem?

>  If you wish to position vbus/venet for
> large scale use you'll need to address all of them.
> 
>>> Note nothing prevents steering multiple MSIs into a single vector.  It's
>>> a bad idea though.
>>>      
>> Yes, it is a bad idea...and not the same thing either.  This would
>> effectively create a shared-line scenario in the irq code, which is not
>> what happens in vbus.
>>    
> 
> Ok.
> 
>>>> In addition, the signals and interrupts are priority aware, which is
>>>> useful for things like 802.1p networking where you may establish 8-tx
>>>> and 8-rx queues for your virtio-net device.  x86 APIC really has no
>>>> usable equivalent, so PCI is stuck here.
>>>>
>>>>        
>>> x86 APIC is priority aware.
>>>      
>> Have you ever tried to use it?
>>    
> 
> I haven't, but Windows does.

Yeah, it doesn't really work well.  Its an extremely rigid model that
(IIRC) only lets you prioritize in 16 groups spaced by IDT (0-15 are one
level, 16-31 are another, etc).  Most of the embedded PICs I have worked
with supported direct remapping, etc.  But in any case, Linux doesn't
support it so we are hosed no matter how good it is.


> 
>>>> Also, the signals can be allocated on-demand for implementing things
>>>> like IPC channels in response to guest requests since there is no
>>>> assumption about device-to-interrupt mappings.  This is more flexible.
>>>>
>>>>        
>>> Yes.  However given that vectors are a scarce resource you're severely
>>> limited in that.
>>>      
>> The connector I am pushing out does not have this limitation.
>>    
> 
> Okay.
> 
>>   
>>>   And if you're multiplexing everything on one vector,
>>> then you can just as well demultiplex your channels in the virtio driver
>>> code.
>>>      
>> Only per-device, not system wide.
>>    
> 
> Right.  I still think multiplexing interrupts is a bad idea in a large
> system.  In a small system... why would you do it at all?

device scaling, like for running a device-domain / bridge in a guest.

> 
>>>> And through all of this, this design would work in any guest even if it
>>>> doesn't have PCI (e.g. lguest, UML, physical systems, etc).
>>>>
>>>>        
>>> That is true for virtio which works on pci-less lguest and s390.
>>>      
>> Yes, and lguest and s390 had to build their own bus-model to do it,
>> right?
>>    
> 
> They had to build connectors just like you propose to do.

More importantly, they had to build back-end busses too, no?

> 
>> Thank you for bringing this up, because it is one of the main points
>> here.  What I am trying to do is generalize the bus to prevent the
>> proliferation of more of these isolated models in the future.  Build
>> one, fast, in-kernel model so that we wouldn't need virtio-X, and
>> virtio-Y in the future.  They can just reuse the (performance optimized)
>> bus and models, and only need to build the connector to bridge them.
>>    
> 
> But you still need vbus-connector-lguest and vbus-connector-s390 because
> they all talk to the host differently.  So what's changed?  the names?

The fact that they don't need to redo most of the in-kernel backend
stuff.  Just the connector.

> 
>>> That is exactly the design goal of virtio (except it limits itself to
>>> virtualization).
>>>      
>> No, virtio is only part of the picture.  It not including the backend
>> models, or how to do memory/signal-path abstraction for in-kernel, for
>> instance.  But otherwise, virtio as a device model is compatible with
>> vbus as a bus model.  They compliment one another.
>>    
> 
> Well, venet doesn't complement virtio-net, and virtio-pci doesn't
> complement vbus-connector.

Agreed, but virtio complements vbus by virtue of virtio-vbus.

> 
>>>> Then device models like virtio can ride happily on top and we end up
>>>> with a really robust and high-performance Linux-based stack.  I don't
>>>> buy the argument that we already have PCI so lets use it.  I don't
>>>> think
>>>> its the best design and I am not afraid to make an investment in a
>>>> change here because I think it will pay off in the long run.
>>>>
>>>>        
>>> Sorry, I don't think you've shown any quantifiable advantages.
>>>      
>> We can agree to disagree then, eh?  There are certainly quantifiable
>> differences.  Waving your hand at the differences to say they are not
>> advantages is merely an opinion, one that is not shared universally.
>>    
> 
> I've addressed them one by one.  We can agree to disagree on interrupt
> multiplexing, and the importance of compatibility, Windows, large
> guests, multiqueue, and DNS vs. PCI-SIG.
> 
>> The bottom line is all of these design distinctions are encapsulated
>> within the vbus subsystem and do not affect the kvm code-base.  So
>> agreement with kvm upstream is not a requirement, but would be
>> advantageous for collaboration.
>>    
> 
> Certainly.
> 

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19  5:48                     ` Avi Kivity
@ 2009-08-19  6:40                       ` Gregory Haskins
  2009-08-19  7:13                         ` Avi Kivity
  0 siblings, 1 reply; 132+ messages in thread
From: Gregory Haskins @ 2009-08-19  6:40 UTC (permalink / raw)
  To: Gregory Haskins, Avi Kivity
  Cc: Anthony Liguori, Ingo Molnar, alacrityvm-devel,
	Michael S. Tsirkin, kvm, linux-kernel, netdev

>>> On 8/19/2009 at  1:48 AM, in message <4A8B9241.20300@redhat.com>, Avi Kivity
<avi@redhat.com> wrote: 
> On 08/19/2009 08:36 AM, Gregory Haskins wrote:
>>> If virtio net in guest could be improved instead, everyone would
>>> benefit.
>>>      
>> So if I whip up a virtio-net backend for vbus with a PCI compliant
>> connector, you are happy?
>>    
> 
> This doesn't improve virtio-net in any way.

Any why not?  (Did you notice I said "PCI compliant", i.e. over virtio-pci)


> 
>>> I am doing this, and I wish more people would join.  Instead,
>>> you change ABI in a incompatible way.
>>>      
>> Only by choice of my particular connector.  The ABI is a function of the
>> connector design.  So one such model is to terminate the connector in
>> qemu, and surface the resulting objects as PCI devices.  I choose not to
>> use this particular design for my connector that I am pushing upstream
>> because I am of the opinion that I can do better by terminating it in
>> the guest directly as a PV optimized bus.  However, both connectors can
>> theoretically coexist peacefully.
>>    
> 
> virtio already supports this model; see lguest and s390.  Transporting 
> virtio over vbus and vbus over something else doesn't gain anything over 
> directly transporting virtio over that something else.

This is not what I am advocating.

Kind Regards,
-Greg






^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19  6:28                     ` Gregory Haskins
@ 2009-08-19  7:11                       ` Avi Kivity
  2009-08-19 18:23                         ` Nicholas A. Bellinger
  2009-08-19 18:26                         ` [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects Gregory Haskins
  0 siblings, 2 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-19  7:11 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ingo Molnar, kvm, alacrityvm-devel, linux-kernel, netdev,
	Michael S. Tsirkin

On 08/19/2009 09:28 AM, Gregory Haskins wrote:
> Avi Kivity wrote:
>    
>> On 08/18/2009 05:46 PM, Gregory Haskins wrote:
>>      
>>>        
>>>> Can you explain how vbus achieves RDMA?
>>>>
>>>> I also don't see the connection to real time guests.
>>>>
>>>>          
>>> Both of these are still in development.  Trying to stay true to the
>>> "release early and often" mantra, the core vbus technology is being
>>> pushed now so it can be reviewed.  Stay tuned for these other
>>> developments.
>>>
>>>        
>> Hopefully you can outline how it works.  AFAICT, RDMA and kernel bypass
>> will need device assignment.  If you're bypassing the call into the host
>> kernel, it doesn't really matter how that call is made, does it?
>>      
> This is for things like the setup of queue-pairs, and the transport of
> door-bells, and ib-verbs.  I am not on the team doing that work, so I am
> not an expert in this area.  What I do know is having a flexible and
> low-latency signal-path was deemed a key requirement.
>    

That's not a full bypass, then.  AFAIK kernel bypass has userspace 
talking directly to the device.

Given that both virtio and vbus can use ioeventfds, I don't see how one 
can perform better than the other.

> For real-time, a big part of it is relaying the guest scheduler state to
> the host, but in a smart way.  For instance, the cpu priority for each
> vcpu is in a shared-table.  When the priority is raised, we can simply
> update the table without taking a VMEXIT.  When it is lowered, we need
> to inform the host of the change in case the underlying task needs to
> reschedule.
>    

This is best done using cr8/tpr so you don't have to exit at all.  See 
also my vtpr support for Windows which does this in software, generally 
avoiding the exit even when lowering priority.

> This is where the really fast call() type mechanism is important.
>
> Its also about having the priority flow-end to end, and having the vcpu
> interrupt state affect the task-priority, etc (e.g. pending interrupts
> affect the vcpu task prio).
>
> etc, etc.
>
> I can go on and on (as you know ;), but will wait till this work is more
> concrete and proven.
>    

Generally cpu state shouldn't flow through a device but rather through 
MSRs, hypercalls, and cpu registers.

> Basically, what it comes down to is both vbus and vhost need
> configuration/management.  Vbus does it with sysfs/configfs, and vhost
> does it with ioctls.  I ultimately decided to go with sysfs/configfs
> because, at least that the time I looked, it seemed like the "blessed"
> way to do user->kernel interfaces.
>    

I really dislike that trend but that's an unrelated discussion.

>> They need to be connected to the real world somehow.  What about
>> security?  can any user create a container and devices and link them to
>> real interfaces?  If not, do you need to run the VM as root?
>>      
> Today it has to be root as a result of weak mode support in configfs, so
> you have me there.  I am looking for help patching this limitation, though.
>
>    

Well, do you plan to address this before submission for inclusion?

>> I hope everyone agrees that it's an important issue for me and that I
>> have to consider non-Linux guests.  I also hope that you're considering
>> non-Linux guests since they have considerable market share.
>>      
> I didn't mean non-Linux guests are not important.  I was disagreeing
> with your assertion that it only works if its PCI.  There are numerous
> examples of IHV/ISV "bridge" implementations deployed in Windows, no?
>    

I don't know.

> If vbus is exposed as a PCI-BRIDGE, how is this different?
>    

Technically it would work, but given you're not interested in Windows, 
who would write a driver?

>> Given I'm not the gateway to inclusion of vbus/venet, you don't need to
>> ask me anything.  I'm still free to give my opinion.
>>      
> Agreed, and I didn't mean to suggest otherwise.  It not clear if you are
> wearing the "kvm maintainer" hat, or the "lkml community member" hat at
> times, so its important to make that distinction.  Otherwise, its not
> clear if this is edict as my superior, or input as my peer. ;)
>    

When I wear a hat, it is a Red Hat.  However I am bareheaded most often.

(that is, look at the contents of my message, not who wrote it or his role).

>> With virtio, the number is 1 (or less if you amortize).  Set up the ring
>> entries and kick.
>>      
> Again, I am just talking about basic PCI here, not the things we build
> on top.
>    

Whatever that means, it isn't interesting.  Performance is measure for 
the whole stack.

> The point is: the things we build on top have costs associated with
> them, and I aim to minimize it.  For instance, to do a "call()" kind of
> interface, you generally need to pre-setup some per-cpu mappings so that
> you can just do a single iowrite32() to kick the call off.  Those
> per-cpu mappings have a cost if you want them to be high-performance, so
> my argument is that you ideally want to limit the number of times you
> have to do this.  My current design reduces this to "once".
>    

Do you mean minimizing the setup cost?  Seriously?

>> There's no such thing as raw PCI.  Every PCI device has a protocol.  The
>> protocol virtio chose is optimized for virtualization.
>>      
> And its a question of how that protocol scales, more than how the
> protocol works.
>
> Obviously the general idea of the protocol works, as vbus itself is
> implemented as a PCI-BRIDGE and is therefore limited to the underlying
> characteristics that I can get out of PCI (like PIO latency).
>    

I thought we agreed that was insignificant?

>> As I've mentioned before, prioritization is available on x86
>>      
> But as Ive mentioned, it doesn't work very well.
>    

I guess it isn't that important then.  I note that clever prioritization 
in a guest is pointless if you can't do the same prioritization in the host.

>> , and coalescing scales badly.
>>      
> Depends on what is scaling.  Scaling vcpus?  Yes, you are right.
> Scaling the number of devices?  No, this is where it improves.
>    

If you queue pending messages instead of walking the device list, you 
may be right.  Still, if hard interrupt processing takes 10% of your 
time you'll only have coalesced 10% of interrupts on average.

>> irq window exits ought to be pretty rare, so we're only left with
>> injection vmexits.  At around 1us/vmexit, even 100,000 interrupts/vcpu
>> (which is excessive) will only cost you 10% cpu time.
>>      
> 1us is too much for what I am building, IMHO.

You can't use current hardware then.

>> You're free to demultiplex an MSI to however many consumers you want,
>> there's no need for a new bus for that.
>>      
> Hmmm...can you elaborate?
>    

Point all those MSIs at one vector.  Its handler will have to poll all 
the attached devices though.

>> Do you use DNS.  We use PCI-SIG.  If Novell is a PCI-SIG member you can
>> get a vendor ID and control your own virtio space.
>>      
> Yeah, we have our own id.  I am more concerned about making this design
> make sense outside of PCI oriented environments.
>    

IIRC we reuse the PCI IDs for non-PCI.




>>>> That's a bug, not a feature.  It means poor scaling as the number of
>>>> vcpus increases and as the number of devices increases.
>>>>          
> vcpu increases, I agree (and am ok with, as I expect low vcpu count
> machines to be typical).

I'm not okay with it.  If you wish people to adopt vbus over virtio 
you'll have to address all concerns, not just yours.

> nr of devices, I disagree.  can you elaborate?
>    

With message queueing, I retract my remark.

>> Windows,
>>      
> Work in progress.
>    

Interesting.  Do you plan to open source the code?  If not, will the 
binaries be freely available?

>    
>> large guests
>>      
> Can you elaborate?  I am not familiar with the term.
>    

Many vcpus.

>    
>> and multiqueue out of your design.
>>      
> AFAICT, multiqueue should work quite nicely with vbus.  Can you
> elaborate on where you see the problem?
>    

You said you aren't interested in it previously IIRC.

>>>> x86 APIC is priority aware.
>>>>
>>>>          
>>> Have you ever tried to use it?
>>>
>>>        
>> I haven't, but Windows does.
>>      
> Yeah, it doesn't really work well.  Its an extremely rigid model that
> (IIRC) only lets you prioritize in 16 groups spaced by IDT (0-15 are one
> level, 16-31 are another, etc).  Most of the embedded PICs I have worked
> with supported direct remapping, etc.  But in any case, Linux doesn't
> support it so we are hosed no matter how good it is.
>    

I agree that it isn't very clever (not that I am a real time expert) but 
I disagree about dismissing Linux support so easily.  If prioritization 
is such a win it should be a win on the host as well and we should make 
it work on the host as well.  Further I don't see how priorities on the 
guest can work if they don't on the host.

>>>
>>>        
>> They had to build connectors just like you propose to do.
>>      
> More importantly, they had to build back-end busses too, no?
>    

They had to write 414 lines in drivers/s390/kvm/kvm_virtio.c and 
something similar for lguest.

>> But you still need vbus-connector-lguest and vbus-connector-s390 because
>> they all talk to the host differently.  So what's changed?  the names?
>>      
> The fact that they don't need to redo most of the in-kernel backend
> stuff.  Just the connector.
>    

So they save 414 lines but have to write a connector which is... how large?

>> Well, venet doesn't complement virtio-net, and virtio-pci doesn't
>> complement vbus-connector.
>>      
> Agreed, but virtio complements vbus by virtue of virtio-vbus.
>    

I don't see what vbus adds to virtio-net.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for  vbus_driver objects
  2009-08-19  6:40                       ` Gregory Haskins
@ 2009-08-19  7:13                         ` Avi Kivity
  2009-08-19 11:40                           ` Gregory Haskins
  0 siblings, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-19  7:13 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Gregory Haskins, Anthony Liguori, Ingo Molnar, alacrityvm-devel,
	Michael S. Tsirkin, kvm, linux-kernel, netdev

On 08/19/2009 09:40 AM, Gregory Haskins wrote:
>
>    
>>> So if I whip up a virtio-net backend for vbus with a PCI compliant
>>> connector, you are happy?
>>>
>>>        
>> This doesn't improve virtio-net in any way.
>>      
> Any why not?  (Did you notice I said "PCI compliant", i.e. over virtio-pci)
>    

Because virtio-net will have gained nothing that it didn't have before.




>> virtio already supports this model; see lguest and s390.  Transporting
>> virtio over vbus and vbus over something else doesn't gain anything over
>> directly transporting virtio over that something else.
>>      
> This is not what I am advocating.
>
>    

What are you advocating?  As far as I can tell your virtio-vbus 
connector plus the vbus-kvm connector is just that.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for  vbus_driver objects
  2009-08-19  7:13                         ` Avi Kivity
@ 2009-08-19 11:40                           ` Gregory Haskins
  2009-08-19 11:49                             ` Avi Kivity
  0 siblings, 1 reply; 132+ messages in thread
From: Gregory Haskins @ 2009-08-19 11:40 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Ingo Molnar, Gregory Haskins, alacrityvm-devel,
	Michael S. Tsirkin, kvm, linux-kernel, netdev

>>> On 8/19/2009 at  3:13 AM, in message <4A8BA635.9010902@redhat.com>, Avi Kivity
<avi@redhat.com> wrote: 
> On 08/19/2009 09:40 AM, Gregory Haskins wrote:
>>
>>    
>>>> So if I whip up a virtio-net backend for vbus with a PCI compliant
>>>> connector, you are happy?
>>>>
>>>>        
>>> This doesn't improve virtio-net in any way.
>>>      
>> Any why not?  (Did you notice I said "PCI compliant", i.e. over virtio-pci)
>>    
> 
> Because virtio-net will have gained nothing that it didn't have before.

??

*) ABI is virtio-pci compatible, as you like
*) fast-path is in-kernel, as we all like
*) model is in vbus so it would work in all environments that vbus supports.

> 
> 
> 
> 
>>> virtio already supports this model; see lguest and s390.  Transporting
>>> virtio over vbus and vbus over something else doesn't gain anything over
>>> directly transporting virtio over that something else.
>>>      
>> This is not what I am advocating.
>>
>>    
> 
> What are you advocating?  As far as I can tell your virtio-vbus 
> connector plus the vbus-kvm connector is just that.

I wouldn't classify it anything like that, no.  Its just virtio over vbus.

-Greg






^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for   vbus_driver objects
  2009-08-19 11:40                           ` Gregory Haskins
@ 2009-08-19 11:49                             ` Avi Kivity
  2009-08-19 11:52                               ` Gregory Haskins
  0 siblings, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-19 11:49 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Ingo Molnar, Gregory Haskins, alacrityvm-devel,
	Michael S. Tsirkin, kvm, linux-kernel, netdev

On 08/19/2009 02:40 PM, Gregory Haskins wrote:
>
>>>>> So if I whip up a virtio-net backend for vbus with a PCI compliant
>>>>> connector, you are happy?
>>>>>
>>>>>
>>>>>            
>>>> This doesn't improve virtio-net in any way.
>>>>
>>>>          
>>> Any why not?  (Did you notice I said "PCI compliant", i.e. over virtio-pci)
>>>
>>>        
>> Because virtio-net will have gained nothing that it didn't have before.
>>      
> ??
>
> *) ABI is virtio-pci compatible, as you like
>    

That's not a gain, that's staying in the same place.

> *) fast-path is in-kernel, as we all like
>    

That's not a gain as we have vhost-net (sure, in development, but your 
proposed backend isn't even there yet).

> *) model is in vbus so it would work in all environments that vbus supports.
>    

The ABI can be virtio-pci compatible or it can be vbus-comaptible.  How 
can it be both?  The ABIs are different.

Note that if you had submitted a virtio-net backend I'd have asked you 
to strip away all the management / bus layers and we'd have ended up 
with vhost-net.

>>>> virtio already supports this model; see lguest and s390.  Transporting
>>>> virtio over vbus and vbus over something else doesn't gain anything over
>>>> directly transporting virtio over that something else.
>>>>
>>>>          
>>> This is not what I am advocating.
>>>
>>>
>>>        
>> What are you advocating?  As far as I can tell your virtio-vbus
>> connector plus the vbus-kvm connector is just that.
>>      
> I wouldn't classify it anything like that, no.  Its just virtio over vbus.
>    

We're in a loop.  Doesn't virtio over vbus need a virtio-vbus 
connector?  and doesn't vbus need a connector to talk to the hypervisor?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for   vbus_driver objects
  2009-08-19 11:49                             ` Avi Kivity
@ 2009-08-19 11:52                               ` Gregory Haskins
  0 siblings, 0 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-19 11:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Anthony Liguori, Ingo Molnar, alacrityvm-devel,
	Michael S. Tsirkin, kvm, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 1987 bytes --]

Avi Kivity wrote:
> On 08/19/2009 02:40 PM, Gregory Haskins wrote:
>>
>>>>>> So if I whip up a virtio-net backend for vbus with a PCI compliant
>>>>>> connector, you are happy?
>>>>>>
>>>>>>
>>>>>>            
>>>>> This doesn't improve virtio-net in any way.
>>>>>
>>>>>          
>>>> Any why not?  (Did you notice I said "PCI compliant", i.e. over
>>>> virtio-pci)
>>>>
>>>>        
>>> Because virtio-net will have gained nothing that it didn't have before.
>>>      
>> ??
>>
>> *) ABI is virtio-pci compatible, as you like
>>    
> 
> That's not a gain, that's staying in the same place.
> 
>> *) fast-path is in-kernel, as we all like
>>    
> 
> That's not a gain as we have vhost-net (sure, in development, but your
> proposed backend isn't even there yet).
> 
>> *) model is in vbus so it would work in all environments that vbus
>> supports.
>>    
> 
> The ABI can be virtio-pci compatible or it can be vbus-comaptible.  How
> can it be both?  The ABIs are different.
> 
> Note that if you had submitted a virtio-net backend I'd have asked you
> to strip away all the management / bus layers and we'd have ended up
> with vhost-net.

Sigh...


> 
>>>>> virtio already supports this model; see lguest and s390.  Transporting
>>>>> virtio over vbus and vbus over something else doesn't gain anything
>>>>> over
>>>>> directly transporting virtio over that something else.
>>>>>
>>>>>          
>>>> This is not what I am advocating.
>>>>
>>>>
>>>>        
>>> What are you advocating?  As far as I can tell your virtio-vbus
>>> connector plus the vbus-kvm connector is just that.
>>>      
>> I wouldn't classify it anything like that, no.  Its just virtio over
>> vbus.
>>    
> 
> We're in a loop.  Doesn't virtio over vbus need a virtio-vbus
> connector?  and doesn't vbus need a connector to talk to the hypervisor?
> 

No, it doesnt work like that.  There is only one connector.

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19  5:22                     ` Avi Kivity
@ 2009-08-19 13:27                       ` Gregory Haskins
  2009-08-19 14:35                         ` Avi Kivity
  0 siblings, 1 reply; 132+ messages in thread
From: Gregory Haskins @ 2009-08-19 13:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Anthony Liguori, kvm, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin, Ira W. Snyder

[-- Attachment #1: Type: text/plain, Size: 11351 bytes --]

Avi Kivity wrote:
> On 08/19/2009 07:27 AM, Gregory Haskins wrote:
>>
>>> This thread started because i asked you about your technical
>>> arguments why we'd want vbus instead of virtio.
>>>      
>> (You mean vbus vs pci, right?  virtio works fine, is untouched, and is
>> out-of-scope here)
>>    
> 
> I guess he meant venet vs virtio-net.  Without venet vbus is currently
> userless.
> 
>> Right, and I do believe I answered your questions.  Do you feel as
>> though this was not a satisfactory response?
>>    
> 
> Others and I have shown you its wrong.

No, you have shown me that you disagree.  I'm sorry, but do not assume
they are the same.

Case in point: You also said that threading the ethernet model was wrong
when I proposed it, and later conceded when I showed you the numbers
that you were wrong.  I don't say this to be a jerk.  I am wrong myself
all the time too.

I only say it to highlight that perhaps we just don't (yet) see each
others POV.  Therefore, do not be so quick to put a "wrong" label on
something, especially when the line of questioning/debate indicates to
me that there are still fundamental issues in understanding exactly how
things work.

>  There's no inherent performance
> problem in pci.  The vbus approach has inherent problems (the biggest of
> which is compatibility

Trying to be backwards compatible in all dimensions is not a design
goal, as already stated.


, the second managability).
> 

Where are the management problems?


>>> Your answer above
>>> now basically boils down to: "because I want it so, why dont you
>>> leave me alone".
>>>      
>> Well, with all due respect, please do not put words in my mouth.  This
>> is not what I am saying at all.
>>
>> What I *am* saying is:
>>
>> fact: this thread is about linux guest drivers to support vbus
>>
>> fact: these drivers do not touch kvm code.
>>
>> fact: these drivers to not force kvm to alter its operation in any way.
>>
>> fact: these drivers do not alter ABIs that KVM currently supports.
>>
>> Therefore, all this talk about "abandoning", "supporting", and
>> "changing" things in KVM is, premature, irrelevant, and/or, FUD.  No one
>> proposed such changes, so I am highlighting this fact to bring the
>> thread back on topic.  That KVM talk is merely a distraction at this
>> point in time.
>>    
> 
> s/kvm/kvm stack/.  virtio/pci is part of the kvm stack, even if it is
> not part of kvm itself.  If vbus/venet were to be merged, users and
> developers would have to choose one or the other.  That's the
> fragmentation I'm worried about.  And you can prefix that with "fact:"
> as well.

Noted

> 
>>> We all love faster code and better management interfaces and tons
>>> of your prior patches got accepted by Avi. This time you didnt even
>>> _try_ to improve virtio.
>>>      
>> Im sorry, but you are mistaken:
>>
>> http://lkml.indiana.edu/hypermail/linux/kernel/0904.2/02443.html
>>    
> 
> That does nothing to improve virtio.

I'm sorry, but thats just plain false.

> Existing guests (Linux and
> Windows) which support virtio will cease to work if the host moves to
> vbus-virtio.

Sigh...please re-read "fact" section.  And even if this work is accepted
upstream as it is, how you configure the host and guest is just that: a
configuration.  If your guest and host both speak vbus, use it.  If they
don't, don't use it.  Simple as that.  Saying anything else is just more
FUD, and I can say the same thing about a variety of other configuration
options currently available.


> Existing hosts (running virtio-pci) won't be able to talk
> to newer guests running virtio-vbus.  The patch doesn't improve
> performance without the entire vbus stack in the host kernel and a
> vbus-virtio-net-host host kernel driver.

<rewind years=2>Existing hosts (running realtek emulation) won't be able
to talk to newer guests running virtio-net.  Virtio-net doesn't do
anything to improve realtek emulation without the entire virtio stack in
the host.</rewind>

You gotta start somewhere.   You're argument buys you nothing other than
backwards compat, which I've already stated is not a specific goal here.
 I am not against "modprobe vbus-pcibridge", and I am sure there are
users out that that do not object to this either.

> 
> Perhaps if you posted everything needed to make vbus-virtio work and
> perform we could compare that to vhost-net and you'll see another reason
> why vhost-net is the better approach.

Yet, you must recognize that an alternative outcome is that we can look
at issues outside of virtio-net on KVM and perhaps you will see vbus is
a better approach.

> 
>> You are also wrong to say that I didn't try to avoid creating a
>> downstream effort first.   I believe the public record of the mailing
>> lists will back me up that I tried politely pushing this directly though
>> kvm first.  It was only after Avi recently informed me that they would
>> be building their own version of an in-kernel backend in lieu of working
>> with me to adapt vbus to their needs that I decided to put my own
>> project together.
>>    
> 
> There's no way we can adapt vbus to our needs.

Really?  Did you ever bother to ask how?  I'm pretty sure you can.  And
if you couldn't, I would have considered changes to make it work.


> Don't you think we'd preferred it rather than writing our own?

Honestly, I am not so sure based on your responses.

>  the current virtio-net issues
> are hurting us.

Indeed.

> 
> Our needs are compatibility, performance, and managability.  vbus fails
> all three, your impressive venet numbers notwithstanding.
> 
>> What should I have done otherwise, in your opinion?
>>    
> 
> You could come up with uses where vbus truly is superior to
> virtio/pci/whatever

I've already listed numerous examples on why I advocate vbus over PCI,
and have already stated I am not competing against virtio.

> (not words about etch constraints).

I was asked about the design, and that was background on some of my
motivations.   Don't try to spin that into something its not.

> Showing some of those non-virt uses, for example.

Actually, Ira's chassis discussed earlier is a classic example.  Vbus
actually fits neatly into his model, I believe (and much better than the
vhost proposals, IMO).

Basically, IMO we want to invert Ira's bus (so that the PPC boards see
host-based devices, instead of the other way around).  You write a
connector that transports the vbus verbs over the PCI link.  You write a
udev rule that responds to the PPC board "arrival" event to create a new
vbus container, and assign the board to that context.

Then, whatever devices you instantiate in the vbus container will
surface on the PPC board's "vbus-proxy" bus.  This can include "virtio"
type devices which are serviced by the virtio-vbus code to render these
devices to the virtio-bus.  Finally, drivers like virtio-net and
virtio-console load and run normally.

The host-side administers the available inventory on a per-board basis
and its configuration using sysfs operations.

> The fact that your only user duplicates existing functionality doesn't help.

Certainly at some level, that is true and is unfortunate, I agree.  In
retrospect, I wish I started with something non-overlapping with virtio
as the demo, just to avoid this aspect of controversy.

At another level, its the highest-performance 802.x interface for KVM at
the moment, since we still have not seen benchmarks for vhost.  Given
that I have spent a lot of time lately optimizing KVM, I can tell you
its not trivial to get it to work better than the userspace virtio.
Michael is clearly a smart guy, so the odds are in his favor.  But do
not count your chickens before they hatch, because its not guaranteed
success.

Long story short, my patches are not duplicative on all levels (i.e.
performance).  Its just another ethernet driver, of which there are
probably hundreds of alternatives in the kernel already.  You could also
argue that we already have multiple models in qemu (realtek, e1000,
virtio-net, etc) so this is not without precedent.  So really all this
"fragmentation" talk is FUD.  Lets stay on-point, please.

> 
> 
>>> And fragmentation matters quite a bit. To Linux users, developers,
>>> administrators, packagers it's a big deal whether two overlapping
>>> pieces of functionality for the same thing exist within the same
>>> kernel.
>>>      
>> So the only thing that could be construed as overlapping here is venet
>> vs virtio-net. If I dropped the contentious venet and focused on making
>> a virtio-net backend that we can all re-use, do you see that as a path
>> of compromise here?
>>    
> 
> That's a step in the right direction.

Ok.  I am concerned it would be a waste of my time given your current
statements regarding the backend aspects of my design.

Can we talk more about that at some point?  I think you will see its not
some "evil, heavy duty" infrastructure that some comments seem to be
trying to paint it as.  I think its similar in concept to what you need
to do for a vhost like design, but (with all due respect to Michael) a
little bit more thought into the necessary abstraction points to allow
broader application.

> 
>>> I certainly dont want that. Instead we (at great expense and work)
>>> try to reach the best technical solution.
>>>      
>> This is all I want, as well.
>>    
> 
> Note whenever I mention migration, large guests, or Windows you say
> these are not your design requirements.

Actually, I don't think I've ever said that, per se.  I said that those
things are not a priority for me, personally.  I never made a design
decision that I knew would preclude the support for such concepts.  In
fact, afaict, the design would support them just fine, given resources
the develop them.

For the record: I never once said "vbus is done".  There is plenty of
work left to do.  This is natural (kvm I'm sure wasn't 100% when it went
in either, nor is it today)


> The best technical solution will have to consider those.

We are on the same page here.

> 
>>> If the community wants this then why cannot you convince one of the
>>> most prominent representatives of that community, the KVM
>>> developers?
>>>      
>> Its a chicken and egg at times.  Perhaps the KVM developers do not have
>> the motivation or time to properly consider such a proposal _until_ the
>> community presents its demand.
> 
> I've spent quite a lot of time arguing with you, no doubt influenced by
> the fact that you can write a lot faster than I can read.

:)

> 
>>> Furthermore, 99% of your work is KVM
>>>      
>> Actually, no.  Almost none of it is.  I think there are about 2-3
>> patches in the series that touch KVM, the rest are all original (and
>> primarily stand-alone code).  AlacrityVM is the application of kvm and
>> vbus (and, of course, Linux) together as a complete unit, but I do not
>> try to hide this relationship.
>>
>> By your argument, KVM is 99% QEMU+Linux. ;)
>>    
> 
> That's one of the kvm strong points...

As AlacrityVMs, as well ;)

Kind Regards,
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19  5:36                   ` Gregory Haskins
  2009-08-19  5:48                     ` Avi Kivity
@ 2009-08-19 14:33                     ` Michael S. Tsirkin
  2009-08-20 12:12                     ` Michael S. Tsirkin
  2 siblings, 0 replies; 132+ messages in thread
From: Michael S. Tsirkin @ 2009-08-19 14:33 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Anthony Liguori, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev

On Wed, Aug 19, 2009 at 01:36:14AM -0400, Gregory Haskins wrote:
> Please post results when you have numbers, as I had to
> give up my 10GE rig in the lab.
> I suspect you will have performance
> issues until you at least address GSO, but you may already be there by now.

Yes, measuring streaming bandwidth probably does not make sense yet, as
I do not have GSO, and I do not have VM exit mitigation. But RSN.

Meanwhile udp_rr does not need any of these, so I checked that and numbers look
like what you'd expect.  My systems seem slower than yours, but the
virtualization overhead is same: around 20us (sometimes it's a bit higher, up
to 25us).

host to host:

[root@virtlab18 netperf-2.4.5]# ~mst/netperf-2.4.5/bin/netperf -H 20.1.50.1 -t
+udp_rr
UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 20.1.50.1
+(20.1.50.1) port 0 AF_INET
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

262144 262144 1        1       10.00    13890.41
124928 124928

host to guest:

[root@virtlab18 linux-2.6]# ~mst/netperf-2.4.5/bin/netperf -H 20.1.50.3 -t udp_rr
UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 20.1.50.3
+(20.1.50.3) port 0 AF_INET
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

262144 262144 1        1       10.00    10884.78
124928 124928


-- 
MST

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19 13:27                       ` Gregory Haskins
@ 2009-08-19 14:35                         ` Avi Kivity
  0 siblings, 0 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-19 14:35 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ingo Molnar, Anthony Liguori, kvm, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin, Ira W. Snyder

On 08/19/2009 04:27 PM, Gregory Haskins wrote:
>>   There's no inherent performance
>> problem in pci.  The vbus approach has inherent problems (the biggest of
>> which is compatibility
>>      
> Trying to be backwards compatible in all dimensions is not a design
> goal, as already stated.
>    

It's important to me.  If you ignore what's important to me don't expect 
me to support your code.

>
> , the second managability).
>    
>>      
> Where are the management problems?
>    

Requiring root, negotiation in the kernel (making it harder to set up a 
compatible "migration pool", but wait, you don't care about migration 
either.


> No, you have shown me that you disagree.  I'm sorry, but do not assume
> they are the same.

[...]

> I'm sorry, but thats just plain false.
>    

Don't you mean, "I disagree but that's completely different from you 
being wrong".

>> Existing guests (Linux and
>> Windows) which support virtio will cease to work if the host moves to
>> vbus-virtio.
>>      
> Sigh...please re-read "fact" section.  And even if this work is accepted
> upstream as it is, how you configure the host and guest is just that: a
> configuration.  If your guest and host both speak vbus, use it.  If they
> don't, don't use it.  Simple as that.  Saying anything else is just more
> FUD, and I can say the same thing about a variety of other configuration
> options currently available.
>    

The host, yes.  The guest, no.  I have RHEL 5.3 and Windows guests that 
work with virtio now, and I'd like to keep it that way.  Given that I 
need to keep the current virtio-net/pci ABI, I have no motivation to add 
other ABIs.  Given that host userspace configuration works, I have no 
motivation to move it into a kernel configfs/vbus based system.  The 
only thing that's hurting me is virtio-net's performance problems and 
we're addressing it by moving the smallest possible component into the 
kernel: vhost-net.


>> Existing hosts (running virtio-pci) won't be able to talk
>> to newer guests running virtio-vbus.  The patch doesn't improve
>> performance without the entire vbus stack in the host kernel and a
>> vbus-virtio-net-host host kernel driver.
>>      
> <rewind years=2>Existing hosts (running realtek emulation) won't be able
> to talk to newer guests running virtio-net.  Virtio-net doesn't do
> anything to improve realtek emulation without the entire virtio stack in
> the host.</rewind>
>
> You gotta start somewhere.   You're argument buys you nothing other than
> backwards compat, which I've already stated is not a specific goal here.
>   I am not against "modprobe vbus-pcibridge", and I am sure there are
> users out that that do not object to this either.
>    

Two years ago we had something that was set in stone and had a very 
limited performance future.  That's not the case now.  If every two 
years we start from scratch we'll be in a pretty pickle fairly soon.

virtio-net/pci is here to stay.  I see no convincing reason to pour 
efforts into a competitor and then have to support both.

>> Perhaps if you posted everything needed to make vbus-virtio work and
>> perform we could compare that to vhost-net and you'll see another reason
>> why vhost-net is the better approach.
>>      
> Yet, you must recognize that an alternative outcome is that we can look
> at issues outside of virtio-net on KVM and perhaps you will see vbus is
> a better approach.
>    

We won't know until that experiment takes place.

>>> You are also wrong to say that I didn't try to avoid creating a
>>> downstream effort first.   I believe the public record of the mailing
>>> lists will back me up that I tried politely pushing this directly though
>>> kvm first.  It was only after Avi recently informed me that they would
>>> be building their own version of an in-kernel backend in lieu of working
>>> with me to adapt vbus to their needs that I decided to put my own
>>> project together.
>>>
>>>        
>> There's no way we can adapt vbus to our needs.
>>      
> Really?  Did you ever bother to ask how?  I'm pretty sure you can.  And
> if you couldn't, I would have considered changes to make it work.
>    

Our needs are: compatibility, live migration, Windows, managebility 
(nonroot, userspace control over configuration).  Non-requirements but 
highly desirable: minimal kernel impact.

>> Don't you think we'd preferred it rather than writing our own?
>>      
> Honestly, I am not so sure based on your responses.
>    

Does your experience indicate that I reject patches from others in 
favour of writing my own?

Look for your own name in the kernel's git log.

> I've already listed numerous examples on why I advocate vbus over PCI,
> and have already stated I am not competing against virtio.
>    

Well, your examples didn't convince me, and vbus's deficiencies 
(compatibility, live migration, Windows, managebility, kernel impact) 
aren't helping.

>> Showing some of those non-virt uses, for example.
>>      
> Actually, Ira's chassis discussed earlier is a classic example.  Vbus
> actually fits neatly into his model, I believe (and much better than the
> vhost proposals, IMO).
>
> Basically, IMO we want to invert Ira's bus (so that the PPC boards see
> host-based devices, instead of the other way around).  You write a
> connector that transports the vbus verbs over the PCI link.  You write a
> udev rule that responds to the PPC board "arrival" event to create a new
> vbus container, and assign the board to that context.
>    

It's not inverted at all.  vhost-net corresponds to the device side, 
where a real NIC's DMA engine lives, while virtio-net is the guest side 
which drives the device and talks only to its main memory (and device 
registers).  It may seem backwards but it's quite natural when you 
consider DMA.

If you wish to push vbus for non-virt uses, I have nothing to say.  If 
you wish to push vbus for some other hypervisor (like AlacrityVM), 
that's the other hypervisor's maintainer's turf.  But vbus as I 
understand it doesn't suit kvm's needs (compatibility, live migration, 
Windows, managebility, kernel impact).

>> The fact that your only user duplicates existing functionality doesn't help.
>>      
> Certainly at some level, that is true and is unfortunate, I agree.  In
> retrospect, I wish I started with something non-overlapping with virtio
> as the demo, just to avoid this aspect of controversy.
>
> At another level, its the highest-performance 802.x interface for KVM at
> the moment, since we still have not seen benchmarks for vhost.  Given
> that I have spent a lot of time lately optimizing KVM, I can tell you
> its not trivial to get it to work better than the userspace virtio.
> Michael is clearly a smart guy, so the odds are in his favor.  But do
> not count your chickens before they hatch, because its not guaranteed
> success.
>    

Well the latency numbers seem to match (after normalizing for host-host 
baseline).  Obviously throughput needs more work, but I have confidence 
we'll see pretty good results.

> Long story short, my patches are not duplicative on all levels (i.e.
> performance).  Its just another ethernet driver, of which there are
> probably hundreds of alternatives in the kernel already.  You could also
> argue that we already have multiple models in qemu (realtek, e1000,
> virtio-net, etc) so this is not without precedent.  So really all this
> "fragmentation" talk is FUD.  Lets stay on-point, please.
>    

It's not FUD and please talk technical, not throw words around.  If 
there are a limited number of kvm developers, then every new device 
dilutes the effort.  Further, e1000 and friends don't need drivers for a 
bunch of OSs, v* do.

> Can we talk more about that at some point?  I think you will see its not
> some "evil, heavy duty" infrastructure that some comments seem to be
> trying to paint it as.  I think its similar in concept to what you need
> to do for a vhost like design, but (with all due respect to Michael) a
> little bit more thought into the necessary abstraction points to allow
> broader application.
>    

vhost-net only pumps the rings.  It leaves everything else for 
userspace.  vbus/venet leave almost nothing to userspace.

vbus redoes everything that the guest's native bus provides, virtio-pci 
relies on pci.  I haven't called it evil or heavy duty, just unnecessary.

(btw, your current alacrityvm patch is larger than kvm when it was first 
merged into Linux)

>>>
>>>        
>> Note whenever I mention migration, large guests, or Windows you say
>> these are not your design requirements.
>>      
> Actually, I don't think I've ever said that, per se.  I said that those
> things are not a priority for me, personally.  I never made a design
> decision that I knew would preclude the support for such concepts.  In
> fact, afaict, the design would support them just fine, given resources
> the develop them.
>    

So given three choices:

1. merge vbus without those things that we need
2. merge vbus and start working on them
3. not merge vbus

As choice 1 gives me nothing and choice 2 takes away development effort, 
choice 3 is the winner.

> For the record: I never once said "vbus is done".  There is plenty of
> work left to do.  This is natural (kvm I'm sure wasn't 100% when it went
> in either, nor is it today)
>    

Which is why I want to concentrate effort in one direction, not wander 
off in many.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19  5:40                                 ` Avi Kivity
@ 2009-08-19 15:28                                   ` Ira W. Snyder
  2009-08-19 15:37                                     ` Avi Kivity
  0 siblings, 1 reply; 132+ messages in thread
From: Ira W. Snyder @ 2009-08-19 15:28 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins

On Wed, Aug 19, 2009 at 08:40:33AM +0300, Avi Kivity wrote:
> On 08/19/2009 03:38 AM, Ira W. Snyder wrote:
>> On Wed, Aug 19, 2009 at 12:26:23AM +0300, Avi Kivity wrote:
>>    
>>> On 08/18/2009 11:59 PM, Ira W. Snyder wrote:
>>>      
>>>> On a non shared-memory system (where the guest's RAM is not just a chunk
>>>> of userspace RAM in the host system), virtio's management model seems to
>>>> fall apart. Feature negotiation doesn't work as one would expect.
>>>>
>>>>        
>>> In your case, virtio-net on the main board accesses PCI config space
>>> registers to perform the feature negotiation; software on your PCI cards
>>> needs to trap these config space accesses and respond to them according
>>> to virtio ABI.
>>>
>>>      
>> Is this "real PCI" (physical hardware) or "fake PCI" (software PCI
>> emulation) that you are describing?
>>
>>    
>
> Real PCI.
>
>> The host (x86, PCI master) must use "real PCI" to actually configure the
>> boards, enable bus mastering, etc. Just like any other PCI device, such
>> as a network card.
>>
>> On the guests (ppc, PCI agents) I cannot add/change PCI functions (the
>> last .[0-9] in the PCI address) nor can I change PCI BAR's once the
>> board has started. I'm pretty sure that would violate the PCI spec,
>> since the PCI master would need to re-scan the bus, and re-assign
>> addresses, which is a task for the BIOS.
>>    
>
> Yes.  Can the boards respond to PCI config space cycles coming from the  
> host, or is the config space implemented in silicon and immutable?   
> (reading on, I see the answer is no).  virtio-pci uses the PCI config  
> space to configure the hardware.
>

Yes, the PCI config space is implemented in silicon. I can change a few
things (mostly PCI BAR attributes), but not much.

>>> (There's no real guest on your setup, right?  just a kernel running on
>>> and x86 system and other kernels running on the PCI cards?)
>>>
>>>      
>> Yes, the x86 (PCI master) runs Linux (booted via PXELinux). The ppc's
>> (PCI agents) also run Linux (booted via U-Boot). They are independent
>> Linux systems, with a physical PCI interconnect.
>>
>> The x86 has CONFIG_PCI=y, however the ppc's have CONFIG_PCI=n. Linux's
>> PCI stack does bad things as a PCI agent. It always assumes it is a PCI
>> master.
>>
>> It is possible for me to enable CONFIG_PCI=y on the ppc's by removing
>> the PCI bus from their list of devices provided by OpenFirmware. They
>> can not access PCI via normal methods. PCI drivers cannot work on the
>> ppc's, because Linux assumes it is a PCI master.
>>
>> To the best of my knowledge, I cannot trap configuration space accesses
>> on the PCI agents. I haven't needed that for anything I've done thus
>> far.
>>
>>    
>
> Well, if you can't do that, you can't use virtio-pci on the host.   
> You'll need another virtio transport (equivalent to "fake pci" you  
> mentioned above).
>

Ok.

Is there something similar that I can study as an example? Should I look
at virtio-pci?

>>>> This does appear to be solved by vbus, though I haven't written a
>>>> vbus-over-PCI implementation, so I cannot be completely sure.
>>>>
>>>>        
>>> Even if virtio-pci doesn't work out for some reason (though it should),
>>> you can write your own virtio transport and implement its config space
>>> however you like.
>>>
>>>      
>> This is what I did with virtio-over-PCI. The way virtio-net negotiates
>> features makes this work non-intuitively.
>>    
>
> I think you tried to take two virtio-nets and make them talk together?   
> That won't work.  You need the code from qemu to talk to virtio-net  
> config space, and vhost-net to pump the rings.
>

It *is* possible to make two unmodified virtio-net's talk together. I've
done it, and it is exactly what the virtio-over-PCI patch does. Study it
and you'll see how I connected the rx/tx queues together.

The feature negotiation code also works, but in a very unintuitive
manner. I made it work in the virtio-over-PCI patch, but the devices are
hardcoded into the driver. It would be quite a bit of work to swap
virtio-net and virtio-console, for example.

>>>> I'm not at all clear on how to get feature negotiation to work on a
>>>> system like mine. From my study of lguest and kvm (see below) it looks
>>>> like userspace will need to be involved, via a miscdevice.
>>>>
>>>>        
>>> I don't see why.  Is the kernel on the PCI cards in full control of all
>>> accesses?
>>>
>>>      
>> I'm not sure what you mean by this. Could you be more specific? This is
>> a normal, unmodified vanilla Linux kernel running on the PCI agents.
>>    
>
> I meant, does board software implement the config space accesses issued  
> from the host, and it seems the answer is no.
>
>
>> In my virtio-over-PCI patch, I hooked two virtio-net's together. I wrote
>> an algorithm to pair the tx/rx queues together. Since virtio-net
>> pre-fills its rx queues with buffers, I was able to use the DMA engine
>> to copy from the tx queue into the pre-allocated memory in the rx queue.
>>
>>    
>
> Please find a name other than virtio-over-PCI since it conflicts with  
> virtio-pci.  You're tunnelling virtio config cycles (which are usually  
> done on pci config cycles) on a new protocol which is itself tunnelled  
> over PCI shared memory.
>

Sorry about that. Do you have suggestions for a better name?

I called it virtio-over-PCI in my previous postings to LKML, so until a
new patch is written and posted, I'll keep referring to it by the name
used in the past, so people can search for it.

When I post virtio patches, should I CC another mailing list in addition
to LKML?

>>>>
>>>>        
>>> Yeah.  You'll need to add byteswaps.
>>>
>>>      
>> I wonder if Rusty would accept a new feature:
>> VIRTIO_F_NET_LITTLE_ENDIAN, which would allow the virtio-net driver to
>> use LE for all of it's multi-byte fields.
>>
>> I don't think the transport should have to care about the endianness.
>>    
>
> Given this is not mainstream use, it would have to have zero impact when  
> configured out.
>

Yes, of course.

That said, I'm not sure how qemu-system-ppc running on x86 could
possibly communicate using virtio-net. This would mean the guest is an
emulated big-endian PPC, while the host is a little-endian x86. I
haven't actually tested this situation, so perhaps I am wrong.

>> True. It's slowpath setup, so I don't care how fast it is. For reasons
>> outside my control, the x86 (PCI master) is running a RHEL5 system. This
>> means glibc-2.5, which doesn't have eventfd support, AFAIK. I could try
>> and push for an upgrade. This obviously makes cat/echo really nice, it
>> doesn't depend on glibc, only the kernel version.
>>
>> I don't give much weight to the above, because I can use the eventfd
>> syscalls directly, without glibc support. It is just more painful.
>>    
>
> The x86 side only needs to run virtio-net, which is present in RHEL 5.3.  
> You'd only need to run virtio-tunnel or however it's called.  All the 
> eventfd magic takes place on the PCI agents.
>

I can upgrade the kernel to anything I want on both the x86 and ppc's.
I'd like to avoid changing the x86 (RHEL5) userspace, though. On the
ppc's, I have full control over the userspace environment.

Thanks,
Ira

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19 15:28                                   ` Ira W. Snyder
@ 2009-08-19 15:37                                     ` Avi Kivity
  2009-08-19 16:29                                       ` Ira W. Snyder
  0 siblings, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-19 15:37 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: Michael S. Tsirkin, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins

On 08/19/2009 06:28 PM, Ira W. Snyder wrote:
>
>> Well, if you can't do that, you can't use virtio-pci on the host.
>> You'll need another virtio transport (equivalent to "fake pci" you
>> mentioned above).
>>
>>      
> Ok.
>
> Is there something similar that I can study as an example? Should I look
> at virtio-pci?
>
>    

There's virtio-lguest, virtio-s390, and virtio-vbus.

>> I think you tried to take two virtio-nets and make them talk together?
>> That won't work.  You need the code from qemu to talk to virtio-net
>> config space, and vhost-net to pump the rings.
>>
>>      
> It *is* possible to make two unmodified virtio-net's talk together. I've
> done it, and it is exactly what the virtio-over-PCI patch does. Study it
> and you'll see how I connected the rx/tx queues together.
>    

Right, crossing the cables works, but feature negotiation is screwed up, 
and both sides think the data is in their RAM.

vhost-net doesn't do negotiation and doesn't assume the data lives in 
its address space.

>> Please find a name other than virtio-over-PCI since it conflicts with
>> virtio-pci.  You're tunnelling virtio config cycles (which are usually
>> done on pci config cycles) on a new protocol which is itself tunnelled
>> over PCI shared memory.
>>
>>      
> Sorry about that. Do you have suggestions for a better name?
>
>    

virtio-$yourhardware or maybe virtio-dma

> I called it virtio-over-PCI in my previous postings to LKML, so until a
> new patch is written and posted, I'll keep referring to it by the name
> used in the past, so people can search for it.
>
> When I post virtio patches, should I CC another mailing list in addition
> to LKML?
>    

virtualization@lists.linux-foundation.org is virtio's home.

> That said, I'm not sure how qemu-system-ppc running on x86 could
> possibly communicate using virtio-net. This would mean the guest is an
> emulated big-endian PPC, while the host is a little-endian x86. I
> haven't actually tested this situation, so perhaps I am wrong.
>    

I'm confused now.  You don't actually have any guest, do you, so why 
would you run qemu at all?

>> The x86 side only needs to run virtio-net, which is present in RHEL 5.3.
>> You'd only need to run virtio-tunnel or however it's called.  All the
>> eventfd magic takes place on the PCI agents.
>>
>>      
> I can upgrade the kernel to anything I want on both the x86 and ppc's.
> I'd like to avoid changing the x86 (RHEL5) userspace, though. On the
> ppc's, I have full control over the userspace environment.
>    

You don't need any userspace on virtio-net's side.

Your ppc boards emulate a virtio-net device, so all you need is the 
virtio-net module (and virtio bindings).  If you chose to emulate, say, 
an e1000 card all you'd need is the e1000 driver.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19 15:37                                     ` Avi Kivity
@ 2009-08-19 16:29                                       ` Ira W. Snyder
  2009-08-19 16:38                                         ` Avi Kivity
  0 siblings, 1 reply; 132+ messages in thread
From: Ira W. Snyder @ 2009-08-19 16:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins

On Wed, Aug 19, 2009 at 06:37:06PM +0300, Avi Kivity wrote:
> On 08/19/2009 06:28 PM, Ira W. Snyder wrote:
>>
>>> Well, if you can't do that, you can't use virtio-pci on the host.
>>> You'll need another virtio transport (equivalent to "fake pci" you
>>> mentioned above).
>>>
>>>      
>> Ok.
>>
>> Is there something similar that I can study as an example? Should I look
>> at virtio-pci?
>>
>>    
>
> There's virtio-lguest, virtio-s390, and virtio-vbus.
>
>>> I think you tried to take two virtio-nets and make them talk together?
>>> That won't work.  You need the code from qemu to talk to virtio-net
>>> config space, and vhost-net to pump the rings.
>>>
>>>      
>> It *is* possible to make two unmodified virtio-net's talk together. I've
>> done it, and it is exactly what the virtio-over-PCI patch does. Study it
>> and you'll see how I connected the rx/tx queues together.
>>    
>
> Right, crossing the cables works, but feature negotiation is screwed up,  
> and both sides think the data is in their RAM.
>
> vhost-net doesn't do negotiation and doesn't assume the data lives in  
> its address space.
>

Yes, that is exactly what I did: crossed the cables (in software).

I'll take a closer look at vhost-net now, and make sure I understand how
it works.

>>> Please find a name other than virtio-over-PCI since it conflicts with
>>> virtio-pci.  You're tunnelling virtio config cycles (which are usually
>>> done on pci config cycles) on a new protocol which is itself tunnelled
>>> over PCI shared memory.
>>>
>>>      
>> Sorry about that. Do you have suggestions for a better name?
>>
>>    
>
> virtio-$yourhardware or maybe virtio-dma
>

How about virtio-phys?

Arnd and BenH are both looking at PPC systems (similar to mine). Grant
Likely is looking at talking to an processor core running on an FPGA,
IIRC. Most of the code can be shared, very little should need to be
board-specific, I hope.

>> I called it virtio-over-PCI in my previous postings to LKML, so until a
>> new patch is written and posted, I'll keep referring to it by the name
>> used in the past, so people can search for it.
>>
>> When I post virtio patches, should I CC another mailing list in addition
>> to LKML?
>>    
>
> virtualization@lists.linux-foundation.org is virtio's home.
>
>> That said, I'm not sure how qemu-system-ppc running on x86 could
>> possibly communicate using virtio-net. This would mean the guest is an
>> emulated big-endian PPC, while the host is a little-endian x86. I
>> haven't actually tested this situation, so perhaps I am wrong.
>>    
>
> I'm confused now.  You don't actually have any guest, do you, so why  
> would you run qemu at all?
>

I do not run qemu. I am just stating a problem with virtio-net that I
noticed. This is just so someone more knowledgeable can be aware of the
problem.

>>> The x86 side only needs to run virtio-net, which is present in RHEL 5.3.
>>> You'd only need to run virtio-tunnel or however it's called.  All the
>>> eventfd magic takes place on the PCI agents.
>>>
>>>      
>> I can upgrade the kernel to anything I want on both the x86 and ppc's.
>> I'd like to avoid changing the x86 (RHEL5) userspace, though. On the
>> ppc's, I have full control over the userspace environment.
>>    
>
> You don't need any userspace on virtio-net's side.
>
> Your ppc boards emulate a virtio-net device, so all you need is the  
> virtio-net module (and virtio bindings).  If you chose to emulate, say,  
> an e1000 card all you'd need is the e1000 driver.
>

Thanks for the replies.
Ira

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19 16:29                                       ` Ira W. Snyder
@ 2009-08-19 16:38                                         ` Avi Kivity
  2009-08-19 21:05                                           ` Hollis Blanchard
  0 siblings, 1 reply; 132+ messages in thread
From: Avi Kivity @ 2009-08-19 16:38 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: Michael S. Tsirkin, Gregory Haskins, kvm, netdev, linux-kernel,
	alacrityvm-devel, Anthony Liguori, Ingo Molnar, Gregory Haskins,
	Hollis Blanchard

On 08/19/2009 07:29 PM, Ira W. Snyder wrote:
>
>    
>> virtio-$yourhardware or maybe virtio-dma
>>
>>      
> How about virtio-phys?
>    

Could work.

> Arnd and BenH are both looking at PPC systems (similar to mine). Grant
> Likely is looking at talking to an processor core running on an FPGA,
> IIRC. Most of the code can be shared, very little should need to be
> board-specific, I hope.
>    

Excellent.

>>> That said, I'm not sure how qemu-system-ppc running on x86 could
>>> possibly communicate using virtio-net. This would mean the guest is an
>>> emulated big-endian PPC, while the host is a little-endian x86. I
>>> haven't actually tested this situation, so perhaps I am wrong.
>>>
>>>        
>> I'm confused now.  You don't actually have any guest, do you, so why
>> would you run qemu at all?
>>
>>      
> I do not run qemu. I am just stating a problem with virtio-net that I
> noticed. This is just so someone more knowledgeable can be aware of the
> problem.
>
>    

Ah, it certainly doesn't byteswap.  Maybe nobody tried it.  Hollis?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19  7:11                       ` Avi Kivity
@ 2009-08-19 18:23                         ` Nicholas A. Bellinger
  2009-08-19 18:39                           ` Gregory Haskins
  2009-08-19 20:12                           ` configfs/sysfs Avi Kivity
  2009-08-19 18:26                         ` [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects Gregory Haskins
  1 sibling, 2 replies; 132+ messages in thread
From: Nicholas A. Bellinger @ 2009-08-19 18:23 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Anthony Liguori, kvm, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin, Ira W. Snyder,
	Joel Becker

On Wed, 2009-08-19 at 10:11 +0300, Avi Kivity wrote:
> On 08/19/2009 09:28 AM, Gregory Haskins wrote:
> > Avi Kivity wrote:

<SNIP>

> > Basically, what it comes down to is both vbus and vhost need
> > configuration/management.  Vbus does it with sysfs/configfs, and vhost
> > does it with ioctls.  I ultimately decided to go with sysfs/configfs
> > because, at least that the time I looked, it seemed like the "blessed"
> > way to do user->kernel interfaces.
> >    
> 
> I really dislike that trend but that's an unrelated discussion.
> 
> >> They need to be connected to the real world somehow.  What about
> >> security?  can any user create a container and devices and link them to
> >> real interfaces?  If not, do you need to run the VM as root?
> >>      
> > Today it has to be root as a result of weak mode support in configfs, so
> > you have me there.  I am looking for help patching this limitation, though.
> >
> >    
> 
> Well, do you plan to address this before submission for inclusion?
> 

Greetings Avi and Co,

I have been following this thread, and although I cannot say that I am
intimately fimilar with all of the virtualization considerations
involved to really add anything use to that side of the discussion, I
think you guys are doing a good job of explaining the technical issues
for the non virtualization wizards following this thread.  :-)

Anyways, I was wondering if you might be interesting in sharing your
concerns wrt to configfs (conigfs maintainer CC'ed), at some point..?
As you may recall, I have been using configfs extensively for the 3.x
generic target core infrastructure and iSCSI fabric modules living in
lio-core-2.6.git/drivers/target/target_core_configfs.c and
lio-core-2.6.git/drivers/lio-core/iscsi_target_config.c, and have found
it to be extraordinarly useful for the purposes of a implementing a
complex kernel level target mode stack that is expected to manage
massive amounts of metadata, allow for real-time configuration, share
data structures (eg: SCSI Target Ports) between other kernel fabric
modules and manage the entire set of fabrics using only intrepetered
userspace code.

Using the 10000 1:1 mapped TCM Virtual HBA+FILEIO LUNs <-> iSCSI Target
Endpoints inside of a KVM Guest (from the results in May posted with
IOMMU aware 10 Gb on modern Nahelem hardware, see
http://linux-iscsi.org/index.php/KVM-LIO-Target), we have been able to
dump the entire running target fabric configfs hierarchy to a single
struct file on a KVM Guest root device using python code on the order of
~30 seconds for those 10000 active iSCSI endpoints.  In configfs terms,
this means:

*) 7 configfs groups (directories), ~50 configfs attributes (files) per
Virtual HBA+FILEIO LUN
*) 15 configfs groups (directories), ~60 configfs attributes (files per
iSCSI fabric Endpoint

Which comes out to a total of ~220000 groups and ~1100000 attributes
active configfs objects living in the configfs_dir_cache that are being
dumped inside of the single KVM guest instances, including symlinks
between the fabric modules to establish the SCSI ports containing
complete set of SPC-4 and RFC-3720 features, et al.

Also on the kernel <-> user API interaction compatibility side, I have
found the 3.x configfs enabled code adventagous over the LIO 2.9 code
(that used an ioctl for everything) because it allows us to do backwards
compat for future versions without using any userspace C code, which in
IMHO makes maintaining userspace packages for complex kernel stacks with
massive amounts of metadata + real-time configuration considerations.
No longer having ioctl compatibility issues between LIO versions as the
structures passed via ioctl change, and being able to do backwards
compat with small amounts of interpreted code against configfs layout
changes makes maintaining the kernel <-> user API really have made this
that much easier for me.

Anyways, I though these might be useful to the discussion as it releates
to potental uses of configfs on the KVM Host or other projects that
really make sense, and/or to improve the upstream implementation so that
other users (like myself) can benefit from improvements to configfs.

Many thanks for your most valuable of time,

--nab



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19  7:11                       ` Avi Kivity
  2009-08-19 18:23                         ` Nicholas A. Bellinger
@ 2009-08-19 18:26                         ` Gregory Haskins
  2009-08-19 20:37                           ` Avi Kivity
  1 sibling, 1 reply; 132+ messages in thread
From: Gregory Haskins @ 2009-08-19 18:26 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, kvm, alacrityvm-devel, linux-kernel, netdev,
	Michael S. Tsirkin, Patrick Mullaney

[-- Attachment #1: Type: text/plain, Size: 16186 bytes --]

Avi Kivity wrote:
> On 08/19/2009 09:28 AM, Gregory Haskins wrote:
>> Avi Kivity wrote:
>>   
>>> On 08/18/2009 05:46 PM, Gregory Haskins wrote:
>>>     
>>>>       
>>>>> Can you explain how vbus achieves RDMA?
>>>>>
>>>>> I also don't see the connection to real time guests.
>>>>>
>>>>>          
>>>> Both of these are still in development.  Trying to stay true to the
>>>> "release early and often" mantra, the core vbus technology is being
>>>> pushed now so it can be reviewed.  Stay tuned for these other
>>>> developments.
>>>>
>>>>        
>>> Hopefully you can outline how it works.  AFAICT, RDMA and kernel bypass
>>> will need device assignment.  If you're bypassing the call into the host
>>> kernel, it doesn't really matter how that call is made, does it?
>>>      
>> This is for things like the setup of queue-pairs, and the transport of
>> door-bells, and ib-verbs.  I am not on the team doing that work, so I am
>> not an expert in this area.  What I do know is having a flexible and
>> low-latency signal-path was deemed a key requirement.
>>    
> 
> That's not a full bypass, then.  AFAIK kernel bypass has userspace
> talking directly to the device.

Like I said, I am not an expert on the details here.  I only work on the
vbus plumbing.  FWIW, the work is derivative from the "Xen-IB" project

http://www.openib.org/archives/nov2006sc/xen-ib-presentation.pdf

There were issues with getting Xen-IB to map well into the Xen model.
Vbus was specifically designed to address some of those short-comings.

> 
> Given that both virtio and vbus can use ioeventfds, I don't see how one
> can perform better than the other.
> 
>> For real-time, a big part of it is relaying the guest scheduler state to
>> the host, but in a smart way.  For instance, the cpu priority for each
>> vcpu is in a shared-table.  When the priority is raised, we can simply
>> update the table without taking a VMEXIT.  When it is lowered, we need
>> to inform the host of the change in case the underlying task needs to
>> reschedule.
>>    
> 
> This is best done using cr8/tpr so you don't have to exit at all.  See
> also my vtpr support for Windows which does this in software, generally
> avoiding the exit even when lowering priority.

You can think of vTPR as a good model, yes.  Generally, you can't
actually use it for our purposes for several reasons, however:

1) the prio granularity is too coarse (16 levels, -rt has 100)

2) it is too scope limited (it covers only interrupts, we need to have
additional considerations, like nested guest/host scheduling algorithms
against the vcpu, and prio-remap policies)

3) I use "priority" generally..there may be other non-priority based
policies that need to add state to the table (such as EDF deadlines, etc).

but, otherwise, the idea is the same.  Besides, this was one example.


> 
>> This is where the really fast call() type mechanism is important.
>>
>> Its also about having the priority flow-end to end, and having the vcpu
>> interrupt state affect the task-priority, etc (e.g. pending interrupts
>> affect the vcpu task prio).
>>
>> etc, etc.
>>
>> I can go on and on (as you know ;), but will wait till this work is more
>> concrete and proven.
>>    
> 
> Generally cpu state shouldn't flow through a device but rather through
> MSRs, hypercalls, and cpu registers.


Well, you can blame yourself for that one ;)

The original vbus was implemented as cpuid+hypercalls, partly for that
reason.  You kicked me out of kvm.ko, so I had to make due with plan B
via a less direct PCI-BRIDGE route.

But in reality, it doesn't matter much.  You can certainly have "system"
devices sitting on vbus that fit a similar role as "MSRs", so the access
method is more of an implementation detail.  The key is it needs to be
fast, and optimize out extraneous exits when possible.

> 
>> Basically, what it comes down to is both vbus and vhost need
>> configuration/management.  Vbus does it with sysfs/configfs, and vhost
>> does it with ioctls.  I ultimately decided to go with sysfs/configfs
>> because, at least that the time I looked, it seemed like the "blessed"
>> way to do user->kernel interfaces.
>>    
> 
> I really dislike that trend but that's an unrelated discussion.


Ok

> 
>>> They need to be connected to the real world somehow.  What about
>>> security?  can any user create a container and devices and link them to
>>> real interfaces?  If not, do you need to run the VM as root?
>>>      
>> Today it has to be root as a result of weak mode support in configfs, so
>> you have me there.  I am looking for help patching this limitation,
>> though.
>>
>>    
> 
> Well, do you plan to address this before submission for inclusion?

Maybe, maybe not.  Its workable for now (i.e. run as root), so its
inclusion is not predicated on the availability of the fix, per se (at
least IMHO).  If I can get it working before I get to pushing the core,
great!  Patches welcome.


> 
>>> I hope everyone agrees that it's an important issue for me and that I
>>> have to consider non-Linux guests.  I also hope that you're considering
>>> non-Linux guests since they have considerable market share.
>>>      
>> I didn't mean non-Linux guests are not important.  I was disagreeing
>> with your assertion that it only works if its PCI.  There are numerous
>> examples of IHV/ISV "bridge" implementations deployed in Windows, no?
>>    
> 
> I don't know.
> 
>> If vbus is exposed as a PCI-BRIDGE, how is this different?
>>    
> 
> Technically it would work, but given you're not interested in Windows,

s/interested in/priortizing

For the time being, windows will not be RT, and windows can fall-back to
use virtio-net, etc.  So I am ok with this.  It will come in due time.

> who would write a driver?

Someone from the vbus community who is motivated enough and has the time
to do it, I suppose.  We have people interested in looking at this
internally, but other items have pushed it primarily to the back-burner.


> 
>>> Given I'm not the gateway to inclusion of vbus/venet, you don't need to
>>> ask me anything.  I'm still free to give my opinion.
>>>      
>> Agreed, and I didn't mean to suggest otherwise.  It not clear if you are
>> wearing the "kvm maintainer" hat, or the "lkml community member" hat at
>> times, so its important to make that distinction.  Otherwise, its not
>> clear if this is edict as my superior, or input as my peer. ;)
>>    
> 
> When I wear a hat, it is a Red Hat.  However I am bareheaded most often.
> 
> (that is, look at the contents of my message, not who wrote it or his
> role).

Like it or not, maintainers always carry more weight when they opine
what can and can't be done w.r.t. what can be perceived as their
relevant subsystem.

> 
>>> With virtio, the number is 1 (or less if you amortize).  Set up the ring
>>> entries and kick.
>>>      
>> Again, I am just talking about basic PCI here, not the things we build
>> on top.
>>    
> 
> Whatever that means, it isn't interesting.  Performance is measure for
> the whole stack.
> 
>> The point is: the things we build on top have costs associated with
>> them, and I aim to minimize it.  For instance, to do a "call()" kind of
>> interface, you generally need to pre-setup some per-cpu mappings so that
>> you can just do a single iowrite32() to kick the call off.  Those
>> per-cpu mappings have a cost if you want them to be high-performance, so
>> my argument is that you ideally want to limit the number of times you
>> have to do this.  My current design reduces this to "once".
>>    
> 
> Do you mean minimizing the setup cost?  Seriously?

Not the time-to-complete-setup overhead.  The residual costs, like
heap/vmap usage at run-time.  You generally have to set up per-cpu
mappings to gain maximum performance.  You would need it per-device, I
do it per-system.  Its not a big deal in the grand-scheme of things,
really.  But chalk that up as an advantage to my approach over yours,
nonetheless.

> 
>>> There's no such thing as raw PCI.  Every PCI device has a protocol.  The
>>> protocol virtio chose is optimized for virtualization.
>>>      
>> And its a question of how that protocol scales, more than how the
>> protocol works.
>>
>> Obviously the general idea of the protocol works, as vbus itself is
>> implemented as a PCI-BRIDGE and is therefore limited to the underlying
>> characteristics that I can get out of PCI (like PIO latency).
>>    
> 
> I thought we agreed that was insignificant?

I think I was agreeing with you, there. (e.g. obviously PIO latency is
acceptable, as I use it to underpin vbus)


> 
>>> As I've mentioned before, prioritization is available on x86
>>>      
>> But as Ive mentioned, it doesn't work very well.
>>    
> 
> I guess it isn't that important then.  I note that clever prioritization
> in a guest is pointless if you can't do the same prioritization in the
> host.

I answer this below...


> 
>>> , and coalescing scales badly.
>>>      
>> Depends on what is scaling.  Scaling vcpus?  Yes, you are right.
>> Scaling the number of devices?  No, this is where it improves.
>>    
> 
> If you queue pending messages instead of walking the device list, you
> may be right.  Still, if hard interrupt processing takes 10% of your
> time you'll only have coalesced 10% of interrupts on average.
> 
>>> irq window exits ought to be pretty rare, so we're only left with
>>> injection vmexits.  At around 1us/vmexit, even 100,000 interrupts/vcpu
>>> (which is excessive) will only cost you 10% cpu time.
>>>      
>> 1us is too much for what I am building, IMHO.
> 
> You can't use current hardware then.

The point is that I am eliminating as many exits as possible, so 1us,
2us, whatever...it doesn't matter.  The fastest exit is the one you
don't have to take.

> 
>>> You're free to demultiplex an MSI to however many consumers you want,
>>> there's no need for a new bus for that.
>>>      
>> Hmmm...can you elaborate?
>>    
> 
> Point all those MSIs at one vector.  Its handler will have to poll all
> the attached devices though.

Right, thats broken.


> 
>>> Do you use DNS.  We use PCI-SIG.  If Novell is a PCI-SIG member you can
>>> get a vendor ID and control your own virtio space.
>>>      
>> Yeah, we have our own id.  I am more concerned about making this design
>> make sense outside of PCI oriented environments.
>>    
> 
> IIRC we reuse the PCI IDs for non-PCI.


You already know how I feel about this gem.


> 
> 
> 
> 
>>>>> That's a bug, not a feature.  It means poor scaling as the number of
>>>>> vcpus increases and as the number of devices increases.
>>>>>          
>> vcpu increases, I agree (and am ok with, as I expect low vcpu count
>> machines to be typical).
> 
> I'm not okay with it.  If you wish people to adopt vbus over virtio
> you'll have to address all concerns, not just yours.

By building a community around the development of vbus, isnt this what I
am doing?  Working towards making it usable for all?

> 
>> nr of devices, I disagree.  can you elaborate?
>>    
> 
> With message queueing, I retract my remark.

Ok.

> 
>>> Windows,
>>>      
>> Work in progress.
>>    
> 
> Interesting.  Do you plan to open source the code?  If not, will the
> binaries be freely available?

Ideally, yeah.  But I guess that has to go through legal, etc.  Right
now its primarily back-burnered.  If someone wants to submit code to
support this, great!


> 
>>   
>>> large guests
>>>      
>> Can you elaborate?  I am not familiar with the term.
>>    
> 
> Many vcpus.
> 
>>   
>>> and multiqueue out of your design.
>>>      
>> AFAICT, multiqueue should work quite nicely with vbus.  Can you
>> elaborate on where you see the problem?
>>    
> 
> You said you aren't interested in it previously IIRC.
> 

I don't think so, no.  Perhaps I misspoke or was misunderstood.  I
actually think its a good idea and will be looking to do this.


>>>>> x86 APIC is priority aware.
>>>>>
>>>>>          
>>>> Have you ever tried to use it?
>>>>
>>>>        
>>> I haven't, but Windows does.
>>>      
>> Yeah, it doesn't really work well.  Its an extremely rigid model that
>> (IIRC) only lets you prioritize in 16 groups spaced by IDT (0-15 are one
>> level, 16-31 are another, etc).  Most of the embedded PICs I have worked
>> with supported direct remapping, etc.  But in any case, Linux doesn't
>> support it so we are hosed no matter how good it is.
>>    
> 
> I agree that it isn't very clever (not that I am a real time expert) but
> I disagree about dismissing Linux support so easily.  If prioritization
> is such a win it should be a win on the host as well and we should make
> it work on the host as well.  Further I don't see how priorities on the
> guest can work if they don't on the host.

Its more about task priority in the case of real-time.  We do stuff with
802.1p as well for control messages, etc.  But for the most part, this
is an orthogonal effort.  And yes, you are right, it would be nice to
have this interrupt classification capability in the host.

Generally this is mitigated by the use of irq-threads.  You could argue
that if irq-threads help the host without a prioritized interrupt
controller, why cant the guest?  The answer is simply that the host can
afford sub-optimal behavior w.r.t. IDT injection here, where the guest
cannot (due to the disparity of hw-injection vs guest-injection overheads).

IOW: The cost of an IDT dispatch in real-hardware adds minimal latency,
even if a low-priority IDT preempts a high-priority interrupt thread.
The cost of an IDT dispatch in a guest, OTOH, especially when you factor
in the complete picture (IPI-exit, inject, eoi exit, re-enter) is
greater...to great, in fact.  So if you can get the guests interrupts
priority aware, you can avoid even the IDT preempting the irq-thread
until the system is in the ideal state.

> 
>>>>
>>>>        
>>> They had to build connectors just like you propose to do.
>>>      
>> More importantly, they had to build back-end busses too, no?
>>    
> 
> They had to write 414 lines in drivers/s390/kvm/kvm_virtio.c and
> something similar for lguest.

Well, then I retract that statement.  I think the small amount of code
is probably because they are re-using the qemu device-models, however.
Note that I am essentially advocating the same basic idea here.


> 
>>> But you still need vbus-connector-lguest and vbus-connector-s390 because
>>> they all talk to the host differently.  So what's changed?  the names?
>>>      
>> The fact that they don't need to redo most of the in-kernel backend
>> stuff.  Just the connector.
>>    
> 
> So they save 414 lines but have to write a connector which is... how large?

I guess that depends on the features they want.  A pci-based connector
would probably be pretty thin, since you don't need event channels like
I use in the pci-bridge connector.

The idea, of course, is that the vbus can become your whole bus if you
want.  So you wouldn't need to tunnel, say, vbus over some lguest bus.
You just base the design on vbus outright.

Note that this was kind of what the first pass of vbus did for KVM.  The
bus was exposed via cpuid and hypercalls as kind of a system-service.
It wasn't until later that I surfaced it as a bridge model.

> 
>>> Well, venet doesn't complement virtio-net, and virtio-pci doesn't
>>> complement vbus-connector.
>>>      
>> Agreed, but virtio complements vbus by virtue of virtio-vbus.
>>    
> 
> I don't see what vbus adds to virtio-net.

Well, as you stated in your last reply, you don't want it.  So I guess
that doesn't matter much at this point.  I will continue developing
vbus, and pushing things your way.  You can opt to accept or reject
those things at your own discretion.

Kind Regards,
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19 18:23                         ` Nicholas A. Bellinger
@ 2009-08-19 18:39                           ` Gregory Haskins
  2009-08-19 19:19                             ` Nicholas A. Bellinger
  2009-08-19 20:12                           ` configfs/sysfs Avi Kivity
  1 sibling, 1 reply; 132+ messages in thread
From: Gregory Haskins @ 2009-08-19 18:39 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: Avi Kivity, Ingo Molnar, Anthony Liguori, kvm, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin, Ira W. Snyder,
	Joel Becker

[-- Attachment #1: Type: text/plain, Size: 5681 bytes --]

Hi Nicholas

Nicholas A. Bellinger wrote:
> On Wed, 2009-08-19 at 10:11 +0300, Avi Kivity wrote:
>> On 08/19/2009 09:28 AM, Gregory Haskins wrote:
>>> Avi Kivity wrote:
> 
> <SNIP>
> 
>>> Basically, what it comes down to is both vbus and vhost need
>>> configuration/management.  Vbus does it with sysfs/configfs, and vhost
>>> does it with ioctls.  I ultimately decided to go with sysfs/configfs
>>> because, at least that the time I looked, it seemed like the "blessed"
>>> way to do user->kernel interfaces.
>>>    
>> I really dislike that trend but that's an unrelated discussion.
>>
>>>> They need to be connected to the real world somehow.  What about
>>>> security?  can any user create a container and devices and link them to
>>>> real interfaces?  If not, do you need to run the VM as root?
>>>>      
>>> Today it has to be root as a result of weak mode support in configfs, so
>>> you have me there.  I am looking for help patching this limitation, though.
>>>
>>>    
>> Well, do you plan to address this before submission for inclusion?
>>
> 
> Greetings Avi and Co,
> 
> I have been following this thread, and although I cannot say that I am
> intimately fimilar with all of the virtualization considerations
> involved to really add anything use to that side of the discussion, I
> think you guys are doing a good job of explaining the technical issues
> for the non virtualization wizards following this thread.  :-)
> 
> Anyways, I was wondering if you might be interesting in sharing your
> concerns wrt to configfs (conigfs maintainer CC'ed), at some point..?

So for those tuning in, the reference here is the use of configfs for
the management of this component of AlacrityVM, called "virtual-bus"

http://developer.novell.com/wiki/index.php/Virtual-bus

> As you may recall, I have been using configfs extensively for the 3.x
> generic target core infrastructure and iSCSI fabric modules living in
> lio-core-2.6.git/drivers/target/target_core_configfs.c and
> lio-core-2.6.git/drivers/lio-core/iscsi_target_config.c, and have found
> it to be extraordinarly useful for the purposes of a implementing a
> complex kernel level target mode stack that is expected to manage
> massive amounts of metadata, allow for real-time configuration, share
> data structures (eg: SCSI Target Ports) between other kernel fabric
> modules and manage the entire set of fabrics using only intrepetered
> userspace code.

I concur.  Configfs provided me a very natural model to express
resource-containers and their respective virtual-device objects.

> 
> Using the 10000 1:1 mapped TCM Virtual HBA+FILEIO LUNs <-> iSCSI Target
> Endpoints inside of a KVM Guest (from the results in May posted with
> IOMMU aware 10 Gb on modern Nahelem hardware, see
> http://linux-iscsi.org/index.php/KVM-LIO-Target), we have been able to
> dump the entire running target fabric configfs hierarchy to a single
> struct file on a KVM Guest root device using python code on the order of
> ~30 seconds for those 10000 active iSCSI endpoints.  In configfs terms,
> this means:
> 
> *) 7 configfs groups (directories), ~50 configfs attributes (files) per
> Virtual HBA+FILEIO LUN
> *) 15 configfs groups (directories), ~60 configfs attributes (files per
> iSCSI fabric Endpoint
> 
> Which comes out to a total of ~220000 groups and ~1100000 attributes
> active configfs objects living in the configfs_dir_cache that are being
> dumped inside of the single KVM guest instances, including symlinks
> between the fabric modules to establish the SCSI ports containing
> complete set of SPC-4 and RFC-3720 features, et al.
> 
> Also on the kernel <-> user API interaction compatibility side, I have
> found the 3.x configfs enabled code adventagous over the LIO 2.9 code
> (that used an ioctl for everything) because it allows us to do backwards
> compat for future versions without using any userspace C code, which in
> IMHO makes maintaining userspace packages for complex kernel stacks with
> massive amounts of metadata + real-time configuration considerations.
> No longer having ioctl compatibility issues between LIO versions as the
> structures passed via ioctl change, and being able to do backwards
> compat with small amounts of interpreted code against configfs layout
> changes makes maintaining the kernel <-> user API really have made this
> that much easier for me.
> 
> Anyways, I though these might be useful to the discussion as it releates
> to potental uses of configfs on the KVM Host or other projects that
> really make sense, and/or to improve the upstream implementation so that
> other users (like myself) can benefit from improvements to configfs.
> 
> Many thanks for your most valuable of time,

Thank you for the explanation of your setup.

Configfs mostly works for the vbus project "as is".  As Avi pointed out,
I currently have a limitation w.r.t. perms.  Forgive me if what I am
about to say is overly simplistic.  Its been quite a few months since I
worked on the configfs portion of the code, so my details may be fuzzy.

What it boiled down to is I need is a way to better manage perms (and to
be able to do it cross sysfs and configfs would be ideal).

For instance, I would like to be able to assign groups to configfs
directories, like /config/vbus/devices, such that

mkdir /config/vbus/devices/foo

would not require root if that GID was permitted.

Are there ways to do this (now, or in upcoming releases)?  If not, I may
be interested in helping to add this feature, so please advise how best
to achieve this.

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19 18:39                           ` Gregory Haskins
@ 2009-08-19 19:19                             ` Nicholas A. Bellinger
  2009-08-19 19:34                               ` Nicholas A. Bellinger
  0 siblings, 1 reply; 132+ messages in thread
From: Nicholas A. Bellinger @ 2009-08-19 19:19 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Ingo Molnar, Anthony Liguori, kvm, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin, Ira W. Snyder,
	Joel Becker

On Wed, 2009-08-19 at 14:39 -0400, Gregory Haskins wrote:
> Hi Nicholas
> 
> Nicholas A. Bellinger wrote:
> > On Wed, 2009-08-19 at 10:11 +0300, Avi Kivity wrote:
> >> On 08/19/2009 09:28 AM, Gregory Haskins wrote:
> >>> Avi Kivity wrote:
> > 
> > <SNIP>
> > 
> >>> Basically, what it comes down to is both vbus and vhost need
> >>> configuration/management.  Vbus does it with sysfs/configfs, and vhost
> >>> does it with ioctls.  I ultimately decided to go with sysfs/configfs
> >>> because, at least that the time I looked, it seemed like the "blessed"
> >>> way to do user->kernel interfaces.
> >>>    
> >> I really dislike that trend but that's an unrelated discussion.
> >>
> >>>> They need to be connected to the real world somehow.  What about
> >>>> security?  can any user create a container and devices and link them to
> >>>> real interfaces?  If not, do you need to run the VM as root?
> >>>>      
> >>> Today it has to be root as a result of weak mode support in configfs, so
> >>> you have me there.  I am looking for help patching this limitation, though.
> >>>
> >>>    
> >> Well, do you plan to address this before submission for inclusion?
> >>
> > 
> > Greetings Avi and Co,
> > 
> > I have been following this thread, and although I cannot say that I am
> > intimately fimilar with all of the virtualization considerations
> > involved to really add anything use to that side of the discussion, I
> > think you guys are doing a good job of explaining the technical issues
> > for the non virtualization wizards following this thread.  :-)
> > 
> > Anyways, I was wondering if you might be interesting in sharing your
> > concerns wrt to configfs (conigfs maintainer CC'ed), at some point..?
> 
> So for those tuning in, the reference here is the use of configfs for
> the management of this component of AlacrityVM, called "virtual-bus"
> 
> http://developer.novell.com/wiki/index.php/Virtual-bus
> 
> > As you may recall, I have been using configfs extensively for the 3.x
> > generic target core infrastructure and iSCSI fabric modules living in
> > lio-core-2.6.git/drivers/target/target_core_configfs.c and
> > lio-core-2.6.git/drivers/lio-core/iscsi_target_config.c, and have found
> > it to be extraordinarly useful for the purposes of a implementing a
> > complex kernel level target mode stack that is expected to manage
> > massive amounts of metadata, allow for real-time configuration, share
> > data structures (eg: SCSI Target Ports) between other kernel fabric
> > modules and manage the entire set of fabrics using only intrepetered
> > userspace code.
> 
> I concur.  Configfs provided me a very natural model to express
> resource-containers and their respective virtual-device objects.
> 
> > 
> > Using the 10000 1:1 mapped TCM Virtual HBA+FILEIO LUNs <-> iSCSI Target
> > Endpoints inside of a KVM Guest (from the results in May posted with
> > IOMMU aware 10 Gb on modern Nahelem hardware, see
> > http://linux-iscsi.org/index.php/KVM-LIO-Target), we have been able to
> > dump the entire running target fabric configfs hierarchy to a single
> > struct file on a KVM Guest root device using python code on the order of
> > ~30 seconds for those 10000 active iSCSI endpoints.  In configfs terms,
> > this means:
> > 
> > *) 7 configfs groups (directories), ~50 configfs attributes (files) per
> > Virtual HBA+FILEIO LUN
> > *) 15 configfs groups (directories), ~60 configfs attributes (files per
> > iSCSI fabric Endpoint
> > 
> > Which comes out to a total of ~220000 groups and ~1100000 attributes
> > active configfs objects living in the configfs_dir_cache that are being
> > dumped inside of the single KVM guest instances, including symlinks
> > between the fabric modules to establish the SCSI ports containing
> > complete set of SPC-4 and RFC-3720 features, et al.
> > 
> > Also on the kernel <-> user API interaction compatibility side, I have
> > found the 3.x configfs enabled code adventagous over the LIO 2.9 code
> > (that used an ioctl for everything) because it allows us to do backwards
> > compat for future versions without using any userspace C code, which in
> > IMHO makes maintaining userspace packages for complex kernel stacks with
> > massive amounts of metadata + real-time configuration considerations.
> > No longer having ioctl compatibility issues between LIO versions as the
> > structures passed via ioctl change, and being able to do backwards
> > compat with small amounts of interpreted code against configfs layout
> > changes makes maintaining the kernel <-> user API really have made this
> > that much easier for me.
> > 
> > Anyways, I though these might be useful to the discussion as it releates
> > to potental uses of configfs on the KVM Host or other projects that
> > really make sense, and/or to improve the upstream implementation so that
> > other users (like myself) can benefit from improvements to configfs.
> > 
> > Many thanks for your most valuable of time,
> 
> Thank you for the explanation of your setup.
> 
> Configfs mostly works for the vbus project "as is".  As Avi pointed out,
> I currently have a limitation w.r.t. perms.  Forgive me if what I am
> about to say is overly simplistic.  Its been quite a few months since I
> worked on the configfs portion of the code, so my details may be fuzzy.
> 
> What it boiled down to is I need is a way to better manage perms

I have not looked at implementing this personally, so I am not sure how
this would look in fs/configfs/ off the top of my head..  Joel, have you
had any thoughts on this..?

>  (and to
> be able to do it cross sysfs and configfs would be ideal).
> 

I had coded up a patch last year to to allow configfs to access sysfs
symlinks in the context of target_core_mod storage object (Linux/SCSI,
Linux/Block, Linux/FILEIO) registration, which did work but ended up not
really making sense and was (thankully) rejected by GregKH, more of that
discussion here:

http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-10/msg06559.html

I am not sure if the sharing of permissions between sysfs and configfs
would run into the same types of limitiations as the above..

> For instance, I would like to be able to assign groups to configfs
> directories, like /config/vbus/devices, such that
> 
> mkdir /config/vbus/devices/foo
> 
> would not require root if that GID was permitted.
> 
> Are there ways to do this (now, or in upcoming releases)?  If not, I may
> be interested in helping to add this feature, so please advise how best
> to achieve this.
> 

Not that I am aware of.  However, I think this would be useful for
generic configfs, and I think user/group permissions on configfs
groups/dirs and attribute/items would be quite useful for the LIO 3.x
configfs enabled generic target engine.

Many thanks for your most valuable of time,

--nab

> Kind Regards,
> -Greg
> 



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19 19:19                             ` Nicholas A. Bellinger
@ 2009-08-19 19:34                               ` Nicholas A. Bellinger
  0 siblings, 0 replies; 132+ messages in thread
From: Nicholas A. Bellinger @ 2009-08-19 19:34 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Ingo Molnar, Anthony Liguori, kvm, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin, Ira W. Snyder,
	Joel Becker

On Wed, 2009-08-19 at 12:19 -0700, Nicholas A. Bellinger wrote:
> On Wed, 2009-08-19 at 14:39 -0400, Gregory Haskins wrote:
> > Hi Nicholas
> > 
> > Nicholas A. Bellinger wrote:
> > > On Wed, 2009-08-19 at 10:11 +0300, Avi Kivity wrote:
> > >> On 08/19/2009 09:28 AM, Gregory Haskins wrote:
> > >>> Avi Kivity wrote:
> > > 
> > > <SNIP>
> > > 
> > >>> Basically, what it comes down to is both vbus and vhost need
> > >>> configuration/management.  Vbus does it with sysfs/configfs, and vhost
> > >>> does it with ioctls.  I ultimately decided to go with sysfs/configfs
> > >>> because, at least that the time I looked, it seemed like the "blessed"
> > >>> way to do user->kernel interfaces.
> > >>>    
> > >> I really dislike that trend but that's an unrelated discussion.
> > >>
> > >>>> They need to be connected to the real world somehow.  What about
> > >>>> security?  can any user create a container and devices and link them to
> > >>>> real interfaces?  If not, do you need to run the VM as root?
> > >>>>      
> > >>> Today it has to be root as a result of weak mode support in configfs, so
> > >>> you have me there.  I am looking for help patching this limitation, though.
> > >>>
> > >>>    
> > >> Well, do you plan to address this before submission for inclusion?
> > >>
> > > 
> > > Greetings Avi and Co,
> > > 
> > > I have been following this thread, and although I cannot say that I am
> > > intimately fimilar with all of the virtualization considerations
> > > involved to really add anything use to that side of the discussion, I
> > > think you guys are doing a good job of explaining the technical issues
> > > for the non virtualization wizards following this thread.  :-)
> > > 
> > > Anyways, I was wondering if you might be interesting in sharing your
> > > concerns wrt to configfs (conigfs maintainer CC'ed), at some point..?
> > 
> > So for those tuning in, the reference here is the use of configfs for
> > the management of this component of AlacrityVM, called "virtual-bus"
> > 
> > http://developer.novell.com/wiki/index.php/Virtual-bus
> > 
> > > As you may recall, I have been using configfs extensively for the 3.x
> > > generic target core infrastructure and iSCSI fabric modules living in
> > > lio-core-2.6.git/drivers/target/target_core_configfs.c and
> > > lio-core-2.6.git/drivers/lio-core/iscsi_target_config.c, and have found
> > > it to be extraordinarly useful for the purposes of a implementing a
> > > complex kernel level target mode stack that is expected to manage
> > > massive amounts of metadata, allow for real-time configuration, share
> > > data structures (eg: SCSI Target Ports) between other kernel fabric
> > > modules and manage the entire set of fabrics using only intrepetered
> > > userspace code.
> > 
> > I concur.  Configfs provided me a very natural model to express
> > resource-containers and their respective virtual-device objects.
> > 
> > > 
> > > Using the 10000 1:1 mapped TCM Virtual HBA+FILEIO LUNs <-> iSCSI Target
> > > Endpoints inside of a KVM Guest (from the results in May posted with
> > > IOMMU aware 10 Gb on modern Nahelem hardware, see
> > > http://linux-iscsi.org/index.php/KVM-LIO-Target), we have been able to
> > > dump the entire running target fabric configfs hierarchy to a single
> > > struct file on a KVM Guest root device using python code on the order of
> > > ~30 seconds for those 10000 active iSCSI endpoints.  In configfs terms,
> > > this means:
> > > 
> > > *) 7 configfs groups (directories), ~50 configfs attributes (files) per
> > > Virtual HBA+FILEIO LUN
> > > *) 15 configfs groups (directories), ~60 configfs attributes (files per
> > > iSCSI fabric Endpoint
> > > 
> > > Which comes out to a total of ~220000 groups and ~1100000 attributes
> > > active configfs objects living in the configfs_dir_cache that are being
> > > dumped inside of the single KVM guest instances, including symlinks
> > > between the fabric modules to establish the SCSI ports containing
> > > complete set of SPC-4 and RFC-3720 features, et al.
> > > 
> > > Also on the kernel <-> user API interaction compatibility side, I have
> > > found the 3.x configfs enabled code adventagous over the LIO 2.9 code
> > > (that used an ioctl for everything) because it allows us to do backwards
> > > compat for future versions without using any userspace C code, which in
> > > IMHO makes maintaining userspace packages for complex kernel stacks with
> > > massive amounts of metadata + real-time configuration considerations.
> > > No longer having ioctl compatibility issues between LIO versions as the
> > > structures passed via ioctl change, and being able to do backwards
> > > compat with small amounts of interpreted code against configfs layout
> > > changes makes maintaining the kernel <-> user API really have made this
> > > that much easier for me.
> > > 
> > > Anyways, I though these might be useful to the discussion as it releates
> > > to potental uses of configfs on the KVM Host or other projects that
> > > really make sense, and/or to improve the upstream implementation so that
> > > other users (like myself) can benefit from improvements to configfs.
> > > 
> > > Many thanks for your most valuable of time,
> > 
> > Thank you for the explanation of your setup.
> > 
> > Configfs mostly works for the vbus project "as is".  As Avi pointed out,
> > I currently have a limitation w.r.t. perms.  Forgive me if what I am
> > about to say is overly simplistic.  Its been quite a few months since I
> > worked on the configfs portion of the code, so my details may be fuzzy.
> > 
> > What it boiled down to is I need is a way to better manage perms
> 
> I have not looked at implementing this personally, so I am not sure how
> this would look in fs/configfs/ off the top of my head..  Joel, have you
> had any thoughts on this..?
> 

Actually, something that I have been using is for simple stuff is:

	if (!capable(CAP_SYS_ADMIN))

for controlling I/O to configfs attributes from non priviledged users
for iSCSI authentication information living in struct
config_item_operations lio_target_nacl_auth_cit, the code is here:

http://git.kernel.org/?p=linux/kernel/git/nab/lio-core-2.6.git;a=blob;f=drivers/lio-core/iscsi_target_configfs.c;h=1230b74577076a184b756b3883fb56c6050c7d87;hb=HEAD#l803

I am also using the CONFIGFS Extended Macros, CONFIGS_EATTR() which I
created to allow me to use more than one struct config_groups per parent
structure, and use less lines of code when defining configfs attributes
using generic store() and show() functions:

http://git.kernel.org/?p=linux/kernel/git/nab/lio-core-2.6.git;a=blob;f=include/target/configfs_macros.h

--nab



^ permalink raw reply	[flat|nested] 132+ messages in thread

* configfs/sysfs
  2009-08-19 18:23                         ` Nicholas A. Bellinger
  2009-08-19 18:39                           ` Gregory Haskins
@ 2009-08-19 20:12                           ` Avi Kivity
  2009-08-19 20:48                             ` configfs/sysfs Ingo Molnar
                                               ` (3 more replies)
  1 sibling, 4 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-19 20:12 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: Ingo Molnar, Anthony Liguori, kvm, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin, Ira W. Snyder,
	Joel Becker

On 08/19/2009 09:23 PM, Nicholas A. Bellinger wrote:
> Anyways, I was wondering if you might be interesting in sharing your
> concerns wrt to configfs (conigfs maintainer CC'ed), at some point..?
>    

My concerns aren't specifically with configfs, but with all the text 
based pseudo filesystems that the kernel exposes.

My high level concern is that we're optimizing for the active sysadmin, 
not for libraries and management programs.  configfs and sysfs are easy 
to use from the shell, discoverable, and easily scripted.  But they 
discourage documentation, the text format is ambiguous, and they require 
a lot of boilerplate to use in code.

You could argue that you can wrap *fs in a library that hides the 
details of accessing it, but that's the wrong approach IMO.  We should 
make the information easy to use and manipulate for programs; one of 
these programs can be a fuse filesystem for the active sysadmin if 
someone thinks it's important.

Now for the low level concerns:

- efficiency

Each attribute access requires an open/read/close triplet and 
binary->ascii->binary conversions.  In contrast an ordinary 
syscall/ioctl interface can fetch all attributes of an object, or even 
all attributes of all objects, in one call.

- atomicity

One attribute per file means that, lacking userspace-visible 
transactions, there is no way to change several attributes at once.  
When you read attributes, there is no way to read several attributes 
atomically so you can be sure their values correlate.  Another example 
of a problem is when an object disappears while reading its attributes.  
Sure, openat() can mitigate this, but it's better to avoid introducing 
problem than having a fix.

- ambiguity

What format is the attribute?  does it accept lowercase or uppercase hex 
digits?  is there a newline at the end?  how many digits can it take 
before the attribute overflows?  All of this has to be documented and 
checked by the OS, otherwise we risk regressions later.  In contrast, 
__u64 says everything in a binary interface.

- lifetime and access control

If a process brings an object into being (using mkdir) and then dies, 
the object remains behind.  The syscall/ioctl approach ties the object 
into an fd, which will be destroyed when the process dies, and which can 
be passed around using SCM_RIGHTS, allowing a server process to create 
and configure an object before passing it to an unprivileged program

- notifications

It's hard to notify users about changes in attributes.  Sure, you can 
use inotify, but that limits you to watching subtrees.  Once you do get 
the notification, you run into the atomicity problem.  When do you know 
all attributes are valid?  This can be solved using sequence counters, 
but that's just gratuitous complexity.  Netlink type interfaces are much 
more robust and flexible.

- readdir

You can either list everything, or nothing.  Sure, you can have trees to 
ease searching, even multiple views of the same data, but it's painful.

You may argue, correctly, that syscalls and ioctls are not as flexible.  
But this is because no one has invested the effort in making them so.  A 
struct passed as an argument to a syscall is not extensible.  But if you 
pass the size of the structure, and also a bitmap of which attributes 
are present, you gain extensibility and retain the atomicity property of 
a syscall interface.  I don't think a lot of effort is needed to make an 
extensible syscall interface just as usable and a lot more efficient 
than configfs/sysfs.  It should also be simple to bolt a fuse interface 
on top to expose it to us commandline types.

> As you may recall, I have been using configfs extensively for the 3.x
> generic target core infrastructure and iSCSI fabric modules living in
> lio-core-2.6.git/drivers/target/target_core_configfs.c and
> lio-core-2.6.git/drivers/lio-core/iscsi_target_config.c, and have found
> it to be extraordinarly useful for the purposes of a implementing a
> complex kernel level target mode stack that is expected to manage
> massive amounts of metadata, allow for real-time configuration, share
> data structures (eg: SCSI Target Ports) between other kernel fabric
> modules and manage the entire set of fabrics using only intrepetered
> userspace code.
>
> Using the 10000 1:1 mapped TCM Virtual HBA+FILEIO LUNs<->  iSCSI Target
> Endpoints inside of a KVM Guest (from the results in May posted with
> IOMMU aware 10 Gb on modern Nahelem hardware, see
> http://linux-iscsi.org/index.php/KVM-LIO-Target), we have been able to
> dump the entire running target fabric configfs hierarchy to a single
> struct file on a KVM Guest root device using python code on the order of
> ~30 seconds for those 10000 active iSCSI endpoints.  In configfs terms,
> this means:
>
> *) 7 configfs groups (directories), ~50 configfs attributes (files) per
> Virtual HBA+FILEIO LUN
> *) 15 configfs groups (directories), ~60 configfs attributes (files per
> iSCSI fabric Endpoint
>
> Which comes out to a total of ~220000 groups and ~1100000 attributes
> active configfs objects living in the configfs_dir_cache that are being
> dumped inside of the single KVM guest instances, including symlinks
> between the fabric modules to establish the SCSI ports containing
> complete set of SPC-4 and RFC-3720 features, et al.
>    

You achieved 3 million syscalls/sec from Python code?  That's very 
impressive.

Note with syscalls you could have done it with 10K syscalls (Python 
supports packing and unpacking structs quite well, and also directly 
calling C code IIRC).

> Also on the kernel<->  user API interaction compatibility side, I have
> found the 3.x configfs enabled code adventagous over the LIO 2.9 code
> (that used an ioctl for everything) because it allows us to do backwards
> compat for future versions without using any userspace C code, which in
> IMHO makes maintaining userspace packages for complex kernel stacks with
> massive amounts of metadata + real-time configuration considerations.
> No longer having ioctl compatibility issues between LIO versions as the
> structures passed via ioctl change, and being able to do backwards
> compat with small amounts of interpreted code against configfs layout
> changes makes maintaining the kernel<->  user API really have made this
> that much easier for me.
>    

configfs is more maintainable that a bunch of hand-maintained ioctls.  
But if we put some effort into an extendable syscall infrastructure 
(perhaps to the point of using an IDL) I'm sure we can improve on that 
without the problems pseudo filesystems introduce.

> Anyways, I though these might be useful to the discussion as it releates
> to potental uses of configfs on the KVM Host or other projects that
> really make sense, and/or to improve the upstream implementation so that
> other users (like myself) can benefit from improvements to configfs.
>    

I can't really fault a project for using configfs; it's an accepted and 
recommented (by the community) interface.  I'd much prefer it though if 
there was an effort to create a usable fd/struct based alternative.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19 18:26                         ` [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects Gregory Haskins
@ 2009-08-19 20:37                           ` Avi Kivity
  2009-08-19 20:53                             ` Ingo Molnar
                                               ` (2 more replies)
  0 siblings, 3 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-19 20:37 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ingo Molnar, kvm, alacrityvm-devel, linux-kernel, netdev,
	Michael S. Tsirkin, Patrick Mullaney

On 08/19/2009 09:26 PM, Gregory Haskins wrote:
>>> This is for things like the setup of queue-pairs, and the transport of
>>> door-bells, and ib-verbs.  I am not on the team doing that work, so I am
>>> not an expert in this area.  What I do know is having a flexible and
>>> low-latency signal-path was deemed a key requirement.
>>>
>>>        
>> That's not a full bypass, then.  AFAIK kernel bypass has userspace
>> talking directly to the device.
>>      
> Like I said, I am not an expert on the details here.  I only work on the
> vbus plumbing.  FWIW, the work is derivative from the "Xen-IB" project
>
> http://www.openib.org/archives/nov2006sc/xen-ib-presentation.pdf
>
> There were issues with getting Xen-IB to map well into the Xen model.
> Vbus was specifically designed to address some of those short-comings.
>    

Well I'm not an Infiniband expert.  But from what I understand VMM 
bypass means avoiding the call to the VMM entirely by exposing hardware 
registers directly to the guest.

>> This is best done using cr8/tpr so you don't have to exit at all.  See
>> also my vtpr support for Windows which does this in software, generally
>> avoiding the exit even when lowering priority.
>>      
> You can think of vTPR as a good model, yes.  Generally, you can't
> actually use it for our purposes for several reasons, however:
>
> 1) the prio granularity is too coarse (16 levels, -rt has 100)
>
> 2) it is too scope limited (it covers only interrupts, we need to have
> additional considerations, like nested guest/host scheduling algorithms
> against the vcpu, and prio-remap policies)
>
> 3) I use "priority" generally..there may be other non-priority based
> policies that need to add state to the table (such as EDF deadlines, etc).
>
> but, otherwise, the idea is the same.  Besides, this was one example.
>    

Well, if priority is so important then I'd recommend exposing it via a 
virtual interrupt controller.  A bus is the wrong model to use, because 
its scope is only the devices it contains, and because it is system-wide 
in nature, not per-cpu.

>>> This is where the really fast call() type mechanism is important.
>>>
>>> Its also about having the priority flow-end to end, and having the vcpu
>>> interrupt state affect the task-priority, etc (e.g. pending interrupts
>>> affect the vcpu task prio).
>>>
>>> etc, etc.
>>>
>>> I can go on and on (as you know ;), but will wait till this work is more
>>> concrete and proven.
>>>
>>>        
>> Generally cpu state shouldn't flow through a device but rather through
>> MSRs, hypercalls, and cpu registers.
>>      
>
> Well, you can blame yourself for that one ;)
>
> The original vbus was implemented as cpuid+hypercalls, partly for that
> reason.  You kicked me out of kvm.ko, so I had to make due with plan B
> via a less direct PCI-BRIDGE route.
>    

A bus has no business doing these things.  But cpu state definitely 
needs to be manipulated using hypercalls, see the pvmmu and vtpr 
hypercalls or the pvclock msr.

> But in reality, it doesn't matter much.  You can certainly have "system"
> devices sitting on vbus that fit a similar role as "MSRs", so the access
> method is more of an implementation detail.  The key is it needs to be
> fast, and optimize out extraneous exits when possible.
>    

No, percpu state belongs in the vcpu model, not the device model.  cpu 
priority is logically a cpu register or state, not device state.



>> Well, do you plan to address this before submission for inclusion?
>>      
> Maybe, maybe not.  Its workable for now (i.e. run as root), so its
> inclusion is not predicated on the availability of the fix, per se (at
> least IMHO).  If I can get it working before I get to pushing the core,
> great!  Patches welcome.
>    

The lack of so many feature indicates the whole thing is immature.  That 
would be find if the whole thing was the first of its kind, but it isn't.

> For the time being, windows will not be RT, and windows can fall-back to
> use virtio-net, etc.  So I am ok with this.  It will come in due time.
>
>    

So we need to work on optimizing both virtio-net and venet.  Great.

>>> The point is: the things we build on top have costs associated with
>>> them, and I aim to minimize it.  For instance, to do a "call()" kind of
>>> interface, you generally need to pre-setup some per-cpu mappings so that
>>> you can just do a single iowrite32() to kick the call off.  Those
>>> per-cpu mappings have a cost if you want them to be high-performance, so
>>> my argument is that you ideally want to limit the number of times you
>>> have to do this.  My current design reduces this to "once".
>>>
>>>        
>> Do you mean minimizing the setup cost?  Seriously?
>>      
> Not the time-to-complete-setup overhead.  The residual costs, like
> heap/vmap usage at run-time.  You generally have to set up per-cpu
> mappings to gain maximum performance.  You would need it per-device, I
> do it per-system.  Its not a big deal in the grand-scheme of things,
> really.  But chalk that up as an advantage to my approach over yours,
> nonetheless.
>    

Without measurements, it's just handwaving.
>> I guess it isn't that important then.  I note that clever prioritization
>> in a guest is pointless if you can't do the same prioritization in the
>> host.
>>      
> I answer this below...
>
> The point is that I am eliminating as many exits as possible, so 1us,
> 2us, whatever...it doesn't matter.  The fastest exit is the one you
> don't have to take.
>    

You'll still have to exit if the host takes a low priority interrupt, 
schedule the irq thread according to its priority, and return to the 
guest.  At this point you may as well inject the interrupt and let the 
guest do the same thing.

>> IIRC we reuse the PCI IDs for non-PCI.
>>      
>
> You already know how I feel about this gem.
>    

The earth keeps rotating despite the widespread use of PCI IDs.

>> I'm not okay with it.  If you wish people to adopt vbus over virtio
>> you'll have to address all concerns, not just yours.
>>      
> By building a community around the development of vbus, isnt this what I
> am doing?  Working towards making it usable for all?
>    

I've no idea if you're actually doing that.  Maybe inclusion should be 
predicated on achieving feature parity.

>>>> and multiqueue out of your design.
>>>>
>>>>          
>>> AFAICT, multiqueue should work quite nicely with vbus.  Can you
>>> elaborate on where you see the problem?
>>>
>>>        
>> You said you aren't interested in it previously IIRC.
>>
>>      
> I don't think so, no.  Perhaps I misspoke or was misunderstood.  I
> actually think its a good idea and will be looking to do this.
>    

When I pointed out that multiplexing all interrupts onto a single vector 
is bad for per-vcpu multiqueue, you said you're not interested in that.

>> I agree that it isn't very clever (not that I am a real time expert) but
>> I disagree about dismissing Linux support so easily.  If prioritization
>> is such a win it should be a win on the host as well and we should make
>> it work on the host as well.  Further I don't see how priorities on the
>> guest can work if they don't on the host.
>>      
> Its more about task priority in the case of real-time.  We do stuff with
> 802.1p as well for control messages, etc.  But for the most part, this
> is an orthogonal effort.  And yes, you are right, it would be nice to
> have this interrupt classification capability in the host.
>
> Generally this is mitigated by the use of irq-threads.  You could argue
> that if irq-threads help the host without a prioritized interrupt
> controller, why cant the guest?  The answer is simply that the host can
> afford sub-optimal behavior w.r.t. IDT injection here, where the guest
> cannot (due to the disparity of hw-injection vs guest-injection overheads).
>    

Guest injection overhead is not too bad, most of the cost is the exit 
itself, and you can't avoid that without host task priorities.




>> They had to write 414 lines in drivers/s390/kvm/kvm_virtio.c and
>> something similar for lguest.
>>      
> Well, then I retract that statement.  I think the small amount of code
> is probably because they are re-using the qemu device-models, however.
>    

No that's guest code, it isn't related to qemu.

> Note that I am essentially advocating the same basic idea here.
>    

Right, duplicating existing infrastructure.

>> I don't see what vbus adds to virtio-net.
>>      
> Well, as you stated in your last reply, you don't want it.  So I guess
> that doesn't matter much at this point.  I will continue developing
> vbus, and pushing things your way.  You can opt to accept or reject
> those things at your own discretion.
>    

I'm not the one to merge it.  However my opinion is that it shouldn't be 
merged.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: configfs/sysfs
  2009-08-19 20:12                           ` configfs/sysfs Avi Kivity
@ 2009-08-19 20:48                             ` Ingo Molnar
  2009-08-19 20:53                               ` configfs/sysfs Avi Kivity
  2009-08-19 21:19                             ` configfs/sysfs Nicholas A. Bellinger
                                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2009-08-19 20:48 UTC (permalink / raw)
  To: Avi Kivity, Peter Zijlstra
  Cc: Nicholas A. Bellinger, Anthony Liguori, kvm, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin, Ira W. Snyder,
	Joel Becker


* Avi Kivity <avi@redhat.com> wrote:

> You may argue, correctly, that syscalls and ioctls are 
> not as flexible.  But this is because no one has 
> invested the effort in making them so.  A struct passed 
> as an argument to a syscall is not extensible.  But if 
> you pass the size of the structure, and also a bitmap 
> of which attributes are present, you gain extensibility 
> and retain the atomicity property of a syscall 
> interface.  I don't think a lot of effort is needed to 
> make an extensible syscall interface just as usable and 
> a lot more efficient than configfs/sysfs.  It should 
> also be simple to bolt a fuse interface on top to 
> expose it to us commandline types.

FYI, an example of such a syscall design and 
implementation has been merged upstream in the .31 merge 
window, see:

 kernel/perf_counter.c::sys_perf_counter_open()

SYSCALL_DEFINE5(perf_counter_open,
                struct perf_counter_attr __user *, attr_uptr,
                pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)

We embedd a '.size' field in struct perf_counter_attr. We 
copy the attribute from user-space in an 
'auto-extend-to-zero' way:

        ret = perf_copy_attr(attr_uptr, &attr);
        if (ret)
                return ret;

where perf_copy_attr() extends the possibly-smaller 
user-space structure to the in-kernel structure and 
zeroes out remaining fields.

This means that older binaries can pass in older 
(smaller) versions of the structure.

This syscall ABI design works very well and has a lot of 
advantages:

 - is extensible in a flexible way

 - it is forwards ABI compatible

 - the kernel is backwards compatible with applications

 - extensions to the ABI dont uglify the interface.

 - new applications can fall back gracefully to older ABI 
   versions if they so choose. (the kernel will reject 
   overlarge attr.size) So full forwards and backwards 
   compatibility can be implemented, if an app wants to.

 - 'same version' ABI uses dont have any interface quirk 
   or performance penalty. (i.e. there's no increasingly 
   complex maze of add-on ABI details for the syscall to 
   multiplex through)

 - the system call stays nice and readable

We've made use of this property of the perfcounters ABI 
and extended it in a compatible way several times 
already, with great success.

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19 20:37                           ` Avi Kivity
@ 2009-08-19 20:53                             ` Ingo Molnar
  2009-08-20 17:25                             ` Muli Ben-Yehuda
  2009-08-20 20:58                               ` Caitlin Bestler
  2 siblings, 0 replies; 132+ messages in thread
From: Ingo Molnar @ 2009-08-19 20:53 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, kvm, alacrityvm-devel, linux-kernel, netdev,
	Michael S. Tsirkin, Patrick Mullaney


* Avi Kivity <avi@redhat.com> wrote:

>>> IIRC we reuse the PCI IDs for non-PCI.
>>>      
>>
>> You already know how I feel about this gem.
>
> The earth keeps rotating despite the widespread use of 
> PCI IDs.

Btw., PCI IDs are a great way to arbitrate interfaces 
planet-wide, in an OS-neutral, depoliticized and 
well-established way.

It's a bit like CPUID for CPUs, just on a much larger 
scope.

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: configfs/sysfs
  2009-08-19 20:48                             ` configfs/sysfs Ingo Molnar
@ 2009-08-19 20:53                               ` Avi Kivity
  0 siblings, 0 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-19 20:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Nicholas A. Bellinger, Anthony Liguori, kvm,
	alacrityvm-devel, linux-kernel, netdev, Michael S. Tsirkin,
	Ira W. Snyder, Joel Becker

On 08/19/2009 11:48 PM, Ingo Molnar wrote:
>
> FYI, an example of such a syscall design and
> implementation has been merged upstream in the .31 merge
> window, see:
>
> <big snip>
>
>    

Exactly.  It's beautiful.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19 16:38                                         ` Avi Kivity
@ 2009-08-19 21:05                                           ` Hollis Blanchard
  2009-08-20  9:57                                               ` Stefan Hajnoczi
  0 siblings, 1 reply; 132+ messages in thread
From: Hollis Blanchard @ 2009-08-19 21:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ira W. Snyder, Michael S. Tsirkin, Gregory Haskins, kvm, netdev,
	linux-kernel, alacrityvm-devel, Anthony Liguori, Ingo Molnar,
	Gregory Haskins, Stefan Hajnoczi

On Wed, 2009-08-19 at 19:38 +0300, Avi Kivity wrote:
> On 08/19/2009 07:29 PM, Ira W. Snyder wrote:
> >
> >    
> >> virtio-$yourhardware or maybe virtio-dma
> >>
> >>      
> > How about virtio-phys?
> >    
> 
> Could work.
> 
> > Arnd and BenH are both looking at PPC systems (similar to mine). Grant
> > Likely is looking at talking to an processor core running on an FPGA,
> > IIRC. Most of the code can be shared, very little should need to be
> > board-specific, I hope.
> >    
> 
> Excellent.
> 
> >>> That said, I'm not sure how qemu-system-ppc running on x86 could
> >>> possibly communicate using virtio-net. This would mean the guest is an
> >>> emulated big-endian PPC, while the host is a little-endian x86. I
> >>> haven't actually tested this situation, so perhaps I am wrong.
> >>>
> >>>        
> >> I'm confused now.  You don't actually have any guest, do you, so why
> >> would you run qemu at all?
> >>
> >>      
> > I do not run qemu. I am just stating a problem with virtio-net that I
> > noticed. This is just so someone more knowledgeable can be aware of the
> > problem.
> >
> >    
> 
> Ah, it certainly doesn't byteswap.  Maybe nobody tried it.  Hollis?

I've never tried it. I've only used virtio with matching guest/host
architectures.

-- 
Hollis Blanchard
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: configfs/sysfs
  2009-08-19 20:12                           ` configfs/sysfs Avi Kivity
  2009-08-19 20:48                             ` configfs/sysfs Ingo Molnar
@ 2009-08-19 21:19                             ` Nicholas A. Bellinger
  2009-08-19 22:15                             ` configfs/sysfs Gregory Haskins
  2009-08-19 22:16                             ` configfs/sysfs Joel Becker
  3 siblings, 0 replies; 132+ messages in thread
From: Nicholas A. Bellinger @ 2009-08-19 21:19 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Anthony Liguori, kvm, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin, Ira W. Snyder,
	Joel Becker

On Wed, 2009-08-19 at 23:12 +0300, Avi Kivity wrote:
> On 08/19/2009 09:23 PM, Nicholas A. Bellinger wrote:
> > Anyways, I was wondering if you might be interesting in sharing your
> > concerns wrt to configfs (conigfs maintainer CC'ed), at some point..?
> >    
> 
> My concerns aren't specifically with configfs, but with all the text 
> based pseudo filesystems that the kernel exposes.
> 

<nod>

> My high level concern is that we're optimizing for the active sysadmin, 
> not for libraries and management programs.  configfs and sysfs are easy 
> to use from the shell, discoverable, and easily scripted.  But they 
> discourage documentation, the text format is ambiguous, and they require 
> a lot of boilerplate to use in code.
> 
> You could argue that you can wrap *fs in a library that hides the 
> details of accessing it, but that's the wrong approach IMO.  We should 
> make the information easy to use and manipulate for programs; one of 
> these programs can be a fuse filesystem for the active sysadmin if 
> someone thinks it's important.
> 
> Now for the low level concerns:
> 
> - efficiency
> 
> Each attribute access requires an open/read/close triplet and 
> binary->ascii->binary conversions.  In contrast an ordinary 
> syscall/ioctl interface can fetch all attributes of an object, or even 
> all attributes of all objects, in one call.
> 

I agree that syscalls/ioctls can, given enough coding effort, use a
potentially much smaller amount of total syscalls than a pseudo
filesystem such as configfs.  In the case of the configfs enabled
generic target engine, I have not found this to be particularly limiting
in terms of management on modern x86_64 virtualized hardware inside of
KVM Guests with my development so far..

> - atomicity
> 
> One attribute per file means that, lacking userspace-visible 
> transactions, there is no way to change several attributes at once.  
> When you read attributes,

Actually, something like this can be done in struct
config_item_type->ct_attrs[] by changing the attributes you want, but
not making them active until pulling a seperate configfs item 'trigger'
in the group to make the changes take effect.

I am doing something similar to this now during fabric bringup while
each iSCSI Target module is configured, and then a enable trigger throw
to allow iSCSI Initiators to actually login to the endpoint, and to
prevent endpoints from being active before all of the Ports and ACLs
have been configured for each configured iSCSI endpoint.

This logic is not built into ConfigFS of course, but it does give the
same effect.

>  there is no way to read several attributes 
> atomically so you can be sure their values correlate.

In this case, even though adding multiple values per attribute is
discouraged per the upstream sysfs layout, using a single configfs
attribute to read multiple values of another individual attributes that
need to be read atomically is primary option today wrt existing code.

Not ideal with configfs, but it is easy to do.

>   Another example 
> of a problem is when an object disappears while reading its attributes.  
> Sure, openat() can mitigate this, but it's better to avoid introducing 
> problem than having a fix.
> 

<not sure on this one..>

> - ambiguity
> 
> What format is the attribute?  does it accept lowercase or uppercase hex 
> digits?  is there a newline at the end?  how many digits can it take 
> before the attribute overflows?  All of this has to be documented and 
> checked by the OS, otherwise we risk regressions later.  In contrast, 
> __u64 says everything in a binary interface.
> 

Yes, you need to make strict_str*() calls on the configfs attribute
store() functions with casts to locally defined variable types.  Using
strtoul() and strtoull() have been working fine for me in the context of
the generic target engine, but point taken about the usefulness in
having access to the format metadata of a given attribute.

> - lifetime and access control
> 
> If a process brings an object into being (using mkdir) and then dies, 
> the object remains behind. 

I think this depends on how the struct configfs_item_grops->make_group()
and ->drop_item() are being used.  For example, I typically allocate a
TCM related data structure during the make_group() call containing a
struct config_group member that is registered with
config_group_init_type_name() upon a successful mkdir(2) call.

When drop_item() is called via rmdir(2), that references the struct
config_group, the original data structure containing the struct
config_group is released with config_item_put(), and the TCM allocated
data structure released.

While in use, the registered struct config_group can be pinned with
configfs_depend_item(), which has some interesting limitiations of its
own.

>  The syscall/ioctl approach ties the object 
> into an fd, which will be destroyed when the process dies, and which can 
> be passed around using SCM_RIGHTS, allowing a server process to create 
> and configure an object before passing it to an unprivileged program
> 

<nod>  I have not personally had this requirement so I can't add much
here..

> - notifications
> 
> It's hard to notify users about changes in attributes.  Sure, you can 
> use inotify, but that limits you to watching subtrees.  Once you do get 
> the notification, you run into the atomicity problem.  When do you know 
> all attributes are valid?  This can be solved using sequence counters, 
> but that's just gratuitous complexity.  Netlink type interfaces are much 
> more robust and flexible.
> 

nor the notifiy case either..

> - readdir
> 
> You can either list everything, or nothing.  Sure, you can have trees to 
> ease searching, even multiple views of the same data, but it's painful.
> 
> You may argue, correctly, that syscalls and ioctls are not as flexible.  
> But this is because no one has invested the effort in making them so.

I think that new syscalls are great when you can get them merged (as KVM
is quite important, that means not a problem), and I am sure you guys
can make an ioctl contort into all manner of positions.

Perhaps it is just that I think that the code to manage complex ioctl
interaction can get quite ugly from my experience, and doing backwards
compat with interpreted code makes life for easier, at least for me.

>   A 
> struct passed as an argument to a syscall is not extensible.  But if you 
> pass the size of the structure, and also a bitmap of which attributes 
> are present, you gain extensibility and retain the atomicity property of 
> a syscall interface.  I don't think a lot of effort is needed to make an 
> extensible syscall interface just as usable and a lot more efficient 
> than configfs/sysfs.

Good point, however in terms of typical mangement scenarios in my
experience with TCM/LIO 3.x, I have not found the lost efficiently of
using configfs compared to legacy IOCTL for controlling the fabric in
typical usage cases.

That said, I am sure there must be particular cases in the
virtualization world where having those syscalls is critical, for which
a configfs enabled generic target does not make sense.

>   It should also be simple to bolt a fuse interface 
> on top to expose it to us commandline types.
>

That would be interesting..

> > As you may recall, I have been using configfs extensively for the 3.x
> > generic target core infrastructure and iSCSI fabric modules living in
> > lio-core-2.6.git/drivers/target/target_core_configfs.c and
> > lio-core-2.6.git/drivers/lio-core/iscsi_target_config.c, and have found
> > it to be extraordinarly useful for the purposes of a implementing a
> > complex kernel level target mode stack that is expected to manage
> > massive amounts of metadata, allow for real-time configuration, share
> > data structures (eg: SCSI Target Ports) between other kernel fabric
> > modules and manage the entire set of fabrics using only intrepetered
> > userspace code.
> >
> > Using the 10000 1:1 mapped TCM Virtual HBA+FILEIO LUNs<->  iSCSI Target
> > Endpoints inside of a KVM Guest (from the results in May posted with
> > IOMMU aware 10 Gb on modern Nahelem hardware, see
> > http://linux-iscsi.org/index.php/KVM-LIO-Target), we have been able to
> > dump the entire running target fabric configfs hierarchy to a single
> > struct file on a KVM Guest root device using python code on the order of
> > ~30 seconds for those 10000 active iSCSI endpoints.  In configfs terms,
> > this means:
> >
> > *) 7 configfs groups (directories), ~50 configfs attributes (files) per
> > Virtual HBA+FILEIO LUN
> > *) 15 configfs groups (directories), ~60 configfs attributes (files per
> > iSCSI fabric Endpoint
> >
> > Which comes out to a total of ~220000 groups and ~1100000 attributes
> > active configfs objects living in the configfs_dir_cache that are being
> > dumped inside of the single KVM guest instances, including symlinks
> > between the fabric modules to establish the SCSI ports containing
> > complete set of SPC-4 and RFC-3720 features, et al.
> >    
> 
> You achieved 3 million syscalls/sec from Python code?  That's very 
> impressive.

Well, that is dumping the running configfs for everything.  In more
typical usage cases of the TCM/LIO configfs fabric, specific Virtual
HBAs+LUNs and iSCSI Fabric endpoints would be changing individually, as
each Virtual HBA and iSCSI endpoint are completely independent of each
other and are intended to be administrated that way.

You can even run multiple for loops from different shell procceses to
create the endpoints in parallel using UUID and iSCSI WWN naming for
doing multithreaded configfs fabric bringup.

> 
> Note with syscalls you could have done it with 10K syscalls (Python 
> supports packing and unpacking structs quite well, and also directly 
> calling C code IIRC).
> 
> > Also on the kernel<->  user API interaction compatibility side, I have
> > found the 3.x configfs enabled code adventagous over the LIO 2.9 code
> > (that used an ioctl for everything) because it allows us to do backwards
> > compat for future versions without using any userspace C code, which in
> > IMHO makes maintaining userspace packages for complex kernel stacks with
> > massive amounts of metadata + real-time configuration considerations.
> > No longer having ioctl compatibility issues between LIO versions as the
> > structures passed via ioctl change, and being able to do backwards
> > compat with small amounts of interpreted code against configfs layout
> > changes makes maintaining the kernel<->  user API really have made this
> > that much easier for me.
> >    
> 
> configfs is more maintainable that a bunch of hand-maintained ioctls.  

<nod>

> But if we put some effort into an extendable syscall infrastructure 
> (perhaps to the point of using an IDL) I'm sure we can improve on that 
> without the problems pseudo filesystems introduce.
> 

Understood, while I think configfs is grand for a number of purposes, I
am certainly not foolish enough to think it is perfect for everything

> > Anyways, I though these might be useful to the discussion as it releates
> > to potental uses of configfs on the KVM Host or other projects that
> > really make sense, and/or to improve the upstream implementation so that
> > other users (like myself) can benefit from improvements to configfs.
> >    
> 
> I can't really fault a project for using configfs; it's an accepted and 
> recommented (by the community) interface.  I'd much prefer it though if 
> there was an effort to create a usable fd/struct based alternative.
> 

Thanks for your great comments Avi!

--nab





^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: configfs/sysfs
  2009-08-19 20:12                           ` configfs/sysfs Avi Kivity
  2009-08-19 20:48                             ` configfs/sysfs Ingo Molnar
  2009-08-19 21:19                             ` configfs/sysfs Nicholas A. Bellinger
@ 2009-08-19 22:15                             ` Gregory Haskins
  2009-08-19 22:16                             ` configfs/sysfs Joel Becker
  3 siblings, 0 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-19 22:15 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nicholas A. Bellinger, Ingo Molnar, Anthony Liguori, kvm,
	alacrityvm-devel, linux-kernel, netdev, Michael S. Tsirkin,
	Ira W. Snyder, Joel Becker

[-- Attachment #1: Type: text/plain, Size: 11227 bytes --]

Avi Kivity wrote:
> On 08/19/2009 09:23 PM, Nicholas A. Bellinger wrote:
>> Anyways, I was wondering if you might be interesting in sharing your
>> concerns wrt to configfs (conigfs maintainer CC'ed), at some point..?
>>    
> 
> My concerns aren't specifically with configfs, but with all the text
> based pseudo filesystems that the kernel exposes.
> 
> My high level concern is that we're optimizing for the active sysadmin,
> not for libraries and management programs.  configfs and sysfs are easy
> to use from the shell, discoverable, and easily scripted.  But they
> discourage documentation, the text format is ambiguous, and they require
> a lot of boilerplate to use in code.
> 
> You could argue that you can wrap *fs in a library that hides the
> details of accessing it, but that's the wrong approach IMO.  We should
> make the information easy to use and manipulate for programs; one of
> these programs can be a fuse filesystem for the active sysadmin if
> someone thinks it's important.
> 
> Now for the low level concerns:
> 
> - efficiency
> 
> Each attribute access requires an open/read/close triplet and
> binary->ascii->binary conversions.  In contrast an ordinary
> syscall/ioctl interface can fetch all attributes of an object, or even
> all attributes of all objects, in one call.

I can only speak for vbus, but *fs access efficiency is not a problem.
Its all slow-path anyway.

> 
> - atomicity
> 
> One attribute per file means that, lacking userspace-visible
> transactions, there is no way to change several attributes at once. 

Actually, I do think configfs has some rudimentary, but incomplete IIUC,
support for transactional commits of updates.  In lieu of formal
support, this is also not generally a problem:  You can just put your
own transaction in by the form of an explicit attribute.  For instance,
see the "enabled" attribute in venet-tap.  This lets you set all the
parameters and then hit "enabled" to turn it act on the other settings
atomically.

For sysfs kernel updates, I think you can update the values under a
lock.  For sysfs userspace updates, I suppose you could do a similar
"explicit commit" attribute if it was needed.

> When you read attributes, there is no way to read several attributes
> atomically so you can be sure their values correlate.

This isn't a valid concern for configfs, unless you have multiple
userspace applications updating concurrently.  IIUC, configfs is only
changed by userspace, not the kernel.  So I suppose if you were
concerned about supporting this, you could use an advisory flock or
something.

For sysfs, this is a valid concern.  Generally, I do not think *fs
interfaces are a good match if you need that type of behavior (atomic
read of rapidly changing attributes), however.  FWIW, vbus does not need
this (the parameters do not generally change once established).

> Another example
> of a problem is when an object disappears while reading its attributes. 
> Sure, openat() can mitigate this, but it's better to avoid introducing
> problem than having a fix.

Again, that can only happen if another userspace app did that to you.
Possible solutions might be advisory locking.


> 
> - ambiguity
> 
> What format is the attribute?  does it accept lowercase or uppercase hex
> digits?  is there a newline at the end?  how many digits can it take
> before the attribute overflows?  All of this has to be documented and
> checked by the OS, otherwise we risk regressions later.  In contrast,
> __u64 says everything in a binary interface.

I don't think this is a legit concern.  I would thing you have to
understand the ABI to use the interface regardless, no matter the
transport.  And either way, the kernel has to validate the input.

> 
> - lifetime and access control
> 
> If a process brings an object into being (using mkdir) and then dies,
> the object remains behind.

This is one of the big problems with configfs, I agree.  I guess you
could argue that the ioctl approach has the opposite problem (resource
goes if the owner goes), which is to say it requires the app to hang
around.  Syscall is kind of in the middle, since it doesn't expressly
have a policy against a given resource if a task dies.  You can
certainly modify kernel/exit.c to add such a policy, I suppose.  But
ioctl has a distinct advantage in this regard.

All in all, I think ioctl wins here.

> The syscall/ioctl approach ties the object
> into an fd, which will be destroyed when the process dies, and which can
> be passed around using SCM_RIGHTS, allowing a server process to create
> and configure an object before passing it to an unprivileged program
> 
> - notifications
> 
> It's hard to notify users about changes in attributes.  Sure, you can
> use inotify, but that limits you to watching subtrees.

Whats worse, inotify doesn't seem to work very well against *fs from
what I hear.

>  Once you do get
> the notification, you run into the atomicity problem.  When do you know
> all attributes are valid?  This can be solved using sequence counters,
> but that's just gratuitous complexity.  Netlink type interfaces are much
> more robust and flexible.
> 
> - readdir
> 
> You can either list everything, or nothing.  Sure, you can have trees to
> ease searching, even multiple views of the same data, but it's painful.

I do not see the problem here.  *fs structures dirs as objects, and
files as attributes.  A logical presentation of the data from that
perspective ensues.  Why is "readdir" a problem?  It gets all the
attributes of an "object" (sans potential consistency problems, as you
point out above).

> 
> You may argue, correctly, that syscalls and ioctls are not as flexible. 
> But this is because no one has invested the effort in making them so.  A
> struct passed as an argument to a syscall is not extensible.  But if you
> pass the size of the structure, and also a bitmap of which attributes
> are present, you gain extensibility and retain the atomicity property of
> a syscall interface.  I don't think a lot of effort is needed to make an
> extensible syscall interface just as usable and a lot more efficient
> than configfs/sysfs.  It should also be simple to bolt a fuse interface
> on top to expose it to us commandline types.

I think the strongest argument about having *fs like models, is its a
way to keep the "management tool" coupled with the kernel that
understands it.  This is quite nice in practice.

Its true that the interface exposed by *fs could be construed as an
"ABI", but that is generally more of an issue for userspace tools that
would turn around and read it, as opposed to a human sitting at the
shell.  So therefore, both *fs and syscall/ioctl approaches suffer from
ABI mis-sync issues w.r.t. tools.  But the *fs wins here because
generally a human can adapt dynamically to the change (e.g. by running
'tree' and looking for something recognizable), whereas syscall/ioctl
have no choice...they are hosed.

It's true you could make an extensible syscall/ioctl interface, but do
note you can use similar techniques (e.g. only add new attributes to
existing objects) on the *fs front as well.  So to me it comes down to
more or less the lifetime question (ioctl wins), vs the
auto-synchronized tool (*fs wins) benefit.  I am honestly not sure what
is better.


> 
>> As you may recall, I have been using configfs extensively for the 3.x
>> generic target core infrastructure and iSCSI fabric modules living in
>> lio-core-2.6.git/drivers/target/target_core_configfs.c and
>> lio-core-2.6.git/drivers/lio-core/iscsi_target_config.c, and have found
>> it to be extraordinarly useful for the purposes of a implementing a
>> complex kernel level target mode stack that is expected to manage
>> massive amounts of metadata, allow for real-time configuration, share
>> data structures (eg: SCSI Target Ports) between other kernel fabric
>> modules and manage the entire set of fabrics using only intrepetered
>> userspace code.
>>
>> Using the 10000 1:1 mapped TCM Virtual HBA+FILEIO LUNs<->  iSCSI Target
>> Endpoints inside of a KVM Guest (from the results in May posted with
>> IOMMU aware 10 Gb on modern Nahelem hardware, see
>> http://linux-iscsi.org/index.php/KVM-LIO-Target), we have been able to
>> dump the entire running target fabric configfs hierarchy to a single
>> struct file on a KVM Guest root device using python code on the order of
>> ~30 seconds for those 10000 active iSCSI endpoints.  In configfs terms,
>> this means:
>>
>> *) 7 configfs groups (directories), ~50 configfs attributes (files) per
>> Virtual HBA+FILEIO LUN
>> *) 15 configfs groups (directories), ~60 configfs attributes (files per
>> iSCSI fabric Endpoint
>>
>> Which comes out to a total of ~220000 groups and ~1100000 attributes
>> active configfs objects living in the configfs_dir_cache that are being
>> dumped inside of the single KVM guest instances, including symlinks
>> between the fabric modules to establish the SCSI ports containing
>> complete set of SPC-4 and RFC-3720 features, et al.
>>    
> 
> You achieved 3 million syscalls/sec from Python code?  That's very
> impressive.
> 
> Note with syscalls you could have done it with 10K syscalls (Python
> supports packing and unpacking structs quite well, and also directly
> calling C code IIRC).
> 
>> Also on the kernel<->  user API interaction compatibility side, I have
>> found the 3.x configfs enabled code adventagous over the LIO 2.9 code
>> (that used an ioctl for everything) because it allows us to do backwards
>> compat for future versions without using any userspace C code, which in
>> IMHO makes maintaining userspace packages for complex kernel stacks with
>> massive amounts of metadata + real-time configuration considerations.
>> No longer having ioctl compatibility issues between LIO versions as the
>> structures passed via ioctl change, and being able to do backwards
>> compat with small amounts of interpreted code against configfs layout
>> changes makes maintaining the kernel<->  user API really have made this
>> that much easier for me.
>>    
> 
> configfs is more maintainable that a bunch of hand-maintained ioctls. 
> But if we put some effort into an extendable syscall infrastructure
> (perhaps to the point of using an IDL) I'm sure we can improve on that
> without the problems pseudo filesystems introduce.
> 
>> Anyways, I though these might be useful to the discussion as it releates
>> to potental uses of configfs on the KVM Host or other projects that
>> really make sense, and/or to improve the upstream implementation so that
>> other users (like myself) can benefit from improvements to configfs.
>>    
> 
> I can't really fault a project for using configfs; it's an accepted and
> recommented (by the community) interface.  I'd much prefer it though if
> there was an effort to create a usable fd/struct based alternative.

Yeah, doing it manually with all the CAP bits gets old, fast, so I agree
 that improvement here is welcome.

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: configfs/sysfs
  2009-08-19 20:12                           ` configfs/sysfs Avi Kivity
                                               ` (2 preceding siblings ...)
  2009-08-19 22:15                             ` configfs/sysfs Gregory Haskins
@ 2009-08-19 22:16                             ` Joel Becker
  2009-08-19 23:48                               ` [Alacrityvm-devel] configfs/sysfs Alex Tsariounov
                                                 ` (3 more replies)
  3 siblings, 4 replies; 132+ messages in thread
From: Joel Becker @ 2009-08-19 22:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nicholas A. Bellinger, Ingo Molnar, Anthony Liguori, kvm,
	alacrityvm-devel, linux-kernel, netdev, Michael S. Tsirkin,
	Ira W. Snyder

On Wed, Aug 19, 2009 at 11:12:43PM +0300, Avi Kivity wrote:
> On 08/19/2009 09:23 PM, Nicholas A. Bellinger wrote:
> >Anyways, I was wondering if you might be interesting in sharing your
> >concerns wrt to configfs (conigfs maintainer CC'ed), at some point..?
> 
> My concerns aren't specifically with configfs, but with all the text
> based pseudo filesystems that the kernel exposes.

	Phew!  It's not just me :-)

> My high level concern is that we're optimizing for the active
> sysadmin, not for libraries and management programs.  configfs and
> sysfs are easy to use from the shell, discoverable, and easily
> scripted.  But they discourage documentation, the text format is
> ambiguous, and they require a lot of boilerplate to use in code.

	I don't think they "discourage documentation" anymore than any
ioctl we've ever had.  At least you can look at the names and values and
take a good stab at it (configfs is better than sysfs at this, by virtue
of what it does, but discoverability is certainly not as good as real
documentation).
	With an ioctl() that isn't (well) documented, you have to go
read the structure and probably even read the code that uses the
structure to be sure what you are doing.

> You could argue that you can wrap *fs in a library that hides the
> details of accessing it, but that's the wrong approach IMO.  We
> should make the information easy to use and manipulate for programs;
> one of these programs can be a fuse filesystem for the active
> sysadmin if someone thinks it's important.

	You are absolutely correct that they are a boon to the sysadmin,
where in theory programs can do better with binary interfaces.  Except
what programs?  I can't do an ioctl or a syscall from a shell script
(no, using bash's network capabilities to talk to netlink does not
count).  Same with perl/python/whatever where you have to write
boilerplate to create binary structures.
	These interfaces have two opposing forces acting on them.  They
provide a reasonably nice way to cross the user<->kernel boundary, so
people want to use them.  Programmatic things, like a power management
daemon for example, don't want sysadmins touching anything.  It's just
an interface for the daemon.  Conversely, some things are really knobs
for the sysadmin.  There's nothing else to it.  Why should they have to
code up a C program just to turn a knob?  Configfs, as its name implies,
really does exist for that second case.  It turns out that it's quite
nice to use for the first case too, but if folks wanted to go the
syscall route, no worries.
	I've said it many times.  We will never come up with one
over-arching solution to all the disparate use cases.  Instead, we
should use each facility - syscalls, ioctls, sysfs, configfs, etc - as
appropriate.  Even in the same program or subsystem.

> - atomicity
> 
> One attribute per file means that, lacking userspace-visible
> transactions, there is no way to change several attributes at once.
> When you read attributes, there is no way to read several attributes
> atomically so you can be sure their values correlate.  Another
> example of a problem is when an object disappears while reading its
> attributes.  Sure, openat() can mitigate this, but it's better to
> avoid introducing problem than having a fix.

	configfs has some atomicity capabilities, but not full
atomicity.  It's not the right too for that sort of thing.

> - ambiguity
> 
> What format is the attribute?  does it accept lowercase or uppercase
> hex digits?  is there a newline at the end?  how many digits can it
> take before the attribute overflows?  All of this has to be
> documented and checked by the OS, otherwise we risk regressions
> later.  In contrast, __u64 says everything in a binary interface.

	Um, is that __u64 a pointer to a userspace object?  A key to a
lookup table?  A file descriptor that is padded out?  It's no less
ambiguous.

> - lifetime and access control
> 
> If a process brings an object into being (using mkdir) and then
> dies, the object remains behind.  The syscall/ioctl approach ties
> the object into an fd, which will be destroyed when the process
> dies, and which can be passed around using SCM_RIGHTS, allowing a
> server process to create and configure an object before passing it
> to an unprivileged program

	Most things here do *not* want to be tied to the lifetime of one
process.  We don't want our cpu_freq governor changing just because the
power manager died.

 
> You may argue, correctly, that syscalls and ioctls are not as
> flexible.  But this is because no one has invested the effort in
> making them so.  A struct passed as an argument to a syscall is not
> extensible.  But if you pass the size of the structure, and also a
> bitmap of which attributes are present, you gain extensibility and
> retain the atomicity property of a syscall interface.  I don't think
> a lot of effort is needed to make an extensible syscall interface
> just as usable and a lot more efficient than configfs/sysfs.  It
> should also be simple to bolt a fuse interface on top to expose it
> to us commandline types.

	Your extensible syscall still needs to be known.  The
flexibility provided by configfs and sysfs is of generic access to
non-generic things.  It's different.
	The follow-ups regarding the perf_counter call are a good
example.  If you know the perf_counter call, you can code up a C program
that asks what attributes or things are there.  But if you don't, you've
first got to find out that there's a perf_counter call, then learn how
to use it.  With configfs/sysfs, you notice that there's now a
perf_counter directory under a tree, and you can figure out what
attributes and items are there.
	But this is not the be-all-end-all.  Our syscalls should be more
flexible in the perf_counter way.  Not everything really needs to be
listable by some yokel sysadmin.

> configfs is more maintainable that a bunch of hand-maintained
> ioctls.  But if we put some effort into an extendable syscall
> infrastructure (perhaps to the point of using an IDL) I'm sure we
> can improve on that without the problems pseudo filesystems
> introduce.

	Oh, boy, IDL :-)  Seriously, if you can solve the "how do I just
poke around without actually writing C code or installing a
domain-specific binary" problem, you will probably get somewhere.
 
> I can't really fault a project for using configfs; it's an accepted
> and recommented (by the community) interface.  I'd much prefer it
> though if there was an effort to create a usable fd/struct based
> alternative.

	Oh, and configfs was explicitly designed to be interface
agnostic to the client.  The filesystem portions, to the best of my
ability, are not exposed to client drivers.  So you can replace the
configfs filesystem interface with a system call set that does the same
operations, and no configfs user will actually need to change their
code (if you want to change from text values to non-text, that would
require changing the show/store operation prototypes, but that's about
it).

Joel

-- 

A good programming language should have features that make the
kind of people who use the phrase "software engineering" shake
their heads disapprovingly.
	- Paul Graham

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] configfs/sysfs
  2009-08-19 22:16                             ` configfs/sysfs Joel Becker
@ 2009-08-19 23:48                               ` Alex Tsariounov
  2009-08-19 23:54                               ` configfs/sysfs Nicholas A. Bellinger
                                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 132+ messages in thread
From: Alex Tsariounov @ 2009-08-19 23:48 UTC (permalink / raw)
  To: Joel Becker
  Cc: Avi Kivity, kvm, Michael S. Tsirkin, netdev, linux-kernel,
	Nicholas A. Bellinger, alacrityvm-devel, Anthony Liguori,
	Ingo Molnar

On Wed, 2009-08-19 at 15:16 -0700, Joel Becker wrote:
> On Wed, Aug 19, 2009 at 11:12:43PM +0300, Avi Kivity wrote:
> > On 08/19/2009 09:23 PM, Nicholas A. Bellinger wrote:
> > >Anyways, I was wondering if you might be interesting in sharing your
> > >concerns wrt to configfs (conigfs maintainer CC'ed), at some point..?
> > 
> > My concerns aren't specifically with configfs, but with all the text
> > based pseudo filesystems that the kernel exposes.
> 
> 	Phew!  It's not just me :-)

The points on *fs vs. ioctl are interesting.  I think both have their
benefits and both have the downfalls, for example efficiency vs. ease of
(human) use. I suppose it comes down to whether you're in the fast path
or not for the most part.  However, just because and interface is not as
efficient as it can be does not necessarily mean that it's not a good
one.

As as an example, many moons ago, I worked on implementing some serial
comms between an embedded speed controller and its command console.
Being young and efficiency starved ;) I disregarded our other
controllers which implemented these serial comms with ASCII strings, and
used binary blobs instead.  I indeed got some respectable performance
out of doing this, even to the effect of creating a "real time" status
monitor that updated multiple times a second via the hand-held terminal.

However, I totally missed the point of intentionally doing things
"inefficiently."  For example, our serial debugging setup consisted of
two VT100 terminals wired up with a custom serial cable that went
between two communicating units.  Each term would show what each end was
saying.  Kinda crude, but effective.  Of course with my new and improved
controller that "spoke the binary language of moisture evaporators,"
well, all one saw was garbage.  :)

Additionally, someone debugging the controllers could just use a term to
talk to it if it used simple ASCII commands and take terminals, hosts
and other software out of the picture, but for my controller, well, you
could only use the custom programmed hand-held term.

I ended up supporting both ASCII and binary communications on the
controller for these (and other) reasons.  However, in the end, I
ditched the binary comms since they really didn't add the efficiency in
the fast path where it should be added.  (Well, I also ran out of eprom
space... :)

In any case, having a humanly understandable communications protocol (or
ABI) can be extremely useful, and just because it's not efficient
doesn't automatically mean that it's a bad thing, especially if it's in
the slow path.  It does have it's down sides as mentioned in this
thread, so we really need both types.  Because of that, the fuse layer
on top of a binary ABI is an interesting idea.

Alex


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: configfs/sysfs
  2009-08-19 22:16                             ` configfs/sysfs Joel Becker
  2009-08-19 23:48                               ` [Alacrityvm-devel] configfs/sysfs Alex Tsariounov
@ 2009-08-19 23:54                               ` Nicholas A. Bellinger
  2009-08-20  6:09                                 ` configfs/sysfs Avi Kivity
  2009-08-20  6:09                               ` configfs/sysfs Avi Kivity
  3 siblings, 0 replies; 132+ messages in thread
From: Nicholas A. Bellinger @ 2009-08-19 23:54 UTC (permalink / raw)
  To: Joel Becker
  Cc: Avi Kivity, Ingo Molnar, Anthony Liguori, kvm, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin, Ira W. Snyder

On Wed, 2009-08-19 at 15:16 -0700, Joel Becker wrote:
> On Wed, Aug 19, 2009 at 11:12:43PM +0300, Avi Kivity wrote:
> > On 08/19/2009 09:23 PM, Nicholas A. Bellinger wrote:
> > >Anyways, I was wondering if you might be interesting in sharing your
> > >concerns wrt to configfs (conigfs maintainer CC'ed), at some point..?
> > 
> > My concerns aren't specifically with configfs, but with all the text
> > based pseudo filesystems that the kernel exposes.
> 
> 	Phew!  It's not just me :-)
> 
> > My high level concern is that we're optimizing for the active
> > sysadmin, not for libraries and management programs.  configfs and
> > sysfs are easy to use from the shell, discoverable, and easily
> > scripted.  But they discourage documentation, the text format is
> > ambiguous, and they require a lot of boilerplate to use in code.
> 
> 	I don't think they "discourage documentation" anymore than any
> ioctl we've ever had.  At least you can look at the names and values and
> take a good stab at it (configfs is better than sysfs at this, by virtue
> of what it does, but discoverability is certainly not as good as real
> documentation).
> 	With an ioctl() that isn't (well) documented, you have to go
> read the structure and probably even read the code that uses the
> structure to be sure what you are doing.
> 

Good point..

> 
> > You could argue that you can wrap *fs in a library that hides the
> > details of accessing it, but that's the wrong approach IMO.  We
> > should make the information easy to use and manipulate for programs;
> > one of these programs can be a fuse filesystem for the active
> > sysadmin if someone thinks it's important.
> 
> 	You are absolutely correct that they are a boon to the sysadmin,
> where in theory programs can do better with binary interfaces.  Except
> what programs?  I can't do an ioctl or a syscall from a shell script
> (no, using bash's network capabilities to talk to netlink does not
> count).  Same with perl/python/whatever where you have to write
> boilerplate to create binary structures.

<nod>, then I suppose it then begins to get down to how easy those
boilderplates can be used to add new groups and attributes for
developers..  In my experience using the CONFIGFS_EATTR() macros with
multiple struct config_groups hanging of the same make_group() allocated
internal TCM structure, this has been very easy for me once I figured
out why I really needed the extended macro set (again, to hang multiple
differently named struct config_groups off a single internally allocated
structure).  Joel, I know that you have been keeping the configfs macros
in sync with the parameters used for original matching sysfs macros (and
that I have been using my own configfs macro that can be used together
with existing code) but I really do think the extended macro set has
benefit for users of configfs who put a little bit of effort to
understand how they work.


> 	These interfaces have two opposing forces acting on them.  They
> provide a reasonably nice way to cross the user<->kernel boundary, so
> people want to use them.  Programmatic things, like a power management
> daemon for example, don't want sysadmins touching anything.  It's just
> an interface for the daemon.  Conversely, some things are really knobs
> for the sysadmin.  There's nothing else to it.  Why should they have to
> code up a C program just to turn a knob?  Configfs, as its name implies,
> really does exist for that second case. 

I think this is a very good point that really shows the benefits of a
configfs based design for real world admin useablility and
configurability (CLI building blocks for higher level UIs).  Having the
ability to modify non compiled code to suit their needs on top of a user
defined configfs directory structure of groups/directories (assuming
config groups have some sort of project defined naming requrements in
each defined struct configfs_item_operations->make_group()) with
synchronization done on a individual configfs group context for
creation/deletion and optionally the I/O access of attributes within
said group.

>  It turns out that it's quite
> nice to use for the first case too, but if folks wanted to go the
> syscall route, no worries.
> 	I've said it many times.  We will never come up with one
> over-arching solution to all the disparate use cases.  Instead, we
> should use each facility - syscalls, ioctls, sysfs, configfs, etc - as
> appropriate.  Even in the same program or subsystem.
> 
> > - atomicity
> > 
> > One attribute per file means that, lacking userspace-visible
> > transactions, there is no way to change several attributes at once.
> > When you read attributes, there is no way to read several attributes
> > atomically so you can be sure their values correlate.  Another
> > example of a problem is when an object disappears while reading its
> > attributes.  Sure, openat() can mitigate this, but it's better to
> > avoid introducing problem than having a fix.
> 
> 	configfs has some atomicity capabilities, but not full
> atomicity.  It's not the right too for that sort of thing.
> 
> > - ambiguity
> > 
> > What format is the attribute?  does it accept lowercase or uppercase
> > hex digits?  is there a newline at the end?  how many digits can it
> > take before the attribute overflows?  All of this has to be
> > documented and checked by the OS, otherwise we risk regressions
> > later.  In contrast, __u64 says everything in a binary interface.
> 
> 	Um, is that __u64 a pointer to a userspace object?  A key to a
> lookup table?  A file descriptor that is padded out?  It's no less
> ambiguous.
> 
> > - lifetime and access control
> > 
> > If a process brings an object into being (using mkdir) and then
> > dies, the object remains behind.  The syscall/ioctl approach ties
> > the object into an fd, which will be destroyed when the process
> > dies, and which can be passed around using SCM_RIGHTS, allowing a
> > server process to create and configure an object before passing it
> > to an unprivileged program
> 
> 	Most things here do *not* want to be tied to the lifetime of one
> process.  We don't want our cpu_freq governor changing just because the
> power manager died.
> 
>  
> > You may argue, correctly, that syscalls and ioctls are not as
> > flexible.  But this is because no one has invested the effort in
> > making them so.  A struct passed as an argument to a syscall is not
> > extensible.  But if you pass the size of the structure, and also a
> > bitmap of which attributes are present, you gain extensibility and
> > retain the atomicity property of a syscall interface.  I don't think
> > a lot of effort is needed to make an extensible syscall interface
> > just as usable and a lot more efficient than configfs/sysfs.  It
> > should also be simple to bolt a fuse interface on top to expose it
> > to us commandline types.
> 
> 	Your extensible syscall still needs to be known.  The
> flexibility provided by configfs and sysfs is of generic access to
> non-generic things.  It's different.
> 	The follow-ups regarding the perf_counter call are a good
> example.  If you know the perf_counter call, you can code up a C program
> that asks what attributes or things are there.  But if you don't, you've
> first got to find out that there's a perf_counter call, then learn how
> to use it.  With configfs/sysfs, you notice that there's now a
> perf_counter directory under a tree, and you can figure out what
> attributes and items are there.
> 	But this is not the be-all-end-all.  Our syscalls should be more
> flexible in the perf_counter way.  Not everything really needs to be
> listable by some yokel sysadmin.
> 
> > configfs is more maintainable that a bunch of hand-maintained
> > ioctls.  But if we put some effort into an extendable syscall
> > infrastructure (perhaps to the point of using an IDL) I'm sure we
> > can improve on that without the problems pseudo filesystems
> > introduce.
> 
> 	Oh, boy, IDL :-)  Seriously, if you can solve the "how do I just
> poke around without actually writing C code or installing a
> domain-specific binary" problem, you will probably get somewhere.
>  

Also, having the configfs directory hierarchy that is based on names
provided by user that can be accessed by higher level code or directly
by the shell, 'tree' and friends is pretty nice too if you are the admin
running the box.  ;-)

> > I can't really fault a project for using configfs; it's an accepted
> > and recommented (by the community) interface.  I'd much prefer it
> > though if there was an effort to create a usable fd/struct based
> > alternative.
> 
> 	Oh, and configfs was explicitly designed to be interface
> agnostic to the client.  The filesystem portions, to the best of my
> ability, are not exposed to client drivers.  So you can replace the
> configfs filesystem interface with a system call set that does the same
> operations, and no configfs user will actually need to change their
> code (if you want to change from text values to non-text, that would
> require changing the show/store operation prototypes, but that's about
> it).
> 

Wow really..?  I was wondering if something like this was possible in
terms of different client interfaces for configfs ops, and where it
would (ever..?) make sense..

--nab

> Joel
> 


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: configfs/sysfs
  2009-08-19 22:16                             ` configfs/sysfs Joel Becker
@ 2009-08-20  6:09                                 ` Avi Kivity
  2009-08-19 23:54                               ` configfs/sysfs Nicholas A. Bellinger
                                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-20  6:09 UTC (permalink / raw)
  To: Nicholas A. Bellinger, Ingo Molnar, Anthony Liguori, kvm,
	alacrityvm-devel, linux-kernel, netdev, Michael S. Tsirkin,
	Ira W. Snyder

On 08/20/2009 01:16 AM, Joel Becker wrote:
>> My high level concern is that we're optimizing for the active
>> sysadmin, not for libraries and management programs.  configfs and
>> sysfs are easy to use from the shell, discoverable, and easily
>> scripted.  But they discourage documentation, the text format is
>> ambiguous, and they require a lot of boilerplate to use in code.
>>      
> 	I don't think they "discourage documentation" anymore than any
> ioctl we've ever had.  At least you can look at the names and values and
> take a good stab at it (configfs is better than sysfs at this, by virtue
> of what it does, but discoverability is certainly not as good as real
> documentation).
> 	With an ioctl() that isn't (well) documented, you have to go
> read the structure and probably even read the code that uses the
> structure to be sure what you are doing.
>    

An ioctl structure and a configfs/sysfs readdir provide similar 
information (the structure also provides the types of fields and isn't 
able to hide some of these fields).

"Looking at the values" is what I meant by discouraging documentation.  
That implies looking at a self-documenting live system.  But that tells 
you nothing about which fields were added in which versions, or fields 
which are hidden because your hardware doesn't support them or because 
you didn't echo 1 > somewhere.

>> You could argue that you can wrap *fs in a library that hides the
>> details of accessing it, but that's the wrong approach IMO.  We
>> should make the information easy to use and manipulate for programs;
>> one of these programs can be a fuse filesystem for the active
>> sysadmin if someone thinks it's important.
>>      
> 	You are absolutely correct that they are a boon to the sysadmin,
> where in theory programs can do better with binary interfaces.  Except
> what programs?  I can't do an ioctl or a syscall from a shell script
> (no, using bash's network capabilities to talk to netlink does not
> count).  Same with perl/python/whatever where you have to write
> boilerplate to create binary structures.
>    

The maintainer of the subsystem should provide a library that talks to 
the binary interface and a CLI program that talks to the library.  
Boring nonkernely work.  Alternatively a fuse filesystem to talk to the 
library, or an IDL can replace the library.

> 	These interfaces have two opposing forces acting on them.  They
> provide a reasonably nice way to cross the user<->kernel boundary, so
> people want to use them.  Programmatic things, like a power management
> daemon for example, don't want sysadmins touching anything.  It's just
> an interface for the daemon.

Many things start oriented at people and then, if they're useful, cross 
the lines to machines.  You can convert a machine interface to a human 
interface at the cost of some work, but it's difficult to undo the 
deficiencies of a human oriented interface so it can be used by a program.

> Conversely, some things are really knobs
> for the sysadmin.

I disagree.  If it's useful for a human, it's useful for a machine.

Moreover, *fs+bash is a user interface.  It happens that bash is good at 
processing files, and filesystems are easily discoverable, so we code to 
that.  But we make it more difficult to provide other interfaces to the 
same controls.


> There's nothing else to it.  Why should they have to
> code up a C program just to turn a knob?

Many kernel developers believe that userspace is burned into ROM and the 
only thing they can change is the kernel.  That turns out to be 
incorrect.  If you don't want users to write C programs to access your 
interface, write your own library+CLI.  That will have the added benefit 
of providing meaningful errors as well ("Invalid argument" vs "frob must 
be between 52 and 91").  The program can have a configuration file so 
you don't need to reecho the values on boot.  It can have a --daemon 
mode and do something when an event occurs.

> Configfs, as its name implies,
> really does exist for that second case.  It turns out that it's quite
> nice to use for the first case too, but if folks wanted to go the
> syscall route, no worries.
>    

Eventually everything is used in the first case.  For example in the 
virtualization space it is common to have a zillion nodes running 
virtual machine that are only accessed by a management node.

> 	I've said it many times.  We will never come up with one
> over-arching solution to all the disparate use cases.  Instead, we
> should use each facility - syscalls, ioctls, sysfs, configfs, etc - as
> appropriate.  Even in the same program or subsystem.
>    

configfs is optional, but sysfs is not.  Everything exposed via sysfs 
needs to continue to be exposed via sysfs, and new things as well for 
consistency.  So now if someone wants a syscall interface they must 
duplicate the syscall interface, not replace it.

>> - ambiguity
>>
>> What format is the attribute?  does it accept lowercase or uppercase
>> hex digits?  is there a newline at the end?  how many digits can it
>> take before the attribute overflows?  All of this has to be
>> documented and checked by the OS, otherwise we risk regressions
>> later.  In contrast, __u64 says everything in a binary interface.
>>      
> 	Um, is that __u64 a pointer to a userspace object?  A key to a
> lookup table?  A file descriptor that is padded out?  It's no less
> ambiguous.
>    

__u64 says everything about the type and space requirements of a field.  
It doesn't describe everything (like the name of the field or what it 
means) but it does provide a bunch of boring information that people 
rarely document in other ways.

If my program reads a *fs field into a u32 and it later turns out the 
field was a u64, I'll get an overflow.  It's a lot harder to get that 
wrong with a typed interface.

>> - lifetime and access control
>>
>> If a process brings an object into being (using mkdir) and then
>> dies, the object remains behind.  The syscall/ioctl approach ties
>> the object into an fd, which will be destroyed when the process
>> dies, and which can be passed around using SCM_RIGHTS, allowing a
>> server process to create and configure an object before passing it
>> to an unprivileged program
>>      
> 	Most things here do *not* want to be tied to the lifetime of one
> process.  We don't want our cpu_freq governor changing just because the
> power manager died.
>    

Using file descriptors doesn't force you to tie their lifetime to the 
fd; it only allows it.

>> You may argue, correctly, that syscalls and ioctls are not as
>> flexible.  But this is because no one has invested the effort in
>> making them so.  A struct passed as an argument to a syscall is not
>> extensible.  But if you pass the size of the structure, and also a
>> bitmap of which attributes are present, you gain extensibility and
>> retain the atomicity property of a syscall interface.  I don't think
>> a lot of effort is needed to make an extensible syscall interface
>> just as usable and a lot more efficient than configfs/sysfs.  It
>> should also be simple to bolt a fuse interface on top to expose it
>> to us commandline types.
>>      
> 	Your extensible syscall still needs to be known.  The
> flexibility provided by configfs and sysfs is of generic access to
> non-generic things.  It's different.
> 	The follow-ups regarding the perf_counter call are a good
> example.  If you know the perf_counter call, you can code up a C program
> that asks what attributes or things are there.  But if you don't, you've
> first got to find out that there's a perf_counter call, then learn how
> to use it.  With configfs/sysfs, you notice that there's now a
> perf_counter directory under a tree, and you can figure out what
> attributes and items are there.
>    

Right, that's the great allure of *fs, discoverability.  Everything is 
at your fingertips.  Except if you're writing a program to manage 
things.  The program can't explore *fs until it's run and usually does 
not want to present nongeneric things in a generic way.  Ultimately most 
of our users are behind programs.


>> configfs is more maintainable that a bunch of hand-maintained
>> ioctls.  But if we put some effort into an extendable syscall
>> infrastructure (perhaps to the point of using an IDL) I'm sure we
>> can improve on that without the problems pseudo filesystems
>> introduce.
>>      
> 	Oh, boy, IDL :-)  Seriously, if you can solve the "how do I just
> poke around without actually writing C code or installing a
> domain-specific binary" problem, you will probably get somewhere.
>    

IDL is very unpleasant to work with but it gets the work done.  I don't 
see an issue with domain specific binaries (except that you have to 
write them).  Some say there's the problem of distribution, but if the 
kernel distributed itself to the user somehow then the tool can be 
distributed just as well (maybe via tools/).

>> I can't really fault a project for using configfs; it's an accepted
>> and recommented (by the community) interface.  I'd much prefer it
>> though if there was an effort to create a usable fd/struct based
>> alternative.
>>      
> 	Oh, and configfs was explicitly designed to be interface
> agnostic to the client.  The filesystem portions, to the best of my
> ability, are not exposed to client drivers.  So you can replace the
> configfs filesystem interface with a system call set that does the same
> operations, and no configfs user will actually need to change their
> code (if you want to change from text values to non-text, that would
> require changing the show/store operation prototypes, but that's about
> it).
>
>
>    

But the user visible part is now ABI.  I have no issues with the kernel 
internals.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: configfs/sysfs
  2009-08-19 22:16                             ` configfs/sysfs Joel Becker
                                                 ` (2 preceding siblings ...)
  2009-08-20  6:09                                 ` configfs/sysfs Avi Kivity
@ 2009-08-20  6:09                               ` Avi Kivity
  3 siblings, 0 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-20  6:09 UTC (permalink / raw)
  To: Nicholas A. Bellinger, Ingo Molnar, Anthony Liguori, kvm,
	alacrityvm-devel, linu

On 08/20/2009 01:16 AM, Joel Becker wrote:
>> My high level concern is that we're optimizing for the active
>> sysadmin, not for libraries and management programs.  configfs and
>> sysfs are easy to use from the shell, discoverable, and easily
>> scripted.  But they discourage documentation, the text format is
>> ambiguous, and they require a lot of boilerplate to use in code.
>>      
> 	I don't think they "discourage documentation" anymore than any
> ioctl we've ever had.  At least you can look at the names and values and
> take a good stab at it (configfs is better than sysfs at this, by virtue
> of what it does, but discoverability is certainly not as good as real
> documentation).
> 	With an ioctl() that isn't (well) documented, you have to go
> read the structure and probably even read the code that uses the
> structure to be sure what you are doing.
>    

An ioctl structure and a configfs/sysfs readdir provide similar 
information (the structure also provides the types of fields and isn't 
able to hide some of these fields).

"Looking at the values" is what I meant by discouraging documentation.  
That implies looking at a self-documenting live system.  But that tells 
you nothing about which fields were added in which versions, or fields 
which are hidden because your hardware doesn't support them or because 
you didn't echo 1 > somewhere.

>> You could argue that you can wrap *fs in a library that hides the
>> details of accessing it, but that's the wrong approach IMO.  We
>> should make the information easy to use and manipulate for programs;
>> one of these programs can be a fuse filesystem for the active
>> sysadmin if someone thinks it's important.
>>      
> 	You are absolutely correct that they are a boon to the sysadmin,
> where in theory programs can do better with binary interfaces.  Except
> what programs?  I can't do an ioctl or a syscall from a shell script
> (no, using bash's network capabilities to talk to netlink does not
> count).  Same with perl/python/whatever where you have to write
> boilerplate to create binary structures.
>    

The maintainer of the subsystem should provide a library that talks to 
the binary interface and a CLI program that talks to the library.  
Boring nonkernely work.  Alternatively a fuse filesystem to talk to the 
library, or an IDL can replace the library.

> 	These interfaces have two opposing forces acting on them.  They
> provide a reasonably nice way to cross the user<->kernel boundary, so
> people want to use them.  Programmatic things, like a power management
> daemon for example, don't want sysadmins touching anything.  It's just
> an interface for the daemon.

Many things start oriented at people and then, if they're useful, cross 
the lines to machines.  You can convert a machine interface to a human 
interface at the cost of some work, but it's difficult to undo the 
deficiencies of a human oriented interface so it can be used by a program.

> Conversely, some things are really knobs
> for the sysadmin.

I disagree.  If it's useful for a human, it's useful for a machine.

Moreover, *fs+bash is a user interface.  It happens that bash is good at 
processing files, and filesystems are easily discoverable, so we code to 
that.  But we make it more difficult to provide other interfaces to the 
same controls.


> There's nothing else to it.  Why should they have to
> code up a C program just to turn a knob?

Many kernel developers believe that userspace is burned into ROM and the 
only thing they can change is the kernel.  That turns out to be 
incorrect.  If you don't want users to write C programs to access your 
interface, write your own library+CLI.  That will have the added benefit 
of providing meaningful errors as well ("Invalid argument" vs "frob must 
be between 52 and 91").  The program can have a configuration file so 
you don't need to reecho the values on boot.  It can have a --daemon 
mode and do something when an event occurs.

> Configfs, as its name implies,
> really does exist for that second case.  It turns out that it's quite
> nice to use for the first case too, but if folks wanted to go the
> syscall route, no worries.
>    

Eventually everything is used in the first case.  For example in the 
virtualization space it is common to have a zillion nodes running 
virtual machine that are only accessed by a management node.

> 	I've said it many times.  We will never come up with one
> over-arching solution to all the disparate use cases.  Instead, we
> should use each facility - syscalls, ioctls, sysfs, configfs, etc - as
> appropriate.  Even in the same program or subsystem.
>    

configfs is optional, but sysfs is not.  Everything exposed via sysfs 
needs to continue to be exposed via sysfs, and new things as well for 
consistency.  So now if someone wants a syscall interface they must 
duplicate the syscall interface, not replace it.

>> - ambiguity
>>
>> What format is the attribute?  does it accept lowercase or uppercase
>> hex digits?  is there a newline at the end?  how many digits can it
>> take before the attribute overflows?  All of this has to be
>> documented and checked by the OS, otherwise we risk regressions
>> later.  In contrast, __u64 says everything in a binary interface.
>>      
> 	Um, is that __u64 a pointer to a userspace object?  A key to a
> lookup table?  A file descriptor that is padded out?  It's no less
> ambiguous.
>    

__u64 says everything about the type and space requirements of a field.  
It doesn't describe everything (like the name of the field or what it 
means) but it does provide a bunch of boring information that people 
rarely document in other ways.

If my program reads a *fs field into a u32 and it later turns out the 
field was a u64, I'll get an overflow.  It's a lot harder to get that 
wrong with a typed interface.

>> - lifetime and access control
>>
>> If a process brings an object into being (using mkdir) and then
>> dies, the object remains behind.  The syscall/ioctl approach ties
>> the object into an fd, which will be destroyed when the process
>> dies, and which can be passed around using SCM_RIGHTS, allowing a
>> server process to create and configure an object before passing it
>> to an unprivileged program
>>      
> 	Most things here do *not* want to be tied to the lifetime of one
> process.  We don't want our cpu_freq governor changing just because the
> power manager died.
>    

Using file descriptors doesn't force you to tie their lifetime to the 
fd; it only allows it.

>> You may argue, correctly, that syscalls and ioctls are not as
>> flexible.  But this is because no one has invested the effort in
>> making them so.  A struct passed as an argument to a syscall is not
>> extensible.  But if you pass the size of the structure, and also a
>> bitmap of which attributes are present, you gain extensibility and
>> retain the atomicity property of a syscall interface.  I don't think
>> a lot of effort is needed to make an extensible syscall interface
>> just as usable and a lot more efficient than configfs/sysfs.  It
>> should also be simple to bolt a fuse interface on top to expose it
>> to us commandline types.
>>      
> 	Your extensible syscall still needs to be known.  The
> flexibility provided by configfs and sysfs is of generic access to
> non-generic things.  It's different.
> 	The follow-ups regarding the perf_counter call are a good
> example.  If you know the perf_counter call, you can code up a C program
> that asks what attributes or things are there.  But if you don't, you've
> first got to find out that there's a perf_counter call, then learn how
> to use it.  With configfs/sysfs, you notice that there's now a
> perf_counter directory under a tree, and you can figure out what
> attributes and items are there.
>    

Right, that's the great allure of *fs, discoverability.  Everything is 
at your fingertips.  Except if you're writing a program to manage 
things.  The program can't explore *fs until it's run and usually does 
not want to present nongeneric things in a generic way.  Ultimately most 
of our users are behind programs.


>> configfs is more maintainable that a bunch of hand-maintained
>> ioctls.  But if we put some effort into an extendable syscall
>> infrastructure (perhaps to the point of using an IDL) I'm sure we
>> can improve on that without the problems pseudo filesystems
>> introduce.
>>      
> 	Oh, boy, IDL :-)  Seriously, if you can solve the "how do I just
> poke around without actually writing C code or installing a
> domain-specific binary" problem, you will probably get somewhere.
>    

IDL is very unpleasant to work with but it gets the work done.  I don't 
see an issue with domain specific binaries (except that you have to 
write them).  Some say there's the problem of distribution, but if the 
kernel distributed itself to the user somehow then the tool can be 
distributed just as well (maybe via tools/).

>> I can't really fault a project for using configfs; it's an accepted
>> and recommented (by the community) interface.  I'd much prefer it
>> though if there was an effort to create a usable fd/struct based
>> alternative.
>>      
> 	Oh, and configfs was explicitly designed to be interface
> agnostic to the client.  The filesystem portions, to the best of my
> ability, are not exposed to client drivers.  So you can replace the
> configfs filesystem interface with a system call set that does the same
> operations, and no configfs user will actually need to change their
> code (if you want to change from text values to non-text, that would
> require changing the show/store operation prototypes, but that's about
> it).
>
>
>    

But the user visible part is now ABI.  I have no issues with the kernel 
internals.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: configfs/sysfs
@ 2009-08-20  6:09                                 ` Avi Kivity
  0 siblings, 0 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-20  6:09 UTC (permalink / raw)
  To: Nicholas A. Bellinger, Ingo Molnar, Anthony Liguori, kvm,
	alacrityvm-devel, linu

On 08/20/2009 01:16 AM, Joel Becker wrote:
>> My high level concern is that we're optimizing for the active
>> sysadmin, not for libraries and management programs.  configfs and
>> sysfs are easy to use from the shell, discoverable, and easily
>> scripted.  But they discourage documentation, the text format is
>> ambiguous, and they require a lot of boilerplate to use in code.
>>      
> 	I don't think they "discourage documentation" anymore than any
> ioctl we've ever had.  At least you can look at the names and values and
> take a good stab at it (configfs is better than sysfs at this, by virtue
> of what it does, but discoverability is certainly not as good as real
> documentation).
> 	With an ioctl() that isn't (well) documented, you have to go
> read the structure and probably even read the code that uses the
> structure to be sure what you are doing.
>    

An ioctl structure and a configfs/sysfs readdir provide similar 
information (the structure also provides the types of fields and isn't 
able to hide some of these fields).

"Looking at the values" is what I meant by discouraging documentation.  
That implies looking at a self-documenting live system.  But that tells 
you nothing about which fields were added in which versions, or fields 
which are hidden because your hardware doesn't support them or because 
you didn't echo 1 > somewhere.

>> You could argue that you can wrap *fs in a library that hides the
>> details of accessing it, but that's the wrong approach IMO.  We
>> should make the information easy to use and manipulate for programs;
>> one of these programs can be a fuse filesystem for the active
>> sysadmin if someone thinks it's important.
>>      
> 	You are absolutely correct that they are a boon to the sysadmin,
> where in theory programs can do better with binary interfaces.  Except
> what programs?  I can't do an ioctl or a syscall from a shell script
> (no, using bash's network capabilities to talk to netlink does not
> count).  Same with perl/python/whatever where you have to write
> boilerplate to create binary structures.
>    

The maintainer of the subsystem should provide a library that talks to 
the binary interface and a CLI program that talks to the library.  
Boring nonkernely work.  Alternatively a fuse filesystem to talk to the 
library, or an IDL can replace the library.

> 	These interfaces have two opposing forces acting on them.  They
> provide a reasonably nice way to cross the user<->kernel boundary, so
> people want to use them.  Programmatic things, like a power management
> daemon for example, don't want sysadmins touching anything.  It's just
> an interface for the daemon.

Many things start oriented at people and then, if they're useful, cross 
the lines to machines.  You can convert a machine interface to a human 
interface at the cost of some work, but it's difficult to undo the 
deficiencies of a human oriented interface so it can be used by a program.

> Conversely, some things are really knobs
> for the sysadmin.

I disagree.  If it's useful for a human, it's useful for a machine.

Moreover, *fs+bash is a user interface.  It happens that bash is good at 
processing files, and filesystems are easily discoverable, so we code to 
that.  But we make it more difficult to provide other interfaces to the 
same controls.


> There's nothing else to it.  Why should they have to
> code up a C program just to turn a knob?

Many kernel developers believe that userspace is burned into ROM and the 
only thing they can change is the kernel.  That turns out to be 
incorrect.  If you don't want users to write C programs to access your 
interface, write your own library+CLI.  That will have the added benefit 
of providing meaningful errors as well ("Invalid argument" vs "frob must 
be between 52 and 91").  The program can have a configuration file so 
you don't need to reecho the values on boot.  It can have a --daemon 
mode and do something when an event occurs.

> Configfs, as its name implies,
> really does exist for that second case.  It turns out that it's quite
> nice to use for the first case too, but if folks wanted to go the
> syscall route, no worries.
>    

Eventually everything is used in the first case.  For example in the 
virtualization space it is common to have a zillion nodes running 
virtual machine that are only accessed by a management node.

> 	I've said it many times.  We will never come up with one
> over-arching solution to all the disparate use cases.  Instead, we
> should use each facility - syscalls, ioctls, sysfs, configfs, etc - as
> appropriate.  Even in the same program or subsystem.
>    

configfs is optional, but sysfs is not.  Everything exposed via sysfs 
needs to continue to be exposed via sysfs, and new things as well for 
consistency.  So now if someone wants a syscall interface they must 
duplicate the syscall interface, not replace it.

>> - ambiguity
>>
>> What format is the attribute?  does it accept lowercase or uppercase
>> hex digits?  is there a newline at the end?  how many digits can it
>> take before the attribute overflows?  All of this has to be
>> documented and checked by the OS, otherwise we risk regressions
>> later.  In contrast, __u64 says everything in a binary interface.
>>      
> 	Um, is that __u64 a pointer to a userspace object?  A key to a
> lookup table?  A file descriptor that is padded out?  It's no less
> ambiguous.
>    

__u64 says everything about the type and space requirements of a field.  
It doesn't describe everything (like the name of the field or what it 
means) but it does provide a bunch of boring information that people 
rarely document in other ways.

If my program reads a *fs field into a u32 and it later turns out the 
field was a u64, I'll get an overflow.  It's a lot harder to get that 
wrong with a typed interface.

>> - lifetime and access control
>>
>> If a process brings an object into being (using mkdir) and then
>> dies, the object remains behind.  The syscall/ioctl approach ties
>> the object into an fd, which will be destroyed when the process
>> dies, and which can be passed around using SCM_RIGHTS, allowing a
>> server process to create and configure an object before passing it
>> to an unprivileged program
>>      
> 	Most things here do *not* want to be tied to the lifetime of one
> process.  We don't want our cpu_freq governor changing just because the
> power manager died.
>    

Using file descriptors doesn't force you to tie their lifetime to the 
fd; it only allows it.

>> You may argue, correctly, that syscalls and ioctls are not as
>> flexible.  But this is because no one has invested the effort in
>> making them so.  A struct passed as an argument to a syscall is not
>> extensible.  But if you pass the size of the structure, and also a
>> bitmap of which attributes are present, you gain extensibility and
>> retain the atomicity property of a syscall interface.  I don't think
>> a lot of effort is needed to make an extensible syscall interface
>> just as usable and a lot more efficient than configfs/sysfs.  It
>> should also be simple to bolt a fuse interface on top to expose it
>> to us commandline types.
>>      
> 	Your extensible syscall still needs to be known.  The
> flexibility provided by configfs and sysfs is of generic access to
> non-generic things.  It's different.
> 	The follow-ups regarding the perf_counter call are a good
> example.  If you know the perf_counter call, you can code up a C program
> that asks what attributes or things are there.  But if you don't, you've
> first got to find out that there's a perf_counter call, then learn how
> to use it.  With configfs/sysfs, you notice that there's now a
> perf_counter directory under a tree, and you can figure out what
> attributes and items are there.
>    

Right, that's the great allure of *fs, discoverability.  Everything is 
at your fingertips.  Except if you're writing a program to manage 
things.  The program can't explore *fs until it's run and usually does 
not want to present nongeneric things in a generic way.  Ultimately most 
of our users are behind programs.


>> configfs is more maintainable that a bunch of hand-maintained
>> ioctls.  But if we put some effort into an extendable syscall
>> infrastructure (perhaps to the point of using an IDL) I'm sure we
>> can improve on that without the problems pseudo filesystems
>> introduce.
>>      
> 	Oh, boy, IDL :-)  Seriously, if you can solve the "how do I just
> poke around without actually writing C code or installing a
> domain-specific binary" problem, you will probably get somewhere.
>    

IDL is very unpleasant to work with but it gets the work done.  I don't 
see an issue with domain specific binaries (except that you have to 
write them).  Some say there's the problem of distribution, but if the 
kernel distributed itself to the user somehow then the tool can be 
distributed just as well (maybe via tools/).

>> I can't really fault a project for using configfs; it's an accepted
>> and recommented (by the community) interface.  I'd much prefer it
>> though if there was an effort to create a usable fd/struct based
>> alternative.
>>      
> 	Oh, and configfs was explicitly designed to be interface
> agnostic to the client.  The filesystem portions, to the best of my
> ability, are not exposed to client drivers.  So you can replace the
> configfs filesystem interface with a system call set that does the same
> operations, and no configfs user will actually need to change their
> code (if you want to change from text values to non-text, that would
> require changing the show/store operation prototypes, but that's about
> it).
>
>
>    

But the user visible part is now ABI.  I have no issues with the kernel 
internals.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus  model for vbus_driver objects
  2009-08-19 21:05                                           ` Hollis Blanchard
@ 2009-08-20  9:57                                               ` Stefan Hajnoczi
  0 siblings, 0 replies; 132+ messages in thread
From: Stefan Hajnoczi @ 2009-08-20  9:57 UTC (permalink / raw)
  To: Hollis Blanchard
  Cc: Avi Kivity, Ira W. Snyder, Michael S. Tsirkin, Gregory Haskins,
	kvm, netdev, linux-kernel, alacrityvm-devel, Anthony Liguori,
	Ingo Molnar, Gregory Haskins

On Wed, Aug 19, 2009 at 10:05 PM, Hollis Blanchard<hollisb@us.ibm.com> wrote:
> On Wed, 2009-08-19 at 19:38 +0300, Avi Kivity wrote:
>> On 08/19/2009 07:29 PM, Ira W. Snyder wrote:
>> >>> That said, I'm not sure how qemu-system-ppc running on x86 could
>> >>> possibly communicate using virtio-net. This would mean the guest is an
>> >>> emulated big-endian PPC, while the host is a little-endian x86. I
>> >>> haven't actually tested this situation, so perhaps I am wrong.

Cross-platform virtio works when endianness is known in advance.  For
a hypervisor and a guest:
1. virtio-pci I/O registers use PCI endianness
2. vring uses guest endianness (hypervisor must byteswap)
3. guest memory buffers use guest endianness (hypervisor must byteswap)

I know of no existing way when endianness is not known in advance.
Perhaps a transport bit could be added to mark the endianness of the
guest/driver side.  This can be negotiated because virtio-pci has a
known endianness.  After negotiation, the host knows whether or not
byteswapping is necessary for structures in guest memory.

Stefan

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
@ 2009-08-20  9:57                                               ` Stefan Hajnoczi
  0 siblings, 0 replies; 132+ messages in thread
From: Stefan Hajnoczi @ 2009-08-20  9:57 UTC (permalink / raw)
  To: Hollis Blanchard
  Cc: Avi Kivity, Ira W. Snyder, Michael S. Tsirkin, Gregory Haskins,
	kvm, netdev, linux-kernel, alacrityvm-devel, Anthony Liguori,
	Ingo Molnar, Gregory Haskins

On Wed, Aug 19, 2009 at 10:05 PM, Hollis Blanchard<hollisb@us.ibm.com> wrote:
> On Wed, 2009-08-19 at 19:38 +0300, Avi Kivity wrote:
>> On 08/19/2009 07:29 PM, Ira W. Snyder wrote:
>> >>> That said, I'm not sure how qemu-system-ppc running on x86 could
>> >>> possibly communicate using virtio-net. This would mean the guest is an
>> >>> emulated big-endian PPC, while the host is a little-endian x86. I
>> >>> haven't actually tested this situation, so perhaps I am wrong.

Cross-platform virtio works when endianness is known in advance.  For
a hypervisor and a guest:
1. virtio-pci I/O registers use PCI endianness
2. vring uses guest endianness (hypervisor must byteswap)
3. guest memory buffers use guest endianness (hypervisor must byteswap)

I know of no existing way when endianness is not known in advance.
Perhaps a transport bit could be added to mark the endianness of the
guest/driver side.  This can be negotiated because virtio-pci has a
known endianness.  After negotiation, the host knows whether or not
byteswapping is necessary for structures in guest memory.

Stefan

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-20  9:57                                               ` Stefan Hajnoczi
  (?)
@ 2009-08-20 10:08                                               ` Avi Kivity
  -1 siblings, 0 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-20 10:08 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Hollis Blanchard, Ira W. Snyder, Michael S. Tsirkin,
	Gregory Haskins, kvm, netdev, linux-kernel, alacrityvm-devel,
	Anthony Liguori, Ingo Molnar, Gregory Haskins

On 08/20/2009 12:57 PM, Stefan Hajnoczi wrote:
> Cross-platform virtio works when endianness is known in advance.  For
> a hypervisor and a guest:
> 1. virtio-pci I/O registers use PCI endianness
> 2. vring uses guest endianness (hypervisor must byteswap)
> 3. guest memory buffers use guest endianness (hypervisor must byteswap)
>
> I know of no existing way when endianness is not known in advance.
> Perhaps a transport bit could be added to mark the endianness of the
> guest/driver side.  This can be negotiated because virtio-pci has a
> known endianness.  After negotiation, the host knows whether or not
> byteswapping is necessary for structures in guest memory.
>
>    

Some processors are capable of switching their gender at runtime, so you 
cannot tell the guest endianness in advance.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19  5:36                   ` Gregory Haskins
  2009-08-19  5:48                     ` Avi Kivity
  2009-08-19 14:33                     ` Michael S. Tsirkin
@ 2009-08-20 12:12                     ` Michael S. Tsirkin
  2 siblings, 0 replies; 132+ messages in thread
From: Michael S. Tsirkin @ 2009-08-20 12:12 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Anthony Liguori, Ingo Molnar, Gregory Haskins, kvm,
	alacrityvm-devel, linux-kernel, netdev

On Wed, Aug 19, 2009 at 01:36:14AM -0400, Gregory Haskins wrote:
> >> So where is the problem here?
> > 
> > If virtio net in guest could be improved instead, everyone would
> > benefit.
> 
> So if I whip up a virtio-net backend for vbus with a PCI compliant
> connector, you are happy?

I'm currently worried about venet versus virtio-net guest situation, if
you drop it and switch to virtio net instead that issue's resolved.

I don't have an opinion on vbus versus pci, and I only speak for myself.

-- 
MST

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-08-19 20:37                           ` Avi Kivity
  2009-08-19 20:53                             ` Ingo Molnar
@ 2009-08-20 17:25                             ` Muli Ben-Yehuda
  2009-08-20 20:58                               ` Caitlin Bestler
  2 siblings, 0 replies; 132+ messages in thread
From: Muli Ben-Yehuda @ 2009-08-20 17:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Ingo Molnar, kvm, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin, Patrick Mullaney

On Wed, Aug 19, 2009 at 11:37:16PM +0300, Avi Kivity wrote:

> On 08/19/2009 09:26 PM, Gregory Haskins wrote:

>>>> This is for things like the setup of queue-pairs, and the
>>>> transport of door-bells, and ib-verbs.  I am not on the team
>>>> doing that work, so I am not an expert in this area.  What I do
>>>> know is having a flexible and low-latency signal-path was deemed
>>>> a key requirement.
>>>>
>>>>        
>>> That's not a full bypass, then.  AFAIK kernel bypass has userspace
>>> talking directly to the device.
>>>      
>> Like I said, I am not an expert on the details here.  I only work
>> on the vbus plumbing.  FWIW, the work is derivative from the
>> "Xen-IB" project
>>
>> http://www.openib.org/archives/nov2006sc/xen-ib-presentation.pdf
>>
>> There were issues with getting Xen-IB to map well into the Xen
>> model.  Vbus was specifically designed to address some of those
>> short-comings.
>
> Well I'm not an Infiniband expert.  But from what I understand VMM
> bypass means avoiding the call to the VMM entirely by exposing
> hardware registers directly to the guest.

The original IB VMM bypass work predates SR-IOV (i.e., does not assume
that the adapter has multiple hardware register windows for multiple
devices). The way it worked was to split all device operations into
`privileged' and `non-privileged'. Privileged operations such as
mapping and pinning memory went through the hypervisor. Non-privileged
operations such reading or writing previously mapped memory went
directly to the adpater. Now-days with SR-IOV devices, VMM bypass
usually means bypassing the hypervisor completely.

Cheers,
Muli
-- 
Muli Ben-Yehuda | muli@il.ibm.com | +972-4-8281080
Manager, Virtualization and Systems Architecture
Master Inventor, IBM Haifa Research Laboratory

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver  objects
  2009-08-19 20:37                           ` Avi Kivity
@ 2009-08-20 20:58                               ` Caitlin Bestler
  2009-08-20 17:25                             ` Muli Ben-Yehuda
  2009-08-20 20:58                               ` Caitlin Bestler
  2 siblings, 0 replies; 132+ messages in thread
From: Caitlin Bestler @ 2009-08-20 20:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Ingo Molnar, kvm, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin, Patrick Mullaney

On Wed, Aug 19, 2009 at 1:37 PM, Avi Kivity<avi@redhat.com> wrote:

>
> Well I'm not an Infiniband expert.  But from what I understand VMM bypass
> means avoiding the call to the VMM entirely by exposing hardware registers
> directly to the guest.
>

It enables clients to talk directly to the hardware. Whether or not
that involves
registers would be model specific. But frequently the queues being written
were in the client's memory, and only a "doorbell ring" involved actual device
resources.

But whatever the mechanism, it enables the client to provide buffer addresses
directly to the hardware in a manner that cannot damage another client. The two
key requirements are a) client cannot enable access to pages that it does
not already have access to, and b) client can delegate that authority to the
Adapter without needing to invoke OS or Hypervisor on a per message
basis.


Traditionally that meant that memory maps ("Regions") were created on the
privileged path to enable fast/non-privileged references by the client.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects
@ 2009-08-20 20:58                               ` Caitlin Bestler
  0 siblings, 0 replies; 132+ messages in thread
From: Caitlin Bestler @ 2009-08-20 20:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Ingo Molnar, kvm, alacrityvm-devel,
	linux-kernel, netdev, Michael S. Tsirkin, Patrick Mullaney

On Wed, Aug 19, 2009 at 1:37 PM, Avi Kivity<avi@redhat.com> wrote:

>
> Well I'm not an Infiniband expert.  But from what I understand VMM bypass
> means avoiding the call to the VMM entirely by exposing hardware registers
> directly to the guest.
>

It enables clients to talk directly to the hardware. Whether or not
that involves
registers would be model specific. But frequently the queues being written
were in the client's memory, and only a "doorbell ring" involved actual device
resources.

But whatever the mechanism, it enables the client to provide buffer addresses
directly to the hardware in a manner that cannot damage another client. The two
key requirements are a) client cannot enable access to pages that it does
not already have access to, and b) client can delegate that authority to the
Adapter without needing to invoke OS or Hypervisor on a per message
basis.


Traditionally that meant that memory maps ("Regions") were created on the
privileged path to enable fast/non-privileged references by the client.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: configfs/sysfs
  2009-08-20  6:09                                 ` configfs/sysfs Avi Kivity
  (?)
@ 2009-08-20 22:48                                 ` Joel Becker
  2009-08-21  4:14                                     ` configfs/sysfs Avi Kivity
  2009-08-21  4:14                                   ` configfs/sysfs Avi Kivity
  -1 siblings, 2 replies; 132+ messages in thread
From: Joel Becker @ 2009-08-20 22:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Nicholas A. Bellinger, Ingo Molnar, Anthony Liguori, kvm,
	alacrityvm-devel, linux-kernel, netdev, Michael S. Tsirkin,
	Ira W. Snyder

On Thu, Aug 20, 2009 at 09:09:21AM +0300, Avi Kivity wrote:
> On 08/20/2009 01:16 AM, Joel Becker wrote:
> >	With an ioctl() that isn't (well) documented, you have to go
> >read the structure and probably even read the code that uses the
> >structure to be sure what you are doing.
> 
> An ioctl structure and a configfs/sysfs readdir provide similar
> information (the structure also provides the types of fields and
> isn't able to hide some of these fields).

	With an ioctl structure, I can't take a look at what the values
look like unless I read the code or write up a C program.  With a
configfs file, I can just cat the thing.
 
> "Looking at the values" is what I meant by discouraging
> documentation.  That implies looking at a self-documenting live
> system.  But that tells you nothing about which fields were added in
> which versions, or fields which are hidden because your hardware
> doesn't support them or because you didn't echo 1 > somewhere.

	Most ioctls don't tell you that either.  It certainly won't let
you know that field foo_arg1 is ignored unless foo_arg2 is set to 2, or
things like that.
	The problem of versioning requires discipline either way.  It's
not obvious from many ioctls.  Conversely, you can create versioned
configfs items via attributes or directories (same for sysfs, etc).

> The maintainer of the subsystem should provide a library that talks
> to the binary interface and a CLI program that talks to the library.
> Boring nonkernely work.  Alternatively a fuse filesystem to talk to
> the library, or an IDL can replace the library.

	Again, that helps the user nothing.  I don't know it exists.  I
don't have it installed.  Unless it ships with the kernel, I have no
idea about it.

> Many things start oriented at people and then, if they're useful,
> cross the lines to machines.  You can convert a machine interface to
> a human interface at the cost of some work, but it's difficult to
> undo the deficiencies of a human oriented interface so it can be
> used by a program.

	It's work to convert either way.  Outside of fast-path things,
the time it takes to strtoll() is unimportant.  Don't use configfs/sysfs
for fast-path things.

> I disagree.  If it's useful for a human, it's useful for a machine.

	And if it's useful for a machine, a human might want to peek at
it by hand someday to debug it.

> Moreover, *fs+bash is a user interface.  It happens that bash is
> good at processing files, and filesystems are easily discoverable,
> so we code to that.  But we make it more difficult to provide other
> interfaces to the same controls.

	Not really.  Writing a sane CLI to a binary interface takes
about as much work as writing a sane API library to a text interface.
The hard part is not the conversion, in either direction.  The hard part
is defining the interface.

> >Configfs, as its name implies,
> >really does exist for that second case.  It turns out that it's quite
> >nice to use for the first case too, but if folks wanted to go the
> >syscall route, no worries.
> 
> Eventually everything is used in the first case.  For example in the
> virtualization space it is common to have a zillion nodes running
> virtual machine that are only accessed by a management node.

	Everything is eventually used in the second case, and admin or a
developer debugging why the daemon is going wrong.  Much easier from a
shell or other generic accessor.  Much faster than having to download
your library's source, learn how to build it, add some printfs, discover
you have the wrong printfs...

> __u64 says everything about the type and space requirements of a
> field.  It doesn't describe everything (like the name of the field
> or what it means) but it does provide a bunch of boring information
> that people rarely document in other ways.
> 
> If my program reads a *fs field into a u32 and it later turns out
> the field was a u64, I'll get an overflow.  It's a lot harder to get
> that wrong with a typed interface.

	And if you send the wrong thing to configfs or sysfs you'll get
an EINVAL or the like.
	It doesn't look like configfs and sysfs will work for you.
Don't use 'em!  Write your interfaces with ioctls and syscalls.  Write
your libraries and CLIs.  In the end, you're the one who has to maintain
them.  I don't ever want anyone thinking I want to force configfs on
them.  I wrote it because it solves its class of problem well, and many
people find it fits them too.  So I'll use configfs, you'll use ioctl,
and our users will be happy either way because we make it work!

Joel

-- 

Life's Little Instruction Book #396

	"Never give anyone a fruitcake."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: configfs/sysfs
  2009-08-20 22:48                                 ` configfs/sysfs Joel Becker
@ 2009-08-21  4:14                                     ` Avi Kivity
  2009-08-21  4:14                                   ` configfs/sysfs Avi Kivity
  1 sibling, 0 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-21  4:14 UTC (permalink / raw)
  To: Nicholas A. Bellinger, Ingo Molnar, Anthony Liguori, kvm,
	alacrityvm-devel, linux-kernel, netdev, Michael S. Tsirkin,
	Ira W. Snyder

On 08/21/2009 01:48 AM, Joel Becker wrote:
> On Thu, Aug 20, 2009 at 09:09:21AM +0300, Avi Kivity wrote:
>    
>> On 08/20/2009 01:16 AM, Joel Becker wrote:
>>      
>>> 	With an ioctl() that isn't (well) documented, you have to go
>>> read the structure and probably even read the code that uses the
>>> structure to be sure what you are doing.
>>>        
>> An ioctl structure and a configfs/sysfs readdir provide similar
>> information (the structure also provides the types of fields and
>> isn't able to hide some of these fields).
>>      
> 	With an ioctl structure, I can't take a look at what the values
> look like unless I read the code or write up a C program.  With a
> configfs file, I can just cat the thing.
>    

Unless it's system dependent like many sysfs files.  If you're coding 
something that's supposed to run on several boxes, coding by example is 
not a good idea.  Look up the documentation to find out what the values 
look like (unfortunately often there is no documentation).

Looking at the value on your box does not indicate the range of values 
on other boxes or even if the value will be present on other boxes (due 
to having older kernels or different configurations).

>
>    
>> "Looking at the values" is what I meant by discouraging
>> documentation.  That implies looking at a self-documenting live
>> system.  But that tells you nothing about which fields were added in
>> which versions, or fields which are hidden because your hardware
>> doesn't support them or because you didn't echo 1>  somewhere.
>>      
> 	Most ioctls don't tell you that either.  It certainly won't let
> you know that field foo_arg1 is ignored unless foo_arg2 is set to 2, or
> things like that.
>    

Correct.  What I mean is that discoverability is great for a sysadmin or 
kernel developers exploring the system, but pretty useless for a 
programmer writing code that will run on other systems.  The majority of 
lkml users will find *fs easy to use and useful, but that's not the 
majority of our users.

> 	The problem of versioning requires discipline either way.  It's
> not obvious from many ioctls.  Conversely, you can create versioned
> configfs items via attributes or directories (same for sysfs, etc).
>    

Sure.

>> The maintainer of the subsystem should provide a library that talks
>> to the binary interface and a CLI program that talks to the library.
>> Boring nonkernely work.  Alternatively a fuse filesystem to talk to
>> the library, or an IDL can replace the library.
>>      
> 	Again, that helps the user nothing.  I don't know it exists.  I
> don't have it installed.  Unless it ships with the kernel, I have no
> idea about it.
>    

That's true for the lkml reader downloading a kernel from kernel.org 
(use git already) and run it on a random system.  But again the majority 
of users will run a distro which is supposed to integrate the kernel and 
userspace.  The short term gratification of early adopters harms the 
integration that more mainstream users expect.

>> Many things start oriented at people and then, if they're useful,
>> cross the lines to machines.  You can convert a machine interface to
>> a human interface at the cost of some work, but it's difficult to
>> undo the deficiencies of a human oriented interface so it can be
>> used by a program.
>>      
> 	It's work to convert either way.  Outside of fast-path things,
> the time it takes to strtoll() is unimportant.  Don't use configfs/sysfs
> for fast-path things.
>    

Infrastructure must be careful not to code itself into a corner.  
Already udev takes quite a bit of time to run and I have some memories 
of problems on thousand-disk configurations.  What works reasonably well 
with one disk may not work as well with 1000.

No doubt some of the problem is with udev, but I'm sure sysfs 
contributes.  As a software development exercise reading a table of 1000 
objects each with a couple dozen attributes should take less that a 
millisecond.

>> I disagree.  If it's useful for a human, it's useful for a machine.
>>      
> 	And if it's useful for a machine, a human might want to peek at
> it by hand someday to debug it.
>    

We have strace and wireshark to decode binary syscall and wire streams.

>> Moreover, *fs+bash is a user interface.  It happens that bash is
>> good at processing files, and filesystems are easily discoverable,
>> so we code to that.  But we make it more difficult to provide other
>> interfaces to the same controls.
>>      
> 	Not really.  Writing a sane CLI to a binary interface takes
> about as much work as writing a sane API library to a text interface.
> The hard part is not the conversion, in either direction.  The hard part
> is defining the interface.
>    

A *fs interface limits what you can do, so it makes writing the API 
library harder.  I'm talking about the issues with atomicity and 
notifications.

>>> Configfs, as its name implies,
>>> really does exist for that second case.  It turns out that it's quite
>>> nice to use for the first case too, but if folks wanted to go the
>>> syscall route, no worries.
>>>        
>> Eventually everything is used in the first case.  For example in the
>> virtualization space it is common to have a zillion nodes running
>> virtual machine that are only accessed by a management node.
>>      
> 	Everything is eventually used in the second case, and admin or a
> developer debugging why the daemon is going wrong.  Much easier from a
> shell or other generic accessor.  Much faster than having to download
> your library's source, learn how to build it, add some printfs, discover
> you have the wrong printfs...
>    

As a kernel/user interface, any syscall replacement for *fs is exposed 
via strace.  It's true that debugging C code is harder than a bit of bash.

>> __u64 says everything about the type and space requirements of a
>> field.  It doesn't describe everything (like the name of the field
>> or what it means) but it does provide a bunch of boring information
>> that people rarely document in other ways.
>>
>> If my program reads a *fs field into a u32 and it later turns out
>> the field was a u64, I'll get an overflow.  It's a lot harder to get
>> that wrong with a typed interface.
>>      
> 	And if you send the wrong thing to configfs or sysfs you'll get
> an EINVAL or the like.
> 	It doesn't look like configfs and sysfs will work for you.
> Don't use 'em!  Write your interfaces with ioctls and syscalls.  Write
> your libraries and CLIs.  In the end, you're the one who has to maintain
> them.  I don't ever want anyone thinking I want to force configfs on
> them.  I wrote it because it solves its class of problem well, and many
> people find it fits them too.  So I'll use configfs, you'll use ioctl,
> and our users will be happy either way because we make it work!
>    

No, I have to use *fs (at least sysfs) since that's the current blessed 
interface.  Fragmenting the kernel/userspace is the wrong thing to do, I 
value a consistent interface more than fixing the *fs problems (which 
are all fixable or tolerable).

This is not a call to deprecate *fs and switch over to a yet another new 
thing.  Users (and programmers) need some ABI stability.  It just arose 
because I remarked that I'm not in love with *fs interfaces in an 
unrelated flamewar and someone asked me why.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: configfs/sysfs
  2009-08-20 22:48                                 ` configfs/sysfs Joel Becker
  2009-08-21  4:14                                     ` configfs/sysfs Avi Kivity
@ 2009-08-21  4:14                                   ` Avi Kivity
  1 sibling, 0 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-21  4:14 UTC (permalink / raw)
  To: Nicholas A. Bellinger, Ingo Molnar, Anthony Liguori, kvm,
	alacrityvm-devel, linu

On 08/21/2009 01:48 AM, Joel Becker wrote:
> On Thu, Aug 20, 2009 at 09:09:21AM +0300, Avi Kivity wrote:
>    
>> On 08/20/2009 01:16 AM, Joel Becker wrote:
>>      
>>> 	With an ioctl() that isn't (well) documented, you have to go
>>> read the structure and probably even read the code that uses the
>>> structure to be sure what you are doing.
>>>        
>> An ioctl structure and a configfs/sysfs readdir provide similar
>> information (the structure also provides the types of fields and
>> isn't able to hide some of these fields).
>>      
> 	With an ioctl structure, I can't take a look at what the values
> look like unless I read the code or write up a C program.  With a
> configfs file, I can just cat the thing.
>    

Unless it's system dependent like many sysfs files.  If you're coding 
something that's supposed to run on several boxes, coding by example is 
not a good idea.  Look up the documentation to find out what the values 
look like (unfortunately often there is no documentation).

Looking at the value on your box does not indicate the range of values 
on other boxes or even if the value will be present on other boxes (due 
to having older kernels or different configurations).

>
>    
>> "Looking at the values" is what I meant by discouraging
>> documentation.  That implies looking at a self-documenting live
>> system.  But that tells you nothing about which fields were added in
>> which versions, or fields which are hidden because your hardware
>> doesn't support them or because you didn't echo 1>  somewhere.
>>      
> 	Most ioctls don't tell you that either.  It certainly won't let
> you know that field foo_arg1 is ignored unless foo_arg2 is set to 2, or
> things like that.
>    

Correct.  What I mean is that discoverability is great for a sysadmin or 
kernel developers exploring the system, but pretty useless for a 
programmer writing code that will run on other systems.  The majority of 
lkml users will find *fs easy to use and useful, but that's not the 
majority of our users.

> 	The problem of versioning requires discipline either way.  It's
> not obvious from many ioctls.  Conversely, you can create versioned
> configfs items via attributes or directories (same for sysfs, etc).
>    

Sure.

>> The maintainer of the subsystem should provide a library that talks
>> to the binary interface and a CLI program that talks to the library.
>> Boring nonkernely work.  Alternatively a fuse filesystem to talk to
>> the library, or an IDL can replace the library.
>>      
> 	Again, that helps the user nothing.  I don't know it exists.  I
> don't have it installed.  Unless it ships with the kernel, I have no
> idea about it.
>    

That's true for the lkml reader downloading a kernel from kernel.org 
(use git already) and run it on a random system.  But again the majority 
of users will run a distro which is supposed to integrate the kernel and 
userspace.  The short term gratification of early adopters harms the 
integration that more mainstream users expect.

>> Many things start oriented at people and then, if they're useful,
>> cross the lines to machines.  You can convert a machine interface to
>> a human interface at the cost of some work, but it's difficult to
>> undo the deficiencies of a human oriented interface so it can be
>> used by a program.
>>      
> 	It's work to convert either way.  Outside of fast-path things,
> the time it takes to strtoll() is unimportant.  Don't use configfs/sysfs
> for fast-path things.
>    

Infrastructure must be careful not to code itself into a corner.  
Already udev takes quite a bit of time to run and I have some memories 
of problems on thousand-disk configurations.  What works reasonably well 
with one disk may not work as well with 1000.

No doubt some of the problem is with udev, but I'm sure sysfs 
contributes.  As a software development exercise reading a table of 1000 
objects each with a couple dozen attributes should take less that a 
millisecond.

>> I disagree.  If it's useful for a human, it's useful for a machine.
>>      
> 	And if it's useful for a machine, a human might want to peek at
> it by hand someday to debug it.
>    

We have strace and wireshark to decode binary syscall and wire streams.

>> Moreover, *fs+bash is a user interface.  It happens that bash is
>> good at processing files, and filesystems are easily discoverable,
>> so we code to that.  But we make it more difficult to provide other
>> interfaces to the same controls.
>>      
> 	Not really.  Writing a sane CLI to a binary interface takes
> about as much work as writing a sane API library to a text interface.
> The hard part is not the conversion, in either direction.  The hard part
> is defining the interface.
>    

A *fs interface limits what you can do, so it makes writing the API 
library harder.  I'm talking about the issues with atomicity and 
notifications.

>>> Configfs, as its name implies,
>>> really does exist for that second case.  It turns out that it's quite
>>> nice to use for the first case too, but if folks wanted to go the
>>> syscall route, no worries.
>>>        
>> Eventually everything is used in the first case.  For example in the
>> virtualization space it is common to have a zillion nodes running
>> virtual machine that are only accessed by a management node.
>>      
> 	Everything is eventually used in the second case, and admin or a
> developer debugging why the daemon is going wrong.  Much easier from a
> shell or other generic accessor.  Much faster than having to download
> your library's source, learn how to build it, add some printfs, discover
> you have the wrong printfs...
>    

As a kernel/user interface, any syscall replacement for *fs is exposed 
via strace.  It's true that debugging C code is harder than a bit of bash.

>> __u64 says everything about the type and space requirements of a
>> field.  It doesn't describe everything (like the name of the field
>> or what it means) but it does provide a bunch of boring information
>> that people rarely document in other ways.
>>
>> If my program reads a *fs field into a u32 and it later turns out
>> the field was a u64, I'll get an overflow.  It's a lot harder to get
>> that wrong with a typed interface.
>>      
> 	And if you send the wrong thing to configfs or sysfs you'll get
> an EINVAL or the like.
> 	It doesn't look like configfs and sysfs will work for you.
> Don't use 'em!  Write your interfaces with ioctls and syscalls.  Write
> your libraries and CLIs.  In the end, you're the one who has to maintain
> them.  I don't ever want anyone thinking I want to force configfs on
> them.  I wrote it because it solves its class of problem well, and many
> people find it fits them too.  So I'll use configfs, you'll use ioctl,
> and our users will be happy either way because we make it work!
>    

No, I have to use *fs (at least sysfs) since that's the current blessed 
interface.  Fragmenting the kernel/userspace is the wrong thing to do, I 
value a consistent interface more than fixing the *fs problems (which 
are all fixable or tolerable).

This is not a call to deprecate *fs and switch over to a yet another new 
thing.  Users (and programmers) need some ABI stability.  It just arose 
because I remarked that I'm not in love with *fs interfaces in an 
unrelated flamewar and someone asked me why.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: configfs/sysfs
@ 2009-08-21  4:14                                     ` Avi Kivity
  0 siblings, 0 replies; 132+ messages in thread
From: Avi Kivity @ 2009-08-21  4:14 UTC (permalink / raw)
  To: Nicholas A. Bellinger, Ingo Molnar, Anthony Liguori, kvm,
	alacrityvm-devel, linu

On 08/21/2009 01:48 AM, Joel Becker wrote:
> On Thu, Aug 20, 2009 at 09:09:21AM +0300, Avi Kivity wrote:
>    
>> On 08/20/2009 01:16 AM, Joel Becker wrote:
>>      
>>> 	With an ioctl() that isn't (well) documented, you have to go
>>> read the structure and probably even read the code that uses the
>>> structure to be sure what you are doing.
>>>        
>> An ioctl structure and a configfs/sysfs readdir provide similar
>> information (the structure also provides the types of fields and
>> isn't able to hide some of these fields).
>>      
> 	With an ioctl structure, I can't take a look at what the values
> look like unless I read the code or write up a C program.  With a
> configfs file, I can just cat the thing.
>    

Unless it's system dependent like many sysfs files.  If you're coding 
something that's supposed to run on several boxes, coding by example is 
not a good idea.  Look up the documentation to find out what the values 
look like (unfortunately often there is no documentation).

Looking at the value on your box does not indicate the range of values 
on other boxes or even if the value will be present on other boxes (due 
to having older kernels or different configurations).

>
>    
>> "Looking at the values" is what I meant by discouraging
>> documentation.  That implies looking at a self-documenting live
>> system.  But that tells you nothing about which fields were added in
>> which versions, or fields which are hidden because your hardware
>> doesn't support them or because you didn't echo 1>  somewhere.
>>      
> 	Most ioctls don't tell you that either.  It certainly won't let
> you know that field foo_arg1 is ignored unless foo_arg2 is set to 2, or
> things like that.
>    

Correct.  What I mean is that discoverability is great for a sysadmin or 
kernel developers exploring the system, but pretty useless for a 
programmer writing code that will run on other systems.  The majority of 
lkml users will find *fs easy to use and useful, but that's not the 
majority of our users.

> 	The problem of versioning requires discipline either way.  It's
> not obvious from many ioctls.  Conversely, you can create versioned
> configfs items via attributes or directories (same for sysfs, etc).
>    

Sure.

>> The maintainer of the subsystem should provide a library that talks
>> to the binary interface and a CLI program that talks to the library.
>> Boring nonkernely work.  Alternatively a fuse filesystem to talk to
>> the library, or an IDL can replace the library.
>>      
> 	Again, that helps the user nothing.  I don't know it exists.  I
> don't have it installed.  Unless it ships with the kernel, I have no
> idea about it.
>    

That's true for the lkml reader downloading a kernel from kernel.org 
(use git already) and run it on a random system.  But again the majority 
of users will run a distro which is supposed to integrate the kernel and 
userspace.  The short term gratification of early adopters harms the 
integration that more mainstream users expect.

>> Many things start oriented at people and then, if they're useful,
>> cross the lines to machines.  You can convert a machine interface to
>> a human interface at the cost of some work, but it's difficult to
>> undo the deficiencies of a human oriented interface so it can be
>> used by a program.
>>      
> 	It's work to convert either way.  Outside of fast-path things,
> the time it takes to strtoll() is unimportant.  Don't use configfs/sysfs
> for fast-path things.
>    

Infrastructure must be careful not to code itself into a corner.  
Already udev takes quite a bit of time to run and I have some memories 
of problems on thousand-disk configurations.  What works reasonably well 
with one disk may not work as well with 1000.

No doubt some of the problem is with udev, but I'm sure sysfs 
contributes.  As a software development exercise reading a table of 1000 
objects each with a couple dozen attributes should take less that a 
millisecond.

>> I disagree.  If it's useful for a human, it's useful for a machine.
>>      
> 	And if it's useful for a machine, a human might want to peek at
> it by hand someday to debug it.
>    

We have strace and wireshark to decode binary syscall and wire streams.

>> Moreover, *fs+bash is a user interface.  It happens that bash is
>> good at processing files, and filesystems are easily discoverable,
>> so we code to that.  But we make it more difficult to provide other
>> interfaces to the same controls.
>>      
> 	Not really.  Writing a sane CLI to a binary interface takes
> about as much work as writing a sane API library to a text interface.
> The hard part is not the conversion, in either direction.  The hard part
> is defining the interface.
>    

A *fs interface limits what you can do, so it makes writing the API 
library harder.  I'm talking about the issues with atomicity and 
notifications.

>>> Configfs, as its name implies,
>>> really does exist for that second case.  It turns out that it's quite
>>> nice to use for the first case too, but if folks wanted to go the
>>> syscall route, no worries.
>>>        
>> Eventually everything is used in the first case.  For example in the
>> virtualization space it is common to have a zillion nodes running
>> virtual machine that are only accessed by a management node.
>>      
> 	Everything is eventually used in the second case, and admin or a
> developer debugging why the daemon is going wrong.  Much easier from a
> shell or other generic accessor.  Much faster than having to download
> your library's source, learn how to build it, add some printfs, discover
> you have the wrong printfs...
>    

As a kernel/user interface, any syscall replacement for *fs is exposed 
via strace.  It's true that debugging C code is harder than a bit of bash.

>> __u64 says everything about the type and space requirements of a
>> field.  It doesn't describe everything (like the name of the field
>> or what it means) but it does provide a bunch of boring information
>> that people rarely document in other ways.
>>
>> If my program reads a *fs field into a u32 and it later turns out
>> the field was a u64, I'll get an overflow.  It's a lot harder to get
>> that wrong with a typed interface.
>>      
> 	And if you send the wrong thing to configfs or sysfs you'll get
> an EINVAL or the like.
> 	It doesn't look like configfs and sysfs will work for you.
> Don't use 'em!  Write your interfaces with ioctls and syscalls.  Write
> your libraries and CLIs.  In the end, you're the one who has to maintain
> them.  I don't ever want anyone thinking I want to force configfs on
> them.  I wrote it because it solves its class of problem well, and many
> people find it fits them too.  So I'll use configfs, you'll use ioctl,
> and our users will be happy either way because we make it work!
>    

No, I have to use *fs (at least sysfs) since that's the current blessed 
interface.  Fragmenting the kernel/userspace is the wrong thing to do, I 
value a consistent interface more than fixing the *fs problems (which 
are all fixable or tolerable).

This is not a call to deprecate *fs and switch over to a yet another new 
thing.  Users (and programmers) need some ABI stability.  It just arose 
because I remarked that I'm not in love with *fs interfaces in an 
unrelated flamewar and someone asked me why.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* vbus design points: shm and shm-signals
  2009-08-19  4:27                   ` Gregory Haskins
  2009-08-19  5:22                     ` Avi Kivity
@ 2009-08-21 10:55                     ` Gregory Haskins
  2009-08-24 19:02                       ` Anthony Liguori
  1 sibling, 1 reply; 132+ messages in thread
From: Gregory Haskins @ 2009-08-21 10:55 UTC (permalink / raw)
  To: alacrityvm-devel; +Cc: Ingo Molnar, Gregory Haskins, kvm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4744 bytes --]

Gregory Haskins wrote:
> Ingo Molnar wrote:
>>
>> We all love faster code and better management interfaces and tons 
>> of your prior patches got accepted by Avi. This time you didnt even 
>> _try_ to improve virtio.
> 
> Im sorry, but you are mistaken:
> 
> http://lkml.indiana.edu/hypermail/linux/kernel/0904.2/02443.html
> 

BTW: One point that I forgot to point out in this most recent thread
that I am particularly proud of here is the design of the vbus
shared-memory model.  Despite some claims to the contrary; not only is
it possible to improve virtio with vbus (as evident by the patch
referenced above)...I specifically designed vbus with virtio
considerations in mind from the start!  In fact, the design is conducive
to accelerating a variety of other models as well.  Read on for details.

Vbus was designed it to be _agnostic_ to the shm algorithm in general.
This allows you to, of course, run ring algorithms (such as virtqueues,
or IOQs), but really any other designs as well, such as shared-tables, etc.

A guest driver sees the following interface:

struct vbus_device_proxy_ops {
	int (*open)(struct vbus_device_proxy *dev, int version, int flags);
	int (*close)(struct vbus_device_proxy *dev, int flags);
	int (*shm)(struct vbus_device_proxy *dev, int id, int prio,
		   void *ptr, size_t len,
		   struct shm_signal_desc *sigdesc, struct shm_signal **signal,
		   int flags);
	int (*call)(struct vbus_device_proxy *dev, u32 func,
		    void *data, size_t len, int flags);
	void (*release)(struct vbus_device_proxy *dev);
};

note the ops->shm() method.  This allows the driver to register some
arbitrary pointer (ptr, len) with the host, optionally embedding a
shm_signal_desc object in the memory.  If "sigdesc" is non-null, the
connector will allocate and return a fully formed shm_signal object in
**signal.

shm-signal (posted in patch 1/6, I believe) is a generalization of the
basic mechanism you need for emitting and catching events via any
shared-memory based design.  It includes interrupt/hypercall mitigation
support which is independent of the actual shm algorithm (e.g. ring,
table, etc).  This means that we can get the event mitigation code (e.g.
disable unecessary interrupts/hypercalls, prevent spurious re-calls,
etc) in once place (and, ideally, correct), and things like the ring
algorithm (or table design, etc) can focus on their value-add, instead
of re-inventing the mitigation code each time.

You can then build your higher-layer algorithms (rings, tables, etc) in
terms of the shm-signal library.  As a matter of fact, if you look at
the patch referenced above, it implements the virtqueue->kick() method
on top of shm-signal.  IOQs follow a similar pattern.  And the RT
enhancements will, for instance, be using a table design for the
scheduler/interrupt control.

In short, vbus is the result of my experience in dealing with issues
like IO in virt.  I thought long and hard about where the issues were
for high-performance, low-latency, software-to-software IO for a
wide-variety of applications and environments.  I then tried to
systematically solve those problems at various key points in the stack
to promote maximum flexibility and reuse of those enhancements.  So we
see things like this generalization of async event mitigation with the
shm/shm-signal design, low-latency "hypercalls", reusable backend models
(which support both a variety of virt, as well as physical system use), etc.

Part of this flexibility means that we do not want to rely on something
like PCI because it's not necessarily available/logical in all
platforms/environments.  So rather than force such environments to fit
into a PCI ABI, we start over again and offer an ABI specific to the
actual goals (high-performance, low-latency, software-to-software IO).

Yes, that means we will possibly take some lumps in the short-term
before these concepts and support for them are ubiquitous (like PCI, USB
are today).  But, like all things, you have to start somewhere.  The PCI
guys didn't try to make PCI look like ISA, and the USB guys didn't try
to make USB look like PCI.  It took a little while for support for such
notions to catch on, but eventually all relevant platforms supported
them.  I am going for the same thing here.  Sometimes, the existing
model just doesn't fit well and you have to re-evaluate.

I hope that this helps clarify some of the design of vbus and why I
believe it to be worth considering.  I will plan on taking this email as
the first entry on the wiki as a "design series" (or something like
that) and update it regularly with other aspects of the design.

Kind Regards,
-Greg






[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: vbus design points: shm and shm-signals
  2009-08-21 10:55                     ` vbus design points: shm and shm-signals Gregory Haskins
@ 2009-08-24 19:02                       ` Anthony Liguori
  2009-08-24 20:00                         ` Gregory Haskins
  0 siblings, 1 reply; 132+ messages in thread
From: Anthony Liguori @ 2009-08-24 19:02 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: alacrityvm-devel, Ingo Molnar, Gregory Haskins, kvm, linux-kernel

Gregory Haskins wrote:
> Gregory Haskins wrote:
>   
>> Ingo Molnar wrote:
>>     
>>> We all love faster code and better management interfaces and tons 
>>> of your prior patches got accepted by Avi. This time you didnt even 
>>> _try_ to improve virtio.
>>>       
>> Im sorry, but you are mistaken:
>>
>> http://lkml.indiana.edu/hypermail/linux/kernel/0904.2/02443.html
>>
>>     
>
> BTW: One point that I forgot to point out in this most recent thread
> that I am particularly proud of here is the design of the vbus
> shared-memory model.  Despite some claims to the contrary; not only is
> it possible to improve virtio with vbus (as evident by the patch
> referenced above)...I specifically designed vbus with virtio
> considerations in mind from the start!  In fact, the design is conducive
> to accelerating a variety of other models as well.  Read on for details.
>
> Vbus was designed it to be _agnostic_ to the shm algorithm in general.
> This allows you to, of course, run ring algorithms (such as virtqueues,
> or IOQs), but really any other designs as well, such as shared-tables, etc.
>
> A guest driver sees the following interface:
>
> struct vbus_device_proxy_ops {
> 	int (*open)(struct vbus_device_proxy *dev, int version, int flags);
> 	int (*close)(struct vbus_device_proxy *dev, int flags);
> 	int (*shm)(struct vbus_device_proxy *dev, int id, int prio,
> 		   void *ptr, size_t len,
> 		   struct shm_signal_desc *sigdesc, struct shm_signal **signal,
> 		   int flags);
> 	int (*call)(struct vbus_device_proxy *dev, u32 func,
> 		    void *data, size_t len, int flags);
> 	void (*release)(struct vbus_device_proxy *dev);
> };
>
> note the ops->shm() method.  This allows the driver to register some
> arbitrary pointer (ptr, len) with the host, optionally embedding a
> shm_signal_desc object in the memory.  If "sigdesc" is non-null, the
> connector will allocate and return a fully formed shm_signal object in
> **signal.
>   

Fundamentally, how is this different than the virtio->add_buf concept?

virtio provides a mechanism to register scatter/gather lists, associate 
a handle with them, and provides a mechanism for retrieving notification 
that the buffer has been processed.

vbus provides a mechanism to register a single buffer with an integer 
handle, priority, and a signaling mechanism.

So virtio provides builtin support for scatter/gathers whereas vbus 
models priority.  But fundamentally, they seem like almost identical 
concepts.

If we added priority to virtio->add_buf, would it be equivalent in your 
mind functionally speaking?

What does one do with priority, btw?

Is there something I'm overlooking?

Regards,

Anthony Liguroi

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: vbus design points: shm and shm-signals
  2009-08-24 19:02                       ` Anthony Liguori
@ 2009-08-24 20:00                         ` Gregory Haskins
  2009-08-24 21:28                           ` Gregory Haskins
  2009-08-24 23:57                           ` Anthony Liguori
  0 siblings, 2 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-24 20:00 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: alacrityvm-devel, Ingo Molnar, Gregory Haskins, kvm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6811 bytes --]

Hi Anthony,

Anthony Liguori wrote:
> Gregory Haskins wrote:
>> Gregory Haskins wrote:
>>  
>>> Ingo Molnar wrote:
>>>    
>>>> We all love faster code and better management interfaces and tons of
>>>> your prior patches got accepted by Avi. This time you didnt even
>>>> _try_ to improve virtio.
>>>>       
>>> Im sorry, but you are mistaken:
>>>
>>> http://lkml.indiana.edu/hypermail/linux/kernel/0904.2/02443.html
>>>
>>>     
>>
>> BTW: One point that I forgot to point out in this most recent thread
>> that I am particularly proud of here is the design of the vbus
>> shared-memory model.  Despite some claims to the contrary; not only is
>> it possible to improve virtio with vbus (as evident by the patch
>> referenced above)...I specifically designed vbus with virtio
>> considerations in mind from the start!  In fact, the design is conducive
>> to accelerating a variety of other models as well.  Read on for details.
>>
>> Vbus was designed it to be _agnostic_ to the shm algorithm in general.
>> This allows you to, of course, run ring algorithms (such as virtqueues,
>> or IOQs), but really any other designs as well, such as shared-tables,
>> etc.
>>
>> A guest driver sees the following interface:
>>
>> struct vbus_device_proxy_ops {
>>     int (*open)(struct vbus_device_proxy *dev, int version, int flags);
>>     int (*close)(struct vbus_device_proxy *dev, int flags);
>>     int (*shm)(struct vbus_device_proxy *dev, int id, int prio,
>>            void *ptr, size_t len,
>>            struct shm_signal_desc *sigdesc, struct shm_signal **signal,
>>            int flags);
>>     int (*call)(struct vbus_device_proxy *dev, u32 func,
>>             void *data, size_t len, int flags);
>>     void (*release)(struct vbus_device_proxy *dev);
>> };
>>
>> note the ops->shm() method.  This allows the driver to register some
>> arbitrary pointer (ptr, len) with the host, optionally embedding a
>> shm_signal_desc object in the memory.  If "sigdesc" is non-null, the
>> connector will allocate and return a fully formed shm_signal object in
>> **signal.
>>   
> 
> Fundamentally, how is this different than the virtio->add_buf concept?

From my POV, they are at different levels.  Calling vbus->shm() is for
establishing a shared-memory region including routing the memory and
signal-path contexts.  You do this once at device init time, and then
run some algorithm on top (such as a virtqueue design).

virtio->add_buf() OTOH, is a run-time function.  You do this to modify
the shared-memory region that is already established at init time by
something like vbus->shm().  You would do this to queue a network
packet, for instance.

That said, shm-signal's closest analogy to virtio would be vq->kick(),
vq->callback(), vq->enable_cb(), and vq->disable_cb().  The difference
is that the notification mechanism isn't associated with a particular
type of shared-memory construct (such as a virt-queue), but instead can
be used with any shared-mem algorithm (at least, if I designed it properly).

The closest analogy for vbus->shm() to virtio would be
vdev->config->find_vqs().  Again, the difference is that the algorithm
(ring, etc) is not dictated by the call.  You then overlay something
like virtqueue on top.

> 
> virtio provides a mechanism to register scatter/gather lists, associate
> a handle with them, and provides a mechanism for retrieving notification
> that the buffer has been processed.

Yes, and I agree this is very useful for many/most algorithms...but not
all.  Sometimes you don't want ring-like semantics, but instead want
something like an idempotent table.  (Think of things like interrupt
controllers, timers, etc).

Rings, of course, have a trait that all updates are retained in fifo
order.  For many things (e.g. network, block io, etc), this is exactly
what you want.  If I say "send packet X" now, and "send packet Y" later,
I want the system to do both (and perhaps in that order), so a ring
scheme works well.

However, sometimes you may want to say "time is now X", and later "time
is now Y".  The update value of 'X' is technically superseded by Y and
is stale.  But a ring may allow both to exist in-flight within the shm
simultaneously if the recipient (guest or host) is lagging, and the X
may be processed even though its data is now irrelevant.  What we really
want is the transform of X->Y to invalidate anything else in flight so
that only Y is visible.

So in a case like this, we may want a different algorithm.  Something
like a table which always contains the current/valid value, and a way to
signal in both directions when something interesting happens to that data.

If you think about it, a ring is a superset of this construct...the ring
meta-data is the "shared-table" (e.g. HEAD ptr, TAIL ptr, COUNT, etc).
So we start by introducing the basic shm concept, and allow the next
layer (virtio/virtqueue) in the stack to refine it for its needs.


> 
> vbus provides a mechanism to register a single buffer with an integer
> handle, priority, and a signaling mechanism.

Again, I think we are talking about two different layers.  You would
never put entries into a virtio-ring of different priority.  This
doesn't make sense, as they would just get linearized by the fifo.

What you *would* do is possibly make multiple virtqueues, each with a
different priority (for instance, say 8-rx queues for virtio-net).

> 
> So virtio provides builtin support for scatter/gathers whereas vbus
> models priority.  But fundamentally, they seem like almost identical
> concepts.

I would say that virtqueue and IOQ are a much closer analogy in terms of
comparison at the scatter-gather level.  The virtio device model itself
is similar to a vbus device-model except its oriented towards the
virtqueue ring design.  In addition, a big part of vbus is also what
happens _behind_ the device model.

> 
> If we added priority to virtio->add_buf, would it be equivalent in your
> mind functionally speaking?

As indicated above, this wouldn't be sane.  A better design (IMO) is to
use a ring per priority.

> 
> What does one do with priority, btw?

There are, of course, many answers to that question.  One particularly
trivial example is 802.1p networking.  So, for instance, you can
classify and prioritize network traffic so that things like
control/timing packets are higher priority than best-effort HTTP.  Doing
this "right" means you have end-to-end priority within the system (e.g.
your switch/fabric, nics, interrupt controllers, etc).  Today, virt is
fairly far removed from being fully integrated in this sense, but the
vbus project is addressing this short-coming.

HTH,

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: vbus design points: shm and shm-signals
  2009-08-24 20:00                         ` Gregory Haskins
@ 2009-08-24 21:28                           ` Gregory Haskins
  2009-08-24 23:57                           ` Anthony Liguori
  1 sibling, 0 replies; 132+ messages in thread
From: Gregory Haskins @ 2009-08-24 21:28 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: alacrityvm-devel, Ingo Molnar, kvm, linux-kernel, Gregory Haskins

[-- Attachment #1: Type: text/plain, Size: 3167 bytes --]

Gregory Haskins wrote:
> Anthony Liguori wrote:
>> Fundamentally, how is this different than the virtio->add_buf concept?
> 
> From my POV, they are at different levels.  Calling vbus->shm() is for
> establishing a shared-memory region including routing the memory and
> signal-path contexts.  You do this once at device init time, and then
> run some algorithm on top (such as a virtqueue design).
> 
> virtio->add_buf() OTOH, is a run-time function.  You do this to modify
> the shared-memory region that is already established at init time by
> something like vbus->shm().  You would do this to queue a network
> packet, for instance.
> 
> That said, shm-signal's closest analogy to virtio would be vq->kick(),
> vq->callback(), vq->enable_cb(), and vq->disable_cb().  The difference
> is that the notification mechanism isn't associated with a particular
> type of shared-memory construct (such as a virt-queue), but instead can
> be used with any shared-mem algorithm (at least, if I designed it properly).
> 
> The closest analogy for vbus->shm() to virtio would be
> vdev->config->find_vqs().  Again, the difference is that the algorithm
> (ring, etc) is not dictated by the call.  You then overlay something
> like virtqueue on top.

BTW: Another way to think of this is that virtio->add_buf() is really
"buffer assignment", whereas "vbus->shm()" is "buffer sharing".  The
former is meant to follow an "assign, consume, re-assign, reclaim"
model, where the changing pointer ownership implicitly serializes the
writability of the buffer.  Its used (quite effectively) for things like
passing a network-packet around.

Conversely, the latter case ("buffer sharing") is designed for
concurrent writers.  Its used for things like ring-metadata,
shared-table designs, etc.  Anything that generally is designed for a
longer-term, parallel update model, instead of a consume/reclaim model.

Whether we realize it or not, we generally build buffer-assignment
algorithms on top of buffer-sharing algorithms.  Therefore, while virtio
technically has both of these components, it only exposes the former
(buffer-assignment) as a user-extensible ABI (vq->add_buf).  The latter
(buffer-sharing) is inextricably linked to the underlying virtqueue ABI
(vdev->find_vqs) (or, at least it is today).

This is why I keep emphasizing that they are different layers of the
same stack.  From a device point of view, virtio adds a robust ring
model with buffer-assignment capabilities, support for scatter-gather,
etc.  Vbus underneath it provides a robust buffer-sharing design with
considerations for things like end-to-end prioritization, mitigation of
various virt-like inefficiencies (hypercalls, interrupts, eois, spurious
re-signals), etc.

The idea is you can then join the two together to do something like
build 8-rx virtqueues for your virtio-net to support prio.  If you take
these things into consideration on the backend design as well, you can
actually tie it in end-to-end to gain performance and capabilities not
previously available in KVM (or possibly any virt platform).

HTH,

Kind Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: vbus design points: shm and shm-signals
  2009-08-24 20:00                         ` Gregory Haskins
  2009-08-24 21:28                           ` Gregory Haskins
@ 2009-08-24 23:57                           ` Anthony Liguori
  2009-08-25  0:10                             ` Anthony Liguori
  1 sibling, 1 reply; 132+ messages in thread
From: Anthony Liguori @ 2009-08-24 23:57 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: alacrityvm-devel, Ingo Molnar, Gregory Haskins, kvm, linux-kernel

Gregory Haskins wrote:
> Hi Anthony,
>
>   
>> Fundamentally, how is this different than the virtio->add_buf concept?
>>     
>
> From my POV, they are at different levels.  Calling vbus->shm() is for
> establishing a shared-memory region including routing the memory and
> signal-path contexts.  You do this once at device init time, and then
> run some algorithm on top (such as a virtqueue design).
>   

virtio explicitly avoids having a single setup-memory-region call 
because it was designed to accommodate things like Xen grant tables 
whereas you have a fixed number of sharable
buffers that need to be setup and torn down as you use them.

You can certainly use add_buf() to setup a persistent mapping but it's 
not the common usage.  For KVM, since all memory is accessible by the 
host without special setup, add_buf() never results in an exit (it's 
essentially a nop).

So I think from that perspective, add_buf() is a functional superset of 
vbus->shm().

> virtio->add_buf() OTOH, is a run-time function.  You do this to modify
> the shared-memory region that is already established at init time by
> something like vbus->shm().  You would do this to queue a network
> packet, for instance.
>
> That said, shm-signal's closest analogy to virtio would be vq->kick(),
> vq->callback(), vq->enable_cb(), and vq->disable_cb().  The difference
> is that the notification mechanism isn't associated with a particular
> type of shared-memory construct (such as a virt-queue), but instead can
> be used with any shared-mem algorithm (at least, if I designed it properly).
>   

Obviously, virtio allows multiple ring implements based on how it does 
layering.  The key point is that it doesn't expose that to the consumer 
of the device.

Do you see a compelling reason to have an interface at this layer?

>> virtio provides a mechanism to register scatter/gather lists, associate
>> a handle with them, and provides a mechanism for retrieving notification
>> that the buffer has been processed.
>>     
>
> Yes, and I agree this is very useful for many/most algorithms...but not
> all.  Sometimes you don't want ring-like semantics, but instead want
> something like an idempotent table.  (Think of things like interrupt
> controllers, timers, etc).
>   

We haven't crossed this bridge yet because we haven't implemented one of 
these devices.  One approach would be to use add_buf() to register fixed 
shared memory regions.  Because our rings are fixed sized, this implies 
a fixed number of shared memory mappings.

You could also extend virtio to provide a mechanism to register 
unlimited numbers of shared memory regions.  The problem with this is 
that it doesn't work well for hypervisors with fixed shared-memory 
regions (like Xen).
> However, sometimes you may want to say "time is now X", and later "time
> is now Y".  The update value of 'X' is technically superseded by Y and
> is stale.  But a ring may allow both to exist in-flight within the shm
> simultaneously if the recipient (guest or host) is lagging, and the X
> may be processed even though its data is now irrelevant.  What we really
> want is the transform of X->Y to invalidate anything else in flight so
> that only Y is visible.
>   

We actually do this today but we just don't use virtio.  I'm not sure we 
need a single bus that can serve both of these purposes.  What does this 
abstraction buy us?

> If you think about it, a ring is a superset of this construct...the ring
> meta-data is the "shared-table" (e.g. HEAD ptr, TAIL ptr, COUNT, etc).
> So we start by introducing the basic shm concept, and allow the next
> layer (virtio/virtqueue) in the stack to refine it for its needs.
>   

I think there's a trade off between practicality and theoretical 
abstractions.  Surely, a system can be constructed simply with 
notification and shared memory primitives.   This is what Xen does via 
event channels and grant tables.  In practice, this ends up being 
cumbersome and results in complex drivers.  Compare netfront to 
virtio-net, for instance.

We choose to abstract at the ring level precisely because it simplifies 
driver implementations.  I think we've been very successful here.

virtio does not accommodate devices that don't fit into a ring model 
very well today.  There's certainly room to discuss how to do this.  If 
there is to be a layer below virtio's ring semantics, I don't think that 
vbus is this because it mandates much higher levels of the stack 
(namely, device enumeration).

IOW, I can envision a model that looked like PCI -> virtio-pci -> 
virtio-shm -> virtio-ring -> virtio-net

Whereas generic-shm-mechanism provided a non-ring interface for non-ring 
devices.  That doesn't preclude non virtio-pci transports, it just 
suggests how we would do the layering.

So maybe there's a future for vbus as virtio-shm?  How attached are you 
to your device discovery infrastructure?

If you introduced a virtio-shm layer to the virtio API that looked a bit 
like vbus' device API, and then decoupled the device discovery bits into 
a virtio-vbus transport, I think you'd end up with something that was 
quite agreeable.

As a transport, PCI has significant limitations.  The biggest being the 
maximum number of devices we can support.  It's biggest advantage though 
is portability so it's something I think we would always want to 
support.  However, having a virtio transport optimized for Linux's 
guests is something I would certainly support.

>> vbus provides a mechanism to register a single buffer with an integer
>> handle, priority, and a signaling mechanism.
>>     
>
> Again, I think we are talking about two different layers.  You would
> never put entries into a virtio-ring of different priority.  This
> doesn't make sense, as they would just get linearized by the fifo.
>
> What you *would* do is possibly make multiple virtqueues, each with a
> different priority (for instance, say 8-rx queues for virtio-net).
>   

I think priority is an overloaded concept.  I'm not sure it belongs in a 
generic memory sharing API.

>> What does one do with priority, btw?
>>     
>
> There are, of course, many answers to that question.  One particularly
> trivial example is 802.1p networking.  So, for instance, you can
> classify and prioritize network traffic so that things like
> control/timing packets are higher priority than best-effort HTTP.

Wouldn't you do this at a config-space level though?  I agree you would 
want to have multiple rings with individual priority, but I think 
priority is a ring configuration just as programmable triplet filtering 
would be a per-ring configuration.  I also think how priority gets 
interpreted really depends on the device so it belongs in the device's 
ABI instead of the shared memory or ring ABI.

> HTH,
>   

It does, thanks.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: vbus design points: shm and shm-signals
  2009-08-24 23:57                           ` Anthony Liguori
@ 2009-08-25  0:10                             ` Anthony Liguori
  0 siblings, 0 replies; 132+ messages in thread
From: Anthony Liguori @ 2009-08-25  0:10 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: alacrityvm-devel, Ingo Molnar, Gregory Haskins, kvm, linux-kernel

Anthony Liguori wrote:
> IOW, I can envision a model that looked like PCI -> virtio-pci -> 
> virtio-shm -> virtio-ring -> virtio-net

Let me stress that what's important here is that devices target either 
virtio-ring or virtio-shm.  If we had another transport, those drivers 
would be agnostic toward it.  We really want to preserve the ability to 
use all devices over a PCI transport.  That's a critical requirement for us.

The problem with vbus as it stands today, is that it presents vbus -> 
virtio-ring -> virtio-net and allows drivers to target either 
virtio-ring or vbus directly.  If a driver targets vbus directly, then 
the driver is no longer transport agnostic and we could not support that 
driver over PCI.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 132+ messages in thread

end of thread, other threads:[~2009-08-25  0:10 UTC | newest]

Thread overview: 132+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-14 15:42 [PATCH v3 0/6] AlacrityVM guest drivers Gregory Haskins
2009-08-14 15:42 ` [PATCH v3 1/6] shm-signal: shared-memory signals Gregory Haskins
2009-08-14 15:43 ` [PATCH v3 2/6] ioq: Add basic definitions for a shared-memory, lockless queue Gregory Haskins
2009-08-14 15:43 ` [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects Gregory Haskins
2009-08-15 10:32   ` Ingo Molnar
2009-08-15 19:15     ` Anthony Liguori
2009-08-16  7:16       ` Ingo Molnar
2009-08-17 13:54         ` Anthony Liguori
2009-08-17 14:23           ` Ingo Molnar
2009-08-17 14:14       ` Gregory Haskins
2009-08-17 14:58         ` Avi Kivity
2009-08-17 15:05           ` Ingo Molnar
2009-08-17 17:41         ` Michael S. Tsirkin
2009-08-17 20:17           ` Gregory Haskins
2009-08-18  8:46             ` Michael S. Tsirkin
2009-08-18 15:19               ` Gregory Haskins
2009-08-18 16:25                 ` Michael S. Tsirkin
2009-08-18 15:53               ` [Alacrityvm-devel] " Ira W. Snyder
2009-08-18 16:51                 ` Avi Kivity
2009-08-18 17:27                   ` Ira W. Snyder
2009-08-18 17:47                     ` Avi Kivity
2009-08-18 18:27                       ` Ira W. Snyder
2009-08-18 18:52                         ` Avi Kivity
2009-08-18 20:59                           ` Ira W. Snyder
2009-08-18 21:26                             ` Avi Kivity
2009-08-18 22:06                               ` Avi Kivity
2009-08-19  0:44                                 ` Ira W. Snyder
2009-08-19  5:26                                   ` Avi Kivity
2009-08-19  0:38                               ` Ira W. Snyder
2009-08-19  5:40                                 ` Avi Kivity
2009-08-19 15:28                                   ` Ira W. Snyder
2009-08-19 15:37                                     ` Avi Kivity
2009-08-19 16:29                                       ` Ira W. Snyder
2009-08-19 16:38                                         ` Avi Kivity
2009-08-19 21:05                                           ` Hollis Blanchard
2009-08-20  9:57                                             ` Stefan Hajnoczi
2009-08-20  9:57                                               ` Stefan Hajnoczi
2009-08-20 10:08                                               ` Avi Kivity
2009-08-18 20:35                         ` Michael S. Tsirkin
2009-08-18 21:04                           ` Arnd Bergmann
2009-08-18 20:39                     ` Michael S. Tsirkin
2009-08-18 20:57                 ` Michael S. Tsirkin
2009-08-18 23:24                   ` Ira W. Snyder
2009-08-18  1:08         ` Anthony Liguori
2009-08-18  7:38           ` Avi Kivity
2009-08-18  8:54           ` Michael S. Tsirkin
2009-08-18 13:16           ` Gregory Haskins
2009-08-18 13:45             ` Avi Kivity
2009-08-18 15:51               ` Gregory Haskins
2009-08-18 16:14                 ` Ingo Molnar
2009-08-19  4:27                   ` Gregory Haskins
2009-08-19  5:22                     ` Avi Kivity
2009-08-19 13:27                       ` Gregory Haskins
2009-08-19 14:35                         ` Avi Kivity
2009-08-21 10:55                     ` vbus design points: shm and shm-signals Gregory Haskins
2009-08-24 19:02                       ` Anthony Liguori
2009-08-24 20:00                         ` Gregory Haskins
2009-08-24 21:28                           ` Gregory Haskins
2009-08-24 23:57                           ` Anthony Liguori
2009-08-25  0:10                             ` Anthony Liguori
2009-08-18 16:47                 ` [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects Avi Kivity
2009-08-18 16:51                 ` Michael S. Tsirkin
2009-08-19  5:36                   ` Gregory Haskins
2009-08-19  5:48                     ` Avi Kivity
2009-08-19  6:40                       ` Gregory Haskins
2009-08-19  7:13                         ` Avi Kivity
2009-08-19 11:40                           ` Gregory Haskins
2009-08-19 11:49                             ` Avi Kivity
2009-08-19 11:52                               ` Gregory Haskins
2009-08-19 14:33                     ` Michael S. Tsirkin
2009-08-20 12:12                     ` Michael S. Tsirkin
2009-08-16  8:30     ` Avi Kivity
2009-08-17 14:16       ` Gregory Haskins
2009-08-17 14:59         ` Avi Kivity
2009-08-17 15:09           ` Gregory Haskins
2009-08-17 15:14             ` Ingo Molnar
2009-08-17 19:35               ` Gregory Haskins
2009-08-17 15:18             ` Avi Kivity
2009-08-17 13:02     ` Gregory Haskins
2009-08-17 14:25       ` Ingo Molnar
2009-08-17 15:05         ` Gregory Haskins
2009-08-17 15:08           ` Ingo Molnar
2009-08-17 19:33             ` Gregory Haskins
2009-08-18  8:33               ` Avi Kivity
2009-08-18 14:46                 ` Gregory Haskins
2009-08-18 16:27                   ` Avi Kivity
2009-08-19  6:28                     ` Gregory Haskins
2009-08-19  7:11                       ` Avi Kivity
2009-08-19 18:23                         ` Nicholas A. Bellinger
2009-08-19 18:39                           ` Gregory Haskins
2009-08-19 19:19                             ` Nicholas A. Bellinger
2009-08-19 19:34                               ` Nicholas A. Bellinger
2009-08-19 20:12                           ` configfs/sysfs Avi Kivity
2009-08-19 20:48                             ` configfs/sysfs Ingo Molnar
2009-08-19 20:53                               ` configfs/sysfs Avi Kivity
2009-08-19 21:19                             ` configfs/sysfs Nicholas A. Bellinger
2009-08-19 22:15                             ` configfs/sysfs Gregory Haskins
2009-08-19 22:16                             ` configfs/sysfs Joel Becker
2009-08-19 23:48                               ` [Alacrityvm-devel] configfs/sysfs Alex Tsariounov
2009-08-19 23:54                               ` configfs/sysfs Nicholas A. Bellinger
2009-08-20  6:09                               ` configfs/sysfs Avi Kivity
2009-08-20  6:09                                 ` configfs/sysfs Avi Kivity
2009-08-20 22:48                                 ` configfs/sysfs Joel Becker
2009-08-21  4:14                                   ` configfs/sysfs Avi Kivity
2009-08-21  4:14                                     ` configfs/sysfs Avi Kivity
2009-08-21  4:14                                   ` configfs/sysfs Avi Kivity
2009-08-20  6:09                               ` configfs/sysfs Avi Kivity
2009-08-19 18:26                         ` [PATCH v3 3/6] vbus: add a "vbus-proxy" bus model for vbus_driver objects Gregory Haskins
2009-08-19 20:37                           ` Avi Kivity
2009-08-19 20:53                             ` Ingo Molnar
2009-08-20 17:25                             ` Muli Ben-Yehuda
2009-08-20 20:58                             ` Caitlin Bestler
2009-08-20 20:58                               ` Caitlin Bestler
2009-08-18 18:20                   ` Arnd Bergmann
2009-08-18 19:08                     ` Avi Kivity
2009-08-19  5:36                     ` Gregory Haskins
2009-08-18  9:53               ` Michael S. Tsirkin
2009-08-18 10:00                 ` Avi Kivity
2009-08-18 10:09                   ` Michael S. Tsirkin
2009-08-18 10:13                     ` Avi Kivity
2009-08-18 10:28                       ` Michael S. Tsirkin
2009-08-18 10:45                         ` Avi Kivity
2009-08-18 11:07                           ` Michael S. Tsirkin
2009-08-18 11:15                             ` Avi Kivity
2009-08-18 11:49                               ` Michael S. Tsirkin
2009-08-18 11:54                                 ` Avi Kivity
2009-08-18 15:39                 ` Gregory Haskins
2009-08-18 16:39                   ` Michael S. Tsirkin
2009-08-17 15:13           ` Avi Kivity
2009-08-14 15:43 ` [PATCH v3 4/6] vbus-proxy: add a pci-to-vbus bridge Gregory Haskins
2009-08-14 15:43 ` [PATCH v3 5/6] ioq: add driver-side vbus helpers Gregory Haskins
2009-08-14 15:43 ` [PATCH v3 6/6] net: Add vbus_enet driver Gregory Haskins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.