kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/17] virtual-bus
@ 2009-03-31 18:42 Gregory Haskins
  2009-03-31 18:42 ` [RFC PATCH 01/17] shm-signal: shared-memory signals Gregory Haskins
                   ` (18 more replies)
  0 siblings, 19 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

applies to v2.6.29 (will port to git HEAD soon)

FIRST OFF: Let me state that this is not a KVM or networking specific
technology.  Virtual-Bus is a mechanism for defining and deploying
software “devices” directly in a Linux kernel.  The example use-case we
have provided supports a “virtual-ethernet” device being utilized in a
KVM guest environment, so comparisons to virtio-net will be natural.
However, please note that this is but one use-case, of many we have
planned for the future (such as userspace bypass and RT guest support).
The goal for right now is to describe what a virual-bus is and why we
believe it is useful.

We are intent to get this core technology merged, even if the networking
components are not accepted as is.  It should be noted that, in many ways,
virtio could be considered complimentary to the technology.  We could
in fact, have implemented the virtual-ethernet using a virtio-ring, but
it would have required ABI changes that we didn't want to yet propose
without having the concept in general vetted and accepted by the community.

To cut to the chase, we recently measured our virtual-ethernet on 
v2.6.29 on two 8-core x86_64 boxes with Chelsio T3 10GE connected back
to back via cross over.  We measured bare-metal performance, as well
as a kvm guest (running the same kernel) connected to the T3 via
a linux-bridge+tap configuration with a 1500 MTU.  The results are as
follows:

Bare metal: tput = 4078Mb/s, round-trip = 25593pps (39us rtt)
Virtio-net: tput = 4003Mb/s, round-trip = 320pps (3125us rtt)
Venet: tput = 4050Mb/s, round-trip = 15255 (65us rtt)

As you can see, all three technologies can achieve (MTU limited) line-rate,
but the virtio-net solution is severely limited on the latency front (by a
factor of 48:1)

Note that the 320pps is technically artificially low in virtio-net, caused by a
a known design limitation to use a timer for tx-mitigation.  However, note that
even when removing the timer from the path the best we could achieve was
350us-450us of latency, and doing so causes the tput to drop to 1300Mb/s.
So even in this case, I think the in-kernel results presents a compelling
argument for the new model presented.

When we jump to 9000 byte MTU, the situation looks similar

Bare metal: tput = 9717Mb/s, round-trip = 30396pps (33us rtt)
Virtio-net: tput = 4578Mb/s, round-trip = 249pps (4016us rtt)
Venet: tput = 5802Mb/s, round-trip = 15127 (66us rtt)


Note that even the throughput was slightly better in this test for venet, though
neither venet nor virtio-net could achieve line-rate.  I suspect some tuning may
allow these numbers to improve, TBD.

So with that said, lets jump into the description:

Virtual-Bus: What is it?
--------------------

Virtual-Bus is a kernel based IO resource container technology.  It is modeled
on a concept similar to the Linux Device-Model (LDM), where we have buses,
devices, and drivers as the primary actors.  However, VBUS has several
distinctions when contrasted with LDM:

  1) "Busses" in LDM are relatively static and global to the kernel (e.g.
     "PCI", "USB", etc).  VBUS buses are arbitrarily created and destroyed
     dynamically, and are not globally visible.  Instead they are defined as
     visible only to a specific subset of the system (the contained context).
  2) "Devices" in LDM are typically tangible physical (or sometimes logical)
     devices.  VBUS devices are purely software abstractions (which may or
     may not have one or more physical devices behind them).  Devices may
     also be arbitrarily created or destroyed by software/administrative action
     as opposed to by a hardware discovery mechanism.
  3) "Drivers" in LDM sit within the same kernel context as the busses and
     devices they interact with.  VBUS drivers live in a foreign
     context (such as userspace, or a virtual-machine guest).

The idea is that a vbus is created to contain access to some IO services.
Virtual devices are then instantiated and linked to a bus to grant access to
drivers actively present on the bus.  Drivers will only have visibility to
devices present on their respective bus, and nothing else.

Virtual devices are defined by modules which register a deviceclass with the
system.  A deviceclass simply represents a type of device that _may_ be
instantiated into a device, should an administrator wish to do so.  Once
this has happened, the device may be associated with one or more buses where
it will become visible to all clients of those respective buses.

Why do we need this?
----------------------

There are various reasons why such a construct may be useful.  One of the
most interesting use cases is for virtualization, such as KVM.  Hypervisors
today provide virtualized IO resources to a guest, but this is often at a cost
in both latency and throughput compared to bare metal performance.  Utilizing
para-virtual resources instead of emulated devices helps to mitigate this
penalty, but even these techniques to date have not fully realized the
potential of the underlying bare-metal hardware.

Some of the performance differential is unavoidable just given the extra
processing that occurs due to the deeper stack (guest+host).  However, some of
this overhead is a direct result of the rather indirect path most hypervisors
use to route IO.  For instance, KVM uses PIO faults from the guest to trigger
a guest->host-kernel->host-userspace->host-kernel sequence of events.
Contrast this to a typical userspace application on the host which must only
traverse app->kernel for most IO.

The fact is that the linux kernel is already great at managing access to IO
resources.  Therefore, if you have a hypervisor that is based on the linux
kernel, is there some way that we can allow the hypervisor to manage IO
directly instead of forcing this convoluted path?

The short answer is: "not yet" ;)

In order to use such a concept, we need some new facilties.  For one, we
need to be able to define containers with their corresponding access-control so
that guests do not have unmitigated access to anything they wish.  Second,
we also need to define some forms of memory access that is uniform in the face
of various clients (e.g. "copy_to_user()" cannot be assumed to work for, say,
a KVM vcpu context).  Lastly, we need to provide access to these resources in
a way that makes sense for the application, such as asynchronous communication
paths and minimizing context switches.

So we introduce VBUS as a framework to provide such facilities.  The net
result is a *substantial* reduction in IO overhead, even when compared to
state of the art para-virtualization techniques (such as virtio-net).

For more details, please visit our wiki at:

http://developer.novell.com/wiki/index.php/Virtual-bus

Regards,
-Greg

---

Gregory Haskins (17):
      kvm: Add guest-side support for VBUS
      kvm: Add VBUS support to the host
      kvm: add dynamic IRQ support
      kvm: add a reset capability
      x86: allow the irq->vector translation to be determined outside of ioapic
      venettap: add scatter-gather support
      venet: add scatter-gather support
      venet-tap: Adds a "venet" compatible "tap" device to VBUS
      net: Add vbus_enet driver
      venet: add the ABI definitions for an 802.x packet interface
      ioq: add vbus helpers
      ioq: Add basic definitions for a shared-memory, lockless queue
      vbus: add a "vbus-proxy" bus model for vbus_driver objects
      vbus: add bus-registration notifiers
      vbus: add connection-client helper infrastructure
      vbus: add virtual-bus definitions
      shm-signal: shared-memory signals


 Documentation/vbus.txt           |  386 +++++++++
 arch/x86/Kconfig                 |   16 
 arch/x86/Makefile                |    3 
 arch/x86/include/asm/irq.h       |    6 
 arch/x86/include/asm/kvm_host.h  |    9 
 arch/x86/include/asm/kvm_para.h  |   12 
 arch/x86/kernel/io_apic.c        |   25 +
 arch/x86/kvm/Kconfig             |    9 
 arch/x86/kvm/Makefile            |    6 
 arch/x86/kvm/dynirq.c            |  329 ++++++++
 arch/x86/kvm/guest/Makefile      |    2 
 arch/x86/kvm/guest/dynirq.c      |   95 ++
 arch/x86/kvm/x86.c               |   13 
 arch/x86/kvm/x86.h               |   12 
 drivers/Makefile                 |    2 
 drivers/net/Kconfig              |   13 
 drivers/net/Makefile             |    1 
 drivers/net/vbus-enet.c          |  933 ++++++++++++++++++++++
 drivers/vbus/devices/Kconfig     |   17 
 drivers/vbus/devices/Makefile    |    1 
 drivers/vbus/devices/venet-tap.c | 1587 ++++++++++++++++++++++++++++++++++++++
 drivers/vbus/proxy/Makefile      |    2 
 drivers/vbus/proxy/kvm.c         |  726 +++++++++++++++++
 fs/proc/base.c                   |   96 ++
 include/linux/ioq.h              |  410 ++++++++++
 include/linux/kvm.h              |    4 
 include/linux/kvm_guest.h        |    7 
 include/linux/kvm_host.h         |   27 +
 include/linux/kvm_para.h         |   60 +
 include/linux/sched.h            |    4 
 include/linux/shm_signal.h       |  188 +++++
 include/linux/vbus.h             |  162 ++++
 include/linux/vbus_client.h      |  115 +++
 include/linux/vbus_device.h      |  423 ++++++++++
 include/linux/vbus_driver.h      |   80 ++
 include/linux/venet.h            |   82 ++
 kernel/Makefile                  |    1 
 kernel/exit.c                    |    2 
 kernel/fork.c                    |    2 
 kernel/vbus/Kconfig              |   38 +
 kernel/vbus/Makefile             |    6 
 kernel/vbus/attribute.c          |   52 +
 kernel/vbus/client.c             |  527 +++++++++++++
 kernel/vbus/config.c             |  275 +++++++
 kernel/vbus/core.c               |  626 +++++++++++++++
 kernel/vbus/devclass.c           |  124 +++
 kernel/vbus/map.c                |   72 ++
 kernel/vbus/map.h                |   41 +
 kernel/vbus/proxy.c              |  216 +++++
 kernel/vbus/shm-ioq.c            |   89 ++
 kernel/vbus/vbus.h               |  117 +++
 lib/Kconfig                      |   22 +
 lib/Makefile                     |    2 
 lib/ioq.c                        |  298 +++++++
 lib/shm_signal.c                 |  186 ++++
 virt/kvm/kvm_main.c              |   37 +
 virt/kvm/vbus.c                  | 1307 +++++++++++++++++++++++++++++++
 57 files changed, 9902 insertions(+), 1 deletions(-)
 create mode 100644 Documentation/vbus.txt
 create mode 100644 arch/x86/kvm/dynirq.c
 create mode 100644 arch/x86/kvm/guest/Makefile
 create mode 100644 arch/x86/kvm/guest/dynirq.c
 create mode 100644 drivers/net/vbus-enet.c
 create mode 100644 drivers/vbus/devices/Kconfig
 create mode 100644 drivers/vbus/devices/Makefile
 create mode 100644 drivers/vbus/devices/venet-tap.c
 create mode 100644 drivers/vbus/proxy/Makefile
 create mode 100644 drivers/vbus/proxy/kvm.c
 create mode 100644 include/linux/ioq.h
 create mode 100644 include/linux/kvm_guest.h
 create mode 100644 include/linux/shm_signal.h
 create mode 100644 include/linux/vbus.h
 create mode 100644 include/linux/vbus_client.h
 create mode 100644 include/linux/vbus_device.h
 create mode 100644 include/linux/vbus_driver.h
 create mode 100644 include/linux/venet.h
 create mode 100644 kernel/vbus/Kconfig
 create mode 100644 kernel/vbus/Makefile
 create mode 100644 kernel/vbus/attribute.c
 create mode 100644 kernel/vbus/client.c
 create mode 100644 kernel/vbus/config.c
 create mode 100644 kernel/vbus/core.c
 create mode 100644 kernel/vbus/devclass.c
 create mode 100644 kernel/vbus/map.c
 create mode 100644 kernel/vbus/map.h
 create mode 100644 kernel/vbus/proxy.c
 create mode 100644 kernel/vbus/shm-ioq.c
 create mode 100644 kernel/vbus/vbus.h
 create mode 100644 lib/ioq.c
 create mode 100644 lib/shm_signal.c
 create mode 100644 virt/kvm/vbus.c

-- 
Signature

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [RFC PATCH 01/17] shm-signal: shared-memory signals
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
@ 2009-03-31 18:42 ` Gregory Haskins
  2009-03-31 20:44   ` Avi Kivity
  2009-03-31 18:42 ` [RFC PATCH 02/17] vbus: add virtual-bus definitions Gregory Haskins
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

This interface provides a bidirectional shared-memory based signaling
mechanism.  It can be used by any entities which desire efficient
communication via shared memory.  The implementation details of the
signaling are abstracted so that they may transcend a wide variety
of locale boundaries (e.g. userspace/kernel, guest/host, etc).

The shm_signal mechanism supports event masking as well as spurious
event delivery mitigation.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/shm_signal.h |  188 ++++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig                |   10 ++
 lib/Makefile               |    1 
 lib/shm_signal.c           |  186 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 385 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/shm_signal.h
 create mode 100644 lib/shm_signal.c

diff --git a/include/linux/shm_signal.h b/include/linux/shm_signal.h
new file mode 100644
index 0000000..a65e54e
--- /dev/null
+++ b/include/linux/shm_signal.h
@@ -0,0 +1,188 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_SHM_SIGNAL_H
+#define _LINUX_SHM_SIGNAL_H
+
+#include <asm/types.h>
+
+/*
+ *---------
+ * The following structures represent data that is shared across boundaries
+ * which may be quite disparate from one another (e.g. Windows vs Linux,
+ * 32 vs 64 bit, etc).  Therefore, care has been taken to make sure they
+ * present data in a manner that is independent of the environment.
+ *-----------
+ */
+
+#define SHM_SIGNAL_MAGIC 0x58fa39df
+#define SHM_SIGNAL_VER   1
+
+struct shm_signal_irq {
+	__u8                  enabled;
+	__u8                  pending;
+	__u8                  dirty;
+};
+
+enum shm_signal_locality {
+	shm_locality_north,
+	shm_locality_south,
+};
+
+struct shm_signal_desc {
+	__u32                 magic;
+	__u32                 ver;
+	struct shm_signal_irq irq[2];
+};
+
+/* --- END SHARED STRUCTURES --- */
+
+#ifdef __KERNEL__
+
+#include <linux/interrupt.h>
+
+struct shm_signal_notifier {
+	void (*signal)(struct shm_signal_notifier *);
+};
+
+struct shm_signal;
+
+struct shm_signal_ops {
+	int      (*inject)(struct shm_signal *s);
+	void     (*fault)(struct shm_signal *s, const char *fmt, ...);
+	void     (*release)(struct shm_signal *s);
+};
+
+enum {
+	shm_signal_in_wakeup,
+};
+
+struct shm_signal {
+	atomic_t                    refs;
+	spinlock_t                  lock;
+	enum shm_signal_locality    locale;
+	unsigned long               flags;
+	struct shm_signal_ops      *ops;
+	struct shm_signal_desc     *desc;
+	struct shm_signal_notifier *notifier;
+	struct tasklet_struct       deferred_notify;
+};
+
+#define SHM_SIGNAL_FAULT(s, fmt, args...)  \
+  ((s)->ops->fault ? (s)->ops->fault((s), fmt, ## args) : panic(fmt, ## args))
+
+ /*
+  * These functions should only be used internally
+  */
+void _shm_signal_release(struct shm_signal *s);
+void _shm_signal_wakeup(struct shm_signal *s);
+
+/**
+ * shm_signal_init() - initialize an SHM_SIGNAL
+ * @s:        SHM_SIGNAL context
+ *
+ * Initializes SHM_SIGNAL context before first use
+ *
+ **/
+void shm_signal_init(struct shm_signal *s);
+
+/**
+ * shm_signal_get() - acquire an SHM_SIGNAL context reference
+ * @s:        SHM_SIGNAL context
+ *
+ **/
+static inline struct shm_signal *shm_signal_get(struct shm_signal *s)
+{
+	atomic_inc(&s->refs);
+
+	return s;
+}
+
+/**
+ * shm_signal_put() - release an SHM_SIGNAL context reference
+ * @s:        SHM_SIGNAL context
+ *
+ **/
+static inline void shm_signal_put(struct shm_signal *s)
+{
+	if (atomic_dec_and_test(&s->refs))
+		_shm_signal_release(s);
+}
+
+/**
+ * shm_signal_enable() - enables local notifications on an SHM_SIGNAL
+ * @s:        SHM_SIGNAL context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Enables/unmasks the registered notifier (if applicable) to receive wakeups
+ * whenever the remote side performs an shm_signal() operation. A notification
+ * will be dispatched immediately if any pending signals have already been
+ * issued prior to invoking this call.
+ *
+ * This is synonymous with unmasking an interrupt.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int shm_signal_enable(struct shm_signal *s, int flags);
+
+/**
+ * shm_signal_disable() - disable local notifications on an SHM_SIGNAL
+ * @s:        SHM_SIGNAL context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Disables/masks the registered shm_signal_notifier (if applicable) from
+ * receiving any further notifications.  Any subsequent calls to shm_signal()
+ * by the remote side will update the shm as dirty, but will not traverse the
+ * locale boundary and will not invoke the notifier callback.  Signals
+ * delivered while masked will be deferred until shm_signal_enable() is
+ * invoked.
+ *
+ * This is synonymous with masking an interrupt
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int shm_signal_disable(struct shm_signal *s, int flags);
+
+/**
+ * shm_signal_inject() - notify the remote side about shm changes
+ * @s:        SHM_SIGNAL context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Marks the shm state as "dirty" and, if enabled, will traverse
+ * a locale boundary to inject a remote notification.  The remote
+ * side controls whether the notification should be delivered via
+ * the shm_signal_enable/disable() interface.
+ *
+ * The specifics of how to traverse a locale boundary are abstracted
+ * by the shm_signal_ops->signal() interface and provided by a particular
+ * implementation.  However, typically going north to south would be
+ * something like a syscall/hypercall, and going south to north would be
+ * something like a posix-signal/guest-interrupt.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int shm_signal_inject(struct shm_signal *s, int flags);
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_SHM_SIGNAL_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 03c2c24..32d82fe 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -174,4 +174,14 @@ config DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
        bool "Disable obsolete cpumask functions" if DEBUG_PER_CPU_MAPS
        depends on EXPERIMENTAL && BROKEN
 
+config SHM_SIGNAL
+	boolean "SHM Signal - Generic shared-memory signaling mechanism"
+	default n
+	help
+	 Provides a shared-memory based signaling mechansim to indicate
+         memory-dirty notifications between two end-points.
+
+	 If unsure, say N
+
+
 endmenu
diff --git a/lib/Makefile b/lib/Makefile
index 32b0e64..bc36327 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -71,6 +71,7 @@ obj-$(CONFIG_TEXTSEARCH_BM) += ts_bm.o
 obj-$(CONFIG_TEXTSEARCH_FSM) += ts_fsm.o
 obj-$(CONFIG_SMP) += percpu_counter.o
 obj-$(CONFIG_AUDIT_GENERIC) += audit.o
+obj-$(CONFIG_SHM_SIGNAL) += shm_signal.o
 
 obj-$(CONFIG_SWIOTLB) += swiotlb.o
 obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
diff --git a/lib/shm_signal.c b/lib/shm_signal.c
new file mode 100644
index 0000000..fa1770c
--- /dev/null
+++ b/lib/shm_signal.c
@@ -0,0 +1,186 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * See include/linux/shm_signal.h for documentation
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/interrupt.h>
+#include <linux/shm_signal.h>
+
+int shm_signal_enable(struct shm_signal *s, int flags)
+{
+	struct shm_signal_irq *irq = &s->desc->irq[s->locale];
+	unsigned long iflags;
+
+	spin_lock_irqsave(&s->lock, iflags);
+
+	irq->enabled = 1;
+	wmb();
+
+	if ((irq->dirty || irq->pending)
+	    && !test_bit(shm_signal_in_wakeup, &s->flags)) {
+		rmb();
+		tasklet_schedule(&s->deferred_notify);
+	}
+
+	spin_unlock_irqrestore(&s->lock, iflags);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(shm_signal_enable);
+
+int shm_signal_disable(struct shm_signal *s, int flags)
+{
+	struct shm_signal_irq *irq = &s->desc->irq[s->locale];
+
+	irq->enabled = 0;
+	wmb();
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(shm_signal_disable);
+
+/*
+ * signaling protocol:
+ *
+ * each side of the shm_signal has an "irq" structure with the following
+ * fields:
+ *
+ *    - enabled: controlled by shm_signal_enable/disable() to mask/unmask
+ *               the notification locally
+ *    - dirty:   indicates if the shared-memory is dirty or clean.  This
+ *               is updated regardless of the enabled/pending state so that
+ *               the state is always accurately tracked.
+ *    - pending: indicates if a signal is pending to the remote locale.
+ *               This allows us to determine if a remote-notification is
+ *               already in flight to optimize spurious notifications away.
+ */
+int shm_signal_inject(struct shm_signal *s, int flags)
+{
+	/* Load the irq structure from the other locale */
+	struct shm_signal_irq *irq = &s->desc->irq[!s->locale];
+
+	/*
+	 * We always mark the remote side as dirty regardless of whether
+	 * they need to be notified.
+	 */
+	irq->dirty = 1;
+	wmb();   /* dirty must be visible before we test the pending state */
+
+	if (irq->enabled && !irq->pending) {
+		rmb();
+
+		/*
+		 * If the remote side has enabled notifications, and we do
+		 * not see a notification pending, we must inject a new one.
+		 */
+		irq->pending = 1;
+		wmb(); /* make it visible before we do the injection */
+
+		s->ops->inject(s);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(shm_signal_inject);
+
+void _shm_signal_wakeup(struct shm_signal *s)
+{
+	struct shm_signal_irq *irq = &s->desc->irq[s->locale];
+	int dirty;
+	unsigned long flags;
+
+	spin_lock_irqsave(&s->lock, flags);
+
+	__set_bit(shm_signal_in_wakeup, &s->flags);
+
+	/*
+	 * The outer loop protects against race conditions between
+	 * irq->dirty and irq->pending updates
+	 */
+	while (irq->enabled && (irq->dirty || irq->pending)) {
+
+		/*
+		 * Run until we completely exhaust irq->dirty (it may
+		 * be re-dirtied by the remote side while we are in the
+		 * callback).  We let "pending" remain untouched until we have
+		 * processed them all so that the remote side knows we do not
+		 * need a new notification (yet).
+		 */
+		do {
+			irq->dirty = 0;
+			/* the unlock is an implicit wmb() for dirty = 0 */
+			spin_unlock_irqrestore(&s->lock, flags);
+
+			if (s->notifier)
+				s->notifier->signal(s->notifier);
+
+			spin_lock_irqsave(&s->lock, flags);
+			dirty = irq->dirty;
+			rmb();
+
+		} while (irq->enabled && dirty);
+
+		barrier();
+
+		/*
+		 * We can finally acknowledge the notification by clearing
+		 * "pending" after all of the dirty memory has been processed
+		 * Races against this clearing are handled by the outer loop.
+		 * Subsequent iterations of this loop will execute with
+		 * pending=0 potentially leading to future spurious
+		 * notifications, but this is an acceptable tradeoff as this
+		 * will be rare and harmless.
+		 */
+		irq->pending = 0;
+		wmb();
+
+	}
+
+	__clear_bit(shm_signal_in_wakeup, &s->flags);
+	spin_unlock_irqrestore(&s->lock, flags);
+
+}
+EXPORT_SYMBOL_GPL(_shm_signal_wakeup);
+
+void _shm_signal_release(struct shm_signal *s)
+{
+	s->ops->release(s);
+}
+EXPORT_SYMBOL_GPL(_shm_signal_release);
+
+static void
+deferred_notify(unsigned long data)
+{
+	struct shm_signal *s = (struct shm_signal *)data;
+
+	_shm_signal_wakeup(s);
+}
+
+void shm_signal_init(struct shm_signal *s)
+{
+	memset(s, 0, sizeof(*s));
+	atomic_set(&s->refs, 1);
+	spin_lock_init(&s->lock);
+	tasklet_init(&s->deferred_notify,
+		     deferred_notify,
+		     (unsigned long)s);
+}
+EXPORT_SYMBOL_GPL(shm_signal_init);


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RFC PATCH 02/17] vbus: add virtual-bus definitions
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
  2009-03-31 18:42 ` [RFC PATCH 01/17] shm-signal: shared-memory signals Gregory Haskins
@ 2009-03-31 18:42 ` Gregory Haskins
  2009-04-02 16:06   ` Ben Hutchings
  2009-03-31 18:43 ` [RFC PATCH 03/17] vbus: add connection-client helper infrastructure Gregory Haskins
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

See Documentation/vbus.txt for details

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 Documentation/vbus.txt      |  386 +++++++++++++++++++++++++++++
 arch/x86/Kconfig            |    2 
 fs/proc/base.c              |   96 +++++++
 include/linux/sched.h       |    4 
 include/linux/vbus.h        |  147 +++++++++++
 include/linux/vbus_device.h |  416 ++++++++++++++++++++++++++++++++
 kernel/Makefile             |    1 
 kernel/exit.c               |    2 
 kernel/fork.c               |    2 
 kernel/vbus/Kconfig         |   14 +
 kernel/vbus/Makefile        |    1 
 kernel/vbus/attribute.c     |   52 ++++
 kernel/vbus/config.c        |  275 +++++++++++++++++++++
 kernel/vbus/core.c          |  567 +++++++++++++++++++++++++++++++++++++++++++
 kernel/vbus/devclass.c      |  124 +++++++++
 kernel/vbus/map.c           |   72 +++++
 kernel/vbus/map.h           |   41 +++
 kernel/vbus/vbus.h          |  116 +++++++++
 18 files changed, 2318 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/vbus.txt
 create mode 100644 include/linux/vbus.h
 create mode 100644 include/linux/vbus_device.h
 create mode 100644 kernel/vbus/Kconfig
 create mode 100644 kernel/vbus/Makefile
 create mode 100644 kernel/vbus/attribute.c
 create mode 100644 kernel/vbus/config.c
 create mode 100644 kernel/vbus/core.c
 create mode 100644 kernel/vbus/devclass.c
 create mode 100644 kernel/vbus/map.c
 create mode 100644 kernel/vbus/map.h
 create mode 100644 kernel/vbus/vbus.h

diff --git a/Documentation/vbus.txt b/Documentation/vbus.txt
new file mode 100644
index 0000000..e8a05da
--- /dev/null
+++ b/Documentation/vbus.txt
@@ -0,0 +1,386 @@
+
+Virtual-Bus:
+======================
+Author: Gregory Haskins <ghaskins@novell.com>
+
+
+
+
+What is it?
+--------------------
+
+Virtual-Bus is a kernel based IO resource container technology.  It is modeled
+on a concept similar to the Linux Device-Model (LDM), where we have buses,
+devices, and drivers as the primary actors.  However, VBUS has several
+distinctions when contrasted with LDM:
+
+  1) "Busses" in LDM are relatively static and global to the kernel (e.g.
+     "PCI", "USB", etc).  VBUS buses are arbitrarily created and destroyed
+     dynamically, and are not globally visible.  Instead they are defined as
+     visible only to a specific subset of the system (the contained context).
+  2) "Devices" in LDM are typically tangible physical (or sometimes logical)
+     devices.  VBUS devices are purely software abstractions (which may or
+     may not have one or more physical devices behind them).  Devices may
+     also be arbitrarily created or destroyed by software/administrative action
+     as opposed to by a hardware discovery mechanism.
+  3) "Drivers" in LDM sit within the same kernel context as the busses and
+     devices they interact with.  VBUS drivers live in a foreign
+     context (such as userspace, or a virtual-machine guest).
+
+The idea is that a vbus is created to contain access to some IO services.
+Virtual devices are then instantiated and linked to a bus to grant access to
+drivers actively present on the bus.  Drivers will only have visibility to
+devices present on their respective bus, and nothing else.
+
+Virtual devices are defined by modules which register a deviceclass with the
+system.  A deviceclass simply represents a type of device that _may_ be
+instantiated into a device, should an administrator wish to do so.  Once
+this has happened, the device may be associated with one or more buses where
+it will become visible to all clients of those respective buses.
+
+Why do we need this?
+----------------------
+
+There are various reasons why such a construct may be useful.  One of the
+most interesting use cases is for virtualization, such as KVM.  Hypervisors
+today provide virtualized IO resources to a guest, but this is often at a cost
+in both latency and throughput compared to bare metal performance.  Utilizing
+para-virtual resources instead of emulated devices helps to mitigate this
+penalty, but even these techniques to date have not fully realized the
+potential of the underlying bare-metal hardware.
+
+Some of the performance differential is unavoidable just given the extra
+processing that occurs due to the deeper stack (guest+host).  However, some of
+this overhead is a direct result of the rather indirect path most hypervisors
+use to route IO.  For instance, KVM uses PIO faults from the guest to trigger
+a guest->host-kernel->host-userspace->host-kernel sequence of events.
+Contrast this to a typical userspace application on the host which must only
+traverse app->kernel for most IO.
+
+The fact is that the linux kernel is already great at managing access to IO
+resources.  Therefore, if you have a hypervisor that is based on the linux
+kernel, is there some way that we can allow the hypervisor to manage IO
+directly instead of forcing this convoluted path?
+
+The short answer is: "not yet" ;)
+
+In order to use such a concept, we need some new facilties.  For one, we
+need to be able to define containers with their corresponding access-control so
+that guests do not have unmitigated access to anything they wish.  Second,
+we also need to define some forms of memory access that is uniform in the face
+of various clients (e.g. "copy_to_user()" cannot be assumed to work for, say,
+a KVM vcpu context).  Lastly, we need to provide access to these resources in
+a way that makes sense for the application, such as asynchronous communication
+paths and minimizing context switches.
+
+So we introduce VBUS as a framework to provide such facilities.  The net
+result is a *substantial* reduction in IO overhead, even when compared to
+state of the art para-virtualization techniques (such as virtio-net).
+
+How do I use it?
+------------------------
+
+There are two components to utilizing a virtual-bus.  One is the
+administrative function (creating and configuring a bus and its devices).  The
+other is the consumption of the resources on the bus by a client (e.g. a
+virtual machine, or a userspace application).  The former occurs on the host
+kernel by means of interacting with various special filesystems (e.g. sysfs,
+configfs, etc).  The latter occurs by means of a "vbus connector" which must
+be developed specifically to bridge a particular environment.  To date, we
+have developed such connectors for host-userspace and kvm-guests.  Conceivably
+we could develop other connectors as needs arise (e.g. lguest, xen,
+guest-userspace, etc).  This document deals with the administrative interface.
+Details about developing a connector are out of scope for this document.
+
+Interacting with vbus
+------------------------
+
+The first step is to enable virtual-bus support (CONFIG_VBUS) as well as any
+desired vbus-device modules (e.g. CONFIG_VBUS_VENETTAP), and ensure that your
+environment mounts both sysfs and configfs somewhere in the filesystem.  This
+document will assume they are mounted to /sys and /config, respectively.
+
+VBUS will create a top-level directory "vbus" in each of the two respective
+filesystems.  At boot-up, they will look like the following:
+
+/sys/vbus/
+|-- deviceclass
+|-- devices
+|-- instances
+`-- version
+
+/config/vbus/
+|-- devices
+`-- instances
+
+Following their respective roles, /config/vbus is for userspace to manage the
+lifetime of some number of objects/attributes.  This is in contrast to
+/sys/vbus which is a reflection of objects managed by the kernel.  It is
+assumed the reader is already familiar with these two facilities, so we will
+not go into depth about their general operation.  Suffice to say that vbus
+consists of objects that are managed both by userspace and the kernel.
+Modification of objects via /config/vbus will typically be reflected in the
+/sys/vbus area.
+
+It all starts with a deviceclass
+--------------------------------
+
+Before you can do anything useful with vbus, you need some registered
+deviceclasses.  A deviceclass provides the implementation of a specific type
+of virtual device.  A deviceclass will typically be registered by loading a
+kernel-module.  Once loaded, the available device types are enumerated under
+/sys/vbus/deviceclass.  For example, we will load our "venet-tap" module,
+which provides network services:
+
+# modprobe venet-tap
+# tree /sys/vbus
+/sys/vbus
+|-- deviceclass
+|   `-- venet-tap
+|-- devices
+|-- instances
+`-- version
+
+An administrative agent should be able to enumerate /sys/vbus/deviceclass to
+determine what services are available on a given platform.
+
+Create the container
+-------------------
+
+The next step is to create a new container.  In vbus, this comes in the form
+of a vbus-instance and it is created by a simple "mkdir" in the
+/config/vbus/instances area.  The only requirement is that the instance is
+given a host-wide unique name.  This may be some kind of association to the
+application (e.g. the unique VM GUID) or it can be arbitrary.  For the
+purposes of example, we will let $(uuidgen) generate a random UUID for us.
+
+# mkdir /config/vbus/instances/$(uuidgen)
+# tree /sys/vbus/
+/sys/vbus/
+|-- deviceclass
+|   `-- venet-tap
+|-- devices
+|-- instances
+|   `-- beb4df8f-7483-4028-b3f7-767512e2a18c
+|       |-- devices
+|       `-- members
+`-- version
+
+So we can see that we have now created a vbus called
+
+               "beb4df8f-7483-4028-b3f7-767512e2a18c"
+
+in the /config area, and it was immediately reflected in the
+/sys/vbus/instances area as well (with a few subobjects of its own: "devices"
+and "members").  The "devices" object denotes any devices that are present on
+the bus (in this case: none).  Likewise, "members" denotes the pids of any
+tasks that are members of the bus (in this case: none).  We will come back to
+this later.  For now, we move on to the next step
+
+Create a device instance
+------------------------
+
+Devices are instantiated by again utilizing the /config/vbus configfs area.
+At first you may suspect that devices are created as subordinate objects of a
+bus/container instance, but you would be mistaken.  Devices are actually
+root-level objects in vbus specifically to allow greater flexibility in the
+association of a device.  For instance, it may be desirable to have a single
+device that spans multiple VMs (consider an ethernet switch, or a shared disk
+for a cluster).  Therefore, device lifecycles are managed by creating/deleting
+objects in /config/vbus/devices.
+
+Note: Creating a device instance is actually a two step process:  We need to
+give the device instance a unique name, and we also need to give it a specific
+device type.  It is hard to express both parameters using standard filesystem
+operations like mkdir, so the design decision was made to require performing
+the operation in two steps.
+
+Our first step is to create a unique instance.  We will again utilize
+$(uuidgen) to yield an arbitrary name.  Any name will suffice as long as it is
+unqie on this particular host.
+
+# mkdir /config/vbus/devices/$(uuidgen)
+# tree /sys/vbus
+/sys/vbus
+|-- deviceclass
+|   `-- venet-tap
+|-- devices
+|   `-- 6a1aff24-5dc0-4aea-9c35-435daef90e55
+|       `-- interfaces
+|-- instances
+|   `-- beb4df8f-7483-4028-b3f7-767512e2a18c
+|       |-- devices
+|       `-- members
+`-- version
+
+At this point we have created a partial instance, since we have not yet
+assigned a type to the device.  Even so, we can see that some state has
+changed under /sys/vbus/devices.  We now have an instance named
+
+	      	 6a1aff24-5dc0-4aea-9c35-435daef90e55
+
+and it has a single subordinate object: "interfaces".  This object in
+particular is provided by the infrastructure, though do note that a
+deviceclass may also provide its own attributes/objects once it is created.
+
+We will go ahead and give this device a type to complete its construction.  We
+do this by setting the /config/vbus/devices/$devname/type attribute with a
+valid deviceclass type:
+
+# echo foo > /config/vbus/devices/6a1aff24-5dc0-4aea-9c35-435daef90e55/type
+bash: echo: write error: No such file or directory
+
+Oops!  What happened?  "foo" is not a valid deviceclass.  We need to consult
+the /sys/vbus/deviceclass area to find out what our options are:
+
+# tree /sys/vbus/deviceclass/
+/sys/vbus/deviceclass/
+`-- venet-tap
+
+Lets try again:
+
+# echo venet-tap > /config/vbus/devices/6a1aff24-5dc0-4aea-9c35-435daef90e55/type
+# tree /sys/vbus/
+/sys/vbus/
+|-- deviceclass
+|   `-- venet-tap
+|-- devices
+|   `-- 6a1aff24-5dc0-4aea-9c35-435daef90e55
+|       |-- class -> ../../deviceclass/venet-tap
+|       |-- client_mac
+|       |-- enabled
+|       |-- host_mac
+|       |-- ifname
+|       `-- interfaces
+|-- instances
+|   `-- beb4df8f-7483-4028-b3f7-767512e2a18c
+|       |-- devices
+|       `-- members
+`-- version
+
+Ok, that looks better.  And note that /sys/vbus/devices now has some more
+subordinate objects.  Most of those were registered when the venet-tap
+deviceclass was given a chance to create an instance of itself.  Those
+attributes are a property of the venet-tap and therefore are out of scope
+for this document.  Please see the documentation that accompanies a particular
+module for more details.
+
+Put the device on the bus
+-------------------------
+
+The next administrative step is to associate our new device with our bus.
+This is accomplished using a symbolic link from the bus instance to our device
+instance.
+
+ln -s /config/vbus/devices/6a1aff24-5dc0-4aea-9c35-435daef90e55/ /config/vbus/instances/beb4df8f-7483-4028-b3f7-767512e2a18c/
+# tree /sys/vbus/
+/sys/vbus/
+|-- deviceclass
+|   `-- venet-tap
+|-- devices
+|   `-- 6a1aff24-5dc0-4aea-9c35-435daef90e55
+|       |-- class -> ../../deviceclass/venet-tap
+|       |-- client_mac
+|       |-- enabled
+|       |-- host_mac
+|       |-- ifname
+|       `-- interfaces
+|           `-- 0 -> ../../../instances/beb4df8f-7483-4028-b3f7-767512e2a18c/devices/0
+|-- instances
+|   `-- beb4df8f-7483-4028-b3f7-767512e2a18c
+|       |-- devices
+|       |   `-- 0
+|       |       |-- device -> ../../../../devices/6a1aff24-5dc0-4aea-9c35-435daef90e55
+|       |       `-- type
+|       `-- members
+`-- version
+
+We can now see that the device indicates that it has an interface registered
+to a bus:
+
+/sys/vbus/devices/6a1aff24-5dc0-4aea-9c35-435daef90e55/interfaces/
+`-- 0 -> ../../../instances/beb4df8f-7483-4028-b3f7-767512e2a18c/devices/0
+
+Likewise, we can see that the bus has a device listed (id = "0"):
+
+/sys/vbus/instances/beb4df8f-7483-4028-b3f7-767512e2a18c/devices/
+`-- 0
+    |-- device -> ../../../../devices/6a1aff24-5dc0-4aea-9c35-435daef90e55
+    `-- type
+
+At this point, our container is ready for use.  However, it currently has 0
+members, so lets fix that
+
+Add some members
+--------------------
+
+Membership is controlled by an attribute: /proc/$pid/vbus.  A pid can only be
+a member of one (or zero) busses at a time.  To establish membership, we set
+the name of the bus, like so:
+
+# echo beb4df8f-7483-4028-b3f7-767512e2a18c > /proc/self/vbus
+# tree /sys/vbus/
+/sys/vbus/
+|-- deviceclass
+|   `-- venet-tap
+|-- devices
+|   `-- 6a1aff24-5dc0-4aea-9c35-435daef90e55
+|       |-- class -> ../../deviceclass/venet-tap
+|       |-- client_mac
+|       |-- enabled
+|       |-- host_mac
+|       |-- ifname
+|       `-- interfaces
+|           `-- 0 -> ../../../instances/beb4df8f-7483-4028-b3f7-767512e2a18c/devices/0
+|-- instances
+|   `-- beb4df8f-7483-4028-b3f7-767512e2a18c
+|       |-- devices
+|       |   `-- 0
+|       |       |-- device -> ../../../../devices/6a1aff24-5dc0-4aea-9c35-435daef90e55
+|       |       `-- type
+|       `-- members
+|           |-- 4382
+|           `-- 4588
+`-- version
+
+Woah!  Why are there two members?  VBUS membership is inherited by forked
+tasks.  Therefore 4382 is the pid of our shell (which we set via /proc/self),
+and 4588 is the pid of the forked/exec'ed "tree" process.  This property can
+be useful for having things like qemu set up the bus and then forking each
+vcpu which will inherit access.
+
+At this point, we are ready to roll.  Pid 4382 has access to a virtual-bus
+namespace with one device, id=0.  Its type is:
+
+# cat /sys/vbus/instances/beb4df8f-7483-4028-b3f7-767512e2a18c/devices/0/type
+virtual-ethernet
+
+"virtual-ethernet"?  Why is it not "venet-tap"?  Device-classes are allowed to
+register their interfaces under an id that is not required to be the same as
+their deviceclass.  This supports device polymorphism.   For instance,
+consider that an interface "virtual-ethernet" may provide basic 802.x packet
+exchange.  However, we could have various implementations of a device that
+supports the 802.x interface, while having various implementations behind
+them.
+
+For instance, "venet-tap" might act like a tuntap module, while
+"venet-loopback" would loop packets back and "venet-switch" would form a
+layer-2 domain among the participating guests.  All three modules would
+presumably support the same basic 802.x interface, yet all three have
+completely different implementations.
+
+Drivers on this particular bus would see this instance id=0 as a type
+"virtual-ethernet" even though the underlying implementation happens to be a
+tap device.  This means a single driver that supports the protocol advertised
+by the "virtual-ethernet" type would be able to support the plethera of
+available device types that we may wish to create.
+
+Teardown:
+---------------
+
+We can descontruct a vbus container doing pretty much the opposite of what we
+did to create it.  Echo "0" into /proc/self/vbus, rm the symlink between the
+bus and device, and rmdir the bus and device objects.  Once that is done, we
+can even rmmod the venet-tap module.  Note that the infrastructure will
+maintain a module-ref while it is configured in a container, so be sure to
+completely tear down the vbus/device before trying this.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bc2fbad..3fca247 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1939,6 +1939,8 @@ source "drivers/pcmcia/Kconfig"
 
 source "drivers/pci/hotplug/Kconfig"
 
+source "kernel/vbus/Kconfig"
+
 endmenu
 
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index beaa0ce..03993fb 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -80,6 +80,7 @@
 #include <linux/oom.h>
 #include <linux/elf.h>
 #include <linux/pid_namespace.h>
+#include <linux/vbus.h>
 #include "internal.h"
 
 /* NOTE:
@@ -1065,6 +1066,98 @@ static const struct file_operations proc_oom_adjust_operations = {
 	.write		= oom_adjust_write,
 };
 
+#ifdef CONFIG_VBUS
+
+static ssize_t vbus_read(struct file *file, char __user *buf,
+			 size_t count, loff_t *ppos)
+{
+	struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
+	struct vbus *vbus;
+	const char *name;
+	char buffer[256];
+	size_t len;
+
+	if (!task)
+		return -ESRCH;
+
+	vbus = task_vbus_get(task);
+
+	put_task_struct(task);
+
+	name = vbus_name(vbus);
+
+	len = snprintf(buffer, sizeof(buffer), "%s\n", name ? name : "<none>");
+
+	vbus_put(vbus);
+
+	return simple_read_from_buffer(buf, count, ppos, buffer, len);
+}
+
+static ssize_t vbus_write(struct file *file, const char __user *buf,
+			  size_t count, loff_t *ppos)
+{
+	struct task_struct *task;
+	struct vbus *vbus = NULL;
+	char buffer[256];
+	int disable = 0;
+
+	memset(buffer, 0, sizeof(buffer));
+	if (count > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+
+	if (buffer[count-1] == '\n')
+		buffer[count-1] = 0;
+
+	task = get_proc_task(file->f_path.dentry->d_inode);
+	if (!task)
+		return -ESRCH;
+
+	if (!capable(CAP_SYS_ADMIN)) {
+		put_task_struct(task);
+		return -EACCES;
+	}
+
+	if (strcmp(buffer, "0") == 0)
+		disable = 1;
+	else
+		vbus = vbus_find(buffer);
+
+	if (disable || vbus)
+		task_vbus_disassociate(task);
+
+	if (vbus) {
+		int ret = vbus_associate(vbus, task);
+
+		if (ret < 0)
+			printk(KERN_ERR \
+			       "vbus: could not associate %s/%d with bus %s",
+			       task->comm, task->pid, vbus_name(vbus));
+		else
+			rcu_assign_pointer(task->vbus, vbus);
+
+		vbus_put(vbus); /* Counter the vbus_find() */
+	} else if (!disable) {
+		put_task_struct(task);
+		return -ENOENT;
+	}
+
+	put_task_struct(task);
+
+	if (count == sizeof(buffer)-1)
+		return -EIO;
+
+	return count;
+}
+
+static const struct file_operations proc_vbus_operations = {
+	.read		= vbus_read,
+	.write		= vbus_write,
+};
+
+#endif /* CONFIG_VBUS */
+
 #ifdef CONFIG_AUDITSYSCALL
 #define TMPBUFLEN 21
 static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
@@ -2556,6 +2649,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_TASK_IO_ACCOUNTING
 	INF("io",	S_IRUGO, proc_tgid_io_accounting),
 #endif
+#ifdef CONFIG_VBUS
+	REG("vbus", S_IRUGO|S_IWUSR, proc_vbus_operations),
+#endif
 };
 
 static int proc_tgid_base_readdir(struct file * filp,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 011db2f..cd2f9b1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -97,6 +97,7 @@ struct futex_pi_state;
 struct robust_list_head;
 struct bio;
 struct bts_tracer;
+struct vbus;
 
 /*
  * List of flags we want to share for kernel threads,
@@ -1329,6 +1330,9 @@ struct task_struct {
 	unsigned int lockdep_recursion;
 	struct held_lock held_locks[MAX_LOCK_DEPTH];
 #endif
+#ifdef CONFIG_VBUS
+	struct vbus *vbus;
+#endif
 
 /* journalling filesystem info */
 	void *journal_info;
diff --git a/include/linux/vbus.h b/include/linux/vbus.h
new file mode 100644
index 0000000..5f0566c
--- /dev/null
+++ b/include/linux/vbus.h
@@ -0,0 +1,147 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Virtual-Bus
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VBUS_H
+#define _LINUX_VBUS_H
+
+#ifdef CONFIG_VBUS
+
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/rcupdate.h>
+#include <linux/vbus_device.h>
+
+struct vbus;
+struct task_struct;
+
+/**
+ * vbus_associate() - associate a task with a vbus
+ * @vbus:      The bus context to associate with
+ * @p:         The task to associate
+ *
+ * This function adds a task as a member of a vbus.  Tasks must be members
+ * of a bus before they are allowed to use its resources.  Tasks may only
+ * associate with a single bus at a time.
+ *
+ * Note: children inherit any association present at fork().
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int vbus_associate(struct vbus *vbus, struct task_struct *p);
+
+/**
+ * vbus_disassociate() - disassociate a task with a vbus
+ * @vbus:      The bus context to disassociate with
+ * @p:         The task to disassociate
+ *
+ * This function removes a task as a member of a vbus.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int vbus_disassociate(struct vbus *vbus, struct task_struct *p);
+
+struct vbus *vbus_get(struct vbus *);
+void vbus_put(struct vbus *);
+
+/**
+ * vbus_name() - returns the name of a bus
+ * @vbus:      The bus context
+ *
+ * Returns: (char *) name of bus
+ *
+ **/
+const char *vbus_name(struct vbus *vbus);
+
+/**
+ * vbus_find() - retreives a vbus pointer from its name
+ * @name:      The name of the bus to find
+ *
+ * Returns: NULL = failure, non-null = (vbus *)bus-pointer
+ *
+ **/
+struct vbus *vbus_find(const char *name);
+
+/**
+ * task_vbus_get() - retreives an associated vbus pointer from a task
+ * @p:         The task context
+ *
+ * Safely retreives a pointer to an associated (if any) vbus from a task
+ *
+ * Returns: NULL = no association, non-null = (vbus *)bus-pointer
+ *
+ **/
+static inline struct vbus *task_vbus_get(struct task_struct *p)
+{
+	struct vbus *vbus;
+
+	rcu_read_lock();
+	vbus = rcu_dereference(p->vbus);
+	if (vbus)
+		vbus_get(vbus);
+	rcu_read_unlock();
+
+	return vbus;
+}
+
+/**
+ * fork_vbus() - Helper function to handle associated task forking
+ * @p:         The task context
+ *
+ **/
+static inline void fork_vbus(struct task_struct *p)
+{
+	struct vbus *vbus = task_vbus_get(p);
+
+	if (vbus) {
+		BUG_ON(vbus_associate(vbus, p) < 0);
+		vbus_put(vbus);
+	}
+}
+
+/**
+ * task_vbus_disassociate() - Helper function to handle disassociating tasks
+ * @p:         The task context
+ *
+ **/
+static inline void task_vbus_disassociate(struct task_struct *p)
+{
+	struct vbus *vbus = task_vbus_get(p);
+
+	if (vbus) {
+		rcu_assign_pointer(p->vbus, NULL);
+		synchronize_rcu();
+
+		vbus_disassociate(vbus, p);
+		vbus_put(vbus);
+	}
+}
+
+#else /* CONFIG_VBUS */
+
+#define fork_vbus(p) do { } while (0)
+#define task_vbus_disassociate(p) do { } while (0)
+
+#endif /* CONFIG_VBUS */
+
+#endif /* _LINUX_VBUS_H */
diff --git a/include/linux/vbus_device.h b/include/linux/vbus_device.h
new file mode 100644
index 0000000..705d92e
--- /dev/null
+++ b/include/linux/vbus_device.h
@@ -0,0 +1,416 @@
+/*
+ * VBUS device models
+ *
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file deals primarily with the definitions for interfacing a virtual
+ * device model to a virtual bus.  In a nutshell, a devclass begets a device,
+ * which begets a device_interface, which begets a connection.
+ *
+ * devclass
+ * -------
+ *
+ * To develop a vbus device, it all starts with a devclass.  You must register
+ * a devclass using vbus_devclass_register().  Each registered devclass is
+ * enumerated under /sys/vbus/deviceclass.
+ *
+ * In of itself, a devclass doesnt do much.  It is just an object factory for
+ * a device whose lifetime is managed by userspace.  When userspace decides
+ * it would like to create an instance of a particular devclass, the
+ * devclass::create() callback is invoked (registered as part of the ops
+ * structure during vbus_devclass_register()).  How and when userspace decides
+ * to do this is beyond the scope of this document.  Please see:
+ *
+ *                         Documentation/vbus.txt
+ *
+ * for more details.
+ *
+ * device
+ * -------
+ *
+ * A vbus device is created by a particular devclass during the invokation
+ * of its devclass::create() callback.  A device is initially created without
+ * any association with a bus.  One or more buses may attempt to connect to
+ * a device (controlled, again, by userspace).  When this occurs, a
+ * device::bus_connect() callback is invoked.
+ *
+ * This bus_connect() callback gives the device a chance to decide if it will
+ * accept the connection, and if so, to register its interfaces.  Most devices
+ * will likely only allow a connection to one bus.  Therefore, they may return
+ * -EBUSY another bus is already connected.
+ *
+ * If the device accepts the connection, it should register one of more
+ * interfaces with the bus using vbus_device_interface_register().  Most
+ * devices will only support one interface, and therefore will only invoke
+ * this method once.  However, some more elaborate devices may have multiple
+ * functions, or abstracted topologies.  Therefore they may opt at their own
+ * discretion to register more than one interface.  The interfaces do not need
+ * to be uniform in type.
+ *
+ * device_interface
+ * -------------------
+ *
+ * The purpose of an interface is two fold: 1) advertise a particular ABI
+ * for communcation to a driver, 2) handle the initial connection of a driver.
+ *
+ * As such, a device_interface has a string "type" (which is akin to the
+ * abi type that this interface supports, like a PCI-ID).  It also sports
+ * an interface::open() method.
+ *
+ * The interface::open callback is invoked whenever a driver attempts to
+ * connect to this device.  The device implements its own policy regarding
+ * whether it accepts multiple connections or not.  Most devices will likely
+ * only accept one connection at a time, and therefore will return -EBUSY if
+ * subsequent attempts are made.
+ *
+ * However, if successful, the interface::open() should return a
+ * vbus_connection object
+ *
+ * connections
+ * -----------
+ *
+ * A connection represents an interface that is succesfully opened.  It will
+ * remain in an active state as long as the client retains the connection.
+ * The connection::release() method is invoked if the client should die,
+ * restart, or explicitly close the connection.  The device-model should use
+ * this release() callback as the indication to clean up any resources
+ * associated with a particular connection such as allocated queues, etc.
+ *
+ * ---
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VBUS_DEVICE_H
+#define _LINUX_VBUS_DEVICE_H
+
+#include <linux/module.h>
+#include <linux/configfs.h>
+#include <linux/rbtree.h>
+#include <linux/shm_signal.h>
+#include <linux/vbus.h>
+#include <asm/atomic.h>
+
+struct vbus_device_interface;
+struct vbus_connection;
+struct vbus_device;
+struct vbus_devclass;
+struct vbus_memctx;
+
+/*
+ * ----------------------
+ * devclass
+ * ----------------------
+ */
+struct vbus_devclass_ops {
+	int (*create)(struct vbus_devclass *dc,
+		      struct vbus_device **dev);
+	void (*release)(struct vbus_devclass *dc);
+};
+
+struct vbus_devclass {
+	const char *name;
+	struct vbus_devclass_ops *ops;
+	struct rb_node node;
+	struct kobject kobj;
+	struct module *owner;
+};
+
+/**
+ * vbus_devclass_register() - register a devclass with the system
+ * @devclass:   The devclass context to register
+ *
+ * Establishes a new device-class for consumption.  Registered device-classes
+ * are enumerated under /sys/vbus/deviceclass.  For more details, please see
+ * Documentation/vbus*
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int vbus_devclass_register(struct vbus_devclass *devclass);
+
+/**
+ * vbus_devclass_unregister() - unregister a devclass with the system
+ * @devclass:   The devclass context to unregister
+ *
+ * Removes a devclass from the system
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int vbus_devclass_unregister(struct vbus_devclass *devclass);
+
+/**
+ * vbus_devclass_get() - acquire a devclass context reference
+ * @devclass:      devclass context
+ *
+ **/
+static inline struct vbus_devclass *
+vbus_devclass_get(struct vbus_devclass *devclass)
+{
+	if (!try_module_get(devclass->owner))
+		return NULL;
+
+	kobject_get(&devclass->kobj);
+	return devclass;
+}
+
+/**
+ * vbus_devclass_put() - release a devclass context reference
+ * @devclass:      devclass context
+ *
+ **/
+static inline void
+vbus_devclass_put(struct vbus_devclass *devclass)
+{
+	kobject_put(&devclass->kobj);
+	module_put(devclass->owner);
+}
+
+/*
+ * ----------------------
+ * device
+ * ----------------------
+ */
+struct vbus_device_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct vbus_device *dev,
+			struct vbus_device_attribute *attr,
+			char *buf);
+	ssize_t (*store)(struct vbus_device *dev,
+			 struct vbus_device_attribute *attr,
+			 const char *buf, size_t count);
+};
+
+struct vbus_device_ops {
+	int (*bus_connect)(struct vbus_device *dev, struct vbus *vbus);
+	int (*bus_disconnect)(struct vbus_device *dev, struct vbus *vbus);
+	void (*release)(struct vbus_device *dev);
+};
+
+struct vbus_device {
+	const char *type;
+	struct vbus_device_ops *ops;
+	struct attribute_group *attrs;
+	struct kobject *kobj;
+};
+
+/*
+ * ----------------------
+ * device_interface
+ * ----------------------
+ */
+struct vbus_device_interface_ops {
+	int (*open)(struct vbus_device_interface *intf,
+		    struct vbus_memctx *ctx,
+		    int version,
+		    struct vbus_connection **conn);
+	void (*release)(struct vbus_device_interface *intf);
+};
+
+struct vbus_device_interface {
+	const char *name;
+	const char *type;
+	struct vbus_device_interface_ops *ops;
+	unsigned long id;
+	struct vbus_device *dev;
+	struct vbus *vbus;
+	struct rb_node node;
+	struct kobject kobj;
+};
+
+/**
+ * vbus_device_interface_register() - register an interface with a bus
+ * @dev:        The device context of the caller
+ * @vbus:       The bus context to register with
+ * @intf:       The interface context to register
+ *
+ * This function is invoked (usually in the context of a device::bus_connect()
+ * callback) to register a interface on a bus.  We make this an explicit
+ * operation instead of implicit on the bus_connect() to facilitate devices
+ * that may present multiple interfaces to a bus.  In those cases, a device
+ * may invoke this function multiple times (one per supported interface).
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int vbus_device_interface_register(struct vbus_device *dev,
+				   struct vbus *vbus,
+				   struct vbus_device_interface *intf);
+
+/**
+ * vbus_device_interface_unregister() - unregister an interface with a bus
+ * @intf:       The interface context to unregister
+ *
+ * This function is the converse of interface_register.  It is typically
+ * invoked in the context of a device::bus_disconnect().
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int vbus_device_interface_unregister(struct vbus_device_interface *intf);
+
+/*
+ * ----------------------
+ * memory context
+ * ----------------------
+ */
+struct vbus_memctx_ops {
+	unsigned long (*copy_to)(struct vbus_memctx *ctx,
+				 void *dst,
+				 const void *src,
+				 unsigned long len);
+	unsigned long (*copy_from)(struct vbus_memctx *ctx,
+				   void *dst,
+				   const void *src,
+				   unsigned long len);
+	void (*release)(struct vbus_memctx *ctx);
+};
+
+struct vbus_memctx {
+	atomic_t refs;
+	struct vbus_memctx_ops *ops;
+};
+
+static inline void
+vbus_memctx_init(struct vbus_memctx *ctx, struct vbus_memctx_ops *ops)
+{
+	memset(ctx, 0, sizeof(*ctx));
+	atomic_set(&ctx->refs, 1);
+	ctx->ops = ops;
+}
+
+#define VBUS_MEMCTX_INIT(_ops) {                                   \
+	.refs = ATOMIC_INIT(1),                                    \
+	.ops = _ops,                                               \
+}
+
+static inline void
+vbus_memctx_get(struct vbus_memctx *ctx)
+{
+	atomic_inc(&ctx->refs);
+}
+
+static inline void
+vbus_memctx_put(struct vbus_memctx *ctx)
+{
+	if (atomic_dec_and_test(&ctx->refs))
+		ctx->ops->release(ctx);
+}
+
+/*
+ * ----------------------
+ * memory context
+ * ----------------------
+ */
+struct vbus_shm;
+
+struct vbus_shm_ops {
+	void (*release)(struct vbus_shm *shm);
+};
+
+struct vbus_shm {
+	atomic_t refs;
+	struct vbus_shm_ops *ops;
+	void                *ptr;
+	size_t               len;
+};
+
+static inline void
+vbus_shm_init(struct vbus_shm *shm, struct vbus_shm_ops *ops,
+	      void *ptr, size_t len)
+{
+	memset(shm, 0, sizeof(*shm));
+	atomic_set(&shm->refs, 1);
+	shm->ops = ops;
+	shm->ptr = ptr;
+	shm->len = len;
+}
+
+static inline void
+vbus_shm_get(struct vbus_shm *shm)
+{
+	atomic_inc(&shm->refs);
+}
+
+static inline void
+vbus_shm_put(struct vbus_shm *shm)
+{
+	if (atomic_dec_and_test(&shm->refs))
+		shm->ops->release(shm);
+}
+
+/*
+ * ----------------------
+ * connection
+ * ----------------------
+ */
+struct vbus_connection_ops {
+	int (*call)(struct vbus_connection *conn,
+		    unsigned long func,
+		    void *data,
+		    unsigned long len,
+		    unsigned long flags);
+	int (*shm)(struct vbus_connection *conn,
+		   unsigned long id,
+		   struct vbus_shm *shm,
+		   struct shm_signal *signal,
+		   unsigned long flags);
+	void (*release)(struct vbus_connection *conn);
+};
+
+struct vbus_connection {
+	atomic_t refs;
+	struct vbus_connection_ops *ops;
+};
+
+/**
+ * vbus_connection_init() - initialize a vbus_connection
+ * @conn:       connection context
+ * @ops:        ops structure to assign to context
+ *
+ **/
+static inline void vbus_connection_init(struct vbus_connection *conn,
+					struct vbus_connection_ops *ops)
+{
+	memset(conn, 0, sizeof(*conn));
+	atomic_set(&conn->refs, 1);
+	conn->ops = ops;
+}
+
+/**
+ * vbus_connection_get() - acquire a connection context reference
+ * @conn:       connection context
+ *
+ **/
+static inline void vbus_connection_get(struct vbus_connection *conn)
+{
+	atomic_inc(&conn->refs);
+}
+
+/**
+ * vbus_connection_put() - release a connection context reference
+ * @conn:       connection context
+ *
+ **/
+static inline void vbus_connection_put(struct vbus_connection *conn)
+{
+	if (atomic_dec_and_test(&conn->refs))
+		conn->ops->release(conn);
+}
+
+#endif /* _LINUX_VBUS_DEVICE_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index e4791b3..99a98a7 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -93,6 +93,7 @@ obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace/
 obj-$(CONFIG_TRACING) += trace/
 obj-$(CONFIG_SMP) += sched_cpupri.o
+obj-$(CONFIG_VBUS) += vbus/
 
 ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/exit.c b/kernel/exit.c
index efd30cc..8736de6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -48,6 +48,7 @@
 #include <linux/tracehook.h>
 #include <linux/init_task.h>
 #include <trace/sched.h>
+#include <linux/vbus.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -1081,6 +1082,7 @@ NORET_TYPE void do_exit(long code)
 	check_stack_usage();
 	exit_thread();
 	cgroup_exit(tsk, 1);
+	task_vbus_disassociate(tsk);
 
 	if (group_dead && tsk->signal->leader)
 		disassociate_ctty(1);
diff --git a/kernel/fork.c b/kernel/fork.c
index 4854c2c..5536053 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -61,6 +61,7 @@
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
 #include <trace/sched.h>
+#include <linux/vbus.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1274,6 +1275,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
+	fork_vbus(p);
 	return p;
 
 bad_fork_free_graph:
diff --git a/kernel/vbus/Kconfig b/kernel/vbus/Kconfig
new file mode 100644
index 0000000..f2b92f5
--- /dev/null
+++ b/kernel/vbus/Kconfig
@@ -0,0 +1,14 @@
+#
+# Virtual-Bus (VBus) configuration
+#
+
+config VBUS
+       bool "Virtual Bus"
+       select CONFIGFS_FS
+       select SHM_SIGNAL
+       default n
+       help
+        Provides a mechansism for declaring virtual-bus objects and binding
+	various tasks and devices which reside on the bus.
+
+	If unsure, say N
diff --git a/kernel/vbus/Makefile b/kernel/vbus/Makefile
new file mode 100644
index 0000000..367f65b
--- /dev/null
+++ b/kernel/vbus/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_VBUS) += core.o devclass.o config.o attribute.o map.o
diff --git a/kernel/vbus/attribute.c b/kernel/vbus/attribute.c
new file mode 100644
index 0000000..3928228
--- /dev/null
+++ b/kernel/vbus/attribute.c
@@ -0,0 +1,52 @@
+#include <linux/vbus.h>
+#include <linux/uaccess.h>
+#include <linux/kobject.h>
+#include <linux/kallsyms.h>
+
+#include "vbus.h"
+
+static struct vbus_device_attribute *to_vattr(struct attribute *attr)
+{
+	return container_of(attr, struct vbus_device_attribute, attr);
+}
+
+static struct vbus_devshell *to_devshell(struct kobject *kobj)
+{
+	return container_of(kobj, struct vbus_devshell, kobj);
+}
+
+static ssize_t _dev_attr_show(struct kobject *kobj, struct attribute *attr,
+			     char *buf)
+{
+	struct vbus_devshell *ds = to_devshell(kobj);
+	struct vbus_device_attribute *vattr = to_vattr(attr);
+	ssize_t ret = -EIO;
+
+	if (vattr->show)
+		ret = vattr->show(ds->dev, vattr, buf);
+
+	if (ret >= (ssize_t)PAGE_SIZE) {
+		print_symbol("vbus_attr_show: %s returned bad count\n",
+				(unsigned long)vattr->show);
+	}
+
+	return ret;
+}
+
+static ssize_t _dev_attr_store(struct kobject *kobj, struct attribute *attr,
+			       const char *buf, size_t count)
+{
+	struct vbus_devshell *ds = to_devshell(kobj);
+	struct vbus_device_attribute *vattr = to_vattr(attr);
+	ssize_t ret = -EIO;
+
+	if (vattr->store)
+		ret = vattr->store(ds->dev, vattr, buf, count);
+
+	return ret;
+}
+
+struct sysfs_ops vbus_dev_attr_ops = {
+	.show	= _dev_attr_show,
+	.store	= _dev_attr_store,
+};
diff --git a/kernel/vbus/config.c b/kernel/vbus/config.c
new file mode 100644
index 0000000..a40dbf1
--- /dev/null
+++ b/kernel/vbus/config.c
@@ -0,0 +1,275 @@
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+
+#include <linux/vbus.h>
+#include <linux/configfs.h>
+
+#include "vbus.h"
+
+static struct config_item_type perms_type = {
+	.ct_owner	= THIS_MODULE,
+};
+
+static struct vbus *to_vbus(struct config_group *group)
+{
+	return group ? container_of(group, struct vbus, ci.group) : NULL;
+}
+
+static struct vbus *item_to_vbus(struct config_item *item)
+{
+	return to_vbus(to_config_group(item));
+}
+
+static struct vbus_devshell *to_devshell(struct config_group *group)
+{
+	return group ? container_of(group, struct vbus_devshell, ci_group)
+		: NULL;
+}
+
+static struct vbus_devshell *to_vbus_devshell(struct config_item *item)
+{
+	return to_devshell(to_config_group(item));
+}
+
+static int
+device_bus_connect(struct config_item *src, struct config_item *target)
+{
+	struct vbus *vbus = item_to_vbus(src);
+	struct vbus_devshell *ds;
+
+	/* We only allow connections to devices */
+	if (target->ci_parent != &vbus_root.devices.ci_group.cg_item)
+		return -EINVAL;
+
+	ds = to_vbus_devshell(target);
+	BUG_ON(!ds);
+
+	if (!ds->dev)
+		return -EINVAL;
+
+	return ds->dev->ops->bus_connect(ds->dev, vbus);
+}
+
+static int
+device_bus_disconnect(struct config_item *src, struct config_item *target)
+{
+	struct vbus *vbus = item_to_vbus(src);
+	struct vbus_devshell *ds;
+
+	ds = to_vbus_devshell(target);
+	BUG_ON(!ds);
+
+	if (!ds->dev)
+		return -EINVAL;
+
+	return ds->dev->ops->bus_disconnect(ds->dev, vbus);
+}
+
+struct configfs_item_operations bus_ops = {
+	.allow_link = device_bus_connect,
+	.drop_link = device_bus_disconnect,
+};
+
+static struct config_item_type bus_type = {
+	.ct_item_ops    = &bus_ops,
+	.ct_owner	= THIS_MODULE,
+};
+
+static struct config_group *bus_create(struct config_group *group,
+				       const char *name)
+{
+	struct vbus *bus = NULL;
+	int ret;
+
+	ret = vbus_create(name, &bus);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	config_group_init_type_name(&bus->ci.group, name, &bus_type);
+	bus->ci.group.default_groups = bus->ci.defgroups;
+	bus->ci.group.default_groups[0] = &bus->ci.perms;
+	bus->ci.group.default_groups[1] = NULL;
+
+	config_group_init_type_name(&bus->ci.perms, "perms", &perms_type);
+
+	return &bus->ci.group;
+}
+
+static void bus_destroy(struct config_group *group, struct config_item *item)
+{
+	struct vbus *vbus = item_to_vbus(item);
+
+	vbus_put(vbus);
+}
+
+static struct configfs_group_operations buses_ops = {
+	.make_group	= bus_create,
+	.drop_item      = bus_destroy,
+};
+
+static struct config_item_type buses_type = {
+	.ct_group_ops	= &buses_ops,
+	.ct_owner	= THIS_MODULE,
+};
+
+CONFIGFS_ATTR_STRUCT(vbus_devshell);
+#define DEVSHELL_ATTR(_name, _mode, _show, _store)	\
+struct vbus_devshell_attribute vbus_devshell_attr_##_name = \
+    __CONFIGFS_ATTR(_name, _mode, _show, _store)
+
+static ssize_t devshell_type_read(struct vbus_devshell *ds, char *page)
+{
+	if (ds->dev)
+		return sprintf(page, "%s\n", ds->dev->type);
+	else
+		return sprintf(page, "\n");
+}
+
+static ssize_t devshell_type_write(struct vbus_devshell *ds, const char *page,
+				   size_t count)
+{
+	struct vbus_devclass *dc;
+	struct vbus_device *dev;
+	char name[256];
+	int ret;
+
+	/*
+	 * The device-type can only be set once, and then it is permenent.
+	 * The admin should delete the device-shell if they want to create
+	 * a new type
+	 */
+	if (ds->dev)
+		return -EINVAL;
+
+	if (count > sizeof(name))
+		return -EINVAL;
+
+	strcpy(name, page);
+	if (name[count-1] == '\n')
+		name[count-1] = 0;
+
+	dc = vbus_devclass_find(name);
+	if (!dc)
+		return -ENOENT;
+
+	ret = dc->ops->create(dc, &dev);
+	if (ret < 0) {
+		vbus_devclass_put(dc);
+		return ret;
+	}
+
+	ds->dev = dev;
+	ds->dc = dc;
+	dev->kobj = &ds->kobj;
+
+	ret = vbus_devshell_type_set(ds);
+	if (ret < 0) {
+		vbus_devclass_put(dc);
+		return ret;
+	}
+
+	return count;
+}
+
+DEVSHELL_ATTR(type, S_IRUGO | S_IWUSR, devshell_type_read,
+	    devshell_type_write);
+
+static struct configfs_attribute *devshell_attrs[] = {
+	&vbus_devshell_attr_type.attr,
+	NULL,
+};
+
+CONFIGFS_ATTR_OPS(vbus_devshell);
+static struct configfs_item_operations devshell_item_ops = {
+	.show_attribute		= vbus_devshell_attr_show,
+	.store_attribute	= vbus_devshell_attr_store,
+};
+
+static struct config_item_type devshell_type = {
+	.ct_item_ops	= &devshell_item_ops,
+	.ct_attrs	= devshell_attrs,
+	.ct_owner	= THIS_MODULE,
+};
+
+static struct config_group *devshell_create(struct config_group *group,
+					    const char *name)
+{
+	struct vbus_devshell *ds = NULL;
+	int ret;
+
+	ret = vbus_devshell_create(name, &ds);
+	if (ret < 0)
+		return ERR_PTR(ret);
+
+	config_group_init_type_name(&ds->ci_group, name, &devshell_type);
+
+	return &ds->ci_group;
+}
+
+static void devshell_release(struct config_group *group,
+			     struct config_item *item)
+{
+	struct vbus_devshell *ds = to_vbus_devshell(item);
+
+	kobject_put(&ds->kobj);
+
+	if (ds->dc)
+		vbus_devclass_put(ds->dc);
+}
+
+static struct configfs_group_operations devices_ops = {
+	.make_group	= devshell_create,
+	.drop_item      = devshell_release,
+};
+
+static struct config_item_type devices_type = {
+	.ct_group_ops	= &devices_ops,
+	.ct_owner	= THIS_MODULE,
+};
+
+static struct config_item_type root_type = {
+	.ct_owner	= THIS_MODULE,
+};
+
+int __init vbus_config_init(void)
+{
+	int ret;
+	struct configfs_subsystem *subsys = &vbus_root.ci.subsys;
+
+	config_group_init_type_name(&subsys->su_group, "vbus", &root_type);
+	mutex_init(&subsys->su_mutex);
+
+	subsys->su_group.default_groups = vbus_root.ci.defgroups;
+	subsys->su_group.default_groups[0] = &vbus_root.buses.ci_group;
+	subsys->su_group.default_groups[1] = &vbus_root.devices.ci_group;
+	subsys->su_group.default_groups[2] = NULL;
+
+	config_group_init_type_name(&vbus_root.buses.ci_group,
+				    "instances", &buses_type);
+
+	config_group_init_type_name(&vbus_root.devices.ci_group,
+				    "devices", &devices_type);
+
+	ret = configfs_register_subsystem(subsys);
+	if (ret) {
+		printk(KERN_ERR "Error %d while registering subsystem %s\n",
+		       ret,
+		       subsys->su_group.cg_item.ci_namebuf);
+		goto out_unregister;
+	}
+
+	return 0;
+
+out_unregister:
+	configfs_unregister_subsystem(subsys);
+
+	return ret;
+}
+
+void __exit vbus_config_exit(void)
+{
+	configfs_unregister_subsystem(&vbus_root.ci.subsys);
+}
+
+
diff --git a/kernel/vbus/core.c b/kernel/vbus/core.c
new file mode 100644
index 0000000..033999f
--- /dev/null
+++ b/kernel/vbus/core.c
@@ -0,0 +1,567 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/vbus.h>
+#include <linux/uaccess.h>
+
+#include "vbus.h"
+
+static struct vbus_device_interface *kobj_to_intf(struct kobject *kobj)
+{
+	return container_of(kobj, struct vbus_device_interface, kobj);
+}
+
+static struct vbus_devshell *to_devshell(struct kobject *kobj)
+{
+	return container_of(kobj, struct vbus_devshell, kobj);
+}
+
+static void interface_release(struct kobject *kobj)
+{
+	struct vbus_device_interface *intf = kobj_to_intf(kobj);
+
+	if (intf->ops->release)
+		intf->ops->release(intf);
+}
+
+static struct kobj_type interface_ktype = {
+	.release = interface_release,
+	.sysfs_ops = &kobj_sysfs_ops,
+};
+
+static ssize_t
+type_show(struct kobject *kobj, struct kobj_attribute *attr,
+		  char *buf)
+{
+	struct vbus_device_interface *intf = kobj_to_intf(kobj);
+
+	return snprintf(buf, PAGE_SIZE, "%s\n", intf->type);
+}
+
+static struct kobj_attribute devattr_type =
+	__ATTR_RO(type);
+
+static struct attribute *attrs[] = {
+	&devattr_type.attr,
+	NULL,
+};
+
+static struct attribute_group attr_group = {
+	.attrs = attrs,
+};
+
+/*
+ * Assumes dev->bus->lock is held
+ */
+static void _interface_unregister(struct vbus_device_interface *intf)
+{
+	struct vbus *vbus = intf->vbus;
+	struct vbus_devshell *ds = to_devshell(intf->dev->kobj);
+
+	map_del(&vbus->devices.map, &intf->node);
+	sysfs_remove_link(&ds->intfs, intf->name);
+	sysfs_remove_link(&intf->kobj, "device");
+	sysfs_remove_group(&intf->kobj, &attr_group);
+}
+
+int vbus_device_interface_register(struct vbus_device *dev,
+				   struct vbus *vbus,
+				   struct vbus_device_interface *intf)
+{
+	int ret;
+	struct vbus_devshell *ds = to_devshell(dev->kobj);
+
+	mutex_lock(&vbus->lock);
+
+	if (vbus->next_id == -1) {
+		mutex_unlock(&vbus->lock);
+		return -ENOSPC;
+	}
+
+	intf->id = vbus->next_id++;
+	intf->dev = dev;
+	intf->vbus = vbus;
+
+	ret = map_add(&vbus->devices.map, &intf->node);
+	if (ret < 0) {
+		mutex_unlock(&vbus->lock);
+		return ret;
+	}
+
+	kobject_init_and_add(&intf->kobj, &interface_ktype,
+			     &vbus->devices.kobj, "%ld", intf->id);
+
+	/* Create the basic attribute files associated with this kobject */
+	ret = sysfs_create_group(&intf->kobj, &attr_group);
+	if (ret)
+		goto error;
+
+	/* Create cross-referencing links between the device and bus */
+	ret = sysfs_create_link(&intf->kobj, dev->kobj, "device");
+	if (ret)
+		goto error;
+
+	ret = sysfs_create_link(&ds->intfs, &intf->kobj, intf->name);
+	if (ret)
+		goto error;
+
+	mutex_unlock(&vbus->lock);
+
+	return 0;
+
+error:
+	_interface_unregister(intf);
+	mutex_unlock(&vbus->lock);
+
+	kobject_put(&intf->kobj);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vbus_device_interface_register);
+
+int vbus_device_interface_unregister(struct vbus_device_interface *intf)
+{
+	struct vbus *vbus = intf->vbus;
+
+	mutex_lock(&vbus->lock);
+	_interface_unregister(intf);
+	mutex_unlock(&vbus->lock);
+
+	kobject_put(&intf->kobj);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vbus_device_interface_unregister);
+
+static struct vbus_device_interface *node_to_intf(struct rb_node *node)
+{
+	return node ? container_of(node, struct vbus_device_interface, node)
+		: NULL;
+}
+
+static int interface_item_compare(struct rb_node *lhs, struct rb_node *rhs)
+{
+	struct vbus_device_interface *lintf = node_to_intf(lhs);
+	struct vbus_device_interface *rintf = node_to_intf(rhs);
+
+	return lintf->id - rintf->id;
+}
+
+static int interface_key_compare(const void *key, struct rb_node *node)
+{
+	struct vbus_device_interface *intf = node_to_intf(node);
+	unsigned long id = *(unsigned long *)key;
+
+	return id - intf->id;
+}
+
+static struct map_ops interface_map_ops = {
+	.key_compare = &interface_key_compare,
+	.item_compare = &interface_item_compare,
+};
+
+/*
+ *-----------------
+ * member
+ *-----------------
+ */
+
+static struct vbus_member *node_to_member(struct rb_node *node)
+{
+	return node ? container_of(node, struct vbus_member, node) : NULL;
+}
+
+static struct vbus_member *kobj_to_member(struct kobject *kobj)
+{
+	return kobj ? container_of(kobj, struct vbus_member, kobj) : NULL;
+}
+
+static int member_item_compare(struct rb_node *lhs, struct rb_node *rhs)
+{
+	struct vbus_member *lmember = node_to_member(lhs);
+	struct vbus_member *rmember = node_to_member(rhs);
+
+	return lmember->tsk->pid - rmember->tsk->pid;
+}
+
+static int member_key_compare(const void *key, struct rb_node *node)
+{
+	struct vbus_member *member = node_to_member(node);
+	pid_t pid = *(pid_t *)key;
+
+	return pid - member->tsk->pid;
+}
+
+static struct map_ops member_map_ops = {
+	.key_compare = &member_key_compare,
+	.item_compare = &member_item_compare,
+};
+
+static void member_release(struct kobject *kobj)
+{
+	struct vbus_member *member = kobj_to_member(kobj);
+
+	vbus_put(member->vbus);
+	put_task_struct(member->tsk);
+
+	kfree(member);
+}
+
+static struct kobj_type member_ktype = {
+	.release = member_release,
+};
+
+int vbus_associate(struct vbus *vbus, struct task_struct *tsk)
+{
+	struct vbus_member *member;
+	int ret;
+
+	member = kzalloc(sizeof(struct vbus_member), GFP_KERNEL);
+	if (!member)
+		return -ENOMEM;
+
+	mutex_lock(&vbus->lock);
+
+	get_task_struct(tsk);
+	vbus_get(vbus);
+
+	member->vbus = vbus;
+	member->tsk = tsk;
+
+	ret = kobject_init_and_add(&member->kobj, &member_ktype,
+				   &vbus->members.kobj,
+				   "%d", tsk->pid);
+	if (ret < 0)
+		goto error;
+
+	ret = map_add(&vbus->members.map, &member->node);
+	if (ret < 0)
+		goto error;
+
+out:
+	mutex_unlock(&vbus->lock);
+	return 0;
+
+error:
+	kobject_put(&member->kobj);
+	goto out;
+}
+
+int vbus_disassociate(struct vbus *vbus, struct task_struct *tsk)
+{
+	struct vbus_member *member;
+
+	mutex_lock(&vbus->lock);
+
+	member = node_to_member(map_find(&vbus->members.map, &tsk->pid));
+	BUG_ON(!member);
+
+	map_del(&vbus->members.map, &member->node);
+
+	mutex_unlock(&vbus->lock);
+
+	kobject_put(&member->kobj);
+
+	return 0;
+}
+
+/*
+ *-----------------
+ * vbus_subdir
+ *-----------------
+ */
+
+static void vbus_subdir_init(struct vbus_subdir *subdir,
+			     const char *name,
+			     struct kobject *parent,
+			     struct kobj_type *type,
+			     struct map_ops *map_ops)
+{
+	int ret;
+
+	map_init(&subdir->map, map_ops);
+
+	ret = kobject_init_and_add(&subdir->kobj, type, parent, name);
+	BUG_ON(ret < 0);
+}
+
+/*
+ *-----------------
+ * vbus
+ *-----------------
+ */
+
+static void vbus_destroy(struct kobject *kobj)
+{
+	struct vbus *vbus = container_of(kobj, struct vbus, kobj);
+
+	kfree(vbus);
+}
+
+static struct kobj_type vbus_ktype = {
+	.release = vbus_destroy,
+};
+
+static struct kobj_type null_ktype = {
+};
+
+int vbus_create(const char *name, struct vbus **bus)
+{
+	struct vbus *_bus = NULL;
+	int ret;
+
+	_bus = kzalloc(sizeof(struct vbus), GFP_KERNEL);
+	if (!_bus)
+		return -ENOMEM;
+
+	atomic_set(&_bus->refs, 1);
+	mutex_init(&_bus->lock);
+
+	kobject_init_and_add(&_bus->kobj, &vbus_ktype,
+			     vbus_root.buses.kobj, name);
+
+	vbus_subdir_init(&_bus->devices, "devices", &_bus->kobj,
+			 &null_ktype, &interface_map_ops);
+	vbus_subdir_init(&_bus->members, "members", &_bus->kobj,
+			 &null_ktype, &member_map_ops);
+
+	_bus->next_id = 0;
+
+	mutex_lock(&vbus_root.lock);
+
+	ret = map_add(&vbus_root.buses.map, &_bus->node);
+	BUG_ON(ret < 0);
+
+	mutex_unlock(&vbus_root.lock);
+
+	*bus = _bus;
+
+	return 0;
+}
+
+static void devshell_release(struct kobject *kobj)
+{
+	struct vbus_devshell *ds = container_of(kobj,
+						struct vbus_devshell, kobj);
+
+	if (ds->dev) {
+		if (ds->dev->attrs)
+			sysfs_remove_group(&ds->kobj, ds->dev->attrs);
+
+		if (ds->dev->ops->release)
+			ds->dev->ops->release(ds->dev);
+	}
+
+	if (ds->dc)
+		sysfs_remove_link(&ds->kobj, "class");
+
+	kobject_put(&ds->intfs);
+	kfree(ds);
+}
+
+static struct kobj_type devshell_ktype = {
+	.release = devshell_release,
+	.sysfs_ops = &vbus_dev_attr_ops,
+};
+
+static void _interfaces_init(struct vbus_devshell *ds)
+{
+	kobject_init_and_add(&ds->intfs, &null_ktype, &ds->kobj, "interfaces");
+}
+
+int vbus_devshell_create(const char *name, struct vbus_devshell **ds)
+{
+	struct vbus_devshell *_ds = NULL;
+
+	_ds = kzalloc(sizeof(*_ds), GFP_KERNEL);
+	if (!_ds)
+		return -ENOMEM;
+
+	kobject_init_and_add(&_ds->kobj, &devshell_ktype,
+			     vbus_root.devices.kobj, name);
+
+	_interfaces_init(_ds);
+
+	*ds = _ds;
+
+	return 0;
+}
+
+int vbus_devshell_type_set(struct vbus_devshell *ds)
+{
+	int ret;
+
+	if (!ds->dev)
+		return -EINVAL;
+
+	if (!ds->dev->attrs)
+		return 0;
+
+	ret = sysfs_create_link(&ds->kobj, &ds->dc->kobj, "class");
+	if (ret < 0)
+		return ret;
+
+	return sysfs_create_group(&ds->kobj, ds->dev->attrs);
+}
+
+struct vbus *vbus_get(struct vbus *vbus)
+{
+	if (vbus)
+		atomic_inc(&vbus->refs);
+
+	return vbus;
+}
+EXPORT_SYMBOL_GPL(vbus_get);
+
+void vbus_put(struct vbus *vbus)
+{
+	if (vbus && atomic_dec_and_test(&vbus->refs)) {
+		kobject_put(&vbus->devices.kobj);
+		kobject_put(&vbus->members.kobj);
+		kobject_put(&vbus->kobj);
+	}
+}
+EXPORT_SYMBOL_GPL(vbus_put);
+
+long vbus_interface_find(struct vbus *bus,
+			 unsigned long id,
+			 struct vbus_device_interface **intf)
+{
+	struct vbus_device_interface *_intf;
+
+	BUG_ON(!bus);
+
+	mutex_lock(&bus->lock);
+
+	_intf = node_to_intf(map_find(&bus->devices.map, &id));
+	if (likely(_intf))
+		kobject_get(&_intf->kobj);
+
+	mutex_unlock(&bus->lock);
+
+	if (!_intf)
+		return -ENOENT;
+
+	*intf = _intf;
+
+	return 0;
+}
+
+const char *vbus_name(struct vbus *vbus)
+{
+	return vbus ? vbus->kobj.name : NULL;
+}
+
+/*
+ *---------------------
+ * vbus_buses
+ *---------------------
+ */
+
+static struct vbus *node_to_bus(struct rb_node *node)
+{
+	return node ? container_of(node, struct vbus, node) : NULL;
+}
+
+static int bus_item_compare(struct rb_node *lhs, struct rb_node *rhs)
+{
+	struct vbus *lbus = node_to_bus(lhs);
+	struct vbus *rbus = node_to_bus(rhs);
+
+	return strcmp(lbus->kobj.name, rbus->kobj.name);
+}
+
+static int bus_key_compare(const void *key, struct rb_node *node)
+{
+	struct vbus *bus = node_to_bus(node);
+
+	return strcmp(key, bus->kobj.name);
+}
+
+static struct map_ops bus_map_ops = {
+	.key_compare = &bus_key_compare,
+	.item_compare = &bus_item_compare,
+};
+
+struct vbus *vbus_find(const char *name)
+{
+	struct vbus *bus;
+
+	mutex_lock(&vbus_root.lock);
+
+	bus = node_to_bus(map_find(&vbus_root.buses.map, name));
+	if (!bus)
+		goto out;
+
+	vbus_get(bus);
+
+out:
+	mutex_unlock(&vbus_root.lock);
+
+	return bus;
+
+}
+
+struct vbus_root vbus_root;
+
+static ssize_t version_show(struct kobject *kobj,
+			struct kobj_attribute *attr, char *buf)
+{
+	return snprintf(buf, PAGE_SIZE, "%d\n", VBUS_VERSION);
+}
+
+static struct kobj_attribute version_attr =
+	__ATTR(version, S_IRUGO, version_show, NULL);
+
+static int __init vbus_init(void)
+{
+	int ret;
+
+	mutex_init(&vbus_root.lock);
+
+	ret = vbus_config_init();
+	BUG_ON(ret < 0);
+
+	vbus_root.kobj = kobject_create_and_add("vbus", NULL);
+	BUG_ON(!vbus_root.kobj);
+
+	ret = sysfs_create_file(vbus_root.kobj, &version_attr.attr);
+	BUG_ON(ret);
+
+	ret = vbus_devclass_init();
+	BUG_ON(ret < 0);
+
+	map_init(&vbus_root.buses.map, &bus_map_ops);
+	vbus_root.buses.kobj = kobject_create_and_add("instances",
+						      vbus_root.kobj);
+	BUG_ON(!vbus_root.buses.kobj);
+
+	vbus_root.devices.kobj = kobject_create_and_add("devices",
+							vbus_root.kobj);
+	BUG_ON(!vbus_root.devices.kobj);
+
+	return 0;
+}
+
+late_initcall(vbus_init);
+
+
diff --git a/kernel/vbus/devclass.c b/kernel/vbus/devclass.c
new file mode 100644
index 0000000..3f5ef0d
--- /dev/null
+++ b/kernel/vbus/devclass.c
@@ -0,0 +1,124 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/vbus.h>
+
+#include "vbus.h"
+
+static struct vbus_devclass *node_to_devclass(struct rb_node *node)
+{
+	return node ? container_of(node, struct vbus_devclass, node) : NULL;
+}
+
+static int devclass_item_compare(struct rb_node *lhs, struct rb_node *rhs)
+{
+	struct vbus_devclass *ldc = node_to_devclass(lhs);
+	struct vbus_devclass *rdc = node_to_devclass(rhs);
+
+	return strcmp(ldc->name, rdc->name);
+}
+
+static int devclass_key_compare(const void *key, struct rb_node *node)
+{
+	struct vbus_devclass *dc = node_to_devclass(node);
+
+	return strcmp((const char *)key, dc->name);
+}
+
+static struct map_ops devclass_map_ops = {
+	.key_compare = &devclass_key_compare,
+	.item_compare = &devclass_item_compare,
+};
+
+int __init vbus_devclass_init(void)
+{
+	struct vbus_devclasses *c = &vbus_root.devclasses;
+
+	map_init(&c->map, &devclass_map_ops);
+
+	c->kobj = kobject_create_and_add("deviceclass", vbus_root.kobj);
+	BUG_ON(!c->kobj);
+
+	return 0;
+}
+
+static void devclass_release(struct kobject *kobj)
+{
+	struct vbus_devclass *dc = container_of(kobj,
+						struct vbus_devclass,
+						kobj);
+
+	if (dc->ops->release)
+		dc->ops->release(dc);
+}
+
+static struct kobj_type devclass_ktype = {
+	.release = devclass_release,
+};
+
+int vbus_devclass_register(struct vbus_devclass *dc)
+{
+	int ret;
+
+	mutex_lock(&vbus_root.lock);
+
+	ret = map_add(&vbus_root.devclasses.map, &dc->node);
+	if (ret < 0)
+		goto out;
+
+	ret = kobject_init_and_add(&dc->kobj, &devclass_ktype,
+				   vbus_root.devclasses.kobj, dc->name);
+	if (ret < 0) {
+		map_del(&vbus_root.devclasses.map, &dc->node);
+		goto out;
+	}
+
+out:
+	mutex_unlock(&vbus_root.lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vbus_devclass_register);
+
+int vbus_devclass_unregister(struct vbus_devclass *dc)
+{
+	mutex_lock(&vbus_root.lock);
+	map_del(&vbus_root.devclasses.map, &dc->node);
+	mutex_unlock(&vbus_root.lock);
+
+	kobject_put(&dc->kobj);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vbus_devclass_unregister);
+
+struct vbus_devclass *vbus_devclass_find(const char *name)
+{
+	struct vbus_devclass *dev;
+
+	mutex_lock(&vbus_root.lock);
+	dev = node_to_devclass(map_find(&vbus_root.devclasses.map, name));
+	if (dev)
+		dev = vbus_devclass_get(dev);
+	mutex_unlock(&vbus_root.lock);
+
+	return dev;
+}
diff --git a/kernel/vbus/map.c b/kernel/vbus/map.c
new file mode 100644
index 0000000..a3bd841
--- /dev/null
+++ b/kernel/vbus/map.c
@@ -0,0 +1,72 @@
+
+#include <linux/errno.h>
+
+#include "map.h"
+
+void map_init(struct map *map, struct map_ops *ops)
+{
+	map->root = RB_ROOT;
+	map->ops = ops;
+}
+
+int map_add(struct map *map, struct rb_node *node)
+{
+	int		ret = 0;
+	struct rb_root *root;
+	struct rb_node **new, *parent = NULL;
+
+	root = &map->root;
+	new  = &(root->rb_node);
+
+	/* Figure out where to put new node */
+	while (*new) {
+		int val;
+
+		parent = *new;
+
+		val = map->ops->item_compare(node, *new);
+		if (val < 0)
+			new = &((*new)->rb_left);
+		else if (val > 0)
+			new = &((*new)->rb_right);
+		else {
+			ret = -EEXIST;
+			break;
+		}
+	}
+
+	if (!ret) {
+		/* Add new node and rebalance tree. */
+		rb_link_node(node, parent, new);
+		rb_insert_color(node, root);
+	}
+
+	return ret;
+}
+
+struct rb_node *map_find(struct map *map, const void *key)
+{
+	struct rb_node *node;
+
+	node = map->root.rb_node;
+
+	while (node) {
+		int val;
+
+		val = map->ops->key_compare(key, node);
+		if (val < 0)
+			node = node->rb_left;
+		else if (val > 0)
+			node = node->rb_right;
+		else
+			break;
+	}
+
+	return node;
+}
+
+void map_del(struct map *map, struct rb_node *node)
+{
+	rb_erase(node, &map->root);
+}
+
diff --git a/kernel/vbus/map.h b/kernel/vbus/map.h
new file mode 100644
index 0000000..7fb5164
--- /dev/null
+++ b/kernel/vbus/map.h
@@ -0,0 +1,41 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef __VBUS_MAP_H__
+#define __VBUS_MAP_H__
+
+#include <linux/rbtree.h>
+
+struct map_ops {
+	int (*item_compare)(struct rb_node *lhs, struct rb_node *rhs);
+	int (*key_compare)(const void *key, struct rb_node *item);
+};
+
+struct map {
+	struct rb_root root;
+	struct map_ops *ops;
+};
+
+void map_init(struct map *map, struct map_ops *ops);
+int map_add(struct map *map, struct rb_node *node);
+struct rb_node *map_find(struct map *map, const void *key);
+void map_del(struct map *map, struct rb_node *node);
+
+#endif /* __VBUS_MAP_H__ */
diff --git a/kernel/vbus/vbus.h b/kernel/vbus/vbus.h
new file mode 100644
index 0000000..1266d69
--- /dev/null
+++ b/kernel/vbus/vbus.h
@@ -0,0 +1,116 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef __VBUS_H__
+#define __VBUS_H__
+
+#include <linux/configfs.h>
+#include <linux/rbtree.h>
+#include <linux/mutex.h>
+#include <linux/kobject.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+
+#include "map.h"
+
+#define VBUS_VERSION 1
+
+struct vbus_subdir {
+	struct map     map;
+	struct kobject kobj;
+};
+
+struct vbus {
+	struct {
+		struct config_group group;
+		struct config_group perms;
+		struct config_group *defgroups[2];
+	} ci;
+
+	atomic_t refs;
+	struct mutex lock;
+	struct kobject kobj;
+	struct vbus_subdir devices;
+	struct vbus_subdir members;
+	unsigned long next_id;
+	struct rb_node node;
+};
+
+struct vbus_member {
+	struct rb_node      node;
+	struct task_struct *tsk;
+	struct vbus        *vbus;
+	struct kobject      kobj;
+};
+
+struct vbus_devclasses {
+	struct kobject *kobj;
+	struct map map;
+};
+
+struct vbus_buses {
+	struct config_group ci_group;
+	struct map map;
+	struct kobject *kobj;
+};
+
+struct vbus_devshell {
+	struct config_group ci_group;
+	struct vbus_device *dev;
+	struct vbus_devclass *dc;
+	struct kobject kobj;
+	struct kobject intfs;
+};
+
+struct vbus_devices {
+	struct config_group ci_group;
+	struct kobject *kobj;
+};
+
+struct vbus_root {
+	struct {
+		struct configfs_subsystem subsys;
+		struct config_group      *defgroups[3];
+	} ci;
+
+	struct mutex            lock;
+	struct kobject         *kobj;
+	struct vbus_devclasses  devclasses;
+	struct vbus_buses       buses;
+	struct vbus_devices     devices;
+};
+
+extern struct vbus_root vbus_root;
+extern struct sysfs_ops vbus_dev_attr_ops;
+
+int vbus_config_init(void);
+int vbus_devclass_init(void);
+
+int vbus_create(const char *name, struct vbus **bus);
+
+int vbus_devshell_create(const char *name, struct vbus_devshell **ds);
+struct vbus_devclass *vbus_devclass_find(const char *name);
+int vbus_devshell_type_set(struct vbus_devshell *ds);
+
+long vbus_interface_find(struct vbus *vbus,
+			 unsigned long id,
+			 struct vbus_device_interface **intf);
+
+#endif /* __VBUS_H__ */


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RFC PATCH 03/17] vbus: add connection-client helper infrastructure
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
  2009-03-31 18:42 ` [RFC PATCH 01/17] shm-signal: shared-memory signals Gregory Haskins
  2009-03-31 18:42 ` [RFC PATCH 02/17] vbus: add virtual-bus definitions Gregory Haskins
@ 2009-03-31 18:43 ` Gregory Haskins
  2009-03-31 18:43 ` [RFC PATCH 04/17] vbus: add bus-registration notifiers Gregory Haskins
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

We expect to have various types of connection-clients (e.g. userspace,
kvm, etc), each of which is likely to have common access patterns and
marshalling duties.  Therefore we create a "client" API to simplify
client development by helping with mundane tasks such as handle-2-pointer
translation, etc.

Special thanks to Pat Mullaney for suggesting the optimization to pass
a cookie object down during DEVICESHM operations to save lookup overhead
on the event channel.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/vbus_client.h |  115 +++++++++
 kernel/vbus/Makefile        |    2 
 kernel/vbus/client.c        |  527 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 643 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/vbus_client.h
 create mode 100644 kernel/vbus/client.c

diff --git a/include/linux/vbus_client.h b/include/linux/vbus_client.h
new file mode 100644
index 0000000..62dab78
--- /dev/null
+++ b/include/linux/vbus_client.h
@@ -0,0 +1,115 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Virtual-Bus - Client interface
+ *
+ * We expect to have various types of connection-clients (e.g. userspace,
+ * kvm, etc).  Each client will be connecting from some environment outside
+ * of the kernel, and therefore will not have direct access to the API as
+ * presented in ./linux/vbus.h.  There will undoubtedly be some parameter
+ * marshalling that must occur, as well as common patterns for the handling
+ * of those marshalled parameters (e.g. translating a handle into a pointer,
+ * etc).
+ *
+ * Therefore this "client" API is provided to simplify the development
+ * of any clients.  Of course, a client is free to bypass this API entirely
+ * and communicate with the direct VBUS API if desired.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VBUS_CLIENT_H
+#define _LINUX_VBUS_CLIENT_H
+
+#include <linux/types.h>
+#include <linux/compiler.h>
+
+struct vbus_deviceopen {
+	__u32 devid;
+	__u32 version; /* device ABI version */
+	__u64 handle; /* return value for devh */
+};
+
+struct vbus_devicecall {
+	__u64 devh;   /* device-handle (returned from DEVICEOPEN */
+	__u32 func;
+	__u32 len;
+	__u32 flags;
+	__u64 datap;
+};
+
+struct vbus_deviceshm {
+	__u64 devh;   /* device-handle (returned from DEVICEOPEN */
+	__u32 id;
+	__u32 len;
+	__u32 flags;
+	struct {
+		__u32 offset;
+		__u32 prio;
+		__u64 cookie; /* token to pass back when signaling client */
+	} signal;
+	__u64 datap;
+	__u64 handle; /* return value for signaling from client to kernel */
+};
+
+#ifdef __KERNEL__
+
+#include <linux/ioq.h>
+#include <linux/module.h>
+#include <asm/atomic.h>
+
+struct vbus_client;
+
+struct vbus_client_ops {
+	int (*deviceopen)(struct vbus_client *client,  struct vbus_memctx *ctx,
+			  __u32 devid, __u32 version, __u64 *devh);
+	int (*deviceclose)(struct vbus_client *client, __u64 devh);
+	int (*devicecall)(struct vbus_client *client,
+			  __u64 devh, __u32 func,
+			  void *data, __u32 len, __u32 flags);
+	int (*deviceshm)(struct vbus_client *client,
+			 __u64 devh, __u32 id,
+			 struct vbus_shm *shm, struct shm_signal *signal,
+			 __u32 flags, __u64 *handle);
+	int (*shmsignal)(struct vbus_client *client, __u64 handle);
+	void (*release)(struct vbus_client *client);
+};
+
+struct vbus_client {
+	atomic_t refs;
+	struct vbus_client_ops *ops;
+};
+
+static inline void vbus_client_get(struct vbus_client *client)
+{
+	atomic_inc(&client->refs);
+}
+
+static inline void vbus_client_put(struct vbus_client *client)
+{
+	if (atomic_dec_and_test(&client->refs))
+		client->ops->release(client);
+}
+
+struct vbus_client *vbus_client_attach(struct vbus *bus);
+
+extern struct vbus_memctx *current_memctx;
+struct vbus_memctx *task_memctx_alloc(struct task_struct *task);
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_VBUS_CLIENT_H */
diff --git a/kernel/vbus/Makefile b/kernel/vbus/Makefile
index 367f65b..4d440e5 100644
--- a/kernel/vbus/Makefile
+++ b/kernel/vbus/Makefile
@@ -1 +1 @@
-obj-$(CONFIG_VBUS) += core.o devclass.o config.o attribute.o map.o
+obj-$(CONFIG_VBUS) += core.o devclass.o config.o attribute.o map.o client.o
diff --git a/kernel/vbus/client.c b/kernel/vbus/client.c
new file mode 100644
index 0000000..ff8d6df
--- /dev/null
+++ b/kernel/vbus/client.c
@@ -0,0 +1,527 @@
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/highmem.h>
+#include <linux/uaccess.h>
+#include <linux/vbus.h>
+#include <linux/vbus_client.h>
+#include "vbus.h"
+
+static int
+nodeptr_item_compare(struct rb_node *lhs, struct rb_node *rhs)
+{
+	unsigned long l = (unsigned long)lhs;
+	unsigned long r = (unsigned long)rhs;
+
+	return l - r;
+}
+
+static int
+nodeptr_key_compare(const void *key, struct rb_node *node)
+{
+	unsigned long item = (unsigned long)node;
+	unsigned long _key = *(unsigned long *)key;
+
+	return _key - item;
+}
+
+static struct map_ops nodeptr_map_ops = {
+	.key_compare = &nodeptr_key_compare,
+	.item_compare = &nodeptr_item_compare,
+};
+
+struct _signal {
+	atomic_t refs;
+	struct rb_node node;
+	struct list_head list;
+	struct shm_signal *signal;
+};
+
+struct _connection {
+	atomic_t refs;
+	struct rb_node node;
+	struct list_head signals;
+	struct vbus_connection *conn;
+};
+
+static inline void _signal_get(struct _signal *_signal)
+{
+	atomic_inc(&_signal->refs);
+}
+
+static inline void _signal_put(struct _signal *_signal)
+{
+	if (atomic_dec_and_test(&_signal->refs)) {
+		shm_signal_put(_signal->signal);
+		kfree(_signal);
+	}
+}
+
+static inline void conn_get(struct _connection *_conn)
+{
+	atomic_inc(&_conn->refs);
+}
+
+static inline void conn_put(struct _connection *_conn)
+{
+	if (atomic_dec_and_test(&_conn->refs)) {
+		struct _signal *_signal, *tmp;
+
+		list_for_each_entry_safe(_signal, tmp, &_conn->signals,
+					 list) {
+			list_del(&_signal->list);
+			_signal_put(_signal);
+		}
+
+		vbus_connection_put(_conn->conn);
+		kfree(_conn);
+	}
+}
+
+struct _client {
+	struct mutex lock;
+	struct map conn_map;
+	struct map signal_map;
+	struct vbus *vbus;
+	struct vbus_client client;
+};
+
+struct _connection *to_conn(struct rb_node *node)
+{
+	return node ? container_of(node, struct _connection, node) : NULL;
+}
+
+static struct _signal *to_signal(struct rb_node *node)
+{
+	return node ? container_of(node, struct _signal, node) : NULL;
+}
+
+static struct _client *to_client(struct vbus_client *client)
+{
+	return container_of(client, struct _client, client);
+}
+
+static struct _connection *
+connection_find(struct _client *c, unsigned long devid)
+{
+	struct _connection *_conn;
+
+	/*
+	 * We could, in theory, cast devid to _conn->node, but this would
+	 * be pretty stupid to trust.  Therefore, we must validate that
+	 * the pointer is legit by seeing if it exists in our conn_map
+	 */
+
+	mutex_lock(&c->lock);
+
+	_conn = to_conn(map_find(&c->conn_map, &devid));
+	if (likely(_conn))
+		conn_get(_conn);
+
+	mutex_unlock(&c->lock);
+
+	return _conn;
+}
+
+static int
+_deviceopen(struct vbus_client *client, struct vbus_memctx *ctx,
+	    __u32 devid, __u32 version, __u64 *devh)
+{
+	struct _client *c = to_client(client);
+	struct vbus_connection *conn;
+	struct _connection *_conn;
+	struct vbus_device_interface *intf = NULL;
+	int ret;
+
+	/*
+	 * We only get here if the device has never been opened before,
+	 * so we need to create a new connection
+	 */
+	ret = vbus_interface_find(c->vbus, devid, &intf);
+	if (ret < 0)
+		return ret;
+
+	ret = intf->ops->open(intf, ctx, version, &conn);
+	kobject_put(&intf->kobj);
+	if (ret < 0)
+		return ret;
+
+	_conn = kzalloc(sizeof(*_conn), GFP_KERNEL);
+	if (!_conn) {
+		vbus_connection_put(conn);
+		return -ENOMEM;
+	}
+
+	atomic_set(&_conn->refs, 1);
+	_conn->conn = conn;
+
+	INIT_LIST_HEAD(&_conn->signals);
+
+	mutex_lock(&c->lock);
+	ret = map_add(&c->conn_map, &_conn->node);
+	mutex_unlock(&c->lock);
+
+	if (ret < 0) {
+		conn_put(_conn);
+		return ret;
+	}
+
+	/* in theory, &_conn->node should be unique */
+	*devh = (__u64)&_conn->node;
+
+	return 0;
+
+}
+
+/*
+ * Assumes client->lock is held (or we are releasing and dont need to lock)
+ */
+static void
+conn_del(struct _client *c, struct _connection *_conn)
+{
+	struct _signal *_signal, *tmp;
+
+	/* Delete and release each opened queue */
+	list_for_each_entry_safe(_signal, tmp, &_conn->signals, list) {
+		map_del(&c->signal_map, &_signal->node);
+		_signal_put(_signal);
+	}
+
+	map_del(&c->conn_map, &_conn->node);
+}
+
+static int
+_deviceclose(struct vbus_client *client, __u64 devh)
+{
+	struct _client *c = to_client(client);
+	struct _connection *_conn;
+
+	mutex_lock(&c->lock);
+
+	_conn = to_conn(map_find(&c->conn_map, &devh));
+	if (likely(_conn))
+		conn_del(c, _conn);
+
+	mutex_unlock(&c->lock);
+
+	if (unlikely(!_conn))
+		return -ENOENT;
+
+	/* this _put is the compliment to the _get performed at _deviceopen */
+	conn_put(_conn);
+
+	return 0;
+}
+
+static int
+_devicecall(struct vbus_client *client,
+	    __u64 devh, __u32 func, void *data, __u32 len, __u32 flags)
+{
+	struct _client *c = to_client(client);
+	struct _connection *_conn;
+	struct vbus_connection *conn;
+	int ret;
+
+	_conn = connection_find(c, devh);
+	if (!_conn)
+		return -ENOENT;
+
+	conn = _conn->conn;
+
+	ret = conn->ops->call(conn, func, data, len, flags);
+
+	conn_put(_conn);
+
+	return ret;
+}
+
+static int
+_deviceshm(struct vbus_client *client,
+	   __u64 devh,
+	   __u32 id,
+	   struct vbus_shm *shm,
+	   struct shm_signal *signal,
+	   __u32 flags,
+	   __u64 *handle)
+{
+	struct _client *c = to_client(client);
+	struct _signal *_signal = NULL;
+	struct _connection *_conn;
+	struct vbus_connection *conn;
+	int ret;
+
+	*handle = 0;
+
+	_conn = connection_find(c, devh);
+	if (!_conn)
+		return -ENOENT;
+
+	conn = _conn->conn;
+
+	ret = conn->ops->shm(conn, id, shm, signal, flags);
+	if (ret < 0) {
+		conn_put(_conn);
+		return ret;
+	}
+
+	if (signal) {
+		_signal = kzalloc(sizeof(*_signal), GFP_KERNEL);
+		if (!_signal) {
+			conn_put(_conn);
+			return -ENOMEM;
+		}
+
+		 /* one for map-ref, one for list-ref */
+		atomic_set(&_signal->refs, 2);
+		_signal->signal = signal;
+		shm_signal_get(signal);
+
+		mutex_lock(&c->lock);
+		ret = map_add(&c->signal_map, &_signal->node);
+		list_add_tail(&_signal->list, &_conn->signals);
+		mutex_unlock(&c->lock);
+
+		if (!ret)
+			*handle = (__u64)&_signal->node;
+	}
+
+	conn_put(_conn);
+
+	return 0;
+}
+
+static int
+_shmsignal(struct vbus_client *client, __u64 handle)
+{
+	struct _client *c = to_client(client);
+	struct _signal *_signal;
+
+	mutex_lock(&c->lock);
+
+	_signal = to_signal(map_find(&c->signal_map, &handle));
+	if (likely(_signal))
+		_signal_get(_signal);
+
+	mutex_unlock(&c->lock);
+
+	if (!_signal)
+		return -ENOENT;
+
+	_shm_signal_wakeup(_signal->signal);
+
+	_signal_put(_signal);
+
+	return 0;
+}
+
+static void
+_release(struct vbus_client *client)
+{
+	struct _client *c = to_client(client);
+	struct rb_node *node;
+
+	/* Drop all of our open connections */
+	while ((node = rb_first(&c->conn_map.root))) {
+		struct _connection *_conn = to_conn(node);
+
+		conn_del(c, _conn);
+		conn_put(_conn);
+	}
+
+	vbus_put(c->vbus);
+	kfree(c);
+}
+
+struct vbus_client_ops _client_ops = {
+	.deviceopen  = _deviceopen,
+	.deviceclose = _deviceclose,
+	.devicecall  = _devicecall,
+	.deviceshm   = _deviceshm,
+	.shmsignal   = _shmsignal,
+	.release     = _release,
+};
+
+struct vbus_client *vbus_client_attach(struct vbus *vbus)
+{
+	struct _client *c;
+
+	BUG_ON(!vbus);
+
+	c = kzalloc(sizeof(*c), GFP_KERNEL);
+	if (!c)
+		return NULL;
+
+	atomic_set(&c->client.refs, 1);
+	c->client.ops = &_client_ops;
+
+	mutex_init(&c->lock);
+	map_init(&c->conn_map, &nodeptr_map_ops);
+	map_init(&c->signal_map, &nodeptr_map_ops);
+	c->vbus = vbus_get(vbus);
+
+	return &c->client;
+}
+EXPORT_SYMBOL_GPL(vbus_client_attach);
+
+/*
+ * memory context helpers
+ */
+
+static unsigned long
+current_memctx_copy_to(struct vbus_memctx *ctx, void *dst, const void *src,
+		       unsigned long len)
+{
+	return copy_to_user(dst, src, len);
+}
+
+static unsigned long
+current_memctx_copy_from(struct vbus_memctx *ctx, void *dst, const void *src,
+			 unsigned long len)
+{
+	return copy_from_user(dst, src, len);
+}
+
+static void
+current_memctx_release(struct vbus_memctx *ctx)
+{
+	panic("dropped last reference to current_memctx");
+}
+
+static struct vbus_memctx_ops current_memctx_ops = {
+	.copy_to   = &current_memctx_copy_to,
+	.copy_from = &current_memctx_copy_from,
+	.release   = &current_memctx_release,
+};
+
+static struct vbus_memctx _current_memctx =
+	VBUS_MEMCTX_INIT((&current_memctx_ops));
+
+struct vbus_memctx *current_memctx = &_current_memctx;
+
+/*
+ * task_mem allows you to have a copy_from_user/copy_to_user like
+ * environment, except that it supports copying to tasks other
+ * than "current" as ctu/cfu() do
+ */
+struct task_memctx {
+	struct task_struct *task;
+	struct vbus_memctx ctx;
+};
+
+static struct task_memctx *to_task_memctx(struct vbus_memctx *ctx)
+{
+	return container_of(ctx, struct task_memctx, ctx);
+}
+
+static unsigned long
+task_memctx_copy_to(struct vbus_memctx *ctx, void *dst, const void *src,
+		    unsigned long n)
+{
+	struct task_memctx *tm = to_task_memctx(ctx);
+	struct task_struct *p = tm->task;
+
+	while (n) {
+		unsigned long offset = ((unsigned long)dst)%PAGE_SIZE;
+		unsigned long len = PAGE_SIZE - offset;
+		int ret;
+		struct page *pg;
+		void *maddr;
+
+		if (len > n)
+			len = n;
+
+		down_read(&p->mm->mmap_sem);
+		ret = get_user_pages(p, p->mm,
+				     (unsigned long)dst, 1, 1, 0, &pg, NULL);
+
+		if (ret != 1) {
+			up_read(&p->mm->mmap_sem);
+			break;
+		}
+
+		maddr = kmap_atomic(pg, KM_USER0);
+		memcpy(maddr + offset, src, len);
+		kunmap_atomic(maddr, KM_USER0);
+		set_page_dirty_lock(pg);
+		put_page(pg);
+		up_read(&p->mm->mmap_sem);
+
+		src += len;
+		dst += len;
+		n -= len;
+	}
+
+	return n;
+}
+
+static unsigned long
+task_memctx_copy_from(struct vbus_memctx *ctx, void *dst, const void *src,
+		      unsigned long n)
+{
+	struct task_memctx *tm = to_task_memctx(ctx);
+	struct task_struct *p = tm->task;
+
+	while (n) {
+		unsigned long offset = ((unsigned long)src)%PAGE_SIZE;
+		unsigned long len = PAGE_SIZE - offset;
+		int ret;
+		struct page *pg;
+		void *maddr;
+
+		if (len > n)
+			len = n;
+
+		down_read(&p->mm->mmap_sem);
+		ret = get_user_pages(p, p->mm,
+				     (unsigned long)src, 1, 1, 0, &pg, NULL);
+
+		if (ret != 1) {
+			up_read(&p->mm->mmap_sem);
+			break;
+		}
+
+		maddr = kmap_atomic(pg, KM_USER0);
+		memcpy(dst, maddr + offset, len);
+		kunmap_atomic(maddr, KM_USER0);
+		put_page(pg);
+		up_read(&p->mm->mmap_sem);
+
+		src += len;
+		dst += len;
+		n -= len;
+	}
+
+	return n;
+}
+
+static void
+task_memctx_release(struct vbus_memctx *ctx)
+{
+	struct task_memctx *tm = to_task_memctx(ctx);
+
+	put_task_struct(tm->task);
+	kfree(tm);
+}
+
+static struct vbus_memctx_ops task_memctx_ops = {
+	.copy_to   = &task_memctx_copy_to,
+	.copy_from = &task_memctx_copy_from,
+	.release   = &task_memctx_release,
+};
+
+struct vbus_memctx *task_memctx_alloc(struct task_struct *task)
+{
+	struct task_memctx *tm;
+
+	tm = kzalloc(sizeof(*tm), GFP_KERNEL);
+	if (!tm)
+		return NULL;
+
+	get_task_struct(task);
+
+	tm->task = task;
+	vbus_memctx_init(&tm->ctx, &task_memctx_ops);
+
+	return &tm->ctx;
+}
+EXPORT_SYMBOL_GPL(task_memctx_alloc);


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RFC PATCH 04/17] vbus: add bus-registration notifiers
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
                   ` (2 preceding siblings ...)
  2009-03-31 18:43 ` [RFC PATCH 03/17] vbus: add connection-client helper infrastructure Gregory Haskins
@ 2009-03-31 18:43 ` Gregory Haskins
  2009-03-31 18:43 ` [RFC PATCH 05/17] vbus: add a "vbus-proxy" bus model for vbus_driver objects Gregory Haskins
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

We need to get hotswap events in environments which cannot use existing
facilities (e.g. inotify).  So we add a notifier-chain to allow client
callbacks whenever an interface is {un}registered.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/vbus.h |   15 +++++++++++++
 kernel/vbus/core.c   |   59 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/vbus/vbus.h   |    1 +
 3 files changed, 75 insertions(+), 0 deletions(-)

diff --git a/include/linux/vbus.h b/include/linux/vbus.h
index 5f0566c..04db4ff 100644
--- a/include/linux/vbus.h
+++ b/include/linux/vbus.h
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/rcupdate.h>
 #include <linux/vbus_device.h>
+#include <linux/notifier.h>
 
 struct vbus;
 struct task_struct;
@@ -137,6 +138,20 @@ static inline void task_vbus_disassociate(struct task_struct *p)
 	}
 }
 
+enum {
+	VBUS_EVENT_DEVADD,
+	VBUS_EVENT_DEVDROP,
+};
+
+struct vbus_event_devadd {
+	const char   *type;
+	unsigned long id;
+};
+
+int vbus_notifier_register(struct vbus *vbus, struct notifier_block *nb);
+int vbus_notifier_unregister(struct vbus *vbus, struct notifier_block *nb);
+
+
 #else /* CONFIG_VBUS */
 
 #define fork_vbus(p) do { } while (0)
diff --git a/kernel/vbus/core.c b/kernel/vbus/core.c
index 033999f..b6df487 100644
--- a/kernel/vbus/core.c
+++ b/kernel/vbus/core.c
@@ -89,6 +89,7 @@ int vbus_device_interface_register(struct vbus_device *dev,
 {
 	int ret;
 	struct vbus_devshell *ds = to_devshell(dev->kobj);
+	struct vbus_event_devadd ev;
 
 	mutex_lock(&vbus->lock);
 
@@ -124,6 +125,14 @@ int vbus_device_interface_register(struct vbus_device *dev,
 	if (ret)
 		goto error;
 
+	ev.type = intf->type;
+	ev.id   = intf->id;
+
+	/* and let any clients know about the new device */
+	ret = raw_notifier_call_chain(&vbus->notifier, VBUS_EVENT_DEVADD, &ev);
+	if (ret < 0)
+		goto error;
+
 	mutex_unlock(&vbus->lock);
 
 	return 0;
@@ -144,6 +153,7 @@ int vbus_device_interface_unregister(struct vbus_device_interface *intf)
 
 	mutex_lock(&vbus->lock);
 	_interface_unregister(intf);
+	raw_notifier_call_chain(&vbus->notifier, VBUS_EVENT_DEVDROP, &intf->id);
 	mutex_unlock(&vbus->lock);
 
 	kobject_put(&intf->kobj);
@@ -346,6 +356,8 @@ int vbus_create(const char *name, struct vbus **bus)
 
 	_bus->next_id = 0;
 
+	RAW_INIT_NOTIFIER_HEAD(&_bus->notifier);
+
 	mutex_lock(&vbus_root.lock);
 
 	ret = map_add(&vbus_root.buses.map, &_bus->node);
@@ -358,6 +370,53 @@ int vbus_create(const char *name, struct vbus **bus)
 	return 0;
 }
 
+#define for_each_rbnode(node, root) \
+	for (node = rb_first(root); node != NULL; node = rb_next(node))
+
+int vbus_notifier_register(struct vbus *vbus, struct notifier_block *nb)
+{
+	int ret;
+	struct rb_node *node;
+
+	mutex_lock(&vbus->lock);
+
+	/*
+	 * resync the client for any devices we might already have
+	 */
+	for_each_rbnode(node, &vbus->devices.map.root) {
+		struct vbus_device_interface *intf = node_to_intf(node);
+		struct vbus_event_devadd ev = {
+			.type = intf->type,
+			.id   = intf->id,
+		};
+
+		ret = nb->notifier_call(nb, VBUS_EVENT_DEVADD, &ev);
+		if (ret & NOTIFY_STOP_MASK) {
+			mutex_unlock(&vbus->lock);
+			return -EPERM;
+		}
+	}
+
+	ret = raw_notifier_chain_register(&vbus->notifier, nb);
+
+	mutex_unlock(&vbus->lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vbus_notifier_register);
+
+int vbus_notifier_unregister(struct vbus *vbus, struct notifier_block *nb)
+{
+	int ret;
+
+	mutex_lock(&vbus->lock);
+	ret = raw_notifier_chain_unregister(&vbus->notifier, nb);
+	mutex_unlock(&vbus->lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vbus_notifier_unregister);
+
 static void devshell_release(struct kobject *kobj)
 {
 	struct vbus_devshell *ds = container_of(kobj,
diff --git a/kernel/vbus/vbus.h b/kernel/vbus/vbus.h
index 1266d69..cd2676b 100644
--- a/kernel/vbus/vbus.h
+++ b/kernel/vbus/vbus.h
@@ -51,6 +51,7 @@ struct vbus {
 	struct vbus_subdir members;
 	unsigned long next_id;
 	struct rb_node node;
+	struct raw_notifier_head notifier;
 };
 
 struct vbus_member {


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RFC PATCH 05/17] vbus: add a "vbus-proxy" bus model for vbus_driver objects
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
                   ` (3 preceding siblings ...)
  2009-03-31 18:43 ` [RFC PATCH 04/17] vbus: add bus-registration notifiers Gregory Haskins
@ 2009-03-31 18:43 ` Gregory Haskins
  2009-03-31 18:43 ` [RFC PATCH 06/17] ioq: Add basic definitions for a shared-memory, lockless queue Gregory Haskins
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

This will generally be used for hypervisors to publish any host-side
virtual devices up to a guest.  The guest will have the opportunity
to consume any devices present on the vbus-proxy as if they were
platform devices, similar to existing buses like PCI.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/vbus_driver.h |   73 +++++++++++++++++++++
 kernel/vbus/Kconfig         |    9 +++
 kernel/vbus/Makefile        |    4 +
 kernel/vbus/proxy.c         |  152 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 238 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/vbus_driver.h
 create mode 100644 kernel/vbus/proxy.c

diff --git a/include/linux/vbus_driver.h b/include/linux/vbus_driver.h
new file mode 100644
index 0000000..c53e13f
--- /dev/null
+++ b/include/linux/vbus_driver.h
@@ -0,0 +1,73 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Mediates access to a host VBUS from a guest kernel by providing a
+ * global view of all VBUS devices
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VBUS_DRIVER_H
+#define _LINUX_VBUS_DRIVER_H
+
+#include <linux/device.h>
+#include <linux/shm_signal.h>
+
+struct vbus_device_proxy;
+struct vbus_driver;
+
+struct vbus_device_proxy_ops {
+	int (*open)(struct vbus_device_proxy *dev, int version, int flags);
+	int (*close)(struct vbus_device_proxy *dev, int flags);
+	int (*shm)(struct vbus_device_proxy *dev, int id, int prio,
+		   void *ptr, size_t len,
+		   struct shm_signal_desc *sigdesc, struct shm_signal **signal,
+		   int flags);
+	int (*call)(struct vbus_device_proxy *dev, u32 func,
+		    void *data, size_t len, int flags);
+	void (*release)(struct vbus_device_proxy *dev);
+};
+
+struct vbus_device_proxy {
+	char                          *type;
+	u64                            id;
+	void                          *priv; /* Used by drivers */
+	struct vbus_device_proxy_ops  *ops;
+	struct device                  dev;
+};
+
+int vbus_device_proxy_register(struct vbus_device_proxy *dev);
+void vbus_device_proxy_unregister(struct vbus_device_proxy *dev);
+
+struct vbus_device_proxy *vbus_device_proxy_find(u64 id);
+
+struct vbus_driver_ops {
+	int (*probe)(struct vbus_device_proxy *dev);
+	int (*remove)(struct vbus_device_proxy *dev);
+};
+
+struct vbus_driver {
+	char                          *type;
+	struct module                 *owner;
+	struct vbus_driver_ops        *ops;
+	struct device_driver           drv;
+};
+
+int vbus_driver_register(struct vbus_driver *drv);
+void vbus_driver_unregister(struct vbus_driver *drv);
+
+#endif /* _LINUX_VBUS_DRIVER_H */
diff --git a/kernel/vbus/Kconfig b/kernel/vbus/Kconfig
index f2b92f5..3aaa085 100644
--- a/kernel/vbus/Kconfig
+++ b/kernel/vbus/Kconfig
@@ -12,3 +12,12 @@ config VBUS
 	various tasks and devices which reside on the bus.
 
 	If unsure, say N
+
+config VBUS_DRIVERS
+       tristate "VBUS Driver support"
+       default n
+       help
+        Adds support for a virtual bus model for proxying drivers.
+
+	If unsure, say N
+
diff --git a/kernel/vbus/Makefile b/kernel/vbus/Makefile
index 4d440e5..d028ece 100644
--- a/kernel/vbus/Makefile
+++ b/kernel/vbus/Makefile
@@ -1 +1,5 @@
 obj-$(CONFIG_VBUS) += core.o devclass.o config.o attribute.o map.o client.o
+
+vbus-proxy-objs += proxy.o
+obj-$(CONFIG_VBUS_DRIVERS) += vbus-proxy.o
+
diff --git a/kernel/vbus/proxy.c b/kernel/vbus/proxy.c
new file mode 100644
index 0000000..ea48f00
--- /dev/null
+++ b/kernel/vbus/proxy.c
@@ -0,0 +1,152 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/vbus_driver.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+
+#define VBUS_PROXY_NAME "vbus-proxy"
+
+static struct vbus_device_proxy *to_dev(struct device *_dev)
+{
+	return _dev ? container_of(_dev, struct vbus_device_proxy, dev) : NULL;
+}
+
+static struct vbus_driver *to_drv(struct device_driver *_drv)
+{
+	return container_of(_drv, struct vbus_driver, drv);
+}
+
+/*
+ * This function is invoked whenever a new driver and/or device is added
+ * to check if there is a match
+ */
+static int vbus_dev_proxy_match(struct device *_dev, struct device_driver *_drv)
+{
+	struct vbus_device_proxy *dev = to_dev(_dev);
+	struct vbus_driver *drv = to_drv(_drv);
+
+	return !strcmp(dev->type, drv->type);
+}
+
+/*
+ * This function is invoked after the bus infrastructure has already made a
+ * match.  The device will contain a reference to the paired driver which
+ * we will extract.
+ */
+static int vbus_dev_proxy_probe(struct device *_dev)
+{
+	int ret = 0;
+	struct vbus_device_proxy *dev = to_dev(_dev);
+	struct vbus_driver *drv = to_drv(_dev->driver);
+
+	if (drv->ops->probe)
+		ret = drv->ops->probe(dev);
+
+	return ret;
+}
+
+static struct bus_type vbus_proxy = {
+	.name   = VBUS_PROXY_NAME,
+	.match  = vbus_dev_proxy_match,
+};
+
+static struct device vbus_proxy_rootdev = {
+	.parent = NULL,
+	.bus_id = VBUS_PROXY_NAME,
+};
+
+static int __init vbus_init(void)
+{
+	int ret;
+
+	ret = bus_register(&vbus_proxy);
+	BUG_ON(ret < 0);
+
+	ret = device_register(&vbus_proxy_rootdev);
+	BUG_ON(ret < 0);
+
+	return 0;
+}
+
+postcore_initcall(vbus_init);
+
+static void device_release(struct device *dev)
+{
+	struct vbus_device_proxy *_dev;
+
+	_dev = container_of(dev, struct vbus_device_proxy, dev);
+
+	_dev->ops->release(_dev);
+}
+
+int vbus_device_proxy_register(struct vbus_device_proxy *new)
+{
+	new->dev.parent  = &vbus_proxy_rootdev;
+	new->dev.bus     = &vbus_proxy;
+	new->dev.release = &device_release;
+
+	return device_register(&new->dev);
+}
+EXPORT_SYMBOL_GPL(vbus_device_proxy_register);
+
+void vbus_device_proxy_unregister(struct vbus_device_proxy *dev)
+{
+	device_unregister(&dev->dev);
+}
+EXPORT_SYMBOL_GPL(vbus_device_proxy_unregister);
+
+static int match_device_id(struct device *_dev, void *data)
+{
+	struct vbus_device_proxy *dev = to_dev(_dev);
+	u64 id = *(u64 *)data;
+
+	return dev->id == id;
+}
+
+struct vbus_device_proxy *vbus_device_proxy_find(u64 id)
+{
+	struct device *dev;
+
+	dev = bus_find_device(&vbus_proxy, NULL, &id, &match_device_id);
+
+	return to_dev(dev);
+}
+EXPORT_SYMBOL_GPL(vbus_device_proxy_find);
+
+int vbus_driver_register(struct vbus_driver *new)
+{
+	new->drv.bus   = &vbus_proxy;
+	new->drv.name  = new->type;
+	new->drv.owner = new->owner;
+	new->drv.probe = vbus_dev_proxy_probe;
+
+	return driver_register(&new->drv);
+}
+EXPORT_SYMBOL_GPL(vbus_driver_register);
+
+void vbus_driver_unregister(struct vbus_driver *drv)
+{
+	driver_unregister(&drv->drv);
+}
+EXPORT_SYMBOL_GPL(vbus_driver_unregister);
+


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RFC PATCH 06/17] ioq: Add basic definitions for a shared-memory, lockless queue
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
                   ` (4 preceding siblings ...)
  2009-03-31 18:43 ` [RFC PATCH 05/17] vbus: add a "vbus-proxy" bus model for vbus_driver objects Gregory Haskins
@ 2009-03-31 18:43 ` Gregory Haskins
  2009-03-31 18:43 ` [RFC PATCH 07/17] ioq: add vbus helpers Gregory Haskins
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

We can map these over VBUS shared memory (or really any shared-memory
architecture if it supports shm-signals) to allow asynchronous
communication between two end-points.  Memory is synchronized using
pure barriers (i.e. lockless), so IOQs are friendly in many contexts,
even if the memory is remote.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/ioq.h |  410 +++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig         |   12 +
 lib/Makefile        |    1 
 lib/ioq.c           |  298 +++++++++++++++++++++++++++++++++++++
 4 files changed, 721 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/ioq.h
 create mode 100644 lib/ioq.c

diff --git a/include/linux/ioq.h b/include/linux/ioq.h
new file mode 100644
index 0000000..d450d9a
--- /dev/null
+++ b/include/linux/ioq.h
@@ -0,0 +1,410 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * IOQ is a generic shared-memory, lockless queue mechanism. It can be used
+ * in a variety of ways, though its intended purpose is to become the
+ * asynchronous communication path for virtual-bus drivers.
+ *
+ * The following are a list of key design points:
+ *
+ * #) All shared-memory is always allocated on explicitly one side of the
+ *    link.  This typically would be the guest side in a VM/VMM scenario.
+ * #) Each IOQ has the concept of "north" and "south" locales, where
+ *    north denotes the memory-owner side (e.g. guest).
+ * #) An IOQ is manipulated using an iterator idiom.
+ * #) Provides a bi-directional signaling/notification infrastructure on
+ *    a per-queue basis, which includes an event mitigation strategy
+ *    to reduce boundary switching.
+ * #) The signaling path is abstracted so that various technologies and
+ *    topologies can define their own specific implementation while sharing
+ *    the basic structures and code.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_IOQ_H
+#define _LINUX_IOQ_H
+
+#include <asm/types.h>
+#include <linux/shm_signal.h>
+
+/*
+ *---------
+ * The following structures represent data that is shared across boundaries
+ * which may be quite disparate from one another (e.g. Windows vs Linux,
+ * 32 vs 64 bit, etc).  Therefore, care has been taken to make sure they
+ * present data in a manner that is independent of the environment.
+ *-----------
+ */
+struct ioq_ring_desc {
+	__u64                 cookie; /* for arbitrary use by north-side */
+	__u64                 ptr;
+	__u64                 len;
+	__u8                  valid;
+	__u8                  sown; /* South owned = 1, North owned = 0 */
+};
+
+#define IOQ_RING_MAGIC 0x47fa2fe4
+#define IOQ_RING_VER   4
+
+struct ioq_ring_idx {
+	__u32                 head;    /* 0 based index to head of ptr array */
+	__u32                 tail;    /* 0 based index to tail of ptr array */
+	__u8                  full;
+};
+
+enum ioq_locality {
+	ioq_locality_north,
+	ioq_locality_south,
+};
+
+struct ioq_ring_head {
+	__u32                  magic;
+	__u32                  ver;
+	struct shm_signal_desc signal;
+	struct ioq_ring_idx    idx[2];
+	__u32                  count;
+	struct ioq_ring_desc   ring[1]; /* "count" elements will be allocated */
+};
+
+#define IOQ_HEAD_DESC_SIZE(count) \
+    (sizeof(struct ioq_ring_head) + sizeof(struct ioq_ring_desc) * (count - 1))
+
+/* --- END SHARED STRUCTURES --- */
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/interrupt.h>
+#include <linux/shm_signal.h>
+#include <asm/atomic.h>
+
+enum ioq_idx_type {
+	ioq_idxtype_valid,
+	ioq_idxtype_inuse,
+	ioq_idxtype_both,
+	ioq_idxtype_invalid,
+};
+
+enum ioq_seek_type {
+	ioq_seek_tail,
+	ioq_seek_next,
+	ioq_seek_head,
+	ioq_seek_set
+};
+
+struct ioq_iterator {
+	struct ioq            *ioq;
+	struct ioq_ring_idx   *idx;
+	u32                    pos;
+	struct ioq_ring_desc  *desc;
+	int                    update:1;
+	int                    dualidx:1;
+	int                    flipowner:1;
+};
+
+struct ioq_notifier {
+	void (*signal)(struct ioq_notifier *);
+};
+
+struct ioq_ops {
+	void     (*release)(struct ioq *ioq);
+};
+
+struct ioq {
+	struct ioq_ops *ops;
+
+	atomic_t               refs;
+	enum ioq_locality      locale;
+	struct ioq_ring_head  *head_desc;
+	struct ioq_ring_desc  *ring;
+	struct shm_signal     *signal;
+	wait_queue_head_t      wq;
+	struct ioq_notifier   *notifier;
+	size_t                 count;
+	struct shm_signal_notifier shm_notifier;
+};
+
+#define IOQ_ITER_AUTOUPDATE  (1 << 0)
+#define IOQ_ITER_NOFLIPOWNER (1 << 1)
+
+/**
+ * ioq_init() - initialize an IOQ
+ * @ioq:        IOQ context
+ *
+ * Initializes IOQ context before first use
+ *
+ **/
+void ioq_init(struct ioq *ioq,
+	      struct ioq_ops *ops,
+	      enum ioq_locality locale,
+	      struct ioq_ring_head *head,
+	      struct shm_signal *signal,
+	      size_t count);
+
+/**
+ * ioq_get() - acquire an IOQ context reference
+ * @ioq:        IOQ context
+ *
+ **/
+static inline struct ioq *ioq_get(struct ioq *ioq)
+{
+	atomic_inc(&ioq->refs);
+
+	return ioq;
+}
+
+/**
+ * ioq_put() - release an IOQ context reference
+ * @ioq:        IOQ context
+ *
+ **/
+static inline void ioq_put(struct ioq *ioq)
+{
+	if (atomic_dec_and_test(&ioq->refs)) {
+		shm_signal_put(ioq->signal);
+		ioq->ops->release(ioq);
+	}
+}
+
+/**
+ * ioq_notify_enable() - enables local notifications on an IOQ
+ * @ioq:        IOQ context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Enables/unmasks the registered ioq_notifier (if applicable) and waitq to
+ * receive wakeups whenever the remote side performs an ioq_signal() operation.
+ * A notification will be dispatched immediately if any pending signals have
+ * already been issued prior to invoking this call.
+ *
+ * This is synonymous with unmasking an interrupt.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+static inline int ioq_notify_enable(struct ioq *ioq, int flags)
+{
+	return shm_signal_enable(ioq->signal, 0);
+}
+
+/**
+ * ioq_notify_disable() - disable local notifications on an IOQ
+ * @ioq:        IOQ context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Disables/masks the registered ioq_notifier (if applicable) and waitq
+ * from receiving any further notifications.  Any subsequent calls to
+ * ioq_signal() by the remote side will update the ring as dirty, but
+ * will not traverse the locale boundary and will not invoke the notifier
+ * callback or wakeup the waitq.  Signals delivered while masked will
+ * be deferred until ioq_notify_enable() is invoked
+ *
+ * This is synonymous with masking an interrupt
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+static inline int ioq_notify_disable(struct ioq *ioq, int flags)
+{
+	return shm_signal_disable(ioq->signal, 0);
+}
+
+/**
+ * ioq_signal() - notify the remote side about ring changes
+ * @ioq:        IOQ context
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Marks the ring state as "dirty" and, if enabled, will traverse
+ * a locale boundary to invoke a remote notification.  The remote
+ * side controls whether the notification should be delivered via
+ * the ioq_notify_enable/disable() interface.
+ *
+ * The specifics of how to traverse a locale boundary are abstracted
+ * by the ioq_ops->signal() interface and provided by a particular
+ * implementation.  However, typically going north to south would be
+ * something like a syscall/hypercall, and going south to north would be
+ * something like a posix-signal/guest-interrupt.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+static inline int ioq_signal(struct ioq *ioq, int flags)
+{
+	return shm_signal_inject(ioq->signal, 0);
+}
+
+/**
+ * ioq_count() - counts the number of outstanding descriptors in an index
+ * @ioq:        IOQ context
+ * @type:	Specifies the index type
+ *                 (*) valid: the descriptor is valid.  This is usually
+ *                     used to keep track of descriptors that may not
+ *                     be carrying a useful payload, but still need to
+ *                     be tracked carefully.
+ *                 (*) inuse: Descriptors that carry useful payload
+ *
+ * Returns:
+ *  (*) >=0: # of descriptors outstanding in the index
+ *  (*) <0 = ERRNO
+ *
+ **/
+int ioq_count(struct ioq *ioq, enum ioq_idx_type type);
+
+/**
+ * ioq_remain() - counts the number of remaining descriptors in an index
+ * @ioq:        IOQ context
+ * @type:	Specifies the index type
+ *                 (*) valid: the descriptor is valid.  This is usually
+ *                     used to keep track of descriptors that may not
+ *                     be carrying a useful payload, but still need to
+ *                     be tracked carefully.
+ *                 (*) inuse: Descriptors that carry useful payload
+ *
+ * This is the converse of ioq_count().  This function returns the number
+ * of "free" descriptors left in a particular index
+ *
+ * Returns:
+ *  (*) >=0: # of descriptors remaining in the index
+ *  (*) <0 = ERRNO
+ *
+ **/
+int ioq_remain(struct ioq *ioq, enum ioq_idx_type type);
+
+/**
+ * ioq_size() - counts the maximum number of descriptors in an ring
+ * @ioq:        IOQ context
+ *
+ * This function returns the maximum number of descriptors supported in
+ * a ring, regardless of their current state (free or inuse).
+ *
+ * Returns:
+ *  (*) >=0: total # of descriptors in the ring
+ *  (*) <0 = ERRNO
+ *
+ **/
+int ioq_size(struct ioq *ioq);
+
+/**
+ * ioq_full() - determines if a specific index is "full"
+ * @ioq:        IOQ context
+ * @type:	Specifies the index type
+ *                 (*) valid: the descriptor is valid.  This is usually
+ *                     used to keep track of descriptors that may not
+ *                     be carrying a useful payload, but still need to
+ *                     be tracked carefully.
+ *                 (*) inuse: Descriptors that carry useful payload
+ *
+ * Returns:
+ *  (*) 0: index is not full
+ *  (*) 1: index is full
+ *  (*) <0 = ERRNO
+ *
+ **/
+int ioq_full(struct ioq *ioq, enum ioq_idx_type type);
+
+/**
+ * ioq_empty() - determines if a specific index is "empty"
+ * @ioq:        IOQ context
+ * @type:	Specifies the index type
+ *                 (*) valid: the descriptor is valid.  This is usually
+ *                     used to keep track of descriptors that may not
+ *                     be carrying a useful payload, but still need to
+ *                     be tracked carefully.
+ *                 (*) inuse: Descriptors that carry useful payload
+ *
+ * Returns:
+ *  (*) 0: index is not empty
+ *  (*) 1: index is empty
+ *  (*) <0 = ERRNO
+ *
+ **/
+static inline int ioq_empty(struct ioq *ioq, enum ioq_idx_type type)
+{
+    return !ioq_count(ioq, type);
+}
+
+/**
+ * ioq_iter_init() - initialize an iterator for IOQ descriptor traversal
+ * @ioq:        IOQ context to iterate on
+ * @iter:	Iterator context to init (usually from stack)
+ * @type:	Specifies the index type to iterate against
+ *                 (*) valid: iterate against the "valid" index
+ *                 (*) inuse: iterate against the "inuse" index
+ *                 (*) both: iterate against both indexes simultaneously
+ * @flags:      Bitfield with 0 or more bits set to alter behavior
+ *                 (*) autoupdate: automatically signal the remote side
+ *                     whenever the iterator pushes/pops to a new desc
+ *                 (*) noflipowner: do not flip the ownership bit during
+ *                     a push/pop operation
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int ioq_iter_init(struct ioq *ioq, struct ioq_iterator *iter,
+		  enum ioq_idx_type type, int flags);
+
+/**
+ * ioq_iter_seek() - seek to a specific location in the IOQ ring
+ * @iter:	Iterator context (must be initialized with ioq_iter_init)
+ * @type:	Specifies the type of seek operation
+ *                 (*) tail: seek to the absolute tail, offset is ignored
+ *                 (*) next: seek to the relative next, offset is ignored
+ *                 (*) head: seek to the absolute head, offset is ignored
+ *                 (*) set: seek to the absolute offset
+ * @offset:     Offset for ioq_seek_set operations
+ * @flags:      Reserved for future use, must be 0
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int  ioq_iter_seek(struct ioq_iterator *iter, enum ioq_seek_type type,
+		   long offset, int flags);
+
+/**
+ * ioq_iter_push() - push the tail pointer forward
+ * @iter:	Iterator context (must be initialized with ioq_iter_init)
+ * @flags:      Reserved for future use, must be 0
+ *
+ * This function will simultaneously advance the tail ptr in the current
+ * index (valid/inuse, as specified in the ioq_iter_init) as well as
+ * perform a seek(next) operation.  This effectively "pushes" a new pointer
+ * onto the tail of the index.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int  ioq_iter_push(struct ioq_iterator *iter, int flags);
+
+/**
+ * ioq_iter_pop() - pop the head pointer from the ring
+ * @iter:	Iterator context (must be initialized with ioq_iter_init)
+ * @flags:      Reserved for future use, must be 0
+ *
+ * This function will simultaneously advance the head ptr in the current
+ * index (valid/inuse, as specified in the ioq_iter_init) as well as
+ * perform a seek(next) operation.  This effectively "pops" a pointer
+ * from the head of the index.
+ *
+ * Returns: success = 0, <0 = ERRNO
+ *
+ **/
+int  ioq_iter_pop(struct ioq_iterator *iter,  int flags);
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_IOQ_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 32d82fe..1e66f8e 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -183,5 +183,17 @@ config SHM_SIGNAL
 
 	 If unsure, say N
 
+config IOQ
+	boolean "IO-Queue library - Generic shared-memory queue"
+	select SHM_SIGNAL
+	default n
+	help
+	 IOQ is a generic shared-memory-queue mechanism that happens to be
+	 friendly to virtualization boundaries. It can be used in a variety
+	 of ways, though its intended purpose is to become the low-level
+	 communication path for paravirtualized drivers.
+
+	 If unsure, say N
+
 
 endmenu
diff --git a/lib/Makefile b/lib/Makefile
index bc36327..98cd332 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -72,6 +72,7 @@ obj-$(CONFIG_TEXTSEARCH_FSM) += ts_fsm.o
 obj-$(CONFIG_SMP) += percpu_counter.o
 obj-$(CONFIG_AUDIT_GENERIC) += audit.o
 obj-$(CONFIG_SHM_SIGNAL) += shm_signal.o
+obj-$(CONFIG_IOQ) += ioq.o
 
 obj-$(CONFIG_SWIOTLB) += swiotlb.o
 obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
diff --git a/lib/ioq.c b/lib/ioq.c
new file mode 100644
index 0000000..803b5d6
--- /dev/null
+++ b/lib/ioq.c
@@ -0,0 +1,298 @@
+/*
+ * Copyright 2008 Novell.  All Rights Reserved.
+ *
+ * See include/linux/ioq.h for documentation
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/sched.h>
+#include <linux/ioq.h>
+#include <asm/bitops.h>
+#include <linux/module.h>
+
+#ifndef NULL
+#define NULL 0
+#endif
+
+static int ioq_iter_setpos(struct ioq_iterator *iter, u32 pos)
+{
+	struct ioq *ioq = iter->ioq;
+
+	BUG_ON(pos >= ioq->count);
+
+	iter->pos  = pos;
+	iter->desc = &ioq->ring[pos];
+
+	return 0;
+}
+
+static inline u32 modulo_inc(u32 val, u32 mod)
+{
+	BUG_ON(val >= mod);
+
+	if (val == (mod - 1))
+		return 0;
+
+	return val + 1;
+}
+
+static inline int idx_full(struct ioq_ring_idx *idx)
+{
+	return idx->full && (idx->head == idx->tail);
+}
+
+int ioq_iter_seek(struct ioq_iterator *iter, enum ioq_seek_type type,
+		  long offset, int flags)
+{
+	struct ioq_ring_idx *idx = iter->idx;
+	u32 pos;
+
+	switch (type) {
+	case ioq_seek_next:
+		pos = modulo_inc(iter->pos, iter->ioq->count);
+		break;
+	case ioq_seek_tail:
+		pos = idx->tail;
+		break;
+	case ioq_seek_head:
+		pos = idx->head;
+		break;
+	case ioq_seek_set:
+		if (offset >= iter->ioq->count)
+			return -1;
+		pos = offset;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return ioq_iter_setpos(iter, pos);
+}
+EXPORT_SYMBOL_GPL(ioq_iter_seek);
+
+static int ioq_ring_count(struct ioq_ring_idx *idx, int count)
+{
+	if (idx->full && (idx->head == idx->tail))
+		return count;
+	else if (idx->tail >= idx->head)
+		return idx->tail - idx->head;
+	else
+		return (idx->tail + count) - idx->head;
+}
+
+static void idx_tail_push(struct ioq_ring_idx *idx, int count)
+{
+	u32 tail = modulo_inc(idx->tail, count);
+
+	if (idx->head == tail) {
+		rmb();
+
+		/*
+		 * Setting full here may look racy, but note that we havent
+		 * flipped the owner bit yet.  So it is impossible for the
+		 * remote locale to move head in such a way that this operation
+		 * becomes invalid
+		 */
+		idx->full = 1;
+		wmb();
+	}
+
+	idx->tail = tail;
+}
+
+int ioq_iter_push(struct ioq_iterator *iter, int flags)
+{
+	struct ioq_ring_head *head_desc = iter->ioq->head_desc;
+	struct ioq_ring_idx  *idx  = iter->idx;
+	int ret;
+
+	/*
+	 * Its only valid to push if we are currently pointed at the tail
+	 */
+	if (iter->pos != idx->tail || iter->desc->sown != iter->ioq->locale)
+		return -EINVAL;
+
+	idx_tail_push(idx, iter->ioq->count);
+	if (iter->dualidx) {
+		idx_tail_push(&head_desc->idx[ioq_idxtype_inuse],
+			      iter->ioq->count);
+		if (head_desc->idx[ioq_idxtype_inuse].tail !=
+		    head_desc->idx[ioq_idxtype_valid].tail) {
+			SHM_SIGNAL_FAULT(iter->ioq->signal,
+					 "Tails not synchronized");
+			return -EINVAL;
+		}
+	}
+
+	wmb(); /* the index must be visible before the sown, or signal */
+
+	if (iter->flipowner) {
+		iter->desc->sown = !iter->ioq->locale;
+		wmb(); /* sown must be visible before we signal */
+	}
+
+	ret = ioq_iter_seek(iter, ioq_seek_next, 0, flags);
+
+	if (iter->update)
+		ioq_signal(iter->ioq, 0);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ioq_iter_push);
+
+int ioq_iter_pop(struct ioq_iterator *iter,  int flags)
+{
+	struct ioq_ring_idx *idx = iter->idx;
+	int full;
+	int ret;
+
+	/*
+	 * Its only valid to pop if we are currently pointed at the head
+	 */
+	if (iter->pos != idx->head || iter->desc->sown != iter->ioq->locale)
+		return -EINVAL;
+
+	full = idx_full(idx);
+	rmb();
+
+	idx->head = modulo_inc(idx->head, iter->ioq->count);
+	wmb(); /* head must be visible before full */
+
+	if (full) {
+		idx->full = 0;
+		wmb(); /* full must be visible before sown */
+	}
+
+	if (iter->flipowner) {
+		iter->desc->sown = !iter->ioq->locale;
+		wmb(); /* sown must be visible before we signal */
+	}
+
+	ret = ioq_iter_seek(iter, ioq_seek_next, 0, flags);
+
+	if (iter->update)
+		ioq_signal(iter->ioq, 0);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ioq_iter_pop);
+
+static struct ioq_ring_idx *idxtype_to_idx(struct ioq *ioq,
+					   enum ioq_idx_type type)
+{
+	struct ioq_ring_idx *idx;
+
+	switch (type) {
+	case ioq_idxtype_valid:
+	case ioq_idxtype_inuse:
+		idx = &ioq->head_desc->idx[type];
+		break;
+	default:
+		panic("IOQ: illegal index type: %d", type);
+		break;
+	}
+
+	return idx;
+}
+
+int ioq_iter_init(struct ioq *ioq, struct ioq_iterator *iter,
+		  enum ioq_idx_type type, int flags)
+{
+	iter->ioq        = ioq;
+	iter->update     = (flags & IOQ_ITER_AUTOUPDATE);
+	iter->flipowner  = !(flags & IOQ_ITER_NOFLIPOWNER);
+	iter->pos        = -1;
+	iter->desc       = NULL;
+	iter->dualidx    = 0;
+
+	if (type == ioq_idxtype_both) {
+		/*
+		 * "both" is a special case, so we set the dualidx flag.
+		 *
+		 * However, we also just want to use the valid-index
+		 * for normal processing, so override that here
+		 */
+		type = ioq_idxtype_valid;
+		iter->dualidx = 1;
+	}
+
+	iter->idx = idxtype_to_idx(ioq, type);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ioq_iter_init);
+
+int ioq_count(struct ioq *ioq, enum ioq_idx_type type)
+{
+	return ioq_ring_count(idxtype_to_idx(ioq, type), ioq->count);
+}
+EXPORT_SYMBOL_GPL(ioq_count);
+
+int ioq_remain(struct ioq *ioq, enum ioq_idx_type type)
+{
+	int count = ioq_ring_count(idxtype_to_idx(ioq, type), ioq->count);
+
+	return ioq->count - count;
+}
+EXPORT_SYMBOL_GPL(ioq_remain);
+
+int ioq_size(struct ioq *ioq)
+{
+	return ioq->count;
+}
+EXPORT_SYMBOL_GPL(ioq_size);
+
+int ioq_full(struct ioq *ioq, enum ioq_idx_type type)
+{
+	struct ioq_ring_idx *idx = idxtype_to_idx(ioq, type);
+
+	return idx_full(idx);
+}
+EXPORT_SYMBOL_GPL(ioq_full);
+
+static void ioq_shm_signal(struct shm_signal_notifier *notifier)
+{
+	struct ioq *ioq = container_of(notifier, struct ioq, shm_notifier);
+
+	wake_up(&ioq->wq);
+	if (ioq->notifier)
+		ioq->notifier->signal(ioq->notifier);
+}
+
+void ioq_init(struct ioq *ioq,
+	      struct ioq_ops *ops,
+	      enum ioq_locality locale,
+	      struct ioq_ring_head *head,
+	      struct shm_signal *signal,
+	      size_t count)
+{
+	memset(ioq, 0, sizeof(*ioq));
+	atomic_set(&ioq->refs, 1);
+	init_waitqueue_head(&ioq->wq);
+
+	ioq->ops         = ops;
+	ioq->locale      = locale;
+	ioq->head_desc   = head;
+	ioq->ring        = &head->ring[0];
+	ioq->count       = count;
+	ioq->signal      = signal;
+
+	ioq->shm_notifier.signal = &ioq_shm_signal;
+	signal->notifier         = &ioq->shm_notifier;
+}
+EXPORT_SYMBOL_GPL(ioq_init);


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RFC PATCH 07/17] ioq: add vbus helpers
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
                   ` (5 preceding siblings ...)
  2009-03-31 18:43 ` [RFC PATCH 06/17] ioq: Add basic definitions for a shared-memory, lockless queue Gregory Haskins
@ 2009-03-31 18:43 ` Gregory Haskins
  2009-03-31 18:43 ` [RFC PATCH 08/17] venet: add the ABI definitions for an 802.x packet interface Gregory Haskins
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

It will be common to map an IOQ over the VBUS shared-memory interfaces,
so lets generalize their setup so we can reuse the pattern.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/vbus_device.h |    7 +++
 include/linux/vbus_driver.h |    7 +++
 kernel/vbus/Kconfig         |    2 +
 kernel/vbus/Makefile        |    1 
 kernel/vbus/proxy.c         |   64 +++++++++++++++++++++++++++++++
 kernel/vbus/shm-ioq.c       |   89 +++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 170 insertions(+), 0 deletions(-)
 create mode 100644 kernel/vbus/shm-ioq.c

diff --git a/include/linux/vbus_device.h b/include/linux/vbus_device.h
index 705d92e..66990e2 100644
--- a/include/linux/vbus_device.h
+++ b/include/linux/vbus_device.h
@@ -102,6 +102,7 @@
 #include <linux/configfs.h>
 #include <linux/rbtree.h>
 #include <linux/shm_signal.h>
+#include <linux/ioq.h>
 #include <linux/vbus.h>
 #include <asm/atomic.h>
 
@@ -413,4 +414,10 @@ static inline void vbus_connection_put(struct vbus_connection *conn)
 		conn->ops->release(conn);
 }
 
+/*
+ * device-side IOQ helper - dereferences device-shm as an IOQ
+ */
+int vbus_shm_ioq_attach(struct vbus_shm *shm, struct shm_signal *signal,
+			int maxcount, struct ioq **ioq);
+
 #endif /* _LINUX_VBUS_DEVICE_H */
diff --git a/include/linux/vbus_driver.h b/include/linux/vbus_driver.h
index c53e13f..9cfbf60 100644
--- a/include/linux/vbus_driver.h
+++ b/include/linux/vbus_driver.h
@@ -26,6 +26,7 @@
 
 #include <linux/device.h>
 #include <linux/shm_signal.h>
+#include <linux/ioq.h>
 
 struct vbus_device_proxy;
 struct vbus_driver;
@@ -70,4 +71,10 @@ struct vbus_driver {
 int vbus_driver_register(struct vbus_driver *drv);
 void vbus_driver_unregister(struct vbus_driver *drv);
 
+/*
+ * driver-side IOQ helper - allocates device-shm and maps an IOQ on it
+ */
+int vbus_driver_ioq_alloc(struct vbus_device_proxy *dev, int id, int prio,
+			  size_t ringsize, struct ioq **ioq);
+
 #endif /* _LINUX_VBUS_DRIVER_H */
diff --git a/kernel/vbus/Kconfig b/kernel/vbus/Kconfig
index 3aaa085..71acd6f 100644
--- a/kernel/vbus/Kconfig
+++ b/kernel/vbus/Kconfig
@@ -6,6 +6,7 @@ config VBUS
        bool "Virtual Bus"
        select CONFIGFS_FS
        select SHM_SIGNAL
+       select IOQ
        default n
        help
         Provides a mechansism for declaring virtual-bus objects and binding
@@ -15,6 +16,7 @@ config VBUS
 
 config VBUS_DRIVERS
        tristate "VBUS Driver support"
+       select IOQ
        default n
        help
         Adds support for a virtual bus model for proxying drivers.
diff --git a/kernel/vbus/Makefile b/kernel/vbus/Makefile
index d028ece..45f6503 100644
--- a/kernel/vbus/Makefile
+++ b/kernel/vbus/Makefile
@@ -1,4 +1,5 @@
 obj-$(CONFIG_VBUS) += core.o devclass.o config.o attribute.o map.o client.o
+obj-$(CONFIG_VBUS) += shm-ioq.o
 
 vbus-proxy-objs += proxy.o
 obj-$(CONFIG_VBUS_DRIVERS) += vbus-proxy.o
diff --git a/kernel/vbus/proxy.c b/kernel/vbus/proxy.c
index ea48f00..75b0cb1 100644
--- a/kernel/vbus/proxy.c
+++ b/kernel/vbus/proxy.c
@@ -150,3 +150,67 @@ void vbus_driver_unregister(struct vbus_driver *drv)
 }
 EXPORT_SYMBOL_GPL(vbus_driver_unregister);
 
+/*
+ *---------------------------------
+ * driver-side IOQ helper
+ *---------------------------------
+ */
+static void
+vbus_driver_ioq_release(struct ioq *ioq)
+{
+	kfree(ioq->head_desc);
+	kfree(ioq);
+}
+
+static struct ioq_ops vbus_driver_ioq_ops = {
+	.release = vbus_driver_ioq_release,
+};
+
+
+int vbus_driver_ioq_alloc(struct vbus_device_proxy *dev, int id, int prio,
+			  size_t count, struct ioq **ioq)
+{
+	struct ioq           *_ioq;
+	struct ioq_ring_head *head = NULL;
+	struct shm_signal    *signal = NULL;
+	size_t                len = IOQ_HEAD_DESC_SIZE(count);
+	int                   ret = -ENOMEM;
+
+	_ioq = kzalloc(sizeof(*_ioq), GFP_KERNEL);
+	if (!_ioq)
+		goto error;
+
+	head = kzalloc(len, GFP_KERNEL | GFP_DMA);
+	if (!head)
+		goto error;
+
+	head->magic     = IOQ_RING_MAGIC;
+	head->ver	= IOQ_RING_VER;
+	head->count     = count;
+
+	ret = dev->ops->shm(dev, id, prio, head, len,
+			    &head->signal, &signal, 0);
+	if (ret < 0)
+		goto error;
+
+	ioq_init(_ioq,
+		 &vbus_driver_ioq_ops,
+		 ioq_locality_north,
+		 head,
+		 signal,
+		 count);
+
+	*ioq = _ioq;
+
+	return 0;
+
+ error:
+	kfree(_ioq);
+	kfree(head);
+
+	if (signal)
+		shm_signal_put(signal);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vbus_driver_ioq_alloc);
diff --git a/kernel/vbus/shm-ioq.c b/kernel/vbus/shm-ioq.c
new file mode 100644
index 0000000..a627337
--- /dev/null
+++ b/kernel/vbus/shm-ioq.c
@@ -0,0 +1,89 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * IOQ helper for devices - This module implements an IOQ which has
+ * been shared with a device via a vbus_shm segment.
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/ioq.h>
+#include <linux/vbus_device.h>
+
+struct _ioq {
+	struct vbus_shm *shm;
+	struct ioq ioq;
+};
+
+static void
+_shm_ioq_release(struct ioq *ioq)
+{
+	struct _ioq *_ioq = container_of(ioq, struct _ioq, ioq);
+
+	/* the signal is released by the IOQ infrastructure */
+	vbus_shm_put(_ioq->shm);
+	kfree(_ioq);
+}
+
+static struct ioq_ops _shm_ioq_ops = {
+	.release = _shm_ioq_release,
+};
+
+int vbus_shm_ioq_attach(struct vbus_shm *shm, struct shm_signal *signal,
+			int maxcount, struct ioq **ioq)
+{
+	struct _ioq *_ioq;
+	struct ioq_ring_head *head = NULL;
+	size_t ringcount;
+
+	if (!signal)
+		return -EINVAL;
+
+	_ioq = kzalloc(sizeof(*_ioq), GFP_KERNEL);
+	if (!_ioq)
+		return -ENOMEM;
+
+	head = (struct ioq_ring_head *)shm->ptr;
+
+	if (head->magic != IOQ_RING_MAGIC)
+		return -EINVAL;
+
+	if (head->ver != IOQ_RING_VER)
+		return -EINVAL;
+
+	ringcount = head->count;
+
+	if ((maxcount != -1) && (ringcount > maxcount))
+		return -EINVAL;
+
+	/*
+	 * Sanity check the ringcount against the actual length of the segment
+	 */
+	if (IOQ_HEAD_DESC_SIZE(ringcount) != shm->len)
+		return -EINVAL;
+
+	_ioq->shm = shm;
+
+	ioq_init(&_ioq->ioq, &_shm_ioq_ops, ioq_locality_south, head,
+		 signal, ringcount);
+
+	*ioq = &_ioq->ioq;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vbus_shm_ioq_attach);
+


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RFC PATCH 08/17] venet: add the ABI definitions for an 802.x packet interface
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
                   ` (6 preceding siblings ...)
  2009-03-31 18:43 ` [RFC PATCH 07/17] ioq: add vbus helpers Gregory Haskins
@ 2009-03-31 18:43 ` Gregory Haskins
  2009-03-31 18:43 ` [RFC PATCH 09/17] net: Add vbus_enet driver Gregory Haskins
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/venet.h |   47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 47 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/venet.h

diff --git a/include/linux/venet.h b/include/linux/venet.h
new file mode 100644
index 0000000..ef6b199
--- /dev/null
+++ b/include/linux/venet.h
@@ -0,0 +1,47 @@
+/*
+ * Copyright 2008 Novell.  All Rights Reserved.
+ *
+ * Virtual-Ethernet adapter
+ *
+ * Author:
+ *      Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#ifndef _LINUX_VENET_H
+#define _LINUX_VENET_H
+
+#define VENET_VERSION 1
+
+#define VENET_TYPE "virtual-ethernet"
+
+#define VENET_QUEUE_RX 0
+#define VENET_QUEUE_TX 1
+
+struct venet_capabilities {
+	__u32 gid;
+	__u32 bits;
+};
+
+/* CAPABILITIES-GROUP 0 */
+/* #define VENET_CAP_FOO    0   (No capabilities defined yet, for now) */
+
+#define VENET_FUNC_LINKUP   0
+#define VENET_FUNC_LINKDOWN 1
+#define VENET_FUNC_MACQUERY 2
+#define VENET_FUNC_NEGCAP   3 /* negotiate capabilities */
+#define VENET_FUNC_FLUSHRX  4
+
+#endif /* _LINUX_VENET_H */


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RFC PATCH 09/17] net: Add vbus_enet driver
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
                   ` (7 preceding siblings ...)
  2009-03-31 18:43 ` [RFC PATCH 08/17] venet: add the ABI definitions for an 802.x packet interface Gregory Haskins
@ 2009-03-31 18:43 ` Gregory Haskins
  2009-03-31 20:39   ` Stephen Hemminger
  2009-03-31 18:43 ` [RFC PATCH 10/17] venet-tap: Adds a "venet" compatible "tap" device to VBUS Gregory Haskins
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 drivers/net/Kconfig     |   13 +
 drivers/net/Makefile    |    1 
 drivers/net/vbus-enet.c |  706 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 720 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/vbus-enet.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 62d732a..ac9dabd 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -3099,4 +3099,17 @@ config VIRTIO_NET
 	  This is the virtual network driver for virtio.  It can be used with
           lguest or QEMU based VMMs (like KVM or Xen).  Say Y or M.
 
+config VBUS_ENET
+	tristate "Virtual Ethernet Driver"
+	depends on VBUS_DRIVERS
+	help
+	   A virtualized 802.x network device based on the VBUS interface.
+	   It can be used with any hypervisor/kernel that supports the
+	   vbus protocol.
+
+config VBUS_ENET_DEBUG
+        bool "Enable Debugging"
+	depends on VBUS_ENET
+	default n
+
 endif # NETDEVICES
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 471baaf..61db928 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -264,6 +264,7 @@ obj-$(CONFIG_FS_ENET) += fs_enet/
 obj-$(CONFIG_NETXEN_NIC) += netxen/
 obj-$(CONFIG_NIU) += niu.o
 obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+obj-$(CONFIG_VBUS_ENET) += vbus-enet.o
 obj-$(CONFIG_SFC) += sfc/
 
 obj-$(CONFIG_WIMAX) += wimax/
diff --git a/drivers/net/vbus-enet.c b/drivers/net/vbus-enet.c
new file mode 100644
index 0000000..e698b3f
--- /dev/null
+++ b/drivers/net/vbus-enet.c
@@ -0,0 +1,706 @@
+/*
+ * vbus_enet - A virtualized 802.x network device based on the VBUS interface
+ *
+ * Copyright (C) 2009 Novell, Gregory Haskins <ghaskins@novell.com>
+ *
+ * Derived from the SNULL example from the book "Linux Device Drivers" by
+ * Alessandro Rubini, Jonathan Corbet, and Greg Kroah-Hartman, published
+ * by O'Reilly & Associates.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/interrupt.h>
+
+#include <linux/in.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/skbuff.h>
+#include <linux/ioq.h>
+#include <linux/vbus_driver.h>
+
+#include <linux/in6.h>
+#include <asm/checksum.h>
+
+#include <linux/venet.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+
+static int napi_weight = 128;
+module_param(napi_weight, int, 0444);
+static int rx_ringlen = 256;
+module_param(rx_ringlen, int, 0444);
+static int tx_ringlen = 256;
+module_param(tx_ringlen, int, 0444);
+
+#undef PDEBUG             /* undef it, just in case */
+#ifdef VBUS_ENET_DEBUG
+#  define PDEBUG(fmt, args...) printk(KERN_DEBUG "vbus_enet: " fmt, ## args)
+#else
+#  define PDEBUG(fmt, args...) /* not debugging: nothing */
+#endif
+
+struct vbus_enet_queue {
+	struct ioq              *queue;
+	struct ioq_notifier      notifier;
+};
+
+struct vbus_enet_priv {
+	spinlock_t                 lock;
+	struct net_device         *dev;
+	struct vbus_device_proxy  *vdev;
+	struct napi_struct         napi;
+	struct net_device_stats    stats;
+	struct vbus_enet_queue     rxq;
+	struct vbus_enet_queue     txq;
+	struct tasklet_struct      txtask;
+};
+
+static struct vbus_enet_priv *
+napi_to_priv(struct napi_struct *napi)
+{
+	return container_of(napi, struct vbus_enet_priv, napi);
+}
+
+static int
+queue_init(struct vbus_enet_priv *priv,
+	   struct vbus_enet_queue *q,
+	   int qid,
+	   size_t ringsize,
+	   void (*func)(struct ioq_notifier *))
+{
+	struct vbus_device_proxy *dev = priv->vdev;
+	int ret;
+
+	ret = vbus_driver_ioq_alloc(dev, qid, 0, ringsize, &q->queue);
+	if (ret < 0)
+		panic("ioq_alloc failed: %d\n", ret);
+
+	if (func) {
+		q->notifier.signal = func;
+		q->queue->notifier = &q->notifier;
+	}
+
+	return 0;
+}
+
+static int
+devcall(struct vbus_enet_priv *priv, u32 func, void *data, size_t len)
+{
+	struct vbus_device_proxy *dev = priv->vdev;
+
+	return dev->ops->call(dev, func, data, len, 0);
+}
+
+/*
+ * ---------------
+ * rx descriptors
+ * ---------------
+ */
+
+static void
+rxdesc_alloc(struct ioq_ring_desc *desc, size_t len)
+{
+	struct sk_buff *skb;
+
+	len += ETH_HLEN;
+
+	skb = dev_alloc_skb(len + 2);
+	BUG_ON(!skb);
+
+	skb_reserve(skb, 2); /* align IP on 16B boundary */
+
+	desc->cookie = (u64)skb;
+	desc->ptr    = (u64)__pa(skb->data);
+	desc->len    = len; /* total length  */
+	desc->valid  = 1;
+}
+
+static void
+rx_setup(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->rxq.queue;
+	struct ioq_iterator iter;
+	int ret;
+
+	/*
+	 * We want to iterate on the "valid" index.  By default the iterator
+	 * will not "autoupdate" which means it will not hypercall the host
+	 * with our changes.  This is good, because we are really just
+	 * initializing stuff here anyway.  Note that you can always manually
+	 * signal the host with ioq_signal() if the autoupdate feature is not
+	 * used.
+	 */
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Seek to the tail of the valid index (which should be our first
+	 * item, since the queue is brand-new)
+	 */
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Now populate each descriptor with an empty SKB and mark it valid
+	 */
+	while (!iter.desc->valid) {
+		rxdesc_alloc(iter.desc, priv->dev->mtu);
+
+		/*
+		 * This push operation will simultaneously advance the
+		 * valid-head index and increment our position in the queue
+		 * by one.
+		 */
+		ret = ioq_iter_push(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+}
+
+static void
+rx_teardown(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->rxq.queue;
+	struct ioq_iterator iter;
+	int ret;
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * free each valid descriptor
+	 */
+	while (iter.desc->valid) {
+		struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+
+		iter.desc->valid = 0;
+		wmb();
+
+		iter.desc->ptr = 0;
+		iter.desc->cookie = 0;
+
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+
+		dev_kfree_skb(skb);
+	}
+}
+
+/*
+ * Open and close
+ */
+
+static int
+vbus_enet_open(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	ret = devcall(priv, VENET_FUNC_LINKUP, NULL, 0);
+	BUG_ON(ret < 0);
+
+	napi_enable(&priv->napi);
+
+	return 0;
+}
+
+static int
+vbus_enet_stop(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	napi_disable(&priv->napi);
+
+	ret = devcall(priv, VENET_FUNC_LINKDOWN, NULL, 0);
+	BUG_ON(ret < 0);
+
+	return 0;
+}
+
+/*
+ * Configuration changes (passed on by ifconfig)
+ */
+static int
+vbus_enet_config(struct net_device *dev, struct ifmap *map)
+{
+	if (dev->flags & IFF_UP) /* can't act on a running interface */
+		return -EBUSY;
+
+	/* Don't allow changing the I/O address */
+	if (map->base_addr != dev->base_addr) {
+		printk(KERN_WARNING "vbus_enet: Can't change I/O address\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* ignore other fields */
+	return 0;
+}
+
+static void
+vbus_enet_schedule_rx(struct vbus_enet_priv *priv)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (netif_rx_schedule_prep(&priv->napi)) {
+		/* Disable further interrupts */
+		ioq_notify_disable(priv->rxq.queue, 0);
+		__netif_rx_schedule(&priv->napi);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static int
+vbus_enet_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	int ret;
+
+	dev->mtu = new_mtu;
+
+	/*
+	 * FLUSHRX will cause the device to flush any outstanding
+	 * RX buffers.  They will appear to come in as 0 length
+	 * packets which we can simply discard and replace with new_mtu
+	 * buffers for the future.
+	 */
+	ret = devcall(priv, VENET_FUNC_FLUSHRX, NULL, 0);
+	BUG_ON(ret < 0);
+
+	vbus_enet_schedule_rx(priv);
+
+	return 0;
+}
+
+/*
+ * The poll implementation.
+ */
+static int
+vbus_enet_poll(struct napi_struct *napi, int budget)
+{
+	struct vbus_enet_priv *priv = napi_to_priv(napi);
+	int npackets = 0;
+	struct ioq_iterator iter;
+	int ret;
+
+	PDEBUG("%lld: polling...\n", priv->vdev->id);
+
+	/* We want to iterate on the head of the in-use index */
+	ret = ioq_iter_init(priv->rxq.queue, &iter, ioq_idxtype_inuse,
+			    IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * We stop if we have met the quota or there are no more packets.
+	 * The EOM is indicated by finding a packet that is still owned by
+	 * the south side
+	 */
+	while ((npackets < budget) && (!iter.desc->sown)) {
+		struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+
+		if (iter.desc->len) {
+			skb_put(skb, iter.desc->len);
+
+			/* Maintain stats */
+			npackets++;
+			priv->stats.rx_packets++;
+			priv->stats.rx_bytes += iter.desc->len;
+
+			/* Pass the buffer up to the stack */
+			skb->dev      = priv->dev;
+			skb->protocol = eth_type_trans(skb, priv->dev);
+			netif_receive_skb(skb);
+
+			mb();
+		} else
+			/*
+			 * the device may send a zero-length packet when its
+			 * flushing references on the ring.  We can just drop
+			 * these on the floor
+			 */
+			dev_kfree_skb(skb);
+
+		/* Grab a new buffer to put in the ring */
+		rxdesc_alloc(iter.desc, priv->dev->mtu);
+
+		/* Advance the in-use tail */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	PDEBUG("%lld poll: %d packets received\n", priv->vdev->id, npackets);
+
+	/*
+	 * If we processed all packets, we're done; tell the kernel and
+	 * reenable ints
+	 */
+	if (ioq_empty(priv->rxq.queue, ioq_idxtype_inuse)) {
+		netif_rx_complete(napi);
+		ioq_notify_enable(priv->rxq.queue, 0);
+		ret = 0;
+	} else
+		/* We couldn't process everything. */
+		ret = 1;
+
+	return ret;
+}
+
+/*
+ * Transmit a packet (called by the kernel)
+ */
+static int
+vbus_enet_tx_start(struct sk_buff *skb, struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	struct ioq_iterator    iter;
+	int ret;
+	unsigned long flags;
+
+	PDEBUG("%lld: sending %d bytes\n", priv->vdev->id, skb->len);
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
+		/*
+		 * We must flow-control the kernel by disabling the
+		 * queue
+		 */
+		spin_unlock_irqrestore(&priv->lock, flags);
+		netif_stop_queue(dev);
+		printk(KERN_ERR "VBUS_ENET: tx on full queue bug "	\
+		       "on device %lld\n", priv->vdev->id);
+		return 1;
+	}
+
+	/*
+	 * We want to iterate on the tail of both the "inuse" and "valid" index
+	 * so we specify the "both" index
+	 */
+	ret = ioq_iter_init(priv->txq.queue, &iter, ioq_idxtype_both,
+			    IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+	BUG_ON(iter.desc->sown);
+
+	/*
+	 * We simply put the skb right onto the ring.  We will get an interrupt
+	 * later when the data has been consumed and we can reap the pointers
+	 * at that time
+	 */
+	iter.desc->cookie = (u64)skb;
+	iter.desc->len = (u64)skb->len;
+	iter.desc->ptr = (u64)__pa(skb->data);
+	iter.desc->valid  = 1;
+
+	priv->stats.tx_packets++;
+	priv->stats.tx_bytes += skb->len;
+
+	/*
+	 * This advances both indexes together implicitly, and then
+	 * signals the south side to consume the packet
+	 */
+	ret = ioq_iter_push(&iter, 0);
+	BUG_ON(ret < 0);
+
+	dev->trans_start = jiffies; /* save the timestamp */
+
+	if (ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
+		/*
+		 * If the queue is congested, we must flow-control the kernel
+		 */
+		PDEBUG("%lld: backpressure tx queue\n", priv->vdev->id);
+		netif_stop_queue(dev);
+	}
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	return 0;
+}
+
+/*
+ * reclaim any outstanding completed tx packets
+ *
+ * assumes priv->lock held
+ */
+static void
+vbus_enet_tx_reap(struct vbus_enet_priv *priv, int force)
+{
+	struct ioq_iterator iter;
+	int ret;
+
+	/*
+	 * We want to iterate on the head of the valid index, but we
+	 * do not want the iter_pop (below) to flip the ownership, so
+	 * we set the NOFLIPOWNER option
+	 */
+	ret = ioq_iter_init(priv->txq.queue, &iter, ioq_idxtype_valid,
+			    IOQ_ITER_NOFLIPOWNER);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * We are done once we find the first packet either invalid or still
+	 * owned by the south-side
+	 */
+	while (iter.desc->valid && (!iter.desc->sown || force)) {
+		struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+
+		PDEBUG("%lld: completed sending %d bytes\n",
+		       priv->vdev->id, skb->len);
+
+		/* Reset the descriptor */
+		iter.desc->valid  = 0;
+
+		dev_kfree_skb(skb);
+
+		/* Advance the valid-index head */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	/*
+	 * If we were previously stopped due to flow control, restart the
+	 * processing
+	 */
+	if (netif_queue_stopped(priv->dev)
+	    && !ioq_full(priv->txq.queue, ioq_idxtype_valid)) {
+		PDEBUG("%lld: re-enabling tx queue\n", priv->vdev->id);
+		netif_wake_queue(priv->dev);
+	}
+}
+
+static void
+vbus_enet_timeout(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	unsigned long flags;
+
+	printk(KERN_DEBUG "VBUS_ENET %lld: Transmit timeout\n", priv->vdev->id);
+
+	spin_lock_irqsave(&priv->lock, flags);
+	vbus_enet_tx_reap(priv, 0);
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+/*
+ * Ioctl commands
+ */
+static int
+vbus_enet_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
+{
+	PDEBUG("ioctl\n");
+	return 0;
+}
+
+/*
+ * Return statistics to the caller
+ */
+static struct net_device_stats *
+vbus_enet_stats(struct net_device *dev)
+{
+	struct vbus_enet_priv *priv = netdev_priv(dev);
+	return &priv->stats;
+}
+
+static void
+rx_isr(struct ioq_notifier *notifier)
+{
+	struct vbus_enet_priv *priv;
+	struct net_device  *dev;
+
+	priv = container_of(notifier, struct vbus_enet_priv, rxq.notifier);
+	dev = priv->dev;
+
+	if (!ioq_empty(priv->rxq.queue, ioq_idxtype_inuse))
+		vbus_enet_schedule_rx(priv);
+}
+
+static void
+deferred_tx_isr(unsigned long data)
+{
+	struct vbus_enet_priv *priv = (struct vbus_enet_priv *)data;
+	unsigned long flags;
+
+	PDEBUG("deferred_tx_isr for %lld\n", priv->vdev->id);
+
+	spin_lock_irqsave(&priv->lock, flags);
+	vbus_enet_tx_reap(priv, 0);
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	ioq_notify_enable(priv->txq.queue, 0);
+}
+
+static void
+tx_isr(struct ioq_notifier *notifier)
+{
+       struct vbus_enet_priv *priv;
+       unsigned long flags;
+
+       priv = container_of(notifier, struct vbus_enet_priv, txq.notifier);
+
+       PDEBUG("tx_isr for %lld\n", priv->vdev->id);
+
+       ioq_notify_disable(priv->txq.queue, 0);
+       tasklet_schedule(&priv->txtask);
+}
+
+static struct net_device_ops vbus_enet_netdev_ops = {
+	.ndo_open          = vbus_enet_open,
+	.ndo_stop          = vbus_enet_stop,
+	.ndo_set_config    = vbus_enet_config,
+	.ndo_start_xmit    = vbus_enet_tx_start,
+	.ndo_change_mtu	   = vbus_enet_change_mtu,
+	.ndo_do_ioctl      = vbus_enet_ioctl,
+	.ndo_get_stats     = vbus_enet_stats,
+	.ndo_tx_timeout    = vbus_enet_timeout,
+};
+
+/*
+ * This is called whenever a new vbus_device_proxy is added to the vbus
+ * with the matching VENET_ID
+ */
+static int
+vbus_enet_probe(struct vbus_device_proxy *vdev)
+{
+	struct net_device  *dev;
+	struct vbus_enet_priv *priv;
+	int ret;
+
+	printk(KERN_INFO "VBUS_ENET: Found new device at %lld\n", vdev->id);
+
+	ret = vdev->ops->open(vdev, VENET_VERSION, 0);
+	if (ret < 0)
+		return ret;
+
+	dev = alloc_etherdev(sizeof(struct vbus_enet_priv));
+	if (!dev)
+		return -ENOMEM;
+
+	priv = netdev_priv(dev);
+	memset(priv, 0, sizeof(*priv));
+
+	spin_lock_init(&priv->lock);
+	priv->dev  = dev;
+	priv->vdev = vdev;
+
+	tasklet_init(&priv->txtask, deferred_tx_isr, (unsigned long)priv);
+
+	queue_init(priv, &priv->rxq, VENET_QUEUE_RX, rx_ringlen, rx_isr);
+	queue_init(priv, &priv->txq, VENET_QUEUE_TX, tx_ringlen, tx_isr);
+
+	rx_setup(priv);
+
+	ioq_notify_enable(priv->rxq.queue, 0);  /* enable interrupts */
+	ioq_notify_enable(priv->txq.queue, 0);
+
+	ether_setup(dev); /* assign some of the fields */
+
+	dev->netdev_ops     = &vbus_enet_netdev_ops;
+	dev->watchdog_timeo = 5 * HZ;
+
+	netif_napi_add(dev, &priv->napi, vbus_enet_poll, napi_weight);
+
+	ret = devcall(priv, VENET_FUNC_MACQUERY, priv->dev->dev_addr, ETH_ALEN);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: Error obtaining MAC address for " \
+		       "%lld\n",
+		       priv->vdev->id);
+		goto out_free;
+	}
+
+	dev->features |= NETIF_F_HIGHDMA;
+
+	ret = register_netdev(dev);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: error %i registering device \"%s\"\n",
+		       ret, dev->name);
+		goto out_free;
+	}
+
+	vdev->priv = priv;
+
+	return 0;
+
+ out_free:
+	free_netdev(dev);
+
+	return ret;
+}
+
+static int
+vbus_enet_remove(struct vbus_device_proxy *vdev)
+{
+	struct vbus_enet_priv *priv = (struct vbus_enet_priv *)vdev->priv;
+	struct vbus_device_proxy *dev = priv->vdev;
+
+	unregister_netdev(priv->dev);
+	napi_disable(&priv->napi);
+
+	rx_teardown(priv);
+	vbus_enet_tx_reap(priv, 1);
+
+	ioq_put(priv->rxq.queue);
+	ioq_put(priv->txq.queue);
+
+	dev->ops->close(dev, 0);
+
+	free_netdev(priv->dev);
+
+	return 0;
+}
+
+/*
+ * Finally, the module stuff
+ */
+
+static struct vbus_driver_ops vbus_enet_driver_ops = {
+	.probe  = vbus_enet_probe,
+	.remove = vbus_enet_remove,
+};
+
+static struct vbus_driver vbus_enet_driver = {
+	.type   = VENET_TYPE,
+	.owner  = THIS_MODULE,
+	.ops    = &vbus_enet_driver_ops,
+};
+
+static __init int
+vbus_enet_init_module(void)
+{
+	printk(KERN_INFO "Virtual Ethernet: Copyright (C) 2009 Novell, Gregory Haskins\n");
+	printk(KERN_DEBUG "VBUSENET: Using %d/%d queue depth\n",
+	       rx_ringlen, tx_ringlen);
+	return vbus_driver_register(&vbus_enet_driver);
+}
+
+static __exit void
+vbus_enet_cleanup(void)
+{
+	vbus_driver_unregister(&vbus_enet_driver);
+}
+
+module_init(vbus_enet_init_module);
+module_exit(vbus_enet_cleanup);


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RFC PATCH 10/17] venet-tap: Adds a "venet" compatible "tap" device to VBUS
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
                   ` (8 preceding siblings ...)
  2009-03-31 18:43 ` [RFC PATCH 09/17] net: Add vbus_enet driver Gregory Haskins
@ 2009-03-31 18:43 ` Gregory Haskins
  2009-03-31 18:43 ` [RFC PATCH 11/17] venet: add scatter-gather support Gregory Haskins
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

This module is similar in concept to a "tuntap".  A tuntap module provides
a netif() interface on one side, and a char-dev interface on the other.
Packets that ingress on one interface, egress on the other (and vice versa).

This module offers a similar concept, except that it substitues the
char-dev for a VBUS/IOQ interface.  This allows a VBUS compatible entity
(e.g. userspace or a guest) to directly inject and receive packets
from the host/kernel stack.

Thanks to Pat Mullaney for contributing the maxcount modification

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 drivers/Makefile                 |    1 
 drivers/vbus/devices/Kconfig     |   17 
 drivers/vbus/devices/Makefile    |    1 
 drivers/vbus/devices/venet-tap.c | 1365 ++++++++++++++++++++++++++++++++++++++
 kernel/vbus/Kconfig              |   13 
 5 files changed, 1397 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vbus/devices/Kconfig
 create mode 100644 drivers/vbus/devices/Makefile
 create mode 100644 drivers/vbus/devices/venet-tap.c

diff --git a/drivers/Makefile b/drivers/Makefile
index c1bf417..98fab51 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -106,3 +106,4 @@ obj-$(CONFIG_SSB)		+= ssb/
 obj-$(CONFIG_VIRTIO)		+= virtio/
 obj-$(CONFIG_STAGING)		+= staging/
 obj-y				+= platform/
+obj-$(CONFIG_VBUS_DEVICES)	+= vbus/devices/
diff --git a/drivers/vbus/devices/Kconfig b/drivers/vbus/devices/Kconfig
new file mode 100644
index 0000000..64e4731
--- /dev/null
+++ b/drivers/vbus/devices/Kconfig
@@ -0,0 +1,17 @@
+#
+# Virtual-Bus (VBus) configuration
+#
+
+config VBUS_VENETTAP
+       tristate "Virtual-Bus Ethernet Tap Device"
+       depends on VBUS_DEVICES
+       default n
+       help
+        Provides a virtual ethernet adapter to a vbus, which in turn
+        manifests itself as a standard netif based adapter to the
+	kernel.  It can be used similarly to a "tuntap" device,
+        except that the char-dev transport is replaced with a vbus/ioq
+        interface.
+
+	If unsure, say N
+
diff --git a/drivers/vbus/devices/Makefile b/drivers/vbus/devices/Makefile
new file mode 100644
index 0000000..2ea7d2a
--- /dev/null
+++ b/drivers/vbus/devices/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_VBUS_VENETTAP) += venet-tap.o
diff --git a/drivers/vbus/devices/venet-tap.c b/drivers/vbus/devices/venet-tap.c
new file mode 100644
index 0000000..ccce58e
--- /dev/null
+++ b/drivers/vbus/devices/venet-tap.c
@@ -0,0 +1,1365 @@
+/*
+ * venettap - A 802.x virtual network device based on the VBUS/IOQ interface
+ *
+ * Copyright (C) 2009 Novell, Gregory Haskins <ghaskins@novell.com>
+ *
+ * Derived from the SNULL example from the book "Linux Device Drivers" by
+ * Alessandro Rubini, Jonathan Corbet, and Greg Kroah-Hartman, published
+ * by O'Reilly & Associates.
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/interrupt.h>
+
+#include <linux/in.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/skbuff.h>
+#include <linux/ioq.h>
+#include <linux/vbus.h>
+#include <linux/freezer.h>
+#include <linux/kthread.h>
+
+#include <linux/venet.h>
+
+#include <linux/in6.h>
+#include <asm/checksum.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+
+#undef PDEBUG             /* undef it, just in case */
+#ifdef VENETTAP_DEBUG
+#  define PDEBUG(fmt, args...) printk(KERN_DEBUG "venet-tap: " fmt, ## args)
+#else
+#  define PDEBUG(fmt, args...) /* not debugging: nothing */
+#endif
+
+static int maxcount = 2048;
+module_param(maxcount, int, 0600);
+MODULE_PARM_DESC(maxcount, "maximum size for rx/tx ioq ring");
+
+static void venettap_tx_isr(struct ioq_notifier *notifier);
+static int venettap_rx_thread(void *__priv);
+static int venettap_tx_thread(void *__priv);
+
+struct venettap_queue {
+	struct ioq              *queue;
+	struct ioq_notifier      notifier;
+};
+
+struct venettap;
+
+enum {
+	RX_SCHED,
+	TX_SCHED,
+	TX_NETIF_CONGESTED,
+	TX_IOQ_CONGESTED,
+};
+
+struct venettap {
+	spinlock_t                   lock;
+	unsigned char                hmac[ETH_ALEN]; /* host-mac */
+	unsigned char                cmac[ETH_ALEN]; /* client-mac */
+	struct task_struct          *rxthread;
+	struct task_struct          *txthread;
+	unsigned long                flags;
+
+	struct {
+		struct net_device           *dev;
+		struct net_device_stats      stats;
+		struct {
+			struct sk_buff_head  list;
+			size_t               len;
+			int                  irqdepth;
+		} txq;
+		int                          enabled:1;
+		int                          link:1;
+	} netif;
+
+	struct {
+		struct vbus_device           dev;
+		struct vbus_device_interface intf;
+		struct vbus_connection       conn;
+		struct vbus_memctx          *ctx;
+		struct venettap_queue        rxq;
+		struct venettap_queue        txq;
+		int                          connected:1;
+		int                          opened:1;
+		int                          link:1;
+	} vbus;
+};
+
+static int
+venettap_queue_init(struct venettap_queue *q,
+		    struct vbus_shm *shm,
+		    struct shm_signal *signal,
+		    void (*func)(struct ioq_notifier *))
+{
+	struct ioq *ioq;
+	int ret;
+
+	if (q->queue)
+		return -EEXIST;
+
+	/* FIXME: make maxcount a tunable */
+	ret = vbus_shm_ioq_attach(shm, signal, maxcount, &ioq);
+	if (ret < 0)
+		return ret;
+
+	q->queue = ioq;
+	ioq_get(ioq);
+
+	if (func) {
+		q->notifier.signal = func;
+		q->queue->notifier = &q->notifier;
+	}
+
+	return 0;
+}
+
+static void
+venettap_queue_release(struct venettap_queue *q)
+{
+	if (!q->queue)
+		return;
+
+	ioq_put(q->queue);
+	q->queue = NULL;
+}
+
+/* Assumes priv->lock is held */
+static void
+venettap_txq_notify_inc(struct venettap *priv)
+{
+	priv->netif.txq.irqdepth++;
+	if (priv->netif.txq.irqdepth == 1 && priv->vbus.link)
+		ioq_notify_enable(priv->vbus.txq.queue, 0);
+}
+
+/* Assumes priv->lock is held */
+static void
+venettap_txq_notify_dec(struct venettap *priv)
+{
+	BUG_ON(!priv->netif.txq.irqdepth);
+	priv->netif.txq.irqdepth--;
+	if (!priv->netif.txq.irqdepth && priv->vbus.link)
+		ioq_notify_disable(priv->vbus.txq.queue, 0);
+}
+
+/*
+ *----------------------------------------------------------------------
+ * netif link
+ *----------------------------------------------------------------------
+ */
+
+static struct venettap *conn_to_priv(struct vbus_connection *conn)
+{
+	return container_of(conn, struct venettap, vbus.conn);
+}
+
+static struct venettap *intf_to_priv(struct vbus_device_interface *intf)
+{
+	return container_of(intf, struct venettap, vbus.intf);
+}
+
+static struct venettap *vdev_to_priv(struct vbus_device *vdev)
+{
+	return container_of(vdev, struct venettap, vbus.dev);
+}
+
+static int
+venettap_netdev_open(struct net_device *dev)
+{
+	struct venettap *priv = netdev_priv(dev);
+	unsigned long flags;
+
+	BUG_ON(priv->netif.link);
+
+	/*
+	 * We need rx-polling to be done in process context, and we want
+	 * ingress processing to occur independent of the producer thread
+	 * to maximize multi-core distribution.  Since the built in NAPI uses a
+	 * softirq, we cannot guarantee this wont call us back in interrupt
+	 * context, so we cant use it.  And both a work-queue or softirq
+	 * solution would tend to process requests on the same CPU as the
+	 * producer.  Therefore, we create a special thread to handle ingress.
+	 *
+	 * The downside to this type of approach is that we may still need to
+	 * ctx-switch to the NAPI polling thread (presumably running on the same
+	 * core as the rx-thread) by virtue of the netif_rx() backlog mechanism.
+	 * However, this can be mitigated by the use of netif_rx_ni().
+	 */
+	priv->rxthread = kthread_create(venettap_rx_thread, priv,
+					"%s-rx", priv->netif.dev->name);
+
+	priv->txthread = kthread_create(venettap_tx_thread, priv,
+					"%s-tx", priv->netif.dev->name);
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	priv->netif.link = true;
+
+	if (!priv->vbus.link)
+		netif_carrier_off(dev);
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	return 0;
+}
+
+static int
+venettap_netdev_stop(struct net_device *dev)
+{
+	struct venettap *priv = netdev_priv(dev);
+	unsigned long flags;
+	int needs_stop = false;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (priv->netif.link) {
+		needs_stop = true;
+		priv->netif.link = false;
+	}
+
+	/* FIXME: free priv->netif.txq */
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	if (needs_stop) {
+		kthread_stop(priv->rxthread);
+		priv->rxthread = NULL;
+
+		kthread_stop(priv->txthread);
+		priv->txthread = NULL;
+	}
+
+	return 0;
+}
+
+/*
+ * Configuration changes (passed on by ifconfig)
+ */
+static int
+venettap_netdev_config(struct net_device *dev, struct ifmap *map)
+{
+	if (dev->flags & IFF_UP) /* can't act on a running interface */
+		return -EBUSY;
+
+	/* Don't allow changing the I/O address */
+	if (map->base_addr != dev->base_addr) {
+		printk(KERN_WARNING "venettap: Can't change I/O address\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* ignore other fields */
+	return 0;
+}
+
+static int
+venettap_change_mtu(struct net_device *dev, int new_mtu)
+{
+	dev->mtu = new_mtu;
+
+	return 0;
+}
+
+/*
+ * The poll implementation.
+ */
+static int
+venettap_rx(struct venettap *priv)
+{
+	struct ioq                 *ioq;
+	struct vbus_memctx         *ctx;
+	int                         npackets = 0;
+	int                         dirty = 0;
+	struct ioq_iterator         iter;
+	int                         ret;
+	unsigned long               flags;
+	struct vbus_connection     *conn;
+
+	PDEBUG("polling...\n");
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (!priv->vbus.link) {
+		spin_unlock_irqrestore(&priv->lock, flags);
+		return 0;
+	}
+
+	/*
+	 * We take a reference to the connection object to ensure that the
+	 * ioq/ctx references do not disappear out from under us.  We could
+	 * acommplish the same thing more directly by acquiring a reference
+	 * to the ioq and ctx explictly, but this would require an extra
+	 * atomic_inc+dec pair, for no additional benefit
+	 */
+	conn = &priv->vbus.conn;
+	vbus_connection_get(conn);
+
+	ioq = priv->vbus.rxq.queue;
+	ctx = priv->vbus.ctx;
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	/* We want to iterate on the head of the in-use index */
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_inuse, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * The EOM is indicated by finding a packet that is still owned by
+	 * the north side
+	 */
+	while (priv->vbus.link && iter.desc->sown) {
+		size_t len = iter.desc->len;
+		size_t maxlen = priv->netif.dev->mtu + ETH_HLEN;
+		struct sk_buff *skb = NULL;
+
+		if (unlikely(len > maxlen)) {
+			priv->netif.stats.rx_errors++;
+			priv->netif.stats.rx_length_errors++;
+			goto next;
+		}
+
+		skb = dev_alloc_skb(len+2);
+		if (unlikely(!skb)) {
+			printk(KERN_INFO "VENETTAP: skb alloc failed:"	\
+			       " memory squeeze.\n");
+			priv->netif.stats.rx_errors++;
+			priv->netif.stats.rx_dropped++;
+			goto next;
+		}
+
+		/* align IP on 16B boundary */
+		skb_reserve(skb, 2);
+
+		ret = ctx->ops->copy_from(ctx, skb->data,
+					 (void *)iter.desc->ptr,
+					 len);
+		if (unlikely(ret)) {
+			priv->netif.stats.rx_errors++;
+			goto next;
+		}
+
+		/* Maintain stats */
+		npackets++;
+		priv->netif.stats.rx_packets++;
+		priv->netif.stats.rx_bytes += len;
+
+		/* Pass the buffer up to the stack */
+		skb->dev      = priv->netif.dev;
+		skb->protocol = eth_type_trans(skb, priv->netif.dev);
+
+		netif_rx_ni(skb);
+next:
+		dirty = 1;
+
+		/* Advance the in-use head */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+
+		/* send up to N packets before sending tx-complete */
+		if (!(npackets % 10)) {
+			ioq_signal(ioq, 0);
+			dirty = 0;
+		}
+
+	}
+
+	PDEBUG("poll: %d packets received\n", npackets);
+
+	if (dirty)
+		ioq_signal(ioq, 0);
+
+	/*
+	 * If we processed all packets we're done, so reenable ints
+	 */
+	if (ioq_empty(ioq, ioq_idxtype_inuse)) {
+		clear_bit(RX_SCHED, &priv->flags);
+		ioq_notify_enable(ioq, 0);
+	}
+
+	vbus_connection_put(conn);
+
+	return 0;
+}
+
+static int venettap_rx_thread(void *__priv)
+{
+	struct venettap *priv = __priv;
+
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (!freezing(current) &&
+		    !kthread_should_stop() &&
+		    !test_bit(RX_SCHED, &priv->flags))
+			schedule();
+		set_current_state(TASK_RUNNING);
+
+		try_to_freeze();
+
+		if (kthread_should_stop())
+			break;
+
+		venettap_rx(priv);
+	}
+
+	return 0;
+}
+
+/* assumes priv->lock is held */
+static void
+venettap_check_netif_congestion(struct venettap *priv)
+{
+	struct ioq *ioq = priv->vbus.txq.queue;
+
+	if (priv->vbus.link
+	    && priv->netif.txq.len < ioq_remain(ioq, ioq_idxtype_inuse)
+	    && test_and_clear_bit(TX_NETIF_CONGESTED, &priv->flags)) {
+		PDEBUG("NETIF congestion cleared\n");
+		venettap_txq_notify_dec(priv);
+
+		if (priv->netif.link)
+			netif_wake_queue(priv->netif.dev);
+	}
+}
+
+static int
+venettap_tx(struct venettap *priv)
+{
+	struct sk_buff             *skb;
+	struct ioq_iterator         iter;
+	struct ioq                 *ioq = NULL;
+	struct vbus_memctx         *ctx;
+	int                         ret;
+	int                         npackets = 0;
+	unsigned long               flags;
+	struct vbus_connection     *conn;
+
+	PDEBUG("tx-thread\n");
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (unlikely(!priv->vbus.link)) {
+		spin_unlock_irqrestore(&priv->lock, flags);
+		return 0;
+	}
+
+	/*
+	 * We take a reference to the connection object to ensure that the
+	 * ioq/ctx references do not disappear out from under us.  We could
+	 * acommplish the same thing more directly by acquiring a reference
+	 * to the ioq and ctx explictly, but this would require an extra
+	 * atomic_inc+dec pair, for no additional benefit
+	 */
+	conn = &priv->vbus.conn;
+	vbus_connection_get(conn);
+
+	ioq = priv->vbus.txq.queue;
+	ctx = priv->vbus.ctx;
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_inuse, IOQ_ITER_AUTOUPDATE);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+
+	while (priv->vbus.link && iter.desc->sown && priv->netif.txq.len) {
+
+		skb = __skb_dequeue(&priv->netif.txq.list);
+		if (!skb)
+			break;
+
+		spin_unlock_irqrestore(&priv->lock, flags);
+
+		PDEBUG("tx-thread: sending %d bytes\n", skb->len);
+
+		if (skb->len <= iter.desc->len) {
+			ret = ctx->ops->copy_to(ctx, (void *)iter.desc->ptr,
+					       skb->data, skb->len);
+			BUG_ON(ret);
+
+			iter.desc->len = skb->len;
+
+			npackets++;
+			priv->netif.stats.tx_packets++;
+			priv->netif.stats.tx_bytes += skb->len;
+
+			ret = ioq_iter_push(&iter, 0);
+			BUG_ON(ret < 0);
+		} else {
+			printk(KERN_WARNING				\
+			       "VENETTAP: discarding packet: buf too small " \
+			       "(%d > %lld)\n", skb->len, iter.desc->len);
+			priv->netif.stats.tx_errors++;
+		}
+
+		dev_kfree_skb(skb);
+		priv->netif.dev->trans_start = jiffies; /* save the timestamp */
+
+		spin_lock_irqsave(&priv->lock, flags);
+
+		priv->netif.txq.len--;
+	}
+
+	PDEBUG("send complete\n");
+
+	if (!priv->vbus.link || !priv->netif.txq.len) {
+		PDEBUG("descheduling TX: link=%d, len=%d\n",
+		       priv->vbus.link, priv->netif.txq.len);
+		clear_bit(TX_SCHED, &priv->flags);
+	} else if (!test_and_set_bit(TX_IOQ_CONGESTED, &priv->flags)) {
+		PDEBUG("congested with %d packets still queued\n",
+		       priv->netif.txq.len);
+		venettap_txq_notify_inc(priv);
+	}
+
+	venettap_check_netif_congestion(priv);
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	vbus_connection_put(conn);
+
+	return npackets;
+}
+
+static int venettap_tx_thread(void *__priv)
+{
+	struct venettap *priv = __priv;
+
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (!freezing(current) &&
+		    !kthread_should_stop() &&
+		    (test_bit(TX_IOQ_CONGESTED, &priv->flags) ||
+		     !test_bit(TX_SCHED, &priv->flags)))
+			schedule();
+		set_current_state(TASK_RUNNING);
+
+		PDEBUG("tx wakeup: %s%s%s\n",
+		       test_bit(TX_SCHED, &priv->flags) ? "s" : "-",
+		       test_bit(TX_IOQ_CONGESTED, &priv->flags) ? "c" : "-",
+		       test_bit(TX_NETIF_CONGESTED, &priv->flags) ? "b" : "-"
+			);
+
+		try_to_freeze();
+
+		if (kthread_should_stop())
+			break;
+
+		venettap_tx(priv);
+	}
+
+	return 0;
+}
+
+static void
+venettap_deferred_tx(struct venettap *priv)
+{
+	PDEBUG("wake up txthread\n");
+	wake_up_process(priv->txthread);
+}
+
+/* assumes priv->lock is held */
+static void
+venettap_apply_backpressure(struct venettap *priv)
+{
+	PDEBUG("backpressure\n");
+
+	if (!test_and_set_bit(TX_NETIF_CONGESTED, &priv->flags)) {
+		/*
+		 * We must flow-control the kernel by disabling the queue
+		 */
+		netif_stop_queue(priv->netif.dev);
+		venettap_txq_notify_inc(priv);
+	}
+}
+
+/*
+ * Transmit a packet (called by the kernel)
+ *
+ * We want to perform ctx->copy_to() operations from a sleepable process
+ * context, so we defer the actual tx operations to a thread.
+ * However, we want to be careful that we do not double-buffer the
+ * queue, so we create a buffer whose space dynamically grows and
+ * shrinks with the availability of the actual IOQ.  This means that
+ * the netif flow control is still managed by the actual consumer,
+ * thereby avoiding the creation of an extra servo-loop to the equation.
+ */
+static int
+venettap_netdev_tx(struct sk_buff *skb, struct net_device *dev)
+{
+	struct venettap *priv = netdev_priv(dev);
+	struct ioq      *ioq = NULL;
+	unsigned long    flags;
+
+	PDEBUG("queuing %d bytes\n", skb->len);
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	ioq = priv->vbus.txq.queue;
+
+	BUG_ON(test_bit(TX_NETIF_CONGESTED, &priv->flags));
+
+	if (!priv->vbus.link) {
+		/*
+		 * We have a link-down condition
+		 */
+		printk(KERN_ERR "VENETTAP: tx on link down\n");
+		goto flowcontrol;
+	}
+
+	__skb_queue_tail(&priv->netif.txq.list, skb);
+	priv->netif.txq.len++;
+	set_bit(TX_SCHED, &priv->flags);
+
+	if (priv->netif.txq.len >= ioq_remain(ioq, ioq_idxtype_inuse))
+		venettap_apply_backpressure(priv);
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	venettap_deferred_tx(priv);
+
+	return NETDEV_TX_OK;
+
+flowcontrol:
+	venettap_apply_backpressure(priv);
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	return NETDEV_TX_BUSY;
+}
+
+/*
+ * Ioctl commands
+ */
+static int
+venettap_netdev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
+{
+	PDEBUG("ioctl\n");
+	return 0;
+}
+
+/*
+ * Return statistics to the caller
+ */
+struct net_device_stats *
+venettap_netdev_stats(struct net_device *dev)
+{
+	struct venettap *priv = netdev_priv(dev);
+	return &priv->netif.stats;
+}
+
+static void
+venettap_netdev_unregister(struct venettap *priv)
+{
+	if (priv->netif.enabled) {
+		venettap_netdev_stop(priv->netif.dev);
+		unregister_netdev(priv->netif.dev);
+	}
+}
+
+/*
+ * Assumes priv->lock held
+ */
+static void
+venettap_rx_schedule(struct venettap *priv)
+{
+	if (!priv->vbus.link)
+		return;
+
+	if (priv->netif.link
+	    && !ioq_empty(priv->vbus.rxq.queue, ioq_idxtype_inuse)) {
+		ioq_notify_disable(priv->vbus.rxq.queue, 0);
+
+		if (!test_and_set_bit(RX_SCHED, &priv->flags))
+			wake_up_process(priv->rxthread);
+	}
+}
+
+/*
+ * receive interrupt-service-routine - called whenever the vbus-driver signals
+ * our IOQ to indicate more inbound packets are ready.
+ */
+static void
+venettap_rx_isr(struct ioq_notifier *notifier)
+{
+	struct venettap *priv;
+	unsigned long flags;
+
+	priv = container_of(notifier, struct venettap, vbus.rxq.notifier);
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	/* Disable future interrupts and schedule our napi-poll */
+	venettap_rx_schedule(priv);
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+/*
+ * transmit interrupt-service-routine - called whenever the vbus-driver signals
+ * our IOQ to indicate there is more room in the TX queue
+ */
+static void
+venettap_tx_isr(struct ioq_notifier *notifier)
+{
+	struct venettap *priv;
+	unsigned long flags;
+
+	priv = container_of(notifier, struct venettap, vbus.txq.notifier);
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (priv->vbus.link
+	    && !ioq_full(priv->vbus.txq.queue, ioq_idxtype_inuse)
+	    && test_and_clear_bit(TX_IOQ_CONGESTED, &priv->flags)) {
+		PDEBUG("IOQ congestion cleared\n");
+		venettap_txq_notify_dec(priv);
+
+		if (priv->netif.link)
+			wake_up_process(priv->txthread);
+	}
+
+	venettap_check_netif_congestion(priv);
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static int
+venettap_vlink_up(struct venettap *priv)
+{
+	int ret = 0;
+	unsigned long flags;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (priv->vbus.link) {
+		ret = -EEXIST;
+		goto out;
+	}
+
+	if (!priv->vbus.rxq.queue || !priv->vbus.txq.queue) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	priv->vbus.link = 1;
+
+	if (priv->netif.link)
+		netif_carrier_on(priv->netif.dev);
+
+	venettap_check_netif_congestion(priv);
+
+	ioq_notify_enable(priv->vbus.rxq.queue, 0);
+
+out:
+	spin_unlock_irqrestore(&priv->lock, flags);
+	return ret;
+}
+
+/* Assumes priv->lock held */
+static int
+_venettap_vlink_down(struct venettap *priv)
+{
+	struct sk_buff *skb;
+
+	if (!priv->vbus.link)
+		return -ENOENT;
+
+	priv->vbus.link = 0;
+
+	if (priv->netif.link)
+		netif_carrier_off(priv->netif.dev);
+
+	/* just trash whatever might have been pending */
+	while ((skb = __skb_dequeue(&priv->netif.txq.list)))
+		dev_kfree_skb(skb);
+
+	priv->netif.txq.len = 0;
+
+	/* And deschedule any pending processing */
+	clear_bit(RX_SCHED, &priv->flags);
+	clear_bit(TX_SCHED, &priv->flags);
+
+	ioq_notify_disable(priv->vbus.rxq.queue, 0);
+
+	return 0;
+}
+
+static int
+venettap_vlink_down(struct venettap *priv)
+{
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&priv->lock, flags);
+	ret = _venettap_vlink_down(priv);
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	return ret;
+}
+
+static int
+venettap_macquery(struct venettap *priv, void *data, unsigned long len)
+{
+	struct vbus_memctx *ctx = priv->vbus.ctx;
+	int ret;
+
+	if (len != ETH_ALEN)
+		return -EINVAL;
+
+	ret = ctx->ops->copy_to(ctx, data, priv->cmac, ETH_ALEN);
+	if (ret)
+		return -EFAULT;
+
+	return 0;
+}
+
+/*
+ * Negotiate Capabilities - This function is provided so that the
+ * interface may be extended without breaking ABI compatability
+ *
+ * The caller is expected to send down any capabilities they would like
+ * to enable, and the device will OR them with capabilities that it
+ * supports.  This value is then returned so that both sides may
+ * ascertain the lowest-common-denominator of features to enable
+ */
+static int
+venettap_negcap(struct venettap *priv, void *data, unsigned long len)
+{
+	struct vbus_memctx *ctx = priv->vbus.ctx;
+	struct venet_capabilities caps;
+	int ret;
+
+	if (len != sizeof(caps))
+		return -EINVAL;
+
+	if (priv->vbus.link)
+		return -EINVAL;
+
+	ret = ctx->ops->copy_from(ctx, &caps, data, sizeof(caps));
+	if (ret)
+		return -EFAULT;
+
+	switch (caps.gid) {
+	default:
+		caps.bits = 0;
+		break;
+	}
+
+	ret = ctx->ops->copy_to(ctx, data, &caps, sizeof(caps));
+	if (ret)
+		return -EFAULT;
+
+	return 0;
+}
+
+/*
+ * Walk through and flush each remaining descriptor by returning
+ * a zero length packet.
+ *
+ * This is useful, for instance, when the driver is changing the MTU
+ * and wants to reclaim all the existing buffers outstanding which
+ * are a different size than the new MTU
+ */
+static int
+venettap_flushrx(struct venettap *priv)
+{
+	struct ioq_iterator         iter;
+	struct ioq                 *ioq = NULL;
+	int                         ret;
+	unsigned long               flags;
+
+	PDEBUG("flushrx\n");
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	if (unlikely(!priv->vbus.link)) {
+		spin_unlock_irqrestore(&priv->lock, flags);
+		return -EINVAL;
+	}
+
+	ioq = priv->vbus.txq.queue;
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_inuse, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+
+	while (iter.desc->sown) {
+		iter.desc->len = 0;
+		ret = ioq_iter_push(&iter, 0);
+		if (ret < 0)
+			SHM_SIGNAL_FAULT(ioq->signal, "could not flushrx");
+	}
+
+	PDEBUG("flushrx complete\n");
+
+	if (!test_and_set_bit(TX_IOQ_CONGESTED, &priv->flags)) {
+		PDEBUG("congested with %d packets still queued\n",
+		       priv->netif.txq.len);
+		venettap_txq_notify_inc(priv);
+	}
+
+	/*
+	 * we purposely do not ioq_signal() the other side here.  Since
+	 * this function was invoked by the client, they can take care
+	 * of explcitly calling any reclaim code if they like.  This also
+	 * avoids a potential deadlock in case turning around and injecting
+	 * a signal while we are in a call() is problematic to the
+	 * connector design
+	 */
+
+	venettap_check_netif_congestion(priv);
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	return 0;
+}
+
+/*
+ * This is called whenever a driver wants to perform a synchronous
+ * "function call" to our device.  It is similar to the notion of
+ * an ioctl().  The parameters are part of the ABI between the device
+ * and driver.
+ */
+static int
+venettap_vlink_call(struct vbus_connection *conn,
+		    unsigned long func,
+		    void *data,
+		    unsigned long len,
+		    unsigned long flags)
+{
+	struct venettap *priv = conn_to_priv(conn);
+
+	PDEBUG("call -> %d with %p/%d\n", func, data, len);
+
+	switch (func) {
+	case VENET_FUNC_LINKUP:
+		return venettap_vlink_up(priv);
+	case VENET_FUNC_LINKDOWN:
+		return venettap_vlink_down(priv);
+	case VENET_FUNC_MACQUERY:
+		return venettap_macquery(priv, data, len);
+	case VENET_FUNC_NEGCAP:
+		return venettap_negcap(priv, data, len);
+	case VENET_FUNC_FLUSHRX:
+		return venettap_flushrx(priv);
+	default:
+		return -EINVAL;
+	}
+}
+
+/*
+ * This is called whenever a driver wants to open a new IOQ between itself
+ * and our device.  The "id" field is meant to convey meaning to the device
+ * as to what the intended use of this IOQ is.  For instance, for venet "id=0"
+ * means "rx" and "id=1" = "tx".  That namespace is managed by the device
+ * and should be understood by the driver as part of its ABI agreement.
+ *
+ * The device should take a reference to the IOQ via ioq_get() and hold it
+ * until the connection is released.
+ */
+static int
+venettap_vlink_shm(struct vbus_connection *conn,
+		   unsigned long id,
+		   struct vbus_shm *shm,
+		   struct shm_signal *signal,
+		   unsigned long flags)
+{
+	struct venettap *priv = conn_to_priv(conn);
+
+	PDEBUG("queue -> %p/%d attached\n", ioq, id);
+
+	switch (id) {
+	case VENET_QUEUE_RX:
+		return venettap_queue_init(&priv->vbus.txq, shm, signal,
+					   venettap_tx_isr);
+	case VENET_QUEUE_TX:
+		return venettap_queue_init(&priv->vbus.rxq, shm, signal,
+					   venettap_rx_isr);
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ * This is called whenever the driver closes all references to our device
+ */
+static void
+venettap_vlink_release(struct vbus_connection *conn)
+{
+	struct venettap *priv = conn_to_priv(conn);
+	unsigned long flags;
+
+	PDEBUG("connection released\n");
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	priv->vbus.opened = false;
+	_venettap_vlink_down(priv);
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	venettap_queue_release(&priv->vbus.rxq);
+	venettap_queue_release(&priv->vbus.txq);
+	vbus_memctx_put(priv->vbus.ctx);
+
+	kobject_put(priv->vbus.dev.kobj);
+}
+
+static struct vbus_connection_ops venettap_vbus_link_ops = {
+	.call    = venettap_vlink_call,
+	.shm     = venettap_vlink_shm,
+	.release = venettap_vlink_release,
+};
+
+/*
+ * This is called whenever a driver wants to open our device_interface
+ * for communication.  The connection is represented by a
+ * vbus_connection object.  It is up to the implementation to decide
+ * if it allows more than one connection at a time.  This simple example
+ * does not.
+ */
+static int
+venettap_intf_open(struct vbus_device_interface *intf,
+		   struct vbus_memctx *ctx,
+		   int version,
+		   struct vbus_connection **conn)
+{
+	struct venettap *priv = intf_to_priv(intf);
+	unsigned long flags;
+
+	PDEBUG("open\n");
+
+	if (version != VENET_VERSION)
+		return -EINVAL;
+
+	spin_lock_irqsave(&priv->lock, flags);
+
+	/*
+	 * We only allow one connection to this device
+	 */
+	if (priv->vbus.opened) {
+		spin_unlock_irqrestore(&priv->lock, flags);
+		return -EBUSY;
+	}
+
+	kobject_get(intf->dev->kobj);
+
+	vbus_connection_init(&priv->vbus.conn, &venettap_vbus_link_ops);
+
+	priv->vbus.opened = true;
+	priv->vbus.ctx = ctx;
+
+	vbus_memctx_get(ctx);
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	*conn = &priv->vbus.conn;
+
+	return 0;
+}
+
+static void
+venettap_intf_release(struct vbus_device_interface *intf)
+{
+	kobject_put(intf->dev->kobj);
+}
+
+static struct vbus_device_interface_ops venettap_device_interface_ops = {
+	.open = venettap_intf_open,
+	.release = venettap_intf_release,
+};
+
+/*
+ * This is called whenever the admin creates a symbolic link between
+ * a bus in /config/vbus/buses and our device.  It represents a bus
+ * connection.  Your device can chose to allow more than one bus to
+ * connect, or it can restrict it to one bus.  It can also choose to
+ * register one or more device_interfaces on each bus that it
+ * successfully connects to.
+ *
+ * This example device only registers a single interface
+ */
+static int
+venettap_device_bus_connect(struct vbus_device *dev, struct vbus *vbus)
+{
+	struct venettap *priv = vdev_to_priv(dev);
+	struct vbus_device_interface *intf = &priv->vbus.intf;
+
+	/* We only allow one bus to connect */
+	if (priv->vbus.connected)
+		return -EBUSY;
+
+	kobject_get(dev->kobj);
+
+	intf->name = "0";
+	intf->type = VENET_TYPE;
+	intf->ops = &venettap_device_interface_ops;
+
+	priv->vbus.connected = true;
+
+	/*
+	 * Our example only registers one interface.  If you need
+	 * more, simply call interface_register() multiple times
+	 */
+	return vbus_device_interface_register(dev, vbus, intf);
+}
+
+/*
+ * This is called whenever the admin removes the symbolic link between
+ * a bus in /config/vbus/buses and our device.
+ */
+static int
+venettap_device_bus_disconnect(struct vbus_device *dev, struct vbus *vbus)
+{
+	struct venettap *priv = vdev_to_priv(dev);
+	struct vbus_device_interface *intf = &priv->vbus.intf;
+
+	if (!priv->vbus.connected)
+		return -EINVAL;
+
+	vbus_device_interface_unregister(intf);
+
+	priv->vbus.connected = false;
+	kobject_put(dev->kobj);
+
+	return 0;
+}
+
+static void
+venettap_device_release(struct vbus_device *dev)
+{
+	struct venettap *priv = vdev_to_priv(dev);
+
+	venettap_netdev_unregister(priv);
+	free_netdev(priv->netif.dev);
+	module_put(THIS_MODULE);
+}
+
+
+static struct vbus_device_ops venettap_device_ops = {
+	.bus_connect = venettap_device_bus_connect,
+	.bus_disconnect = venettap_device_bus_disconnect,
+	.release = venettap_device_release,
+};
+
+#define VENETTAP_TYPE "venet-tap"
+
+/*
+ * Interface attributes show up as files under
+ * /sys/vbus/devices/$devid
+ */
+static ssize_t
+host_mac_show(struct vbus_device *dev, struct vbus_device_attribute *attr,
+	 char *buf)
+{
+	struct venettap *priv = vdev_to_priv(dev);
+
+	return sysfs_format_mac(buf, priv->hmac, ETH_ALEN);
+}
+
+static struct vbus_device_attribute attr_hmac =
+	__ATTR_RO(host_mac);
+
+static ssize_t
+client_mac_show(struct vbus_device *dev, struct vbus_device_attribute *attr,
+	 char *buf)
+{
+	struct venettap *priv = vdev_to_priv(dev);
+
+	return sysfs_format_mac(buf, priv->cmac, ETH_ALEN);
+}
+
+static struct vbus_device_attribute attr_cmac =
+	__ATTR_RO(client_mac);
+
+static ssize_t
+enabled_show(struct vbus_device *dev, struct vbus_device_attribute *attr,
+	 char *buf)
+{
+	struct venettap *priv = vdev_to_priv(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", priv->netif.enabled);
+}
+
+static ssize_t
+enabled_store(struct vbus_device *dev, struct vbus_device_attribute *attr,
+	      const char *buf, size_t count)
+{
+	struct venettap *priv = vdev_to_priv(dev);
+	int enabled = -1;
+	int ret = 0;
+
+	if (count > 0)
+		sscanf(buf, "%d", &enabled);
+
+	if (enabled != 0 && enabled != 1)
+		return -EINVAL;
+
+	if (enabled && !priv->netif.enabled)
+		ret = register_netdev(priv->netif.dev);
+
+	if (!enabled && priv->netif.enabled)
+		venettap_netdev_unregister(priv);
+
+	if (ret < 0)
+		return ret;
+
+	priv->netif.enabled = enabled;
+
+	return count;
+}
+
+static struct vbus_device_attribute attr_enabled =
+	__ATTR(enabled, S_IRUGO | S_IWUSR, enabled_show, enabled_store);
+
+static ssize_t
+ifname_show(struct vbus_device *dev, struct vbus_device_attribute *attr,
+	   char *buf)
+{
+	struct venettap *priv = vdev_to_priv(dev);
+
+	if (!priv->netif.enabled)
+		return sprintf(buf, "<disabled>\n");
+
+	return snprintf(buf, PAGE_SIZE, "%s\n", priv->netif.dev->name);
+}
+
+static struct vbus_device_attribute attr_ifname =
+	__ATTR_RO(ifname);
+
+static struct attribute *attrs[] = {
+	&attr_hmac.attr,
+	&attr_cmac.attr,
+	&attr_enabled.attr,
+	&attr_ifname.attr,
+	NULL,
+};
+
+static struct attribute_group venettap_attr_group = {
+	.attrs = attrs,
+};
+
+static struct net_device_ops venettap_netdev_ops = {
+	.ndo_open        = venettap_netdev_open,
+	.ndo_stop        = venettap_netdev_stop,
+	.ndo_set_config  = venettap_netdev_config,
+	.ndo_change_mtu  = venettap_change_mtu,
+	.ndo_start_xmit  = venettap_netdev_tx,
+	.ndo_do_ioctl    = venettap_netdev_ioctl,
+	.ndo_get_stats   = venettap_netdev_stats,
+};
+
+/*
+ * This is called whenever the admin instantiates our devclass via
+ * "mkdir /config/vbus/devices/$(inst)/venet-tap"
+ */
+static int
+venettap_device_create(struct vbus_devclass *dc,
+		       struct vbus_device **vdev)
+{
+	struct net_device *dev;
+	struct venettap *priv;
+	struct vbus_device *_vdev;
+
+	dev = alloc_etherdev(sizeof(struct venettap));
+	if (!dev)
+		return -ENOMEM;
+
+	priv = netdev_priv(dev);
+	memset(priv, 0, sizeof(*priv));
+
+	spin_lock_init(&priv->lock);
+	random_ether_addr(priv->hmac);
+	random_ether_addr(priv->cmac);
+
+	/*
+	 * vbus init
+	 */
+	_vdev = &priv->vbus.dev;
+
+	_vdev->type            = VENETTAP_TYPE;
+	_vdev->ops             = &venettap_device_ops;
+	_vdev->attrs           = &venettap_attr_group;
+
+	/*
+	 * netif init
+	 */
+	skb_queue_head_init(&priv->netif.txq.list);
+	priv->netif.txq.len = 0;
+
+	priv->netif.dev = dev;
+
+	ether_setup(dev); /* assign some of the fields */
+
+	dev->netdev_ops = &venettap_netdev_ops;
+	memcpy(dev->dev_addr, priv->hmac, ETH_ALEN);
+
+	dev->features |= NETIF_F_HIGHDMA;
+
+	*vdev = _vdev;
+
+	/*
+	 * We don't need a try_get because the reference is held by the
+	 * infrastructure during a create() operation
+	 */
+	__module_get(THIS_MODULE);
+
+	return 0;
+}
+
+static struct vbus_devclass_ops venettap_devclass_ops = {
+	.create = venettap_device_create,
+};
+
+static struct vbus_devclass venettap_devclass = {
+	.name = VENETTAP_TYPE,
+	.ops = &venettap_devclass_ops,
+	.owner = THIS_MODULE,
+};
+
+static int __init venettap_init(void)
+{
+	return vbus_devclass_register(&venettap_devclass);
+}
+
+static void __exit venettap_cleanup(void)
+{
+	vbus_devclass_unregister(&venettap_devclass);
+}
+
+module_init(venettap_init);
+module_exit(venettap_cleanup);
diff --git a/kernel/vbus/Kconfig b/kernel/vbus/Kconfig
index 71acd6f..3ce0adc 100644
--- a/kernel/vbus/Kconfig
+++ b/kernel/vbus/Kconfig
@@ -14,6 +14,17 @@ config VBUS
 
 	If unsure, say N
 
+config VBUS_DEVICES
+       bool "Virtual-Bus Devices"
+       depends on VBUS
+       default n
+       help
+         Provides device-class modules for instantiation on a virtual-bus
+
+	 If unsure, say N
+
+source "drivers/vbus/devices/Kconfig"
+
 config VBUS_DRIVERS
        tristate "VBUS Driver support"
        select IOQ
@@ -23,3 +34,5 @@ config VBUS_DRIVERS
 
 	If unsure, say N
 
+
+


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RFC PATCH 11/17] venet: add scatter-gather support
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
                   ` (9 preceding siblings ...)
  2009-03-31 18:43 ` [RFC PATCH 10/17] venet-tap: Adds a "venet" compatible "tap" device to VBUS Gregory Haskins
@ 2009-03-31 18:43 ` Gregory Haskins
  2009-03-31 18:43 ` [RFC PATCH 12/17] venettap: " Gregory Haskins
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 drivers/net/vbus-enet.c |  249 +++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/venet.h   |   39 +++++++
 2 files changed, 275 insertions(+), 13 deletions(-)

diff --git a/drivers/net/vbus-enet.c b/drivers/net/vbus-enet.c
index e698b3f..8e96c9c 100644
--- a/drivers/net/vbus-enet.c
+++ b/drivers/net/vbus-enet.c
@@ -42,6 +42,8 @@ static int rx_ringlen = 256;
 module_param(rx_ringlen, int, 0444);
 static int tx_ringlen = 256;
 module_param(tx_ringlen, int, 0444);
+static int sg_enabled = 1;
+module_param(sg_enabled, int, 0444);
 
 #undef PDEBUG             /* undef it, just in case */
 #ifdef VBUS_ENET_DEBUG
@@ -64,8 +66,17 @@ struct vbus_enet_priv {
 	struct vbus_enet_queue     rxq;
 	struct vbus_enet_queue     txq;
 	struct tasklet_struct      txtask;
+	struct {
+		int                sg:1;
+		int                tso:1;
+		int                ufo:1;
+		int                tso6:1;
+		int                ecn:1;
+	} flags;
 };
 
+static void vbus_enet_tx_reap(struct vbus_enet_priv *priv, int force);
+
 static struct vbus_enet_priv *
 napi_to_priv(struct napi_struct *napi)
 {
@@ -199,6 +210,93 @@ rx_teardown(struct vbus_enet_priv *priv)
 	}
 }
 
+static int
+tx_setup(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->txq.queue;
+	struct ioq_iterator iter;
+	int i;
+	int ret;
+
+	if (!priv->flags.sg)
+		/*
+		 * There is nothing to do for a ring that is not using
+		 * scatter-gather
+		 */
+		return 0;
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_set, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Now populate each descriptor with an empty SG descriptor
+	 */
+	for (i = 0; i < tx_ringlen; i++) {
+		struct venet_sg *vsg;
+		size_t iovlen = sizeof(struct venet_iov) * (MAX_SKB_FRAGS-1);
+		size_t len = sizeof(*vsg) + iovlen;
+
+		vsg = kzalloc(len, GFP_KERNEL);
+		if (!vsg)
+			return -ENOMEM;
+
+		iter.desc->cookie = (u64)vsg;
+		iter.desc->len    = len;
+		iter.desc->ptr    = (u64)__pa(vsg);
+
+		ret = ioq_iter_seek(&iter, ioq_seek_next, 0, 0);
+		BUG_ON(ret < 0);
+	}
+
+	return 0;
+}
+
+static void
+tx_teardown(struct vbus_enet_priv *priv)
+{
+	struct ioq *ioq = priv->txq.queue;
+	struct ioq_iterator iter;
+	int ret;
+
+	/* forcefully free all outstanding transmissions */
+	vbus_enet_tx_reap(priv, 1);
+
+	if (!priv->flags.sg)
+		/*
+		 * There is nothing else to do for a ring that is not using
+		 * scatter-gather
+		 */
+		return;
+
+	ret = ioq_iter_init(ioq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	/* seek to position 0 */
+	ret = ioq_iter_seek(&iter, ioq_seek_set, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * free each valid descriptor
+	 */
+	while (iter.desc->cookie) {
+		struct venet_sg *vsg = (struct venet_sg *)iter.desc->cookie;
+
+		iter.desc->valid = 0;
+		wmb();
+
+		iter.desc->ptr = 0;
+		iter.desc->cookie = 0;
+
+		ret = ioq_iter_seek(&iter, ioq_seek_next, 0, 0);
+		BUG_ON(ret < 0);
+
+		kfree(vsg);
+	}
+}
+
 /*
  * Open and close
  */
@@ -403,14 +501,67 @@ vbus_enet_tx_start(struct sk_buff *skb, struct net_device *dev)
 	BUG_ON(ret < 0);
 	BUG_ON(iter.desc->sown);
 
-	/*
-	 * We simply put the skb right onto the ring.  We will get an interrupt
-	 * later when the data has been consumed and we can reap the pointers
-	 * at that time
-	 */
-	iter.desc->cookie = (u64)skb;
-	iter.desc->len = (u64)skb->len;
-	iter.desc->ptr = (u64)__pa(skb->data);
+	if (priv->flags.sg) {
+		struct venet_sg *vsg = (struct venet_sg *)iter.desc->cookie;
+		struct scatterlist sgl[MAX_SKB_FRAGS+1];
+		struct scatterlist *sg;
+		int count, maxcount = ARRAY_SIZE(sgl);
+
+		sg_init_table(sgl, maxcount);
+
+		memset(vsg, 0, sizeof(*vsg));
+
+		vsg->cookie = (u64)skb;
+		vsg->len    = skb->len;
+
+		if (skb->ip_summed == CHECKSUM_PARTIAL) {
+			vsg->flags      |= VENET_SG_FLAG_NEEDS_CSUM;
+			vsg->csum.start  = skb->csum_start - skb_headroom(skb);
+			vsg->csum.offset = skb->csum_offset;
+		}
+
+		if (skb_is_gso(skb)) {
+			struct skb_shared_info *sinfo = skb_shinfo(skb);
+
+			vsg->flags |= VENET_SG_FLAG_GSO;
+
+			vsg->gso.hdrlen = skb_transport_header(skb) - skb->data;
+			vsg->gso.size = sinfo->gso_size;
+			if (sinfo->gso_type & SKB_GSO_TCPV4)
+				vsg->gso.type = VENET_GSO_TYPE_TCPV4;
+			else if (sinfo->gso_type & SKB_GSO_TCPV6)
+				vsg->gso.type = VENET_GSO_TYPE_TCPV6;
+			else if (sinfo->gso_type & SKB_GSO_UDP)
+				vsg->gso.type = VENET_GSO_TYPE_UDP;
+			else
+				panic("Virtual-Ethernet: unknown GSO type " \
+				      "0x%x\n", sinfo->gso_type);
+
+			if (sinfo->gso_type & SKB_GSO_TCP_ECN)
+				vsg->flags |= VENET_SG_FLAG_ECN;
+		}
+
+		count = skb_to_sgvec(skb, sgl, 0, skb->len);
+
+		BUG_ON(count > maxcount);
+
+		for (sg = &sgl[0]; sg; sg = sg_next(sg)) {
+			struct venet_iov *iov = &vsg->iov[vsg->count++];
+
+			iov->len = sg->length;
+			iov->ptr = (u64)sg_phys(sg);
+		}
+
+	} else {
+		/*
+		 * non scatter-gather mode: simply put the skb right onto the
+		 * ring.
+		 */
+		iter.desc->cookie = (u64)skb;
+		iter.desc->len = (u64)skb->len;
+		iter.desc->ptr = (u64)__pa(skb->data);
+	}
+
 	iter.desc->valid  = 1;
 
 	priv->stats.tx_packets++;
@@ -466,7 +617,17 @@ vbus_enet_tx_reap(struct vbus_enet_priv *priv, int force)
 	 * owned by the south-side
 	 */
 	while (iter.desc->valid && (!iter.desc->sown || force)) {
-		struct sk_buff *skb = (struct sk_buff *)iter.desc->cookie;
+		struct sk_buff *skb;
+
+		if (priv->flags.sg) {
+			struct venet_sg *vsg;
+
+			vsg = (struct venet_sg *)iter.desc->cookie;
+			skb = (struct sk_buff *)vsg->cookie;
+
+		} else {
+			skb = (struct sk_buff *)iter.desc->cookie;
+		}
 
 		PDEBUG("%lld: completed sending %d bytes\n",
 		       priv->vdev->id, skb->len);
@@ -567,6 +728,47 @@ tx_isr(struct ioq_notifier *notifier)
        tasklet_schedule(&priv->txtask);
 }
 
+static int
+vbus_enet_negcap(struct vbus_enet_priv *priv)
+{
+	int ret;
+	struct venet_capabilities caps;
+
+	memset(&caps, 0, sizeof(caps));
+
+	if (sg_enabled) {
+		caps.gid = VENET_CAP_GROUP_SG;
+		caps.bits |= (VENET_CAP_SG|VENET_CAP_TSO4|VENET_CAP_TSO6
+			      |VENET_CAP_ECN|VENET_CAP_UFO);
+	}
+
+	ret = devcall(priv, VENET_FUNC_NEGCAP, &caps, sizeof(caps));
+	if (ret < 0)
+		return ret;
+
+	if (caps.bits & VENET_CAP_SG) {
+		priv->flags.sg = true;
+
+		if (caps.bits & VENET_CAP_TSO4)
+			priv->flags.tso = true;
+		if (caps.bits & VENET_CAP_TSO6)
+			priv->flags.tso6 = true;
+		if (caps.bits & VENET_CAP_UFO)
+			priv->flags.ufo = true;
+		if (caps.bits & VENET_CAP_ECN)
+			priv->flags.ecn = true;
+
+		printk(KERN_INFO "VBUSENET %lld: " \
+		       "Detected GSO features %s%s%s%s\n", priv->vdev->id,
+		       priv->flags.tso  ? "t" : "-",
+		       priv->flags.tso6 ? "T" : "-",
+		       priv->flags.ufo  ? "u" : "-",
+		       priv->flags.ecn  ? "e" : "-");
+	}
+
+	return 0;
+}
+
 static struct net_device_ops vbus_enet_netdev_ops = {
 	.ndo_open          = vbus_enet_open,
 	.ndo_stop          = vbus_enet_stop,
@@ -606,12 +808,21 @@ vbus_enet_probe(struct vbus_device_proxy *vdev)
 	priv->dev  = dev;
 	priv->vdev = vdev;
 
+	ret = vbus_enet_negcap(priv);
+	if (ret < 0) {
+		printk(KERN_INFO "VENET: Error negotiating capabilities for " \
+		       "%lld\n",
+		       priv->vdev->id);
+		goto out_free;
+	}
+
 	tasklet_init(&priv->txtask, deferred_tx_isr, (unsigned long)priv);
 
 	queue_init(priv, &priv->rxq, VENET_QUEUE_RX, rx_ringlen, rx_isr);
 	queue_init(priv, &priv->txq, VENET_QUEUE_TX, tx_ringlen, tx_isr);
 
 	rx_setup(priv);
+	tx_setup(priv);
 
 	ioq_notify_enable(priv->rxq.queue, 0);  /* enable interrupts */
 	ioq_notify_enable(priv->txq.queue, 0);
@@ -633,6 +844,22 @@ vbus_enet_probe(struct vbus_device_proxy *vdev)
 
 	dev->features |= NETIF_F_HIGHDMA;
 
+	if (priv->flags.sg) {
+		dev->features |= NETIF_F_SG|NETIF_F_HW_CSUM|NETIF_F_FRAGLIST;
+
+		if (priv->flags.tso)
+			dev->features |= NETIF_F_TSO;
+
+		if (priv->flags.ufo)
+			dev->features |= NETIF_F_UFO;
+
+		if (priv->flags.tso6)
+			dev->features |= NETIF_F_TSO6;
+
+		if (priv->flags.ecn)
+			dev->features |= NETIF_F_TSO_ECN;
+	}
+
 	ret = register_netdev(dev);
 	if (ret < 0) {
 		printk(KERN_INFO "VENET: error %i registering device \"%s\"\n",
@@ -660,9 +887,9 @@ vbus_enet_remove(struct vbus_device_proxy *vdev)
 	napi_disable(&priv->napi);
 
 	rx_teardown(priv);
-	vbus_enet_tx_reap(priv, 1);
-
 	ioq_put(priv->rxq.queue);
+
+	tx_teardown(priv);
 	ioq_put(priv->txq.queue);
 
 	dev->ops->close(dev, 0);
diff --git a/include/linux/venet.h b/include/linux/venet.h
index ef6b199..1c96b90 100644
--- a/include/linux/venet.h
+++ b/include/linux/venet.h
@@ -35,8 +35,43 @@ struct venet_capabilities {
 	__u32 bits;
 };
 
-/* CAPABILITIES-GROUP 0 */
-/* #define VENET_CAP_FOO    0   (No capabilities defined yet, for now) */
+#define VENET_CAP_GROUP_SG 0
+
+/* CAPABILITIES-GROUP SG */
+#define VENET_CAP_SG     (1 << 0)
+#define VENET_CAP_TSO4   (1 << 1)
+#define VENET_CAP_TSO6   (1 << 2)
+#define VENET_CAP_ECN    (1 << 3)
+#define VENET_CAP_UFO    (1 << 4)
+
+struct venet_iov {
+	__u32 len;
+	__u64 ptr;
+};
+
+#define VENET_SG_FLAG_NEEDS_CSUM (1 << 0)
+#define VENET_SG_FLAG_GSO        (1 << 1)
+#define VENET_SG_FLAG_ECN        (1 << 2)
+
+struct venet_sg {
+	__u64            cookie;
+	__u32            flags;
+	__u32            len;     /* total length of all iovs */
+	struct {
+		__u16    start;	  /* csum starting position */
+		__u16    offset;  /* offset to place csum */
+	} csum;
+	struct {
+#define VENET_GSO_TYPE_TCPV4	0	/* IPv4 TCP (TSO) */
+#define VENET_GSO_TYPE_UDP	1	/* IPv4 UDP (UFO) */
+#define VENET_GSO_TYPE_TCPV6	2	/* IPv6 TCP */
+		__u8     type;
+		__u16    hdrlen;
+		__u16    size;
+	} gso;
+	__u32            count;   /* nr of iovs */
+	struct venet_iov iov[1];
+};
 
 #define VENET_FUNC_LINKUP   0
 #define VENET_FUNC_LINKDOWN 1


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RFC PATCH 12/17] venettap: add scatter-gather support
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
                   ` (10 preceding siblings ...)
  2009-03-31 18:43 ` [RFC PATCH 11/17] venet: add scatter-gather support Gregory Haskins
@ 2009-03-31 18:43 ` Gregory Haskins
  2009-03-31 18:43 ` [RFC PATCH 13/17] x86: allow the irq->vector translation to be determined outside of ioapic Gregory Haskins
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 drivers/vbus/devices/venet-tap.c |  236 +++++++++++++++++++++++++++++++++++++-
 1 files changed, 229 insertions(+), 7 deletions(-)

diff --git a/drivers/vbus/devices/venet-tap.c b/drivers/vbus/devices/venet-tap.c
index ccce58e..0ccb7ed 100644
--- a/drivers/vbus/devices/venet-tap.c
+++ b/drivers/vbus/devices/venet-tap.c
@@ -80,6 +80,13 @@ enum {
 	TX_IOQ_CONGESTED,
 };
 
+struct venettap;
+
+struct venettap_rx_ops {
+	int (*decode)(struct venettap *priv, void *ptr, int len);
+	int (*import)(struct venettap *, struct sk_buff *, void *, int);
+};
+
 struct venettap {
 	spinlock_t                   lock;
 	unsigned char                hmac[ETH_ALEN]; /* host-mac */
@@ -107,6 +114,12 @@ struct venettap {
 		struct vbus_memctx          *ctx;
 		struct venettap_queue        rxq;
 		struct venettap_queue        txq;
+		struct venettap_rx_ops      *rx_ops;
+		struct {
+			struct venet_sg     *desc;
+			size_t               len;
+			int                  enabled:1;
+		} sg;
 		int                          connected:1;
 		int                          opened:1;
 		int                          link:1;
@@ -288,6 +301,183 @@ venettap_change_mtu(struct net_device *dev, int new_mtu)
 }
 
 /*
+ * ---------------------------
+ * Scatter-Gather support
+ * ---------------------------
+ */
+
+/* assumes reference to priv->vbus.conn held */
+static int
+venettap_sg_decode(struct venettap *priv, void *ptr, int len)
+{
+	struct venet_sg *vsg;
+	struct vbus_memctx *ctx;
+	int ret;
+
+	/*
+	 * SG is enabled, so we need to pull in the venet_sg
+	 * header before we can interpret the rest of the
+	 * packet
+	 *
+	 * FIXME: Make sure this is not too big
+	 */
+	if (unlikely(len > priv->vbus.sg.len)) {
+		kfree(priv->vbus.sg.desc);
+		priv->vbus.sg.desc = kzalloc(len, GFP_KERNEL);
+	}
+
+	vsg = priv->vbus.sg.desc;
+	ctx = priv->vbus.ctx;
+
+	ret = ctx->ops->copy_from(ctx, vsg, ptr, len);
+	BUG_ON(ret);
+
+	/*
+	 * Non GSO type packets should be constrained by the MTU setting
+	 * on the host
+	 */
+	if (!(vsg->flags & VENET_SG_FLAG_GSO)
+	    && (vsg->len > (priv->netif.dev->mtu + ETH_HLEN)))
+		return -1;
+
+	return vsg->len;
+}
+
+/*
+ * venettap_sg_import - import an skb in scatter-gather mode
+ *
+ * assumes reference to priv->vbus.conn held
+ */
+static int
+venettap_sg_import(struct venettap *priv, struct sk_buff *skb,
+		   void *ptr, int len)
+{
+	struct venet_sg *vsg = priv->vbus.sg.desc;
+	struct vbus_memctx *ctx = priv->vbus.ctx;
+	int remain = len;
+	int ret;
+	int i;
+
+	PDEBUG("Importing %d bytes in %d segments\n", len, vsg->count);
+
+	for (i = 0; i < vsg->count; i++) {
+		struct venet_iov *iov = &vsg->iov[i];
+
+		if (remain < iov->len)
+			return -EINVAL;
+
+		PDEBUG("Segment %d: %p/%d\n", i, iov->ptr, iov->len);
+
+		ret = ctx->ops->copy_from(ctx, skb_tail_pointer(skb),
+					 (void *)iov->ptr,
+					 iov->len);
+		if (ret)
+			return -EFAULT;
+
+		skb_put(skb, iov->len);
+		remain -= iov->len;
+	}
+
+	if (vsg->flags & VENET_SG_FLAG_NEEDS_CSUM
+	    && !skb_partial_csum_set(skb, vsg->csum.start, vsg->csum.offset))
+		return -EINVAL;
+
+	if (vsg->flags & VENET_SG_FLAG_GSO) {
+		struct skb_shared_info *sinfo = skb_shinfo(skb);
+
+		PDEBUG("GSO packet detected\n");
+
+		switch (vsg->gso.type) {
+		case VENET_GSO_TYPE_TCPV4:
+			sinfo->gso_type = SKB_GSO_TCPV4;
+			break;
+		case VENET_GSO_TYPE_TCPV6:
+			sinfo->gso_type = SKB_GSO_TCPV6;
+			break;
+		case VENET_GSO_TYPE_UDP:
+			sinfo->gso_type = SKB_GSO_UDP;
+			break;
+		default:
+			PDEBUG("Illegal GSO type: %d\n", vsg->gso.type);
+			priv->netif.stats.rx_frame_errors++;
+			kfree_skb(skb);
+			return -EINVAL;
+		}
+
+		if (vsg->flags & VENET_SG_FLAG_ECN)
+			sinfo->gso_type |= SKB_GSO_TCP_ECN;
+
+		sinfo->gso_size = vsg->gso.size;
+		if (skb_shinfo(skb)->gso_size == 0) {
+			PDEBUG("Illegal GSO size: %d\n", vsg->gso.size);
+			priv->netif.stats.rx_frame_errors++;
+			kfree_skb(skb);
+			return -EINVAL;
+		}
+
+		/* Header must be checked, and gso_segs computed. */
+		skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
+		skb_shinfo(skb)->gso_segs = 0;
+	}
+
+	return 0;
+}
+
+static struct venettap_rx_ops venettap_sg_rx_ops = {
+	.decode = venettap_sg_decode,
+	.import = venettap_sg_import,
+};
+
+/*
+ * ---------------------------
+ * Flat (non Scatter-Gather) support
+ * ---------------------------
+ */
+
+/* assumes reference to priv->vbus.conn held */
+static int
+venettap_flat_decode(struct venettap *priv, void *ptr, int len)
+{
+	size_t maxlen = priv->netif.dev->mtu + ETH_HLEN;
+
+	if (len > maxlen)
+		return -1;
+
+	/*
+	 * If SG is *not* enabled, the length is simply the
+	 * descriptor length
+	 */
+
+	return len;
+}
+
+/*
+ * venettap_rx_flat - import an skb in non scatter-gather mode
+ *
+ * assumes reference to priv->vbus.conn held
+ */
+static int
+venettap_flat_import(struct venettap *priv, struct sk_buff *skb,
+		     void *ptr, int len)
+{
+	struct vbus_memctx *ctx = priv->vbus.ctx;
+	int ret;
+
+	ret = ctx->ops->copy_from(ctx, skb_tail_pointer(skb), ptr, len);
+	if (ret)
+		return -EFAULT;
+
+	skb_put(skb, len);
+
+	return 0;
+}
+
+static struct venettap_rx_ops venettap_flat_rx_ops = {
+	.decode = venettap_flat_decode,
+	.import = venettap_flat_import,
+};
+
+/*
  * The poll implementation.
  */
 static int
@@ -301,6 +491,7 @@ venettap_rx(struct venettap *priv)
 	int                         ret;
 	unsigned long               flags;
 	struct vbus_connection     *conn;
+	struct venettap_rx_ops     *rx_ops;
 
 	PDEBUG("polling...\n");
 
@@ -324,6 +515,8 @@ venettap_rx(struct venettap *priv)
 	ioq = priv->vbus.rxq.queue;
 	ctx = priv->vbus.ctx;
 
+	rx_ops = priv->vbus.rx_ops;
+
 	spin_unlock_irqrestore(&priv->lock, flags);
 
 	/* We want to iterate on the head of the in-use index */
@@ -338,11 +531,14 @@ venettap_rx(struct venettap *priv)
 	 * the north side
 	 */
 	while (priv->vbus.link && iter.desc->sown) {
-		size_t len = iter.desc->len;
-		size_t maxlen = priv->netif.dev->mtu + ETH_HLEN;
 		struct sk_buff *skb = NULL;
+		size_t len;
 
-		if (unlikely(len > maxlen)) {
+		len = rx_ops->decode(priv,
+				     (void *)iter.desc->ptr,
+				     iter.desc->len);
+
+		if (unlikely(len < 0)) {
 			priv->netif.stats.rx_errors++;
 			priv->netif.stats.rx_length_errors++;
 			goto next;
@@ -360,10 +556,8 @@ venettap_rx(struct venettap *priv)
 		/* align IP on 16B boundary */
 		skb_reserve(skb, 2);
 
-		ret = ctx->ops->copy_from(ctx, skb->data,
-					 (void *)iter.desc->ptr,
-					 len);
-		if (unlikely(ret)) {
+		ret = rx_ops->import(priv, skb, (void *)iter.desc->ptr, len);
+		if (unlikely(ret < 0)) {
 			priv->netif.stats.rx_errors++;
 			goto next;
 		}
@@ -843,6 +1037,23 @@ venettap_macquery(struct venettap *priv, void *data, unsigned long len)
 	return 0;
 }
 
+static u32
+venettap_negcap_sg(struct venettap *priv, u32 requested)
+{
+	u32 available = VENET_CAP_SG|VENET_CAP_TSO4|VENET_CAP_TSO6
+		|VENET_CAP_ECN;
+	u32 ret;
+
+	ret = available & requested;
+
+	if (ret & VENET_CAP_SG) {
+		priv->vbus.sg.enabled = true;
+		priv->vbus.rx_ops = &venettap_sg_rx_ops;
+	}
+
+	return ret;
+}
+
 /*
  * Negotiate Capabilities - This function is provided so that the
  * interface may be extended without breaking ABI compatability
@@ -870,6 +1081,9 @@ venettap_negcap(struct venettap *priv, void *data, unsigned long len)
 		return -EFAULT;
 
 	switch (caps.gid) {
+	case VENET_CAP_GROUP_SG:
+		caps.bits = venettap_negcap_sg(priv, caps.bits);
+		break;
 	default:
 		caps.bits = 0;
 		break;
@@ -1037,6 +1251,12 @@ venettap_vlink_release(struct vbus_connection *conn)
 	vbus_memctx_put(priv->vbus.ctx);
 
 	kobject_put(priv->vbus.dev.kobj);
+
+	priv->vbus.sg.enabled = false;
+	priv->vbus.rx_ops = &venettap_flat_rx_ops;
+	kfree(priv->vbus.sg.desc);
+	priv->vbus.sg.desc = NULL;
+	priv->vbus.sg.len = 0;
 }
 
 static struct vbus_connection_ops venettap_vbus_link_ops = {
@@ -1315,6 +1535,8 @@ venettap_device_create(struct vbus_devclass *dc,
 	_vdev->ops             = &venettap_device_ops;
 	_vdev->attrs           = &venettap_attr_group;
 
+	priv->vbus.rx_ops      = &venettap_flat_rx_ops;
+
 	/*
 	 * netif init
 	 */


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RFC PATCH 13/17] x86: allow the irq->vector translation to be determined outside of ioapic
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
                   ` (11 preceding siblings ...)
  2009-03-31 18:43 ` [RFC PATCH 12/17] venettap: " Gregory Haskins
@ 2009-03-31 18:43 ` Gregory Haskins
  2009-03-31 19:16   ` Alan Cox
  2009-03-31 18:44 ` [RFC PATCH 14/17] kvm: add a reset capability Gregory Haskins
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

The ioapic code currently privately manages the mapping between irq
and vector.  This results in some layering violations as the support
for certain MSI operations need this info.  As a result, the MSI
code itself was moved to the ioapic module.  This is not really
optimal.

We now have another need to gain access to the vector assignment on
x86.  However, rather than put yet another inappropriately placed
function into io-apic, lets create a way to export this simple data
and therefore allow the logic to sit closer to where it belongs.

Ideally we should abstract the entire notion of irq->vector management
out of io-apic, but we leave that as an excercise for another day.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/x86/include/asm/irq.h |    6 ++++++
 arch/x86/kernel/io_apic.c  |   25 +++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 592688e..b1726d8 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -40,6 +40,12 @@ extern unsigned int do_IRQ(struct pt_regs *regs);
 extern void init_IRQ(void);
 extern void native_init_IRQ(void);
 
+#ifdef CONFIG_SMP
+extern int set_irq_affinity(int irq, cpumask_t mask);
+#endif
+
+extern int irq_to_vector(int irq);
+
 /* Interrupt vector management */
 extern DECLARE_BITMAP(used_vectors, NR_VECTORS);
 extern int vector_used_by_percpu_irq(unsigned int vector);
diff --git a/arch/x86/kernel/io_apic.c b/arch/x86/kernel/io_apic.c
index bc7ac4d..86a2c36 100644
--- a/arch/x86/kernel/io_apic.c
+++ b/arch/x86/kernel/io_apic.c
@@ -614,6 +614,14 @@ set_ioapic_affinity_irq(unsigned int irq, const struct cpumask *mask)
 
 	set_ioapic_affinity_irq_desc(desc, mask);
 }
+
+int set_irq_affinity(int irq, cpumask_t mask)
+{
+	set_ioapic_affinity_irq(irq, &mask);
+
+	return 0;
+}
+
 #endif /* CONFIG_SMP */
 
 /*
@@ -3249,6 +3257,23 @@ void destroy_irq(unsigned int irq)
 	spin_unlock_irqrestore(&vector_lock, flags);
 }
 
+int irq_to_vector(int irq)
+{
+	struct irq_cfg *cfg;
+	unsigned long flags;
+	int ret = -ENOENT;
+
+	spin_lock_irqsave(&vector_lock, flags);
+
+	cfg = irq_cfg(irq);
+	if (cfg && cfg->vector != 0)
+		ret = cfg->vector;
+
+	spin_unlock_irqrestore(&vector_lock, flags);
+
+	return ret;
+}
+
 /*
  * MSI message composition
  */


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RFC PATCH 14/17] kvm: add a reset capability
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
                   ` (12 preceding siblings ...)
  2009-03-31 18:43 ` [RFC PATCH 13/17] x86: allow the irq->vector translation to be determined outside of ioapic Gregory Haskins
@ 2009-03-31 18:44 ` Gregory Haskins
  2009-03-31 19:22   ` Avi Kivity
  2009-03-31 18:44 ` [RFC PATCH 15/17] kvm: add dynamic IRQ support Gregory Haskins
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

We need a way to detect if a VM is reset later in the series, so lets
add a capability for userspace to signal a VM reset down to the kernel.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/x86/kvm/x86.c       |    1 +
 include/linux/kvm.h      |    2 ++
 include/linux/kvm_host.h |    6 ++++++
 virt/kvm/kvm_main.c      |   36 ++++++++++++++++++++++++++++++++++++
 4 files changed, 45 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 758b7a1..9b0a649 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -971,6 +971,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_NOP_IO_DELAY:
 	case KVM_CAP_MP_STATE:
 	case KVM_CAP_SYNC_MMU:
+	case KVM_CAP_RESET:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 0424326..7ffd8f5 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -396,6 +396,7 @@ struct kvm_trace_rec {
 #ifdef __KVM_HAVE_USER_NMI
 #define KVM_CAP_USER_NMI 22
 #endif
+#define KVM_CAP_RESET 23
 
 /*
  * ioctls for VM fds
@@ -429,6 +430,7 @@ struct kvm_trace_rec {
 				   struct kvm_assigned_pci_dev)
 #define KVM_ASSIGN_IRQ _IOR(KVMIO, 0x70, \
 			    struct kvm_assigned_irq)
+#define KVM_RESET	          _IO(KVMIO,  0x67)
 
 /*
  * ioctls for vcpu fds
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bf6f703..506eca1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -17,6 +17,7 @@
 #include <linux/preempt.h>
 #include <linux/marker.h>
 #include <linux/msi.h>
+#include <linux/notifier.h>
 #include <asm/signal.h>
 
 #include <linux/kvm.h>
@@ -132,6 +133,8 @@ struct kvm {
 	unsigned long mmu_notifier_seq;
 	long mmu_notifier_count;
 #endif
+
+	struct raw_notifier_head reset_notifier; /* triggers when VM reboots */
 };
 
 /* The guest did something we don't support. */
@@ -158,6 +161,9 @@ void kvm_exit(void);
 void kvm_get_kvm(struct kvm *kvm);
 void kvm_put_kvm(struct kvm *kvm);
 
+int kvm_reset_notifier_register(struct kvm *kvm, struct notifier_block *nb);
+int kvm_reset_notifier_unregister(struct kvm *kvm, struct notifier_block *nb);
+
 #define HPA_MSB ((sizeof(hpa_t) * 8) - 1)
 #define HPA_ERR_MASK ((hpa_t)1 << HPA_MSB)
 static inline int is_error_hpa(hpa_t hpa) { return hpa >> HPA_MSB; }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 29a667c..fca2d25 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -868,6 +868,8 @@ static struct kvm *kvm_create_vm(void)
 #ifdef KVM_COALESCED_MMIO_PAGE_OFFSET
 	kvm_coalesced_mmio_init(kvm);
 #endif
+	RAW_INIT_NOTIFIER_HEAD(&kvm->reset_notifier);
+
 out:
 	return kvm;
 }
@@ -1485,6 +1487,35 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	}
 }
 
+static void kvm_notify_reset(struct kvm *kvm)
+{
+	mutex_lock(&kvm->lock);
+	raw_notifier_call_chain(&kvm->reset_notifier, 0, kvm);
+	mutex_unlock(&kvm->lock);
+}
+
+int kvm_reset_notifier_register(struct kvm *kvm, struct notifier_block *nb)
+{
+	int ret;
+
+	mutex_lock(&kvm->lock);
+	ret = raw_notifier_chain_register(&kvm->reset_notifier, nb);
+	mutex_unlock(&kvm->lock);
+
+	return ret;
+}
+
+int kvm_reset_notifier_unregister(struct kvm *kvm, struct notifier_block *nb)
+{
+	int ret;
+
+	mutex_lock(&kvm->lock);
+	ret = raw_notifier_chain_unregister(&kvm->reset_notifier, nb);
+	mutex_unlock(&kvm->lock);
+
+	return ret;
+}
+
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
@@ -1929,6 +1960,11 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif
+	case KVM_RESET: {
+		kvm_notify_reset(kvm);
+		r = 0;
+		break;
+	}
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 	}


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RFC PATCH 15/17] kvm: add dynamic IRQ support
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
                   ` (13 preceding siblings ...)
  2009-03-31 18:44 ` [RFC PATCH 14/17] kvm: add a reset capability Gregory Haskins
@ 2009-03-31 18:44 ` Gregory Haskins
  2009-03-31 19:20   ` Avi Kivity
  2009-03-31 18:44 ` [RFC PATCH 16/17] kvm: Add VBUS support to the host Gregory Haskins
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

This patch provides the ability to dynamically declare and map an
interrupt-request handle to an x86 8-bit vector.

Problem Statement: Emulated devices (such as PCI, ISA, etc) have
interrupt routing done via standard PC mechanisms (MP-table, ACPI,
etc).  However, we also want to support a new class of devices
which exist in a new virtualized namespace and therefore should
not try to piggyback on these emulated mechanisms.  Rather, we
create a way to dynamically register interrupt resources that
acts indepent of the emulated counterpart.

On x86, a simplistic view of the interrupt model is that each core
has a local-APIC which can recieve messages from APIC-compliant
routing devices (such as IO-APIC and MSI) regarding details about
an interrupt (such as which vector to raise).  These routing devices
are controlled by the OS so they may translate a physical event
(such as "e1000: raise an RX interrupt") to a logical destination
(such as "inject IDT vector 46 on core 3").  A dynirq is a virtual
implementation of such a router (think of it as a virtual-MSI, but
without the coupling to an existing standard, such as PCI).

The model is simple: A guest OS can allocate the mapping of "IRQ"
handle to "vector/core" in any way it sees fit, and provide this
information to the dynirq module running in the host.  The assigned
IRQ then becomes the sole handle needed to inject an IDT vector
to the guest from a host.  A host entity that wishes to raise an
interrupt simple needs to call kvm_inject_dynirq(irq) and the routing
is performed transparently.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/x86/Kconfig                |    5 +
 arch/x86/Makefile               |    3 
 arch/x86/include/asm/kvm_host.h |    9 +
 arch/x86/include/asm/kvm_para.h |   11 +
 arch/x86/kvm/Makefile           |    3 
 arch/x86/kvm/dynirq.c           |  329 +++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/guest/Makefile     |    2 
 arch/x86/kvm/guest/dynirq.c     |   95 +++++++++++
 arch/x86/kvm/x86.c              |    6 +
 include/linux/kvm.h             |    1 
 include/linux/kvm_guest.h       |    7 +
 include/linux/kvm_host.h        |    1 
 include/linux/kvm_para.h        |    1 
 13 files changed, 472 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/kvm/dynirq.c
 create mode 100644 arch/x86/kvm/guest/Makefile
 create mode 100644 arch/x86/kvm/guest/dynirq.c
 create mode 100644 include/linux/kvm_guest.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3fca247..91fefd5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -446,6 +446,11 @@ config KVM_GUEST
 	 This option enables various optimizations for running under the KVM
 	 hypervisor.
 
+config KVM_GUEST_DYNIRQ
+       bool "KVM Dynamic IRQ support"
+       depends on KVM_GUEST
+       default y
+
 source "arch/x86/lguest/Kconfig"
 
 config PARAVIRT
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index d1a47ad..d788815 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -147,6 +147,9 @@ core-$(CONFIG_XEN) += arch/x86/xen/
 # lguest paravirtualization support
 core-$(CONFIG_LGUEST_GUEST) += arch/x86/lguest/
 
+# kvm paravirtualization support
+core-$(CONFIG_KVM_GUEST) += arch/x86/kvm/guest/
+
 core-y += arch/x86/kernel/
 core-y += arch/x86/mm/
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 730843d..9ae398a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -346,6 +346,12 @@ struct kvm_mem_alias {
 	gfn_t target_gfn;
 };
 
+struct kvm_dynirq {
+	spinlock_t lock;
+	struct rb_root map;
+	struct kvm *kvm;
+};
+
 struct kvm_arch{
 	int naliases;
 	struct kvm_mem_alias aliases[KVM_ALIAS_SLOTS];
@@ -363,6 +369,7 @@ struct kvm_arch{
 	struct iommu_domain *iommu_domain;
 	struct kvm_pic *vpic;
 	struct kvm_ioapic *vioapic;
+	struct kvm_dynirq *dynirq;
 	struct kvm_pit *vpit;
 	struct hlist_head irq_ack_notifier_list;
 	int vapics_in_nmi_mode;
@@ -519,6 +526,8 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
 			  const void *val, int bytes);
 int kvm_pv_mmu_op(struct kvm_vcpu *vcpu, unsigned long bytes,
 		  gpa_t addr, unsigned long *ret);
+int kvm_dynirq_hc(struct kvm_vcpu *vcpu, int nr, gpa_t gpa, size_t len);
+void kvm_free_dynirq(struct kvm *kvm);
 
 extern bool tdp_enabled;
 
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index b8a3305..fba210e 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -13,6 +13,7 @@
 #define KVM_FEATURE_CLOCKSOURCE		0
 #define KVM_FEATURE_NOP_IO_DELAY	1
 #define KVM_FEATURE_MMU_OP		2
+#define KVM_FEATURE_DYNIRQ		3
 
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
@@ -45,6 +46,16 @@ struct kvm_mmu_op_release_pt {
 	__u64 pt_phys;
 };
 
+/* Operations for KVM_HC_DYNIRQ */
+#define KVM_DYNIRQ_OP_SET   1
+#define KVM_DYNIRQ_OP_CLEAR 2
+
+struct kvm_dynirq_set {
+	__u32 irq;
+	__u32 vec;  /* x86 IDT vector */
+	__u32 dest; /* 0-based vcpu id */
+};
+
 #ifdef __KERNEL__
 #include <asm/processor.h>
 
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index d3ec292..d5676f5 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -14,9 +14,10 @@ endif
 EXTRA_CFLAGS += -Ivirt/kvm -Iarch/x86/kvm
 
 kvm-objs := $(common-objs) x86.o mmu.o x86_emulate.o i8259.o irq.o lapic.o \
-	i8254.o
+	i8254.o dynirq.o
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
 kvm-amd-objs = svm.o
 obj-$(CONFIG_KVM_AMD) += kvm-amd.o
+
diff --git a/arch/x86/kvm/dynirq.c b/arch/x86/kvm/dynirq.c
new file mode 100644
index 0000000..54162dd
--- /dev/null
+++ b/arch/x86/kvm/dynirq.c
@@ -0,0 +1,329 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Dynamic-Interrupt-Request (dynirq): This module provides the ability
+ * to dynamically declare and map an interrupt-request handle to an
+ * x86 8-bit vector.
+ *
+ * Problem Statement: Emulated devices (such as PCI, ISA, etc) have
+ * interrupt routing done via standard PC mechanisms (MP-table, ACPI,
+ * etc).  However, we also want to support a new class of devices
+ * which exist in a new virtualized namespace and therefore should
+ * not try to piggyback on these emulated mechanisms.  Rather, we
+ * create a way to dynamically register interrupt resources that
+ * acts indepent of the emulated counterpart.
+ *
+ * On x86, a simplistic view of the interrupt model is that each core
+ * has a local-APIC which can recieve messages from APIC-compliant
+ * routing devices (such as IO-APIC and MSI) regarding details about
+ * an interrupt (such as which vector to raise).  These routing devices
+ * are controlled by the OS so they may translate a physical event
+ * (such as "e1000: raise an RX interrupt") to a logical destination
+ * (such as "inject IDT vector 46 on core 3").  A dynirq is a virtual
+ * implementation of such a router (think of it as a virtual-MSI, but
+ * without the coupling to an existing standard, such as PCI).
+ *
+ * The model is simple: A guest OS can allocate the mapping of "IRQ"
+ * handle to "vector/core" in any way it sees fit, and provide this
+ * information to the dynirq module running in the host.  The assigned
+ * IRQ then becomes the sole handle needed to inject an IDT vector
+ * to the guest from a host.  A host entity that wishes to raise an
+ * interrupt simple needs to call kvm_inject_dynirq(irq) and the routing
+ * is performed transparently.
+ *
+ * Author:
+ *	Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+#include <linux/module.h>
+#include <linux/rbtree.h>
+#include <linux/mutex.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+
+#include <linux/kvm.h>
+#include <linux/kvm_host.h>
+#include <linux/kvm_para.h>
+#include <linux/workqueue.h>
+#include <linux/hardirq.h>
+
+#include "lapic.h"
+
+struct dynirq {
+	struct kvm_dynirq *parent;
+	unsigned int       irq;
+	unsigned short     vec;
+	unsigned int       dest;
+	struct rb_node     node;
+	struct work_struct work;
+};
+
+static inline struct dynirq *
+to_dynirq(struct rb_node *node)
+{
+	return node ? container_of(node, struct dynirq, node) : NULL;
+}
+
+static int
+map_add(struct rb_root *root, struct dynirq *entry)
+{
+	int		ret = 0;
+	struct rb_node **new, *parent = NULL;
+	struct rb_node *node = &entry->node;
+
+	new  = &(root->rb_node);
+
+	/* Figure out where to put new node */
+	while (*new) {
+		int val;
+
+		parent = *new;
+
+		val = to_dynirq(node)->irq - to_dynirq(*new)->irq;
+		if (val < 0)
+			new = &((*new)->rb_left);
+		else if (val > 0)
+			new = &((*new)->rb_right);
+		else {
+			ret = -EEXIST;
+			break;
+		}
+	}
+
+	if (!ret) {
+		/* Add new node and rebalance tree. */
+		rb_link_node(node, parent, new);
+		rb_insert_color(node, root);
+	}
+
+	return ret;
+}
+
+static struct dynirq *
+map_find(struct rb_root *root, unsigned int key)
+{
+	struct rb_node *node;
+
+	node = root->rb_node;
+
+	while (node) {
+		int val;
+
+		val = key - to_dynirq(node)->irq;
+		if (val < 0)
+			node = node->rb_left;
+		else if (val > 0)
+			node = node->rb_right;
+		else
+			break;
+	}
+
+	return to_dynirq(node);
+}
+
+static void
+dynirq_add(struct kvm_dynirq *dynirq, struct dynirq *entry)
+{
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&dynirq->lock, flags);
+	ret = map_add(&dynirq->map, entry);
+	spin_unlock_irqrestore(&dynirq->lock, flags);
+}
+
+static struct dynirq *
+dynirq_find(struct kvm_dynirq *dynirq, int irq)
+{
+	struct dynirq *entry;
+	unsigned long flags;
+
+	spin_lock_irqsave(&dynirq->lock, flags);
+	entry = map_find(&dynirq->map, irq);
+	spin_unlock_irqrestore(&dynirq->lock, flags);
+
+	return entry;
+}
+
+static int
+_kvm_inject_dynirq(struct kvm *kvm, struct dynirq *entry)
+{
+	struct kvm_vcpu *vcpu;
+	int ret;
+
+	mutex_lock(&kvm->lock);
+
+	vcpu = kvm->vcpus[entry->dest];
+	if (!vcpu) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	ret = kvm_apic_set_irq(vcpu, entry->vec, 1);
+
+out:
+	mutex_unlock(&kvm->lock);
+
+	return ret;
+}
+
+static void
+deferred_inject_dynirq(struct work_struct *work)
+{
+	struct dynirq *entry = container_of(work, struct dynirq, work);
+	struct kvm_dynirq *dynirq = entry->parent;
+	struct kvm *kvm = dynirq->kvm;
+
+	_kvm_inject_dynirq(kvm, entry);
+}
+
+int
+kvm_inject_dynirq(struct kvm *kvm, int irq)
+{
+	struct kvm_dynirq *dynirq = kvm->arch.dynirq;
+	struct dynirq *entry;
+
+	entry = dynirq_find(dynirq, irq);
+	if (!entry)
+		return -EINVAL;
+
+	if (preemptible())
+		return _kvm_inject_dynirq(kvm, entry);
+
+	schedule_work(&entry->work);
+	return 0;
+}
+
+static int
+hc_set(struct kvm_vcpu *vcpu, gpa_t gpa, size_t len)
+{
+	struct kvm_dynirq_set args;
+	struct kvm_dynirq    *dynirq = vcpu->kvm->arch.dynirq;
+	struct dynirq        *entry;
+	int                   ret;
+
+	if (len != sizeof(args))
+		return -EINVAL;
+
+	ret = kvm_read_guest(vcpu->kvm, gpa, &args, len);
+	if (ret < 0)
+		return ret;
+
+	if (args.dest >= KVM_MAX_VCPUS)
+		return -EINVAL;
+
+	entry = dynirq_find(dynirq, args.irq);
+	if (!entry) {
+		entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+		INIT_WORK(&entry->work, deferred_inject_dynirq);
+	} else
+		rb_erase(&entry->node, &dynirq->map);
+
+	entry->irq  = args.irq;
+	entry->vec  = args.vec;
+	entry->dest = args.dest;
+
+	dynirq_add(dynirq, entry);
+
+	return 0;
+}
+
+static int
+hc_clear(struct kvm_vcpu *vcpu, gpa_t gpa, size_t len)
+{
+	struct kvm_dynirq *dynirq = vcpu->kvm->arch.dynirq;
+	struct dynirq *entry;
+	unsigned long flags;
+	u32 irq;
+	int ret;
+
+	if (len != sizeof(irq))
+		return -EINVAL;
+
+	ret = kvm_read_guest(vcpu->kvm, gpa, &irq, len);
+	if (ret < 0)
+		return ret;
+
+	spin_lock_irqsave(&dynirq->lock, flags);
+
+	entry = map_find(&dynirq->map, irq);
+	if (entry)
+		rb_erase(&entry->node, &dynirq->map);
+
+	spin_unlock_irqrestore(&dynirq->lock, flags);
+
+	if (!entry)
+		return -ENOENT;
+
+	kfree(entry);
+	return 0;
+}
+
+/*
+ * Our hypercall format will always follow with the call-id in arg[0],
+ * a pointer to the arguments in arg[1], and the argument length in arg[2]
+ */
+int
+kvm_dynirq_hc(struct kvm_vcpu *vcpu, int nr, gpa_t gpa, size_t len)
+{
+	int ret = -EINVAL;
+
+	mutex_lock(&vcpu->kvm->lock);
+
+	if (unlikely(!vcpu->kvm->arch.dynirq)) {
+		struct kvm_dynirq *dynirq;
+
+		dynirq = kzalloc(sizeof(*dynirq), GFP_KERNEL);
+		if (!dynirq)
+			return -ENOMEM;
+
+		spin_lock_init(&dynirq->lock);
+		dynirq->map = RB_ROOT;
+		dynirq->kvm = vcpu->kvm;
+		vcpu->kvm->arch.dynirq = dynirq;
+	}
+
+	switch (nr) {
+	case KVM_DYNIRQ_OP_SET:
+		ret = hc_set(vcpu, gpa, len);
+		break;
+	case KVM_DYNIRQ_OP_CLEAR:
+		ret = hc_clear(vcpu, gpa, len);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	mutex_unlock(&vcpu->kvm->lock);
+
+	return ret;
+}
+
+void
+kvm_free_dynirq(struct kvm *kvm)
+{
+	struct kvm_dynirq *dynirq = kvm->arch.dynirq;
+	struct rb_node *node;
+
+	while ((node = rb_first(&dynirq->map))) {
+		struct dynirq *entry = to_dynirq(node);
+
+		rb_erase(node, &dynirq->map);
+		kfree(entry);
+	}
+
+	kfree(dynirq);
+}
diff --git a/arch/x86/kvm/guest/Makefile b/arch/x86/kvm/guest/Makefile
new file mode 100644
index 0000000..de8f824
--- /dev/null
+++ b/arch/x86/kvm/guest/Makefile
@@ -0,0 +1,2 @@
+
+obj-$(CONFIG_KVM_GUEST_DYNIRQ) += dynirq.o
\ No newline at end of file
diff --git a/arch/x86/kvm/guest/dynirq.c b/arch/x86/kvm/guest/dynirq.c
new file mode 100644
index 0000000..a5cf55e
--- /dev/null
+++ b/arch/x86/kvm/guest/dynirq.c
@@ -0,0 +1,95 @@
+#include <linux/module.h>
+#include <linux/irq.h>
+#include <linux/kvm.h>
+#include <linux/kvm_para.h>
+
+#include <asm/irq.h>
+#include <asm/apic.h>
+
+/*
+ * -----------------------
+ * Dynamic-IRQ support
+ * -----------------------
+ */
+
+static int dynirq_set(int irq, int dest)
+{
+	struct kvm_dynirq_set op = {
+		.irq  = irq,
+		.vec  = irq_to_vector(irq),
+		.dest = dest,
+	};
+
+	return kvm_hypercall3(KVM_HC_DYNIRQ, KVM_DYNIRQ_OP_SET,
+			      __pa(&op), sizeof(op));
+}
+
+static void dynirq_chip_noop(unsigned int irq)
+{
+}
+
+static void dynirq_chip_eoi(unsigned int irq)
+{
+	ack_APIC_irq();
+}
+
+struct irq_chip kvm_irq_chip = {
+	.name		= "KVM-DYNIRQ",
+	.mask		= dynirq_chip_noop,
+	.unmask		= dynirq_chip_noop,
+	.eoi		= dynirq_chip_eoi,
+};
+
+int create_kvm_dynirq(int cpu)
+{
+	const cpumask_t *mask = get_cpu_mask(cpu);
+	int irq;
+	int ret;
+
+	ret = kvm_para_has_feature(KVM_FEATURE_DYNIRQ);
+	if (!ret)
+		return -ENOENT;
+
+	irq = create_irq();
+	if (irq < 0)
+		return -ENOSPC;
+
+#ifdef CONFIG_SMP
+	ret = set_irq_affinity(irq, *mask);
+	if (ret < 0)
+		goto error;
+#endif
+
+	set_irq_chip_and_handler_name(irq,
+				      &kvm_irq_chip,
+				      handle_percpu_irq,
+				      "apiceoi");
+
+	ret = dynirq_set(irq, cpu);
+	if (ret < 0)
+		goto error;
+
+	return irq;
+
+error:
+	destroy_irq(irq);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(create_kvm_dynirq);
+
+int destroy_kvm_dynirq(int irq)
+{
+	__u32 _irq = irq;
+
+	if (kvm_para_has_feature(KVM_FEATURE_DYNIRQ))
+		kvm_hypercall3(KVM_HC_DYNIRQ,
+			       KVM_DYNIRQ_OP_CLEAR,
+			       __pa(&_irq),
+			       sizeof(_irq));
+
+	destroy_irq(irq);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(destroy_kvm_dynirq);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9b0a649..e24f0a5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -972,6 +972,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_MP_STATE:
 	case KVM_CAP_SYNC_MMU:
 	case KVM_CAP_RESET:
+	case KVM_CAP_DYNIRQ:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -2684,6 +2685,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 	case KVM_HC_MMU_OP:
 		r = kvm_pv_mmu_op(vcpu, a0, hc_gpa(vcpu, a1, a2), &ret);
 		break;
+	case KVM_HC_DYNIRQ:
+		ret = kvm_dynirq_hc(vcpu, a0, a1, a2);
+		break;
 	default:
 		ret = -KVM_ENOSYS;
 		break;
@@ -4141,6 +4145,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_free_pit(kvm);
 	kfree(kvm->arch.vpic);
 	kfree(kvm->arch.vioapic);
+	if (kvm->arch.dynirq)
+		kvm_free_dynirq(kvm);
 	kvm_free_vcpus(kvm);
 	kvm_free_physmem(kvm);
 	if (kvm->arch.apic_access_page)
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 7ffd8f5..349d273 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -397,6 +397,7 @@ struct kvm_trace_rec {
 #define KVM_CAP_USER_NMI 22
 #endif
 #define KVM_CAP_RESET 23
+#define KVM_CAP_DYNIRQ 24
 
 /*
  * ioctls for VM fds
diff --git a/include/linux/kvm_guest.h b/include/linux/kvm_guest.h
new file mode 100644
index 0000000..7dd7930
--- /dev/null
+++ b/include/linux/kvm_guest.h
@@ -0,0 +1,7 @@
+#ifndef __LINUX_KVM_GUEST_H
+#define __LINUX_KVM_GUEST_H
+
+extern int create_kvm_dynirq(int cpu);
+extern int destroy_kvm_dynirq(int irq);
+
+#endif /* __LINUX_KVM_GUEST_H */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 506eca1..bec9b35 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -297,6 +297,7 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *v);
 int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
+int kvm_inject_dynirq(struct kvm *kvm, int irq);
 
 int kvm_is_mmio_pfn(pfn_t pfn);
 
diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index 3ddce03..a2de904 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -16,6 +16,7 @@
 
 #define KVM_HC_VAPIC_POLL_IRQ		1
 #define KVM_HC_MMU_OP			2
+#define KVM_HC_DYNIRQ			3
 
 /*
  * hypercalls use architecture specific


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RFC PATCH 16/17] kvm: Add VBUS support to the host
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
                   ` (14 preceding siblings ...)
  2009-03-31 18:44 ` [RFC PATCH 15/17] kvm: add dynamic IRQ support Gregory Haskins
@ 2009-03-31 18:44 ` Gregory Haskins
  2009-03-31 18:44 ` [RFC PATCH 17/17] kvm: Add guest-side support for VBUS Gregory Haskins
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

This patch adds support for guest access to a VBUS assigned to the same
context as the VM.  It utilizes a IOQ+IRQ to move events from host->guest,
and provides a hypercall interface to move events guest->host.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/x86/include/asm/kvm_para.h |    1 
 arch/x86/kvm/Kconfig            |    9 
 arch/x86/kvm/Makefile           |    3 
 arch/x86/kvm/x86.c              |    6 
 arch/x86/kvm/x86.h              |   12 
 include/linux/kvm.h             |    1 
 include/linux/kvm_host.h        |   20 +
 include/linux/kvm_para.h        |   59 ++
 virt/kvm/kvm_main.c             |    1 
 virt/kvm/vbus.c                 | 1307 +++++++++++++++++++++++++++++++++++++++
 10 files changed, 1419 insertions(+), 0 deletions(-)
 create mode 100644 virt/kvm/vbus.c

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index fba210e..19d81e0 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -14,6 +14,7 @@
 #define KVM_FEATURE_NOP_IO_DELAY	1
 #define KVM_FEATURE_MMU_OP		2
 #define KVM_FEATURE_DYNIRQ		3
+#define KVM_FEATURE_VBUS                4
 
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index b81125f..875e96e 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -64,6 +64,15 @@ config KVM_TRACE
 	  relayfs.  Note the ABI is not considered stable and will be
 	  modified in future updates.
 
+config KVM_HOST_VBUS
+       bool "KVM virtual-bus (VBUS) host-side support"
+       depends on KVM
+       select VBUS
+       default n
+       ---help---
+          This option enables host-side support for accessing virtual-bus
+	  devices.
+
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
 source drivers/lguest/Kconfig
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index d5676f5..f749ec9 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -15,6 +15,9 @@ EXTRA_CFLAGS += -Ivirt/kvm -Iarch/x86/kvm
 
 kvm-objs := $(common-objs) x86.o mmu.o x86_emulate.o i8259.o irq.o lapic.o \
 	i8254.o dynirq.o
+ifeq ($(CONFIG_KVM_HOST_VBUS),y)
+kvm-objs += $(addprefix ../../../virt/kvm/, vbus.o)
+endif
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e24f0a5..2369d84 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -996,6 +996,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_CLOCKSOURCE:
 		r = boot_cpu_has(X86_FEATURE_CONSTANT_TSC);
 		break;
+	case KVM_CAP_VBUS:
+		r = kvm_vbus_support();
+		break;
 	default:
 		r = 0;
 		break;
@@ -2688,6 +2691,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 	case KVM_HC_DYNIRQ:
 		ret = kvm_dynirq_hc(vcpu, a0, a1, a2);
 		break;
+	case KVM_HC_VBUS:
+		ret = kvm_vbus_hc(vcpu, a0, a1, a2);
+		break;
 	default:
 		ret = -KVM_ENOSYS;
 		break;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 6a4be78..b6c682b 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -3,6 +3,18 @@
 
 #include <linux/kvm_host.h>
 
+#ifdef CONFIG_KVM_HOST_VBUS
+static inline int kvm_vbus_support(void)
+{
+    return 1;
+}
+#else
+static inline int kvm_vbus_support(void)
+{
+    return 0;
+}
+#endif
+
 static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
 {
 	vcpu->arch.exception.pending = false;
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 349d273..077daac 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -398,6 +398,7 @@ struct kvm_trace_rec {
 #endif
 #define KVM_CAP_RESET 23
 #define KVM_CAP_DYNIRQ 24
+#define KVM_CAP_VBUS 25
 
 /*
  * ioctls for VM fds
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bec9b35..757f998 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -120,6 +120,9 @@ struct kvm {
 	struct list_head vm_list;
 	struct kvm_io_bus mmio_bus;
 	struct kvm_io_bus pio_bus;
+#ifdef CONFIG_KVM_HOST_VBUS
+	struct kvm_vbus *kvbus;
+#endif
 	struct kvm_vm_stat stat;
 	struct kvm_arch arch;
 	atomic_t users_count;
@@ -471,4 +474,21 @@ static inline int mmu_notifier_retry(struct kvm_vcpu *vcpu, unsigned long mmu_se
 }
 #endif
 
+#ifdef CONFIG_KVM_HOST_VBUS
+
+int kvm_vbus_hc(struct kvm_vcpu *vcpu, int nr, gpa_t gpa, size_t len);
+void kvm_vbus_release(struct kvm_vbus *kvbus);
+
+#else /* CONFIG_KVM_HOST_VBUS */
+
+static inline int
+kvm_vbus_hc(struct kvm_vcpu *vcpu, int nr, gpa_t gpa, size_t len)
+{
+	return -EINVAL;
+}
+
+#define kvm_vbus_release(kvbus) do {} while (0)
+
+#endif /* CONFIG_KVM_HOST_VBUS */
+
 #endif
diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index a2de904..ca5203c 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -17,6 +17,65 @@
 #define KVM_HC_VAPIC_POLL_IRQ		1
 #define KVM_HC_MMU_OP			2
 #define KVM_HC_DYNIRQ			3
+#define KVM_HC_VBUS			4
+
+/* Payload of KVM_HC_VBUS */
+#define KVM_VBUS_MAGIC   0x27fdab45
+#define KVM_VBUS_VERSION 1
+
+enum kvm_vbus_op{
+	KVM_VBUS_OP_BUSOPEN,
+	KVM_VBUS_OP_BUSREG,
+	KVM_VBUS_OP_DEVOPEN,
+	KVM_VBUS_OP_DEVCLOSE,
+	KVM_VBUS_OP_DEVCALL,
+	KVM_VBUS_OP_DEVSHM,
+	KVM_VBUS_OP_SHMSIGNAL,
+};
+
+struct kvm_vbus_busopen {
+	__u32 magic;
+	__u32 version;
+	__u64 capabilities;
+};
+
+struct kvm_vbus_eventqreg {
+	__u32 irq;
+	__u32 count;
+	__u64 ring;
+	__u64 data;
+};
+
+struct kvm_vbus_busreg {
+	__u32 count;  /* supporting multiple queues allows for prio, etc */
+	struct kvm_vbus_eventqreg eventq[1];
+};
+
+enum kvm_vbus_eventid {
+	KVM_VBUS_EVENT_DEVADD,
+	KVM_VBUS_EVENT_DEVDROP,
+	KVM_VBUS_EVENT_SHMSIGNAL,
+	KVM_VBUS_EVENT_SHMCLOSE,
+};
+
+#define VBUS_MAX_DEVTYPE_LEN 128
+
+struct kvm_vbus_add_event {
+	__u64  id;
+	char type[VBUS_MAX_DEVTYPE_LEN];
+};
+
+struct kvm_vbus_handle_event {
+	__u64 handle;
+};
+
+struct kvm_vbus_event {
+	__u32 eventid;
+	union {
+		struct kvm_vbus_add_event    add;
+		struct kvm_vbus_handle_event handle;
+	} data;
+};
 
 /*
  * hypercalls use architecture specific
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fca2d25..2e4ba8b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -942,6 +942,7 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
 {
 	struct kvm *kvm = filp->private_data;
 
+	kvm_vbus_release(kvm->kvbus);
 	kvm_put_kvm(kvm);
 	return 0;
 }
diff --git a/virt/kvm/vbus.c b/virt/kvm/vbus.c
new file mode 100644
index 0000000..17b3392
--- /dev/null
+++ b/virt/kvm/vbus.c
@@ -0,0 +1,1307 @@
+/*
+ * Copyright 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *	Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/highmem.h>
+#include <linux/workqueue.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/ioq.h>
+
+#include <linux/kvm.h>
+#include <linux/kvm_host.h>
+#include <linux/kvm_para.h>
+#include <linux/vbus.h>
+#include <linux/vbus_client.h>
+
+#undef PDEBUG
+#ifdef KVMVBUS_DEBUG
+#include <linux/ftrace.h>
+#  define PDEBUG(fmt, args...) ftrace_printk(fmt, ## args)
+#else
+#  define PDEBUG(fmt, args...)
+#endif
+
+struct kvm_vbus_eventq {
+	spinlock_t          lock;
+	struct ioq         *ioq;
+	struct ioq_notifier notifier;
+	struct list_head    backlog;
+	struct {
+		u64         gpa;
+		size_t      len;
+		void       *ptr;
+	} ringdata;
+	struct work_struct  work;
+	int                 backpressure:1;
+};
+
+enum kvm_vbus_state {
+	kvm_vbus_state_init,
+	kvm_vbus_state_registration,
+	kvm_vbus_state_running,
+};
+
+struct kvm_vbus {
+	struct mutex	        lock;
+	enum kvm_vbus_state     state;
+	struct kvm             *kvm;
+	struct vbus            *vbus;
+	struct vbus_client     *client;
+	struct kvm_vbus_eventq  eventq;
+	struct work_struct      destruct;
+	struct vbus_memctx     *ctx;
+	struct {
+		struct notifier_block vbus;
+		struct notifier_block reset;
+	} notify;
+};
+
+struct vbus_client *to_client(struct kvm_vcpu *vcpu)
+{
+	return vcpu ? vcpu->kvm->kvbus->client : NULL;
+}
+
+static void*
+kvm_vmap(struct kvm *kvm, gpa_t gpa, size_t len)
+{
+	struct page **page_list;
+	void *ptr = NULL;
+	unsigned long addr;
+	off_t offset;
+	size_t npages;
+	int ret;
+
+	addr = gfn_to_hva(kvm, gpa >> PAGE_SHIFT);
+
+	offset = offset_in_page(gpa);
+	npages = PAGE_ALIGN(len + offset) >> PAGE_SHIFT;
+
+	if (npages > (PAGE_SIZE / sizeof(struct page *)))
+		return NULL;
+
+	page_list = (struct page **) __get_free_page(GFP_KERNEL);
+	if (!page_list)
+		return NULL;
+
+	ret = get_user_pages_fast(addr, npages, 1, page_list);
+	if (ret < 0)
+		goto out;
+
+	down_write(&current->mm->mmap_sem);
+
+	ptr = vmap(page_list, npages, VM_MAP, PAGE_KERNEL);
+	if (ptr)
+		current->mm->locked_vm += npages;
+
+	up_write(&current->mm->mmap_sem);
+
+	ptr = ptr+offset;
+
+out:
+	free_page((unsigned long)page_list);
+
+	return ptr;
+}
+
+static void
+kvm_vunmap(void *ptr)
+{
+	/* FIXME: do we need to adjust current->mm->locked_vm? */
+	vunmap((void *)((unsigned long)ptr & PAGE_MASK));
+}
+
+/*
+ * -----------------
+ * kvm_shm routines
+ * -----------------
+ */
+
+struct kvm_shm {
+	struct kvm_vbus   *kvbus;
+	struct vbus_shm    shm;
+};
+
+static void
+kvm_shm_release(struct vbus_shm *shm)
+{
+	struct kvm_shm *_shm = container_of(shm, struct kvm_shm, shm);
+
+	kvm_vunmap(_shm->shm.ptr);
+	kfree(_shm);
+}
+
+static struct vbus_shm_ops kvm_shm_ops = {
+	.release = kvm_shm_release,
+};
+
+static int
+kvm_shm_map(struct kvm_vbus *kvbus, __u64 ptr, __u32 len, struct kvm_shm **kshm)
+{
+	struct kvm_shm *_shm;
+	void *vmap;
+
+	if (!can_do_mlock())
+		return -EPERM;
+
+	_shm = kzalloc(sizeof(*_shm), GFP_KERNEL);
+	if (!_shm)
+		return -ENOMEM;
+
+	_shm->kvbus = kvbus;
+
+	vmap = kvm_vmap(kvbus->kvm, ptr, len);
+	if (!vmap) {
+		kfree(_shm);
+		return -EFAULT;
+	}
+
+	vbus_shm_init(&_shm->shm, &kvm_shm_ops, vmap, len);
+
+	*kshm = _shm;
+
+	return 0;
+}
+
+/*
+ * -----------------
+ * vbus_memctx routines
+ * -----------------
+ */
+
+struct kvm_memctx {
+	struct kvm *kvm;
+	struct vbus_memctx *taskmem;
+	struct vbus_memctx ctx;
+};
+
+static struct kvm_memctx *to_kvm_memctx(struct vbus_memctx *ctx)
+{
+	return container_of(ctx, struct kvm_memctx, ctx);
+}
+
+
+static unsigned long
+kvm_memctx_copy_to(struct vbus_memctx *ctx, void *dst, const void *src,
+	       unsigned long n)
+{
+	struct kvm_memctx *kvm_memctx = to_kvm_memctx(ctx);
+	struct vbus_memctx *tm = kvm_memctx->taskmem;
+	gpa_t gpa = (gpa_t)dst;
+	unsigned long addr;
+	int offset;
+
+	addr = gfn_to_hva(kvm_memctx->kvm, gpa >> PAGE_SHIFT);
+	offset = offset_in_page(gpa);
+
+	return tm->ops->copy_to(tm, (void *)(addr + offset), src, n);
+}
+
+static unsigned long
+kvm_memctx_copy_from(struct vbus_memctx *ctx, void *dst, const void *src,
+		  unsigned long n)
+{
+	struct kvm_memctx *kvm_memctx = to_kvm_memctx(ctx);
+	struct vbus_memctx *tm = kvm_memctx->taskmem;
+	gpa_t gpa = (gpa_t)src;
+	unsigned long addr;
+	int offset;
+
+	addr = gfn_to_hva(kvm_memctx->kvm, gpa >> PAGE_SHIFT);
+	offset = offset_in_page(gpa);
+
+	return tm->ops->copy_from(tm, dst, (void *)(addr + offset), n);
+}
+
+static void
+kvm_memctx_release(struct vbus_memctx *ctx)
+{
+	struct kvm_memctx *kvm_memctx = to_kvm_memctx(ctx);
+
+	vbus_memctx_put(kvm_memctx->taskmem);
+	kvm_put_kvm(kvm_memctx->kvm);
+
+	kfree(kvm_memctx);
+}
+
+static struct vbus_memctx_ops kvm_memctx_ops = {
+	.copy_to   = &kvm_memctx_copy_to,
+	.copy_from = &kvm_memctx_copy_from,
+	.release   = &kvm_memctx_release,
+};
+
+struct vbus_memctx *kvm_memctx_alloc(struct kvm *kvm)
+{
+	struct kvm_memctx *kvm_memctx;
+
+	kvm_memctx = kzalloc(sizeof(*kvm_memctx), GFP_KERNEL);
+	if (!kvm_memctx)
+		return NULL;
+
+	kvm_get_kvm(kvm);
+	kvm_memctx->kvm = kvm;
+
+	kvm_memctx->taskmem = task_memctx_alloc(current);
+	vbus_memctx_init(&kvm_memctx->ctx, &kvm_memctx_ops);
+
+	return &kvm_memctx->ctx;
+}
+
+/*
+ * -----------------
+ * general routines
+ * -----------------
+ */
+
+static int
+_signal_init(struct kvm *kvm, struct shm_signal_desc *desc,
+	     struct shm_signal *signal, struct shm_signal_ops *ops)
+{
+	if (desc->magic != SHM_SIGNAL_MAGIC)
+		return -EINVAL;
+
+	if (desc->ver != SHM_SIGNAL_VER)
+		return -EINVAL;
+
+	shm_signal_init(signal);
+
+	signal->locale    = shm_locality_south;
+	signal->ops       = ops;
+	signal->desc      = desc;
+
+	return 0;
+}
+
+static struct kvm_vbus_event *
+event_ptr_translate(struct kvm_vbus_eventq *eventq, u64 ptr)
+{
+	u64 off = ptr - eventq->ringdata.gpa;
+
+	if ((ptr < eventq->ringdata.gpa)
+	    || (off > (eventq->ringdata.len - sizeof(struct kvm_vbus_event))))
+		return NULL;
+
+	return eventq->ringdata.ptr + off;
+}
+
+/*
+ * ------------------
+ * event-object code
+ * ------------------
+ */
+
+struct _event {
+	atomic_t              refs;
+	struct list_head      list;
+	struct kvm_vbus_event data;
+};
+
+static void
+_event_init(struct _event *event)
+{
+	memset(event, 0, sizeof(*event));
+	atomic_set(&event->refs, 1);
+	INIT_LIST_HEAD(&event->list);
+}
+
+static void
+_event_get(struct _event *event)
+{
+	atomic_inc(&event->refs);
+}
+
+static inline void
+_event_put(struct _event *event)
+{
+	if (atomic_dec_and_test(&event->refs))
+		kfree(event);
+}
+
+/*
+ * ------------------
+ * event-inject code
+ * ------------------
+ */
+
+static struct kvm_vbus_eventq *notify_to_eventq(struct ioq_notifier *notifier)
+{
+	return container_of(notifier, struct kvm_vbus_eventq, notifier);
+}
+
+static struct kvm_vbus_eventq *work_to_eventq(struct work_struct *work)
+{
+	return container_of(work, struct kvm_vbus_eventq, work);
+}
+
+/*
+ * This is invoked by the guest whenever they signal our eventq when
+ * we have notifications enabled
+ */
+static void
+eventq_notify(struct ioq_notifier *notifier)
+{
+	struct kvm_vbus_eventq *eventq = notify_to_eventq(notifier);
+	unsigned long           flags;
+
+	spin_lock_irqsave(&eventq->lock, flags);
+
+	if (!ioq_full(eventq->ioq, ioq_idxtype_inuse)) {
+		eventq->backpressure = false;
+		ioq_notify_disable(eventq->ioq, 0);
+		schedule_work(&eventq->work);
+	}
+
+	spin_unlock_irqrestore(&eventq->lock, flags);
+}
+
+static void
+events_flush(struct kvm_vbus_eventq *eventq)
+{
+	struct ioq_iterator     iter;
+	int                     ret;
+	unsigned long           flags;
+	struct _event          *_event, *tmp;
+	int                     dirty = 0;
+
+	spin_lock_irqsave(&eventq->lock, flags);
+
+	/* We want to iterate on the tail of the in-use index */
+	ret = ioq_iter_init(eventq->ioq, &iter, ioq_idxtype_inuse, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+
+	list_for_each_entry_safe(_event, tmp, &eventq->backlog, list) {
+		struct kvm_vbus_event *ev;
+
+		if (!iter.desc->sown) {
+			eventq->backpressure = true;
+			ioq_notify_enable(eventq->ioq, 0);
+			break;
+		}
+
+		if (iter.desc->len < sizeof(*ev)) {
+			SHM_SIGNAL_FAULT(eventq->ioq->signal,
+					 "Desc too small on eventq: %p: %d<%d",
+					 iter.desc->ptr,
+					 iter.desc->len, sizeof(*ev));
+			break;
+		}
+
+		ev = event_ptr_translate(eventq, iter.desc->ptr);
+		if (!ev) {
+			SHM_SIGNAL_FAULT(eventq->ioq->signal,
+					 "Invalid address on eventq: %p",
+					 iter.desc->ptr);
+			break;
+		}
+
+		memcpy(ev, &_event->data, sizeof(*ev));
+
+		list_del_init(&_event->list);
+		_event_put(_event);
+
+		ret = ioq_iter_push(&iter, 0);
+		BUG_ON(ret < 0);
+
+		dirty = 1;
+	}
+
+	spin_unlock_irqrestore(&eventq->lock, flags);
+
+	/*
+	 * Signal the IOQ outside of the spinlock so that we can potentially
+	 * directly inject this interrupt instead of deferring it
+	 */
+	if (dirty)
+		ioq_signal(eventq->ioq, 0);
+}
+
+static int
+event_inject(struct kvm_vbus_eventq *eventq, struct _event *_event)
+{
+	unsigned long flags;
+
+	if (!list_empty(&_event->list))
+		return -EBUSY;
+
+	spin_lock_irqsave(&eventq->lock, flags);
+	list_add_tail(&_event->list, &eventq->backlog);
+	spin_unlock_irqrestore(&eventq->lock, flags);
+
+	events_flush(eventq);
+
+	return 0;
+}
+
+static void
+eventq_reinject(struct work_struct *work)
+{
+	struct kvm_vbus_eventq *eventq = work_to_eventq(work);
+
+	events_flush(eventq);
+}
+
+/*
+ * devadd/drop are in the slow path and are rare enough that we will
+ * simply allocate memory for the event from the heap
+ */
+static int
+devadd_inject(struct kvm_vbus_eventq *eventq, const char *type, u64 id)
+{
+	struct _event *_event;
+	struct kvm_vbus_add_event *ae;
+	int ret;
+
+	_event = kmalloc(sizeof(*_event), GFP_KERNEL);
+	if (!_event)
+		return -ENOMEM;
+
+	_event_init(_event);
+
+	_event->data.eventid = KVM_VBUS_EVENT_DEVADD;
+	ae = (struct kvm_vbus_add_event *)&_event->data.data;
+	ae->id = id;
+	strncpy(ae->type, type, VBUS_MAX_DEVTYPE_LEN);
+
+	ret = event_inject(eventq, _event);
+	if (ret < 0)
+		_event_put(_event);
+
+	return ret;
+}
+
+/*
+ * "handle" events are used to send any kind of event that simply
+ * uses a handle as a parameter.  This includes things like DEVDROP
+ * and SHMSIGNAL, etc.
+ */
+static struct _event *
+handle_event_alloc(u64 id, u64 handle)
+{
+	struct _event *_event;
+	struct kvm_vbus_handle_event *he;
+
+	_event = kmalloc(sizeof(*_event), GFP_KERNEL);
+	if (!_event)
+		return NULL;
+
+	_event_init(_event);
+	_event->data.eventid = id;
+
+	he = (struct kvm_vbus_handle_event *)&_event->data.data;
+	he->handle = handle;
+
+	return _event;
+}
+
+static int
+devdrop_inject(struct kvm_vbus_eventq *eventq, u64 id)
+{
+	struct _event *_event;
+	int ret;
+
+	_event = handle_event_alloc(KVM_VBUS_EVENT_DEVDROP, id);
+	if (!_event)
+		return -ENOMEM;
+
+	ret = event_inject(eventq, _event);
+	if (ret < 0)
+		_event_put(_event);
+
+	return ret;
+}
+
+static struct kvm_vbus_eventq *
+prio_to_eventq(struct kvm_vbus *kvbus, int prio)
+{
+	/*
+	 * NOTE: priority is ignored for now...all events aggregate onto a
+	 * single queue
+	 */
+
+	return &kvbus->eventq;
+}
+
+/*
+ * -----------------
+ * event ioq
+ *
+ * This queue is used by the infrastructure to transmit events (such as
+ * "new device", or "signal an ioq") to the guest.  We do this so that
+ * we minimize the number of hypercalls required to inject an event.
+ * In theory, the guest only needs to process a single interrupt vector
+ * and it doesnt require switching back to host context since the state
+ * is placed within the ring
+ * -----------------
+ */
+
+struct eventq_signal {
+	struct kvm_vbus   *kvbus;
+	struct vbus_shm   *shm;
+	struct shm_signal  signal;
+	int                irq;
+};
+
+static struct eventq_signal *signal_to_eventq(struct shm_signal *signal)
+{
+       return container_of(signal, struct eventq_signal, signal);
+}
+
+static int
+eventq_signal_inject(struct shm_signal *signal)
+{
+	struct eventq_signal *_signal = signal_to_eventq(signal);
+	struct kvm *kvm = _signal->kvbus->kvm;
+
+	/* Inject an interrupt to the guest */
+	kvm_inject_dynirq(kvm, _signal->irq);
+
+	return 0;
+}
+
+static void
+eventq_signal_release(struct shm_signal *signal)
+{
+	struct eventq_signal *_signal = signal_to_eventq(signal);
+
+	vbus_shm_put(_signal->shm);
+	kfree(_signal);
+}
+
+static struct shm_signal_ops eventq_signal_ops = {
+	.inject  = eventq_signal_inject,
+	.release = eventq_signal_release,
+};
+
+static int
+_eventq_attach(struct kvm_vbus *kvbus, __u32 count, __u64 ptr, int irq,
+	       struct ioq **ioq)
+{
+	struct ioq_ring_head *desc;
+	struct eventq_signal *_signal = NULL;
+	struct kvm_shm *_shm = NULL;
+	size_t len = IOQ_HEAD_DESC_SIZE(count);
+	int ret;
+
+	ret = kvm_shm_map(kvbus, ptr, len, &_shm);
+	if (ret < 0)
+		return ret;
+
+	_signal = kzalloc(sizeof(*_signal), GFP_KERNEL);
+	if (!_signal) {
+		ret = -ENOMEM;
+		goto error;
+	}
+
+	desc = _shm->shm.ptr;
+
+	ret = _signal_init(kvbus->kvm,
+			   &desc->signal,
+			   &_signal->signal,
+			   &eventq_signal_ops);
+	if (ret < 0) {
+		kfree(_signal);
+		_signal = NULL;
+		goto error;
+	}
+
+	_signal->kvbus = kvbus;
+	_signal->irq   = irq;
+	_signal->shm   = &_shm->shm;
+	vbus_shm_get(&_shm->shm); /* dropped when the signal releases */
+
+	/* FIXME: we should make maxcount configurable */
+	ret = vbus_shm_ioq_attach(&_shm->shm, &_signal->signal, 2048, ioq);
+	if (ret < 0)
+		goto error;
+
+	return 0;
+
+error:
+	if (_signal)
+		shm_signal_put(&_signal->signal);
+
+	if (_shm)
+		vbus_shm_put(&_shm->shm);
+
+	return ret;
+}
+
+/*
+ * -----------------
+ * device_signal routines
+ *
+ * This is the more standard signal that is allocated to communicate
+ * with a specific device's shm region
+ * -----------------
+ */
+
+struct device_signal {
+	struct kvm_vbus   *kvbus;
+	struct vbus_shm   *shm;
+	struct shm_signal  signal;
+	struct _event     *inject;
+	int                prio;
+	u64                handle;
+};
+
+static struct device_signal *to_dsig(struct shm_signal *signal)
+{
+       return container_of(signal, struct device_signal, signal);
+}
+
+static void
+_device_signal_inject(struct device_signal *_signal)
+{
+	struct kvm_vbus_eventq *eventq;
+	int ret;
+
+	eventq = prio_to_eventq(_signal->kvbus, _signal->prio);
+
+	ret = event_inject(eventq, _signal->inject);
+	if (ret < 0)
+		_event_put(_signal->inject);
+}
+
+static int
+device_signal_inject(struct shm_signal *signal)
+{
+	struct device_signal *_signal = to_dsig(signal);
+
+	_event_get(_signal->inject); /* will be dropped by injection code */
+	_device_signal_inject(_signal);
+
+	return 0;
+}
+
+static void
+device_signal_release(struct shm_signal *signal)
+{
+	struct device_signal *_signal = to_dsig(signal);
+	struct kvm_vbus_eventq *eventq;
+	unsigned long flags;
+
+	eventq = prio_to_eventq(_signal->kvbus, _signal->prio);
+
+	/*
+	 * Change the event-type while holding the lock so we do not race
+	 * with any potential threads already processing the queue
+	 */
+	spin_lock_irqsave(&eventq->lock, flags);
+	_signal->inject->data.eventid = KVM_VBUS_EVENT_SHMCLOSE;
+	spin_unlock_irqrestore(&eventq->lock, flags);
+
+	/*
+	 * do not take a reference to event..last will be dropped once
+	 * transmitted.
+	 */
+	_device_signal_inject(_signal);
+
+	vbus_shm_put(_signal->shm);
+	kfree(_signal);
+}
+
+static struct shm_signal_ops device_signal_ops = {
+	.inject  = device_signal_inject,
+	.release = device_signal_release,
+};
+
+static int
+device_signal_alloc(struct kvm_vbus *kvbus, struct vbus_shm *shm,
+		    u32 offset, u32 prio, u64 cookie,
+		    struct device_signal **dsignal)
+{
+	struct device_signal *_signal;
+	int ret;
+
+	_signal = kzalloc(sizeof(*_signal), GFP_KERNEL);
+	if (!_signal)
+		return -ENOMEM;
+
+	ret = _signal_init(kvbus->kvm, shm->ptr + offset,
+			   &_signal->signal,
+			   &device_signal_ops);
+	if (ret < 0) {
+		kfree(_signal);
+		return ret;
+	}
+
+	_signal->inject = handle_event_alloc(KVM_VBUS_EVENT_SHMSIGNAL, cookie);
+	if (!_signal->inject) {
+		shm_signal_put(&_signal->signal);
+		return -ENOMEM;
+	}
+
+	_signal->kvbus  = kvbus;
+	_signal->shm    = shm;
+	_signal->prio   = prio;
+	vbus_shm_get(shm); /* dropped when the signal is released */
+
+	*dsignal = _signal;
+
+	return 0;
+}
+
+/*
+ * ------------------
+ * notifiers
+ * ------------------
+ */
+
+/*
+ * This is called whenever our associated vbus emits an event.  We inject
+ * these events at the highest logical priority
+ */
+static int
+vbus_notifier(struct notifier_block *nb, unsigned long nr, void *data)
+{
+	struct kvm_vbus *kvbus = container_of(nb, struct kvm_vbus, notify.vbus);
+	struct kvm_vbus_eventq *eventq = prio_to_eventq(kvbus, 0);
+
+	switch (nr) {
+	case VBUS_EVENT_DEVADD: {
+		struct vbus_event_devadd *ev = data;
+
+		devadd_inject(eventq, ev->type, ev->id);
+		break;
+	}
+	case VBUS_EVENT_DEVDROP: {
+		unsigned long id = *(unsigned long *)data;
+
+		devdrop_inject(eventq, id);
+		break;
+	}
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static void
+deferred_destruct(struct work_struct *work)
+{
+	struct kvm_vbus *kvbus = container_of(work, struct kvm_vbus, destruct);
+
+	kvm_vbus_release(kvbus);
+}
+
+/*
+ * This is called if the guest reboots...we should release our association
+ * with the vbus (if any)
+ */
+static int
+reset_notifier(struct notifier_block *nb, unsigned long nr, void *data)
+{
+	struct kvm_vbus *kvbus = container_of(nb, struct kvm_vbus,
+					      notify.reset);
+
+	schedule_work(&kvbus->destruct);
+	kvbus->kvm->kvbus = NULL;
+
+	return NOTIFY_DONE;
+}
+
+static int
+kvm_vbus_eventq_attach(struct kvm_vbus *kvbus, struct kvm_vbus_eventq *eventq,
+		      u32 count, u64 ring, u64 data, int irq)
+{
+	struct ioq *ioq;
+	size_t len;
+	void *ptr;
+	int ret;
+
+	if (eventq->ioq)
+		return -EINVAL;
+
+	ret = _eventq_attach(kvbus, count, ring, irq, &ioq);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * We are going to pre-vmap the eventq data for performance reasons
+	 */
+	len = count * sizeof(struct kvm_vbus_event);
+	ptr =  kvm_vmap(kvbus->kvm, data, len);
+	if (!ptr) {
+		ioq_put(ioq);
+		return -EFAULT;
+	}
+
+	spin_lock_init(&eventq->lock);
+	eventq->ioq = ioq;
+	INIT_WORK(&eventq->work, eventq_reinject);
+
+	eventq->notifier.signal = eventq_notify;
+	ioq->notifier = &eventq->notifier;
+
+	INIT_LIST_HEAD(&eventq->backlog);
+
+	eventq->ringdata.len = len;
+	eventq->ringdata.gpa = data;
+	eventq->ringdata.ptr = ptr;
+
+	return 0;
+}
+
+static void
+kvm_vbus_eventq_detach(struct kvm_vbus_eventq *eventq)
+{
+	if (eventq->ioq)
+		ioq_put(eventq->ioq);
+
+	if (eventq->ringdata.ptr)
+		kvm_vunmap(eventq->ringdata.ptr);
+}
+
+static int
+kvm_vbus_alloc(struct kvm_vcpu *vcpu)
+{
+	struct vbus *vbus = task_vbus_get(current);
+	struct vbus_client *client;
+	struct kvm_vbus *kvbus;
+	int ret;
+
+	if (!vbus)
+		return -EPERM;
+
+	client = vbus_client_attach(vbus);
+	if (!client) {
+		vbus_put(vbus);
+		return -ENOMEM;
+	}
+
+	kvbus = kzalloc(sizeof(*kvbus), GFP_KERNEL);
+	if (!kvbus) {
+		vbus_put(vbus);
+		vbus_client_put(client);
+		return -ENOMEM;
+	}
+
+	mutex_init(&kvbus->lock);
+	kvbus->state = kvm_vbus_state_registration;
+	kvbus->kvm = vcpu->kvm;
+	kvbus->vbus = vbus;
+	kvbus->client = client;
+
+	vcpu->kvm->kvbus = kvbus;
+
+	INIT_WORK(&kvbus->destruct, deferred_destruct);
+	kvbus->ctx = kvm_memctx_alloc(vcpu->kvm);
+
+	kvbus->notify.vbus.notifier_call = vbus_notifier;
+	kvbus->notify.vbus.priority = 0;
+
+	kvbus->notify.reset.notifier_call = reset_notifier;
+	kvbus->notify.reset.priority = 0;
+
+	ret = kvm_reset_notifier_register(vcpu->kvm, &kvbus->notify.reset);
+	if (ret < 0) {
+		kvm_vbus_release(kvbus);
+		return ret;
+	}
+
+	return 0;
+}
+
+void
+kvm_vbus_release(struct kvm_vbus *kvbus)
+{
+	if (!kvbus)
+		return;
+
+	if (kvbus->ctx)
+		vbus_memctx_put(kvbus->ctx);
+
+	kvm_vbus_eventq_detach(&kvbus->eventq);
+
+	if (kvbus->client)
+		vbus_client_put(kvbus->client);
+
+	if (kvbus->vbus) {
+		vbus_notifier_unregister(kvbus->vbus, &kvbus->notify.vbus);
+		vbus_put(kvbus->vbus);
+	}
+
+	kvm_reset_notifier_unregister(kvbus->kvm, &kvbus->notify.reset);
+
+	flush_scheduled_work();
+
+	kvbus->kvm->kvbus = NULL;
+
+	kfree(kvbus);
+}
+
+/*
+ * ------------------
+ * hypercall implementation
+ * ------------------
+ */
+
+static int
+hc_busopen(struct kvm_vcpu *vcpu, void *data)
+{
+	struct kvm_vbus_busopen *args = data;
+
+	if (vcpu->kvm->kvbus)
+		return -EEXIST;
+
+	if (args->magic != KVM_VBUS_MAGIC)
+		return -EINVAL;
+
+	if (args->version != KVM_VBUS_VERSION)
+		return -EINVAL;
+
+	args->capabilities = 0;
+
+	return kvm_vbus_alloc(vcpu);
+}
+
+static int
+hc_busreg(struct kvm_vcpu *vcpu, void *data)
+{
+	struct kvm_vbus_busreg *args = data;
+	struct kvm_vbus_eventqreg *qreg = &args->eventq[0];
+	struct kvm_vbus *kvbus = vcpu->kvm->kvbus;
+	int ret;
+
+	if (args->count != 1)
+		return -EINVAL;
+
+	ret = kvm_vbus_eventq_attach(kvbus,
+				     &kvbus->eventq,
+				     qreg->count,
+				     qreg->ring,
+				     qreg->data,
+				     qreg->irq);
+	if (ret < 0)
+		return ret;
+
+	ret = vbus_notifier_register(kvbus->vbus, &kvbus->notify.vbus);
+	if (ret < 0)
+		return ret;
+
+	kvbus->state = kvm_vbus_state_running;
+
+	return 0;
+}
+
+static int
+hc_deviceopen(struct kvm_vcpu *vcpu, void *data)
+{
+	struct vbus_deviceopen *args = data;
+	struct kvm_vbus *kvbus = vcpu->kvm->kvbus;
+	struct vbus_client *c = kvbus->client;
+
+	return c->ops->deviceopen(c, kvbus->ctx,
+				  args->devid, args->version, &args->handle);
+}
+
+static int
+hc_deviceclose(struct kvm_vcpu *vcpu, void *data)
+{
+	__u64 devh = *(__u64 *)data;
+	struct vbus_client *c = to_client(vcpu);
+
+	return c->ops->deviceclose(c, devh);
+}
+
+static int
+hc_devicecall(struct kvm_vcpu *vcpu, void *data)
+{
+	struct vbus_devicecall *args = data;
+	struct vbus_client *c = to_client(vcpu);
+
+	return c->ops->devicecall(c, args->devh, args->func,
+				  (void *)args->datap, args->len, args->flags);
+}
+
+static int
+hc_deviceshm(struct kvm_vcpu *vcpu, void *data)
+{
+	struct vbus_deviceshm *args = data;
+	struct kvm_vbus *kvbus = vcpu->kvm->kvbus;
+	struct vbus_client *c = to_client(vcpu);
+	struct device_signal *_signal = NULL;
+	struct shm_signal *signal = NULL;
+	struct kvm_shm *_shm;
+	u64 handle;
+	int ret;
+
+	ret = kvm_shm_map(kvbus, args->datap, args->len, &_shm);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Establishing a signal is optional
+	 */
+	if (args->signal.offset != -1) {
+		ret = device_signal_alloc(kvbus, &_shm->shm,
+					  args->signal.offset,
+					  args->signal.prio,
+					  args->signal.cookie,
+					  &_signal);
+		if (ret < 0)
+			goto out;
+
+		signal = &_signal->signal;
+	}
+
+	ret = c->ops->deviceshm(c, args->devh, args->id,
+				&_shm->shm, signal,
+				args->flags, &handle);
+	if (ret < 0)
+		goto out;
+
+	args->handle = handle;
+	if (_signal)
+		_signal->handle = handle;
+
+	return 0;
+
+out:
+	if (signal)
+		shm_signal_put(signal);
+
+	vbus_shm_put(&_shm->shm);
+	return ret;
+}
+
+static int
+hc_shmsignal(struct kvm_vcpu *vcpu, void *data)
+{
+	__u64 handle = *(__u64 *)data;
+	struct kvm_vbus *kvbus;
+	struct vbus_client *c = to_client(vcpu);
+
+	/* A non-zero handle is targeted at a device's shm */
+	if (handle)
+		return c->ops->shmsignal(c, handle);
+
+	kvbus = vcpu->kvm->kvbus;
+
+	/* A null handle is signaling our eventq */
+	_shm_signal_wakeup(kvbus->eventq.ioq->signal);
+
+	return 0;
+}
+
+struct hc_op {
+	int nr;
+	int len;
+	int dirty;
+	int (*func)(struct kvm_vcpu *vcpu, void *args);
+};
+
+static struct hc_op _hc_busopen = {
+	.nr = KVM_VBUS_OP_BUSOPEN,
+	.len = sizeof(struct kvm_vbus_busopen),
+	.dirty = 1,
+	.func = &hc_busopen,
+};
+
+static struct hc_op _hc_busreg = {
+	.nr = KVM_VBUS_OP_BUSREG,
+	.len = sizeof(struct kvm_vbus_busreg),
+	.func = &hc_busreg,
+};
+
+static struct hc_op _hc_devopen = {
+	.nr = KVM_VBUS_OP_DEVOPEN,
+	.len = sizeof(struct vbus_deviceopen),
+	.dirty = 1,
+	.func = &hc_deviceopen,
+};
+
+static struct hc_op _hc_devclose = {
+	.nr = KVM_VBUS_OP_DEVCLOSE,
+	.len = sizeof(u64),
+	.func = &hc_deviceclose,
+};
+
+static struct hc_op _hc_devcall = {
+	.nr = KVM_VBUS_OP_DEVCALL,
+	.len = sizeof(struct vbus_devicecall),
+	.func = &hc_devicecall,
+};
+
+static struct hc_op _hc_devshm = {
+	.nr = KVM_VBUS_OP_DEVSHM,
+	.len = sizeof(struct vbus_deviceshm),
+	.dirty = 1,
+	.func = &hc_deviceshm,
+};
+
+static struct hc_op _hc_shmsignal = {
+	.nr = KVM_VBUS_OP_SHMSIGNAL,
+	.len = sizeof(u64),
+	.func = &hc_shmsignal,
+};
+
+static struct hc_op *hc_ops[] = {
+	&_hc_busopen,
+	&_hc_busreg,
+	&_hc_devopen,
+	&_hc_devclose,
+	&_hc_devcall,
+	&_hc_devshm,
+	&_hc_shmsignal,
+	NULL,
+};
+
+static int
+hc_execute_indirect(struct kvm_vcpu *vcpu, struct hc_op *op, gpa_t gpa)
+{
+	struct kvm *kvm  = vcpu->kvm;
+	char       *args = NULL;
+	int         ret;
+
+	BUG_ON(!op->len);
+
+	args = kmalloc(op->len, GFP_KERNEL);
+	if (!args)
+		return -ENOMEM;
+
+	ret = kvm_read_guest(kvm, gpa, args, op->len);
+	if (ret < 0)
+		goto out;
+
+	ret = op->func(vcpu, args);
+
+	if (ret >= 0 && op->dirty)
+		ret = kvm_write_guest(kvm, gpa, args, op->len);
+
+out:
+	kfree(args);
+
+	return ret;
+}
+
+static int
+hc_execute_direct(struct kvm_vcpu *vcpu, struct hc_op *op, gpa_t gpa)
+{
+	struct kvm  *kvm   = vcpu->kvm;
+	void        *args;
+	char        *kaddr;
+	struct page *page;
+	int          ret;
+
+	page = gfn_to_page(kvm, gpa >> PAGE_SHIFT);
+	if (page == bad_page) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	kaddr = kmap(page);
+	if (!kaddr) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	args = kaddr + offset_in_page(gpa);
+
+	ret = op->func(vcpu, args);
+
+out:
+	if (kaddr)
+		kunmap(kaddr);
+
+	if (ret >= 0 && op->dirty)
+		kvm_release_page_dirty(page);
+	else
+		kvm_release_page_clean(page);
+
+	return ret;
+}
+
+static int
+hc_execute(struct kvm_vcpu *vcpu, struct hc_op *op, gpa_t gpa, size_t len)
+{
+	if (len != op->len)
+		return -EINVAL;
+
+	/*
+	 * Execute-immediate if there is no data
+	 */
+	if (!len)
+		return op->func(vcpu, NULL);
+
+	/*
+	 * We will need to copy the arguments in the unlikely case that the
+	 * gpa pointer crosses a page boundary
+	 *
+	 * FIXME: Is it safe to assume PAGE_SIZE is relevant to gpa?
+	 */
+	if (unlikely(len && (offset_in_page(gpa) + len) > PAGE_SIZE))
+		return hc_execute_indirect(vcpu, op, gpa);
+
+	/*
+	 * Otherwise just execute with zero-copy by mapping the arguments
+	 */
+	return hc_execute_direct(vcpu, op, gpa);
+}
+
+/*
+ * Our hypercall format will always follow with the call-id in arg[0],
+ * a pointer to the arguments in arg[1], and the argument length in arg[2]
+ */
+int
+kvm_vbus_hc(struct kvm_vcpu *vcpu, int nr, gpa_t gpa, size_t len)
+{
+	struct kvm_vbus *kvbus = vcpu->kvm->kvbus;
+	enum kvm_vbus_state state = kvbus ? kvbus->state : kvm_vbus_state_init;
+	int i;
+
+	PDEBUG("nr=%d, state=%d\n", nr, state);
+
+	switch (state) {
+	case kvm_vbus_state_init:
+		if (nr != KVM_VBUS_OP_BUSOPEN) {
+			PDEBUG("expected BUSOPEN\n");
+			return -EINVAL;
+		}
+		break;
+	case kvm_vbus_state_registration:
+		if (nr != KVM_VBUS_OP_BUSREG) {
+			PDEBUG("expected BUSREG\n");
+			return -EINVAL;
+		}
+		break;
+	default:
+		break;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(hc_ops); i++) {
+		struct hc_op *op = hc_ops[i];
+
+		if (op->nr != nr)
+			continue;
+
+		return hc_execute(vcpu, op, gpa, len);
+	}
+
+	PDEBUG("error: no matching function for nr=%d\n", nr);
+
+	return -EINVAL;
+}


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [RFC PATCH 17/17] kvm: Add guest-side support for VBUS
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
                   ` (15 preceding siblings ...)
  2009-03-31 18:44 ` [RFC PATCH 16/17] kvm: Add VBUS support to the host Gregory Haskins
@ 2009-03-31 18:44 ` Gregory Haskins
  2009-03-31 20:18 ` [RFC PATCH 00/17] virtual-bus Andi Kleen
  2009-04-01  6:08 ` Rusty Russell
  18 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 18:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

This adds a driver to interface between the host VBUS support, and the
guest-vbus bus model.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 arch/x86/Kconfig            |    9 +
 drivers/Makefile            |    1 
 drivers/vbus/proxy/Makefile |    2 
 drivers/vbus/proxy/kvm.c    |  726 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 738 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vbus/proxy/Makefile
 create mode 100644 drivers/vbus/proxy/kvm.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 91fefd5..8661495 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -451,6 +451,15 @@ config KVM_GUEST_DYNIRQ
        depends on KVM_GUEST
        default y
 
+config KVM_GUEST_VBUS
+       tristate "KVM virtual-bus (VBUS) guest-side support"
+       depends on KVM_GUEST
+       select VBUS_DRIVERS
+       default y
+       ---help---
+          This option enables guest-side support for accessing virtual-bus
+	  devices.
+
 source "arch/x86/lguest/Kconfig"
 
 config PARAVIRT
diff --git a/drivers/Makefile b/drivers/Makefile
index 98fab51..4f2cb93 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -107,3 +107,4 @@ obj-$(CONFIG_VIRTIO)		+= virtio/
 obj-$(CONFIG_STAGING)		+= staging/
 obj-y				+= platform/
 obj-$(CONFIG_VBUS_DEVICES)	+= vbus/devices/
+obj-$(CONFIG_VBUS_DRIVERS)	+= vbus/proxy/
diff --git a/drivers/vbus/proxy/Makefile b/drivers/vbus/proxy/Makefile
new file mode 100644
index 0000000..c18d58d
--- /dev/null
+++ b/drivers/vbus/proxy/Makefile
@@ -0,0 +1,2 @@
+kvm-guest-vbus-objs += kvm.o
+obj-$(CONFIG_KVM_GUEST_VBUS) += kvm-guest-vbus.o
diff --git a/drivers/vbus/proxy/kvm.c b/drivers/vbus/proxy/kvm.c
new file mode 100644
index 0000000..82e28b4
--- /dev/null
+++ b/drivers/vbus/proxy/kvm.c
@@ -0,0 +1,726 @@
+/*
+ * Copyright (C) 2009 Novell.  All Rights Reserved.
+ *
+ * Author:
+ *	Gregory Haskins <ghaskins@novell.com>
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/vbus.h>
+#include <linux/kvm_para.h>
+#include <linux/kvm.h>
+#include <linux/mm.h>
+#include <linux/ioq.h>
+#include <linux/interrupt.h>
+#include <linux/kvm_para.h>
+#include <linux/kvm_guest.h>
+#include <linux/vbus_client.h>
+#include <linux/vbus_driver.h>
+
+MODULE_AUTHOR("Gregory Haskins");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("1");
+
+static int kvm_vbus_hypercall(unsigned long nr, void *data, unsigned long len)
+{
+	return kvm_hypercall3(KVM_HC_VBUS, nr, __pa(data), len);
+}
+
+struct kvm_vbus {
+	spinlock_t                lock;
+	struct ioq                eventq;
+	struct kvm_vbus_event    *ring;
+	int                       irq;
+};
+
+static struct kvm_vbus kvm_vbus;
+
+struct kvm_vbus_device {
+	char                     type[VBUS_MAX_DEVTYPE_LEN];
+	u64                      handle;
+	struct list_head         shms;
+	struct vbus_device_proxy vdev;
+};
+
+/*
+ * -------------------
+ * common routines
+ * -------------------
+ */
+
+struct kvm_vbus_device *
+to_dev(struct vbus_device_proxy *vdev)
+{
+	return container_of(vdev, struct kvm_vbus_device, vdev);
+}
+
+static void
+_signal_init(struct shm_signal *signal, struct shm_signal_desc *desc,
+	     struct shm_signal_ops *ops)
+{
+	desc->magic = SHM_SIGNAL_MAGIC;
+	desc->ver   = SHM_SIGNAL_VER;
+
+	shm_signal_init(signal);
+
+	signal->locale = shm_locality_north;
+	signal->ops    = ops;
+	signal->desc   = desc;
+}
+
+/*
+ * -------------------
+ * _signal
+ * -------------------
+ */
+
+struct _signal {
+	struct kvm_vbus   *kvbus;
+	struct shm_signal  signal;
+	u64                handle;
+	struct rb_node     node;
+	struct list_head   list;
+};
+
+static struct _signal *
+to_signal(struct shm_signal *signal)
+{
+       return container_of(signal, struct _signal, signal);
+}
+
+static struct _signal *
+node_to_signal(struct rb_node *node)
+{
+       return container_of(node, struct _signal, node);
+}
+
+static int
+_signal_inject(struct shm_signal *signal)
+{
+	struct _signal *_signal = to_signal(signal);
+
+	kvm_vbus_hypercall(KVM_VBUS_OP_SHMSIGNAL,
+			   &_signal->handle, sizeof(_signal->handle));
+
+	return 0;
+}
+
+static void
+_signal_release(struct shm_signal *signal)
+{
+	struct _signal *_signal = to_signal(signal);
+
+	kfree(_signal);
+}
+
+static struct shm_signal_ops _signal_ops = {
+	.inject  = _signal_inject,
+	.release = _signal_release,
+};
+
+/*
+ * -------------------
+ * vbus_device_proxy routines
+ * -------------------
+ */
+
+static int
+kvm_vbus_device_open(struct vbus_device_proxy *vdev, int version, int flags)
+{
+	struct kvm_vbus_device *dev = to_dev(vdev);
+	struct vbus_deviceopen params;
+	int ret;
+
+	if (dev->handle)
+		return -EINVAL;
+
+	params.devid   = vdev->id;
+	params.version = version;
+
+	ret = kvm_vbus_hypercall(KVM_VBUS_OP_DEVOPEN,
+				 &params, sizeof(params));
+	if (ret < 0)
+		return ret;
+
+	dev->handle = params.handle;
+
+	return 0;
+}
+
+static int
+kvm_vbus_device_close(struct vbus_device_proxy *vdev, int flags)
+{
+	struct kvm_vbus_device *dev = to_dev(vdev);
+	unsigned long iflags;
+	int ret;
+
+	if (!dev->handle)
+		return -EINVAL;
+
+	spin_lock_irqsave(&kvm_vbus.lock, iflags);
+
+	while (!list_empty(&dev->shms)) {
+		struct _signal *_signal;
+
+		_signal = list_first_entry(&dev->shms, struct _signal, list);
+
+		list_del(&_signal->list);
+
+		spin_unlock_irqrestore(&kvm_vbus.lock, iflags);
+		shm_signal_put(&_signal->signal);
+		spin_lock_irqsave(&kvm_vbus.lock, iflags);
+	}
+
+	spin_unlock_irqrestore(&kvm_vbus.lock, iflags);
+
+	/*
+	 * The DEVICECLOSE will implicitly close all of the shm on the
+	 * host-side, so there is no need to do an explicit per-shm
+	 * hypercall
+	 */
+	ret = kvm_vbus_hypercall(KVM_VBUS_OP_DEVCLOSE,
+				 &dev->handle, sizeof(dev->handle));
+
+	if (ret < 0)
+		printk(KERN_ERR "KVM-VBUS: Error closing device %s/%lld: %d\n",
+		       vdev->type, vdev->id, ret);
+
+	dev->handle = 0;
+
+	return 0;
+}
+
+static int
+kvm_vbus_device_shm(struct vbus_device_proxy *vdev, int id, int prio,
+		    void *ptr, size_t len,
+		    struct shm_signal_desc *sdesc, struct shm_signal **signal,
+		    int flags)
+{
+	struct kvm_vbus_device *dev = to_dev(vdev);
+	struct _signal *_signal = NULL;
+	struct vbus_deviceshm params;
+	unsigned long iflags;
+	int ret;
+
+	if (!dev->handle)
+		return -EINVAL;
+
+	params.devh   = dev->handle;
+	params.id     = id;
+	params.flags  = flags;
+	params.datap  = (u64)__pa(ptr);
+	params.len    = len;
+
+	if (signal) {
+		/*
+		 * The signal descriptor must be embedded within the
+		 * provided ptr
+		 */
+		if (!sdesc
+		    || (len < sizeof(*sdesc))
+		    || ((void *)sdesc < ptr)
+		    || ((void *)sdesc > (ptr + len - sizeof(*sdesc))))
+			return -EINVAL;
+
+		_signal = kzalloc(sizeof(*_signal), GFP_KERNEL);
+		if (!_signal)
+			return -ENOMEM;
+
+		_signal_init(&_signal->signal, sdesc, &_signal_ops);
+
+		/*
+		 * take another reference for the host.  This is dropped
+		 * by a SHMCLOSE event
+		 */
+		shm_signal_get(&_signal->signal);
+
+		params.signal.offset = (u64)sdesc - (u64)ptr;
+		params.signal.prio   = prio;
+		params.signal.cookie = (u64)_signal;
+
+	} else
+		params.signal.offset = -1; /* yes, this is a u32, but its ok */
+
+	ret = kvm_vbus_hypercall(KVM_VBUS_OP_DEVSHM,
+				 &params, sizeof(params));
+	if (ret < 0) {
+		if (_signal) {
+			/*
+			 * We held two references above, so we need to drop
+			 * both of them
+			 */
+			shm_signal_put(&_signal->signal);
+			shm_signal_put(&_signal->signal);
+		}
+
+		return ret;
+	}
+
+	if (signal) {
+		_signal->handle = params.handle;
+
+		spin_lock_irqsave(&kvm_vbus.lock, iflags);
+
+		list_add_tail(&_signal->list, &dev->shms);
+
+		spin_unlock_irqrestore(&kvm_vbus.lock, iflags);
+
+		shm_signal_get(&_signal->signal);
+		*signal = &_signal->signal;
+	}
+
+	return 0;
+}
+
+static int
+kvm_vbus_device_call(struct vbus_device_proxy *vdev, u32 func, void *data,
+		     size_t len, int flags)
+{
+	struct kvm_vbus_device *dev = to_dev(vdev);
+	struct vbus_devicecall params = {
+		.devh  = dev->handle,
+		.func  = func,
+		.datap = (u64)__pa(data),
+		.len   = len,
+		.flags = flags,
+	};
+
+	if (!dev->handle)
+		return -EINVAL;
+
+	return kvm_vbus_hypercall(KVM_VBUS_OP_DEVCALL, &params, sizeof(params));
+}
+
+static void
+kvm_vbus_device_release(struct vbus_device_proxy *vdev)
+{
+	struct kvm_vbus_device *_dev = to_dev(vdev);
+
+	kvm_vbus_device_close(vdev, 0);
+
+	kfree(_dev);
+}
+
+struct vbus_device_proxy_ops kvm_vbus_device_ops = {
+	.open    = kvm_vbus_device_open,
+	.close   = kvm_vbus_device_close,
+	.shm     = kvm_vbus_device_shm,
+	.call    = kvm_vbus_device_call,
+	.release = kvm_vbus_device_release,
+};
+
+/*
+ * -------------------
+ * vbus events
+ * -------------------
+ */
+
+static void
+event_devadd(struct kvm_vbus_add_event *event)
+{
+	int ret;
+	struct kvm_vbus_device *new = kzalloc(sizeof(*new), GFP_KERNEL);
+	if (!new) {
+		printk(KERN_ERR "KVM_VBUS: Out of memory on add_event\n");
+		return;
+	}
+
+	INIT_LIST_HEAD(&new->shms);
+
+	memcpy(new->type, event->type, VBUS_MAX_DEVTYPE_LEN);
+	new->vdev.type        = new->type;
+	new->vdev.id          = event->id;
+	new->vdev.ops         = &kvm_vbus_device_ops;
+
+	sprintf(new->vdev.dev.bus_id, "%lld", event->id);
+
+	ret = vbus_device_proxy_register(&new->vdev);
+	if (ret < 0)
+		panic("failed to register device %lld(%s): %d\n",
+		      event->id, event->type, ret);
+}
+
+static void
+event_devdrop(struct kvm_vbus_handle_event *event)
+{
+	struct vbus_device_proxy *dev = vbus_device_proxy_find(event->handle);
+
+	if (!dev) {
+		printk(KERN_WARNING "KVM-VBUS: devdrop failed: %lld\n",
+		       event->handle);
+		return;
+	}
+
+	vbus_device_proxy_unregister(dev);
+}
+
+static void
+event_shmsignal(struct kvm_vbus_handle_event *event)
+{
+	struct _signal *_signal = (struct _signal *)event->handle;
+
+	_shm_signal_wakeup(&_signal->signal);
+}
+
+static void
+event_shmclose(struct kvm_vbus_handle_event *event)
+{
+	struct _signal *_signal = (struct _signal *)event->handle;
+
+	/*
+	 * This reference was taken during the DEVICESHM call
+	 */
+	shm_signal_put(&_signal->signal);
+}
+
+/*
+ * -------------------
+ * eventq routines
+ * -------------------
+ */
+
+static struct ioq_notifier eventq_notifier;
+
+static int __init
+eventq_init(int qlen)
+{
+	struct ioq_iterator iter;
+	int ret;
+	int i;
+
+	kvm_vbus.ring = kzalloc(sizeof(struct kvm_vbus_event) * qlen,
+				GFP_KERNEL);
+	if (!kvm_vbus.ring)
+		return -ENOMEM;
+
+	/*
+	 * We want to iterate on the "valid" index.  By default the iterator
+	 * will not "autoupdate" which means it will not hypercall the host
+	 * with our changes.  This is good, because we are really just
+	 * initializing stuff here anyway.  Note that you can always manually
+	 * signal the host with ioq_signal() if the autoupdate feature is not
+	 * used.
+	 */
+	ret = ioq_iter_init(&kvm_vbus.eventq, &iter, ioq_idxtype_valid, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Seek to the tail of the valid index (which should be our first
+	 * item since the queue is brand-new)
+	 */
+	ret = ioq_iter_seek(&iter, ioq_seek_tail, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * Now populate each descriptor with an empty vbus_event and mark it
+	 * valid
+	 */
+	for (i = 0; i < qlen; i++) {
+		struct kvm_vbus_event *event = &kvm_vbus.ring[i];
+		size_t                 len   = sizeof(*event);
+		struct ioq_ring_desc  *desc  = iter.desc;
+
+		BUG_ON(iter.desc->valid);
+
+		desc->cookie = (u64)event;
+		desc->ptr    = (u64)__pa(event);
+		desc->len    = len; /* total length  */
+		desc->valid  = 1;
+
+		/*
+		 * This push operation will simultaneously advance the
+		 * valid-tail index and increment our position in the queue
+		 * by one.
+		 */
+		ret = ioq_iter_push(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	kvm_vbus.eventq.notifier = &eventq_notifier;
+
+	/*
+	 * And finally, ensure that we can receive notification
+	 */
+	ioq_notify_enable(&kvm_vbus.eventq, 0);
+
+	return 0;
+}
+
+/* Invoked whenever the hypervisor ioq_signal()s our eventq */
+static void
+eventq_wakeup(struct ioq_notifier *notifier)
+{
+	struct ioq_iterator iter;
+	int ret;
+
+	/* We want to iterate on the head of the in-use index */
+	ret = ioq_iter_init(&kvm_vbus.eventq, &iter, ioq_idxtype_inuse, 0);
+	BUG_ON(ret < 0);
+
+	ret = ioq_iter_seek(&iter, ioq_seek_head, 0, 0);
+	BUG_ON(ret < 0);
+
+	/*
+	 * The EOM is indicated by finding a packet that is still owned by
+	 * the south side.
+	 *
+	 * FIXME: This in theory could run indefinitely if the host keeps
+	 * feeding us events since there is nothing like a NAPI budget.  We
+	 * might need to address that
+	 */
+	while (!iter.desc->sown) {
+		struct ioq_ring_desc *desc  = iter.desc;
+		struct kvm_vbus_event *event;
+
+		event = (struct kvm_vbus_event *)desc->cookie;
+
+		switch (event->eventid) {
+		case KVM_VBUS_EVENT_DEVADD:
+			event_devadd(&event->data.add);
+			break;
+		case KVM_VBUS_EVENT_DEVDROP:
+			event_devdrop(&event->data.handle);
+			break;
+		case KVM_VBUS_EVENT_SHMSIGNAL:
+			event_shmsignal(&event->data.handle);
+			break;
+		case KVM_VBUS_EVENT_SHMCLOSE:
+			event_shmclose(&event->data.handle);
+			break;
+		default:
+			printk(KERN_WARNING "KVM_VBUS: Unexpected event %d\n",
+			       event->eventid);
+			break;
+		};
+
+		memset(event, 0, sizeof(*event));
+
+		/* Advance the in-use head */
+		ret = ioq_iter_pop(&iter, 0);
+		BUG_ON(ret < 0);
+	}
+
+	/* And let the south side know that we changed the queue */
+	ioq_signal(&kvm_vbus.eventq, 0);
+}
+
+static struct ioq_notifier eventq_notifier = {
+	.signal = &eventq_wakeup,
+};
+
+/* Injected whenever the host issues an ioq_signal() on the eventq */
+irqreturn_t
+eventq_intr(int irq, void *dev)
+{
+	_shm_signal_wakeup(kvm_vbus.eventq.signal);
+
+	return IRQ_HANDLED;
+}
+
+/*
+ * -------------------
+ */
+
+static int
+eventq_signal_inject(struct shm_signal *signal)
+{
+	u64 handle = 0; /* The eventq uses the special-case handle=0 */
+
+	kvm_vbus_hypercall(KVM_VBUS_OP_SHMSIGNAL, &handle, sizeof(handle));
+
+	return 0;
+}
+
+static void
+eventq_signal_release(struct shm_signal *signal)
+{
+	kfree(signal);
+}
+
+static struct shm_signal_ops eventq_signal_ops = {
+	.inject  = eventq_signal_inject,
+	.release = eventq_signal_release,
+};
+
+/*
+ * -------------------
+ */
+
+static void
+eventq_ioq_release(struct ioq *ioq)
+{
+	/* released as part of the kvm_vbus object */
+}
+
+static struct ioq_ops eventq_ioq_ops = {
+	.release = eventq_ioq_release,
+};
+
+/*
+ * -------------------
+ */
+
+static void
+kvm_vbus_release(void)
+{
+	if (kvm_vbus.irq > 0) {
+		free_irq(kvm_vbus.irq, NULL);
+		destroy_kvm_dynirq(kvm_vbus.irq);
+	}
+
+	kfree(kvm_vbus.eventq.head_desc);
+	kfree(kvm_vbus.ring);
+}
+
+static int __init
+kvm_vbus_open(void)
+{
+	struct kvm_vbus_busopen params = {
+		.magic        = KVM_VBUS_MAGIC,
+		.version      = KVM_VBUS_VERSION,
+		.capabilities = 0,
+	};
+
+	return kvm_vbus_hypercall(KVM_VBUS_OP_BUSOPEN, &params, sizeof(params));
+}
+
+#define QLEN 1024
+
+static int __init
+kvm_vbus_register(void)
+{
+	struct kvm_vbus_busreg params = {
+		.count = 1,
+		.eventq = {
+			{
+				.irq   = kvm_vbus.irq,
+				.count = QLEN,
+				.ring  = (u64)__pa(kvm_vbus.eventq.head_desc),
+				.data  = (u64)__pa(kvm_vbus.ring),
+			},
+		},
+	};
+
+	return kvm_vbus_hypercall(KVM_VBUS_OP_BUSREG, &params, sizeof(params));
+}
+
+static int __init
+_ioq_init(size_t ringsize, struct ioq *ioq, struct ioq_ops *ops)
+{
+	struct shm_signal    *signal = NULL;
+	struct ioq_ring_head *head = NULL;
+	size_t                len  = IOQ_HEAD_DESC_SIZE(ringsize);
+
+	head = kzalloc(len, GFP_KERNEL | GFP_DMA);
+	if (!head)
+		return -ENOMEM;
+
+	signal = kzalloc(sizeof(*signal), GFP_KERNEL);
+	if (!signal) {
+		kfree(head);
+		return -ENOMEM;
+	}
+
+	head->magic     = IOQ_RING_MAGIC;
+	head->ver	= IOQ_RING_VER;
+	head->count     = ringsize;
+
+	_signal_init(signal, &head->signal, &eventq_signal_ops);
+
+	ioq_init(ioq, ops, ioq_locality_north, head, signal, ringsize);
+
+	return 0;
+}
+
+int __init
+kvm_vbus_init(void)
+{
+	int ret;
+
+	memset(&kvm_vbus, 0, sizeof(kvm_vbus));
+
+	ret = kvm_para_has_feature(KVM_FEATURE_VBUS);
+	if (!ret)
+		return -ENOENT;
+
+	ret = kvm_vbus_open();
+	if (ret < 0) {
+		printk(KERN_ERR "KVM_VBUS: Could not register with host: %d\n",
+		       ret);
+		goto out_fail;
+	}
+
+	spin_lock_init(&kvm_vbus.lock);
+
+	/*
+	 * Allocate an IOQ to use for host-2-guest event notification
+	 */
+	ret = _ioq_init(QLEN, &kvm_vbus.eventq, &eventq_ioq_ops);
+	if (ret < 0) {
+		printk(KERN_ERR "KVM_VBUS: Cound not init eventq\n");
+		goto out_fail;
+	}
+
+	ret = eventq_init(QLEN);
+	if (ret < 0) {
+		printk(KERN_ERR "KVM_VBUS: Cound not setup ring\n");
+		goto out_fail;
+	}
+
+	/*
+	 * Dynamically assign a free IRQ to this resource
+	 */
+	kvm_vbus.irq = create_kvm_dynirq(0);
+	if (kvm_vbus.irq < 0) {
+		printk(KERN_ERR "KVM_VBUS: Failed to create IRQ\n");
+		goto out_fail;
+	}
+
+	ret = request_irq(kvm_vbus.irq, eventq_intr, 0, "vbus", NULL);
+	if (ret < 0) {
+		printk(KERN_ERR "KVM_VBUS: Failed to register IRQ %d\n: %d",
+		       kvm_vbus.irq, ret);
+		goto out_fail;
+	}
+
+	/*
+	 * Finally register our queue on the host to start receiving events
+	 */
+	ret = kvm_vbus_register();
+	if (ret < 0) {
+		printk(KERN_ERR "KVM_VBUS: Could not register with host: %d\n",
+		       ret);
+		goto out_fail;
+	}
+
+	return 0;
+
+ out_fail:
+	kvm_vbus_release();
+
+	return ret;
+
+}
+
+static void __exit
+kvm_vbus_exit(void)
+{
+	kvm_vbus_release();
+}
+
+module_init(kvm_vbus_init);
+module_exit(kvm_vbus_exit);
+


^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 13/17] x86: allow the irq->vector translation to be determined outside of ioapic
  2009-03-31 18:43 ` [RFC PATCH 13/17] x86: allow the irq->vector translation to be determined outside of ioapic Gregory Haskins
@ 2009-03-31 19:16   ` Alan Cox
  2009-03-31 20:02     ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Alan Cox @ 2009-03-31 19:16 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

On Tue, 31 Mar 2009 14:43:55 -0400
Gregory Haskins <ghaskins@novell.com> wrote:

> The ioapic code currently privately manages the mapping between irq
> and vector.  This results in some layering violations as the support
> for certain MSI operations need this info.  As a result, the MSI
> code itself was moved to the ioapic module.  This is not really
> optimal.

This appears to have been muddled in with the vnet patches ?

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 15/17] kvm: add dynamic IRQ support
  2009-03-31 18:44 ` [RFC PATCH 15/17] kvm: add dynamic IRQ support Gregory Haskins
@ 2009-03-31 19:20   ` Avi Kivity
  2009-03-31 19:39     ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-03-31 19:20 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

Gregory Haskins wrote:
> This patch provides the ability to dynamically declare and map an
> interrupt-request handle to an x86 8-bit vector.
>
> Problem Statement: Emulated devices (such as PCI, ISA, etc) have
> interrupt routing done via standard PC mechanisms (MP-table, ACPI,
> etc).  However, we also want to support a new class of devices
> which exist in a new virtualized namespace and therefore should
> not try to piggyback on these emulated mechanisms.  Rather, we
> create a way to dynamically register interrupt resources that
> acts indepent of the emulated counterpart.
>
> On x86, a simplistic view of the interrupt model is that each core
> has a local-APIC which can recieve messages from APIC-compliant
> routing devices (such as IO-APIC and MSI) regarding details about
> an interrupt (such as which vector to raise).  These routing devices
> are controlled by the OS so they may translate a physical event
> (such as "e1000: raise an RX interrupt") to a logical destination
> (such as "inject IDT vector 46 on core 3").  A dynirq is a virtual
> implementation of such a router (think of it as a virtual-MSI, but
> without the coupling to an existing standard, such as PCI).
>
> The model is simple: A guest OS can allocate the mapping of "IRQ"
> handle to "vector/core" in any way it sees fit, and provide this
> information to the dynirq module running in the host.  The assigned
> IRQ then becomes the sole handle needed to inject an IDT vector
> to the guest from a host.  A host entity that wishes to raise an
> interrupt simple needs to call kvm_inject_dynirq(irq) and the routing
> is performed transparently.
>   

A major disadvantage of dynirq is that it will only work on guests which 
have been ported to it.  So this will only be useful on newer Linux, and 
will likely never work with Windows guests.

Why is having an emulated PCI device so bad?  We found that it has 
several advantages:
 - works with all guests
 - supports hotplug/hotunplug, udev, sysfs, module autoloading, ...
 - supported in all OSes
 - someone else maintains it

See also the kvm irq routing work, merged into 2.6.30, which does a 
small part of what you're describing (the "sole handle" part, specifically).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 14/17] kvm: add a reset capability
  2009-03-31 18:44 ` [RFC PATCH 14/17] kvm: add a reset capability Gregory Haskins
@ 2009-03-31 19:22   ` Avi Kivity
  2009-03-31 20:02     ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-03-31 19:22 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

Gregory Haskins wrote:
> We need a way to detect if a VM is reset later in the series, so lets
> add a capability for userspace to signal a VM reset down to the kernel.
>   

How do you handle the case of a guest calling kexec to load a new 
kernel?  Or is that not important for your use case?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 15/17] kvm: add dynamic IRQ support
  2009-03-31 19:20   ` Avi Kivity
@ 2009-03-31 19:39     ` Gregory Haskins
  2009-03-31 20:13       ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 19:39 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3614 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> This patch provides the ability to dynamically declare and map an
>> interrupt-request handle to an x86 8-bit vector.
>>
>> Problem Statement: Emulated devices (such as PCI, ISA, etc) have
>> interrupt routing done via standard PC mechanisms (MP-table, ACPI,
>> etc).  However, we also want to support a new class of devices
>> which exist in a new virtualized namespace and therefore should
>> not try to piggyback on these emulated mechanisms.  Rather, we
>> create a way to dynamically register interrupt resources that
>> acts indepent of the emulated counterpart.
>>
>> On x86, a simplistic view of the interrupt model is that each core
>> has a local-APIC which can recieve messages from APIC-compliant
>> routing devices (such as IO-APIC and MSI) regarding details about
>> an interrupt (such as which vector to raise).  These routing devices
>> are controlled by the OS so they may translate a physical event
>> (such as "e1000: raise an RX interrupt") to a logical destination
>> (such as "inject IDT vector 46 on core 3").  A dynirq is a virtual
>> implementation of such a router (think of it as a virtual-MSI, but
>> without the coupling to an existing standard, such as PCI).
>>
>> The model is simple: A guest OS can allocate the mapping of "IRQ"
>> handle to "vector/core" in any way it sees fit, and provide this
>> information to the dynirq module running in the host.  The assigned
>> IRQ then becomes the sole handle needed to inject an IDT vector
>> to the guest from a host.  A host entity that wishes to raise an
>> interrupt simple needs to call kvm_inject_dynirq(irq) and the routing
>> is performed transparently.
>>   
>
> A major disadvantage of dynirq is that it will only work on guests
> which have been ported to it.  So this will only be useful on newer
> Linux, and will likely never work with Windows guests.
>
> Why is having an emulated PCI device so bad?  We found that it has
> several advantages:
> - works with all guests
> - supports hotplug/hotunplug, udev, sysfs, module autoloading, ...
> - supported in all OSes
> - someone else maintains it
These points are all valid, and I really struggled with this particular
part of the design.  The entire vbus design only requires one IRQ for
the entire guest, so its conceivable that I could present a simple
"dummy" PCI device with some "VBUS" type PCI-ID, just to piggy back on
the IRQ routing logic.  Then userspace could simply pass the IRQ routing
info down to the kernel with an ioctl, or something similar.

Ultimately I wasn't sure whether I wanted all that goo just to get an
IRQ assignment...but on the other hand, we have all this goo to build
one in the first place, and its half on the guest side which has the
disadvantages you mention.  So perhaps this should go in favor of a
PCI-esqe type solution, as I think you are suggesting.

I think ultimately I was trying to stay away from PCI in general because
I want to support environments that do not have PCI.  However, for the
kvm-transport case (at least on x86) this isnt really a constraint.

>
> See also the kvm irq routing work, merged into 2.6.30, which does a
> small part of what you're describing (the "sole handle" part,
> specifically).

I will take a look, thanks!

(I wish I wish you had accepted those irq patches I wrote a while back. 
It had the foundation for this type of stuff all built in.  But alas, I
think it was before its time, and I didn't do a good job of explaining
my future plans....) ;)

Regards,
-Greg





[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 13/17] x86: allow the irq->vector translation to be determined outside of ioapic
  2009-03-31 19:16   ` Alan Cox
@ 2009-03-31 20:02     ` Gregory Haskins
  0 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 20:02 UTC (permalink / raw)
  To: Alan Cox
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 1208 bytes --]

Alan Cox wrote:
> On Tue, 31 Mar 2009 14:43:55 -0400
> Gregory Haskins <ghaskins@novell.com> wrote:
>
>   
>> The ioapic code currently privately manages the mapping between irq
>> and vector.  This results in some layering violations as the support
>> for certain MSI operations need this info.  As a result, the MSI
>> code itself was moved to the ioapic module.  This is not really
>> optimal.
>>     
>
> This appears to have been muddled in with the vnet patches ?
>   
Its needed for the kvm-connector patches later in the series, so it was
included intentionally.

On that topic, I probably should have had a TOC of some kind.  Hmm..let
me hack one together now:

Patch 1: Stand-alone "shared-memory signal" construct, used by various
components in vbus/venet
Patches 2-5: Basic vbus infrastructure
Patches 6-7: IOQ construct, similar to virtio-ring.  Used to overlay
ring-like behavior over the shm interface in vbus
Patches 8-12: virtual-ethernet front and backends
Patch 13: io-apic work to expose the irq-vector in x86, needed for
dynirq support
Patches 14-16: KVM host side support
Patch 17: KVM guest side support

Sorry for the confusion :(

Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 14/17] kvm: add a reset capability
  2009-03-31 19:22   ` Avi Kivity
@ 2009-03-31 20:02     ` Gregory Haskins
  2009-03-31 20:18       ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 20:02 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 409 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> We need a way to detect if a VM is reset later in the series, so lets
>> add a capability for userspace to signal a VM reset down to the kernel.
>>   
>
> How do you handle the case of a guest calling kexec to load a new
> kernel?  Or is that not important for your use case?
>

Hmm..I had not considered this.  Any suggestions on ways to detect it?


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 15/17] kvm: add dynamic IRQ support
  2009-03-31 19:39     ` Gregory Haskins
@ 2009-03-31 20:13       ` Avi Kivity
  2009-03-31 20:32         ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-03-31 20:13 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

Gregory Haskins wrote:
>> - works with all guests
>> - supports hotplug/hotunplug, udev, sysfs, module autoloading, ...
>> - supported in all OSes
>> - someone else maintains it
>>     
> These points are all valid, and I really struggled with this particular
> part of the design.  The entire vbus design only requires one IRQ for
> the entire guest,

Won't this have scaling issues?  One IRQ means one target vcpu.  Whereas 
I'd like virtio devices to span multiple queues, each queue with its own 
MSI IRQ.  Also, the single IRQ handler will need to scan for all 
potential IRQ sources.  Even if implemented carefully, this will cause 
many cacheline bounces.

>  so its conceivable that I could present a simple
> "dummy" PCI device with some "VBUS" type PCI-ID, just to piggy back on
> the IRQ routing logic.  Then userspace could simply pass the IRQ routing
> info down to the kernel with an ioctl, or something similar.
>   

Xen does something similar, I believe.

> I think ultimately I was trying to stay away from PCI in general because
> I want to support environments that do not have PCI.  However, for the
> kvm-transport case (at least on x86) this isnt really a constraint.
>
>   

s/PCI/the native IRQ solution for your platform/. virtio has the same 
problem; on s390 we use the native (if that word ever applies to s390) 
interrupt and device discovery mechanism.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 14/17] kvm: add a reset capability
  2009-03-31 20:02     ` Gregory Haskins
@ 2009-03-31 20:18       ` Avi Kivity
  2009-03-31 20:37         ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-03-31 20:18 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

Gregory Haskins wrote:
> Avi Kivity wrote:
>   
>> Gregory Haskins wrote:
>>     
>>> We need a way to detect if a VM is reset later in the series, so lets
>>> add a capability for userspace to signal a VM reset down to the kernel.
>>>   
>>>       
>> How do you handle the case of a guest calling kexec to load a new
>> kernel?  Or is that not important for your use case?
>>
>>     
>
> Hmm..I had not considered this.  Any suggestions on ways to detect it?
>
>   

Best would be not to detect it; it's tying global events into a device.  
Instead, have a reset command for your device and have the driver issue 
it on load and unload.

btw, reset itself would be better controlled from userspace; qemu knows 
about resets and can reset vbus devices directly instead of relying on 
kvm to reset them.  This decouples the two code bases a bit.  This is 
what virtio does.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
                   ` (16 preceding siblings ...)
  2009-03-31 18:44 ` [RFC PATCH 17/17] kvm: Add guest-side support for VBUS Gregory Haskins
@ 2009-03-31 20:18 ` Andi Kleen
  2009-04-01 12:03   ` Gregory Haskins
  2009-04-01  6:08 ` Rusty Russell
  18 siblings, 1 reply; 146+ messages in thread
From: Andi Kleen @ 2009-03-31 20:18 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

Gregory Haskins <ghaskins@novell.com> writes:

What might be useful is if you could expand a bit more on what the high level
use cases for this. 

Questions that come to mind and that would be good to answer:

This seems to be aimed at having multiple VMs talk
to each other, but not talk to the rest of the world, correct? 
Is that a common use case? 

Wouldn't they typically have a default route  anyways and be able to talk to each 
other this way? 
And why can't any such isolation be done with standard firewalling? (it's known that 
current iptables has some scalability issues, but there's work going on right
now to fix that). 

What would be the use cases for non networking devices?

How would the interfaces to the user look like?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 15/17] kvm: add dynamic IRQ support
  2009-03-31 20:13       ` Avi Kivity
@ 2009-03-31 20:32         ` Gregory Haskins
  2009-03-31 20:59           ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 20:32 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3348 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>>> - works with all guests
>>> - supports hotplug/hotunplug, udev, sysfs, module autoloading, ...
>>> - supported in all OSes
>>> - someone else maintains it
>>>     
>> These points are all valid, and I really struggled with this particular
>> part of the design.  The entire vbus design only requires one IRQ for
>> the entire guest,
>
> Won't this have scaling issues?  One IRQ means one target vcpu. 
> Whereas I'd like virtio devices to span multiple queues, each queue
> with its own MSI IRQ.
Hmm..you know I hadnt really thought of it that way, but you have a
point.  To clarify, my design actually uses one IRQ per "eventq", where
we can have an arbitrary number of eventq's defined (note: today I only
define one eventq, however).  An eventq is actually a shm-ring construct
where I can pass events up to the host like "device added" or "ring X
signaled".  Each individual device based virtio-ring would then
aggregates "signal" events onto this eventq mechanism to actually inject
events to the host.  Only the eventq itself injects an actual IRQ to the
assigned vcpu.

My intended use of multiple eventqs was for prioritization of different
rings.  For instance, we could define 8 priority levels, each with its
own ring/irq.  That way, a virtio-net that supports something like
802.1p could define 8 virtio-rings, one for each priority level.

But this scheme is more targeted at prioritization than per vcpu
irq-balancing.  I support the eventq construct I proposed could still be
used in this fashion since each has its own routable IRQ.  However, I
would have to think about that some more because it is beyond the design
spec.

The good news is that the decision to use the "eventq+irq" approach is
completely contained in the kvm-host+guest.patch.  We could easily
switch to a 1:1 irq:shm-signal if we wanted to, and the device/drivers
would work exactly the same without modification.

>   Also, the single IRQ handler will need to scan for all potential IRQ
> sources.  Even if implemented carefully, this will cause many
> cacheline bounces.
Well, no, I think this part is covered.  As mentioned above, we use a
queuing technique so there is no scanning needed.  Ultimately I would
love to adapt a similar technique to optionally replace the LAPIC.  That
way we can avoid the EOI trap and just consume the next interrupt (if
applicable) from the shm-ring.

>
>>  so its conceivable that I could present a simple
>> "dummy" PCI device with some "VBUS" type PCI-ID, just to piggy back on
>> the IRQ routing logic.  Then userspace could simply pass the IRQ routing
>> info down to the kernel with an ioctl, or something similar.
>>   
>
> Xen does something similar, I believe.
>
>> I think ultimately I was trying to stay away from PCI in general because
>> I want to support environments that do not have PCI.  However, for the
>> kvm-transport case (at least on x86) this isnt really a constraint.
>>
>>   
>
> s/PCI/the native IRQ solution for your platform/. virtio has the same
> problem; on s390 we use the native (if that word ever applies to s390)
> interrupt and device discovery mechanism.

yeah, I agree.  We can contain the "exposure" of PCI to just platforms
within KVM that care about it.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 14/17] kvm: add a reset capability
  2009-03-31 20:18       ` Avi Kivity
@ 2009-03-31 20:37         ` Gregory Haskins
  0 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 20:37 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 1515 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> Gregory Haskins wrote:
>>>    
>>>> We need a way to detect if a VM is reset later in the series, so lets
>>>> add a capability for userspace to signal a VM reset down to the
>>>> kernel.
>>>>         
>>> How do you handle the case of a guest calling kexec to load a new
>>> kernel?  Or is that not important for your use case?
>>>
>>>     
>>
>> Hmm..I had not considered this.  Any suggestions on ways to detect it?
>>
>>   
>
> Best would be not to detect it; it's tying global events into a
> device.  Instead, have a reset command for your device and have the
> driver issue it on load and unload.

Yes, good point.  This is doable within the existing infrastructure, but
it would have to be declared in each devices ABI definition.  I could
make it more formal and add it to the list of low-level bus-verbs, like
DEVICEOPEN, DEVICECLOSE, etc.

>
> btw, reset itself would be better controlled from userspace; qemu
> knows about resets and can reset vbus devices directly instead of
> relying on kvm to reset them.
In a way, this is what I have done (note to self: post the userspace
patches)

The detection is done by userspace, and it invokes an ioctl.  The kernel
based devices then react if they are interested.  In my case, vbus
registers for reset-notification, and it acts as if the guest exited
when it gets reset (e.g. it issues DEVICECLOSE verbs to all devices the
guest had open).



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 09/17] net: Add vbus_enet driver
  2009-03-31 18:43 ` [RFC PATCH 09/17] net: Add vbus_enet driver Gregory Haskins
@ 2009-03-31 20:39   ` Stephen Hemminger
  2009-04-02 11:43     ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Stephen Hemminger @ 2009-03-31 20:39 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

On Tue, 31 Mar 2009 14:43:34 -0400
Gregory Haskins <ghaskins@novell.com> wrote:

> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> ---
> 
>  drivers/net/Kconfig     |   13 +
>  drivers/net/Makefile    |    1 
>  drivers/net/vbus-enet.c |  706 +++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 720 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/net/vbus-enet.c
> 
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 62d732a..ac9dabd 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -3099,4 +3099,17 @@ config VIRTIO_NET
>  	  This is the virtual network driver for virtio.  It can be used with
>            lguest or QEMU based VMMs (like KVM or Xen).  Say Y or M.
>  
> +config VBUS_ENET
> +	tristate "Virtual Ethernet Driver"
> +	depends on VBUS_DRIVERS
> +	help
> +	   A virtualized 802.x network device based on the VBUS interface.
> +	   It can be used with any hypervisor/kernel that supports the
> +	   vbus protocol.
> +
> +config VBUS_ENET_DEBUG
> +        bool "Enable Debugging"
> +	depends on VBUS_ENET
> +	default n
> +
>  endif # NETDEVICES
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index 471baaf..61db928 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -264,6 +264,7 @@ obj-$(CONFIG_FS_ENET) += fs_enet/
>  obj-$(CONFIG_NETXEN_NIC) += netxen/
>  obj-$(CONFIG_NIU) += niu.o
>  obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
> +obj-$(CONFIG_VBUS_ENET) += vbus-enet.o
>  obj-$(CONFIG_SFC) += sfc/
>  
>  obj-$(CONFIG_WIMAX) += wimax/
> diff --git a/drivers/net/vbus-enet.c b/drivers/net/vbus-enet.c
> new file mode 100644
> index 0000000..e698b3f
> --- /dev/null
> +++ b/drivers/net/vbus-enet.c
> @@ -0,0 +1,706 @@
> +/*
> + * vbus_enet - A virtualized 802.x network device based on the VBUS interface
> + *
> + * Copyright (C) 2009 Novell, Gregory Haskins <ghaskins@novell.com>
> + *
> + * Derived from the SNULL example from the book "Linux Device Drivers" by
> + * Alessandro Rubini, Jonathan Corbet, and Greg Kroah-Hartman, published
> + * by O'Reilly & Associates.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/moduleparam.h>
> +
> +#include <linux/sched.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/errno.h>
> +#include <linux/types.h>
> +#include <linux/interrupt.h>
> +
> +#include <linux/in.h>
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/ip.h>
> +#include <linux/tcp.h>
> +#include <linux/skbuff.h>
> +#include <linux/ioq.h>
> +#include <linux/vbus_driver.h>
> +
> +#include <linux/in6.h>
> +#include <asm/checksum.h>
> +
> +#include <linux/venet.h>
> +
> +MODULE_AUTHOR("Gregory Haskins");
> +MODULE_LICENSE("GPL");
> +
> +static int napi_weight = 128;
> +module_param(napi_weight, int, 0444);
> +static int rx_ringlen = 256;
> +module_param(rx_ringlen, int, 0444);
> +static int tx_ringlen = 256;
> +module_param(tx_ringlen, int, 0444);
> +
> +#undef PDEBUG             /* undef it, just in case */
> +#ifdef VBUS_ENET_DEBUG
> +#  define PDEBUG(fmt, args...) printk(KERN_DEBUG "vbus_enet: " fmt, ## args)
> +#else
> +#  define PDEBUG(fmt, args...) /* not debugging: nothing */
> +#endif
> +
> +struct vbus_enet_queue {
> +	struct ioq              *queue;
> +	struct ioq_notifier      notifier;
> +};
> +
> +struct vbus_enet_priv {
> +	spinlock_t                 lock;
> +	struct net_device         *dev;
> +	struct vbus_device_proxy  *vdev;
> +	struct napi_struct         napi;
> +	struct net_device_stats    stats;

Not needed any more, stats are available in net_device

> +	struct vbus_enet_queue     rxq;
> +	struct vbus_enet_queue     txq;
> +	struct tasklet_struct      txtask;
> +};
> +

> + * Ioctl commands
> + */
> +static int
> +vbus_enet_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
> +{
> +	PDEBUG("ioctl\n");
> +	return 0;
> +}

If it doesn't do ioctl, just leave pointer as NULL

> +/*
> + * Return statistics to the caller
> + */
> +static struct net_device_stats *
> +vbus_enet_stats(struct net_device *dev)
> +{
> +	struct vbus_enet_priv *priv = netdev_priv(dev);
> +	return &priv->stats;
> +}

Not needed if you use internal net_device stats

> +static void
> +rx_isr(struct ioq_notifier *notifier)
> +{
> +	struct vbus_enet_priv *priv;
> +	struct net_device  *dev;
> +
> +	priv = container_of(notifier, struct vbus_enet_priv, rxq.notifier);
> +	dev = priv->dev;
> +
> +	if (!ioq_empty(priv->rxq.queue, ioq_idxtype_inuse))
> +		vbus_enet_schedule_rx(priv);
> +}
> +
> +static void
> +deferred_tx_isr(unsigned long data)
> +{
> +	struct vbus_enet_priv *priv = (struct vbus_enet_priv *)data;
> +	unsigned long flags;
> +
> +	PDEBUG("deferred_tx_isr for %lld\n", priv->vdev->id);
> +
> +	spin_lock_irqsave(&priv->lock, flags);
> +	vbus_enet_tx_reap(priv, 0);
> +	spin_unlock_irqrestore(&priv->lock, flags);
> +
> +	ioq_notify_enable(priv->txq.queue, 0);
> +}
> +
> +static void
> +tx_isr(struct ioq_notifier *notifier)
> +{
> +       struct vbus_enet_priv *priv;
> +       unsigned long flags;
> +
> +       priv = container_of(notifier, struct vbus_enet_priv, txq.notifier);
> +
> +       PDEBUG("tx_isr for %lld\n", priv->vdev->id);
> +
> +       ioq_notify_disable(priv->txq.queue, 0);
> +       tasklet_schedule(&priv->txtask);
> +}
> +
> +static struct net_device_ops vbus_enet_netdev_ops = {

Should be const.

> +	.ndo_open          = vbus_enet_open,
> +	.ndo_stop          = vbus_enet_stop,
> +	.ndo_set_config    = vbus_enet_config,
> +	.ndo_start_xmit    = vbus_enet_tx_start,
> +	.ndo_change_mtu	   = vbus_enet_change_mtu,
> +	.ndo_do_ioctl      = vbus_enet_ioctl,
> +	.ndo_get_stats     = vbus_enet_stats,
> +	.ndo_tx_timeout    = vbus_enet_timeout,
> +};
> +
> +/*
> + * This is called whenever a new vbus_device_proxy is added to the vbus
> + * with the matching VENET_ID
> + */
> +static int
> +vbus_enet_probe(struct vbus_device_proxy *vdev)
> +{
> +	struct net_device  *dev;
> +	struct vbus_enet_priv *priv;
> +	int ret;
> +
> +	printk(KERN_INFO "VBUS_ENET: Found new device at %lld\n", vdev->id);
> +
> +	ret = vdev->ops->open(vdev, VENET_VERSION, 0);
> +	if (ret < 0)
> +		return ret;
> +
> +	dev = alloc_etherdev(sizeof(struct vbus_enet_priv));
> +	if (!dev)
> +		return -ENOMEM;
> +
> +	priv = netdev_priv(dev);
> +	memset(priv, 0, sizeof(*priv));

Useless already done by alloc_etherdev

> +
> +	spin_lock_init(&priv->lock);
> +	priv->dev  = dev;
> +	priv->vdev = vdev;
> +
> +	tasklet_init(&priv->txtask, deferred_tx_isr, (unsigned long)priv);
> +
> +	queue_init(priv, &priv->rxq, VENET_QUEUE_RX, rx_ringlen, rx_isr);
> +	queue_init(priv, &priv->txq, VENET_QUEUE_TX, tx_ringlen, tx_isr);
> +
> +	rx_setup(priv);
> +
> +	ioq_notify_enable(priv->rxq.queue, 0);  /* enable interrupts */
> +	ioq_notify_enable(priv->txq.queue, 0);
> +
> +	ether_setup(dev); /* assign some of the fields */

Useless already done by alloc_etherdiv

> +
> +	dev->netdev_ops     = &vbus_enet_netdev_ops;
> +	dev->watchdog_timeo = 5 * HZ;
> +


Please consider adding basic set of ethtool_ops to allow controlling
offload, etc.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 01/17] shm-signal: shared-memory signals
  2009-03-31 18:42 ` [RFC PATCH 01/17] shm-signal: shared-memory signals Gregory Haskins
@ 2009-03-31 20:44   ` Avi Kivity
  2009-03-31 20:58     ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-03-31 20:44 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

Gregory Haskins wrote:
> This interface provides a bidirectional shared-memory based signaling
> mechanism.  It can be used by any entities which desire efficient
> communication via shared memory.  The implementation details of the
> signaling are abstracted so that they may transcend a wide variety
> of locale boundaries (e.g. userspace/kernel, guest/host, etc).
>
> The shm_signal mechanism supports event masking as well as spurious
> event delivery mitigation.
> +
> +/*
> + *---------
> + * The following structures represent data that is shared across boundaries
> + * which may be quite disparate from one another (e.g. Windows vs Linux,
> + * 32 vs 64 bit, etc).  Therefore, care has been taken to make sure they
> + * present data in a manner that is independent of the environment.
> + *-----------
> + */
> +
> +#define SHM_SIGNAL_MAGIC 0x58fa39df
> +#define SHM_SIGNAL_VER   1
> +
> +struct shm_signal_irq {
> +	__u8                  enabled;
> +	__u8                  pending;
> +	__u8                  dirty;
> +};
>   

Some ABIs may choose to pad this, suggest explicit padding.

> +
> +enum shm_signal_locality {
> +	shm_locality_north,
> +	shm_locality_south,
> +};
> +
> +struct shm_signal_desc {
> +	__u32                 magic;
> +	__u32                 ver;
> +	struct shm_signal_irq irq[2];
> +};
>   

Similarly, this should be padded to 0 (mod 8).

Instead of versions, I prefer feature flags which can be independently 
enabled or disabled.

> +
> +/* --- END SHARED STRUCTURES --- */
> +
> +#ifdef __KERNEL__
> +
> +#include <linux/interrupt.h>
> +
> +struct shm_signal_notifier {
> +	void (*signal)(struct shm_signal_notifier *);
> +};
>   

This means "->inject() has been called from the other side"?

(reading below I see this is so.  not used to reading well commented 
code... :)

> +
> +struct shm_signal;
> +
> +struct shm_signal_ops {
> +	int      (*inject)(struct shm_signal *s);
> +	void     (*fault)(struct shm_signal *s, const char *fmt, ...);
>   

Eww.  Must we involve strings and printf formats?

> +	void     (*release)(struct shm_signal *s);
> +};
> +
> +/*
> + * signaling protocol:
> + *
> + * each side of the shm_signal has an "irq" structure with the following
> + * fields:
> + *
> + *    - enabled: controlled by shm_signal_enable/disable() to mask/unmask
> + *               the notification locally
> + *    - dirty:   indicates if the shared-memory is dirty or clean.  This
> + *               is updated regardless of the enabled/pending state so that
> + *               the state is always accurately tracked.
> + *    - pending: indicates if a signal is pending to the remote locale.
> + *               This allows us to determine if a remote-notification is
> + *               already in flight to optimize spurious notifications away.
> + */
>   

When you overlay a ring on top of this, won't the ring indexes convey 
the same information as ->dirty?


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 01/17] shm-signal: shared-memory signals
  2009-03-31 20:44   ` Avi Kivity
@ 2009-03-31 20:58     ` Gregory Haskins
  2009-03-31 21:05       ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-03-31 20:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 4762 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> This interface provides a bidirectional shared-memory based signaling
>> mechanism.  It can be used by any entities which desire efficient
>> communication via shared memory.  The implementation details of the
>> signaling are abstracted so that they may transcend a wide variety
>> of locale boundaries (e.g. userspace/kernel, guest/host, etc).
>>
>> The shm_signal mechanism supports event masking as well as spurious
>> event delivery mitigation.
>> +
>> +/*
>> + *---------
>> + * The following structures represent data that is shared across
>> boundaries
>> + * which may be quite disparate from one another (e.g. Windows vs
>> Linux,
>> + * 32 vs 64 bit, etc).  Therefore, care has been taken to make sure
>> they
>> + * present data in a manner that is independent of the environment.
>> + *-----------
>> + */
>> +
>> +#define SHM_SIGNAL_MAGIC 0x58fa39df
>> +#define SHM_SIGNAL_VER   1
>> +
>> +struct shm_signal_irq {
>> +    __u8                  enabled;
>> +    __u8                  pending;
>> +    __u8                  dirty;
>> +};
>>   
>
> Some ABIs may choose to pad this, suggest explicit padding.

Yeah, good idea.  What is the official way to do this these days?  Are
GCC pragmas allowed?

>
>> +
>> +enum shm_signal_locality {
>> +    shm_locality_north,
>> +    shm_locality_south,
>> +};
>> +
>> +struct shm_signal_desc {
>> +    __u32                 magic;
>> +    __u32                 ver;
>> +    struct shm_signal_irq irq[2];
>> +};
>>   
>
> Similarly, this should be padded to 0 (mod 8).
Ack

>
> Instead of versions, I prefer feature flags which can be independently
> enabled or disabled.

Totally agreed.  If you look, most of the ABI has a type of "NEGCAP"
(negotiate capabilities) feature.  The version number is a contingency
plan in case I still have to break it for whatever reason.   I will
always opt for the feature bits over bumping the version when its
feasible (especially if/when this is actually in the field).

>
>> +
>> +/* --- END SHARED STRUCTURES --- */
>> +
>> +#ifdef __KERNEL__
>> +
>> +#include <linux/interrupt.h>
>> +
>> +struct shm_signal_notifier {
>> +    void (*signal)(struct shm_signal_notifier *);
>> +};
>>   
>
> This means "->inject() has been called from the other side"?

Yep
>
> (reading below I see this is so.  not used to reading well commented
> code... :)

:)

>
>> +
>> +struct shm_signal;
>> +
>> +struct shm_signal_ops {
>> +    int      (*inject)(struct shm_signal *s);
>> +    void     (*fault)(struct shm_signal *s, const char *fmt, ...);
>>   
>
> Eww.  Must we involve strings and printf formats?

This is still somewhat of a immature part of the design.  Its supposed
to be used so that by default, its a panic.  But on the host side, we
can do something like inject a machine-check.  That way malicious/broken
guests cannot (should not? ;) be able to take down the host.  Note today
I do not map this to anything other than the default panic, so this
needs some love.

But given the asynchronous nature of the fault, I want to be sure we
have decent accounting to avoid bug reports like "silent MCE kills the
guest" ;)  At least this way, we can log the fault string somewhere to
get a clue.

>
>> +    void     (*release)(struct shm_signal *s);
>> +};
>> +
>> +/*
>> + * signaling protocol:
>> + *
>> + * each side of the shm_signal has an "irq" structure with the
>> following
>> + * fields:
>> + *
>> + *    - enabled: controlled by shm_signal_enable/disable() to
>> mask/unmask
>> + *               the notification locally
>> + *    - dirty:   indicates if the shared-memory is dirty or clean. 
>> This
>> + *               is updated regardless of the enabled/pending state
>> so that
>> + *               the state is always accurately tracked.
>> + *    - pending: indicates if a signal is pending to the remote locale.
>> + *               This allows us to determine if a
>> remote-notification is
>> + *               already in flight to optimize spurious
>> notifications away.
>> + */
>>   
>
> When you overlay a ring on top of this, won't the ring indexes convey
> the same information as ->dirty?

I agree that the information may be redundant with components of the
broader shm state.  However, we need this state at this level of scope
in order to function optimally, so I dont think its a huge deal to have
this here as well.  Afterall, the shm_signal library can only assess its
internal state.  We would have to teach it how to glean the broader
state through some mechanism otherwise (callback, perhaps), but I don't
think its worth it.

-Greg

>
>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 15/17] kvm: add dynamic IRQ support
  2009-03-31 20:32         ` Gregory Haskins
@ 2009-03-31 20:59           ` Avi Kivity
  0 siblings, 0 replies; 146+ messages in thread
From: Avi Kivity @ 2009-03-31 20:59 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

Gregory Haskins wrote:
>> Won't this have scaling issues?  One IRQ means one target vcpu. 
>> Whereas I'd like virtio devices to span multiple queues, each queue
>> with its own MSI IRQ.
>>     
> Hmm..you know I hadnt really thought of it that way, but you have a
> point.  To clarify, my design actually uses one IRQ per "eventq", where
> we can have an arbitrary number of eventq's defined (note: today I only
> define one eventq, however).  An eventq is actually a shm-ring construct
> where I can pass events up to the host like "device added" or "ring X
> signaled".  Each individual device based virtio-ring would then
> aggregates "signal" events onto this eventq mechanism to actually inject
> events to the host.  Only the eventq itself injects an actual IRQ to the
> assigned vcpu.
>   

You will get get cachelines bounced around when events from different 
devices are added to the queue.  On the plus side, a single injection 
can contain interrupts for multiple devices.

I'm not sure how useful this coalescing is; certainly you will never see 
it on microbenchmarks, but that doesn't mean it's not useful.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 01/17] shm-signal: shared-memory signals
  2009-03-31 20:58     ` Gregory Haskins
@ 2009-03-31 21:05       ` Avi Kivity
  2009-04-01 12:12         ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-03-31 21:05 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

Gregory Haskins wrote:
>>> +struct shm_signal_irq {
>>> +    __u8                  enabled;
>>> +    __u8                  pending;
>>> +    __u8                  dirty;
>>> +};
>>>   
>>>       
>> Some ABIs may choose to pad this, suggest explicit padding.
>>     
>
> Yeah, good idea.  What is the official way to do this these days?  Are
> GCC pragmas allowed?
>
>   

I just add a __u8 pad[5] in such cases.

>>> +
>>> +struct shm_signal;
>>> +
>>> +struct shm_signal_ops {
>>> +    int      (*inject)(struct shm_signal *s);
>>> +    void     (*fault)(struct shm_signal *s, const char *fmt, ...);
>>>   
>>>       
>> Eww.  Must we involve strings and printf formats?
>>     
>
> This is still somewhat of a immature part of the design.  Its supposed
> to be used so that by default, its a panic.  But on the host side, we
> can do something like inject a machine-check.  That way malicious/broken
> guests cannot (should not? ;) be able to take down the host.  Note today
> I do not map this to anything other than the default panic, so this
> needs some love.
>
> But given the asynchronous nature of the fault, I want to be sure we
> have decent accounting to avoid bug reports like "silent MCE kills the
> guest" ;)  At least this way, we can log the fault string somewhere to
> get a clue.
>   

I see.

This raises a point I've been thinking of - the symmetrical nature of 
the API vs the assymetrical nature of guest/host or user/kernel 
interfaces.  This is most pronounced in ->inject(); in the host->guest 
direction this is async (host can continue processing while the guest is 
handling the interrupt), whereas in the guest->host direction it is 
synchronous (the guest is blocked while the host is processing the call, 
unless the host explicitly hands off work to a different thread).


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
                   ` (17 preceding siblings ...)
  2009-03-31 20:18 ` [RFC PATCH 00/17] virtual-bus Andi Kleen
@ 2009-04-01  6:08 ` Rusty Russell
  2009-04-01 11:35   ` Gregory Haskins
                     ` (2 more replies)
  18 siblings, 3 replies; 146+ messages in thread
From: Rusty Russell @ 2009-04-01  6:08 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, netdev, kvm

On Wednesday 01 April 2009 05:12:47 Gregory Haskins wrote:
> Bare metal: tput = 4078Mb/s, round-trip = 25593pps (39us rtt)
> Virtio-net: tput = 4003Mb/s, round-trip = 320pps (3125us rtt)
> Venet: tput = 4050Mb/s, round-trip = 15255 (65us rtt)

That rtt time is awful.  I know the notification suppression heuristic
in qemu sucks.

I could dig through the code, but I'll ask directly: what heuristic do
you use for notification prevention in your venet_tap driver?

As you point out, 350-450 is possible, which is still bad, and it's at least
partially caused by the exit to userspace and two system calls.  If virtio_net
had a backend in the kernel, we'd be able to compare numbers properly.

> Bare metal: tput = 9717Mb/s, round-trip = 30396pps (33us rtt)
> Virtio-net: tput = 4578Mb/s, round-trip = 249pps (4016us rtt)
> Venet: tput = 5802Mb/s, round-trip = 15127 (66us rtt)
> 
> Note that even the throughput was slightly better in this test for venet, though
> neither venet nor virtio-net could achieve line-rate.  I suspect some tuning may
> allow these numbers to improve, TBD.

At some point, the copying will hurt you.  This is fairly easy to avoid on
xmit tho.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01  6:08 ` Rusty Russell
@ 2009-04-01 11:35   ` Gregory Haskins
  2009-04-02  1:24     ` Rusty Russell
  2009-04-01 16:10   ` Anthony Liguori
  2009-04-02  3:15   ` Herbert Xu
  2 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-04-01 11:35 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3541 bytes --]

Rusty Russell wrote:
> On Wednesday 01 April 2009 05:12:47 Gregory Haskins wrote:
>   
>> Bare metal: tput = 4078Mb/s, round-trip = 25593pps (39us rtt)
>> Virtio-net: tput = 4003Mb/s, round-trip = 320pps (3125us rtt)
>> Venet: tput = 4050Mb/s, round-trip = 15255 (65us rtt)
>>     
>
> That rtt time is awful.  I know the notification suppression heuristic
> in qemu sucks.
>
> I could dig through the code, but I'll ask directly: what heuristic do
> you use for notification prevention in your venet_tap driver?
>   

I am not 100% sure I know what you mean with "notification prevention",
but let me take a stab at it.

So like most of these kinds of constructs, I have two rings (rx + tx on
the guest is reversed to tx + rx on the host), each of which can signal
in either direction for a total of 4 events, 2 on each side of the
connection.  I utilize what I call "bidirectional napi" so that only the
first packet submitted needs to signal across the guest/host boundary. 
E.g. first ingress packet injects an interrupt, and then does a
napi_schedule and masks future irqs.  Likewise, first egress packet does
a hypercall, and then does a "napi_schedule" (I dont actually use napi
in this path, but its conceptually identical) and masks future
hypercalls.  So thats is my first form of what I would call notification
prevention.

The second form occurs on the "tx-complete" path (that is guest->host
tx).  I only signal back to the guest to reclaim its skbs every 10
packets, or if I drain the queue, whichever comes first (note to self:
make this # configurable).

The nice part about this scheme is it significantly reduces the amount
of guest/host transitions, while still providing the lowest latency
response for single packets possible.  e.g. Send one packet, and you get
one hypercall, and one tx-complete interrupt as soon as it queues on the
hardware.  Send 100 packets, and you get one hypercall and 10
tx-complete interrupts as frequently as every tenth packet queues on the
hardware.  There is no timer governing the flow, etc.

Is that what you were asking?

> As you point out, 350-450 is possible, which is still bad, and it's at least
> partially caused by the exit to userspace and two system calls.  If virtio_net
> had a backend in the kernel, we'd be able to compare numbers properly.
>   
:)

But that is the whole point, isnt it?  I created vbus specifically as a
framework for putting things in the kernel, and that *is* one of the
major reasons it is faster than virtio-net...its not the difference in,
say, IOQs vs virtio-ring (though note I also think some of the
innovations we have added such as bi-dir napi are helping too, but these
are not "in-kernel" specific kinds of features and could probably help
the userspace version too).

I would be entirely happy if you guys accepted the general concept and
framework of vbus, and then worked with me to actually convert what I
have as "venet-tap" into essentially an in-kernel virtio-net.  I am not
specifically interested in creating a competing pv-net driver...I just
needed something to showcase the concepts and I didnt want to hack the
virtio-net infrastructure to do it until I had everyone's blessing. 
Note to maintainers: I *am* perfectly willing to maintain the venet
drivers if, for some reason, we decide that we want to keep them as
is.   Its just an ideal for me to collapse virtio-net and venet-tap
together, and I suspect our community would prefer this as well.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-03-31 20:18 ` [RFC PATCH 00/17] virtual-bus Andi Kleen
@ 2009-04-01 12:03   ` Gregory Haskins
  2009-04-01 13:23     ` Andi Kleen
  0 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-04-01 12:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 4062 bytes --]

Andi Kleen wrote:
> Gregory Haskins <ghaskins@novell.com> writes:
>
> What might be useful is if you could expand a bit more on what the high level
> use cases for this. 
>
> Questions that come to mind and that would be good to answer:
>
> This seems to be aimed at having multiple VMs talk
> to each other, but not talk to the rest of the world, correct? 
> Is that a common use case? 
>   

Actually we didn't design specifically for either type of environment. 
I think it would, in fact, be well suited to either type of
communication model, even concurrently (e.g. an intra-vm ipc channel
resource could live right on the same bus as a virtio-net and a
virtio-disk resource)

> Wouldn't they typically have a default route  anyways and be able to talk to each 
> other this way? 
> And why can't any such isolation be done with standard firewalling? (it's known that 
> current iptables has some scalability issues, but there's work going on right
> now to fix that). 
>   
vbus itself, and even some of the higher level constructs we apply on
top of it (like venet) are at a different scope than I think what you
are getting at above.  Yes, I suppose you could create a private network
using the existing virtio-net + iptables.  But you could also do the
same using virtio-net and a private bridge devices as well.  That is not
what we are trying to address.

What we *are* trying to address is making an easy way to declare virtual
resources directly in the kernel so that they can be accessed more
efficiently.  Contrast that to the way its done today, where the models
live in, say, qemu userspace.

So instead of having
guest->host->qemu::virtio-net->tap->[iptables|bridge], you simply have
guest->host->[iptables|bridge].  How you make your private network (if
that is what you want to do) is orthogonal...its the path to get there
that we changed.

> What would be the use cases for non networking devices?
>
> How would the interfaces to the user look like?
>   

I am not sure if you are asking about the guests perspective or the
host-administators perspective.

First now lets look at the low-level device interface from the guests
perspective.  We can cover the admin perspective in a separate doc, if
need be.

Each device in vbus supports two basic verbs: CALL, and SHM

int (*call)(struct vbus_device_proxy *dev, u32 func,
            void *data, size_t len, int flags);

int (*shm)(struct vbus_device_proxy *dev, int id, int prio,
           void *ptr, size_t len,
           struct shm_signal_desc *sigdesc, struct shm_signal **signal,
           int flags);

CALL provides a synchronous method for invoking some verb on the device
(defined by "func") with some arbitrary data.  The namespace for "func"
is part of the ABI for the device in question.  It is analogous to an
ioctl, with the primary difference being that its remotable (it invokes
from the guest driver across to the host device).

SHM provides a way to register shared-memory with the device which can
be used for asynchronous communication.  The memory is always owned by
the "north" (the guest), while the "south" (the host) simply maps it
into its address space.  You can optionally establish a shm_signal
object on this memory for signaling in either direction, and I
anticipate most shm regions will use this feature.  Each shm region has
an "id" namespace, which like the "func" namespace from the CALL method
is completely owned by the device ABI.  For example, we have might have
id's of "RX-RING" and "TX-RING", etc.

From there, we can (hopefully) build an arbitrary type of IO service to
map on top.  So for instance, for venet-tap, we have CALL verbs for
things like MACQUERY, and LINKUP, and we have SHM ids for RX-QUEUE and
TX-QUEUE.  We can write a driver that speaks this ABI on the bottom
edge, and presents a normal netif interface on the top edge.  So the
actual consumption of these resources can look just like another other
resource of a similar type.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 01/17] shm-signal: shared-memory signals
  2009-03-31 21:05       ` Avi Kivity
@ 2009-04-01 12:12         ` Gregory Haskins
  2009-04-01 12:24           ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-04-01 12:12 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 2426 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>>>> +struct shm_signal_irq {
>>>> +    __u8                  enabled;
>>>> +    __u8                  pending;
>>>> +    __u8                  dirty;
>>>> +};
>>>>         
>>> Some ABIs may choose to pad this, suggest explicit padding.
>>>     
>>
>> Yeah, good idea.  What is the official way to do this these days?  Are
>> GCC pragmas allowed?
>>
>>   
>
> I just add a __u8 pad[5] in such cases.

Oh, duh.  Dumb question.  I was getting confused with "pack", not pad.  :)

>
>>>> +
>>>> +struct shm_signal;
>>>> +
>>>> +struct shm_signal_ops {
>>>> +    int      (*inject)(struct shm_signal *s);
>>>> +    void     (*fault)(struct shm_signal *s, const char *fmt, ...);
>>>>         
>>> Eww.  Must we involve strings and printf formats?
>>>     
>>
>> This is still somewhat of a immature part of the design.  Its supposed
>> to be used so that by default, its a panic.  But on the host side, we
>> can do something like inject a machine-check.  That way malicious/broken
>> guests cannot (should not? ;) be able to take down the host.  Note today
>> I do not map this to anything other than the default panic, so this
>> needs some love.
>>
>> But given the asynchronous nature of the fault, I want to be sure we
>> have decent accounting to avoid bug reports like "silent MCE kills the
>> guest" ;)  At least this way, we can log the fault string somewhere to
>> get a clue.
>>   
>
> I see.
>
> This raises a point I've been thinking of - the symmetrical nature of
> the API vs the assymetrical nature of guest/host or user/kernel
> interfaces.  This is most pronounced in ->inject(); in the host->guest
> direction this is async (host can continue processing while the guest
> is handling the interrupt), whereas in the guest->host direction it is
> synchronous (the guest is blocked while the host is processing the
> call, unless the host explicitly hands off work to a different thread).

Note that this is exactly what I do (though it is device specific). 
venet-tap has a ioq_notifier registered on its "rx" ring (which is the
tx-ring for the guest) that simply calls ioq_notify_disable() (which
calls shm_signal_disable() under the covers) and it wakes its
rx-thread.  This all happens in the context of the hypercall, which then
returns and allows the vcpu to re-enter guest mode immediately.


>
>



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 01/17] shm-signal: shared-memory signals
  2009-04-01 12:12         ` Gregory Haskins
@ 2009-04-01 12:24           ` Avi Kivity
  2009-04-01 13:57             ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-01 12:24 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

Gregory Haskins wrote:
> Note that this is exactly what I do (though it is device specific). 
> venet-tap has a ioq_notifier registered on its "rx" ring (which is the
> tx-ring for the guest) that simply calls ioq_notify_disable() (which
> calls shm_signal_disable() under the covers) and it wakes its
> rx-thread.  This all happens in the context of the hypercall, which then
> returns and allows the vcpu to re-enter guest mode immediately.
>   
I think this is suboptimal.  The ring is likely to be cache hot on the 
current cpu, waking a thread will introduce scheduling latency + IPI 
+cache-to-cache transfers.

On a benchmark setup, host resources are likely to exceed guest 
requirements, so you can throw cpu at the problem and no one notices.  
But I think the bits/cycle figure will decrease, even if bits/sec increases.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 12:03   ` Gregory Haskins
@ 2009-04-01 13:23     ` Andi Kleen
  2009-04-01 14:19       ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Andi Kleen @ 2009-04-01 13:23 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Andi Kleen, linux-kernel, agraf, pmullaney, pmorreale, anthony,
	rusty, netdev, kvm

On Wed, Apr 01, 2009 at 08:03:49AM -0400, Gregory Haskins wrote:
> Andi Kleen wrote:
> > Gregory Haskins <ghaskins@novell.com> writes:
> >
> > What might be useful is if you could expand a bit more on what the high level
> > use cases for this. 
> >
> > Questions that come to mind and that would be good to answer:
> >
> > This seems to be aimed at having multiple VMs talk
> > to each other, but not talk to the rest of the world, correct? 
> > Is that a common use case? 
> >   
> 
> Actually we didn't design specifically for either type of environment. 

But surely you must have some specific use case in mind? Something
that it does better than the various methods that are available
today. Or rather there must be some problem you're trying
to solve. I'm just not sure what that problem exactly is.

> What we *are* trying to address is making an easy way to declare virtual
> resources directly in the kernel so that they can be accessed more
> efficiently.  Contrast that to the way its done today, where the models
> live in, say, qemu userspace.
> 
> So instead of having
> guest->host->qemu::virtio-net->tap->[iptables|bridge], you simply have
> guest->host->[iptables|bridge].  How you make your private network (if

So is the goal more performance or simplicity or what?

> > What would be the use cases for non networking devices?
> >
> > How would the interfaces to the user look like?
> >   
> 
> I am not sure if you are asking about the guests perspective or the
> host-administators perspective.

I was wondering about the host-administrators perspective.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 01/17] shm-signal: shared-memory signals
  2009-04-01 12:24           ` Avi Kivity
@ 2009-04-01 13:57             ` Gregory Haskins
  0 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-04-01 13:57 UTC (permalink / raw)
  To: Avi Kivity
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 5505 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> Note that this is exactly what I do (though it is device specific).
>> venet-tap has a ioq_notifier registered on its "rx" ring (which is the
>> tx-ring for the guest) that simply calls ioq_notify_disable() (which
>> calls shm_signal_disable() under the covers) and it wakes its
>> rx-thread.  This all happens in the context of the hypercall, which then
>> returns and allows the vcpu to re-enter guest mode immediately.
>>   
> I think this is suboptimal.

Heh, yes I know this is your (well documented) position, but I
respectfully disagree. :)

CPUs are not getting much faster, but they are rapidly getting more
cores.  If we want to continue to make software run increasingly faster,
we need to actually use those cores IMO.  Generally this means split
workloads up into as many threads as possible as long as you can keep
pipelines filed.

>   The ring is likely to be cache hot on the current cpu, waking a
> thread will introduce scheduling latency + IPI
This part is a valid criticism, though note that Linux is very adept at
scheduling so we are talking mere ns/us range here, which is dwarfed by
the latency of something like your typical IO device (e.g. 36us for a
rtt packet on 10GE baremetal, etc).  The benefit, of course, is the
potential for increased parallelism which I have plenty of data to show
we are very much taking advantage of here (I can saturate two cores
almost completely according to LTT traces, one doing vcpu work, and the
other running my "rx" thread which schedules the packet on the hardware)

> +cache-to-cache transfers.
This one I take exception to.  While it is perfectly true that splitting
the work between two cores has a greater cache impact than staying on
one, you cannot look at this one metric alone and say "this is bad". 
Its also a function of how efficiently the second (or more) cores are
utilized.  There will be a point in the curve where the cost of cache
coherence will become marginalized by the efficiency added by the extra
compute power.  Some workloads will invariably be on the bad end of that
curve, and therefore doing the work on one core is better.  However, we
cant ignore that there will others that are on the good end of this
spectrum either.  Otherwise, we risk performance stagnation on our
effectively uniprocessor box ;).  In addition, the task-scheduler will
attempt to co-locate tasks that are sharing data according to a best-fit
within the cache hierarchy.  Therefore, we will still be sharing as much
as possible (perhaps only L2, L3, or a local NUMA domain, but this is
still better than nothing)

The way I have been thinking about these issues is something I have been
calling "soft-asics".  In the early days, we had things like a simple
uniprocessor box with a simple dumb ethernet.  People figured out that
if you put more processing power into the NIC, you could offload that
work from the cpu and do more in parallel.   So things like checksum
computation and segmentation duties were a good fit.  More recently, we
see even more advanced hardware where you can do L2 or even L4 packet
classification right in the hardware, etc.  All of these things are
effectively parallel computation, and it occurs in a completely foreign
cache domain!

So a lot of my research has been around the notion of trying to use some
of our cpu cores to do work like some of the advanced asic based offload
engines do.  The cores are often under utilized anyway, and this will
bring some of the features of advanced silicon to commodity resources. 
They also have the added flexibility that its just software, so you can
change or enhance the system at will.

So if you think about it, by using threads like this in venet-tap, I am
effectively using other cores to do csum/segmentation (if the physical
hardware doesn't support it), layer 2 classification (linux bridging),
filtering (iptables in the bridge), queuing, etc as if it was some
"smart" device out on the PCI bus.  The guest just queues up packets
independently in its own memory, while the device just "dma's" the data
on its own (after the initial kick).  The vcpu is keeping the pipeline
filled on its side independently.

>
> On a benchmark setup, host resources are likely to exceed guest
> requirements, so you can throw cpu at the problem and no one notices.
Sure, but with the type of design I have presented this still sorts
itself out naturally even if the host doesn't have the resources.  For
instance, if there is a large number of threads competing for a small
number of cores, we will simply see things like the rx-thread stalling
and going to sleep, or the vcpu thread backpressuring and going idle
(and therefore sleeping).  All of these things are self throttling.  If
you don't have enough resources to run a workload at a desirable
performance level, the system wasn't sized right to begin with. ;)

>   But I think the bits/cycle figure will decrease, even if bits/sec
> increases.
>
Note that this isn't necessarily a bad thing.  I think studies show that
most machines are generally idle a significant percentage of the time,
and this will likely only get worse as we get more and more cores.  So
if I have to consume more cycles to get more bits on the wire, thats
probably ok with most of my customers.   If its not, it would be trivial
to make the venet threading policy a tunable parameter.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 13:23     ` Andi Kleen
@ 2009-04-01 14:19       ` Gregory Haskins
  2009-04-01 14:42         ` Gregory Haskins
  2009-04-01 17:01         ` Andi Kleen
  0 siblings, 2 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-04-01 14:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 5594 bytes --]

Andi Kleen wrote:
> On Wed, Apr 01, 2009 at 08:03:49AM -0400, Gregory Haskins wrote:
>   
>> Andi Kleen wrote:
>>     
>>> Gregory Haskins <ghaskins@novell.com> writes:
>>>
>>> What might be useful is if you could expand a bit more on what the high level
>>> use cases for this. 
>>>
>>> Questions that come to mind and that would be good to answer:
>>>
>>> This seems to be aimed at having multiple VMs talk
>>> to each other, but not talk to the rest of the world, correct? 
>>> Is that a common use case? 
>>>   
>>>       
>> Actually we didn't design specifically for either type of environment. 
>>     
>
> But surely you must have some specific use case in mind? Something
> that it does better than the various methods that are available
> today. Or rather there must be some problem you're trying
> to solve. I'm just not sure what that problem exactly is.
>   
Performance.  We are trying to create a high performance IO infrastructure.

Ideally we would like to see things like virtual-machines have
bare-metal performance (or as close as possible) using just pure
software on commodity hardware.   The data I provided shows that
something like KVM with virtio-net does a good job on throughput even on
10GE, but the latency is several orders of magnitude slower than
bare-metal.   We are addressing this issue and others like it that are a
result of the current design of out-of-kernel emulation.
>   
>> What we *are* trying to address is making an easy way to declare virtual
>> resources directly in the kernel so that they can be accessed more
>> efficiently.  Contrast that to the way its done today, where the models
>> live in, say, qemu userspace.
>>
>> So instead of having
>> guest->host->qemu::virtio-net->tap->[iptables|bridge], you simply have
>> guest->host->[iptables|bridge].  How you make your private network (if
>>     
>
> So is the goal more performance or simplicity or what?
>   

(Answered above)

>   
>>> What would be the use cases for non networking devices?
>>>
>>> How would the interfaces to the user look like?
>>>   
>>>       
>> I am not sure if you are asking about the guests perspective or the
>> host-administators perspective.
>>     
>
> I was wondering about the host-administrators perspective.
>   
Ah, ok.  Sorry about that.  It was probably good to document that other
thing anyway, so no harm.

So about the host-administrator interface.  The whole thing is driven by
configfs, and the basics are already covered in the documentation in
patch 2, so I wont repeat it here.  Here is a reference to the file for
everyone's convenience:

http://git.kernel.org/?p=linux/kernel/git/ghaskins/vbus/linux-2.6.git;a=blob;f=Documentation/vbus.txt;h=e8a05dafaca2899d37bd4314fb0c7529c167ee0f;hb=f43949f7c340bf667e68af6e6a29552e62f59033

So a sufficiently privileged user can instantiate a new bus (e.g.
container) and devices on that bus via configfs operations.  The types
of devices available to instantiate are dictated by whatever vbus-device
modules you have loaded into your particular kernel.  The loaded modules
available are enumerated under /sys/vbus/deviceclass.

Now presumably the administrator knows what a particular module is and
how to configure it before instantiating it.  Once they instantiate it,
it will present an interface in sysfs with a set of attributes.  For
example, an instantiated venet-tap looks like this:

ghaskins@test:~> tree /sys/vbus/devices
/sys/vbus/devices
`-- foo
    |-- class -> ../../deviceclass/venet-tap
    |-- client_mac
    |-- enabled
    |-- host_mac
    |-- ifname
    `-- interfaces
        `-- 0 -> ../../../instances/bar/devices/0


Some of these attributes, like "class" and "interfaces" are default
attributes that are filled in by the infrastructure.  Other attributes,
like "client_mac" and "enabled" are properties defined by the venet-tap
module itself.  So the administrator can then set these attributes as
desired to manipulate the configuration of the instance of the device,
on a per device basis.

So now imagine we have some kind of disk-io vbus device that is designed
to act kind of like a file-loopback device.  It might define an
attribute allowing you to specify the path to the file/block-dev that
you want it to export.

(Warning: completely fictitious "tree" output to follow ;)

ghaskins@test:~> tree /sys/vbus/devices
/sys/vbus/devices
`-- foo
    |-- class -> ../../deviceclass/vdisk
    |-- src_path
    `-- interfaces
        `-- 0 -> ../../../instances/bar/devices/0

So the admin would instantiate this "vdisk" device and do:

'echo /path/to/my/exported/disk.dat > /sys/vbus/devices/foo/src_path'

To point the device to the file on the host that it wants to present as
a vdisk.  Any guest that has access to the particular bus that contains
this device would then see it as a standard "vdisk" ABI device (as if
there where such a thing, yet) and could talk to it using a vdisk
specific driver.

A property of a vbus is that it is inherited by children.  Today, I do
not have direct support in qemu for creating/configuring vbus devices. 
Instead what I do is I set up the vbus and devices from bash, and then
launch qemu-kvm so it inherits the bus.  Someday (soon, unless you guys
start telling me this whole idea is rubbish ;) I will add support so you
could do things like "-net nic,model=venet" and that would trigger qemu
to go out and create the container/device on its own.  TBD.

I hope this helps to clarify!
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 14:19       ` Gregory Haskins
@ 2009-04-01 14:42         ` Gregory Haskins
  2009-04-01 17:01         ` Andi Kleen
  1 sibling, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-04-01 14:42 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty,
	netdev, kvm, linux-rt-users

[-- Attachment #1: Type: text/plain, Size: 2677 bytes --]

Gregory Haskins wrote:
> Andi Kleen wrote:
>   
>> On Wed, Apr 01, 2009 at 08:03:49AM -0400, Gregory Haskins wrote:
>>   
>>     
>>> Andi Kleen wrote:
>>>     
>>>       
>>>> Gregory Haskins <ghaskins@novell.com> writes:
>>>>
>>>> What might be useful is if you could expand a bit more on what the high level
>>>> use cases for this. 
>>>>
>>>> Questions that come to mind and that would be good to answer:
>>>>
>>>> This seems to be aimed at having multiple VMs talk
>>>> to each other, but not talk to the rest of the world, correct? 
>>>> Is that a common use case? 
>>>>   
>>>>       
>>>>         
>>> Actually we didn't design specifically for either type of environment. 
>>>     
>>>       
>> But surely you must have some specific use case in mind? Something
>> that it does better than the various methods that are available
>> today. Or rather there must be some problem you're trying
>> to solve. I'm just not sure what that problem exactly is.
>>   
>>     
> Performance.  We are trying to create a high performance IO infrastructure.
>   
Actually, I should also state that I am interested in enabling some new
kinds of features based on having in-kernel devices like this.  For
instance (and this is still very theoretical and half-baked), I would
like to try to support RT guests.

[adding linux-rt-users]

I think one of the things that we need in order to do that is being able
to convey vcpu priority state information to the host in an efficient
way.  I was thinking that a shared-page per vcpu could have something
like "current" and "theshold" priorties.  The guest modifies "current"
while the host modifies "threshold".   The guest would be allowed to
increase its "current" priority without a hypercall (after all, if its
already running presumably it is already of sufficient priority that the
scheduler).  But if the guest wants to drop below "threshold", it needs
to hypercall the host to give it an opportunity to schedule() a new task
(vcpu or not).

The host, on the other hand, could apply a mapping so that the guests
priority of RT1-RT99 might map to RT20-RT30 on the host, or something
like that.  We would have to take other considerations as well, such as
implicit boosting on IRQ injection (e.g. the guest could be in HLT/IDLE
when an interrupt is injected...but by virtue of injecting that
interrupt we may need to boost it to (guest-relative) RT50).

Like I said, this is all half-baked right now.  My primary focus is
improving performance, but I did try to lay the groundwork for taking
things in new directions too..rt being an example.

Hope that helps!
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01  6:08 ` Rusty Russell
  2009-04-01 11:35   ` Gregory Haskins
@ 2009-04-01 16:10   ` Anthony Liguori
  2009-04-05  3:44     ` Rusty Russell
  2009-04-02  3:15   ` Herbert Xu
  2 siblings, 1 reply; 146+ messages in thread
From: Anthony Liguori @ 2009-04-01 16:10 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Gregory Haskins, linux-kernel, agraf, pmullaney, pmorreale, netdev, kvm

Rusty Russell wrote:
> On Wednesday 01 April 2009 05:12:47 Gregory Haskins wrote:
>   
>> Bare metal: tput = 4078Mb/s, round-trip = 25593pps (39us rtt)
>> Virtio-net: tput = 4003Mb/s, round-trip = 320pps (3125us rtt)
>> Venet: tput = 4050Mb/s, round-trip = 15255 (65us rtt)
>>     
>
> That rtt time is awful.  I know the notification suppression heuristic
> in qemu sucks.
>
> I could dig through the code, but I'll ask directly: what heuristic do
> you use for notification prevention in your venet_tap driver?
>
> As you point out, 350-450 is possible, which is still bad, and it's at least
> partially caused by the exit to userspace and two system calls.  If virtio_net
> had a backend in the kernel, we'd be able to compare numbers properly.
>   

I doubt the userspace exit is the problem.  On a modern system, it takes 
about 1us to do a light-weight exit and about 2us to do a heavy-weight 
exit.  A transition to userspace is only about ~150ns, the bulk of the 
additional heavy-weight exit cost is from vcpu_put() within KVM.

If you were to switch to another kernel thread, and I'm pretty sure you 
have to, you're going to still see about a 2us exit cost.  Even if you 
factor in the two syscalls, we're still talking about less than .5us 
that you're saving.  Avi mentioned he had some ideas to allow in-kernel 
thread switching without taking a heavy-weight exit but suffice to say, 
we can't do that today.

You have no easy way to generate PCI interrupts in the kernel either.  
You'll most certainly have to drop down to userspace anyway for that.

I believe the real issue is that we cannot get enough information today 
from tun/tap to do proper notification prevention b/c we don't know when 
the packet processing is completed.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 14:19       ` Gregory Haskins
  2009-04-01 14:42         ` Gregory Haskins
@ 2009-04-01 17:01         ` Andi Kleen
  2009-04-01 18:45           ` Anthony Liguori
  2009-04-01 20:29           ` Gregory Haskins
  1 sibling, 2 replies; 146+ messages in thread
From: Andi Kleen @ 2009-04-01 17:01 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Andi Kleen, linux-kernel, agraf, pmullaney, pmorreale, anthony,
	rusty, netdev, kvm

On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote:
> >>     
> >
> > But surely you must have some specific use case in mind? Something
> > that it does better than the various methods that are available
> > today. Or rather there must be some problem you're trying
> > to solve. I'm just not sure what that problem exactly is.
> >   
> Performance.  We are trying to create a high performance IO infrastructure.

Ok. So the goal is to bypass user space qemu completely for better
performance. Can you please put this into the initial patch
description?

> So the administrator can then set these attributes as
> desired to manipulate the configuration of the instance of the device,
> on a per device basis.

How would the guest learn of any changes in there?

I think the interesting part would be how e.g. a vnet device
would be connected to the outside interfaces.

> So the admin would instantiate this "vdisk" device and do:
> 
> 'echo /path/to/my/exported/disk.dat > /sys/vbus/devices/foo/src_path'

So it would act like a loop device? Would you reuse the loop device
or write something new?

How about VFS mount name spaces?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 17:01         ` Andi Kleen
@ 2009-04-01 18:45           ` Anthony Liguori
  2009-04-01 20:40             ` Chris Wright
                               ` (2 more replies)
  2009-04-01 20:29           ` Gregory Haskins
  1 sibling, 3 replies; 146+ messages in thread
From: Anthony Liguori @ 2009-04-01 18:45 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gregory Haskins, linux-kernel, agraf, pmullaney, pmorreale,
	rusty, netdev, kvm

Andi Kleen wrote:
> On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote:
>   
>>>>     
>>>>         
>>> But surely you must have some specific use case in mind? Something
>>> that it does better than the various methods that are available
>>> today. Or rather there must be some problem you're trying
>>> to solve. I'm just not sure what that problem exactly is.
>>>   
>>>       
>> Performance.  We are trying to create a high performance IO infrastructure.
>>     
>
> Ok. So the goal is to bypass user space qemu completely for better
> performance. Can you please put this into the initial patch
> description?
>   

FWIW, there's nothing that prevents in-kernel back ends with virtio so 
vbus certainly isn't required for in-kernel backends.

That said, I don't think we're bound today by the fact that we're in 
userspace.  Rather we're bound by the interfaces we have between the 
host kernel and userspace to generate IO.  I'd rather fix those 
interfaces than put more stuff in the kernel.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 17:01         ` Andi Kleen
  2009-04-01 18:45           ` Anthony Liguori
@ 2009-04-01 20:29           ` Gregory Haskins
  2009-04-01 22:23             ` Andi Kleen
  1 sibling, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-04-01 20:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 5815 bytes --]

Andi Kleen wrote:
> On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote:
>   
>>>>     
>>>>         
>>> But surely you must have some specific use case in mind? Something
>>> that it does better than the various methods that are available
>>> today. Or rather there must be some problem you're trying
>>> to solve. I'm just not sure what that problem exactly is.
>>>   
>>>       
>> Performance.  We are trying to create a high performance IO infrastructure.
>>     
>
> Ok. So the goal is to bypass user space qemu completely for better
> performance. Can you please put this into the initial patch
> description?
>   
Yes, good point.  I will be sure to be more explicit in the next rev.

>   
>> So the administrator can then set these attributes as
>> desired to manipulate the configuration of the instance of the device,
>> on a per device basis.
>>     
>
> How would the guest learn of any changes in there?
>   
The only events explicitly supported by the infrastructure of this
nature would be device-add and device-remove.  So when an admin adds or
removes a device to a bus, the guest would see driver::probe() and
driver::remove() callbacks, respectively.  All other events are left (by
design) to be handled by the device ABI itself, presumably over the
provided shm infrastructure.

So for instance, I have on my todo list to add a third shm-ring for
events in the venet ABI.   One of the event-types I would like to
support is LINK_UP and LINK_DOWN.  These events would be coupled to the
administrative manipulation of the "enabled" attribute in sysfs.  Other
event-types could be added as needed/appropriate.

I decided to do it this way because I felt it didn't make sense for me
to expose the attributes directly, since they are often back-end
specific anyway.   Therefore I leave it to the device-specific ABI which
has all the necessary tools for async events built in.


> I think the interesting part would be how e.g. a vnet device
> would be connected to the outside interfaces.
>   

Ah, good question.  This ties into the statement I made earlier about
how presumably the administrative agent would know what a module is and
how it works.  As part of this, they would also handle any kind of
additional work, such as wiring the backend up.  Here is a script that I
use for testing that demonstrates this:

------------------
#!/bin/bash

set -e

modprobe venet-tap
mount -t configfs configfs /config

bridge=vbus-br0

brctl addbr $bridge
brctl setfd $bridge 0
ifconfig $bridge up

createtap()
{
    mkdir /config/vbus/devices/$1-dev
    echo venet-tap > /config/vbus/devices/$1-dev/type
    mkdir /config/vbus/instances/$1-bus
    ln -s /config/vbus/devices/$1-dev /config/vbus/instances/$1-bus
    echo 1 > /sys/vbus/devices/$1-dev/enabled

    ifname=$(cat /sys/vbus/devices/$1-dev/ifname)
    ifconfig $ifname up
    brctl addif $bridge $ifname
}

createtap client
createtap server

--------------------

This script creates two buses ("client-bus" and "server-bus"),
instantiates a single venet-tap on each of them, and then "wires" them
together with a private bridge instance called "vbus-br0".  To complete
the picture here, you would want to launch two kvms, one of each of the
client-bus/server-bus instances.  You can do this via /proc/$pid/vbus.  E.g.

# (echo client-bus > /proc/self/vbus; qemu-kvm -hda client.img....)
# (echo server-bus > /proc/self/vbus; qemu-kvm -hda server.img....)

(And as noted, someday qemu will be able to do all the setup that the
script did, natively.  It would wire whatever tap it created to an
existing bridge with qemu-ifup, just like we do for tun-taps today)

One of the key details is where I do "ifname=$(cat
/sys/vbus/devices/$1-dev/ifname)".  The "ifname" attribute of the
venet-tap is a read-only attribute that reports back the netif interface
name that was returned when the device did a register_netdev() (e.g.
"eth3").  This register_netdev() operation occurs as a result of echoing
the "1" into the "enabled" attribute.  Deferring the registration until
the admin explicitly does an "enable" gives the admin a chance to change
the MAC address of the virtual-adapter before it is registered (note:
the current code doesnt support rw on the mac attributes yet..i need a
parser first).


>   
>> So the admin would instantiate this "vdisk" device and do:
>>
>> 'echo /path/to/my/exported/disk.dat > /sys/vbus/devices/foo/src_path'
>>     
>
> So it would act like a loop device? Would you reuse the loop device
> or write something new?
>   

Well, keeping in mind that I haven't even looked at writing a block
device for this infrastructure yet....my blanket statement would be
"lets reuse as much as possible" ;)  If the existing loop infrastructure
would work here, great!

> How about VFS mount name spaces?
>   

Yeah, ultimately I would love to be able to support a fairly wide range
of the normal userspace/kernel ABI through this mechanism.  In fact, one
of my original design goals was to somehow expose the syscall ABI
directly via some kind of syscall proxy device on the bus.  I have since
backed away from that idea once I started thinking about things some
more and realized that a significant number of system calls are really
inappropriate for a guest type environment due to their ability to
block.   We really dont want a vcpu to block.....however, the AIO type
system calls on the other hand, have much more promise.  ;)  TBD.

For right now I am focused more on the explicit virtual-device type
transport (disk, net, etc).  But in theory we should be able to express
a fairly broad range of services in terms of the call()/shm() interfaces.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 18:45           ` Anthony Liguori
@ 2009-04-01 20:40             ` Chris Wright
  2009-04-01 21:11               ` Gregory Haskins
  2009-04-02  3:11               ` Herbert Xu
  2009-04-01 21:09             ` Gregory Haskins
  2009-04-02  3:09             ` Herbert Xu
  2 siblings, 2 replies; 146+ messages in thread
From: Chris Wright @ 2009-04-01 20:40 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Andi Kleen, Gregory Haskins, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

* Anthony Liguori (anthony@codemonkey.ws) wrote:
> Andi Kleen wrote:
>> On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote:
>>> Performance.  We are trying to create a high performance IO infrastructure.
>>
>> Ok. So the goal is to bypass user space qemu completely for better
>> performance. Can you please put this into the initial patch
>> description?
>
> FWIW, there's nothing that prevents in-kernel back ends with virtio so  
> vbus certainly isn't required for in-kernel backends.

Indeed.

> That said, I don't think we're bound today by the fact that we're in  
> userspace.  Rather we're bound by the interfaces we have between the  
> host kernel and userspace to generate IO.  I'd rather fix those  
> interfaces than put more stuff in the kernel.

And more stuff in the kernel can come at the potential cost of weakening
protection/isolation.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 18:45           ` Anthony Liguori
  2009-04-01 20:40             ` Chris Wright
@ 2009-04-01 21:09             ` Gregory Haskins
  2009-04-02  0:29               ` Anthony Liguori
  2009-04-02  6:51               ` Avi Kivity
  2009-04-02  3:09             ` Herbert Xu
  2 siblings, 2 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-04-01 21:09 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Andi Kleen, linux-kernel, agraf, pmullaney, pmorreale, rusty,
	netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 2770 bytes --]

Anthony Liguori wrote:
> Andi Kleen wrote:
>> On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote:
>>  
>>>>>             
>>>> But surely you must have some specific use case in mind? Something
>>>> that it does better than the various methods that are available
>>>> today. Or rather there must be some problem you're trying
>>>> to solve. I'm just not sure what that problem exactly is.
>>>>         
>>> Performance.  We are trying to create a high performance IO
>>> infrastructure.
>>>     
>>
>> Ok. So the goal is to bypass user space qemu completely for better
>> performance. Can you please put this into the initial patch
>> description?
>>   
>
> FWIW, there's nothing that prevents in-kernel back ends with virtio so
> vbus certainly isn't required for in-kernel backends.

I think there is a slight disconnect here.  This is *exactly* what I am
trying to do.  You can of course do this many ways, and I am not denying
it could be done a different way than the path I have chosen.  One
extreme would be to just slam a virtio-net specific chunk of code
directly into kvm on the host.  Another extreme would be to build a
generic framework into Linux for declaring arbitrary IO types,
integrating it with kvm (as well as other environments such as lguest,
userspace, etc), and building a virtio-net model on top of that.

So in case it is not obvious at this point, I have gone with the latter
approach.  I wanted to make sure it wasn't kvm specific or something
like pci specific so it had the broadest applicability to a range of
environments.  So that is why the design is the way it is.  I understand
that this approach is technically "harder/more-complex" than the "slam
virtio-net into kvm" approach, but I've already done that work.  All we
need to do now is agree on the details ;)

>
>
> That said, I don't think we're bound today by the fact that we're in
> userspace.
You will *always* be bound by the fact that you are in userspace.  Its
purely a question of "how much" and "does anyone care".    Right now,
the anwer is "a lot (roughly 45x slower)" and "at least Greg's customers
do".  I have no doubt that this can and will change/improve in the
future.  But it will always be true that no matter how much userspace
improves, the kernel based solution will always be faster.  Its simple
physics.  I'm cutting out the middleman to ultimately reach the same
destination as the userspace path, so userspace can never be equal.

I agree that the "does anyone care" part of the equation will approach
zero as the latency difference shrinks across some threshold (probably
the single microsecond range), but I will believe that is even possible
when I see it ;)

Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 20:40             ` Chris Wright
@ 2009-04-01 21:11               ` Gregory Haskins
  2009-04-01 21:28                 ` Chris Wright
  2009-04-02  3:11               ` Herbert Xu
  1 sibling, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-04-01 21:11 UTC (permalink / raw)
  To: Chris Wright
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 396 bytes --]

Chris Wright wrote:
> And more stuff in the kernel can come at the potential cost of weakening
> protection/isolation.
>   
Note that the design of vbus should prevent any weakening...though if
you see a hole, please point it out.

(On that front, note that I still have some hardening to do, such as not
calling BUG_ON() in venet-tap if the ring is in a funk, etc)

Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 21:11               ` Gregory Haskins
@ 2009-04-01 21:28                 ` Chris Wright
  2009-04-01 22:10                   ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Chris Wright @ 2009-04-01 21:28 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Chris Wright, Anthony Liguori, Andi Kleen, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

* Gregory Haskins (ghaskins@novell.com) wrote:
> Note that the design of vbus should prevent any weakening

Could you elaborate?

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 21:28                 ` Chris Wright
@ 2009-04-01 22:10                   ` Gregory Haskins
  2009-04-02  6:00                     ` Chris Wright
  0 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-04-01 22:10 UTC (permalink / raw)
  To: Chris Wright
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3362 bytes --]

Chris Wright wrote:
> * Gregory Haskins (ghaskins@novell.com) wrote:
>   
>> Note that the design of vbus should prevent any weakening
>>     
>
> Could you elaborate?
>   

Absolutely.

So you said that something in the kernel could weaken the
protection/isolation.  And I fully agree that whatever we do here has to
be done carefully...more carefully than a userspace derived counterpart,
naturally.

So to address this, I put in various mechanisms to (hopefully? :) ensure
we can still maintain proper isolation, as well as protect the host,
other guests, and other applications from corruption.  Here are some of
the highlights:

*) As I mentioned, a "vbus" is a form of a kernel-resource-container. 
It is designed so that the view of a vbus is a unique namespace of
device-ids.  Each bus has its own individual namespace that consist
solely of the devices that have been placed on that bus.  The only way
to create a bus, and/or create a device on a bus, is via the
administrative interface on the host.

*) A task can only associate with, at most, one vbus at a time.  This
means that a task can only see the device-id namespace of the devices on
its associated bus and thats it.  This is enforced by the host kernel by
placing a reference to the associated vbus on the task-struct itself. 
Again, the only way to modify this association is via a host based
administrative operation.  Note that multiple tasks can associate to the
same vbus, which would commonly be used by all threads in an app, or all
vcpus in a guest, etc.

*) the asynchronous nature of the shm/ring interfaces implies we have
the potential for asynchronous faults.  E.g. "crap" in the ring might
not be discovered at the EIP of the guest vcpu when it actually inserts
the crap, but rather later when the host side tries to update the ring. 
A naive implementation would have the host do a BUG_ON() when it
discovers the discrepancy (note that I still have a few of these to fix
in the venet-tap code).  Instead, what should happen is that we utilize
an asynchronous fault mechanism that allows the guest to always be the
one punished (via something like a machine-check for guests, or SIGABRT
for userspace, etc)

*) "south-to-north path signaling robustness".  Because vbus supports a
variety of different environments, I call guest/userspace "north', and
the host/kernel "south".  When the north wants to communicate with the
kernel, its perfectly ok to stall the north indefinitely if the south is
not ready.  However, it is not really ok to stall the south when
communicating with the north because this is an attack vector.  E.g. a
malicous/broken guest could just stop servicing its ring to cause
threads in the host to jam up.  This is bad. :)  So what we do is we
design all south-to-north signaling paths to be robust against
stalling.  What they do instead is manage backpressure a little bit more
intelligently than simply blocking like they might in the guest.  For
instance, in venet-tap, a "transmit" from netif that has to be injected
in the south-to-north ring when it is full will result in a
netif_stop_queue().   etc.

I cant think of more examples right now, but I will update this list
if/when I come up with more.  I hope that satisfactorily answered your
question, though!

Regards,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 20:29           ` Gregory Haskins
@ 2009-04-01 22:23             ` Andi Kleen
  2009-04-01 23:05               ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Andi Kleen @ 2009-04-01 22:23 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Andi Kleen, linux-kernel, agraf, pmullaney, pmorreale, anthony,
	rusty, netdev, kvm

On Wed, Apr 01, 2009 at 04:29:57PM -0400, Gregory Haskins wrote:
> > description?
> >   
> Yes, good point.  I will be sure to be more explicit in the next rev.
> 
> >   
> >> So the administrator can then set these attributes as
> >> desired to manipulate the configuration of the instance of the device,
> >> on a per device basis.
> >>     
> >
> > How would the guest learn of any changes in there?
> >   
> The only events explicitly supported by the infrastructure of this
> nature would be device-add and device-remove.  So when an admin adds or
> removes a device to a bus, the guest would see driver::probe() and
> driver::remove() callbacks, respectively.  All other events are left (by
> design) to be handled by the device ABI itself, presumably over the
> provided shm infrastructure.

Ok so you rely on a transaction model where everything is set up
before it is somehow comitted to the guest? I hope that is made
explicit in the interface somehow.

> This script creates two buses ("client-bus" and "server-bus"),
> instantiates a single venet-tap on each of them, and then "wires" them
> together with a private bridge instance called "vbus-br0".  To complete
> the picture here, you would want to launch two kvms, one of each of the
> client-bus/server-bus instances.  You can do this via /proc/$pid/vbus.  E.g.
> 
> # (echo client-bus > /proc/self/vbus; qemu-kvm -hda client.img....)
> # (echo server-bus > /proc/self/vbus; qemu-kvm -hda server.img....)
> 
> (And as noted, someday qemu will be able to do all the setup that the
> script did, natively.  It would wire whatever tap it created to an
> existing bridge with qemu-ifup, just like we do for tun-taps today)

The usual problem with that is permissions. Just making qemu-ifup suid
it not very nice.  It would be good if any new design addressed this.

> the current code doesnt support rw on the mac attributes yet..i need a
> parser first).

parser in kernel space always sounds scary to me.


> 
> Yeah, ultimately I would love to be able to support a fairly wide range
> of the normal userspace/kernel ABI through this mechanism.  In fact, one
> of my original design goals was to somehow expose the syscall ABI
> directly via some kind of syscall proxy device on the bus.  I have since

That sounds really scary for security. 


> backed away from that idea once I started thinking about things some
> more and realized that a significant number of system calls are really
> inappropriate for a guest type environment due to their ability to
> block.   We really dont want a vcpu to block.....however, the AIO type

Not only because of blocking, but also because of security issues.
After all one of the usual reasons to run a guest is security isolation.

In general the more powerful the guest API the more risky it is, so some
self moderation is probably a good thing.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 22:23             ` Andi Kleen
@ 2009-04-01 23:05               ` Gregory Haskins
  0 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-04-01 23:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 5003 bytes --]

Andi Kleen wrote:
> On Wed, Apr 01, 2009 at 04:29:57PM -0400, Gregory Haskins wrote:
>   
>>> description?
>>>   
>>>       
>> Yes, good point.  I will be sure to be more explicit in the next rev.
>>
>>     
>>>   
>>>       
>>>> So the administrator can then set these attributes as
>>>> desired to manipulate the configuration of the instance of the device,
>>>> on a per device basis.
>>>>     
>>>>         
>>> How would the guest learn of any changes in there?
>>>   
>>>       
>> The only events explicitly supported by the infrastructure of this
>> nature would be device-add and device-remove.  So when an admin adds or
>> removes a device to a bus, the guest would see driver::probe() and
>> driver::remove() callbacks, respectively.  All other events are left (by
>> design) to be handled by the device ABI itself, presumably over the
>> provided shm infrastructure.
>>     
>
> Ok so you rely on a transaction model where everything is set up
> before it is somehow comitted to the guest? I hope that is made
> explicit in the interface somehow.
>   
Well, its not an explicit transaction model, but I guess you could think
of it that way.

Generally you set the device up before you launch the guest.  By the
time the guest loads and tries to scan the bus for the initial
discovery, all the devices would be ready to go.

This does bring up the question of hotswap.  Today we fully support
hotswap in and out, but leaving this "enabled" transaction to the
individual device means that the device-id would be visible in the bus
namespace before the device may want to actually communicate.  Hmmm

Perhaps I need to build this in as a more explicit "enabled"
feature...and the guest will not see the driver::probe() until this happens.

>   
>> This script creates two buses ("client-bus" and "server-bus"),
>> instantiates a single venet-tap on each of them, and then "wires" them
>> together with a private bridge instance called "vbus-br0".  To complete
>> the picture here, you would want to launch two kvms, one of each of the
>> client-bus/server-bus instances.  You can do this via /proc/$pid/vbus.  E.g.
>>
>> # (echo client-bus > /proc/self/vbus; qemu-kvm -hda client.img....)
>> # (echo server-bus > /proc/self/vbus; qemu-kvm -hda server.img....)
>>
>> (And as noted, someday qemu will be able to do all the setup that the
>> script did, natively.  It would wire whatever tap it created to an
>> existing bridge with qemu-ifup, just like we do for tun-taps today)
>>     
>
> The usual problem with that is permissions. Just making qemu-ifup suid
> it not very nice.  It would be good if any new design addressed this.
>   

Well, its kind of out of my control.  venet-tap ultimately creates a
simple netif interface which we must do something with.  Once its
created, "wiring" it up to something like a linux-bridge is no different
than something like a tun-tap, so the qemu-ifup requirement doesn't change.

The one thing I can think of is it would be possible to build a
"venet-switch" module, and this could be done without using brctl or
qemu-ifup...but then I would lose all the benefits of re-using that
infrastructure.  I do not recommend we actually do this, but it would
technically be a way to address your concern.


>   
>> the current code doesnt support rw on the mac attributes yet..i need a
>> parser first).
>>     
>
> parser in kernel space always sounds scary to me.
>   
Heh..why do you think I keep procrastinating ;)

>
>   
>> Yeah, ultimately I would love to be able to support a fairly wide range
>> of the normal userspace/kernel ABI through this mechanism.  In fact, one
>> of my original design goals was to somehow expose the syscall ABI
>> directly via some kind of syscall proxy device on the bus.  I have since
>>     
>
> That sounds really scary for security. 
>
>
>   
>> backed away from that idea once I started thinking about things some
>> more and realized that a significant number of system calls are really
>> inappropriate for a guest type environment due to their ability to
>> block.   We really dont want a vcpu to block.....however, the AIO type
>>     
>
> Not only because of blocking, but also because of security issues.
> After all one of the usual reasons to run a guest is security isolation.
>   
Oh yeah, totally agreed.  Not that I am advocating this, because I have
abandoned the idea.  But back when I was thinking of this, I would have
addressed the security with the vbus and syscall-proxy-device objects
themselves.  E.g. if you dont instantiate a syscall-proxy-device on the
bus, the guest wouldnt have access to syscalls at all.   And you could
put filters into the module to limit what syscalls were allowed, which
UID to make the guest appear as, etc.

> In general the more powerful the guest API the more risky it is, so some
> self moderation is probably a good thing.
>   
:)

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 21:09             ` Gregory Haskins
@ 2009-04-02  0:29               ` Anthony Liguori
  2009-04-02  3:11                 ` Gregory Haskins
  2009-04-02  6:51               ` Avi Kivity
  1 sibling, 1 reply; 146+ messages in thread
From: Anthony Liguori @ 2009-04-02  0:29 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Andi Kleen, linux-kernel, agraf, pmullaney, pmorreale, rusty,
	netdev, kvm

Gregory Haskins wrote:
> Anthony Liguori wrote:
>   
> I think there is a slight disconnect here.  This is *exactly* what I am
> trying to do. 

If it were exactly what you were trying to do, you would have posted a 
virtio-net in-kernel backend implementation instead of a whole new 
paravirtual IO framework ;-)

>> That said, I don't think we're bound today by the fact that we're in
>> userspace.
>>     
> You will *always* be bound by the fact that you are in userspace.

Again, let's talk numbers.  A heavy-weight exit is 1us slower than a 
light weight exit.  Ideally, you're taking < 1 exit per packet because 
you're batching notifications.  If you're ping latency on bare metal 
compared to vbus is 39us to 65us, then all other things being equally, 
the cost imposed by doing what your doing in userspace would make the 
latency be 66us taking your latency from 166% of native to 169% of 
native.  That's not a huge difference and I'm sure you'll agree there 
are a lot of opportunities to improve that even further.

And you didn't mention whether your latency tests are based on ping or 
something more sophisticated as ping will be a pathological case that 
doesn't allow any notification batching.

> I agree that the "does anyone care" part of the equation will approach
> zero as the latency difference shrinks across some threshold (probably
> the single microsecond range), but I will believe that is even possible
> when I see it ;)
>   

Note the other hat we have to where is not just virtualization developer 
but Linux developer.  If there are bad userspace interfaces for IO that 
impose artificial restrictions, then we need to identify those and fix them.

Regards,

Anthony Liguori

> Regards,
> -Greg
>
>   


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 11:35   ` Gregory Haskins
@ 2009-04-02  1:24     ` Rusty Russell
  2009-04-02  2:27       ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Rusty Russell @ 2009-04-02  1:24 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, netdev, kvm

On Wednesday 01 April 2009 22:05:39 Gregory Haskins wrote:
> Rusty Russell wrote:
> > I could dig through the code, but I'll ask directly: what heuristic do
> > you use for notification prevention in your venet_tap driver?
> 
> I am not 100% sure I know what you mean with "notification prevention",
> but let me take a stab at it.

Good stab :)

> I only signal back to the guest to reclaim its skbs every 10
> packets, or if I drain the queue, whichever comes first (note to self:
> make this # configurable).

Good stab, though I was referring to guest->host signals (I'll assume
you use a similar scheme there).

You use a number of packets, qemu uses a timer (150usec), lguest uses a
variable timer (starting at 500usec, dropping by 1 every time but increasing
by 10 every time we get fewer packets than last time).

So, if the guest sends two packets and stops, you'll hang indefinitely?
That's why we use a timer, otherwise any mitigation scheme has this issue.

Thanks,
Rusty.


> 
> The nice part about this scheme is it significantly reduces the amount
> of guest/host transitions, while still providing the lowest latency
> response for single packets possible.  e.g. Send one packet, and you get
> one hypercall, and one tx-complete interrupt as soon as it queues on the
> hardware.  Send 100 packets, and you get one hypercall and 10
> tx-complete interrupts as frequently as every tenth packet queues on the
> hardware.  There is no timer governing the flow, etc.
> 
> Is that what you were asking?
> 
> > As you point out, 350-450 is possible, which is still bad, and it's at least
> > partially caused by the exit to userspace and two system calls.  If virtio_net
> > had a backend in the kernel, we'd be able to compare numbers properly.
> >   
> :)
> 
> But that is the whole point, isnt it?  I created vbus specifically as a
> framework for putting things in the kernel, and that *is* one of the
> major reasons it is faster than virtio-net...its not the difference in,
> say, IOQs vs virtio-ring (though note I also think some of the
> innovations we have added such as bi-dir napi are helping too, but these
> are not "in-kernel" specific kinds of features and could probably help
> the userspace version too).
> 
> I would be entirely happy if you guys accepted the general concept and
> framework of vbus, and then worked with me to actually convert what I
> have as "venet-tap" into essentially an in-kernel virtio-net.  I am not
> specifically interested in creating a competing pv-net driver...I just
> needed something to showcase the concepts and I didnt want to hack the
> virtio-net infrastructure to do it until I had everyone's blessing. 
> Note to maintainers: I *am* perfectly willing to maintain the venet
> drivers if, for some reason, we decide that we want to keep them as
> is.   Its just an ideal for me to collapse virtio-net and venet-tap
> together, and I suspect our community would prefer this as well.
> 
> -Greg
> 
> 

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  1:24     ` Rusty Russell
@ 2009-04-02  2:27       ` Gregory Haskins
  0 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-04-02  2:27 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3541 bytes --]

Rusty Russell wrote:
> On Wednesday 01 April 2009 22:05:39 Gregory Haskins wrote:
>   
>> Rusty Russell wrote:
>>     
>>> I could dig through the code, but I'll ask directly: what heuristic do
>>> you use for notification prevention in your venet_tap driver?
>>>       
>> I am not 100% sure I know what you mean with "notification prevention",
>> but let me take a stab at it.
>>     
>
> Good stab :)
>
>   
>> I only signal back to the guest to reclaim its skbs every 10
>> packets, or if I drain the queue, whichever comes first (note to self:
>> make this # configurable).
>>     
>
> Good stab, though I was referring to guest->host signals (I'll assume
> you use a similar scheme there).
>   
Oh, actually no.  The guest->host path only uses the "bidir napi" thing
I mentioned.  So first packet hypercalls the host immediately with no
delay, schedules my host-side "rx" thread, disables subsequent
hypercalls, and returns to the guest.  If the guest tries to send
another packet before the time it takes the host to drain all queued
skbs (in this case, 1), it will simply queue it to the ring with no
additional hypercalls.    Like typical napi ingress processing, the host
will leave hypercalls disabled until it finds the ring empty, so this
process can continue indefinitely until the host catches up.  Once fully
drained,  the host will re-enable the hypercall channel and subsequent
transmissions will repeat the original process.

In summary, infrequent transmissions will tend to have one hypercall per
packet.  Bursty transmissions will have one hypercall per burst
(starting immediately with the first packet).  In both cases, we
minimize the latency to get the first packet "out the door".

So really the only place I am using a funky heuristic is the modulus 10
operation for tx-complete going host->guest.  The rest are kind of
standard napi event mitigation techniques.

> You use a number of packets, qemu uses a timer (150usec), lguest uses a
> variable timer (starting at 500usec, dropping by 1 every time but increasing
> by 10 every time we get fewer packets than last time).
>
> So, if the guest sends two packets and stops, you'll hang indefinitely?
>   
Shouldn't, no.  The host will send tx-complete interrupts at *max* every
10 packets, but if it drains the queue before the modulus 10 expires, it
will send a tx-complete immediately, right before it re-enables
hypercalls.  So there is no hang, and there is no delay.

For reference, here is the modulus 10 signaling
(./drivers/vbus/devices/venet-tap.c, line 584):

http://git.kernel.org/?p=linux/kernel/git/ghaskins/vbus/linux-2.6.git;a=blob;f=drivers/vbus/devices/venet-tap.c;h=0ccb7ed94a1a8edd0cca269488f940f40fce20df;hb=master#l584

Here is the one that happens after the queue is fully drained (line 593)

http://git.kernel.org/?p=linux/kernel/git/ghaskins/vbus/linux-2.6.git;a=blob;f=drivers/vbus/devices/venet-tap.c;h=0ccb7ed94a1a8edd0cca269488f940f40fce20df;hb=master#l593

and finally, here is where I re-enable hypercalls (or system calls if
the driver is in userspace, etc)

http://git.kernel.org/?p=linux/kernel/git/ghaskins/vbus/linux-2.6.git;a=blob;f=drivers/vbus/devices/venet-tap.c;h=0ccb7ed94a1a8edd0cca269488f940f40fce20df;hb=master#l600

> That's why we use a timer, otherwise any mitigation scheme has this issue.
>   

I'm not sure I follow.  I don't think I need a timer at all using this
scheme, but perhaps I am missing something?

Thanks Rusty!
-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 18:45           ` Anthony Liguori
  2009-04-01 20:40             ` Chris Wright
  2009-04-01 21:09             ` Gregory Haskins
@ 2009-04-02  3:09             ` Herbert Xu
  2009-04-02  6:46               ` Avi Kivity
  2 siblings, 1 reply; 146+ messages in thread
From: Herbert Xu @ 2009-04-02  3:09 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: andi, ghaskins, linux-kernel, agraf, pmullaney, pmorreale, rusty,
	netdev, kvm

Anthony Liguori <anthony@codemonkey.ws> wrote:
>
> That said, I don't think we're bound today by the fact that we're in 
> userspace.  Rather we're bound by the interfaces we have between the 
> host kernel and userspace to generate IO.  I'd rather fix those 
> interfaces than put more stuff in the kernel.

I'm sorry but I totally disagree with that.  By having our IO
infrastructure in user-space we've basically given up the main
advantage of kvm, which is that the physical drivers operate in
the same environment as the hypervisor.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  0:29               ` Anthony Liguori
@ 2009-04-02  3:11                 ` Gregory Haskins
  0 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-04-02  3:11 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Andi Kleen, linux-kernel, agraf, pmullaney, pmorreale, rusty,
	netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3936 bytes --]

Anthony Liguori wrote:
> Gregory Haskins wrote:
>> Anthony Liguori wrote:
>>   I think there is a slight disconnect here.  This is *exactly* what
>> I am
>> trying to do. 
>
> If it were exactly what you were trying to do, you would have posted a
> virtio-net in-kernel backend implementation instead of a whole new
> paravirtual IO framework ;-)

semantics, semantics ;)

but ok, fair enough.

>
>>> That said, I don't think we're bound today by the fact that we're in
>>> userspace.
>>>     
>> You will *always* be bound by the fact that you are in userspace.
>
> Again, let's talk numbers.  A heavy-weight exit is 1us slower than a
> light weight exit.  Ideally, you're taking < 1 exit per packet because
> you're batching notifications.  If you're ping latency on bare metal
> compared to vbus is 39us to 65us, then all other things being equally,
> the cost imposed by doing what your doing in userspace would make the
> latency be 66us taking your latency from 166% of native to 169% of
> native.  That's not a huge difference and I'm sure you'll agree there
> are a lot of opportunities to improve that even further.

Ok, so lets see it happen.  Consider the gauntlet thrown :)  Your
challenge, should you chose to accept it, is to take todays 4000us and
hit a 65us latency target while maintaining 10GE line-rate (at least
1500 mtu line-rate).

I personally don't want to even stop at 65.  I want to hit that 36us!  
In case you think that is crazy, my first prototype of venet was hitting
about 140us, and I shaved 10us here, 10us there, eventually getting down
to the 65us we have today.  The low hanging fruit is all but harvested
at this point, but I am not done searching for additional sources of
latency. I just needed to take a breather to get the code out there for
review. :)

>
> And you didn't mention whether your latency tests are based on ping or
> something more sophisticated

Well, the numbers posted were actually from netperf -t UDP_RR.  This
generates a pps from a continuous (but non-bursted) RTT measurement.  So
I invert the pps result of this test to get the average rtt time.  I
have also confirmed that ping jives with these results (e.g. virtio-net
results were about 4ms, and venet were about 0.065ms as reported by ping).

> as ping will be a pathological case
Ah, but this is not really pathological IMO.  There are plenty of
workloads that exhibit request-reply patterns (e.g. RPC), and this is a
direct measurement of the systems ability to support these
efficiently.   And even unidirectional flows can be hampered by poor
latency (think PTP clock sync, etc).

Massive throughput with poor latency is like Andrew Tanenbaum's
station-wagon full of backup tapes ;)  I think I have proven we can
actually get both with a little creative use of resources.

> that doesn't allow any notification batching.
Well, if we can take anything away from all this: I think I have
demonstrated that you don't need notification batching to get good
throughput.  And batching on the head-end of the queue adds directly to
your latency overhead, so I don't think its a good technique in general
(though I realize that not everyone cares about latency, per se, so
maybe most are satisfied with the status-quo).

>
>> I agree that the "does anyone care" part of the equation will approach
>> zero as the latency difference shrinks across some threshold (probably
>> the single microsecond range), but I will believe that is even possible
>> when I see it ;)
>>   
>
> Note the other hat we have to where is not just virtualization
> developer but Linux developer.  If there are bad userspace interfaces
> for IO that impose artificial restrictions, then we need to identify
> those and fix them.

Fair enough, and I would love to take that on but alas my
development/debug bandwidth is rather finite these days ;)

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 20:40             ` Chris Wright
  2009-04-01 21:11               ` Gregory Haskins
@ 2009-04-02  3:11               ` Herbert Xu
  1 sibling, 0 replies; 146+ messages in thread
From: Herbert Xu @ 2009-04-02  3:11 UTC (permalink / raw)
  To: Chris Wright
  Cc: anthony, andi, ghaskins, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Chris Wright <chrisw@sous-sol.org> wrote:
>
>> That said, I don't think we're bound today by the fact that we're in  
>> userspace.  Rather we're bound by the interfaces we have between the  
>> host kernel and userspace to generate IO.  I'd rather fix those  
>> interfaces than put more stuff in the kernel.
> 
> And more stuff in the kernel can come at the potential cost of weakening
> protection/isolation.

Protection/isolation always comes at a cost.  Not everyone wants
to pay that, just like health insurance :) We should enable the
users to choose which model they want, based on their needs.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01  6:08 ` Rusty Russell
  2009-04-01 11:35   ` Gregory Haskins
  2009-04-01 16:10   ` Anthony Liguori
@ 2009-04-02  3:15   ` Herbert Xu
  2 siblings, 0 replies; 146+ messages in thread
From: Herbert Xu @ 2009-04-02  3:15 UTC (permalink / raw)
  To: Rusty Russell
  Cc: ghaskins, linux-kernel, agraf, pmullaney, pmorreale, anthony,
	netdev, kvm

Rusty Russell <rusty@rustcorp.com.au> wrote:
> 
> As you point out, 350-450 is possible, which is still bad, and it's at least
> partially caused by the exit to userspace and two system calls.  If virtio_net
> had a backend in the kernel, we'd be able to compare numbers properly.

FWIW I don't really care whether we go with this or a kernel
virtio_net backend.  Either way should be good.  However the
status quo where we're stuck with a user-space backend really
sucks!

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 22:10                   ` Gregory Haskins
@ 2009-04-02  6:00                     ` Chris Wright
  0 siblings, 0 replies; 146+ messages in thread
From: Chris Wright @ 2009-04-02  6:00 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Chris Wright, Anthony Liguori, Andi Kleen, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

* Gregory Haskins (ghaskins@novell.com) wrote:

<snip nice list>

> I cant think of more examples right now, but I will update this list
> if/when I come up with more.  I hope that satisfactorily answered your
> question, though!

Yes, that helps, thanks.

There's still the simple issue of guest/host interface widening w/ kernel
resident backend where a plain ol' bug (good that you thought about the
isolation) can take out more than single guest.  Always the balance... ;-)

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  3:09             ` Herbert Xu
@ 2009-04-02  6:46               ` Avi Kivity
  2009-04-02  8:54                 ` Herbert Xu
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-02  6:46 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Anthony Liguori, andi, ghaskins, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> Anthony Liguori <anthony@codemonkey.ws> wrote:
>   
>> That said, I don't think we're bound today by the fact that we're in 
>> userspace.  Rather we're bound by the interfaces we have between the 
>> host kernel and userspace to generate IO.  I'd rather fix those 
>> interfaces than put more stuff in the kernel.
>>     
>
> I'm sorry but I totally disagree with that.  By having our IO
> infrastructure in user-space we've basically given up the main
> advantage of kvm, which is that the physical drivers operate in
> the same environment as the hypervisor.
>   

I don't understand this.  If we had good interfaces, all that userspace 
would do is translate guest physical addresses to host physical 
addresses, and translate the guest->host protocol to host API calls.  I 
don't see anything there that benefits from being in the kernel.

Can you elaborate?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 21:09             ` Gregory Haskins
  2009-04-02  0:29               ` Anthony Liguori
@ 2009-04-02  6:51               ` Avi Kivity
  2009-04-02  8:52                 ` Herbert Xu
  2009-04-02 10:46                 ` Gregory Haskins
  1 sibling, 2 replies; 146+ messages in thread
From: Avi Kivity @ 2009-04-02  6:51 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:
>
>
> I think there is a slight disconnect here.  This is *exactly* what I am
> trying to do.  You can of course do this many ways, and I am not denying
> it could be done a different way than the path I have chosen.  One
> extreme would be to just slam a virtio-net specific chunk of code
> directly into kvm on the host.  Another extreme would be to build a
> generic framework into Linux for declaring arbitrary IO types,
> integrating it with kvm (as well as other environments such as lguest,
> userspace, etc), and building a virtio-net model on top of that.
>
> So in case it is not obvious at this point, I have gone with the latter
> approach.  I wanted to make sure it wasn't kvm specific or something
> like pci specific so it had the broadest applicability to a range of
> environments.  So that is why the design is the way it is.  I understand
> that this approach is technically "harder/more-complex" than the "slam
> virtio-net into kvm" approach, but I've already done that work.  All we
> need to do now is agree on the details ;)
>
>   

virtio is already non-kvm-specific (lguest uses it) and non-pci-specific 
(s390 uses it).

>> That said, I don't think we're bound today by the fact that we're in
>> userspace.
>>     
> You will *always* be bound by the fact that you are in userspace.  Its
> purely a question of "how much" and "does anyone care".    Right now,
> the anwer is "a lot (roughly 45x slower)" and "at least Greg's customers
> do".  I have no doubt that this can and will change/improve in the
> future.  But it will always be true that no matter how much userspace
> improves, the kernel based solution will always be faster.  Its simple
> physics.  I'm cutting out the middleman to ultimately reach the same
> destination as the userspace path, so userspace can never be equal.
>   

If you have a good exit mitigation scheme you can cut exits by a factor 
of 100; so the userspace exit costs are cut by the same factor.  If you 
have good copyless networking APIs you can cut the cost of copies to 
zero (well, to the cost of get_user_pages_fast(), but a kernel solution 
needs that too).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  6:51               ` Avi Kivity
@ 2009-04-02  8:52                 ` Herbert Xu
  2009-04-02  9:02                   ` Avi Kivity
  2009-04-02 10:46                 ` Gregory Haskins
  1 sibling, 1 reply; 146+ messages in thread
From: Herbert Xu @ 2009-04-02  8:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Avi Kivity <avi@redhat.com> wrote:
>
> virtio is already non-kvm-specific (lguest uses it) and non-pci-specific 
> (s390 uses it).

I think Greg's work shows that putting the backend in the kernel
can dramatically reduce the cost of a single guest->host transaction.
I'm sure the same thing would work for virtio too.

> If you have a good exit mitigation scheme you can cut exits by a factor 
> of 100; so the userspace exit costs are cut by the same factor.  If you 
> have good copyless networking APIs you can cut the cost of copies to 
> zero (well, to the cost of get_user_pages_fast(), but a kernel solution 
> needs that too).

Given the choice of having to mitigate or not having the problem
in the first place, guess what I would prefer :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  6:46               ` Avi Kivity
@ 2009-04-02  8:54                 ` Herbert Xu
  2009-04-02  9:03                   ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Herbert Xu @ 2009-04-02  8:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, andi, ghaskins, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

On Thu, Apr 02, 2009 at 09:46:49AM +0300, Avi Kivity wrote:
>
> I don't understand this.  If we had good interfaces, all that userspace  
> would do is translate guest physical addresses to host physical  
> addresses, and translate the guest->host protocol to host API calls.  I  
> don't see anything there that benefits from being in the kernel.
>
> Can you elaborate?

I think Greg has expressed it clearly enough.

At the end of the day, the numbers speak for themselves.  So if
and when there's a user-space version that achieves the same or
better results, then I will change my mind :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  8:52                 ` Herbert Xu
@ 2009-04-02  9:02                   ` Avi Kivity
  2009-04-02  9:16                     ` Herbert Xu
                                       ` (2 more replies)
  0 siblings, 3 replies; 146+ messages in thread
From: Avi Kivity @ 2009-04-02  9:02 UTC (permalink / raw)
  To: Herbert Xu
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> Avi Kivity <avi@redhat.com> wrote:
>   
>> virtio is already non-kvm-specific (lguest uses it) and non-pci-specific 
>> (s390 uses it).
>>     
>
> I think Greg's work shows that putting the backend in the kernel
> can dramatically reduce the cost of a single guest->host transaction.
> I'm sure the same thing would work for virtio too.
>   

Virtio suffers because we've had no notification of when a packet is 
actually submitted.  With the notification, the only difference should 
be in the cost of a kernel->user switch, which is nowhere nearly as 
dramatic.

>> If you have a good exit mitigation scheme you can cut exits by a factor 
>> of 100; so the userspace exit costs are cut by the same factor.  If you 
>> have good copyless networking APIs you can cut the cost of copies to 
>> zero (well, to the cost of get_user_pages_fast(), but a kernel solution 
>> needs that too).
>>     
>
> Given the choice of having to mitigate or not having the problem
> in the first place, guess what I would prefer :)
>   

There is no choice.  Exiting from the guest to the kernel to userspace 
is prohibitively expensive, you can't do that on every packet.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  8:54                 ` Herbert Xu
@ 2009-04-02  9:03                   ` Avi Kivity
  2009-04-02  9:05                     ` Herbert Xu
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-02  9:03 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Anthony Liguori, andi, ghaskins, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 09:46:49AM +0300, Avi Kivity wrote:
>   
>> I don't understand this.  If we had good interfaces, all that userspace  
>> would do is translate guest physical addresses to host physical  
>> addresses, and translate the guest->host protocol to host API calls.  I  
>> don't see anything there that benefits from being in the kernel.
>>
>> Can you elaborate?
>>     
>
> I think Greg has expressed it clearly enough.
>
> At the end of the day, the numbers speak for themselves.  So if
> and when there's a user-space version that achieves the same or
> better results, then I will change my mind :)
>   

Like Anthony said, the problem is with the kernel->user interfaces.  We 
won't have a good user space virtio implementation until that is fixed.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:03                   ` Avi Kivity
@ 2009-04-02  9:05                     ` Herbert Xu
  0 siblings, 0 replies; 146+ messages in thread
From: Herbert Xu @ 2009-04-02  9:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, andi, ghaskins, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

On Thu, Apr 02, 2009 at 12:03:32PM +0300, Avi Kivity wrote:
>
> Like Anthony said, the problem is with the kernel->user interfaces.  We  
> won't have a good user space virtio implementation until that is fixed.

If it's just the interface that's bad, then it should be possible
to do a proof-of-concept patch to show that this is the case.

Even if we have to redesign the interface, at least you can then
say that you guys were right all along :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:02                   ` Avi Kivity
@ 2009-04-02  9:16                     ` Herbert Xu
  2009-04-02  9:27                       ` Avi Kivity
  2009-04-02 10:55                     ` Gregory Haskins
  2009-04-03 10:58                     ` Gerd Hoffmann
  2 siblings, 1 reply; 146+ messages in thread
From: Herbert Xu @ 2009-04-02  9:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

On Thu, Apr 02, 2009 at 12:02:09PM +0300, Avi Kivity wrote:
>
> There is no choice.  Exiting from the guest to the kernel to userspace  
> is prohibitively expensive, you can't do that on every packet.

I was referring to the bit between the kernel and userspace.

In any case, I just looked at the virtio mitigation code again
and I am completely baffled at why we need it.  Look at Greg's
code or the netback/netfront notification, why do we need this
completely artificial mitigation when the ring itself provides
a natural way of stemming the flow?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:16                     ` Herbert Xu
@ 2009-04-02  9:27                       ` Avi Kivity
  2009-04-02  9:29                         ` Herbert Xu
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-02  9:27 UTC (permalink / raw)
  To: Herbert Xu
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 12:02:09PM +0300, Avi Kivity wrote:
>   
>> There is no choice.  Exiting from the guest to the kernel to userspace  
>> is prohibitively expensive, you can't do that on every packet.
>>     
>
> I was referring to the bit between the kernel and userspace.
>
> In any case, I just looked at the virtio mitigation code again
> and I am completely baffled at why we need it.  Look at Greg's
> code or the netback/netfront notification, why do we need this
> completely artificial mitigation when the ring itself provides
> a natural way of stemming the flow?
>   

If the vcpu thread does the transmit, then it will always complete 
sending immediately:

  guest: push packet, notify qemu
  qemu: disable notification
  qemu: pop packet
  qemu: copy to tap
  qemu: ??

At this point, qemu must enable notification again, since we have no 
notification from tap that the transmit completed.  The only alternative 
is the timer.

If we do the transmit through an extra thread, then scheduling latency 
buys us some time:

  guest: push packet, notify qemu
  qemu: disable notification
  qemu: schedule iothread
  iothread: pop packet
  iothread: copy to tap
  iothread: check for more packets
  iothread: enable notification

If tap told us when the packets were actually transmitted, life would be 
wonderful:

  guest: push packet, notify qemu
  qemu: disable notification
  qemu: pop packet
  qemu: queue on tap
  qemu: return to guest
  hardware: churn churn churn
  tap: packet is out
  iothread: check for more packets
  iothread: enable notification

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:27                       ` Avi Kivity
@ 2009-04-02  9:29                         ` Herbert Xu
  2009-04-02  9:33                           ` Herbert Xu
  2009-04-02  9:38                           ` Avi Kivity
  0 siblings, 2 replies; 146+ messages in thread
From: Herbert Xu @ 2009-04-02  9:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

On Thu, Apr 02, 2009 at 12:27:17PM +0300, Avi Kivity wrote:
>
> If tap told us when the packets were actually transmitted, life would be  
> wonderful:

And why do we need this? Because we are in user space!

I'll continue to wait for your patch and numbers :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:29                         ` Herbert Xu
@ 2009-04-02  9:33                           ` Herbert Xu
  2009-04-02  9:38                           ` Avi Kivity
  1 sibling, 0 replies; 146+ messages in thread
From: Herbert Xu @ 2009-04-02  9:33 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm, Patrick Ohly, David S. Miller

On Thu, Apr 02, 2009 at 05:29:36PM +0800, Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 12:27:17PM +0300, Avi Kivity wrote:
> >
> > If tap told us when the packets were actually transmitted, life would be  
> > wonderful:
> 
> And why do we need this? Because we are in user space!
> 
> I'll continue to wait for your patch and numbers :)

And in case you're working on that patch, this might interest
you.  Check out the netdev thread titled "TX time stamping".
Now that we assign the tap skb with its own sk, these two scenarios
are pretty much identical.

I also noitced despite davem's threats to revert the patch, it
has now made Linus's tree :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:29                         ` Herbert Xu
  2009-04-02  9:33                           ` Herbert Xu
@ 2009-04-02  9:38                           ` Avi Kivity
  2009-04-02  9:41                             ` Herbert Xu
  2009-04-02 11:06                             ` Gregory Haskins
  1 sibling, 2 replies; 146+ messages in thread
From: Avi Kivity @ 2009-04-02  9:38 UTC (permalink / raw)
  To: Herbert Xu
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 12:27:17PM +0300, Avi Kivity wrote:
>   
>> If tap told us when the packets were actually transmitted, life would be  
>> wonderful:
>>     
>
> And why do we need this? Because we are in user space!
>
>   

Why does a kernel solution not need to know when a packet is transmitted?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:38                           ` Avi Kivity
@ 2009-04-02  9:41                             ` Herbert Xu
  2009-04-02  9:43                               ` Avi Kivity
  2009-04-02 11:06                             ` Gregory Haskins
  1 sibling, 1 reply; 146+ messages in thread
From: Herbert Xu @ 2009-04-02  9:41 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

On Thu, Apr 02, 2009 at 12:38:46PM +0300, Avi Kivity wrote:
>
> Why does a kernel solution not need to know when a packet is transmitted?

Because you can install your own destructor?

I don't know what Greg did, but netback did that nasty page destructor
hack which Jeremy is trying to undo :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:41                             ` Herbert Xu
@ 2009-04-02  9:43                               ` Avi Kivity
  2009-04-02  9:44                                 ` Herbert Xu
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-02  9:43 UTC (permalink / raw)
  To: Herbert Xu
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 12:38:46PM +0300, Avi Kivity wrote:
>   
>> Why does a kernel solution not need to know when a packet is transmitted?
>>     
>
> Because you can install your own destructor?
>   

So we're back to "the problem is with the kernel->user interface, not 
userspace being cursed into slowness".


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:43                               ` Avi Kivity
@ 2009-04-02  9:44                                 ` Herbert Xu
  0 siblings, 0 replies; 146+ messages in thread
From: Herbert Xu @ 2009-04-02  9:44 UTC (permalink / raw)
  To: Avi Kivity
  Cc: ghaskins, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

On Thu, Apr 02, 2009 at 12:43:54PM +0300, Avi Kivity wrote:
>
> So we're back to "the problem is with the kernel->user interface, not  
> userspace being cursed into slowness".

Well until you have a patch + numbers that's only an allegation :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  6:51               ` Avi Kivity
  2009-04-02  8:52                 ` Herbert Xu
@ 2009-04-02 10:46                 ` Gregory Haskins
  2009-04-02 11:43                   ` Avi Kivity
  1 sibling, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-04-02 10:46 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3434 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>>
>>
>> I think there is a slight disconnect here.  This is *exactly* what I am
>> trying to do.  You can of course do this many ways, and I am not denying
>> it could be done a different way than the path I have chosen.  One
>> extreme would be to just slam a virtio-net specific chunk of code
>> directly into kvm on the host.  Another extreme would be to build a
>> generic framework into Linux for declaring arbitrary IO types,
>> integrating it with kvm (as well as other environments such as lguest,
>> userspace, etc), and building a virtio-net model on top of that.
>>
>> So in case it is not obvious at this point, I have gone with the latter
>> approach.  I wanted to make sure it wasn't kvm specific or something
>> like pci specific so it had the broadest applicability to a range of
>> environments.  So that is why the design is the way it is.  I understand
>> that this approach is technically "harder/more-complex" than the "slam
>> virtio-net into kvm" approach, but I've already done that work.  All we
>> need to do now is agree on the details ;)
>>
>>   
>
> virtio is already non-kvm-specific (lguest uses it) and
> non-pci-specific (s390 uses it).

Ok, then to be more specific, I need it to be more generic than it
already is.  For instance, I need it to be able to integrate with
shm_signals.  If we can do that without breaking the existing ABI, that
would be great!  Last I looked, it was somewhat entwined here so I didnt
try...but I admit that I didnt try that hard since I already had the IOQ
library ready to go.

>
>>> That said, I don't think we're bound today by the fact that we're in
>>> userspace.
>>>     
>> You will *always* be bound by the fact that you are in userspace.  Its
>> purely a question of "how much" and "does anyone care".    Right now,
>> the anwer is "a lot (roughly 45x slower)" and "at least Greg's customers
>> do".  I have no doubt that this can and will change/improve in the
>> future.  But it will always be true that no matter how much userspace
>> improves, the kernel based solution will always be faster.  Its simple
>> physics.  I'm cutting out the middleman to ultimately reach the same
>> destination as the userspace path, so userspace can never be equal.
>>   
>
> If you have a good exit mitigation scheme you can cut exits by a
> factor of 100; so the userspace exit costs are cut by the same
> factor.  If you have good copyless networking APIs you can cut the
> cost of copies to zero (well, to the cost of get_user_pages_fast(),
> but a kernel solution needs that too).

"exit mitigation' schemes are for bandwidth, not latency.  For latency
it all comes down to how fast you can signal in both directions.  If
someone is going to do a stand-alone request-reply, its generally always
going to be at least one hypercall and one rx-interrupt.  So your speed
will be governed by your signal path, not your buffer bandwidth.

What Ive done is shown that you can use techniques other than buffering
the head of the queue to do exit mitigation for bandwidth, while still
maintaining a very short signaling path for latency.  And I also argue
that the latter will always be optimal in the kernel, though I know by
which degree is still TBD.  Anthony thinks he can make the difference
negligible, and I would love to see it but am skeptical.

-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:02                   ` Avi Kivity
  2009-04-02  9:16                     ` Herbert Xu
@ 2009-04-02 10:55                     ` Gregory Haskins
  2009-04-02 11:48                       ` Avi Kivity
  2009-04-03 10:58                     ` Gerd Hoffmann
  2 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-04-02 10:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Herbert Xu, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 1626 bytes --]

Avi Kivity wrote:
> Herbert Xu wrote:
>> Avi Kivity <avi@redhat.com> wrote:
>>  
>>> virtio is already non-kvm-specific (lguest uses it) and
>>> non-pci-specific (s390 uses it).
>>>     
>>
>> I think Greg's work shows that putting the backend in the kernel
>> can dramatically reduce the cost of a single guest->host transaction.
>> I'm sure the same thing would work for virtio too.
>>   
>
> Virtio suffers because we've had no notification of when a packet is
> actually submitted.  With the notification, the only difference should
> be in the cost of a kernel->user switch, which is nowhere nearly as
> dramatic.
>
>>> If you have a good exit mitigation scheme you can cut exits by a
>>> factor of 100; so the userspace exit costs are cut by the same
>>> factor.  If you have good copyless networking APIs you can cut the
>>> cost of copies to zero (well, to the cost of get_user_pages_fast(),
>>> but a kernel solution needs that too).
>>>     
>>
>> Given the choice of having to mitigate or not having the problem
>> in the first place, guess what I would prefer :)
>>   
>
> There is no choice.  Exiting from the guest to the kernel to userspace
> is prohibitively expensive, you can't do that on every packet.
>

Now you are making my point ;)  This is part of the cost of your
signaling path, and it directly adds to your latency time.   You can't
buffer packets here if the guest is only going to send one and wait for
a response and expect that to perform well.  And this is precisely what
drove me to look at avoiding going back to userspace in the first place.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:38                           ` Avi Kivity
  2009-04-02  9:41                             ` Herbert Xu
@ 2009-04-02 11:06                             ` Gregory Haskins
  2009-04-02 11:59                               ` Avi Kivity
  2009-04-02 12:13                               ` Rusty Russell
  1 sibling, 2 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-04-02 11:06 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Herbert Xu, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 664 bytes --]

Avi Kivity wrote:
> Herbert Xu wrote:
>> On Thu, Apr 02, 2009 at 12:27:17PM +0300, Avi Kivity wrote:
>>  
>>> If tap told us when the packets were actually transmitted, life
>>> would be  wonderful:
>>>     
>>
>> And why do we need this? Because we are in user space!
>>
>>   
>
> Why does a kernel solution not need to know when a packet is transmitted?
>

You do not need to know when the packet is copied (which I currently
do).  You only need it for zero-copy (of which I would like to support,
but as I understand it there are problems with the reliability of proper
callback (i.e. skb->destructor).

Its "fire and forget" :)

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 10:46                 ` Gregory Haskins
@ 2009-04-02 11:43                   ` Avi Kivity
  2009-04-02 12:22                     ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-02 11:43 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:

  

>> virtio is already non-kvm-specific (lguest uses it) and
>> non-pci-specific (s390 uses it).
>>     
>
> Ok, then to be more specific, I need it to be more generic than it
> already is.  For instance, I need it to be able to integrate with
> shm_signals.  

Why?

  

>> If you have a good exit mitigation scheme you can cut exits by a
>> factor of 100; so the userspace exit costs are cut by the same
>> factor.  If you have good copyless networking APIs you can cut the
>> cost of copies to zero (well, to the cost of get_user_pages_fast(),
>> but a kernel solution needs that too).
>>     
>
> "exit mitigation' schemes are for bandwidth, not latency.  For latency
> it all comes down to how fast you can signal in both directions.  If
> someone is going to do a stand-alone request-reply, its generally always
> going to be at least one hypercall and one rx-interrupt.  So your speed
> will be governed by your signal path, not your buffer bandwidth.
>   

The userspace path is longer by 2 microseconds (for two additional 
heavyweight exits) and a few syscalls.  I don't think that's worthy of 
putting all the code in the kernel.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 09/17] net: Add vbus_enet driver
  2009-03-31 20:39   ` Stephen Hemminger
@ 2009-04-02 11:43     ` Gregory Haskins
  0 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-04-02 11:43 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 535 bytes --]

Stephen Hemminger wrote:
> On Tue, 31 Mar 2009 14:43:34 -0400
> Gregory Haskins <ghaskins@novell.com> wrote:
>   
>> +struct vbus_enet_priv {
>> +	spinlock_t                 lock;
>> +	struct net_device         *dev;
>> +	struct vbus_device_proxy  *vdev;
>> +	struct napi_struct         napi;
>> +	struct net_device_stats    stats;
>>     
>
> Not needed any more, stats are available in net_device
>
>   

Thanks for the review, Stephen!

I will apply all of your recommended fixes for the next release.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 10:55                     ` Gregory Haskins
@ 2009-04-02 11:48                       ` Avi Kivity
  0 siblings, 0 replies; 146+ messages in thread
From: Avi Kivity @ 2009-04-02 11:48 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Herbert Xu, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:

  

>> There is no choice.  Exiting from the guest to the kernel to userspace
>> is prohibitively expensive, you can't do that on every packet.
>>
>>     
>
> Now you are making my point ;)  This is part of the cost of your
> signaling path, and it directly adds to your latency time.   

It adds a microsecond.  The kvm overhead of putting things in userspace 
is low enough, I don't know why people keep mentioning it.  The problem 
is the kernel/user networking interfaces.

> You can't
> buffer packets here if the guest is only going to send one and wait for
> a response and expect that to perform well.  And this is precisely what
> drove me to look at avoiding going back to userspace in the first place.
>   

We're not buffering any packets.  What we lack is a way to tell the 
guest that we're done processing all packets in the ring (IOW, re-enable 
notifications).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 11:06                             ` Gregory Haskins
@ 2009-04-02 11:59                               ` Avi Kivity
  2009-04-02 12:30                                 ` Gregory Haskins
  2009-04-02 12:13                               ` Rusty Russell
  1 sibling, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-02 11:59 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Herbert Xu, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm, Mark McLoughlin

Gregory Haskins wrote:

  

>> Why does a kernel solution not need to know when a packet is transmitted?
>>
>>     
>
> You do not need to know when the packet is copied (which I currently
> do).  You only need it for zero-copy (of which I would like to support,
> but as I understand it there are problems with the reliability of proper
> callback (i.e. skb->destructor).
>
> Its "fire and forget" :)
>   

It's more of a "schedule and forget" which I think brings you the win.  
The host disables notifications and schedules the actual tx work (rx 
from the host's perspective).  So now the guest and host continue 
producing and consuming packets in parallel.  So long as the guest is 
faster (due to the host being throttled?), notifications continue to be 
disabled.

If you changed your rx_isr() to process the packets immediately instead 
of scheduling, I think throughput would drop dramatically.

Mark had a similar change for virtio.  Mark?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 11:06                             ` Gregory Haskins
  2009-04-02 11:59                               ` Avi Kivity
@ 2009-04-02 12:13                               ` Rusty Russell
  2009-04-02 12:50                                 ` Gregory Haskins
  2009-04-02 15:10                                 ` Michael S. Tsirkin
  1 sibling, 2 replies; 146+ messages in thread
From: Rusty Russell @ 2009-04-02 12:13 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Avi Kivity, Herbert Xu, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, netdev, kvm

On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
> You do not need to know when the packet is copied (which I currently
> do).  You only need it for zero-copy (of which I would like to support,
> but as I understand it there are problems with the reliability of proper
> callback (i.e. skb->destructor).

But if you have a UP guest, there will *never* be another packet in the queue
at this point, since it wasn't running.

As Avi said, you can do the processing in another thread and go back to the
guest; lguest pre-virtio did a hacky "weak" wakeup to ensure the guest ran
again before the thread did for exactly this kind of reason.

While Avi's point about a "powerful enough userspace API" is probably valid,
I don't think it's going to happen.  It's almost certainly less code to put a
virtio_net server in the kernel, than it is to create such a powerful
interface (see vringfd & tap).  And that interface would have one user in
practice.

So, let's roll out a kernel virtio_net server.  Anyone?
Rusty.

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 11:43                   ` Avi Kivity
@ 2009-04-02 12:22                     ` Gregory Haskins
  2009-04-02 12:42                       ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-04-02 12:22 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3660 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>
>  
>
>>> virtio is already non-kvm-specific (lguest uses it) and
>>> non-pci-specific (s390 uses it).
>>>     
>>
>> Ok, then to be more specific, I need it to be more generic than it
>> already is.  For instance, I need it to be able to integrate with
>> shm_signals.  
>
> Why?
Well, shm_signals is what I designed to be the event mechanism for vbus
devices.  One of the design criteria of shm_signal is that it should
support a variety of environments, such as kvm, but also something like
userspace apps.  So I cannot make assumptions about things like "pci
interrupts", etc.

So if I want to use it in vbus, virtio-ring has to be able to use them,
as opposed to what it does today. Part of this would be a natural fit
for the "kick()" callback in virtio, but there are other problems.  For
one, virtio-ring (IIUC) does its own event-masking directly in the
virtio metadata.  However, really I want the higher layer ring-overlay
to do its masking in terms of the lower-layered shm_signal in order to
work the way I envision this stuff.  If you look at the IOQ
implementation, this is exactly what it does.

To be clear, and Ive stated this in the past: venet is just an example
of this generic, in-kernel concept.  We plan on doing much much more
with all this.  One of the things we are working on is have userspace
clients be able to access this too, with an ultimately goal of
supporting things like having guest-userspace doing bypass, rdma, etc. 
We are not there yet, though...only the kvm-host to guest kernel is
currently functional and is thus the working example.

I totally "get" the attraction to doing things in userspace.  Its
contained, naturally isolated, easily supports migration, etc.  Its also
a penalty.  Bare-metal userspace apps have a direct path to the kernel
IO.  I want to give guest the same advantage.  Some people will care
more about things like migration than performance, and that is fine. 
But others will certainly care more about performance, and that is what
we are trying to address.

>
>  
>
>>> If you have a good exit mitigation scheme you can cut exits by a
>>> factor of 100; so the userspace exit costs are cut by the same
>>> factor.  If you have good copyless networking APIs you can cut the
>>> cost of copies to zero (well, to the cost of get_user_pages_fast(),
>>> but a kernel solution needs that too).
>>>     
>>
>> "exit mitigation' schemes are for bandwidth, not latency.  For latency
>> it all comes down to how fast you can signal in both directions.  If
>> someone is going to do a stand-alone request-reply, its generally always
>> going to be at least one hypercall and one rx-interrupt.  So your speed
>> will be governed by your signal path, not your buffer bandwidth.
>>   
>
> The userspace path is longer by 2 microseconds (for two additional
> heavyweight exits) and a few syscalls.  I don't think that's worthy of
> putting all the code in the kernel.

By your own words, the exit to userspace is "prohibitively expensive",
so that is either true or its not.  If its 2 microseconds, show me.  We
need the rtt time to go from a "kick" PIO all the way to queue a packet
on the egress hardware and return.  That is going to define your
latency.  If you can do this such that you can do something like ICMP
ping in 65us (or anything close to a few dozen microseconds of this),
I'll shut-up about how much I think the current path sucks ;)  Even so,
I still propose the concept of a frame-work for in-kernel devices for
all the other reasons I mentioned above.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 11:59                               ` Avi Kivity
@ 2009-04-02 12:30                                 ` Gregory Haskins
  2009-04-02 12:43                                   ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-04-02 12:30 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Herbert Xu, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm, Mark McLoughlin

[-- Attachment #1: Type: text/plain, Size: 1214 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>
>  
>
>>> Why does a kernel solution not need to know when a packet is
>>> transmitted?
>>>
>>>     
>>
>> You do not need to know when the packet is copied (which I currently
>> do).  You only need it for zero-copy (of which I would like to support,
>> but as I understand it there are problems with the reliability of proper
>> callback (i.e. skb->destructor).
>>
>> Its "fire and forget" :)
>>   
>
> It's more of a "schedule and forget" which I think brings you the
> win.  The host disables notifications and schedules the actual tx work
> (rx from the host's perspective).  So now the guest and host continue
> producing and consuming packets in parallel.  So long as the guest is
> faster (due to the host being throttled?), notifications continue to
> be disabled.
Yep, when the "producer::consumer" ratio is > 1, we mitigate signaling. 
When its < 1, we signal roughly once per packet.

>
> If you changed your rx_isr() to process the packets immediately
> instead of scheduling, I think throughput would drop dramatically.
Right, that is the point. :) This is that "soft asic" thing I was
talking about yesterday.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:22                     ` Gregory Haskins
@ 2009-04-02 12:42                       ` Avi Kivity
  2009-04-02 12:54                         ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-02 12:42 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:
> Avi Kivity wrote:
>   
>> Gregory Haskins wrote:
>>
>>  
>>
>>     
>>>> virtio is already non-kvm-specific (lguest uses it) and
>>>> non-pci-specific (s390 uses it).
>>>>     
>>>>         
>>> Ok, then to be more specific, I need it to be more generic than it
>>> already is.  For instance, I need it to be able to integrate with
>>> shm_signals.  
>>>       
>> Why?
>>     
> Well, shm_signals is what I designed to be the event mechanism for vbus
> devices.  One of the design criteria of shm_signal is that it should
> support a variety of environments, such as kvm, but also something like
> userspace apps.  So I cannot make assumptions about things like "pci
> interrupts", etc.
>   

virtio doesn't make these assumptions either.  The only difference I see 
is that you separate notification from the ring structure.

> By your own words, the exit to userspace is "prohibitively expensive",
> so that is either true or its not.  If its 2 microseconds, show me.

In user/test/x86/vmexit.c, change 'cpuid' to 'out %al, $0'; drop the 
printf() in kvmctl.c's test_outb().

I get something closer to 4 microseconds, but that's on a two year old 
machine;  It will be around two on Nehalems.

My 'prohibitively expensive' is true only if you exit every packet.



-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:30                                 ` Gregory Haskins
@ 2009-04-02 12:43                                   ` Avi Kivity
  2009-04-02 13:03                                     ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-02 12:43 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Herbert Xu, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm, Mark McLoughlin

Gregory Haskins wrote:

  

>> It's more of a "schedule and forget" which I think brings you the
>> win.  The host disables notifications and schedules the actual tx work
>> (rx from the host's perspective).  So now the guest and host continue
>> producing and consuming packets in parallel.  So long as the guest is
>> faster (due to the host being throttled?), notifications continue to
>> be disabled.
>>     
> Yep, when the "producer::consumer" ratio is > 1, we mitigate signaling. 
> When its < 1, we signal roughly once per packet.
>
>   
>> If you changed your rx_isr() to process the packets immediately
>> instead of scheduling, I think throughput would drop dramatically.
>>     
> Right, that is the point. :) This is that "soft asic" thing I was
> talking about yesterday.
>   

But all that has nothing to do with where the code lives, in the kernel 
or userspace.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:13                               ` Rusty Russell
@ 2009-04-02 12:50                                 ` Gregory Haskins
  2009-04-02 12:52                                   ` Gregory Haskins
  2009-04-02 13:07                                   ` Avi Kivity
  2009-04-02 15:10                                 ` Michael S. Tsirkin
  1 sibling, 2 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-04-02 12:50 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Avi Kivity, Herbert Xu, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 2296 bytes --]

Rusty Russell wrote:
> On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
>   
>> You do not need to know when the packet is copied (which I currently
>> do).  You only need it for zero-copy (of which I would like to support,
>> but as I understand it there are problems with the reliability of proper
>> callback (i.e. skb->destructor).
>>     
>
> But if you have a UP guest,

I assume you mean UP host ;)

>  there will *never* be another packet in the queue
> at this point, since it wasn't running.
>   
Yep, and I'll be the first to admit that my design only looks forward. 
Its for high speed links and multi-core cpus, etc.  If you have a
uniprocessor host, the throughput would likely start to suffer with my
current strategy.  You could probably reclaim some of that throughput
(but trading latency) by doing as you are suggesting with the deferred
initial signalling.  However, it is still a tradeoff to account for the
lower-end rig.  I could certainly put a heuristic/timer on the
guest->host to mitigate this as well, but this is not my target use case
anyway so I am not sure it is worth it.


> As Avi said, you can do the processing in another thread and go back to the
> guest; lguest pre-virtio did a hacky "weak" wakeup to ensure the guest ran
> again before the thread did for exactly this kind of reason.
>
> While Avi's point about a "powerful enough userspace API" is probably valid,
> I don't think it's going to happen.  It's almost certainly less code to put a
> virtio_net server in the kernel, than it is to create such a powerful
> interface (see vringfd & tap).  And that interface would have one user in
> practice.
>
> So, let's roll out a kernel virtio_net server.  Anyone?
>   
Hmm..well I was hoping to be able to work with you guys to make my
proposal fit this role.  If there is no interest in that, I hope that my
infrastructure itself may still be considered for merging (in *some*
tree, not -kvm per se) as I would prefer to not maintain it out of tree
if it can be avoided.  I think people will find that the new logic
touches very few existing kernel lines at all, and can be completely
disabled with config options so it should be relatively inconsequential
to those that do not care.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:50                                 ` Gregory Haskins
@ 2009-04-02 12:52                                   ` Gregory Haskins
  2009-04-02 13:07                                   ` Avi Kivity
  1 sibling, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-04-02 12:52 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Avi Kivity, Herbert Xu, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 402 bytes --]

Gregory Haskins wrote:
> Rusty Russell wrote:
>   
>
>>  there will *never* be another packet in the queue
>> at this point, since it wasn't running.
>>   
>>     
> Yep, and I'll be the first to admit that my design only looks forward. 
>   
To clarify, I am referring to the internal design of the venet-tap
only.  The general vbus architecture makes no such policy decisions.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:42                       ` Avi Kivity
@ 2009-04-02 12:54                         ` Gregory Haskins
  2009-04-02 13:08                           ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-04-02 12:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 246 bytes --]

Avi Kivity wrote:
>
>
> My 'prohibitively expensive' is true only if you exit every packet.
>
>

Understood, but yet you need to do this if you want something like iSCSI
READ transactions to have as low-latency as possible.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:43                                   ` Avi Kivity
@ 2009-04-02 13:03                                     ` Gregory Haskins
  0 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-04-02 13:03 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Herbert Xu, anthony, andi, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm, Mark McLoughlin

[-- Attachment #1: Type: text/plain, Size: 1289 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>
>  
>
>>> It's more of a "schedule and forget" which I think brings you the
>>> win.  The host disables notifications and schedules the actual tx work
>>> (rx from the host's perspective).  So now the guest and host continue
>>> producing and consuming packets in parallel.  So long as the guest is
>>> faster (due to the host being throttled?), notifications continue to
>>> be disabled.
>>>     
>> Yep, when the "producer::consumer" ratio is > 1, we mitigate
>> signaling. When its < 1, we signal roughly once per packet.
>>
>>  
>>> If you changed your rx_isr() to process the packets immediately
>>> instead of scheduling, I think throughput would drop dramatically.
>>>     
>> Right, that is the point. :) This is that "soft asic" thing I was
>> talking about yesterday.
>>   
>
> But all that has nothing to do with where the code lives, in the
> kernel or userspace.

Agreed, but note Ive already stated that some of my boost is likely from
in-kernel, while others are unrelated design elements such as the
"soft-asic" approach (you guys dont read my 10 page emails, do you? ;). 
I don't deny that some of my ideas could be used in userspace as well
(Credit if used would be appreciated :).

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:50                                 ` Gregory Haskins
  2009-04-02 12:52                                   ` Gregory Haskins
@ 2009-04-02 13:07                                   ` Avi Kivity
  2009-04-02 13:22                                     ` Gregory Haskins
  2009-04-02 14:50                                     ` Herbert Xu
  1 sibling, 2 replies; 146+ messages in thread
From: Avi Kivity @ 2009-04-02 13:07 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Rusty Russell, Herbert Xu, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, netdev, kvm

Gregory Haskins wrote:
> Rusty Russell wrote:
>   
>> On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
>>   
>>     
>>> You do not need to know when the packet is copied (which I currently
>>> do).  You only need it for zero-copy (of which I would like to support,
>>> but as I understand it there are problems with the reliability of proper
>>> callback (i.e. skb->destructor).
>>>     
>>>       
>> But if you have a UP guest,
>>     
>
> I assume you mean UP host ;)
>
>   

I think Rusty did mean a UP guest, and without schedule-and-forget.

> Hmm..well I was hoping to be able to work with you guys to make my
> proposal fit this role.  If there is no interest in that, I hope that my
> infrastructure itself may still be considered for merging (in *some*
> tree, not -kvm per se) as I would prefer to not maintain it out of tree
> if it can be avoided.

The problem is that we already have virtio guest drivers going several 
kernel versions back, as well as Windows drivers.  We can't keep 
changing the infrastructure under people's feet.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:54                         ` Gregory Haskins
@ 2009-04-02 13:08                           ` Avi Kivity
  2009-04-02 13:36                             ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-02 13:08 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:
> Avi Kivity wrote:
>   
>> My 'prohibitively expensive' is true only if you exit every packet.
>>
>>
>>     
>
> Understood, but yet you need to do this if you want something like iSCSI
> READ transactions to have as low-latency as possible.
>   

Dunno, two microseconds is too much?  The wire imposes much more.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 13:07                                   ` Avi Kivity
@ 2009-04-02 13:22                                     ` Gregory Haskins
  2009-04-02 13:27                                       ` Avi Kivity
  2009-04-02 14:50                                     ` Herbert Xu
  1 sibling, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-04-02 13:22 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rusty Russell, Herbert Xu, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 1983 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> Rusty Russell wrote:
>>  
>>> On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
>>>      
>>>> You do not need to know when the packet is copied (which I currently
>>>> do).  You only need it for zero-copy (of which I would like to
>>>> support,
>>>> but as I understand it there are problems with the reliability of
>>>> proper
>>>> callback (i.e. skb->destructor).
>>>>           
>>> But if you have a UP guest,
>>>     
>>
>> I assume you mean UP host ;)
>>
>>   
>
> I think Rusty did mean a UP guest, and without schedule-and-forget.
That doesnt make sense to me, tho.  All the testing I did was a UP
guest, actually.  Why would I be constrained to run without the
scheduling unless the host was also UP?

>
>> Hmm..well I was hoping to be able to work with you guys to make my
>> proposal fit this role.  If there is no interest in that, I hope that my
>> infrastructure itself may still be considered for merging (in *some*
>> tree, not -kvm per se) as I would prefer to not maintain it out of tree
>> if it can be avoided.
>
> The problem is that we already have virtio guest drivers going several
> kernel versions back, as well as Windows drivers.  We can't keep
> changing the infrastructure under people's feet.

Well, IIUC the virtio code itself declares the ABI as unstable, so there
technically *is* an out if we really wanted one.  But I certainly
understand the desire to not change this ABI if at all possible, and
thus the resistance here.

However, theres still the possibility we can make this work in an ABI
friendly way with cap-bits, or other such features.  For instance, the
virtio-net driver could register both with pci and vbus-proxy and
instantiate a device with a slightly different ops structure for each or
something.  Alternatively we could write a host-side shim to expose vbus
devices as pci devices or something like that.

-Greg

>
>



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 13:22                                     ` Gregory Haskins
@ 2009-04-02 13:27                                       ` Avi Kivity
  2009-04-02 14:05                                         ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-02 13:27 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Rusty Russell, Herbert Xu, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, netdev, kvm

Gregory Haskins wrote:
> Avi Kivity wrote:
>   
>> Gregory Haskins wrote:
>>     
>>> Rusty Russell wrote:
>>>  
>>>       
>>>> On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
>>>>      
>>>>         
>>>>> You do not need to know when the packet is copied (which I currently
>>>>> do).  You only need it for zero-copy (of which I would like to
>>>>> support,
>>>>> but as I understand it there are problems with the reliability of
>>>>> proper
>>>>> callback (i.e. skb->destructor).
>>>>>           
>>>>>           
>>>> But if you have a UP guest,
>>>>     
>>>>         
>>> I assume you mean UP host ;)
>>>
>>>   
>>>       
>> I think Rusty did mean a UP guest, and without schedule-and-forget.
>>     
> That doesnt make sense to me, tho.  All the testing I did was a UP
> guest, actually.  Why would I be constrained to run without the
> scheduling unless the host was also UP?
>   

You aren't constrained.  And your numbers show it works.

>>
>> The problem is that we already have virtio guest drivers going several
>> kernel versions back, as well as Windows drivers.  We can't keep
>> changing the infrastructure under people's feet.
>>     
>
> Well, IIUC the virtio code itself declares the ABI as unstable, so there
> technically *is* an out if we really wanted one.  But I certainly
> understand the desire to not change this ABI if at all possible, and
> thus the resistance here.
>   

virtio is a stable ABI.

> However, theres still the possibility we can make this work in an ABI
> friendly way with cap-bits, or other such features.  For instance, the
> virtio-net driver could register both with pci and vbus-proxy and
> instantiate a device with a slightly different ops structure for each or
> something.  Alternatively we could write a host-side shim to expose vbus
> devices as pci devices or something like that.
>   

Sounds complicated...

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 13:08                           ` Avi Kivity
@ 2009-04-02 13:36                             ` Gregory Haskins
  2009-04-02 13:45                               ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-04-02 13:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 1350 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> My 'prohibitively expensive' is true only if you exit every packet.
>>>
>>>
>>>     
>>
>> Understood, but yet you need to do this if you want something like iSCSI
>> READ transactions to have as low-latency as possible.
>>   
>
> Dunno, two microseconds is too much?  The wire imposes much more.
>

No, but thats not what we are talking about.  You said signaling on
every packet is prohibitively expensive.  I am saying signaling on every
packet is required for decent latency.  So is it prohibitively expensive
or not?

I think most would agree that adding 2us is not bad, but so far that is
an unproven theory that the IO path in question only adds 2us.   And we
are not just looking at the rate at which we can enter and exit the
guest...we need the whole path...from the PIO kick to the dev_xmit() on
the egress hardware, to the ingress and rx-injection.  This includes any
and all penalties associated with the path, even if they are imposed by
something like the design of tun-tap.

Right now its way way way worse than 2us.  In fact, at my last reading
this was more like 3060us (3125-65).  So shorten that 3125 to 67 (while
maintaining line-rate) and I will be impressed.  Heck, shorten it to
80us and I will be impressed.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 13:36                             ` Gregory Haskins
@ 2009-04-02 13:45                               ` Avi Kivity
  2009-04-02 14:24                                 ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-02 13:45 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:
> Avi Kivity wrote:
>   
>> Gregory Haskins wrote:
>>     
>>> Avi Kivity wrote:
>>>  
>>>       
>>>> My 'prohibitively expensive' is true only if you exit every packet.
>>>>
>>>>
>>>>     
>>>>         
>>> Understood, but yet you need to do this if you want something like iSCSI
>>> READ transactions to have as low-latency as possible.
>>>   
>>>       
>> Dunno, two microseconds is too much?  The wire imposes much more.
>>
>>     
>
> No, but thats not what we are talking about.  You said signaling on
> every packet is prohibitively expensive.  I am saying signaling on every
> packet is required for decent latency.  So is it prohibitively expensive
> or not?
>   

We're heading dangerously into the word-game area.  Let's not do that.

If you have a high throughput workload with many packets per seconds 
then an exit per packet (whether to userspace or to the kernel) is 
expensive.  So you do exit mitigation.  Latency is not important since 
the packets are going to sit in the output queue anyway.

If you have a request-response workload with the wire idle and latency 
critical, then there's no problem having an exit per packet because (a) 
there aren't that many packets and (b) the guest isn't doing any 
batching, so guest overhead will swamp the hypervisor overhead.

If you have a low latency request-response workload mixed with a high 
throughput workload, then you aren't going to get low latency since your 
low latency packets will sit on the queue behind the high throughput 
packets.  You can fix that with multiqueue and then you're back to one 
of the scenarios above.

> I think most would agree that adding 2us is not bad, but so far that is
> an unproven theory that the IO path in question only adds 2us.   And we
> are not just looking at the rate at which we can enter and exit the
> guest...we need the whole path...from the PIO kick to the dev_xmit() on
> the egress hardware, to the ingress and rx-injection.  This includes any
> and all penalties associated with the path, even if they are imposed by
> something like the design of tun-tap.
>   

Correct, we need to look at the whole path.  That's why the wishing well 
is clogged with my 'give me a better userspace interface' emails.

> Right now its way way way worse than 2us.  In fact, at my last reading
> this was more like 3060us (3125-65).  So shorten that 3125 to 67 (while
> maintaining line-rate) and I will be impressed.  Heck, shorten it to
> 80us and I will be impressed.
>   

The 3060us thing is a timer, not cpu time.  We aren't starting a JVM for 
each packet.  We could remove it given a notification API, or 
duplicating the sched-and-forget thing, like Rusty did with lguest or 
Mark with qemu.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 13:27                                       ` Avi Kivity
@ 2009-04-02 14:05                                         ` Gregory Haskins
  0 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-04-02 14:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rusty Russell, Herbert Xu, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3713 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> Gregory Haskins wrote:
>>>    
>>>> Rusty Russell wrote:
>>>>  
>>>>      
>>>>> On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
>>>>>             
>>>>>> You do not need to know when the packet is copied (which I currently
>>>>>> do).  You only need it for zero-copy (of which I would like to
>>>>>> support,
>>>>>> but as I understand it there are problems with the reliability of
>>>>>> proper
>>>>>> callback (i.e. skb->destructor).
>>>>>>                     
>>>>> But if you have a UP guest,
>>>>>             
>>>> I assume you mean UP host ;)
>>>>
>>>>         
>>> I think Rusty did mean a UP guest, and without schedule-and-forget.
>>>     
>> That doesnt make sense to me, tho.  All the testing I did was a UP
>> guest, actually.  Why would I be constrained to run without the
>> scheduling unless the host was also UP?
>>   
>
> You aren't constrained.  And your numbers show it works.
>
>>>
>>> The problem is that we already have virtio guest drivers going several
>>> kernel versions back, as well as Windows drivers.  We can't keep
>>> changing the infrastructure under people's feet.
>>>     
>>
>> Well, IIUC the virtio code itself declares the ABI as unstable, so there
>> technically *is* an out if we really wanted one.  But I certainly
>> understand the desire to not change this ABI if at all possible, and
>> thus the resistance here.
>>   
>
> virtio is a stable ABI.

Dang!  Scratch that.
>
>> However, theres still the possibility we can make this work in an ABI
>> friendly way with cap-bits, or other such features.  For instance, the
>> virtio-net driver could register both with pci and vbus-proxy and
>> instantiate a device with a slightly different ops structure for each or
>> something.  Alternatively we could write a host-side shim to expose vbus
>> devices as pci devices or something like that.
>>   
>
> Sounds complicated...

Well, the first solution would be relatively trivial...at least on the
guest side.  All the other infrastructure is done and included in the
series I sent out.  The changes to the virtio-net driver on the guest
itself would be minimal.  The bigger effort would be converting
venet-tap to use virtio-ring instead of IOQ.  But this would arguably be
less work than starting a virtio-net backend module from scratch because
you would have to not only code up the entire virtio-net backend, but
also all the pci emulation and irq routing stuff that is required (and
is already done by the vbus infrastructure).  Here all the major pieces
are in place, just the xmit and rx routines need to be converted to
virtio-isms.

For the second option, I agree.  Its probably too nasty and it would be
better if there was just either a virtio-net to kvm-host hack, or a more
pci oriented version of a vbus-like framework.

That said, there is certainly nothing wrong with having an alternate
option.  There is plenty of precedent for having different drivers for
different subsystems, etc, even if there is overlap.  Heck, even KVM has
realtek, e1000, and virtio-net, etc.  Would our kvm community be willing
to work with me to get these patches merged?  I am perfectly willing to
maintain them.  That said, the general infrastructure should probably
not live in -kvm (perhaps -tip, -mm, or -next, etc is more
appropriate).  So a good plan might be to shoot for the core going into
a more general upstream tree.  When/if that happens, then the kvm
community could consider the kvm specific parts, etc.  I realize this is
all pending review acceptance by everyone involved...

-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 13:45                               ` Avi Kivity
@ 2009-04-02 14:24                                 ` Gregory Haskins
  2009-04-02 14:32                                   ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-04-02 14:24 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 3816 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> Gregory Haskins wrote:
>>>    
>>>> Avi Kivity wrote:
>>>>  
>>>>      
>>>>> My 'prohibitively expensive' is true only if you exit every packet.
>>>>>
>>>>>
>>>>>             
>>>> Understood, but yet you need to do this if you want something like
>>>> iSCSI
>>>> READ transactions to have as low-latency as possible.
>>>>         
>>> Dunno, two microseconds is too much?  The wire imposes much more.
>>>
>>>     
>>
>> No, but thats not what we are talking about.  You said signaling on
>> every packet is prohibitively expensive.  I am saying signaling on every
>> packet is required for decent latency.  So is it prohibitively expensive
>> or not?
>>   
>
> We're heading dangerously into the word-game area.  Let's not do that.
>
> If you have a high throughput workload with many packets per seconds
> then an exit per packet (whether to userspace or to the kernel) is
> expensive.  So you do exit mitigation.  Latency is not important since
> the packets are going to sit in the output queue anyway.

Agreed.  virtio-net currently does this with batching.  I do with the
bidir napi thing (which effectively crosses the producer::consumer > 1
threshold to mitigate the signal path).


>
> If you have a request-response workload with the wire idle and latency
> critical, then there's no problem having an exit per packet because
> (a) there aren't that many packets and (b) the guest isn't doing any
> batching, so guest overhead will swamp the hypervisor overhead.
Right, so the trick is to use an algorithm that adapts here.  Batching
solves the first case, but not the second.  The bidir napi thing solves
both, but it does assume you have ample host processing power to run the
algorithm concurrently.  This may or may not be suitable to all
applications, I admit.

>
> If you have a low latency request-response workload mixed with a high
> throughput workload, then you aren't going to get low latency since
> your low latency packets will sit on the queue behind the high
> throughput packets.  You can fix that with multiqueue and then you're
> back to one of the scenarios above.
Agreed, and thats ok.  Now we are getting more into 802.1p type MQ
issues anyway, if the application cared about it that much.

>
>> I think most would agree that adding 2us is not bad, but so far that is
>> an unproven theory that the IO path in question only adds 2us.   And we
>> are not just looking at the rate at which we can enter and exit the
>> guest...we need the whole path...from the PIO kick to the dev_xmit() on
>> the egress hardware, to the ingress and rx-injection.  This includes any
>> and all penalties associated with the path, even if they are imposed by
>> something like the design of tun-tap.
>>   
>
> Correct, we need to look at the whole path.  That's why the wishing
> well is clogged with my 'give me a better userspace interface' emails.
>
>> Right now its way way way worse than 2us.  In fact, at my last reading
>> this was more like 3060us (3125-65).  So shorten that 3125 to 67 (while
>> maintaining line-rate) and I will be impressed.  Heck, shorten it to
>> 80us and I will be impressed.
>>   
>
> The 3060us thing is a timer, not cpu time.
Agreed, but its still "state of the art" from an observer perspective. 
The reason "why", though easily explainable, is inconsequential to most
people.  FWIW, I have seen virtio-net do a much more respectable 350us
on an older version, so I know there is plenty of room for improvement.

>   We aren't starting a JVM for each packet.
Heh...it kind of feels like that right now, so hopefully some
improvement will at least be on the one thing that comes out of all this.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 14:24                                 ` Gregory Haskins
@ 2009-04-02 14:32                                   ` Avi Kivity
  2009-04-02 14:41                                     ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-02 14:32 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:
>> If you have a request-response workload with the wire idle and latency
>> critical, then there's no problem having an exit per packet because
>> (a) there aren't that many packets and (b) the guest isn't doing any
>> batching, so guest overhead will swamp the hypervisor overhead.
>>     
> Right, so the trick is to use an algorithm that adapts here.  Batching
> solves the first case, but not the second.  The bidir napi thing solves
> both, but it does assume you have ample host processing power to run the
> algorithm concurrently.  This may or may not be suitable to all
> applications, I admit.
>   

The alternative is to get a notification from the stack that the packet 
is done processing.  Either an skb destructor in the kernel, or my new 
API that everyone is not rushing out to implement.

>>> Right now its way way way worse than 2us.  In fact, at my last reading
>>> this was more like 3060us (3125-65).  So shorten that 3125 to 67 (while
>>> maintaining line-rate) and I will be impressed.  Heck, shorten it to
>>> 80us and I will be impressed.
>>>   
>>>       
>> The 3060us thing is a timer, not cpu time.
>>     
> Agreed, but its still "state of the art" from an observer perspective. 
> The reason "why", though easily explainable, is inconsequential to most
> people.  FWIW, I have seen virtio-net do a much more respectable 350us
> on an older version, so I know there is plenty of room for improvement.
>   

All I want is the notification, and the timer is headed into the nearest 
landfill.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 14:32                                   ` Avi Kivity
@ 2009-04-02 14:41                                     ` Avi Kivity
  2009-04-02 14:49                                       ` Anthony Liguori
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-02 14:41 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Avi Kivity wrote:
>
> The alternative is to get a notification from the stack that the 
> packet is done processing.  Either an skb destructor in the kernel, or 
> my new API that everyone is not rushing out to implement.

btw, my new api is


   io_submit(..., nr, ...): submit nr packets
   io_getevents(): complete nr packets

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 14:41                                     ` Avi Kivity
@ 2009-04-02 14:49                                       ` Anthony Liguori
  2009-04-02 16:09                                         ` Anthony Liguori
  0 siblings, 1 reply; 146+ messages in thread
From: Anthony Liguori @ 2009-04-02 14:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Avi Kivity wrote:
> Avi Kivity wrote:
>>
>> The alternative is to get a notification from the stack that the 
>> packet is done processing.  Either an skb destructor in the kernel, 
>> or my new API that everyone is not rushing out to implement.
>
> btw, my new api is
>
>
>   io_submit(..., nr, ...): submit nr packets
>   io_getevents(): complete nr packets

I don't think we even need that to end this debate.  I'm convinced we 
have a bug somewhere.  Even disabling TX mitigation, I see a ping 
latency of around 300ns whereas it's only 50ns on the host.  This defies 
logic so I'm now looking to isolate why that is.

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 13:07                                   ` Avi Kivity
  2009-04-02 13:22                                     ` Gregory Haskins
@ 2009-04-02 14:50                                     ` Herbert Xu
  2009-04-02 15:00                                       ` Avi Kivity
  1 sibling, 1 reply; 146+ messages in thread
From: Herbert Xu @ 2009-04-02 14:50 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm

On Thu, Apr 02, 2009 at 04:07:09PM +0300, Avi Kivity wrote:
>
> I think Rusty did mean a UP guest, and without schedule-and-forget.

Going off on a tangent here, I don't really think it should matter
whether we're UP or SMP.  The ideal state is where we have the
same number of (virtual) TX queues as there are cores in the guest.
On the host side we need the backend to run at least on a core
that shares cache with the corresponding guest queue/core.  If
that happens to be the same core as the guest core then it should
work as well.

IOW we should optimise it as if the host were UP.

> The problem is that we already have virtio guest drivers going several  
> kernel versions back, as well as Windows drivers.  We can't keep  
> changing the infrastructure under people's feet.

Yes I agree that changing the guest-side driver is a no-no.  However,
we should be able to achieve what's shown here without modifying the
guest-side.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 14:50                                     ` Herbert Xu
@ 2009-04-02 15:00                                       ` Avi Kivity
  2009-04-02 15:40                                         ` Herbert Xu
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-02 15:00 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm

Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 04:07:09PM +0300, Avi Kivity wrote:
>   
>> I think Rusty did mean a UP guest, and without schedule-and-forget.
>>     
>
> Going off on a tangent here, I don't really think it should matter
> whether we're UP or SMP.  The ideal state is where we have the
> same number of (virtual) TX queues as there are cores in the guest.
> On the host side we need the backend to run at least on a core
> that shares cache with the corresponding guest queue/core.  If
> that happens to be the same core as the guest core then it should
> work as well.
>
> IOW we should optimise it as if the host were UP.
>   

Good point - if we rely on having excess cores in the host, large guest 
scalability will drop.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 12:13                               ` Rusty Russell
  2009-04-02 12:50                                 ` Gregory Haskins
@ 2009-04-02 15:10                                 ` Michael S. Tsirkin
  2009-04-03  4:43                                   ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 146+ messages in thread
From: Michael S. Tsirkin @ 2009-04-02 15:10 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Gregory Haskins, Avi Kivity, Herbert Xu, anthony, andi,
	linux-kernel, agraf, pmullaney, pmorreale, netdev, kvm

On Thu, Apr 02, 2009 at 10:43:19PM +1030, Rusty Russell wrote:
> On Thursday 02 April 2009 21:36:07 Gregory Haskins wrote:
> > You do not need to know when the packet is copied (which I currently
> > do).  You only need it for zero-copy (of which I would like to support,
> > but as I understand it there are problems with the reliability of proper
> > callback (i.e. skb->destructor).
> 
> But if you have a UP guest, there will *never* be another packet in the queue
> at this point, since it wasn't running.
> 
> As Avi said, you can do the processing in another thread and go back to the
> guest; lguest pre-virtio did a hacky "weak" wakeup to ensure the guest ran
> again before the thread did for exactly this kind of reason.
> 
> While Avi's point about a "powerful enough userspace API" is probably valid,
> I don't think it's going to happen.  It's almost certainly less code to put a
> virtio_net server in the kernel, than it is to create such a powerful
> interface (see vringfd & tap).  And that interface would have one user in
> practice.
> 
> So, let's roll out a kernel virtio_net server.  Anyone?
> Rusty.

BTW, whatever approach is chosen, to enable zero-copy transmits, it seems that
we still must add tracking of when the skb has actually been transmitted, right?

Rusty, I think this is what you did in your patch from 2008 to add destructor
for skb data ( http://kerneltrap.org/mailarchive/linux-netdev/2008/4/18/1464944 ):
and it seems that it would make zero-copy possible - or was there some problem with
that approach? Do you happen to remember?

-- 
MST

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 15:00                                       ` Avi Kivity
@ 2009-04-02 15:40                                         ` Herbert Xu
  2009-04-02 15:57                                           ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Herbert Xu @ 2009-04-02 15:40 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm, Ingo Molnar

On Thu, Apr 02, 2009 at 06:00:17PM +0300, Avi Kivity wrote:
>
> Good point - if we rely on having excess cores in the host, large guest  
> scalability will drop.

Going back to TX mitigation, I wonder if we could avoid it altogether
by having a "wakeup" mechanism that does not involve a vmexit.  We
have two cases:

1) UP, or rather guest runs on the same core/hyperthread as the
backend.  This is the easy one, the guest simply sets a marker
in shared memory and keeps going until its time is up.  Then the
backend takes over, and uses a marker for notification too.

The markers need to be interpreted by the scheduler so that it
knows the guest/backend is runnable, respectively.

2) The guest and backend runs on two cores/hyperthreads.  We'll
assume that they share caches as otherwise mitigation is the last
thing to worry about.  We use the same marker mechanism as above.
The only caveat is that if one core/hyperthread is idle, its
idle thread needs to monitor the marker (this would be a separate
per-core marker) to wake up the scheduler.

CCing Ingo so that he can flame me if I'm totally off the mark.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 15:40                                         ` Herbert Xu
@ 2009-04-02 15:57                                           ` Avi Kivity
  2009-04-02 16:09                                             ` Herbert Xu
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-02 15:57 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm, Ingo Molnar

Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 06:00:17PM +0300, Avi Kivity wrote:
>   
>> Good point - if we rely on having excess cores in the host, large guest  
>> scalability will drop.
>>     
>
> Going back to TX mitigation, I wonder if we could avoid it altogether
> by having a "wakeup" mechanism that does not involve a vmexit.  We
> have two cases:
>
> 1) UP, or rather guest runs on the same core/hyperthread as the
> backend.  This is the easy one, the guest simply sets a marker
> in shared memory and keeps going until its time is up.  Then the
> backend takes over, and uses a marker for notification too.
>
> The markers need to be interpreted by the scheduler so that it
> knows the guest/backend is runnable, respectively.
>   

Let's look at this first.

What if the guest sends N packets, then does some expensive computation 
(say the guest scheduler switches from the benchmark process to 
evolution).  So now we have the marker set at packet N, but the host 
will not see it until the guest timeslice is up?

I think I totally misunderstood you.  Can you repeat in smaller words?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 02/17] vbus: add virtual-bus definitions
  2009-03-31 18:42 ` [RFC PATCH 02/17] vbus: add virtual-bus definitions Gregory Haskins
@ 2009-04-02 16:06   ` Ben Hutchings
  2009-04-02 18:13     ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Ben Hutchings @ 2009-04-02 16:06 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

On Tue, 2009-03-31 at 14:42 -0400, Gregory Haskins wrote:
[...]
> +Create a device instance
> +------------------------
> +
> +Devices are instantiated by again utilizing the /config/vbus configfs area.
> +At first you may suspect that devices are created as subordinate objects of a
> +bus/container instance, but you would be mistaken.

This is kind of patronising; why don't you simply lay out how things
_do_ work?

>  Devices are actually
> +root-level objects in vbus specifically to allow greater flexibility in the
> +association of a device.  For instance, it may be desirable to have a single
> +device that spans multiple VMs (consider an ethernet switch, or a shared disk
> +for a cluster).  Therefore, device lifecycles are managed by creating/deleting
> +objects in /config/vbus/devices.
> +
> +Note: Creating a device instance is actually a two step process:  We need to
> +give the device instance a unique name, and we also need to give it a specific
> +device type.  It is hard to express both parameters using standard filesystem
> +operations like mkdir, so the design decision was made to require performing
> +the operation in two steps.

How about exposing a subdir for each device class under
/config/vbus/devices/ and allowing device creation only within those?
Two-stage construction is a pain for both users and implementors.

[...]
> +At this point, we are ready to roll.  Pid 4382 has access to a virtual-bus
> +namespace with one device, id=0.  Its type is:
> +
> +# cat /sys/vbus/instances/beb4df8f-7483-4028-b3f7-767512e2a18c/devices/0/type
> +virtual-ethernet
> +
> +"virtual-ethernet"?  Why is it not "venet-tap"?  Device-classes are allowed to
> +register their interfaces under an id that is not required to be the same as
> +their deviceclass.  This supports device polymorphism.   For instance,
> +consider that an interface "virtual-ethernet" may provide basic 802.x packet
> +exchange.  However, we could have various implementations of a device that
> +supports the 802.x interface, while having various implementations behind
> +them.
[...]

It seems to me that your "device-classes" correspond to drivers and
"interfaces" correspond to device classes in the LDM.  To avoid
confusion, I think the vbus terminology should be made consistent with
LDM.  And certainly these should not both be called simply "type" in the
configfs/sysfs interface.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 14:49                                       ` Anthony Liguori
@ 2009-04-02 16:09                                         ` Anthony Liguori
  2009-04-02 16:19                                           ` Avi Kivity
  2009-04-03 12:03                                           ` Gregory Haskins
  0 siblings, 2 replies; 146+ messages in thread
From: Anthony Liguori @ 2009-04-02 16:09 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 1274 bytes --]

Anthony Liguori wrote:
> Avi Kivity wrote:
>> Avi Kivity wrote:
>>>
>>> The alternative is to get a notification from the stack that the 
>>> packet is done processing.  Either an skb destructor in the kernel, 
>>> or my new API that everyone is not rushing out to implement.
>>
>> btw, my new api is
>>
>>
>>   io_submit(..., nr, ...): submit nr packets
>>   io_getevents(): complete nr packets
>
> I don't think we even need that to end this debate.  I'm convinced we 
> have a bug somewhere.  Even disabling TX mitigation, I see a ping 
> latency of around 300ns whereas it's only 50ns on the host.  This 
> defies logic so I'm now looking to isolate why that is.

I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes were 
the big winner... I hate qemu sometimes.

I'm pretty confident I can get at least to Greg's numbers with some 
poking.  I think I understand why he's doing better after reading his 
patches carefully but I also don't think it'll scale with many guests 
well...  stay tuned.

But most importantly, we are darn near where vbus is with this patch wrt 
added packet latency and this is totally from userspace with no host 
kernel changes.

So no, userspace is not the issue.

Regards,

Anthony Liguori

> Regards,
>
> Anthony Liguori
>


[-- Attachment #2: first-pass.patch --]
[-- Type: text/x-patch, Size: 6596 bytes --]

diff --git a/qemu/exec.c b/qemu/exec.c
index 67f3fa3..1331022 100644
--- a/qemu/exec.c
+++ b/qemu/exec.c
@@ -3268,6 +3268,10 @@ uint32_t ldl_phys(target_phys_addr_t addr)
     unsigned long pd;
     PhysPageDesc *p;
 
+#if 1
+    return ldl_p(phys_ram_base + addr);
+#endif
+
     p = phys_page_find(addr >> TARGET_PAGE_BITS);
     if (!p) {
         pd = IO_MEM_UNASSIGNED;
@@ -3300,6 +3304,10 @@ uint64_t ldq_phys(target_phys_addr_t addr)
     unsigned long pd;
     PhysPageDesc *p;
 
+#if 1
+    return ldq_p(phys_ram_base + addr);
+#endif
+
     p = phys_page_find(addr >> TARGET_PAGE_BITS);
     if (!p) {
         pd = IO_MEM_UNASSIGNED;
diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index 9bce3a0..ac77b80 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -36,6 +36,7 @@ typedef struct VirtIONet
     VirtQueue *ctrl_vq;
     VLANClientState *vc;
     QEMUTimer *tx_timer;
+    QEMUBH *bh;
     int tx_timer_active;
     int mergeable_rx_bufs;
     int promisc;
@@ -504,6 +505,10 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
     virtio_notify(&n->vdev, n->rx_vq);
 }
 
+VirtIODevice *global_vdev = NULL;
+
+extern void tap_try_to_recv(VLANClientState *vc);
+
 /* TX */
 static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 {
@@ -545,42 +550,35 @@ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
             len += hdr_len;
         }
 
+        global_vdev = &n->vdev;
         len += qemu_sendv_packet(n->vc, out_sg, out_num);
+        global_vdev = NULL;
 
         virtqueue_push(vq, &elem, len);
         virtio_notify(&n->vdev, vq);
     }
+
+    tap_try_to_recv(n->vc->vlan->first_client);
 }
 
 static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIONet *n = to_virtio_net(vdev);
 
-    if (n->tx_timer_active) {
-        virtio_queue_set_notification(vq, 1);
-        qemu_del_timer(n->tx_timer);
-        n->tx_timer_active = 0;
-        virtio_net_flush_tx(n, vq);
-    } else {
-        qemu_mod_timer(n->tx_timer,
-                       qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
-        n->tx_timer_active = 1;
-        virtio_queue_set_notification(vq, 0);
-    }
+#if 0
+    virtio_queue_set_notification(vq, 0);
+    qemu_bh_schedule(n->bh);
+#else
+    virtio_net_flush_tx(n, n->tx_vq);
+#endif
 }
 
-static void virtio_net_tx_timer(void *opaque)
+static void virtio_net_handle_tx_bh(void *opaque)
 {
     VirtIONet *n = opaque;
 
-    n->tx_timer_active = 0;
-
-    /* Just in case the driver is not ready on more */
-    if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
-        return;
-
-    virtio_queue_set_notification(n->tx_vq, 1);
     virtio_net_flush_tx(n, n->tx_vq);
+    virtio_queue_set_notification(n->tx_vq, 1);
 }
 
 static void virtio_net_save(QEMUFile *f, void *opaque)
@@ -675,8 +673,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
     n->vdev.get_features = virtio_net_get_features;
     n->vdev.set_features = virtio_net_set_features;
     n->vdev.reset = virtio_net_reset;
-    n->rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
-    n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx);
+    n->rx_vq = virtio_add_queue(&n->vdev, 512, virtio_net_handle_rx);
+    n->tx_vq = virtio_add_queue(&n->vdev, 512, virtio_net_handle_tx);
     n->ctrl_vq = virtio_add_queue(&n->vdev, 16, virtio_net_handle_ctrl);
     memcpy(n->mac, nd->macaddr, ETH_ALEN);
     n->status = VIRTIO_NET_S_LINK_UP;
@@ -684,10 +682,10 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
                                  virtio_net_receive, virtio_net_can_receive, n);
     n->vc->link_status_changed = virtio_net_set_link_status;
 
+    n->bh = qemu_bh_new(virtio_net_handle_tx_bh, n);
+
     qemu_format_nic_info_str(n->vc, n->mac);
 
-    n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
-    n->tx_timer_active = 0;
     n->mergeable_rx_bufs = 0;
     n->promisc = 1; /* for compatibility */
 
diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c
index 577eb5a..1365d11 100644
--- a/qemu/hw/virtio.c
+++ b/qemu/hw/virtio.c
@@ -507,6 +507,39 @@ static void virtio_reset(void *opaque)
     }
 }
 
+void virtio_sample_start(VirtIODevice *vdev)
+{
+    vdev->n_samples = 0;
+    virtio_sample(vdev);
+}
+
+void virtio_sample(VirtIODevice *vdev)
+{
+    gettimeofday(&vdev->samples[vdev->n_samples], NULL);
+    vdev->n_samples++;
+}
+
+static unsigned long usec_delta(struct timeval *before, struct timeval *after)
+{
+    return (after->tv_sec - before->tv_sec) * 1000000UL + (after->tv_usec - before->tv_usec);
+}
+
+void virtio_sample_end(VirtIODevice *vdev)
+{
+    int last, i;
+
+    virtio_sample(vdev);
+
+    last = vdev->n_samples - 1;
+
+    printf("Total time = %ldus\n", usec_delta(&vdev->samples[0], &vdev->samples[last]));
+
+    for (i = 1; i < vdev->n_samples; i++)
+        printf("sample[%d .. %d] = %ldus\n", i - 1, i, usec_delta(&vdev->samples[i - 1], &vdev->samples[i]));
+
+    vdev->n_samples = 0;
+}
+
 static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val)
 {
     VirtIODevice *vdev = to_virtio_device(opaque);
diff --git a/qemu/hw/virtio.h b/qemu/hw/virtio.h
index 18c7a1a..a039310 100644
--- a/qemu/hw/virtio.h
+++ b/qemu/hw/virtio.h
@@ -17,6 +17,8 @@
 #include "hw.h"
 #include "pci.h"
 
+#include <sys/time.h>
+
 /* from Linux's linux/virtio_config.h */
 
 /* Status byte for guest to report progress, and synchronize features. */
@@ -87,6 +89,8 @@ struct VirtIODevice
     void (*set_config)(VirtIODevice *vdev, const uint8_t *config);
     void (*reset)(VirtIODevice *vdev);
     VirtQueue *vq;
+    int n_samples;
+    struct timeval samples[100];
 };
 
 VirtIODevice *virtio_init_pci(PCIBus *bus, const char *name,
@@ -122,4 +126,10 @@ int virtio_queue_ready(VirtQueue *vq);
 
 int virtio_queue_empty(VirtQueue *vq);
 
+void virtio_sample_start(VirtIODevice *vdev);
+
+void virtio_sample(VirtIODevice *vdev);
+
+void virtio_sample_end(VirtIODevice *vdev);
+
 #endif
diff --git a/qemu/net.c b/qemu/net.c
index efb64d3..dc872e5 100644
--- a/qemu/net.c
+++ b/qemu/net.c
@@ -733,6 +733,7 @@ typedef struct TAPState {
 } TAPState;
 
 #ifdef HAVE_IOVEC
+
 static ssize_t tap_receive_iov(void *opaque, const struct iovec *iov,
                                int iovcnt)
 {
@@ -853,6 +854,12 @@ static void tap_send(void *opaque)
     } while (s->size > 0);
 }
 
+void tap_try_to_recv(VLANClientState *vc)
+{
+    TAPState *s = vc->opaque;
+    tap_send(s);
+}
+
 int tap_has_vnet_hdr(void *opaque)
 {
     VLANClientState *vc = opaque;

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 15:57                                           ` Avi Kivity
@ 2009-04-02 16:09                                             ` Herbert Xu
  2009-04-02 16:54                                               ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Herbert Xu @ 2009-04-02 16:09 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm, Ingo Molnar

On Thu, Apr 02, 2009 at 06:57:38PM +0300, Avi Kivity wrote:
>
> What if the guest sends N packets, then does some expensive computation  
> (say the guest scheduler switches from the benchmark process to  
> evolution).  So now we have the marker set at packet N, but the host  
> will not see it until the guest timeslice is up?

Well that's fine.  The guest will use up the remainder of its
timeslice.  After all we only have one core/hyperthread here so
this is no different than if the packets were held up higher up
in the guest kernel and the guest decided to do some computation.

Once its timeslice completes the backend can start plugging away
at the backlog.

Of course it would be better to put the backend on another core
that shares the cache or a hyperthread on the same core.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 16:09                                         ` Anthony Liguori
@ 2009-04-02 16:19                                           ` Avi Kivity
  2009-04-02 18:18                                             ` Anthony Liguori
  2009-04-03 12:03                                           ` Gregory Haskins
  1 sibling, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-02 16:19 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Gregory Haskins, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Anthony Liguori wrote:
>> I don't think we even need that to end this debate.  I'm convinced we 
>> have a bug somewhere.  Even disabling TX mitigation, I see a ping 
>> latency of around 300ns whereas it's only 50ns on the host.  This 
>> defies logic so I'm now looking to isolate why that is.
>
> I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes 
> were the big winner... I hate qemu sometimes.
>
>

What, this:

> diff --git a/qemu/exec.c b/qemu/exec.c
> index 67f3fa3..1331022 100644
> --- a/qemu/exec.c
> +++ b/qemu/exec.c
> @@ -3268,6 +3268,10 @@ uint32_t ldl_phys(target_phys_addr_t addr)
>      unsigned long pd;
>      PhysPageDesc *p;
>  
> +#if 1
> +    return ldl_p(phys_ram_base + addr);
> +#endif
> +
>      p = phys_page_find(addr >> TARGET_PAGE_BITS);
>      if (!p) {
>          pd = IO_MEM_UNASSIGNED;
> @@ -3300,6 +3304,10 @@ uint64_t ldq_phys(target_phys_addr_t addr)
>      unsigned long pd;
>      PhysPageDesc *p;
>  
> +#if 1
> +    return ldq_p(phys_ram_base + addr);
> +#endif
> +
>      p = phys_page_find(addr >> TARGET_PAGE_BITS);
>      if (!p) {
>          pd = IO_MEM_UNASSIGNED;

The way I read it, it will run only run slowly once per page, then 
settle to a cache miss per page.

Regardless, it makes a memslot model even more attractive.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 16:09                                             ` Herbert Xu
@ 2009-04-02 16:54                                               ` Avi Kivity
  2009-04-02 17:06                                                 ` Herbert Xu
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-02 16:54 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm, Ingo Molnar

Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 06:57:38PM +0300, Avi Kivity wrote:
>   
>> What if the guest sends N packets, then does some expensive computation  
>> (say the guest scheduler switches from the benchmark process to  
>> evolution).  So now we have the marker set at packet N, but the host  
>> will not see it until the guest timeslice is up?
>>     
>
> Well that's fine.  The guest will use up the remainder of its
> timeslice.  After all we only have one core/hyperthread here so
> this is no different than if the packets were held up higher up
> in the guest kernel and the guest decided to do some computation.
>
>   

3ms latency for ping?

(ping will always be scheduled immediately when the reply arrives if I 
understand cfs, so guest load won't delay it)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 16:54                                               ` Avi Kivity
@ 2009-04-02 17:06                                                 ` Herbert Xu
  2009-04-02 17:17                                                   ` Herbert Xu
  2009-04-03 12:25                                                   ` Avi Kivity
  0 siblings, 2 replies; 146+ messages in thread
From: Herbert Xu @ 2009-04-02 17:06 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm, Ingo Molnar

On Thu, Apr 02, 2009 at 07:54:21PM +0300, Avi Kivity wrote:
>
> 3ms latency for ping?
>
> (ping will always be scheduled immediately when the reply arrives if I  
> understand cfs, so guest load won't delay it)

That only happens if the guest immediately does some CPU-intensive
computation 3ms and assuming its timeslice lasts that long.

In any case, the same thing will happen right now if the host or
some other guest on the same CPU hogs the CPU for 3ms.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 17:06                                                 ` Herbert Xu
@ 2009-04-02 17:17                                                   ` Herbert Xu
  2009-04-03 12:25                                                   ` Avi Kivity
  1 sibling, 0 replies; 146+ messages in thread
From: Herbert Xu @ 2009-04-02 17:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm, Ingo Molnar

On Fri, Apr 03, 2009 at 01:06:10AM +0800, Herbert Xu wrote:
>
> That only happens if the guest immediately does some CPU-intensive
> computation 3ms and assuming its timeslice lasts that long.
> 
> In any case, the same thing will happen right now if the host or
> some other guest on the same CPU hogs the CPU for 3ms.

Even better, look at the packet's TOS.  If it's marked for low-
latency then vmexit immediately.  Otherwise continue.

In the backend you'd just set the marker in shared memory.

Of course invert this for the host => guest direction.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 02/17] vbus: add virtual-bus definitions
  2009-04-02 16:06   ` Ben Hutchings
@ 2009-04-02 18:13     ` Gregory Haskins
  0 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-04-02 18:13 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: linux-kernel, agraf, pmullaney, pmorreale, anthony, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 5609 bytes --]

Hi Ben

Ben Hutchings wrote:
> On Tue, 2009-03-31 at 14:42 -0400, Gregory Haskins wrote:
> [...]
>   
>> +Create a device instance
>> +------------------------
>> +
>> +Devices are instantiated by again utilizing the /config/vbus configfs area.
>> +At first you may suspect that devices are created as subordinate objects of a
>> +bus/container instance, but you would be mistaken.
>>     
>
> This is kind of patronising; why don't you simply lay out how things
> _do_ work?
>   

Ya, point taken.  I think that was written really to myself, because my
first design *had* the device as a subordinate object.  Then I realized
later that I didn't like that design :)

I will fix this.

>   
>>  Devices are actually
>> +root-level objects in vbus specifically to allow greater flexibility in the
>> +association of a device.  For instance, it may be desirable to have a single
>> +device that spans multiple VMs (consider an ethernet switch, or a shared disk
>> +for a cluster).  Therefore, device lifecycles are managed by creating/deleting
>> +objects in /config/vbus/devices.
>> +
>> +Note: Creating a device instance is actually a two step process:  We need to
>> +give the device instance a unique name, and we also need to give it a specific
>> +device type.  It is hard to express both parameters using standard filesystem
>> +operations like mkdir, so the design decision was made to require performing
>> +the operation in two steps.
>>     
>
> How about exposing a subdir for each device class under
> /config/vbus/devices/ and allowing device creation only within those?
> Two-stage construction is a pain for both users and implementors.
>
>   
I am not sure I follow.  It sounds like you are suggesting exactly what
I do today.

> [...]
>   
>> +At this point, we are ready to roll.  Pid 4382 has access to a virtual-bus
>> +namespace with one device, id=0.  Its type is:
>> +
>> +# cat /sys/vbus/instances/beb4df8f-7483-4028-b3f7-767512e2a18c/devices/0/type
>> +virtual-ethernet
>> +
>> +"virtual-ethernet"?  Why is it not "venet-tap"?  Device-classes are allowed to
>>     

I think I worded this awkwardly.  A device-class creates a
device-instance.  A device-instance registers one or more interfaces. 
There are device types (of which I would classify both the device-class
and its instantiated device object as the same "type"), and there are
interface types.  The interface types may overlap across different
device types, as demonstrated below.  I will update the doc to be more
clear, here (assuming I didn't muddle it up even more ;)

>> +register their interfaces under an id that is not required to be the same as
>> +their deviceclass.  This supports device polymorphism.   For instance,
>> +consider that an interface "virtual-ethernet" may provide basic 802.x packet
>> +exchange.  However, we could have various implementations of a device that
>> +supports the 802.x interface, while having various implementations behind
>> +them.
>>     
> [...]
>
> It seems to me that your "device-classes" correspond to drivers and
> "interfaces" correspond to device classes in the LDM.
I don't think that is quite right, but I might be missing your point. 
All of these objects exist on the "backend", of which there isnt a
specific precedent with LDM to express.  Normally in LDM, you would have
some kind of physical device object in the hardware (say a SATA disk),
and an LDM "block device" that represents it in software.  So we call
the LDM model for that disk a "device" but really its like a proxy or a
software representative of the actual device itself.  And I am not
knocking this designation, as I think it makes a lot of sense.

However, what I will point out is that what we are creating here in vbus
is more akin to the SATA disk itself, not the LDM "block device"
representation of the device.   There was no really great existing way
to express this type of object, which is why I had to create a new
namespace in sysfs.

To dig down into this a little further, the device and interface are
inextricably linked in a relationship very close to this "physical
device" concept.  Therefore the "driver" portion of LDM that you
referenced w.r.t. the device-class doesnt even enter the picture here
(that would actually be up in the guest or userspace, actually. 
Discussed below).

As an example, consider a e1000 network card.  The PCI-ID and REV for
the e1000 card and the associated ABI are like its "interface".  Whereas
if its a physical card plugged into a physical pci slot, or its an
emulated e1000 inside qemu-kvm are like its device-instance.  In theory,
I can substitute either device-instance transparently with any driver
that understands the ABI associated with the e1000 PCI-ID
interchangeably (assuming all the plumbing is there, etc).  Its the same
deal here.  Taking a little creative license here to use that example in
terms of vbus concepts, I would have a device-class type =
"physical-e1000-card", and another "qemu-e1000-model".  I could
instantiate either one of those and they would ultimately register an
interface of type "e1000".

So where traditional LDM starts to play here is actually on the other
side (e.g. the guest).  So the host has this vbus context with our
"e1000" interface registered on it.  When the guest loads, it would
create an LDM "device" object for the e1000, as well as a driver
instance if one was present.  From here, things would look more like
normal LDM concepts that we are used to.

HTH

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 16:19                                           ` Avi Kivity
@ 2009-04-02 18:18                                             ` Anthony Liguori
  2009-04-03  1:11                                               ` Herbert Xu
  2009-04-20 18:02                                               ` Alex Williamson
  0 siblings, 2 replies; 146+ messages in thread
From: Anthony Liguori @ 2009-04-02 18:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Avi Kivity wrote:
> Anthony Liguori wrote:
>>> I don't think we even need that to end this debate.  I'm convinced 
>>> we have a bug somewhere.  Even disabling TX mitigation, I see a ping 
>>> latency of around 300ns whereas it's only 50ns on the host.  This 
>>> defies logic so I'm now looking to isolate why that is.
>>
>> I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes 
>> were the big winner... I hate qemu sometimes.
>>
>>
>
> What, this:

UDP_RR test was limited by CPU consumption.  QEMU was pegging a CPU with 
only about 4000 packets per second whereas the host could do 14000.  An 
oprofile run showed that phys_page_find/cpu_physical_memory_rw where at 
the top by a wide margin which makes little sense since virtio is zero 
copy in kvm-userspace today.

That leaves the ring queue accessors that used ld[wlq]_phys and friends 
that happen to make use of the above.  That led me to try this terrible 
hack below and low and beyond, we immediately jumped to 10000 pps.  This 
only works because almost nothing uses ld[wlq]_phys in practice except 
for virtio so breaking it for the non-RAM case didn't matter.

We didn't encounter this before because when I changed this behavior, I 
tested streaming and ping.  Both remained the same.  You can only expose 
this issue if you first disable tx mitigation.

Anyway, if we're able to send this many packets, I suspect we'll be able 
to also handle much higher throughputs without TX mitigation so that's 
what I'm going to look at now.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 18:18                                             ` Anthony Liguori
@ 2009-04-03  1:11                                               ` Herbert Xu
  2009-04-20 18:02                                               ` Alex Williamson
  1 sibling, 0 replies; 146+ messages in thread
From: Herbert Xu @ 2009-04-03  1:11 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: avi, ghaskins, andi, linux-kernel, agraf, pmullaney, pmorreale,
	rusty, netdev, kvm

Anthony Liguori <anthony@codemonkey.ws> wrote:
>
> Anyway, if we're able to send this many packets, I suspect we'll be able 
> to also handle much higher throughputs without TX mitigation so that's 
> what I'm going to look at now.

Awesome! I'm prepared to eat my words :)

On the subject of TX mitigation, can we please set a standard
on how we measure it? For instance, do we bind the the backend
qemu to the same CPU as the guest, or do we bind it to a different
CPU that shares cache? They're two completely different scenarios
and I think we should be explicit about which one we're measuring.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 15:10                                 ` Michael S. Tsirkin
@ 2009-04-03  4:43                                   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 146+ messages in thread
From: Jeremy Fitzhardinge @ 2009-04-03  4:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rusty Russell, Gregory Haskins, Avi Kivity, Herbert Xu, anthony,
	andi, linux-kernel, agraf, pmullaney, pmorreale, netdev, kvm

Michael S. Tsirkin wrote:
> Rusty, I think this is what you did in your patch from 2008 to add destructor
> for skb data ( http://kerneltrap.org/mailarchive/linux-netdev/2008/4/18/1464944 ):
> and it seems that it would make zero-copy possible - or was there some problem with
> that approach? Do you happen to remember?
>   

I'm planning on resurrecting it to replace the page destructor used by 
Xen netback.

    J


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02  9:02                   ` Avi Kivity
  2009-04-02  9:16                     ` Herbert Xu
  2009-04-02 10:55                     ` Gregory Haskins
@ 2009-04-03 10:58                     ` Gerd Hoffmann
  2009-04-03 11:03                       ` Avi Kivity
                                         ` (2 more replies)
  2 siblings, 3 replies; 146+ messages in thread
From: Gerd Hoffmann @ 2009-04-03 10:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Herbert Xu, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

Avi Kivity wrote:
> There is no choice.  Exiting from the guest to the kernel to userspace
> is prohibitively expensive, you can't do that on every packet.

I didn't look at virtio-net very closely yet.  I wonder why the
notification is that a big issue though.  It is easy to keep the number
of notifications low without increasing latency:

Check shared ring status when stuffing a request.  If there are requests
not (yet) consumed by the other end there is no need to send a
notification.  That scheme can even span multiple rings (nics with rx
and tx for example).

Host backend can put a limit on the number of requests it takes out of
the queue at once.  i.e. block backend can take out some requests, throw
them at the block layer, check whenever any request in flight is done,
if so send back replies, start over again.  guest can put more requests
into the queue meanwhile without having to notify the host.  I've seen
the number of notifications going down to zero when running disk
benchmarks in the guest ;)

Of course that works best with one or more I/O threads, so the vcpu
doesn't has to stop running anyway to get the I/O work done ...

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 10:58                     ` Gerd Hoffmann
@ 2009-04-03 11:03                       ` Avi Kivity
  2009-04-03 11:12                         ` Herbert Xu
  2009-04-03 11:18                       ` Andi Kleen
  2009-04-03 11:28                       ` Gregory Haskins
  2 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-03 11:03 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Herbert Xu, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

Gerd Hoffmann wrote:
> Avi Kivity wrote:
>   
>> There is no choice.  Exiting from the guest to the kernel to userspace
>> is prohibitively expensive, you can't do that on every packet.
>>     
>
> I didn't look at virtio-net very closely yet.  I wonder why the
> notification is that a big issue though.  It is easy to keep the number
> of notifications low without increasing latency:
>
> Check shared ring status when stuffing a request.  If there are requests
> not (yet) consumed by the other end there is no need to send a
> notification.  That scheme can even span multiple rings (nics with rx
> and tx for example).
>   

If the host is able to consume a request immediately, and the guest is 
not able to batch requests, this breaks down.  And that is the current 
situation.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 11:03                       ` Avi Kivity
@ 2009-04-03 11:12                         ` Herbert Xu
  2009-04-03 11:46                           ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Herbert Xu @ 2009-04-03 11:12 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gerd Hoffmann, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

On Fri, Apr 03, 2009 at 02:03:45PM +0300, Avi Kivity wrote:
>
> If the host is able to consume a request immediately, and the guest is  
> not able to batch requests, this breaks down.  And that is the current  
> situation.

Hang on, why is the host consuming the request immediately? It
has to write the packet to tap, which then calls netif_rx_ni so
it should actually go all the way, no?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 10:58                     ` Gerd Hoffmann
  2009-04-03 11:03                       ` Avi Kivity
@ 2009-04-03 11:18                       ` Andi Kleen
  2009-04-03 11:34                         ` Herbert Xu
  2009-04-03 11:46                         ` Avi Kivity
  2009-04-03 11:28                       ` Gregory Haskins
  2 siblings, 2 replies; 146+ messages in thread
From: Andi Kleen @ 2009-04-03 11:18 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Avi Kivity, Herbert Xu, ghaskins, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, rusty, netdev, kvm

> Check shared ring status when stuffing a request.  If there are requests

That means you're bouncing cache lines all the time. Probably not a big
issue on single socket but could be on larger systems.

-Andi


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 10:58                     ` Gerd Hoffmann
  2009-04-03 11:03                       ` Avi Kivity
  2009-04-03 11:18                       ` Andi Kleen
@ 2009-04-03 11:28                       ` Gregory Haskins
  2 siblings, 0 replies; 146+ messages in thread
From: Gregory Haskins @ 2009-04-03 11:28 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Avi Kivity, Herbert Xu, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 1086 bytes --]

Gerd Hoffmann wrote:
> Avi Kivity wrote:
>   
>> There is no choice.  Exiting from the guest to the kernel to userspace
>> is prohibitively expensive, you can't do that on every packet.
>>     
>
> I didn't look at virtio-net very closely yet.  I wonder why the
> notification is that a big issue though.  It is easy to keep the number
> of notifications low without increasing latency:
>
> Check shared ring status when stuffing a request.  If there are requests
> not (yet) consumed by the other end there is no need to send a
> notification.  That scheme can even span multiple rings (nics with rx
> and tx for example).
>   

FWIW: I employ this scheme.  The shm-signal construct has a "dirty" and
"pending" flag (all on the same cacheline, which may or may not address
Andi's later point).  The first time you dirty the shm, it sets both
flags.  The consumer side has to clear "pending" before any subsequent
signals are sent.  Normally the consumer side will also clear "enabled"
(as part of the bidir napi thing) to further disable signals.

-Greg




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 11:18                       ` Andi Kleen
@ 2009-04-03 11:34                         ` Herbert Xu
  2009-04-03 11:46                         ` Avi Kivity
  1 sibling, 0 replies; 146+ messages in thread
From: Herbert Xu @ 2009-04-03 11:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gerd Hoffmann, Avi Kivity, ghaskins, anthony, linux-kernel,
	agraf, pmullaney, pmorreale, rusty, netdev, kvm

On Fri, Apr 03, 2009 at 01:18:54PM +0200, Andi Kleen wrote:
> > Check shared ring status when stuffing a request.  If there are requests
> 
> That means you're bouncing cache lines all the time. Probably not a big
> issue on single socket but could be on larger systems.

If the backend is running on a core that doesn't share caches
with the guest queue then you've got bigger problems.

Right this is unavoidable for guests with many CPUs but that
should go away once we support multiqueue in virtio-net.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 11:12                         ` Herbert Xu
@ 2009-04-03 11:46                           ` Avi Kivity
  2009-04-03 11:48                             ` Herbert Xu
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-03 11:46 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Gerd Hoffmann, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> On Fri, Apr 03, 2009 at 02:03:45PM +0300, Avi Kivity wrote:
>   
>> If the host is able to consume a request immediately, and the guest is  
>> not able to batch requests, this breaks down.  And that is the current  
>> situation.
>>     
>
> Hang on, why is the host consuming the request immediately? It
> has to write the packet to tap, which then calls netif_rx_ni so
> it should actually go all the way, no?
>   

The host writes the packet to tap, at which point it is consumed from 
its point of view.  The host would like to mention that if there was an 
API to notify it when the packet was actually consumed, then it would 
gladly use it.  Bonus points if this involves not copying the packet.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 11:18                       ` Andi Kleen
  2009-04-03 11:34                         ` Herbert Xu
@ 2009-04-03 11:46                         ` Avi Kivity
  1 sibling, 0 replies; 146+ messages in thread
From: Avi Kivity @ 2009-04-03 11:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Gerd Hoffmann, Herbert Xu, ghaskins, anthony, linux-kernel,
	agraf, pmullaney, pmorreale, rusty, netdev, kvm

Andi Kleen wrote:
>> Check shared ring status when stuffing a request.  If there are requests
>>     
>
> That means you're bouncing cache lines all the time. Probably not a big
> issue on single socket but could be on larger systems.
>   

That's why I'd like requests to be handled on the vcpu thread rather 
than an auxiliary thread.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 11:46                           ` Avi Kivity
@ 2009-04-03 11:48                             ` Herbert Xu
  2009-04-03 11:54                               ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Herbert Xu @ 2009-04-03 11:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gerd Hoffmann, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

On Fri, Apr 03, 2009 at 02:46:04PM +0300, Avi Kivity wrote:
>
> The host writes the packet to tap, at which point it is consumed from  
> its point of view.  The host would like to mention that if there was an  
> API to notify it when the packet was actually consumed, then it would  
> gladly use it.  Bonus points if this involves not copying the packet.

We're using write(2) for this, no? That should invoke netif_rx_ni
which blocks until the packet is "processed", which usually means
that it's placed on the NIC's hardware queue.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 11:48                             ` Herbert Xu
@ 2009-04-03 11:54                               ` Avi Kivity
  2009-04-03 11:55                                 ` Herbert Xu
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-03 11:54 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Gerd Hoffmann, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> On Fri, Apr 03, 2009 at 02:46:04PM +0300, Avi Kivity wrote:
>   
>> The host writes the packet to tap, at which point it is consumed from  
>> its point of view.  The host would like to mention that if there was an  
>> API to notify it when the packet was actually consumed, then it would  
>> gladly use it.  Bonus points if this involves not copying the packet.
>>     
>
> We're using write(2) for this, no? 

Yes.

> That should invoke netif_rx_ni
> which blocks until the packet is "processed", which usually means
> that it's placed on the NIC's hardware queue.
>   

It doesn't copy and queue the packet?  We use O_NONBLOCK and poll() so 
we can tell when we can queue without blocking.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 11:54                               ` Avi Kivity
@ 2009-04-03 11:55                                 ` Herbert Xu
  2009-04-03 12:02                                   ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Herbert Xu @ 2009-04-03 11:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gerd Hoffmann, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

On Fri, Apr 03, 2009 at 02:54:02PM +0300, Avi Kivity wrote:
>
> It doesn't copy and queue the packet?  We use O_NONBLOCK and poll() so  
> we can tell when we can queue without blocking.

Well netif_rx queues the packet, but netif_rx_ni is netif_rx plus
an immediate flush.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 11:55                                 ` Herbert Xu
@ 2009-04-03 12:02                                   ` Avi Kivity
  2009-04-03 13:05                                     ` Herbert Xu
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-03 12:02 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Gerd Hoffmann, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

Herbert Xu wrote:
> On Fri, Apr 03, 2009 at 02:54:02PM +0300, Avi Kivity wrote:
>   
>> It doesn't copy and queue the packet?  We use O_NONBLOCK and poll() so  
>> we can tell when we can queue without blocking.
>>     
>
> Well netif_rx queues the packet, but netif_rx_ni is netif_rx plus
> an immediate flush.
>   

But it flushes the tap device, the packet still has to go through the 
bridge + real interface?

Even if it's queued there, I want to know when the packet is on the 
wire, not on some random software or hardware queue in the middle.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 16:09                                         ` Anthony Liguori
  2009-04-02 16:19                                           ` Avi Kivity
@ 2009-04-03 12:03                                           ` Gregory Haskins
  2009-04-03 12:15                                             ` Avi Kivity
  1 sibling, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-04-03 12:03 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Avi Kivity, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 2438 bytes --]

Anthony Liguori wrote:
> Anthony Liguori wrote:
>> Avi Kivity wrote:
>>> Avi Kivity wrote:
>>>>
>>>> The alternative is to get a notification from the stack that the
>>>> packet is done processing.  Either an skb destructor in the kernel,
>>>> or my new API that everyone is not rushing out to implement.
>>>
>>> btw, my new api is
>>>
>>>
>>>   io_submit(..., nr, ...): submit nr packets
>>>   io_getevents(): complete nr packets
>>
>> I don't think we even need that to end this debate.  I'm convinced we
>> have a bug somewhere.  Even disabling TX mitigation, I see a ping
>> latency of around 300ns whereas it's only 50ns on the host.  This
>> defies logic so I'm now looking to isolate why that is.
>
> I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes
> were the big winner... I hate qemu sometimes.

[ Ive already said this privately to Anthony on IRC, but ..]

Hey, congrats!  Thats impressive actually.

So I realize that perhaps you guys are not quite seeing my long term
vision here, which I think will offer some new features that we dont
have today.  I hope to change that over the coming weeks.  However, I
should also point out that perhaps even if, as of right now, my one and
only working module (venet-tap) were all I could offer, it does give us
a "rivalry" position between the two, and this historically has been a
good thing on many projects.  This helps foster innovation through
competition that potentially benefits both.  Case in point, a little
competition provoked an investigation that brought virtio-net's latency
down from 3125us to 90us.  I realize its not a production-ready patch
quite yet, but I am confident Anthony will find something that is
suitable to checkin very soon.  That's a huge improvement to a problem
that was just sitting around unnoticed because there was nothing to
compare it with.

So again, I am proposing for consideration of accepting my work (either
in its current form, or something we agree on after the normal review
process) not only on the basis of the future development of the
platform, but also to keep current components in their running to their
full potential.  I will again point out that the code is almost
completely off to the side, can be completely disabled with config
options, and I will maintain it.  Therefore the only real impact is to
people who care to even try it, and to me.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 12:03                                           ` Gregory Haskins
@ 2009-04-03 12:15                                             ` Avi Kivity
  2009-04-03 13:13                                               ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-03 12:15 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:
> So again, I am proposing for consideration of accepting my work (either
> in its current form, or something we agree on after the normal review
> process) not only on the basis of the future development of the
> platform, but also to keep current components in their running to their
> full potential.  I will again point out that the code is almost
> completely off to the side, can be completely disabled with config
> options, and I will maintain it.  Therefore the only real impact is to
> people who care to even try it, and to me.
>   

Your work is a whole stack.  Let's look at the constituents.

- a new virtual bus for enumerating devices.

Sorry, I still don't see the point.  It will just make writing drivers 
more difficult.  The only advantage I've heard from you is that it gets 
rid of the gunk.  Well, we still have to support the gunk for non-pv 
devices so the gunk is basically free.  The clean version is expensive 
since we need to port it to all guests and implement exciting features 
like hotplug.

- finer-grained point-to-point communication abstractions

Where virtio has ring+signalling together, you layer the two.  For 
networking, it doesn't matter.  For other applications, it may be 
helpful, perhaps you have something in mind.

- your "bidirectional napi" model for the network device

virtio implements exactly the same thing, except for the case of tx 
mitigation, due to my (perhaps pig-headed) rejection of doing things in 
a separate thread, and due to the total lack of sane APIs for packet 
traffic.

- a kernel implementation of the host networking device

Given the continuous rejection (or rather, their continuous 
non-adoption-and-implementation) of my ideas re zerocopy networking aio, 
that seems like a pragmatic approach.  I wish it were otherwise.

- a promise of more wonderful things yet to come

Obviously I can't evaluate this.

Did I miss anything?

Right now my preferred course of action is to implement a prototype 
userspace notification for networking.  Second choice is to move the 
host virtio implementation into the kernel.  I simply don't see how the 
rest of the stack is cost effective.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 17:06                                                 ` Herbert Xu
  2009-04-02 17:17                                                   ` Herbert Xu
@ 2009-04-03 12:25                                                   ` Avi Kivity
  1 sibling, 0 replies; 146+ messages in thread
From: Avi Kivity @ 2009-04-03 12:25 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Gregory Haskins, Rusty Russell, anthony, andi, linux-kernel,
	agraf, pmullaney, pmorreale, netdev, kvm, Ingo Molnar

Herbert Xu wrote:
> On Thu, Apr 02, 2009 at 07:54:21PM +0300, Avi Kivity wrote:
>   
>> 3ms latency for ping?
>>
>> (ping will always be scheduled immediately when the reply arrives if I  
>> understand cfs, so guest load won't delay it)
>>     
>
> That only happens if the guest immediately does some CPU-intensive
> computation 3ms and assuming its timeslice lasts that long.
>   

 Note that this happens even if the computation is SCHED_BATCH.

> In any case, the same thing will happen right now if the host or
> some other guest on the same CPU hogs the CPU for 3ms.
>
>   

If the host is overloaded, that's fair.  But millisecond latencies 
without host contention is not a good result.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 12:02                                   ` Avi Kivity
@ 2009-04-03 13:05                                     ` Herbert Xu
  0 siblings, 0 replies; 146+ messages in thread
From: Herbert Xu @ 2009-04-03 13:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gerd Hoffmann, ghaskins, anthony, andi, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

On Fri, Apr 03, 2009 at 03:02:22PM +0300, Avi Kivity wrote:
>
> But it flushes the tap device, the packet still has to go through the  
> bridge + real interface?

Which under normal circumstances should occur before netif_rx_ni
returns.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 12:15                                             ` Avi Kivity
@ 2009-04-03 13:13                                               ` Gregory Haskins
  2009-04-03 13:37                                                 ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-04-03 13:13 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 5298 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> So again, I am proposing for consideration of accepting my work (either
>> in its current form, or something we agree on after the normal review
>> process) not only on the basis of the future development of the
>> platform, but also to keep current components in their running to their
>> full potential.  I will again point out that the code is almost
>> completely off to the side, can be completely disabled with config
>> options, and I will maintain it.  Therefore the only real impact is to
>> people who care to even try it, and to me.
>>   
>
> Your work is a whole stack.  Let's look at the constituents.
>
> - a new virtual bus for enumerating devices.
>
> Sorry, I still don't see the point.  It will just make writing drivers
> more difficult.  The only advantage I've heard from you is that it
> gets rid of the gunk.  Well, we still have to support the gunk for
> non-pv devices so the gunk is basically free.  The clean version is
> expensive since we need to port it to all guests and implement
> exciting features like hotplug.
My real objection to PCI is fast-path related.  I don't object, per se,
to using PCI for discovery and hotplug.  If you use PCI just for these
types of things, but then allow fastpath to use more hypercall oriented
primitives, then I would agree with you.  We can leave PCI emulation in
user-space, and we get it for free, and things are relatively tidy.

Its once you start requiring that we stay ABI compatible with something
like the existing virtio-net in x86 KVM where I think it starts to get
ugly when you try to move it into the kernel.  So that is what I had a
real objection to.  I think as long as we are not talking about trying
to make something like that work, its a much more viable prospect.

So what I propose is the following: 

1) The core vbus design stays the same (or close to it)
2) the vbus-proxy and kvm-guest patch go away
3) the kvm-host patch changes to work with coordination from the
userspace-pci emulation for things like MSI routing
4) qemu will know to create some MSI shim 1:1 with whatever it
instantiates on the bus (and can communicate changes
5) any drivers that are written for these new PCI-IDs that might be
present are allowed to use a hypercall ABI to talk after they have been
probed for that ID (e.g. they are not limited to PIO or MMIO BAR type
access methods).

Once I get here, I might have greater clarity to see how hard it would
make to emulate fast path components as well.  It might be easier than I
think.

This is all off the cuff so it might need some fine tuning before its
actually workable.

Does that sound reasonable?

>
> - finer-grained point-to-point communication abstractions
>
> Where virtio has ring+signalling together, you layer the two.  For
> networking, it doesn't matter.  For other applications, it may be
> helpful, perhaps you have something in mind.

Yeah, actually.  Thanks for bringing that up.

So the reason why signaling and the ring are distinct constructs in the
design is to facilitate constructs other than rings.  For instance,
there may be some models where having a flat shared page is better than
a ring.  A ring will naturally preserve all values in flight, where as a
flat shared page would not (last update is always current).  There are
some algorithms where a previously posted value is obsoleted by an
update, and therefore rings are inherently bad for this update model. 
And as we know, there are plenty of algorithms where a ring works
perfectly.  So I wanted that flexibility to be able to express both.

One of the things I have in mind for the flat page model is that RT vcpu
priority thing.  Another thing I am thinking of is coming up with a PV
LAPIC type replacement (where we can avoid doing the EOI trap by having
the PICs state shared).

>
> - your "bidirectional napi" model for the network device
>
> virtio implements exactly the same thing, except for the case of tx
> mitigation, due to my (perhaps pig-headed) rejection of doing things
> in a separate thread, and due to the total lack of sane APIs for
> packet traffic.

Yeah, and this part is not vbus, nor in-kernel specific.  That was just
a design element of venet-tap.  Though note, I did design the
vbus/shm-signal infrastructure with rich support for such a notion in
mind, so it wasn't accidental or anything like that.

>
> - a kernel implementation of the host networking device
>
> Given the continuous rejection (or rather, their continuous
> non-adoption-and-implementation) of my ideas re zerocopy networking
> aio, that seems like a pragmatic approach.  I wish it were otherwise.

Well, that gives me hope, at least ;)


>
> - a promise of more wonderful things yet to come
>
> Obviously I can't evaluate this.

Right, sorry.  I wish I had more concrete examples to show you, but we
only have the venet-tap working at this time.  I was going for the
"release early/often" approach in getting the core reviewed before we
got too far down a path, but perhaps that was the wrong thing in this
case.  We will certainly be sending updates as we get some of the more
advanced models and concepts working.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 13:13                                               ` Gregory Haskins
@ 2009-04-03 13:37                                                 ` Avi Kivity
  2009-04-03 16:28                                                   ` Gregory Haskins
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-03 13:37 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:
> Avi Kivity wrote:
>   
>> Gregory Haskins wrote:
>>     
>>> So again, I am proposing for consideration of accepting my work (either
>>> in its current form, or something we agree on after the normal review
>>> process) not only on the basis of the future development of the
>>> platform, but also to keep current components in their running to their
>>> full potential.  I will again point out that the code is almost
>>> completely off to the side, can be completely disabled with config
>>> options, and I will maintain it.  Therefore the only real impact is to
>>> people who care to even try it, and to me.
>>>   
>>>       
>> Your work is a whole stack.  Let's look at the constituents.
>>
>> - a new virtual bus for enumerating devices.
>>
>> Sorry, I still don't see the point.  It will just make writing drivers
>> more difficult.  The only advantage I've heard from you is that it
>> gets rid of the gunk.  Well, we still have to support the gunk for
>> non-pv devices so the gunk is basically free.  The clean version is
>> expensive since we need to port it to all guests and implement
>> exciting features like hotplug.
>>     
> My real objection to PCI is fast-path related.  I don't object, per se,
> to using PCI for discovery and hotplug.  If you use PCI just for these
> types of things, but then allow fastpath to use more hypercall oriented
> primitives, then I would agree with you.  We can leave PCI emulation in
> user-space, and we get it for free, and things are relatively tidy.
>   

PCI has very little to do with the fast path (nothing, if we use MSI).

> Its once you start requiring that we stay ABI compatible with something
> like the existing virtio-net in x86 KVM where I think it starts to get
> ugly when you try to move it into the kernel.  So that is what I had a
> real objection to.  I think as long as we are not talking about trying
> to make something like that work, its a much more viable prospect.
>   

I don't see why the fast path of virtio-net would be bad.  Can you 
elaborate?

Obviously all the pci glue stays in userspace.

> So what I propose is the following: 
>
> 1) The core vbus design stays the same (or close to it)
>   

Sorry, I still don't see what advantage this has over PCI, and how you 
deal with the disadvantages.

> 2) the vbus-proxy and kvm-guest patch go away
> 3) the kvm-host patch changes to work with coordination from the
> userspace-pci emulation for things like MSI routing
> 4) qemu will know to create some MSI shim 1:1 with whatever it
> instantiates on the bus (and can communicate changes
>   

Don't userstand.  What's this MSI shim?

> 5) any drivers that are written for these new PCI-IDs that might be
> present are allowed to use a hypercall ABI to talk after they have been
> probed for that ID (e.g. they are not limited to PIO or MMIO BAR type
> access methods).
>   

The way we'd to it with virtio is to add a feature bit that say "you can 
hypercall here instead of pio".  This way old drivers continue to work.

Note that nothing prevents us from trapping pio in the kernel (in fact, 
we do) and forwarding it to the device.  It shouldn't be any slower than 
hypercalls.

> Once I get here, I might have greater clarity to see how hard it would
> make to emulate fast path components as well.  It might be easier than I
> think.
>
> This is all off the cuff so it might need some fine tuning before its
> actually workable.
>
> Does that sound reasonable?
>   

The vbus part (I assume you mean device enumeration) worries me.  I 
don't think you've yet set down what its advantages are.  Being pure and 
clean doesn't count, unless you rip out PCI from all existing installed 
hardware and from Windows.

>> - finer-grained point-to-point communication abstractions
>>
>> Where virtio has ring+signalling together, you layer the two.  For
>> networking, it doesn't matter.  For other applications, it may be
>> helpful, perhaps you have something in mind.
>>     
>
> Yeah, actually.  Thanks for bringing that up.
>
> So the reason why signaling and the ring are distinct constructs in the
> design is to facilitate constructs other than rings.  For instance,
> there may be some models where having a flat shared page is better than
> a ring.  A ring will naturally preserve all values in flight, where as a
> flat shared page would not (last update is always current).  There are
> some algorithms where a previously posted value is obsoleted by an
> update, and therefore rings are inherently bad for this update model. 
> And as we know, there are plenty of algorithms where a ring works
> perfectly.  So I wanted that flexibility to be able to express both.
>   

I agree that there is significant potential here.

> One of the things I have in mind for the flat page model is that RT vcpu
> priority thing.  Another thing I am thinking of is coming up with a PV
> LAPIC type replacement (where we can avoid doing the EOI trap by having
> the PICs state shared).
>   

You keep falling into the paravirtualize the entire universe trap.  If 
you look deep down, you can see Jeremy struggling in there trying to 
bring dom0 support to Linux/Xen.

The lapic is a huge ball of gunk but ripping it out is a monumental job 
with no substantial benefits.  We can at much lower effort avoid the EOI 
trap by paravirtualizing that small bit of ugliness.  Sure the result 
isn't a pure and clean room implementation.  It's a band aid.  But I'll 
take a 50-line band aid over a 3000-line implementation split across 
guest and host, which only works with Linux.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 13:37                                                 ` Avi Kivity
@ 2009-04-03 16:28                                                   ` Gregory Haskins
  2009-04-05 10:00                                                     ` Avi Kivity
  0 siblings, 1 reply; 146+ messages in thread
From: Gregory Haskins @ 2009-04-03 16:28 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

[-- Attachment #1: Type: text/plain, Size: 10320 bytes --]

Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> Gregory Haskins wrote:
>>>    
>>>> So again, I am proposing for consideration of accepting my work
>>>> (either
>>>> in its current form, or something we agree on after the normal review
>>>> process) not only on the basis of the future development of the
>>>> platform, but also to keep current components in their running to
>>>> their
>>>> full potential.  I will again point out that the code is almost
>>>> completely off to the side, can be completely disabled with config
>>>> options, and I will maintain it.  Therefore the only real impact is to
>>>> people who care to even try it, and to me.
>>>>         
>>> Your work is a whole stack.  Let's look at the constituents.
>>>
>>> - a new virtual bus for enumerating devices.
>>>
>>> Sorry, I still don't see the point.  It will just make writing drivers
>>> more difficult.  The only advantage I've heard from you is that it
>>> gets rid of the gunk.  Well, we still have to support the gunk for
>>> non-pv devices so the gunk is basically free.  The clean version is
>>> expensive since we need to port it to all guests and implement
>>> exciting features like hotplug.
>>>     
>> My real objection to PCI is fast-path related.  I don't object, per se,
>> to using PCI for discovery and hotplug.  If you use PCI just for these
>> types of things, but then allow fastpath to use more hypercall oriented
>> primitives, then I would agree with you.  We can leave PCI emulation in
>> user-space, and we get it for free, and things are relatively tidy.
>>   
>
> PCI has very little to do with the fast path (nothing, if we use MSI).

At the very least, PIOs are slightly slower than hypercalls.  Perhaps
not enough to care, but the last time I measured them they were slower,
and therefore my clean slate design doesn't use them.

But I digress.  I think I was actually kind of agreeing with you that we
could do this. :P

>
>> Its once you start requiring that we stay ABI compatible with something
>> like the existing virtio-net in x86 KVM where I think it starts to get
>> ugly when you try to move it into the kernel.  So that is what I had a
>> real objection to.  I think as long as we are not talking about trying
>> to make something like that work, its a much more viable prospect.
>>   
>
> I don't see why the fast path of virtio-net would be bad.  Can you
> elaborate?

Im not.  I am saying I think we might be able to do this.

>
> Obviously all the pci glue stays in userspace.
>
>> So what I propose is the following:
>> 1) The core vbus design stays the same (or close to it)
>>   
>
> Sorry, I still don't see what advantage this has over PCI, and how you
> deal with the disadvantages.

I think you are confusing the vbus-proxy (guest side) with the vbus
backend.  (1) is saying "keep the vbus backend'" and (2) is saying drop
the guest side stuff.  In this proposal, the guest would speak a PCI ABI
as far as its concerned.  Devices in the vbus backend would render as
PCI objects in the ICH (or whatever) model in userspace.

>
>> 2) the vbus-proxy and kvm-guest patch go away
>> 3) the kvm-host patch changes to work with coordination from the
>> userspace-pci emulation for things like MSI routing
>> 4) qemu will know to create some MSI shim 1:1 with whatever it
>> instantiates on the bus (and can communicate changes
>>   
>
> Don't userstand.  What's this MSI shim?

Well, if the device model was an object in vbus down in the kernel, yet
PCI emulation was up in qemu, presumably we would want something to
handle things like PCI config-cycles up in userspace.  Like, for
instance, if the guest re-routes the MSI.  The shim/proxy would handle
the config-cycle, and then turn around and do an ioctl to the kernel to
configure the change with the in-kernel device model (or the irq
infrastructure, as required).

But, TBH, I haven't really looked into whats actually required to make
this work yet.  I am just spitballing to try to find a compromise.

>
>> 5) any drivers that are written for these new PCI-IDs that might be
>> present are allowed to use a hypercall ABI to talk after they have been
>> probed for that ID (e.g. they are not limited to PIO or MMIO BAR type
>> access methods).
>>   
>
> The way we'd to it with virtio is to add a feature bit that say "you
> can hypercall here instead of pio".  This way old drivers continue to
> work.

Yep, agreed.  This is what I was thinking we could do.  But now that I
have the possibility that I just need to write a virtio-vbus module to
co-exist with virtio-pci, perhaps it doesn't even need to be explicit.

>
> Note that nothing prevents us from trapping pio in the kernel (in
> fact, we do) and forwarding it to the device.  It shouldn't be any
> slower than hypercalls.
Sure, its just slightly slower, so I would prefer pure hypercalls if at
all possible.

>
>> Once I get here, I might have greater clarity to see how hard it would
>> make to emulate fast path components as well.  It might be easier than I
>> think.
>>
>> This is all off the cuff so it might need some fine tuning before its
>> actually workable.
>>
>> Does that sound reasonable?
>>   
>
> The vbus part (I assume you mean device enumeration) worries me

No, you are confusing the front-end and back-end again ;)

The back-end remains, and holds the device models as before.  This is
the "vbus core".  Today the front-end interacts with the hypervisor to
render "vbus" specific devices.  The proposal is to eliminate the
front-end, and have the back end render the objects on the bus as PCI
devices to the guest.  I am not sure if I can make it work, yet.  It
needs more thought.

> .  I don't think you've yet set down what its advantages are.  Being
> pure and clean doesn't count, unless you rip out PCI from all existing
> installed hardware and from Windows.

You are being overly dramatic.  No one has ever said we are talking
about ripping something out.  In fact, I've explicitly stated that PCI
can coexist peacefully.    Having more than one bus in a system is
certainly not without precedent (PCI, scsi, usb, etc).

Rather, PCI is PCI, and will always be.  PCI was designed as a
software-to-hardware interface.  It works well for its intention.  When
we do full emulation of guests, we still do PCI so that all that
software that was designed to work software-to-hardware still continue
to work, even though technically its now software-to-software.  When we
do PV, on the other hand, we no longer need to pretend it is
software-to-hardware.  We can continue to use an interface designed for
software-to-hardware if we choose, or we can use something else such as
an interface designed specifically for software-to-software.

As I have stated, PCI was designed with hardware constraints in mind. 
What if I don't want to be governed by those constraints?  What if I
don't want an interrupt per device (I don't)?   What do I need BARs for
(I don't)?  Is a PCI PIO address relevant to me (no, hypercalls are more
direct)?  Etc.  Its crap I dont need.

All I really need is a way to a) discover and enumerate devices,
preferably dynamically (hotswap), and b) a way to communicate with those
devices.  I think you are overstating the the importance that PCI plays
in (a), and are overstating the complexity associated with doing an
alternative.  I think you are understating the level of hackiness
required to continue to support PCI as we move to new paradigms, like
in-kernel models.  And I think I have already stated that I can
establish a higher degree of flexibility, and arguably, performance for
(b).  Therefore, I have come to the conclusion that I don't want it and
thus eradicated the dependence on it in my design.  I understand the
design tradeoffs that are associated with that decision.

>
>>> - finer-grained point-to-point communication abstractions
>>>
>>> Where virtio has ring+signalling together, you layer the two.  For
>>> networking, it doesn't matter.  For other applications, it may be
>>> helpful, perhaps you have something in mind.
>>>     
>>
>> Yeah, actually.  Thanks for bringing that up.
>>
>> So the reason why signaling and the ring are distinct constructs in the
>> design is to facilitate constructs other than rings.  For instance,
>> there may be some models where having a flat shared page is better than
>> a ring.  A ring will naturally preserve all values in flight, where as a
>> flat shared page would not (last update is always current).  There are
>> some algorithms where a previously posted value is obsoleted by an
>> update, and therefore rings are inherently bad for this update model.
>> And as we know, there are plenty of algorithms where a ring works
>> perfectly.  So I wanted that flexibility to be able to express both.
>>   
>
> I agree that there is significant potential here.
>
>> One of the things I have in mind for the flat page model is that RT vcpu
>> priority thing.  Another thing I am thinking of is coming up with a PV
>> LAPIC type replacement (where we can avoid doing the EOI trap by having
>> the PICs state shared).
>>   
>
> You keep falling into the paravirtualize the entire universe trap.  If
> you look deep down, you can see Jeremy struggling in there trying to
> bring dom0 support to Linux/Xen.
>
> The lapic is a huge ball of gunk but ripping it out is a monumental
> job with no substantial benefits.  We can at much lower effort avoid
> the EOI trap by paravirtualizing that small bit of ugliness.  Sure the
> result isn't a pure and clean room implementation.  It's a band aid. 
> But I'll take a 50-line band aid over a 3000-line implementation split
> across guest and host, which only works with Linux.
Well, keep in mind that I was really just giving you an example of
something that might want a shared-page instead of a shared-ring model. 
The possibility that such a device may be desirable in the future was
enough for me to decide that I wanted the shm model to be flexible,
instead of, say, designed specifically for virtio.  We may never, in
fact, do anything with the LAPIC idea.

-Greg

>
>



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-01 16:10   ` Anthony Liguori
@ 2009-04-05  3:44     ` Rusty Russell
  2009-04-05  8:06       ` Avi Kivity
  2009-04-05 14:13       ` Anthony Liguori
  0 siblings, 2 replies; 146+ messages in thread
From: Rusty Russell @ 2009-04-05  3:44 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Gregory Haskins, linux-kernel, agraf, pmullaney, pmorreale, netdev, kvm

On Thursday 02 April 2009 02:40:29 Anthony Liguori wrote:
> Rusty Russell wrote:
> > As you point out, 350-450 is possible, which is still bad, and it's at least
> > partially caused by the exit to userspace and two system calls.  If virtio_net
> > had a backend in the kernel, we'd be able to compare numbers properly.
> 
> I doubt the userspace exit is the problem.  On a modern system, it takes 
> about 1us to do a light-weight exit and about 2us to do a heavy-weight 
> exit.  A transition to userspace is only about ~150ns, the bulk of the 
> additional heavy-weight exit cost is from vcpu_put() within KVM.

Just to inject some facts, servicing a ping via tap (ie host->guest then
guest->host response) takes 26 system calls from one qemu thread, 7 from
another (see strace below). Judging by those futex calls, multiple context
switches, too.

> If you were to switch to another kernel thread, and I'm pretty sure you 
> have to, you're going to still see about a 2us exit cost.

He switches to another thread, too, but with the right infrastructure (ie.
skb data destructors) we could skip this as well.  (It'd be interesting to
see how virtual-bus performed on a single cpu host).

Cheers,
Rusty.

Pid 10260:
12:37:40.245785 select(17, [4 6 8 14 16], [], [], {0, 996000}) = 1 (in [6], left {0, 992000}) <0.003995>
12:37:40.250226 read(6, "\0\0\0\0\0\0\0\0\0\0RT\0\0224V*\211\24\210`\304\10\0E\0"..., 69632) = 108 <0.000051>
12:37:40.250462 write(1, "tap read: 108 bytes\n", 20) = 20 <0.000197>
12:37:40.250800 ioctl(7, 0x4008ae61, 0x7fff8cafb3a0) = 0 <0.000223>
12:37:40.251149 read(6, 0x115c6ac, 69632) = -1 EAGAIN (Resource temporarily unavailable) <0.000019>
12:37:40.251292 write(1, "tap read: -1 bytes\n", 19) = 19 <0.000085>
12:37:40.251488 clock_gettime(CLOCK_MONOTONIC, {1554, 633304282}) = 0 <0.000020>
12:37:40.251604 clock_gettime(CLOCK_MONOTONIC, {1554, 633413793}) = 0 <0.000019>
12:37:40.251717 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 1 <0.001222>
12:37:40.253037 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [16], left {1, 0}) <0.000026>
12:37:40.253196 read(16, "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128 <0.000022>
12:37:40.253324 rt_sigaction(SIGALRM, NULL, {0x406d50, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7f1a842430f0}, 8) = 0 <0.000018>
12:37:40.253477 write(5, "\0", 1)       = 1 <0.000022>
12:37:40.253585 read(16, 0x7fff8cb09440, 128) = -1 EAGAIN (Resource temporarily unavailable) <0.000020>
12:37:40.253687 clock_gettime(CLOCK_MONOTONIC, {1554, 635496181}) = 0 <0.000019>
12:37:40.253798 writev(6, [{"\0\0\0\0\0\0\0\0\0\0", 10}, {"*\211\24\210`\304RT\0\0224V\10\0E\0\0T\255\262\0\0@\1G"..., 98}], 2) = 108 <0.000062>
12:37:40.253993 ioctl(7, 0x4008ae61, 0x7fff8caff460) = 0 <0.000161>
12:37:40.254263 clock_gettime(CLOCK_MONOTONIC, {1554, 636077540}) = 0 <0.000019>
12:37:40.254380 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 1 <0.000394>
12:37:40.254861 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [4], left {1, 0}) <0.000022>
12:37:40.255001 read(4, "\0", 512)      = 1 <0.000021>
12:37:40.255109 read(4, 0x7fff8cb092d0, 512) = -1 EAGAIN (Resource temporarily unavailable) <0.000018>
12:37:40.255211 clock_gettime(CLOCK_MONOTONIC, {1554, 637020677}) = 0 <0.000019>
12:37:40.255314 clock_gettime(CLOCK_MONOTONIC, {1554, 637123483}) = 0 <0.000019>
12:37:40.255416 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0 <0.000018>
12:37:40.255524 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 14000000}}, NULL) = 0 <0.000021>
12:37:40.255635 clock_gettime(CLOCK_MONOTONIC, {1554, 637443915}) = 0 <0.000019>
12:37:40.255739 clock_gettime(CLOCK_MONOTONIC, {1554, 637547001}) = 0 <0.000018>
12:37:40.255847 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [16], left {0, 988000}) <0.014303>

Pid 10262:
12:37:40.252531 clock_gettime(CLOCK_MONOTONIC, {1554, 634339051}) = 0 <0.000018>
12:37:40.252631 timer_gettime(0, {it_interval={0, 0}, it_value={0, 17549811}}) = 0 <0.000021>
12:37:40.252750 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 250000}}, NULL) = 0 <0.000024>
12:37:40.252868 ioctl(11, 0xae80, 0)    = 0 <0.001171>
12:37:40.254128 futex(0xb81360, 0x80 /* FUTEX_??? */, 2) = 0 <0.000270>
12:37:40.254490 ioctl(7, 0x4008ae61, 0x4134bee0) = 0 <0.000019>
12:37:40.254598 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 0 <0.000017>
12:37:40.254693 ioctl(11, 0xae80 <unfinished ...>

fd:
lrwx------ 1 root root 64 2009-04-05 12:31 0 -> /dev/pts/1 
lrwx------ 1 root root 64 2009-04-05 12:31 1 -> /dev/pts/1 
lrwx------ 1 root root 64 2009-04-05 12:35 10 -> /home/rusty/qemu-images/ubuntu-8.10                                                                            
lrwx------ 1 root root 64 2009-04-05 12:35 11 -> anon_inode:kvm-vcpu            
lrwx------ 1 root root 64 2009-04-05 12:35 12 -> socket:[31414]                 
lrwx------ 1 root root 64 2009-04-05 12:35 13 -> socket:[31416]                 
lrwx------ 1 root root 64 2009-04-05 12:35 14 -> anon_inode:[eventfd]           
lrwx------ 1 root root 64 2009-04-05 12:35 15 -> anon_inode:[eventfd]           
lrwx------ 1 root root 64 2009-04-05 12:35 16 -> anon_inode:[signalfd]          
lrwx------ 1 root root 64 2009-04-05 12:31 2 -> /dev/pts/1                      
lr-x------ 1 root root 64 2009-04-05 12:31 3 -> /dev/kvm
lr-x------ 1 root root 64 2009-04-05 12:35 4 -> pipe:[31406]
l-wx------ 1 root root 64 2009-04-05 12:35 5 -> pipe:[31406]
lrwx------ 1 root root 64 2009-04-05 12:35 6 -> /dev/net/tun
lrwx------ 1 root root 64 2009-04-05 12:35 7 -> anon_inode:kvm-vm
lrwx------ 1 root root 64 2009-04-05 12:35 8 -> anon_inode:[signalfd]
lrwx------ 1 root root 64 2009-04-05 12:35 9 -> /tmp/vl.OL1kd9 (deleted)

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-05  3:44     ` Rusty Russell
@ 2009-04-05  8:06       ` Avi Kivity
  2009-04-05 14:13       ` Anthony Liguori
  1 sibling, 0 replies; 146+ messages in thread
From: Avi Kivity @ 2009-04-05  8:06 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Anthony Liguori, Gregory Haskins, linux-kernel, agraf, pmullaney,
	pmorreale, netdev, kvm

Rusty Russell wrote:
> On Thursday 02 April 2009 02:40:29 Anthony Liguori wrote:
>   
>> Rusty Russell wrote:
>>     
>>> As you point out, 350-450 is possible, which is still bad, and it's at least
>>> partially caused by the exit to userspace and two system calls.  If virtio_net
>>> had a backend in the kernel, we'd be able to compare numbers properly.
>>>       
>> I doubt the userspace exit is the problem.  On a modern system, it takes 
>> about 1us to do a light-weight exit and about 2us to do a heavy-weight 
>> exit.  A transition to userspace is only about ~150ns, the bulk of the 
>> additional heavy-weight exit cost is from vcpu_put() within KVM.
>>     
>
> Just to inject some facts, servicing a ping via tap (ie host->guest then
> guest->host response) takes 26 system calls from one qemu thread, 7 from
> another (see strace below). Judging by those futex calls, multiple context
> switches, too.
>   

Interesting stuff.  Even if amortized over half a ring's worth of 
packets, that's quite a lot.

Two threads are involved (we complete on the iothread, since we don't 
know which vcpu will end up processing the interrupt, if any).

>
> Pid 10260:
> 12:37:40.245785 select(17, [4 6 8 14 16], [], [], {0, 996000}) = 1 (in [6], left {0, 992000}) <0.003995>
>   

Should switch to epoll with its lower wait costs.  Unfortunately the 
relative timeout requires reading the clock.

> 12:37:40.250226 read(6, "\0\0\0\0\0\0\0\0\0\0RT\0\0224V*\211\24\210`\304\10\0E\0"..., 69632) = 108 <0.000051>
> 12:37:40.250462 write(1, "tap read: 108 bytes\n", 20) = 20 <0.000197>
>   

I hope this is your addition.

> 12:37:40.250800 ioctl(7, 0x4008ae61, 0x7fff8cafb3a0) = 0 <0.000223>
> 12:37:40.251149 read(6, 0x115c6ac, 69632) = -1 EAGAIN (Resource temporarily unavailable) <0.000019>
>   

This wouldn't be necessary with io_getevents().

> 12:37:40.251292 write(1, "tap read: -1 bytes\n", 19) = 19 <0.000085>
>   

...

> 12:37:40.251488 clock_gettime(CLOCK_MONOTONIC, {1554, 633304282}) = 0 <0.000020>
> 12:37:40.251604 clock_gettime(CLOCK_MONOTONIC, {1554, 633413793}) = 0 <0.000019>
>   

Great.

> 12:37:40.251717 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 1 <0.001222>
> 12:37:40.253037 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [16], left {1, 0}) <0.000026>
> 12:37:40.253196 read(16, "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128 <0.000022>
> 12:37:40.253324 rt_sigaction(SIGALRM, NULL, {0x406d50, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7f1a842430f0}, 8) = 0 <0.000018>
> 12:37:40.253477 write(5, "\0", 1)       = 1 <0.000022>
>   

The write is to wake someone up.  Who?

> 12:37:40.253585 read(16, 0x7fff8cb09440, 128) = -1 EAGAIN (Resource temporarily unavailable) <0.000020>
>   

Clearing up signalfd...

> 12:37:40.253687 clock_gettime(CLOCK_MONOTONIC, {1554, 635496181}) = 0 <0.000019>
> 12:37:40.253798 writev(6, [{"\0\0\0\0\0\0\0\0\0\0", 10}, {"*\211\24\210`\304RT\0\0224V\10\0E\0\0T\255\262\0\0@\1G"..., 98}], 2) = 108 <0.000062>
> 12:37:40.253993 ioctl(7, 0x4008ae61, 0x7fff8caff460) = 0 <0.000161>
>   

Injecting the interrupt.

> 12:37:40.254263 clock_gettime(CLOCK_MONOTONIC, {1554, 636077540}) = 0 <0.000019>
> 12:37:40.254380 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 1 <0.000394>
> 12:37:40.254861 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [4], left {1, 0}) <0.000022>
> 12:37:40.255001 read(4, "\0", 512)      = 1 <0.000021>
>   

Great,  the write() was to wake ourselves up.

> 12:37:40.255109 read(4, 0x7fff8cb092d0, 512) = -1 EAGAIN (Resource temporarily unavailable) <0.000018>
> 12:37:40.255211 clock_gettime(CLOCK_MONOTONIC, {1554, 637020677}) = 0 <0.000019>
> 12:37:40.255314 clock_gettime(CLOCK_MONOTONIC, {1554, 637123483}) = 0 <0.000019>
> 12:37:40.255416 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0 <0.000018>
> 12:37:40.255524 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 14000000}}, NULL) = 0 <0.000021>
> 12:37:40.255635 clock_gettime(CLOCK_MONOTONIC, {1554, 637443915}) = 0 <0.000019>
> 12:37:40.255739 clock_gettime(CLOCK_MONOTONIC, {1554, 637547001}) = 0 <0.000018>
> 12:37:40.255847 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [16], left {0, 988000}) <0.014303>
>
>   

This is the vcpu thread:

> Pid 10262:
> 12:37:40.252531 clock_gettime(CLOCK_MONOTONIC, {1554, 634339051}) = 0 <0.000018>
> 12:37:40.252631 timer_gettime(0, {it_interval={0, 0}, it_value={0, 17549811}}) = 0 <0.000021>
> 12:37:40.252750 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 250000}}, NULL) = 0 <0.000024>
> 12:37:40.252868 ioctl(11, 0xae80, 0)    = 0 <0.001171>
> 12:37:40.254128 futex(0xb81360, 0x80 /* FUTEX_??? */, 2) = 0 <0.000270>
> 12:37:40.254490 ioctl(7, 0x4008ae61, 0x4134bee0) = 0 <0.000019>
> 12:37:40.254598 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 0 <0.000017>
> 12:37:40.254693 ioctl(11, 0xae80 <unfinished ...>
>   

Looks like the interrupt from the iothread was injected and delivered 
before the iothread could give up the mutex, so we needed to wait here.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-03 16:28                                                   ` Gregory Haskins
@ 2009-04-05 10:00                                                     ` Avi Kivity
  0 siblings, 0 replies; 146+ messages in thread
From: Avi Kivity @ 2009-04-05 10:00 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Anthony Liguori, Andi Kleen, linux-kernel, agraf, pmullaney,
	pmorreale, rusty, netdev, kvm

Gregory Haskins wrote:
>   
>>> 2) the vbus-proxy and kvm-guest patch go away
>>> 3) the kvm-host patch changes to work with coordination from the
>>> userspace-pci emulation for things like MSI routing
>>> 4) qemu will know to create some MSI shim 1:1 with whatever it
>>> instantiates on the bus (and can communicate changes
>>>   
>>>       
>> Don't userstand.  What's this MSI shim?
>>     
>
> Well, if the device model was an object in vbus down in the kernel, yet
> PCI emulation was up in qemu, presumably we would want something to
> handle things like PCI config-cycles up in userspace.  Like, for
> instance, if the guest re-routes the MSI.  The shim/proxy would handle
> the config-cycle, and then turn around and do an ioctl to the kernel to
> configure the change with the in-kernel device model (or the irq
> infrastructure, as required).
>   

Right, this is how it should work.  All the gunk in userspace.

> But, TBH, I haven't really looked into whats actually required to make
> this work yet.  I am just spitballing to try to find a compromise.
>   

One thing I thought of trying to get this generic is to use file 
descriptors as irq handles.  So:

- userspace exposes a PCI device (same as today)
- guest configures its PCI IRQ (using MSI if it supports it)
- userspace handles this by calling KVM_IRQ_FD which converts the irq to 
a file descriptor
- userspace passes this fd to the kernel, or another userspace process
- end user triggers guest irqs by writing to this fd

We could do the same with hypercalls:

- guest and host userspace negotiate hypercall use through PCI config space
- userspace passes an fd to the kernel
- whenever the guest issues an hypercall, the kernel writes the 
arguments to the fd
- other end (in kernel or userspace) processes the hypercall


> No, you are confusing the front-end and back-end again ;)
>
> The back-end remains, and holds the device models as before.  This is
> the "vbus core".  Today the front-end interacts with the hypervisor to
> render "vbus" specific devices.  The proposal is to eliminate the
> front-end, and have the back end render the objects on the bus as PCI
> devices to the guest.  I am not sure if I can make it work, yet.  It
> needs more thought.
>   

It seems to me this already exists, it's the qemu device model.

The host kernel doesn't need any knowledge of how the devices are 
connected, even if it does implement some of them.

>> .  I don't think you've yet set down what its advantages are.  Being
>> pure and clean doesn't count, unless you rip out PCI from all existing
>> installed hardware and from Windows.
>>     
>
> You are being overly dramatic.  No one has ever said we are talking
> about ripping something out.  In fact, I've explicitly stated that PCI
> can coexist peacefully.    Having more than one bus in a system is
> certainly not without precedent (PCI, scsi, usb, etc).
>
> Rather, PCI is PCI, and will always be.  PCI was designed as a
> software-to-hardware interface.  It works well for its intention.  When
> we do full emulation of guests, we still do PCI so that all that
> software that was designed to work software-to-hardware still continue
> to work, even though technically its now software-to-software.  When we
> do PV, on the other hand, we no longer need to pretend it is
> software-to-hardware.  We can continue to use an interface designed for
> software-to-hardware if we choose, or we can use something else such as
> an interface designed specifically for software-to-software.
>
> As I have stated, PCI was designed with hardware constraints in mind. 
> What if I don't want to be governed by those constraints?  

I'd agree with all this if I actually saw a constraint in PCI.  But I don't.

> What if I
> don't want an interrupt per device (I don't)?   

Don't.  Though I thing you do, even multiple interrupts per device.

> What do I need BARs for
> (I don't)?  

Don't use them.

> Is a PCI PIO address relevant to me (no, hypercalls are more
> direct)?  Etc.  Its crap I dont need.
>   

So use hypercalls.

> All I really need is a way to a) discover and enumerate devices,
> preferably dynamically (hotswap), and b) a way to communicate with those
> devices.  I think you are overstating the the importance that PCI plays
> in (a), and are overstating the complexity associated with doing an
> alternative.  

Given that we have PCI, why would we do an alternative?

It works, it works with Windows, the nasty stuff is in userspace.  Why 
expend effort on an alternative?  Instead make it go faster.

> I think you are understating the level of hackiness
> required to continue to support PCI as we move to new paradigms, like
> in-kernel models.  

The kernel need know nothing about PCI, so I don't see how you work this 
out.

> And I think I have already stated that I can
> establish a higher degree of flexibility, and arguably, performance for
> (b).  

You've stated it, but failed to provide arguments for it.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-05  3:44     ` Rusty Russell
  2009-04-05  8:06       ` Avi Kivity
@ 2009-04-05 14:13       ` Anthony Liguori
  2009-04-05 16:10         ` Avi Kivity
  1 sibling, 1 reply; 146+ messages in thread
From: Anthony Liguori @ 2009-04-05 14:13 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Gregory Haskins, linux-kernel, agraf, pmullaney, pmorreale, netdev, kvm

Rusty Russell wrote:
> On Thursday 02 April 2009 02:40:29 Anthony Liguori wrote:
>   
>> Rusty Russell wrote:
>>     
>>> As you point out, 350-450 is possible, which is still bad, and it's at least
>>> partially caused by the exit to userspace and two system calls.  If virtio_net
>>> had a backend in the kernel, we'd be able to compare numbers properly.
>>>       
>> I doubt the userspace exit is the problem.  On a modern system, it takes 
>> about 1us to do a light-weight exit and about 2us to do a heavy-weight 
>> exit.  A transition to userspace is only about ~150ns, the bulk of the 
>> additional heavy-weight exit cost is from vcpu_put() within KVM.
>>     
>
> Just to inject some facts, servicing a ping via tap (ie host->guest then
> guest->host response) takes 26 system calls from one qemu thread, 7 from
> another (see strace below). Judging by those futex calls, multiple context
> switches, too.
>   

N.B. we're not optimized for latency today.  With the right 
infrastructure in userspace, I'm confident we could get this down.

What we need is:

1) Lockless MMIO/PIO dispatch (there should be two IO registration 
interfaces, a new lockless one and the legacy one)
2) A virtio-net thread that's independent of the IO thread.

It would be interesting to count the number of syscalls required in the 
lguest path since that should be a lot closer to optimal.

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-05 14:13       ` Anthony Liguori
@ 2009-04-05 16:10         ` Avi Kivity
  2009-04-05 16:45           ` Anthony Liguori
  0 siblings, 1 reply; 146+ messages in thread
From: Avi Kivity @ 2009-04-05 16:10 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Rusty Russell, Gregory Haskins, linux-kernel, agraf, pmullaney,
	pmorreale, netdev, kvm

Anthony Liguori wrote:
>
> What we need is:
>
> 1) Lockless MMIO/PIO dispatch (there should be two IO registration 
> interfaces, a new lockless one and the legacy one)

Not sure exactly how much this is needed, since when there is no 
contention, locks are almost free (there's the atomic and cacheline 
bounce, but no syscall).

For any long operations, we should drop the lock (of course we need some 
kind of read/write lock or rcu to avoid hotunplug or reconfiguration).

> 2) A virtio-net thread that's independent of the IO thread.

Yes -- that saves us all the select() prologue (calculating new timeout) 
and the select() itself.



-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [RFC PATCH 00/17] virtual-bus
  2009-04-05 16:10         ` Avi Kivity
@ 2009-04-05 16:45           ` Anthony Liguori
  0 siblings, 0 replies; 146+ messages in thread
From: Anthony Liguori @ 2009-04-05 16:45 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rusty Russell, Gregory Haskins, linux-kernel, agraf, pmullaney,
	pmorreale, netdev, kvm

Avi Kivity wrote:
> Anthony Liguori wrote:
>>
>> What we need is:
>>
>> 1) Lockless MMIO/PIO dispatch (there should be two IO registration 
>> interfaces, a new lockless one and the legacy one)
>
> Not sure exactly how much this is needed, since when there is no 
> contention, locks are almost free (there's the atomic and cacheline 
> bounce, but no syscall).

There should be no contention but I strongly suspect there is more often 
than we think.  The IO thread can potentially hold the lock for a very 
long period of time.  Take into consideration things like qcow2 metadata 
read/write, VNC server updates, etc..

> For any long operations, we should drop the lock (of course we need 
> some kind of read/write lock or rcu to avoid hotunplug or 
> reconfiguration).
>
>> 2) A virtio-net thread that's independent of the IO thread.
>
> Yes -- that saves us all the select() prologue (calculating new 
> timeout) and the select() itself.

In an ideal world, we could do the submission via io_submit in the VCPU 
context, not worry about the copy latency (because we're zero copy).  
Then our packet transmission latency is consistently low because the 
path is consistent and lockless.  This is why dropping the lock is so 
important, it's not enough to usually have low latency.  We need to try 
and have latency as low as possible as often as possible.

Regards,

Anthony Liguori
>
>

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: Re: [RFC PATCH 00/17] virtual-bus
  2009-04-02 18:18                                             ` Anthony Liguori
  2009-04-03  1:11                                               ` Herbert Xu
@ 2009-04-20 18:02                                               ` Alex Williamson
  1 sibling, 0 replies; 146+ messages in thread
From: Alex Williamson @ 2009-04-20 18:02 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Avi Kivity, Gregory Haskins, Andi Kleen, linux-kernel, agraf,
	pmullaney, pmorreale, rusty, netdev, kvm

On Thu, 2009-04-02 at 13:18 -0500, Anthony Liguori wrote:
> Avi Kivity wrote:
> > Anthony Liguori wrote:
> >>> I don't think we even need that to end this debate.  I'm convinced 
> >>> we have a bug somewhere.  Even disabling TX mitigation, I see a ping 
> >>> latency of around 300ns whereas it's only 50ns on the host.  This 
> >>> defies logic so I'm now looking to isolate why that is.
> >>
> >> I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes 
> >> were the big winner... I hate qemu sometimes.
> 
> Anyway, if we're able to send this many packets, I suspect we'll be able 
> to also handle much higher throughputs without TX mitigation so that's 
> what I'm going to look at now.

Anthony,

Any news on this?  I'm anxious to see virtio-net performance on par with
the virtual-bus results.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 146+ messages in thread

end of thread, other threads:[~2009-04-20 18:02 UTC | newest]

Thread overview: 146+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-31 18:42 [RFC PATCH 00/17] virtual-bus Gregory Haskins
2009-03-31 18:42 ` [RFC PATCH 01/17] shm-signal: shared-memory signals Gregory Haskins
2009-03-31 20:44   ` Avi Kivity
2009-03-31 20:58     ` Gregory Haskins
2009-03-31 21:05       ` Avi Kivity
2009-04-01 12:12         ` Gregory Haskins
2009-04-01 12:24           ` Avi Kivity
2009-04-01 13:57             ` Gregory Haskins
2009-03-31 18:42 ` [RFC PATCH 02/17] vbus: add virtual-bus definitions Gregory Haskins
2009-04-02 16:06   ` Ben Hutchings
2009-04-02 18:13     ` Gregory Haskins
2009-03-31 18:43 ` [RFC PATCH 03/17] vbus: add connection-client helper infrastructure Gregory Haskins
2009-03-31 18:43 ` [RFC PATCH 04/17] vbus: add bus-registration notifiers Gregory Haskins
2009-03-31 18:43 ` [RFC PATCH 05/17] vbus: add a "vbus-proxy" bus model for vbus_driver objects Gregory Haskins
2009-03-31 18:43 ` [RFC PATCH 06/17] ioq: Add basic definitions for a shared-memory, lockless queue Gregory Haskins
2009-03-31 18:43 ` [RFC PATCH 07/17] ioq: add vbus helpers Gregory Haskins
2009-03-31 18:43 ` [RFC PATCH 08/17] venet: add the ABI definitions for an 802.x packet interface Gregory Haskins
2009-03-31 18:43 ` [RFC PATCH 09/17] net: Add vbus_enet driver Gregory Haskins
2009-03-31 20:39   ` Stephen Hemminger
2009-04-02 11:43     ` Gregory Haskins
2009-03-31 18:43 ` [RFC PATCH 10/17] venet-tap: Adds a "venet" compatible "tap" device to VBUS Gregory Haskins
2009-03-31 18:43 ` [RFC PATCH 11/17] venet: add scatter-gather support Gregory Haskins
2009-03-31 18:43 ` [RFC PATCH 12/17] venettap: " Gregory Haskins
2009-03-31 18:43 ` [RFC PATCH 13/17] x86: allow the irq->vector translation to be determined outside of ioapic Gregory Haskins
2009-03-31 19:16   ` Alan Cox
2009-03-31 20:02     ` Gregory Haskins
2009-03-31 18:44 ` [RFC PATCH 14/17] kvm: add a reset capability Gregory Haskins
2009-03-31 19:22   ` Avi Kivity
2009-03-31 20:02     ` Gregory Haskins
2009-03-31 20:18       ` Avi Kivity
2009-03-31 20:37         ` Gregory Haskins
2009-03-31 18:44 ` [RFC PATCH 15/17] kvm: add dynamic IRQ support Gregory Haskins
2009-03-31 19:20   ` Avi Kivity
2009-03-31 19:39     ` Gregory Haskins
2009-03-31 20:13       ` Avi Kivity
2009-03-31 20:32         ` Gregory Haskins
2009-03-31 20:59           ` Avi Kivity
2009-03-31 18:44 ` [RFC PATCH 16/17] kvm: Add VBUS support to the host Gregory Haskins
2009-03-31 18:44 ` [RFC PATCH 17/17] kvm: Add guest-side support for VBUS Gregory Haskins
2009-03-31 20:18 ` [RFC PATCH 00/17] virtual-bus Andi Kleen
2009-04-01 12:03   ` Gregory Haskins
2009-04-01 13:23     ` Andi Kleen
2009-04-01 14:19       ` Gregory Haskins
2009-04-01 14:42         ` Gregory Haskins
2009-04-01 17:01         ` Andi Kleen
2009-04-01 18:45           ` Anthony Liguori
2009-04-01 20:40             ` Chris Wright
2009-04-01 21:11               ` Gregory Haskins
2009-04-01 21:28                 ` Chris Wright
2009-04-01 22:10                   ` Gregory Haskins
2009-04-02  6:00                     ` Chris Wright
2009-04-02  3:11               ` Herbert Xu
2009-04-01 21:09             ` Gregory Haskins
2009-04-02  0:29               ` Anthony Liguori
2009-04-02  3:11                 ` Gregory Haskins
2009-04-02  6:51               ` Avi Kivity
2009-04-02  8:52                 ` Herbert Xu
2009-04-02  9:02                   ` Avi Kivity
2009-04-02  9:16                     ` Herbert Xu
2009-04-02  9:27                       ` Avi Kivity
2009-04-02  9:29                         ` Herbert Xu
2009-04-02  9:33                           ` Herbert Xu
2009-04-02  9:38                           ` Avi Kivity
2009-04-02  9:41                             ` Herbert Xu
2009-04-02  9:43                               ` Avi Kivity
2009-04-02  9:44                                 ` Herbert Xu
2009-04-02 11:06                             ` Gregory Haskins
2009-04-02 11:59                               ` Avi Kivity
2009-04-02 12:30                                 ` Gregory Haskins
2009-04-02 12:43                                   ` Avi Kivity
2009-04-02 13:03                                     ` Gregory Haskins
2009-04-02 12:13                               ` Rusty Russell
2009-04-02 12:50                                 ` Gregory Haskins
2009-04-02 12:52                                   ` Gregory Haskins
2009-04-02 13:07                                   ` Avi Kivity
2009-04-02 13:22                                     ` Gregory Haskins
2009-04-02 13:27                                       ` Avi Kivity
2009-04-02 14:05                                         ` Gregory Haskins
2009-04-02 14:50                                     ` Herbert Xu
2009-04-02 15:00                                       ` Avi Kivity
2009-04-02 15:40                                         ` Herbert Xu
2009-04-02 15:57                                           ` Avi Kivity
2009-04-02 16:09                                             ` Herbert Xu
2009-04-02 16:54                                               ` Avi Kivity
2009-04-02 17:06                                                 ` Herbert Xu
2009-04-02 17:17                                                   ` Herbert Xu
2009-04-03 12:25                                                   ` Avi Kivity
2009-04-02 15:10                                 ` Michael S. Tsirkin
2009-04-03  4:43                                   ` Jeremy Fitzhardinge
2009-04-02 10:55                     ` Gregory Haskins
2009-04-02 11:48                       ` Avi Kivity
2009-04-03 10:58                     ` Gerd Hoffmann
2009-04-03 11:03                       ` Avi Kivity
2009-04-03 11:12                         ` Herbert Xu
2009-04-03 11:46                           ` Avi Kivity
2009-04-03 11:48                             ` Herbert Xu
2009-04-03 11:54                               ` Avi Kivity
2009-04-03 11:55                                 ` Herbert Xu
2009-04-03 12:02                                   ` Avi Kivity
2009-04-03 13:05                                     ` Herbert Xu
2009-04-03 11:18                       ` Andi Kleen
2009-04-03 11:34                         ` Herbert Xu
2009-04-03 11:46                         ` Avi Kivity
2009-04-03 11:28                       ` Gregory Haskins
2009-04-02 10:46                 ` Gregory Haskins
2009-04-02 11:43                   ` Avi Kivity
2009-04-02 12:22                     ` Gregory Haskins
2009-04-02 12:42                       ` Avi Kivity
2009-04-02 12:54                         ` Gregory Haskins
2009-04-02 13:08                           ` Avi Kivity
2009-04-02 13:36                             ` Gregory Haskins
2009-04-02 13:45                               ` Avi Kivity
2009-04-02 14:24                                 ` Gregory Haskins
2009-04-02 14:32                                   ` Avi Kivity
2009-04-02 14:41                                     ` Avi Kivity
2009-04-02 14:49                                       ` Anthony Liguori
2009-04-02 16:09                                         ` Anthony Liguori
2009-04-02 16:19                                           ` Avi Kivity
2009-04-02 18:18                                             ` Anthony Liguori
2009-04-03  1:11                                               ` Herbert Xu
2009-04-20 18:02                                               ` Alex Williamson
2009-04-03 12:03                                           ` Gregory Haskins
2009-04-03 12:15                                             ` Avi Kivity
2009-04-03 13:13                                               ` Gregory Haskins
2009-04-03 13:37                                                 ` Avi Kivity
2009-04-03 16:28                                                   ` Gregory Haskins
2009-04-05 10:00                                                     ` Avi Kivity
2009-04-02  3:09             ` Herbert Xu
2009-04-02  6:46               ` Avi Kivity
2009-04-02  8:54                 ` Herbert Xu
2009-04-02  9:03                   ` Avi Kivity
2009-04-02  9:05                     ` Herbert Xu
2009-04-01 20:29           ` Gregory Haskins
2009-04-01 22:23             ` Andi Kleen
2009-04-01 23:05               ` Gregory Haskins
2009-04-01  6:08 ` Rusty Russell
2009-04-01 11:35   ` Gregory Haskins
2009-04-02  1:24     ` Rusty Russell
2009-04-02  2:27       ` Gregory Haskins
2009-04-01 16:10   ` Anthony Liguori
2009-04-05  3:44     ` Rusty Russell
2009-04-05  8:06       ` Avi Kivity
2009-04-05 14:13       ` Anthony Liguori
2009-04-05 16:10         ` Avi Kivity
2009-04-05 16:45           ` Anthony Liguori
2009-04-02  3:15   ` Herbert Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).